summaryrefslogtreecommitdiffstats
path: root/doc/features
diff options
context:
space:
mode:
Diffstat (limited to 'doc/features')
-rw-r--r--doc/features/afr-statistics.md142
-rw-r--r--doc/features/afr-v1.md340
-rw-r--r--doc/features/bd-xlator.md469
-rw-r--r--doc/features/brick-failure-detection.md67
-rw-r--r--doc/features/ctime.md68
-rw-r--r--doc/features/file-snapshot.md91
-rw-r--r--doc/features/ganesha-ha.md43
-rw-r--r--doc/features/geo-replication/distributed-geo-rep.md71
-rw-r--r--doc/features/geo-replication/libgfchangelog.md119
-rw-r--r--doc/features/gfid-access.md73
-rw-r--r--doc/features/libgfapi.md381
-rw-r--r--doc/features/nufa.md20
-rw-r--r--doc/features/ovirt-integration.md106
-rw-r--r--doc/features/qemu-integration.md231
-rw-r--r--doc/features/quota-scalability.md52
-rw-r--r--doc/features/rdma-cm-in-3.4.0.txt9
-rw-r--r--doc/features/rdmacm.md17
-rw-r--r--doc/features/readdir-ahead.md14
-rw-r--r--doc/features/rebalance.md74
-rw-r--r--doc/features/server-quorum.md44
-rw-r--r--doc/features/worm.md75
-rw-r--r--doc/features/zerofill.md26
22 files changed, 111 insertions, 2421 deletions
diff --git a/doc/features/afr-statistics.md b/doc/features/afr-statistics.md
deleted file mode 100644
index d0705845aa4..00000000000
--- a/doc/features/afr-statistics.md
+++ /dev/null
@@ -1,142 +0,0 @@
-##gluster volume heal <volume-name> statistics
-
-##Description
-In case of index self-heal, self-heal daemon reads the entries from the
-local bricks, from /brick-path/.glusterfs/indices/xattrop/ directory.
-So based on the entries read by self heal daemon, it will attempt self-heal.
-Executing this command will list the crawl statistics of self heal done for
-each brick.
-
-For each brick, it will list:
-1. Starting time of crawl done for that brick.
-2. Ending time of crawl done for that brick.
-3. No of entries for which self-heal is successfully attempted.
-4. No of split-brain entries found while self-healing.
-5. No of entries for which heal failed.
-
-
-
-Example:
-a) Create a gluster volume with replica count 2.
-b) Create 10 files.
-c) kill brick_1 of this replica.
-d) Overwrite all 10 files.
-e) Kill the other brick (brick_2), and bring back (brick_1) up.
-f) Overwrite all 10 files.
-
-Now we have 10 files, which are in split brain. Self heal daemon will crawl for
-both the bricks, and will count 10 files from each brick.
-It will report 10 files under split-brain with respect to given brick.
-
-Gathering crawl statistics on volume volume1 has been successful
-------------------------------------------------
-
-Crawl statistics for brick no 0
-Hostname of brick 192.168.122.1
-
-Starting time of crawl: Tue May 20 19:13:11 2014
-
-Ending time of crawl: Tue May 20 19:13:12 2014
-
-Type of crawl: INDEX
-No. of entries healed: 0
-No. of entries in split-brain: 10
-No. of heal failed entries: 0
-------------------------------------------------
-
-Crawl statistics for brick no 1
-Hostname of brick 192.168.122.1
-
-Starting time of crawl: Tue May 20 19:13:12 2014
-
-Ending time of crawl: Tue May 20 19:13:12 2014
-
-Type of crawl: INDEX
-No. of entries healed: 0
-No. of entries in split-brain: 10
-No. of heal failed entries: 0
-
-------------------------------------------------
-
-
-As the output shows, self-heal daemon detects 10 files in split-brain with
-resept to given brick.
-
-
-
-
-##gluster volume heal <volume-name> statistics heal-count
-It lists the number of entries present in
-/<brick>/.glusterfs/indices/xattrop from each-brick.
-
-
-1. Create a replicate volume.
-2. Kill one brick of a replicate volume1.
-3. Create 10 files.
-4. Execute above command.
-
---------------------------------------------------------------------------------
-Gathering count of entries to be healed on volume volume1 has been successful
-
-Brick 192.168.122.1:/brick_1
-Number of entries: 10
-
-Brick 192.168.122.1:/brick_2
-No gathered input for this brick
--------------------------------------------------------------------------------
-
-
-
-
-
-
-##gluster volume heal <volume-name> statistics heal-count replica \
- ip_addr:/brick_location
-
-To list the number of entries to be healed from a particular replicate
-subvolume, listing any one child of that replicate subvolume in the command,
-will list the entries for all the childrens of that replicate subvolume.
-
-Example: dht
- / \
- / \
- replica-1 replica-2
- / \ / \
- child-1 child-2 child-3 child-4
- /brick1 /brick2 /brick3 /brick4
-
-gluster volume heal <vol-name> statistics heal-count ip:/brick1
-will list count only for child-1 and child-2.
-
-gluster volume heal <vol-name> statistics heal-count ip:/brick3
-will list count only for child-3 and child-4.
-
-
-
-1. Create a volume same as mentioned in the above graph.
-2. Kill Brick-2.
-3. Create some files.
-4. If we are interested in knowing the number of files to be healed from each
- brick of replica-1 only, mention any one child of that replica.
-
-gluster volume heal volume1 statistics heal-count replica 192.168.122.1:/brick2
-
-output:
--------
-Gathering count of entries to be healed per replica on volume volume1 has \
-been successful
-
-Brick 192.168.122.1:/brick_1
-Number of entries: 10 <--10 files
-
-Brick 192.168.122.1:/brick_2
-No gathered input for this brick <-Brick is down
-
-Brick 192.168.122.1:/brick_3
-No gathered input for this brick <--No result, as we are not
- interested.
-
-Brick 192.168.122.1:/brick_4 <--No result, as we are not
-No gathered input for this brick interested.
-
-
diff --git a/doc/features/afr-v1.md b/doc/features/afr-v1.md
deleted file mode 100644
index 8df66658613..00000000000
--- a/doc/features/afr-v1.md
+++ /dev/null
@@ -1,340 +0,0 @@
-#Automatic File Replication
-Afr xlator in glusterfs is responsible for replicating the data across the bricks.
-
-###Responsibilities of AFR
-Its responsibilities include the following:
-
-1. Maintain replication consistency (i.e. Data on both the bricks should be same, even in the cases where there are operations happening on same file/directory in parallel from multiple applications/mount points as long as all the bricks in replica set are up)
-
-2. Provide a way of recovering data in case of failures as long as there is
- at least one brick which has the correct data.
-
-3. Serve fresh data for read/stat/readdir etc
-
-###Transaction framework
-For 1, 2 above afr uses transaction framework which consists of the following 5
-phases which happen on all the bricks in replica set(Bricks which are in replication):
-
-####1.Lock Phase
-####2. Pre-op Phase
-####3. Op Phase
-####4. Post-op Phase
-####5. Unlock Phase
-
-*Op Phase* is the actual operation sent by application (`write/create/unlink` etc). For every operation which afr receives that modifies data it sends that same operation in parallel to all the bricks in its replica set. This is how it achieves replication.
-
-*Lock, Unlock Phases* take necessary locks so that *Op phase* can provide **replication consistency** in normal work flow.
-
-#####For example:
-If an application performs `touch a` and the other one does `mkdir a`, afr makes sure that either file with name `a` or directory with name `a` is created on both the bricks.
-
-*Pre-op, Post-op Phases* provide changelogging which enables afr to figure out which copy is fresh.
-Once afr knows how to figure out fresh copy in the replica set it can **recover data** from fresh copy to stale copy. Also it can **serve fresh** data for `read/stat/readdir` etc.
-
-##Internal Operations
-Brief introduction to internal operations in Glusterfs which make *Locking, Unlocking, Pre/Post ops* possible:
-
-###Internal Locking Operations
-Glusterfs has **locks** translator which provides the following internal locking operations called `inodelk`, `entrylk` which are used by afr to achieve synchronization of operations on files or directories that conflict with each other.
-
-`Inodelk` gives the facility for translators in Glusterfs to obtain range (denoted by tuple with **offset**, **length**) locks in a given domain for an inode.
-Full file lock is denoted by the tuple (offset: `0`, length: `0`) i.e. length `0` is considered as infinity.
-
-`Entrylk` enables translators of Glusterfs to obtain locks on `name` in a given domain for an inode, typically a directory.
-
-**Locks** translator provides both *blocking* and *nonblocking* variants and of these operations.
-
-###Xattrop
-For pre/post ops posix translator provides an operation called xattrop.
-xattrop is a way of *incrementing*/*decrementing* a number present in the extended attribute of the inode *atomically*.
-
-##Transaction Types
-There are 3 types of transactions in AFR.
-1. Data transactions
- - Operations that add/modify/truncate the file contents.
- - `Write`/`Truncate`/`Ftruncate` etc
-
-2. Metadata transactions
- - Operations that modify the data kept in inode.
- - `Chmod`/`Chown` etc
-
-3) Entry transactions
- - Operations that add/remove/rename entries in a directory
- - `Touch`/`Mkdir`/`Mknod` etc
-
-###Data transactions:
-
-*write* (`offset`, `size`) - writes data from `offset` of `size`
-
-*ftruncate*/*truncate* (`offset`) - truncates data from `offset` till the end of file.
-
-Afr internal locking needs to make sure that two conflicting data operations happen in order, one after the other so that it does not result in replication inconsistency. Afr data operations take inodelks in same domain (lets call it `data` domain).
-
-*Write* operation with offset `O` and size `S` takes an inode lock in data domain with range `(O, S)`.
-
-*Ftruncate*/*Truncate* operations with offset `O` take inode locks in `data` domain with range `(O, 0)`. Please note that size `0` means size infinity.
-
-These ranges make sure that overlapping write/truncate/ftruncate operations are done one after the other.
-
-Now that we know the ranges the operations take locks on, we will see how locking happens in afr.
-
-####Lock:
-Afr initially attempts **non-blocking** locks on **all** the bricks of the replica set in **parallel**. If all the locks are successful then it goes on to perform pre-op. But in case **non-blocking** locks **fail** because there is *at least one conflicting operation* which already has a **granted lock** then it **unlocks** the **non-blocking** locks it already acquired in this previous step and proceeds to perform **blocking** locks **one after the other** on each of the subvolumes in the order of subvolumes specified in the volfile.
-
-Chances of **conflicting operations** is **very low** and time elapsed in **non-blocking** locks phase is `Max(latencies of the bricks for responding to inodelk)`, where as time elapsed in **blocking locks** phase is `Sum(latencies of the bricks for responding to inodelk)`. That is why afr always tries for non-blocking locks first and only then it moves to blocking locks.
-
-####Pre-op:
-Each file/dir in a brick maintains the changelog(roughly pending operation count) of itself and that of the files
-present in all the other bricks in it's replica set as seen by that brick.
-
-Lets consider an example replica volume with 2 bricks brick-a and brick-b.
-all files in brick-a will have 2 entries
-one for itself and the other for the file present in it's replica set, i.e.brick-b:
-One can inspect changelogs using getfattr command.
-
- # getfattr -d -e hex -m. brick-a/file.txt
- trusted.afr.vol-client-0=0x000000000000000000000000 -->changelog for itself (brick-a)
- trusted.afr.vol-client-1=0x000000000000000000000000 -->changelog for brick-b as seen by brick-a
-
-Likewise, all files in brick-b will have:
-
- # getfattr -d -e hex -m. brick-b/file.txt
- trusted.afr.vol-client-0=0x000000000000000000000000 -->changelog for brick-a as seen by brick-b
- trusted.afr.vol-client-1=0x000000000000000000000000 -->changelog for itself (brick-b)
-
-#####Interpreting Changelog Value:
-Each extended attribute has a value which is `24` hexa decimal digits. i.e. `12` bytes
-First `8` digits (`4` bytes) represent changelog of `data`. Second `8` digits represent changelog
-of `metadata`. Last 8 digits represent Changelog of `directory entries`.
-
-Pictorially representing the same, we have:
-
- 0x 00000000 00000000 00000000
- | | |
- | | \_ changelog of directory entries
- | \_ changelog of metadata
- \ _ changelog of data
-
-Before write operation is performed on the brick, afr marks the file saying there is a pending operation.
-
-As part of this pre-op afr sends xattrop operation with increment 1 for data operation to make the extended attributes the following:
- # getfattr -d -e hex -m. brick-a/file.txt
- trusted.afr.vol-client-0=0x000000010000000000000000 -->changelog for itself (brick-a)
- trusted.afr.vol-client-1=0x000000010000000000000000 -->changelog for brick-b as seen by brick-a
-
-Likewise, all files in brick-b will have:
-
- # getfattr -d -e hex -m. brick-b/file.txt
- trusted.afr.vol-client-0=0x000000010000000000000000 -->changelog for brick-a as seen by brick-b
- trusted.afr.vol-client-1=0x000000010000000000000000 -->changelog for itself (brick-b)
-
-As the operation is in progress on files on both the bricks all the extended attributes show the same value.
-
-####Op:
-Now it sends the actual write operation to both the bricks. Afr remembers whether the operation is successful or not on all the subvolumes.
-
-####Post-Op:
-If the operation succeeds on all the bricks then there is no pending operations on any of the bricks so as part of POST-OP afr sends xattrop operation with increment -1 i.e. decrement by 1 for data operation to make the extended attributes back to all zeros again.
-
-In case there is a failure on brick-b then there is still a pending operation on brick-b where as no pending operations are there on brick-a. So xattrop operation for both of these extended attributes differs now. For extended attribute corresponding to brick-a i.e. trusted.afr.vol-client-0 decrement by 1 is sent where as for extended attribute corresponding to brick-b increment by '0' is sent to retain the pending operation count.
-
- # getfattr -d -e hex -m. brick-a/file.txt
- trusted.afr.vol-client-0=0x000000000000000000000000 -->changelog for itself (brick-a)
- trusted.afr.vol-client-1=0x000000010000000000000000 -->changelog for brick-b as seen by brick-a
-
- # getfattr -d -e hex -m. brick-b/file.txt
- trusted.afr.vol-client-0=0x000000000000000000000000 -->changelog for brick-a as seen by brick-b
- trusted.afr.vol-client-1=0x000000010000000000000000 -->changelog for itself (brick-b)
-
-####Unlock:
-Once the transaction is completed unlock is sent on all the bricks where lock is acquired.
-
-
-###Meta Data transactions:
-
-setattr, setxattr, removexattr
-All metadata operations take same inode lock with same range in metadata domain.
-
-####Lock:
-Metadata locking also starts initially with non-blocking locks then move on to blocking locks on any failures because of conflicting operations.
-
-####Pre-op:
-Before metadata operation is performed on the brick, afr marks the file saying there is a pending operation.
-As part of this pre-op afr sends xattrop operation with increment 1 for metadata operation to make the extended attributes the following:
- # getfattr -d -e hex -m. brick-a/file.txt
- trusted.afr.vol-client-0=0x000000000000000100000000 -->changelog for itself (brick-a)
- trusted.afr.vol-client-1=0x000000000000000100000000 -->changelog for brick-b as seen by brick-a
-
-Likewise, all files in brick-b will have:
- # getfattr -d -e hex -m. brick-b/file.txt
- trusted.afr.vol-client-0=0x000000000000000100000000 -->changelog for brick-a as seen by brick-b
- trusted.afr.vol-client-1=0x000000000000000100000000 -->changelog for itself (brick-b)
-
-As the operation is in progress on files on both the bricks all the extended attributes show the same value.
-
-####Op:
-Now it sends the actual metadata operation to both the bricks. Afr remembers whether the operation is successful or not on all the subvolumes.
-
-Post-Op:
-If the operation succeeds on all the bricks then there is no pending operations on any of the bricks so as part of POST-OP afr sends xattrop operation with increment -1 i.e. decrement by 1 for metadata operation to make the extended attributes back to all zeros again.
-
-In case there is a failure on brick-b then there is still a pending operation on brick-b where as no pending operations are there on brick-a. So xattrop operation for both of these extended attributes differs now. For extended attribute corresponding to brick-a i.e. trusted.afr.vol-client-0 decrement by 1 is sent where as for extended attribute corresponding to brick-b increment by '0' is sent to retain the pending operation count.
-
- # getfattr -d -e hex -m. brick-a/file.txt
- trusted.afr.vol-client-0=0x000000000000000000000000 -->changelog for itself (brick-a)
- trusted.afr.vol-client-1=0x000000000000000100000000 -->changelog for brick-b as seen by brick-a
-
- # getfattr -d -e hex -m. brick-b/file.txt
- trusted.afr.vol-client-0=0x000000000000000000000000 -->changelog for brick-a as seen by brick-b
- trusted.afr.vol-client-1=0x000000000000000100000000 -->changelog for itself (brick-b)
-
-####Unlock:
-Once the transaction is completed unlock is sent on all the bricks where lock is acquired.
-
-
-###Entry transactions:
-
-create, mknod, mkdir, link, symlink, rename, unlink, rmdir
-Pre-op/Post-op (done using xattrop) always happens on the parent directory.
-
-Entry Locks taken by these entry operations:
-
-**Create** (file `dir/a`): Lock on name `a` in inode of `dir`
-
-**mknod** (file `dir/a`): Lock on name `a` in inode of `dir`
-
-**mkdir** (dir `dir/a`): Lock on name `a` in inode of `dir`
-
-**link** (file `oldfile`, file `dir/newfile`): Lock on name `newfile` in inode of `dir`
-
-**Symlink** (file `oldfile`, file `dir`/`symlinkfile`): Lock on name `symlinkfile` in inode of `dir`
-
-**rename** of (file `dir1`/`file1`, file `dir2`/`file2`): Lock on name `file1` in inode of `dir1`, Lock on name `file2` in inode of `dir2`
-
-**rename** of (dir `dir1`/`dir2`, dir `dir3`/`dir4`): Lock on name `dir2` in inode of `dir1`, Lock on name `dir4` in inode of `dir3`, Lock on `NULL` in inode of `dir4`
-
-**unlink** (file `dir`/`a`): Lock on name `a` in inode of `dir`
-
-**rmdir** (dir dir/a): Lock on name `a` in inode of `dir`, Lock on `NULL` in inode of `a`
-
-####Lock:
-Even entry locking starts initially with non-blocking locks then move on to blocking locks on any failures because of conflicting operations.
-
-####Pre-op:
-Before entry operation is performed on the brick, afr marks the directory saying there is a pending operation.
-
-As part of this pre-op afr sends xattrop operation with increment 1 for entry operation to make the extended attributes the following:
-
- # getfattr -d -e hex -m. brick-a/
- trusted.afr.vol-client-0=0x000000000000000000000001 -->changelog for itself (brick-a)
- trusted.afr.vol-client-1=0x000000000000000000000001 -->changelog for brick-b as seen by brick-a
-
-Likewise, all files in brick-b will have:
- # getfattr -d -e hex -m. brick-b/
- trusted.afr.vol-client-0=0x000000000000000000000001 -->changelog for brick-a as seen by brick-b
- trusted.afr.vol-client-1=0x000000000000000000000001 -->changelog for itself (brick-b)
-
-As the operation is in progress on files on both the bricks all the extended attributes show the same value.
-
-####Op:
-Now it sends the actual entry operation to both the bricks. Afr remembers whether the operation is successful or not on all the subvolumes.
-
-####Post-Op:
-If the operation succeeds on all the bricks then there is no pending operations on any of the bricks so as part of POST-OP afr sends xattrop operation with increment -1 i.e. decrement by 1 for entry operation to make the extended attributes back to all zeros again.
-
-In case there is a failure on brick-b then there is still a pending operation on brick-b where as no pending operations are there on brick-a. So xattrop operation for both of these extended attributes differs now. For extended attribute corresponding to brick-a i.e. trusted.afr.vol-client-0 decrement by 1 is sent where as for extended attribute corresponding to brick-b increment by '0' is sent to retain the pending operation count.
-
- # getfattr -d -e hex -m. brick-a/file.txt
- trusted.afr.vol-client-0=0x000000000000000000000000 -->changelog for itself (brick-a)
- trusted.afr.vol-client-1=0x000000000000000000000001 -->changelog for brick-b as seen by brick-a
-
- # getfattr -d -e hex -m. brick-b/file.txt
- trusted.afr.vol-client-0=0x000000000000000000000000 -->changelog for brick-a as seen by brick-b
- trusted.afr.vol-client-1=0x000000000000000000000001 -->changelog for itself (brick-b)
-
-####Unlock:
-Once the transaction is completed unlock is sent on all the bricks where lock is acquired.
-
-The parts above cover how replication consistency is achieved in afr.
-
-Now let us look at how afr can figure out how to recover from failures given the changelog extended attributes
-
-###Recovering from failures (Self-heal)
-For recovering from failures afr tries to determine which copy is the fresh copy based on the extended attributes.
-
-There are 3 possibilities:
-1. All the extended attributes are zero on all the bricks. This means there are no pending operations on any of the bricks so there is nothing to recover.
-2. According to the extended attributes there is a brick(brick-a) which noticed that there are operations pending on the other brick(brick-b).
- - There are 4 possibilities for brick-b
-
- - It did not even participate in transaction (all extended attributes on brick-b are zeros). Choose brick-a as source and perform recovery to brick-b.
-
- - It participated in the transaction but died even before post-op. (All extended attributes on brick-b have a pending-count). Choose brick-a as source and perform recovery to brick-b.
-
- - It participated in the transaction and after the post-op extended attributes on brick-b show that there are pending operations on itself. Choose brick-a as source and perform recovery to brick-b.
-
- - It participated in the transaction and after the post-op extended attributes on brick-b show that there are pending operations on brick-a. This situation is called Split-brain and there is no way to recover. This situation can happen in cases of network partition.
-
-3. The only possibility now is where both brick-a, brick-b have pending operations. In this case changelogs extended attributes are all non-zeros on all the bricks. Basically what could have happened is the operations started on the file but either the whole replica set went down or the mount process itself dies before post-op is performed. In this case there is a possibility that data on the bricks is different. In this case afr chooses file with bigger size as source, if both files have same size then it choses the subvolume which has witnessed large number of pending operations on the other brick as source. If both have same number of pending operations then it chooses the file with newest ctime as source. If this is also same then it just picks one of the two bricks as source and syncs data on to the other to make sure that the files are replicas to each other.
-
-###Self-healing:
-Afr does 3 types of self-heals for data recovery.
-
-1. Data self-heal
-
-2. Metadata self-heal
-
-3. Data self-heal
-
-As we have seen earlier, afr depends on changelog extended attributes to figure out which copy is source and which copy is sink. General algorithm for performing this recovery (self-heal) is same for all of these different self-heals.
-
-1. Take appropriate full locks on the file/directory to make sure no other transaction is in progress while inspecting changelog extended attributes.
-In this step, for
- - Data self-heal afr takes inode lock with `offset: 0` and `size: 0`(infinity) in data domain.
- - Entry self-heal takes entry lock on directory with `NULL` name i.e. full directory lock.
- - Metadata self-heal it takes pre-defined range in metadata domain on which all the metadata operations on that inode take locks on. To prevent duplicate data self-heal an inode lock is taken in self-heal domain as well.
-
-2. Perform Sync from fresh copy to stale copy.
-In this step,
- - Metadata self-heal gets the inode attributes, extended attributes from source copy and sets them on the stale copy.
-
- - Entry self-heal reads entries on stale directories and see if they are present on source directory, if they are not present it deletes them. Then it reads entries on fresh directory and creates the missing entries on stale directories.
-
- - Data self-heal does things a bit differently to make sure no other writes on the file are blocked for the duration of self-heal because files sizes could be as big as 100G(VM files) and we don't want to block all the transactions until the self-heal is over. Locks translator allows two overlapping locks to be granted if they are from same lock owner. Using this what data self-heal does is it takes a small 128k size range lock and unlock previous acquired lock, heals just that 128k chunk and takes next 128k chunk lock and unlock previous lock and moves to the next one. It always makes sure that at least one lock is present on the file by selfheal throughout the duration of self-heal so that two self-heals don't happen in parallel.
-
- - Data self-heal has two algorithms, where the file can be copied only when there is data mismatch for that chunk called as 'diff' self-heal. The otherone is blind copy of each chunk called 'full' self-heal
-
-3. Change extended attributes to mark new sources after the sync.
-
-4. Unlock the locks acquired to perform self-heal.
-
-### Transaction Optimizations:
-As we saw earlier afr transaction for all the operations that modify data happens in 5 phases, i.e. it sends 5 operations on the network for every operation. In the following sections we will see optimizations already implemented in afr which reduce the number of operations on the network to just 1 per transaction in best case.
-
-####Changelog-piggybacking
-This optimization comes into picture when on same file descriptor, before write1's post op is complete write2's pre-op starts and the operations are succeeding. When writes come in that manner we can piggyback on the pre-op of write1 for write2 and somehow tell write1 that write2 will do the post-op that was supposed to be done by write1. So write1's post-op does not happen over network, write2's pre-op does not happen over network. This optimization does not hold if there are any failures in write1's phases.
-
-####Delayed Post-op
-This optimization just delays post-op of the write transaction(write1) by a pre-configured amount time to increase the probability of next write piggybacking on the pre-op done by write1.
-
-With the combination of these two optimizations for operations like full file copy which are write intensive operations, what will essentially happen is for the first write a pre-op will happen. Then for the last write on the file post-op happens. So for all the write transactions between first write and last write afr reduced network operations from 5 to 3.
-
-####Eager-locking:
-This optimization comes into picture when only one file descriptor is open on the file and performing writes just like in the previous optimization. What this optimization does is it takes a full file lock on the file irrespective of the offset, size of the write, so that lock acquired by write1 can be piggybacked by write2 and write2 takes the responsibility of unlocking it. both write1, write2 will have same lock owner and afr takes the responsibility of serializing overlapping writes so that replication consistency is maintained.
-
-With the combination of these optimizations for operations like full file copy which are write intensive operations, what will essentially happen is for the first write a pre-op, full-file lock will happen. Then for the last write on the file post-op, unlock happens. So for all the write transactions between first write and last write afr reduced network operations from 5 to 1.
-
-###Quorum in afr:
-To avoid split-brains, afr employs the following quorum policies.
- - In replica set with odd number of bricks, replica set is said to be in quorum if more than half of the bricks are up.
- - In replica set with even number of bricks, if more than half of the bricks are up then it is said to be in quorum but if number of bricks that are up is equal to number of bricks that are down then, it is said to be in quorum if the first brick is also up in the set of bricks that are up.
-
-When quorum is not met in the replica set then modify operations on the mount are not allowed by afr.
-
-###Self-heal daemon and Index translator usage by afr:
-
-####Index xlator:
-On each brick index xlator is loaded. This xlator keeps track of what is happening in afr's pre-op and post-op. If there is an ongoing I/O or a pending self-heal, changelog xattrs would have non-zero values. Whenever xattrop/fxattrop fop (pre-op, post-ops are done using these fops) comes to index xlator a link (with gfid as name of the file on which the fop is performed) is added in <brick>/.glusterfs/indices/xattrop directory. If the value returned by the fop is zero the link is removed from the index otherwise it is kept until zero is returned in the subsequent xattrop/fxattrop fops.
-
-####Self-heal-daemon:
-self-heal-daemon process keeps running on each machine of the trusted storage pool. This process has afr xlators of all the volumes which are started. Its job is to crawl indices on bricks that are local to that machine. If any of the files represented by the gfid of the link name need healing and automatically heal them. This operation is performed every 10 minutes for each replica set. Additionally when a brick comes online also this operation is performed.
diff --git a/doc/features/bd-xlator.md b/doc/features/bd-xlator.md
deleted file mode 100644
index 1771fb6e24b..00000000000
--- a/doc/features/bd-xlator.md
+++ /dev/null
@@ -1,469 +0,0 @@
-#Block device translator
-
-Block device translator (BD xlator) is a translator added to GlusterFS which provides block backend for GlusterFS. This replaces the existing bd_map translator in GlusterFS that provided similar but very limited functionality. GlusterFS expects the underlying brick to be formatted with a POSIX compatible file system. BD xlator changes that and allows for having bricks that are raw block devices like LVM which needn’t have any file systems on them. Hence with BD xlator, it becomes possible to build a GlusterFS volume comprising of bricks that are logical volumes (LV).
-
-##bd
-
-BD xlator maps underlying LVs to files and hence the LVs appear as files to GlusterFS clients. Though BD volume externally appears very similar to the usual Posix volume, not all operations are supported or possible for the files on a BD volume. Only those operations that make sense for a block device are supported and the exact semantics are described in subsequent sections.
-
-While Posix volume takes a file system directory as brick, BD volume needs a volume group (VG) as brick. In the usual use case of BD volume, a file created on BD volume will result in an LV being created in the brick VG. In addition to a VG, BD volume also needs a file system directory that should be specified at the volume creation time. This directory is necessary for supporting the notion of directories and directory hierarchy for the BD volume. Metadata about LVs (size, mapping info) is stored in this directory.
-
-BD xlator was mainly developed to use block devices directly as VM images when GlusterFS is used as storage for KVM virtualization. Some of the salient points of BD xlator are
-
-* Since BD supports file level snapshots and clones by leveraging the snapshot and clone capabilities of LVM, it can be used to fully off-load snapshot and cloning operations from QEMU to the storage (GlusterFS) itself.
-
-* BD understands dm-thin LVs and hence can support files that are backed by thinly provisioned LVs. This capability of BD xlator translates to having thinly provisioned raw VM images.
-
-* BD enables thin LVs from a thin pool to be used from multiple nodes that have visibility to GlusterFS BD volume. Thus thin pool can be used as a VM image repository allowing access/visibility to it from multiple nodes.
-
-* BD supports true zerofill by using BLKZEROOUT ioctl on underlying block devices. Thus BD allows SCSI WRITESAME to be used on underlying block device if the device supports it.
-
-Though BD xlator is primarily intended to be used with block devices, it does provide full Posix xlator compatibility for files that are created on BD volume but are not backed by or mapped to a block device. Such files which don’t have a block device mapping exist on the Posix directory that is specified during BD volume creation. BD xlator is available from GlusterFS-3.5 release.
-
-###Compiling BD translator
-
-BD xlator needs lvm2 development library. –enable-bd-xlator option can be used with `./configure` script to explicitly enable BD translator. The following snippet from the output of configure script shows that BD xlator is enabled for compilation.
-
-
-#####GlusterFS configure summary
-
- …
- Block Device xlator : yes
-
-
-###Creating a BD volume
-
-BD supports hosting of both linear LV and thin LV within the same volume. However seperate examples are provided below. As noted above, the prerequisite for a BD volume is VG which is created from a loop device here, but it can be any other device too.
-
-
-* Creating BD volume with linear LV backend
-
-* Create a loop device
-
-
- [root@node ~]# dd if=/dev/zero of=bd-loop count=1024 bs=1M
-
- [root@node ~]# losetup /dev/loop0 bd-loop
-
-
-* Prepare a brick by creating a VG
-
- [root@node ~]# pvcreate /dev/loop0
-
- [root@node ~]# vgcreate bd-vg /dev/loop0
-
-
-* Create the BD volume
-
-* Create a POSIX directory first
-
-
- [root@node ~]# mkdir /bd-meta
-
-It is recommended that this directory is created on an LV in the brick VG itself so that both data and metadata live together on the same device.
-
-
-* Create and mount the volume
-
- [root@node ~]# gluster volume create bd node:/bd-meta?bd-vg force
-
-
-The general syntax for specifying the brick is `host:/posix-dir?volume-group-name` where “?” is the separator.
-
-
-
- [root@node ~]# gluster volume start bd
- [root@node ~]# gluster volume info bd
- Volume Name: bd
- Type: Distribute
- Volume ID: cb042d2a-f435-4669-b886-55f5927a4d7f
- Status: Started
- Xlator 1: BD
- Capability 1: offload_copy
- Capability 2: offload_snapshot
- Number of Bricks: 1
- Transport-type: tcp
- Bricks:
- Brick1: node:/bd-meta
- Brick1 VG: bd-vg
-
-
-
- [root@node ~]# mount -t glusterfs node:/bd /mnt
-
-* Create a file that is backed by an LV
-
- [root@node ~]# ls /mnt
-
- [root@node ~]#
-
-Since the volume is empty now, so is the underlying VG.
-
- [root@node ~]# lvdisplay bd-vg
- [root@node ~]#
-
-Creating a file that is mapped to an LV is a 2 step operation. First the file should be created on the mount point and a specific extended attribute should be set to map the file to LV.
-
- [root@node ~]# touch /mnt/lv
- [root@node ~]# setfattr -n “user.glusterfs.bd” -v “lv” /mnt/lv
-
-Now an LV got created in the VG brick and the file /mnt/lv maps to this LV. Any read/write to this file ends up as read/write to the underlying LV.
-
- [root@node ~]# lvdisplay bd-vg
- — Logical volume —
- LV Path /dev/bd-vg/6ff0f25f-2776-4d19-adfb-df1a3cab8287
- LV Name 6ff0f25f-2776-4d19-adfb-df1a3cab8287
- VG Name bd-vg
- LV UUID PjMPcc-RkD5-RADz-6ixG-UYsk-oclz-vL0nv6
- LV Write Access read/write
- LV Creation host, time node, 2013-11-26 16:15:45 +0530
- LV Status available
- open 0
- LV Size 4.00 MiB
- Current LE 1
- Segments 1
- Allocation inherit
- Read ahead sectors 0
- Block device 253:6
-
-The file gets created with default LV size which is 1 LE which is 4MB in this case.
-
- [root@node ~]# ls -lh /mnt/lv
- -rw-r–r–. 1 root root 4.0M Nov 26 16:15 /mnt/lv
-
-truncate can be used to set the required file size.
-
- [root@node ~]# truncate /mnt/lv -s 256M
- [root@node ~]# lvdisplay bd-vg
- — Logical volume —
- LV Path /dev/bd-vg/6ff0f25f-2776-4d19-adfb-df1a3cab8287
- LV Name 6ff0f25f-2776-4d19-adfb-df1a3cab8287
- VG Name bd-vg
- LV UUID PjMPcc-RkD5-RADz-6ixG-UYsk-oclz-vL0nv6
- LV Write Access read/write
- LV Creation host, time node, 2013-11-26 16:15:45 +0530
- LV Status available
- # open 0
- LV Size 256.00 MiB
- Current LE 64
- Segments 1
- Allocation inherit
- Read ahead sectors 0
- Block device 253:6
-
-
- [root@node ~]# ls -lh /mnt/lv
- -rw-r–r–. 1 root root 256M Nov 26 16:15 /mnt/lv
-
- currently LV size has been set to 256
-
-The size of the file/LV can be specified during creation/mapping time itself like this:
-
- setfattr -n “user.glusterfs.bd” -v “lv:256MB” /mnt/lv
-
-2. Creating BD volume with thin LV backend
-
-* Create a loop device
-
-
- [root@node ~]# dd if=/dev/zero of=bd-loop-thin count=1024 bs=1M
-
- [root@node ~]# losetup /dev/loop0 bd-loop-thin
-
-
-* Prepare a brick by creating a VG and thin pool
-
-
- [root@node ~]# pvcreate /dev/loop0
-
- [root@node ~]# vgcreate bd-vg-thin /dev/loop0
-
-
-* Create a thin pool
-
-
- [root@node ~]# lvcreate –thin bd-vg-thin -L 1000M
-
- Rounding up size to full physical extent 4.00 MiB
- Logical volume “lvol0″ created
-
-lvdisplay shows the thin pool
-
- [root@node ~]# lvdisplay bd-vg-thin
- — Logical volume —
- LV Name lvol0
- VG Name bd-vg-thin
- LV UUID HVa3EM-IVMS-QG2g-oqU6-1UxC-RgqS-g8zhVn
- LV Write Access read/write
- LV Creation host, time node, 2013-11-26 16:39:06 +0530
- LV Pool transaction ID 0
- LV Pool metadata lvol0_tmeta
- LV Pool data lvol0_tdata
- LV Pool chunk size 64.00 KiB
- LV Zero new blocks yes
- LV Status available
- # open 0
- LV Size 1000.00 MiB
- Allocated pool data 0.00%
- Allocated metadata 0.88%
- Current LE 250
- Segments 1
- Allocation inherit
- Read ahead sectors auto
- Block device 253:9
-
-* Create the BD volume
-
-* Create a POSIX directory first
-
-
- [root@node ~]# mkdir /bd-meta-thin
-
-* Create and mount the volume
-
- [root@node ~]# gluster volume create bd-thin node:/bd-meta-thin?bd-vg-thin force
-
- [root@node ~]# gluster volume start bd-thin
-
-
- [root@node ~]# gluster volume info bd-thin
- Volume Name: bd-thin
- Type: Distribute
- Volume ID: 27aa7eb0-4ffa-497e-b639-7cbda0128793
- Status: Started
- Xlator 1: BD
- Capability 1: thin
- Capability 2: offload_copy
- Capability 3: offload_snapshot
- Number of Bricks: 1
- Transport-type: tcp
- Bricks:
- Brick1: node:/bd-meta-thin
- Brick1 VG: bd-vg-thin
-
-
- [root@node ~]# mount -t glusterfs node:/bd-thin /mnt
-
-* Create a file that is backed by a thin LV
-
-
- [root@node ~]# ls /mnt
-
- [root@node ~]#
-
-Creating a file that is mapped to a thin LV is a 2 step operation. First the file should be created on the mount point and a specific extended attribute should be set to map the file to a thin LV.
-
- [root@node ~]# touch /mnt/thin-lv
-
- [root@node ~]# setfattr -n “user.glusterfs.bd” -v “thin:256MB” /mnt/thin-lv
-
-Now /mnt/thin-lv is a thin provisioned file that is backed by a thin LV and size has been set to 256.
-
- [root@node ~]# lvdisplay bd-vg-thin
- — Logical volume —
- LV Name lvol0
- VG Name bd-vg-thin
- LV UUID HVa3EM-IVMS-QG2g-oqU6-1UxC-RgqS-g8zhVn
- LV Write Access read/write
- LV Creation host, time node, 2013-11-26 16:39:06 +0530
- LV Pool transaction ID 1
- LV Pool metadata lvol0_tmeta
- LV Pool data lvol0_tdata
- LV Pool chunk size 64.00 KiB
- LV Zero new blocks yes
- LV Status available
- # open 0
- LV Size 000.00 MiB
- Allocated pool data 0.00%
- Allocated metadata 0.98%
- Current LE 250
- Segments 1
- Allocation inherit
- Read ahead sectors auto
- Block device 253:9
-
-
-
-
- — Logical volume —
- LV Path dev/bd-vg-thin/081b01d1-1436-4306-9baf-41c7bf5a2c73
- LV Name 081b01d1-1436-4306-9baf-41c7bf5a2c73
- VG Name bd-vg-thin
- LV UUID coxpTY-2UZl-9293-8H2X-eAZn-wSp6-csZIeB
- LV Write Access read/write
- LV Creation host, time node, 2013-11-26 16:43:19 +0530
- LV Pool name lvol0
- LV Status available
- # open 0
- LV Size 256.00 MiB
- Mapped size 0.00%
- Current LE 64
- Segments 1
- Allocation inherit
- Read ahead sectors auto
- Block device 253:10
-
-
-
-
-
-As can be seen from above, creation of a file resulted in creation of a thin LV in the brick.
-
-
-###Improvisation on BD translator:
-
-First version of BD xlator ( block backend) had few limitations such as
-
-* Creation of directories not supported
-* Supports only single brick
-* Does not use extended attributes (and client gfid) like posix xlator
-* Creation of special files (symbolic links, device nodes etc) not
- supported
-
-Basic limitation of not allowing directory creation was blocking
-oVirt/VDSM to consume BD xlator as part of Gluster domain since VDSM
-creates multi-level directories when GlusterFS is used as storage
-backend for storing VM images.
-
-To overcome these limitations a new BD xlator with following
-improvements are implemented.
-
-* New hybrid BD xlator that handles both regular files and block device
- files
-* The volume will have both POSIX and BD bricks. Regular files are
- created on POSIX bricks, block devices are created on the BD brick (VG)
-* BD xlator leverages exiting POSIX xlator for most POSIX calls and
- hence sits above the POSIX xlator
-* Block device file is differentiated from regular file by an extended
- attribute
-* The xattr 'user.glusterfs.bd' (BD_XATTR) plays a role in mapping a
- posix file to Logical Volume (LV).
-* When a client sends a request to set BD_XATTR on a posix file, a new
- LV is created and mapped to posix file. So every block device will
- have a representative file in POSIX brick with 'user.glusterfs.bd'
- (BD_XATTR) set.
-* Here after all operations on this file results in LV related
- operations.
-
-For example, opening a file that has BD_XATTR set results in opening
-the LV block device, reading results in reading the corresponding LV
-block device.
-
-When BD xlator gets request to set BD_XATTR via setxattr call, it
-creates a LV and information about this LV is placed in the xattr of the
-posix file. xattr "user.glusterfs.bd" used to identify that posix file
-is mapped to BD.
-
-Usage:
-Server side:
-
- [root@host1 ~]# gluster volume create bdvol host1:/storage/vg1_info?vg1 host2:/storage/vg2_info?vg2
-
-It creates a distributed gluster volume 'bdvol' with Volume Group vg1
-using posix brick /storage/vg1_info in host1 and Volume Group vg2 using
-/storage/vg2_info in host2.
-
-
- [root@host1 ~]# gluster volume start bdvol
-
-Client side:
-
- [root@node ~]# mount -t glusterfs host1:/bdvol /media
- [root@node ~]# touch /media/posix
-
-It creates regular posix file 'posix' in either host1:/vg1 or host2:/vg2 brick
-
- [root@node ~]# mkdir /media/image
-
- [root@node ~]# touch /media/image/lv1
-
-
-It also creates regular posix file 'lv1' in either host1:/vg1 or
-host2:/vg2 brick
-
- [root@node ~]# setfattr -n "user.glusterfs.bd" -v "lv" /media/image/lv1
-
- [root@node ~]#
-
-
-Above setxattr results in creating a new LV in corresponding brick's VG
-and it sets 'user.glusterfs.bd' with value 'lv:<default-extent-size''
-
-
- [root@node ~]# truncate -s5G /media/image/lv1
-
-
-It results in resizig LV 'lv1'to 5G
-
-New BD xlator code is placed in `xlators/storage/bd` directory.
-
-Also add volume-uuid to the VG so that same VG cannot be used for other
-bricks/volumes. After deleting a gluster volume, one has to manually
-remove the associated tag using vgchange <vg-name> --deltag
-`<trusted.glusterfs.volume-id:<volume-id>>`
-
-
-#### Exposing volume capabilities
-
-With multiple storage translators (posix and bd) being supported in GlusterFS, it becomes
-necessary to know the volume type so that user can issue appropriate calls that are relevant
-only to the a given volume type. Hence there needs to be a way to expose the type of
-the storage translator of the volume to the user.
-
-BD xlator is capable of providing server offloaded file copy, server/storage offloaded
-zeroing of a file etc. This capabilities should be visible to the client/user, so that these
-features can be exploited.
-
-BD xlator exports capability information through gluster volume info (and --xml) output. For eg:
-
-`snip of gluster volume info output for a BD based volume`
-
- Xlator 1: BD
- Capability 1: thin
-
-`snip of gluster volume info --xml output for a BD based volume`
-
- <xlators>
- <xlator>
- <name>BD</name>
- <capabilities>
- <capability>thin</capability>
- </capabilities>
- </xlator>
- </xlators>
-
-But this capability information should also exposed through some other means so that a host
-which is not part of Gluster peer could also avail this capabilities.
-
-* Type
-
-BD translator supports both regular files and block device, i,e., one can create files on
-GlusterFS volume backed by BD translator and this file could end up as regular posix file or
-a logical volume (block device) based on the user''s choice. User can do a setxattr on the
-created file to convert it to a logical volume.
-
-Users of BD backed volume like QEMU would like to know that it is working with BD type of volume
-so that it can issue an additional setxattr call after creating a VM image on GlusterFS backend.
-This is necessary to ensure that the created VM image is backed by LV instead of file.
-
-There are different ways to expose this information (BD type of volume) to user.
-One way is to export it via a `getxattr` call. That said, When a client issues getxattr("volume_type")
-on a root gfid, bd xlator will return 1 implying its BD xlator. But posix xlator will return ENODATA
-and client code can interpret this as posix xlator. Also capability list can be returned via
-getxattr("caps") for root gfid.
-
-* Capabilities
-
-BD xlator supports new features such as server offloaded file copy, thin provisioned VM images etc.
-
-There is no standard way of exploiting these features from client side (such as syscall
-to exploit server offloaded copy). So these features need to be exported to the client so that
-they can be used. BD xlator latest version exports these capabilities information through
-gluster volume info (and --xml) output. But if a client is not part of GlusterFS peer
-it can''t run volume info command to get the list of capabilities of a given GlusterFS volume.
-For example, GlusterFS block driver in qemu need to get the capability list so that these features are used.
-
-
-
-Parts of this documentation were originally published here
-#http://raobharata.wordpress.com/2013/11/27/glusterfs-block-device-translator/
diff --git a/doc/features/brick-failure-detection.md b/doc/features/brick-failure-detection.md
deleted file mode 100644
index 24f2a18f39f..00000000000
--- a/doc/features/brick-failure-detection.md
+++ /dev/null
@@ -1,67 +0,0 @@
-# Brick Failure Detection
-
-This feature attempts to identify storage/file system failures and disable the failed brick without disrupting the remainder of the node's operation.
-
-## Description
-
-Detecting failures on the filesystem that a brick uses makes it possible to handle errors that are caused from outside of the Gluster environment.
-
-There have been hanging brick processes when the underlying storage of a brick went unavailable. A hanging brick process can still use the network and repond to clients, but actual I/O to the storage is impossible and can cause noticible delays on the client side.
-
-Provide better detection of storage subsytem failures and prevent bricks from hanging. It should prevent hanging brick processes when storage-hardware or the filesystem fails.
-
-A health-checker (thread) has been added to the posix xlator. This thread periodically checks the status of the filesystem (implies checking of functional storage-hardware).
-
-`glusterd` can detect that the brick process has exited, `gluster volume status` will show that the brick process is not running anymore. System administrators checking the logs should be able to triage the cause.
-
-## Usage and Configuration
-
-The health-checker is enabled by default and runs a check every 30 seconds. This interval can be changed per volume with:
-
- # gluster volume set <VOLNAME> storage.health-check-interval <SECONDS>
-
-If `SECONDS` is set to 0, the health-checker will be disabled.
-
-## Failure Detection
-
-Error are logged to the standard syslog (mostly `/var/log/messages`):
-
- Jun 24 11:31:49 vm130-32 kernel: XFS (dm-2): metadata I/O error: block 0x0 ("xfs_buf_iodone_callbacks") error 5 buf count 512
- Jun 24 11:31:49 vm130-32 kernel: XFS (dm-2): I/O Error Detected. Shutting down filesystem
- Jun 24 11:31:49 vm130-32 kernel: XFS (dm-2): Please umount the filesystem and rectify the problem(s)
- Jun 24 11:31:49 vm130-32 kernel: VFS:Filesystem freeze failed
- Jun 24 11:31:50 vm130-32 GlusterFS[1969]: [2013-06-24 10:31:50.500674] M [posix-helpers.c:1114:posix_health_check_thread_proc] 0-failing_xfs-posix: health-check failed, going down
- Jun 24 11:32:09 vm130-32 kernel: XFS (dm-2): xfs_log_force: error 5 returned.
- Jun 24 11:32:20 vm130-32 GlusterFS[1969]: [2013-06-24 10:32:20.508690] M [posix-helpers.c:1119:posix_health_check_thread_proc] 0-failing_xfs-posix: still alive! -> SIGTERM
-
-The messages labelled with `GlusterFS` in the above output are also written to the logs of the brick process.
-
-## Recovery after a failure
-
-When a brick process detects that the underlaying storage is not responding anymore, the process will exit. There is no automated way that the brick process gets restarted, the sysadmin will need to fix the problem with the storage first.
-
-After correcting the storage (hardware or filesystem) issue, the following command will start the brick process again:
-
- # gluster volume start <VOLNAME> force
-
-## How To Test
-
-The health-checker thread that is part of each brick process will get started automatically when a volume has been started. Verifying its functionality can be done in different ways.
-
-On virtual hardware:
-
-* disconnect the disk from the VM that holds the brick
-
-On real hardware:
-
-* simulate a RAID-card failure by unplugging the card or cables
-
-On a system that uses LVM for the bricks:
-
-* use device-mapper to load an error-table for the disk, see [this description](http://review.gluster.org/5176).
-
-On any system (writing to random offsets of the block device, more difficult to trigger):
-
-1. cause corruption on the filesystem that holds the brick
-2. read contents from the brick, hoping to hit the corrupted area
-3. the filsystem should abort after hitting a bad spot, the health-checker should notice that shortly afterwards
diff --git a/doc/features/ctime.md b/doc/features/ctime.md
new file mode 100644
index 00000000000..74a77abed4b
--- /dev/null
+++ b/doc/features/ctime.md
@@ -0,0 +1,68 @@
+# Consistent time attributes in gluster across replica/distribute
+
+
+#### Problem:
+Traditionally gluster has been using time attributes (ctime, atime, mtime) of files/dirs from bricks. The problem with this approach is that, it is not consisteant across replica and distribute bricks. And applications which depend on it breaks as replica might not always return time attributes from same brick.
+
+Tar especially gives "file changed as we read it" whenever it detects ctime differences when stat is served from different bricks. The way we have been trying to solve it is to serve the stat structures from same brick in afr, max-time in dht. But it doesn't avoid the problem completely. Because there is no way to change ctime at the moment(lutimes() only allows mtime, atime), there is little we can do to make sure ctimes match after self-heals/xattr updates/rebalance.
+
+#### Solution Proposed:
+Store time attribues (ctime, mtime, atime) as an xattr of the file. The xattr is updated based
+on the fop. If a filesystem fop changes only mtime and ctime, update only those in xattr for
+that file.
+
+#### Design Overview:
+1) As part of each fop, top layer will generate a time stamp and pass it to the down along
+ with other information
+ - This will bring a dependency for NTP synced clients along with servers
+ - There can be a diff in time if the fop stuck in the xlator for various reason,
+for ex: because of locks.
+
+ 2) On the server, posix layer stores the value in the memory (inode ctx) and will sync the data periodically to the disk as an extended attr
+ - Of course sync call also will force it. And fop comes for an inode which is not linked, we do the sync immediately.
+
+ 3) Each time when inodes are created or initialized it read the data from disk and store in inode ctx.
+
+ 4) Before setting to inode_ctx we compare the timestamp stored and the timestamp received, and only store if the stored value is lesser than the current value.
+
+ 5) So in best case data will be stored and retrieved from the memory. We replace the values in iatt with the values in inode_ctx.
+
+ 6) File ops that changes the parent directory attr time need to be consistent across all the distributed directories across the subvolumes. (for eg: a create call will change ctime and mtime of parent dir)
+
+ - This has to handle separately because we only send the fop to the hashed subvolume.
+ - We can asynchronously send the timeupdate setattr fop to the other subvoumes and change the values for parent directory if the file fops is successful on hashed subvolume.
+ - This will have a window where the times are inconsistent across dht subvolume (Please provide your suggestions)
+
+7) Currently we have couple of mount options for time attributes like noatime, relatime , nodiratime etc. But we are not explicitly handled those options even if it is given as mount option when gluster mount.
+
+
+#### Implementation Overview:
+This features involves changes in following xlators.
+ - utime xlator
+ - posix xlator
+
+##### utime xlator:
+This is a new client side xlator which does following tasks.
+
+1. It will generate a time stamp and passes it down in frame->root->ctime and over the network.
+2. Based on fop, it also decides the time attributes to be updated and this passed using "frame->root->flags"
+
+ Patches:
+ 1. https://review.gluster.org/#/c/19857/
+
+##### posix xlator:
+Following tasks are done in posix xlator:
+
+1. Provides APIs to set and get the xattr from backend. It also caches the xattr in inode context. During get, it updates time attributes stored in xattr into iatt structure.
+2. Based on the flags from utime xlator, relevant fops update the time attributes in the xattr.
+
+ Patches:
+ 1. https://review.gluster.org/#/c/19267/
+ 2. https://review.gluster.org/#/c/19795/
+ 3. https://review.gluster.org/#/c/19796/
+
+#### Pending Work:
+1. Handling of time related mount options (noatime, realatime,etc)
+2. flag based create (depending on flags in open, create behaviour might change)
+3. Changes in dht for direcotory sync acrosss multiple subvolumes
+4. readdirp stat need to be worked on.
diff --git a/doc/features/file-snapshot.md b/doc/features/file-snapshot.md
deleted file mode 100644
index 7f7c419fc7f..00000000000
--- a/doc/features/file-snapshot.md
+++ /dev/null
@@ -1,91 +0,0 @@
-#File Snapshot
-This feature gives the ability to take snapshot of files.
-
-##Descritpion
-This feature adds file snapshotting support to glusterfs. Snapshots can be created , deleted and reverted.
-
-To take a snapshot of a file, file should be in QCOW2 format as the code for the block layer snapshot has been taken from Qemu and put into gluster as a translator.
-
-With this feature, glusterfs will have better integration with Openstack Cinder, and in general ability to take snapshots of files (typically VM images).
-
-New extended attribute (xattr) will be added to identify files which are 'snapshot managed' vs raw files.
-
-##Volume Options
-Following volume option needs to be set on the volume for taking file snapshot.
-
- # features.file-snapshot on
-##CLI parameters
-Following cli parameters needs to be passed with setfattr command to create, delete and revert file snapshot.
-
- # trusted.glusterfs.block-format
- # trusted.glusterfs.block-snapshot-create
- # trusted.glusterfs.block-snapshot-goto
-##Fully loaded Example
-Download glusterfs3.5 rpms from download.gluster.org
-Install these rpms.
-
-start glusterd by using the command
-
- # service glusterd start
-Now create a volume by using the command
-
- # gluster volume create <vol_name> <brick_path>
-Run the command below to make sure that volume is created.
-
- # gluster volume info
-Now turn on the snapshot feature on the volume by using the command
-
- # gluster volume set <vol_name> features.file-snapshot on
-Verify that the option is set by using the command
-
- # gluster volume info
-User should be able to see another option in the volume info
-
- # features.file-snapshot: on
-Now mount the volume using fuse mount
-
- # mount -t glusterfs <vol_name> <mount point>
-cd into the mount point
- # cd <mount_point>
- # touch <file_name>
-Size of the file can be set and format of the file can be changed to QCOW2 by running the command below. File size can be in KB/MB/GB
-
- # setfattr -n trusted.glusterfs.block-format -v qcow2:<file_size> <file_name>
-Now create another file and send data to that file by running the command
-
- # echo 'ABCDEFGHIJ' > <data_file1>
-copy the data to the one file to another by running the command
-
- # dd if=data-file1 of=big-file conv=notrunc
-Now take the `snapshot of the file` by running the command
-
- # setfattr -n trusted.glusterfs.block-snapshot-create -v <image1> <file_name>
-Add some more contents to the file and take another file snaphot by doing the following steps
-
- # echo '1234567890' > <data_file2>
- # dd if=<data_file2> of=<file_name> conv=notrunc
- # setfattr -n trusted.glusterfs.block-snapshot-create -v <image2> <file_name>
-Now `revert` both the file snapshots and write data to some files so that data can be compared.
-
- # setfattr -n trusted.glusterfs.block-snapshot-goto -v <image1> <file_name>
- # dd if=<file_name> of=<out-file1> bs=11 count=1
- # setfattr -n trusted.glusterfs.block-snapshot-goto -v <image2> <file_name>
- # dd if=<file_name> of=<out-file2> bs=11 count=1
-Now read the contents of the files and compare as below:
-
- # cat <data_file1>, <out_file1> and compare contents.
- # cat <data_file2>, <out_file2> and compare contents.
-##one line description for the variables used
-file_name = File which will be creating in the mount point intially.
-
-data_file1 = File which contains data 'ABCDEFGHIJ'
-
-image1 = First file snapshot which has 'ABCDEFGHIJ' + some null values.
-
-data_file2 = File which contains data '1234567890'
-
-image2 = second file snapshot which has '1234567890' + some null values.
-
-out_file1 = After reverting image1 this contains 'ABCDEFGHIJ'
-
-out_file2 = After reverting image2 this contians '1234567890'
diff --git a/doc/features/ganesha-ha.md b/doc/features/ganesha-ha.md
new file mode 100644
index 00000000000..4b226a22ccf
--- /dev/null
+++ b/doc/features/ganesha-ha.md
@@ -0,0 +1,43 @@
+# Overview of Ganesha HA Resource Agents in GlusterFS 3.7
+
+The ganesha_mon RA monitors its ganesha.nfsd daemon. While the
+daemon is running, it creates two attributes: ganesha-active and
+grace-active. When the daemon stops for any reason, the attributes
+are deleted. Deleting the ganesha-active attribute triggers the
+failover of the virtual IP (the IPaddr RA) to another node —
+according to constraint location rules — where ganesha.nfsd is
+still running.
+
+The ganesha_grace RA monitors the grace-active attribute. When
+the grace-active attibute is deleted, the ganesha_grace RA stops,
+and will not restart. This triggers pacemaker to invoke the notify
+action in the ganesha_grace RAs on the other nodes in the cluster;
+which send a DBUS message to their respective ganesha.nfsd.
+
+(N.B. grace-active is a bit of a misnomer. while the grace-active
+attribute exists, everything is normal and healthy. Deleting the
+attribute triggers putting the surviving ganesha.nfsds into GRACE.)
+
+To ensure that the remaining/surviving ganesha.nfsds are put into
+ NFS-GRACE before the IPaddr (virtual IP) fails over there is a
+short delay (sleep) between deleting the grace-active attribute
+and the ganesha-active attribute. To summarize, e.g. in a four
+node cluster:
+
+1. on node 2 ganesha_mon::monitor notices that ganesha.nfsd has died
+
+2. on node 2 ganesha_mon::monitor deletes its grace-active attribute
+
+3. on node 2 ganesha_grace::monitor notices that grace-active is gone
+and returns OCF_ERR_GENERIC, a.k.a. new error. When pacemaker tries
+to (re)start ganesha_grace, its start action will return
+OCF_NOT_RUNNING, a.k.a. known error, don't attempt further restarts.
+
+4. on nodes 1, 3, and 4, ganesha_grace::notify receives a post-stop
+notification indicating that node 2 is gone, and sends a DBUS message
+to its ganesha.nfsd, putting it into NFS-GRACE.
+
+5. on node 2 ganesha_mon::monitor waits a short period, then deletes
+its ganesha-active attribute. This triggers the IPaddr (virt IP)
+failover according to constraint location rules.
+
diff --git a/doc/features/geo-replication/distributed-geo-rep.md b/doc/features/geo-replication/distributed-geo-rep.md
deleted file mode 100644
index 0a3183d6269..00000000000
--- a/doc/features/geo-replication/distributed-geo-rep.md
+++ /dev/null
@@ -1,71 +0,0 @@
-Introduction
-============
-
-This document goes through the new design of distributed geo-replication, it's features and the nature of changes involved. First we list down some of the important features.
-
- - Distributed asynchronous replication
- - Fast and versatile change detection
- - Replica failover
- - Hardlink synchronization
- - Effective handling of deletes and renames
- - Configurable sync engine (rsync, tar+ssh)
- - Adaptive to a wide variety of workloads
- - GFID synchronization
-
-Geo-replication makes use of the all new *journaling* infrastructure (a.k.a. changelog) to achieve great performance and feature improvements as mentioned above. To understand more about changelogging and the helper library (*libgfchangelog*) refer to document: doc/features/geo-replication/libgfchangelog.md
-
-Data Replication
-----------------
-
-Geo-replication is responsible to incrementally replicate data from the master node to the slave. But isn't that similar to what AFR does? Yes, but here the slave is located geographically distant from the master. Geo-replication follows the eventually consistent replication model, which implies, at any point of time, the slave would be lagging w.r.t. master, but would eventually catch up. Replication performance is dependent on two crucial factors:
- - Network latency
- - Change detection
-
-Network latency is something that is not in direct control for many reasons, but still there is always a best effort. Therefore, geo-replication offloads the data replicaiton part to common UNIX file transfer utilities. We choose the grand daddy of file transfers [rsync(1)] [1] as the default synchronization engine, as it's best known for it's diff transfer algorithm for effcient usage of network and lightning fast transfers (leave alone the flexibiliy). But what about small files performance? Due to it's checksumming algorithm, rsync has more overhead for small files -- the overhead of checksumming outweighs the bytes to be transferred for small files. Therefore, geo-replication can also use combination of tar piped over ssh to transfer large number of small files. Tests have shown a great improvement over standard rsync. However, sync engine is not yet dynamic to the file type and needs to be chosen manually by a configuration option.
-
-OTOH, change detection is something that is in full control of the application. Earlier (< release 3.5), geo-replicaiton would perform a file system crawl to indentify changes in the file system. This was not an unintelligent *check-every-single-inode* in the file system, but crawl logic based on *xtime*. xtime is an extended attribute maintained by the *marker* translator for each inode on the master and follows an upward-recursive marking pattern. Geo-replication would traverse a directory based on this simple condition:
-
-> xtime(master) > xtime(slave)
-
-E.g.:
-
-> MASTER SLAVE
->
-> /\ /\
-> d0 dir0 d0 dir0
-> / \ / \
-> d1 dir1 d1 dir1
-> / /
-> d2 d2
-> / /
-> file0 file0
-
-Consider the directory tree above. Assume that master and slave were in sync and the following operation happens on master:
-```
-touch /d0/d1/d2/file0
-```
-This would trigger a xtime marking (xtime being the current timestamp) from the leaf (*file0*) upto the root (*/*), i.e. an *xattr* of *file0*, *d2*, *d1*, *d0* and finally */*. Geo-replication daemon would crawl the file system based the condition mentioned before and hence would only crawl the **left** part of the directory tree (as the **right** part would hve equal xtimes).
-
-Although the above crawling algorithm is fast, it still has to crawl a good part of the file system. Also, to decide whether to crawl a particular subdirectory, geo-rep need to compare xtime -- which is basically a **getxattr()** call on the master and slave (remember, *slave* is over a WAN).
-
-Therefore, in 3.5 the need arised to take crawling to the next level. Geo-replication now uses the changelogging infrastructure to idenitify changes in the filesystem. Actually, there is absolutely no crawl involved. Changelogging based detection is notification based. Geo-replication daemon registers itself with the changelog consumer library (*libgfchangelog*) and basically invokes a set of APIs to get the list of changes in the filesystem and replays them onto the slave. There is absolutely no crawl or any kind of extended attribute gets involved.
-
-Distributed Geo-Replication
----------------------------
-Geo-replication (also known as gsyncd or geo-rep) used to be non-distributed before release 3.5. The node on which geo-rep start command was executed was responsible for replication data to the slave. If this node goes offline due to some reason (reboot, crash, etc..), replication would thereby be ceased. So one of the main development efforts for release 3.5 was to *distributify* geo-replication. Geo-rep daemon running on each node (per brick) is responsible for replicating data **local** to each brick. This results in full parallelism and effective use of cluster/network resource.
-
-With release 3.5, geo-rep start command would spawn a geo-replication daemon on each node in the master cluster (one per brick). Geo-rep *status* command shown geo-rep session status from each master node. Similary, *stop* would gracefully tear down the session from all nodes.
-
-What else is synced?
---------------------
- - GFID: Synchronizing the inode number (GFID) between master and the slave helps in synchronizing hardlinks.
- - Purges are also handled effectively as there is no entry comparison between master and slave. With changelog replay, geo-rep perform unlink operation without having to resort to expensive **readdir()** over the WAN.
- - Renames: With earlier geo-replication, because of the path based nature of crawling, renames were actually a delete and a create on the slave, followed by data transfer (not to mention the inode number change). Now, with changelogging, it's actually a **rename()** call on the slave.
-
-Replica Failover
-----------------
-One of the basic volume configuration is a replicated volume (synchronous replication). Having geo-replication sync data from all replicas would mean wastage of network bandwidth and possibly data corruption on the slave (though that's unlikely). Therefore, geo-rep on such volume configurations works in an **ACTIVE** and **PASSIVE** mode. Geo-rep daemon on one of the replicas is responsible for replicating data (**ACTIVE**), while the other geo-rep daemon is basically doing nothing (**PASSIVE**).
-
-On the event of the *ACTIVE* node going offline, the *PASSIVE* node identifies this event (there's a lag of max 60 seconds for this identification) and switches to *ACTIVE*; thereby taking over the role of replicating data from where the earlier *ACTIVE* node left off. This guarantees uninterrupted data replication even on node reboot/failures.
-
-[1]:http://rsync.samba.org
diff --git a/doc/features/geo-replication/libgfchangelog.md b/doc/features/geo-replication/libgfchangelog.md
deleted file mode 100644
index 1dd0d24253a..00000000000
--- a/doc/features/geo-replication/libgfchangelog.md
+++ /dev/null
@@ -1,119 +0,0 @@
-libgfchangelog: "GlusterFS changelog" consumer library
-======================================================
-
-This document puts forward the intended need for GlusterFS changelog consumer library (a.k.a. libgfchangelog) for consuming changlogs produced by the Changelog translator. Further, it mentions the proposed design and the API exposed by it. A brief explanation of changelog translator can also be found as a commit message in the upstream source tree and the review link can be [accessed here] [1].
-
-Initial consumer of changelogs would be Geo-Replication (release 3.5). Possible consumers in the future could be backup utilities, GlusterFS self-heal, bit-rot detection, AV scanners. All these utilities have one thing in common - to get a list of changed entities (created/modified/deleted) in the file system. Therefore, the need arises to provide such functionality in the form of a shared library that applications can link against and query for changes (See API section). There is no plan as of now to provide language bindings as such, but for shell script friendliness: 'gfind' command line utility (which would be dynamically linked with libgfchangelog) would be helpful. As of now, development for this utility is still not commenced.
-
-The next section gives a brief introduction about how changelogs are organized and managed. Then we propose couple of designs for libgfchangelog. API set is not covered in this document (maybe later).
-
-Changelogs
-==========
-
-Changelogs can be thought as a running history for an entity in the file system from the time the entity came into existance. The goal is to capture all possible transitions the entity underwent till the time it got purged. The transition namespace is broken up into three categories with each category represented by a specific changelog format. Changes are recorded in a flat file in the filesystem and are rolled over after a specific time interval. All three types of categories are recorded in a single changelog file (sequentially) with a type for each entry. Having a single file reduces disk seeks and fragmentation and less number of files to deal with. Stratergy for pruning of old logs is still undecided.
-
-
-Changelog Transition Namespace
-------------------------------
-
-As mentioned before the transition namespace is categorized into three types:
- - TYPE-I : Data operation
- - TYPE-II : Metadata operation
- - TYPE-III : Entry operation
-
-One could visualize the transition of an file system entity as a state machine transitioning from one type to another. For TYPE-I and TYPE-II operations there is no state transition as such, but TYPE-III operation involves a state change from the file systems perspective. We can now classify file operations (fops) into one of the three types:
- - Data operation: write(), writev(), truncate(), ftruncate()
- - Metadata operation: setattr(), fsetattr(), setxattr(), fsetxattr(), removexattr(), fremovexattr()
- - Entry operation: create(), mkdir(), mknod(), symlink(), link(), rename(), unlink(), rmdir()
-
-Changelog Entry Format
-----------------------
-
-In order to record the type of operation and entity underwent, a type identifier is used. Normally, the entity on which the operation is performed would be identified by the pathname, which is the most common way of addressing in a file system, but we choose to use GlusterFS internal file identifier (GFID) instead (as GlusterFS supports GFID based backend and the pathname field may not always be valid and other reasons which are out of scope of this this document). Therefore, the format of the record for the three types of operation can be summarized as follows:
-
- - TYPE-I : GFID of the file
- - TYPE-II : GFID of the file
- - TYPE-III : GFID + FOP + MODE + UID + GID + PARGFID/BNAME [PARGFID/BNAME]
-
-GFID's are analogous to inodes. TYPE-I and TYPE-II fops record the GFID of the entity on which the operation was performed: thereby recording that there was an data/metadata change on the inode. TYPE-III fops record at the minimum a set of six or seven records (depending on the type of operation), that is sufficient to identify what type of operation the entity underwent. Normally this record inculdes the GFID of the entity, the type of file operation (which is an integer [an enumerated value which is used in GluterFS]) and the parent GFID and the basename (analogous to parent inode and basename).
-
-Changelogs can be either in ascii or binary format, the difference being the format of the records that is persisted. In a binary changelog the gfids are recorded in it's native format ie. 16 byte record and the fop number as a 4 byte integer. In an ascii changelog, the gfids are stored in their canonical form and the fop number is stringified and persisted. Null charater is used as the record serarator and changelogs. This makes it hard to read changelogs from the command line, but the packed format is needed to support file names with spaces and special characters. Below is a snippet of a changelog along side it's hexdump.
-
-```
-00000000 47 6c 75 73 74 65 72 46 53 20 43 68 61 6e 67 65 |GlusterFS Change|
-00000010 6c 6f 67 20 7c 20 76 65 72 73 69 6f 6e 3a 20 76 |log | version: v|
-00000020 31 2e 31 20 7c 20 65 6e 63 6f 64 69 6e 67 20 3a |1.1 | encoding :|
-00000030 20 32 0a 45 61 36 39 33 63 30 34 65 2d 61 66 39 | 2.Ea693c04e-af9|
-00000040 65 2d 34 62 61 35 2d 39 63 61 37 2d 31 63 34 61 |e-4ba5-9ca7-1c4a|
-00000050 34 37 30 31 30 64 36 32 00 32 33 00 33 33 32 36 |47010d62.23.3326|
-00000060 31 00 30 00 30 00 66 36 35 34 32 33 32 65 2d 61 |1.0.0.f654232e-a|
-00000070 34 32 62 2d 34 31 62 33 2d 62 35 61 61 2d 38 30 |42b-41b3-b5aa-80|
-00000080 33 62 33 64 61 34 35 39 33 37 2f 6c 69 62 76 69 |3b3da45937/libvi|
-00000090 72 74 5f 64 72 69 76 65 72 5f 6e 65 74 77 6f 72 |rt_driver_networ|
-000000a0 6b 2e 73 6f 00 44 61 36 39 33 63 30 34 65 2d 61 |k.so.Da693c04e-a|
-000000b0 66 39 65 2d 34 62 61 35 2d 39 63 61 37 2d 31 63 |f9e-4ba5-9ca7-1c|
-000000c0 34 61 34 37 30 31 30 64 36 32 00 45 36 65 39 37 |4a47010d62.E6e97|
-```
-
-As you can see, there is an *entry* operation (journal record starting with an "E"). Records for this operation are:
- - GFID : a693c04e-af9e-4ba5-9ca7-1c4a-47010d62
- - FOP : 23 (create)
- - Mode : 33261
- - UID : 0
- - GID : 0
- - PARGFID/BNAME: f654232e-a42b-41b3-b5aa-803b3da45937
-
-**NOTE**: In case of a rename operation, there would be an additional record (for the target PARGFID/BNAME).
-
-libgfchangelog
---------------
-
-NOTE: changelogs generated by the changelog translator are rolled over [with the timestamp as the suffix] after a specific interval, after which a new change is started. The current changelog [changelog file without the timestamp as the suffix] should never be processed unless it's rolled over. The rolled over logs should be treated read-only.
-
-Capturing changes performed on a file system is useful for applications that rely on file system scan (crawl) to figure out such information. Backup utilities, automatic file healing in a replicated environment, bit-rot detection and the likes are some of the end user applications that require a set of changed entities in a file system to act on. Goal of libgfchangelog is to provide the application (consumer) a fast and easy to use common query interface (API). The consumer need not worry about the changelog format, nomenclature of the changelog files etc.
-
-Now we list functionality and some of the features.
-
-Functionality
--------------
-
-Changelog Processing: Processing involes reading changelog file(s), converting the entries into human-readable (or application understandable) format (in case of binary log format).
-Book-keeping: Keeping track of how much the application has consumed the changelog (ie. changes during the time slice start-time -> end-time).
-Serve API request: Update the consumer by providing the set of changes.
-
-Processing could be done in two ways:
-
-* Pre-processing (pre-processing from the library POV):
-Once a changelog file is rolled over (by the changelog translator), a set of post processing operations are performed. These operations could include conversion of a binary log file to an understandable format, collate a bunch of logs into a larger sampling period or just keep a private copy of the changelog (in ascii format). Extra disk space is consumed to store this private copy. The library would then be free to consume these logs and serve application requests.
-
-* On-demand:
-The processing of the changelogs is trigerred when an application requests for changes. Downside of this being additional time spent on decoding the logs and data accumulation during application request time (but no additional disk space is used over the time period).
-
-After processing, the changelog is ready to be consumed by the application. The function of processing is to convert the logs into human/application readable format (an example is shown below):
-
-```
-E a7264fe2-dd6b-43e1-8786-a03b42cc2489 CREATE 33188 0 0 00000000-0000-0000-0000-000000000001%2Fservices1
-M a7264fe2-dd6b-43e1-8786-a03b42cc2489 NULL
-M 00000000-0000-0000-0000-000000000001 NULL
-D a7264fe2-dd6b-43e1-8786-a03b42cc2489
-```
-
-Features
---------
-
-The following points mention some of the features that the library could provide.
-
- - Consumer could choose the update type when it registers with the library. 'types' could be:
- - Streaming: The consumer is updated via stream of changes, ie. the library would just replay the logs
- - Consolidated: The consumer is provided with a consolidated view of the changelog, eg. if <gfid> had an DATA and a METADATA operation, it would be presented as a single update. Similarly for ENTRY operations.
- - Raw: This mode provides the consumer with the pathnames of the changelog files itself (after processing). The changelogs should be strictly treated as read-only. This gives the flexibility to the consumer to extract updates using thier own preferred way (eg. using command line tools like sed, awk, sort | uniq etc.).
- - Application may choose to adopt a synchronous (blocking) or an asynchronous (callback) notification mechanism.
- - Provide a unified view of changelogs from multiple peers (replication scenario) or a global changelog view of the entire cluster.
-
-
-** The first cut of the library supports:**
- - Raw access mode
- - Synchronous programming model
- - Per brick changelog consumption ie. no unified/globally aggregated changelog
-
-[1]:http://review.gluster.org/5127
diff --git a/doc/features/gfid-access.md b/doc/features/gfid-access.md
deleted file mode 100644
index 2d324a18bdb..00000000000
--- a/doc/features/gfid-access.md
+++ /dev/null
@@ -1,73 +0,0 @@
-#Gfid-access Translator
-The 'gfid-access' translator provides access to data in glusterfs using a
-virtual path. This particular translator is designed to provide direct access to
-files in glusterfs using its gfid. 'GFID' is glusterfs's inode number for a file
-to identify it uniquely. As of now, Geo-replication is the only consumer of this
-translator. The changelog translator logs the 'gfid' with corresponding file
-operation in journals which are consumed by Geo-Replication to replicate the
-files using gfid-access translator very efficiently.
-
-###Implications and Usage
-A new virtual directory called '.gfid' is exposed in the aux-gfid mount
-point when gluster volume is mounted with 'aux-gfid-mount' option.
-All the gfids of files are exposed in one level under the '.gfid' directory.
-No matter at what level the file resides, it is accessed using its
-gfid under this virutal directory as shown in example below. All access
-protocols work seemlessly, as the complexities are handled internally.
-
-###Testing
-1. Mount glusterfs client with '-o aux-gfid-mount' as follows.
-
- mount -t glusterfs -o aux-gfid-mount <node-ip>:<volname> <mountpoint>
-
- Example:
-
- #mount -t glusterfs -o aux-gfid-mount rhs1:master /master-aux-mnt
-
-2. Get the 'gfid' of a file using normal mount or aux-gfid-mount and do some
- operations as follows.
-
- getfattr -n glusterfs.gfid.string <file>
-
- Example:
-
- #getfattr -n glusterfs.gfid.string /master-aux-mnt/file
- # file: file
- glusterfs.gfid.string="796d3170-0910-4853-9ff3-3ee6b1132080"
-
- #cat /master-aux-mnt/file
- sample data
-
- #stat /master-aux-mnt/file
- File: `file'
- Size: 12 Blocks: 1 IO Block: 131072 regular file
- Device: 13h/19d Inode: 11525625031905452160 Links: 1
- Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
- Access: 2014-05-23 20:43:33.239999863 +0530
- Modify: 2014-05-23 17:36:48.224999989 +0530
- Change: 2014-05-23 20:44:10.081999938 +0530
-
-
-3. Access files using virtual path as follows.
-
- /mountpoint/.gfid/<actual-canonical-gfid-of-the-file\>'
-
- Example:
-
- #cat /master-aux-mnt/.gfid/796d3170-0910-4853-9ff3-3ee6b1132080
- sample data
- #stat /master-aux-mnt/.gfid/796d3170-0910-4853-9ff3-3ee6b1132080
- File: `.gfid/796d3170-0910-4853-9ff3-3ee6b1132080'
- Size: 12 Blocks: 1 IO Block: 131072 regular file
- Device: 13h/19d Inode: 11525625031905452160 Links: 1
- Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
- Access: 2014-05-23 20:43:33.239999863 +0530
- Modify: 2014-05-23 17:36:48.224999989 +0530
- Change: 2014-05-23 20:44:10.081999938 +0530
-
- We can notice that 'cat' command on the 'file' using path and using virtual
- path displays the same data. Similarly 'stat' command on the 'file' and using
- virtual path with gfid gives same Inode Number confirming that its same file.
-
-###Nature of changes
-This feature is introduced with 'gfid-access' translator.
diff --git a/doc/features/libgfapi.md b/doc/features/libgfapi.md
deleted file mode 100644
index dfc8cfe6527..00000000000
--- a/doc/features/libgfapi.md
+++ /dev/null
@@ -1,381 +0,0 @@
-One of the known methods to access glusterfs is via fuse module. However, it has some overhead or performance issues because of the number of context switches which need to be performed to complete one i/o transaction[1].
-
-
-To over come this limitation, a new method called ‘libgfapi’ is introduced. libgfapi support is available from GlusterFS-3.4 release.
-
-libgfapi is a userspace library for accessing data in glusterfs. libgfapi library perform IO on gluster volumes directly without FUSE mount. It is a filesystem like api and runs/sits in application process context. libgfapi eliminates the fuse and the kernel vfs layer from the glusterfs volume access. The speed and latency have improved with libgfapi access. [1]
-
-
-Using libgfapi, various user-space filesystems (like NFS-Ganesha or Samba) or the virtualizer (like QEMU) can interact with GlusterFS which serves as back-end filesystem. Currently below projects integrate with glusterfs using libgfapi interfaces.
-
-
-* qemu storage layer
-* Samba VFS plugin
-* NFS-Ganesha
-
-All the APIs in libgfapi make use of `struct glfs` object. This object
-contains information about volume name, glusterfs context associated,
-subvols in the graph etc which makes it unique for each volume.
-
-
-For any application to make use of libgfapi, it should typically start
-with the below APIs in the following order -
-
-* To create a new glfs object :
-
- glfs_t *glfs_new (const char *volname) ;
-
- glfs_new() returns glfs_t object.
-
-
-* On this newly created glfs_t, you need to be either set a volfile path
- (glfs_set_volfile) or a volfile server (glfs_set_volfile_server).
- Incase of failures, the corresponding cleanup routine is
- "glfs_unset_volfile_server"
-
- int glfs_set_volfile (glfs_t *fs, const char *volfile);
-
- int glfs_set_volfile_server (glfs_t *fs, const char *transport,const char *host, int port) ;
-
- int glfs_unset_volfile_server (glfs_t *fs, const char *transport,const char *host, int port) ;
-
-* Specify logging parameters using glfs_set_logging():
-
- int glfs_set_logging (glfs_t *fs, const char *logfile, int loglevel) ;
-
-* Initializes the glfs_t object using glfs_init()
- int glfs_init (glfs_t *fs) ;
-
-#### FOPs APIs available with libgfapi :
-
-
-
- int glfs_get_volumeid (struct glfs *fs, char *volid, size_t size);
-
- int glfs_setfsuid (uid_t fsuid) ;
-
- int glfs_setfsgid (gid_t fsgid) ;
-
- int glfs_setfsgroups (size_t size, const gid_t *list) ;
-
- glfs_fd_t *glfs_open (glfs_t *fs, const char *path, int flags) ;
-
- glfs_fd_t *glfs_creat (glfs_t *fs, const char *path, int flags,mode_t mode) ;
-
- int glfs_close (glfs_fd_t *fd) ;
-
- glfs_t *glfs_from_glfd (glfs_fd_t *fd) ;
-
- int glfs_set_xlator_option (glfs_t *fs, const char *xlator, const char *key,const char *value) ;
-
- typedef void (*glfs_io_cbk) (glfs_fd_t *fd, ssize_t ret, void *data);
-
- ssize_t glfs_read (glfs_fd_t *fd, void *buf,size_t count, int flags) ;
-
- ssize_t glfs_write (glfs_fd_t *fd, const void *buf,size_t count, int flags) ;
-
- int glfs_read_async (glfs_fd_t *fd, void *buf, size_t count, int flags, glfs_io_cbk fn, void *data) ;
-
- int glfs_write_async (glfs_fd_t *fd, const void *buf, size_t count, int flags, glfs_io_cbk fn, void *data) ;
-
- ssize_t glfs_readv (glfs_fd_t *fd, const struct iovec *iov, int iovcnt,int flags) ;
-
- ssize_t glfs_writev (glfs_fd_t *fd, const struct iovec *iov, int iovcnt,int flags) ;
-
- int glfs_readv_async (glfs_fd_t *fd, const struct iovec *iov, int count, int flags, glfs_io_cbk fn, void *data) ;
-
- int glfs_writev_async (glfs_fd_t *fd, const struct iovec *iov, int count, int flags, glfs_io_cbk fn, void *data) ;
-
- ssize_t glfs_pread (glfs_fd_t *fd, void *buf, size_t count, off_t offset,int flags) ;
-
- ssize_t glfs_pwrite (glfs_fd_t *fd, const void *buf, size_t count, off_t offset, int flags) ;
-
- int glfs_pread_async (glfs_fd_t *fd, void *buf, size_t count, off_t offset,int flags, glfs_io_cbk fn, void *data) ;
-
- int glfs_pwrite_async (glfs_fd_t *fd, const void *buf, int count, off_t offset,int flags, glfs_io_cbk fn, void *data) ;
-
- ssize_t glfs_preadv (glfs_fd_t *fd, const struct iovec *iov, int iovcnt, int count, off_t offset, int flags,glfs_io_cbk fn, void *data) ;
-
- ssize_t glfs_pwritev (glfs_fd_t *fd, const struct iovec *iov, int iovcnt,int count, off_t offset, int flags, glfs_io_cbk fn, void *data) ;
-
- int glfs_preadv_async (glfs_fd_t *fd, const struct iovec *iov, glfs_io_cbk fn, void *data) ;
-
- int glfs_pwritev_async (glfs_fd_t *fd, const struct iovec *iov, glfs_io_cbk fn, void *data) ;
-
- off_t glfs_lseek (glfs_fd_t *fd, off_t offset, int whence) ;
-
- int glfs_truncate (glfs_t *fs, const char *path, off_t length) ;
-
- int glfs_ftruncate (glfs_fd_t *fd, off_t length) ;
-
- int glfs_ftruncate_async (glfs_fd_t *fd, off_t length, glfs_io_cbk fn,void *data) ;
-
- int glfs_lstat (glfs_t *fs, const char *path, struct stat *buf) ;
-
- int glfs_stat (glfs_t *fs, const char *path, struct stat *buf) ;
-
- int glfs_fstat (glfs_fd_t *fd, struct stat *buf) ;
-
- int glfs_fsync (glfs_fd_t *fd) ;
-
- int glfs_fsync_async (glfs_fd_t *fd, glfs_io_cbk fn, void *data) ;
-
- int glfs_fdatasync (glfs_fd_t *fd) ;
-
- int glfs_fdatasync_async (glfs_fd_t *fd, glfs_io_cbk fn, void *data) ;
-
- int glfs_access (glfs_t *fs, const char *path, int mode) ;
-
- int glfs_symlink (glfs_t *fs, const char *oldpath, const char *newpath) ;
-
- int glfs_readlink (glfs_t *fs, const char *path,char *buf, size_t bufsiz) ;
-
- int glfs_mknod (glfs_t *fs, const char *path, mode_t mode, dev_t dev) ;
-
- int glfs_mkdir (glfs_t *fs, const char *path, mode_t mode) ;
-
- int glfs_unlink (glfs_t *fs, const char *path) ;
-
- int glfs_rmdir (glfs_t *fs, const char *path) ;
-
- int glfs_rename (glfs_t *fs, const char *oldpath, const char *newpath) ;
-
- int glfs_link (glfs_t *fs, const char *oldpath, const char *newpath) ;
-
- glfs_fd_t *glfs_opendir (glfs_t *fs, const char *path) ;
-
- int glfs_readdir_r (glfs_fd_t *fd, struct dirent *dirent,struct dirent **result) ;
-
- int glfs_readdirplus_r (glfs_fd_t *fd, struct stat *stat, struct dirent *dirent, struct dirent **result) ;
-
- struct dirent *glfs_readdir (glfs_fd_t *fd) ;
-
- struct dirent *glfs_readdirplus (glfs_fd_t *fd, struct stat *stat) ;
-
- long glfs_telldir (glfs_fd_t *fd) ;
-
- void glfs_seekdir (glfs_fd_t *fd, long offset) ;
-
- int glfs_closedir (glfs_fd_t *fd) ;
-
- int glfs_statvfs (glfs_t *fs, const char *path, struct statvfs *buf) ;
-
- int glfs_chmod (glfs_t *fs, const char *path, mode_t mode) ;
-
- int glfs_fchmod (glfs_fd_t *fd, mode_t mode) ;
-
- int glfs_chown (glfs_t *fs, const char *path, uid_t uid, gid_t gid) ;
-
- int glfs_lchown (glfs_t *fs, const char *path, uid_t uid, gid_t gid) ;
-
- int glfs_fchown (glfs_fd_t *fd, uid_t uid, gid_t gid) ;
-
- int glfs_utimens (glfs_t *fs, const char *path,struct timespec times[2]) ;
-
- int glfs_lutimens (glfs_t *fs, const char *path,struct timespec times[2]) ;
-
- int glfs_futimens (glfs_fd_t *fd, struct timespec times[2]) ;
-
- ssize_t glfs_getxattr (glfs_t *fs, const char *path, const char *name,void *value, size_t size) ;
-
- ssize_t glfs_lgetxattr (glfs_t *fs, const char *path, const char *name,void *value, size_t size) ;
-
- ssize_t glfs_fgetxattr (glfs_fd_t *fd, const char *name,void *value, size_t size) ;
-
- ssize_t glfs_listxattr (glfs_t *fs, const char *path,void *value, size_t size) ;
-
- ssize_t glfs_llistxattr (glfs_t *fs, const char *path, void *value,size_t size) ;
-
- ssize_t glfs_flistxattr (glfs_fd_t *fd, void *value, size_t size) ;
-
- int glfs_setxattr (glfs_t *fs, const char *path, const char *name,const void *value, size_t size, int flags) ;
-
- int glfs_lsetxattr (glfs_t *fs, const char *path, const char *name,const void *value, size_t size, int flags) ;
-
- int glfs_fsetxattr (glfs_fd_t *fd, const char *name,const void *value, size_t size, int flags) ;
-
- int glfs_removexattr (glfs_t *fs, const char *path, const char *name) ;
-
- int glfs_lremovexattr (glfs_t *fs, const char *path, const char *name) ;
-
- int glfs_fremovexattr (glfs_fd_t *fd, const char *name) ;
-
- int glfs_fallocate(glfs_fd_t *fd, int keep_size, off_t offset, size_t len) ;
-
- int glfs_discard(glfs_fd_t *fd, off_t offset, size_t len) ;
-
- int glfs_discard_async (glfs_fd_t *fd, off_t length, size_t lent, glfs_io_cbk fn, void *data) ;
-
- int glfs_zerofill(glfs_fd_t *fd, off_t offset, off_t len) ;
-
- int glfs_zerofill_async (glfs_fd_t *fd, off_t length, off_t len, glfs_io_cbk fn, void *data) ;
-
- char *glfs_getcwd (glfs_t *fs, char *buf, size_t size) ;
-
- int glfs_chdir (glfs_t *fs, const char *path) ;
-
- int glfs_fchdir (glfs_fd_t *fd) ;
-
- char *glfs_realpath (glfs_t *fs, const char *path, char *resolved_path) ;
-
- int glfs_posix_lock (glfs_fd_t *fd, int cmd, struct flock *flock) ;
-
- glfs_fd_t *glfs_dup (glfs_fd_t *fd) ;
-
-
- struct glfs_object *glfs_h_lookupat (struct glfs *fs,struct glfs_object *parent,
- const char *path,
- struct stat *stat) ;
-
- struct glfs_object *glfs_h_creat (struct glfs *fs, struct glfs_object *parent,
- const char *path, int flags, mode_t mode,
- struct stat *sb) ;
-
- struct glfs_object *glfs_h_mkdir (struct glfs *fs, struct glfs_object *parent,
- const char *path, mode_t flags,
- struct stat *sb) ;
-
- struct glfs_object *glfs_h_mknod (struct glfs *fs, struct glfs_object *parent,
- const char *path, mode_t mode, dev_t dev,
- struct stat *sb) ;
-
- struct glfs_object *glfs_h_symlink (struct glfs *fs, struct glfs_object *parent,
- const char *name, const char *data,
- struct stat *stat) ;
-
-
- int glfs_h_unlink (struct glfs *fs, struct glfs_object *parent,
- const char *path) ;
-
- int glfs_h_close (struct glfs_object *object) ;
-
- int glfs_caller_specific_init (void *uid_caller_key, void *gid_caller_key,
- void *future) ;
-
- int glfs_h_truncate (struct glfs *fs, struct glfs_object *object,
- off_t offset) ;
-
- int glfs_h_stat(struct glfs *fs, struct glfs_object *object,
- struct stat *stat) ;
-
- int glfs_h_getattrs (struct glfs *fs, struct glfs_object *object,
- struct stat *stat) ;
-
- int glfs_h_getxattrs (struct glfs *fs, struct glfs_object *object,
- const char *name, void *value,
- size_t size) ;
-
- int glfs_h_setattrs (struct glfs *fs, struct glfs_object *object,
- struct stat *sb, int valid) ;
-
- int glfs_h_setxattrs (struct glfs *fs, struct glfs_object *object,
- const char *name, const void *value,
- size_t size, int flags) ;
-
- int glfs_h_readlink (struct glfs *fs, struct glfs_object *object, char *buf,
- size_t bufsiz) ;
-
- int glfs_h_link (struct glfs *fs, struct glfs_object *linktgt,
- struct glfs_object *parent, const char *name) ;
-
- int glfs_h_rename (struct glfs *fs, struct glfs_object *olddir,
- const char *oldname, struct glfs_object *newdir,
- const char *newname) ;
-
- int glfs_h_removexattrs (struct glfs *fs, struct glfs_object *object,
- const char *name) ;
-
- ssize_t glfs_h_extract_handle (struct glfs_object *object,
- unsigned char *handle, int len) ;
-
- struct glfs_object *glfs_h_create_from_handle (struct glfs *fs,
- unsigned char *handle, int len,
- struct stat *stat) ;
-
-
- struct glfs_fd *glfs_h_opendir (struct glfs *fs,
- struct glfs_object *object) ;
-
- struct glfs_fd *glfs_h_open (struct glfs *fs, struct glfs_object *object,
- int flags) ;
-
-For more details on these apis please refer glfs.h and glfs-handles.h in the source tree (api/src/) of glusterfs:
-
-* Incase of failures or to close the connection and destroy glfs_t
-object, use glfs_fini.
-
- int glfs_fini (glfs_t *fs) ;
-
-
-All the fileops are typically divided into below categories
-
-* a) Handle based Operations -
-
-These APIs create/make use of a glfs_object (referred as handles) unique
-to each file within a volume.
-The structure glfs_object contains inode pointer and gfid.
-
-For example: Since NFS protocol uses file handles to access files, these APIs are
-mainly used by NFS-Ganesha server.
-
-Eg:
-
- struct glfs_object *glfs_h_lookupat (struct glfs *fs,
- struct glfs_object *parent,
- const char *path,
- struct stat *stat);
-
- struct glfs_object *glfs_h_creat (struct glfs *fs,
- struct glfs_object *parent,
- const char *path,
- int flags, mode_t mode,
- struct stat *sb);
-
- struct glfs_object *glfs_h_mkdir (struct glfs *fs,
- struct glfs_object *parent,
- const char *path, mode_t flags,
- struct stat *sb);
-
-
-
-* b) File path/descriptor based Operations -
-
-These APIs make use of file path/descriptor to determine the file on
-which it needs to operate on.
-
-For example: Samba uses these APIs for file operations.
-
-Examples of the APIs using file path -
-
- int glfs_chdir (glfs_t *fs, const char *path) ;
-
- char *glfs_realpath (glfs_t *fs, const char *path, char *resolved_path) ;
-
-Once the file is opened, the file-descriptor generated is used for
-further operations.
-
-Eg:
-
- int glfs_posix_lock (glfs_fd_t *fd, int cmd, struct flock *flock) ;
- glfs_fd_t *glfs_dup (glfs_fd_t *fd) ;
-
-
-
-#### libgfapi bindings :
-
-libgfapi bindings are available for below languages:
-
- - Go
- - Java
- - python [2]
- - Ruby
-
-For more details on these bindings,please refer :
-
- #http://www.gluster.org/community/documentation/index.php/Language_Bindings
-
-References:
-
-[1] http://humblec.com/libgfapi-interface-glusterfs/
-[2] http://www.gluster.org/2014/04/play-with-libgfapi-and-its-python-bindings/
-
diff --git a/doc/features/nufa.md b/doc/features/nufa.md
deleted file mode 100644
index 03b8194b4c0..00000000000
--- a/doc/features/nufa.md
+++ /dev/null
@@ -1,20 +0,0 @@
-# NUFA Translator
-
-The NUFA ("Non Uniform File Access") is a variant of the DHT ("Distributed Hash
-Table") translator, intended for use with workloads that have a high locality
-of reference. Instead of placing new files pseudo-randomly, it places them on
-the same nodes where they are created so that future accesses can be made
-locally. For replicated volumes, this means that one copy will be local and
-others will be remote; the read-replica selection mechanisms will then favor
-the local copy for reads. For non-replicated volumes, the only copy will be
-local.
-
-## Interface
-
-Use of NUFA is controlled by a volume option, as follows.
-
- gluster volume set myvolume cluster.nufa on
-
-This will cause the NUFA translator to be used wherever the DHT translator
-otherwise would be. The rest is all automatic.
-
diff --git a/doc/features/ovirt-integration.md b/doc/features/ovirt-integration.md
deleted file mode 100644
index 46dbeabbbaa..00000000000
--- a/doc/features/ovirt-integration.md
+++ /dev/null
@@ -1,106 +0,0 @@
-##Ovirt Integration with glusterfs
-
-oVirt is an opensource virtualization management platform. You can use oVirt to manage
-hardware nodes, storage and network resources, and to deploy and monitor virtual machines
-running in your data center. oVirt serves as the bedrock for Red Hat''s Enterprise Virtualization product,
-and is the "upstream" project where new features are developed in advance of their inclusion
-in that supported product offering.
-
-To know more about ovirt please visit http://www.ovirt.org/ and to configure
-#http://www.ovirt.org/Quick_Start_Guide#Install_oVirt_Engine_.28Fedora.29%60
-
-For the installation step of ovirt, please refer
-#http://www.ovirt.org/Quick_Start_Guide#Install_oVirt_Engine_.28Fedora.29%60
-
-When oVirt integrated with gluster, glusterfs can be used in below forms:
-
-* As a storage domain to host VM disks.
-
-There are mainly two ways to exploit glusterfs as a storage domain.
- - POSIXFS_DOMAIN ( >=oVirt 3.1 )
- - GLUSTERFS_DOMAIN ( >=oVirt 3.3)
-
-The former one has performance overhead and is not an ideal way to consume images hosted in glusterfs volumes.
-When used by this method, qemu uses glusterfs `mount point` to access VM images and invite FUSE overhead.
-The libvirt treats this as a file type disk in its xml schema.
-
-The latter is the recommended way of using glusterfs with ovirt as a storage domain. This provides better
-and efficient way to access images hosted under glusterfs volumes.When qemu accessing glusterfs volume using this method,
-it make use of `libgfapi` implementation of glusterfs and this method is called native integration.
-Here the glusterfs is added as a block backend to qemu and libvirt treat this as a `network` type disk.
-
-For more details on this, please refer # http://www.ovirt.org/Features/GlusterFS_Storage_Domain
-However there are 2 bugs which block usage of this feature.
-
-https://bugzilla.redhat.com/show_bug.cgi?id=1022961
-https://bugzilla.redhat.com/show_bug.cgi?id=1017289
-
-Please check above bugs for latest status.
-
-* To manage gluster trusted pools.
-
-oVirt web admin console can be used to -
- - add new / import existing gluster cluster
- - add/delete volumes
- - add/delete bricks
- - set/reset volume options
- - optimize volume for virt store
- - Rebalance and Remove bricks
- - Monitor gluster deployment - node, brick, volume status,
- Enhanced service monitoring (Physical node resources as well Quota, geo-rep and self-heal status) through Nagios integration(>=oVirt 3.4)
-
-
-
-When configuing ovirt to manage only gluster cluster/trusted pool, you need to select `gluster` as an input for
-`Application mode` in OVIRT ENGINE CONFIGURATION option of `engine-setup` command.
-Refer # http://www.ovirt.org/Quick_Start_Guide#Install_oVirt_Engine_.28Fedora.29%60
-
-If you want to use gluster as both ( as a storage domain to host VM disks and to manage gluster trusted pools)
-you need to input `both` as a value for `Application mode` in engine-setup command.
-
-Once you have successfully installed oVirt Engine as mentioned above, you will be provided with instructions
-to access oVirt''s web console.
-
-Below example shows how to configure gluster nodes in fedora.
-
-
-#Configuring gluster nodes.
-
-On the machine designated as your host, install any supported distribution( ex:Fedora/CentOS/RHEL...etc).
-A minimal installation is sufficient.
-
-Refer # http://www.ovirt.org/Quick_Start_Guide#Install_Hosts
-
-
-##Connect to Ovirt Engine
-
-Log In to Administration Console
-
-Ensure that you have the administrator password configured during installation of oVirt engine.
-
-- To connect to oVirt webadmin console
-
-
-Open a browser and navigate to https://domain.example.com/webadmin. Substitute domain.example.com with the URL provided during installation
-
-If this is your first time connecting to the administration console, oVirt Engine will issue
-security certificates for your browser. Click the link labelled this certificate to trust the
-ca.cer certificate. A pop-up displays, click Open to launch the Certificate dialog.
-Click `Install Certificate` and select to place the certificate in Trusted Root Certification Authorities store.
-
-
-The console login screen displays. Enter admin as your User Name, and enter the Password that
-you provided during installation. Ensure that your domain is set to Internal. Click Login.
-
-
-You have now successfully logged in to the oVirt web administration console. Here, you can configure and manage all your gluster resources.
-
-To manage gluster trusted pool:
-
-- Create a cluster with "Enable gluster service" - turned on. (Turn on "Enable virt service" if the same nodes are used as hypervisor as well)
-- Add hosts which have already been set up as in step Configuring gluster nodes.
-- Create a volume, and click on "Optimize for virt store",This sets the volume tunables optimize volume to be used as an image store
-
-To use this volume as a storage domain:
-
-Please refer `User interface` section of www.ovirt.org/Features/GlusterFS_Storage_Domain
diff --git a/doc/features/qemu-integration.md b/doc/features/qemu-integration.md
deleted file mode 100644
index b44dc06bb43..00000000000
--- a/doc/features/qemu-integration.md
+++ /dev/null
@@ -1,231 +0,0 @@
-Using GlusterFS volumes to host VM images and data was sub-optimal due to the FUSE overhead involved in accessing gluster volumes via GlusterFS native client. However this has changed now with two specific enhancements:
-
-- A new library called libgfapi is now available as part of GlusterFS that provides POSIX-like C APIs for accessing gluster volumes. libgfapi support is available from GlusterFS-3.4 release.
-- QEMU (starting from QEMU-1.3) will have GlusterFS block driver that uses libgfapi and hence there is no FUSE overhead any longer when QEMU works with VM images on gluster volumes.
-
-GlusterFS with its pluggable translator model can serve as a flexible storage backend for QEMU. QEMU has to just talk to GlusterFS and GlusterFS will hide different file systems and storage types underneath. Various GlusterFS storage features like replication and striping will automatically be available for QEMU. Efforts are also on to add block device backend in Gluster via Block Device (BD) translator that will expose underlying block devices as files to QEMU. This allows GlusterFS to be a single storage backend for both file and block based storage types.
-
-###GlusterFS specifcation in QEMU
-
-VM image residing on gluster volume can be specified on QEMU command line using URI format
-
- gluster[+transport]://[server[:port]]/volname/image[?socket=...]
-
-
-
-* `gluster` is the protocol.
-
-* `transport` specifies the transport type used to connect to gluster management daemon (glusterd). Valid transport types are `tcp, unix and rdma.` If a transport type isn’t specified, then tcp type is assumed.
-
-* `server` specifies the server where the volume file specification for the given volume resides. This can be either hostname, ipv4 address or ipv6 address. ipv6 address needs to be within square brackets [ ]. If transport type is unix, then server field should not be specified. Instead the socket field needs to be populated with the path to unix domain socket.
-
-* `port` is the port number on which glusterd is listening. This is optional and if not specified, QEMU will send 0 which will make gluster to use the default port. If the transport type is unix, then port should not be specified.
-
-* `volname` is the name of the gluster volume which contains the VM image.
-
-* `image` is the path to the actual VM image that resides on gluster volume.
-
-
-###Examples:
-
- gluster://1.2.3.4/testvol/a.img
- gluster+tcp://1.2.3.4/testvol/a.img
- gluster+tcp://1.2.3.4:24007/testvol/dir/a.img
- gluster+tcp://[1:2:3:4:5:6:7:8]/testvol/dir/a.img
- gluster+tcp://[1:2:3:4:5:6:7:8]:24007/testvol/dir/a.img
- gluster+tcp://server.domain.com:24007/testvol/dir/a.img
- gluster+unix:///testvol/dir/a.img?socket=/tmp/glusterd.socket
- gluster+rdma://1.2.3.4:24007/testvol/a.img
-
-
-
-NOTE: (GlusterFS URI description and above examples are taken from QEMU documentation)
-
-###Configuring QEMU with GlusterFS backend
-
-While building QEMU from source, in addition to the normal configuration options, ensure that –enable-glusterfs options are specified explicitly with ./configure script to get glusterfs support in qemu.
-
-Starting with QEMU-1.6, pkg-config is used to configure the GlusterFS backend in QEMU. If you are using GlusterFS compiled and installed from sources, then the GlusterFS package config file (glusterfs-api.pc) might not be present at the standard path and you will have to explicitly add the path by executing this command before running the QEMU configure script:
-
- export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig/
-
-Without this, GlusterFS driver will not be compiled into QEMU even when GlusterFS is present in the system.
-
-* Creating a VM image on GlusterFS backend
-
-qemu-img command can be used to create VM images on gluster backend. The general syntax for image creation looks like this:
-
-For ex:
-
- qemu-img create gluster://server/volname/path/to/image size
-
-## How to setup the environment:
-
-This usecase ( using glusterfs backend for VM disk store), is known as 'Virt-Store' usecase. Steps for the entire procedure could be split to:
-
-* Steps to be done on gluster volume side
-* Steps to be done on Hypervisor side
-
-
-##Steps to be done on gluster side
-
-These are the steps that needs to be done on the gluster side. Precisely this involves
-
- Creating "Trusted Storage Pool"
- Creating a volume
- Tuning the volume for virt-store
- Tuning glusterd to accept requests from QEMU
- Tuning glusterfsd to accept requests from QEMU
- Setting ownership on the volume
- Starting the volume
-
-* Creating "Trusted Storage Pool"
-
-Install glusterfs rpms on the NODE. You can create a volume with a single node. You can also scale up the cluster, as we call as `Trusted Storage Pool`, by adding more nodes to the cluster
-
- gluster peer probe <hostname>
-
-* Creating a volume
-
-It is highly recommended to have replicate volume or distribute-replicate volume for virt-store usecase, as it would add high availability and fault-tolerance. Remember the plain distribute works equally well
-
- gluster volume create replica 2 <brick1> .. <brickN>
-
-where, `<brick1> is <hostname>:/<path-of-dir> `
-
-
-Note: It is recommended to create sub-directories inside brick and that could be used to create a volume.For example, say, /home/brick1 is the mountpoint of XFS, then you can create a sub-directory inside it /home/brick1/b1 and use it while creating a volume.You can also use space available in root filesystem for bricks. Gluster cli, by default, throws warning in that case. You can override it by using force option
-
- gluster volume create replica 2 <brick1> .. <brickN> force
-
-If you are new to GlusterFS, you can take a look at QuickStart (http://www.gluster.org/community/documentation/index.php/QuickStart) guide.
-
-* Tuning the volume for virt-store
-
-There are recommended settings available for virt-store. This provide good performance characteristics when enabled on the volume that was used for virt-store
-
-Refer to http://www.gluster.org/community/documentation/index.php/Virt-store-usecase#Tunables for recommended tunables and for applying them on the volume, http://www.gluster.org/community/documentation/index.php/Virt-store-usecase#Applying_the_Tunables_on_the_volume
-
-
-* Tuning glusterd to accept requests from QEMU
-
-glusterd receives the request only from the applications that run with port number less than 1024 and it blocks otherwise. QEMU uses port number greater than 1024 and to make glusterd accept requests from QEMU, edit the glusterd vol file, /etc/glusterfs/glusterd.vol and add the following,
-
- option rpc-auth-allow-insecure on
-
-Note: If you have installed glusterfs from source, you can find glusterd vol file at `/usr/local/etc/glusterfs/glusterd.vol`
-
-Restart glusterd after adding that option to glusterd vol file
-
- service glusterd restart
-
-* Tuning glusterfsd to accept requests from QEMU
-
-Enable the option `allow-insecure` on the particular volume
-
- gluster volume set <volname> server.allow-insecure on
-
-IMPORTANT : As of now(april 2,2014)there is a bug, as allow-insecure is not dynamically set on a volume.You need to restart the volume for the change to take effect
-
-
-* Setting ownership on the volume
-
-Set the ownership of qemu:qemu on to the volume
-
- gluster volume set <vol-name> storage.owner-uid 107
- gluster volume set <vol-name> storage.owner-gid 107
-
-* Starting the volume
-
-Start the volume
-
- gluster volume start <vol-name>
-
-## Steps to be done on Hypervisor Side:
-
-To create a raw image,
-
- qemu-img create gluster://1.2.3.4/testvol/dir/a.img 5G
-
-To create a qcow2 image,
-
- qemu-img create -f qcow2 gluster://server.domain.com:24007/testvol/a.img 5G
-
-
-
-
-
-## Booting VM image from GlusterFS backend
-
-A VM image 'a.img' residing on gluster volume testvol can be booted using QEMU like this:
-
-
- qemu-system-x86_64 -drive file=gluster://1.2.3.4/testvol/a.img,if=virtio
-
-In addition to VM images, gluster drives can also be used as data drives:
-
- qemu-system-x86_64 -drive file=gluster://1.2.3.4/testvol/a.img,if=virtio -drive file=gluster://1.2.3.4/datavol/a-data.img,if=virtio
-
-Here 'a-data.img' from datavol gluster volume appears as a 2nd drive for the guest.
-
-It is also possible to make use of libvirt to define a disk and use it with qemu:
-
-
-### Create libvirt XML to define Virtual Machine
-
-virt-install is python wrapper which is mostly used to create VM using set of params. How-ever virt-install doesn't support any network filesystem [ https://bugzilla.redhat.com/show_bug.cgi?id=1017308 ]
-
-Create a libvirt VM xml - http://libvirt.org/formatdomain.html where the disk section is formatted in such a way, qemu driver for glusterfs is being used. This can be seen in the following example xml description
-
-
- <disk type='network' device='disk'>
- <driver name='qemu' type='raw' cache='none'/>
- <source protocol='gluster' name='distrepvol/vm3.img'>
- <host name='10.70.37.106' port='24007'/>
- </source>
- <target dev='vda' bus='virtio'/>
- <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
- </disk>
-
-
-
-
-
-* Define the VM from the XML file that was created earlier
-
-
- virsh define <xml-file-description>
-
-* Verify that the VM is created successfully
-
-
- virsh list --all
-
-* Start the VM
-
-
- virsh start <VM>
-
-* Verification
-
-You can verify the disk image file that is being used by VM
-
- virsh domblklist <VM-Domain-Name/ID>
-
-The above should show the volume name and image name. Here is the example,
-
-
- [root@test ~]# virsh domblklist vm-test2
- Target Source
- ------------------------------------------------
- vda distrepvol/test.img
- hdc -
-
-
-Reference:
-
-For more details on this feature implementation and its advantages, please refer:
-
-http://raobharata.wordpress.com/2012/10/29/qemu-glusterfs-native-integration/
-
-http://www.gluster.org/community/documentation/index.php/Libgfapi_with_qemu_libvirt
diff --git a/doc/features/quota-scalability.md b/doc/features/quota-scalability.md
deleted file mode 100644
index e47c898dd2a..00000000000
--- a/doc/features/quota-scalability.md
+++ /dev/null
@@ -1,52 +0,0 @@
-Issues with older implemetation:
------------------------------------
-* >#### Enforcement of quota was done on client side. This had following two issues :
- > >* All clients are not trusted and hence enforcement is not secure.
- > >* Quota enforcer caches directory size for a certain time out period to reduce network calls to fetch size. On time out, this cache is validated by querying server. With more clients, the traffic caused due to this
-validation increases.
-
-* >#### Relying on lookup calls on a file/directory (inode) to update its contribution [time consuming]
-
-* >####Hardlimits were stored in a comma separated list.
- > >* Hence, changing hard limit of one directory is not an independent operation and would invalidate hard limits of other directories. We need to parse the string once for each of these directories just to identify whether its hard limit is changed. This limits the number of hard limits we can configure.
-
-* >####Cli used to fetch the list of directories on which quota-limit is set, from glusterd.
- > >* With more number of limits, the network overhead incurred to fetch this list limits the scalability of number of directories on which we can set quota.
-
-* >#### Problem with NFS mount
- > >* Quota, for its enforcement and accounting requires all the ancestors of a file/directory till root. However, with NFS relying heavily on nameless lookups (through which there is no guarantee that ancestry can be
-accessed) this ancestry is not always present. Hence accounting and enforcement was not correct.
-
-
-New Design Implementation:
---------------------------------
-
-* Quota enforcement is moved to server side. This addresses issues that arose because of client side enforcement.
-
-* Two levels of quota limits, soft and hard quota is introduced.
- This will result in a message being logged on reaching soft quota and writes will fail with EDQUOT after hard limit is reached.
-
-Work Flow
------------------
-
-* Accounting
- # This is done using the marker translator loaded on each brick of the volume. Accounting happens in the background. Ie, it doesn't happen in-flight with the file operation. The file operations latency is not
-directly affected by the time taken to perform accounting. This update is sent recursively upwards up to the root of the volume.
-
-* Enforcement
- # The enforcer updates its 'view' (cached) of directory's disk usage on the incidence of a file operation after the expiry of hard/soft timeout, depending on the current usage. Enforcer uses quotad to get the
-aggregated disk usage of a directory from the accounting information present on each brick (viz, provided by marker).
-
-* Aggregator (quotad)
- # Quotad is a daemon that serves volume-wide disk usage of a directory, on which quota is configured. It is present on all nodes in the cluster (trusted storage pool) as bricks don't have a global view of cluster.
-Quotad queries the disk usage information from all the bricks in that volume and aggregates. It manages all the volumes on which quota is enabled.
-
-
-Benefit to GlusterFS
----------------------------------
-
-* Support upto 65536 quota configurations per volume.
-* More quotas can be configured in a single volume thereby leading to support GlusterFS for use cases like home directory.
-
-###For more information on quota usability refer the following link :
-> https://access.redhat.com/site/documentation/en-US/Red_Hat_Storage/2.1/html-single/Administration_Guide/index.html#chap-User_Guide-Dir_Quota-Enable
diff --git a/doc/features/rdma-cm-in-3.4.0.txt b/doc/features/rdma-cm-in-3.4.0.txt
deleted file mode 100644
index fd953e56b3f..00000000000
--- a/doc/features/rdma-cm-in-3.4.0.txt
+++ /dev/null
@@ -1,9 +0,0 @@
-Following is the impact of http://review.gluster.org/#change,149.
-
-New userspace packages needed:
-librdmacm
-librdmacm-devel
-
-rdmacm needs an IPoIB address for connection establishment. This requirement results in following issues:
-* Because of bug #890502, we've to probe the peer on an IPoIB address. This imposes a restriction that all volumes created in the future have to communicate over IPoIB address (irrespective of whether they use gluster's tcp or rdma transport).
-* Currently client has an independence to choose b/w tcp and rdma transports while communicating with the server (by creating volumes with transport-type tcp,rdma). This independence was a byproduct of our ability use the normal channel used with transport-type tcp for rdma connectiion establishment handshake too. However, with new requirement of IPoIB address for connection establishment, we loose this independence (till we bring in multi-network support - where a brick can be identified by a set of ip-addresses and we can choose different pairs of ip-addresses for communication based on our requirements - in glusterd).
diff --git a/doc/features/rdmacm.md b/doc/features/rdmacm.md
deleted file mode 100644
index caacab40452..00000000000
--- a/doc/features/rdmacm.md
+++ /dev/null
@@ -1,17 +0,0 @@
-## Rdma Connection manager ##
-
-### What? ###
-Infiniband requires addresses of end points to be exchanged using an out-of-band channel (like tcp/ip). Glusterfs used a custom protocol over a tcp/ip channel to exchange this address. However, librdmacm provides the same functionality with the advantage of being a standard protocol. This helps if we want to communicate with a non-glusterfs entity (say nfs client with gluster nfs server) over infiniband.
-
-### Dependencies ###
-* [IP over Infiniband](http://pkg-ofed.alioth.debian.org/howto/infiniband-howto-5.html) - The value to *option* **remote-host** in glusterfs transport configuration should be an IPoIB address
-* [rdma cm kernel module](http://pkg-ofed.alioth.debian.org/howto/infiniband-howto-4.html#ss4.4)
-* [user space rdmacm library - librdmacm](https://www.openfabrics.org/downloads/rdmacm)
-
-### Limitations ###
-* Because of bug [890502](https://bugzilla.redhat.com/show_bug.cgi?id=890502), we've to probe the peer on an IPoIB address. This imposes a restriction that all volumes created in the future have to communicate over IPoIB address (irrespective of whether they use gluster's tcp or rdma transport).
-
-* Currently client has independence to choose b/w tcp and rdma transports while communicating with the server (by creating volumes with **transport-type tcp,rdma**). This independence was a by-product of our ability to use the tcp/ip channel - transports with *option transport-type tcp* - for rdma connection establishment handshake too. However, with new requirement of IPoIB address for connection establishment, we loose this independence (till we bring in [multi-network support](https://bugzilla.redhat.com/show_bug.cgi?id=765437) - where a brick can be identified by a set of ip-addresses and we can choose different pairs of ip-addresses for communication based on our requirements - in glusterd).
-
-### External links ###
-* [Infiniband Howto](http://pkg-ofed.alioth.debian.org/howto/infiniband-howto.html)
diff --git a/doc/features/readdir-ahead.md b/doc/features/readdir-ahead.md
deleted file mode 100644
index 5302a021202..00000000000
--- a/doc/features/readdir-ahead.md
+++ /dev/null
@@ -1,14 +0,0 @@
-## Readdir-ahead ##
-
-### Summary ###
-Provide read-ahead support for directories to improve sequential directory read performance.
-
-### Owners ###
-Brian Foster
-
-### Detailed Description ###
-The read-ahead feature for directories is analogous to read-ahead for files. The objective is to detect sequential directory read operations and establish a pipeline for directory content. When a readdir request is received and fulfilled, preemptively issue subsequent readdir requests to the server in anticipation of those requests from the user. If sequential readdir requests are received, the directory content is already immediately available in the client. If subsequent requests are not sequential or not received, said data is simply dropped and the optimization is bypassed.
-
-readdir-ahead is currently disabled by default. It can be enabled with the following command:
-
- gluster volume set <volname> readdir-ahead on
diff --git a/doc/features/rebalance.md b/doc/features/rebalance.md
deleted file mode 100644
index 29b993008d2..00000000000
--- a/doc/features/rebalance.md
+++ /dev/null
@@ -1,74 +0,0 @@
-## Background
-
-
-For a more detailed description, view Jeff Darcy's blog post [here]
-(http://hekafs.org/index.php/2012/03/glusterfs-algorithms-distribution/)
-
-GlusterFS uses the distribute translator (DHT) to aggregate space of multiple servers. DHT distributes files among its subvolumes using a consistent hashing method providing 32-bit hashes. Each DHT subvolume is given a range in the 32-bit hash space. A hash value is calculated for every file using a combination of its name. The file is then placed in the subvolume with the hash range that contains the hash value.
-
-## What is rebalance?
-
-The rebalance process migrates files between the DHT subvolumes when necessary.
-
-## When is rebalance required?
-
-Rebalancing is required for two main cases.
-
-1. Addition/Removal of bricks
-
-2. Renaming of a file
-
-## Addition/Removal of bricks
-
-Whenever the number or order of DHT subvolumes change, the hash range given to each subvolume is recalculated. When this happens, already existing files on the volume will need to be moved to the correct subvolume based on their hash. Rebalance does this activity.
-
-Addition of bricks which increase the size of a volume will increase the number of DHT subvolumes and lead to recalculation of hash ranges (This doesn't happen when bricks are added to a volume to increase redundancy, i.e. increase replica count of a volume). This will require an explicit rebalance command to be issued to migrate the files.
-
-Removal of bricks which decrease the size of a volumes also causes the hash ranges of DHT to be recalculated. But we don't need to issue an explicit rebalance command in this case, as rebalance is done automatically by the remove-brick process if needed.
-
-## Renaming of a file
-
-Renaming of file will cause its hash to change. The file now needs to be moved to the correct subvolume based on its new hash. Rebalance does this.
-
-## How does rebalance work?
-
-At a high level, the rebalance process consists of the following 3 steps:
-
-1. Crawl the volume to access all files
-2. Calculate the hash for the file
-3. If needed move the migrate the file to the correct subvolume.
-
-
-The rebalance process has been optimized by making it distributed across the trusted storage pool. With distributed rebalance, a rebalance process is launched on each peer in the cluster. Each rebalance process will crawl files on only those bricks of the volume which are present on it, and migrate the files which need migration to the correct brick. This speeds up the rebalance process considerably.
-
-## What will happen if rebalance is not run?
-
-### Addition of bricks
-
-With the current implementation of add-brick, when the size of a volume is augmented by adding new bricks, the new bricks are not put into use immediately i.e., the hash ranges there not recalculated immediately. This means that the files will still be placed only onto the existing bricks, leaving the newly added storage space unused. Starting a rebalance process on the volume will cause the hash ranges to be recalculated with the new bricks included, which allows the newly added storage space to be used.
-
-### Renaming a file
-
-When a file rename causes the file to be hashed to a new subvolume, DHT writes a link file on the new subvolume leaving the actual file on the original subvolume. A link file is an empty file, which has an extended attribute set that points to the subvolume on which the actual file exists. So, when a client accesses the renamed file, DHT first looks for the file in the hashed subvolume and gets the link file. DHT understands the link file, and gets the actual file from the subvolume pointed to by the link file. This leads to a slight reduction in performance. A rebalance will move the actual file to the hashed subvolume, allowing clients to access the file directly once again.
-
-## Are clients affected during a rebalance process?
-
-The rebalance process is transparent to applications on the clients. Applications which have open files on the volume will not be affected by the rebalance process, even if the open file requires migration. The DHT translator on the client will hide the migration from the applications.
-
-##How are open files migrated?
-
-(A more technical description of the algorithm used can be seen in the commit message of commit a07bb18c8adeb8597f62095c5d1361c5bad01f09.)
-
-To achieve migration of open files, two things need to be assured of,
-a) any writes or changes happening to the file during migration are correctly synced to destination subvolume after the migration is complete.
-b) any further changes should be made to the destination subvolume
-
-Both of these requirements require sending notificatoins to clients. Clients are notified by overloading an attribute used in every callback functions. DHT understands these attributes in the callbacks and can be notified if a file is being migrated or not.
-
-During rebalance, a file will be in two phases
-
-1. Migration in process - In this phase the file is being migrated by the rebalance process from the source subvolume to the destination subvolume. The rebalance process will set a 'in-migration' attribute on the file, which will notify the clients' DHT translator. The clients' DHT translator will then take care to send any further changes to the destination subvolume as well. This way we satisfy the first requirement
-
-2. Migration completed - Once the file has been migrated, the rebalance process will set a 'migration-complete' attribute on the file. The clients will be notified of the completion and all further operations on the file will happen on the destination subvolume.
-
-The DHT translator handles the above and allows the applications on the clients to continue working on a file under migration.
diff --git a/doc/features/server-quorum.md b/doc/features/server-quorum.md
deleted file mode 100644
index 7b20084cea8..00000000000
--- a/doc/features/server-quorum.md
+++ /dev/null
@@ -1,44 +0,0 @@
-# Server Quorum
-
-Server quorum is a feature intended to reduce the occurrence of "split brain"
-after a brick failure or network partition. Split brain happens when different
-sets of servers are allowed to process different sets of writes, leaving data
-in a state that can not be reconciled automatically. The key to avoiding split
-brain is to ensure that there can be only one set of servers - a quorum - that
-can continue handling writes. Server quorum does this by the brutal but
-effective means of forcing down all brick daemons on cluster nodes that can no
-longer reach enough of their peers to form a majority. Because there can only
-be one majority, there can be only one set of bricks remaining, and thus split
-brain can not occur.
-
-## Options
-
-Server quorum is controlled by two parameters:
-
- * **cluster.server-quorum-type**
-
- This value may be "server" to indicate that server quorum is enabled, or
- "none" to mean it's disabled.
-
- * **cluster.server-quorum-ratio**
-
- This is the percentage of cluster nodes that must be up to maintain quorum.
- More precisely, this percentage of nodes *plus one* must be up.
-
-Note that these are cluster-wide flags. All volumes served by the cluster will
-be affected. Once these values are set, quorum actions - starting or stopping
-brick daemons in response to node or network events - will be automatic.
-
-## Best Practices
-
-If a cluster with an even number of nodes is split exactly down the middle,
-neither half can have quorum (which requires **more than** half of the total).
-This is particularly important when N=2, in which case the loss of either node
-leads to loss of quorum. Therefore, it is highly advisable to ensure that the
-cluster size is three or greater. The "extra" node in this case need not have
-any bricks or serve any data. It need only be present to preserve the notion
-of a quorum majority less than the entire cluster membership, allowing the
-cluster to survive the loss of a single node without losing quorum.
-
-
-
diff --git a/doc/features/worm.md b/doc/features/worm.md
deleted file mode 100644
index dba99777da5..00000000000
--- a/doc/features/worm.md
+++ /dev/null
@@ -1,75 +0,0 @@
-#WORM (Write Once Read Many)
-This features enables you to create a `WORM volume` using gluster CLI.
-##Description
-WORM (write once,read many) is a desired feature for users who want to store data such as `log files` and where data is not allowed to get modified.
-
-GlusterFS provides a new key `features.worm` which takes boolean values(enable/disable) for volume set.
-
-Internally, the volume set command with 'feature.worm' key will add 'features/worm' translator in the brick's volume file.
-
-`This change would be reflected on a subsequent restart of the volume`, i.e gluster volume stop, followed by a gluster volume start.
-
-With a volume converted to WORM, the changes are as follows:
-
-* Reads are handled normally
-* Only files with O_APPEND flag will be supported.
-* Truncation,deletion wont be supported.
-
-##Volume Options
-Use the volume set command on a volume and see if the volume is actually turned into WORM type.
-
- # features.worm enable
-##Fully loaded Example
-WORM feature is being supported from glusterfs version 3.4
-start glusterd by using the command
-
- # service glusterd start
-Now create a volume by using the command
-
- # gluster volume create <vol_name> <brick_path>
-start the volume created by running the command below.
-
- # gluster vol start <vol_name>
-Run the command below to make sure that volume is created.
-
- # gluster volume info
-Now turn on the WORM feature on the volume by using the command
-
- # gluster vol set <vol_name> worm enable
-Verify that the option is set by using the command
-
- # gluster volume info
-User should be able to see another option in the volume info
-
- # features.worm: enable
-Now restart the volume for the changes to reflect, by performing volume stop and start.
-
- # gluster volume <vol_name> stop
- # gluster volume <vol_name> start
-Now mount the volume using fuse mount
-
- # mount -t glusterfs <vol_name> <mnt_point>
-create a file inside the mount point by running the command below
-
- # touch <file_name>
-Verify that user is able to create a file by running the command below
-
- # ls <file_name>
-
-##How To Test
-Now try deleting the above file which is been created
-
- # rm <file_name>
-Since WORM is enabled on the volume, it gives the following error message `rm: cannot remove '/<mnt_point>/<file_name>': Read-only file system`
-
-put some content into the file which is created above.
-
- # echo "at the end of the file" >> <file_name>
-Now try editing the file by running the commnad below and verify that the following error message is displayed `rm: cannot remove '/<mnt_point>/<file_name>': Read-only file system`
-
- # sed -i "1iAt the beginning of the file" <file_name>
-Now read the contents of the file and verify that file can be read.
-
- cat <file_name>
-
-`Note: If WORM option is set on the volume before it is started, then volume need not be restarted for the changes to get reflected`.
diff --git a/doc/features/zerofill.md b/doc/features/zerofill.md
deleted file mode 100644
index c0f1fc5014c..00000000000
--- a/doc/features/zerofill.md
+++ /dev/null
@@ -1,26 +0,0 @@
-#zerofill API for GlusterFS
-zerofill() API would allow creation of pre-allocated and zeroed-out files on GlusterFS volumes by offloading the zeroing part to server and/or storage (storage offloads use SCSI WRITESAME).
-## Description
-
-Zerofill writes zeroes to a file in the specified range. This fop will be useful when a whole file needs to be initialized with zero (could be useful for zero filled VM disk image provisioning or during scrubbing of VM disk images).
-
-Client/application can issue this FOP for zeroing out. Gluster server will zero out required range of bytes ie server offloaded zeroing. In the absence of this fop, client/application has to repetitively issue write (zero) fop to the server, which is very inefficient method because of the overheads involved in RPC calls and acknowledgements.
-
-WRITESAME is a SCSI T10 command that takes a block of data as input and writes the same data to other blocks and this write is handled completely within the storage and hence is known as offload . Linux ,now has support for SCSI WRITESAME command which is exposed to the user in the form of BLKZEROOUT ioctl. BD Xlator can exploit BLKZEROOUT ioctl to implement this fop. Thus zeroing out operations can be completely offloaded to the storage device,
-making it highly efficient.
-
-The fop takes two arguments offset and size. It zeroes out 'size' number of bytes in an opened file starting from 'offset' position.
-This feature adds zerofill support to the following areas:
-> - libglusterfs
-- io-stats
-- performance/md-cache,open-behind
-- quota
-- cluster/afr,dht,stripe
-- rpc/xdr
-- protocol/client,server
-- io-threads
-- marker
-- storage/posix
-- libgfapi
-
-Client applications can exploit this fop by using glfs_zerofill introduced in libgfapi.FUSE support to this fop has not been added as there is no system call for this fop.