summaryrefslogtreecommitdiffstats
path: root/done
diff options
context:
space:
mode:
Diffstat (limited to 'done')
-rw-r--r--done/Features/README.md42
-rw-r--r--done/Features/afr-arbiter-volumes.md56
-rw-r--r--done/Features/afr-statistics.md142
-rw-r--r--done/Features/afr-v1.md340
-rw-r--r--done/Features/bitrot-docs.md7
-rw-r--r--done/Features/brick-failure-detection.md67
-rw-r--r--done/Features/dht.md223
-rw-r--r--done/Features/distributed-geo-rep.md71
-rw-r--r--done/Features/file-snapshot.md91
-rw-r--r--done/Features/gfid-access.md73
-rw-r--r--done/Features/glusterfs_nfs-ganesha_integration.md123
-rw-r--r--done/Features/heal-info-and-split-brain-resolution.md448
-rw-r--r--done/Features/leases.md11
-rw-r--r--done/Features/libgfapi.md382
-rw-r--r--done/Features/libgfchangelog.md119
-rw-r--r--done/Features/memory-usage.md49
-rw-r--r--done/Features/meta.md206
-rw-r--r--done/Features/mount_gluster_volume_using_pnfs.md68
-rw-r--r--done/Features/nufa.md20
-rw-r--r--done/Features/object-versioning.md230
-rw-r--r--done/Features/ovirt-integration.md106
-rw-r--r--done/Features/qemu-integration.md230
-rw-r--r--done/Features/quota-object-count.md47
-rw-r--r--done/Features/quota-scalability.md52
-rw-r--r--done/Features/rdmacm.md26
-rw-r--r--done/Features/readdir-ahead.md14
-rw-r--r--done/Features/rebalance.md74
-rw-r--r--done/Features/server-quorum.md44
-rw-r--r--done/Features/shard.md68
-rw-r--r--done/Features/tier.md170
-rw-r--r--done/Features/trash_xlator.md80
-rw-r--r--done/Features/upcall.md38
-rw-r--r--done/Features/worm.md75
-rw-r--r--done/Features/zerofill.md26
-rw-r--r--done/GlusterFS 3.5/AFR CLI enhancements.md204
-rw-r--r--done/GlusterFS 3.5/Brick Failure Detection.md151
-rw-r--r--done/GlusterFS 3.5/Disk Encryption.md443
-rw-r--r--done/GlusterFS 3.5/Exposing Volume Capabilities.md161
-rw-r--r--done/GlusterFS 3.5/File Snapshot.md101
-rw-r--r--done/GlusterFS 3.5/Onwire Compression-Decompression.md96
-rw-r--r--done/GlusterFS 3.5/Quota Scalability.md99
-rw-r--r--done/GlusterFS 3.5/Virt store usecase.md140
-rw-r--r--done/GlusterFS 3.5/Zerofill.md192
-rw-r--r--done/GlusterFS 3.5/gfid access.md89
-rw-r--r--done/GlusterFS 3.5/index.md32
-rw-r--r--done/GlusterFS 3.5/libgfapi with qemu libvirt.md222
-rw-r--r--done/GlusterFS 3.5/readdir ahead.md117
-rw-r--r--done/GlusterFS 3.6/Better Logging.md348
-rw-r--r--done/GlusterFS 3.6/Better Peer Identification.md172
-rw-r--r--done/GlusterFS 3.6/Gluster User Serviceable Snapshots.md39
-rw-r--r--done/GlusterFS 3.6/Gluster Volume Snapshot.md354
-rw-r--r--done/GlusterFS 3.6/New Style Replication.md230
-rw-r--r--done/GlusterFS 3.6/Persistent AFR Changelog xattributes.md178
-rw-r--r--done/GlusterFS 3.6/RDMA Improvements.md101
-rw-r--r--done/GlusterFS 3.6/Server-side Barrier feature.md213
-rw-r--r--done/GlusterFS 3.6/Thousand Node Gluster.md150
-rw-r--r--done/GlusterFS 3.6/afrv2.md244
-rw-r--r--done/GlusterFS 3.6/better-ssl.md137
-rw-r--r--done/GlusterFS 3.6/disperse.md142
-rw-r--r--done/GlusterFS 3.6/glusterd volume locks.md48
-rw-r--r--done/GlusterFS 3.6/heterogeneous-bricks.md136
-rw-r--r--done/GlusterFS 3.6/index.md96
-rw-r--r--done/GlusterFS 3.7/Archipelago Integration.md93
-rw-r--r--done/GlusterFS 3.7/BitRot.md211
-rw-r--r--done/GlusterFS 3.7/Clone of Snapshot.md100
-rw-r--r--done/GlusterFS 3.7/Data Classification.md279
-rw-r--r--done/GlusterFS 3.7/Easy addition of Custom Translators.md129
-rw-r--r--done/GlusterFS 3.7/Exports and Netgroups Authentication.md134
-rw-r--r--done/GlusterFS 3.7/Gluster CLI for NFS Ganesha.md120
-rw-r--r--done/GlusterFS 3.7/Gnotify.md168
-rw-r--r--done/GlusterFS 3.7/HA for Ganesha.md156
-rw-r--r--done/GlusterFS 3.7/Improve Rebalance Performance.md277
-rw-r--r--done/GlusterFS 3.7/Object Count.md113
-rw-r--r--done/GlusterFS 3.7/Policy based Split-brain Resolution.md128
-rw-r--r--done/GlusterFS 3.7/SE Linux Integration.md4
-rw-r--r--done/GlusterFS 3.7/Scheduling of Snapshot.md229
-rw-r--r--done/GlusterFS 3.7/Sharding xlator.md129
-rw-r--r--done/GlusterFS 3.7/Small File Performance.md433
-rw-r--r--done/GlusterFS 3.7/Trash.md182
-rw-r--r--done/GlusterFS 3.7/Upcall Infrastructure.md747
-rw-r--r--done/GlusterFS 3.7/arbiter.md100
-rw-r--r--done/GlusterFS 3.7/index.md90
-rw-r--r--done/GlusterFS 3.7/rest-api.md152
83 files changed, 12427 insertions, 0 deletions
diff --git a/done/Features/README.md b/done/Features/README.md
new file mode 100644
index 0000000..97c1175
--- /dev/null
+++ b/done/Features/README.md
@@ -0,0 +1,42 @@
+###Features in GlusterFS 3.7
+
+- [AFR Arbiter Volumes](./afr-arbiter-volumes.md)
+- [bit rot docs](./bitrot-docs.md)
+- [bit rot memory usage](./memory-usage.md)
+- [bit rot object versioning](./object-versioning.md)
+- [shard](./shard.md)
+- [upcall](./upcall.md)
+- [quota object count](./quota-object-count.md)
+- [GlusterFS NFS Ganesha Integration](./glusterfs_nfs-ganesha_integration.md)
+- [Tiering](./tier.md)
+- [trash_xlator](./trash_xlator.md)
+
+###Features in GlusterFS 3.6
+
+
+###Features in GlusterFS 3.5
+
+- [AFR Statistics](./afr-statistics.md)
+- [AFR ver 1](./afr-v1.md)
+- [Brick failure detection](./brick-failure-detection.md)
+- [File Snapshot](./file-snapshot.md)
+- [gfid access](./gfid-access.md)
+- [quota scalability.md](./quota-scalability.md)
+- [Readdir-ahead](./readdir-ahead.md)
+- [zerofill API for GlusterFS](./zerofill.md)
+
+###Other Gluster Features
+
+- [Distributed Hash Tables](./dht.md)
+- [Heal Info and Split Brain Resolution](./heal-info-and-split-brain-resolution.md)
+- [libgfapi](./libgfapi.md)
+- [Mounting Gluster Volumes using PNFS](./mount_gluster_volume_using_pnfs.md)
+- [Non Uniform File Access](./nufa.md)
+- [OVirt Integration](./ovirt-integration.md)
+- [QEMU Integration](./qemu-integration.md)
+- [RDMA Connection Manager](./rdmacm.md)
+- [Rebalance](./rebalance.md)
+- [Server Quorum](./server-quorum.md)
+- [Write Once Read Many](./worm.md)
+- [Distributed geo replication](./distributed-geo-rep.md)
+- [libgf changelog](./libgfchangelog.md) \ No newline at end of file
diff --git a/done/Features/afr-arbiter-volumes.md b/done/Features/afr-arbiter-volumes.md
new file mode 100644
index 0000000..e31bc31
--- /dev/null
+++ b/done/Features/afr-arbiter-volumes.md
@@ -0,0 +1,56 @@
+Usage guide: Replicate volumes with arbiter configuration
+==========================================================
+
+Arbiter volumes are replica 3 volumes where the 3rd brick of the replica is
+automatically configured as an arbiter node. What this means is that the 3rd
+brick will store only the file name and metadata, but does not contain any data.
+This configuration is helpful in avoiding split-brains while providing the same
+level of consistency as a normal replica 3 volume.
+
+The arbiter volume can be created with the following command:
+
+ gluster volume create <VOLNAME> replica 3 arbiter 1 host1:brick1 host2:brick2 host3:brick3
+
+Note that the syntax is similar to creating a normal replica 3 volume with the
+exception of the `arbiter 1` keyword. As seen in the command above, the only
+permissible values for the replica count and arbiter count are 3 and 1
+respectively. Also, the 3rd brick is always chosen as the arbiter brick and it
+is not configurable to have any other brick as the arbiter.
+
+Client/ Mount behaviour:
+========================
+
+By default, client quorum (`cluster.quorum-type`) is set to `auto` for a replica
+3 volume when it is created; i.e. at least 2 bricks need to be up to satisfy
+quorum and to allow writes. This setting is not to be changed for arbiter
+volumes also. Additionally, the arbiter volume has additional some checks to
+prevent files from ending up in split-brain:
+
+* Clients take full file locks when writing to a file as opposed to range locks
+ in a normal replica 3 volume.
+
+* If 2 bricks are up and if one of them is the arbiter (i.e. the 3rd brick) *and*
+ it blames the other up brick, then all FOPS will fail with ENOTCONN (Transport
+ endpoint is not connected). IF the arbiter doesn't blame the other brick,
+ FOPS will be allowed to proceed. 'Blaming' here is w.r.t the values of AFR
+ changelog extended attributes.
+
+* If 2 bricks are up and the arbiter is down, then FOPS will be allowed.
+
+* In all cases, if there is only one source before the FOP is initiated and if
+ the FOP fails on that source, the application will receive ENOTCONN.
+
+Note: It is possible to see if a replica 3 volume has arbiter configuration from
+the mount point. If
+`$mount_point/.meta/graphs/active/$V0-replicate-0/options/arbiter-count` exists
+and its value is 1, then it is an arbiter volume. Also the client volume graph
+will have arbiter-count as a xlator option for AFR translators.
+
+Self-heal daemon behaviour:
+===========================
+Since the arbiter brick does not store any data for the files, data-self-heal
+from the arbiter brick will not take place. For example if there are 2 source
+bricks B2 and B3 (B3 being arbiter brick) and B2 is down, then data-self-heal
+will *not* happen from B3 to sink brick B1, and will be pending until B2 comes
+up and heal can happen from it. Note that metadata and entry self-heals can
+still happen from B3 if it is one of the sources.cd \ No newline at end of file
diff --git a/done/Features/afr-statistics.md b/done/Features/afr-statistics.md
new file mode 100644
index 0000000..d070584
--- /dev/null
+++ b/done/Features/afr-statistics.md
@@ -0,0 +1,142 @@
+##gluster volume heal <volume-name> statistics
+
+##Description
+In case of index self-heal, self-heal daemon reads the entries from the
+local bricks, from /brick-path/.glusterfs/indices/xattrop/ directory.
+So based on the entries read by self heal daemon, it will attempt self-heal.
+Executing this command will list the crawl statistics of self heal done for
+each brick.
+
+For each brick, it will list:
+1. Starting time of crawl done for that brick.
+2. Ending time of crawl done for that brick.
+3. No of entries for which self-heal is successfully attempted.
+4. No of split-brain entries found while self-healing.
+5. No of entries for which heal failed.
+
+
+
+Example:
+a) Create a gluster volume with replica count 2.
+b) Create 10 files.
+c) kill brick_1 of this replica.
+d) Overwrite all 10 files.
+e) Kill the other brick (brick_2), and bring back (brick_1) up.
+f) Overwrite all 10 files.
+
+Now we have 10 files, which are in split brain. Self heal daemon will crawl for
+both the bricks, and will count 10 files from each brick.
+It will report 10 files under split-brain with respect to given brick.
+
+Gathering crawl statistics on volume volume1 has been successful
+------------------------------------------------
+
+Crawl statistics for brick no 0
+Hostname of brick 192.168.122.1
+
+Starting time of crawl: Tue May 20 19:13:11 2014
+
+Ending time of crawl: Tue May 20 19:13:12 2014
+
+Type of crawl: INDEX
+No. of entries healed: 0
+No. of entries in split-brain: 10
+No. of heal failed entries: 0
+------------------------------------------------
+
+Crawl statistics for brick no 1
+Hostname of brick 192.168.122.1
+
+Starting time of crawl: Tue May 20 19:13:12 2014
+
+Ending time of crawl: Tue May 20 19:13:12 2014
+
+Type of crawl: INDEX
+No. of entries healed: 0
+No. of entries in split-brain: 10
+No. of heal failed entries: 0
+
+------------------------------------------------
+
+
+As the output shows, self-heal daemon detects 10 files in split-brain with
+resept to given brick.
+
+
+
+
+##gluster volume heal <volume-name> statistics heal-count
+It lists the number of entries present in
+/<brick>/.glusterfs/indices/xattrop from each-brick.
+
+
+1. Create a replicate volume.
+2. Kill one brick of a replicate volume1.
+3. Create 10 files.
+4. Execute above command.
+
+--------------------------------------------------------------------------------
+Gathering count of entries to be healed on volume volume1 has been successful
+
+Brick 192.168.122.1:/brick_1
+Number of entries: 10
+
+Brick 192.168.122.1:/brick_2
+No gathered input for this brick
+-------------------------------------------------------------------------------
+
+
+
+
+
+
+##gluster volume heal <volume-name> statistics heal-count replica \
+ ip_addr:/brick_location
+
+To list the number of entries to be healed from a particular replicate
+subvolume, listing any one child of that replicate subvolume in the command,
+will list the entries for all the childrens of that replicate subvolume.
+
+Example: dht
+ / \
+ / \
+ replica-1 replica-2
+ / \ / \
+ child-1 child-2 child-3 child-4
+ /brick1 /brick2 /brick3 /brick4
+
+gluster volume heal <vol-name> statistics heal-count ip:/brick1
+will list count only for child-1 and child-2.
+
+gluster volume heal <vol-name> statistics heal-count ip:/brick3
+will list count only for child-3 and child-4.
+
+
+
+1. Create a volume same as mentioned in the above graph.
+2. Kill Brick-2.
+3. Create some files.
+4. If we are interested in knowing the number of files to be healed from each
+ brick of replica-1 only, mention any one child of that replica.
+
+gluster volume heal volume1 statistics heal-count replica 192.168.122.1:/brick2
+
+output:
+-------
+Gathering count of entries to be healed per replica on volume volume1 has \
+been successful
+
+Brick 192.168.122.1:/brick_1
+Number of entries: 10 <--10 files
+
+Brick 192.168.122.1:/brick_2
+No gathered input for this brick <-Brick is down
+
+Brick 192.168.122.1:/brick_3
+No gathered input for this brick <--No result, as we are not
+ interested.
+
+Brick 192.168.122.1:/brick_4 <--No result, as we are not
+No gathered input for this brick interested.
+
+
diff --git a/done/Features/afr-v1.md b/done/Features/afr-v1.md
new file mode 100644
index 0000000..0ab41a1
--- /dev/null
+++ b/done/Features/afr-v1.md
@@ -0,0 +1,340 @@
+#Automatic File Replication
+Afr xlator in glusterfs is responsible for replicating the data across the bricks.
+
+###Responsibilities of AFR
+Its responsibilities include the following:
+
+1. Maintain replication consistency (i.e. Data on both the bricks should be same, even in the cases where there are operations happening on same file/directory in parallel from multiple applications/mount points as long as all the bricks in replica set are up)
+
+2. Provide a way of recovering data in case of failures as long as there is
+ at least one brick which has the correct data.
+
+3. Serve fresh data for read/stat/readdir etc
+
+###Transaction framework
+For 1, 2 above afr uses transaction framework which consists of the following 5
+phases which happen on all the bricks in replica set(Bricks which are in replication):
+
+####1.Lock Phase
+####2. Pre-op Phase
+####3. Op Phase
+####4. Post-op Phase
+####5. Unlock Phase
+
+*Op Phase* is the actual operation sent by application (`write/create/unlink` etc). For every operation which afr receives that modifies data it sends that same operation in parallel to all the bricks in its replica set. This is how it achieves replication.
+
+*Lock, Unlock Phases* take necessary locks so that *Op phase* can provide **replication consistency** in normal work flow.
+
+#####For example:
+If an application performs `touch a` and the other one does `mkdir a`, afr makes sure that either file with name `a` or directory with name `a` is created on both the bricks.
+
+*Pre-op, Post-op Phases* provide changelogging which enables afr to figure out which copy is fresh.
+Once afr knows how to figure out fresh copy in the replica set it can **recover data** from fresh copy to stale copy. Also it can **serve fresh** data for `read/stat/readdir` etc.
+
+##Internal Operations
+Brief introduction to internal operations in Glusterfs which make *Locking, Unlocking, Pre/Post ops* possible:
+
+###Internal Locking Operations
+Glusterfs has **locks** translator which provides the following internal locking operations called `inodelk`, `entrylk` which are used by afr to achieve synchronization of operations on files or directories that conflict with each other.
+
+`Inodelk` gives the facility for translators in Glusterfs to obtain range (denoted by tuple with **offset**, **length**) locks in a given domain for an inode.
+Full file lock is denoted by the tuple (offset: `0`, length: `0`) i.e. length `0` is considered as infinity.
+
+`Entrylk` enables translators of Glusterfs to obtain locks on `name` in a given domain for an inode, typically a directory.
+
+**Locks** translator provides both *blocking* and *nonblocking* variants and of these operations.
+
+###Xattrop
+For pre/post ops posix translator provides an operation called xattrop.
+xattrop is a way of *incrementing*/*decrementing* a number present in the extended attribute of the inode *atomically*.
+
+##Transaction Types
+There are 3 types of transactions in AFR.
+1. Data transactions
+ - Operations that add/modify/truncate the file contents.
+ - `Write`/`Truncate`/`Ftruncate` etc
+
+2. Metadata transactions
+ - Operations that modify the data kept in inode.
+ - `Chmod`/`Chown` etc
+
+3) Entry transactions
+ - Operations that add/remove/rename entries in a directory
+ - `Touch`/`Mkdir`/`Mknod` etc
+
+###Data transactions:
+
+*write* (`offset`, `size`) - writes data from `offset` of `size`
+
+*ftruncate*/*truncate* (`offset`) - truncates data from `offset` till the end of file.
+
+Afr internal locking needs to make sure that two conflicting data operations happen in order, one after the other so that it does not result in replication inconsistency. Afr data operations take inodelks in same domain (lets call it `data` domain).
+
+*Write* operation with offset `O` and size `S` takes an inode lock in data domain with range `(O, S)`.
+
+*Ftruncate*/*Truncate* operations with offset `O` take inode locks in `data` domain with range `(O, 0)`. Please note that size `0` means size infinity.
+
+These ranges make sure that overlapping write/truncate/ftruncate operations are done one after the other.
+
+Now that we know the ranges the operations take locks on, we will see how locking happens in afr.
+
+####Lock:
+Afr initially attempts **non-blocking** locks on **all** the bricks of the replica set in **parallel**. If all the locks are successful then it goes on to perform pre-op. But in case **non-blocking** locks **fail** because there is *at least one conflicting operation* which already has a **granted lock** then it **unlocks** the **non-blocking** locks it already acquired in this previous step and proceeds to perform **blocking** locks **one after the other** on each of the subvolumes in the order of subvolumes specified in the volfile.
+
+Chances of **conflicting operations** is **very low** and time elapsed in **non-blocking** locks phase is `Max(latencies of the bricks for responding to inodelk)`, where as time elapsed in **blocking locks** phase is `Sum(latencies of the bricks for responding to inodelk)`. That is why afr always tries for non-blocking locks first and only then it moves to blocking locks.
+
+####Pre-op:
+Each file/dir in a brick maintains the changelog(roughly pending operation count) of itself and that of the files
+present in all the other bricks in it's replica set as seen by that brick.
+
+Lets consider an example replica volume with 2 bricks brick-a and brick-b.
+all files in brick-a will have 2 entries
+one for itself and the other for the file present in it's replica set, i.e.brick-b:
+One can inspect changelogs using getfattr command.
+
+ # getfattr -d -e hex -m. brick-a/file.txt
+ trusted.afr.vol-client-0=0x000000000000000000000000 -->changelog for itself (brick-a)
+ trusted.afr.vol-client-1=0x000000000000000000000000 -->changelog for brick-b as seen by brick-a
+
+Likewise, all files in brick-b will have:
+
+ # getfattr -d -e hex -m. brick-b/file.txt
+ trusted.afr.vol-client-0=0x000000000000000000000000 -->changelog for brick-a as seen by brick-b
+ trusted.afr.vol-client-1=0x000000000000000000000000 -->changelog for itself (brick-b)
+
+#####Interpreting Changelog Value:
+Each extended attribute has a value which is `24` hexa decimal digits. i.e. `12` bytes
+First `8` digits (`4` bytes) represent changelog of `data`. Second `8` digits represent changelog
+of `metadata`. Last 8 digits represent Changelog of `directory entries`.
+
+Pictorially representing the same, we have:
+
+ 0x 00000000 00000000 00000000
+ | | |
+ | | \_ changelog of directory entries
+ | \_ changelog of metadata
+ \ _ changelog of data
+
+Before write operation is performed on the brick, afr marks the file saying there is a pending operation.
+
+As part of this pre-op afr sends xattrop operation with increment 1 for data operation to make the extended attributes the following:
+ # getfattr -d -e hex -m. brick-a/file.txt
+ trusted.afr.vol-client-0=0x000000010000000000000000 -->changelog for itself (brick-a)
+ trusted.afr.vol-client-1=0x000000010000000000000000 -->changelog for brick-b as seen by brick-a
+
+Likewise, all files in brick-b will have:
+
+ # getfattr -d -e hex -m. brick-b/file.txt
+ trusted.afr.vol-client-0=0x000000010000000000000000 -->changelog for brick-a as seen by brick-b
+ trusted.afr.vol-client-1=0x000000010000000000000000 -->changelog for itself (brick-b)
+
+As the operation is in progress on files on both the bricks all the extended attributes show the same value.
+
+####Op:
+Now it sends the actual write operation to both the bricks. Afr remembers whether the operation is successful or not on all the subvolumes.
+
+####Post-Op:
+If the operation succeeds on all the bricks then there is no pending operations on any of the bricks so as part of POST-OP afr sends xattrop operation with increment -1 i.e. decrement by 1 for data operation to make the extended attributes back to all zeros again.
+
+In case there is a failure on brick-b then there is still a pending operation on brick-b where as no pending operations are there on brick-a. So xattrop operation for both of these extended attributes differs now. For extended attribute corresponding to brick-a i.e. trusted.afr.vol-client-0 decrement by 1 is sent where as for extended attribute corresponding to brick-b increment by '0' is sent to retain the pending operation count.
+
+ # getfattr -d -e hex -m. brick-a/file.txt
+ trusted.afr.vol-client-0=0x000000000000000000000000 -->changelog for itself (brick-a)
+ trusted.afr.vol-client-1=0x000000010000000000000000 -->changelog for brick-b as seen by brick-a
+
+ # getfattr -d -e hex -m. brick-b/file.txt
+ trusted.afr.vol-client-0=0x000000000000000000000000 -->changelog for brick-a as seen by brick-b
+ trusted.afr.vol-client-1=0x000000010000000000000000 -->changelog for itself (brick-b)
+
+####Unlock:
+Once the transaction is completed unlock is sent on all the bricks where lock is acquired.
+
+
+###Meta Data transactions:
+
+setattr, setxattr, removexattr
+All metadata operations take same inode lock with same range in metadata domain.
+
+####Lock:
+Metadata locking also starts initially with non-blocking locks then move on to blocking locks on any failures because of conflicting operations.
+
+####Pre-op:
+Before metadata operation is performed on the brick, afr marks the file saying there is a pending operation.
+As part of this pre-op afr sends xattrop operation with increment 1 for metadata operation to make the extended attributes the following:
+ # getfattr -d -e hex -m. brick-a/file.txt
+ trusted.afr.vol-client-0=0x000000000000000100000000 -->changelog for itself (brick-a)
+ trusted.afr.vol-client-1=0x000000000000000100000000 -->changelog for brick-b as seen by brick-a
+
+Likewise, all files in brick-b will have:
+ # getfattr -d -e hex -m. brick-b/file.txt
+ trusted.afr.vol-client-0=0x000000000000000100000000 -->changelog for brick-a as seen by brick-b
+ trusted.afr.vol-client-1=0x000000000000000100000000 -->changelog for itself (brick-b)
+
+As the operation is in progress on files on both the bricks all the extended attributes show the same value.
+
+####Op:
+Now it sends the actual metadata operation to both the bricks. Afr remembers whether the operation is successful or not on all the subvolumes.
+
+Post-Op:
+If the operation succeeds on all the bricks then there is no pending operations on any of the bricks so as part of POST-OP afr sends xattrop operation with increment -1 i.e. decrement by 1 for metadata operation to make the extended attributes back to all zeros again.
+
+In case there is a failure on brick-b then there is still a pending operation on brick-b where as no pending operations are there on brick-a. So xattrop operation for both of these extended attributes differs now. For extended attribute corresponding to brick-a i.e. trusted.afr.vol-client-0 decrement by 1 is sent where as for extended attribute corresponding to brick-b increment by '0' is sent to retain the pending operation count.
+
+ # getfattr -d -e hex -m. brick-a/file.txt
+ trusted.afr.vol-client-0=0x000000000000000000000000 -->changelog for itself (brick-a)
+ trusted.afr.vol-client-1=0x000000000000000100000000 -->changelog for brick-b as seen by brick-a
+
+ # getfattr -d -e hex -m. brick-b/file.txt
+ trusted.afr.vol-client-0=0x000000000000000000000000 -->changelog for brick-a as seen by brick-b
+ trusted.afr.vol-client-1=0x000000000000000100000000 -->changelog for itself (brick-b)
+
+####Unlock:
+Once the transaction is completed unlock is sent on all the bricks where lock is acquired.
+
+
+###Entry transactions:
+
+create, mknod, mkdir, link, symlink, rename, unlink, rmdir
+Pre-op/Post-op (done using xattrop) always happens on the parent directory.
+
+Entry Locks taken by these entry operations:
+
+**Create** (file `dir/a`): Lock on name `a` in inode of `dir`
+
+**mknod** (file `dir/a`): Lock on name `a` in inode of `dir`
+
+**mkdir** (dir `dir/a`): Lock on name `a` in inode of `dir`
+
+**link** (file `oldfile`, file `dir/newfile`): Lock on name `newfile` in inode of `dir`
+
+**Symlink** (file `oldfile`, file `dir`/`symlinkfile`): Lock on name `symlinkfile` in inode of `dir`
+
+**rename** of (file `dir1`/`file1`, file `dir2`/`file2`): Lock on name `file1` in inode of `dir1`, Lock on name `file2` in inode of `dir2`
+
+**rename** of (dir `dir1`/`dir2`, dir `dir3`/`dir4`): Lock on name `dir2` in inode of `dir1`, Lock on name `dir4` in inode of `dir3`, Lock on `NULL` in inode of `dir4`
+
+**unlink** (file `dir`/`a`): Lock on name `a` in inode of `dir`
+
+**rmdir** (dir dir/a): Lock on name `a` in inode of `dir`, Lock on `NULL` in inode of `a`
+
+####Lock:
+Even entry locking starts initially with non-blocking locks then move on to blocking locks on any failures because of conflicting operations.
+
+####Pre-op:
+Before entry operation is performed on the brick, afr marks the directory saying there is a pending operation.
+
+As part of this pre-op afr sends xattrop operation with increment 1 for entry operation to make the extended attributes the following:
+
+ # getfattr -d -e hex -m. brick-a/
+ trusted.afr.vol-client-0=0x000000000000000000000001 -->changelog for itself (brick-a)
+ trusted.afr.vol-client-1=0x000000000000000000000001 -->changelog for brick-b as seen by brick-a
+
+Likewise, all files in brick-b will have:
+ # getfattr -d -e hex -m. brick-b/
+ trusted.afr.vol-client-0=0x000000000000000000000001 -->changelog for brick-a as seen by brick-b
+ trusted.afr.vol-client-1=0x000000000000000000000001 -->changelog for itself (brick-b)
+
+As the operation is in progress on files on both the bricks all the extended attributes show the same value.
+
+####Op:
+Now it sends the actual entry operation to both the bricks. Afr remembers whether the operation is successful or not on all the subvolumes.
+
+####Post-Op:
+If the operation succeeds on all the bricks then there is no pending operations on any of the bricks so as part of POST-OP afr sends xattrop operation with increment -1 i.e. decrement by 1 for entry operation to make the extended attributes back to all zeros again.
+
+In case there is a failure on brick-b then there is still a pending operation on brick-b where as no pending operations are there on brick-a. So xattrop operation for both of these extended attributes differs now. For extended attribute corresponding to brick-a i.e. trusted.afr.vol-client-0 decrement by 1 is sent where as for extended attribute corresponding to brick-b increment by '0' is sent to retain the pending operation count.
+
+ # getfattr -d -e hex -m. brick-a/file.txt
+ trusted.afr.vol-client-0=0x000000000000000000000000 -->changelog for itself (brick-a)
+ trusted.afr.vol-client-1=0x000000000000000000000001 -->changelog for brick-b as seen by brick-a
+
+ # getfattr -d -e hex -m. brick-b/file.txt
+ trusted.afr.vol-client-0=0x000000000000000000000000 -->changelog for brick-a as seen by brick-b
+ trusted.afr.vol-client-1=0x000000000000000000000001 -->changelog for itself (brick-b)
+
+####Unlock:
+Once the transaction is completed unlock is sent on all the bricks where lock is acquired.
+
+The parts above cover how replication consistency is achieved in afr.
+
+Now let us look at how afr can figure out how to recover from failures given the changelog extended attributes
+
+###Recovering from failures (Self-heal)
+For recovering from failures afr tries to determine which copy is the fresh copy based on the extended attributes.
+
+There are 3 possibilities:
+1. All the extended attributes are zero on all the bricks. This means there are no pending operations on any of the bricks so there is nothing to recover.
+2. According to the extended attributes there is a brick(brick-a) which noticed that there are operations pending on the other brick(brick-b).
+ - There are 4 possibilities for brick-b
+
+ - It did not even participate in transaction (all extended attributes on brick-b are zeros). Choose brick-a as source and perform recovery to brick-b.
+
+ - It participated in the transaction but died even before post-op. (All extended attributes on brick-b have a pending-count). Choose brick-a as source and perform recovery to brick-b.
+
+ - It participated in the transaction and after the post-op extended attributes on brick-b show that there are pending operations on itself. Choose brick-a as source and perform recovery to brick-b.
+
+ - It participated in the transaction and after the post-op extended attributes on brick-b show that there are pending operations on brick-a. This situation is called Split-brain and there is no way to recover. This situation can happen in cases of network partition.
+
+3. The only possibility now is where both brick-a, brick-b have pending operations. In this case changelogs extended attributes are all non-zeros on all the bricks. Basically what could have happened is the operations started on the file but either the whole replica set went down or the mount process itself dies before post-op is performed. In this case there is a possibility that data on the bricks is different. In this case afr chooses file with bigger size as source, if both files have same size then it choses the subvolume which has witnessed large number of pending operations on the other brick as source. If both have same number of pending operations then it chooses the file with newest ctime as source. If this is also same then it just picks one of the two bricks as source and syncs data on to the other to make sure that the files are replicas to each other.
+
+###Self-healing:
+Afr does 3 types of self-heals for data recovery.
+
+1. Data self-heal
+
+2. Metadata self-heal
+
+3. Entry self-heal
+
+As we have seen earlier, afr depends on changelog extended attributes to figure out which copy is source and which copy is sink. General algorithm for performing this recovery (self-heal) is same for all of these different self-heals.
+
+1. Take appropriate full locks on the file/directory to make sure no other transaction is in progress while inspecting changelog extended attributes.
+In this step, for
+ - Data self-heal afr takes inode lock with `offset: 0` and `size: 0`(infinity) in data domain.
+ - Entry self-heal takes entry lock on directory with `NULL` name i.e. full directory lock.
+ - Metadata self-heal it takes pre-defined range in metadata domain on which all the metadata operations on that inode take locks on. To prevent duplicate data self-heal an inode lock is taken in self-heal domain as well.
+
+2. Perform Sync from fresh copy to stale copy.
+In this step,
+ - Metadata self-heal gets the inode attributes, extended attributes from source copy and sets them on the stale copy.
+
+ - Entry self-heal reads entries on stale directories and see if they are present on source directory, if they are not present it deletes them. Then it reads entries on fresh directory and creates the missing entries on stale directories.
+
+ - Data self-heal does things a bit differently to make sure no other writes on the file are blocked for the duration of self-heal because files sizes could be as big as 100G(VM files) and we don't want to block all the transactions until the self-heal is over. Locks translator allows two overlapping locks to be granted if they are from same lock owner. Using this what data self-heal does is it takes a small 128k size range lock and unlock previous acquired lock, heals just that 128k chunk and takes next 128k chunk lock and unlock previous lock and moves to the next one. It always makes sure that at least one lock is present on the file by selfheal throughout the duration of self-heal so that two self-heals don't happen in parallel.
+
+ - Data self-heal has two algorithms, where the file can be copied only when there is data mismatch for that chunk called as 'diff' self-heal. The otherone is blind copy of each chunk called 'full' self-heal
+
+3. Change extended attributes to mark new sources after the sync.
+
+4. Unlock the locks acquired to perform self-heal.
+
+### Transaction Optimizations:
+As we saw earlier afr transaction for all the operations that modify data happens in 5 phases, i.e. it sends 5 operations on the network for every operation. In the following sections we will see optimizations already implemented in afr which reduce the number of operations on the network to just 1 per transaction in best case.
+
+####Changelog-piggybacking
+This optimization comes into picture when on same file descriptor, before write1's post op is complete write2's pre-op starts and the operations are succeeding. When writes come in that manner we can piggyback on the pre-op of write1 for write2 and somehow tell write1 that write2 will do the post-op that was supposed to be done by write1. So write1's post-op does not happen over network, write2's pre-op does not happen over network. This optimization does not hold if there are any failures in write1's phases.
+
+####Delayed Post-op
+This optimization just delays post-op of the write transaction(write1) by a pre-configured amount time to increase the probability of next write piggybacking on the pre-op done by write1.
+
+With the combination of these two optimizations for operations like full file copy which are write intensive operations, what will essentially happen is for the first write a pre-op will happen. Then for the last write on the file post-op happens. So for all the write transactions between first write and last write afr reduced network operations from 5 to 3.
+
+####Eager-locking:
+This optimization comes into picture when only one file descriptor is open on the file and performing writes just like in the previous optimization. What this optimization does is it takes a full file lock on the file irrespective of the offset, size of the write, so that lock acquired by write1 can be piggybacked by write2 and write2 takes the responsibility of unlocking it. both write1, write2 will have same lock owner and afr takes the responsibility of serializing overlapping writes so that replication consistency is maintained.
+
+With the combination of these optimizations for operations like full file copy which are write intensive operations, what will essentially happen is for the first write a pre-op, full-file lock will happen. Then for the last write on the file post-op, unlock happens. So for all the write transactions between first write and last write afr reduced network operations from 5 to 1.
+
+###Quorum in afr:
+To avoid split-brains, afr employs the following quorum policies.
+ - In replica set with odd number of bricks, replica set is said to be in quorum if more than half of the bricks are up.
+ - In replica set with even number of bricks, if more than half of the bricks are up then it is said to be in quorum but if number of bricks that are up is equal to number of bricks that are down then, it is said to be in quorum if the first brick is also up in the set of bricks that are up.
+
+When quorum is not met in the replica set then modify operations on the mount are not allowed by afr.
+
+###Self-heal daemon and Index translator usage by afr:
+
+####Index xlator:
+On each brick index xlator is loaded. This xlator keeps track of what is happening in afr's pre-op and post-op. If there is an ongoing I/O or a pending self-heal, changelog xattrs would have non-zero values. Whenever xattrop/fxattrop fop (pre-op, post-ops are done using these fops) comes to index xlator a link (with gfid as name of the file on which the fop is performed) is added in <brick>/.glusterfs/indices/xattrop directory. If the value returned by the fop is zero the link is removed from the index otherwise it is kept until zero is returned in the subsequent xattrop/fxattrop fops.
+
+####Self-heal-daemon:
+self-heal-daemon process keeps running on each machine of the trusted storage pool. This process has afr xlators of all the volumes which are started. Its job is to crawl indices on bricks that are local to that machine. If any of the files represented by the gfid of the link name need healing and automatically heal them. This operation is performed every 10 minutes for each replica set. Additionally when a brick comes online also this operation is performed.
diff --git a/done/Features/bitrot-docs.md b/done/Features/bitrot-docs.md
new file mode 100644
index 0000000..90edffc
--- /dev/null
+++ b/done/Features/bitrot-docs.md
@@ -0,0 +1,7 @@
+### Sources where you can find more resources on bitrot
+
+* Feature page: http://www.gluster.org/community/documentation/index.php/Features/BitRot
+
+* Design: http://goo.gl/Mjy4mD
+
+* CLI specification: http://goo.gl/2o12Fn
diff --git a/done/Features/brick-failure-detection.md b/done/Features/brick-failure-detection.md
new file mode 100644
index 0000000..24f2a18
--- /dev/null
+++ b/done/Features/brick-failure-detection.md
@@ -0,0 +1,67 @@
+# Brick Failure Detection
+
+This feature attempts to identify storage/file system failures and disable the failed brick without disrupting the remainder of the node's operation.
+
+## Description
+
+Detecting failures on the filesystem that a brick uses makes it possible to handle errors that are caused from outside of the Gluster environment.
+
+There have been hanging brick processes when the underlying storage of a brick went unavailable. A hanging brick process can still use the network and repond to clients, but actual I/O to the storage is impossible and can cause noticible delays on the client side.
+
+Provide better detection of storage subsytem failures and prevent bricks from hanging. It should prevent hanging brick processes when storage-hardware or the filesystem fails.
+
+A health-checker (thread) has been added to the posix xlator. This thread periodically checks the status of the filesystem (implies checking of functional storage-hardware).
+
+`glusterd` can detect that the brick process has exited, `gluster volume status` will show that the brick process is not running anymore. System administrators checking the logs should be able to triage the cause.
+
+## Usage and Configuration
+
+The health-checker is enabled by default and runs a check every 30 seconds. This interval can be changed per volume with:
+
+ # gluster volume set <VOLNAME> storage.health-check-interval <SECONDS>
+
+If `SECONDS` is set to 0, the health-checker will be disabled.
+
+## Failure Detection
+
+Error are logged to the standard syslog (mostly `/var/log/messages`):
+
+ Jun 24 11:31:49 vm130-32 kernel: XFS (dm-2): metadata I/O error: block 0x0 ("xfs_buf_iodone_callbacks") error 5 buf count 512
+ Jun 24 11:31:49 vm130-32 kernel: XFS (dm-2): I/O Error Detected. Shutting down filesystem
+ Jun 24 11:31:49 vm130-32 kernel: XFS (dm-2): Please umount the filesystem and rectify the problem(s)
+ Jun 24 11:31:49 vm130-32 kernel: VFS:Filesystem freeze failed
+ Jun 24 11:31:50 vm130-32 GlusterFS[1969]: [2013-06-24 10:31:50.500674] M [posix-helpers.c:1114:posix_health_check_thread_proc] 0-failing_xfs-posix: health-check failed, going down
+ Jun 24 11:32:09 vm130-32 kernel: XFS (dm-2): xfs_log_force: error 5 returned.
+ Jun 24 11:32:20 vm130-32 GlusterFS[1969]: [2013-06-24 10:32:20.508690] M [posix-helpers.c:1119:posix_health_check_thread_proc] 0-failing_xfs-posix: still alive! -> SIGTERM
+
+The messages labelled with `GlusterFS` in the above output are also written to the logs of the brick process.
+
+## Recovery after a failure
+
+When a brick process detects that the underlaying storage is not responding anymore, the process will exit. There is no automated way that the brick process gets restarted, the sysadmin will need to fix the problem with the storage first.
+
+After correcting the storage (hardware or filesystem) issue, the following command will start the brick process again:
+
+ # gluster volume start <VOLNAME> force
+
+## How To Test
+
+The health-checker thread that is part of each brick process will get started automatically when a volume has been started. Verifying its functionality can be done in different ways.
+
+On virtual hardware:
+
+* disconnect the disk from the VM that holds the brick
+
+On real hardware:
+
+* simulate a RAID-card failure by unplugging the card or cables
+
+On a system that uses LVM for the bricks:
+
+* use device-mapper to load an error-table for the disk, see [this description](http://review.gluster.org/5176).
+
+On any system (writing to random offsets of the block device, more difficult to trigger):
+
+1. cause corruption on the filesystem that holds the brick
+2. read contents from the brick, hoping to hit the corrupted area
+3. the filsystem should abort after hitting a bad spot, the health-checker should notice that shortly afterwards
diff --git a/done/Features/dht.md b/done/Features/dht.md
new file mode 100644
index 0000000..c35dd6d
--- /dev/null
+++ b/done/Features/dht.md
@@ -0,0 +1,223 @@
+# How GlusterFS Distribution Works
+
+The defining feature of any scale-out system is its ability to distribute work
+or data among many servers. Accordingly, people in the distributed-system
+community have developed many powerful techniques to perform such distribution,
+but those techniques often remain little known or understood even among other
+members of the file system and database communities that benefit. This
+confusion is represented even in the name of the GlusterFS component that
+performs distribution - DHT, which stands for Distributed Hash Table but is not
+actually a DHT as that term is most commonly used or defined. The way
+GlusterFS's DHT works is based on a few basic principles:
+
+ * All operations are driven by clients, which are all equal. There are no
+ special nodes with special knowledge of where files are or should be.
+
+ * Directories exist on all subvolumes (bricks or lower-level aggregations of
+ bricks); files exist on only one.
+
+ * Files are assigned to subvolumes based on *consistent hashing*, and even
+ more specifically a form of consistent hashing exemplified by Amazon's
+ [Dynamo][dynamo].
+
+The result of all this is that users are presented with a set of files that is
+the union of the files present on all subvolumes. The following sections
+describe how this "uniting" process actually works.
+
+## Layouts
+
+The conceptual basis of Dynamo-style consistent hashing is of numbers around a
+circle, like a clock. First, the circle is divided into segments and those
+segments are assigned to bricks. (For the sake of simplicity we'll use
+"bricks" hereafter even though they might actually be replicated/striped
+subvolumes.) Several factors guide this assignment.
+
+ * Assignments are done separately for each directory.
+
+ * Historically, segments have all been the same size. However, this can lead
+ to smaller bricks becoming full while plenty of space remains on larger
+ ones. If the *cluster.weighted-rebalance* option is set, segments sizes
+ will be proportional to brick sizes.
+
+ * Assignments need not include all bricks in the volume. If the
+ *cluster.subvols-per-directory* option is set, only that many bricks will
+ receive assignments for that directory.
+
+However these assignments are done, they collectively become what we call a
+*layout* for a directory. This layout is then stored using extended
+attributes, with each brick's copy of that extended attribute on that directory
+consisting of four 32-bit fields.
+
+ * A version, which might be DHT\_HASH\_TYPE\_DM to represent an assignment as
+ described above, or DHT\_HASH\_TYPE\_DM\_USER to represent an assignment made
+ manually by the user (or external script).
+
+ * A "commit hash" which will be described later.
+
+ * The first number in the assigned range (segment).
+
+ * The last number in the assigned range.
+
+For example, the extended attributes representing a weighted assignment between
+three bricks, one twice as big as the others, might look like this.
+
+ * Brick A (the large one): DHT\_HASH\_TYPE\_DM 1234 0 0x7ffffff
+
+ * Brick B: DHT\_HASH\_TYPE\_DM 1234 0x80000000 0xbfffffff
+
+ * Brick C: DHT\_HASH\_TYPE\_DM 1234 0xc0000000 0xffffffff
+
+## Placing Files
+
+To place a file in a directory, we first need a layout for that directory - as
+described above. Next, we calculate a hash for the file. To minimize
+collisions either between files in the same directory with different names or
+between files in different directories with the same name, this hash is
+generated using both the (containing) directory's unique GFID and the file's
+name. This hash is then matched to one of the layout assignments, to yield
+what we call a *hashed location*. For example, consider the layout shown
+above. The hash 0xabad1dea is between 0x80000000 and 0xbfffffff, so the
+corresponding file's hashed location would be on Brick B. A second file with a
+hash of 0xfaceb00c would be assigned to Brick C by the same reasoning.
+
+## Looking Up Files
+
+Because layout assignments might change, especially as bricks are added or
+removed, finding a file involves more than calculating its hashed location and
+looking there. That is in fact the first step, and works most of the time -
+i.e. the file is found where we expected it to be - but there are a few more
+steps when that's not the case. Historically, the next step has been to look
+for the file **everywhere** - i.e. to broadcast our lookup request to all
+subvolumes. If the file isn't found that way, it doesn't exist. At this
+point, an open that requires the file's presence will fail, or a create/mkdir
+that requires its absence will be allowed to continue.
+
+Regardless of whether a file is found at its hashed location or elsewhere, we
+now know its *cached location*. As the name implies, this is stored within DHT
+to satisfy future lookups. If it's not the same as the hashed location, we
+also take an extra step. This step is the creation of a *linkfile*, which is a
+special stub left at the **hashed** location pointing to the **cached**
+location. Therefore, if a client naively looks for a file at its hashed
+location and finds a linkfile instead, it can use that linkfile to look up the
+file where it really is instead of needing to inquire everywhere.
+
+## Rebalancing
+
+As bricks are added or removed, or files are renamed, many files can end up
+somewhere other than at their hashed locations. When this happens, the volumes
+need to be rebalanced. This process consists of two parts.
+
+ 1. Calculate new layouts, according to the current set of bricks (and possibly
+ their characteristics). We call this the "fix-layout" phase.
+
+ 2. Migrate any "misplaced" files to their correct (hashed) locations, and
+ clean up any linkfiles which are no longer necessary. We call this the
+ "migrate-data" phase.
+
+Usually, these two phases are done together. (In fact, the code for them is
+somewhat intermingled.) However, the migrate-data phase can involve a lot of
+I/O and be very disruptive, so users can do just the fix-layout phase and defer
+migrate-data until a more convenient time. This allows new files to be placed
+on new bricks, even though old files might still be in the "wrong" place.
+
+When calculating a new layout to replace an old one, DHT specifically tries to
+maximize overlap of the assigned ranges, thus minimizing data movement. This
+difference can be very large. For example, consider the case where our example
+layout from earlier is updated to add a new double-sided brick. Here's a very
+inefficient way to do that.
+
+ * Brick A (the large one): 0x00000000 to 0x55555555
+
+ * Brick B: 0x55555556 to 0x7fffffff
+
+ * Brick C: 0x80000000 to 0xaaaaaaaa
+
+ * Brick D (the new one): 0xaaaaaaab to 0xffffffff
+
+This would cause files in the following ranges to be migrated:
+
+ * 0x55555556 to 0x7fffffff (from A to B)
+
+ * 0x80000000 to 0xaaaaaaaa (from B to C)
+
+ * 0xaaaaaaab to 0xbfffffff (from B to D)
+
+ * 0xc0000000 to 0xffffffff (from C to D)
+
+As an historical note, this is exactly what we used to do, and in this case it
+would have meant moving 7/12 of all files in the volume. Now let's consider a
+new layout that's optimized to maximize overlap with the old one.
+
+ * Brick A: 0x00000000 to 0x55555555
+
+ * Brick D: 0x55555556 to 0xaaaaaaaa <- optimized insertion point
+
+ * Brick B: 0xaaaaaaab to 0xd5555554
+
+ * Brick C: 0xd5555555 to 0xffffffff
+
+In this case we only need to move 5/12 of all files. In a volume with millions
+or even billions of files, reducing data movement by 1/6 of all files is a
+pretty big improvement. In the future, DHT might use "virtual node IDs" or
+multiple hash rings to make rebalancing even more efficient.
+
+## Rename Optimizations
+
+With the file-lookup mechanisms we already have in place, it's not necessary to
+move a file from one brick to another when it's renamed - even across
+directories. It will still be found, albeit a little less efficiently. The
+first client to look for it after the rename will add a linkfile, which every
+other client will follow from then on. Also, every client that has found the
+file once will continue to find it based on its cached location, without any
+network traffic at all. Because the extra lookup cost is small, and the
+movement cost might be very large, DHT renames the file "in place" on its
+current brick instead (taking advantage of the fact that directories exist
+everywhere).
+
+This optimization is further extended to handle cases where renames are very
+common. For example, rsync and similar tools often use a "write new then
+rename" idiom in which a file "xxx" is actually written as ".xxx.1234" and then
+moved into place only after its contents have been fully written. To make this
+process more efficient, DHT uses a regular expression to separate the permanent
+part of a file's name (in this case "xxx") from what is likely to be a
+temporary part (the leading "." and trailing ".1234"). That way, after the
+file is renamed it will be in its correct hashed location - which it wouldn't
+be otherwise if "xxx" and ".xxx.1234" hash differently - and no linkfiles or
+broadcast lookups will be necessary.
+
+In fact, there are two regular expressions available for this purpose -
+*cluster.rsync-hash-regex* and *cluster.extra-hash-regex*. As its name
+implies, *rsync-hash-regex* defaults to the pattern that regex uses, while
+*extra-hash-regex* can be set by the user to support a second tool using the
+same temporary-file idiom.
+
+## Commit Hashes
+
+A very recent addition to DHT's algorithmic arsenal is intended to reduce the
+number of "broadcast" lookups the it issues. If a volume is completely in
+balance, then no file could exist anywhere but at its hashed location.
+Therefore, if we've already looked there and not found it, then looking
+elsewhere would be pointless (and wasteful). The *commit hash* mechanism is
+used to detect this case. A commit hash is assigned to a volume, and
+separately to each directory, and then updated according to the following
+rules.
+
+ * The volume commit hash is changed whenever actions are taken that might
+ cause layout assignments across all directories to become invalid - i.e.
+ bricks being added, removed, or replaced.
+
+ * The directory commit hash is changed whenever actions are taken that might
+ cause files to be "misplaced" - e.g. when they're renamed.
+
+ * The directory commit hash is set to the volume commit hash when the
+ directory is created, and whenever the directory is fully rebalanced so that
+ all files are at their hashed locations.
+
+In other words, whenever either the volume or directory commit hash is changed
+that creates a mismatch. In that case we revert to the "pessimistic"
+broadcast-lookup method described earlier. However, if the two hashes match
+then we can with skip the broadcast lookup and return a result immediately.
+This has been observed to cause a 3x performance improvement in workloads that
+involve creating many small files across many bricks.
+
+[dynamo]: http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf
diff --git a/done/Features/distributed-geo-rep.md b/done/Features/distributed-geo-rep.md
new file mode 100644
index 0000000..0a3183d
--- /dev/null
+++ b/done/Features/distributed-geo-rep.md
@@ -0,0 +1,71 @@
+Introduction
+============
+
+This document goes through the new design of distributed geo-replication, it's features and the nature of changes involved. First we list down some of the important features.
+
+ - Distributed asynchronous replication
+ - Fast and versatile change detection
+ - Replica failover
+ - Hardlink synchronization
+ - Effective handling of deletes and renames
+ - Configurable sync engine (rsync, tar+ssh)
+ - Adaptive to a wide variety of workloads
+ - GFID synchronization
+
+Geo-replication makes use of the all new *journaling* infrastructure (a.k.a. changelog) to achieve great performance and feature improvements as mentioned above. To understand more about changelogging and the helper library (*libgfchangelog*) refer to document: doc/features/geo-replication/libgfchangelog.md
+
+Data Replication
+----------------
+
+Geo-replication is responsible to incrementally replicate data from the master node to the slave. But isn't that similar to what AFR does? Yes, but here the slave is located geographically distant from the master. Geo-replication follows the eventually consistent replication model, which implies, at any point of time, the slave would be lagging w.r.t. master, but would eventually catch up. Replication performance is dependent on two crucial factors:
+ - Network latency
+ - Change detection
+
+Network latency is something that is not in direct control for many reasons, but still there is always a best effort. Therefore, geo-replication offloads the data replicaiton part to common UNIX file transfer utilities. We choose the grand daddy of file transfers [rsync(1)] [1] as the default synchronization engine, as it's best known for it's diff transfer algorithm for effcient usage of network and lightning fast transfers (leave alone the flexibiliy). But what about small files performance? Due to it's checksumming algorithm, rsync has more overhead for small files -- the overhead of checksumming outweighs the bytes to be transferred for small files. Therefore, geo-replication can also use combination of tar piped over ssh to transfer large number of small files. Tests have shown a great improvement over standard rsync. However, sync engine is not yet dynamic to the file type and needs to be chosen manually by a configuration option.
+
+OTOH, change detection is something that is in full control of the application. Earlier (< release 3.5), geo-replicaiton would perform a file system crawl to indentify changes in the file system. This was not an unintelligent *check-every-single-inode* in the file system, but crawl logic based on *xtime*. xtime is an extended attribute maintained by the *marker* translator for each inode on the master and follows an upward-recursive marking pattern. Geo-replication would traverse a directory based on this simple condition:
+
+> xtime(master) > xtime(slave)
+
+E.g.:
+
+> MASTER SLAVE
+>
+> /\ /\
+> d0 dir0 d0 dir0
+> / \ / \
+> d1 dir1 d1 dir1
+> / /
+> d2 d2
+> / /
+> file0 file0
+
+Consider the directory tree above. Assume that master and slave were in sync and the following operation happens on master:
+```
+touch /d0/d1/d2/file0
+```
+This would trigger a xtime marking (xtime being the current timestamp) from the leaf (*file0*) upto the root (*/*), i.e. an *xattr* of *file0*, *d2*, *d1*, *d0* and finally */*. Geo-replication daemon would crawl the file system based the condition mentioned before and hence would only crawl the **left** part of the directory tree (as the **right** part would hve equal xtimes).
+
+Although the above crawling algorithm is fast, it still has to crawl a good part of the file system. Also, to decide whether to crawl a particular subdirectory, geo-rep need to compare xtime -- which is basically a **getxattr()** call on the master and slave (remember, *slave* is over a WAN).
+
+Therefore, in 3.5 the need arised to take crawling to the next level. Geo-replication now uses the changelogging infrastructure to idenitify changes in the filesystem. Actually, there is absolutely no crawl involved. Changelogging based detection is notification based. Geo-replication daemon registers itself with the changelog consumer library (*libgfchangelog*) and basically invokes a set of APIs to get the list of changes in the filesystem and replays them onto the slave. There is absolutely no crawl or any kind of extended attribute gets involved.
+
+Distributed Geo-Replication
+---------------------------
+Geo-replication (also known as gsyncd or geo-rep) used to be non-distributed before release 3.5. The node on which geo-rep start command was executed was responsible for replication data to the slave. If this node goes offline due to some reason (reboot, crash, etc..), replication would thereby be ceased. So one of the main development efforts for release 3.5 was to *distributify* geo-replication. Geo-rep daemon running on each node (per brick) is responsible for replicating data **local** to each brick. This results in full parallelism and effective use of cluster/network resource.
+
+With release 3.5, geo-rep start command would spawn a geo-replication daemon on each node in the master cluster (one per brick). Geo-rep *status* command shown geo-rep session status from each master node. Similary, *stop* would gracefully tear down the session from all nodes.
+
+What else is synced?
+--------------------
+ - GFID: Synchronizing the inode number (GFID) between master and the slave helps in synchronizing hardlinks.
+ - Purges are also handled effectively as there is no entry comparison between master and slave. With changelog replay, geo-rep perform unlink operation without having to resort to expensive **readdir()** over the WAN.
+ - Renames: With earlier geo-replication, because of the path based nature of crawling, renames were actually a delete and a create on the slave, followed by data transfer (not to mention the inode number change). Now, with changelogging, it's actually a **rename()** call on the slave.
+
+Replica Failover
+----------------
+One of the basic volume configuration is a replicated volume (synchronous replication). Having geo-replication sync data from all replicas would mean wastage of network bandwidth and possibly data corruption on the slave (though that's unlikely). Therefore, geo-rep on such volume configurations works in an **ACTIVE** and **PASSIVE** mode. Geo-rep daemon on one of the replicas is responsible for replicating data (**ACTIVE**), while the other geo-rep daemon is basically doing nothing (**PASSIVE**).
+
+On the event of the *ACTIVE* node going offline, the *PASSIVE* node identifies this event (there's a lag of max 60 seconds for this identification) and switches to *ACTIVE*; thereby taking over the role of replicating data from where the earlier *ACTIVE* node left off. This guarantees uninterrupted data replication even on node reboot/failures.
+
+[1]:http://rsync.samba.org
diff --git a/done/Features/file-snapshot.md b/done/Features/file-snapshot.md
new file mode 100644
index 0000000..7f7c419
--- /dev/null
+++ b/done/Features/file-snapshot.md
@@ -0,0 +1,91 @@
+#File Snapshot
+This feature gives the ability to take snapshot of files.
+
+##Descritpion
+This feature adds file snapshotting support to glusterfs. Snapshots can be created , deleted and reverted.
+
+To take a snapshot of a file, file should be in QCOW2 format as the code for the block layer snapshot has been taken from Qemu and put into gluster as a translator.
+
+With this feature, glusterfs will have better integration with Openstack Cinder, and in general ability to take snapshots of files (typically VM images).
+
+New extended attribute (xattr) will be added to identify files which are 'snapshot managed' vs raw files.
+
+##Volume Options
+Following volume option needs to be set on the volume for taking file snapshot.
+
+ # features.file-snapshot on
+##CLI parameters
+Following cli parameters needs to be passed with setfattr command to create, delete and revert file snapshot.
+
+ # trusted.glusterfs.block-format
+ # trusted.glusterfs.block-snapshot-create
+ # trusted.glusterfs.block-snapshot-goto
+##Fully loaded Example
+Download glusterfs3.5 rpms from download.gluster.org
+Install these rpms.
+
+start glusterd by using the command
+
+ # service glusterd start
+Now create a volume by using the command
+
+ # gluster volume create <vol_name> <brick_path>
+Run the command below to make sure that volume is created.
+
+ # gluster volume info
+Now turn on the snapshot feature on the volume by using the command
+
+ # gluster volume set <vol_name> features.file-snapshot on
+Verify that the option is set by using the command
+
+ # gluster volume info
+User should be able to see another option in the volume info
+
+ # features.file-snapshot: on
+Now mount the volume using fuse mount
+
+ # mount -t glusterfs <vol_name> <mount point>
+cd into the mount point
+ # cd <mount_point>
+ # touch <file_name>
+Size of the file can be set and format of the file can be changed to QCOW2 by running the command below. File size can be in KB/MB/GB
+
+ # setfattr -n trusted.glusterfs.block-format -v qcow2:<file_size> <file_name>
+Now create another file and send data to that file by running the command
+
+ # echo 'ABCDEFGHIJ' > <data_file1>
+copy the data to the one file to another by running the command
+
+ # dd if=data-file1 of=big-file conv=notrunc
+Now take the `snapshot of the file` by running the command
+
+ # setfattr -n trusted.glusterfs.block-snapshot-create -v <image1> <file_name>
+Add some more contents to the file and take another file snaphot by doing the following steps
+
+ # echo '1234567890' > <data_file2>
+ # dd if=<data_file2> of=<file_name> conv=notrunc
+ # setfattr -n trusted.glusterfs.block-snapshot-create -v <image2> <file_name>
+Now `revert` both the file snapshots and write data to some files so that data can be compared.
+
+ # setfattr -n trusted.glusterfs.block-snapshot-goto -v <image1> <file_name>
+ # dd if=<file_name> of=<out-file1> bs=11 count=1
+ # setfattr -n trusted.glusterfs.block-snapshot-goto -v <image2> <file_name>
+ # dd if=<file_name> of=<out-file2> bs=11 count=1
+Now read the contents of the files and compare as below:
+
+ # cat <data_file1>, <out_file1> and compare contents.
+ # cat <data_file2>, <out_file2> and compare contents.
+##one line description for the variables used
+file_name = File which will be creating in the mount point intially.
+
+data_file1 = File which contains data 'ABCDEFGHIJ'
+
+image1 = First file snapshot which has 'ABCDEFGHIJ' + some null values.
+
+data_file2 = File which contains data '1234567890'
+
+image2 = second file snapshot which has '1234567890' + some null values.
+
+out_file1 = After reverting image1 this contains 'ABCDEFGHIJ'
+
+out_file2 = After reverting image2 this contians '1234567890'
diff --git a/done/Features/gfid-access.md b/done/Features/gfid-access.md
new file mode 100644
index 0000000..2d324a1
--- /dev/null
+++ b/done/Features/gfid-access.md
@@ -0,0 +1,73 @@
+#Gfid-access Translator
+The 'gfid-access' translator provides access to data in glusterfs using a
+virtual path. This particular translator is designed to provide direct access to
+files in glusterfs using its gfid. 'GFID' is glusterfs's inode number for a file
+to identify it uniquely. As of now, Geo-replication is the only consumer of this
+translator. The changelog translator logs the 'gfid' with corresponding file
+operation in journals which are consumed by Geo-Replication to replicate the
+files using gfid-access translator very efficiently.
+
+###Implications and Usage
+A new virtual directory called '.gfid' is exposed in the aux-gfid mount
+point when gluster volume is mounted with 'aux-gfid-mount' option.
+All the gfids of files are exposed in one level under the '.gfid' directory.
+No matter at what level the file resides, it is accessed using its
+gfid under this virutal directory as shown in example below. All access
+protocols work seemlessly, as the complexities are handled internally.
+
+###Testing
+1. Mount glusterfs client with '-o aux-gfid-mount' as follows.
+
+ mount -t glusterfs -o aux-gfid-mount <node-ip>:<volname> <mountpoint>
+
+ Example:
+
+ #mount -t glusterfs -o aux-gfid-mount rhs1:master /master-aux-mnt
+
+2. Get the 'gfid' of a file using normal mount or aux-gfid-mount and do some
+ operations as follows.
+
+ getfattr -n glusterfs.gfid.string <file>
+
+ Example:
+
+ #getfattr -n glusterfs.gfid.string /master-aux-mnt/file
+ # file: file
+ glusterfs.gfid.string="796d3170-0910-4853-9ff3-3ee6b1132080"
+
+ #cat /master-aux-mnt/file
+ sample data
+
+ #stat /master-aux-mnt/file
+ File: `file'
+ Size: 12 Blocks: 1 IO Block: 131072 regular file
+ Device: 13h/19d Inode: 11525625031905452160 Links: 1
+ Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
+ Access: 2014-05-23 20:43:33.239999863 +0530
+ Modify: 2014-05-23 17:36:48.224999989 +0530
+ Change: 2014-05-23 20:44:10.081999938 +0530
+
+
+3. Access files using virtual path as follows.
+
+ /mountpoint/.gfid/<actual-canonical-gfid-of-the-file\>'
+
+ Example:
+
+ #cat /master-aux-mnt/.gfid/796d3170-0910-4853-9ff3-3ee6b1132080
+ sample data
+ #stat /master-aux-mnt/.gfid/796d3170-0910-4853-9ff3-3ee6b1132080
+ File: `.gfid/796d3170-0910-4853-9ff3-3ee6b1132080'
+ Size: 12 Blocks: 1 IO Block: 131072 regular file
+ Device: 13h/19d Inode: 11525625031905452160 Links: 1
+ Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
+ Access: 2014-05-23 20:43:33.239999863 +0530
+ Modify: 2014-05-23 17:36:48.224999989 +0530
+ Change: 2014-05-23 20:44:10.081999938 +0530
+
+ We can notice that 'cat' command on the 'file' using path and using virtual
+ path displays the same data. Similarly 'stat' command on the 'file' and using
+ virtual path with gfid gives same Inode Number confirming that its same file.
+
+###Nature of changes
+This feature is introduced with 'gfid-access' translator.
diff --git a/done/Features/glusterfs_nfs-ganesha_integration.md b/done/Features/glusterfs_nfs-ganesha_integration.md
new file mode 100644
index 0000000..b306715
--- /dev/null
+++ b/done/Features/glusterfs_nfs-ganesha_integration.md
@@ -0,0 +1,123 @@
+# GlusterFS and NFS-Ganesha integration
+
+Nfs-ganesha can support NFS (v3, 4.0, 4.1 pNFS) and 9P (from the Plan9 operating system) protocols concurrently. It provides a FUSE-compatible File System Abstraction Layer(FSAL) to allow the file-system developers to plug in their own storage mechanism and access it from any NFS client.
+
+With NFS-GANESHA, the NFS client talks to the NFS-GANESHA server instead, which is in the user address space already. NFS-GANESHA can access the FUSE filesystems directly through its FSAL without copying any data to or from the kernel, thus potentially improving response times. Of course the network streams themselves (TCP/UDP) will still be handled by the Linux kernel when using NFS-GANESHA.
+
+Even GlusterFS has been integrated with NFS-Ganesha, in the recent past to export the volumes created via glusterfs, using “libgfapi”. libgfapi is a new userspace library developed to access data in glusterfs. It performs I/O on gluster volumes directly without FUSE mount. It is a filesystem like api which runs/sits in the application process context(which is NFS-Ganesha here) and eliminates the use of fuse and the kernel vfs layer from the glusterfs volume access. Thus by integrating NFS-Ganesha and libgfapi, the speed and latency have been improved compared to FUSE mount access.
+
+### 1.) Pre-requisites
+
+ - Before starting to setup NFS-Ganesha, a GlusterFS volume should be created.
+ - Disable kernel-nfs, gluster-nfs services on the system using the following commands
+ - service nfs stop
+ - gluster vol set <volname> nfs.disable ON (Note: this command has to be repeated for all the volumes in the trusted-pool)
+ - Usually the libgfapi.so* files are installed in “/usr/lib” or “/usr/local/lib”, based on whether you have installed glusterfs using rpm or sources. Verify if those libgfapi.so* files are linked in “/usr/lib64″ and “/usr/local/lib64″ as well. If not create the links for those .so files in those directories.
+
+### 2.) Installing nfs-ganesha
+
+##### i) using rpm install
+
+ - nfs-ganesha rpms are available in Fedora19 or later packages. So to install nfs-ganesha, run
+ - *#yum install nfs-ganesha*
+ - Using CentOS or EL, download the rpms from the below link :
+ - http://download.gluster.org/pub/gluster/glusterfs/nfs-ganesha
+
+##### ii) using sources
+
+ - cd /root
+ - git clone git://github.com/nfs-ganesha/nfs-ganesha.git
+ - cd nfs-ganesha/
+ - git submodule update --init
+ - git checkout -b next origin/next (Note : origin/next is the current development branch)
+ - rm -rf ~/build; mkdir ~/build ; cd ~/build
+ - cmake -DUSE_FSAL_GLUSTER=ON -DCURSES_LIBRARY=/usr/lib64 -DCURSES_INCLUDE_PATH=/usr/include/ncurses -DCMAKE_BUILD_TYPE=Maintainer /root/nfs-ganesha/src/
+ - make; make install
+> Note: libcap-devel, libnfsidmap, dbus-devel, libacl-devel ncurses* packages
+> may need to be installed prior to running this command. For Fedora, libjemalloc,
+> libjemalloc-devel may also be required.
+
+### 3.) Run nfs-ganesha server
+
+ - To start nfs-ganesha manually, execute the following command:
+ - *#ganesha.nfsd -f <location_of_nfs-ganesha.conf_file> -L <location_of_log_file> -N <log_level> -d
+
+```sh
+For example:
+#ganesha.nfsd -f nfs-ganesha.conf -L nfs-ganesha.log -N NIV_DEBUG -d
+where:
+nfs-ganesha.log is the log file for the ganesha.nfsd process.
+nfs-ganesha.conf is the configuration file
+NIV_DEBUG is the log level.
+```
+ - To check if nfs-ganesha has started, execute the following command:
+ - *#ps aux | grep ganesha*
+ - By default '/' will be exported
+
+### 4.) Exporting GlusterFS volume via nfs-ganesha
+
+#####step 1 :
+
+To export any GlusterFS volume or directory inside volume, create the EXPORT block for each of those entries in a .conf file, for example export.conf. The following paremeters are required to export any entry.
+- *#cat export.conf*
+
+```sh
+EXPORT{
+ Export_Id = 1 ; # Export ID unique to each export
+ Path = "volume_path"; # Path of the volume to be exported. Eg: "/test_volume"
+
+ FSAL {
+ name = GLUSTER;
+ hostname = "10.xx.xx.xx"; # IP of one of the nodes in the trusted pool
+ volume = "volume_name"; # Volume name. Eg: "test_volume"
+ }
+
+ Access_type = RW; # Access permissions
+ Squash = No_root_squash; # To enable/disable root squashing
+ Disable_ACL = TRUE; # To enable/disable ACL
+ Pseudo = "pseudo_path"; # NFSv4 pseudo path for this export. Eg: "/test_volume_pseudo"
+ Protocols = "3","4" ; # NFS protocols supported
+ Transports = "UDP","TCP" ; # Transport protocols supported
+ SecType = "sys"; # Security flavors supported
+}
+```
+
+#####step 2 :
+
+Define/copy “nfs-ganesha.conf” file to a suitable location. This file is available in “/etc/glusterfs-ganesha” on installation of nfs-ganesha rpms or incase if using the sources, rename “/root/nfs-ganesha/src/FSAL/FSAL_GLUSTER/README” file to “nfs-ganesha.conf” file.
+
+#####step 3 :
+
+Now include the “export.conf” file in nfs-ganesha.conf. This can be done by adding the line below at the end of nfs-ganesha.conf.
+ - %include “export.conf”
+
+#####step 4 :
+
+ - run ganesha server as mentioned in section 3
+ - To check if the volume is exported, run
+ - *#showmount -e localhost*
+
+### 5.) Additional Notes
+
+To switch back to gluster-nfs/kernel-nfs, kill the ganesha daemon and start those services using the below commands :
+
+ - pkill ganesha
+ - service nfs start (for kernel-nfs)
+ - gluster v set <volname> nfs.disable off
+
+
+### 6.) References
+
+ - Setup and create glusterfs volumes :
+http://www.gluster.org/community/documentation/index.php/QuickStart
+
+ - NFS-Ganesha wiki : https://github.com/nfs-ganesha/nfs-ganesha/wiki
+
+ - Sample configuration files
+ - /root/nfs-ganesha/src/config_samples/gluster.conf
+ - https://github.com/nfs-ganesha/nfs-ganesha/blob/master/src/config_samples/gluster.conf
+
+ - https://forge.gluster.org/nfs-ganesha-and-glusterfs-integration/pages/Home
+
+ - http://blog.gluster.org/2014/09/glusterfs-and-nfs-ganesha-integration/
+
diff --git a/done/Features/heal-info-and-split-brain-resolution.md b/done/Features/heal-info-and-split-brain-resolution.md
new file mode 100644
index 0000000..6ca2be2
--- /dev/null
+++ b/done/Features/heal-info-and-split-brain-resolution.md
@@ -0,0 +1,448 @@
+The following document explains the usage of volume heal info and split-brain
+resolution commands.
+
+##`gluster volume heal <VOLNAME> info [split-brain]` commands
+###volume heal info
+Usage: `gluster volume heal <VOLNAME> info`
+
+This lists all the files that need healing (either their path or
+GFID is printed).
+###Interpretting the output
+All the files that are listed in the output of this command need healing to be
+done. Apart from this, there are 2 special cases that may be associated with
+an entry -
+a) Is in split-brain
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; A file in data/metadata split-brain will
+be listed with " - Is in split-brain" appended after its path/gfid. Eg.,
+"/file4" in the output provided below. But for a gfid split-brain,
+ the parent directory of the file is shown to be in split-brain and the file
+itself is shown to be needing heal. Eg., "/dir" in the output provided below
+which is in split-brain because of gfid split-brain of file "/dir/a".
+b) Is possibly undergoing heal
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; A file is said to be possibly undergoing
+ heal because it is possible that the file was undergoing heal when heal status
+was being determined but it cannot be said for sure. It could so have happened
+that self-heal daemon and glfsheal process that is trying to get heal information
+are competing for the same lock leading to such conclusion. Another possible case
+ could be multiple glfsheal processes running simultaneously (e.g., multiple users
+ ran heal info command at the same time), competing for same lock.
+
+The following is an example of heal info command's output.
+###Example
+Consider a replica volume "test" with 2 bricks b1 and b2;
+self-heal daemon off, mounted at /mnt.
+
+`gluster volume heal test info`
+~~~
+Brick \<hostname:brickpath-b1>
+<gfid:aaca219f-0e25-4576-8689-3bfd93ca70c2> - Is in split-brain
+<gfid:39f301ae-4038-48c2-a889-7dac143e82dd> - Is in split-brain
+<gfid:c3c94de2-232d-4083-b534-5da17fc476ac> - Is in split-brain
+<gfid:6dc78b20-7eb6-49a3-8edb-087b90142246>
+
+Number of entries: 4
+
+Brick <hostname:brickpath-b2>
+/dir/file2
+/dir/file1 - Is in split-brain
+/dir - Is in split-brain
+/dir/file3
+/file4 - Is in split-brain
+/dir/a
+
+
+Number of entries: 6
+~~~
+
+###Analysis of the output
+It can be seen that
+A) from brick b1 4 entries need healing:
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;1) file with gfid:6dc78b20-7eb6-49a3-8edb-087b90142246 needs healing
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;2) "aaca219f-0e25-4576-8689-3bfd93ca70c2",
+"39f301ae-4038-48c2-a889-7dac143e82dd" and "c3c94de2-232d-4083-b534-5da17fc476ac"
+ are in split-brain
+
+B) from brick b2 6 entries need healing-
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;1) "a", "file2" and "file3" need healing
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;2) "file1", "file4" & "/dir" are in split-brain
+
+###volume heal info split-brain
+Usage: `gluster volume heal <VOLNAME> info split-brain`
+This command shows all the files that are in split-brain.
+##Example
+`gluster volume heal test info split-brain`
+~~~
+Brick <hostname:brickpath-b1>
+<gfid:aaca219f-0e25-4576-8689-3bfd93ca70c2>
+<gfid:39f301ae-4038-48c2-a889-7dac143e82dd>
+<gfid:c3c94de2-232d-4083-b534-5da17fc476ac>
+Number of entries in split-brain: 3
+
+Brick <hostname:brickpath-b2>
+/dir/file1
+/dir
+/file4
+Number of entries in split-brain: 3
+~~~
+Note that, similar to heal info command, for gfid split-brains (same filename but different gfid)
+their parent directories are listed to be in split-brain.
+
+##Resolution of split-brain using CLI
+Once the files in split-brain are identified, their resolution can be done
+from the command line. Note that entry/gfid split-brain resolution is not supported.
+Split-brain resolution commands let the user resolve split-brain in 3 ways.
+###Select the bigger-file as source
+This command is useful for per file healing where it is known/decided that the
+file with bigger size is to be considered as source.
+1.`gluster volume heal <VOLNAME> split-brain bigger-file <FILE>`
+`<FILE>` can be either the full file name as seen from the root of the volume
+(or) the gfid-string representation of the file, which sometimes gets displayed
+in the heal info command's output.
+Once this command is executed, the replica containing the FILE with bigger
+size is found out and heal is completed with it as source.
+
+###Example :
+Consider the above output of heal info split-brain command.
+
+Before healing the file, notice file size and md5 checksums :
+~~~
+On brick b1:
+# stat b1/dir/file1
+ File: ‘b1/dir/file1’
+ Size: 17 Blocks: 16 IO Block: 4096 regular file
+Device: fd03h/64771d Inode: 919362 Links: 2
+Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
+Access: 2015-03-06 13:55:40.149897333 +0530
+Modify: 2015-03-06 13:55:37.206880347 +0530
+Change: 2015-03-06 13:55:37.206880347 +0530
+ Birth: -
+
+# md5sum b1/dir/file1
+040751929ceabf77c3c0b3b662f341a8 b1/dir/file1
+
+On brick b2:
+# stat b2/dir/file1
+ File: ‘b2/dir/file1’
+ Size: 13 Blocks: 16 IO Block: 4096 regular file
+Device: fd03h/64771d Inode: 919365 Links: 2
+Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
+Access: 2015-03-06 13:54:22.974451898 +0530
+Modify: 2015-03-06 13:52:22.910758923 +0530
+Change: 2015-03-06 13:52:22.910758923 +0530
+ Birth: -
+# md5sum b2/dir/file1
+cb11635a45d45668a403145059c2a0d5 b2/dir/file1
+~~~
+Healing file1 using the above command -
+`gluster volume heal test split-brain bigger-file /dir/file1`
+Healed /dir/file1.
+
+After healing is complete, the md5sum and file size on both bricks should be the same.
+~~~
+On brick b1:
+# stat b1/dir/file1
+ File: ‘b1/dir/file1’
+ Size: 17 Blocks: 16 IO Block: 4096 regular file
+Device: fd03h/64771d Inode: 919362 Links: 2
+Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
+Access: 2015-03-06 14:17:27.752429505 +0530
+Modify: 2015-03-06 13:55:37.206880347 +0530
+Change: 2015-03-06 14:17:12.880343950 +0530
+ Birth: -
+# md5sum b1/dir/file1
+040751929ceabf77c3c0b3b662f341a8 b1/dir/file1
+
+On brick b2:
+# stat b2/dir/file1
+ File: ‘b2/dir/file1’
+ Size: 17 Blocks: 16 IO Block: 4096 regular file
+Device: fd03h/64771d Inode: 919365 Links: 2
+Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
+Access: 2015-03-06 14:17:23.249403600 +0530
+Modify: 2015-03-06 13:55:37.206880000 +0530
+Change: 2015-03-06 14:17:12.881343955 +0530
+ Birth: -
+
+# md5sum b2/dir/file1
+040751929ceabf77c3c0b3b662f341a8 b2/dir/file1
+~~~
+###Select one replica as source for a particular file
+2.`gluster volume heal <VOLNAME> split-brain source-brick <HOSTNAME:BRICKNAME> <FILE>`
+`<HOSTNAME:BRICKNAME>` is selected as source brick,
+FILE present in the source brick is taken as source for healing.
+
+###Example :
+Notice the md5 checksums and file size before and after heal.
+
+Before heal :
+~~~
+On brick b1:
+
+ stat b1/file4
+ File: ‘b1/file4’
+ Size: 4 Blocks: 16 IO Block: 4096 regular file
+Device: fd03h/64771d Inode: 919356 Links: 2
+Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
+Access: 2015-03-06 13:53:19.417085062 +0530
+Modify: 2015-03-06 13:53:19.426085114 +0530
+Change: 2015-03-06 13:53:19.426085114 +0530
+ Birth: -
+# md5sum b1/file4
+b6273b589df2dfdbd8fe35b1011e3183 b1/file4
+
+On brick b2:
+
+# stat b2/file4
+ File: ‘b2/file4’
+ Size: 4 Blocks: 16 IO Block: 4096 regular file
+Device: fd03h/64771d Inode: 919358 Links: 2
+Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
+Access: 2015-03-06 13:52:35.761833096 +0530
+Modify: 2015-03-06 13:52:35.769833142 +0530
+Change: 2015-03-06 13:52:35.769833142 +0530
+ Birth: -
+# md5sum b2/file4
+0bee89b07a248e27c83fc3d5951213c1 b2/file4
+~~~
+`gluster volume heal test split-brain source-brick test-host:/test/b1 gfid:c3c94de2-232d-4083-b534-5da17fc476ac`
+Healed gfid:c3c94de2-232d-4083-b534-5da17fc476ac.
+
+After healing :
+~~~
+On brick b1:
+# stat b1/file4
+ File: ‘b1/file4’
+ Size: 4 Blocks: 16 IO Block: 4096 regular file
+Device: fd03h/64771d Inode: 919356 Links: 2
+Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
+Access: 2015-03-06 14:23:38.944609863 +0530
+Modify: 2015-03-06 13:53:19.426085114 +0530
+Change: 2015-03-06 14:27:15.058927962 +0530
+ Birth: -
+# md5sum b1/file4
+b6273b589df2dfdbd8fe35b1011e3183 b1/file4
+
+On brick b2:
+# stat b2/file4
+ File: ‘b2/file4’
+ Size: 4 Blocks: 16 IO Block: 4096 regular file
+Device: fd03h/64771d Inode: 919358 Links: 2
+Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
+Access: 2015-03-06 14:23:38.944609000 +0530
+Modify: 2015-03-06 13:53:19.426085000 +0530
+Change: 2015-03-06 14:27:15.059927968 +0530
+ Birth: -
+# md5sum b2/file4
+b6273b589df2dfdbd8fe35b1011e3183 b2/file4
+~~~
+Note that, as mentioned earlier, entry split-brain and gfid split-brain healing
+ are not supported using CLI. However, they can be fixed using the method described
+ [here](https://github.com/gluster/glusterfs/blob/master/doc/debugging/split-brain.md).
+###Example:
+Trying to heal /dir would fail as it is in entry split-brain.
+`gluster volume heal test split-brain source-brick test-host:/test/b1 /dir`
+Healing /dir failed:Operation not permitted.
+Volume heal failed.
+
+3.`gluster volume heal <VOLNAME> split-brain source-brick <HOSTNAME:BRICKNAME>`
+Consider a scenario where many files are in split-brain such that one brick of
+replica pair is source. As the result of the above command all split-brained
+files in `<HOSTNAME:BRICKNAME>` are selected as source and healed to the sink.
+
+###Example:
+Consider a volume having three entries "a, b and c" in split-brain.
+~~~
+`gluster volume heal test split-brain source-brick test-host:/test/b1`
+Healed gfid:944b4764-c253-4f02-b35f-0d0ae2f86c0f.
+Healed gfid:3256d814-961c-4e6e-8df2-3a3143269ced.
+Healed gfid:b23dd8de-af03-4006-a803-96d8bc0df004.
+Number of healed entries: 3
+~~~
+
+## An overview of working of heal info commands
+When these commands are invoked, a "glfsheal" process is spawned which reads
+the entries from `/<brick-path>/.glusterfs/indices/xattrop/` directory of all
+the bricks that are up (that it can connect to) one after another. These
+entries are GFIDs of files that might need healing. Once GFID entries from a
+brick are obtained, based on the lookup response of this file on each
+participating brick of replica-pair & trusted.afr.* extended attributes it is
+found out if the file needs healing, is in split-brain etc based on the
+requirement of each command and displayed to the user.
+
+
+##Resolution of split-brain from the mount point
+A set of getfattr and setfattr commands have been provided to detect the data and metadata split-brain status of a file and resolve split-brain, if any, from mount point.
+
+Consider a volume "test", having bricks b0, b1, b2 and b3.
+
+~~~
+# gluster volume info test
+
+Volume Name: test
+Type: Distributed-Replicate
+Volume ID: 00161935-de9e-4b80-a643-b36693183b61
+Status: Started
+Number of Bricks: 2 x 2 = 4
+Transport-type: tcp
+Bricks:
+Brick1: test-host:/test/b0
+Brick2: test-host:/test/b1
+Brick3: test-host:/test/b2
+Brick4: test-host:/test/b3
+~~~
+
+Directory structure of the bricks is as follows:
+
+~~~
+# tree -R /test/b?
+/test/b0
+├── dir
+│   └── a
+└── file100
+
+/test/b1
+├── dir
+│   └── a
+└── file100
+
+/test/b2
+├── dir
+├── file1
+├── file2
+└── file99
+
+/test/b3
+├── dir
+├── file1
+├── file2
+└── file99
+~~~
+
+Some files in the volume are in split-brain.
+~~~
+# gluster v heal test info split-brain
+Brick test-host:/test/b0/
+/file100
+/dir
+Number of entries in split-brain: 2
+
+Brick test-host:/test/b1/
+/file100
+/dir
+Number of entries in split-brain: 2
+
+Brick test-host:/test/b2/
+/file99
+<gfid:5399a8d1-aee9-4653-bb7f-606df02b3696>
+Number of entries in split-brain: 2
+
+Brick test-host:/test/b3/
+<gfid:05c4b283-af58-48ed-999e-4d706c7b97d5>
+<gfid:5399a8d1-aee9-4653-bb7f-606df02b3696>
+Number of entries in split-brain: 2
+~~~
+###To know data/metadata split-brain status of a file:
+~~~
+getfattr -n replica.split-brain-status <path-to-file>
+~~~
+The above command executed from mount provides information if a file is in data/metadata split-brain. Also provides the list of afr children to analyze to get more information about the file.
+This command is not applicable to gfid/directory split-brain.
+
+###Example:
+1) "file100" is in metadata split-brain. Executing the above mentioned command for file100 gives :
+~~~
+# getfattr -n replica.split-brain-status file100
+# file: file100
+replica.split-brain-status="data-split-brain:no metadata-split-brain:yes Choices:test-client-0,test-client-1"
+~~~
+
+2) "file1" is in data split-brain.
+~~~
+# getfattr -n replica.split-brain-status file1
+# file: file1
+replica.split-brain-status="data-split-brain:yes metadata-split-brain:no Choices:test-client-2,test-client-3"
+~~~
+
+3) "file99" is in both data and metadata split-brain.
+~~~
+# getfattr -n replica.split-brain-status file99
+# file: file99
+replica.split-brain-status="data-split-brain:yes metadata-split-brain:yes Choices:test-client-2,test-client-3"
+~~~
+
+4) "dir" is in directory split-brain but as mentioned earlier, the above command is not applicable to such split-brain. So it says that the file is not under data or metadata split-brain.
+~~~
+# getfattr -n replica.split-brain-status dir
+# file: dir
+replica.split-brain-status="The file is not under data or metadata split-brain"
+~~~
+
+5) "file2" is not in any kind of split-brain.
+~~~
+# getfattr -n replica.split-brain-status file2
+# file: file2
+replica.split-brain-status="The file is not under data or metadata split-brain"
+~~~
+
+### To analyze the files in data and metadata split-brain
+Trying to do operations (say cat, getfattr etc) from the mount on files in split-brain, gives an input/output error. To enable the users analyze such files, a setfattr command is provided.
+
+~~~
+# setfattr -n replica.split-brain-choice -v "choiceX" <path-to-file>
+~~~
+Using this command, a particular brick can be chosen to access the file in split-brain from.
+
+###Example:
+1) "file1" is in data-split-brain. Trying to read from the file gives input/output error.
+~~~
+# cat file1
+cat: file1: Input/output error
+~~~
+Split-brain choices provided for file1 were test-client-2 and test-client-3.
+
+Setting test-client-2 as split-brain choice for file1 serves reads from b2 for the file.
+~~~
+# setfattr -n replica.split-brain-choice -v test-client-2 file1
+~~~
+Now, read operations on the file can be done.
+~~~
+# cat file1
+xyz
+~~~
+Similarly, to inspect the file from other choice, replica.split-brain-choice is to be set to test-client-3.
+
+Trying to inspect the file from a wrong choice errors out.
+
+To undo the split-brain-choice that has been set, the above mentioned setfattr command can be used
+with "none" as the value for extended attribute.
+
+###Example:
+~~~
+1) setfattr -n replica.split-brain-choice -v none file1
+~~~
+Now performing cat operation on the file will again result in input/output error, as before.
+~~~
+# cat file
+cat: file1: Input/output error
+~~~
+
+Once the choice for resolving split-brain is made, source brick is supposed to be set for the healing to be done.
+This is done using the following command:
+
+~~~
+# setfattr -n replica.split-brain-heal-finalize -v <heal-choice> <path-to-file>
+~~~
+
+##Example
+~~~
+# setfattr -n replica.split-brain-heal-finalize -v test-client-2 file1
+~~~
+The above process can be used to resolve data and/or metadata split-brain on all the files.
+
+NOTE:
+1) If "fopen-keep-cache" fuse mount option is disabled then inode needs to be invalidated each time before selecting a new replica.split-brain-choice to inspect a file. This can be done by using:
+~~~
+# sefattr -n inode-invalidate -v 0 <path-to-file>
+~~~
+
+2) The above mentioned process for split-brain resolution from mount will not work on nfs mounts as it doesn't provide xattrs support.
diff --git a/done/Features/leases.md b/done/Features/leases.md
new file mode 100644
index 0000000..08f2056
--- /dev/null
+++ b/done/Features/leases.md
@@ -0,0 +1,11 @@
+##Leases
+
+###API:
+
+###Lease Semantics in Gluster
+
+###High level implementation details
+
+###Known issues
+
+###TODO
diff --git a/done/Features/libgfapi.md b/done/Features/libgfapi.md
new file mode 100644
index 0000000..34adf60
--- /dev/null
+++ b/done/Features/libgfapi.md
@@ -0,0 +1,382 @@
+One of the known methods to access glusterfs is via fuse module. However, it has some overhead or performance issues because of the number of context switches which need to be performed to complete one i/o transaction[1].
+
+
+To over come this limitation, a new method called ‘libgfapi’ is introduced. libgfapi support is available from GlusterFS-3.4 release.
+
+libgfapi is a userspace library for accessing data in glusterfs. libgfapi library perform IO on gluster volumes directly without FUSE mount. It is a filesystem like api and runs/sits in application process context. libgfapi eliminates the fuse and the kernel vfs layer from the glusterfs volume access. The speed and latency have improved with libgfapi access. [1]
+
+
+Using libgfapi, various user-space filesystems (like NFS-Ganesha or Samba) or the virtualizer (like QEMU) can interact with GlusterFS which serves as back-end filesystem. Currently below projects integrate with glusterfs using libgfapi interfaces.
+
+
+* qemu storage layer
+* Samba VFS plugin
+* NFS-Ganesha
+
+All the APIs in libgfapi make use of `struct glfs` object. This object
+contains information about volume name, glusterfs context associated,
+subvols in the graph etc which makes it unique for each volume.
+
+
+For any application to make use of libgfapi, it should typically start
+with the below APIs in the following order -
+
+* To create a new glfs object :
+
+ glfs_t *glfs_new (const char *volname) ;
+
+ glfs_new() returns glfs_t object.
+
+
+* On this newly created glfs_t, you need to be either set a volfile path
+ (glfs_set_volfile) or a volfile server (glfs_set_volfile_server).
+ Incase of failures, the corresponding cleanup routine is
+ "glfs_unset_volfile_server"
+
+ int glfs_set_volfile (glfs_t *fs, const char *volfile);
+
+ int glfs_set_volfile_server (glfs_t *fs, const char *transport,const char *host, int port) ;
+
+ int glfs_unset_volfile_server (glfs_t *fs, const char *transport,const char *host, int port) ;
+
+* Specify logging parameters using glfs_set_logging():
+
+ int glfs_set_logging (glfs_t *fs, const char *logfile, int loglevel) ;
+
+* Initializes the glfs_t object using glfs_init()
+ int glfs_init (glfs_t *fs) ;
+
+#### FOPs APIs available with libgfapi :
+
+
+
+ int glfs_get_volumeid (struct glfs *fs, char *volid, size_t size);
+
+ int glfs_setfsuid (uid_t fsuid) ;
+
+ int glfs_setfsgid (gid_t fsgid) ;
+
+ int glfs_setfsgroups (size_t size, const gid_t *list) ;
+
+ glfs_fd_t *glfs_open (glfs_t *fs, const char *path, int flags) ;
+
+ glfs_fd_t *glfs_creat (glfs_t *fs, const char *path, int flags,mode_t mode) ;
+
+ int glfs_close (glfs_fd_t *fd) ;
+
+ glfs_t *glfs_from_glfd (glfs_fd_t *fd) ;
+
+ int glfs_set_xlator_option (glfs_t *fs, const char *xlator, const char *key,const char *value) ;
+
+ typedef void (*glfs_io_cbk) (glfs_fd_t *fd, ssize_t ret, void *data);
+
+ ssize_t glfs_read (glfs_fd_t *fd, void *buf,size_t count, int flags) ;
+
+ ssize_t glfs_write (glfs_fd_t *fd, const void *buf,size_t count, int flags) ;
+
+ int glfs_read_async (glfs_fd_t *fd, void *buf, size_t count, int flags, glfs_io_cbk fn, void *data) ;
+
+ int glfs_write_async (glfs_fd_t *fd, const void *buf, size_t count, int flags, glfs_io_cbk fn, void *data) ;
+
+ ssize_t glfs_readv (glfs_fd_t *fd, const struct iovec *iov, int iovcnt,int flags) ;
+
+ ssize_t glfs_writev (glfs_fd_t *fd, const struct iovec *iov, int iovcnt,int flags) ;
+
+ int glfs_readv_async (glfs_fd_t *fd, const struct iovec *iov, int count, int flags, glfs_io_cbk fn, void *data) ;
+
+ int glfs_writev_async (glfs_fd_t *fd, const struct iovec *iov, int count, int flags, glfs_io_cbk fn, void *data) ;
+
+ ssize_t glfs_pread (glfs_fd_t *fd, void *buf, size_t count, off_t offset,int flags) ;
+
+ ssize_t glfs_pwrite (glfs_fd_t *fd, const void *buf, size_t count, off_t offset, int flags) ;
+
+ int glfs_pread_async (glfs_fd_t *fd, void *buf, size_t count, off_t offset,int flags, glfs_io_cbk fn, void *data) ;
+
+ int glfs_pwrite_async (glfs_fd_t *fd, const void *buf, int count, off_t offset,int flags, glfs_io_cbk fn, void *data) ;
+
+ ssize_t glfs_preadv (glfs_fd_t *fd, const struct iovec *iov, int iovcnt, int count, off_t offset, int flags,glfs_io_cbk fn, void *data) ;
+
+ ssize_t glfs_pwritev (glfs_fd_t *fd, const struct iovec *iov, int iovcnt,int count, off_t offset, int flags, glfs_io_cbk fn, void *data) ;
+
+ int glfs_preadv_async (glfs_fd_t *fd, const struct iovec *iov, glfs_io_cbk fn, void *data) ;
+
+ int glfs_pwritev_async (glfs_fd_t *fd, const struct iovec *iov, glfs_io_cbk fn, void *data) ;
+
+ off_t glfs_lseek (glfs_fd_t *fd, off_t offset, int whence) ;
+
+ int glfs_truncate (glfs_t *fs, const char *path, off_t length) ;
+
+ int glfs_ftruncate (glfs_fd_t *fd, off_t length) ;
+
+ int glfs_ftruncate_async (glfs_fd_t *fd, off_t length, glfs_io_cbk fn,void *data) ;
+
+ int glfs_lstat (glfs_t *fs, const char *path, struct stat *buf) ;
+
+ int glfs_stat (glfs_t *fs, const char *path, struct stat *buf) ;
+
+ int glfs_fstat (glfs_fd_t *fd, struct stat *buf) ;
+
+ int glfs_fsync (glfs_fd_t *fd) ;
+
+ int glfs_fsync_async (glfs_fd_t *fd, glfs_io_cbk fn, void *data) ;
+
+ int glfs_fdatasync (glfs_fd_t *fd) ;
+
+ int glfs_fdatasync_async (glfs_fd_t *fd, glfs_io_cbk fn, void *data) ;
+
+ int glfs_access (glfs_t *fs, const char *path, int mode) ;
+
+ int glfs_symlink (glfs_t *fs, const char *oldpath, const char *newpath) ;
+
+ int glfs_readlink (glfs_t *fs, const char *path,char *buf, size_t bufsiz) ;
+
+ int glfs_mknod (glfs_t *fs, const char *path, mode_t mode, dev_t dev) ;
+
+ int glfs_mkdir (glfs_t *fs, const char *path, mode_t mode) ;
+
+ int glfs_unlink (glfs_t *fs, const char *path) ;
+
+ int glfs_rmdir (glfs_t *fs, const char *path) ;
+
+ int glfs_rename (glfs_t *fs, const char *oldpath, const char *newpath) ;
+
+ int glfs_link (glfs_t *fs, const char *oldpath, const char *newpath) ;
+
+ glfs_fd_t *glfs_opendir (glfs_t *fs, const char *path) ;
+
+ int glfs_readdir_r (glfs_fd_t *fd, struct dirent *dirent,struct dirent **result) ;
+
+ int glfs_readdirplus_r (glfs_fd_t *fd, struct stat *stat, struct dirent *dirent, struct dirent **result) ;
+
+ struct dirent *glfs_readdir (glfs_fd_t *fd) ;
+
+ struct dirent *glfs_readdirplus (glfs_fd_t *fd, struct stat *stat) ;
+
+ long glfs_telldir (glfs_fd_t *fd) ;
+
+ void glfs_seekdir (glfs_fd_t *fd, long offset) ;
+
+ int glfs_closedir (glfs_fd_t *fd) ;
+
+ int glfs_statvfs (glfs_t *fs, const char *path, struct statvfs *buf) ;
+
+ int glfs_chmod (glfs_t *fs, const char *path, mode_t mode) ;
+
+ int glfs_fchmod (glfs_fd_t *fd, mode_t mode) ;
+
+ int glfs_chown (glfs_t *fs, const char *path, uid_t uid, gid_t gid) ;
+
+ int glfs_lchown (glfs_t *fs, const char *path, uid_t uid, gid_t gid) ;
+
+ int glfs_fchown (glfs_fd_t *fd, uid_t uid, gid_t gid) ;
+
+ int glfs_utimens (glfs_t *fs, const char *path,struct timespec times[2]) ;
+
+ int glfs_lutimens (glfs_t *fs, const char *path,struct timespec times[2]) ;
+
+ int glfs_futimens (glfs_fd_t *fd, struct timespec times[2]) ;
+
+ ssize_t glfs_getxattr (glfs_t *fs, const char *path, const char *name,void *value, size_t size) ;
+
+ ssize_t glfs_lgetxattr (glfs_t *fs, const char *path, const char *name,void *value, size_t size) ;
+
+ ssize_t glfs_fgetxattr (glfs_fd_t *fd, const char *name,void *value, size_t size) ;
+
+ ssize_t glfs_listxattr (glfs_t *fs, const char *path,void *value, size_t size) ;
+
+ ssize_t glfs_llistxattr (glfs_t *fs, const char *path, void *value,size_t size) ;
+
+ ssize_t glfs_flistxattr (glfs_fd_t *fd, void *value, size_t size) ;
+
+ int glfs_setxattr (glfs_t *fs, const char *path, const char *name,const void *value, size_t size, int flags) ;
+
+ int glfs_lsetxattr (glfs_t *fs, const char *path, const char *name,const void *value, size_t size, int flags) ;
+
+ int glfs_fsetxattr (glfs_fd_t *fd, const char *name,const void *value, size_t size, int flags) ;
+
+ int glfs_removexattr (glfs_t *fs, const char *path, const char *name) ;
+
+ int glfs_lremovexattr (glfs_t *fs, const char *path, const char *name) ;
+
+ int glfs_fremovexattr (glfs_fd_t *fd, const char *name) ;
+
+ int glfs_fallocate(glfs_fd_t *fd, int keep_size, off_t offset, size_t len) ;
+
+ int glfs_discard(glfs_fd_t *fd, off_t offset, size_t len) ;
+
+ int glfs_discard_async (glfs_fd_t *fd, off_t length, size_t lent, glfs_io_cbk fn, void *data) ;
+
+ int glfs_zerofill(glfs_fd_t *fd, off_t offset, off_t len) ;
+
+ int glfs_zerofill_async (glfs_fd_t *fd, off_t length, off_t len, glfs_io_cbk fn, void *data) ;
+
+ char *glfs_getcwd (glfs_t *fs, char *buf, size_t size) ;
+
+ int glfs_chdir (glfs_t *fs, const char *path) ;
+
+ int glfs_fchdir (glfs_fd_t *fd) ;
+
+ char *glfs_realpath (glfs_t *fs, const char *path, char *resolved_path) ;
+
+ int glfs_posix_lock (glfs_fd_t *fd, int cmd, struct flock *flock) ;
+
+ glfs_fd_t *glfs_dup (glfs_fd_t *fd) ;
+
+
+ struct glfs_object *glfs_h_lookupat (struct glfs *fs,struct glfs_object *parent,
+ const char *path,
+ struct stat *stat) ;
+
+ struct glfs_object *glfs_h_creat (struct glfs *fs, struct glfs_object *parent,
+ const char *path, int flags, mode_t mode,
+ struct stat *sb) ;
+
+ struct glfs_object *glfs_h_mkdir (struct glfs *fs, struct glfs_object *parent,
+ const char *path, mode_t flags,
+ struct stat *sb) ;
+
+ struct glfs_object *glfs_h_mknod (struct glfs *fs, struct glfs_object *parent,
+ const char *path, mode_t mode, dev_t dev,
+ struct stat *sb) ;
+
+ struct glfs_object *glfs_h_symlink (struct glfs *fs, struct glfs_object *parent,
+ const char *name, const char *data,
+ struct stat *stat) ;
+
+
+ int glfs_h_unlink (struct glfs *fs, struct glfs_object *parent,
+ const char *path) ;
+
+ int glfs_h_close (struct glfs_object *object) ;
+
+ int glfs_caller_specific_init (void *uid_caller_key, void *gid_caller_key,
+ void *future) ;
+
+ int glfs_h_truncate (struct glfs *fs, struct glfs_object *object,
+ off_t offset) ;
+
+ int glfs_h_stat(struct glfs *fs, struct glfs_object *object,
+ struct stat *stat) ;
+
+ int glfs_h_getattrs (struct glfs *fs, struct glfs_object *object,
+ struct stat *stat) ;
+
+ int glfs_h_getxattrs (struct glfs *fs, struct glfs_object *object,
+ const char *name, void *value,
+ size_t size) ;
+
+ int glfs_h_setattrs (struct glfs *fs, struct glfs_object *object,
+ struct stat *sb, int valid) ;
+
+ int glfs_h_setxattrs (struct glfs *fs, struct glfs_object *object,
+ const char *name, const void *value,
+ size_t size, int flags) ;
+
+ int glfs_h_readlink (struct glfs *fs, struct glfs_object *object, char *buf,
+ size_t bufsiz) ;
+
+ int glfs_h_link (struct glfs *fs, struct glfs_object *linktgt,
+ struct glfs_object *parent, const char *name) ;
+
+ int glfs_h_rename (struct glfs *fs, struct glfs_object *olddir,
+ const char *oldname, struct glfs_object *newdir,
+ const char *newname) ;
+
+ int glfs_h_removexattrs (struct glfs *fs, struct glfs_object *object,
+ const char *name) ;
+
+ ssize_t glfs_h_extract_handle (struct glfs_object *object,
+ unsigned char *handle, int len) ;
+
+ struct glfs_object *glfs_h_create_from_handle (struct glfs *fs,
+ unsigned char *handle, int len,
+ struct stat *stat) ;
+
+
+ struct glfs_fd *glfs_h_opendir (struct glfs *fs,
+ struct glfs_object *object) ;
+
+ struct glfs_fd *glfs_h_open (struct glfs *fs, struct glfs_object *object,
+ int flags) ;
+
+For more details on these apis please refer glfs.h and glfs-handles.h in the source tree (api/src/) of glusterfs:
+
+* Incase of failures or to close the connection and destroy glfs_t
+object, use glfs_fini.
+
+ int glfs_fini (glfs_t *fs) ;
+
+
+All the fileops are typically divided into below categories
+
+* a) Handle based Operations -
+
+These APIs create/make use of a glfs_object (referred as handles) unique
+to each file within a volume.
+The structure glfs_object contains inode pointer and gfid.
+
+For example: Since NFS protocol uses file handles to access files, these APIs are
+mainly used by NFS-Ganesha server.
+
+Eg:
+
+ struct glfs_object *glfs_h_lookupat (struct glfs *fs,
+ struct glfs_object *parent,
+ const char *path,
+ struct stat *stat);
+
+ struct glfs_object *glfs_h_creat (struct glfs *fs,
+ struct glfs_object *parent,
+ const char *path,
+ int flags, mode_t mode,
+ struct stat *sb);
+
+ struct glfs_object *glfs_h_mkdir (struct glfs *fs,
+ struct glfs_object *parent,
+ const char *path, mode_t flags,
+ struct stat *sb);
+
+
+
+* b) File path/descriptor based Operations -
+
+These APIs make use of file path/descriptor to determine the file on
+which it needs to operate on.
+
+For example: Samba uses these APIs for file operations.
+
+Examples of the APIs using file path -
+
+ int glfs_chdir (glfs_t *fs, const char *path) ;
+
+ char *glfs_realpath (glfs_t *fs, const char *path, char *resolved_path) ;
+
+Once the file is opened, the file-descriptor generated is used for
+further operations.
+
+Eg:
+
+ int glfs_posix_lock (glfs_fd_t *fd, int cmd, struct flock *flock) ;
+ glfs_fd_t *glfs_dup (glfs_fd_t *fd) ;
+
+
+
+#### libgfapi bindings :
+
+libgfapi bindings are available for below languages:
+
+ - Go
+ - Java
+ - python [2]
+ - Ruby
+ - Rust
+
+For more details on these bindings,please refer :
+
+ #http://www.gluster.org/community/documentation/index.php/Language_Bindings
+
+References:
+
+[1] http://humblec.com/libgfapi-interface-glusterfs/
+[2] http://www.gluster.org/2014/04/play-with-libgfapi-and-its-python-bindings/
+
diff --git a/done/Features/libgfchangelog.md b/done/Features/libgfchangelog.md
new file mode 100644
index 0000000..1dd0d24
--- /dev/null
+++ b/done/Features/libgfchangelog.md
@@ -0,0 +1,119 @@
+libgfchangelog: "GlusterFS changelog" consumer library
+======================================================
+
+This document puts forward the intended need for GlusterFS changelog consumer library (a.k.a. libgfchangelog) for consuming changlogs produced by the Changelog translator. Further, it mentions the proposed design and the API exposed by it. A brief explanation of changelog translator can also be found as a commit message in the upstream source tree and the review link can be [accessed here] [1].
+
+Initial consumer of changelogs would be Geo-Replication (release 3.5). Possible consumers in the future could be backup utilities, GlusterFS self-heal, bit-rot detection, AV scanners. All these utilities have one thing in common - to get a list of changed entities (created/modified/deleted) in the file system. Therefore, the need arises to provide such functionality in the form of a shared library that applications can link against and query for changes (See API section). There is no plan as of now to provide language bindings as such, but for shell script friendliness: 'gfind' command line utility (which would be dynamically linked with libgfchangelog) would be helpful. As of now, development for this utility is still not commenced.
+
+The next section gives a brief introduction about how changelogs are organized and managed. Then we propose couple of designs for libgfchangelog. API set is not covered in this document (maybe later).
+
+Changelogs
+==========
+
+Changelogs can be thought as a running history for an entity in the file system from the time the entity came into existance. The goal is to capture all possible transitions the entity underwent till the time it got purged. The transition namespace is broken up into three categories with each category represented by a specific changelog format. Changes are recorded in a flat file in the filesystem and are rolled over after a specific time interval. All three types of categories are recorded in a single changelog file (sequentially) with a type for each entry. Having a single file reduces disk seeks and fragmentation and less number of files to deal with. Stratergy for pruning of old logs is still undecided.
+
+
+Changelog Transition Namespace
+------------------------------
+
+As mentioned before the transition namespace is categorized into three types:
+ - TYPE-I : Data operation
+ - TYPE-II : Metadata operation
+ - TYPE-III : Entry operation
+
+One could visualize the transition of an file system entity as a state machine transitioning from one type to another. For TYPE-I and TYPE-II operations there is no state transition as such, but TYPE-III operation involves a state change from the file systems perspective. We can now classify file operations (fops) into one of the three types:
+ - Data operation: write(), writev(), truncate(), ftruncate()
+ - Metadata operation: setattr(), fsetattr(), setxattr(), fsetxattr(), removexattr(), fremovexattr()
+ - Entry operation: create(), mkdir(), mknod(), symlink(), link(), rename(), unlink(), rmdir()
+
+Changelog Entry Format
+----------------------
+
+In order to record the type of operation and entity underwent, a type identifier is used. Normally, the entity on which the operation is performed would be identified by the pathname, which is the most common way of addressing in a file system, but we choose to use GlusterFS internal file identifier (GFID) instead (as GlusterFS supports GFID based backend and the pathname field may not always be valid and other reasons which are out of scope of this this document). Therefore, the format of the record for the three types of operation can be summarized as follows:
+
+ - TYPE-I : GFID of the file
+ - TYPE-II : GFID of the file
+ - TYPE-III : GFID + FOP + MODE + UID + GID + PARGFID/BNAME [PARGFID/BNAME]
+
+GFID's are analogous to inodes. TYPE-I and TYPE-II fops record the GFID of the entity on which the operation was performed: thereby recording that there was an data/metadata change on the inode. TYPE-III fops record at the minimum a set of six or seven records (depending on the type of operation), that is sufficient to identify what type of operation the entity underwent. Normally this record inculdes the GFID of the entity, the type of file operation (which is an integer [an enumerated value which is used in GluterFS]) and the parent GFID and the basename (analogous to parent inode and basename).
+
+Changelogs can be either in ascii or binary format, the difference being the format of the records that is persisted. In a binary changelog the gfids are recorded in it's native format ie. 16 byte record and the fop number as a 4 byte integer. In an ascii changelog, the gfids are stored in their canonical form and the fop number is stringified and persisted. Null charater is used as the record serarator and changelogs. This makes it hard to read changelogs from the command line, but the packed format is needed to support file names with spaces and special characters. Below is a snippet of a changelog along side it's hexdump.
+
+```
+00000000 47 6c 75 73 74 65 72 46 53 20 43 68 61 6e 67 65 |GlusterFS Change|
+00000010 6c 6f 67 20 7c 20 76 65 72 73 69 6f 6e 3a 20 76 |log | version: v|
+00000020 31 2e 31 20 7c 20 65 6e 63 6f 64 69 6e 67 20 3a |1.1 | encoding :|
+00000030 20 32 0a 45 61 36 39 33 63 30 34 65 2d 61 66 39 | 2.Ea693c04e-af9|
+00000040 65 2d 34 62 61 35 2d 39 63 61 37 2d 31 63 34 61 |e-4ba5-9ca7-1c4a|
+00000050 34 37 30 31 30 64 36 32 00 32 33 00 33 33 32 36 |47010d62.23.3326|
+00000060 31 00 30 00 30 00 66 36 35 34 32 33 32 65 2d 61 |1.0.0.f654232e-a|
+00000070 34 32 62 2d 34 31 62 33 2d 62 35 61 61 2d 38 30 |42b-41b3-b5aa-80|
+00000080 33 62 33 64 61 34 35 39 33 37 2f 6c 69 62 76 69 |3b3da45937/libvi|
+00000090 72 74 5f 64 72 69 76 65 72 5f 6e 65 74 77 6f 72 |rt_driver_networ|
+000000a0 6b 2e 73 6f 00 44 61 36 39 33 63 30 34 65 2d 61 |k.so.Da693c04e-a|
+000000b0 66 39 65 2d 34 62 61 35 2d 39 63 61 37 2d 31 63 |f9e-4ba5-9ca7-1c|
+000000c0 34 61 34 37 30 31 30 64 36 32 00 45 36 65 39 37 |4a47010d62.E6e97|
+```
+
+As you can see, there is an *entry* operation (journal record starting with an "E"). Records for this operation are:
+ - GFID : a693c04e-af9e-4ba5-9ca7-1c4a-47010d62
+ - FOP : 23 (create)
+ - Mode : 33261
+ - UID : 0
+ - GID : 0
+ - PARGFID/BNAME: f654232e-a42b-41b3-b5aa-803b3da45937
+
+**NOTE**: In case of a rename operation, there would be an additional record (for the target PARGFID/BNAME).
+
+libgfchangelog
+--------------
+
+NOTE: changelogs generated by the changelog translator are rolled over [with the timestamp as the suffix] after a specific interval, after which a new change is started. The current changelog [changelog file without the timestamp as the suffix] should never be processed unless it's rolled over. The rolled over logs should be treated read-only.
+
+Capturing changes performed on a file system is useful for applications that rely on file system scan (crawl) to figure out such information. Backup utilities, automatic file healing in a replicated environment, bit-rot detection and the likes are some of the end user applications that require a set of changed entities in a file system to act on. Goal of libgfchangelog is to provide the application (consumer) a fast and easy to use common query interface (API). The consumer need not worry about the changelog format, nomenclature of the changelog files etc.
+
+Now we list functionality and some of the features.
+
+Functionality
+-------------
+
+Changelog Processing: Processing involes reading changelog file(s), converting the entries into human-readable (or application understandable) format (in case of binary log format).
+Book-keeping: Keeping track of how much the application has consumed the changelog (ie. changes during the time slice start-time -> end-time).
+Serve API request: Update the consumer by providing the set of changes.
+
+Processing could be done in two ways:
+
+* Pre-processing (pre-processing from the library POV):
+Once a changelog file is rolled over (by the changelog translator), a set of post processing operations are performed. These operations could include conversion of a binary log file to an understandable format, collate a bunch of logs into a larger sampling period or just keep a private copy of the changelog (in ascii format). Extra disk space is consumed to store this private copy. The library would then be free to consume these logs and serve application requests.
+
+* On-demand:
+The processing of the changelogs is trigerred when an application requests for changes. Downside of this being additional time spent on decoding the logs and data accumulation during application request time (but no additional disk space is used over the time period).
+
+After processing, the changelog is ready to be consumed by the application. The function of processing is to convert the logs into human/application readable format (an example is shown below):
+
+```
+E a7264fe2-dd6b-43e1-8786-a03b42cc2489 CREATE 33188 0 0 00000000-0000-0000-0000-000000000001%2Fservices1
+M a7264fe2-dd6b-43e1-8786-a03b42cc2489 NULL
+M 00000000-0000-0000-0000-000000000001 NULL
+D a7264fe2-dd6b-43e1-8786-a03b42cc2489
+```
+
+Features
+--------
+
+The following points mention some of the features that the library could provide.
+
+ - Consumer could choose the update type when it registers with the library. 'types' could be:
+ - Streaming: The consumer is updated via stream of changes, ie. the library would just replay the logs
+ - Consolidated: The consumer is provided with a consolidated view of the changelog, eg. if <gfid> had an DATA and a METADATA operation, it would be presented as a single update. Similarly for ENTRY operations.
+ - Raw: This mode provides the consumer with the pathnames of the changelog files itself (after processing). The changelogs should be strictly treated as read-only. This gives the flexibility to the consumer to extract updates using thier own preferred way (eg. using command line tools like sed, awk, sort | uniq etc.).
+ - Application may choose to adopt a synchronous (blocking) or an asynchronous (callback) notification mechanism.
+ - Provide a unified view of changelogs from multiple peers (replication scenario) or a global changelog view of the entire cluster.
+
+
+** The first cut of the library supports:**
+ - Raw access mode
+ - Synchronous programming model
+ - Per brick changelog consumption ie. no unified/globally aggregated changelog
+
+[1]:http://review.gluster.org/5127
diff --git a/done/Features/memory-usage.md b/done/Features/memory-usage.md
new file mode 100644
index 0000000..4e1a8a0
--- /dev/null
+++ b/done/Features/memory-usage.md
@@ -0,0 +1,49 @@
+object expiry tracking memroy usage
+====================================
+
+Bitrot daemon tracks objects for expiry in a data structure known
+as "timer-wheel" (after which the object is signed). It's a well
+known data structure for tracking million of objects of expiry.
+Let's see the memory usage involved when tracking 1 million
+objects (per brick).
+
+Bitrot daemon uses "br_object" structure to hold information
+needed for signing. An instance of this structure is allocated
+for each object that needs to be signed.
+
+ struct br_object {
+ xlator_t *this;
+
+ br_child_t *child;
+
+ void *data;
+ uuid_t gfid;
+ unsigned long signedversion;
+
+ struct list_head list;
+ };
+
+Timer-wheel requires an instance of the structure below per
+object that needs to be tracked for expiry.
+
+ struct gf_tw_timer_list {
+ void *data;
+ unsigned long expires;
+
+ /** callback routine */
+ void (*function)(struct gf_tw_timer_list *, void *, unsigned long);
+
+ struct list_head entry;
+ };
+
+Structure sizes:
+
+- sizeof (struct br_object): 64 bytes
+- sizeof (struct gf_tw_timer_list): 40 bytes
+
+Together, these structures take up 104 bytes. To track all 1 million objects
+at the same time, the amount of memory taken up would be:
+
+** 1,000,000 * 104 bytes: ~100MB**
+
+Not so bad, I think.
diff --git a/done/Features/meta.md b/done/Features/meta.md
new file mode 100644
index 0000000..da0d62a
--- /dev/null
+++ b/done/Features/meta.md
@@ -0,0 +1,206 @@
+Meta translator
+===============
+
+Introduction
+------------
+
+Meta xlator provides an interface similar to the Linux procfs, for GlusterFS
+runtime and configuration. This document lists some useful information about
+GlusterFS internals that could be accessed via the meta xlator. This is not
+exhaustive at the moment. Contributors are welcome to improve this.
+
+Note: Meta xlator is loaded automatically in the client graph, ie. in the
+mount process' graph.
+
+### GlusterFS native mount version
+
+>[root@trantor codebase]# cat $META/version
+>{
+> "Package Version": "3.7dev"
+>}
+
+### Listing of some files under the `meta` folder
+
+>[root@trantor codebase]# mount -t glusterfs trantor:/vol /mnt/fuse
+>[root@trantor codebase]# ls $META
+>cmdline frames graphs logging mallinfo master measure_latency process_uuid version
+
+### GlusterFS' process identifier
+
+>[root@trantor codebase]# cat $META/process_uuid
+>trantor-11149-2014/07/25-18:48:50:468259
+>
+This identifier appears in connection establishment log messages.
+For eg.,
+
+>[2014-07-25 18:48:49.017927] I [server-handshake.c:585:server_setvolume] 0-vol-server: accepted client from trantor-11087-2014/07/25-18:48:48:779656-vol-client-0-0-0 (version: 3.7dev)
+>
+
+### GlusterFS command line
+
+>[root@trantor codebase]# cat $META/cmdline
+>{
+> "Cmdlinestr": "/usr/local/sbin/glusterfs --volfile-server=trantor --volfile-id=/vol /mnt/fuse"
+>}
+
+### GlusterFS volume graph
+
+The following directory structure reveals the way xlators are stacked in a
+graph like fashion. Each (virtual) file under a xlator directory provides
+runtime information of that xlator. For eg. 'name' contains the name of the
+xlator.
+
+```
+/mnt/fuse/.meta/graphs/active
+|-- meta-autoload
+| |-- history
+| |-- meminfo
+| |-- name
+| |-- options
+| |-- private
+| |-- profile
+| |-- subvolumes
+| | `-- 0 -> ../../vol
+| |-- type
+| `-- view
+|-- top -> meta-autoload
+|-- vol
+| |-- history
+| |-- meminfo
+| |-- name
+| |-- options
+| | |-- count-fop-hits
+| | `-- latency-measurement
+| |-- private
+| |-- profile
+| |-- subvolumes
+| | `-- 0 -> ../../vol-md-cache
+| |-- type
+| `-- view
+|-- vol-client-0
+| |-- history
+| |-- meminfo
+| |-- name
+| |-- options
+| | |-- client-version
+| | |-- clnt-lk-version
+| | |-- fops-version
+| | |-- password
+| | |-- ping-timeout
+| | |-- process-uuid
+| | |-- remote-host
+| | |-- remote-subvolume
+| | |-- send-gids
+| | |-- transport-type
+| | |-- username
+| | |-- volfile-checksum
+| | `-- volfile-key
+| |-- private
+| |-- profile
+| |-- subvolumes
+| |-- type
+| `-- view
+|-- vol-client-1
+| |-- history
+| |-- meminfo
+| |-- name
+| |-- options
+| | |-- client-version
+| | |-- clnt-lk-version
+| | |-- fops-version
+| | |-- password
+| | |-- ping-timeout
+| | |-- process-uuid
+| | |-- remote-host
+| | |-- remote-subvolume
+| | |-- send-gids
+| | |-- transport-type
+| | |-- username
+| | |-- volfile-checksum
+| | `-- volfile-key
+| |-- private
+| |-- profile
+| |-- subvolumes
+| |-- type
+| `-- view
+|-- vol-dht
+| |-- history
+| |-- meminfo
+| |-- name
+| |-- options
+| |-- private
+| |-- profile
+| |-- subvolumes
+| | |-- 0 -> ../../vol-client-0
+| | `-- 1 -> ../../vol-client-1
+| |-- type
+| `-- view
+|-- volfile
+|-- vol-io-cache
+| |-- history
+| |-- meminfo
+| |-- name
+| |-- options
+| |-- private
+| |-- profile
+| |-- subvolumes
+| | `-- 0 -> ../../vol-read-ahead
+| |-- type
+| `-- view
+|-- vol-md-cache
+| |-- history
+| |-- meminfo
+| |-- name
+| |-- options
+| |-- private
+| |-- profile
+| |-- subvolumes
+| | `-- 0 -> ../../vol-open-behind
+| |-- type
+| `-- view
+|-- vol-open-behind
+| |-- history
+| |-- meminfo
+| |-- name
+| |-- options
+| |-- private
+| |-- profile
+| |-- subvolumes
+| | `-- 0 -> ../../vol-quick-read
+| |-- type
+| `-- view
+|-- vol-quick-read
+| |-- history
+| |-- meminfo
+| |-- name
+| |-- options
+| |-- private
+| |-- profile
+| |-- subvolumes
+| | `-- 0 -> ../../vol-io-cache
+| |-- type
+| `-- view
+|-- vol-read-ahead
+| |-- history
+| |-- meminfo
+| |-- name
+| |-- options
+| |-- private
+| |-- profile
+| |-- subvolumes
+| | `-- 0 -> ../../vol-write-behind
+| |-- type
+| `-- view
+`-- vol-write-behind
+ |-- history
+ |-- meminfo
+ |-- name
+ |-- options
+ |-- private
+ |-- profile
+ |-- subvolumes
+ | `-- 0 -> ../../vol-dht
+ |-- type
+ `-- view
+
+```
diff --git a/done/Features/mount_gluster_volume_using_pnfs.md b/done/Features/mount_gluster_volume_using_pnfs.md
new file mode 100644
index 0000000..6807a21
--- /dev/null
+++ b/done/Features/mount_gluster_volume_using_pnfs.md
@@ -0,0 +1,68 @@
+# How to export gluster volumes using pNFS?
+
+The Parallel Network File System (pNFS) is part of the NFS v4.1 protocol that
+allows compute clients to access storage devices directly and in parallel.
+The pNFS cluster consists of MDS(Meta-Data-Server) and DS (Data-Server).
+The client sends all the read/write requests directly to DS and all other
+operations are handle by the MDS. pNFS support is implemented as part of
+glusterFS+NFS-ganesha integration.
+
+### 1.) Pre-requisites
+
+ - Create a GlusterFS volume
+
+ - Install nfs-ganesha (refer section 5)
+
+ - Disable kernel-nfs, gluster-nfs services on the system using the following commands
+ - service nfs stop
+ - gluster vol set <volname> nfs.disable ON (Note: this command has to be repeated for all the volumes in the trusted-pool)
+
+ - Turn on feature.cache-invalidation for the volume.
+ - gluster v set <volname> features.cache-invalidation on
+
+### 2.) Configure nfs-ganesha for pNFS
+
+ - Disable nfs-ganesha and tear down HA cluster via gluster cli (pNFS did not need to disturb HA setup)
+ - gluster features.ganesha disable
+
+ - For the optimal working of pNFS, ganesha servers should run on every node in the trusted pool manually(refer section 5)
+ - *#ganesha.nfsd -f <location_of_nfs-ganesha.conf_file> -L <location_of_log_file> -N <log_level> -d*
+
+ - Check whether volume is exported via nfs-ganesha in all the nodes.
+ - *#showmount -e localhost*
+
+ - Configure M.D.S by adding following block to ganesha configuration file
+```sh
+GLUSTER
+{
+ PNFS_MDS = true;
+}
+```
+
+
+### 3.) Mount volume via pNFS
+
+Mount the volume using any nfs-ganesha server in the trusted pool.By default, nfs version 4.1 will use pNFS protocol for gluster volumes
+ - *#mount -t nfs4 -o minorversion=1 <ip of server>:/<volume name> <mount path>*
+
+### 4.) Points to be noted
+
+ - Current architecture supports only single MDS and mulitple DS. The server with which client mounts will act as MDS and all severs including MDS can act as DS.
+
+ - If any of the DS goes down , then MDS will handle those I/O's.
+
+ - Hereafter, all the subsequent nfs clients need to use same server for mounting that volume via pNFS. i.e more than one MDS for a volume is not prefered
+
+ - pNFS support is only tested with distributed, replicated or distribute-replicate volumes
+
+ - It is tested and verfied with RHEL 6.5 , fedora 20, fedora 21 nfs clients. It is always better to use latest nfs-clients
+
+### 5.) References
+
+ - Setup and create glusterfs volumes : http://www.gluster.org/community/documentation/index.php/QuickStart
+
+ - NFS-Ganesha wiki : https://github.com/nfs-ganesha/nfs-ganesha/wiki
+
+ - For installing, running NFS-Ganesha and exporting a volume :
+ - read doc/features/glusterfs_nfs-ganesha_integration.md
+ - http://blog.gluster.org/2014/09/glusterfs-and-nfs-ganesha-integration/
diff --git a/done/Features/nufa.md b/done/Features/nufa.md
new file mode 100644
index 0000000..03b8194
--- /dev/null
+++ b/done/Features/nufa.md
@@ -0,0 +1,20 @@
+# NUFA Translator
+
+The NUFA ("Non Uniform File Access") is a variant of the DHT ("Distributed Hash
+Table") translator, intended for use with workloads that have a high locality
+of reference. Instead of placing new files pseudo-randomly, it places them on
+the same nodes where they are created so that future accesses can be made
+locally. For replicated volumes, this means that one copy will be local and
+others will be remote; the read-replica selection mechanisms will then favor
+the local copy for reads. For non-replicated volumes, the only copy will be
+local.
+
+## Interface
+
+Use of NUFA is controlled by a volume option, as follows.
+
+ gluster volume set myvolume cluster.nufa on
+
+This will cause the NUFA translator to be used wherever the DHT translator
+otherwise would be. The rest is all automatic.
+
diff --git a/done/Features/object-versioning.md b/done/Features/object-versioning.md
new file mode 100644
index 0000000..aaa8e26
--- /dev/null
+++ b/done/Features/object-versioning.md
@@ -0,0 +1,230 @@
+Object versioning
+=================
+
+ Bitrot detection in GlusterFS relies on object (file) checksum (hash) verification,
+ also known as "object signature". An object is signed when there are no active
+ file desciptors referring to it's inode (i.e., upon last close()). This is just an
+ hint for the initiation of hash calculation (and therefore signing). There is
+ absolutely no control over when clients can initiate modification operations on
+ the object. An object could be under modification while it's hash computation is
+ under progress. It would also be in-appropriate to restrict access to such objects
+ during the time duration of signing.
+
+ Object versioning is used as a mechanism to identify the staleness of an objects
+ signature. The document below does not just list down the version update protocol,
+ but goes through various factors that led to its design.
+
+*NOTE:* The word "object" is used to represent a "regular file" (in linux sense) and
+ object versions are persisted in extended attributes of the object's inode.
+ Signature calculation includes object's data (no metadata as of now).
+
+INDEX
+=====
+1. Version updation protocol
+2. Correctness guaraantees
+3. Implementation
+4. Protocol enhancements
+
+1. Version updation protocol
+============================
+ There are two types of versions associated with an object:
+
+ a) Ongoing version: This version is incremented on first open() [when
+ the in-memory representation of the object (inode) is marked dirty
+ and synchronized to disk. When an object is created, a default ongoing
+ version of one (1) is assigned. An object lookup() too assigns the
+ default version if not present. When a version is initialized upon
+ lookup() or creat() FOP, it need to be durable on disk and therefore
+ can just be a extended attrbute set with out an expensive fsync()
+ syscall.
+
+ b) Signing version: This is the version against which an object is deemed
+ to be signed. An objects signature is tied to a particular signed version.
+ Since, an object is a candidate for signing upon last release() [last
+ close()], signing version is the "ongoing version" at that point of time
+
+ An object's signature is trustable when the version it was signed against
+ matches the ongoing version, i.e., if the hash is calculated by hand and
+ compared against the object signature, it *should* be a perfect match if
+ and only if the versions are equal. On the other hand, the signature is
+ considered stale (might or might not match the hash just calculated).
+
+ Initialization of object versions
+ ---------------------------------
+ An object that existed before the pre versioning days, is assigned the
+ default versions upon lookup(). The protocol at this point expects "no"
+ durability guarantess of the versions, i.e., extended attribute sets
+ need not be followed by an explicit filesystem sync (fsync()). In case
+ of a power outage or a crash, versions are re-initialized with defaults
+ if found to be non-existant. The signing version is initialized with a
+ deafault value of zero (0) and the ongoing version as one (1).
+
+ *NOTE:* If an object already has versions on-disk, lookup() just brings
+ the versions in memory. In this case both versions may or may
+ not match depending on state the object was left in.
+
+ Increment of object versions
+ ----------------------------
+ During initial versioning, the in-memory representation of the object is
+ marked dirty, so that subsequent modification operations on the object
+ triggers a versiong synchronization to disk (extended attribute set).
+ Moreover, this operation needs to be durable on disk, for the protocol
+ to be crash consistent.
+
+ Let's picturize the various version states after subsequent open()s.
+ Not all modification operations need to increment the ongoing version,
+ only the first operations needs to (subsequent operations are NO-OPs).
+
+ *NOTE:* From here one "[s]" depicts a durable filesystem operation and
+ "*" depicts the inode as dirty.
+
+
+ lookup() open() open() open()
+ ===========================================================
+
+ OV(m): 1* 2 2 2
+ -----------------------------------------
+ OV(d): 1 2[s] 2 2
+ SV(d): 0 0 0 0
+
+
+ Let's now picturize the state when an already signed object undergoes
+ file operations.
+
+ on-disk state:
+ OV(d): 3
+ SV(d): 3|<signature>
+
+
+ lookup() open() open() open()
+ ===========================================================
+
+ OV(m): 3* 4 4 4
+ -----------------------------------------
+ OV(d): 3 4[s] 4 4
+ SV(d): 3 3 3 3
+
+ Signing process
+ ---------------
+ As per the above example, when the last open file descriptor is closed,
+ signing needs to be performed. The protocol restricts that the signing
+ needs to be attached to a version, which in this case is the in-memory
+ value of the ongoing version. A release() also marks the inode dirty,
+ therefore, the next open() does a durable version synchronization to
+ disk.
+
+ [carry forwarding the versions from earlier example]
+
+ close() release() open() open()
+ ===========================================================
+
+ OV(m): 4 4* 5 5
+ -----------------------------------------
+ OV(d): 4 4 5[s] 5
+ SV(d): 3 3 3 3
+
+ As shown above, a relase() call triggers a signing with signing version
+ as OV(m): which in this case is 4. During signing, the object is signed
+ with a signature attached to version 4 as shown below (continuing with
+ the last open() call from above):
+
+ open() sign(4, signature)
+ ===========================================================
+
+ OV(m): 5 5
+ -----------------------------------------
+ OV(d): 5 5
+ SV(d): 3 4:<signature>[s]
+
+ A signature comparison at this point of time is un-trustable due to
+ version mismatches. This also protects from node crashes and hard
+ reboots due to durability guarantee of on-disk version on first
+ open().
+
+ close() release() open()
+ ===========================================================
+
+ OV(m): 4 4* 5
+ -------------------------------- CRASH
+ OV(d): 4 4 5[s]
+ SV(d): 3 3 3
+
+ The protocol is immune to signing request after crashes due to
+ the version synchronization performed on first open(). Signing
+ request for a version lesser than the *current* ongoing version
+ can be ignored. It's left upon the implementation to either
+ accept or ignore such signing request(s).
+
+ *NOTE:* Inode forget() causes a fresh lookup() to be trigerred.
+ Since a forget() call is received when there are no
+ active references for an inode, the on-disk version is
+ the latest and would be copied in-memory on lookup().
+
+2. Correctness Guarantees
+==========================
+
+ Concurrent open()'s
+ -------------------
+ When an inode is dirty (i.e., the very next operations would try to
+ synchronize the version to disk), there can be multiple calls [say,
+ open()] that would find the inode state as dirty and try to writeback
+ the new version to disk. Also, note that, marking the inode as synced
+ and updating the in-memory version is done *after* the new version
+ is written on disk. This is done to avoid incorrect version stored
+ on-disk in case the version synchronization fails (but the in-memory
+ version still holding the updated value).
+ Coming back to multiple open() calls on an object, each open() call
+ tries to synchronize the new version to disk if the inode is marked
+ as dirty. This is safe as each open() would try to synchronize the
+ new version (ongoingversion + 1) even if the updation is concurrent.
+ The in-memory version is finally updated to reflect the updated
+ version and mark the inode non-dirty. Again this is done *only* if
+ the inode is dirty, thereby open() calls which updated the on-disk
+ version but lost the race to update the in-memory version result
+ are NO-OPs.
+
+ on-disk state:
+ OV(d): 3
+ SV(d): 3|<signature>
+
+
+ lookup() open() open()' open()' open()
+ =============================================================
+
+ OV(m): 3* 3* 3* 4 NO-OP
+ --------------------------------------------------
+ OV(d): 3 4[s] 4[s] 4 4
+ SV(d): 3 3 3 3 3
+
+ open()/release() race
+ ---------------------
+ This race can cause a release() [on last close()] to pick up the
+ ongoing version which was just incremented on fresh open(). This
+ leads to signing of the object with the same version as the
+ ongoing version, thereby, mismatching signatures when calculated.
+ Another point that's worth mentioning here is that the open
+ file descriptor is *attached* to it's inode *after* it's done
+ version synchronization (and increment). Hence, if a release()
+ sneaks in this window, the file desriptor list for the given
+ inode is still empty, therefore release() considering it as a
+ last close().
+ To counter this, the protocol should track the open and release
+ counts for file descriptors. A release() should only trigger a
+ signing request when the file desccriptor for an inode is empty
+ and the numbers of releases match the number of opens. When an
+ open() sneaks and increments the ongoing version but the file
+ descriptor is still not attached to the inode, open and release
+ counts mismatch, hence identifying an open() in progress.
+
+3. Implementation
+==================
+
+Refer to: xlators/feature/bit-rot/src/stub
+
+4. Protocol enhancements
+=========================
+
+ a) Delaying persisting on-disk versions till open()
+ b) Lazy version updation (until signing?)
+ c) Protocol changes required to handle anonymous file
+ descriptors in GlusterFS.
diff --git a/done/Features/ovirt-integration.md b/done/Features/ovirt-integration.md
new file mode 100644
index 0000000..46dbeab
--- /dev/null
+++ b/done/Features/ovirt-integration.md
@@ -0,0 +1,106 @@
+##Ovirt Integration with glusterfs
+
+oVirt is an opensource virtualization management platform. You can use oVirt to manage
+hardware nodes, storage and network resources, and to deploy and monitor virtual machines
+running in your data center. oVirt serves as the bedrock for Red Hat''s Enterprise Virtualization product,
+and is the "upstream" project where new features are developed in advance of their inclusion
+in that supported product offering.
+
+To know more about ovirt please visit http://www.ovirt.org/ and to configure
+#http://www.ovirt.org/Quick_Start_Guide#Install_oVirt_Engine_.28Fedora.29%60
+
+For the installation step of ovirt, please refer
+#http://www.ovirt.org/Quick_Start_Guide#Install_oVirt_Engine_.28Fedora.29%60
+
+When oVirt integrated with gluster, glusterfs can be used in below forms:
+
+* As a storage domain to host VM disks.
+
+There are mainly two ways to exploit glusterfs as a storage domain.
+ - POSIXFS_DOMAIN ( >=oVirt 3.1 )
+ - GLUSTERFS_DOMAIN ( >=oVirt 3.3)
+
+The former one has performance overhead and is not an ideal way to consume images hosted in glusterfs volumes.
+When used by this method, qemu uses glusterfs `mount point` to access VM images and invite FUSE overhead.
+The libvirt treats this as a file type disk in its xml schema.
+
+The latter is the recommended way of using glusterfs with ovirt as a storage domain. This provides better
+and efficient way to access images hosted under glusterfs volumes.When qemu accessing glusterfs volume using this method,
+it make use of `libgfapi` implementation of glusterfs and this method is called native integration.
+Here the glusterfs is added as a block backend to qemu and libvirt treat this as a `network` type disk.
+
+For more details on this, please refer # http://www.ovirt.org/Features/GlusterFS_Storage_Domain
+However there are 2 bugs which block usage of this feature.
+
+https://bugzilla.redhat.com/show_bug.cgi?id=1022961
+https://bugzilla.redhat.com/show_bug.cgi?id=1017289
+
+Please check above bugs for latest status.
+
+* To manage gluster trusted pools.
+
+oVirt web admin console can be used to -
+ - add new / import existing gluster cluster
+ - add/delete volumes
+ - add/delete bricks
+ - set/reset volume options
+ - optimize volume for virt store
+ - Rebalance and Remove bricks
+ - Monitor gluster deployment - node, brick, volume status,
+ Enhanced service monitoring (Physical node resources as well Quota, geo-rep and self-heal status) through Nagios integration(>=oVirt 3.4)
+
+
+
+When configuing ovirt to manage only gluster cluster/trusted pool, you need to select `gluster` as an input for
+`Application mode` in OVIRT ENGINE CONFIGURATION option of `engine-setup` command.
+Refer # http://www.ovirt.org/Quick_Start_Guide#Install_oVirt_Engine_.28Fedora.29%60
+
+If you want to use gluster as both ( as a storage domain to host VM disks and to manage gluster trusted pools)
+you need to input `both` as a value for `Application mode` in engine-setup command.
+
+Once you have successfully installed oVirt Engine as mentioned above, you will be provided with instructions
+to access oVirt''s web console.
+
+Below example shows how to configure gluster nodes in fedora.
+
+
+#Configuring gluster nodes.
+
+On the machine designated as your host, install any supported distribution( ex:Fedora/CentOS/RHEL...etc).
+A minimal installation is sufficient.
+
+Refer # http://www.ovirt.org/Quick_Start_Guide#Install_Hosts
+
+
+##Connect to Ovirt Engine
+
+Log In to Administration Console
+
+Ensure that you have the administrator password configured during installation of oVirt engine.
+
+- To connect to oVirt webadmin console
+
+
+Open a browser and navigate to https://domain.example.com/webadmin. Substitute domain.example.com with the URL provided during installation
+
+If this is your first time connecting to the administration console, oVirt Engine will issue
+security certificates for your browser. Click the link labelled this certificate to trust the
+ca.cer certificate. A pop-up displays, click Open to launch the Certificate dialog.
+Click `Install Certificate` and select to place the certificate in Trusted Root Certification Authorities store.
+
+
+The console login screen displays. Enter admin as your User Name, and enter the Password that
+you provided during installation. Ensure that your domain is set to Internal. Click Login.
+
+
+You have now successfully logged in to the oVirt web administration console. Here, you can configure and manage all your gluster resources.
+
+To manage gluster trusted pool:
+
+- Create a cluster with "Enable gluster service" - turned on. (Turn on "Enable virt service" if the same nodes are used as hypervisor as well)
+- Add hosts which have already been set up as in step Configuring gluster nodes.
+- Create a volume, and click on "Optimize for virt store",This sets the volume tunables optimize volume to be used as an image store
+
+To use this volume as a storage domain:
+
+Please refer `User interface` section of www.ovirt.org/Features/GlusterFS_Storage_Domain
diff --git a/done/Features/qemu-integration.md b/done/Features/qemu-integration.md
new file mode 100644
index 0000000..aba3621
--- /dev/null
+++ b/done/Features/qemu-integration.md
@@ -0,0 +1,230 @@
+Using GlusterFS volumes to host VM images and data was sub-optimal due to the FUSE overhead involved in accessing gluster volumes via GlusterFS native client. However this has changed now with two specific enhancements:
+
+- A new library called libgfapi is now available as part of GlusterFS that provides POSIX-like C APIs for accessing gluster volumes. libgfapi support is available from GlusterFS-3.4 release.
+- QEMU (starting from QEMU-1.3) will have GlusterFS block driver that uses libgfapi and hence there is no FUSE overhead any longer when QEMU works with VM images on gluster volumes.
+
+GlusterFS, with its pluggable translator model serves as a flexible storage backend for QEMU. QEMU has to just talk to GlusterFS and GlusterFS will hide different file systems and storage types underneath. Various GlusterFS storage features like replication and striping will automatically be available for QEMU. Efforts are also on to add block device backend in Gluster via Block Device (BD) translator that will expose underlying block devices as files to QEMU. This allows GlusterFS to be a single storage backend for both file and block based storage types.
+
+###GlusterFS specifcation in QEMU
+
+VM image residing on gluster volume can be specified on QEMU command line using URI format
+
+ gluster[+transport]://[server[:port]]/volname/image[?socket=...]
+
+
+
+* `gluster` is the protocol.
+
+* `transport` specifies the transport type used to connect to gluster management daemon (glusterd). Valid transport types are `tcp, unix and rdma.` If a transport type isn’t specified, then tcp type is assumed.
+
+* `server` specifies the server where the volume file specification for the given volume resides. This can be either hostname, ipv4 address or ipv6 address. ipv6 address needs to be within square brackets [ ]. If transport type is unix, then server field should not be specified. Instead the socket field needs to be populated with the path to unix domain socket.
+
+* `port` is the port number on which glusterd is listening. This is optional and if not specified, QEMU will send 0 which will make gluster to use the default port. If the transport type is unix, then port should not be specified.
+
+* `volname` is the name of the gluster volume which contains the VM image.
+
+* `image` is the path to the actual VM image that resides on gluster volume.
+
+
+###Examples:
+
+ gluster://1.2.3.4/testvol/a.img
+ gluster+tcp://1.2.3.4/testvol/a.img
+ gluster+tcp://1.2.3.4:24007/testvol/dir/a.img
+ gluster+tcp://[1:2:3:4:5:6:7:8]/testvol/dir/a.img
+ gluster+tcp://[1:2:3:4:5:6:7:8]:24007/testvol/dir/a.img
+ gluster+tcp://server.domain.com:24007/testvol/dir/a.img
+ gluster+unix:///testvol/dir/a.img?socket=/tmp/glusterd.socket
+ gluster+rdma://1.2.3.4:24007/testvol/a.img
+
+
+
+NOTE: (GlusterFS URI description and above examples are taken from QEMU documentation)
+
+###Configuring QEMU with GlusterFS backend
+
+While building QEMU from source, in addition to the normal configuration options, ensure that –enable-glusterfs options are specified explicitly with ./configure script to get glusterfs support in qemu.
+
+Starting with QEMU-1.6, pkg-config is used to configure the GlusterFS backend in QEMU. If you are using GlusterFS compiled and installed from sources, then the GlusterFS package config file (glusterfs-api.pc) might not be present at the standard path and you will have to explicitly add the path by executing this command before running the QEMU configure script:
+
+ export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig/
+
+Without this, GlusterFS driver will not be compiled into QEMU even when GlusterFS is present in the system.
+
+* Creating a VM image on GlusterFS backend
+
+qemu-img command can be used to create VM images on gluster backend. The general syntax for image creation looks like this:
+
+For ex:
+
+ qemu-img create gluster://server/volname/path/to/image size
+
+## How to setup the environment:
+
+This usecase ( using glusterfs backend for VM disk store), is known as 'Virt-Store' usecase. Steps for the entire procedure could be split to:
+
+* Steps to be done on gluster volume side
+* Steps to be done on Hypervisor side
+
+
+##Steps to be done on gluster side
+
+These are the steps that needs to be done on the gluster side. Precisely this involves
+
+ Creating "Trusted Storage Pool"
+ Creating a volume
+ Tuning the volume for virt-store
+ Tuning glusterd to accept requests from QEMU
+ Tuning glusterfsd to accept requests from QEMU
+ Setting ownership on the volume
+ Starting the volume
+
+* Creating "Trusted Storage Pool"
+
+Install glusterfs rpms on the NODE. You can create a volume with a single node. You can also scale up the cluster, as we call as `Trusted Storage Pool`, by adding more nodes to the cluster
+
+ gluster peer probe <hostname>
+
+* Creating a volume
+
+It is highly recommended to have replicate volume or distribute-replicate volume for virt-store usecase, as it would add high availability and fault-tolerance. Remember the plain distribute works equally well
+
+ gluster volume create replica 2 <brick1> .. <brickN>
+
+where, `<brick1> is <hostname>:/<path-of-dir> `
+
+
+Note: It is recommended to create sub-directories inside brick and that could be used to create a volume.For example, say, /home/brick1 is the mountpoint of XFS, then you can create a sub-directory inside it /home/brick1/b1 and use it while creating a volume.You can also use space available in root filesystem for bricks. Gluster cli, by default, throws warning in that case. You can override it by using force option
+
+ gluster volume create replica 2 <brick1> .. <brickN> force
+
+If you are new to GlusterFS, you can take a look at [QuickStart](../Quick-Start-Guide/Quickstart.md) guide.
+
+* Tuning the volume for virt-store
+
+There are recommended settings available for virt-store. This provides good performance characteristics when enabled on the volume which is used for virt-store
+
+Refer to [Virt store usecases-Tunables](../Feature Planning/GlusterFS 3.5/Virt store usecase.md#tunables) for recommended tunables and for applying them on the volume, [refer this section](../Feature Planning/GlusterFS 3.5/Virt store usecase.md#applying-the-tunables-on-the-volume)
+
+
+* Tuning glusterd to accept requests from QEMU
+
+glusterd receives requests only from the application that runs with port number less than 1024, otherwise the request is blocked. QEMU uses port number greater than 1024. To make glusterd accept requests from QEMU, you must edit the glusterd vol file, /etc/glusterfs/glusterd.vol and add the following:
+
+ option rpc-auth-allow-insecure on
+
+Note: If you have installed glusterfs from source, you can find glusterd vol file at: `/usr/local/etc/glusterfs/glusterd.vol`
+
+Restart glusterd after adding the option to glusterd vol file.
+
+ service glusterd restart
+
+* Tuning glusterfsd to accept requests from QEMU
+
+Enable the option `allow-insecure` on the particular volume
+
+ gluster volume set <volname> server.allow-insecure on
+
+IMPORTANT : As of now (april 2,2014)there is a bug, as allow-insecure is not dynamically set on a volume.You need to restart the volume for the change to take effect
+
+
+* Setting ownership on the volume
+
+Set the ownership of qemu:qemu on to the volume
+
+ gluster volume set <vol-name> storage.owner-uid 107
+ gluster volume set <vol-name> storage.owner-gid 107
+
+* Starting the volume
+
+Start the volume
+
+ gluster volume start <vol-name>
+
+## Steps to be done on Hypervisor Side:
+
+To create a raw image,
+
+ qemu-img create gluster://1.2.3.4/testvol/dir/a.img 5G
+
+To create a qcow2 image,
+
+ qemu-img create -f qcow2 gluster://server.domain.com:24007/testvol/a.img 5G
+
+
+
+
+
+## Booting VM image from GlusterFS backend
+
+A VM image 'a.img' residing on gluster volume testvol can be booted using QEMU like this:
+
+
+ qemu-system-x86_64 -drive file=gluster://1.2.3.4/testvol/a.img,if=virtio
+
+In addition to VM images, gluster drives can also be used as data drives:
+
+ qemu-system-x86_64 -drive file=gluster://1.2.3.4/testvol/a.img,if=virtio -drive file=gluster://1.2.3.4/datavol/a-data.img,if=virtio
+
+Here 'a-data.img' from datavol gluster volume appears as a 2nd drive for the guest.
+
+It is also possible to make use of libvirt to define a disk and use it with qemu:
+
+
+### Create libvirt XML to define Virtual Machine
+
+virt-install is python wrapper which is mostly used to create VM using set of params. How-ever virt-install doesn't support any network filesystem [ https://bugzilla.redhat.com/show_bug.cgi?id=1017308 ]
+
+Create a libvirt VM xml - http://libvirt.org/formatdomain.html where the disk section is formatted in such a way, qemu driver for glusterfs is being used. This can be seen in the following example xml description
+
+
+ <disk type='network' device='disk'>
+ <driver name='qemu' type='raw' cache='none'/>
+ <source protocol='gluster' name='distrepvol/vm3.img'>
+ <host name='10.70.37.106' port='24007'/>
+ </source>
+ <target dev='vda' bus='virtio'/>
+ <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
+ </disk>
+
+
+
+
+
+* Define the VM from the XML file that was created earlier
+
+
+ virsh define <xml-file-description>
+
+* Verify that the VM is created successfully
+
+
+ virsh list --all
+
+* Start the VM
+
+
+ virsh start <VM>
+
+* Verification
+
+You can verify the disk image file that is being used by VM
+
+ virsh domblklist <VM-Domain-Name/ID>
+
+The above should show the volume name and image name. Here is the example,
+
+
+ [root@test ~]# virsh domblklist vm-test2
+ Target Source
+ ------------------------------------------------
+ vda distrepvol/test.img
+ hdc -
+
+
+Reference:
+
+For more details on this feature implementation and its advantages, please refer:
+
+- http://raobharata.wordpress.com/2012/10/29/qemu-glusterfs-native-integration/
+- [Libgfapi_with_qemu_libvirt](../Feature Planning/GlusterFS 3.5/libgfapi with qemu libvirt.md)
diff --git a/done/Features/quota-object-count.md b/done/Features/quota-object-count.md
new file mode 100644
index 0000000..063aa7c
--- /dev/null
+++ b/done/Features/quota-object-count.md
@@ -0,0 +1,47 @@
+Previous mechanism:
+====================
+
+The only way we could have retrieved the number of files/objects in a directory or volume was to do a crawl of the entire directory/volume. That was expensive and was not scalable.
+
+New Design Implementation:
+==========================
+The proposed mechanism will provide an easier alternative to determine the count of files/objects in a directory or volume.
+
+The new mechanism will store count of objects/files as part of an extended attribute of a directory. Each directory extended attribute value will indicate the number of files/objects present in a tree with the directory being considered as the root of the tree.
+
+Inode quota management
+======================
+
+**setting limits**
+
+Syntax:
+*gluster volume quota <volname\> limit-objects <path\> <number\>*
+
+Details:
+<number\> is a hard-limit for number of objects limitation for path <path\>. If hard-limit is exceeded, creation of file or directory is no longer permitted.
+
+**list-objects**
+
+Syntax:
+*gluster volume quota <volname\> list-objects \[path\] ...*
+
+Details:
+If path is not specified, then all the directories which has object limit set on it will be displayed. If we provide path then only that particular path is displayed along with the details associated with that.
+
+Sample output:
+
+ Path Hard-limit Soft-limit Files Dirs Available Soft-limit exceeded? Hard-limit exceeded?
+ ---------------------------------------------------------------------------------------------------------------------------------------------
+ /dir 10 80% 0 1 9 No No
+
+**Deleting limits**
+
+Syntax:
+*gluster volume quota <volname\> remove-objects <path\>*
+
+Details:
+This will remove the object limit set on the specified path.
+
+Note: There is a known issue associated with remove-objects. When both usage limit and object limit is set on a path, then removal of any limit will lead to removal of other limit as well. This is tracked in the bug #1202244
+
+
diff --git a/done/Features/quota-scalability.md b/done/Features/quota-scalability.md
new file mode 100644
index 0000000..e47c898
--- /dev/null
+++ b/done/Features/quota-scalability.md
@@ -0,0 +1,52 @@
+Issues with older implemetation:
+-----------------------------------
+* >#### Enforcement of quota was done on client side. This had following two issues :
+ > >* All clients are not trusted and hence enforcement is not secure.
+ > >* Quota enforcer caches directory size for a certain time out period to reduce network calls to fetch size. On time out, this cache is validated by querying server. With more clients, the traffic caused due to this
+validation increases.
+
+* >#### Relying on lookup calls on a file/directory (inode) to update its contribution [time consuming]
+
+* >####Hardlimits were stored in a comma separated list.
+ > >* Hence, changing hard limit of one directory is not an independent operation and would invalidate hard limits of other directories. We need to parse the string once for each of these directories just to identify whether its hard limit is changed. This limits the number of hard limits we can configure.
+
+* >####Cli used to fetch the list of directories on which quota-limit is set, from glusterd.
+ > >* With more number of limits, the network overhead incurred to fetch this list limits the scalability of number of directories on which we can set quota.
+
+* >#### Problem with NFS mount
+ > >* Quota, for its enforcement and accounting requires all the ancestors of a file/directory till root. However, with NFS relying heavily on nameless lookups (through which there is no guarantee that ancestry can be
+accessed) this ancestry is not always present. Hence accounting and enforcement was not correct.
+
+
+New Design Implementation:
+--------------------------------
+
+* Quota enforcement is moved to server side. This addresses issues that arose because of client side enforcement.
+
+* Two levels of quota limits, soft and hard quota is introduced.
+ This will result in a message being logged on reaching soft quota and writes will fail with EDQUOT after hard limit is reached.
+
+Work Flow
+-----------------
+
+* Accounting
+ # This is done using the marker translator loaded on each brick of the volume. Accounting happens in the background. Ie, it doesn't happen in-flight with the file operation. The file operations latency is not
+directly affected by the time taken to perform accounting. This update is sent recursively upwards up to the root of the volume.
+
+* Enforcement
+ # The enforcer updates its 'view' (cached) of directory's disk usage on the incidence of a file operation after the expiry of hard/soft timeout, depending on the current usage. Enforcer uses quotad to get the
+aggregated disk usage of a directory from the accounting information present on each brick (viz, provided by marker).
+
+* Aggregator (quotad)
+ # Quotad is a daemon that serves volume-wide disk usage of a directory, on which quota is configured. It is present on all nodes in the cluster (trusted storage pool) as bricks don't have a global view of cluster.
+Quotad queries the disk usage information from all the bricks in that volume and aggregates. It manages all the volumes on which quota is enabled.
+
+
+Benefit to GlusterFS
+---------------------------------
+
+* Support upto 65536 quota configurations per volume.
+* More quotas can be configured in a single volume thereby leading to support GlusterFS for use cases like home directory.
+
+###For more information on quota usability refer the following link :
+> https://access.redhat.com/site/documentation/en-US/Red_Hat_Storage/2.1/html-single/Administration_Guide/index.html#chap-User_Guide-Dir_Quota-Enable
diff --git a/done/Features/rdmacm.md b/done/Features/rdmacm.md
new file mode 100644
index 0000000..2c287e8
--- /dev/null
+++ b/done/Features/rdmacm.md
@@ -0,0 +1,26 @@
+## Rdma Connection manager ##
+
+### What? ###
+Infiniband requires addresses of end points to be exchanged using an out-of-band channel (like tcp/ip). Glusterfs used a custom protocol over a tcp/ip channel to exchange this address. However, librdmacm provides the same functionality with the advantage of being a standard protocol. This helps if we want to communicate with a non-glusterfs entity (say nfs client with gluster nfs server) over infiniband.
+
+### Dependencies ###
+* [IP over Infiniband](http://pkg-ofed.alioth.debian.org/howto/infiniband-howto-5.html) - The value to *option* **remote-host** in glusterfs transport configuration should be an IPoIB address
+* [rdma cm kernel module](http://pkg-ofed.alioth.debian.org/howto/infiniband-howto-4.html#ss4.4)
+* [user space rdmacm library - librdmacm](https://www.openfabrics.org/downloads/rdmacm)
+
+### rdma-cm in >= GlusterFs 3.4 ###
+
+Following is the impact of http://review.gluster.org/#change,149.
+
+New userspace packages needed:
+librdmacm
+librdmacm-devel
+
+### Limitations ###
+
+* Because of bug [890502](https://bugzilla.redhat.com/show_bug.cgi?id=890502), we've to probe the peer on an IPoIB address. This imposes a restriction that all volumes created in the future have to communicate over IPoIB address (irrespective of whether they use gluster's tcp or rdma transport).
+
+* Currently client has independence to choose b/w tcp and rdma transports while communicating with the server (by creating volumes with **transport-type tcp,rdma**). This independence was a by-product of our ability to use the tcp/ip channel - transports with *option transport-type tcp* - for rdma connection establishment handshake too. However, with new requirement of IPoIB address for connection establishment, we loose this independence (till we bring in [multi-network support](https://bugzilla.redhat.com/show_bug.cgi?id=765437) - where a brick can be identified by a set of ip-addresses and we can choose different pairs of ip-addresses for communication based on our requirements - in glusterd).
+
+### External links ###
+* [Infiniband Howto](http://pkg-ofed.alioth.debian.org/howto/infiniband-howto.html)
diff --git a/done/Features/readdir-ahead.md b/done/Features/readdir-ahead.md
new file mode 100644
index 0000000..5302a02
--- /dev/null
+++ b/done/Features/readdir-ahead.md
@@ -0,0 +1,14 @@
+## Readdir-ahead ##
+
+### Summary ###
+Provide read-ahead support for directories to improve sequential directory read performance.
+
+### Owners ###
+Brian Foster
+
+### Detailed Description ###
+The read-ahead feature for directories is analogous to read-ahead for files. The objective is to detect sequential directory read operations and establish a pipeline for directory content. When a readdir request is received and fulfilled, preemptively issue subsequent readdir requests to the server in anticipation of those requests from the user. If sequential readdir requests are received, the directory content is already immediately available in the client. If subsequent requests are not sequential or not received, said data is simply dropped and the optimization is bypassed.
+
+readdir-ahead is currently disabled by default. It can be enabled with the following command:
+
+ gluster volume set <volname> readdir-ahead on
diff --git a/done/Features/rebalance.md b/done/Features/rebalance.md
new file mode 100644
index 0000000..e7212d4
--- /dev/null
+++ b/done/Features/rebalance.md
@@ -0,0 +1,74 @@
+## Background
+
+
+For a more detailed description, view Jeff Darcy's blog post [here]
+(http://pl.atyp.us/hekafs.org/index.php/2012/03/glusterfs-algorithms-distribution/)
+
+GlusterFS uses the distribute translator (DHT) to aggregate space of multiple servers. DHT distributes files among its subvolumes using a consistent hashing method providing 32-bit hashes. Each DHT subvolume is given a range in the 32-bit hash space. A hash value is calculated for every file using a combination of its name. The file is then placed in the subvolume with the hash range that contains the hash value.
+
+## What is rebalance?
+
+The rebalance process migrates files between the DHT subvolumes when necessary.
+
+## When is rebalance required?
+
+Rebalancing is required for two main cases.
+
+1. Addition/Removal of bricks
+
+2. Renaming of a file
+
+## Addition/Removal of bricks
+
+Whenever the number or order of DHT subvolumes change, the hash range given to each subvolume is recalculated. When this happens, already existing files on the volume will need to be moved to the correct subvolume based on their hash. Rebalance does this activity.
+
+Addition of bricks which increase the size of a volume will increase the number of DHT subvolumes and lead to recalculation of hash ranges (This doesn't happen when bricks are added to a volume to increase redundancy, i.e. increase replica count of a volume). This will require an explicit rebalance command to be issued to migrate the files.
+
+Removal of bricks which decrease the size of a volumes also causes the hash ranges of DHT to be recalculated. But we don't need to issue an explicit rebalance command in this case, as rebalance is done automatically by the remove-brick process if needed.
+
+## Renaming of a file
+
+Renaming of file will cause its hash to change. The file now needs to be moved to the correct subvolume based on its new hash. Rebalance does this.
+
+## How does rebalance work?
+
+At a high level, the rebalance process consists of the following 3 steps:
+
+1. Crawl the volume to access all files
+2. Calculate the hash for the file
+3. If needed move the migrate the file to the correct subvolume.
+
+
+The rebalance process has been optimized by making it distributed across the trusted storage pool. With distributed rebalance, a rebalance process is launched on each peer in the cluster. Each rebalance process will crawl files on only those bricks of the volume which are present on it, and migrate the files which need migration to the correct brick. This speeds up the rebalance process considerably.
+
+## What will happen if rebalance is not run?
+
+### Addition of bricks
+
+With the current implementation of add-brick, when the size of a volume is augmented by adding new bricks, the new bricks are not put into use immediately i.e., the hash ranges there not recalculated immediately. This means that the files will still be placed only onto the existing bricks, leaving the newly added storage space unused. Starting a rebalance process on the volume will cause the hash ranges to be recalculated with the new bricks included, which allows the newly added storage space to be used.
+
+### Renaming a file
+
+When a file rename causes the file to be hashed to a new subvolume, DHT writes a link file on the new subvolume leaving the actual file on the original subvolume. A link file is an empty file, which has an extended attribute set that points to the subvolume on which the actual file exists. So, when a client accesses the renamed file, DHT first looks for the file in the hashed subvolume and gets the link file. DHT understands the link file, and gets the actual file from the subvolume pointed to by the link file. This leads to a slight reduction in performance. A rebalance will move the actual file to the hashed subvolume, allowing clients to access the file directly once again.
+
+## Are clients affected during a rebalance process?
+
+The rebalance process is transparent to applications on the clients. Applications which have open files on the volume will not be affected by the rebalance process, even if the open file requires migration. The DHT translator on the client will hide the migration from the applications.
+
+##How are open files migrated?
+
+(A more technical description of the algorithm used can be seen in the commit message of commit a07bb18c8adeb8597f62095c5d1361c5bad01f09.)
+
+To achieve migration of open files, two things need to be assured of,
+a) any writes or changes happening to the file during migration are correctly synced to destination subvolume after the migration is complete.
+b) any further changes should be made to the destination subvolume
+
+Both of these requirements require sending notificatoins to clients. Clients are notified by overloading an attribute used in every callback functions. DHT understands these attributes in the callbacks and can be notified if a file is being migrated or not.
+
+During rebalance, a file will be in two phases
+
+1. Migration in process - In this phase the file is being migrated by the rebalance process from the source subvolume to the destination subvolume. The rebalance process will set a 'in-migration' attribute on the file, which will notify the clients' DHT translator. The clients' DHT translator will then take care to send any further changes to the destination subvolume as well. This way we satisfy the first requirement
+
+2. Migration completed - Once the file has been migrated, the rebalance process will set a 'migration-complete' attribute on the file. The clients will be notified of the completion and all further operations on the file will happen on the destination subvolume.
+
+The DHT translator handles the above and allows the applications on the clients to continue working on a file under migration.
diff --git a/done/Features/server-quorum.md b/done/Features/server-quorum.md
new file mode 100644
index 0000000..7b20084
--- /dev/null
+++ b/done/Features/server-quorum.md
@@ -0,0 +1,44 @@
+# Server Quorum
+
+Server quorum is a feature intended to reduce the occurrence of "split brain"
+after a brick failure or network partition. Split brain happens when different
+sets of servers are allowed to process different sets of writes, leaving data
+in a state that can not be reconciled automatically. The key to avoiding split
+brain is to ensure that there can be only one set of servers - a quorum - that
+can continue handling writes. Server quorum does this by the brutal but
+effective means of forcing down all brick daemons on cluster nodes that can no
+longer reach enough of their peers to form a majority. Because there can only
+be one majority, there can be only one set of bricks remaining, and thus split
+brain can not occur.
+
+## Options
+
+Server quorum is controlled by two parameters:
+
+ * **cluster.server-quorum-type**
+
+ This value may be "server" to indicate that server quorum is enabled, or
+ "none" to mean it's disabled.
+
+ * **cluster.server-quorum-ratio**
+
+ This is the percentage of cluster nodes that must be up to maintain quorum.
+ More precisely, this percentage of nodes *plus one* must be up.
+
+Note that these are cluster-wide flags. All volumes served by the cluster will
+be affected. Once these values are set, quorum actions - starting or stopping
+brick daemons in response to node or network events - will be automatic.
+
+## Best Practices
+
+If a cluster with an even number of nodes is split exactly down the middle,
+neither half can have quorum (which requires **more than** half of the total).
+This is particularly important when N=2, in which case the loss of either node
+leads to loss of quorum. Therefore, it is highly advisable to ensure that the
+cluster size is three or greater. The "extra" node in this case need not have
+any bricks or serve any data. It need only be present to preserve the notion
+of a quorum majority less than the entire cluster membership, allowing the
+cluster to survive the loss of a single node without losing quorum.
+
+
+
diff --git a/done/Features/shard.md b/done/Features/shard.md
new file mode 100644
index 0000000..2bdf6ce
--- /dev/null
+++ b/done/Features/shard.md
@@ -0,0 +1,68 @@
+### Sharding xlator (Stripe 2.0)
+
+GlusterFS's answer to very large files (those which can grow beyond a
+single brick) has never been clear. There is a stripe xlator which allows you to
+do that, but that comes at a cost of flexibility - you can add servers only in
+multiple of stripe-count x replica-count, mixing striped and unstriped files is
+not possible in an "elegant" way. This also happens to be a big limiting factor
+for the big data/Hadoop use case where super large files are the norm (and where
+you want to split a file even if it could fit within a single server.)
+
+The proposed solution for this is to replace the current stripe xlator with a
+new Shard xlator. Unlike the stripe xlator, Shard is not a cluster xlator. It is
+placed on top of DHT. Initially all files will be created as normal files, even
+up to a certain configurable size. The first block (default 4MB) will be stored
+like a normal file. However further blocks will be stored in a file, named by
+the GFID and block index in a separate namespace (like /.shard/GFID1.1,
+/.shard/GFID1.2 ... /.shard/GFID1.N). File IO happening to a particular offset
+will write to the appropriate "piece file", creating it if necessary. The
+aggregated file size and block count will be stored in the xattr of the original
+(first block) file.
+
+The advantage of such a model:
+
+- Data blocks are distributed by DHT in a "normal way".
+- Adding servers can happen in any number (even one at a time) and DHT's
+ rebalance will spread out the "piece files" evenly.
+- Self-healing of a large file is now more distributed into smaller files across
+ more servers.
+- piece file naming scheme is immune to renames and hardlinks.
+
+Source: https://gist.github.com/avati/af04f1030dcf52e16535#sharding-xlator-stripe-20
+
+## Usage:
+
+Shard translator is disabled by default. To enable it on a given volume, execute:
+
+ gluster volume set <VOLNAME> features.shard on
+
+The default shard block size is 4MB. To modify it, execute:
+
+ gluster volume set <VOLNAME> features.shard-block-size <value>
+
+When a file is created in a volume with sharding disabled, its block size is
+persisted in its xattr on the first block. This property of the file will remain
+even if the shard-block-size for the volume is reconfigured later.
+
+If you want to disable sharding on a volume, it is advisable to create a new
+volume without sharding and copy out contents of this volume into the new
+volume.
+
+## Note:
+
+* Shard translator is still a beta feature in 3.7.0 and will be possibly fully
+ supported in one of the 3.7.x releases.
+* It is advisable to use shard translator in volumes with replication enabled
+ for fault tolerance.
+
+## TO-DO:
+
+* Complete implementation of zerofill, discard and fallocate fops.
+* Introduce caching and its invalidation within shard translator to store size
+ and block count of shard'ed files.
+* Make shard translator work for non-Hadoop and non-VM use cases where there are
+ multiple clients operating on the same file.
+* Serialize appending writes.
+* Manage recovery of size and block count better in the face of faults during
+ ongoing inode write fops.
+* Anything else that could crop up later :) \ No newline at end of file
diff --git a/done/Features/tier.md b/done/Features/tier.md
new file mode 100644
index 0000000..d44af09
--- /dev/null
+++ b/done/Features/tier.md
@@ -0,0 +1,170 @@
+##Tiering
+
+####Feature page:
+
+[http://gluster.readthedocs.org/en/latest/Feature Planning/GlusterFS 3.7/Data Classification/](http://gluster.readthedocs.org/en/latest/Feature Planning/GlusterFS 3.7/Data Classification/)
+
+#####Design:
+
+[goo.gl/bkU5qv](goo.gl/bkU5qv)
+
+###Theory of operation
+
+
+The tiering feature enables different storage types to be used by the same
+logical volume. In Gluster 3.7, the two types are classified as "cold" and
+"hot", and are represented as two groups of bricks. The hot group acts as
+a cache for the cold group. The bricks within the two groups themselves are
+arranged according to standard Gluster volume conventions, e.g. replicated,
+distributed replicated, or dispersed.
+
+A normal gluster volume can become a tiered volume by "attaching" bricks
+to it. The attached bricks become the "hot" group. The bricks within the
+original gluster volume are the "cold" bricks.
+
+For example, the original volume may be dispersed on HDD, and the "hot"
+tier could be distributed-replicated SSDs.
+
+Once this new "tiered" volume is built, I/Os to it are subjected to cacheing
+heuristics:
+
+* All I/Os are forwarded to the hot tier.
+
+* If a lookup fails to the hot tier, the I/O will be forwarded to the cold
+tier. This is a "cache miss".
+
+* Files on the hot tier that are not touched within some time are demoted
+(moved) to the cold tier (see performance parameters, below).
+
+* Files on the cold tier that are touched one or more times are promoted
+(moved) to the hot tier. (see performance parameters, below).
+
+This resembles implementations by Ceph and the Linux data management (DM)
+component.
+
+Performance enhancements being considered include:
+
+* Biasing migration of large files over small.
+
+* Only demoting when the hot tier is close to full.
+
+* Write-back cache for database updates.
+
+###Code organization
+
+The design endevors to be upward compatible with future migration policies,
+such as scheduled file migration, data classification, etc. For example,
+the caching logic is self-contained and separate from the file migration. A
+different set of migration policies could use the same underlying migration
+engine. The I/O tracking and meta data store compontents are intended to be
+reusable for things besides caching semantics.
+
+####Libgfdb:
+
+Libgfdb provides abstract mechanism to record extra/rich metadata
+required for data maintenance, such as data tiering/classification.
+It provides consumer with API for recording and querying, keeping
+the consumer abstracted from the data store used beneath for storing data.
+It works in a plug-and-play model, where data stores can be plugged-in.
+Presently we have plugin for Sqlite3. In the future will provide recording
+and querying performance optimizer. In the current implementation the schema
+of metadata is fixed.
+
+####Schema:
+
+ GF_FILE_TB Table:
+ This table has one entry per file inode. It holds the metadata required to
+ make decisions in data maintenance.
+ GF_ID (Primary key) : File GFID (Universal Unique IDentifier in the namespace)
+ W_SEC, W_MSEC : Write wind time in sec & micro-sec
+ UW_SEC, UW_MSEC : Write un-wind time in sec & micro-sec
+ W_READ_SEC, W_READ_MSEC : Read wind time in sec & micro-sec
+ UW_READ_SEC, UW_READ_MSEC : Read un-wind time in sec & micro-sec
+ WRITE_FREQ_CNTR INTEGER : Write Frequency Counter
+ READ_FREQ_CNTR INTEGER : Read Frequency Counter
+
+ GF_FLINK_TABLE:
+ This table has all the hardlinks to a file inode.
+ GF_ID : File GFID (Composite Primary Key)``|
+ GF_PID : Parent Directory GFID (Composite Primary Key) |-> Primary Key
+ FNAME : File Base Name (Composite Primary Key)__|
+ FPATH : File Full Path (Its redundant for now, this will go)
+ W_DEL_FLAG : This Flag is used for crash consistancy, when a link is unlinked.
+ i.e Set to 1 during unlink wind and during unwind this record is deleted
+ LINK_UPDATE : This Flag is used when a link is changed i.e rename.
+ Set to 1 when rename wind and set to 0 in rename unwind
+
+Libgfdb API :
+Refer libglusterfs/src/gfdb/gfdb_data_store.h
+
+####ChangeTimeRecorder (CTR) Translator:
+
+ChangeTimeRecorder(CTR) is server side xlator(translator) which sits
+just above posix xlator. The main role of this xlator is to record the
+access/write patterns on a file residing the brick. It records the
+read(only data) and write(data and metadata) times and also count on
+how many times a file is read or written. This xlator also captures
+the hard links to a file(as its required by data tiering to move
+files).
+
+CTR Xlator is the consumer of libgfdb.
+
+To Enable/Disable CTR Xlator:
+
+ **gluster volume set <volume-name> features.ctr-enabled {on/off}**
+
+To Enable/Disable Frequency Counter Recording in CTR Xlator:
+
+ **gluster volume set <volume-name> features.record-counters {on/off}**
+
+
+####Migration daemon:
+
+When a tiered volume is created, a migration daemon starts. There is one daemon
+for every tiered volume per node. The daemon sleeps and then periodically
+queries the database for files to promote or demote. The query callbacks
+assembles files in a list, which is then enumerated. The frequencies by
+which promotes and demotes happen is subject to user configuration.
+
+Selected files are migrated between the tiers using existing DHT migration
+logic. The tier translator will leverage DHT rebalance performance
+enhancements.
+
+Configurable for Migration daemon:
+
+ gluster volume set <volume-name> cluster.tier-demote-frequency <SECS>
+
+ gluster volume set <volume-name> cluster.tier-promote-frequency <SECS>
+
+ gluster volume set <volume-name> cluster.read-freq-threshold <SECS>
+
+ gluster volume set <volume-name> cluster.write-freq-threshold <SECS>
+
+
+####Tier Translator:
+
+The tier translator is the root node in tiered volumes. The first subvolume
+is the cold tier, and the second the hot tier. DHT logic for fowarding I/Os
+is largely unchanged. Exceptions are handled according to the dht_methods_t
+structure, which forks control according to DHT or tier type.
+
+The major exception is DHT's layout is not utilized for choosing hashed
+subvolumes. Rather, the hot tier is always the hashed subvolume.
+
+Changes to DHT were made to allow "stacking", i.e. DHT over DHT:
+
+* readdir operations remember the index of the "leaf node" in the volume graph
+(client id), rather than a unique index for each DHT instance.
+
+* Each DHT instance uses a unique extended attribute for tracking migration.
+
+* In certain cases, it is legal for tiered volumes to have unpopulated inodes
+(wheras this would be an error in DHT's case).
+
+Currently tiered volume expansion (adding and removing bricks) is unsupported.
+
+####glusterd:
+
+The tiered volume tree is a composition of two other volumes. The glusterd
+daemon builds it. Existing logic for adding and removing bricks is heavily
+leveraged to attach and detach tiers, and perform statistics collection.
diff --git a/done/Features/trash_xlator.md b/done/Features/trash_xlator.md
new file mode 100644
index 0000000..3e38e87
--- /dev/null
+++ b/done/Features/trash_xlator.md
@@ -0,0 +1,80 @@
+Trash Translator
+=================
+
+Trash translator will allow users to access deleted or truncated files. Every brick will maintain a hidden .trashcan directory , which will be used to store the files deleted or truncated from the respective brick .The aggreagate of all those .trashcan directory can be accesed from the mount point.In order to avoid name collisions , a time stamp is appended to the original file name while it is being moved to trash directory.
+
+##Implications and Usage
+Apart from the primary use-case of accessing files deleted or truncated by user , the trash translator can be helpful for internal operations such as self-heal and rebalance . During self-heal and rebalance it is possible to lose crucial data.In those circumstances the trash translator can assist in recovery of the lost data. The trash translator is designed to intercept unlink, truncate and ftruncate fops, store a copy of the current file in the trash directory, and then perform the fop on the original file. For the internal operations , the files are stored under 'internal_op' folder inside trash directory.
+
+##Volume Options
+1. *gluster volume set &lt;VOLNAME> features.trash &lt;on | off>*
+
+ This command can be used to enable trash translator in a volume. If set to on, trash directory will be created in every brick inside the volume during volume start command. By default translator is loaded during volume start but remains non-functional. Disabling trash with the help of this option will not remove the trash directory or even its contents from the volume.
+
+2. *gluster volume set &lt;VOLNAME> features.trash-dir &lt;name>*
+
+ This command is used to reconfigure the trash directory to a user specified name. The argument is a valid directory name. Directory will be created inside every brick under this name. If not specified by the user, the trash translator will create the trash directory with the default name “.trashcan”. This can be used only when trash-translator is on.
+
+3. *gluster volume set &lt;VOLNAME> features.trash-max-filesize &lt;size>*
+
+ This command can be used to filter files entering trash directory based on their size. Files above trash_max_filesize are deleted/truncated directly. Value for size may be followed by mutliplicative suffixes KB (=1024), MB (=1024*1024 and GB. Default size is set to 5MB. As of now any value specified higher than 1GB will be changed to 1GB at the maximum level.
+
+4. *gluster volume set &lt;VOLNAME> features.trash-eliminate-path &lt;path1> [ , &lt;path2> , . . . ]*
+
+ This command can be used to set the eliminate pattern for the trash translator. Files residing under this pattern will not be moved to trash directory during deletion/truncation. Path must be a valid one present in volume.
+
+5. *gluster volume set &lt;VOLNAME> features.trash-internal-op &lt;on | off>*
+
+ This command can be used to enable trash for internal operations like self-heal and re-balance. By default set to off.
+
+##Testing
+Following steps give illustrates a simple scenario of deletion of file from directory
+
+1. Create a distributed volume with two bricks and start it.
+
+ gluster volume create test rhs:/home/brick
+
+ gluster volume start test
+
+2. Enable trash translator
+
+ gluster volume set test feature.trash on
+
+3. Mount glusterfs client as follows.
+
+ mount -t glusterfs rhs:test /mnt
+
+4. Create a directory and file in the mount.
+
+ mkdir mnt/dir
+
+ echo abc > mnt/dir/file
+
+5. Delete the file from the mount.
+
+ rm mnt/dir/file -rf
+
+6. Checkout inside the trash directory.
+
+ ls mnt/.trashcan
+
+We can find the deleted file inside the trash directory with timestamp appending on its filename.
+
+For example,
+
+ [root@rh-host ~]#mount -t glusterfs rh-host:/test /mnt/test
+ [root@rh-host ~]#mkdir /mnt/test/abc
+ [root@rh-host ~]#touch /mnt/test/abc/file
+ [root@rh-host ~]#rm /mnt/test/abc/filer
+ remove regular empty file ‘/mnt/test/abc/file’? y
+ [root@rh-host ~]#ls /mnt/test/abc
+ [root@rh-host ~]#
+ [root@rh-host ~]#ls /mnt/test/.trashcan/abc/
+ file2014-08-21_123400
+
+##Points to be remembered
+[1] As soon as the volume is started, trash directory will be created inside the volume and will be visible through mount. Disabling trash will not have any impact on its visibilty from the mount.
+[2] Eventhough deletion of trash-directory is not permitted, currently residing trash contents will be removed on issuing delete on it and only an empty trash-directory exists.
+
+##Known issues
+[1] Since trash translator resides on the server side, DHT translator is unaware of rename and truncate operations being done by this translator which will eventually moves the files to trash directory. Unless and until a complete-path-based lookup comes on trashed files, those may not be visible from the mount.
diff --git a/done/Features/upcall.md b/done/Features/upcall.md
new file mode 100644
index 0000000..5e9ced2
--- /dev/null
+++ b/done/Features/upcall.md
@@ -0,0 +1,38 @@
+## Upcall
+
+### Summary
+
+A generic and extensible framework, used to maintain states in the glusterfsd process for each of the files accessed (including the clients info doing the fops) and send notifications to the respective glusterfs clients incase of any change in that state.
+
+Few of the use-cases (currently using) of this infrastructure are:
+
+- Inode Update/Invalidation
+
+### Detailed Description
+
+GlusterFS, a scale-out storage platform, comprises of distributed file system which follows client-server architectural model.
+
+Its the client(glusterfs) which usually initiates an rpc request to the server(glusterfsd). After processing the request, reply is sent to the client as response to the same request. So till now, there was no interface and use-case present for the server to intimate or make a request to the client.
+
+This support is now being added using **“Upcall Infrastructure”.**
+
+A new xlator(Upcall) has been defined to maintain and process state of the events which require server to send upcall notifications. For each I/O on a inode, we create/update a ‘upcall_inode_ctx’ and store/update the list of clients’ info ‘upcall_client_t’ in the context.
+
+#### Cache Invalidation
+
+Each of the GlusterFS clients/applications cache certain state of the files (for eg, inode or attributes). In a muti-node environment these caches could lead to data-integrity issues, for certain time, if there are multiple clients accessing the same file simulataneously.
+
+To avoid such scenarios, we need server to notify clients incase of any change in the file state/attributes.
+
+More details can be found in the below links -
+ http://www.gluster.org/community/documentation/index.php/Features/Upcall-infrastructure
+ https://soumyakoduri.wordpress.com/2015/02/25/glusterfs-understanding-upcall-infrastructure-and-cache-invalidation-support/
+
+*cache-invalidation is currently disabled by default.*
+It can be enabled with the following command:
+
+ gluster volume set <volname> features.cache-invalidation on
+
+Note: This upcall notification is sent to only those clients which have accessed the file recently (i.e, with in CACHE_INVALIDATE_PERIOD – default 60sec). This options can be tuned using the following command:
+
+ gluster volume set <volname> features.cache-invalidation-timeout <value> \ No newline at end of file
diff --git a/done/Features/worm.md b/done/Features/worm.md
new file mode 100644
index 0000000..dba9977
--- /dev/null
+++ b/done/Features/worm.md
@@ -0,0 +1,75 @@
+#WORM (Write Once Read Many)
+This features enables you to create a `WORM volume` using gluster CLI.
+##Description
+WORM (write once,read many) is a desired feature for users who want to store data such as `log files` and where data is not allowed to get modified.
+
+GlusterFS provides a new key `features.worm` which takes boolean values(enable/disable) for volume set.
+
+Internally, the volume set command with 'feature.worm' key will add 'features/worm' translator in the brick's volume file.
+
+`This change would be reflected on a subsequent restart of the volume`, i.e gluster volume stop, followed by a gluster volume start.
+
+With a volume converted to WORM, the changes are as follows:
+
+* Reads are handled normally
+* Only files with O_APPEND flag will be supported.
+* Truncation,deletion wont be supported.
+
+##Volume Options
+Use the volume set command on a volume and see if the volume is actually turned into WORM type.
+
+ # features.worm enable
+##Fully loaded Example
+WORM feature is being supported from glusterfs version 3.4
+start glusterd by using the command
+
+ # service glusterd start
+Now create a volume by using the command
+
+ # gluster volume create <vol_name> <brick_path>
+start the volume created by running the command below.
+
+ # gluster vol start <vol_name>
+Run the command below to make sure that volume is created.
+
+ # gluster volume info
+Now turn on the WORM feature on the volume by using the command
+
+ # gluster vol set <vol_name> worm enable
+Verify that the option is set by using the command
+
+ # gluster volume info
+User should be able to see another option in the volume info
+
+ # features.worm: enable
+Now restart the volume for the changes to reflect, by performing volume stop and start.
+
+ # gluster volume <vol_name> stop
+ # gluster volume <vol_name> start
+Now mount the volume using fuse mount
+
+ # mount -t glusterfs <vol_name> <mnt_point>
+create a file inside the mount point by running the command below
+
+ # touch <file_name>
+Verify that user is able to create a file by running the command below
+
+ # ls <file_name>
+
+##How To Test
+Now try deleting the above file which is been created
+
+ # rm <file_name>
+Since WORM is enabled on the volume, it gives the following error message `rm: cannot remove '/<mnt_point>/<file_name>': Read-only file system`
+
+put some content into the file which is created above.
+
+ # echo "at the end of the file" >> <file_name>
+Now try editing the file by running the commnad below and verify that the following error message is displayed `rm: cannot remove '/<mnt_point>/<file_name>': Read-only file system`
+
+ # sed -i "1iAt the beginning of the file" <file_name>
+Now read the contents of the file and verify that file can be read.
+
+ cat <file_name>
+
+`Note: If WORM option is set on the volume before it is started, then volume need not be restarted for the changes to get reflected`.
diff --git a/done/Features/zerofill.md b/done/Features/zerofill.md
new file mode 100644
index 0000000..c0f1fc5
--- /dev/null
+++ b/done/Features/zerofill.md
@@ -0,0 +1,26 @@
+#zerofill API for GlusterFS
+zerofill() API would allow creation of pre-allocated and zeroed-out files on GlusterFS volumes by offloading the zeroing part to server and/or storage (storage offloads use SCSI WRITESAME).
+## Description
+
+Zerofill writes zeroes to a file in the specified range. This fop will be useful when a whole file needs to be initialized with zero (could be useful for zero filled VM disk image provisioning or during scrubbing of VM disk images).
+
+Client/application can issue this FOP for zeroing out. Gluster server will zero out required range of bytes ie server offloaded zeroing. In the absence of this fop, client/application has to repetitively issue write (zero) fop to the server, which is very inefficient method because of the overheads involved in RPC calls and acknowledgements.
+
+WRITESAME is a SCSI T10 command that takes a block of data as input and writes the same data to other blocks and this write is handled completely within the storage and hence is known as offload . Linux ,now has support for SCSI WRITESAME command which is exposed to the user in the form of BLKZEROOUT ioctl. BD Xlator can exploit BLKZEROOUT ioctl to implement this fop. Thus zeroing out operations can be completely offloaded to the storage device,
+making it highly efficient.
+
+The fop takes two arguments offset and size. It zeroes out 'size' number of bytes in an opened file starting from 'offset' position.
+This feature adds zerofill support to the following areas:
+> - libglusterfs
+- io-stats
+- performance/md-cache,open-behind
+- quota
+- cluster/afr,dht,stripe
+- rpc/xdr
+- protocol/client,server
+- io-threads
+- marker
+- storage/posix
+- libgfapi
+
+Client applications can exploit this fop by using glfs_zerofill introduced in libgfapi.FUSE support to this fop has not been added as there is no system call for this fop.
diff --git a/done/GlusterFS 3.5/AFR CLI enhancements.md b/done/GlusterFS 3.5/AFR CLI enhancements.md
new file mode 100644
index 0000000..88f4980
--- /dev/null
+++ b/done/GlusterFS 3.5/AFR CLI enhancements.md
@@ -0,0 +1,204 @@
+Feature
+-------
+
+AFR CLI enhancements
+
+SUMMARY
+-------
+
+Presently the AFR reporting via CLI has lots of problems in the
+representation of logs because of which they may not be able to use the
+data effectively. This feature is to correct these problems and provide
+a coherent mechanism to present heal status,information and the logs
+associated.
+
+Owners
+------
+
+Venkatesh Somayajulu
+Raghavan
+
+Current status
+--------------
+
+There are many bugs related to this which indicates the current status
+and why these requirements are required.
+
+​1) 924062 - gluster volume heal info shows only gfids in some cases and
+sometimes names. This is very confusing for the end user.
+
+​2) 852294 - gluster volume heal info hangs/crashes when there is a
+large number of entries to be healed.
+
+​3) 883698 - when self heal daemon is turned off, heal info does not
+show any output. But healing can happen because of lookups from IO path.
+Hence list of entries to be healed still needs to be shown.
+
+​4) 921025 - directories are not reported when list of split brain
+entries needs to be displayed.
+
+​5) 981185 - when self heal daemon process is offline, volume heal info
+gives error as "staging failure"
+
+​6) 952084 - We need a command to resolve files in split brain state.
+
+​7) 986309 - We need to report source information for files which got
+healed during a self heal session.
+
+​8) 986317 - Sometimes list of files to get healed also includes files
+to which IO s being done since the entries for these files could be in
+the xattrop directory. This could be confusing for the user.
+
+There is a master bug 926044 that sums up most of the above problems. It
+does give the QA perspective of the current representation out of the
+present reporting infrastructure.
+
+Detailed Description
+--------------------
+
+​1) One common thread among all the above complaints is that the
+information presented to the user is <B>FUD</B> because of the following
+reasons:
+
+(a) Split brain itself is a scary scenario especially with VMs.
+(b) The data that we present to the users cannot be used in a stable
+ manner for them to get to the list of these files. <I>For ex:</I> we
+ need to give mechanisms by which he can automate the resolution out
+ of split brain.
+(c) The logs that are generated are all the more scarier since we
+ see repetition of some error lines running into hundreds of lines.
+ Our mailing lists are filled with such emails from end users.
+
+Any data is useless unless it is associated with an event. For self
+heal, the event that leads to self heal is the loss of connectivity to a
+brick from a client. So all healing info and especially split brain
+should be associated with such events.
+
+The following is hence the proposed mechanism:
+
+(a) Every loss of a brick from client's perspective is logged and
+ available via some ID. The information provides the time from when
+ the brick went down to when it came up. Also it should also report
+ the number of IO transactions(modifies) that hapenned during this
+ event.
+(b) The list of these events are available via some CLI command. The
+ actual command needs to be detailed as part of this feature.
+(c) All volume info commands regarding list of files to be healed,
+ files healed and split brain files should be associated with this
+ event(s).
+
+​2) Provide a mechanism to show statistics at a volume and replica group
+level. It should show the number of files to be healed and number of
+split brain files at both the volume and replica group level.
+
+​3) Provide a mechanism to show per volume list of files to be
+healed/files healed/split brain in the following info:
+
+This should have the following information:
+
+(a) File name
+(b) Bricks location
+(c) Event association (brick going down)
+(d) Source
+(v) Sink
+
+​4) Self heal crawl statistics - Introduce new CLI commands for showing
+more information on self heal crawl per volume.
+
+(a) Display why a self heal crawl ran (timeouts, brick coming up)
+(b) Start time and end time
+(c) Number of files it attempted to heal
+(d) Location of the self heal daemon
+
+​5) Scale the logging infrastructure to handle huge number of file list
+that needs to be displayed as part of the logging.
+
+(a) Right now the system crashes or hangs in case of a high number
+ of files.
+(b) It causes CLI timeouts arbitrarily. The latencies involved in
+ the logging have to be studied (profiled) and mechanisms to
+ circumvent them have to be introduced.
+(c) All files are displayed on the output. Have a better way of
+ representing them.
+
+Options are:
+
+(a) Maybe write to a glusterd log file or have a seperate directory
+ for afr heal logs.
+(b) Have a status kind of command. This will display the current
+ status of the log building and maybe have batched way of
+ representing when there is a huge list.
+
+​6) We should provide mechanism where the user can heal split brain by
+some pre-established policies:
+
+(a) Let the system figure out the latest files (assuming all nodes
+ are in time sync) and choose the copies that have the latest time.
+(b) Choose one particular brick as the source for split brain and
+ heal all split brains from this brick.
+(c) Just remove the split brain information from changelog. We leave
+ the exercise to the user to repair split brain where in he would
+ rewrite to the split brained files. (right now the user is forced to
+ remove xattrs manually for this step).
+
+Benefits to GlusterFS
+--------------------
+
+Makes the end user more aware of healing status and provides statistics.
+
+Scope
+-----
+
+6.1. Nature of proposed change
+
+Modification to AFR and CLI and glusterd code
+
+6.2. Implications on manageability
+
+New CLI commands to be added. Existing commands to be improved.
+
+6.3. Implications on presentation layer
+
+N/A
+
+6.4. Implications on persistence layer
+
+N/A
+
+6.5. Implications on 'GlusterFS' backend
+
+N/A
+
+6.6. Modification to GlusterFS metadata
+
+N/A
+
+6.7. Implications on 'glusterd'
+
+Changes for healing specific commands will be introduced.
+
+How To Test
+-----------
+
+See documentation session
+
+User Experience
+---------------
+
+*Changes in CLI, effect on User experience...*
+
+Documentation
+-------------
+
+<http://review.gluster.org/#/c/7792/1/doc/features/afr-statistics.md>
+
+Status
+------
+
+Patches :
+
+<http://review.gluster.org/6044> <http://review.gluster.org/4790>
+
+Status:
+
+Merged \ No newline at end of file
diff --git a/done/GlusterFS 3.5/Brick Failure Detection.md b/done/GlusterFS 3.5/Brick Failure Detection.md
new file mode 100644
index 0000000..9952698
--- /dev/null
+++ b/done/GlusterFS 3.5/Brick Failure Detection.md
@@ -0,0 +1,151 @@
+Feature
+-------
+
+Brick Failure Detection
+
+Summary
+-------
+
+This feature attempts to identify storage/file system failures and
+disable the failed brick without disrupting the remainder of the node's
+operation.
+
+Owners
+------
+
+Vijay Bellur with help from Niels de Vos (or the other way around)
+
+Current status
+--------------
+
+Currently, if the underlying storage or file system failure happens, a
+brick process will continue to function. In some cases, a brick can hang
+due to failures in the underlying system. Due to such hangs in brick
+processes, applications running on glusterfs clients can hang.
+
+Detailed Description
+--------------------
+
+Detecting failures on the filesystem that a brick uses makes it possible
+to handle errors that are caused from outside of the Gluster
+environment.
+
+There have been hanging brick processes when the underlying storage of a
+brick went unavailable. A hanging brick process can still use the
+network and repond to clients, but actual I/O to the storage is
+impossible and can cause noticible delays on the client side.
+
+Benefit to GlusterFS
+--------------------
+
+Provide better detection of storage subsytem failures and prevent bricks
+from hanging.
+
+Scope
+-----
+
+### Nature of proposed change
+
+Add a health-checker to the posix xlator that periodically checks the
+status of the filesystem (implies checking of functional
+storage-hardware).
+
+### Implications on manageability
+
+When a brick process detects that the underlaying storage is not
+responding anymore, the process will exit. There is no automated way
+that the brick process gets restarted, the sysadmin will need to fix the
+problem with the storage first.
+
+After correcting the storage (hardware or filesystem) issue, the
+following command will start the brick process again:
+
+ # gluster volume start <VOLNAME> force
+
+### Implications on presentation layer
+
+None
+
+### Implications on persistence layer
+
+None
+
+### Implications on 'GlusterFS' backend
+
+None
+
+### Modification to GlusterFS metadata
+
+None
+
+### Implications on 'glusterd'
+
+'glusterd' can detect that the brick process has exited,
+`gluster volume status` will show that the brick process is not running
+anymore. System administrators checking the logs should be able to
+triage the cause.
+
+How To Test
+-----------
+
+The health-checker thread that is part of each brick process will get
+started automatically when a volume has been started. Verifying its
+functionality can be done in different ways.
+
+On virtual hardware:
+
+- disconnect the disk from the VM that holds the brick
+
+On real hardware:
+
+- simulate a RAID-card failure by unplugging the card or cables
+
+On a system that uses LVM for the bricks:
+
+- use device-mapper to load an error-table for the disk, see [this
+ description](http://review.gluster.org/5176).
+
+On any system (writing to random offsets of the block device, more
+difficult to trigger):
+
+1. cause corruption on the filesystem that holds the brick
+2. read contents from the brick, hoping to hit the corrupted area
+3. the filsystem should abort after hitting a bad spot, the
+ health-checker should notice that shortly afterwards
+
+User Experience
+---------------
+
+No more hanging brick processes when storage-hardware or the filesystem
+fails.
+
+Dependencies
+------------
+
+Posix translator, not available for the BD-xlator.
+
+Documentation
+-------------
+
+The health-checker is enabled by default and runs a check every 30
+seconds. This interval can be changed per volume with:
+
+ # gluster volume set <VOLNAME> storage.health-check-interval <SECONDS>
+
+If `SECONDS` is set to 0, the health-checker will be disabled.
+
+For further details refer:
+<https://forge.gluster.org/glusterfs-core/glusterfs/blobs/release-3.5/doc/features/brick-failure-detection.md>
+
+Status
+------
+
+glusterfs-3.4 and newer include a health-checker for the posix xlator,
+which was introduced with [bug
+971774](https://bugzilla.redhat.com/971774):
+
+- [posix: add a simple
+ health-checker](http://review.gluster.org/5176)?
+
+Comments and Discussion
+-----------------------
diff --git a/done/GlusterFS 3.5/Disk Encryption.md b/done/GlusterFS 3.5/Disk Encryption.md
new file mode 100644
index 0000000..4c6ab89
--- /dev/null
+++ b/done/GlusterFS 3.5/Disk Encryption.md
@@ -0,0 +1,443 @@
+Feature
+=======
+
+Transparent encryption. Allows a volume to be encrypted "at rest" on the
+server using keys only available on the client.
+
+1 Summary
+=========
+
+Distributed systems impose tighter requirements to at-rest encryption.
+This is because your encrypted data will be stored on servers, which are
+de facto untrusted. In particular, your private encrypted data can be
+subjected to analysis and tampering, which eventually will lead to its
+revealing, if it is not properly protected. Specifically, usually it is
+not enough to just encrypt data. In distributed systems serious
+protection of your personal data is possible only in conjunction with a
+special process, which is called authentication. GlusterFS provides such
+enhanced service: In GlusterFS encryption is enhanced with
+authentication. Currently we provide protection from "silent tampering".
+This is a kind of tampering, which is hard to detect, because it doesn't
+break POSIX compliance. Specifically, we protect encryption-specific
+file's metadata. Such metadata includes unique file's object id (GFID),
+cipher algorithm id, cipher block size and other attributes used by the
+encryption process.
+
+1.1 Restrictions
+----------------
+
+​1. We encrypt only file content. The feature of transparent encryption
+doesn't protect file names: they are neither encrypted, nor verified.
+Protection of file names is not so critical as protection of
+encryption-specific file's metadata: any attacks based on tampering file
+names will break POSIX compliance and result in massive corruption,
+which is easy to detect.
+
+​2. The feature of transparent encryption doesn't work in NFS-mounts of
+GlusterFS volumes: NFS's file handles introduce security issues, which
+are hard to resolve. NFS mounts of encrypted GlusterFS volumes will
+result in failed file operations (see section "Encryption in different
+types of mount sessions" for more details).
+
+​3. The feature of transparent encryption is incompatible with GlusterFS
+performance translators quick-read, write-behind and open-behind.
+
+2 Owners
+========
+
+Jeff Darcy <jdarcy@redhat.com>
+Edward Shishkin <eshishki@redhat.com>
+
+3 Current status
+================
+
+Merged to the upstream.
+
+4 Detailed Description
+======================
+
+See Summary.
+
+5 Benefit to GlusterFS
+======================
+
+Besides the justifications that have applied to on-disk encryption just
+about forever, recent events have raised awareness significantly.
+Encryption using keys that are physically present at the server leaves
+data vulnerable to physical seizure of the server. Encryption using keys
+that are kept by the same organization entity leaves data vulnerable to
+"insider threat" plus coercion or capture at the organization level. For
+many, especially various kinds of service providers, only pure
+client-side encryption provides the necessary levels of privacy and
+deniability.
+
+Competitively, other projects - most notably
+[Tahoe-LAFS](https://leastauthority.com/) - are already using recently
+heightened awareness of these issues to attract users who would be
+better served by our performance/scalability, usability, and diversity
+of interfaces. Only the lack of proper encryption holds us back in these
+cases.
+
+6 Scope
+=======
+
+6.1. Nature of proposed change
+------------------------------
+
+This is a new client-side translator, using user-provided key
+information plus information stored in xattrs to encrypt data
+transparently as it's written and decrypt when it's read.
+
+6.2. Implications on manageability
+----------------------------------
+
+User needs to manage a per-volume master key (MK). That is:
+
+​1) Generate an independent MK for every volume which is to be
+encrypted. Note, that one MK is created for the whole life of the
+volume.
+
+​2) Provide MK on the client side at every mount in accordance with the
+location, which has been specified at volume create time, or overridden
+via respective mount option (see section How To Test).
+
+​3) Keep MK between mount sessions. Note that after successful mount MK
+may be removed from the specified location. In this case user should
+retain MK safely till next mount session.
+
+MK is a 256-bit secret string, which is known only to user. Generating
+and retention of MK is in user's competence.
+
+WARNING!!! Losing MK will make content of all regular files of your
+volume inaccessible. It is possible to mount a volume with improper MK,
+however such mount sessions will allow to access only file names as they
+are not encrypted.
+
+Recommendations on MK generation
+
+MK has to be a high-entropy key, appropriately generated by a key
+derivation algorithm. One of the possible ways is using rand(1) provided
+by the OpenSSL package. You need to specify the option "-hex" for proper
+output format. For example, the next command prints a generated key to
+the standard output:
+
+ $ openssl rand -hex 32
+
+6.3. Implications on presentation layer
+---------------------------------------
+
+N/A
+
+6.4. Implications on persistence layer
+--------------------------------------
+
+N/A
+
+6.5. Implications on 'GlusterFS' backend
+----------------------------------------
+
+All encrypted files on the servers contains padding at the end of file.
+That is, size of all enDefines location of the master volume key on the
+trusted client machine.crypted files on the servers is multiple to
+cipher block size. Real file size is stored as file's xattr with the key
+"trusted.glusterfs.crypt.att.size". The translation padded-file-size -\>
+real-file-size (and backward) is performed by the crypt translator.
+
+6.6. Modification to GlusterFS metadata
+---------------------------------------
+
+Encryption-specific metadata in specified format is stored as file's
+xattr with the key "trusted.glusterfs.crypt.att.cfmt". Current format of
+metadata string is described in the slide \#27 of the following [ design
+document](http://www.gluster.org/community/documentation/index.php/File:GlusterFS_transparent_encryption.pdf)
+
+6.7. Options of the crypt translator
+------------------------------------
+
+- data-cipher-alg
+
+Specifies cipher algorithm for file data encryption. Currently only one
+option is available: AES\_XTS. This is hidden option.
+
+- block-size
+
+Specifies size (in bytes) of logical chunk which is encrypted as a whole
+unit in the file body. If cipher modes with initial vectors are used for
+encryption, then the initial vector gets reset for every such chunk.
+Available values are: "512", "1024", "2048" and "4096". Default value is
+"4096".
+
+- data-key-size
+
+Specifies size (in bits) of data cipher key. For AES\_XTS available
+values are: "256" and "512". Default value is "256". The larger key size
+("512") is for stronger security.
+
+- master-key
+
+Specifies pathname of the regular file, or symlink. Defines location of
+the master volume key on the trusted client machine.
+
+7 Getting Started With Crypt Translator
+=======================================
+
+​1. Create a volume <vol_name>.
+
+​2. Turn on crypt xlator:
+
+ # gluster volume set `<vol_name>` encryption on
+
+​3. Turn off performance xlators that currently encryption is
+incompatible with:
+
+ # gluster volume set <vol_name> performance.quick-read off
+ # gluster volume set <vol_name> performance.write-behind off
+ # gluster volume set <vol_name> performance.open-behind off
+
+​4. (optional) Set location of the volume master key:
+
+ # gluster volume set <vol_name> encryption.master-key <master_key_location>
+
+where <master_key_location> is an absolute pathname of the file, which
+will contain the volume master key (see section implications on
+manageability).
+
+​5. (optional) Override default options of crypt xlator:
+
+ # gluster volume set <vol_name> encryption.data-key-size <data_key_size>
+
+where <data_key_size> should have one of the following values:
+"256"(default), "512".
+
+ # gluster volume set <vol_name> encryption.block-size <block_size>
+
+where <block_size> should have one of the following values: "512",
+"1024", "2048", "4096"(default).
+
+​6. Define location of the master key on your client machine, if it
+wasn't specified at section 4 above, or you want it to be different from
+the <master_key_location>, specified at section 4.
+
+​7. On the client side make sure that the file with name
+<master_key_location> (or <master_key_new_location> defined at section
+6) exists and contains respective per-volume master key (see section
+implications on manageability). This key has to be in hex form, i.e.
+should be represented by 64 symbols from the set {'0', ..., '9', 'a',
+..., 'f'}. The key should start at the beginning of the file. All
+symbols at offsets \>= 64 are ignored.
+
+NOTE: <master_key_location> (or <master_key_new_location> defined at
+step 6) can be a symlink. In this case make sure that the target file of
+this symlink exists and contains respective per-volume master key.
+
+​8. Mount the volume <vol_name> on the client side as usual. If you
+specified a location of the master key at section 6, then use the mount
+option
+
+--xlator-option=<suffixed_vol_name>.master-key=<master_key_new_location>
+
+where <master_key_new_location> is location of master key specified at
+section 6, <suffixed_vol_name> is <vol_name> suffixed with "-crypt". For
+example, if you created a volume "myvol" in the step 1, then
+suffixed\_vol\_name is "myvol-crypt".
+
+​9. During mount your client machine receives configuration info from
+the untrusted server, so this step is extremely important! Check, that
+your volume is really encrypted, and that it is encrypted with the
+proper master key (see FAQ \#1,\#2).
+
+​10. (optional) After successful mount the file which contains master
+key may be removed. NOTE: Next mount session will require the master-key
+again. Keeping the master key between mount sessions is in user's
+competence (see section implications on manageability).
+
+8 How to test
+=============
+
+From a correctness standpoint, it's sufficient to run normal tests with
+encryption enabled. From a security standpoint, there's a whole
+discipline devoted to analysing the stored data for weaknesses, and
+engagement with practitioners of that discipline will be necessary to
+develop the right tests.
+
+9 Dependencies
+==============
+
+Crypt translator requires OpenSSL of version \>= 1.0.1
+
+10 Documentation
+================
+
+10.1 Basic design concepts
+--------------------------
+
+The basic design concepts are described in the following [pdf
+slides](http://www.gluster.org/community/documentation/index.php/File:GlusterFS_transparent_encryption.pdf)
+
+10.2 Procedure of security open
+-------------------------------
+
+So, in accordance with the basic design concepts above, before every
+access to a file's body (by read(2), write(2), truncate(2), etc) we need
+to make sure that the file's metadata is trusted. Otherwise, we risk to
+deal with untrusted file's data.
+
+To make sure that file's metadata is trusted, file is subjected to a
+special procedure of security open. The procedure of security open is
+performed by crypt translator at FOP-\>open() (crypt\_open) time by the
+function open\_format(). Currently this is a hardcoded composition of 2
+checks:
+
+1. verification of file's GFID by the file name;
+2. verification of file's metadata by the verified GFID;
+
+If the security open succeeds, then the cache of trusted client machine
+is replenished with file descriptor and file's inode, and user can
+access the file's content by read(2), write(2), ftruncate(2), etc.
+system calls, which accept file descriptor as argument.
+
+However, file API also allows to accept file body without opening the
+file. For example, truncate(2), which accepts pathname instead of file
+descriptor. To make sure that file's metadata is trusted, we create a
+temporal file descriptor and mandatory call crypt\_open() before
+truncating the file's body.
+
+10.3 Encryption in different types of mount sessions
+----------------------------------------------------
+
+Everything described in the section above is valid only for FUSE-mounts.
+Besides, GlusterFS also supports so-called NFS-mounts. From the
+standpoint of security the key difference between the mentioned types of
+mount sessions is that in NFS-mount sessions file operations instead of
+file name accept a so-called file handle (which is actually GFID). It
+creates problems, since the file name is a basic point for verification.
+As it follows from the section above, using the step 1, we can replenish
+the cache of trusted machine with trusted file handles (GFIDs), and
+perform a security open only by trusted GFID (by the step 2). However,
+in this case we need to make sure that there is no leaks of non-trusted
+GFIDs (and, moreover, such leaks won't be introduced by the development
+process in future). This is possible only with changed GFID format:
+everywhere in GlusterFS GFID should appear as a pair (uuid,
+is\_verified), where is\_verified is a boolean variable, which is true,
+if this GFID passed off the procedure of verification (step 1 in the
+section above).
+
+The next problem is that current NFS protocol doesn't encrypt the
+channel between NFS client and NFS server. It means that in NFS-mounts
+of GlusterFS volumes NFS client and GlusterFS client should be the same
+(trusted) machine.
+
+Taking into account the described problems, encryption in GlusterFS is
+not supported in NFS-mount sessions.
+
+10.4 Class of cipher algorithms for file data encryption that can be supported by the crypt translator
+------------------------------------------------------------------------------------------------------
+
+We'll assume that any symmetric block cipher algorithm is completely
+determined by a pair (alg\_id, mode\_id), where alg\_id is an algorithm
+defined on elementary cipher blocks (e.g. AES), and mode\_id is a mode
+of operation (e.g. ECB, XTS, etc).
+
+Technically, the crypt translator is able to support any symmetric block
+cipher algorithms via additional options of the crypt translator.
+However, in practice the set of supported algorithms is narrowed because
+of various security and organization issues. Currently we support only
+one algotithm. This is AES\_XTS.
+
+10.5 Bibliography
+-----------------
+
+1. Recommendations for for Block Cipher Modes of Operation (NIST
+ Special Publication 800-38A).
+2. Recommendation for Block Cipher Modes of Operation: The XTS-AES Mode
+ for Confidentiality on Storage Devices (NIST Special Publication
+ 800-38E).
+3. Recommendation for Key Derivation Using Pseudorandom Functions,
+ (NIST Special Publication 800-108).
+4. Recommendation for Block Cipher Modes of Operation: The CMAC Mode
+ for Authentication, (NIST Special Publication 800-38B).
+5. Recommendation for Block Cipher Modes of Operation: Methods for Key
+ Wrapping, (NIST Special Publication 800-38F).
+6. FIPS PUB 198-1 The Keyed-Hash Message Authentication Code (HMAC).
+7. David A. McGrew, John Viega "The Galois/Counter Mode of Operation
+ (GCM)".
+
+11 FAQ
+======
+
+**1. How to make sure that my volume is really encrypted?**
+
+Check the respective graph of translators on your trusted client
+machine. This graph is created at mount time and is stored by default in
+the file /usr/local/var/log/glusterfs/mountpoint.log
+
+Here "mountpoint" is the absolute name of the mountpoint, where "/" are
+replaced with "-". For example, if your volume is mounted to
+/mnt/testfs, then you'll need to check the file
+/usr/local/var/log/glusterfs/mnt-testfs.log
+
+Make sure that this graph contains the crypt translator, which looks
+like the following:
+
+ 13: volume xvol-crypt
+ 14:     type encryption/crypt
+ 15:     option master-key /home/edward/mykey
+ 16:     subvolumes xvol-dht
+ 17: end-volume
+
+**2. How to make sure that my volume is encrypted with a proper master
+key?**
+
+Check the graph of translators on your trusted client machine (see the
+FAQ\#1). Make sure that the option "master-key" of the crypt translator
+specifies correct location of the master key on your trusted client
+machine.
+
+**3. Can I change the encryption status of a volume?**
+
+You can change encryption status (enable/disable encryption) only for
+empty volumes. Otherwise it will be incorrect (you'll end with IO
+errors, data corruption and security problems). We strongly recommend to
+decide once and forever at volume creation time, whether your volume has
+to be encrypted, or not.
+
+**4. I am able to mount my encrypted volume with improper master keys
+and get list of file names for every directory. Is it normal?**
+
+Yes, it is normal. It doesn't contradict the announced functionality: we
+encrypt only file's content. File names are not encrypted, so it doesn't
+make sense to hide them on the trusted client machine.
+
+**5. What is the reason for only supporting AES-XTS? This mode is not
+using Intel's AES-NI instruction thus not utilizing hardware feature..**
+
+Distributed file systems impose tighter requirements to at-rest
+encryption. We offer more than "at-rest-encryption". We offer "at-rest
+encryption and authentication in distributed systems with non-trusted
+servers". Data and metadata on the server can be easily subjected to
+tampering and analysis with the purpose to reveal secret user's data.
+And we have to resist to this tampering by performing data and metadata
+authentication.
+
+Unfortunately, it is technically hard to implement full-fledged data
+authentication via a stackable file system (GlusterFS translator), so we
+have decided to perform a "light" authentication by using a special
+cipher mode, which is resistant to tampering. Currently OpenSSL supports
+only one such mode: this is XTS. Tampering of ciphertext created in XTS
+mode will lead to unpredictable changes in the plain text. That said,
+user will see "unpredictable gibberish" on the client side. Of course,
+this is not an "official way" to detect tampering, but this is much
+better than nothing. The "official way" (creating/checking MACs) we use
+for metadata authentication.
+
+Other modes like CBC, CFB, OFB, etc supported by OpenSSL are strongly
+not recommended for use in distributed systems with non-trusted servers.
+For example, CBC mode doesn't "survive" overwrite of a logical block in
+a file. It means that with every such overwrite (standard file system
+operation) we'll need to re-encrypt the whole(!) file with different
+key. CFB and OFB modes are sensitive to tampering: there is a way to
+perform \*predictable\* changes in plaintext, which is unacceptable.
+
+Yes, XTS is slow (at least its current implementation in OpenSSL), but
+we don't promise, that CFB, OFB with full-fledged authentication will be
+faster. So..
diff --git a/done/GlusterFS 3.5/Exposing Volume Capabilities.md b/done/GlusterFS 3.5/Exposing Volume Capabilities.md
new file mode 100644
index 0000000..0f72fbc
--- /dev/null
+++ b/done/GlusterFS 3.5/Exposing Volume Capabilities.md
@@ -0,0 +1,161 @@
+Feature
+-------
+
+Provide a capability to:
+
+- Probe the type (posix or bd) of volume.
+- Provide list of capabilities of a xlator/volume. For example posix
+ xlator could support zerofill, BD xlator could support offloaded
+ copy, thin provisioning etc
+
+Summary
+-------
+
+With multiple storage translators (posix and bd) being supported in
+GlusterFS, it becomes necessary to know the volume type so that user can
+issue appropriate calls that are relevant only to the a given volume
+type. Hence there needs to be a way to expose the type of the storage
+translator of the volume to the user.
+
+BD xlator is capable of providing server offloaded file copy,
+server/storage offloaded zeroing of a file etc. This capabilities should
+be visible to the client/user, so that these features can be exploited.
+
+Owners
+------
+
+M. Mohan Kumar
+Bharata B Rao.
+
+Current status
+--------------
+
+BD xlator exports capability information through gluster volume info
+(and --xml) output. For eg:
+
+*snip of gluster volume info output for a BD based volume*
+
+ Xlator 1: BD
+ Capability 1: thin
+
+*snip of gluster volume info --xml output for a BD based volume*
+
+ <xlators>
+   <xlator>
+     <name>BD</name>
+     <capabilities>
+       <capability>thin</capability>
+     </capabilities>
+   </xlator>
+ </xlators>
+
+But this capability information should also exposed through some other
+means so that a host which is not part of Gluster peer could also avail
+this capabilities.
+
+Exposing about type of volume (ie posix or BD) is still in conceptual
+state currently and needs discussion.
+
+Detailed Description
+--------------------
+
+1. Type
+- BD translator supports both regular files and block device,
+i,e., one can create files on GlusterFS volume backed by BD
+translator and this file could end up as regular posix file or a
+logical volume (block device) based on the user's choice. User
+can do a setxattr on the created file to convert it to a logical
+volume.
+- Users of BD backed volume like QEMU would like to know that it
+is working with BD type of volume so that it can issue an
+additional setxattr call after creating a VM image on GlusterFS
+backend. This is necessary to ensure that the created VM image
+is backed by LV instead of file.
+- There are different ways to expose this information (BD type of
+volume) to user. One way is to export it via a getxattr call.
+
+2. Capabilities
+- BD xlator supports new features such as server offloaded file
+copy, thin provisioned VM images etc (there is a patch posted to
+Gerrit to add server offloaded file zeroing in posix xlator).
+There is no standard way of exploiting these features from
+client side (such as syscall to exploit server offloaded copy).
+So these features need to be exported to the client so that they
+can be used. BD xlator V2 patch exports these capabilities
+information through gluster volume info (and --xml) output. But
+if a client is not part of GlusterFS peer it can't run volume
+info command to get the list of capabilities of a given
+GlusterFS volume. Also GlusterFS block driver in qemu need to
+get the capability list so that these features are used.
+
+Benefit to GlusterFS
+--------------------
+
+Enables proper consumption of BD xlator and client exploits new features
+added in both posix and BD xlator.
+
+### Scope
+
+Nature of proposed change
+-------------------------
+
+- Quickest way to expose volume type to a client can be achieved by
+ using getxattr fop. When a client issues getxattr("volume\_type") on
+ a root gfid, bd xlator will return 1 implying its BD xlator. But
+ posix xlator will return ENODATA and client code can interpret this
+ as posix xlator.
+
+- Also capability list can be returned via getxattr("caps") for root
+ gfid.
+
+Implications on manageability
+-----------------------------
+
+None.
+
+Implications on presentation layer
+----------------------------------
+
+N/A
+
+Implications on persistence layer
+---------------------------------
+
+N/A
+
+Implications on 'GlusterFS' backend
+-----------------------------------
+
+N/A
+
+Modification to GlusterFS metadata
+----------------------------------
+
+N/A
+
+Implications on 'glusterd'
+--------------------------
+
+N/A
+
+How To Test
+-----------
+
+User Experience
+---------------
+
+Dependencies
+------------
+
+Documentation
+-------------
+
+Status
+------
+
+Patch : <http://review.gluster.org/#/c/4809/>
+
+Status : Merged
+
+Comments and Discussion
+-----------------------
diff --git a/done/GlusterFS 3.5/File Snapshot.md b/done/GlusterFS 3.5/File Snapshot.md
new file mode 100644
index 0000000..b2d6c69
--- /dev/null
+++ b/done/GlusterFS 3.5/File Snapshot.md
@@ -0,0 +1,101 @@
+Feature
+-------
+
+File Snapshots in GlusterFS
+
+### Summary
+
+Ability to take snapshots of files in GlusterFS
+
+### Owners
+
+Anand Avati
+
+### Source code
+
+Patch for this feature - <http://review.gluster.org/5367>
+
+### Detailed Description
+
+The feature adds file snapshotting support to GlusterFS. '' To use this
+feature the file format should be QCOW2 (from QEMU)'' . The patch takes
+the block layer code from Qemu and converts it into a translator in
+gluster.
+
+### Benefit to GlusterFS
+
+Better integration with Openstack Cinder, and in general ability to take
+snapshots of files (typically VM images)
+
+### Usage
+
+*To take snapshot of a file, the file format should be QCOW2. To set
+file type as qcow2 check step \#2 below*
+
+​1. Turning on snapshot feature :
+
+ gluster volume set `<vol_name>` features.file-snapshot on
+
+​2. To set qcow2 file format:
+
+ setfattr -n trusted.glusterfs.block-format -v qcow2:10GB <file_name>
+
+​3. To create a snapshot:
+
+ setfattr -n trusted.glusterfs.block-snapshot-create -v <image_name> <file_name>
+
+​4. To apply/revert back to a snapshot:
+
+ setfattr -n trusted.glusterfs.block-snapshot-goto -v <image_name> <file_name>
+
+### Scope
+
+#### Nature of proposed change
+
+The work is going to be a new translator. Very minimal changes to
+existing code (minor change in syncops)
+
+#### Implications on manageability
+
+Will need ability to load/unload the translator in the stack.
+
+#### Implications on presentation layer
+
+Feature must be presentation layer independent.
+
+#### Implications on persistence layer
+
+No implications
+
+#### Implications on 'GlusterFS' backend
+
+Internal snapshots - No implications. External snapshots - there will be
+hidden directories added.
+
+#### Modification to GlusterFS metadata
+
+New xattr will be added to identify files which are 'snapshot managed'
+vs raw files.
+
+#### Implications on 'glusterd'
+
+Yet another turn on/off feature for glusterd. Volgen will have to add a
+new translator in the generated graph.
+
+### How To Test
+
+Snapshots can be tested by taking snapshots along with checksum of the
+state of the file, making further changes and going back to old snapshot
+and verify the checksum again.
+
+### Dependencies
+
+Dependent QEMU code is imported into the codebase.
+
+### Documentation
+
+<http://review.gluster.org/#/c/7488/6/doc/features/file-snapshot.md>
+
+### Status
+
+Merged in master and available in Gluster3.5 \ No newline at end of file
diff --git a/done/GlusterFS 3.5/Onwire Compression-Decompression.md b/done/GlusterFS 3.5/Onwire Compression-Decompression.md
new file mode 100644
index 0000000..a26aa7a
--- /dev/null
+++ b/done/GlusterFS 3.5/Onwire Compression-Decompression.md
@@ -0,0 +1,96 @@
+Feature
+=======
+
+On-Wire Compression/Decompression
+
+1. Summary
+==========
+
+Translator to compress/decompress data in flight between client and
+server.
+
+2. Owners
+=========
+
+- Venky Shankar <vshankar@redhat.com>
+- Prashanth Pai <ppai@redhat.com>
+
+3. Current Status
+=================
+
+Code has already been merged. Needs more testing.
+
+The [initial submission](http://review.gluster.org/3251) contained a
+`compress` option, which introduced [some
+confusion](https://bugzilla.redhat.com/1053670). [A correction has been
+sent](http://review.gluster.org/6765) to rename the user visible options
+to start with `network.compression`.
+
+TODO
+
+- Make xlator pluggable to add support for other compression methods
+- Add support for lz4 compression: <https://code.google.com/p/lz4/>
+
+4. Detailed Description
+=======================
+
+- When a writev call occurs, the client compresses the data before
+ sending it to server. On the server, compressed data is
+ decompressed. Similarly, when a readv call occurs, the server
+ compresses the data before sending it to client. On the client, the
+ compressed data is decompressed. Thus the amount of data sent over
+ the wire is minimized.
+
+- Compression/Decompression is done using Zlib library.
+
+- During normal operation, this is the format of data sent over wire:
+ <compressed-data> + trailer(8 bytes). The trailer contains the CRC32
+ checksum and length of original uncompressed data. This is used for
+ validation.
+
+5. Usage
+========
+
+Turning on compression xlator:
+
+ # gluster volume set <vol_name> network.compression on
+
+Configurable options:
+
+ # gluster volume set <vol_name> network.compression.compression-level 8
+ # gluster volume set <vol_name> network.compression.min-size 50
+
+6. Benefits to GlusterFS
+========================
+
+Fewer bytes transferred over the network.
+
+7. Issues
+=========
+
+- Issues with striped volumes. Compression xlator cannot work with
+ striped volumes
+
+- Issues with write-behind: Mount point hangs when writing a file with
+ write-behind xlator turned on. To overcome this, turn off
+ write-behind entirely OR set "performance.strict-write-ordering" to
+ on.
+
+- Issues with AFR: AFR v1 currently does not propagate xdata.
+ <https://bugzilla.redhat.com/show_bug.cgi?id=951800> This issue has
+ been resolved in AFR v2.
+
+8. Dependencies
+===============
+
+Zlib library
+
+9. Documentation
+================
+
+<http://review.gluster.org/#/c/7479/3/doc/network_compression.md>
+
+10. Status
+==========
+
+Code merged upstream. \ No newline at end of file
diff --git a/done/GlusterFS 3.5/Quota Scalability.md b/done/GlusterFS 3.5/Quota Scalability.md
new file mode 100644
index 0000000..f3b0a0d
--- /dev/null
+++ b/done/GlusterFS 3.5/Quota Scalability.md
@@ -0,0 +1,99 @@
+Feature
+-------
+
+Quota Scalability
+
+Summary
+-------
+
+Support upto 65536 quota configurations per volume.
+
+Owners
+------
+
+Krishnan Parthasarathi
+Vijay Bellur
+
+Current status
+--------------
+
+Current implementation of Directory Quota cannot scale beyond a few
+hundred configured limits per volume. The aim of this feature is to
+support upto 65536 quota configurations per volume.
+
+Detailed Description
+--------------------
+
+TBD
+
+Benefit to GlusterFS
+--------------------
+
+More quotas can be configured in a single volume thereby leading to
+support GlusterFS for use cases like home directory.
+
+Scope
+-----
+
+### Nature of proposed change
+
+- Move quota enforcement translator to the server
+- Introduce a new quota daemon which helps in aggregating directory
+ consumption on the server
+- Enhance marker's accounting to be modular
+- Revamp configuration persistence and CLI listing for better scale
+- Allow configuration of soft limits in addition to hard limits.
+
+### Implications on manageability
+
+Mostly the CLI will be backward compatible. New CLI to be introduced
+needs to be enumerated here.
+
+### Implications on presentation layer
+
+None
+
+### Implications on persistence layer
+
+None
+
+### Implications on 'GlusterFS' backend
+
+None
+
+### Modification to GlusterFS metadata
+
+- Addition of a new extended attribute for storing configured hard and
+soft limits on directories.
+
+### Implications on 'glusterd'
+
+- New file based configuration persistence
+
+How To Test
+-----------
+
+TBD
+
+User Experience
+---------------
+
+TBD
+
+Dependencies
+------------
+
+None
+
+Documentation
+-------------
+
+TBD
+
+Status
+------
+
+In development
+
+Comments and Discussion
+-----------------------
diff --git a/done/GlusterFS 3.5/Virt store usecase.md b/done/GlusterFS 3.5/Virt store usecase.md
new file mode 100644
index 0000000..3e649b2
--- /dev/null
+++ b/done/GlusterFS 3.5/Virt store usecase.md
@@ -0,0 +1,140 @@
+ Work In Progress
+ Author - Satheesaran Sundaramoorthi
+ <sasundar@redhat.com>
+
+**Introduction**
+----------------
+
+Gluster volumes are used to host Virtual Machines Images. (i.e) Virtual
+machines Images are stored on gluster volumes. This usecase is popularly
+known as *virt-store* usecase.
+
+This document explains more about,
+
+1. Enabling gluster volumes for virt-store usecase
+2. Common Pitfalls
+3. FAQs
+4. References
+
+**Enabling gluster volumes for virt-store**
+-------------------------------------------
+
+This section describes how to enable gluster volumes for virt store
+usecase
+
+#### Volume Types
+
+Ideally gluster volumes serving virt-store, should provide
+high-availability for the VMs running on it. If the volume is not
+avilable, the VMs may move in to unusable state. So, its best
+recommended to use **replica** or **distribute-replicate** volume for
+this usecase
+
+*If you are new to GlusterFS, you can take a look at
+[QuickStart](http://gluster.readthedocs.org/en/latest/Quick-Start-Guide/Quickstart/) guide or the [admin
+guide](http://gluster.readthedocs.org/en/latest/Administrator%20Guide/README/)*
+
+#### Tunables
+
+The set of volume options are recommended for virt-store usecase, which
+adds performance boost. Following are those options,
+
+ quick-read=off
+ read-ahead=off
+ io-cache=off
+ stat-prefetch=off
+ eager-lock=enable
+ remote-dio=enable
+ quorum-type=auto
+ server-quorum-type=server
+
+- quick-read is meant for improving small-file read performance,which
+ is no longer reasonable for VM Image files
+- read-ahead is turned off. VMs have their own way of doing that. This
+ is pretty usual to leave it to VM to determine the read-ahead
+- io-cache is turned off
+- stat-prefetch is turned off. stat-prefetch, caches the metadata
+ related to files and this is no longer a concern for VM Images (why
+ ?)
+- eager-lock is turned on (why?)
+- remote-dio is turned on,so in open() and creat() calls, O\_DIRECT
+ flag will be filtered at the client protocol level so server will
+ still continue to cache the file.
+- quorum-type is set to auto. This basically enables client side
+ quorum. When client side quorum is enabled, there exists the rule
+ such that atleast half of the bricks in the replica group should be
+ UP and running. If not, the replica group would become read-only
+- server-quorum-type is set to server. This basically enables
+ server-side quorum. This lays a condition that in a cluster, atleast
+ half the number of nodes, should be UP. If not the bricks ( read as
+ brick processes) will be killed, and thereby volume goes offline
+
+#### Applying the Tunables on the volume
+
+There are number of ways to do it.
+
+1. Make use of group-virt.example file
+2. Copy & Paste
+
+##### Make use of group-virt.example file
+
+This is the method best suited and recommended.
+*/etc/glusterfs/group-virt.example* has all options recommended for
+virt-store as explained earlier. Copy this file,
+*/etc/glusterfs/group-virt.example* to */var/lib/glusterd/groups/virt*
+
+ cp /etc/glusterfs/group-virt.example /var/lib/glusterd/groups/virt
+
+Optimize the volume with all the options available in this *virt* file
+in a single go
+
+ gluster volume set <vol-name> group virt
+
+NOTE: No restart of the volume is required Verify the same with the
+command,
+
+ gluster volume info
+
+In forthcoming releases, this file will be automatically put in
+*/var/lib/glusterd/groups/* and you can directly apply it on the volume
+
+##### Copy & Paste
+
+Copy all options from the above
+section,[Virt-store-usecase\#Tunables](Virt-store-usecase#Tunables "wikilink")
+and put in a file named *virt* in */var/lib/glusterd/groups/virt* Apply
+all the options on the volume,
+
+ gluster volume set <vol-name> group virt
+
+NOTE: This is not recommended, as the recommended volume options may/may
+not change in future.Always stick to *virt* file available with the rpms
+
+#### Adding Ownership to Volume
+
+You can add uid:gid to the volume,
+
+ gluster volume set <vol-name> storage.owner-uid <number>
+ gluster volume set <vol-name> storage.owner-gid <number>
+
+For example, when the volume would be accessed by qemu/kvm, you need to
+add ownership as 107:107,
+
+ gluster volume set <vol-name> storage.owner-uid 107
+ gluster volume set <vol-name> storage.owner-gid 107
+
+It would be 36:36 in the case of oVirt/RHEV, 165:165 in the case of
+OpenStack Block Service (cinder),161:161 in case of OpenStack Image
+Service (glance) is accessing this volume
+
+NOTE: Not setting the correct ownership may lead to "Permission Denied"
+errors when accessing the image files residing on the volume
+
+**Common Pitfalls**
+-------------------
+
+**FAQs**
+--------
+
+**References**
+-------------- \ No newline at end of file
diff --git a/done/GlusterFS 3.5/Zerofill.md b/done/GlusterFS 3.5/Zerofill.md
new file mode 100644
index 0000000..43b279d
--- /dev/null
+++ b/done/GlusterFS 3.5/Zerofill.md
@@ -0,0 +1,192 @@
+Feature
+-------
+
+zerofill API for GlusterFS
+
+Summary
+-------
+
+zerofill() API would allow creation of pre-allocated and zeroed-out
+files on GlusterFS volumes by offloading the zeroing part to server
+and/or storage (storage offloads use SCSI WRITESAME).
+
+Owners
+------
+
+Bharata B Rao
+M. Mohankumar
+
+Current status
+--------------
+
+Patch on gerrit: <http://review.gluster.org/5327>
+
+Detailed Description
+--------------------
+
+Add support for a new ZEROFILL fop. Zerofill writes zeroes to a file in
+the specified range. This fop will be useful when a whole file needs to
+be initialized with zero (could be useful for zero filled VM disk image
+provisioning or during scrubbing of VM disk images).
+
+Client/application can issue this FOP for zeroing out. Gluster server
+will zero out required range of bytes ie server offloaded zeroing. In
+the absence of this fop, client/application has to repetitively issue
+write (zero) fop to the server, which is very inefficient method because
+of the overheads involved in RPC calls and acknowledgements.
+
+WRITESAME is a SCSI T10 command that takes a block of data as input and
+writes the same data to other blocks and this write is handled
+completely within the storage and hence is known as offload . Linux ,now
+has support for SCSI WRITESAME command which is exposed to the user in
+the form of BLKZEROOUT ioctl. BD Xlator can exploit BLKZEROOUT ioctl to
+implement this fop. Thus zeroing out operations can be completely
+offloaded to the storage device , making it highly efficient.
+
+The fop takes two arguments offset and size. It zeroes out 'size' number
+of bytes in an opened file starting from 'offset' position.
+
+Benefit to GlusterFS
+--------------------
+
+Benefits GlusterFS in virtualization by providing the ability to quickly
+create pre-allocated and zeroed-out VM disk image by using
+server/storage off-loads.
+
+### Scope
+
+Nature of proposed change
+-------------------------
+
+An FOP supported in libgfapi and FUSE.
+
+Implications on manageability
+-----------------------------
+
+None.
+
+Implications on presentation layer
+----------------------------------
+
+N/A
+
+Implications on persistence layer
+---------------------------------
+
+N/A
+
+Implications on 'GlusterFS' backend
+-----------------------------------
+
+N/A
+
+Modification to GlusterFS metadata
+----------------------------------
+
+N/A
+
+Implications on 'glusterd'
+--------------------------
+
+N/A
+
+How To Test
+-----------
+
+Test server offload by measuring the time taken for creating a fully
+allocated and zeroed file on Posix backend.
+
+Test storage offload by measuring the time taken for creating a fully
+allocated and zeroed file on BD backend.
+
+User Experience
+---------------
+
+Fast provisioning of VM images when GlusterFS is used as a file system
+backend for KVM virtualization.
+
+Dependencies
+------------
+
+zerofill() support in BD backend depends on the new BD translator -
+<http://review.gluster.org/#/c/4809/>
+
+Documentation
+-------------
+
+This feature add support for a new ZEROFILL fop. Zerofill writes zeroes
+to a file in the specified range. This fop will be useful when a whole
+file needs to be initialized with zero (could be useful for zero filled
+VM disk image provisioning or during scrubbing of VM disk images).
+
+Client/application can issue this FOP for zeroing out. Gluster server
+will zero out required range of bytes ie server offloaded zeroing. In
+the absence of this fop, client/application has to repetitively issue
+write (zero) fop to the server, which is very inefficient method because
+of the overheads involved in RPC calls and acknowledgements.
+
+WRITESAME is a SCSI T10 command that takes a block of data as input and
+writes the same data to other blocks and this write is handled
+completely within the storage and hence is known as offload . Linux ,now
+has support for SCSI WRITESAME command which is exposed to the user in
+the form of BLKZEROOUT ioctl. BD Xlator can exploit BLKZEROOUT ioctl to
+implement this fop. Thus zeroing out operations can be completely
+offloaded to the storage device , making it highly efficient.
+
+The fop takes two arguments offset and size. It zeroes out 'size' number
+of bytes in an opened file starting from 'offset' position.
+
+This feature adds zerofill support to the following areas:
+
+-  libglusterfs
+-  io-stats
+-  performance/md-cache,open-behind
+-  quota
+-  cluster/afr,dht,stripe
+-  rpc/xdr
+-  protocol/client,server
+-  io-threads
+-  marker
+-  storage/posix
+-  libgfapi
+
+Client applications can exploit this fop by using glfs\_zerofill
+introduced in libgfapi.FUSE support to this fop has not been added as
+there is no system call for this fop.
+
+Here is a performance comparison of server offloaded zeofill vs zeroing
+out using repeated writes.
+
+ [root@llmvm02 remote]# time ./offloaded aakash-test log 20
+
+ real    3m34.155s
+ user    0m0.018s
+ sys 0m0.040s
+
+
+  [root@llmvm02 remote]# time ./manually aakash-test log 20
+
+ real    4m23.043s
+ user    0m2.197s
+ sys 0m14.457s
+  [root@llmvm02 remote]# time ./offloaded aakash-test log 25;
+
+ real    4m28.363s
+ user    0m0.021s
+ sys 0m0.025s
+ [root@llmvm02 remote]# time ./manually aakash-test log 25
+
+ real    5m34.278s
+ user    0m2.957s
+ sys 0m18.808s
+
+The argument log is a file which we want to set for logging purpose and
+the third argument is size in GB .
+
+As we can see there is a performance improvement of around 20% with this
+fop.
+
+Status
+------
+
+Patch : <http://review.gluster.org/5327> Status : Merged \ No newline at end of file
diff --git a/done/GlusterFS 3.5/gfid access.md b/done/GlusterFS 3.5/gfid access.md
new file mode 100644
index 0000000..db64076
--- /dev/null
+++ b/done/GlusterFS 3.5/gfid access.md
@@ -0,0 +1,89 @@
+### Instructions
+
+**Feature**
+
+'gfid-access' translator to provide access to data in glusterfs using a virtual path.
+
+**1 Summary**
+
+This particular Translator is designed to provide direct access to files in glusterfs using its gfid.'GFID' is glusterfs's inode numbers for a file to identify it uniquely.
+
+**2 Owners**
+
+Amar Tumballi <atumball@redhat.com>
+Raghavendra G <rgowdapp@redhat.com>
+Anand Avati <aavati@redhat.com>
+
+**3 Current status**
+
+With glusterfs-3.4.0, glusterfs provides only path based access.A feature is added in 'fuse' layer in the current master branch,
+but its desirable to have it as a separate translator for long time
+maintenance.
+
+**4 Detailed Description**
+
+With this method, we can consume the data in changelog translator
+(which is logging 'gfid' internally) very efficiently.
+
+**5 Benefit to GlusterFS**
+
+Provides a way to access files quickly with direct gfid.
+
+​**6. Scope**
+
+6.1. Nature of proposed change
+
+* A new translator.
+* Fixes in 'glusterfsd.c' to add this translator automatically based
+on mount time option.
+* change to mount.glusterfs to parse this new option 
+(single digit number or lines changed)
+
+6.2. Implications on manageability
+
+* No CLI required.
+* mount.glusterfs script gets a new option.
+
+6.3. Implications on presentation layer
+
+* A new virtual access path is made available. But all access protocols work seemlessly, as the complexities are handled internally.
+
+6.4. Implications on persistence layer
+
+* None
+
+6.5. Implications on 'GlusterFS' backend
+
+* None
+
+6.6. Modification to GlusterFS metadata
+
+* None
+
+6.7. Implications on 'glusterd'
+
+* None
+
+7 How To Test
+
+* Mount glusterfs client with '-o aux-gfid-mount' and access files using '/mount/point/.gfid/ <actual-canonical-gfid-of-the-file>'.
+
+8 User Experience
+
+* A new virtual path available for users.
+
+9 Dependencies
+
+* None
+
+10 Documentation
+
+This wiki.
+
+11 Status
+
+Patch sent upstream. More review comments required. (http://review.gluster.org/5497)
+
+12 Comments and Discussion
+
+Please do give comments :-) \ No newline at end of file
diff --git a/done/GlusterFS 3.5/index.md b/done/GlusterFS 3.5/index.md
new file mode 100644
index 0000000..e8c2c88
--- /dev/null
+++ b/done/GlusterFS 3.5/index.md
@@ -0,0 +1,32 @@
+GlusterFS 3.5 Release
+---------------------
+
+Tentative Dates:
+
+<Strong>Latest: 13-Nov, 2014 GlusterFS 3.5.3 </Strong>
+
+17th Apr, 2014 - 3.5.0 GA
+
+GlusterFS 3.5
+-------------
+
+### Features in 3.5.0
+
+- [Features/AFR CLI enhancements](./AFR CLI enhancements.md)
+- [Features/exposing volume capabilities](./Exposing Volume Capabilities.md)
+- [Features/File Snapshot](./File Snapshot.md)
+- [Features/gfid-access](./gfid access.md)
+- [Features/On-Wire Compression + Decompression](./Onwire Compression-Decompression.md)
+- [Features/Quota Scalability](./Quota Scalability.md)
+- [Features/readdir ahead](./readdir ahead.md)
+- [Features/zerofill](./Zerofill.md)
+- [Features/Brick Failure Detection](./Brick Failure Detection.md)
+- [Features/disk-encryption](./Disk Encryption.md)
+- Changelog based parallel geo-replication
+- Improved block device translator
+
+Proposing New Features
+----------------------
+
+New feature proposals should be built using the New Feature Template in
+the GlusterFS 3.7 planning page
diff --git a/done/GlusterFS 3.5/libgfapi with qemu libvirt.md b/done/GlusterFS 3.5/libgfapi with qemu libvirt.md
new file mode 100644
index 0000000..2309016
--- /dev/null
+++ b/done/GlusterFS 3.5/libgfapi with qemu libvirt.md
@@ -0,0 +1,222 @@
+ Work In Progress
+ Author - Satheesaran Sundaramoorthi
+ <sasundar@redhat.com>
+
+**Purpose**
+-----------
+
+Gluster volume can be used to store VM Disk images. This usecase is
+popularly known as 'Virt-Store' usecase. Earlier, gluster volume had to
+be fuse mounted and images are created/accessed over the fuse mount.
+
+With the introduction of GlusterFS libgfapi, QEMU supports glusterfs
+through libgfapi directly. This we call as *QEMU driver for glusterfs*.
+These document explains about the way to make use of QEMU driver for
+glusterfs
+
+Steps for the entire procedure could be split in to 2 views viz,the
+document from
+
+1. Steps to be done on gluster volume side
+2. Steps to be done on Hypervisor side
+
+**Steps to be done on gluster side**
+------------------------------------
+
+These are the steps that needs to be done on the gluster side Precisely
+this involves
+
+1. Creating "Trusted Storage Pool"
+2. Creating a volume
+3. Tuning the volume for virt-store
+4. Tuning glusterd to accept requests from QEMU
+5. Tuning glusterfsd to accept requests from QEMU
+6. Setting ownership on the volume
+7. Starting the volume
+
+##### Creating "Trusted Storage Pool"
+
+Install glusterfs rpms on the NODE. You can create a volume with a
+single node. You can also scale up the cluster, as we call as *Trusted
+Storage Pool*, by adding more nodes to the cluster
+
+ gluster peer probe <hostname>
+
+##### Creating a volume
+
+It is highly recommended to have replicate volume or
+distribute-replicate volume for virt-store usecase, as it would add high
+availability and fault-tolerance. Remember the plain distribute works
+equally well
+
+ gluster volume create replica 2 <brick1> .. <brickN>
+
+where, <brick1> is <hostname>:/<path-of-dir> Note: It is recommended to
+create sub-directories inside brick and that could be used to create a
+volume.For example, say, */home/brick1* is the mountpoint of XFS, then
+you can create a sub-directory inside it */home/brick1/b1* and use it
+while creating a volume.You can also use space available in root
+filesystem for bricks. Gluster cli, by default, throws warning in that
+case. You can override by using *force* option
+
+ gluster volume create replica 2 <brick1> .. <brickN> force
+
+*If you are new to GlusterFS, you can take a look at
+[QuickStart](http://gluster.readthedocs.org/en/latest/Quick-Start-Guide/Quickstart/) guide.*
+
+##### Tuning the volume for virt-store
+
+There are recommended settings available for virt-store. This provide
+good performance characteristics when enabled on the volume that was
+used for *virt-store*
+
+Refer to
+[Virt-store-usecase\#Tunables](Virt-store-usecase#Tunables "wikilink")
+for recommended tunables and for applying them on the volume,
+[Virt-store-usecase\#Applying\_the\_Tunables\_on\_the\_volume](Virt-store-usecase#Applying_the_Tunables_on_the_volume "wikilink")
+
+##### Tuning glusterd to accept requests from QEMU
+
+glusterd receives the request only from the applications that run with
+port number less than 1024 and it blocks otherwise. QEMU uses port
+number greater than 1024 and to make glusterd accept requests from QEMU,
+edit the glusterd vol file, */etc/glusterfs/glusterd.vol* and add the
+following,
+
+ option rpc-auth-allow-insecure on
+
+Note: If you have installed glusterfs from source, you can find glusterd
+vol file at */usr/local/etc/glusterfs/glusterd.vol*
+
+Restart glusterd after adding that option to glusterd vol file
+
+ service glusterd restart
+
+##### Tuning glusterfsd to accept requests from QEMU
+
+Enable the option *allow-insecure* on the particular volume
+
+ gluster volume set <volname> server.allow-insecure on
+
+**IMPORTANT :** As of now(april 2,2014)there is a bug, as
+*allow-insecure* is not dynamically set on a volume.You need to restart
+the volume for the change to take effect
+
+##### Setting ownership on the volume
+
+Set the ownership of qemu:qemu on to the volume
+
+ gluster volume set <vol-name> storage.owner-uid 107
+ gluster volume set <vol-name> storage.owner-gid 107
+
+**IMPORTANT :** The UID and GID can differ per Linux distribution, or
+even installation. The UID/GID should be the one fomr the *qemu* or
+'kvm'' user, you can get the IDs with these commands:
+
+ id qemu
+ getent group kvm
+
+##### Starting the volume
+
+Start the volume
+
+ gluster volume start <vol-name>
+
+**Steps to be done on Hypervisor Side**
+---------------------------------------
+
+Hypervisor is just the machine which spawns the Virtual Machines. This
+machines should be necessarily the baremetal with more memory and
+computing power. The following steps needs to be done on hypervisor,
+
+1. Install qemu-kvm
+2. Install libvirt
+3. Create a VM Image
+4. Add ownership to the Image file
+5. Create libvirt XML to define Virtual Machine
+6. Define the VM
+7. Start the VM
+8. Verification
+
+##### Install qemu-kvm
+
+##### Install libvirt
+
+##### Create a VM Image
+
+Images can be created using *qemu-img* utility
+
+ qemu-img create -f <format> gluster://<server>/<vol-name>/ <image> <size>
+
+- format - This can be raw or qcow2
+- server - One of the gluster Node's IP or FQDN
+- vol-name - gluster volume name
+- image - Image File name
+- size - Size of the image
+
+Here is sample,
+
+ qemu-img create -f qcow2 gluster://host.sample.com/vol1/vm1.img 10G
+
+##### Add ownership to the Image file
+
+NFS or FUSE mount the glusterfs volume and change the ownership of the
+image file to qemu:qemu
+
+ mount -t nfs -o vers=3 <gluster-server>:/<vol-name> <mount-point>
+
+Change the ownership of the image file that was earlier created using
+*qemu-img* utility
+
+ chown qemu:qemu <mount-point>/<image-file>
+
+##### Create libvirt XML to define Virtual Machine
+
+*virt-install* is python wrapper which is mostly used to create VM using
+set of params. *virt-install* doesn't support any network filesystem [
+<https://bugzilla.redhat.com/show_bug.cgi?id=1017308> ]
+
+Create a libvirt xml - <http://libvirt.org/formatdomain.html> See to
+that the disk section is formatted in such a way, qemu driver for
+glusterfs is being used. This can be seen in the following example xml
+description
+
+ <disk type='network' device='disk'>
+ <driver name='qemu' type='raw' cache='none'/>
+ <source protocol='gluster' name='distrepvol/vm3.img'>
+ <host name='10.70.37.106' port='24007'/>
+ </source>
+ <target dev='vda' bus='virtio'/>
+ <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
+ </disk>
+
+##### Define the VM from XML
+
+Define the VM from the XML file that was created earlier
+
+ virsh define <xml-file-description>
+
+Verify that the VM is created successfully
+
+ virsh list --all
+
+##### Start the VM
+
+Start the VM
+
+ virsh start <VM>
+
+##### Verification
+
+You can verify the disk image file that is being used by VM
+
+ virsh domblklist <VM-Domain-Name/ID>
+
+The above should show the volume name and image name. Here is the
+example,
+
+ [root@test ~]# virsh domblklist vm-test2
+ Target Source
+ ------------------------------------------------
+ vda distrepvol/test.img
+ hdc - \ No newline at end of file
diff --git a/done/GlusterFS 3.5/readdir ahead.md b/done/GlusterFS 3.5/readdir ahead.md
new file mode 100644
index 0000000..fe34a97
--- /dev/null
+++ b/done/GlusterFS 3.5/readdir ahead.md
@@ -0,0 +1,117 @@
+Feature
+-------
+
+readdir-ahead
+
+Summary
+-------
+
+Provide read-ahead support for directories to improve sequential
+directory read performance.
+
+Owners
+------
+
+Brian Foster
+
+Current status
+--------------
+
+Gluster currently does not attempt to improve directory read
+performance. As a result, simple operations (i.e., ls) on large
+directories are slow.
+
+Detailed Description
+--------------------
+
+The read-ahead feature for directories is analogous to read-ahead for
+files. The objective is to detect sequential directory read operations
+and establish a pipeline for directory content. When a readdir request
+is received and fulfilled, preemptively issue subsequent readdir
+requests to the server in anticipation of those requests from the user.
+If sequential readdir requests are received, the directory content is
+already immediately available in the client. If subsequent requests are
+not sequential or not received, said data is simply dropped and the
+optimization is bypassed.
+
+Benefit to GlusterFS
+--------------------
+
+Improved read performance of large directories.
+
+### Scope
+
+Nature of proposed change
+-------------------------
+
+readdir-ahead support is enabled through a new client-side translator.
+
+Implications on manageability
+-----------------------------
+
+None beyond the ability to enable and disable the translator.
+
+Implications on presentation layer
+----------------------------------
+
+N/A
+
+Implications on persistence layer
+---------------------------------
+
+N/A
+
+Implications on 'GlusterFS' backend
+-----------------------------------
+
+N/A
+
+Modification to GlusterFS metadata
+----------------------------------
+
+N/A
+
+Implications on 'glusterd'
+--------------------------
+
+N/A
+
+How To Test
+-----------
+
+Performance testing. Verify that sequential reads of large directories
+complete faster (i.e., ls, xfs\_io -c readdir).
+
+User Experience
+---------------
+
+Improved performance on sequential read workloads. The translator should
+otherwise be invisible and not detract performance or disrupt behavior
+in any way.
+
+Dependencies
+------------
+
+N/A
+
+Documentation
+-------------
+
+Set the associated config option to enable or disable directory
+read-ahead on a volume:
+
+ gluster volume set <vol> readdir-ahead [enable|disable]
+
+readdir-ahead is disabled by default.
+
+Status
+------
+
+Development complete for the initial version. Minor changes and bug
+fixes likely.
+
+Future versions might expand to provide generic caching and more
+flexible behavior.
+
+Comments and Discussion
+----------------------- \ No newline at end of file
diff --git a/done/GlusterFS 3.6/Better Logging.md b/done/GlusterFS 3.6/Better Logging.md
new file mode 100644
index 0000000..6aad602
--- /dev/null
+++ b/done/GlusterFS 3.6/Better Logging.md
@@ -0,0 +1,348 @@
+Feature
+-------
+
+Gluster logging enhancements to support message IDs per message
+
+Summary
+-------
+
+Enhance gluster logging to provide the following features, SubFeature
+--\> SF
+
+- SF1: Add message IDs to message
+
+- SF2: Standardize error num reporting across messages
+
+- SF3: Enable repetitive message suppression in logs
+
+- SF4: Log location and hierarchy standardization (in case anything is
+further required here, analysis pending)
+
+- SF5: Enable per sub-module logging level configuration
+
+- SF6: Enable logging to other frameworks, than just the current gluster
+logs
+
+- SF7: Generate a catalogue of these message, with message ID, message,
+reason for occurrence, recovery/troubleshooting steps.
+
+Owners
+------
+
+Balamurugan Arumugam <barumuga@redhat.com>
+Krishnan Parthasarathi <kparthas@redhat.com>
+Krutika Dhananjay <kdhananj@redhat.com>
+Shyamsundar Ranganathan <srangana@redhat.com>
+
+Current status
+--------------
+
+### Existing infrastructure:
+
+Currently gf\_logXXX exists as an infrastructure API for all logging
+related needs. This (typically) takes the form,
+
+gf\_log(dom, levl, fmt...)
+
+where,
+
+    dom: Open format string usually the xlator name, or "cli" or volume name etc.
+    levl: One of, GF_LOG_EMERG, GF_LOG_ALERT, GF_LOG_CRITICAL, GF_LOG_ERROR, GF_LOG_WARNING, GF_LOG_NOTICE, GF_LOG_INFO, GF_LOG_DEBUG, GF_LOG_TRACE
+    fmt: the actual message string, followed by the required arguments in the string
+
+The log initialization happens through,
+
+gf\_log\_init (void \*data, const char \*filename, const char \*ident)
+
+where,
+
+    data: glusterfs_ctx_t, largely unused in logging other than the required FILE and mutex fields
+    filename: file name to log to
+    ident: Like syslog ident parameter, largely unused
+
+The above infrastructure leads to logs of type, (sample extraction from
+nfs.log)
+
+     [2013-12-08 14:17:17.603879] I [socket.c:3485:socket_init] 0-socket.ACL: SSL support is NOT enabled
+     [2013-12-08 14:17:17.603937] I [socket.c:3500:socket_init] 0-socket.ACL: using system polling thread
+     [2013-12-08 14:17:17.612128] I [nfs.c:934:init] 0-nfs: NFS service started
+     [2013-12-08 14:17:17.612383] I [dht-shared.c:311:dht_init_regex] 0-testvol-dht: using regex rsync-hash-regex = ^\.(.+)\.[^.]+$
+
+### Limitations/Issues in the infrastructure
+
+​1) Auto analysis of logs needs to be done based on the final message
+string. Automated tools that can help with log message and related
+troubleshooting options need to use the final string, which needs to be
+intelligently parsed and also may change between releases. It would be
+desirable to have message IDs so that such tools and trouble shooting
+options can leverage the same in a much easier fashion.
+
+​2) The log message itself currently does not use the \_ident\_ which
+can help as we move to more common logging frameworks like journald,
+rsyslog (or syslog as the case maybe)
+
+​3) errno is the primary identifier of errors across gluster, i.e we do
+not have error codes in gluster and use errno values everywhere. The log
+messages currently do not lend themselves to standardization like
+printing the string equivalent of errno rather than the actual errno
+value, which \_could\_ be cryptic to administrators
+
+​4) Typical logging infrastructures provide suppression (on a
+configurable basis) for repetitive messages to prevent log flooding,
+this is currently missing in the current infrastructure
+
+​5) The current infrastructure cannot be used to control log levels at a
+per xlator or sub module, as the \_dom\_ passed is a string that change
+based on volume name, translator name etc. It would be desirable to have
+a better module identification mechanism that can help with this
+feature.
+
+​6) Currently the entire logging infrastructure resides within gluster.
+It would be desirable in scaled situations to have centralized logging
+and monitoring solutions in place, to be able to better analyse and
+monitor the cluster health and take actions.
+
+This requires some form of pluggable logging frameworks that can be used
+within gluster to enable this possibility. Currently the existing
+framework is used throughout gluster and hence we need only to change
+configuration and logging.c to enable logging to other frameworks (as an
+example the current syslog plug that was provided).
+
+It would be desirable to enhance this to provide a more robust framework
+for future extensions to other frameworks. This is not a limitation of
+the current framework, so much as a re-factor to be able to switch
+logging frameworks with more ease.
+
+​7) For centralized logging in the future, it would need better
+identification strings from various gluster processes and hosts, which
+is currently missing or suppressed in the logging infrastructure.
+
+Due to the nature of enhancements proposed, it is required that we
+better the current infrastructure for the stated needs and do some
+future proofing in terms of newer messages that would be added.
+
+Detailed Description
+--------------------
+
+NOTE: Covering details for SF1, SF2, and partially SF3, SF5, SF6. SF4/7
+will be covered in later revisions/phases.
+
+### Logging API changes:
+
+​1) Change the logging API as follows,
+
+From: gf\_log(dom, levl, fmt...)
+
+To: gf\_msg(dom, levl, errnum, msgid, fmt...)
+
+Where:
+
+    dom: Open string as used in the current logging infrastructure (helps in backward compat)
+    levl: As in current logging infrastructure (current levels seem sufficient enough to not add more levels for better debuggability etc.)
+    <new fields>
+    msgid: A message identifier, unique to this message FMT string and possibly this invocation. (SF1, lending to SF3)
+    errnum: The errno that this message is generated for (with an implicit 0 meaning no error number per se with this message) (SF2)
+
+NOTE: Internally the gf\_msg would still be a macro that would add the
+\_\_FILE\_\_ \_\_LINE\_\_ \_\_FUNCTION\_\_ arguments
+
+​2) Enforce \_ident\_ in the logging initialization API, gf\_log\_init
+(void \*data, const char \*filename, const char \*ident)
+
+Where:
+
+ ident would be the identifier string like, nfs, <mountpoint>, brick-<brick-name>, cli, glusterd, as is the case with the log file name that is generated today (lending to SF6)
+
+#### What this achieves:
+
+With the above changes, we now have a message ID per message
+(\_msgid\_), location of the message in terms of which component
+(\_dom\_) and which process (\_ident\_). The further identification of
+the message location in terms of host (ip/name) can be done in the
+framework, when centralized logging infrastructure is introduced.
+
+#### Log message changes:
+
+With the above changes to the API the log message can now appear in a
+compatibility mode to adhere to current logging format, or be presented
+as follows,
+
+log invoked as: gf\_msg(dom, levl, ENOTSUP, msgidX)
+
+Example: gf\_msg ("logchecks", GF\_LOG\_CRITICAL, 22, logchecks\_msg\_4,
+42, "Forty-Two", 42);
+
+Where: logchecks\_msg\_4 (GLFS\_COMP\_BASE + 4), "Critical: Format
+testing: %d:%s:%x"
+
+​1) Gluster logging framework (logged as)
+
+ [2014-02-17 08:52:28.038267] I [MSGID: 1002] [logchecks.c:44:go_log] 0-logchecks: Informational: Format testing: 42:Forty-Two:2a [Invalid argument]
+
+​2) syslog (passed as)
+
+ Feb 17 14:17:42 somari logchecks[26205]: [MSGID: 1002] [logchecks.c:44:go_log] 0-logchecks: Informational: Format testing: 42:Forty-Two:2a [Invalid argument]
+
+​3) journald (passed as)
+
+    sd_journal_send("MESSAGE=<vasprintf(dom, msgid(fmt))>",
+                        "MESSAGE_ID=msgid",
+                        "PRIORITY=levl",
+                        "CODE_FILE=`<fname>`", "CODE_LINE=`<lnum>", "CODE_FUNC=<fnnam>",
+                        "ERRNO=errnum",
+                        "SYSLOG_IDENTIFIER=<ident>"
+                        NULL);
+
+​4) CEE (Common Event Expression) format string passed to any CEE
+consumer (say lumberjack)
+
+Based on generating @CEE JSON string as per specifications and passing
+it the infrastructure in question.
+
+#### Message ID generation:
+
+​1) Some rules for message IDs
+
+- Every message, even if it is the same message FMT, will have a unique
+message ID - Changes to a specific message string, hence will not change
+its ID and also not impact other locations in the code that use the same
+message FMT
+
+​2) A glfs-message-id.h file would contain ranges per component for
+individual component based messages to be created without overlapping on
+the ranges.
+
+​3) <component>-message.h would contain something as follows,
+
+     #define GLFS_COMP_BASE         GLFS_MSGID_COMP_<component>
+     #define GLFS_NUM_MESSAGES       1
+     #define GLFS_MSGID_END          (GLFS_COMP_BASE + GLFS_NUM_MESSAGES + 1)
+     /* Messaged with message IDs */
+     #define glfs_msg_start_x GLFS_COMP_BASE, "Invalid: Start of messages"
+     /*------------*/
+     #define <component>_msg_1 (GLFS_COMP_BASE + 1), "Test message, replace with"\
+                        " original when using the template"
+     /*------------*/
+     #define glfs_msg_end_x GLFS_MSGID_END, "Invalid: End of messages"
+
+​5) Each call to gf\_msg hence would be,
+
+    gf_msg(dom, levl, errnum, glfs_msg_x, ...)
+
+#### Setting per xlator logging levels (SF5):
+
+short description to be elaborated later
+
+Leverage this-\>loglevel to override the global loglevel. This can be
+also configured from gluster CLI at runtime to change the log levels at
+a per xlator level for targeted debugging.
+
+#### Multiple log suppression(SF3):
+
+short description to be elaborated later
+
+​1) Save the message string as follows, Msg\_Object(msgid,
+msgstring(vasprintf(dom, fmt)), timestamp, repetitions)
+
+​2) On each message received by the logging infrastructure check the
+list of saved last few Msg\_Objects as follows,
+
+2.1) compare msgid and on success compare msgstring for a match, compare
+repetition tolerance time with current TS and saved TS in the
+Msg\_Object
+
+2.1.1) if tolerance is within limits, increment repetitions and do not
+print message
+
+2.1.2) if tolerance is outside limits, print repetition count for saved
+message (if any) and print the new message
+
+2.2) If none of the messages match the current message, knock off the
+oldest message in the list printing any repetition count message for the
+same, and stash new message into the list
+
+The key things to remember and act on here would be to, minimize the
+string duplication on each message, and also to keep the comparison
+quick (hence base it off message IDs and errno to start with)
+
+#### Message catalogue (SF7):
+
+<short description to be elaborated later>
+
+The idea is to use Doxygen comments in the <component>-message.h per
+component, to list information in various sections per message of
+consequence and later use Doxygen to publish this catalogue on a per
+release basis.
+
+Benefit to GlusterFS
+--------------------
+
+The mentioned limitations and auto log analysis benefits would accrue
+for GlusterFS
+
+Scope
+-----
+
+### Nature of proposed change
+
+All gf\_logXXX function invocations would change to gf\_msgXXX
+invocations.
+
+### Implications on manageability
+
+None
+
+### Implications on presentation layer
+
+None
+
+### Implications on persistence layer
+
+None
+
+### Implications on 'GlusterFS' backend
+
+None
+
+### Modification to GlusterFS metadata
+
+None
+
+### Implications on 'glusterd'
+
+None
+
+How To Test
+-----------
+
+A separate test utility that tests various logs and formats would be
+provided to ensure that functionality can be tested independent of
+GlusterFS
+
+User Experience
+---------------
+
+Users would notice changed logging formats as mentioned above, the
+additional field of importance would be the MSGID:
+
+Dependencies
+------------
+
+None
+
+Documentation
+-------------
+
+Intending to add a logging.md (or modify the same) to elaborate on how a
+new component should now use the new framework and generate messages
+with IDs in the same.
+
+Status
+------
+
+In development (see, <http://review.gluster.org/#/c/6547/> )
+
+Comments and Discussion
+-----------------------
+
+<Follow here> \ No newline at end of file
diff --git a/done/GlusterFS 3.6/Better Peer Identification.md b/done/GlusterFS 3.6/Better Peer Identification.md
new file mode 100644
index 0000000..a8c6996
--- /dev/null
+++ b/done/GlusterFS 3.6/Better Peer Identification.md
@@ -0,0 +1,172 @@
+Feature
+-------
+
+**Better peer identification**
+
+Summary
+-------
+
+This proposal is regarding better identification of peers.
+
+Owners
+------
+
+Kaushal Madappa <kmadappa@redhat.com>
+
+Current status
+--------------
+
+Glusterd currently is inconsistent in the way it identifies peers. This
+causes problems when the same peer is referenced with different names in
+different gluster commands.
+
+Detailed Description
+--------------------
+
+Currently, the way we identify peers is not consistent all through the
+gluster code. We use uuids internally and hostnames externally.
+
+This setup works pretty well when all the peers are on a single network,
+have one address, and are referred to in all the gluster commands with
+same address.
+
+But once we start mixing up addresses in the commands (ip, shortnames,
+fqdn) and bring in multiple networks we have problems.
+
+The problems were discussed in the following mailing list threads and
+some solutions were proposed.
+
+- How do we identify peers? [^1]
+- RFC - "Connection Groups" concept [^2]
+
+The solution to the multi-network problem is dependent on the solution
+to the peer identification problem. So it'll be good to target fixing
+the peer identification problem asap, ie. in 3.6, and take up the
+networks problem later.
+
+Benefit to GlusterFS
+--------------------
+
+Sanity. It will be great to have all internal identifiers for peers
+happening through a UUID, and being translated into a host/IP at the
+most superficial layer.
+
+Scope
+-----
+
+### Nature of proposed change
+
+The following changes will be done in Glusterd to improve peer
+identification.
+
+1. Peerinfo struct will be extended to have a list of associated
+ hostnames/addresses, instead of a single hostname as it is
+ currently. The import/export and store/restore functions will be
+ changed to handle this. CLI will be updated to show this list of
+ addresses in peer status and pool list commands.
+2. Peer probe will be changed to append an address to the peerinfo
+ address list, when we observe that the given address belongs to an
+ existing peer.
+3. Have a new API for translation between hostname/addresses into
+ UUIDs. This new API will be used in all places where
+ hostnames/addresses were being validated, including peer probe, peer
+ detach, volume create, add-brick, remove-brick etc.
+4. A new command - 'gluster peer add-address <existing> <new-address>'
+ - which appends to the address list will be implemented if time
+ permits.
+5. A new command - 'gluster peer rename <existing> <new>' - which will
+ rename all occurrences of a peer with the newly given name will be
+ implemented if time permits.
+
+Changes 1-3 are the base for the other changes and will the primary
+deliverables for this feature.
+
+### Implications on manageability
+
+The primary changes will bring about some changes to the CLI output of
+'peer status' and 'pool list' commands. The normal and XML outputs for
+these commands will contain a list of addresses for each peer, instead
+of a single hostname.
+
+Tools depending on the output of these commands will need to be updated.
+
+**TODO**: *Add sample outputs*
+
+The new commands 'peer add-address' and 'peer rename' will improve
+manageability of peers.
+
+### Implications on presentation layer
+
+None
+
+### Implications on persistence layer
+
+None
+
+### Implications on 'GlusterFS' backend
+
+None
+
+### Modification to GlusterFS metadata
+
+None
+
+### Implications on 'glusterd'
+
+<persistent store, configuration changes, brick-op...>
+
+How To Test
+-----------
+
+**TODO:** *Add test cases*
+
+User Experience
+---------------
+
+User experience will improve for commands which used peer identifiers
+(volume create/add-brick/remove-brick, peer probe, peer detach), as the
+the user will no longer face errors caused by mixed usage of
+identifiers.
+
+Dependencies
+------------
+
+None.
+
+Documentation
+-------------
+
+The new behaviour of the peer probe command will need to be documented.
+The new commands will need to be documented as well.
+
+**TODO:** *Add more documentations*
+
+Status
+------
+
+The feature is under development on forge [^3] and github [^4]. This
+github merge request [^5] can be used for performing preliminary
+reviews. Once we are satisfied with the changes, it will be posted for
+review on gerrit.
+
+Comments and Discussion
+-----------------------
+
+There are open issues around node crash + re-install with same IP (but
+new UUID) which need to be addressed in this effort.
+
+Links
+-----
+
+<references>
+</references>
+
+[^1]: <http://lists.gnu.org/archive/html/gluster-devel/2013-06/msg00067.html>
+
+[^2]: <http://lists.gnu.org/archive/html/gluster-devel/2013-06/msg00069.html>
+
+[^3]: <https://forge.gluster.org/~kshlm/glusterfs-core/kshlms-glusterfs/commits/better-peer-identification>
+
+[^4]: <https://github.com/kshlm/glusterfs/tree/better-peer-identification>
+
+[^5]: <https://github.com/kshlm/glusterfs/pull/2>
diff --git a/done/GlusterFS 3.6/Gluster User Serviceable Snapshots.md b/done/GlusterFS 3.6/Gluster User Serviceable Snapshots.md
new file mode 100644
index 0000000..9af7062
--- /dev/null
+++ b/done/GlusterFS 3.6/Gluster User Serviceable Snapshots.md
@@ -0,0 +1,39 @@
+Feature
+-------
+
+Enable user-serviceable snapshots for GlusterFS Volumes based on
+GlusterFS-Snapshot feature
+
+Owners
+------
+
+Anand Avati
+Anand Subramanian <anands@redhat.com>
+Raghavendra Bhat
+Varun Shastry
+
+Summary
+-------
+
+Each snapshot capable GlusterFS Volume will contain a .snaps directory
+through which a user will be able to access previously point-in-time
+snapshot copies of his data. This will be enabled through a hidden
+.snaps folder in each directory or sub-directory within the volume.
+These user-serviceable snapshot copies will be read-only.
+
+Tests
+-----
+
+​1) Enable uss (gluster volume set <volume name> features.uss enable) A
+snap daemon should get started for the volume. It should be visible in
+gluster volume status command. 2) entering the snapshot world ls on
+.snaps from any directory within the filesystem should be successful and
+should show the list of snapshots as directories. 3) accessing the
+snapshots One of the snapshots can be entered and it should show the
+contents of the directory from which .snaps was entered, when the
+snapshot was taken. NOTE: If the directory was not present when a
+snapshot was taken (say snap1) and created later, then entering snap1
+directory (or any access) will fail with stale file handle. 4) Reading
+from snapshots Any kind of read operations from the snapshots should be
+successful. But any modifications to snapshot data is not allowed.
+Snapshots are read-only \ No newline at end of file
diff --git a/done/GlusterFS 3.6/Gluster Volume Snapshot.md b/done/GlusterFS 3.6/Gluster Volume Snapshot.md
new file mode 100644
index 0000000..468992a
--- /dev/null
+++ b/done/GlusterFS 3.6/Gluster Volume Snapshot.md
@@ -0,0 +1,354 @@
+Feature
+-------
+
+Snapshot of Gluster Volume
+
+Summary
+-------
+
+Gluster volume snapshot will provide point-in-time copy of a GlusterFS
+volume. This snapshot is an online-snapshot therefore file-system and
+its associated data continue to be available for the clients, while the
+snapshot is being taken.
+
+Snapshot of a GlusterFS volume will create another read-only volume
+which will be a point-in-time copy of the original volume. Users can use
+this read-only volume to recover any file(s) they want. Snapshot will
+also provide restore feature which will help the user to recover an
+entire volume. The restore operation will replace the original volume
+with the snapshot volume.
+
+Owner(s)
+--------
+
+Rajesh Joseph <rjoseph@redhat.com>
+
+Copyright
+---------
+
+Copyright (c) 2013-2014 Red Hat, Inc. <http://www.redhat.com>
+
+This feature is licensed under your choice of the GNU Lesser General
+Public License, version 3 or any later version (LGPLv3 or later), or the
+GNU General Public License, version 2 (GPLv2), in all cases as published
+by the Free Software Foundation.
+
+Current status
+--------------
+
+Gluster volume snapshot support is provided in GlusterFS 3.6
+
+Detailed Description
+--------------------
+
+GlusterFS snapshot feature will provide a crash consistent point-in-time
+copy of Gluster volume(s). This snapshot is an online-snapshot therefore
+file-system and its associated data continue to be available for the
+clients, while the snapshot is being taken. As of now we are not
+planning to provide application level crash consistency. That means if a
+snapshot is restored then applications need to rely on journals or other
+technique to recover or cleanup some of the operations performed on
+GlusterFS volume.
+
+A GlusterFS volume is made up of multiple bricks spread across multiple
+nodes. Each brick translates to a directory path on a given file-system.
+The current snapshot design is based on thinly provisioned LVM2 snapshot
+feature. Therefore as a prerequisite the Gluster bricks should be on
+thinly provisioned LVM. For a single lvm, taking a snapshot would be
+straight forward for the admin, but this is compounded in a GlusterFS
+volume which has bricks spread across multiple LVM’s across multiple
+nodes. Gluster volume snapshot feature aims to provide a set of
+interfaces from which the admin can snap and manage the snapshots for
+Gluster volumes.
+
+Gluster volume snapshot is nothing but snapshots of all the bricks in
+the volume. So ideally all the bricks should be snapped at the same
+time. But with real-life latencies (processor and network) this may not
+hold true all the time. Therefore we need to make sure that during
+snapshot the file-system is in consistent state. Therefore we barrier
+few operation so that the file-system remains in a healthy state during
+snapshot.
+
+For details about barrier [Server Side
+Barrier](http://www.gluster.org/community/documentation/index.php/Features/Server-side_Barrier_feature)
+
+Benefit to GlusterFS
+--------------------
+
+Snapshot of glusterfs volume allows users to
+
+- A point in time checkpoint from which to recover/failback
+- Allow read-only snaps to be the source of backups.
+
+Scope
+-----
+
+### Nature of proposed change
+
+Gluster cli will be modified to provide new commands for snapshot
+management. The entire snapshot core implementation will be done in
+glusterd.
+
+Apart from this Snapshot will also make use of quiescing xlator for
+doing quiescing. This will be a server side translator which will
+quiesce will fops which can modify disk state. The quescing will be done
+till the snapshot operation is complete.
+
+### Implications on manageability
+
+Snapshot will provide new set of cli commands to manage snapshots. REST
+APIs are not planned for this release.
+
+### Implications on persistence layer
+
+Snapshot will create new volume per snapshot. These volumes are stored
+in /var/lib/glusterd/snaps folder. Apart from this each volume will have
+additional snapshot related information stored in snap\_list.info file
+in its respective vol folder.
+
+### Implications on 'glusterd'
+
+Snapshot information and snapshot volume details are stored in
+persistent stores.
+
+How To Test
+-----------
+
+For testing this feature one needs to have mulitple thinly provisioned
+volumes or else need to create LVM using loop back devices.
+
+Details of how to create thin volume can be found at the following link
+<https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Logical_Volume_Manager_Administration/thinly_provisioned_volume_creation.html>
+
+Each brick needs to be in a independent LVM. And these LVMs should be
+thinly provisioned. From these bricks create Gluster volume. This volume
+can then be used for snapshot testing.
+
+See the User Experience section for various commands of snapshot.
+
+User Experience
+---------------
+
+##### Snapshot creation
+
+ snapshot create <snapname> <volname(s)> [description <description>] [force]
+
+This command will create a sapshot of the volume identified by volname.
+snapname is a mandatory field and the name should be unique in the
+entire cluster. Users can also provide an optional description to be
+saved along with the snap (max 1024 characters). force keyword is used
+if some bricks of orginal volume is down and still you want to take the
+snapshot.
+
+##### Listing of available snaps
+
+ gluster snapshot list [snap-name] [vol <volname>]
+
+This command is used to list all snapshots taken, or for a specified
+volume. If snap-name is provided then it will list the details of that
+snap.
+
+##### Configuring the snapshot behavior
+
+ gluster snapshot config [vol-name]
+
+This command will display existing config values for a volume. If volume
+name is not provided then config values of all the volume is displayed.
+
+ gluster snapshot config [vol-name] [<snap-max-limit> <count>] [<snap-max-soft-limit> <percentage>] [force]
+
+The above command can be used to change the existing config values. If
+vol-name is provided then config value of that volume is changed, else
+it will set/change the system limit.
+
+The system limit is the default value of the config for all the volume.
+Volume specific limit cannot cross the system limit. If a volume
+specific limit is not provided then system limit will be considered.
+
+If any of this limit is decreased and the current snap count of the
+system/volume is more than the limit then the command will fail. If user
+still want to decrease the limit then force option should be used.
+
+**snap-max-limit**: Maximum snapshot limit for a volume. Snapshots
+creation will fail if snap count reach this limit.
+
+**snap-max-soft-limit**: Maximum snapshot limit for a volume. Snapshots
+can still be created if snap count reaches this limit. An auto-deletion
+will be triggered if this limit is reached. The oldest snaps will be
+deleted if snap count reaches this limit. This is represented as
+percentage value.
+
+##### Status of snapshots
+
+ gluster snapshot status ([snap-name] | [volume <vol-name>])
+
+Shows the status of all the snapshots or the specified snapshot. The
+status will include the brick details, LVM details, process details,
+etc.
+
+##### Activating a snap volume
+
+By default the snapshot created will be in an inactive state. Use the
+following commands to activate snapshot.
+
+ gluster snapshot activate <snap-name>
+
+##### Deactivating a snap volume
+
+ gluster snapshot deactivate <snap-name>
+
+The above command will deactivate an active snapshot
+
+##### Deleting snaps
+
+ gluster snapshot delete <snap-name>
+
+This command will delete the specified snapshot.
+
+##### Restoring snaps
+
+ gluster snapshot restore <snap-name>
+
+This command restores an already taken snapshot of single or multiple
+volumes. Snapshot restore is an offline activity therefore if any volume
+which is part of the given snap is online then the restore operation
+will fail.
+
+Once the snapshot is restored it will be deleted from the list of
+snapshot.
+
+Dependencies
+------------
+
+To provide support for a crash-consistent snapshot feature Gluster core
+com- ponents itself should be crash-consistent. As of now Gluster as a
+whole is not crash-consistent. In this section we will identify those
+Gluster components which are not crash-consistent.
+
+**Geo-Replication**: Geo-replication provides master-slave
+synchronization option to Gluster. Geo-replication maintains state
+information for completing the sync operation. Therefore ideally when a
+snapshot is taken then both the master and slave snapshot should be
+taken. And both master and slave snapshot should be in mutually
+consistent state.
+
+Geo-replication make use of change-log to do the sync. By default the
+change-log is stored .glusterfs folder in every brick. But the
+change-log path is configurable. If change-log is part of the brick then
+snapshot will contain the change-log changes as well. But if it is not
+then it needs to be saved separately during a snapshot.
+
+Following things should be considered for making change-log
+crash-consistent:
+
+- Change-log is part of the brick of the same volume.
+- Change-log is outside the brick. As of now there is no size limit on
+ the
+
+change-log files. We need to answer following questions here
+
+- - Time taken to make a copy of the entire change-log. Will affect
+ the
+
+overall time of snapshot operation.
+
+- - The location where it can be copied. Will impact the disk usage
+ of
+
+the target disk or file-system.
+
+- Some part of change-log is present in the brick and some are outside
+
+the brick. This situation will arrive when change-log path is changed
+in-between.
+
+- Change-log is saved in another volume and this volume forms a CG
+ with
+
+the volume about to be snapped.
+
+**Note**: Considering the above points we have decided not to support
+change-log stored outside the bricks.
+
+For this release automatic snapshot of both master and slave session is
+not supported. If required user need to explicitly take snapshot of both
+master and slave. Following steps need to be followed while taking
+snapshot of a master and slave setup
+
+- Stop geo-replication manually.
+- Snapshot all the slaves first.
+- When the slave snapshot is done then initiate master snapshot.
+- When both the snapshot is complete geo-syncronization can be started
+ again.
+
+**Gluster Quota**: Quota enables an admin to specify per directory
+quota. Quota makes use of marker translator to enforce quota. As of now
+the marker framework is not completely crash-consistent. As part of
+snapshot feature we need to address following issues.
+
+- If a snapshot is taken while the contribution size of a file is
+ being updated then you might end up with a snapshot where there is a
+ mismatch between the actual size of the file and the contribution of
+ the file. These in-consistencies can only be rectified when a
+ look-up is issued on the snapshot volume for the same file. As a
+ workaround admin needs to issue an explicit file-system crawl to
+ rectify the problem.
+- For NFS, quota makes use of pgfid to build a path from gfid and
+ enforce quota. As of now pgfid update is not crash-consistent.
+- Quota saves its configuration in file-system under /var/lib/glusterd
+ folder. As part of snapshot feature we need to save this file.
+
+**NFS**: NFS uses a single graph to represent all the volumes in the
+system. And to make all the snapshot volume accessible these snapshot
+volumes should be added to this graph. This brings in another
+restriction, i.e. all the snapshot names should be unique and
+additionally snap name should not clash with any other volume name as
+well.
+
+To handle this situation we have decided to use an internal uuid as snap
+name. And keep a mapping of this uuid and user given snap name in an
+internal structure.
+
+Another restriction with NFS is that when a newly created volume
+(snapshot volume) is started it will restart NFS server. Therefore we
+decided when snapshot is taken it will be in stopped state. Later when
+snapshot volume is needed it can be started explicitly.
+
+**DHT**: DHT xlator decides which node to look for a file/directory.
+Some of the DHT fop are not atomic in nature, e.g rename (both file and
+directory). Also these operations are not transactional in nature. That
+means if a crash happens the data in server might be in an inconsistent
+state. Depending upon the time of snapshot and which DHT operation is in
+what state there can be an inconsistent snapshot.
+
+**AFR**: AFR is the high-availability module in Gluster. AFR keeps track
+of fresh and correct copy of data using extended attributes. Therefore
+it is important that before taking snapshot these extended attributes
+are written into the disk. To make sure these attributes are written to
+disk snapshot module will issue explicit sync after the
+barrier/quiescing.
+
+The other issue with the current AFR is that it writes the volume name
+to the extended attribute of all the files. AFR uses this for
+self-healing. When snapshot is taken of such a volume the snapshotted
+volume will also have the same volume name. Therefore AFR needs to
+create a mapping of the real volume name and the extended entry name in
+the volfile. So that correct name can be referred during self-heal.
+
+Another dependency on AFR is that currently there is no direct API or
+call back function which will tell that AFR self-healing is completed on
+a volume. This feature is required to heal a snapshot volume before
+restore.
+
+Documentation
+-------------
+
+Status
+------
+
+In development
+
+Comments and Discussion
+-----------------------
+
+<Follow here>
diff --git a/done/GlusterFS 3.6/New Style Replication.md b/done/GlusterFS 3.6/New Style Replication.md
new file mode 100644
index 0000000..ffd8167
--- /dev/null
+++ b/done/GlusterFS 3.6/New Style Replication.md
@@ -0,0 +1,230 @@
+Goal
+----
+
+More partition-tolerant replication, with higher performance for most
+use cases.
+
+Summary
+-------
+
+NSR is a new synchronous replication translator, complementing or
+perhaps some day replacing AFR.
+
+Owners
+------
+
+Jeff Darcy <jdarcy@redhat.com>
+Venky Shankar <vshankar@redhat.com>
+
+Current status
+--------------
+
+Design and prototype (nearly) complete, implementation beginning.
+
+Related Feature Requests and Bugs
+---------------------------------
+
+[AFR bugs related to "split
+brain"](https://bugzilla.redhat.com/buglist.cgi?classification=Community&component=replicate&list_id=3040567&product=GlusterFS&query_format=advanced&short_desc=split&short_desc_type=allwordssubstr)
+
+[AFR bugs related to
+"perf"](https://bugzilla.redhat.com/buglist.cgi?classification=Community&component=replicate&list_id=3040572&product=GlusterFS&query_format=advanced&short_desc=perf&short_desc_type=allwordssubstr)
+
+(Both lists are undoubtedly partial because not all bugs in these areas
+using these specific words. In particular, "GFID mismatch" bugs are
+really a kind of split brain, but aren't represented.)
+
+Detailed Description
+--------------------
+
+NSR is designed to have the following features.
+
+- Server based - "chain" replication can use bandwidth of both client
+ and server instead of splitting client bandwidth N ways.
+
+- Journal based - for reduced network traffic in normal operation,
+ plus faster recovery and greater resistance to "split brain" errors.
+
+- Variable consistency model - based on
+ [Dynamo](http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf)
+ to provide options trading some consistency for greater availability
+ and/or performance.
+
+- Newer, smaller codebase - reduces technical debt, enables higher
+ replica counts, more informative status reporting and logging, and
+ other future features (e.g. ordered asynchronous replication).
+
+Benefit to GlusterFS
+====================
+
+Faster, more robust, more manageable/maintainable replication.
+
+Scope
+=====
+
+Nature of proposed change
+-------------------------
+
+At least two new translators will be necessary.
+
+- A simple client-side translator to route requests to the current
+ leader among the bricks in a replica set.
+
+- A server-side translator to handle the "heavy lifting" of
+ replication, recovery, etc.
+
+Implications on manageability
+-----------------------------
+
+At a high level, commands to enable, configure, and manage NSR will be
+very similar to those already used for AFR. At a lower level, the
+options affecting things things like quorum, consistency, and placement
+of journals will all be completely different.
+
+Implications on presentation layer
+----------------------------------
+
+Minimal. Most changes will be to simplify or remove special handling for
+AFR's unique behavior (especially around lookup vs. self-heal).
+
+Implications on persistence layer
+---------------------------------
+
+N/A
+
+Implications on 'GlusterFS' backend
+-----------------------------------
+
+The journal for each brick in an NSR volume might (for performance
+reasons) be placed on one or more local volumes other than the one
+containing the brick's data. Special requirements around AIO, fsync,
+etc. will be less than with AFR.
+
+Modification to GlusterFS metadata
+----------------------------------
+
+NSR will not use the same xattrs as AFR, reducing the need for larger
+inodes.
+
+Implications on 'glusterd'
+--------------------------
+
+Volgen must be able to configure the client-side and server-side parts
+of NSR, instead of AFR on the client side and index (which will no
+longer be necessary) on the server side. Other interactions with
+glusterd should remain mostly the same.
+
+How To Test
+===========
+
+Most basic AFR tests - e.g. reading/writing data, killing nodes,
+starting/stopping self-heal - would apply to NSR as well. Tests that
+embed assumptions about AFR xattrs or other internal artifacts will need
+to be re-written.
+
+User Experience
+===============
+
+Minimal change, mostly related to new options.
+
+Dependencies
+============
+
+NSR depends on a cluster-management framework that can provide
+membership tracking, leader election, and robust consistent key/value
+data storage. This is expected to be developed in parallel as part of
+the glusterd-scalability feature, but can be implemented (in simplified
+form) within NSR itself if necessary.
+
+Documentation
+=============
+
+TBD.
+
+Status
+======
+
+Some parts of earlier implementation updated to current tree, others in
+the middle of replacement.
+
+- [New design](http://review.gluster.org/#/c/8915/)
+
+- [Basic translator code](http://review.gluster.org/#/c/8913/) (needs
+ update to new code-generation infractructure)
+
+- [GF\_FOP\_IPC](http://review.gluster.org/#/c/8812/)
+
+- [etcd support](http://review.gluster.org/#/c/8887/)
+
+- [New code-generation
+ infrastructure](http://review.gluster.org/#/c/9411/)
+
+- [New data-logging
+ translator](https://forge.gluster.org/~jdarcy/glusterfs-core/jdarcys-glusterfs-data-logging)
+
+Comments and Discussion
+=======================
+
+My biggest concern with journal-based replication comes from my previous
+use of DRBD. They do an "activity log"[^1] which sounds like the same
+basic concept. Once that log filled, I experienced cascading failure.
+When the journal can be filled faster than it's emptied this could cause
+the problem I experienced.
+
+So what I'm looking to be convinced is how journalled replication
+maintains full redundancy and how it will prevent the journal input from
+exceeding the capacity of the journal output or at least how it won't
+fail if this should happen.
+
+[jjulian](User:Jjulian "wikilink")
+([talk](User talk:Jjulian "wikilink")) 17:21, 13 August 2013 (UTC)
+
+<hr/>
+This is akin to a CAP Theorem[^2][^3] problem. If your nodes can't
+communicate, what do you do with writes? Our replication approach has
+traditionally been CP - enforce quorum, allow writes only among the
+majority - and for the sake of satisfying user expectations (or POSIX)
+pretty much has to remain CP at least by default. I personally think we
+need to allow an AP choice as well, which is why the quorum levels in
+NSR are tunable to get that result.
+
+So, what do we do if a node runs out of journal space? Well, it's unable
+to function normally, i.e. it's failed, so it can't count toward quorum.
+This would immediately lead to loss of write availability in a two-node
+replica set, and could happen easily enough in a three-node replica set
+if two similarly configured nodes ran out of journal space
+simultaneously. A significant part of the complexity in our design is
+around pruning no-longer-needed journal segments, precisely because this
+is an icky problem, but even with all the pruning in the world it could
+still happen eventually. Therefore the design also includes the notion
+of arbiters, which can be quorum-only or can also have their own
+journals (with no or partial data). Therefore, your quorum for
+admission/journaling purposes can be significantly higher than your
+actual replica count. So what options do we have to avoid or deal with
+journal exhaustion?
+
+- Add more journal space (it's just files, so this can be done
+ reactively during an extended outage).
+
+- Add arbiters.
+
+- Decrease the quorum levels.
+
+- Manually kick a node out of the replica set.
+
+- Add admission control, artificially delaying new requests as the
+ journal becomes full. (This one requires more code.)
+
+If you do \*none\* of these things then yeah, you're scrod. That said,
+do you think these options seem sufficient?
+
+[Jdarcy](User:Jdarcy "wikilink") ([talk](User talk:Jdarcy "wikilink"))
+15:27, 29 August 2013 (UTC)
+
+<references/>
+
+[^1]: <http://www.drbd.org/users-guide-emb/s-activity-log.html>
+
+[^2]: <http://www.julianbrowne.com/article/viewer/brewers-cap-theorem>
+
+[^3]: <http://henryr.github.io/cap-faq/>
diff --git a/done/GlusterFS 3.6/Persistent AFR Changelog xattributes.md b/done/GlusterFS 3.6/Persistent AFR Changelog xattributes.md
new file mode 100644
index 0000000..e21b788
--- /dev/null
+++ b/done/GlusterFS 3.6/Persistent AFR Changelog xattributes.md
@@ -0,0 +1,178 @@
+Feature
+-------
+
+Provide a unique and consistent name for AFR changelog extended
+attributes/ client translator names in the volume graph.
+
+Summary
+-------
+
+Make AFR changelog extended attribute names independent of brick
+position in the graph, which ensures that there will be no potential
+misdirected self-heals during remove-brick operation.
+
+Owners
+------
+
+Ravishankar N <ravishankar@redhat.com>
+Pranith Kumar K <pkarampu@redhat.com>
+
+Current status
+--------------
+
+Patches merged in master.
+
+<http://review.gluster.org/#/c/7122/>
+
+<http://review.gluster.org/#/c/7155/>
+
+Detailed Description
+--------------------
+
+BACKGROUND ON THE PROBLEM: =========================== AFR makes use of
+changelog extended attributes on a per file basis which records pending
+operations on that file and is used to determine the sources and sinks
+when healing needs to be done. As of today, AFR uses the client
+translator names (from the volume graph) as the names of the changelog
+attributes. For eg. for a replica 3 volume, each file on every brick has
+the following extended attributes:
+
+ trusted.afr.<volname>-client-0-->maps to Brick0
+ trusted.afr.<volname>-client-1-->maps to Brick1
+ trusted.afr.<volname>-client-2-->maps to Brick2
+
+​1) Now when any brick is removed (say Brick1), the graph is regenerated
+and AFR maps the xattrs to the bricks so:
+
+ trusted.afr.<volname>-client-0-->maps to Brick0
+ trusted.afr.<volname>-client-1-->maps to Brick2 
+
+Thus the xattr 'trusted.afr.testvol-client-1' which earlier referred to
+Brick1's attributes now refer to Brick-2's. If there are pending
+self-heals prior to the remove-brick happened, healing could possibly
+happen in the wrong direction thereby causing data loss.
+
+​2) The second problem is a dependency with Snapshot feature. Snapshot
+volumes have new names (UUID based) and thus the (client)xlator names
+are different. Eg: \<<volname>-client-0\> will now be
+\<<snapvolname>-client-0\>. When AFR uses these names to query for its
+changelog xattrs but the files on the bricks have the old changelog
+xattrs. Hence the heal information is completely lost.
+
+WHAT IS THE EXACT ISSUE WE ARE SOLVING OR OBJECTIVE OF THE
+FEATURE/DESIGN?
+==========================================================================
+In a nutshell, the solution is to generate unique and persistent names
+for the client translators so that even if any of the bricks are
+removed, the translator names always map to the same bricks. In turn,
+AFR, which uses these names for the changelog xattr names also refer to
+the correct bricks.
+
+SOLUTION:
+
+The solution is explained as a sequence of steps:
+
+- The client translator names will still use the existing
+ nomenclature, except that now they are monotonically increasing
+ (<volname>-client-0,1,2...) and are not dependent on the brick
+ position.Let us call these names as brick-IDs. These brick IDs are
+ also written to the brickinfo files (in
+ /var/lib/glusterd/vols/<volname>/bricks/\*) by glusterd during
+ volume creation. When the volfile is generated, these brick
+ brick-IDs form the client xlator names.
+
+- Whenever a brick operation is performed, the names are retained for
+ existing bricks irrespective of their position in the graph. New
+ bricks get the monotonically increasing brick-ID while names for
+ existing bricks are obtained from the brickinfo file.
+
+- Note that this approach does not affect client versions (old/new) in
+ anyway because the clients just use the volume config provided by
+ the volfile server.
+
+- For retaining backward compatibility, We need to check two items:
+ (a)Under what condition is remove brick allowed; (b)When is brick-ID
+ written to brickinfo file.
+
+For the above 2 items, the implementation rules will be thus:
+
+​i) This feature is implemented in 3.6. Lets say its op-version is 5.
+
+​ii) We need to implement a check to allow remove-brick only if cluster
+opversion is \>=5
+
+​iii) The brick-ID is written to brickinfo when the nodes are upgraded
+(during glusterd restore) and when a peer is probed (i.e. during volfile
+import).
+
+Benefit to GlusterFS
+--------------------
+
+Even if there are pending self-heals, remove-brick operations can be
+carried out safely without fear of incorrect heals which may cause data
+loss.
+
+Scope
+-----
+
+### Nature of proposed change
+
+Modifications will be made in restore, volfile import and volgen
+portions of glusterd.
+
+### Implications on manageability
+
+N/A
+
+### Implications on presentation layer
+
+N/A
+
+### Implications on persistence layer
+
+N/A
+
+### Implications on 'GlusterFS' backend
+
+N/A
+
+### Modification to GlusterFS metadata
+
+N/A
+
+### Implications on 'glusterd'
+
+As described earlier.
+
+How To Test
+-----------
+
+remove-brick operation needs to be carried out on rep/dist-rep volumes
+having pending self-heals and it must be verified that no data is lost.
+snapshots of the volumes must also be able to access files without any
+issues.
+
+User Experience
+---------------
+
+N/A
+
+Dependencies
+------------
+
+None.
+
+Documentation
+-------------
+
+TBD
+
+Status
+------
+
+See 'Current status' section.
+
+Comments and Discussion
+-----------------------
+
+<Follow here> \ No newline at end of file
diff --git a/done/GlusterFS 3.6/RDMA Improvements.md b/done/GlusterFS 3.6/RDMA Improvements.md
new file mode 100644
index 0000000..1e71729
--- /dev/null
+++ b/done/GlusterFS 3.6/RDMA Improvements.md
@@ -0,0 +1,101 @@
+Feature
+-------
+
+**RDMA Improvements**
+
+Summary
+-------
+
+This proposal is regarding getting RDMA volumes out of tech preview.
+
+Owners
+------
+
+Raghavendra Gowdappa <rgowdapp@redhat.com>
+Vijay Bellur <vbellur@redhat.com>
+
+Current status
+--------------
+
+Work in progress
+
+Detailed Description
+--------------------
+
+Fix known & unknown issues in volumes with transport type rdma so that
+RDMA can be used as the interconnect between client - servers & between
+servers.
+
+- Performance Issues - Had found that performance was bad when
+ compared with plain ib-verbs send/recv v/s RDMA reads and writes.
+- Co-existence with tcp - There seemed to be some memory corruptions
+ when we had both tcp and rdma transports.
+- librdmacm for connection management - with this there is a
+ requirement that the brick has to listen on an IPoIB address and
+ this affects our current ability where a peer has the flexibility to
+ connect to either ethernet or infiniband address. Another related
+ feature Better peer identification will help us to resolve this
+ issue.
+- More testing required
+
+Benefit to GlusterFS
+--------------------
+
+Scope
+-----
+
+### Nature of proposed change
+
+Bug-fixes to transport/rdma
+
+### Implications on manageability
+
+Remove the warning about creation of rdma volumes in CLI.
+
+### Implications on presentation layer
+
+TBD
+
+### Implications on persistence layer
+
+No impact
+
+### Implications on 'GlusterFS' backend
+
+No impact
+
+### Modification to GlusterFS metadata
+
+No impact
+
+### Implications on 'glusterd'
+
+No impact
+
+How To Test
+-----------
+
+TBD
+
+User Experience
+---------------
+
+TBD
+
+Dependencies
+------------
+
+Better Peer identification
+
+Documentation
+-------------
+
+TBD
+
+Status
+------
+
+In development
+
+Comments and Discussion
+-----------------------
diff --git a/done/GlusterFS 3.6/Server-side Barrier feature.md b/done/GlusterFS 3.6/Server-side Barrier feature.md
new file mode 100644
index 0000000..c13e25a
--- /dev/null
+++ b/done/GlusterFS 3.6/Server-side Barrier feature.md
@@ -0,0 +1,213 @@
+Server-side barrier feature
+===========================
+
+- Author(s): Varun Shastry, Krishnan Parthasarathi
+- Date: Jan 28 2014
+- Bugzilla: <https://bugzilla.redhat.com/1060002>
+- Document ID: BZ1060002
+- Document Version: 1
+- Obsoletes: NA
+
+Abstract
+--------
+
+Snapshot feature needs a mechanism in GlusterFS, where acknowledgements
+to file operations (FOPs) are held back until the snapshot of all the
+bricks of the volume are taken.
+
+The barrier feature would stop holding back FOPs after a configurable
+'barrier-timeout' seconds. This is to prevent an accidental lockdown of
+the volume.
+
+This mechanism should have the following properties:
+
+- Should keep 'barriering' transparent to the applications.
+- Should not acknowledge FOPs that fall into the barrier class. A FOP
+ that when acknowledged to the application, could lead to the
+ snapshot of the volume become inconsistent, is a barrier class FOP.
+
+With the below example of 'unlink' how a FOP is classified as barrier
+class is explained.
+
+For the following sequence of events, assuming unlink FOP was not
+barriered. Assume a replicate volume with two bricks, namely b1 and b2.
+
+ b1 b2
+ time ----------------------------------
+ | t1 snapshot
+ | t2 unlink /a unlink /a
+ \/ t3 mkdir /a mkdir /a
+ t4 snapshot
+
+The result of the sequence of events will store /a as a file in snapshot
+b1 while /a is stored as directory in snapshot b2. This leads to split
+brain problem of the AFR and in other way inconsistency of the volume.
+
+Copyright
+---------
+
+Copyright (c) 2014 Red Hat, Inc. <http://www.redhat.com>
+
+This feature is licensed under your choice of the GNU Lesser General
+Public License, version 3 or any later version (LGPLv3 or later), or the
+GNU General Public License, version 2 (GPLv2), in all cases as published
+by the Free Software Foundation.
+
+Introduction
+------------
+
+The volume snapshot feature snapshots a volume by snapshotting
+individual bricks, that are available, using the lvm-snapshot
+technology. As part of using lvm-snapshot, the design requires bricks to
+be free from few set of modifications (fops in Barrier Class) to avoid
+the inconsistency. This is where the server-side barriering of FOPs
+comes into picture.
+
+Terminology
+-----------
+
+- barrier(ing) - To make barrier fops temporarily inactive or
+ disabled.
+- available - A brick is said to be available when the corresponding
+ glusterfsd process is running and serving file operations.
+- FOP - File Operation
+
+High Level Design
+-----------------
+
+### Architecture/Design Overview
+
+- Server-side barriering, for Snapshot, must be enabled/disabled on
+ the bricks of a volume in a synchronous manner. ie, any command
+ using this would be blocked until barriering is enabled/disabled.
+ The brick process would provide this mechanism via an RPC.
+- Barrier translator would be placed immediately above io-threads
+ translator in the server/brick stack.
+- Barrier translator would queue FOPs when enabled. On disable, the
+ translator dequeues all the FOPs, while serving new FOPs from
+ application. By default, barriering is disabled.
+- The barrier feature would stop blocking the acknowledgements of FOPs
+ after a configurable 'barrier-timeout' seconds. This is to prevent
+ an accidental lockdown of the volume.
+- Operations those fall into barrier class are listed below. Any other
+ fop not listed below does not fall into this category and hence are
+ not barriered.
+ - rmdir
+ - unlink
+ - rename
+ - [f]truncate
+ - fsync
+ - write with O\_SYNC flag
+ - [f]removexattr
+
+### Design Feature
+
+Following timeline diagram depicts message exchanges between glusterd
+and brick during enable and disable of barriering. This diagram assumes
+that enable operation is synchronous and disable is asynchronous. See
+below for alternatives.
+
+ glusterd (snapshot) barrier @ brick
+ ------------------ ---------------
+ t1 | |
+ t2 | continue to pass through
+ | all the fops
+ t3 send 'enable' |
+ t4 | * starts barriering the fops
+ | * send back the ack
+ t5 receive the ack |
+ | |
+ t6 | &lt;take snap&gt; |
+ | . |
+ | . |
+ | . |
+ | &lt;/take snap&gt; |
+ | |
+ t7 send disable |
+ (does not wait for the ack) |
+ t8 | release all the holded fops
+ | and no more barriering
+ | |
+ t9 | continue in PASS_THROUGH mode
+
+Glusterd would send an RPC (described in API section), to enable
+barriering on a brick, by setting option feature.barrier to 'ON' in
+barrier translator. This would be performed on all the bricks present in
+that node, belonging to the set of volumes that are being snapshotted.
+
+Disable of barriering can happen in synchronous or asynchronous mode.
+The choice is left to the consumer of this feature.
+
+On disable, all FOPs queued up will be dequeued. Simultaneously the
+subsequent barrier request(s) will be served.
+
+Barrier option enable/disable is persisted into the volfile. This is to
+make the feature available for consumers in asynchronous mode, like any
+other (configurable) feature.
+
+Barrier feature also has timeout option based on which dequeuing would
+get triggered if the consumer fails to send the disable request.
+
+Low-level details of Barrier translator working
+-----------------------------------------------
+
+The translator operates in one of two states, namely QUEUEING and
+PASS\_THROUGH.
+
+When barriering is enabled, the translator moves to QUEUEING state. It
+queues outgoing FOPs thereafter in the call back path.
+
+When barriering is disabled, the translator moves to PASS\_THROUGH state
+and does not queue when it is in PASS\_THROUGH state. Additionally, the
+queued FOPs are 'released', when the translator moves from QUEUEING to
+PASS\_THROUGH state.
+
+It has a translator global queue (doubly linked lists, see
+libglusterfs/src/list.h) where the FOPs are queued in the form of a call
+stub (see libglusterfs/src/call-stub.[ch])
+
+When the FOP has succeeded, but barrier translator failed to queue in
+the call back, the barrier translator would disable barriering and
+release any queued FOPs, barrier would inform the consumer about this
+failure on succesive disable request.
+
+Interfaces
+----------
+
+### Application Programming Interface
+
+- An RPC procedure is added at the brick side, which allows any client
+ [sic] to set the feature.barrier option of the barrier translator
+ with a given value.
+- Glusterd would be using this to set server-side-barriering on, on a
+ brick.
+
+Performance Considerations
+--------------------------
+
+- The barriering of FOPs may be perceived as a performance degrade by
+ the applications. Since this is a hard requirement for snapshot, the
+ onus is on the snapshot feature to reduce the window for which
+ barriering is enabled.
+
+### Scalability
+
+- In glusterd, each brick operation is executed in a serial manner.
+ So, the latency of enabling barriering is a function of the no. of
+ bricks present on the node of the set of volumes being snapshotted.
+ This is not a scalability limitation of the mechanism of enabling
+ barriering but a limitation in the brick operations mechanism in
+ glusterd.
+
+Migration Considerations
+------------------------
+
+The barrier translator is introduced with op-version 4. It is a
+server-side translator and does not impact older clients even when this
+feature is enabled.
+
+Installation and deployment
+---------------------------
+
+- Barrier xlator is not packaged with glusterfs-server rpm. With this
+ changes, this has to be added to the rpm.
diff --git a/done/GlusterFS 3.6/Thousand Node Gluster.md b/done/GlusterFS 3.6/Thousand Node Gluster.md
new file mode 100644
index 0000000..54c3e13
--- /dev/null
+++ b/done/GlusterFS 3.6/Thousand Node Gluster.md
@@ -0,0 +1,150 @@
+Goal
+----
+
+Thousand-node scalability for glusterd
+
+Summary
+=======
+
+This "feature" is really a set of infrastructure changes that will
+enable glusterd to manage a thousand servers gracefully.
+
+Owners
+======
+
+Krishnan Parthasarathi <kparthas@redhat.com>
+Jeff Darcy <jdarcy@redhat.com>
+
+Current status
+==============
+
+Proposed, awaiting summit for approval.
+
+Related Feature Requests and Bugs
+=================================
+
+N/A
+
+Detailed Description
+====================
+
+There are three major areas of change included in this proposal.
+
+- Replace the current order-n-squared heartbeat/membership protocol
+ with a much smaller "monitor cluster" based on Paxos or
+ [Raft](https://ramcloud.stanford.edu/wiki/download/attachments/11370504/raft.pdf),
+ to which I/O servers check in.
+
+- Use the monitor cluster to designate specific functions or roles -
+ e.g. self-heal, rebalance, leadership in an NSR subvolume - to I/O
+ servers in a coordinated and globally optimal fashion.
+
+- Replace the current system of replicating configuration data on all
+ servers (providing practically no guarantee of consistency if one is
+ absent during a configuration change) with storage of configuration
+ data in the monitor cluster.
+
+Benefit to GlusterFS
+====================
+
+Scaling of our management plane to 1000+ nodes, enabling competition
+with other projects such as HDFS or Ceph which already have or claim
+such scalability.
+
+Scope
+=====
+
+Nature of proposed change
+-------------------------
+
+Functionality very similar to what we need in the monitor cluster
+already exists in some of the Raft implementations, notably
+[etcd](https://github.com/coreos/etcd). Such a component could provide
+the services described above to a modified glusterd running on each
+server. The changes to glusterd would mostly consist of removing the
+current heartbeat and config-storage code, replacing it with calls into
+(and callbacks from) the monitor cluster.
+
+Implications on manageability
+-----------------------------
+
+Enabling/starting monitor daemons on those few nodes that have them must
+be done separately from starting glusterd. Since the changes mostly are
+to how each glusterd interacts with others and with its own local
+storage back end, interactions with the CLI or with glusterfsd need not
+change.
+
+Implications on presentation layer
+----------------------------------
+
+N/A
+
+Implications on persistence layer
+---------------------------------
+
+N/A
+
+Implications on 'GlusterFS' backend
+-----------------------------------
+
+N/A
+
+Modification to GlusterFS metadata
+----------------------------------
+
+The monitor daemons need space for their data, much like that currently
+maintained in /var/lib/glusterd currently.
+
+Implications on 'glusterd'
+--------------------------
+
+Drastic. See sections above.
+
+How To Test
+===========
+
+A new set of tests for the monitor-cluster functionality will need to be
+developed, perhaps derived from those for the external project if we
+adopt one. Most tests related to our multi-node testing facilities
+(cluster.rc) will also need to change. Tests which merely invoke the CLI
+should require little if any change.
+
+User Experience
+===============
+
+Minimal change.
+
+Dependencies
+============
+
+A mature/stable enough implementation of Raft or a similar protocol.
+Failing that, we'd need to develop our own service along similar lines.
+
+Documentation
+=============
+
+TBD.
+
+Status
+======
+
+In design.
+
+The choice of technology and approaches are being discussed on the
+-devel ML.
+
+- "Proposal for Glusterd-2.0" -
+ [1](http://www.gluster.org/pipermail/gluster-users/2014-September/018639.html)
+
+: Though the discussion has become passive, the question is whether we
+ choose to implement consensus algorithm inside our project or depend
+ on external projects that provide similar service.
+
+- "Management volume proposal" -
+ [2](http://www.gluster.org/pipermail/gluster-devel/2014-November/042944.html)
+
+: This has limitations due to the circular dependency making it
+ infeasible.
+
+Comments and Discussion
+=======================
diff --git a/done/GlusterFS 3.6/afrv2.md b/done/GlusterFS 3.6/afrv2.md
new file mode 100644
index 0000000..a1767c7
--- /dev/null
+++ b/done/GlusterFS 3.6/afrv2.md
@@ -0,0 +1,244 @@
+Feature
+-------
+
+This feature is major code re-factor of current afr along with a key
+design change in the way changelog extended attributes are stored in
+afr.
+
+Summary
+-------
+
+This feature introduces design change in afr which separates ongoing
+transaction, pending operation count for files/directories.
+
+Owners
+------
+
+Anand Avati
+Pranith Kumar Karampuri
+
+Current status
+--------------
+
+The feature is in final stages of review at
+<http://review.gluster.org/6010>
+
+Detailed Description
+--------------------
+
+How AFR works:
+
+In order to keep track of what copies of the file are modified and up to
+date, and what copies require to be healed, AFR keeps state information
+in the extended attributes of the file called changelog extended
+attributes. These extended attributes stores that copy's view of how up
+to date the other copies are. The extended attributes are modified in a
+transaction which consists of 5 phases - LOCK, PRE-OP, OP, POST-OP and
+UNLOCK. In the PRE-OP phase the extended attributes are updated to store
+the intent of modification (in the OP phase.)
+
+In the POST-OP phase, depending on how many servers crashed mid way and
+on how many servers the OP was applied successfully, a corresponding
+change is made in the extended attributes (of the surviving copies) to
+represent the staleness of the copies which missed the OP phase.
+
+Further, when those lagging servers become available, healing decisions
+are taken based on these extended attribute values.
+
+Today, a PRE-OP increments the pending counters of all elements in the
+array (where each element represents a server, and therefore one of the
+members of the array represents that server itself.) The POST-OP
+decrements those counters which represent servers where the operation
+was successful. The update is performed on all the servers which have
+made it till the POST-OP phase. The decision of whether a server crashed
+in the middle of a transaction or whether the server lived through the
+transaction and witnessed the other server crash, is inferred by
+inspecting the extended attributes of all servers together. Because
+there is no distinction between these counters as to how many of those
+increments represent "in transit" operations and how many of those are
+retained without decrement to represent "pending counters", there is
+value in adding clarity to the system by separating the two.
+
+The change is to now have only one dirty flag on each server per file.
+We also make the PRE-OP increment only that dirty flag rather than all
+the elements of the pending array. The dirty flag must be set before
+performing the operation, and based on which of the servers the
+operation failed, we will set the pending counters representing these
+failed servers on the remaining ones in the POST-OP phase. The dirty
+counter is also cleared at the end of the POST-OP. This means, in
+successful operations only the dirty flag (one integer) is incremented
+and decremented per server per file. However if a pending counter is set
+because of an operation failure, then the flag is an unambiguous "finger
+pointing" at the other server. Meaning, if a file has a pending counter
+AND a dirty flag, it will not undermine the "strength" of the pending
+counter. This change completely removes today's ambiguity of whether a
+pending counter represents a still ongoing operation (or crashed in
+transit) vs a surely missed operation.
+
+Benefit to GlusterFS
+--------------------
+
+It increases the clarity of whether a file has any ongoing transactions
+and any pending self-heals. Code is more maintainable now.
+
+Scope
+-----
+
+### Nature of proposed change
+
+- Remove client side self-healing completely (opendir, openfd, lookup) -
+Re-work readdir-failover to work reliably in case of NFS - Remove
+unused/dead lock recovery code - Consistently use xdata in both calls
+and callbacks in all FOPs - Per-inode event generation, used to force
+inode ctx refresh - Implement dirty flag support (in place of pending
+counts) - Eliminate inode ctx structure, use read subvol bits +
+event\_generation - Implement inode ctx refreshing based on event
+generation - Provide backward compatibility in transactions - remove
+unused variables and functions - make code more consistent in style and
+pattern - regularize and clean up inode-write transaction code -
+regularize and clean up dir-write transaction code - regularize and
+clean up common FOPs - reorganize transaction framework code - skip
+setting xattrs in pending dict if nothing is pending - re-write
+self-healing code using syncops - re-write simpler self-heal-daemon
+
+### Implications on manageability
+
+None
+
+### Implications on presentation layer
+
+None
+
+### Implications on persistence layer
+
+None
+
+### Implications on 'GlusterFS' backend
+
+None
+
+### Modification to GlusterFS metadata
+
+This changes the way pending counts vs Ongoing transactions are
+represented in changelog extended attributes.
+
+### Implications on 'glusterd'
+
+None
+
+How To Test
+-----------
+
+Same test cases of afrv1 hold.
+
+User Experience
+---------------
+
+None
+
+Dependencies
+------------
+
+None
+
+Documentation
+-------------
+
+---
+
+Status
+------
+
+The feature is in final stages of review at
+<http://review.gluster.org/6010>
+
+Comments and Discussion
+-----------------------
+
+---
+
+Summary
+-------
+
+<Brief Description of the Feature>
+
+Owners
+------
+
+<Feature Owners - Ideally includes you :-)>
+
+Current status
+--------------
+
+<Provide details on related existing features, if any and why this new feature is needed>
+
+Detailed Description
+--------------------
+
+<Detailed Feature Description>
+
+Benefit to GlusterFS
+--------------------
+
+<Describe Value additions to GlusterFS>
+
+Scope
+-----
+
+### Nature of proposed change
+
+<modification to existing code, new translators ...>
+
+### Implications on manageability
+
+<Glusterd, GlusterCLI, Web Console, REST API>
+
+### Implications on presentation layer
+
+<NFS/SAMBA/UFO/FUSE/libglusterfsclient Integration>
+
+### Implications on persistence layer
+
+<LVM, XFS, RHEL ...>
+
+### Implications on 'GlusterFS' backend
+
+<brick's data format, layout changes>
+
+### Modification to GlusterFS metadata
+
+<extended attributes used, internal hidden files to keep the metadata...>
+
+### Implications on 'glusterd'
+
+<persistent store, configuration changes, brick-op...>
+
+How To Test
+-----------
+
+<Description on Testing the feature>
+
+User Experience
+---------------
+
+<Changes in CLI, effect on User experience...>
+
+Dependencies
+------------
+
+<Dependencies, if any>
+
+Documentation
+-------------
+
+<Documentation for the feature>
+
+Status
+------
+
+<Status of development - Design Ready, In development, Completed>
+
+Comments and Discussion
+-----------------------
+
+<Follow here>
diff --git a/done/GlusterFS 3.6/better-ssl.md b/done/GlusterFS 3.6/better-ssl.md
new file mode 100644
index 0000000..44136d5
--- /dev/null
+++ b/done/GlusterFS 3.6/better-ssl.md
@@ -0,0 +1,137 @@
+Feature
+=======
+
+Better SSL Support
+
+1 Summary
+=========
+
+Our SSL support is currently incomplete in several areas. This "feature"
+covers several enhancements (see Detailed Description below) to close
+gaps and make it more user-friendly.
+
+2 Owners
+========
+
+Jeff Darcy <jdarcy@redhat.com>
+
+3 Current status
+================
+
+Some patches already submitted.
+
+4 Detailed Description
+======================
+
+These are the items necessary to make our SSL support more of a useful
+differentiating feature vs. other projects.
+
+- Enable SSL for the management plane (glusterd). There are currently
+ several bugs and UI issues precluding this.
+
+- Allow SSL identities to be used for authorization as well as
+ authentication (and encryption). At a minimum this would apply to
+ the I/O path, restricting specific volumes to specific
+ SSL-identified principals. It might also apply to the management
+ path, restricting certain actions (and/or actions on certain
+ volumes) to certain principals. Ultimately this could be the basis
+ for full role-based access control, but that's not in scope
+ currently.
+
+- Provide more options, e.g. for cipher suites or certificate-signing
+
+- Fix bugs related to increased concurrency levels from the
+ multi-threaded transport.
+
+5 Benefit to GlusterFS
+======================
+
+Sufficient security to support deployment in environments where security
+is a non-negotiable requirement (e.g. government). Sufficient usability
+to support deployment by anyone who merely desires additional security.
+Improved performance in some cases, due to the multi-threaded transport.
+
+6 Scope
+=======
+
+6.1. Nature of proposed change
+------------------------------
+
+Most of the proposed changes do not actually involve the SSL transport
+itself, but are in surrounding components instead. The exception is the
+addition of options, which should be pretty simple. However, bugs
+related to increased concurrency levels could show up anywhere, most
+likely in our more complex translators (e.g. DHT or AFR), and will need
+to be fixed on a case-by-case basis.
+
+6.2. Implications on manageability
+----------------------------------
+
+Additional configuration will be necessary to enable SSL for glusterd.
+Additional commands will also be needed to manage certificates and keys;
+the [HekaFS
+documentation](https://git.fedorahosted.org/cgit/CloudFS.git/tree/doc)
+can serve as an example of what's needed.
+
+6.3. Implications on presentation layer
+---------------------------------------
+
+N/A
+
+6.4. Implications on persistence layer
+--------------------------------------
+
+N/A
+
+6.5. Implications on 'GlusterFS' backend
+----------------------------------------
+
+N/A
+
+6.6. Modification to GlusterFS metadata
+---------------------------------------
+
+N/A
+
+6.7. Implications on 'glusterd'
+-------------------------------
+
+Significant changes to how glusterd calls the transport layer (and
+expects to be called in return) will be necessary to fix bugs and to
+enable SSL on its connections.
+
+7 How To Test
+=============
+
+New tests will be needed for each major change in the detailed
+description. Also, to improve test coverage and smoke out all of the
+concurrency bugs, it might be desirable to change the test framework to
+allow running in a mode where SSL is enabled for all tests.
+
+8 User Experience
+=================
+
+Correspondent to "implications on manageability" section above.
+
+9 Dependencies
+==============
+
+Currently we use OpenSSL, so its idiosyncrasies guide implementation
+choices and timelines. Sometimes it even affects the user experience,
+e.g. in terms of what options exist for cipher suites or certificate
+depth. It's possible that it will prove advantageous to switch to
+another SSL/TLS package with a better interface, probably PolarSSL
+(which often responds to new threats more quickly than OpenSSL).
+
+10 Documentation
+================
+
+TBD, likely extensive (see "User Experience" section).
+
+11 Status
+=========
+
+Awaiting approval.
+
+12 Comments and Discussion
+==========================
diff --git a/done/GlusterFS 3.6/disperse.md b/done/GlusterFS 3.6/disperse.md
new file mode 100644
index 0000000..e2bad37
--- /dev/null
+++ b/done/GlusterFS 3.6/disperse.md
@@ -0,0 +1,142 @@
+Feature
+=======
+
+Dispersed volume translator
+
+Summary
+=======
+
+The disperse translator is a new type of volume for GlusterFS that can
+be used to offer a configurable level of fault tolerance while
+optimizing the disk space waste. It can be seen as a RAID5-like volume.
+
+Owners
+======
+
+Xavier Hernandez <xhernandez@datalab.es>
+
+Current status
+==============
+
+A working version is included in GlusterFS 3.6
+
+Detailed Description
+====================
+
+The disperse translator is based on erasure codes to allow the recovery
+of the data stored on one or more bricks in case of failure. The number
+of bricks that can fail without losing data is configurable.
+
+Each brick stores only a portion of each block of data. Some of these
+portions are called parity or redundancy blocks. They are computed using
+a mathematical transformation so that they can be used to recover the
+content of the portion stored on another brick.
+
+Each volume is composed of a set of N bricks (as many as you want), and
+R of them are used to store the redundancy information. On this
+configuration, if each brick has capacity C, the total space available
+on the volume will be (N - R) \* C.
+
+A versioning system is used to dectect inconsistencies and initiate a
+self-heal if appropiate.
+
+All these operations are made on the fly, transparently to the user.
+
+Benefit to GlusterFS
+====================
+
+It can be used to create volumes with a configurable level of redundancy
+(like replicate), but optimizing disk usage.
+
+Scope
+=====
+
+Nature of proposed change
+-------------------------
+
+The dispersed volume is implemented by a client-side translator that
+will be responsible of encoding/decoding the brick contents.
+
+Implications on manageability
+-----------------------------
+
+The new type of volume will be configured as any other one. However the
+healing operations are quite different and maybe will be necessary to
+handle them separately.
+
+Implications on presentation layer
+----------------------------------
+
+N/A
+
+Implications on persistence layer
+---------------------------------
+
+N/A
+
+Implications on 'GlusterFS' backend
+-----------------------------------
+
+N/A
+
+Modification to GlusterFS metadata
+----------------------------------
+
+Three new extended attributes are created to manage a dispersed file:
+
+- trusted.ec.config: Contains information about the parameters used to
+ encode the file.
+- trusted.ec.version: Tracks the number of changes made to the file.
+- trusted.ec.size: Tracks the real size of the file.
+
+Implications on 'glusterd'
+--------------------------
+
+glusterd and cli have been modified to add the needed functionality to
+create and manage dispersed volumes.
+
+How To Test
+===========
+
+There is a new glusterd syntax to create dispersed volumes:
+
+ gluster volume create <volname> [disperse [count]] [redundancy <count>]] <bricks>
+
+Both 'disperse' and 'redundancy' are optional, but at least one of them
+must be present to create a dispersed volume. The <count> of 'disperse'
+is also optional: if not specified, the number of bricks specified in
+the command line is taken as the <count> value. To create a
+distributed-disperse volume, it's necessary to specify 'disperse' with a
+<count> value smaller than the total number of bricks.
+
+When 'redundancy' is not specified, it's default value is computed so
+that it generates an optimal configuration. A configuration is optimal
+if *number of bricks - redundancy* is a power of 2. If such a value
+exists and it's greater than one, a warning is shown to validate the
+number. If it doesn't exist, 1 is taken and another warning is shown.
+
+Once created, the disperse volume can be started, mounted and used as
+any other volume.
+
+User Experience
+===============
+
+Almost the same. Only a new volume type added.
+
+Dependencies
+============
+
+None
+
+Documentation
+=============
+
+Not available yet.
+
+Status
+======
+
+First implementation ready.
+
+Comments and Discussion
+=======================
diff --git a/done/GlusterFS 3.6/glusterd volume locks.md b/done/GlusterFS 3.6/glusterd volume locks.md
new file mode 100644
index 0000000..a8f8ebd
--- /dev/null
+++ b/done/GlusterFS 3.6/glusterd volume locks.md
@@ -0,0 +1,48 @@
+As of today most gluster commands take a cluster wide lock, before
+performing their respective operations. As a result any two gluster
+commands, which have no interdependency with each other, can't be
+executed simultaneously. To remove this interdependency we propose to
+replace this cluster wide lock with a volume specific lock, so that two
+operations on two different volumes can be performed simultaneously.
+
+​1. We classify all gluster operations in three different classes :
+Create volume, Delete volume, and volume specific operations.
+
+​2. At any given point of time, we should allow two simultaneous
+operations (create, delete or volume specific), as long as each both the
+operations are not happening on the same volume.
+
+​3. If two simultaneous operations are performed on the same volume, the
+operation which manages to acquire the volume lock will succeed, while
+the other will fail. Also both might fail to acquire the volume lock on
+the cluster, in which case both the operations will fail.
+
+In order to achieve this, we propose a locking engine, which will
+receive lock requests from these three types of operations. Each such
+request for a particular volume will contest for the same volume lock
+(based on the volume name and the node-uuid). For example, a delete
+volume command for volume1 and a volume status command for volume 1 will
+contest for the same lock (comprising of the volume name, and the uuid
+of the node winning the lock), in which case, one of these commands will
+succeed and the other one will not, failing to acquire the lock.
+
+Whereas, if two operations are simultaneously performed on a different
+volumes they should happen smoothly, as both these operations would
+request the locking engine for two different locks, and will succeed in
+locking them in parallel.
+
+We maintain a global list of volume-locks (using a dict for a list)
+where the key is the volume name, and which saves the uuid of the
+originator glusterd. These locks are held and released per volume
+transaction.
+
+In order to acheive multiple gluster operations occuring at the same
+time, we also separate opinfos in the op-state-machine, as a part of
+this patch. To do so, we generate a unique transaction-id (uuid) per
+gluster transaction. An opinfo is then associated with this transaction
+id, which is used throughout the transaction. We maintain a run-time
+global list(using a dict) of transaction-ids, and their respective
+opinfos to achieve this.
+
+Gluster devel Mailing Thread:
+<http://lists.gnu.org/archive/html/gluster-devel/2013-09/msg00042.html> \ No newline at end of file
diff --git a/done/GlusterFS 3.6/heterogeneous-bricks.md b/done/GlusterFS 3.6/heterogeneous-bricks.md
new file mode 100644
index 0000000..a769b56
--- /dev/null
+++ b/done/GlusterFS 3.6/heterogeneous-bricks.md
@@ -0,0 +1,136 @@
+Feature
+-------
+
+Support heterogeneous (different size) bricks.
+
+Summary
+-------
+
+DHT is currently very naive about brick sizes, assigning equal "weight"
+to each brick/subvolume for purposes of placing new files even though
+the bricks might actually have different sizes. It would be better if
+DHT assigned greater weight (i.e. would create more files) on bricks
+with more total or free space.
+
+This proposal came out of a [mailing-list
+discussion](http://www.gluster.org/pipermail/gluster-users/2014-January/038638.html)
+
+Owners
+------
+
+- Raghavendra G (rgowdapp@redhat.com)
+
+Current status
+--------------
+
+There is a
+[script](https://github.com/gluster/glusterfs/blob/master/extras/rebalance.py)
+representing much of the necessary logic, using DHT's "custom layout"
+feature and other tricks.
+
+The most basic kind of heterogeneous-brick-friendly rebalancing has been
+implemented. [patch](http://review.gluster.org/#/c/8020/)
+
+Detailed Description
+--------------------
+
+There should be (at least) three options:
+
+- Assign subvolume weights based on **total** space.
+
+- Assign subvolume weights based on **free** space.
+
+- Assign all (or nearly all) weight to specific subvolumes.
+
+The last option is useful for those who expand a volume by adding bricks
+and intend to let the system "rebalance automatically" by directing new
+files to the new bricks, without migrating any old data. Once the
+appropriate weights have been calculated, the rebalance command should
+apply the results recursively to all directories within the volume
+(except those with custom layouts) and DHT should assign layouts to new
+directories in accordance with the same weights.
+
+Benefit to GlusterFS
+--------------------
+
+Better support for adding new bricks that are a different size than the
+old, which is common as disk capacity tends to improve with each
+generation (as noted in the ML discussion).
+
+Better support for adding capacity without an expensive (data-migrating)
+rebalance operation.
+
+Scope
+-----
+
+This will involve changes to all current rebalance code - CLI, glusterd,
+DHT, and probably others.
+
+### Implications on manageability
+
+New CLI options.
+
+### Implications on presentation layer
+
+None.
+
+### Implications on persistence layer
+
+None.
+
+### Implications on 'GlusterFS' backend
+
+None, unless we want to add a "policy" xattr on the root inode to be
+consulted when new directories are created (could also be done via
+translator options).
+
+### Modification to GlusterFS metadata
+
+Same as previous.
+
+### Implications on 'glusterd'
+
+New fields in rebalance-related RPCs.
+
+How To Test
+-----------
+
+For each policy:
+
+1. Create a volume with small bricks (ramdisk-based if need be).
+
+1. Fill the bricks to varying degrees.
+
+1. (optional) Add more empty bricks.
+
+1. Rebalance using the target policy.
+
+1. Create some dozens/hundreds of new files.
+
+1. Verify that the distribution of new files matches what is expected
+ for the given policy.
+
+User Experience
+---------------
+
+New options for the "rebalance" command.
+
+Dependencies
+------------
+
+None.
+
+Documentation
+-------------
+
+TBD
+
+Status
+------
+
+Original author has abandoned this change. If anyone else wants to make
+a \*positive\* contribution to fix a long-standing user concern, feel
+free.
+
+Comments and Discussion
+-----------------------
diff --git a/done/GlusterFS 3.6/index.md b/done/GlusterFS 3.6/index.md
new file mode 100644
index 0000000..f4d83db
--- /dev/null
+++ b/done/GlusterFS 3.6/index.md
@@ -0,0 +1,96 @@
+GlusterFS 3.6 Release Planning
+------------------------------
+
+Tentative Dates:
+
+4th Mar, 2014 - Feature proposal freeze
+
+17th Jul, 2014 - Feature freeze & Branching
+
+12th Sep, 2014 - Community Test Weekend \#1
+
+21st Sep, 2014 - 3.6.0 Beta Release
+
+22nd Sep, 2014 - Community Test Week
+
+31st Oct, 2014 - 3.6.0 GA
+
+Feature proposal for GlusterFS 3.6
+----------------------------------
+
+### Features in 3.6.0
+
+- [Features/better-ssl](./better-ssl.md):
+ Various improvements to SSL support.
+
+- [Features/heterogeneous-bricks](./heterogeneous-bricks.md):
+ Support different-sized bricks.
+
+- [Features/disperse](./disperse.md):
+ Volumes based on erasure codes.
+
+- [Features/glusterd-volume-locks](./glusterd volume locks.md):
+ Volume wide locks for glusterd
+
+- [Features/persistent-AFR-changelog-xattributes](./Persistent AFR Changelog xattributes.md):
+ Persistent naming scheme for client xlator names and AFR changelog
+ attributes.
+
+- [Features/better-logging](./Better Logging.md):
+ Gluster logging enhancements to support message IDs per message
+
+- [Features/Better peer identification](./Better Peer Identification.md)
+
+- [Features/Gluster Volume Snapshot](./Gluster Volume Snapshot.md)
+
+- [Features/Gluster User Serviceable Snapshots](./Gluster User Serviceable Snapshots.md)
+
+- **[Features/afrv2](./afrv2.md)**: Afr refactor.
+
+- [Features/RDMA Improvements](./RDMA Improvements.md):
+ Improvements for RDMA
+
+- [Features/Server-side Barrier feature](./Server-side Barrier feature.md):
+ A supplementary feature for the
+ [Features/Gluster Volume Snapshot](./Gluster Volume Snapshot.md) which maintains the consistency across the snapshots.
+
+### Features beyond 3.6.0
+
+- [Features/Smallfile Perf](../GlusterFS 3.7/Small File Performance.md):
+ Small-file performance enhancement.
+
+- [Features/data-classification](../GlusterFS 3.7/Data Classification.md):
+ Tiering, rack-aware placement, and more.
+
+- [Features/new-style-replication](./New Style Replication.md):
+ Log-based, chain replication.
+
+- [Features/thousand-node-glusterd](./Thousand Node Gluster.md):
+ Glusterd changes for higher scale.
+
+- [Features/Trash](../GlusterFS 3.7/Trash.md):
+ Trash translator for GlusterFS
+
+- [Features/Object Count](../GlusterFS 3.7/Object Count.md)
+- [Features/SELinux Integration](../GlusterFS 3.7/SE Linux Integration.md)
+- [Features/Easy addition of custom translators](../GlusterFS 3.7/Easy addition of Custom Translators.md)
+- [Features/Exports Netgroups Authentication](../GlusterFS 3.7/Exports and Netgroups Authentication.md)
+ [Features/outcast](../GlusterFS 3.7/Outcast.md): Outcast
+
+- **[Features/Policy based Split-brain Resolution](../GlusterFS 3.7/Policy based Split-brain Resolution.md)**: Policy Based
+Split-brain resolution
+
+- [Features/rest-api](../GlusterFS 3.7/rest-api.md):
+ REST API for Gluster Volume and Peer Management
+
+- [Features/Archipelago Integration](../GlusterFS 3.7/Archipelago Integration.md):
+ Improvements for integration with Archipelago
+
+Release Criterion
+-----------------
+
+- All new features to be documented in admin guide
+
+- Feature tested as part of testing days.
+
+- More to follow \ No newline at end of file
diff --git a/done/GlusterFS 3.7/Archipelago Integration.md b/done/GlusterFS 3.7/Archipelago Integration.md
new file mode 100644
index 0000000..69ce61d
--- /dev/null
+++ b/done/GlusterFS 3.7/Archipelago Integration.md
@@ -0,0 +1,93 @@
+Feature
+-------
+
+**Archipelago Integration**
+
+Summary
+-------
+
+This proposal is regarding adding support in libgfapi for better
+integration with archipelago.
+
+Owners
+------
+
+Vijay Bellur <vbellur@redhat.com>
+
+Current status
+--------------
+
+Work in progress
+
+Detailed Description
+--------------------
+
+Please refer to discussion at:
+
+<http://lists.nongnu.org/archive/html/gluster-devel/2013-12/msg00011.html>
+
+Benefit to GlusterFS
+--------------------
+
+More interfaces in libgfapi.
+
+Scope
+-----
+
+To be explained better.
+
+### Nature of proposed change
+
+TBD
+
+### Implications on manageability
+
+N/A
+
+### Implications on presentation layer
+
+TBD
+
+### Implications on persistence layer
+
+No impact
+
+### Implications on 'GlusterFS' backend
+
+No impact
+
+### Modification to GlusterFS metadata
+
+No impact
+
+### Implications on 'glusterd'
+
+No impact
+
+How To Test
+-----------
+
+TBD
+
+User Experience
+---------------
+
+TBD
+
+Dependencies
+------------
+
+TBD
+
+Documentation
+-------------
+
+TBD
+
+Status
+------
+
+In development
+
+Comments and Discussion
+-----------------------
diff --git a/done/GlusterFS 3.7/BitRot.md b/done/GlusterFS 3.7/BitRot.md
new file mode 100644
index 0000000..deca9ee
--- /dev/null
+++ b/done/GlusterFS 3.7/BitRot.md
@@ -0,0 +1,211 @@
+Feature
+=======
+
+BitRot Detection
+
+1 Summary
+=========
+
+BitRot detection is a technique used to identify an “insidious” type of
+disk error where data is silently corrupted with no indication from the
+disk to the storage software layer that an error has occurred. BitRot
+detection is exceptionally useful when using JBOD (which had no way of
+knowing that the data is corrupted on disk) rather than RAID (esp. RAID6
+which has a performance penalty for certain kind of workloads).
+
+2 Use cases
+===========
+
+- Archival/Compliance
+- Openstack cinder
+- Gluster health
+
+Refer
+[here](http://supercolony.gluster.org/pipermail/gluster-devel/2014-December/043248.html)
+for an elaborate discussion on use cases.
+
+3 Owners
+========
+
+Venky Shankar <vshankar@redhat.com, yknev.shankar@gmail.com>
+Raghavendra Bhat <rabhat@redhat.com>
+Vijay Bellur <vbellur@redhat.com>
+
+4 Current Status
+================
+
+Initial approach is [here](http://goo.gl/TSjLJn). The document goes into
+some details on why one could end up with "rotten" data and approaches
+taken by block level filesystems to detect and recover from bitrot. Some
+of the design goals are carry forwarded and made to fit with GlusterFS.
+
+Status as of 11th Feb 2015:
+
+Done
+
+- Object notification
+- Object expiry tracking using timer-wheel
+
+In Progress
+
+- BitRot server stub
+- BitRot Daemon
+
+5 Detailed Description
+======================
+
+**NOTE: Points marked with [NIS] are "Not in Scope" for 3.7 release.**
+
+The basic idea is to maintain file data/metadata checksums as an
+extended attribute. Checksum granularity is per file for now, however
+this can be extended to be per "block-size" blocks (chunks). A BitRot
+daemon per brick is responsible for checksum maintenance for files local
+to the brick. "Distributifying" enables scale and effective resource
+utilization of the cluster (memory, disk, etc..).
+
+BitD (BitRot Deamon)
+
+- Daemon per brick takes care of maintaining checksums for data local
+ to the brick.
+- Checksums are SHA256 (default) hash
+ - Of file data (regular files only)
+ - "Rolling" metadata checksum of extended attributes (GlusterFS
+ xattrs) **[NIS]**
+ - Master checksum: checksum of checksums (data + metadata)
+ **[NIS]**
+ - Hashtype is persisted along side the checksum and can be tuned
+ per file type
+
+- Checksum maintenance is "lazy"
+ - "not" inline to the data path (expensive)
+ - List of changed files is notified by the filesystem although a
+ single filesystem scan is needed to get to the current state.
+ BitD is built over existing journaling infrastructure (a.k.a
+ changelog)
+ - Laziness is governed by policies that determine when to
+ (re)calculate checksum. IOW, checksum is calculated when a file
+ is considered "stable"
+ - Release+Expiry: on a file descriptor release and an
+ inactivity for "X" seconds.
+
+- Filesystem scan
+ - Required once after stop/start or for initial data set
+ - Xtime based scan (marker framework)
+ - Considerations
+ - Parallelize crawl
+ - Sort by inode \# to reduce disk seek
+ - Integrate with libgfchangelog
+
+Detection
+
+- Upon file/data access (expensive)
+ - open() or read() (disabled by default)
+- Data scrubbing
+ - Filesystem checksum validation
+ - "Bad" file marking
+ - Deep: validate data checksum
+ - Timestamp of last validity - used for replica repair **[NIS]**
+ - Repair **[NIS]**
+ - Shallow: validate metadata checksum **[NIS]**
+
+Repair/Recover stratergies **[NIS]**
+
+- Mirrored file data
+ - self-heal
+- Erasure Codes (ec xlator)
+
+It would also be beneficial to use inbuilt bitrot capabilities of
+backend filesystems such as btrfs. For such cases, it's better to
+"handover" bulk of the work of the backend filesystem and have
+minimalistic implementation on the daemon side. This area needs to be
+explored more (i.e., ongoing and not for 3.7).
+
+6 Benefit to GlusterFS
+======================
+
+By the ability of detect silent corruptions (and even backend tinkering
+of a file), reading bad data could be avoided and possibly using it as a
+truthful source to heal other copies and may be even remotely replicate
+to a backup node damaging a good copy. Scrubbing allows pro-active
+detection of corrupt files and repairing them before access.
+
+7 Design and CLI specification
+==============================
+
+- [Design document](http://goo.gl/Mjy4mD)
+- [CLI specification](http://goo.gl/2o12Fn)
+
+8 Scope
+=======
+
+8.1. Nature of proposed change
+------------------------------
+
+The most basic changes being introduction of a server side daemon (per
+brick) to maintain file data checksums. Changes to changelog and
+consumer library would be needed to support requirements for bitrot
+daemon.
+
+8.2. Implications on manageability
+----------------------------------
+
+Introduction of new CLI commands to enable bitrot detection, trigger
+scrub, query file status, etc.
+
+8.3. Implications on presentation layer
+---------------------------------------
+
+N/A
+
+8.4. Implications on persistence layer
+--------------------------------------
+
+Introduction of new extended attributes.
+
+8.5. Implications on 'GlusterFS' backend
+----------------------------------------
+
+As in 8.4
+
+8.6. Modification to GlusterFS metadata
+---------------------------------------
+
+BitRot related extended attributes
+
+8.7. Implications on 'glusterd'
+-------------------------------
+
+Supporting changes to CLI.
+
+9 How To Test
+=============
+
+10 User Experience
+==================
+
+Refer to Section \#7
+
+11 Dependencies
+===============
+
+Enhancement to changelog translator (and libgfchangelog) is the most
+prevalent change. Other dependencies include glusterd.
+
+12 Documentation
+================
+
+TBD
+
+13 Status
+=========
+
+- Initial set of patches merged
+- Bug fixing/enhancement in progress
+
+14 Comments and Discussion
+==========================
+
+More than welcome :-)
+
+- [BitRot tracker Bug](https://bugzilla.redhat.com/1170075)
+- [BitRot hash computation](https://bugzilla.redhat.com/914874) \ No newline at end of file
diff --git a/done/GlusterFS 3.7/Clone of Snapshot.md b/done/GlusterFS 3.7/Clone of Snapshot.md
new file mode 100644
index 0000000..ca6304c
--- /dev/null
+++ b/done/GlusterFS 3.7/Clone of Snapshot.md
@@ -0,0 +1,100 @@
+Feature
+-------
+
+Clone of a Snapshot
+
+Summary
+-------
+
+GlusterFS volume snapshot provides point-in-time copy of a GlusterFS
+volume. When we take a volume snapshot, the newly created snap volume is
+a read only volume.
+
+By this feature, this snap volume can be later 'cloned' to create a new
+regular volume which contains the same contents of snapshot bricks. This
+is a space efficient clone therefore it will be created instantaneously
+and will share the disk space in the back-end, just like a snapshot and
+the origin volume.
+
+Owners
+------
+
+Mohammed Rafi KC <rkavunga@redhat.com>
+
+Current status
+--------------
+
+Requiremnt for openstack manila.
+
+Detailed Description
+--------------------
+
+Snapshot create will take point-in-time snapshot of a volume. upon
+successful completion, it will create a new read/only volume. But the
+new volume is not considered as a regular volume, which prevents us to
+perform any volume related operations on this snapshot volume. The
+ultimate aim of this feature is creating a new regular volume out of
+this snap.
+
+For e.g.:
+
+ gluster snapshot create snap1 vol1
+
+The above command will create a read-only snapshot "snap1" from volume
+vol1.
+
+ gluster snapshot clone share1 snap1
+
+The above command will create a regular gluster volume share1 from
+snap1.
+
+Benefit to GlusterFS
+--------------------
+
+We will have a writable snapshot.
+
+Scope
+-----
+
+### Nature of proposed change
+
+Modification to glusterd snapshot code.
+
+### Implications on manageability
+
+glusterd,gluster cli
+
+### Implications on 'GlusterFS' backend
+
+There will be performance degradation for the first write of a each
+block of main volume.
+
+### Modification to GlusterFS metadata
+
+none
+
+How To Test
+-----------
+
+create a volume Take snapshot create a clone. start the clone. cloned
+volume should support all operation for a regular volume.
+
+User Experience
+---------------
+
+there will an additional cli option for snapshot. gluster snapshot clone
+```<clonename> <snapname> [<description> <description test>] [force]```
+
+Dependencies
+------------
+
+Documentation
+-------------
+
+Status
+------
+
+In development
+
+Comments and Discussion
+-----------------------
diff --git a/done/GlusterFS 3.7/Data Classification.md b/done/GlusterFS 3.7/Data Classification.md
new file mode 100644
index 0000000..a3bb35c
--- /dev/null
+++ b/done/GlusterFS 3.7/Data Classification.md
@@ -0,0 +1,279 @@
+Goal
+----
+
+Support tiering and other policy-driven (as opposed to pseudo-random)
+placement of files.
+
+Summary
+-------
+
+"Data classification" is an umbrella term covering things:
+locality-aware data placement, SSD/disk or
+normal/deduplicated/erasure-coded data tiering, HSM, etc. They share
+most of the same infrastructure, and so are proposed (for now) as a
+single feature.
+
+NB this has also been referred to as "DHT on DHT" in various places,
+though "unify on DHT" might be more accurate.
+
+Owners
+------
+
+Dan Lambright <dlambrig@redhat.com>
+
+Joseph Fernandes <josferna@redhat.com>
+
+Current status
+--------------
+
+Cache tiering under development upstream. Tiers may be added to existing
+volumes. Tiers are made up of bricks.
+
+Volume-granularity tiering has been prototyped (bugzilla \#9387) and
+merged in a branch (origin/fix\_9387) to the cache tiering forge
+project. This will allow existing volumes to be combined into a single
+one offering both functionality.
+
+Related Feature Requests and Bugs
+---------------------------------
+
+N/A
+
+Detailed Description
+--------------------
+
+The basic idea is to layer multiple instances of a modified DHT
+translator on top of one another, each making placement/rebalancing
+decisions based on different criteria. The current consistent-hashing
+method is one possibility. Other possibilities involve matching
+file/directory characteristics to subvolume characteristics.
+
+- File/directory characteristics: size, age, access rate, type
+ (extension), ...
+
+- Subvolume characteristics: physical location, storage type (e.g.
+ SSD/disk/PCM, cache), encoding method (e.g. erasure coded or
+ deduplicated).
+
+- Either (arbitrary tags assigned by user): owner, security level,
+ HIPPA category
+
+For example, a first level might redirect files based on security level,
+a second level might match age or access rate vs. SSD-based or
+disk-based subvolumes, and then a third level might use consistent
+hashing across several similarly-equipped bricks.
+
+### Cache tier
+
+The cache tier will support data placement based on access frequency.
+Frequently accessed files shall exist on a "hot" subvolume. Infrequently
+accessed files shall reside on a "cold" subvolume. Files will migrate
+between the hot and cold subvolumes according to observed usage.
+
+Read caching is a desired future enhancement.
+
+When the "cold" subvolume is expensive to use (e.g. erasure coded), this
+feature will mitigate its overhead for many workloads.
+
+Some use cases:
+
+- fast subvolumes are SSDs, slow subvolumes are normal disks
+- fast subvolumes are normal disks, slow subvolumes are erasure coded.
+- fast subvolume is backed up more frequently than slow tier.
+- read caching only , good in cases where migration overhead is
+ unacceptable
+
+Benefit to GlusterFS
+--------------------
+
+By itself, data classification can be used to improve performance (by
+optimizing where "hot" files are placed) and security or regulatory
+compliance (by placing sensitive data only on the most secure storage).
+It also serves as an enabling technology for other enhancements by
+allowing users to combine more cost-effective or archivally oriented
+storage for the majority of their data with higher-performance storage
+to absorb the majority of their I/O load. This enabling effect applies
+e.g. to compression, deduplication, erasure coding, or bitrot detection.
+
+Scope
+-----
+
+### Nature of proposed change
+
+The most basic set of changes involves making the data-placement part of
+DHT more modular, and providing modules/plugins to do the various kinds
+of intelligent placement discussed above. Other changes will be
+explained in subsequent sections.
+
+### Implications on manageability
+
+Eventually, the CLI must provide users with a way to arrange bricks into
+a hierarchy, and assign characteristics such as storage type or security
+level at any level within that hierarchy. They must also be able to
+express which policy (plugin), with which parameters, should apply to
+any level. A data classification language has been proposed to help
+express these concepts, see link above.
+
+The cache tier's graph is more rigid and can be expressed using the
+"volume attach-cache" command described below. Both a "hot" tier and
+"cold tier" are made up of dispersed / distributed / replicated bricks
+in the same manner as a normal volume, and they are combined with the
+tier translator.
+
+#### Cache Tier
+
+An "attach" command will declare an existing volume as "cold" and create
+a new "hot" volume which is appended to it. Together, the combination is
+a single "cache tiered" volume. For example:
+
+gluser volume attach-tier [name] [redundancy \#] brick1 brick2 .. brickN
+
+.. will attach a hot tier made up of brick[1..N] to existing volume
+[name].
+
+The tier can be detached. Data is first migrated off the hot volume, in
+the same manner as brick removal, and then the hot volume is removed
+from the volfile.
+
+gluster volume detach-tier brick1,...,brickN
+
+To start cache tiering:
+
+gluster volume rebalance [name] tier start
+
+Enable the change time recorder:
+
+gluster voiume set [name] features.ctr-enabled on
+
+Other cache parameters:
+
+tier-demote-frequency - how often thread wakes up to demote data
+
+tier-promote-frequency - as above , to promote data
+
+To stop it:
+
+gluster volume rebalance [name] tier stop
+
+To get status:
+
+gluster volume rebalance [name] tier status
+
+upcoming:
+
+A "pause-tier" command will allow users to stop using the hot tier.
+While paused, data will be migrated off the hot tier to the cold tier,
+and all I/Os will be forwarded to the cold tier. A status CLI will
+indicate how much data remains to be "flushed" from the hot tier to the
+cold tier.
+
+### Implications on presentation layer
+
+N/A
+
+### Implications on persistence layer
+
+N/A
+
+### Implications on 'GlusterFS' backend
+
+A tiered volume is a new volume type.
+
+Simple rules may be represented using volume "options" key-value in the
+volfile. Eventually, for more elaborate graphs, some information about a
+brick's characteristics and relationships (within the aforementioned
+hierarchy) may be stored on the bricks themselves as well as in the
+glusterd configuration. In addition, the volume's "info" file may
+include an adjacency list to represent more elaborate graphs.
+
+### Modification to GlusterFS metadata
+
+There are no plans to change meta-data for the cache tier. However in
+the future, categorizing files and directories (especially with
+user-defined tags) may require additional xattrs.
+
+### Implications on 'glusterd'
+
+Finally, volgen must be able to convert these specifications into a
+corresponding hierarchy of translators and options for those
+translators.
+
+Adding and removing tiers dynamically closely resembles the add and
+remove brick operations.
+
+How To Test
+-----------
+
+Eventually, new tests will be needed to set up multi-layer hierarchies,
+create files/directories, issue rebalance commands etc. and ensure that
+files end up in the right place(s). Many of the tests are
+policy-specific, e.g. to test an HSM policy one must effectively change
+files' ages or access rates (perhaps artificially).
+
+Interoperability tests between the Snap, geo-rep, and quota features are
+necessary.
+
+### Cache tier
+
+Tests should include:
+
+Automated tests are under development in the forge repository in file
+tier.t
+
+- The performance of "cache friendly" workloads (e.g. repeated access to
+a small set of files) is improved.
+
+- Performance is not substantially worse in "cache unfriendly" workloads
+(e.g. sequential writes over large numbers of files.)
+
+- Performance should not become substantially worse when the hot tier's
+bricks become full.
+
+User Experience
+---------------
+
+The hierarchical arrangement of bricks, with attributes and policies
+potentially at many levels, represents a fundamental change to the
+current "sea of identical bricks" model. Eventually, some commands that
+currently apply to whole volumes will need to be modified to work on
+sub-volume-level groups (or even individual bricks) as well.
+
+The cache tier must provide statistics on data migration.
+
+Dependencies
+------------
+
+Documentation
+-------------
+
+See below.
+
+Status
+------
+
+Cache tiering implementation in progress for 3.7; some bits for more
+general DC also done (fix 9387).
+
+- [Syntax
+ proposal](https://docs.google.com/presentation/d/1e8tuh9DKNi9eCMrdt5vetppn1D3BiJSmfR7lDW2wRvA/edit#slide=id.p)
+ (dormant).
+- [Syntax prototype](https://forge.gluster.org/data-classification)
+ (dormant, not part of cache tiering).
+- [Cache tier
+ design](https://docs.google.com/document/d/1cjFLzRQ4T1AomdDGk-yM7WkPNhAL345DwLJbK3ynk7I/edit)
+- [Bug 763746](https://bugzilla.redhat.com/763746) - We need an easy
+ way to alter client configs without breaking DVM
+- [Bug 905747](https://bugzilla.redhat.com/905747) - [FEAT] Tier
+ support for Volumes
+- [Working tree for
+ tiering](https://forge.gluster.org/data-classification/data-classification)
+- [Volgen changes for general DC](http://review.gluster.org/#/c/9387/)
+- [d\_off changes to allow stacked
+ DHTs](https://www.mail-archive.com/gluster-devel%40gluster.org/msg03155.html)
+ (prototyped)
+- [Video on the concept](https://www.youtube.com/watch?v=V4cvawIv1qA)
+ Efficient Data Maintenance in GlusterFS using DataBases : Data
+ Classification as the case study
+
+Comments and Discussion
+-----------------------
diff --git a/done/GlusterFS 3.7/Easy addition of Custom Translators.md b/done/GlusterFS 3.7/Easy addition of Custom Translators.md
new file mode 100644
index 0000000..487770e
--- /dev/null
+++ b/done/GlusterFS 3.7/Easy addition of Custom Translators.md
@@ -0,0 +1,129 @@
+Feature
+-------
+
+Easy addition of custom translators
+
+Summary
+-------
+
+I'd like to propose we add a way for people to easily add custom
+translators they've written. (using C, or Glupy, or whatever)
+
+Owners
+------
+
+Justin Clift <jclift@redhat.com>
+Anand Avati <avati@redhat.com>
+
+Current status
+--------------
+
+At present, when a custom translator has been developed it's difficult
+to get it included in generated .vol files properly.
+
+It **can** be done using the GlusterFS "filter" mechanism, but that's
+non optimal and open to catastrophic failure.
+
+Detailed Description
+--------------------
+
+Discussed on the gluster-devel mailing list here:
+
+[http://lists.nongnu.org/archive/html/gluster-devel/2013-08/msg00074.html](http://lists.nongnu.org/archive/html/gluster-devel/2013-08/msg00074.html)
+
+We could have a new install Gluster sub-dir, which takes a .so/.py
+translator file, and a JSON fragment to say what to do with it. No CLI.
+
+This would suit deployment via packaging, and should be simple enough
+for developers to make use of easily as well.
+
+Benefit to GlusterFS
+--------------------
+
+Having an easily usable / deployable approach for custom translators is
+a key part of extending the Gluster Developer Community, especially in
+combination with rapid feature prototyping through Glupy.
+
+Scope
+-----
+
+### Nature of proposed change
+
+Modification of existing code, to enable much easier addition of custom
+translators.
+
+### Implications on packaging
+
+The gluster-devel package should include all the necessary header and
+library files to compile a standalone glusterfs translator.
+
+### Implications on development
+
+/usr/share/doc/gluster-devel/examples/translators/hello-world should
+contain skeleton translator code (well commented), README.txt and build
+files. This code becomes the starting point to implement a new
+translator. Make a few changes and you should be able to build, install,
+test and package your translator.
+
+Ideally, this would be implemented via a script.
+
+Similar to autoproject, "translator-gen NAME" should produce all the
+necessary skeleton translator code and associated files. This avoids
+erroneous find-replace steps.
+
+### Implications on manageability
+
+TBD
+
+### Implications on presentation layer
+
+N/A
+
+### Implications on persistence layer
+
+TBD
+
+### Implications on 'GlusterFS' backend
+
+TBD
+
+### Modification to GlusterFS metadata
+
+TBD
+
+### Implications on 'glusterd'
+
+TBD
+
+How To Test
+-----------
+
+TBD
+
+User Experience
+---------------
+
+TBD
+
+Dependencies
+------------
+
+No new dependencies.
+
+Documentation
+-------------
+
+At least "Getting Started" documentation and API documentation needs to
+be created, including libglusterfs APIs.
+
+Status
+------
+
+Initial concept proposal only.
+
+Comments and Discussion
+-----------------------
+
+- An initial potential concept for the JSON fragment is on the mailing
+ list:
+ - <http://lists.nongnu.org/archive/html/gluster-devel/2013-08/msg00080.html>
diff --git a/done/GlusterFS 3.7/Exports and Netgroups Authentication.md b/done/GlusterFS 3.7/Exports and Netgroups Authentication.md
new file mode 100644
index 0000000..03b43f0
--- /dev/null
+++ b/done/GlusterFS 3.7/Exports and Netgroups Authentication.md
@@ -0,0 +1,134 @@
+Feature
+-------
+
+Exports and Netgroups Authentication for NFS
+
+Summary
+-------
+
+This feature adds Linux-style exports & netgroups authentication to
+Gluster's NFS server. More specifically, this feature allows you to
+restrict access to specific clients & netgroups for both Gluster volumes
+and subdirectories within Gluster volumes.
+
+Owners
+------
+
+Shreyas Siravara
+Richard Wareing
+
+Current Status
+--------------
+
+Today, Gluster can restrict access to volumes through simple IP list.
+This feature makes that capability more scalable by allowing large lists
+of IPs to be managed through a netgroup. It also allows more granular
+permission handling on volumes.
+
+Related Feature Requests and Bugs
+---------------------------------
+
+- [Bug 1143880](https://bugzilla.redhat.com/1143880): Exports and
+ Netgroups Authentication for Gluster NFS mount
+
+Patches ([Gerrit
+link](http://review.gluster.org/#/q/project:glusterfs+branch:master+topic:bug-1143880,n,z)):
+
+- [\#1](http://review.gluster.org/9359): core: add generic parser
+ utility
+- [\#2](http://review.gluster.org/9360): nfs: add structures and
+ functions for parsing netgroups
+- [\#3](http://review.gluster.org/9361): nfs: add support for separate
+ 'exports' file
+- [\#4](http://review.gluster.org/9362): nfs: more fine grained
+ authentication for the MOUNT protocol
+- [\#5](http://review.gluster.org/9363): nfs: add auth-cache for the
+ MOUNT protocol
+- [\#6](http://review.gluster.org/8758): gNFS: Export / Netgroup
+ authentication on Gluster NFS mount
+- [\#7](http://review.gluster.org/9364): glusterd: add new NFS options
+ for exports/netgroups and related caching
+- [\#8](http://review.gluster.org/9365): glusterfsd: add
+ "print-netgroups" and "print-exports" command
+
+Detailed Description
+--------------------
+
+This feature allows users to restrict access to Gluster volumes (and
+subdirectories within a volume) to specific IPs (exports authentication)
+or a netgroup (netgroups authentication), or a combination of both.
+
+Benefit to GlusterFS
+--------------------
+
+This is a scalable security model and allows more granular permissions.
+
+Scope
+-----
+
+### Nature of proposed change
+
+This change modifies the NFS server code and the mount daemon code. It
+adds two parsers for the exports & netgroups files as well as some files
+relating to caching to improve performance.
+
+### Implications on manageability
+
+The authentication can be turned off with a simply volume setting
+('gluster vol set <VOLNAME> nfs.exports-auth-enable off'). The feature
+has some tweakable parameters (how long authorizations should be cached,
+etc.) that can be tweaked through the CLI interface.
+
+### Implications on presentation layer
+
+Adds per-fileop authentication to the NFS server. No other elements of
+the presentation layer are affected.
+
+### Implications on persistence layer
+
+No implications.
+
+### Implications on 'GlusterFS' backend
+
+No implications.
+
+### Modification to GlusterFS metadata
+
+No modifications.
+
+### Implications on 'glusterd'
+
+Adds a few configuration options to NFS to tweak the authentication
+model.
+
+How To Test
+-----------
+
+Restrict some volume in the exports file to some IP, turn on the
+authentication through the Gluster CLI and see mounts/file-operations
+denied (or authorized depending on your setup).
+
+User Experience
+---------------
+
+Authentication can be toggled through the command line.
+
+Dependencies
+------------
+
+No external dependencies.
+
+Documentation
+-------------
+
+TBD
+
+Status
+------
+
+Feature complete, currently testing & working on enhancements.
+
+Comments and Discussion
+-----------------------
+
+TBD
diff --git a/done/GlusterFS 3.7/Gluster CLI for NFS Ganesha.md b/done/GlusterFS 3.7/Gluster CLI for NFS Ganesha.md
new file mode 100644
index 0000000..94028e4
--- /dev/null
+++ b/done/GlusterFS 3.7/Gluster CLI for NFS Ganesha.md
@@ -0,0 +1,120 @@
+Feature
+-------
+
+Gluster CLI support to manage nfs-ganesha exports.
+
+Summary
+-------
+
+NFS-ganesha support for GlusterFS volumes has been operational for quite
+some now. In the upcoming release, we intend to provide gluster CLI
+commands to manage nfs-ganesha exports analogous to the commands
+provided for Gluster-NFS. CLI commands to support ganesha specific
+options shall also be introduced.
+
+Owners
+------
+
+Meghana Madhusudhan
+
+Current status
+--------------
+
+1. Options nfs-ganesha.enable and nfs-ganesha.host defined in
+ gluster-nfs code.
+2. Writing into config files and starting nfs-ganesha is done as part
+ of hook scripts.
+3. User has to manually stop gluster-nfs and configure DBus interface.(
+ Required to add/remove exports dynamically)
+4. Volume level options
+
+ gluster vol set testvol nfs-ganesha.host 10.70.43.78
+ gluster vol set testvol nfs-ganesha.enable on
+
+Drawbacks
+---------
+
+1. Volume set options show success status irrespective of what the
+ outcome is. Post phase of the hook scipts do not allow us to handle
+ errors.
+2. Multi-headed ganesha scenarios were difficult to avoid in this
+ approach.
+
+Related Feature Requests and Bugs
+---------------------------------
+
+Detailed Description
+--------------------
+
+Benefit to GlusterFS
+--------------------
+
+These CLI options is aimed to make the switch between gluster-nfs and
+nfs-ganesha seamless. The approach is to find a way where the end user
+executes the kind of commands that he is already familiar with.
+
+Scope
+-----
+
+### Nature of proposed change
+
+The CLI integration would mean introduction of a number of options that
+are analogous to gluster-nfs. A dummy translator will be introduced on
+the client side for this purpose. Having it as a separate translator
+would provide the necessary modularity and the correct placeholder for
+all nfs-ganesha related functions. When the translator is loaded, all
+the options that are enabled for nfs-ganesha will be listed in that
+(nfs-ganesha) block. This approach will make the user experience with
+nfs-ganesha close to the one that's familiar.
+
+### Implications on manageability
+
+All the options related to nfs-ganesha will appear in the volfile once
+the nfs-ganesha translator is enabled.
+
+### Implications on presentation layer
+
+Gluster-nfs should be disabled to export any volume via nfs-ganesha None
+
+### Implications on persistence layer
+
+None
+
+### Implications on 'GlusterFS' backend
+
+None
+
+### Modification to GlusterFS metadata
+
+None
+
+### Implications on 'glusterd'
+
+Some code will be added to glusterd to manage nfs-ganesha options.
+
+How To Test
+-----------
+
+Execute CLI commands and check for expected behaviour.
+
+User Experience
+---------------
+
+User will be introduced to new CLI commands to manage nfs-ganesha
+exports. Most of the commands will be volume level options.
+
+Dependencies
+------------
+
+None
+
+Documentation
+-------------
+
+<Status of development - Design Ready, In development, Completed> In
+development
+
+Comments and Discussion
+-----------------------
+
+The feature page is not complete as yet. This will be updated regularly.
diff --git a/done/GlusterFS 3.7/Gnotify.md b/done/GlusterFS 3.7/Gnotify.md
new file mode 100644
index 0000000..4f2597c
--- /dev/null
+++ b/done/GlusterFS 3.7/Gnotify.md
@@ -0,0 +1,168 @@
+Feature
+=======
+
+GlusterFS Backup API (a.k.a Gnotify)
+
+1 Summary
+=========
+
+Gnotify is analogous to inotify(7) for Gluster distributed filesystem to
+monitor filesystem events. Currently a similar mechanism exist via
+libgfchangelog (per-brick), but that's more of notification + poll
+based. This feature makes the notification purely callback based and
+provides an API that resembles inotify's block on read() for events.
+There may be efforts to support filesystem notifications on the client
+at a volume level.
+
+2 Owners
+========
+
+Venky Shankar <vshankar@redhat.com>
+Aravinda V K <avishwan@redhat.com>
+
+3 Current Status
+================
+
+As of now, there exist "notification + poll" based event consumption
+mechanism (used by Geo-replication). This has vastly improved
+performance (as filesystem crawl goes away) and has a set of APIs that
+respond to event queries by an application. We call this the "higher
+level" API as the application needs to deal with changelogs (user
+consumable journals) taking care of format, record position, etc..
+
+Proposed change would be to make the API simple, elegant and "backup"
+friendly apart from designing it to be "purely" notify based. Engaging
+the community is a must so as to identify how various backup utilities
+work and prototype APIs accordingly.
+
+4 Detailed Description
+======================
+
+The idea is to have a set of APIs use by applications to retrieve a list
+of changes in the filesystem. As of now, the changes are classified in
+to three categories:
+
+- Entry operation
+ - Operations that act on filesystem namespace such as creat(),
+ unlink(), rename(), etc. fall into this category. These
+ operation require parent inode and the basename as part of the
+ file operation method.
+
+- Data operation
+ - Operations that modify data blocks fall into this category:
+ write(), truncate(), etc.
+
+- Metadata operation
+ - Operation that modify inode data such as setattr(), setxattr()
+ [set extended attributes] etc. fall in this category.
+
+Details of the record format and the consumer library (libgfchangelog)
+is explained in this
+[document](https://github.com/gluster/glusterfs/blob/master/doc/features/geo-replication/libgfchangelog.md).
+Operations that are persisted in the journal can be notified. Therefore,
+operations such as open(), close() are not notified (via journals
+consumption). It's beneficial that notifications for such operations be
+short circuited directly from the changelog translator to
+libgfchangelog.
+
+For gnotify, we introduce a set of low level APIs. Using the low level
+interface relieves the application of knowing the record format and
+other details such as journal state, leave alone periodic polling which
+could be expensive at times. Low level interface induces callback based
+programming model (and an intofy() type blocking read() call) with
+minimum heavy loading from the application.
+
+Now we list down the API prototype for the same (NOTE: prototype is
+subjected to change)
+
+- changelog\_low\_level\_register()
+
+- changelog\_put\_buffer()
+
+It's also necessary to provide an interface to get changes via
+filesystem crawl based on changed time (xtime): beneficial for initial
+crawl when journals are not available or after a stop/start.
+
+5 Benefit to GlusterFS
+======================
+
+Integrating backup applications with GlusterFS to incrementally backup
+the filesystem is a powerful functionality. Having notification back up
+to \*each\* client adds up to the usefulness of this feature. Apart from
+backup perspective, journals can be used by utilities such as self-heal
+daemon and Geo-replication (which already uses the high level API).
+
+6 Scope
+=======
+
+6.1. Nature of proposed change
+------------------------------
+
+Changes to the changelog translator and consumer library (plus
+integration of parallel filesystem crawl and exposing a API)
+
+6.2. Implications on manageability
+----------------------------------
+
+None
+
+6.3. Implications on presentation layer
+---------------------------------------
+
+None
+
+6.4. Implications on persistence layer
+--------------------------------------
+
+None
+
+6.5. Implications on 'GlusterFS' backend
+----------------------------------------
+
+None
+
+6.6. Modification to GlusterFS metadata
+---------------------------------------
+
+Introduction of 'xtime' extended attribute . This is nothing new as it's
+already maintained by marker translator. Now with integrating 'xsync'
+crawl with libgfchangelog, 'xtime' would be additionally maintained by
+the library.
+
+6.7. Implications on 'glusterd'
+-------------------------------
+
+None
+
+7 How To Test
+=============
+
+Test backup scripts integrated with the API or use shipped 'gfind' tool
+as an example.
+
+8 User Experience
+=================
+
+Easy to use backup friendly API, well integrated with GlusterFS
+ecosystem. Does away with polling or expensive duplication of filesystem
+crawl code.
+
+9 Dependencies
+==============
+
+None
+
+10 Documentation
+================
+
+TBD
+
+11 Status
+=========
+
+Design/Development in progress
+
+12 Comments and Discussion
+==========================
+
+More than welcome :-)
diff --git a/done/GlusterFS 3.7/HA for Ganesha.md b/done/GlusterFS 3.7/HA for Ganesha.md
new file mode 100644
index 0000000..fbd3192
--- /dev/null
+++ b/done/GlusterFS 3.7/HA for Ganesha.md
@@ -0,0 +1,156 @@
+Feature
+-------
+
+HA support for NFS-ganesha.
+
+Summary
+-------
+
+Automated resource monitoring and fail-over of the ganesha.nfsd in a
+cluster of GlusterFS and NFS-Ganesha servers.
+
+Owners
+------
+
+Kaleb Keithley
+
+Current status
+--------------
+
+Implementation is in progress.
+
+Related Feature Requests and Bugs
+---------------------------------
+
+- [Gluster CLI for
+ Ganesha](Features/Gluster_CLI_for_ganesha "wikilink")
+- [Upcall Infrastructure](Features/Upcall-infrastructure "wikilink")
+
+Detailed Description
+--------------------
+
+The implementation uses the Corosync and Pacemaker HA solution. The
+implementation consists of three parts:
+1. a script for setup and
+teardown of the clustering.
+2. three new Pacemaker resource agent files,
+and
+3. use of the existing IPaddr and Dummy Pacemaker resource agents
+for handling a floating Virtual IP address (VIP) and putting the
+ganesha.nfsd into Grace.
+
+The three new resource agents are tentatively named ganesha\_grace,
+ganesha\_mon, and ganesha\_nfsd.
+
+The ganesha\_nfsd resource agent is cloned on all nodes in the cluster.
+Each ganesha\_nfsd resource agent is responsible for mounting and
+unmounting a shared volume used for persistent storage of the state of
+all the ganesha.nfsds in the cluster and starting the ganesha.nfsd
+process on each node.
+
+The ganesha\_mon resource agent is cloned on all nodes in the cluster.
+Each ganesha\_mon resource agent monitors the state of its ganesha.nfsd.
+If the daemon terminates for any reason it initiates the move of its VIP
+to another node in the cluster. A Dummy resource agent is created which
+represents the dead ganesha.nfsd. The ganesha\_grace resource agents use
+this resource to send the correct hostname in the dbus event they send.
+
+The ganesha\_grace resource agent is cloned on all nodes in the cluster.
+Each ganesha\_grace resource agent monitors the states of all
+ganesha.nfsds in the clustger. If any ganesha.nfsd has died, it sends a
+DBUS event to its own ganesha.nfsd to put it into Grace.
+
+IPaddr and Dummy resource agents are created on each node in the
+cluster. Each IPaddr resource agent has a unique name derived from the
+node name (e.g. mynodename-cluster\_ip-1) and manages an associated
+virtual IP address. There is one virtual IP address for each node.
+Initially each IPaddr and its virtual IP address is tied to its
+respective node, and moves to another node when its ganesha.nfsd dies
+for any reason. Each Dummy resource agent has a unique name derived from
+the node name (e.g. mynodename-trigger\_ip-1) is used to ensure the
+proper order of operations, i.e. move the virtual IP, then send the dbus
+signal.
+
+N.B. Originally fail-back was outside the scope for the Everglades
+release. After a redesign we got fail-back for free. If the ganesha.nfsd
+is restarted on a node its virtual IP will automatically fail back.
+
+Benefit to GlusterFS
+--------------------
+
+GlusterFS is expected to be a common storage medium for NFS-Ganesha
+NFSv4 storage solutions. GlusterFS has its own built-in HA feature.
+NFS-Ganesha will ultimately support pNFS, a cluster-aware version of
+NFSv4, but does not have its own HA functionality. This will allow users
+to deploy HA NFS-Ganesha.
+
+Scope
+-----
+
+TBD
+
+### Nature of proposed change
+
+TBD
+
+### Implications on manageability
+
+Simplifies setup of HA by providing a supported solution with a recipe
+for basic configuration plus an automated setup.
+
+### Implications on presentation layer
+
+None
+
+### Implications on persistence layer
+
+A small shared volume is required. The NFSganesha resource agent mounts
+and unmounts the volume when it starts and stops.
+
+This volume is used by the ganesha.nfsd to persist things like its lock
+state and is used by another ganesha.nfsd after a fail-over.
+
+### Implications on 'GlusterFS' backend
+
+A small shared volume is required. The NFSganesha resource agent mounts
+and unmounts the volume when it starts and stops.
+
+This volume must be created before HA setup is attempted.
+
+### Modification to GlusterFS metadata
+
+None
+
+### Implications on 'glusterd'
+
+None
+
+How To Test
+-----------
+
+TBD
+
+User Experience
+---------------
+
+The user experiences is intended to be as seamless and invisible as
+possible. There are a few new CLI commands added that will invoke the
+setup script. The Corosync/Pacemaker setup takes about 15-30 seconds on
+a four node cluster, so there is a short delay between invoking the CLI
+and the cluster being ready.
+
+Dependencies
+------------
+
+GlusterFS CLI and Upcall Infrastructure (see related features).
+
+Documentation
+-------------
+
+<Status of development - Design Ready, In development, Complete> In
+development
+
+Comments and Discussion
+-----------------------
+
+The feature page is not complete as yet. This will be updated regularly.
diff --git a/done/GlusterFS 3.7/Improve Rebalance Performance.md b/done/GlusterFS 3.7/Improve Rebalance Performance.md
new file mode 100644
index 0000000..32a2eec
--- /dev/null
+++ b/done/GlusterFS 3.7/Improve Rebalance Performance.md
@@ -0,0 +1,277 @@
+Feature
+-------
+
+Improve GlusterFS rebalance performance
+
+Summary
+-------
+
+Improve the current rebalance mechanism in GlusterFS by utilizing the
+resources better, to speed up overall rebalance and also to speed up the
+brick removal and addition cases where data needs to be spread faster
+than the current rebalance mechanism does.
+
+Owners
+------
+
+Raghavendra Gowdappa <rgowdapp@redhat.com>
+Shyamsundar Ranganathan <srangana@redhat.com>
+Susant Palai <spalai@redhat.com>
+Venkatesh Somyajulu <vsomyaju@redhat.com>
+
+Current status
+--------------
+
+This section is split into 2 parts, explaining how the current rebalance
+works and what its limitations are.
+
+### Current rebalance mechanism
+
+Currently rebalance works as follows,
+
+​A) Each node in the Gluster cluster kicks off a rebalance process for
+one of the following actions
+
+- layout fixing
+- rebalance data, with space constraints in check
+ - Will rebalance data with file size and disk free availability
+ constraints, and move files that will not cause a brick
+ imbalance in terms of amount of data stored across bricks
+- rebalance data, based on layout precision
+ - Will rebalance data so that the layout is adhered to and hence
+ optimize lookups in the future (find the file where the layout
+ claims it is)
+
+​B) Each nodes process, then uses the following algorithm to proceed,
+
+- 1: Open root of the volume
+- 2: Fix the layout of root
+- 3: Start a crawl on the current directory
+- 4: For each file in the current directory,
+ - 4.1: Determine if file is in the current node (optimize on
+ network reads for file data)
+ - 4.2: If it does, migrate file based on type of rebalance sought
+ - 4.3: End the file crawl once crawl returns no more entries
+- 5: For each directory in the current directory
+ - 5.1: Fix the layout, and iterate on starting the crawl for this
+ directory (goto step 3)
+- 6: End the directory crawl once crawl returns no more entries
+- 7: Cleanup and exit
+
+### Limitations and issues in the current mechanism
+
+The current mechanism spreads the work of rebalance to all nodes in the
+cluster, and also takes into account only files that belong to the node
+on which the rebalance process is running. This spreads the load of
+rebalance well and also optimizes network reads of data, by taking into
+account only files local to the current node.
+
+Where this becomes slow is in the following cases,
+
+​1) It rebalances one file at a time only as it uses the syncop
+infrastructure to start the rebalance of a file issuing a setxattr with
+the special attribute "distribute.migrate-data", which in turn would
+return after its synctask of migrating the file is completes (synctask:
+rebalance\_task)
+
+- This reduces the bandwidth consumption of several resources, like
+ disk, CPU and network as we would read and write a single file at a
+ time
+
+​2) Rebalance of data is serial between reads and writes of data, i.e
+for a file a chunk of data is read from disk, written to the network,
+awaiting an response on the write from the remote node, and then
+proceeds with the next read
+
+- This makes read-write dependent on each other, and waiting for one
+ or the other to complete, so we either have the network idle when
+ reads from disk are in progress or vice-versa
+
+- This further makes serial use of resource like the disk or network,
+ reading or writing one block at a time
+
+​3) Each rebalance process crawls the entire volume for files to
+migrate, and chooses only files that are local to it
+
+- This crawl could be expensive and as a node deals with files that
+ are local to it, based on the cluster size and number of nodes,
+ quite a proportion of the entries crawled would hence be dropped
+
+​4) On a remove brick the current rebalance, ends up rebalancing the
+entire cluster. If the interest is in removing the brick(s) or replacing
+the bricks, realancing the entire cluster can be costly.
+
+​5) On addition of bricks, again the entire cluster is rebalanced. If
+space constraints are in play due to which bricks were added, it is
+sub-optimal to rebalance the entire cluster.
+
+​6) In cases where AFR is below DHT, all the nodes in AFR participate in
+the rebalance, and end up rebalancing (or attempting to) the same set of
+files. This is racy, and could (maybe) be made better.
+
+Detailed Description
+--------------------
+
+The above limitations can be broken down into separate features to
+improve rebalance performance and to also provide options in rebalance
+when specific use cases like quicker brick removal is sought. The
+following sections detail out these improvements.
+
+### Rebalance multiple files in parallel
+
+Instead of rebalancing file by file due to using syncops to trigger the
+rebalance of a files data using setxattr, use the wind infrastructure to
+migrate multiple files at a time. This would end up using the disk and
+network infrastructure better and can hence enable faster rebalance of
+data. This would even mean that when one file is blocked on a disk read
+the other parallel stream could be writing data to the network, hence
+starvation of the read-write-read model between disk and network could
+also be alleviated to a point.
+
+### Split reads and writes of files into separate tasks when rebalancing the data
+
+This is to reduce the wait between a disk read or a network write, and
+to ensure both these resources can be kept busy. By rebalancing more
+files in parallel, this improvement may not be needed, as the parallel
+streams would end up in keeping one or the other resource busy with
+better probability. Noting this enhancement down anyway to see if this
+needs consideration post increasing the parallelism of rebalance as
+above.
+
+### Crawl only bricks that belong to the current node
+
+As explained, the current rebalance takes into account only those files
+that belong to the current node. As this is a DHT level operation, we
+can hence choose not to send opendir/readdir calls to subvolumes that do
+not belong to the current node. This would reduce the crawls that are
+performed in rebalance for files at least and help in speeding up the
+entire process.
+
+NOTE: We would still need to evaluate the cost of this crawl vis-a-vis
+the overall rebalance process, to evaluate its benefits
+
+### Rebalance on access
+
+When removing bricks, one of the intention is to drain the brick of all
+its data and to hence enable removing the brick as soon as possible.
+
+When adding bricks, one of the requirements could be that the cluster is
+reaching its capacity and hence we want to increase the same.
+
+In both the cases rebalancing the entire cluster could/would take time.
+Instead an alternate approach is being proposed, where we do 3 things
+essentially as follows,
+
+- Kick off rebalance to fix layout, and drain a brick of its data,
+ or rebalance files onto a newly added brick
+- On further access of data, if the access is leading to a double
+ lookup or redirection based on layout (due to older bricks data not
+ yet having been rebalanced), start a rebalance of this file in
+ tandem to IO access (call this rebalance on access)
+- Start a slower, or a later, rebalance of the cluster, once the
+ intended use case is met, i.e draining a brick of its data or
+ creating space in other bricks and filling the newly added brick
+ with relevant data. This is to get the cluster balanced again,
+ without requiring data to be accessed.
+
+### Make rebalance aware of IO path requirements
+
+One of the problems of improving resource consumption in a node by the
+rebalance process would be that, we could starve the IO path. So further
+to some of the above enhancements, take into account IO path resource
+utilization (i.e disk/network/CPU) and slow down or speed up the
+rebalance process appropriately (say by increasing or decreasing the
+number of files that are rebalanced in parallel).
+
+NOTE: This requirement is being noted down, just to ensure we do not
+make the IO access to the cluster slow as rebalance is being made
+faster, resources to monitor and tune rebalance may differ when tested
+and experimented upon
+
+### Further considerations
+
+- We could consider some further layout optimization to reduce the
+ amount of data that is being rebalanced
+- Addition of scheduled rebalance, or the ability to stop and
+ continue rebalance from a point, could be useful for preventing IO
+ path slowness in cases where an admin could choose to run rebalance
+ on non-critical hours (do these even exist today?)
+- There are no performance xlators in the rebalance graph. We should
+ try experiments loading them.
+
+Benefit to GlusterFS
+--------------------
+
+Gluster is a grow as you need distributed file system. With this in
+picture, rebalance is key to grow the cluster in relatively sane amount
+of time. This enhancement attempts to speed things up in rebalance, in
+order to better this use case.
+
+Scope
+-----
+
+### Nature of proposed change
+
+This is intended as a modification to existing code only, there are no
+new xlators being introduced. BUT, as things evolve and we consider say,
+layout optimization based on live data or some such notions, we would
+need to extend this section to capture the proposed changes.
+
+### Implications on manageability
+
+The gluster command would need some extensions, for example the number
+of files to process in parallel, as we introduce these changes. As it is
+currently in the prototype phase, keeping this and the sections below as
+TBDs
+
+**Document TBD from here on...**
+
+### Implications on presentation layer
+
+*NFS/SAMBA/UFO/FUSE/libglusterfsclient Integration*
+
+### Implications on persistence layer
+
+*LVM, XFS, RHEL ...*
+
+### Implications on 'GlusterFS' backend
+
+*brick's data format, layout changes*
+
+### Modification to GlusterFS metadata
+
+*extended attributes used, internal hidden files to keep the metadata...*
+
+### Implications on 'glusterd'
+
+*persistent store, configuration changes, brick-op...*
+
+How To Test
+-----------
+
+*Description on Testing the feature*
+
+User Experience
+---------------
+
+*Changes in CLI, effect on User experience...*
+
+Dependencies
+------------
+
+*Dependencies, if any*
+
+Documentation
+-------------
+
+*Documentation for the feature*
+
+Status
+------
+
+Design/Prototype in progress
+
+Comments and Discussion
+-----------------------
+
+*Follow here*
diff --git a/done/GlusterFS 3.7/Object Count.md b/done/GlusterFS 3.7/Object Count.md
new file mode 100644
index 0000000..5c7c014
--- /dev/null
+++ b/done/GlusterFS 3.7/Object Count.md
@@ -0,0 +1,113 @@
+Feature
+-------
+
+Object Count
+
+Summary
+-------
+
+An efficient mechanism to retrieve the number of objects per directory
+or volume.
+
+Owners
+------
+
+Vijaikumar M <vmallika@redhat.com>
+Sachin Pandit <spandit@redhat.com>
+
+Current status
+--------------
+
+Currently, the only way to retrieve the number of files/objects in a
+directory or volume is to do a crawl of the entire directory/volume.
+This is expensive and is not scalable.
+
+The proposed mechanism will provide an easier alternative to determine
+the count of files/objects in a directory or volume.
+
+Detailed Description
+--------------------
+
+The new mechanism proposes to store count of objects/files as part of an
+extended attribute of a directory. Each directory's extended attribute
+value will indicate the number of files/objects present in a tree with
+the directory being considered as the root of the tree.
+
+The count value can be accessed by performing a getxattr(). Cluster
+translators like afr, dht and stripe will perform aggregation of count
+values from various bricks when getxattr() happens on the key associated
+with file/object count.
+
+Benefit to GlusterFS
+--------------------
+
+- Easy to query number of objects present in a volume.
+- Can serve as an accounting mechanism for quota enforcement based on
+ number of inodes.
+- This interface will be useful for integration with OpenStack Swift
+ and Ceilometer.
+
+Scope
+-----
+
+### Nature of proposed change
+
+- Marker translator to be modified to perform accounting on all
+ create/delete operations.
+
+- A new volume option to enable/disable this feature.
+
+### Implications on manageability
+
+- A new volume option to enable/disable this feature.
+- A new CLI interface to display this count at either a volume or
+ directory level.
+
+### Implications on presentation layer
+
+None
+
+### Implications on persistence layer
+
+None
+
+### Implications on 'GlusterFS' backend
+
+None
+
+### Modification to GlusterFS metadata
+
+A new extended attribute for storing count of objects at each directory
+level.
+
+### Implications on 'glusterd'
+
+TBD
+
+How To Test
+-----------
+
+TBD
+
+User Experience
+---------------
+
+TBD
+
+Dependencies
+------------
+
+None
+
+Documentation
+-------------
+
+TBD
+
+Status
+------
+
+Design Ready
+
+Comments and Discussion
+-----------------------
diff --git a/done/GlusterFS 3.7/Policy based Split-brain Resolution.md b/done/GlusterFS 3.7/Policy based Split-brain Resolution.md
new file mode 100644
index 0000000..f7a6870
--- /dev/null
+++ b/done/GlusterFS 3.7/Policy based Split-brain Resolution.md
@@ -0,0 +1,128 @@
+Feature
+-------
+
+This feature provides a way of resolving split-brains based on policies
+from the gluster CLI.
+
+Summary
+-------
+
+This feature provides a way of resolving split-brains based on policies.
+Goal is to give different commands to resolve split-brains using
+policies like 'choose a specific brick as source' and choose the biggest
+files as source etc.
+
+Owners
+------
+
+Ravishankar N
+Pranith Kumar Karampuri
+
+Current status
+--------------
+
+Feature completed.
+
+Detailed Description
+--------------------
+
+Till now, if there is a split-brain manual intervention is required to
+resolve split-brain. But most of the times it so happens that files from
+particular brick are chosen as source or the files with bigger file size
+is chosen as source. This feature provides CLI that can be used to
+resolve the split-brains in the system at that moment using these
+policies.
+
+Benefit to GlusterFS
+--------------------
+
+It improves manageability of resolving split-brains
+
+Scope
+-----
+
+### Nature of proposed change
+
+####Added new gluster CLIs:
+
+1.```gluster volume heal <VOLNAME> split-brain bigger-file <FILE>.```
+
+Locates the replica containing the FILE, selects bigger-file as source
+and completes heal.
+
+2.```gluster volume heal <VOLNAME> split-brain source-brick <HOSTNAME:BRICKNAME> <FILE>.```
+
+Selects ```<FILE>``` present in ```<HOSTNAME:BRICKNAME>``` as source and completes
+heal.
+
+3.```gluster volume heal <VOLNAME> split-brain <HOSTNAME:BRICKNAME>.```
+
+Selects **all** split-brained files in ```<HOSTNAME:BRICKNAME>``` as source
+and completes heal.
+
+Note: ```<FILE>``` can be either the full file name as seen from the root of
+the volume (or) the gfid-string representation of the file, which
+sometimes gets displayed in the heal info command's output.
+
+### Implications on manageability
+
+New CLIs are added to improve manageability of files in split-brain
+
+### Implications on presentation layer
+
+None
+
+### Implications on persistence layer
+
+None
+
+### Implications on 'GlusterFS' backend
+
+None
+
+### Modification to GlusterFS metadata
+
+None
+
+### Implications on 'glusterd'
+
+None
+
+How To Test
+-----------
+
+Create files in data and metadata split-brain. Accessing the files from
+clients gives EIO. Use the CLI commands to pick the source file and
+trigger heal After the CLI returns success, the files should be
+identical on the replica bricks and must be accessible again by the
+clients
+
+User Experience
+---------------
+
+New CLIs are introduced.
+
+Dependencies
+------------
+
+None
+
+Documentation
+-------------
+
+TODO: Add an md file in glusterfs/doc.
+
+Status
+------
+
+Feature completed. Main and dependency patches:
+
+<http://review.gluster.org/9377>
+<http://review.gluster.org/9375>
+<http://review.gluster.org/9376>
+<http://review.gluster.org/9439>
+
+Comments and Discussion
+-----------------------
+
+---
diff --git a/done/GlusterFS 3.7/SE Linux Integration.md b/done/GlusterFS 3.7/SE Linux Integration.md
new file mode 100644
index 0000000..8d282e6
--- /dev/null
+++ b/done/GlusterFS 3.7/SE Linux Integration.md
@@ -0,0 +1,4 @@
+SELinux Integration
+-------------------
+
+The work here is really to get SELinux to work with gluster (saving labels on gluster inodes etc.), and most of the work is really outside Gluster. There's really not any coding involved in the gluster side, but to push for things in the SELinux project to get the right policies and code changes in the kernel to deal with FUSE based filesystems. In the process we might discover issues in the gluster side (not sure what they are) - and I would like to fix those not-yet-known problems before 3.5. \ No newline at end of file
diff --git a/done/GlusterFS 3.7/Scheduling of Snapshot.md b/done/GlusterFS 3.7/Scheduling of Snapshot.md
new file mode 100644
index 0000000..0b2b49c
--- /dev/null
+++ b/done/GlusterFS 3.7/Scheduling of Snapshot.md
@@ -0,0 +1,229 @@
+Feature
+-------
+
+Scheduling Of Snapshot
+
+Summary
+-------
+
+GlusterFS volume snapshot provides point-in-time copy of a GlusterFS
+volume. Currently, GlusterFS volume snapshots can be easily scheduled by
+setting up cron jobs on one of the nodes in the GlusterFS trusted
+storage pool. This has a single point failure (SPOF), as scheduled jobs
+can be missed if the node running the cron jobs dies.
+
+We can avoid the SPOF by distributing the cron jobs to all nodes of the
+trusted storage pool.
+
+Owner(s)
+--------
+
+Avra Sengupta <asengupt@redhat.com>
+
+Copyright
+---------
+
+Copyright (c) 2015 Red Hat, Inc. <http://www.redhat.com>
+
+This feature is licensed under your choice of the GNU Lesser General
+Public License, version 3 or any later version (LGPLv3 or later), or the
+GNU General Public License, version 2 (GPLv2), in all cases as published
+by the Free Software Foundation.
+
+Detailed Description
+--------------------
+
+The solution to the above problems involves the usage of:
+
+- A shared storage - A gluster volume by the name of
+ "gluster\_shared\_storage" is used as a shared storage across nodes
+ to co-ordinate the scheduling operations. This shared storage is
+ mounted at /var/run/gluster/shared\_storage on all the nodes.
+
+- An agent - This agent will perform the actual snapshot commands,
+ instead of cron. It will contain the logic to perform coordinated
+ snapshots.
+
+- A helper script - This script will allow the user to initialise the
+ scheduler on the local node, enable/disable scheduling,
+ add/edit/list/delete snapshot schedules.
+
+- cronie - It is the default cron daemon shipped with RHEL. It invokes
+ the agent at the appropriate intervals as mentioned by the user to
+ perform the snapshot operation on the volume as mentioned by the
+ user in the schedule.
+
+Initial Setup
+-------------
+
+The administrator needs to create a shared storage that can be available
+to nodes across the cluster. A GlusterFS volume by the name of
+"gluster\_shared\_storage" should be created for this purpose. It is
+preferable that the \*shared volume\* be a replicate volume to avoid
+SPOF.
+
+Once the shared storage is created, it should be mounted on all nodes in
+the trusted storage pool which will be participating in the scheduling.
+The location where the shared\_storage should be mounted
+(/var/run/gluster/shared\_storage) in these nodes is fixed and is not
+configurable. Each node participating in the scheduling then needs to
+perform an initialisation of the snapshot scheduler by invoking the
+following:
+
+snap\_scheduler.py init
+
+NOTE: This command needs to be run on all the nodes participating in the
+scheduling
+
+Helper Script
+-------------
+
+The helper script(snap\_scheduler.py) will initialise the scheduler on
+the local node, enable/disable scheduling, add/edit/list/delete snapshot
+schedules.
+
+​a) snap\_scheduler.py init
+
+This command initialises the snap\_scheduler and interfaces it with the
+crond running on the local node. This is the first step, before
+executing any scheduling related commands from a node.
+
+NOTE: The helper script needs to be run with this option on all the
+nodes participating in the scheduling. Other options of the helper
+script can be run independently from any node, where initialisation has
+been successfully completed.
+
+​b) snap\_scheduler.py enable
+
+The snap scheduler is disabled by default after initialisation. This
+command enables the snap scheduler.
+
+​c) snap\_scheduler.py disable
+
+This command disables the snap scheduler.
+
+​d) snap\_scheduler.py status
+
+This command displays the current status(Enabled/Disabled) of the snap
+scheduler.
+
+​e) snap\_scheduler.py add "Job Name" "Schedule" "Volume Name"
+
+This command adds a new snapshot schedule. All the arguments must be
+provided within double-quotes(""). It takes three arguments:
+
+-\> Job Name: This name uniquely identifies this particular schedule,
+and can be used to reference this schedule for future events like
+edit/delete. If a schedule already exists for the specified Job Name,
+the add command will fail.
+
+-\> Schedule: The schedules are accepted in the format crond
+understands:-
+
+1. Example of job definition:
+2. .---------------- minute (0 - 59)
+3. | .------------- hour (0 - 23)
+4. | | .---------- day of month (1 - 31)
+5. | | | .------- month (1 - 12) OR jan,feb,mar,apr ...
+6. | | | | .---- day of week (0 - 6) (Sunday=0 or 7) OR
+ sun,mon,tue,wed,thu,fri,sat
+7. | | | | |
+8. - - - - - user-name command to be executed
+
+Although we accept all valid cron schedules, currently we support
+granularity of snapshot schedules to a maximum of half-hourly snapshots.
+
+-\> Volume Name: The name of the volume on which the scheduled snapshot
+operation will be performed.
+
+​f) snap\_scheduler.py edit "Job Name" "Schedule" "Volume Name"
+
+This command edits an existing snapshot schedule. It takes the same
+three arguments that the add option takes. All the arguments must be
+provided within double-quotes(""). If a schedule does not exists for the
+specified Job Name, the edit command will fail.
+
+​g) snap\_scheduler.py delete "Job Name"
+
+This command deletes an existing snapshot schedule. It takes the job
+name of the schedule as argument. The argument must be provided within
+double-quotes(""). If a schedule does not exists for the specified Job
+Name, the delete command will fail.
+
+​h) snap\_scheduler.py list
+
+This command lists the existing snapshot schedules in the following
+manner: Pseudocode:
+
+``\
+`# snap_scheduler.py list`\
+`JOB_NAME         SCHEDULE         OPERATION        VOLUME NAME      `\
+`--------------------------------------------------------------------`\
+`Job0             * * * * *        Snapshot Create  test_vol    `
+
+The agent
+---------
+
+The snapshots scheduled with the help of the helper script, are read by
+crond which then invokes the agent(gcron.py) at the scheduled intervals
+to perform the snapshot operations on the specified volumes. It then
+performs the scheduled snapshots using the following algorithm to
+coordinate.
+
+Pseudocode:
+
+``\
+`start_time = get current time`\
+`lock_file = job_name passed as an argument`\
+`vol_name = volume name psased as an argument`\
+`try POSIX locking the $lock_file`\
+`    if lock is obtained, then`\
+`        mod_time = Get modification time of $entry`\
+`        if $mod_time < $start_time, then`\
+`            Take snapshot of $entry.name (Volume name)`\
+`            if snapshot failed, then`\
+`                log the failure`\
+`            Update modification time of $entry to current time`\
+`        unlock the $entry`
+
+The coordination with other scripts running on other nodes, is handled
+by the use of POSIX locks. All the instances of the script will attempt
+to lock the lock\_file which is essentialy an empty file with the job
+name, and one which gets the lock will take the snapshot.
+
+To prevent redoing a done task, the script will make use of the mtime
+attribute of the entry. At the beginning execution, the script would
+have saved its start time. Once the script obtains the lock on the
+lock\_file, before taking the snapshot, it compares the mtime of the
+entry with the start time. The snapshot will only be taken if the mtime
+is smaller than start time. Once the snapshot command completes, the
+script will update the mtime of the lock\_file to the current time
+before unlocking.
+
+If a snapshot command fails, the script will log the failure (in syslog)
+and continue with its operation. It will not attempt to retry the failed
+snapshot in the current schedule, but will attempt it again in the next
+schedules. It is left to the administrator to monitor the logs and
+decide what to do after a failure.
+
+Assumptions and Limitations
+---------------------------
+
+It is assumed that all nodes in the have their times synced using NTP or
+any other mechanism. This is a hard requirement for this feature to
+work.
+
+The administrator needs to have python2.7 or higher installed, as well
+as the argparse module installed, to be able to use the helper
+script(snap\_scheduler.py).
+
+There is a latency of one minute, between providing a command by the
+helper script and that command taking effect. Hence, currently we do not
+support snapshot schedules with per minute granularity.
+
+The administrator can however leverage the scheduler to schedule
+snapshots with granularity of
+half-hourly/hourly/daily/weekly/monthly/yearly periodic intervals. They
+can also schedule snapshots, which are customised mentioning which
+minute of the hour, which day of the week, which week of the month, and
+which month of the year, they want to schedule the snapshot operation.
diff --git a/done/GlusterFS 3.7/Sharding xlator.md b/done/GlusterFS 3.7/Sharding xlator.md
new file mode 100644
index 0000000..b33d698
--- /dev/null
+++ b/done/GlusterFS 3.7/Sharding xlator.md
@@ -0,0 +1,129 @@
+Goal
+----
+
+Better support for striping.
+
+Summary
+-------
+
+The current stripe translator, below DHT, requires that bricks be added
+in a multiple of the stripe count times the replica/erasure count. It
+also means that failures or performance anomalies in one brick
+disproportionately affect one set of striped files (a fraction equal to
+stripe count divided by total bricks) while the rest remain unaffected.
+By moving above DHT, we can avoid both the configuration limit and the
+performance asymmetry.
+
+Owners
+------
+
+Vijay Bellur <vbellur@redhat.com>
+Jeff Darcy <jdarcy@redhat.com>
+Pranith Kumar Karampuri <pkarampu@redhat.com>
+Krutika Dhananjay <kdhananj@redhat.com>
+
+Current status
+--------------
+
+Proposed, waiting until summit for approval.
+
+Related Feature Requests and Bugs
+---------------------------------
+
+None.
+
+Detailed Description
+--------------------
+
+The new sharding translator sits above DHT, creating "shard files" that
+DHT is then responsible for distributing. The shard translator is thus
+oblivious to the topology under DHT, even when that changes (or for that
+matter when the implementation of DHT changes). Because the shard files
+will each be hashed and placed separately by DHT, we'll also be using
+more combinations of DHT subvolumes and the effect of any imbalance
+there will be distributed more evenly.
+
+Benefit to GlusterFS
+--------------------
+
+More configuration flexibility and resilience to failures.
+
+Data transformations such as compression or de-duplication would benefit
+from sharding because portions of the file may be processed rather than
+exclusively at whole-file granularity. For example, to read a small
+extent from the middle of a compressed large file, only the shards
+overlapping the extent would need to be decompressed. Sharding could
+mean the "chunking" step is not needed at the dedupe level. For example,
+if a small portion of a de-duplicated file was modified, only the shard
+that changed would need to be reverted to an original non-deduped state.
+The untouched shards could continue as deduped and their savings
+maintained.
+
+The cache tiering feature would benefit from sharding. Currently large
+files must be migrated in full between tiers, even if only a small
+portion of the file is accessed. With sharding, only the shard accessed
+would need to be migrated.
+
+Scope
+-----
+
+### Nature of proposed change
+
+Most of the existing stripe translator remains applicable, except that
+it needs to be adapted to its new location above DHT instead of below.
+In particular, it needs to generate unique shard-file names and pass
+them all down to the same (DHT) subvolume, instead of using the same
+name across multiple (AFR/client) subvolumes.
+
+### Implications on manageability
+
+None, except perhaps the name change ("shard" vs. "stripe").
+
+### Implications on presentation layer
+
+None.
+
+### Implications on persistence layer
+
+None.
+
+### Implications on 'GlusterFS' backend
+
+None.
+
+### Modification to GlusterFS metadata
+
+Possibly some minor xattr changes.
+
+### Implications on 'glusterd'
+
+None.
+
+How To Test
+-----------
+
+Current stripe tests should still be applicable. More should be written,
+since it's a little used feature and not many exist currently.
+
+User Experience
+---------------
+
+None, except the name change.
+
+Dependencies
+------------
+
+None.
+
+Documentation
+-------------
+
+TBD, probably minor.
+
+Status
+------
+
+Work In Progress
+
+Comments and Discussion
+-----------------------
diff --git a/done/GlusterFS 3.7/Small File Performance.md b/done/GlusterFS 3.7/Small File Performance.md
new file mode 100644
index 0000000..3b868a6
--- /dev/null
+++ b/done/GlusterFS 3.7/Small File Performance.md
@@ -0,0 +1,433 @@
+Feature
+-------
+
+Small-file performance
+
+Summary
+-------
+
+This page describes a menu of optimizations that together can improve
+small-file performance, along with expected cases where optimizations
+matter, degree of improvement expected, and degree of difficulty. -
+
+Owners
+------
+
+Shyamsundar Ranganathan <srangana@redhat.com>
+
+Ben England <bengland@redhat.com>
+
+Current status
+--------------
+
+Some of these optimizations are proposed patches upstream, some are also
+features being planned, such as Darcy's Gluster V4 DHT and NSR changes,
+and some are not specified yet at all. Where they already exist in some
+form links will be provided.
+
+Some previous optimizations have been included already in the Gluster
+code base, such as quick-read and open-behind translators. While these
+were useful and do improve throughput, they do not solve the general
+problem.
+
+Detailed Description
+--------------------
+
+What is a small file? While this term seems ambiguous, it really is just
+a file where the metadata access time far exceeds the data access time.
+Another term used for this is "metadata-intensive workload". To be
+clear, it is possible to have a metadata-intensive workload running
+against large files, if it is not the file data that is being accessed
+(example: "ls -l", "rm"). But what we are really concerned with is
+throughput and response time of common operations on files where the
+data is being accessed but metadata access time is severely restricting
+throughput.
+
+Why do we have a performance problem? We would expect that Gluster
+small-file performance would be within some reasonable percentage of the
+bottleneck determined by network performance and storage performance,
+and that a user would be happy to pay a performance "tax" in order to
+achieve scalability and high-availability that Gluster offers, as well
+as a wealth of functionality. However, repeatedly we see cases where
+Gluster small-file perf is an order of magnitude off of these
+bottlenecks, indicating that there are flaws in the software. This
+interferes with the most common tasks that a system admin or user has to
+perform, such as copying files into or out of Gluster, migrating or
+rebalancing data, self-heal,
+
+So why do we care? Many of us anticipated that many Gluster workloads
+would have increasingly large files, however we are continuing to
+observe that Gluster workloads such as "unstructured data", are
+surprisingly metadata-intensive. As compute and storage power increase
+exponentially, we would expect that the average size of a storage object
+would also increase, but in fact it hasn't -- in several common cases we
+have files as small as 100 KB average size, or even 7 KB average size in
+one case. We can tell customers to rewrite their applications, or we can
+improve Gluster to be adequate for their needs, even if it isn't the
+design center for Gluster.
+
+The single-threadedness of many customer applications (examples include
+common Linux utilities such as rsync and tar) amplifies this problem,
+converting what was a throughput problem into a *latency* problem.
+
+Benefit to GlusterFS
+--------------------
+
+Improvement of small-file performance will remove a barrier to
+widespread adoption of this filesystem for mainstream use.
+
+Scope
+-----
+
+Although the scope of the individual changes is limited, the overall
+scope is very wide. Some changes can be done incrementally, and some
+cannot. That is why changes are presented as a menu rather than an
+all-or-nothing proposal.
+
+We know that scope of DHT+NSR V4 is large and changes will be discussed
+elsewhere, so we won't cover that here.
+
+##### multi-thread-epoll
+
+*Status*: DONE in glusterfs-3.6! [ <http://review.gluster.org/#/c/3842/>
+based on Anand Avati's patch ]
+
+*Why*: remove single-thread-per-brick barrier to higher CPU utilization
+by servers
+
+*Use case*: multi-client and multi-thread applications
+
+*Improvement*: measured 40% with 2 epoll threads and 100% with 4 epoll
+threads for small file creates to an SSD
+
+*Disadvantage*: might expose some race conditions in Gluster that might
+otherwise happen far less frequently, because receive message processing
+will be less sequential. These need to be fixed anyway.
+
+**Note**: this enhancement also helps high-IOPS applications such as
+databases and virtualization which are not metadata-intensive. This has
+been measured already using a Fusion I/O SSD performing random reads and
+writes -- it was necessary to define multiple bricks per SSD device to
+get Gluster to the same order of magnitude IOPS as a local filesystem.
+But this workaround is problematic for users, because storage space is
+not properly measured when there are multiple bricks on the same
+filesystem.
+
+##### remove io-threads translator
+
+*Status*: no patch yet, hopefully can be tested with volfile edit
+
+*Why*: don't need io-threads xlator now. Anand Avati suggested this
+optimization was possible. io-threads translator was created to allow a
+single "epoll" thread to launch multiple concurrent disk I/O requests,
+and this made sense back in the era of 1-GbE networking and rotational
+storage. However, thread context switching is getting more and more
+expensive as CPUs get faster. For example, switching between threads on
+different NUMA nodes is very costly. Switching to a powered-down core is
+also expensive. And context switch makes the CPUs forget whatever they
+learned about the application's memory and instructions. So this
+optimization could be vital as we try to make Gluster competitive in
+performance.
+
+*Use case*: lower latency for latency-sensitive workloads such as
+single-thread or single-client loads, and also improve efficiency of
+glusterfsd process.
+
+*Improvement*: no data yet
+
+*Disadvantage*: we need to have a much bigger e-poll thread pool to keep
+a large set of disks busy. In principle this is no worse than having
+io-threads pool, is it?
+
+##### glusterfsd stat and xattr cache
+
+Please see feature page
+[Features/stat-xattr-cache](../GlusterFS 4.0/stat-xattr-cache.md)
+
+*Why*: remove most system call latency from small-file read and create
+in brick process (glusterfsd)
+
+*Use case*: single-thread throughput, response time
+
+##### SSD and glusterfs tiering feature
+
+*Status*: [
+<http://www.gluster.org/community/documentation/index.php/Features/data-classification>
+feature page ]
+
+This is Jeff Darcy's proposal for re-using portions of DHT
+infrastructure to do storage tiering and other things. One possible use
+of this data classification feature is SSD caching of hot files, which
+Dan Lambright has begun to implement and demo.
+
+also see [
+<https://www.mail-archive.com/gluster-devel@gluster.org/msg00385.html>
+discussion in gluster-devel ]
+
+*Improvement*: results are particularly dramatic with erasure coding for
+small files, Dan's single-thread demo of 20-KB file reads showed a 100x
+reduction in latency with O\_DIRECT reads.
+
+*Disadvantages*: this will not help and may even slow down workloads
+with a "working set" (set of concurrently active files) much larger than
+the SSD tier, or with a rapidly changing working set that prevents the
+cache from "warming up". At present tiering works at the level of the
+entire file, which means it could be very expensive for some
+applications such as virtualization that do not read the entire file, as
+Ceph found out.
+
+##### migrate .glusterfs to SSD
+
+*Status*: [ <https://forge.gluster.org/gluster-meta-data-on-ssd> Dan
+Lambright's code for moving .glusterfs to SSD ]
+
+Also see [
+<http://blog.gluster.org/2014/03/experiments-using-ssds-with-gluster/> ]
+for background on other attempts to use SSD without changing Gluster to
+be SSD-aware.
+
+*Why*: lower latency of metadata access on disk
+
+*Improvement*: a small smoke test showed a 10x improvement for
+single-thread create, it is expected that this will help small-file
+workloads that are not cache-friendly.
+
+*Disadvantages*: This will not help large-file workloads. It will not
+help workloads where the Linux buffer cache is sufficient to get a high
+cache hit rate.
+
+*Costs*: Gluster bricks now have an external dependency on an SSD device
+- what if it fails?
+
+##### tiering at block device level
+
+*Status*: transparent to GlusterFS core. We mention it here because it
+is a design alternative to preceding item (.glusterfs in SSD).
+
+This option includes use of Linux features like dm-cache (Device Mapper
+caching module) to accelerate reads and writes to Gluster "brick"
+filesystems. Early experimentation with firmware-resident SSD caching
+algorithms suggests that this isn't as maintainable and flexible as a
+software-defined implementation, but these too are transparent to
+GlusterFS.
+
+*Use Case*: can provide acceleration for data ingest, as well as for
+cache-friendly read-intensive workloads where the total size of the hot
+data subset fits within SSD.
+
+*Improvement*: For create-intensive workloads, normal writeback caching
+in RAID controllers does provide some of the same benefits at lower
+cost. For very small files, read acceleration can be as much as 10x if
+SSD cache hits are obtained (and if the total size of hot files does NOT
+fit in buffer cache). BTW, This approach can also have as much as a 30x
+improvement in random read and write performance under these conditions.
+This could also provide lower response times for Device Mapper thin-p
+metadata device.
+
+**NOTE**: we have to change our workload generation to use a non-uniform
+file access distribution, preferably with a *moving* mean, to
+acknowledge that in real-world workloads, not all files are equally
+accessed, and that the "hot" files change over time. Without these two
+workload features, we are not going to observe much benefit from cache
+tiering.
+
+*Disadvantage*: This does not help sequential workloads. It does not
+help workloads where Linux buffer cache can provide cache hits. Because
+this caching is done on the server and not the client, there are limits
+imposed by network round trips on response times that limit the
+improvement.
+
+*Costs*: This adds complexity to the already-complex Gluster brick
+configuration process.
+
+##### cluster.lookup-unhashed auto
+
+*Status*: DONE in glusterfs-3.7! [ <http://review.gluster.org/#/c/7702/>
+Jeff Darcy patch ]
+
+Why: When safe, don't lookup path on every brick before every file
+create, in order to make small-file creation scalable with brick, server
+count
+
+**Note**: With JBOD bricks, we are going to hit this scalability wall a
+lot sooner for smallfile creates!!!
+
+*Use case*: small-file creates of any sort with large brick counts
+
+*Improvement*: [
+<https://s3.amazonaws.com/ben.england/small-file-perf-feature-page.pdf>
+graphs ]
+
+*Costs*: Requires monitoring hooks, see below.
+
+*Disadvantage*: if DHT subvolumes are added/removed, how quickly do we
+recover to state where we don't have to do the paranoid thing and lookup
+on every DHT subvolume? As we scale, does DHT subvolume addition/removal
+become a significantly more frequent occurrence?
+
+##### lower RPC calls per file access
+
+Please see
+[Features/composite-operations](../GlusterFS 4.0/composite-operations.md)
+page for details.
+
+*Status*: no proposals exist for this, but NFS compound RPC and SMB ANDX
+are examples, and NSR and DHT for Gluster V4 are necessary for this.
+
+*Why*: reduce round-trip-induced latency between Gluster client and
+servers.
+
+*Use case*: small file creates -- example: [
+<https://bugzilla.redhat.com/show_bug.cgi?id=1086681> bz-1086681 ]
+
+*Improvement*: small-file operations can avoid pessimistic round-trip
+patterns, and small-file creates can potentially avoid round trips
+required because of AFR implementation. For clients with high round-trip
+time to server, this has a dramatic improvement in throughput.
+
+*Costs*: some of these code modifications are very non-trivial.
+
+*Disadvantage*: may not be backward-compatible?
+
+##### object-store API
+
+Some of details are covered in
+[Features/composite-operations](../GlusterFS 4.0/composite-operations.md)
+
+*Status*: Librados in Ceph and Swift in OpenStack are examples. The
+proposal would be to create an API that lets you do equivalent of Swift
+PUT or GET, including opening/creating a file, accessing metadata, and
+transferring data, in a single API call.
+
+*Why*: on creates, allow application to avoid many round trips to server
+to do lookups, create the file, then retrieve the data, then set
+attributes, then close the file. On reads, allow application to get all
+data in a single round trip (like Swift API).
+
+*Use case*: applications which do not have to use POSIX, such as
+OpenStack Swift.
+
+*Improvement*: for clients that have a long network round trip time to
+server, performance improvement could be 5x. Load on the server could be
+greatly reduced due to lower context-switching overhead.
+
+*Disadvantage*: Without preceding reduction in round trips, the changed
+API may not result in much performance gain if any.
+
+##### dentry injection
+
+*Why*: This is not about small files themselves, but applies to
+directories full of many small files. No matter how much we prefetch
+directory entries from the server to the client, directory-listing speed
+will still be limited by context switches from the application to the
+glusterfs client process. One way to ameliorate this would be to
+prefetch entries and *inject* them into FUSE, so that when the
+application asks they'll be available directly from the kernel.
+
+*Status*: Have discussed this with Brian Foster, not aware of subsequent
+attempts/measurements.
+
+*Use case*: All of those applications which insist on listing all files
+in a huge directory, plus users who do so from the command line. We can
+warn people and recommend against this all we like, but "ls" is often
+one of the first things users do on their new file system and it can be
+hard to recover from a bad first impression.
+
+*Improvement*: TBS, probably not much impact until we have optimized
+directory browsing round trips to server as discussed in
+composite-operations.
+
+*Disadvantage*: Some extra effort might be required to deal with
+consistency issues.
+
+### Implications on manageability
+
+lookup-unhashed=auto implies that the system can, by adding/removing DHT
+subvolumes, get itself into a state where it is not safe to do file
+lookup using consistent hashing, until a rebalance has completed. This
+needs to be visible at the management interface so people know why their
+file creates have slowed down and when they will speed up again.
+
+Use of SSDs implies greater complexity and inter-dependency in managing
+the system as a whole (not necessarily Gluster).
+
+### Implications on presentation layer
+
+No change is required for multi-thread epoll, xattr+stat cache,
+lookup-unhashed=off. If Swift uses libgfapi then Object-Store API
+proposal affects it. DHT and NSR changes will impact management of
+Gluster but should be transparent to translators farther up the stack
+perhaps?
+
+### Implications on persistence layer
+
+None
+
+### Implications on 'GlusterFS' backend
+
+Massive changes would be required for DHT and NSR V4 to on-disk format.
+
+### Modification to GlusterFS metadata
+
+lookup-unhashed-auto change would require an additional xattr to track
+cases where it's not safe to trust consistent hashing for a directory?
+
+### Implications on 'glusterd'
+
+DHT+NSR V4 require big changes to glusterd, covered elsewhere.
+
+How To Test
+-----------
+
+Small-file performance testing methods are discussed in [Gluster
+performance test
+page](http://gluster.readthedocs.org/en/latest/Administrator%20Guide/Performance%20Testing/)
+
+User Experience
+---------------
+
+We anticipate that user experience will become far more pleasant as the
+system performance matches the user expectations and the hardware
+capacity. Operations like loading data into Gluster and running
+traditional NFS or SMB apps will be completed in a reasonable amount of
+time without heroic effort from sysadmins.
+
+SSDs are becoming an increasingly important form of storage, possibly
+even replacing traditional spindles for some high-IOPS apps in the 2016
+timeframe. Multi-thread-epoll and xattr+stat caching are a requirement
+for Gluster to utilize more CPUs, and utilize them more efficiently, to
+keep up with SSDs.
+
+Dependencies
+------------
+
+None other than above.
+
+Documentation
+-------------
+
+lookup-unhashed-auto behavior and how to monitor it will have to be
+documented.
+
+Status
+------
+
+Design-ready
+
+Comments and Discussion
+-----------------------
+
+This work can be, and should be, done incrementally. However, if we
+order these investments by ratio of effort to perf improvement, it might
+look like this:
+
+- multi-thread-epoll (done)
+- lookup-unhashed-auto (done)
+- remove io-threads translator (from brick)
+- .glusterfs on SSD (prototyped)
+- cache tiering (in development)
+- glusterfsd stat+xattr cache
+- libgfapi Object-Store API
+- DHT in Gluster V4
+- NSR
+- reduction in RPCs/file-access
diff --git a/done/GlusterFS 3.7/Trash.md b/done/GlusterFS 3.7/Trash.md
new file mode 100644
index 0000000..cc03ccd
--- /dev/null
+++ b/done/GlusterFS 3.7/Trash.md
@@ -0,0 +1,182 @@
+Feature
+-------
+
+Trash translator for GlusterFS
+
+Summary
+-------
+
+This feature will enable user to temporarily store deleted files from
+GlusterFS for a specified time period.
+
+Owners
+------
+
+Anoop C S <achiraya@redhat.com>
+
+Jiffin Tony Thottan <jthottan@redhat.com>
+
+Current status
+--------------
+
+In the present scenario deletion by a user results in permanent removal
+of a file from the storage pool. An incompatible translator code for
+trash is currently available as part of codebase. On the other side
+gluster cli lacks a volume set option to load the trash translator in
+volume graph.
+
+Detailed Description
+--------------------
+
+Trash is a desired feature for users who accidentally delete some files
+and may need to get back those in near future. Currently, GlusterFS
+codebase includes a translator for trash which is not compatible with
+the current version and so is not usable by users. Trash feature is
+planned to be implemented as a separate directory in every single brick
+inside a volume. This would be achieved by a volume set option from
+gluster cli.
+
+A file can only be deleted when all hard links to it has been completely
+removed. This feature can be extended to operations like truncation
+where we need to retain the original file.
+
+Benefit to GlusterFS
+--------------------
+
+With the implementation of trash, accidental deletion of files can be
+easily avoided.
+
+Scope
+-----
+
+### Nature of proposed change
+
+Proposed implementation mostly involves modifications to existing code
+for trash translator.
+
+### Implications on manageability
+
+Gluster cli will provide an option for creating trash directories on
+various bricks.
+
+### Implications on presentation layer
+
+None
+
+### Implications on persistence layer
+
+None
+
+### Implications on 'GlusterFS' backend
+
+The overall brick structure will include a separate section for trash in
+which regular files will not be stored, i.e. space occupied by the trash
+become unusable.
+
+### Modification to GlusterFS metadata
+
+The original path of files can be stored as an extended attribute.
+
+### Implications on 'glusterd'
+
+An alert can be triggered when trash exceeds a particular size limit.
+Purging of a file from trash depends on its size and age attributes or
+other policies.
+
+### Implications on Rebalancing
+
+Trash can act as an intermediate storage when a file is moved from one
+brick to another during rebalancing of volumes.
+
+### Implications on Self-healing
+
+Self-healing must avoid the chance of re-creating a file which was
+deleted from a brick while one among the other bricks were offline.
+Trash can be used to track the deleted file inside a brick.
+
+### Scope of Recovery
+
+This feature can enhance the restoring of files to previous locations
+through gluster cli with the help of extended attributes residing along
+with the file.
+
+How To Test
+-----------
+
+Functionality of this trash translator can be checked by the file
+operations deletion and truncation or using gluster internal operations
+like self heal and rebalance.
+
+Steps :
+
+1.)Create a glusterfs volume.
+
+2.)Start the volume
+
+3.)Mount the the volume
+
+4.) Check whether ".trashcan" directory is created on mount or not.By
+default the trash directory will be created when volume is started but
+files are not moved to trash directory when deletion or truncation is
+performed until trash translator is on.
+
+5.) The name for trash directory is user configurable option and its
+default value is ".trashcan".It can be configured only when volume is
+started.We cannot remove and rename the trash directory from the
+mount(like .glusterfs directory)
+
+6.) Set features.trash on
+
+7.) Create some files in the mount and perform deletion or truncation on
+those files.Check whether these files are recreated under trash
+directory appending time stamp on the file name. For example,
+
+        [root@rh-host ~]#mount -t glusterfs rh-host:/test /mnt/test
+        [root@rh-host ~]#mkdir /mnt/test/abc
+        [root@rh-host ~]#touch /mnt/test/abc/file
+        [root@rh-host ~]#rm /mnt/test/abc/filer
+        remove regular empty file ‘/mnt/test/abc/file’? y
+        [root@rh-host ~]#ls /mnt/test/abc
+        [root@rh-host ~]# 
+        [root@rh-host ~]#ls /mnt/test/.trashcan/abc/
+        file2014-08-21_123400
+
+8.) Check whether files deleted from trash directory are permanently
+removed
+
+9.) Perform internal operations such as rebalance and self heal on the
+volume.Check whether files are created under trash directory as result
+of internal-ops[we can also make trash translator exclusively for
+internal operations by setting the option features.trash-internal-op on]
+
+10.) Reconfigure the trash directory name and check whether file are
+retained in the new one.
+
+11.) Check whether other options for trash translator such as eliminate
+pattern and maxium file size is working or not.
+
+User Experience
+---------------
+
+Users can access files which were deleted accidentally or intentionally
+and can review the original file which was truncated.
+
+Dependencies
+------------
+
+None
+
+Documentation
+-------------
+
+None
+
+Status
+------
+
+Under review
+
+Comments and Discussion
+-----------------------
+
+Follow here
diff --git a/done/GlusterFS 3.7/Upcall Infrastructure.md b/done/GlusterFS 3.7/Upcall Infrastructure.md
new file mode 100644
index 0000000..47cc8d6
--- /dev/null
+++ b/done/GlusterFS 3.7/Upcall Infrastructure.md
@@ -0,0 +1,747 @@
+Feature
+-------
+
+Framework on the server-side, to handle certain state of the files
+accessed and send notifications to the clients connected.
+
+Summary
+-------
+
+A generic and extensible framework, used to maintain states in the
+glusterfsd process for each of the files accessed (including the clients
+info doing the fops) and send notifications to the respective glusterfs
+clients incase of any change in that state.
+
+Few of the use-cases (currently identified) of this infrastructure are:
+
+- Inode Update/Invalidation
+- Recall Delegations/lease locks
+- Maintain Share Reservations/Locks states.
+- Refresh attributes in md-cache
+
+One of the initial consumers of this feature is NFS-ganesha.
+
+Owners
+------
+
+Soumya Koduri <skoduri@redhat.com>
+
+Poornima Gurusiddaiah <pgurusid@redhat.com>
+
+Current status
+--------------
+
+- Currently there is no such infra available in GlusterFS which can
+ notify clients incase of any change in the file state.
+- There is no support of lease and shared locks.
+
+Drawbacks
+---------
+
+- NFS-ganesha cannot service as Multi-Head and have Active-Active HA
+ support.
+- NFS-ganesha cannot support NFSv4 delegations and Open share
+ reservations.
+
+Related Feature Requests and Bugs
+---------------------------------
+
+<http://www.gluster.org/community/documentation/index.php/Features/Gluster_CLI_for_ganesha>
+
+<http://www.gluster.org/community/documentation/index.php/Features/HA_for_ganesha>
+
+Detailed Description
+--------------------
+
+There are various scenarios which require server processes notify
+certain events/information to the clients connected to it (by means of
+callbacks). Few of such cases are
+
+Cache Invalidation:
+: Each of the GlusterFS clients/applications cache certain state of
+ the files (for eg, inode or attributes). In a muti-node environment
+ these caches could lead to data-integrity issues, for certain time,
+ if there are multiple clients accessing the same file
+ simulataneously.
+
+: To avoid such scenarios, we need server to notify clients incase of
+ any change in the file state/attributes.
+
+Delegations/Lease-locks:
+: Currently there is no support of lease locks/delegations in
+ GlusterFS. We need a infra to maintain those locks state on server
+ side and send notifications to recall those locks incase of any
+ conflicting access by a different client. This can be acheived by
+ using the Upcalls infra.
+
+Similar to above use-cases, this framework could easily be extended to
+handle any other event notifications required to be sent by server.
+
+### Design Considerations
+
+Upcall notifications are RPC calls sent from Gluster server process to
+the client.
+
+Note : A new rpc procedure has been added to "GlusterFS Callback"
+program to send notifications. This rpc call support from gluster server
+to client has been prototyped by Poornima Gurusiddaiah(multi-protocol
+team). We have taken that support and enhanced it to suit our
+requirements.
+
+"clients" referred below are GlusterFS clients. GlusterFS server just
+need to store the details of the clients accessing the file and these
+clients when notified can lookup the corresponding file entry based on
+the gfid, which it need to take action upon and intimate the application
+accordingly.
+
+A new upcall xlator is defined to maintain all the state required for
+upcall notifications. This xlator is below io-threads xlator
+(considering protocol/server xlator is on top). The reason for choosing
+this xlator to be below io-threads is to be able to spawn new threads to
+send upcall notifications, to detect conflicts or to do the cleanup etc.
+
+At present we store all the state related to the file entries accessed
+by the clients in the inode context. Each of these entries have 'gfid'
+as the key value and list of client entries accessing that file.
+
+For each of the file accessed, we create or update an existing entry and
+append/update the clientinfo accessing that file.
+
+Sample structure of the upcall and client entries are -
+
+ struct _upcall_client_entry_t {
+ struct list_head client_list;
+ char *client_uid; /* unique UID of the client */
+ rpc_transport_t *trans; /* RPC transport object of the client */
+ rpcsvc_t *rpc; /* RPC structure of the client */
+ time_t access_time; /* time last accessed */
+ time_t recall_time; /* time recall_deleg sent */
+ deleg_type deleg; /* Delegation granted to the client */
+ };
+
+ typedef struct _upcall_client_entry_t upcall_client_entry;
+
+ struct _upcall_entry_t {
+ struct list_head list;
+ uuid_t gfid; /* GFID of the file */
+ upcall_client_entry client; /* list of clients */
+ int deleg_cnt /* no. of delegations granted for this file */
+ };
+
+ typedef struct _upcall_entry_t upcall_entry;
+
+As upcall notifcations are rpc calls, Gluster server needs to store
+client rpc details as well in the upcalls xlator. These rpc details are
+passed from protocol/server xlator to upcall xlator via "client\_t"
+structure stored as "frame-\>root-\>client".
+
+Below is a brief overview of how each of the above defined use-cases are
+handled.
+
+#### Register for callback notifications
+
+We shall provide APIs in gfapi to register and unregister, for receiving
+specific callback events from the server. At present, we support below
+upcall events.
+
+- Cache Invalidation
+- Recall Lease-Lock
+
+#### Cache Invalidation
+
+: Whenever a client sends a fop, after processing the fop, in callback
+ path, server
+
+- get/add upcall entry based on gfid.
+- lookup/add the client entry to that upcall entry based on
+ client\_t-\>client\_uid, with timestamp updated
+- check if there are other clients which have accessed the same file
+ within cache invalidation time (default 60sec and tunable)
+- if present, send notifications to those clients with the attributes
+ info to be invalidated/refreshed on the client side.
+
+: For eg - WRITE fop would result in change in size, atime, ctime,
+ mtime attributes.
+
+###### Sequence diagram
+
+
+ ------------- ---------------------- ------------ ----------------------- -------------
+ |NFS-Client(C1)| |NFS-Ganesha server(GC1)| |Brick server| |NFS-ganesha server(GC2)| |NFS-Client(C2)|
+ ------------- ----------------------- ------------ ----------------------- -------------
+ | | | | |
+ | | | | |
+ ' ' ' ' '
+ ' ' ' ' '
+ ' I/O on file1 ' ' ' '
+ '------------------------>' ' ' '
+ ' ' Send fop via rpc request ' ' '
+ ' '-------------------------->' ' '
+ ' ' ' ' '
+ ' ' ' ' '
+ ' ' Make an upcall entry of ' '
+ ' ' 'GC1' for 'file1' in ' '
+ ' ' STACK_UNWIND path ' '
+ ' ' Send fop response ' ' '
+ ' '<------------------------- ' ' '
+ ' Response to I/O ' ' ' '
+ '<------------------------' ' ' '
+ ' ' ' ' Request an I/O on 'file1' '
+ ' ' ' '<------------------------------'
+ ' ' ' ' '
+ ' ' ' ' '
+ ' ' ' Send rpc request ' '
+ ' ' '<------------------------------' '
+ ' ' ' ' '
+ ' ' ' ' '
+ ' ' In STACK_UNWIND CBK path, ' '
+ ' ' add upcall entry 'GC2' for ' '
+ ' ' 'file1' ' '
+ ' ' ' ' '
+ ' ' Send 'CACHE_INVALIDATE' ' ' '
+ ' ' Upcall event ' ' '
+ ' '<--------------------------' ' '
+ ' ' ' Send rpc response ' '
+ ' ' '------------------------------>' '
+ ' ' ' ' Response to I/O '
+ ' ' ' '------------------------------>'
+ ' ' ' ' '
+ ' ' ' ' '
+ ' ' ' ' '
+
+Reaper thread
+: Incase of cache\_invalidation, the upcall states maintained are
+ considered valid only if the corresponding client's last
+ access\_time hasn't exceeded 60sec (default at present).
+: To clean-up the expired state entries, a new reaper thread will be
+ spawned which will crawl through all the upcalls states, detect and
+ cleanup the expired entries.
+
+#### delegations/lease-locks
+
+A file lease provides a mechanism whereby the process holding the lease
+(the "lease holder") is notified when a process (the "lease breaker")
+tries to perform a fop with conflicting access on the same file.
+
+NFS Delegation (similar to lease\_locks) is a technique by which the
+server delegates the management of a file to a client which guarantees
+that no client can open the file in a conflicting mode.
+
+Advantages of these locks is that it greatly reduces the interactions
+between the server and the client for delegated files.
+
+This feature now also provides the support to grant or process these
+lease-locks/delegations for the files.
+
+##### API to request lease
+
+: A new API has been introduced in "gfapi" for the applications to
+ request or unlock the lease-locks.
+
+: This API will be an extension to the existing API "glfs\_posix\_lock
+ (int fd, int cmd, struct flock fl)" which is used to request for
+ posix locks, with below extra parameters -
+
+- lktype (byte-range or lease-lock or share-reservation)
+- lkowner (to differentiate between different application clients)
+
+: On receiving lease-lock request, the GlusterFS client uses existing
+ rpc program "GFS3\_OP\_LK" to send lock request to the brick process
+ but with lkflags denoting lease-lock set in the xdata of the
+ request.
+
+##### Leas-lock processing on the server-side
+
+Add Lease
+: On receiving the lock request, the server (in the upcall xlator) now
+ checks the lkflags first to determine if its lease-lock request.
+ Once it identifies so and considering there are no lease-conflicts
+ for that file, it
+
+- fetches the inode\_ctx for that inode entry
+- lookup/add the client entry to that upcall entry based on
+ client\_t-\>client\_uid, with timestamp updated
+- checks whether there are any existing open-fds with conflicting
+ access requests on that file. If yes bail out and do not grant the
+ lease.
+- In addition, server now also need to keep-track and verify that
+ there aren't any non-fd related fops (like SETATTR) being processed
+ in parallel before granting lease. This is done by either
+
+<!-- -->
+
+ * not granting a lease irrespective of which client requested for those fops or
+ * providing a mechanism for the applications to set clientid while doing each fop. Sever then can match the client-ids before deciding to grant lease.
+
+- Update the lease info in the client entry and mark it as lease
+ granted.
+- Incase if there is already a lease-lock granted to the same client
+ for the same fd, this request will be considered duplicate and a
+ success is returned to the client.
+
+Remove Lease
+: Similar to the above case "Add Lease", the server on receiving
+ UNLOCK request for a lease-lock, it
+
+- fetches the inode\_ctx
+- lookup/add the client entry to that upcall entry based on
+ client\_t-\>client\_uid, with timestamp updated
+- remove the lease granted to that client from that list.
+- Even if the lease not found, the server will return success (as done
+ for POSIX locks).
+- After removing the lease, the server starts processing the fops from
+ the blocked queue if there are any.
+
+Lease-conflict check/Recalling lease-lock
+: For each fop issued by a client, server now first need to check if
+ it conflicts with any exisiting lease-lock taken on that file. For
+ that it first
+
+- fetches its inode\_ctx
+- verify if there are lease-locks granted with conflicting access to
+ any other client for that file.
+
+(Note: incase of same client, the assumption is that application will
+handle all the conflict checks between its clients and block them if
+necessary. However, in future we plan to provide a framework/API for
+applications to set their client id, like lkwoner incase of locks,
+before sending any fops for the server to identify and differentiate
+them)
+
+- if yes, send upcall notifications to recall the lease-lock and
+ either
+
+`   * send EDELAY error incase if the fop is 'NON-BLOCKING'. Else`\
+`   * add the fop to the blocking queue`
+
+- Trigger a timer event to notify if the recall doesn't happen within
+ certain configured time.
+
+Purge Lease
+
+- Incase if the client doesn't unlock the lease with in the recall
+ timeout period, timer thread will trigger an event to purge that
+ lease forcefully.
+- Post that, fops (if any) in the blocked queue are processed.
+
+##### Sequence Diagram
+
+ ------------- ---------------------- ------------ ----------------------- -------------
+ |NFS-Client(C1)| |NFS-Ganesha server(GC1)| |Brick server| |NFS-ganesha server(GC2)| |NFS-Client(C2)|
+ ------------- ----------------------- ------------ ----------------------- -------------
+ | | | | |
+ ' Open on file1 ' ' ' '
+ '------------------------>' ' ' '
+ ' ' Send OPEN on 'file1' ' ' '
+ ' '-------------------------->' ' '
+ ' ' ' ' '
+ ' ' OPEN response ' ' '
+ ' '<--------------------------' ' '
+ ' ' ' ' '
+ ' ' LOCK on 'file1' with ' ' '
+ ' ' LEASE_LOCK type ' ' '
+ ' '-------------------------->' ' '
+ ' ' ' ' '
+ ' ' Take a lease_lock for ' '
+ ' ' entire file range. ' '
+ ' ' If it suceeds, add an upcall ' '
+ ' ' lease entry 'GC1' for 'file1' ' '
+ ' ' Send Success ' ' '
+ ' '<------------------------- ' ' '
+ ' Response to OPEN ' ' ' '
+ '<------------------------' ' ' '
+ ' ' ' ' Conflicting I/O on 'file1' '
+ ' ' ' '<------------------------------'
+ ' ' ' Send rpc request ' '
+ ' ' '<------------------------------' '
+ ' ' Send Upcall event ' ' '
+ ' ' 'RECALL_LEASE' ' ' '
+ ' '<--------------------------' ' '
+ ' RECALL_DELEGATION ' (a)Now either block I/O ' '
+ '<------------------------' or ' '
+ ' ' (b) ' Send EDELAY/ERETRY ' '
+ ' ' '------------------------------>' '
+ ' ' ' ' (b)SEND EDELAY/ERETRY '
+ ' Send I/O to flush data ' ' '------------------------------>'
+ '------------------------>' ' ' '
+ ' ' RPC reqeust for all fops' ' '
+ ' '-------------------------->' ' '
+ ' ' ' ' '
+ ' ' Send rpc response ' ' '
+ ' '<--------------------------' ' '
+ ' Send success ' ' ' '
+ '<------------------------' ' ' '
+ ' ' ' ' '
+ ' Return DELEGATION ' ' ' '
+ '------------------------>' ' ' '
+ ' ' UNLOCK request with type ' ' '
+ ' ' LEASE_LOCK ' ' '
+ ' '-------------------------->' ' '
+ ' ' ' ' '
+ ' ' Unlock the lease_lk. ' '
+ ' ' (a) Unblock the fop ' '
+ ' ' Send Success ' ' '
+ ' '<--------------------------'(a) Send response to I/O ' '
+ ' ' '------------------------------>' '
+ ' Return Success ' ' ' (a) SEND RESPONSE '
+ '<------------------------' ' '------------------------------>'
+ ' ' ' ' '
+ ' ' ' ' '
+ ' ' ' ' '
+ ' ' ' ' '
+
+
+#### Upcall notifications processing on the client side
+
+: The structure of the upcall data sent by the server is noted in the
+ "Documentation" section.
+: On receiving the upcall notifications, protocol/client xlator
+ detects that its a callback event, decodes the upcall data sent
+ ('gfs3\_upcall\_req' noted in the Documentation section) and passes
+ the same to the parent translators.
+: On receiving these notify calls from protocol/client, parent
+ translators (planning to use this infra) have to first processes the
+ event\_type of the upcall data received and accordingly take the
+ action.
+: Currently as this infra is used by only nfs-ganesha, these notify
+ calls are directly sent to gfapi from protocol/client xlator.
+: For each of such events received, gfapi creates an entry and queues
+ it to the list of upcall events received.
+
+: Sample entry structure -
+
+<!-- -->
+
+ struct _upcall_events_list {
+ struct list_head upcall_entries;
+ uuid_t gfid;
+ upcall_event_type event_type;
+ uint32_t flags;
+ };
+ typedef struct _upcall_events_list upcall_events_list;
+
+: Now either the application could choose to regularly poll for such
+ upcall events or the gfapi can notify application via a signal or a
+ cond-variable.
+
+### Extentions
+
+: This framework could easily be extended to send any other event
+ notifications to the client process.
+: A new event has to be added to the list of upcall event types
+ (mentioned in Documentation section) and any extra data which need
+ to be sent has to be added to gfs3\_upcall\_req structure.
+: On the client side, the translator (interested) should check for the
+ event type and the data passed to take action accordingly.
+: FUSE can also make use of this feature to support lease-locks
+: A new performance xlator can be added to take lease-locks and cache
+ I/O.
+
+### Limitations
+
+Rebalancing
+: At present, after rebalance, locks states are not migrated.
+ Similarly, the state maintained by this new xlator will also be not
+ migrated.
+: However, after migrating the file, since DHT does delete of the file
+ on the source brick, incase of
+
+- cache-invalidation, we may falsely notify the client that the file
+ is deleted. (Note: to avoid this at present, we do not send any
+ "Destroy Flag")
+- delegations/lease locks present, the 'delete' will be blocked till
+ that delegation is recalled. This way, the clients holding those
+ locks can flush their data which will now be redirected to the new
+ brick.
+
+Self-Heal
+: If a brick process goes down, the replica brick (which maintain the
+ same state) will takeover processing of all the fops.
+: But if later the first brick process comes back up, the healing of
+ the upcall/lease-lock states is not done on that process.
+
+Network Partitions
+: Incase if there are any network partitions between glusterfsd brick
+ process and glusterfs client process, similar to lock states, the
+ upcalls/lease-lock state maintained by this new xlator will also be
+ lost.
+: However if there is a replica brick present, clients will get
+ re-directed to that process (which still has the states maintained).
+ This brick process will take care of checking the conflicts and
+ sending notifications.
+: Maybe client could try reconnecting with the same client\_uid and
+ replay the locks. But if any of those operations fail, gfapi will
+ return 'EBADFD' to the applications. This enhancement will be
+ considered for future.
+
+Directory leases are not yet supported.
+: This feature at present mainly targets file-level
+ delegations/leases.
+
+Lease Upgrade
+: Read-to-write lease upgrade is not supported currently.
+
+Heuristics
+: Have to maintain heuristics in Gluster as well to determine when to
+ grant the lease/delegations.
+
+Benefit to GlusterFS
+--------------------
+
+This feature is definitely needed to support NFS-Ganesha Multi-head and
+Active-Active HA support.
+
+Along with it, this infra may potentially can be used for
+
+- multi-protocol access
+- small-file performance improvements.
+- pNFS support.
+
+Scope
+-----
+
+### Nature of proposed change
+
+- A new xlator 'Upcalls' will be introduced in the server-side stack
+ to maintain the states and send callbacks.
+
+- This xlator will be ON only when Ganesha feature is enabled.
+
+- "client\_t" structure is modified to contain rpc connection details.
+
+- New procedures have been added to "Glusterfs Callback" rpc program
+ for each of the notify events.
+
+- There will be support added on gfapi side to handle these upcalls
+ sent and inform the applications accordingly.
+
+- Probably md-cache may also add support to handle these upcalls.
+
+### Implications on manageability
+
+A new xlator 'Upcalls' is added to the server vol file.
+
+### Implications on presentation layer
+
+Applications planning to use Upcall Infra have to invoke new APIs
+provided by gfapi to receive these notifications.
+
+### Implications on persistence layer
+
+None.
+
+### Implications on 'GlusterFS' backend
+
+None
+
+### Modification to GlusterFS metadata
+
+None
+
+### Implications on 'glusterd'
+
+This infra is supported currently only when the new CLI option
+introduced to enable Ganesha is ON. May need to revisit on this incase
+if there are other consumers of this feature.
+
+How To Test
+-----------
+
+- Bring up Multi-head Ganesha servers.
+- Test if the I/Os performed using one head are reflected on the
+ another server.
+- Test if delegations are granted and successfully recalled when
+ needed.
+
+User Experience
+---------------
+
+- This infra will be controlled by a tunable (currently
+ 'nfs-ganesha.enable' option as it is the only consumer). If the
+ option is off, fops will just pass through without any additional
+ processing done.
+- But incase if its ON, the consumers of this infra may see some
+ performance hit due to the additional state maintained, processed
+ and more RPCs sent over wire incase of notifications.
+
+Dependencies
+------------
+
+Gluster CLI to enable ganesha
+: It depends on the new [Gluster CLI
+ option](http://www.gluster.org/community/documentation/index.php/Features/Gluster_CLI_for_ganesha)
+ which is to be added to enable Ganesha.
+
+Wireshark
+: In addition, the new RPC procedure introduced to send callbacks has
+ to be added to the list of Gluster RPC Procedures supported by
+ [Wireshark](https://forge.gluster.org/wireshark/pages/Todo).
+
+Rebalance/Self-Heal/Tiering
+: This upcall state maintained is anologous to the locks state. Hence,
+
+- During rebalance or tiering of the files, along with the locks
+ state, the state maintained by this xlator also need to be migrated
+ to the new subvolume.
+
+- When there is self-heal support for the locks state, this xlator
+ state also needs to be considered.
+
+Filter-out duplicate notifications
+: Incase of replica bricks maintained by AFR/EC, the upcalls state is
+ maintained and processed on all the replica bricks. This will result
+ in duplicate notifications to be sent by all those bricks incase of
+ non-idempotent fops. Also in case of distributed volumes, cache
+ invalidation notifications on a directory entry will be sent by all
+ the bricks part of that volume. Hence We need support to filter out
+ such duplicate callback notifications.
+
+: The approach we shall take to address it is that,
+
+- add a new xlator on the client-side to track all the fops. Maybe
+ create a unique transaction id and send it to the server.
+- Server needs to store this transaction id in the client info as part
+ of upcall state.
+- While sending any notifications, add this transaction id too to the
+ request.
+- Client (the new xlator) has to filter out duplicate requests based
+ on the transaction ids received.
+
+Special fops
+: During rebalance/self-heal, though it is not the client application
+ which is doing the fops, brick process may still send the
+ notifications. To avoid that, we need a register mechanism to let
+ only those clients who register, to receive upcall notifications.
+
+Cleanup during network disconnect - protocol/server
+: At present, incase of network disconnects between the
+ glusterfs-server and the client, the protocol/server looks up the fd
+ table associated with that client and sends 'flush' op for each of
+ those fds to cleanup the locks associated with it.
+
+: We need similar support to flush the lease locks taken. Hence, while
+ granting the lease-lock, we plan to associate that upcall\_entry
+ with the corresponding fd\_ctx or inode\_ctx so that they can be
+ easily tracked if needed to be cleaned up. Also it will help in
+ faster lookup of the upcall entry while trying to process the fops
+ using the same fd/inode.
+
+Note: Above cleanup is done for the upcall state associated with only
+lease-locks. For the other entries maintained (for eg:, for
+cache-invalidations), the reaper thread will anyways clean-up those
+stale entries once they get expired (i.e, access\_time \> 1min)
+
+Replay the lease-locks taken
+: At present, replay of locks by the client xlator seems to have been
+ disabled.
+: But when it is being enabled, we need to add support to replay
+ lease-locks taken as well.
+
+Documentation
+-------------
+
+Sample upcall request structure sent to the clients-
+
+ struct gfs3_upcall_req {
+ char gfid[16];
+ u_int event_type;
+ u_int flags;
+ };
+ typedef struct gfs3_upcall_req gfs3_upcall_req;
+
+ enum upcall_event_type_t {
+ CACHE_INVALIDATION,
+ RECALL_READ_DELEG,
+ RECALL_READ_WRITE_DELEG
+ };
+ typedef enum upcall_event_type_t upcall_event_type;
+
+ flags to be sent for inode update/invalidation-
+ #define UP_NLINK 0x00000001 /* update nlink */
+ #define UP_MODE 0x00000002 /* update mode and ctime */
+ #define UP_OWN 0x00000004 /* update mode,uid,gid and ctime */
+ #define UP_SIZE 0x00000008 /* update fsize */
+ #define UP_TIMES 0x00000010 /* update all times */
+ #define UP_ATIME 0x00000020 /* update atime only */
+ #define UP_PERM 0x00000040 /* update fields needed for
+ permission checking */
+ #define UP_RENAME 0x00000080 /* this is a rename op -
+ delete the cache entry */
+ #define UP_FORGET 0x00000100 /* inode_forget on server side -
+ invalidate the cache entry */
+ #define UP_PARENT_TIMES 0x00000200 /* update parent dir times */
+ #define UP_XATTR-FLAGS 0x00000400 /* update xattr */
+
+ /* for fops - open, read, lk, which do not trigger upcalll notifications
+ * but need to update to client info in the upcall state */
+ #define UP_UPDATE_CLIENT (UP_ATIME)
+
+ /* for fop - write, truncate */
+ #define UP_WRITE_FLAGS (UP_SIZE | UP_TIMES)
+
+ /* for fop - setattr */
+ #define UP_ATTR_FLAGS (UP_SIZE | UP_TIMES | UP_OWN | \
+ UP_MODE | UP_PERM)
+ /* for fop - rename */
+ #define UP_RENAME_FLAGS (UP_RENAME)
+
+ /* to invalidate parent directory entries for fops -rename, unlink,
+ * rmdir, mkdir, create */
+ #define UP_PARENT_DENTRY_FLAGS (UP_PARENT_TIMES)
+
+ /* for fop - unlink, link, rmdir, mkdir */
+ #define UP_NLINK_FLAGS (UP_NLINK | UP_TIMES)
+
+List of fops currently identified which trigger inode update/Invalidate
+notifications to be sent are :
+
+ fop - flags to be sent - UPDATE/ - Entries
+ INVALIDATION affected
+ ----------------------------------------------------------------------------
+ writev - UP_WRITE_FLAGS - INODE_UPDATE - file
+ truncate - UP_WRITE_FLAGS - INODE_UPDATE - file
+ lk/lock - UP_UPDATE_CLIENT - INODE_UPDATE - file
+ setattr - UP_ATTR_FLAGS - INODE_UPDATE/INVALIDATE - file
+ rename - UP_RENAME_FLAGS, UP_PARENT_DENTRY_FLAGS - INODE_INVALIDATE - both file and parent dir
+ unlink - UP_NLINK_FLAGS, UP_PARENT_DENTRY_FLAGS - INODE_INVALIDATE - file & parent_dir
+ rmdir - UP_NLINK_FLAGS, UP_PARENT_DENTRY_FLAGS - INODE_INVALIDATE - file & parent_dir
+ link - UP_NLINK_FLAGS, UP_PARENT_DENTRY_FLAGS - INODE_UPDATE - file & parent_dir
+ create - UP_TIMES, UP_PARENT_DENTRY_FLAGS - INODE_UPDATE - parent_dir
+ mkdir - UP_TIMES, UP_PARENT_DENTRY_FLAGS - INODE_UPDATE - parent_dir
+ setxattr - UP_XATTR_FLAGS - INODE_UPDATE - file
+ removexattr - UP_UPDATE_CLIENT - INODE_UPDATE - file
+ mknod - UP_TIMES, UP_PARENT_DENTRY_FLAGS - INODE_UPDATE - parent_dir
+ symlink - UP_TIMES, UP_PARENT_DENTRY_FLAGS - INODE_UPDATE - file
+
+List of fops which result in delegations/lease-lock recall are:
+
+ open
+ read
+ write
+ truncate
+ setattr
+ lock
+ link
+ remove
+ rename
+
+Comments and Discussion
+-----------------------
+
+### TODO
+
+- Lease-locks implementation is currently work in progress [BZ
+ 1200268](https://bugzilla.redhat.com/show_bug.cgi?id=1200268)
+- Clean up expired client entries (in case of cache-invalidation).
+ Refer to the section 'Cache Invalidation' [BZ
+ 1200267](https://bugzilla.redhat.com/show_bug.cgi?id=1200267)
+- At present, for cache-invalidation, callback notifications are sent
+ in the fop path. Instead to avoid brick latency, have a mechanism to
+ send it asynchronously. [BZ
+ 1200264](https://bugzilla.redhat.com/show_bug.cgi?id=1200264)
+- Filer out duplicate callback notifications [BZ
+ 1200266](https://bugzilla.redhat.com/show_bug.cgi?id=1200266)
+- Support for Directory leases.
+- Support for read-write Lease Upgrade
+- Have to maintain heuristics in Gluster as well to determine when to
+ grant the lease/delegations.
diff --git a/done/GlusterFS 3.7/arbiter.md b/done/GlusterFS 3.7/arbiter.md
new file mode 100644
index 0000000..797f005
--- /dev/null
+++ b/done/GlusterFS 3.7/arbiter.md
@@ -0,0 +1,100 @@
+Feature
+-------
+
+This feature provides a way of preventing split-brains in replica 3 gluster volumes both in time and space.
+
+Summary
+-------
+
+Please see <http://review.gluster.org/#/c/9656/> for the design discussions
+
+Owners
+------
+
+Pranith Kumar Karampuri
+Ravishankar N
+
+Current status
+--------------
+
+Feature complete.
+
+Code patches: <http://review.gluster.org/#/c/10257/> and
+<http://review.gluster.org/#/c/10258/>
+
+Detailed Description
+--------------------
+Arbiter volumes are replica 3 volumes where the 3rd brick of the replica is
+automatically configured as an arbiter node. What this means is that the 3rd
+brick will store only the file name and metadata, but does not contain any data.
+This configuration is helpful in avoiding split-brains while providing the same
+level of consistency as a normal replica 3 volume.
+
+Benefit to GlusterFS
+--------------------
+
+It prevents split-brains in replica 3 volumes and consumes lesser space than a normal replica 3 volume.
+
+Scope
+-----
+
+### Nature of proposed change
+
+### Implications on manageability
+
+None
+
+### Implications on presentation layer
+
+None
+
+### Implications on persistence layer
+
+None
+
+### Implications on 'GlusterFS' backend
+
+None
+
+### Modification to GlusterFS metadata
+
+None
+
+### Implications on 'glusterd'
+
+None
+
+How To Test
+-----------
+
+If we bring down bricks and perform writes in such a way that arbiter
+brick is the only source online, writes/reads will be made to fail with
+ENOTCONN. See 'tests/basic/afr/arbiter.t' in the glusterfs tree for
+examples.
+
+User Experience
+---------------
+
+Similar to a normal replica 3 volume. The only change is the syntax in
+volume creation. See
+<https://github.com/gluster/glusterfs-specs/blob/master/Features/afr-arbiter-volumes.md>
+
+Dependencies
+------------
+
+None
+
+Documentation
+-------------
+
+---
+
+Status
+------
+
+Feature completed. See 'Current status' section for the patches.
+
+Comments and Discussion
+-----------------------
+Some optimizations are under way.
+---
diff --git a/done/GlusterFS 3.7/index.md b/done/GlusterFS 3.7/index.md
new file mode 100644
index 0000000..99381cf
--- /dev/null
+++ b/done/GlusterFS 3.7/index.md
@@ -0,0 +1,90 @@
+GlusterFS 3.7 Release Planning
+------------------------------
+
+Tentative Dates:
+
+9th Mar, 2015 - Feature freeze & Branching
+
+28th Apr, 2015 - 3.7.0 Beta Release
+
+14th May, 2015 - 3.7.0 GA
+
+Features in GlusterFS 3.7
+-------------------------
+
+* [Features/Smallfile Performance](./Small File Performance.md):
+Small-file performance enhancement - multi-threaded epoll implemented.
+
+* [Features/Data-classification](./Data Classification.md):
+Tiering, rack-aware placement, and more. - Policy based tiering
+implemented
+
+* [Features/Trash](./Trash.md): Trash translator for
+GlusterFS
+
+* [Features/Object Count](./Object Count.md)
+
+* [Features/SELinux Integration](./SE Linux Integration.md)
+
+* [Features/Exports Netgroups Authentication](./Exports and Netgroups Authentication.md)
+
+* [Features/Policy based Split-brain resolution](./Policy based Split-brain Resolution.md): Policy Based
+Split-brain resolution
+
+* [Features/BitRot](./BitRot.md)
+
+* [Features/Gnotify](./Gnotify.md)
+
+* [Features/Improve Rebalance Performance](./Improve Rebalance Performance.md)
+
+* [Features/Upcall-infrastructure](./Upcall Infrastructure.md):
+Support for delegations/lease-locks, inode-invalidation, etc..
+
+* [Features/Gluster CLI for ganesha](./Gluster CLI for NFS Ganesha.md): Gluster CLI
+support to manage nfs-ganesha exports
+
+* [Features/Scheduling of Snapshot](./Scheduling of Snapshot.md): Schedule creation
+of gluster volume snapshots from command line, using cron.
+
+* [Features/sharding-xlator](./Sharding xlator.md)
+
+* [Features/HA for ganesha](./HA for Ganesha.md): HA
+support for NFS-Ganesha
+
+* [Features/Clone of Snapshot](./Clone of Snapshot.md)
+
+Other big changes
+-----------------
+
+* **GlusterD Daemon code
+refactoring**: GlusterD
+manages a lot of other daemons (bricks, NFS server, SHD, rebalance
+etc.), and there are several more on the way. This refactoring will
+introduce a common framework to manage all these daemons, which will
+make maintainance easier.
+
+* **RCU in GlusterD**: GlusterD has issues
+with thread synchronization and data access. This has been discussed on
+<http://www.gluster.org/pipermail/gluster-devel/2014-December/043382.html>
+. We will be using the RCU method to solve these issues and will be
+using [Userspace-RCU](http://urcu.so) to help with the implementation.
+
+Features beyond GlusterFS 3.7
+-----------------------------
+
+* [Features/Easy addition of custom translators](./Easy addition of Custom Translators.md)
+
+* [Features/outcast](./Outcast.md): Outcast
+
+* [Features/rest-api](./rest-api.md): REST API for
+Gluster Volume and Peer Management
+
+* [Features/Archipelago Integration](./Archipelago Integration.md):
+Improvements for integration with Archipelago
+
+Release Criterion
+-----------------
+
+- All new features to be documented in admin guide
+
+- Regression tests added \ No newline at end of file
diff --git a/done/GlusterFS 3.7/rest-api.md b/done/GlusterFS 3.7/rest-api.md
new file mode 100644
index 0000000..e967d28
--- /dev/null
+++ b/done/GlusterFS 3.7/rest-api.md
@@ -0,0 +1,152 @@
+Feature
+-------
+
+REST API for GlusterFS
+
+Summary
+-------
+
+Provides REST API for Gluster Volume and Peer Management.
+
+Owners
+------
+
+Aravinda VK <mail@aravindavk.in> (http://aravindavk.in)
+
+Current status
+--------------
+
+REST API is not available in GlusterFS.
+
+Detailed Description
+--------------------
+
+GlusterFS REST service can be started by running following command in
+any node.
+
+ sudo glusterrest -p 8080
+
+Features:
+
+- No Separate server required, command can be run in any one node.
+- Provides Basic authentication(user groups can be added)
+- Any REST client can be used.
+- JSON output
+
+Benefit to GlusterFS
+--------------------
+
+Provides REST API for GlusterFS cluster.
+
+Scope
+-----
+
+### Nature of proposed change
+
+New code.
+
+### Implications on manageability
+
+### Implications on presentation layer
+
+### Implications on persistence layer
+
+### Implications on 'GlusterFS' backend
+
+### Modification to GlusterFS metadata
+
+### Implications on 'glusterd'
+
+How To Test
+-----------
+
+User Experience
+---------------
+
+Dependencies
+------------
+
+Documentation
+-------------
+
+### Usage:
+
+New CLI command will be available \`glusterrest\`,
+
+ usage: glusterrest [-h] [-p PORT] [--users USERS]
+ [--no-password-hash]
+
+ optional arguments:
+ -h, --help show this help message and exit
+ -p PORT, --port PORT PORT Number
+ -u USERS, --users USERS
+ Users JSON file
+ --no-password-hash No Password Hash
+
+
+Following command will start the REST server in the given port(8080 in
+this example).
+
+ sudo glusterrest -p 8080 --users /root/secured_dir/gluster_users.json
+
+Format of users json file(List of key value pairs, username and
+password):
+
+ [
+ {"username": "aravindavk", "password": "5ebe2294ecd0e0f08eab7690d2a6ee69"}
+ ]
+
+Password is md5 hash, if no hash required then use --no-password-hash
+while running glusterrest command.
+
+### API Documentation
+
+Getting list of peers
+
+ GET /api/1/peers
+
+Peer Probe
+
+ CREATE /api/1/peers/:hostname
+
+Peer Detach
+
+ DELETE /api/1/peers/:hostname
+
+Creating Gluster volumes
+
+ CREATE /api/1/volumes/:name
+ CREATE /api/1/volumes/:name/force
+
+Deleting Gluster Volume
+
+ DELETE /api/1/volumes/:name
+ DELETE /api/1/volumes/:name/stop
+
+Gluster volume actions
+
+ POST /api/1/volumes/:name/start
+ POST /api/1/volumes/:name/stop
+ POST /api/1/volumes/:name/start-force
+ POST /api/1/volumes/:name/stop-force
+ POST /api/1/volumes/:name/restart
+
+Gluster Volume modifications
+
+ PUT /api/1/volumes/:name/add-brick
+ PUT /api/1/volumes/:name/remove-brick
+ PUT /api/1/volumes/:name/set
+ PUT /api/1/volumes/:name/reset
+
+Getting volume information
+
+ GET /api/1/volumes
+ GET /api/1/volumes/:name
+
+Status
+------
+
+50% Coding complete, Started writing documentation.
+
+Comments and Discussion
+-----------------------