From 9e9e3c5620882d2f769694996ff4d7e0cf36cc2b Mon Sep 17 00:00:00 2001 From: raghavendra talur Date: Thu, 20 Aug 2015 15:09:31 +0530 Subject: Create basic directory structure All new features specs go into in_progress directory. Once signed off, it should be moved to done directory. For now, This change moves all the Gluster 4.0 feature specs to in_progress. All other specs are under done/release-version. More cleanup required will be done incrementally. Change-Id: Id272d301ba8c434cbf7a9a966ceba05fe63b230d BUG: 1206539 Signed-off-by: Raghavendra Talur Reviewed-on: http://review.gluster.org/11969 Reviewed-by: Humble Devassy Chirammal Reviewed-by: Prashanth Pai Tested-by: Humble Devassy Chirammal --- done/Features/README.md | 42 ++ done/Features/afr-arbiter-volumes.md | 56 ++ done/Features/afr-statistics.md | 142 ++++ done/Features/afr-v1.md | 340 ++++++++++ done/Features/bitrot-docs.md | 7 + done/Features/brick-failure-detection.md | 67 ++ done/Features/dht.md | 223 ++++++ done/Features/distributed-geo-rep.md | 71 ++ done/Features/file-snapshot.md | 91 +++ done/Features/gfid-access.md | 73 ++ done/Features/glusterfs_nfs-ganesha_integration.md | 123 ++++ .../heal-info-and-split-brain-resolution.md | 448 ++++++++++++ done/Features/leases.md | 11 + done/Features/libgfapi.md | 382 +++++++++++ done/Features/libgfchangelog.md | 119 ++++ done/Features/memory-usage.md | 49 ++ done/Features/meta.md | 206 ++++++ done/Features/mount_gluster_volume_using_pnfs.md | 68 ++ done/Features/nufa.md | 20 + done/Features/object-versioning.md | 230 +++++++ done/Features/ovirt-integration.md | 106 +++ done/Features/qemu-integration.md | 230 +++++++ done/Features/quota-object-count.md | 47 ++ done/Features/quota-scalability.md | 52 ++ done/Features/rdmacm.md | 26 + done/Features/readdir-ahead.md | 14 + done/Features/rebalance.md | 74 ++ done/Features/server-quorum.md | 44 ++ done/Features/shard.md | 68 ++ done/Features/tier.md | 170 +++++ done/Features/trash_xlator.md | 80 +++ done/Features/upcall.md | 38 ++ done/Features/worm.md | 75 +++ done/Features/zerofill.md | 26 + done/GlusterFS 3.5/AFR CLI enhancements.md | 204 ++++++ done/GlusterFS 3.5/Brick Failure Detection.md | 151 +++++ done/GlusterFS 3.5/Disk Encryption.md | 443 ++++++++++++ done/GlusterFS 3.5/Exposing Volume Capabilities.md | 161 +++++ done/GlusterFS 3.5/File Snapshot.md | 101 +++ .../Onwire Compression-Decompression.md | 96 +++ done/GlusterFS 3.5/Quota Scalability.md | 99 +++ done/GlusterFS 3.5/Virt store usecase.md | 140 ++++ done/GlusterFS 3.5/Zerofill.md | 192 ++++++ done/GlusterFS 3.5/gfid access.md | 89 +++ done/GlusterFS 3.5/index.md | 32 + done/GlusterFS 3.5/libgfapi with qemu libvirt.md | 222 ++++++ done/GlusterFS 3.5/readdir ahead.md | 117 ++++ done/GlusterFS 3.6/Better Logging.md | 348 ++++++++++ done/GlusterFS 3.6/Better Peer Identification.md | 172 +++++ .../Gluster User Serviceable Snapshots.md | 39 ++ done/GlusterFS 3.6/Gluster Volume Snapshot.md | 354 ++++++++++ done/GlusterFS 3.6/New Style Replication.md | 230 +++++++ .../Persistent AFR Changelog xattributes.md | 178 +++++ done/GlusterFS 3.6/RDMA Improvements.md | 101 +++ done/GlusterFS 3.6/Server-side Barrier feature.md | 213 ++++++ done/GlusterFS 3.6/Thousand Node Gluster.md | 150 +++++ done/GlusterFS 3.6/afrv2.md | 244 +++++++ done/GlusterFS 3.6/better-ssl.md | 137 ++++ done/GlusterFS 3.6/disperse.md | 142 ++++ done/GlusterFS 3.6/glusterd volume locks.md | 48 ++ done/GlusterFS 3.6/heterogeneous-bricks.md | 136 ++++ done/GlusterFS 3.6/index.md | 96 +++ done/GlusterFS 3.7/Archipelago Integration.md | 93 +++ done/GlusterFS 3.7/BitRot.md | 211 ++++++ done/GlusterFS 3.7/Clone of Snapshot.md | 100 +++ done/GlusterFS 3.7/Data Classification.md | 279 ++++++++ .../Easy addition of Custom Translators.md | 129 ++++ .../Exports and Netgroups Authentication.md | 134 ++++ done/GlusterFS 3.7/Gluster CLI for NFS Ganesha.md | 120 ++++ done/GlusterFS 3.7/Gnotify.md | 168 +++++ done/GlusterFS 3.7/HA for Ganesha.md | 156 +++++ .../GlusterFS 3.7/Improve Rebalance Performance.md | 277 ++++++++ done/GlusterFS 3.7/Object Count.md | 113 ++++ .../Policy based Split-brain Resolution.md | 128 ++++ done/GlusterFS 3.7/SE Linux Integration.md | 4 + done/GlusterFS 3.7/Scheduling of Snapshot.md | 229 +++++++ done/GlusterFS 3.7/Sharding xlator.md | 129 ++++ done/GlusterFS 3.7/Small File Performance.md | 433 ++++++++++++ done/GlusterFS 3.7/Trash.md | 182 +++++ done/GlusterFS 3.7/Upcall Infrastructure.md | 747 +++++++++++++++++++++ done/GlusterFS 3.7/arbiter.md | 100 +++ done/GlusterFS 3.7/index.md | 90 +++ done/GlusterFS 3.7/rest-api.md | 152 +++++ 83 files changed, 12427 insertions(+) create mode 100644 done/Features/README.md create mode 100644 done/Features/afr-arbiter-volumes.md create mode 100644 done/Features/afr-statistics.md create mode 100644 done/Features/afr-v1.md create mode 100644 done/Features/bitrot-docs.md create mode 100644 done/Features/brick-failure-detection.md create mode 100644 done/Features/dht.md create mode 100644 done/Features/distributed-geo-rep.md create mode 100644 done/Features/file-snapshot.md create mode 100644 done/Features/gfid-access.md create mode 100644 done/Features/glusterfs_nfs-ganesha_integration.md create mode 100644 done/Features/heal-info-and-split-brain-resolution.md create mode 100644 done/Features/leases.md create mode 100644 done/Features/libgfapi.md create mode 100644 done/Features/libgfchangelog.md create mode 100644 done/Features/memory-usage.md create mode 100644 done/Features/meta.md create mode 100644 done/Features/mount_gluster_volume_using_pnfs.md create mode 100644 done/Features/nufa.md create mode 100644 done/Features/object-versioning.md create mode 100644 done/Features/ovirt-integration.md create mode 100644 done/Features/qemu-integration.md create mode 100644 done/Features/quota-object-count.md create mode 100644 done/Features/quota-scalability.md create mode 100644 done/Features/rdmacm.md create mode 100644 done/Features/readdir-ahead.md create mode 100644 done/Features/rebalance.md create mode 100644 done/Features/server-quorum.md create mode 100644 done/Features/shard.md create mode 100644 done/Features/tier.md create mode 100644 done/Features/trash_xlator.md create mode 100644 done/Features/upcall.md create mode 100644 done/Features/worm.md create mode 100644 done/Features/zerofill.md create mode 100644 done/GlusterFS 3.5/AFR CLI enhancements.md create mode 100644 done/GlusterFS 3.5/Brick Failure Detection.md create mode 100644 done/GlusterFS 3.5/Disk Encryption.md create mode 100644 done/GlusterFS 3.5/Exposing Volume Capabilities.md create mode 100644 done/GlusterFS 3.5/File Snapshot.md create mode 100644 done/GlusterFS 3.5/Onwire Compression-Decompression.md create mode 100644 done/GlusterFS 3.5/Quota Scalability.md create mode 100644 done/GlusterFS 3.5/Virt store usecase.md create mode 100644 done/GlusterFS 3.5/Zerofill.md create mode 100644 done/GlusterFS 3.5/gfid access.md create mode 100644 done/GlusterFS 3.5/index.md create mode 100644 done/GlusterFS 3.5/libgfapi with qemu libvirt.md create mode 100644 done/GlusterFS 3.5/readdir ahead.md create mode 100644 done/GlusterFS 3.6/Better Logging.md create mode 100644 done/GlusterFS 3.6/Better Peer Identification.md create mode 100644 done/GlusterFS 3.6/Gluster User Serviceable Snapshots.md create mode 100644 done/GlusterFS 3.6/Gluster Volume Snapshot.md create mode 100644 done/GlusterFS 3.6/New Style Replication.md create mode 100644 done/GlusterFS 3.6/Persistent AFR Changelog xattributes.md create mode 100644 done/GlusterFS 3.6/RDMA Improvements.md create mode 100644 done/GlusterFS 3.6/Server-side Barrier feature.md create mode 100644 done/GlusterFS 3.6/Thousand Node Gluster.md create mode 100644 done/GlusterFS 3.6/afrv2.md create mode 100644 done/GlusterFS 3.6/better-ssl.md create mode 100644 done/GlusterFS 3.6/disperse.md create mode 100644 done/GlusterFS 3.6/glusterd volume locks.md create mode 100644 done/GlusterFS 3.6/heterogeneous-bricks.md create mode 100644 done/GlusterFS 3.6/index.md create mode 100644 done/GlusterFS 3.7/Archipelago Integration.md create mode 100644 done/GlusterFS 3.7/BitRot.md create mode 100644 done/GlusterFS 3.7/Clone of Snapshot.md create mode 100644 done/GlusterFS 3.7/Data Classification.md create mode 100644 done/GlusterFS 3.7/Easy addition of Custom Translators.md create mode 100644 done/GlusterFS 3.7/Exports and Netgroups Authentication.md create mode 100644 done/GlusterFS 3.7/Gluster CLI for NFS Ganesha.md create mode 100644 done/GlusterFS 3.7/Gnotify.md create mode 100644 done/GlusterFS 3.7/HA for Ganesha.md create mode 100644 done/GlusterFS 3.7/Improve Rebalance Performance.md create mode 100644 done/GlusterFS 3.7/Object Count.md create mode 100644 done/GlusterFS 3.7/Policy based Split-brain Resolution.md create mode 100644 done/GlusterFS 3.7/SE Linux Integration.md create mode 100644 done/GlusterFS 3.7/Scheduling of Snapshot.md create mode 100644 done/GlusterFS 3.7/Sharding xlator.md create mode 100644 done/GlusterFS 3.7/Small File Performance.md create mode 100644 done/GlusterFS 3.7/Trash.md create mode 100644 done/GlusterFS 3.7/Upcall Infrastructure.md create mode 100644 done/GlusterFS 3.7/arbiter.md create mode 100644 done/GlusterFS 3.7/index.md create mode 100644 done/GlusterFS 3.7/rest-api.md (limited to 'done') diff --git a/done/Features/README.md b/done/Features/README.md new file mode 100644 index 0000000..97c1175 --- /dev/null +++ b/done/Features/README.md @@ -0,0 +1,42 @@ +###Features in GlusterFS 3.7 + +- [AFR Arbiter Volumes](./afr-arbiter-volumes.md) +- [bit rot docs](./bitrot-docs.md) +- [bit rot memory usage](./memory-usage.md) +- [bit rot object versioning](./object-versioning.md) +- [shard](./shard.md) +- [upcall](./upcall.md) +- [quota object count](./quota-object-count.md) +- [GlusterFS NFS Ganesha Integration](./glusterfs_nfs-ganesha_integration.md) +- [Tiering](./tier.md) +- [trash_xlator](./trash_xlator.md) + +###Features in GlusterFS 3.6 + + +###Features in GlusterFS 3.5 + +- [AFR Statistics](./afr-statistics.md) +- [AFR ver 1](./afr-v1.md) +- [Brick failure detection](./brick-failure-detection.md) +- [File Snapshot](./file-snapshot.md) +- [gfid access](./gfid-access.md) +- [quota scalability.md](./quota-scalability.md) +- [Readdir-ahead](./readdir-ahead.md) +- [zerofill API for GlusterFS](./zerofill.md) + +###Other Gluster Features + +- [Distributed Hash Tables](./dht.md) +- [Heal Info and Split Brain Resolution](./heal-info-and-split-brain-resolution.md) +- [libgfapi](./libgfapi.md) +- [Mounting Gluster Volumes using PNFS](./mount_gluster_volume_using_pnfs.md) +- [Non Uniform File Access](./nufa.md) +- [OVirt Integration](./ovirt-integration.md) +- [QEMU Integration](./qemu-integration.md) +- [RDMA Connection Manager](./rdmacm.md) +- [Rebalance](./rebalance.md) +- [Server Quorum](./server-quorum.md) +- [Write Once Read Many](./worm.md) +- [Distributed geo replication](./distributed-geo-rep.md) +- [libgf changelog](./libgfchangelog.md) \ No newline at end of file diff --git a/done/Features/afr-arbiter-volumes.md b/done/Features/afr-arbiter-volumes.md new file mode 100644 index 0000000..e31bc31 --- /dev/null +++ b/done/Features/afr-arbiter-volumes.md @@ -0,0 +1,56 @@ +Usage guide: Replicate volumes with arbiter configuration +========================================================== + +Arbiter volumes are replica 3 volumes where the 3rd brick of the replica is +automatically configured as an arbiter node. What this means is that the 3rd +brick will store only the file name and metadata, but does not contain any data. +This configuration is helpful in avoiding split-brains while providing the same +level of consistency as a normal replica 3 volume. + +The arbiter volume can be created with the following command: + + gluster volume create replica 3 arbiter 1 host1:brick1 host2:brick2 host3:brick3 + +Note that the syntax is similar to creating a normal replica 3 volume with the +exception of the `arbiter 1` keyword. As seen in the command above, the only +permissible values for the replica count and arbiter count are 3 and 1 +respectively. Also, the 3rd brick is always chosen as the arbiter brick and it +is not configurable to have any other brick as the arbiter. + +Client/ Mount behaviour: +======================== + +By default, client quorum (`cluster.quorum-type`) is set to `auto` for a replica +3 volume when it is created; i.e. at least 2 bricks need to be up to satisfy +quorum and to allow writes. This setting is not to be changed for arbiter +volumes also. Additionally, the arbiter volume has additional some checks to +prevent files from ending up in split-brain: + +* Clients take full file locks when writing to a file as opposed to range locks + in a normal replica 3 volume. + +* If 2 bricks are up and if one of them is the arbiter (i.e. the 3rd brick) *and* + it blames the other up brick, then all FOPS will fail with ENOTCONN (Transport + endpoint is not connected). IF the arbiter doesn't blame the other brick, + FOPS will be allowed to proceed. 'Blaming' here is w.r.t the values of AFR + changelog extended attributes. + +* If 2 bricks are up and the arbiter is down, then FOPS will be allowed. + +* In all cases, if there is only one source before the FOP is initiated and if + the FOP fails on that source, the application will receive ENOTCONN. + +Note: It is possible to see if a replica 3 volume has arbiter configuration from +the mount point. If +`$mount_point/.meta/graphs/active/$V0-replicate-0/options/arbiter-count` exists +and its value is 1, then it is an arbiter volume. Also the client volume graph +will have arbiter-count as a xlator option for AFR translators. + +Self-heal daemon behaviour: +=========================== +Since the arbiter brick does not store any data for the files, data-self-heal +from the arbiter brick will not take place. For example if there are 2 source +bricks B2 and B3 (B3 being arbiter brick) and B2 is down, then data-self-heal +will *not* happen from B3 to sink brick B1, and will be pending until B2 comes +up and heal can happen from it. Note that metadata and entry self-heals can +still happen from B3 if it is one of the sources.cd \ No newline at end of file diff --git a/done/Features/afr-statistics.md b/done/Features/afr-statistics.md new file mode 100644 index 0000000..d070584 --- /dev/null +++ b/done/Features/afr-statistics.md @@ -0,0 +1,142 @@ +##gluster volume heal statistics + +##Description +In case of index self-heal, self-heal daemon reads the entries from the +local bricks, from /brick-path/.glusterfs/indices/xattrop/ directory. +So based on the entries read by self heal daemon, it will attempt self-heal. +Executing this command will list the crawl statistics of self heal done for +each brick. + +For each brick, it will list: +1. Starting time of crawl done for that brick. +2. Ending time of crawl done for that brick. +3. No of entries for which self-heal is successfully attempted. +4. No of split-brain entries found while self-healing. +5. No of entries for which heal failed. + + + +Example: +a) Create a gluster volume with replica count 2. +b) Create 10 files. +c) kill brick_1 of this replica. +d) Overwrite all 10 files. +e) Kill the other brick (brick_2), and bring back (brick_1) up. +f) Overwrite all 10 files. + +Now we have 10 files, which are in split brain. Self heal daemon will crawl for +both the bricks, and will count 10 files from each brick. +It will report 10 files under split-brain with respect to given brick. + +Gathering crawl statistics on volume volume1 has been successful +------------------------------------------------ + +Crawl statistics for brick no 0 +Hostname of brick 192.168.122.1 + +Starting time of crawl: Tue May 20 19:13:11 2014 + +Ending time of crawl: Tue May 20 19:13:12 2014 + +Type of crawl: INDEX +No. of entries healed: 0 +No. of entries in split-brain: 10 +No. of heal failed entries: 0 +------------------------------------------------ + +Crawl statistics for brick no 1 +Hostname of brick 192.168.122.1 + +Starting time of crawl: Tue May 20 19:13:12 2014 + +Ending time of crawl: Tue May 20 19:13:12 2014 + +Type of crawl: INDEX +No. of entries healed: 0 +No. of entries in split-brain: 10 +No. of heal failed entries: 0 + +------------------------------------------------ + + +As the output shows, self-heal daemon detects 10 files in split-brain with +resept to given brick. + + + + +##gluster volume heal statistics heal-count +It lists the number of entries present in +//.glusterfs/indices/xattrop from each-brick. + + +1. Create a replicate volume. +2. Kill one brick of a replicate volume1. +3. Create 10 files. +4. Execute above command. + +-------------------------------------------------------------------------------- +Gathering count of entries to be healed on volume volume1 has been successful + +Brick 192.168.122.1:/brick_1 +Number of entries: 10 + +Brick 192.168.122.1:/brick_2 +No gathered input for this brick +------------------------------------------------------------------------------- + + + + + + +##gluster volume heal statistics heal-count replica \ + ip_addr:/brick_location + +To list the number of entries to be healed from a particular replicate +subvolume, listing any one child of that replicate subvolume in the command, +will list the entries for all the childrens of that replicate subvolume. + +Example: dht + / \ + / \ + replica-1 replica-2 + / \ / \ + child-1 child-2 child-3 child-4 + /brick1 /brick2 /brick3 /brick4 + +gluster volume heal statistics heal-count ip:/brick1 +will list count only for child-1 and child-2. + +gluster volume heal statistics heal-count ip:/brick3 +will list count only for child-3 and child-4. + + + +1. Create a volume same as mentioned in the above graph. +2. Kill Brick-2. +3. Create some files. +4. If we are interested in knowing the number of files to be healed from each + brick of replica-1 only, mention any one child of that replica. + +gluster volume heal volume1 statistics heal-count replica 192.168.122.1:/brick2 + +output: +------- +Gathering count of entries to be healed per replica on volume volume1 has \ +been successful + +Brick 192.168.122.1:/brick_1 +Number of entries: 10 <--10 files + +Brick 192.168.122.1:/brick_2 +No gathered input for this brick <-Brick is down + +Brick 192.168.122.1:/brick_3 +No gathered input for this brick <--No result, as we are not + interested. + +Brick 192.168.122.1:/brick_4 <--No result, as we are not +No gathered input for this brick interested. + + diff --git a/done/Features/afr-v1.md b/done/Features/afr-v1.md new file mode 100644 index 0000000..0ab41a1 --- /dev/null +++ b/done/Features/afr-v1.md @@ -0,0 +1,340 @@ +#Automatic File Replication +Afr xlator in glusterfs is responsible for replicating the data across the bricks. + +###Responsibilities of AFR +Its responsibilities include the following: + +1. Maintain replication consistency (i.e. Data on both the bricks should be same, even in the cases where there are operations happening on same file/directory in parallel from multiple applications/mount points as long as all the bricks in replica set are up) + +2. Provide a way of recovering data in case of failures as long as there is + at least one brick which has the correct data. + +3. Serve fresh data for read/stat/readdir etc + +###Transaction framework +For 1, 2 above afr uses transaction framework which consists of the following 5 +phases which happen on all the bricks in replica set(Bricks which are in replication): + +####1.Lock Phase +####2. Pre-op Phase +####3. Op Phase +####4. Post-op Phase +####5. Unlock Phase + +*Op Phase* is the actual operation sent by application (`write/create/unlink` etc). For every operation which afr receives that modifies data it sends that same operation in parallel to all the bricks in its replica set. This is how it achieves replication. + +*Lock, Unlock Phases* take necessary locks so that *Op phase* can provide **replication consistency** in normal work flow. + +#####For example: +If an application performs `touch a` and the other one does `mkdir a`, afr makes sure that either file with name `a` or directory with name `a` is created on both the bricks. + +*Pre-op, Post-op Phases* provide changelogging which enables afr to figure out which copy is fresh. +Once afr knows how to figure out fresh copy in the replica set it can **recover data** from fresh copy to stale copy. Also it can **serve fresh** data for `read/stat/readdir` etc. + +##Internal Operations +Brief introduction to internal operations in Glusterfs which make *Locking, Unlocking, Pre/Post ops* possible: + +###Internal Locking Operations +Glusterfs has **locks** translator which provides the following internal locking operations called `inodelk`, `entrylk` which are used by afr to achieve synchronization of operations on files or directories that conflict with each other. + +`Inodelk` gives the facility for translators in Glusterfs to obtain range (denoted by tuple with **offset**, **length**) locks in a given domain for an inode. +Full file lock is denoted by the tuple (offset: `0`, length: `0`) i.e. length `0` is considered as infinity. + +`Entrylk` enables translators of Glusterfs to obtain locks on `name` in a given domain for an inode, typically a directory. + +**Locks** translator provides both *blocking* and *nonblocking* variants and of these operations. + +###Xattrop +For pre/post ops posix translator provides an operation called xattrop. +xattrop is a way of *incrementing*/*decrementing* a number present in the extended attribute of the inode *atomically*. + +##Transaction Types +There are 3 types of transactions in AFR. +1. Data transactions + - Operations that add/modify/truncate the file contents. + - `Write`/`Truncate`/`Ftruncate` etc + +2. Metadata transactions + - Operations that modify the data kept in inode. + - `Chmod`/`Chown` etc + +3) Entry transactions + - Operations that add/remove/rename entries in a directory + - `Touch`/`Mkdir`/`Mknod` etc + +###Data transactions: + +*write* (`offset`, `size`) - writes data from `offset` of `size` + +*ftruncate*/*truncate* (`offset`) - truncates data from `offset` till the end of file. + +Afr internal locking needs to make sure that two conflicting data operations happen in order, one after the other so that it does not result in replication inconsistency. Afr data operations take inodelks in same domain (lets call it `data` domain). + +*Write* operation with offset `O` and size `S` takes an inode lock in data domain with range `(O, S)`. + +*Ftruncate*/*Truncate* operations with offset `O` take inode locks in `data` domain with range `(O, 0)`. Please note that size `0` means size infinity. + +These ranges make sure that overlapping write/truncate/ftruncate operations are done one after the other. + +Now that we know the ranges the operations take locks on, we will see how locking happens in afr. + +####Lock: +Afr initially attempts **non-blocking** locks on **all** the bricks of the replica set in **parallel**. If all the locks are successful then it goes on to perform pre-op. But in case **non-blocking** locks **fail** because there is *at least one conflicting operation* which already has a **granted lock** then it **unlocks** the **non-blocking** locks it already acquired in this previous step and proceeds to perform **blocking** locks **one after the other** on each of the subvolumes in the order of subvolumes specified in the volfile. + +Chances of **conflicting operations** is **very low** and time elapsed in **non-blocking** locks phase is `Max(latencies of the bricks for responding to inodelk)`, where as time elapsed in **blocking locks** phase is `Sum(latencies of the bricks for responding to inodelk)`. That is why afr always tries for non-blocking locks first and only then it moves to blocking locks. + +####Pre-op: +Each file/dir in a brick maintains the changelog(roughly pending operation count) of itself and that of the files +present in all the other bricks in it's replica set as seen by that brick. + +Lets consider an example replica volume with 2 bricks brick-a and brick-b. +all files in brick-a will have 2 entries +one for itself and the other for the file present in it's replica set, i.e.brick-b: +One can inspect changelogs using getfattr command. + + # getfattr -d -e hex -m. brick-a/file.txt + trusted.afr.vol-client-0=0x000000000000000000000000 -->changelog for itself (brick-a) + trusted.afr.vol-client-1=0x000000000000000000000000 -->changelog for brick-b as seen by brick-a + +Likewise, all files in brick-b will have: + + # getfattr -d -e hex -m. brick-b/file.txt + trusted.afr.vol-client-0=0x000000000000000000000000 -->changelog for brick-a as seen by brick-b + trusted.afr.vol-client-1=0x000000000000000000000000 -->changelog for itself (brick-b) + +#####Interpreting Changelog Value: +Each extended attribute has a value which is `24` hexa decimal digits. i.e. `12` bytes +First `8` digits (`4` bytes) represent changelog of `data`. Second `8` digits represent changelog +of `metadata`. Last 8 digits represent Changelog of `directory entries`. + +Pictorially representing the same, we have: + + 0x 00000000 00000000 00000000 + | | | + | | \_ changelog of directory entries + | \_ changelog of metadata + \ _ changelog of data + +Before write operation is performed on the brick, afr marks the file saying there is a pending operation. + +As part of this pre-op afr sends xattrop operation with increment 1 for data operation to make the extended attributes the following: + # getfattr -d -e hex -m. brick-a/file.txt + trusted.afr.vol-client-0=0x000000010000000000000000 -->changelog for itself (brick-a) + trusted.afr.vol-client-1=0x000000010000000000000000 -->changelog for brick-b as seen by brick-a + +Likewise, all files in brick-b will have: + + # getfattr -d -e hex -m. brick-b/file.txt + trusted.afr.vol-client-0=0x000000010000000000000000 -->changelog for brick-a as seen by brick-b + trusted.afr.vol-client-1=0x000000010000000000000000 -->changelog for itself (brick-b) + +As the operation is in progress on files on both the bricks all the extended attributes show the same value. + +####Op: +Now it sends the actual write operation to both the bricks. Afr remembers whether the operation is successful or not on all the subvolumes. + +####Post-Op: +If the operation succeeds on all the bricks then there is no pending operations on any of the bricks so as part of POST-OP afr sends xattrop operation with increment -1 i.e. decrement by 1 for data operation to make the extended attributes back to all zeros again. + +In case there is a failure on brick-b then there is still a pending operation on brick-b where as no pending operations are there on brick-a. So xattrop operation for both of these extended attributes differs now. For extended attribute corresponding to brick-a i.e. trusted.afr.vol-client-0 decrement by 1 is sent where as for extended attribute corresponding to brick-b increment by '0' is sent to retain the pending operation count. + + # getfattr -d -e hex -m. brick-a/file.txt + trusted.afr.vol-client-0=0x000000000000000000000000 -->changelog for itself (brick-a) + trusted.afr.vol-client-1=0x000000010000000000000000 -->changelog for brick-b as seen by brick-a + + # getfattr -d -e hex -m. brick-b/file.txt + trusted.afr.vol-client-0=0x000000000000000000000000 -->changelog for brick-a as seen by brick-b + trusted.afr.vol-client-1=0x000000010000000000000000 -->changelog for itself (brick-b) + +####Unlock: +Once the transaction is completed unlock is sent on all the bricks where lock is acquired. + + +###Meta Data transactions: + +setattr, setxattr, removexattr +All metadata operations take same inode lock with same range in metadata domain. + +####Lock: +Metadata locking also starts initially with non-blocking locks then move on to blocking locks on any failures because of conflicting operations. + +####Pre-op: +Before metadata operation is performed on the brick, afr marks the file saying there is a pending operation. +As part of this pre-op afr sends xattrop operation with increment 1 for metadata operation to make the extended attributes the following: + # getfattr -d -e hex -m. brick-a/file.txt + trusted.afr.vol-client-0=0x000000000000000100000000 -->changelog for itself (brick-a) + trusted.afr.vol-client-1=0x000000000000000100000000 -->changelog for brick-b as seen by brick-a + +Likewise, all files in brick-b will have: + # getfattr -d -e hex -m. brick-b/file.txt + trusted.afr.vol-client-0=0x000000000000000100000000 -->changelog for brick-a as seen by brick-b + trusted.afr.vol-client-1=0x000000000000000100000000 -->changelog for itself (brick-b) + +As the operation is in progress on files on both the bricks all the extended attributes show the same value. + +####Op: +Now it sends the actual metadata operation to both the bricks. Afr remembers whether the operation is successful or not on all the subvolumes. + +Post-Op: +If the operation succeeds on all the bricks then there is no pending operations on any of the bricks so as part of POST-OP afr sends xattrop operation with increment -1 i.e. decrement by 1 for metadata operation to make the extended attributes back to all zeros again. + +In case there is a failure on brick-b then there is still a pending operation on brick-b where as no pending operations are there on brick-a. So xattrop operation for both of these extended attributes differs now. For extended attribute corresponding to brick-a i.e. trusted.afr.vol-client-0 decrement by 1 is sent where as for extended attribute corresponding to brick-b increment by '0' is sent to retain the pending operation count. + + # getfattr -d -e hex -m. brick-a/file.txt + trusted.afr.vol-client-0=0x000000000000000000000000 -->changelog for itself (brick-a) + trusted.afr.vol-client-1=0x000000000000000100000000 -->changelog for brick-b as seen by brick-a + + # getfattr -d -e hex -m. brick-b/file.txt + trusted.afr.vol-client-0=0x000000000000000000000000 -->changelog for brick-a as seen by brick-b + trusted.afr.vol-client-1=0x000000000000000100000000 -->changelog for itself (brick-b) + +####Unlock: +Once the transaction is completed unlock is sent on all the bricks where lock is acquired. + + +###Entry transactions: + +create, mknod, mkdir, link, symlink, rename, unlink, rmdir +Pre-op/Post-op (done using xattrop) always happens on the parent directory. + +Entry Locks taken by these entry operations: + +**Create** (file `dir/a`): Lock on name `a` in inode of `dir` + +**mknod** (file `dir/a`): Lock on name `a` in inode of `dir` + +**mkdir** (dir `dir/a`): Lock on name `a` in inode of `dir` + +**link** (file `oldfile`, file `dir/newfile`): Lock on name `newfile` in inode of `dir` + +**Symlink** (file `oldfile`, file `dir`/`symlinkfile`): Lock on name `symlinkfile` in inode of `dir` + +**rename** of (file `dir1`/`file1`, file `dir2`/`file2`): Lock on name `file1` in inode of `dir1`, Lock on name `file2` in inode of `dir2` + +**rename** of (dir `dir1`/`dir2`, dir `dir3`/`dir4`): Lock on name `dir2` in inode of `dir1`, Lock on name `dir4` in inode of `dir3`, Lock on `NULL` in inode of `dir4` + +**unlink** (file `dir`/`a`): Lock on name `a` in inode of `dir` + +**rmdir** (dir dir/a): Lock on name `a` in inode of `dir`, Lock on `NULL` in inode of `a` + +####Lock: +Even entry locking starts initially with non-blocking locks then move on to blocking locks on any failures because of conflicting operations. + +####Pre-op: +Before entry operation is performed on the brick, afr marks the directory saying there is a pending operation. + +As part of this pre-op afr sends xattrop operation with increment 1 for entry operation to make the extended attributes the following: + + # getfattr -d -e hex -m. brick-a/ + trusted.afr.vol-client-0=0x000000000000000000000001 -->changelog for itself (brick-a) + trusted.afr.vol-client-1=0x000000000000000000000001 -->changelog for brick-b as seen by brick-a + +Likewise, all files in brick-b will have: + # getfattr -d -e hex -m. brick-b/ + trusted.afr.vol-client-0=0x000000000000000000000001 -->changelog for brick-a as seen by brick-b + trusted.afr.vol-client-1=0x000000000000000000000001 -->changelog for itself (brick-b) + +As the operation is in progress on files on both the bricks all the extended attributes show the same value. + +####Op: +Now it sends the actual entry operation to both the bricks. Afr remembers whether the operation is successful or not on all the subvolumes. + +####Post-Op: +If the operation succeeds on all the bricks then there is no pending operations on any of the bricks so as part of POST-OP afr sends xattrop operation with increment -1 i.e. decrement by 1 for entry operation to make the extended attributes back to all zeros again. + +In case there is a failure on brick-b then there is still a pending operation on brick-b where as no pending operations are there on brick-a. So xattrop operation for both of these extended attributes differs now. For extended attribute corresponding to brick-a i.e. trusted.afr.vol-client-0 decrement by 1 is sent where as for extended attribute corresponding to brick-b increment by '0' is sent to retain the pending operation count. + + # getfattr -d -e hex -m. brick-a/file.txt + trusted.afr.vol-client-0=0x000000000000000000000000 -->changelog for itself (brick-a) + trusted.afr.vol-client-1=0x000000000000000000000001 -->changelog for brick-b as seen by brick-a + + # getfattr -d -e hex -m. brick-b/file.txt + trusted.afr.vol-client-0=0x000000000000000000000000 -->changelog for brick-a as seen by brick-b + trusted.afr.vol-client-1=0x000000000000000000000001 -->changelog for itself (brick-b) + +####Unlock: +Once the transaction is completed unlock is sent on all the bricks where lock is acquired. + +The parts above cover how replication consistency is achieved in afr. + +Now let us look at how afr can figure out how to recover from failures given the changelog extended attributes + +###Recovering from failures (Self-heal) +For recovering from failures afr tries to determine which copy is the fresh copy based on the extended attributes. + +There are 3 possibilities: +1. All the extended attributes are zero on all the bricks. This means there are no pending operations on any of the bricks so there is nothing to recover. +2. According to the extended attributes there is a brick(brick-a) which noticed that there are operations pending on the other brick(brick-b). + - There are 4 possibilities for brick-b + + - It did not even participate in transaction (all extended attributes on brick-b are zeros). Choose brick-a as source and perform recovery to brick-b. + + - It participated in the transaction but died even before post-op. (All extended attributes on brick-b have a pending-count). Choose brick-a as source and perform recovery to brick-b. + + - It participated in the transaction and after the post-op extended attributes on brick-b show that there are pending operations on itself. Choose brick-a as source and perform recovery to brick-b. + + - It participated in the transaction and after the post-op extended attributes on brick-b show that there are pending operations on brick-a. This situation is called Split-brain and there is no way to recover. This situation can happen in cases of network partition. + +3. The only possibility now is where both brick-a, brick-b have pending operations. In this case changelogs extended attributes are all non-zeros on all the bricks. Basically what could have happened is the operations started on the file but either the whole replica set went down or the mount process itself dies before post-op is performed. In this case there is a possibility that data on the bricks is different. In this case afr chooses file with bigger size as source, if both files have same size then it choses the subvolume which has witnessed large number of pending operations on the other brick as source. If both have same number of pending operations then it chooses the file with newest ctime as source. If this is also same then it just picks one of the two bricks as source and syncs data on to the other to make sure that the files are replicas to each other. + +###Self-healing: +Afr does 3 types of self-heals for data recovery. + +1. Data self-heal + +2. Metadata self-heal + +3. Entry self-heal + +As we have seen earlier, afr depends on changelog extended attributes to figure out which copy is source and which copy is sink. General algorithm for performing this recovery (self-heal) is same for all of these different self-heals. + +1. Take appropriate full locks on the file/directory to make sure no other transaction is in progress while inspecting changelog extended attributes. +In this step, for + - Data self-heal afr takes inode lock with `offset: 0` and `size: 0`(infinity) in data domain. + - Entry self-heal takes entry lock on directory with `NULL` name i.e. full directory lock. + - Metadata self-heal it takes pre-defined range in metadata domain on which all the metadata operations on that inode take locks on. To prevent duplicate data self-heal an inode lock is taken in self-heal domain as well. + +2. Perform Sync from fresh copy to stale copy. +In this step, + - Metadata self-heal gets the inode attributes, extended attributes from source copy and sets them on the stale copy. + + - Entry self-heal reads entries on stale directories and see if they are present on source directory, if they are not present it deletes them. Then it reads entries on fresh directory and creates the missing entries on stale directories. + + - Data self-heal does things a bit differently to make sure no other writes on the file are blocked for the duration of self-heal because files sizes could be as big as 100G(VM files) and we don't want to block all the transactions until the self-heal is over. Locks translator allows two overlapping locks to be granted if they are from same lock owner. Using this what data self-heal does is it takes a small 128k size range lock and unlock previous acquired lock, heals just that 128k chunk and takes next 128k chunk lock and unlock previous lock and moves to the next one. It always makes sure that at least one lock is present on the file by selfheal throughout the duration of self-heal so that two self-heals don't happen in parallel. + + - Data self-heal has two algorithms, where the file can be copied only when there is data mismatch for that chunk called as 'diff' self-heal. The otherone is blind copy of each chunk called 'full' self-heal + +3. Change extended attributes to mark new sources after the sync. + +4. Unlock the locks acquired to perform self-heal. + +### Transaction Optimizations: +As we saw earlier afr transaction for all the operations that modify data happens in 5 phases, i.e. it sends 5 operations on the network for every operation. In the following sections we will see optimizations already implemented in afr which reduce the number of operations on the network to just 1 per transaction in best case. + +####Changelog-piggybacking +This optimization comes into picture when on same file descriptor, before write1's post op is complete write2's pre-op starts and the operations are succeeding. When writes come in that manner we can piggyback on the pre-op of write1 for write2 and somehow tell write1 that write2 will do the post-op that was supposed to be done by write1. So write1's post-op does not happen over network, write2's pre-op does not happen over network. This optimization does not hold if there are any failures in write1's phases. + +####Delayed Post-op +This optimization just delays post-op of the write transaction(write1) by a pre-configured amount time to increase the probability of next write piggybacking on the pre-op done by write1. + +With the combination of these two optimizations for operations like full file copy which are write intensive operations, what will essentially happen is for the first write a pre-op will happen. Then for the last write on the file post-op happens. So for all the write transactions between first write and last write afr reduced network operations from 5 to 3. + +####Eager-locking: +This optimization comes into picture when only one file descriptor is open on the file and performing writes just like in the previous optimization. What this optimization does is it takes a full file lock on the file irrespective of the offset, size of the write, so that lock acquired by write1 can be piggybacked by write2 and write2 takes the responsibility of unlocking it. both write1, write2 will have same lock owner and afr takes the responsibility of serializing overlapping writes so that replication consistency is maintained. + +With the combination of these optimizations for operations like full file copy which are write intensive operations, what will essentially happen is for the first write a pre-op, full-file lock will happen. Then for the last write on the file post-op, unlock happens. So for all the write transactions between first write and last write afr reduced network operations from 5 to 1. + +###Quorum in afr: +To avoid split-brains, afr employs the following quorum policies. + - In replica set with odd number of bricks, replica set is said to be in quorum if more than half of the bricks are up. + - In replica set with even number of bricks, if more than half of the bricks are up then it is said to be in quorum but if number of bricks that are up is equal to number of bricks that are down then, it is said to be in quorum if the first brick is also up in the set of bricks that are up. + +When quorum is not met in the replica set then modify operations on the mount are not allowed by afr. + +###Self-heal daemon and Index translator usage by afr: + +####Index xlator: +On each brick index xlator is loaded. This xlator keeps track of what is happening in afr's pre-op and post-op. If there is an ongoing I/O or a pending self-heal, changelog xattrs would have non-zero values. Whenever xattrop/fxattrop fop (pre-op, post-ops are done using these fops) comes to index xlator a link (with gfid as name of the file on which the fop is performed) is added in /.glusterfs/indices/xattrop directory. If the value returned by the fop is zero the link is removed from the index otherwise it is kept until zero is returned in the subsequent xattrop/fxattrop fops. + +####Self-heal-daemon: +self-heal-daemon process keeps running on each machine of the trusted storage pool. This process has afr xlators of all the volumes which are started. Its job is to crawl indices on bricks that are local to that machine. If any of the files represented by the gfid of the link name need healing and automatically heal them. This operation is performed every 10 minutes for each replica set. Additionally when a brick comes online also this operation is performed. diff --git a/done/Features/bitrot-docs.md b/done/Features/bitrot-docs.md new file mode 100644 index 0000000..90edffc --- /dev/null +++ b/done/Features/bitrot-docs.md @@ -0,0 +1,7 @@ +### Sources where you can find more resources on bitrot + +* Feature page: http://www.gluster.org/community/documentation/index.php/Features/BitRot + +* Design: http://goo.gl/Mjy4mD + +* CLI specification: http://goo.gl/2o12Fn diff --git a/done/Features/brick-failure-detection.md b/done/Features/brick-failure-detection.md new file mode 100644 index 0000000..24f2a18 --- /dev/null +++ b/done/Features/brick-failure-detection.md @@ -0,0 +1,67 @@ +# Brick Failure Detection + +This feature attempts to identify storage/file system failures and disable the failed brick without disrupting the remainder of the node's operation. + +## Description + +Detecting failures on the filesystem that a brick uses makes it possible to handle errors that are caused from outside of the Gluster environment. + +There have been hanging brick processes when the underlying storage of a brick went unavailable. A hanging brick process can still use the network and repond to clients, but actual I/O to the storage is impossible and can cause noticible delays on the client side. + +Provide better detection of storage subsytem failures and prevent bricks from hanging. It should prevent hanging brick processes when storage-hardware or the filesystem fails. + +A health-checker (thread) has been added to the posix xlator. This thread periodically checks the status of the filesystem (implies checking of functional storage-hardware). + +`glusterd` can detect that the brick process has exited, `gluster volume status` will show that the brick process is not running anymore. System administrators checking the logs should be able to triage the cause. + +## Usage and Configuration + +The health-checker is enabled by default and runs a check every 30 seconds. This interval can be changed per volume with: + + # gluster volume set storage.health-check-interval + +If `SECONDS` is set to 0, the health-checker will be disabled. + +## Failure Detection + +Error are logged to the standard syslog (mostly `/var/log/messages`): + + Jun 24 11:31:49 vm130-32 kernel: XFS (dm-2): metadata I/O error: block 0x0 ("xfs_buf_iodone_callbacks") error 5 buf count 512 + Jun 24 11:31:49 vm130-32 kernel: XFS (dm-2): I/O Error Detected. Shutting down filesystem + Jun 24 11:31:49 vm130-32 kernel: XFS (dm-2): Please umount the filesystem and rectify the problem(s) + Jun 24 11:31:49 vm130-32 kernel: VFS:Filesystem freeze failed + Jun 24 11:31:50 vm130-32 GlusterFS[1969]: [2013-06-24 10:31:50.500674] M [posix-helpers.c:1114:posix_health_check_thread_proc] 0-failing_xfs-posix: health-check failed, going down + Jun 24 11:32:09 vm130-32 kernel: XFS (dm-2): xfs_log_force: error 5 returned. + Jun 24 11:32:20 vm130-32 GlusterFS[1969]: [2013-06-24 10:32:20.508690] M [posix-helpers.c:1119:posix_health_check_thread_proc] 0-failing_xfs-posix: still alive! -> SIGTERM + +The messages labelled with `GlusterFS` in the above output are also written to the logs of the brick process. + +## Recovery after a failure + +When a brick process detects that the underlaying storage is not responding anymore, the process will exit. There is no automated way that the brick process gets restarted, the sysadmin will need to fix the problem with the storage first. + +After correcting the storage (hardware or filesystem) issue, the following command will start the brick process again: + + # gluster volume start force + +## How To Test + +The health-checker thread that is part of each brick process will get started automatically when a volume has been started. Verifying its functionality can be done in different ways. + +On virtual hardware: + +* disconnect the disk from the VM that holds the brick + +On real hardware: + +* simulate a RAID-card failure by unplugging the card or cables + +On a system that uses LVM for the bricks: + +* use device-mapper to load an error-table for the disk, see [this description](http://review.gluster.org/5176). + +On any system (writing to random offsets of the block device, more difficult to trigger): + +1. cause corruption on the filesystem that holds the brick +2. read contents from the brick, hoping to hit the corrupted area +3. the filsystem should abort after hitting a bad spot, the health-checker should notice that shortly afterwards diff --git a/done/Features/dht.md b/done/Features/dht.md new file mode 100644 index 0000000..c35dd6d --- /dev/null +++ b/done/Features/dht.md @@ -0,0 +1,223 @@ +# How GlusterFS Distribution Works + +The defining feature of any scale-out system is its ability to distribute work +or data among many servers. Accordingly, people in the distributed-system +community have developed many powerful techniques to perform such distribution, +but those techniques often remain little known or understood even among other +members of the file system and database communities that benefit. This +confusion is represented even in the name of the GlusterFS component that +performs distribution - DHT, which stands for Distributed Hash Table but is not +actually a DHT as that term is most commonly used or defined. The way +GlusterFS's DHT works is based on a few basic principles: + + * All operations are driven by clients, which are all equal. There are no + special nodes with special knowledge of where files are or should be. + + * Directories exist on all subvolumes (bricks or lower-level aggregations of + bricks); files exist on only one. + + * Files are assigned to subvolumes based on *consistent hashing*, and even + more specifically a form of consistent hashing exemplified by Amazon's + [Dynamo][dynamo]. + +The result of all this is that users are presented with a set of files that is +the union of the files present on all subvolumes. The following sections +describe how this "uniting" process actually works. + +## Layouts + +The conceptual basis of Dynamo-style consistent hashing is of numbers around a +circle, like a clock. First, the circle is divided into segments and those +segments are assigned to bricks. (For the sake of simplicity we'll use +"bricks" hereafter even though they might actually be replicated/striped +subvolumes.) Several factors guide this assignment. + + * Assignments are done separately for each directory. + + * Historically, segments have all been the same size. However, this can lead + to smaller bricks becoming full while plenty of space remains on larger + ones. If the *cluster.weighted-rebalance* option is set, segments sizes + will be proportional to brick sizes. + + * Assignments need not include all bricks in the volume. If the + *cluster.subvols-per-directory* option is set, only that many bricks will + receive assignments for that directory. + +However these assignments are done, they collectively become what we call a +*layout* for a directory. This layout is then stored using extended +attributes, with each brick's copy of that extended attribute on that directory +consisting of four 32-bit fields. + + * A version, which might be DHT\_HASH\_TYPE\_DM to represent an assignment as + described above, or DHT\_HASH\_TYPE\_DM\_USER to represent an assignment made + manually by the user (or external script). + + * A "commit hash" which will be described later. + + * The first number in the assigned range (segment). + + * The last number in the assigned range. + +For example, the extended attributes representing a weighted assignment between +three bricks, one twice as big as the others, might look like this. + + * Brick A (the large one): DHT\_HASH\_TYPE\_DM 1234 0 0x7ffffff + + * Brick B: DHT\_HASH\_TYPE\_DM 1234 0x80000000 0xbfffffff + + * Brick C: DHT\_HASH\_TYPE\_DM 1234 0xc0000000 0xffffffff + +## Placing Files + +To place a file in a directory, we first need a layout for that directory - as +described above. Next, we calculate a hash for the file. To minimize +collisions either between files in the same directory with different names or +between files in different directories with the same name, this hash is +generated using both the (containing) directory's unique GFID and the file's +name. This hash is then matched to one of the layout assignments, to yield +what we call a *hashed location*. For example, consider the layout shown +above. The hash 0xabad1dea is between 0x80000000 and 0xbfffffff, so the +corresponding file's hashed location would be on Brick B. A second file with a +hash of 0xfaceb00c would be assigned to Brick C by the same reasoning. + +## Looking Up Files + +Because layout assignments might change, especially as bricks are added or +removed, finding a file involves more than calculating its hashed location and +looking there. That is in fact the first step, and works most of the time - +i.e. the file is found where we expected it to be - but there are a few more +steps when that's not the case. Historically, the next step has been to look +for the file **everywhere** - i.e. to broadcast our lookup request to all +subvolumes. If the file isn't found that way, it doesn't exist. At this +point, an open that requires the file's presence will fail, or a create/mkdir +that requires its absence will be allowed to continue. + +Regardless of whether a file is found at its hashed location or elsewhere, we +now know its *cached location*. As the name implies, this is stored within DHT +to satisfy future lookups. If it's not the same as the hashed location, we +also take an extra step. This step is the creation of a *linkfile*, which is a +special stub left at the **hashed** location pointing to the **cached** +location. Therefore, if a client naively looks for a file at its hashed +location and finds a linkfile instead, it can use that linkfile to look up the +file where it really is instead of needing to inquire everywhere. + +## Rebalancing + +As bricks are added or removed, or files are renamed, many files can end up +somewhere other than at their hashed locations. When this happens, the volumes +need to be rebalanced. This process consists of two parts. + + 1. Calculate new layouts, according to the current set of bricks (and possibly + their characteristics). We call this the "fix-layout" phase. + + 2. Migrate any "misplaced" files to their correct (hashed) locations, and + clean up any linkfiles which are no longer necessary. We call this the + "migrate-data" phase. + +Usually, these two phases are done together. (In fact, the code for them is +somewhat intermingled.) However, the migrate-data phase can involve a lot of +I/O and be very disruptive, so users can do just the fix-layout phase and defer +migrate-data until a more convenient time. This allows new files to be placed +on new bricks, even though old files might still be in the "wrong" place. + +When calculating a new layout to replace an old one, DHT specifically tries to +maximize overlap of the assigned ranges, thus minimizing data movement. This +difference can be very large. For example, consider the case where our example +layout from earlier is updated to add a new double-sided brick. Here's a very +inefficient way to do that. + + * Brick A (the large one): 0x00000000 to 0x55555555 + + * Brick B: 0x55555556 to 0x7fffffff + + * Brick C: 0x80000000 to 0xaaaaaaaa + + * Brick D (the new one): 0xaaaaaaab to 0xffffffff + +This would cause files in the following ranges to be migrated: + + * 0x55555556 to 0x7fffffff (from A to B) + + * 0x80000000 to 0xaaaaaaaa (from B to C) + + * 0xaaaaaaab to 0xbfffffff (from B to D) + + * 0xc0000000 to 0xffffffff (from C to D) + +As an historical note, this is exactly what we used to do, and in this case it +would have meant moving 7/12 of all files in the volume. Now let's consider a +new layout that's optimized to maximize overlap with the old one. + + * Brick A: 0x00000000 to 0x55555555 + + * Brick D: 0x55555556 to 0xaaaaaaaa <- optimized insertion point + + * Brick B: 0xaaaaaaab to 0xd5555554 + + * Brick C: 0xd5555555 to 0xffffffff + +In this case we only need to move 5/12 of all files. In a volume with millions +or even billions of files, reducing data movement by 1/6 of all files is a +pretty big improvement. In the future, DHT might use "virtual node IDs" or +multiple hash rings to make rebalancing even more efficient. + +## Rename Optimizations + +With the file-lookup mechanisms we already have in place, it's not necessary to +move a file from one brick to another when it's renamed - even across +directories. It will still be found, albeit a little less efficiently. The +first client to look for it after the rename will add a linkfile, which every +other client will follow from then on. Also, every client that has found the +file once will continue to find it based on its cached location, without any +network traffic at all. Because the extra lookup cost is small, and the +movement cost might be very large, DHT renames the file "in place" on its +current brick instead (taking advantage of the fact that directories exist +everywhere). + +This optimization is further extended to handle cases where renames are very +common. For example, rsync and similar tools often use a "write new then +rename" idiom in which a file "xxx" is actually written as ".xxx.1234" and then +moved into place only after its contents have been fully written. To make this +process more efficient, DHT uses a regular expression to separate the permanent +part of a file's name (in this case "xxx") from what is likely to be a +temporary part (the leading "." and trailing ".1234"). That way, after the +file is renamed it will be in its correct hashed location - which it wouldn't +be otherwise if "xxx" and ".xxx.1234" hash differently - and no linkfiles or +broadcast lookups will be necessary. + +In fact, there are two regular expressions available for this purpose - +*cluster.rsync-hash-regex* and *cluster.extra-hash-regex*. As its name +implies, *rsync-hash-regex* defaults to the pattern that regex uses, while +*extra-hash-regex* can be set by the user to support a second tool using the +same temporary-file idiom. + +## Commit Hashes + +A very recent addition to DHT's algorithmic arsenal is intended to reduce the +number of "broadcast" lookups the it issues. If a volume is completely in +balance, then no file could exist anywhere but at its hashed location. +Therefore, if we've already looked there and not found it, then looking +elsewhere would be pointless (and wasteful). The *commit hash* mechanism is +used to detect this case. A commit hash is assigned to a volume, and +separately to each directory, and then updated according to the following +rules. + + * The volume commit hash is changed whenever actions are taken that might + cause layout assignments across all directories to become invalid - i.e. + bricks being added, removed, or replaced. + + * The directory commit hash is changed whenever actions are taken that might + cause files to be "misplaced" - e.g. when they're renamed. + + * The directory commit hash is set to the volume commit hash when the + directory is created, and whenever the directory is fully rebalanced so that + all files are at their hashed locations. + +In other words, whenever either the volume or directory commit hash is changed +that creates a mismatch. In that case we revert to the "pessimistic" +broadcast-lookup method described earlier. However, if the two hashes match +then we can with skip the broadcast lookup and return a result immediately. +This has been observed to cause a 3x performance improvement in workloads that +involve creating many small files across many bricks. + +[dynamo]: http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf diff --git a/done/Features/distributed-geo-rep.md b/done/Features/distributed-geo-rep.md new file mode 100644 index 0000000..0a3183d --- /dev/null +++ b/done/Features/distributed-geo-rep.md @@ -0,0 +1,71 @@ +Introduction +============ + +This document goes through the new design of distributed geo-replication, it's features and the nature of changes involved. First we list down some of the important features. + + - Distributed asynchronous replication + - Fast and versatile change detection + - Replica failover + - Hardlink synchronization + - Effective handling of deletes and renames + - Configurable sync engine (rsync, tar+ssh) + - Adaptive to a wide variety of workloads + - GFID synchronization + +Geo-replication makes use of the all new *journaling* infrastructure (a.k.a. changelog) to achieve great performance and feature improvements as mentioned above. To understand more about changelogging and the helper library (*libgfchangelog*) refer to document: doc/features/geo-replication/libgfchangelog.md + +Data Replication +---------------- + +Geo-replication is responsible to incrementally replicate data from the master node to the slave. But isn't that similar to what AFR does? Yes, but here the slave is located geographically distant from the master. Geo-replication follows the eventually consistent replication model, which implies, at any point of time, the slave would be lagging w.r.t. master, but would eventually catch up. Replication performance is dependent on two crucial factors: + - Network latency + - Change detection + +Network latency is something that is not in direct control for many reasons, but still there is always a best effort. Therefore, geo-replication offloads the data replicaiton part to common UNIX file transfer utilities. We choose the grand daddy of file transfers [rsync(1)] [1] as the default synchronization engine, as it's best known for it's diff transfer algorithm for effcient usage of network and lightning fast transfers (leave alone the flexibiliy). But what about small files performance? Due to it's checksumming algorithm, rsync has more overhead for small files -- the overhead of checksumming outweighs the bytes to be transferred for small files. Therefore, geo-replication can also use combination of tar piped over ssh to transfer large number of small files. Tests have shown a great improvement over standard rsync. However, sync engine is not yet dynamic to the file type and needs to be chosen manually by a configuration option. + +OTOH, change detection is something that is in full control of the application. Earlier (< release 3.5), geo-replicaiton would perform a file system crawl to indentify changes in the file system. This was not an unintelligent *check-every-single-inode* in the file system, but crawl logic based on *xtime*. xtime is an extended attribute maintained by the *marker* translator for each inode on the master and follows an upward-recursive marking pattern. Geo-replication would traverse a directory based on this simple condition: + +> xtime(master) > xtime(slave) + +E.g.: + +> MASTER SLAVE +> +> /\ /\ +> d0 dir0 d0 dir0 +> / \ / \ +> d1 dir1 d1 dir1 +> / / +> d2 d2 +> / / +> file0 file0 + +Consider the directory tree above. Assume that master and slave were in sync and the following operation happens on master: +``` +touch /d0/d1/d2/file0 +``` +This would trigger a xtime marking (xtime being the current timestamp) from the leaf (*file0*) upto the root (*/*), i.e. an *xattr* of *file0*, *d2*, *d1*, *d0* and finally */*. Geo-replication daemon would crawl the file system based the condition mentioned before and hence would only crawl the **left** part of the directory tree (as the **right** part would hve equal xtimes). + +Although the above crawling algorithm is fast, it still has to crawl a good part of the file system. Also, to decide whether to crawl a particular subdirectory, geo-rep need to compare xtime -- which is basically a **getxattr()** call on the master and slave (remember, *slave* is over a WAN). + +Therefore, in 3.5 the need arised to take crawling to the next level. Geo-replication now uses the changelogging infrastructure to idenitify changes in the filesystem. Actually, there is absolutely no crawl involved. Changelogging based detection is notification based. Geo-replication daemon registers itself with the changelog consumer library (*libgfchangelog*) and basically invokes a set of APIs to get the list of changes in the filesystem and replays them onto the slave. There is absolutely no crawl or any kind of extended attribute gets involved. + +Distributed Geo-Replication +--------------------------- +Geo-replication (also known as gsyncd or geo-rep) used to be non-distributed before release 3.5. The node on which geo-rep start command was executed was responsible for replication data to the slave. If this node goes offline due to some reason (reboot, crash, etc..), replication would thereby be ceased. So one of the main development efforts for release 3.5 was to *distributify* geo-replication. Geo-rep daemon running on each node (per brick) is responsible for replicating data **local** to each brick. This results in full parallelism and effective use of cluster/network resource. + +With release 3.5, geo-rep start command would spawn a geo-replication daemon on each node in the master cluster (one per brick). Geo-rep *status* command shown geo-rep session status from each master node. Similary, *stop* would gracefully tear down the session from all nodes. + +What else is synced? +-------------------- + - GFID: Synchronizing the inode number (GFID) between master and the slave helps in synchronizing hardlinks. + - Purges are also handled effectively as there is no entry comparison between master and slave. With changelog replay, geo-rep perform unlink operation without having to resort to expensive **readdir()** over the WAN. + - Renames: With earlier geo-replication, because of the path based nature of crawling, renames were actually a delete and a create on the slave, followed by data transfer (not to mention the inode number change). Now, with changelogging, it's actually a **rename()** call on the slave. + +Replica Failover +---------------- +One of the basic volume configuration is a replicated volume (synchronous replication). Having geo-replication sync data from all replicas would mean wastage of network bandwidth and possibly data corruption on the slave (though that's unlikely). Therefore, geo-rep on such volume configurations works in an **ACTIVE** and **PASSIVE** mode. Geo-rep daemon on one of the replicas is responsible for replicating data (**ACTIVE**), while the other geo-rep daemon is basically doing nothing (**PASSIVE**). + +On the event of the *ACTIVE* node going offline, the *PASSIVE* node identifies this event (there's a lag of max 60 seconds for this identification) and switches to *ACTIVE*; thereby taking over the role of replicating data from where the earlier *ACTIVE* node left off. This guarantees uninterrupted data replication even on node reboot/failures. + +[1]:http://rsync.samba.org diff --git a/done/Features/file-snapshot.md b/done/Features/file-snapshot.md new file mode 100644 index 0000000..7f7c419 --- /dev/null +++ b/done/Features/file-snapshot.md @@ -0,0 +1,91 @@ +#File Snapshot +This feature gives the ability to take snapshot of files. + +##Descritpion +This feature adds file snapshotting support to glusterfs. Snapshots can be created , deleted and reverted. + +To take a snapshot of a file, file should be in QCOW2 format as the code for the block layer snapshot has been taken from Qemu and put into gluster as a translator. + +With this feature, glusterfs will have better integration with Openstack Cinder, and in general ability to take snapshots of files (typically VM images). + +New extended attribute (xattr) will be added to identify files which are 'snapshot managed' vs raw files. + +##Volume Options +Following volume option needs to be set on the volume for taking file snapshot. + + # features.file-snapshot on +##CLI parameters +Following cli parameters needs to be passed with setfattr command to create, delete and revert file snapshot. + + # trusted.glusterfs.block-format + # trusted.glusterfs.block-snapshot-create + # trusted.glusterfs.block-snapshot-goto +##Fully loaded Example +Download glusterfs3.5 rpms from download.gluster.org +Install these rpms. + +start glusterd by using the command + + # service glusterd start +Now create a volume by using the command + + # gluster volume create +Run the command below to make sure that volume is created. + + # gluster volume info +Now turn on the snapshot feature on the volume by using the command + + # gluster volume set features.file-snapshot on +Verify that the option is set by using the command + + # gluster volume info +User should be able to see another option in the volume info + + # features.file-snapshot: on +Now mount the volume using fuse mount + + # mount -t glusterfs +cd into the mount point + # cd + # touch +Size of the file can be set and format of the file can be changed to QCOW2 by running the command below. File size can be in KB/MB/GB + + # setfattr -n trusted.glusterfs.block-format -v qcow2: +Now create another file and send data to that file by running the command + + # echo 'ABCDEFGHIJ' > +copy the data to the one file to another by running the command + + # dd if=data-file1 of=big-file conv=notrunc +Now take the `snapshot of the file` by running the command + + # setfattr -n trusted.glusterfs.block-snapshot-create -v +Add some more contents to the file and take another file snaphot by doing the following steps + + # echo '1234567890' > + # dd if= of= conv=notrunc + # setfattr -n trusted.glusterfs.block-snapshot-create -v +Now `revert` both the file snapshots and write data to some files so that data can be compared. + + # setfattr -n trusted.glusterfs.block-snapshot-goto -v + # dd if= of= bs=11 count=1 + # setfattr -n trusted.glusterfs.block-snapshot-goto -v + # dd if= of= bs=11 count=1 +Now read the contents of the files and compare as below: + + # cat , and compare contents. + # cat , and compare contents. +##one line description for the variables used +file_name = File which will be creating in the mount point intially. + +data_file1 = File which contains data 'ABCDEFGHIJ' + +image1 = First file snapshot which has 'ABCDEFGHIJ' + some null values. + +data_file2 = File which contains data '1234567890' + +image2 = second file snapshot which has '1234567890' + some null values. + +out_file1 = After reverting image1 this contains 'ABCDEFGHIJ' + +out_file2 = After reverting image2 this contians '1234567890' diff --git a/done/Features/gfid-access.md b/done/Features/gfid-access.md new file mode 100644 index 0000000..2d324a1 --- /dev/null +++ b/done/Features/gfid-access.md @@ -0,0 +1,73 @@ +#Gfid-access Translator +The 'gfid-access' translator provides access to data in glusterfs using a +virtual path. This particular translator is designed to provide direct access to +files in glusterfs using its gfid. 'GFID' is glusterfs's inode number for a file +to identify it uniquely. As of now, Geo-replication is the only consumer of this +translator. The changelog translator logs the 'gfid' with corresponding file +operation in journals which are consumed by Geo-Replication to replicate the +files using gfid-access translator very efficiently. + +###Implications and Usage +A new virtual directory called '.gfid' is exposed in the aux-gfid mount +point when gluster volume is mounted with 'aux-gfid-mount' option. +All the gfids of files are exposed in one level under the '.gfid' directory. +No matter at what level the file resides, it is accessed using its +gfid under this virutal directory as shown in example below. All access +protocols work seemlessly, as the complexities are handled internally. + +###Testing +1. Mount glusterfs client with '-o aux-gfid-mount' as follows. + + mount -t glusterfs -o aux-gfid-mount : + + Example: + + #mount -t glusterfs -o aux-gfid-mount rhs1:master /master-aux-mnt + +2. Get the 'gfid' of a file using normal mount or aux-gfid-mount and do some + operations as follows. + + getfattr -n glusterfs.gfid.string + + Example: + + #getfattr -n glusterfs.gfid.string /master-aux-mnt/file + # file: file + glusterfs.gfid.string="796d3170-0910-4853-9ff3-3ee6b1132080" + + #cat /master-aux-mnt/file + sample data + + #stat /master-aux-mnt/file + File: `file' + Size: 12 Blocks: 1 IO Block: 131072 regular file + Device: 13h/19d Inode: 11525625031905452160 Links: 1 + Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) + Access: 2014-05-23 20:43:33.239999863 +0530 + Modify: 2014-05-23 17:36:48.224999989 +0530 + Change: 2014-05-23 20:44:10.081999938 +0530 + + +3. Access files using virtual path as follows. + + /mountpoint/.gfid/' + + Example: + + #cat /master-aux-mnt/.gfid/796d3170-0910-4853-9ff3-3ee6b1132080 + sample data + #stat /master-aux-mnt/.gfid/796d3170-0910-4853-9ff3-3ee6b1132080 + File: `.gfid/796d3170-0910-4853-9ff3-3ee6b1132080' + Size: 12 Blocks: 1 IO Block: 131072 regular file + Device: 13h/19d Inode: 11525625031905452160 Links: 1 + Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) + Access: 2014-05-23 20:43:33.239999863 +0530 + Modify: 2014-05-23 17:36:48.224999989 +0530 + Change: 2014-05-23 20:44:10.081999938 +0530 + + We can notice that 'cat' command on the 'file' using path and using virtual + path displays the same data. Similarly 'stat' command on the 'file' and using + virtual path with gfid gives same Inode Number confirming that its same file. + +###Nature of changes +This feature is introduced with 'gfid-access' translator. diff --git a/done/Features/glusterfs_nfs-ganesha_integration.md b/done/Features/glusterfs_nfs-ganesha_integration.md new file mode 100644 index 0000000..b306715 --- /dev/null +++ b/done/Features/glusterfs_nfs-ganesha_integration.md @@ -0,0 +1,123 @@ +# GlusterFS and NFS-Ganesha integration + +Nfs-ganesha can support NFS (v3, 4.0, 4.1 pNFS) and 9P (from the Plan9 operating system) protocols concurrently. It provides a FUSE-compatible File System Abstraction Layer(FSAL) to allow the file-system developers to plug in their own storage mechanism and access it from any NFS client. + +With NFS-GANESHA, the NFS client talks to the NFS-GANESHA server instead, which is in the user address space already. NFS-GANESHA can access the FUSE filesystems directly through its FSAL without copying any data to or from the kernel, thus potentially improving response times. Of course the network streams themselves (TCP/UDP) will still be handled by the Linux kernel when using NFS-GANESHA. + +Even GlusterFS has been integrated with NFS-Ganesha, in the recent past to export the volumes created via glusterfs, using “libgfapi”. libgfapi is a new userspace library developed to access data in glusterfs. It performs I/O on gluster volumes directly without FUSE mount. It is a filesystem like api which runs/sits in the application process context(which is NFS-Ganesha here) and eliminates the use of fuse and the kernel vfs layer from the glusterfs volume access. Thus by integrating NFS-Ganesha and libgfapi, the speed and latency have been improved compared to FUSE mount access. + +### 1.) Pre-requisites + + - Before starting to setup NFS-Ganesha, a GlusterFS volume should be created. + - Disable kernel-nfs, gluster-nfs services on the system using the following commands + - service nfs stop + - gluster vol set nfs.disable ON (Note: this command has to be repeated for all the volumes in the trusted-pool) + - Usually the libgfapi.so* files are installed in “/usr/lib” or “/usr/local/lib”, based on whether you have installed glusterfs using rpm or sources. Verify if those libgfapi.so* files are linked in “/usr/lib64″ and “/usr/local/lib64″ as well. If not create the links for those .so files in those directories. + +### 2.) Installing nfs-ganesha + +##### i) using rpm install + + - nfs-ganesha rpms are available in Fedora19 or later packages. So to install nfs-ganesha, run + - *#yum install nfs-ganesha* + - Using CentOS or EL, download the rpms from the below link : + - http://download.gluster.org/pub/gluster/glusterfs/nfs-ganesha + +##### ii) using sources + + - cd /root + - git clone git://github.com/nfs-ganesha/nfs-ganesha.git + - cd nfs-ganesha/ + - git submodule update --init + - git checkout -b next origin/next (Note : origin/next is the current development branch) + - rm -rf ~/build; mkdir ~/build ; cd ~/build + - cmake -DUSE_FSAL_GLUSTER=ON -DCURSES_LIBRARY=/usr/lib64 -DCURSES_INCLUDE_PATH=/usr/include/ncurses -DCMAKE_BUILD_TYPE=Maintainer /root/nfs-ganesha/src/ + - make; make install +> Note: libcap-devel, libnfsidmap, dbus-devel, libacl-devel ncurses* packages +> may need to be installed prior to running this command. For Fedora, libjemalloc, +> libjemalloc-devel may also be required. + +### 3.) Run nfs-ganesha server + + - To start nfs-ganesha manually, execute the following command: + - *#ganesha.nfsd -f -L -N -d + +```sh +For example: +#ganesha.nfsd -f nfs-ganesha.conf -L nfs-ganesha.log -N NIV_DEBUG -d +where: +nfs-ganesha.log is the log file for the ganesha.nfsd process. +nfs-ganesha.conf is the configuration file +NIV_DEBUG is the log level. +``` + - To check if nfs-ganesha has started, execute the following command: + - *#ps aux | grep ganesha* + - By default '/' will be exported + +### 4.) Exporting GlusterFS volume via nfs-ganesha + +#####step 1 : + +To export any GlusterFS volume or directory inside volume, create the EXPORT block for each of those entries in a .conf file, for example export.conf. The following paremeters are required to export any entry. +- *#cat export.conf* + +```sh +EXPORT{ + Export_Id = 1 ; # Export ID unique to each export + Path = "volume_path"; # Path of the volume to be exported. Eg: "/test_volume" + + FSAL { + name = GLUSTER; + hostname = "10.xx.xx.xx"; # IP of one of the nodes in the trusted pool + volume = "volume_name"; # Volume name. Eg: "test_volume" + } + + Access_type = RW; # Access permissions + Squash = No_root_squash; # To enable/disable root squashing + Disable_ACL = TRUE; # To enable/disable ACL + Pseudo = "pseudo_path"; # NFSv4 pseudo path for this export. Eg: "/test_volume_pseudo" + Protocols = "3","4" ; # NFS protocols supported + Transports = "UDP","TCP" ; # Transport protocols supported + SecType = "sys"; # Security flavors supported +} +``` + +#####step 2 : + +Define/copy “nfs-ganesha.conf” file to a suitable location. This file is available in “/etc/glusterfs-ganesha” on installation of nfs-ganesha rpms or incase if using the sources, rename “/root/nfs-ganesha/src/FSAL/FSAL_GLUSTER/README” file to “nfs-ganesha.conf” file. + +#####step 3 : + +Now include the “export.conf” file in nfs-ganesha.conf. This can be done by adding the line below at the end of nfs-ganesha.conf. + - %include “export.conf” + +#####step 4 : + + - run ganesha server as mentioned in section 3 + - To check if the volume is exported, run + - *#showmount -e localhost* + +### 5.) Additional Notes + +To switch back to gluster-nfs/kernel-nfs, kill the ganesha daemon and start those services using the below commands : + + - pkill ganesha + - service nfs start (for kernel-nfs) + - gluster v set nfs.disable off + + +### 6.) References + + - Setup and create glusterfs volumes : +http://www.gluster.org/community/documentation/index.php/QuickStart + + - NFS-Ganesha wiki : https://github.com/nfs-ganesha/nfs-ganesha/wiki + + - Sample configuration files + - /root/nfs-ganesha/src/config_samples/gluster.conf + - https://github.com/nfs-ganesha/nfs-ganesha/blob/master/src/config_samples/gluster.conf + + - https://forge.gluster.org/nfs-ganesha-and-glusterfs-integration/pages/Home + + - http://blog.gluster.org/2014/09/glusterfs-and-nfs-ganesha-integration/ + diff --git a/done/Features/heal-info-and-split-brain-resolution.md b/done/Features/heal-info-and-split-brain-resolution.md new file mode 100644 index 0000000..6ca2be2 --- /dev/null +++ b/done/Features/heal-info-and-split-brain-resolution.md @@ -0,0 +1,448 @@ +The following document explains the usage of volume heal info and split-brain +resolution commands. + +##`gluster volume heal info [split-brain]` commands +###volume heal info +Usage: `gluster volume heal info` + +This lists all the files that need healing (either their path or +GFID is printed). +###Interpretting the output +All the files that are listed in the output of this command need healing to be +done. Apart from this, there are 2 special cases that may be associated with +an entry - +a) Is in split-brain +       A file in data/metadata split-brain will +be listed with " - Is in split-brain" appended after its path/gfid. Eg., +"/file4" in the output provided below. But for a gfid split-brain, + the parent directory of the file is shown to be in split-brain and the file +itself is shown to be needing heal. Eg., "/dir" in the output provided below +which is in split-brain because of gfid split-brain of file "/dir/a". +b) Is possibly undergoing heal +       A file is said to be possibly undergoing + heal because it is possible that the file was undergoing heal when heal status +was being determined but it cannot be said for sure. It could so have happened +that self-heal daemon and glfsheal process that is trying to get heal information +are competing for the same lock leading to such conclusion. Another possible case + could be multiple glfsheal processes running simultaneously (e.g., multiple users + ran heal info command at the same time), competing for same lock. + +The following is an example of heal info command's output. +###Example +Consider a replica volume "test" with 2 bricks b1 and b2; +self-heal daemon off, mounted at /mnt. + +`gluster volume heal test info` +~~~ +Brick \ + - Is in split-brain + - Is in split-brain + - Is in split-brain + + +Number of entries: 4 + +Brick +/dir/file2 +/dir/file1 - Is in split-brain +/dir - Is in split-brain +/dir/file3 +/file4 - Is in split-brain +/dir/a + + +Number of entries: 6 +~~~ + +###Analysis of the output +It can be seen that +A) from brick b1 4 entries need healing: +      1) file with gfid:6dc78b20-7eb6-49a3-8edb-087b90142246 needs healing +      2) "aaca219f-0e25-4576-8689-3bfd93ca70c2", +"39f301ae-4038-48c2-a889-7dac143e82dd" and "c3c94de2-232d-4083-b534-5da17fc476ac" + are in split-brain + +B) from brick b2 6 entries need healing- +      1) "a", "file2" and "file3" need healing +      2) "file1", "file4" & "/dir" are in split-brain + +###volume heal info split-brain +Usage: `gluster volume heal info split-brain` +This command shows all the files that are in split-brain. +##Example +`gluster volume heal test info split-brain` +~~~ +Brick + + + +Number of entries in split-brain: 3 + +Brick +/dir/file1 +/dir +/file4 +Number of entries in split-brain: 3 +~~~ +Note that, similar to heal info command, for gfid split-brains (same filename but different gfid) +their parent directories are listed to be in split-brain. + +##Resolution of split-brain using CLI +Once the files in split-brain are identified, their resolution can be done +from the command line. Note that entry/gfid split-brain resolution is not supported. +Split-brain resolution commands let the user resolve split-brain in 3 ways. +###Select the bigger-file as source +This command is useful for per file healing where it is known/decided that the +file with bigger size is to be considered as source. +1.`gluster volume heal split-brain bigger-file ` +`` can be either the full file name as seen from the root of the volume +(or) the gfid-string representation of the file, which sometimes gets displayed +in the heal info command's output. +Once this command is executed, the replica containing the FILE with bigger +size is found out and heal is completed with it as source. + +###Example : +Consider the above output of heal info split-brain command. + +Before healing the file, notice file size and md5 checksums : +~~~ +On brick b1: +# stat b1/dir/file1 + File: ‘b1/dir/file1’ + Size: 17 Blocks: 16 IO Block: 4096 regular file +Device: fd03h/64771d Inode: 919362 Links: 2 +Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) +Access: 2015-03-06 13:55:40.149897333 +0530 +Modify: 2015-03-06 13:55:37.206880347 +0530 +Change: 2015-03-06 13:55:37.206880347 +0530 + Birth: - + +# md5sum b1/dir/file1 +040751929ceabf77c3c0b3b662f341a8 b1/dir/file1 + +On brick b2: +# stat b2/dir/file1 + File: ‘b2/dir/file1’ + Size: 13 Blocks: 16 IO Block: 4096 regular file +Device: fd03h/64771d Inode: 919365 Links: 2 +Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) +Access: 2015-03-06 13:54:22.974451898 +0530 +Modify: 2015-03-06 13:52:22.910758923 +0530 +Change: 2015-03-06 13:52:22.910758923 +0530 + Birth: - +# md5sum b2/dir/file1 +cb11635a45d45668a403145059c2a0d5 b2/dir/file1 +~~~ +Healing file1 using the above command - +`gluster volume heal test split-brain bigger-file /dir/file1` +Healed /dir/file1. + +After healing is complete, the md5sum and file size on both bricks should be the same. +~~~ +On brick b1: +# stat b1/dir/file1 + File: ‘b1/dir/file1’ + Size: 17 Blocks: 16 IO Block: 4096 regular file +Device: fd03h/64771d Inode: 919362 Links: 2 +Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) +Access: 2015-03-06 14:17:27.752429505 +0530 +Modify: 2015-03-06 13:55:37.206880347 +0530 +Change: 2015-03-06 14:17:12.880343950 +0530 + Birth: - +# md5sum b1/dir/file1 +040751929ceabf77c3c0b3b662f341a8 b1/dir/file1 + +On brick b2: +# stat b2/dir/file1 + File: ‘b2/dir/file1’ + Size: 17 Blocks: 16 IO Block: 4096 regular file +Device: fd03h/64771d Inode: 919365 Links: 2 +Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) +Access: 2015-03-06 14:17:23.249403600 +0530 +Modify: 2015-03-06 13:55:37.206880000 +0530 +Change: 2015-03-06 14:17:12.881343955 +0530 + Birth: - + +# md5sum b2/dir/file1 +040751929ceabf77c3c0b3b662f341a8 b2/dir/file1 +~~~ +###Select one replica as source for a particular file +2.`gluster volume heal split-brain source-brick ` +`` is selected as source brick, +FILE present in the source brick is taken as source for healing. + +###Example : +Notice the md5 checksums and file size before and after heal. + +Before heal : +~~~ +On brick b1: + + stat b1/file4 + File: ‘b1/file4’ + Size: 4 Blocks: 16 IO Block: 4096 regular file +Device: fd03h/64771d Inode: 919356 Links: 2 +Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) +Access: 2015-03-06 13:53:19.417085062 +0530 +Modify: 2015-03-06 13:53:19.426085114 +0530 +Change: 2015-03-06 13:53:19.426085114 +0530 + Birth: - +# md5sum b1/file4 +b6273b589df2dfdbd8fe35b1011e3183 b1/file4 + +On brick b2: + +# stat b2/file4 + File: ‘b2/file4’ + Size: 4 Blocks: 16 IO Block: 4096 regular file +Device: fd03h/64771d Inode: 919358 Links: 2 +Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) +Access: 2015-03-06 13:52:35.761833096 +0530 +Modify: 2015-03-06 13:52:35.769833142 +0530 +Change: 2015-03-06 13:52:35.769833142 +0530 + Birth: - +# md5sum b2/file4 +0bee89b07a248e27c83fc3d5951213c1 b2/file4 +~~~ +`gluster volume heal test split-brain source-brick test-host:/test/b1 gfid:c3c94de2-232d-4083-b534-5da17fc476ac` +Healed gfid:c3c94de2-232d-4083-b534-5da17fc476ac. + +After healing : +~~~ +On brick b1: +# stat b1/file4 + File: ‘b1/file4’ + Size: 4 Blocks: 16 IO Block: 4096 regular file +Device: fd03h/64771d Inode: 919356 Links: 2 +Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) +Access: 2015-03-06 14:23:38.944609863 +0530 +Modify: 2015-03-06 13:53:19.426085114 +0530 +Change: 2015-03-06 14:27:15.058927962 +0530 + Birth: - +# md5sum b1/file4 +b6273b589df2dfdbd8fe35b1011e3183 b1/file4 + +On brick b2: +# stat b2/file4 + File: ‘b2/file4’ + Size: 4 Blocks: 16 IO Block: 4096 regular file +Device: fd03h/64771d Inode: 919358 Links: 2 +Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) +Access: 2015-03-06 14:23:38.944609000 +0530 +Modify: 2015-03-06 13:53:19.426085000 +0530 +Change: 2015-03-06 14:27:15.059927968 +0530 + Birth: - +# md5sum b2/file4 +b6273b589df2dfdbd8fe35b1011e3183 b2/file4 +~~~ +Note that, as mentioned earlier, entry split-brain and gfid split-brain healing + are not supported using CLI. However, they can be fixed using the method described + [here](https://github.com/gluster/glusterfs/blob/master/doc/debugging/split-brain.md). +###Example: +Trying to heal /dir would fail as it is in entry split-brain. +`gluster volume heal test split-brain source-brick test-host:/test/b1 /dir` +Healing /dir failed:Operation not permitted. +Volume heal failed. + +3.`gluster volume heal split-brain source-brick ` +Consider a scenario where many files are in split-brain such that one brick of +replica pair is source. As the result of the above command all split-brained +files in `` are selected as source and healed to the sink. + +###Example: +Consider a volume having three entries "a, b and c" in split-brain. +~~~ +`gluster volume heal test split-brain source-brick test-host:/test/b1` +Healed gfid:944b4764-c253-4f02-b35f-0d0ae2f86c0f. +Healed gfid:3256d814-961c-4e6e-8df2-3a3143269ced. +Healed gfid:b23dd8de-af03-4006-a803-96d8bc0df004. +Number of healed entries: 3 +~~~ + +## An overview of working of heal info commands +When these commands are invoked, a "glfsheal" process is spawned which reads +the entries from `//.glusterfs/indices/xattrop/` directory of all +the bricks that are up (that it can connect to) one after another. These +entries are GFIDs of files that might need healing. Once GFID entries from a +brick are obtained, based on the lookup response of this file on each +participating brick of replica-pair & trusted.afr.* extended attributes it is +found out if the file needs healing, is in split-brain etc based on the +requirement of each command and displayed to the user. + + +##Resolution of split-brain from the mount point +A set of getfattr and setfattr commands have been provided to detect the data and metadata split-brain status of a file and resolve split-brain, if any, from mount point. + +Consider a volume "test", having bricks b0, b1, b2 and b3. + +~~~ +# gluster volume info test + +Volume Name: test +Type: Distributed-Replicate +Volume ID: 00161935-de9e-4b80-a643-b36693183b61 +Status: Started +Number of Bricks: 2 x 2 = 4 +Transport-type: tcp +Bricks: +Brick1: test-host:/test/b0 +Brick2: test-host:/test/b1 +Brick3: test-host:/test/b2 +Brick4: test-host:/test/b3 +~~~ + +Directory structure of the bricks is as follows: + +~~~ +# tree -R /test/b? +/test/b0 +├── dir +│   └── a +└── file100 + +/test/b1 +├── dir +│   └── a +└── file100 + +/test/b2 +├── dir +├── file1 +├── file2 +└── file99 + +/test/b3 +├── dir +├── file1 +├── file2 +└── file99 +~~~ + +Some files in the volume are in split-brain. +~~~ +# gluster v heal test info split-brain +Brick test-host:/test/b0/ +/file100 +/dir +Number of entries in split-brain: 2 + +Brick test-host:/test/b1/ +/file100 +/dir +Number of entries in split-brain: 2 + +Brick test-host:/test/b2/ +/file99 + +Number of entries in split-brain: 2 + +Brick test-host:/test/b3/ + + +Number of entries in split-brain: 2 +~~~ +###To know data/metadata split-brain status of a file: +~~~ +getfattr -n replica.split-brain-status +~~~ +The above command executed from mount provides information if a file is in data/metadata split-brain. Also provides the list of afr children to analyze to get more information about the file. +This command is not applicable to gfid/directory split-brain. + +###Example: +1) "file100" is in metadata split-brain. Executing the above mentioned command for file100 gives : +~~~ +# getfattr -n replica.split-brain-status file100 +# file: file100 +replica.split-brain-status="data-split-brain:no metadata-split-brain:yes Choices:test-client-0,test-client-1" +~~~ + +2) "file1" is in data split-brain. +~~~ +# getfattr -n replica.split-brain-status file1 +# file: file1 +replica.split-brain-status="data-split-brain:yes metadata-split-brain:no Choices:test-client-2,test-client-3" +~~~ + +3) "file99" is in both data and metadata split-brain. +~~~ +# getfattr -n replica.split-brain-status file99 +# file: file99 +replica.split-brain-status="data-split-brain:yes metadata-split-brain:yes Choices:test-client-2,test-client-3" +~~~ + +4) "dir" is in directory split-brain but as mentioned earlier, the above command is not applicable to such split-brain. So it says that the file is not under data or metadata split-brain. +~~~ +# getfattr -n replica.split-brain-status dir +# file: dir +replica.split-brain-status="The file is not under data or metadata split-brain" +~~~ + +5) "file2" is not in any kind of split-brain. +~~~ +# getfattr -n replica.split-brain-status file2 +# file: file2 +replica.split-brain-status="The file is not under data or metadata split-brain" +~~~ + +### To analyze the files in data and metadata split-brain +Trying to do operations (say cat, getfattr etc) from the mount on files in split-brain, gives an input/output error. To enable the users analyze such files, a setfattr command is provided. + +~~~ +# setfattr -n replica.split-brain-choice -v "choiceX" +~~~ +Using this command, a particular brick can be chosen to access the file in split-brain from. + +###Example: +1) "file1" is in data-split-brain. Trying to read from the file gives input/output error. +~~~ +# cat file1 +cat: file1: Input/output error +~~~ +Split-brain choices provided for file1 were test-client-2 and test-client-3. + +Setting test-client-2 as split-brain choice for file1 serves reads from b2 for the file. +~~~ +# setfattr -n replica.split-brain-choice -v test-client-2 file1 +~~~ +Now, read operations on the file can be done. +~~~ +# cat file1 +xyz +~~~ +Similarly, to inspect the file from other choice, replica.split-brain-choice is to be set to test-client-3. + +Trying to inspect the file from a wrong choice errors out. + +To undo the split-brain-choice that has been set, the above mentioned setfattr command can be used +with "none" as the value for extended attribute. + +###Example: +~~~ +1) setfattr -n replica.split-brain-choice -v none file1 +~~~ +Now performing cat operation on the file will again result in input/output error, as before. +~~~ +# cat file +cat: file1: Input/output error +~~~ + +Once the choice for resolving split-brain is made, source brick is supposed to be set for the healing to be done. +This is done using the following command: + +~~~ +# setfattr -n replica.split-brain-heal-finalize -v +~~~ + +##Example +~~~ +# setfattr -n replica.split-brain-heal-finalize -v test-client-2 file1 +~~~ +The above process can be used to resolve data and/or metadata split-brain on all the files. + +NOTE: +1) If "fopen-keep-cache" fuse mount option is disabled then inode needs to be invalidated each time before selecting a new replica.split-brain-choice to inspect a file. This can be done by using: +~~~ +# sefattr -n inode-invalidate -v 0 +~~~ + +2) The above mentioned process for split-brain resolution from mount will not work on nfs mounts as it doesn't provide xattrs support. diff --git a/done/Features/leases.md b/done/Features/leases.md new file mode 100644 index 0000000..08f2056 --- /dev/null +++ b/done/Features/leases.md @@ -0,0 +1,11 @@ +##Leases + +###API: + +###Lease Semantics in Gluster + +###High level implementation details + +###Known issues + +###TODO diff --git a/done/Features/libgfapi.md b/done/Features/libgfapi.md new file mode 100644 index 0000000..34adf60 --- /dev/null +++ b/done/Features/libgfapi.md @@ -0,0 +1,382 @@ +One of the known methods to access glusterfs is via fuse module. However, it has some overhead or performance issues because of the number of context switches which need to be performed to complete one i/o transaction[1]. + + +To over come this limitation, a new method called ‘libgfapi’ is introduced. libgfapi support is available from GlusterFS-3.4 release. + +libgfapi is a userspace library for accessing data in glusterfs. libgfapi library perform IO on gluster volumes directly without FUSE mount. It is a filesystem like api and runs/sits in application process context. libgfapi eliminates the fuse and the kernel vfs layer from the glusterfs volume access. The speed and latency have improved with libgfapi access. [1] + + +Using libgfapi, various user-space filesystems (like NFS-Ganesha or Samba) or the virtualizer (like QEMU) can interact with GlusterFS which serves as back-end filesystem. Currently below projects integrate with glusterfs using libgfapi interfaces. + + +* qemu storage layer +* Samba VFS plugin +* NFS-Ganesha + +All the APIs in libgfapi make use of `struct glfs` object. This object +contains information about volume name, glusterfs context associated, +subvols in the graph etc which makes it unique for each volume. + + +For any application to make use of libgfapi, it should typically start +with the below APIs in the following order - + +* To create a new glfs object : + + glfs_t *glfs_new (const char *volname) ; + + glfs_new() returns glfs_t object. + + +* On this newly created glfs_t, you need to be either set a volfile path + (glfs_set_volfile) or a volfile server (glfs_set_volfile_server). + Incase of failures, the corresponding cleanup routine is + "glfs_unset_volfile_server" + + int glfs_set_volfile (glfs_t *fs, const char *volfile); + + int glfs_set_volfile_server (glfs_t *fs, const char *transport,const char *host, int port) ; + + int glfs_unset_volfile_server (glfs_t *fs, const char *transport,const char *host, int port) ; + +* Specify logging parameters using glfs_set_logging(): + + int glfs_set_logging (glfs_t *fs, const char *logfile, int loglevel) ; + +* Initializes the glfs_t object using glfs_init() + int glfs_init (glfs_t *fs) ; + +#### FOPs APIs available with libgfapi : + + + + int glfs_get_volumeid (struct glfs *fs, char *volid, size_t size); + + int glfs_setfsuid (uid_t fsuid) ; + + int glfs_setfsgid (gid_t fsgid) ; + + int glfs_setfsgroups (size_t size, const gid_t *list) ; + + glfs_fd_t *glfs_open (glfs_t *fs, const char *path, int flags) ; + + glfs_fd_t *glfs_creat (glfs_t *fs, const char *path, int flags,mode_t mode) ; + + int glfs_close (glfs_fd_t *fd) ; + + glfs_t *glfs_from_glfd (glfs_fd_t *fd) ; + + int glfs_set_xlator_option (glfs_t *fs, const char *xlator, const char *key,const char *value) ; + + typedef void (*glfs_io_cbk) (glfs_fd_t *fd, ssize_t ret, void *data); + + ssize_t glfs_read (glfs_fd_t *fd, void *buf,size_t count, int flags) ; + + ssize_t glfs_write (glfs_fd_t *fd, const void *buf,size_t count, int flags) ; + + int glfs_read_async (glfs_fd_t *fd, void *buf, size_t count, int flags, glfs_io_cbk fn, void *data) ; + + int glfs_write_async (glfs_fd_t *fd, const void *buf, size_t count, int flags, glfs_io_cbk fn, void *data) ; + + ssize_t glfs_readv (glfs_fd_t *fd, const struct iovec *iov, int iovcnt,int flags) ; + + ssize_t glfs_writev (glfs_fd_t *fd, const struct iovec *iov, int iovcnt,int flags) ; + + int glfs_readv_async (glfs_fd_t *fd, const struct iovec *iov, int count, int flags, glfs_io_cbk fn, void *data) ; + + int glfs_writev_async (glfs_fd_t *fd, const struct iovec *iov, int count, int flags, glfs_io_cbk fn, void *data) ; + + ssize_t glfs_pread (glfs_fd_t *fd, void *buf, size_t count, off_t offset,int flags) ; + + ssize_t glfs_pwrite (glfs_fd_t *fd, const void *buf, size_t count, off_t offset, int flags) ; + + int glfs_pread_async (glfs_fd_t *fd, void *buf, size_t count, off_t offset,int flags, glfs_io_cbk fn, void *data) ; + + int glfs_pwrite_async (glfs_fd_t *fd, const void *buf, int count, off_t offset,int flags, glfs_io_cbk fn, void *data) ; + + ssize_t glfs_preadv (glfs_fd_t *fd, const struct iovec *iov, int iovcnt, int count, off_t offset, int flags,glfs_io_cbk fn, void *data) ; + + ssize_t glfs_pwritev (glfs_fd_t *fd, const struct iovec *iov, int iovcnt,int count, off_t offset, int flags, glfs_io_cbk fn, void *data) ; + + int glfs_preadv_async (glfs_fd_t *fd, const struct iovec *iov, glfs_io_cbk fn, void *data) ; + + int glfs_pwritev_async (glfs_fd_t *fd, const struct iovec *iov, glfs_io_cbk fn, void *data) ; + + off_t glfs_lseek (glfs_fd_t *fd, off_t offset, int whence) ; + + int glfs_truncate (glfs_t *fs, const char *path, off_t length) ; + + int glfs_ftruncate (glfs_fd_t *fd, off_t length) ; + + int glfs_ftruncate_async (glfs_fd_t *fd, off_t length, glfs_io_cbk fn,void *data) ; + + int glfs_lstat (glfs_t *fs, const char *path, struct stat *buf) ; + + int glfs_stat (glfs_t *fs, const char *path, struct stat *buf) ; + + int glfs_fstat (glfs_fd_t *fd, struct stat *buf) ; + + int glfs_fsync (glfs_fd_t *fd) ; + + int glfs_fsync_async (glfs_fd_t *fd, glfs_io_cbk fn, void *data) ; + + int glfs_fdatasync (glfs_fd_t *fd) ; + + int glfs_fdatasync_async (glfs_fd_t *fd, glfs_io_cbk fn, void *data) ; + + int glfs_access (glfs_t *fs, const char *path, int mode) ; + + int glfs_symlink (glfs_t *fs, const char *oldpath, const char *newpath) ; + + int glfs_readlink (glfs_t *fs, const char *path,char *buf, size_t bufsiz) ; + + int glfs_mknod (glfs_t *fs, const char *path, mode_t mode, dev_t dev) ; + + int glfs_mkdir (glfs_t *fs, const char *path, mode_t mode) ; + + int glfs_unlink (glfs_t *fs, const char *path) ; + + int glfs_rmdir (glfs_t *fs, const char *path) ; + + int glfs_rename (glfs_t *fs, const char *oldpath, const char *newpath) ; + + int glfs_link (glfs_t *fs, const char *oldpath, const char *newpath) ; + + glfs_fd_t *glfs_opendir (glfs_t *fs, const char *path) ; + + int glfs_readdir_r (glfs_fd_t *fd, struct dirent *dirent,struct dirent **result) ; + + int glfs_readdirplus_r (glfs_fd_t *fd, struct stat *stat, struct dirent *dirent, struct dirent **result) ; + + struct dirent *glfs_readdir (glfs_fd_t *fd) ; + + struct dirent *glfs_readdirplus (glfs_fd_t *fd, struct stat *stat) ; + + long glfs_telldir (glfs_fd_t *fd) ; + + void glfs_seekdir (glfs_fd_t *fd, long offset) ; + + int glfs_closedir (glfs_fd_t *fd) ; + + int glfs_statvfs (glfs_t *fs, const char *path, struct statvfs *buf) ; + + int glfs_chmod (glfs_t *fs, const char *path, mode_t mode) ; + + int glfs_fchmod (glfs_fd_t *fd, mode_t mode) ; + + int glfs_chown (glfs_t *fs, const char *path, uid_t uid, gid_t gid) ; + + int glfs_lchown (glfs_t *fs, const char *path, uid_t uid, gid_t gid) ; + + int glfs_fchown (glfs_fd_t *fd, uid_t uid, gid_t gid) ; + + int glfs_utimens (glfs_t *fs, const char *path,struct timespec times[2]) ; + + int glfs_lutimens (glfs_t *fs, const char *path,struct timespec times[2]) ; + + int glfs_futimens (glfs_fd_t *fd, struct timespec times[2]) ; + + ssize_t glfs_getxattr (glfs_t *fs, const char *path, const char *name,void *value, size_t size) ; + + ssize_t glfs_lgetxattr (glfs_t *fs, const char *path, const char *name,void *value, size_t size) ; + + ssize_t glfs_fgetxattr (glfs_fd_t *fd, const char *name,void *value, size_t size) ; + + ssize_t glfs_listxattr (glfs_t *fs, const char *path,void *value, size_t size) ; + + ssize_t glfs_llistxattr (glfs_t *fs, const char *path, void *value,size_t size) ; + + ssize_t glfs_flistxattr (glfs_fd_t *fd, void *value, size_t size) ; + + int glfs_setxattr (glfs_t *fs, const char *path, const char *name,const void *value, size_t size, int flags) ; + + int glfs_lsetxattr (glfs_t *fs, const char *path, const char *name,const void *value, size_t size, int flags) ; + + int glfs_fsetxattr (glfs_fd_t *fd, const char *name,const void *value, size_t size, int flags) ; + + int glfs_removexattr (glfs_t *fs, const char *path, const char *name) ; + + int glfs_lremovexattr (glfs_t *fs, const char *path, const char *name) ; + + int glfs_fremovexattr (glfs_fd_t *fd, const char *name) ; + + int glfs_fallocate(glfs_fd_t *fd, int keep_size, off_t offset, size_t len) ; + + int glfs_discard(glfs_fd_t *fd, off_t offset, size_t len) ; + + int glfs_discard_async (glfs_fd_t *fd, off_t length, size_t lent, glfs_io_cbk fn, void *data) ; + + int glfs_zerofill(glfs_fd_t *fd, off_t offset, off_t len) ; + + int glfs_zerofill_async (glfs_fd_t *fd, off_t length, off_t len, glfs_io_cbk fn, void *data) ; + + char *glfs_getcwd (glfs_t *fs, char *buf, size_t size) ; + + int glfs_chdir (glfs_t *fs, const char *path) ; + + int glfs_fchdir (glfs_fd_t *fd) ; + + char *glfs_realpath (glfs_t *fs, const char *path, char *resolved_path) ; + + int glfs_posix_lock (glfs_fd_t *fd, int cmd, struct flock *flock) ; + + glfs_fd_t *glfs_dup (glfs_fd_t *fd) ; + + + struct glfs_object *glfs_h_lookupat (struct glfs *fs,struct glfs_object *parent, + const char *path, + struct stat *stat) ; + + struct glfs_object *glfs_h_creat (struct glfs *fs, struct glfs_object *parent, + const char *path, int flags, mode_t mode, + struct stat *sb) ; + + struct glfs_object *glfs_h_mkdir (struct glfs *fs, struct glfs_object *parent, + const char *path, mode_t flags, + struct stat *sb) ; + + struct glfs_object *glfs_h_mknod (struct glfs *fs, struct glfs_object *parent, + const char *path, mode_t mode, dev_t dev, + struct stat *sb) ; + + struct glfs_object *glfs_h_symlink (struct glfs *fs, struct glfs_object *parent, + const char *name, const char *data, + struct stat *stat) ; + + + int glfs_h_unlink (struct glfs *fs, struct glfs_object *parent, + const char *path) ; + + int glfs_h_close (struct glfs_object *object) ; + + int glfs_caller_specific_init (void *uid_caller_key, void *gid_caller_key, + void *future) ; + + int glfs_h_truncate (struct glfs *fs, struct glfs_object *object, + off_t offset) ; + + int glfs_h_stat(struct glfs *fs, struct glfs_object *object, + struct stat *stat) ; + + int glfs_h_getattrs (struct glfs *fs, struct glfs_object *object, + struct stat *stat) ; + + int glfs_h_getxattrs (struct glfs *fs, struct glfs_object *object, + const char *name, void *value, + size_t size) ; + + int glfs_h_setattrs (struct glfs *fs, struct glfs_object *object, + struct stat *sb, int valid) ; + + int glfs_h_setxattrs (struct glfs *fs, struct glfs_object *object, + const char *name, const void *value, + size_t size, int flags) ; + + int glfs_h_readlink (struct glfs *fs, struct glfs_object *object, char *buf, + size_t bufsiz) ; + + int glfs_h_link (struct glfs *fs, struct glfs_object *linktgt, + struct glfs_object *parent, const char *name) ; + + int glfs_h_rename (struct glfs *fs, struct glfs_object *olddir, + const char *oldname, struct glfs_object *newdir, + const char *newname) ; + + int glfs_h_removexattrs (struct glfs *fs, struct glfs_object *object, + const char *name) ; + + ssize_t glfs_h_extract_handle (struct glfs_object *object, + unsigned char *handle, int len) ; + + struct glfs_object *glfs_h_create_from_handle (struct glfs *fs, + unsigned char *handle, int len, + struct stat *stat) ; + + + struct glfs_fd *glfs_h_opendir (struct glfs *fs, + struct glfs_object *object) ; + + struct glfs_fd *glfs_h_open (struct glfs *fs, struct glfs_object *object, + int flags) ; + +For more details on these apis please refer glfs.h and glfs-handles.h in the source tree (api/src/) of glusterfs: + +* Incase of failures or to close the connection and destroy glfs_t +object, use glfs_fini. + + int glfs_fini (glfs_t *fs) ; + + +All the fileops are typically divided into below categories + +* a) Handle based Operations - + +These APIs create/make use of a glfs_object (referred as handles) unique +to each file within a volume. +The structure glfs_object contains inode pointer and gfid. + +For example: Since NFS protocol uses file handles to access files, these APIs are +mainly used by NFS-Ganesha server. + +Eg: + + struct glfs_object *glfs_h_lookupat (struct glfs *fs, + struct glfs_object *parent, + const char *path, + struct stat *stat); + + struct glfs_object *glfs_h_creat (struct glfs *fs, + struct glfs_object *parent, + const char *path, + int flags, mode_t mode, + struct stat *sb); + + struct glfs_object *glfs_h_mkdir (struct glfs *fs, + struct glfs_object *parent, + const char *path, mode_t flags, + struct stat *sb); + + + +* b) File path/descriptor based Operations - + +These APIs make use of file path/descriptor to determine the file on +which it needs to operate on. + +For example: Samba uses these APIs for file operations. + +Examples of the APIs using file path - + + int glfs_chdir (glfs_t *fs, const char *path) ; + + char *glfs_realpath (glfs_t *fs, const char *path, char *resolved_path) ; + +Once the file is opened, the file-descriptor generated is used for +further operations. + +Eg: + + int glfs_posix_lock (glfs_fd_t *fd, int cmd, struct flock *flock) ; + glfs_fd_t *glfs_dup (glfs_fd_t *fd) ; + + + +#### libgfapi bindings : + +libgfapi bindings are available for below languages: + + - Go + - Java + - python [2] + - Ruby + - Rust + +For more details on these bindings,please refer : + + #http://www.gluster.org/community/documentation/index.php/Language_Bindings + +References: + +[1] http://humblec.com/libgfapi-interface-glusterfs/ +[2] http://www.gluster.org/2014/04/play-with-libgfapi-and-its-python-bindings/ + diff --git a/done/Features/libgfchangelog.md b/done/Features/libgfchangelog.md new file mode 100644 index 0000000..1dd0d24 --- /dev/null +++ b/done/Features/libgfchangelog.md @@ -0,0 +1,119 @@ +libgfchangelog: "GlusterFS changelog" consumer library +====================================================== + +This document puts forward the intended need for GlusterFS changelog consumer library (a.k.a. libgfchangelog) for consuming changlogs produced by the Changelog translator. Further, it mentions the proposed design and the API exposed by it. A brief explanation of changelog translator can also be found as a commit message in the upstream source tree and the review link can be [accessed here] [1]. + +Initial consumer of changelogs would be Geo-Replication (release 3.5). Possible consumers in the future could be backup utilities, GlusterFS self-heal, bit-rot detection, AV scanners. All these utilities have one thing in common - to get a list of changed entities (created/modified/deleted) in the file system. Therefore, the need arises to provide such functionality in the form of a shared library that applications can link against and query for changes (See API section). There is no plan as of now to provide language bindings as such, but for shell script friendliness: 'gfind' command line utility (which would be dynamically linked with libgfchangelog) would be helpful. As of now, development for this utility is still not commenced. + +The next section gives a brief introduction about how changelogs are organized and managed. Then we propose couple of designs for libgfchangelog. API set is not covered in this document (maybe later). + +Changelogs +========== + +Changelogs can be thought as a running history for an entity in the file system from the time the entity came into existance. The goal is to capture all possible transitions the entity underwent till the time it got purged. The transition namespace is broken up into three categories with each category represented by a specific changelog format. Changes are recorded in a flat file in the filesystem and are rolled over after a specific time interval. All three types of categories are recorded in a single changelog file (sequentially) with a type for each entry. Having a single file reduces disk seeks and fragmentation and less number of files to deal with. Stratergy for pruning of old logs is still undecided. + + +Changelog Transition Namespace +------------------------------ + +As mentioned before the transition namespace is categorized into three types: + - TYPE-I : Data operation + - TYPE-II : Metadata operation + - TYPE-III : Entry operation + +One could visualize the transition of an file system entity as a state machine transitioning from one type to another. For TYPE-I and TYPE-II operations there is no state transition as such, but TYPE-III operation involves a state change from the file systems perspective. We can now classify file operations (fops) into one of the three types: + - Data operation: write(), writev(), truncate(), ftruncate() + - Metadata operation: setattr(), fsetattr(), setxattr(), fsetxattr(), removexattr(), fremovexattr() + - Entry operation: create(), mkdir(), mknod(), symlink(), link(), rename(), unlink(), rmdir() + +Changelog Entry Format +---------------------- + +In order to record the type of operation and entity underwent, a type identifier is used. Normally, the entity on which the operation is performed would be identified by the pathname, which is the most common way of addressing in a file system, but we choose to use GlusterFS internal file identifier (GFID) instead (as GlusterFS supports GFID based backend and the pathname field may not always be valid and other reasons which are out of scope of this this document). Therefore, the format of the record for the three types of operation can be summarized as follows: + + - TYPE-I : GFID of the file + - TYPE-II : GFID of the file + - TYPE-III : GFID + FOP + MODE + UID + GID + PARGFID/BNAME [PARGFID/BNAME] + +GFID's are analogous to inodes. TYPE-I and TYPE-II fops record the GFID of the entity on which the operation was performed: thereby recording that there was an data/metadata change on the inode. TYPE-III fops record at the minimum a set of six or seven records (depending on the type of operation), that is sufficient to identify what type of operation the entity underwent. Normally this record inculdes the GFID of the entity, the type of file operation (which is an integer [an enumerated value which is used in GluterFS]) and the parent GFID and the basename (analogous to parent inode and basename). + +Changelogs can be either in ascii or binary format, the difference being the format of the records that is persisted. In a binary changelog the gfids are recorded in it's native format ie. 16 byte record and the fop number as a 4 byte integer. In an ascii changelog, the gfids are stored in their canonical form and the fop number is stringified and persisted. Null charater is used as the record serarator and changelogs. This makes it hard to read changelogs from the command line, but the packed format is needed to support file names with spaces and special characters. Below is a snippet of a changelog along side it's hexdump. + +``` +00000000 47 6c 75 73 74 65 72 46 53 20 43 68 61 6e 67 65 |GlusterFS Change| +00000010 6c 6f 67 20 7c 20 76 65 72 73 69 6f 6e 3a 20 76 |log | version: v| +00000020 31 2e 31 20 7c 20 65 6e 63 6f 64 69 6e 67 20 3a |1.1 | encoding :| +00000030 20 32 0a 45 61 36 39 33 63 30 34 65 2d 61 66 39 | 2.Ea693c04e-af9| +00000040 65 2d 34 62 61 35 2d 39 63 61 37 2d 31 63 34 61 |e-4ba5-9ca7-1c4a| +00000050 34 37 30 31 30 64 36 32 00 32 33 00 33 33 32 36 |47010d62.23.3326| +00000060 31 00 30 00 30 00 66 36 35 34 32 33 32 65 2d 61 |1.0.0.f654232e-a| +00000070 34 32 62 2d 34 31 62 33 2d 62 35 61 61 2d 38 30 |42b-41b3-b5aa-80| +00000080 33 62 33 64 61 34 35 39 33 37 2f 6c 69 62 76 69 |3b3da45937/libvi| +00000090 72 74 5f 64 72 69 76 65 72 5f 6e 65 74 77 6f 72 |rt_driver_networ| +000000a0 6b 2e 73 6f 00 44 61 36 39 33 63 30 34 65 2d 61 |k.so.Da693c04e-a| +000000b0 66 39 65 2d 34 62 61 35 2d 39 63 61 37 2d 31 63 |f9e-4ba5-9ca7-1c| +000000c0 34 61 34 37 30 31 30 64 36 32 00 45 36 65 39 37 |4a47010d62.E6e97| +``` + +As you can see, there is an *entry* operation (journal record starting with an "E"). Records for this operation are: + - GFID : a693c04e-af9e-4ba5-9ca7-1c4a-47010d62 + - FOP : 23 (create) + - Mode : 33261 + - UID : 0 + - GID : 0 + - PARGFID/BNAME: f654232e-a42b-41b3-b5aa-803b3da45937 + +**NOTE**: In case of a rename operation, there would be an additional record (for the target PARGFID/BNAME). + +libgfchangelog +-------------- + +NOTE: changelogs generated by the changelog translator are rolled over [with the timestamp as the suffix] after a specific interval, after which a new change is started. The current changelog [changelog file without the timestamp as the suffix] should never be processed unless it's rolled over. The rolled over logs should be treated read-only. + +Capturing changes performed on a file system is useful for applications that rely on file system scan (crawl) to figure out such information. Backup utilities, automatic file healing in a replicated environment, bit-rot detection and the likes are some of the end user applications that require a set of changed entities in a file system to act on. Goal of libgfchangelog is to provide the application (consumer) a fast and easy to use common query interface (API). The consumer need not worry about the changelog format, nomenclature of the changelog files etc. + +Now we list functionality and some of the features. + +Functionality +------------- + +Changelog Processing: Processing involes reading changelog file(s), converting the entries into human-readable (or application understandable) format (in case of binary log format). +Book-keeping: Keeping track of how much the application has consumed the changelog (ie. changes during the time slice start-time -> end-time). +Serve API request: Update the consumer by providing the set of changes. + +Processing could be done in two ways: + +* Pre-processing (pre-processing from the library POV): +Once a changelog file is rolled over (by the changelog translator), a set of post processing operations are performed. These operations could include conversion of a binary log file to an understandable format, collate a bunch of logs into a larger sampling period or just keep a private copy of the changelog (in ascii format). Extra disk space is consumed to store this private copy. The library would then be free to consume these logs and serve application requests. + +* On-demand: +The processing of the changelogs is trigerred when an application requests for changes. Downside of this being additional time spent on decoding the logs and data accumulation during application request time (but no additional disk space is used over the time period). + +After processing, the changelog is ready to be consumed by the application. The function of processing is to convert the logs into human/application readable format (an example is shown below): + +``` +E a7264fe2-dd6b-43e1-8786-a03b42cc2489 CREATE 33188 0 0 00000000-0000-0000-0000-000000000001%2Fservices1 +M a7264fe2-dd6b-43e1-8786-a03b42cc2489 NULL +M 00000000-0000-0000-0000-000000000001 NULL +D a7264fe2-dd6b-43e1-8786-a03b42cc2489 +``` + +Features +-------- + +The following points mention some of the features that the library could provide. + + - Consumer could choose the update type when it registers with the library. 'types' could be: + - Streaming: The consumer is updated via stream of changes, ie. the library would just replay the logs + - Consolidated: The consumer is provided with a consolidated view of the changelog, eg. if had an DATA and a METADATA operation, it would be presented as a single update. Similarly for ENTRY operations. + - Raw: This mode provides the consumer with the pathnames of the changelog files itself (after processing). The changelogs should be strictly treated as read-only. This gives the flexibility to the consumer to extract updates using thier own preferred way (eg. using command line tools like sed, awk, sort | uniq etc.). + - Application may choose to adopt a synchronous (blocking) or an asynchronous (callback) notification mechanism. + - Provide a unified view of changelogs from multiple peers (replication scenario) or a global changelog view of the entire cluster. + + +** The first cut of the library supports:** + - Raw access mode + - Synchronous programming model + - Per brick changelog consumption ie. no unified/globally aggregated changelog + +[1]:http://review.gluster.org/5127 diff --git a/done/Features/memory-usage.md b/done/Features/memory-usage.md new file mode 100644 index 0000000..4e1a8a0 --- /dev/null +++ b/done/Features/memory-usage.md @@ -0,0 +1,49 @@ +object expiry tracking memroy usage +==================================== + +Bitrot daemon tracks objects for expiry in a data structure known +as "timer-wheel" (after which the object is signed). It's a well +known data structure for tracking million of objects of expiry. +Let's see the memory usage involved when tracking 1 million +objects (per brick). + +Bitrot daemon uses "br_object" structure to hold information +needed for signing. An instance of this structure is allocated +for each object that needs to be signed. + + struct br_object { + xlator_t *this; + + br_child_t *child; + + void *data; + uuid_t gfid; + unsigned long signedversion; + + struct list_head list; + }; + +Timer-wheel requires an instance of the structure below per +object that needs to be tracked for expiry. + + struct gf_tw_timer_list { + void *data; + unsigned long expires; + + /** callback routine */ + void (*function)(struct gf_tw_timer_list *, void *, unsigned long); + + struct list_head entry; + }; + +Structure sizes: + +- sizeof (struct br_object): 64 bytes +- sizeof (struct gf_tw_timer_list): 40 bytes + +Together, these structures take up 104 bytes. To track all 1 million objects +at the same time, the amount of memory taken up would be: + +** 1,000,000 * 104 bytes: ~100MB** + +Not so bad, I think. diff --git a/done/Features/meta.md b/done/Features/meta.md new file mode 100644 index 0000000..da0d62a --- /dev/null +++ b/done/Features/meta.md @@ -0,0 +1,206 @@ +Meta translator +=============== + +Introduction +------------ + +Meta xlator provides an interface similar to the Linux procfs, for GlusterFS +runtime and configuration. This document lists some useful information about +GlusterFS internals that could be accessed via the meta xlator. This is not +exhaustive at the moment. Contributors are welcome to improve this. + +Note: Meta xlator is loaded automatically in the client graph, ie. in the +mount process' graph. + +### GlusterFS native mount version + +>[root@trantor codebase]# cat $META/version +>{ +> "Package Version": "3.7dev" +>} + +### Listing of some files under the `meta` folder + +>[root@trantor codebase]# mount -t glusterfs trantor:/vol /mnt/fuse +>[root@trantor codebase]# ls $META +>cmdline frames graphs logging mallinfo master measure_latency process_uuid version + +### GlusterFS' process identifier + +>[root@trantor codebase]# cat $META/process_uuid +>trantor-11149-2014/07/25-18:48:50:468259 +> +This identifier appears in connection establishment log messages. +For eg., + +>[2014-07-25 18:48:49.017927] I [server-handshake.c:585:server_setvolume] 0-vol-server: accepted client from trantor-11087-2014/07/25-18:48:48:779656-vol-client-0-0-0 (version: 3.7dev) +> + +### GlusterFS command line + +>[root@trantor codebase]# cat $META/cmdline +>{ +> "Cmdlinestr": "/usr/local/sbin/glusterfs --volfile-server=trantor --volfile-id=/vol /mnt/fuse" +>} + +### GlusterFS volume graph + +The following directory structure reveals the way xlators are stacked in a +graph like fashion. Each (virtual) file under a xlator directory provides +runtime information of that xlator. For eg. 'name' contains the name of the +xlator. + +``` +/mnt/fuse/.meta/graphs/active +|-- meta-autoload +| |-- history +| |-- meminfo +| |-- name +| |-- options +| |-- private +| |-- profile +| |-- subvolumes +| | `-- 0 -> ../../vol +| |-- type +| `-- view +|-- top -> meta-autoload +|-- vol +| |-- history +| |-- meminfo +| |-- name +| |-- options +| | |-- count-fop-hits +| | `-- latency-measurement +| |-- private +| |-- profile +| |-- subvolumes +| | `-- 0 -> ../../vol-md-cache +| |-- type +| `-- view +|-- vol-client-0 +| |-- history +| |-- meminfo +| |-- name +| |-- options +| | |-- client-version +| | |-- clnt-lk-version +| | |-- fops-version +| | |-- password +| | |-- ping-timeout +| | |-- process-uuid +| | |-- remote-host +| | |-- remote-subvolume +| | |-- send-gids +| | |-- transport-type +| | |-- username +| | |-- volfile-checksum +| | `-- volfile-key +| |-- private +| |-- profile +| |-- subvolumes +| |-- type +| `-- view +|-- vol-client-1 +| |-- history +| |-- meminfo +| |-- name +| |-- options +| | |-- client-version +| | |-- clnt-lk-version +| | |-- fops-version +| | |-- password +| | |-- ping-timeout +| | |-- process-uuid +| | |-- remote-host +| | |-- remote-subvolume +| | |-- send-gids +| | |-- transport-type +| | |-- username +| | |-- volfile-checksum +| | `-- volfile-key +| |-- private +| |-- profile +| |-- subvolumes +| |-- type +| `-- view +|-- vol-dht +| |-- history +| |-- meminfo +| |-- name +| |-- options +| |-- private +| |-- profile +| |-- subvolumes +| | |-- 0 -> ../../vol-client-0 +| | `-- 1 -> ../../vol-client-1 +| |-- type +| `-- view +|-- volfile +|-- vol-io-cache +| |-- history +| |-- meminfo +| |-- name +| |-- options +| |-- private +| |-- profile +| |-- subvolumes +| | `-- 0 -> ../../vol-read-ahead +| |-- type +| `-- view +|-- vol-md-cache +| |-- history +| |-- meminfo +| |-- name +| |-- options +| |-- private +| |-- profile +| |-- subvolumes +| | `-- 0 -> ../../vol-open-behind +| |-- type +| `-- view +|-- vol-open-behind +| |-- history +| |-- meminfo +| |-- name +| |-- options +| |-- private +| |-- profile +| |-- subvolumes +| | `-- 0 -> ../../vol-quick-read +| |-- type +| `-- view +|-- vol-quick-read +| |-- history +| |-- meminfo +| |-- name +| |-- options +| |-- private +| |-- profile +| |-- subvolumes +| | `-- 0 -> ../../vol-io-cache +| |-- type +| `-- view +|-- vol-read-ahead +| |-- history +| |-- meminfo +| |-- name +| |-- options +| |-- private +| |-- profile +| |-- subvolumes +| | `-- 0 -> ../../vol-write-behind +| |-- type +| `-- view +`-- vol-write-behind + |-- history + |-- meminfo + |-- name + |-- options + |-- private + |-- profile + |-- subvolumes + | `-- 0 -> ../../vol-dht + |-- type + `-- view + +``` diff --git a/done/Features/mount_gluster_volume_using_pnfs.md b/done/Features/mount_gluster_volume_using_pnfs.md new file mode 100644 index 0000000..6807a21 --- /dev/null +++ b/done/Features/mount_gluster_volume_using_pnfs.md @@ -0,0 +1,68 @@ +# How to export gluster volumes using pNFS? + +The Parallel Network File System (pNFS) is part of the NFS v4.1 protocol that +allows compute clients to access storage devices directly and in parallel. +The pNFS cluster consists of MDS(Meta-Data-Server) and DS (Data-Server). +The client sends all the read/write requests directly to DS and all other +operations are handle by the MDS. pNFS support is implemented as part of +glusterFS+NFS-ganesha integration. + +### 1.) Pre-requisites + + - Create a GlusterFS volume + + - Install nfs-ganesha (refer section 5) + + - Disable kernel-nfs, gluster-nfs services on the system using the following commands + - service nfs stop + - gluster vol set nfs.disable ON (Note: this command has to be repeated for all the volumes in the trusted-pool) + + - Turn on feature.cache-invalidation for the volume. + - gluster v set features.cache-invalidation on + +### 2.) Configure nfs-ganesha for pNFS + + - Disable nfs-ganesha and tear down HA cluster via gluster cli (pNFS did not need to disturb HA setup) + - gluster features.ganesha disable + + - For the optimal working of pNFS, ganesha servers should run on every node in the trusted pool manually(refer section 5) + - *#ganesha.nfsd -f -L -N -d* + + - Check whether volume is exported via nfs-ganesha in all the nodes. + - *#showmount -e localhost* + + - Configure M.D.S by adding following block to ganesha configuration file +```sh +GLUSTER +{ + PNFS_MDS = true; +} +``` + + +### 3.) Mount volume via pNFS + +Mount the volume using any nfs-ganesha server in the trusted pool.By default, nfs version 4.1 will use pNFS protocol for gluster volumes + - *#mount -t nfs4 -o minorversion=1 :/ * + +### 4.) Points to be noted + + - Current architecture supports only single MDS and mulitple DS. The server with which client mounts will act as MDS and all severs including MDS can act as DS. + + - If any of the DS goes down , then MDS will handle those I/O's. + + - Hereafter, all the subsequent nfs clients need to use same server for mounting that volume via pNFS. i.e more than one MDS for a volume is not prefered + + - pNFS support is only tested with distributed, replicated or distribute-replicate volumes + + - It is tested and verfied with RHEL 6.5 , fedora 20, fedora 21 nfs clients. It is always better to use latest nfs-clients + +### 5.) References + + - Setup and create glusterfs volumes : http://www.gluster.org/community/documentation/index.php/QuickStart + + - NFS-Ganesha wiki : https://github.com/nfs-ganesha/nfs-ganesha/wiki + + - For installing, running NFS-Ganesha and exporting a volume : + - read doc/features/glusterfs_nfs-ganesha_integration.md + - http://blog.gluster.org/2014/09/glusterfs-and-nfs-ganesha-integration/ diff --git a/done/Features/nufa.md b/done/Features/nufa.md new file mode 100644 index 0000000..03b8194 --- /dev/null +++ b/done/Features/nufa.md @@ -0,0 +1,20 @@ +# NUFA Translator + +The NUFA ("Non Uniform File Access") is a variant of the DHT ("Distributed Hash +Table") translator, intended for use with workloads that have a high locality +of reference. Instead of placing new files pseudo-randomly, it places them on +the same nodes where they are created so that future accesses can be made +locally. For replicated volumes, this means that one copy will be local and +others will be remote; the read-replica selection mechanisms will then favor +the local copy for reads. For non-replicated volumes, the only copy will be +local. + +## Interface + +Use of NUFA is controlled by a volume option, as follows. + + gluster volume set myvolume cluster.nufa on + +This will cause the NUFA translator to be used wherever the DHT translator +otherwise would be. The rest is all automatic. + diff --git a/done/Features/object-versioning.md b/done/Features/object-versioning.md new file mode 100644 index 0000000..aaa8e26 --- /dev/null +++ b/done/Features/object-versioning.md @@ -0,0 +1,230 @@ +Object versioning +================= + + Bitrot detection in GlusterFS relies on object (file) checksum (hash) verification, + also known as "object signature". An object is signed when there are no active + file desciptors referring to it's inode (i.e., upon last close()). This is just an + hint for the initiation of hash calculation (and therefore signing). There is + absolutely no control over when clients can initiate modification operations on + the object. An object could be under modification while it's hash computation is + under progress. It would also be in-appropriate to restrict access to such objects + during the time duration of signing. + + Object versioning is used as a mechanism to identify the staleness of an objects + signature. The document below does not just list down the version update protocol, + but goes through various factors that led to its design. + +*NOTE:* The word "object" is used to represent a "regular file" (in linux sense) and + object versions are persisted in extended attributes of the object's inode. + Signature calculation includes object's data (no metadata as of now). + +INDEX +===== +1. Version updation protocol +2. Correctness guaraantees +3. Implementation +4. Protocol enhancements + +1. Version updation protocol +============================ + There are two types of versions associated with an object: + + a) Ongoing version: This version is incremented on first open() [when + the in-memory representation of the object (inode) is marked dirty + and synchronized to disk. When an object is created, a default ongoing + version of one (1) is assigned. An object lookup() too assigns the + default version if not present. When a version is initialized upon + lookup() or creat() FOP, it need to be durable on disk and therefore + can just be a extended attrbute set with out an expensive fsync() + syscall. + + b) Signing version: This is the version against which an object is deemed + to be signed. An objects signature is tied to a particular signed version. + Since, an object is a candidate for signing upon last release() [last + close()], signing version is the "ongoing version" at that point of time + + An object's signature is trustable when the version it was signed against + matches the ongoing version, i.e., if the hash is calculated by hand and + compared against the object signature, it *should* be a perfect match if + and only if the versions are equal. On the other hand, the signature is + considered stale (might or might not match the hash just calculated). + + Initialization of object versions + --------------------------------- + An object that existed before the pre versioning days, is assigned the + default versions upon lookup(). The protocol at this point expects "no" + durability guarantess of the versions, i.e., extended attribute sets + need not be followed by an explicit filesystem sync (fsync()). In case + of a power outage or a crash, versions are re-initialized with defaults + if found to be non-existant. The signing version is initialized with a + deafault value of zero (0) and the ongoing version as one (1). + + *NOTE:* If an object already has versions on-disk, lookup() just brings + the versions in memory. In this case both versions may or may + not match depending on state the object was left in. + + Increment of object versions + ---------------------------- + During initial versioning, the in-memory representation of the object is + marked dirty, so that subsequent modification operations on the object + triggers a versiong synchronization to disk (extended attribute set). + Moreover, this operation needs to be durable on disk, for the protocol + to be crash consistent. + + Let's picturize the various version states after subsequent open()s. + Not all modification operations need to increment the ongoing version, + only the first operations needs to (subsequent operations are NO-OPs). + + *NOTE:* From here one "[s]" depicts a durable filesystem operation and + "*" depicts the inode as dirty. + + + lookup() open() open() open() + =========================================================== + + OV(m): 1* 2 2 2 + ----------------------------------------- + OV(d): 1 2[s] 2 2 + SV(d): 0 0 0 0 + + + Let's now picturize the state when an already signed object undergoes + file operations. + + on-disk state: + OV(d): 3 + SV(d): 3| + + + lookup() open() open() open() + =========================================================== + + OV(m): 3* 4 4 4 + ----------------------------------------- + OV(d): 3 4[s] 4 4 + SV(d): 3 3 3 3 + + Signing process + --------------- + As per the above example, when the last open file descriptor is closed, + signing needs to be performed. The protocol restricts that the signing + needs to be attached to a version, which in this case is the in-memory + value of the ongoing version. A release() also marks the inode dirty, + therefore, the next open() does a durable version synchronization to + disk. + + [carry forwarding the versions from earlier example] + + close() release() open() open() + =========================================================== + + OV(m): 4 4* 5 5 + ----------------------------------------- + OV(d): 4 4 5[s] 5 + SV(d): 3 3 3 3 + + As shown above, a relase() call triggers a signing with signing version + as OV(m): which in this case is 4. During signing, the object is signed + with a signature attached to version 4 as shown below (continuing with + the last open() call from above): + + open() sign(4, signature) + =========================================================== + + OV(m): 5 5 + ----------------------------------------- + OV(d): 5 5 + SV(d): 3 4:[s] + + A signature comparison at this point of time is un-trustable due to + version mismatches. This also protects from node crashes and hard + reboots due to durability guarantee of on-disk version on first + open(). + + close() release() open() + =========================================================== + + OV(m): 4 4* 5 + -------------------------------- CRASH + OV(d): 4 4 5[s] + SV(d): 3 3 3 + + The protocol is immune to signing request after crashes due to + the version synchronization performed on first open(). Signing + request for a version lesser than the *current* ongoing version + can be ignored. It's left upon the implementation to either + accept or ignore such signing request(s). + + *NOTE:* Inode forget() causes a fresh lookup() to be trigerred. + Since a forget() call is received when there are no + active references for an inode, the on-disk version is + the latest and would be copied in-memory on lookup(). + +2. Correctness Guarantees +========================== + + Concurrent open()'s + ------------------- + When an inode is dirty (i.e., the very next operations would try to + synchronize the version to disk), there can be multiple calls [say, + open()] that would find the inode state as dirty and try to writeback + the new version to disk. Also, note that, marking the inode as synced + and updating the in-memory version is done *after* the new version + is written on disk. This is done to avoid incorrect version stored + on-disk in case the version synchronization fails (but the in-memory + version still holding the updated value). + Coming back to multiple open() calls on an object, each open() call + tries to synchronize the new version to disk if the inode is marked + as dirty. This is safe as each open() would try to synchronize the + new version (ongoingversion + 1) even if the updation is concurrent. + The in-memory version is finally updated to reflect the updated + version and mark the inode non-dirty. Again this is done *only* if + the inode is dirty, thereby open() calls which updated the on-disk + version but lost the race to update the in-memory version result + are NO-OPs. + + on-disk state: + OV(d): 3 + SV(d): 3| + + + lookup() open() open()' open()' open() + ============================================================= + + OV(m): 3* 3* 3* 4 NO-OP + -------------------------------------------------- + OV(d): 3 4[s] 4[s] 4 4 + SV(d): 3 3 3 3 3 + + open()/release() race + --------------------- + This race can cause a release() [on last close()] to pick up the + ongoing version which was just incremented on fresh open(). This + leads to signing of the object with the same version as the + ongoing version, thereby, mismatching signatures when calculated. + Another point that's worth mentioning here is that the open + file descriptor is *attached* to it's inode *after* it's done + version synchronization (and increment). Hence, if a release() + sneaks in this window, the file desriptor list for the given + inode is still empty, therefore release() considering it as a + last close(). + To counter this, the protocol should track the open and release + counts for file descriptors. A release() should only trigger a + signing request when the file desccriptor for an inode is empty + and the numbers of releases match the number of opens. When an + open() sneaks and increments the ongoing version but the file + descriptor is still not attached to the inode, open and release + counts mismatch, hence identifying an open() in progress. + +3. Implementation +================== + +Refer to: xlators/feature/bit-rot/src/stub + +4. Protocol enhancements +========================= + + a) Delaying persisting on-disk versions till open() + b) Lazy version updation (until signing?) + c) Protocol changes required to handle anonymous file + descriptors in GlusterFS. diff --git a/done/Features/ovirt-integration.md b/done/Features/ovirt-integration.md new file mode 100644 index 0000000..46dbeab --- /dev/null +++ b/done/Features/ovirt-integration.md @@ -0,0 +1,106 @@ +##Ovirt Integration with glusterfs + +oVirt is an opensource virtualization management platform. You can use oVirt to manage +hardware nodes, storage and network resources, and to deploy and monitor virtual machines +running in your data center. oVirt serves as the bedrock for Red Hat''s Enterprise Virtualization product, +and is the "upstream" project where new features are developed in advance of their inclusion +in that supported product offering. + +To know more about ovirt please visit http://www.ovirt.org/ and to configure +#http://www.ovirt.org/Quick_Start_Guide#Install_oVirt_Engine_.28Fedora.29%60 + +For the installation step of ovirt, please refer +#http://www.ovirt.org/Quick_Start_Guide#Install_oVirt_Engine_.28Fedora.29%60 + +When oVirt integrated with gluster, glusterfs can be used in below forms: + +* As a storage domain to host VM disks. + +There are mainly two ways to exploit glusterfs as a storage domain. + - POSIXFS_DOMAIN ( >=oVirt 3.1 ) + - GLUSTERFS_DOMAIN ( >=oVirt 3.3) + +The former one has performance overhead and is not an ideal way to consume images hosted in glusterfs volumes. +When used by this method, qemu uses glusterfs `mount point` to access VM images and invite FUSE overhead. +The libvirt treats this as a file type disk in its xml schema. + +The latter is the recommended way of using glusterfs with ovirt as a storage domain. This provides better +and efficient way to access images hosted under glusterfs volumes.When qemu accessing glusterfs volume using this method, +it make use of `libgfapi` implementation of glusterfs and this method is called native integration. +Here the glusterfs is added as a block backend to qemu and libvirt treat this as a `network` type disk. + +For more details on this, please refer # http://www.ovirt.org/Features/GlusterFS_Storage_Domain +However there are 2 bugs which block usage of this feature. + +https://bugzilla.redhat.com/show_bug.cgi?id=1022961 +https://bugzilla.redhat.com/show_bug.cgi?id=1017289 + +Please check above bugs for latest status. + +* To manage gluster trusted pools. + +oVirt web admin console can be used to - + - add new / import existing gluster cluster + - add/delete volumes + - add/delete bricks + - set/reset volume options + - optimize volume for virt store + - Rebalance and Remove bricks + - Monitor gluster deployment - node, brick, volume status, + Enhanced service monitoring (Physical node resources as well Quota, geo-rep and self-heal status) through Nagios integration(>=oVirt 3.4) + + + +When configuing ovirt to manage only gluster cluster/trusted pool, you need to select `gluster` as an input for +`Application mode` in OVIRT ENGINE CONFIGURATION option of `engine-setup` command. +Refer # http://www.ovirt.org/Quick_Start_Guide#Install_oVirt_Engine_.28Fedora.29%60 + +If you want to use gluster as both ( as a storage domain to host VM disks and to manage gluster trusted pools) +you need to input `both` as a value for `Application mode` in engine-setup command. + +Once you have successfully installed oVirt Engine as mentioned above, you will be provided with instructions +to access oVirt''s web console. + +Below example shows how to configure gluster nodes in fedora. + + +#Configuring gluster nodes. + +On the machine designated as your host, install any supported distribution( ex:Fedora/CentOS/RHEL...etc). +A minimal installation is sufficient. + +Refer # http://www.ovirt.org/Quick_Start_Guide#Install_Hosts + + +##Connect to Ovirt Engine + +Log In to Administration Console + +Ensure that you have the administrator password configured during installation of oVirt engine. + +- To connect to oVirt webadmin console + + +Open a browser and navigate to https://domain.example.com/webadmin. Substitute domain.example.com with the URL provided during installation + +If this is your first time connecting to the administration console, oVirt Engine will issue +security certificates for your browser. Click the link labelled this certificate to trust the +ca.cer certificate. A pop-up displays, click Open to launch the Certificate dialog. +Click `Install Certificate` and select to place the certificate in Trusted Root Certification Authorities store. + + +The console login screen displays. Enter admin as your User Name, and enter the Password that +you provided during installation. Ensure that your domain is set to Internal. Click Login. + + +You have now successfully logged in to the oVirt web administration console. Here, you can configure and manage all your gluster resources. + +To manage gluster trusted pool: + +- Create a cluster with "Enable gluster service" - turned on. (Turn on "Enable virt service" if the same nodes are used as hypervisor as well) +- Add hosts which have already been set up as in step Configuring gluster nodes. +- Create a volume, and click on "Optimize for virt store",This sets the volume tunables optimize volume to be used as an image store + +To use this volume as a storage domain: + +Please refer `User interface` section of www.ovirt.org/Features/GlusterFS_Storage_Domain diff --git a/done/Features/qemu-integration.md b/done/Features/qemu-integration.md new file mode 100644 index 0000000..aba3621 --- /dev/null +++ b/done/Features/qemu-integration.md @@ -0,0 +1,230 @@ +Using GlusterFS volumes to host VM images and data was sub-optimal due to the FUSE overhead involved in accessing gluster volumes via GlusterFS native client. However this has changed now with two specific enhancements: + +- A new library called libgfapi is now available as part of GlusterFS that provides POSIX-like C APIs for accessing gluster volumes. libgfapi support is available from GlusterFS-3.4 release. +- QEMU (starting from QEMU-1.3) will have GlusterFS block driver that uses libgfapi and hence there is no FUSE overhead any longer when QEMU works with VM images on gluster volumes. + +GlusterFS, with its pluggable translator model serves as a flexible storage backend for QEMU. QEMU has to just talk to GlusterFS and GlusterFS will hide different file systems and storage types underneath. Various GlusterFS storage features like replication and striping will automatically be available for QEMU. Efforts are also on to add block device backend in Gluster via Block Device (BD) translator that will expose underlying block devices as files to QEMU. This allows GlusterFS to be a single storage backend for both file and block based storage types. + +###GlusterFS specifcation in QEMU + +VM image residing on gluster volume can be specified on QEMU command line using URI format + + gluster[+transport]://[server[:port]]/volname/image[?socket=...] + + + +* `gluster` is the protocol. + +* `transport` specifies the transport type used to connect to gluster management daemon (glusterd). Valid transport types are `tcp, unix and rdma.` If a transport type isn’t specified, then tcp type is assumed. + +* `server` specifies the server where the volume file specification for the given volume resides. This can be either hostname, ipv4 address or ipv6 address. ipv6 address needs to be within square brackets [ ]. If transport type is unix, then server field should not be specified. Instead the socket field needs to be populated with the path to unix domain socket. + +* `port` is the port number on which glusterd is listening. This is optional and if not specified, QEMU will send 0 which will make gluster to use the default port. If the transport type is unix, then port should not be specified. + +* `volname` is the name of the gluster volume which contains the VM image. + +* `image` is the path to the actual VM image that resides on gluster volume. + + +###Examples: + + gluster://1.2.3.4/testvol/a.img + gluster+tcp://1.2.3.4/testvol/a.img + gluster+tcp://1.2.3.4:24007/testvol/dir/a.img + gluster+tcp://[1:2:3:4:5:6:7:8]/testvol/dir/a.img + gluster+tcp://[1:2:3:4:5:6:7:8]:24007/testvol/dir/a.img + gluster+tcp://server.domain.com:24007/testvol/dir/a.img + gluster+unix:///testvol/dir/a.img?socket=/tmp/glusterd.socket + gluster+rdma://1.2.3.4:24007/testvol/a.img + + + +NOTE: (GlusterFS URI description and above examples are taken from QEMU documentation) + +###Configuring QEMU with GlusterFS backend + +While building QEMU from source, in addition to the normal configuration options, ensure that –enable-glusterfs options are specified explicitly with ./configure script to get glusterfs support in qemu. + +Starting with QEMU-1.6, pkg-config is used to configure the GlusterFS backend in QEMU. If you are using GlusterFS compiled and installed from sources, then the GlusterFS package config file (glusterfs-api.pc) might not be present at the standard path and you will have to explicitly add the path by executing this command before running the QEMU configure script: + + export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig/ + +Without this, GlusterFS driver will not be compiled into QEMU even when GlusterFS is present in the system. + +* Creating a VM image on GlusterFS backend + +qemu-img command can be used to create VM images on gluster backend. The general syntax for image creation looks like this: + +For ex: + + qemu-img create gluster://server/volname/path/to/image size + +## How to setup the environment: + +This usecase ( using glusterfs backend for VM disk store), is known as 'Virt-Store' usecase. Steps for the entire procedure could be split to: + +* Steps to be done on gluster volume side +* Steps to be done on Hypervisor side + + +##Steps to be done on gluster side + +These are the steps that needs to be done on the gluster side. Precisely this involves + + Creating "Trusted Storage Pool" + Creating a volume + Tuning the volume for virt-store + Tuning glusterd to accept requests from QEMU + Tuning glusterfsd to accept requests from QEMU + Setting ownership on the volume + Starting the volume + +* Creating "Trusted Storage Pool" + +Install glusterfs rpms on the NODE. You can create a volume with a single node. You can also scale up the cluster, as we call as `Trusted Storage Pool`, by adding more nodes to the cluster + + gluster peer probe + +* Creating a volume + +It is highly recommended to have replicate volume or distribute-replicate volume for virt-store usecase, as it would add high availability and fault-tolerance. Remember the plain distribute works equally well + + gluster volume create replica 2 .. + +where, ` is :/ ` + + +Note: It is recommended to create sub-directories inside brick and that could be used to create a volume.For example, say, /home/brick1 is the mountpoint of XFS, then you can create a sub-directory inside it /home/brick1/b1 and use it while creating a volume.You can also use space available in root filesystem for bricks. Gluster cli, by default, throws warning in that case. You can override it by using force option + + gluster volume create replica 2 .. force + +If you are new to GlusterFS, you can take a look at [QuickStart](../Quick-Start-Guide/Quickstart.md) guide. + +* Tuning the volume for virt-store + +There are recommended settings available for virt-store. This provides good performance characteristics when enabled on the volume which is used for virt-store + +Refer to [Virt store usecases-Tunables](../Feature Planning/GlusterFS 3.5/Virt store usecase.md#tunables) for recommended tunables and for applying them on the volume, [refer this section](../Feature Planning/GlusterFS 3.5/Virt store usecase.md#applying-the-tunables-on-the-volume) + + +* Tuning glusterd to accept requests from QEMU + +glusterd receives requests only from the application that runs with port number less than 1024, otherwise the request is blocked. QEMU uses port number greater than 1024. To make glusterd accept requests from QEMU, you must edit the glusterd vol file, /etc/glusterfs/glusterd.vol and add the following: + + option rpc-auth-allow-insecure on + +Note: If you have installed glusterfs from source, you can find glusterd vol file at: `/usr/local/etc/glusterfs/glusterd.vol` + +Restart glusterd after adding the option to glusterd vol file. + + service glusterd restart + +* Tuning glusterfsd to accept requests from QEMU + +Enable the option `allow-insecure` on the particular volume + + gluster volume set server.allow-insecure on + +IMPORTANT : As of now (april 2,2014)there is a bug, as allow-insecure is not dynamically set on a volume.You need to restart the volume for the change to take effect + + +* Setting ownership on the volume + +Set the ownership of qemu:qemu on to the volume + + gluster volume set storage.owner-uid 107 + gluster volume set storage.owner-gid 107 + +* Starting the volume + +Start the volume + + gluster volume start + +## Steps to be done on Hypervisor Side: + +To create a raw image, + + qemu-img create gluster://1.2.3.4/testvol/dir/a.img 5G + +To create a qcow2 image, + + qemu-img create -f qcow2 gluster://server.domain.com:24007/testvol/a.img 5G + + + + + +## Booting VM image from GlusterFS backend + +A VM image 'a.img' residing on gluster volume testvol can be booted using QEMU like this: + + + qemu-system-x86_64 -drive file=gluster://1.2.3.4/testvol/a.img,if=virtio + +In addition to VM images, gluster drives can also be used as data drives: + + qemu-system-x86_64 -drive file=gluster://1.2.3.4/testvol/a.img,if=virtio -drive file=gluster://1.2.3.4/datavol/a-data.img,if=virtio + +Here 'a-data.img' from datavol gluster volume appears as a 2nd drive for the guest. + +It is also possible to make use of libvirt to define a disk and use it with qemu: + + +### Create libvirt XML to define Virtual Machine + +virt-install is python wrapper which is mostly used to create VM using set of params. How-ever virt-install doesn't support any network filesystem [ https://bugzilla.redhat.com/show_bug.cgi?id=1017308 ] + +Create a libvirt VM xml - http://libvirt.org/formatdomain.html where the disk section is formatted in such a way, qemu driver for glusterfs is being used. This can be seen in the following example xml description + + + + + + + + +
+ + + + + + +* Define the VM from the XML file that was created earlier + + + virsh define + +* Verify that the VM is created successfully + + + virsh list --all + +* Start the VM + + + virsh start + +* Verification + +You can verify the disk image file that is being used by VM + + virsh domblklist + +The above should show the volume name and image name. Here is the example, + + + [root@test ~]# virsh domblklist vm-test2 + Target Source + ------------------------------------------------ + vda distrepvol/test.img + hdc - + + +Reference: + +For more details on this feature implementation and its advantages, please refer: + +- http://raobharata.wordpress.com/2012/10/29/qemu-glusterfs-native-integration/ +- [Libgfapi_with_qemu_libvirt](../Feature Planning/GlusterFS 3.5/libgfapi with qemu libvirt.md) diff --git a/done/Features/quota-object-count.md b/done/Features/quota-object-count.md new file mode 100644 index 0000000..063aa7c --- /dev/null +++ b/done/Features/quota-object-count.md @@ -0,0 +1,47 @@ +Previous mechanism: +==================== + +The only way we could have retrieved the number of files/objects in a directory or volume was to do a crawl of the entire directory/volume. That was expensive and was not scalable. + +New Design Implementation: +========================== +The proposed mechanism will provide an easier alternative to determine the count of files/objects in a directory or volume. + +The new mechanism will store count of objects/files as part of an extended attribute of a directory. Each directory extended attribute value will indicate the number of files/objects present in a tree with the directory being considered as the root of the tree. + +Inode quota management +====================== + +**setting limits** + +Syntax: +*gluster volume quota limit-objects * + +Details: + is a hard-limit for number of objects limitation for path . If hard-limit is exceeded, creation of file or directory is no longer permitted. + +**list-objects** + +Syntax: +*gluster volume quota list-objects \[path\] ...* + +Details: +If path is not specified, then all the directories which has object limit set on it will be displayed. If we provide path then only that particular path is displayed along with the details associated with that. + +Sample output: + + Path Hard-limit Soft-limit Files Dirs Available Soft-limit exceeded? Hard-limit exceeded? + --------------------------------------------------------------------------------------------------------------------------------------------- + /dir 10 80% 0 1 9 No No + +**Deleting limits** + +Syntax: +*gluster volume quota remove-objects * + +Details: +This will remove the object limit set on the specified path. + +Note: There is a known issue associated with remove-objects. When both usage limit and object limit is set on a path, then removal of any limit will lead to removal of other limit as well. This is tracked in the bug #1202244 + + diff --git a/done/Features/quota-scalability.md b/done/Features/quota-scalability.md new file mode 100644 index 0000000..e47c898 --- /dev/null +++ b/done/Features/quota-scalability.md @@ -0,0 +1,52 @@ +Issues with older implemetation: +----------------------------------- +* >#### Enforcement of quota was done on client side. This had following two issues : + > >* All clients are not trusted and hence enforcement is not secure. + > >* Quota enforcer caches directory size for a certain time out period to reduce network calls to fetch size. On time out, this cache is validated by querying server. With more clients, the traffic caused due to this +validation increases. + +* >#### Relying on lookup calls on a file/directory (inode) to update its contribution [time consuming] + +* >####Hardlimits were stored in a comma separated list. + > >* Hence, changing hard limit of one directory is not an independent operation and would invalidate hard limits of other directories. We need to parse the string once for each of these directories just to identify whether its hard limit is changed. This limits the number of hard limits we can configure. + +* >####Cli used to fetch the list of directories on which quota-limit is set, from glusterd. + > >* With more number of limits, the network overhead incurred to fetch this list limits the scalability of number of directories on which we can set quota. + +* >#### Problem with NFS mount + > >* Quota, for its enforcement and accounting requires all the ancestors of a file/directory till root. However, with NFS relying heavily on nameless lookups (through which there is no guarantee that ancestry can be +accessed) this ancestry is not always present. Hence accounting and enforcement was not correct. + + +New Design Implementation: +-------------------------------- + +* Quota enforcement is moved to server side. This addresses issues that arose because of client side enforcement. + +* Two levels of quota limits, soft and hard quota is introduced. + This will result in a message being logged on reaching soft quota and writes will fail with EDQUOT after hard limit is reached. + +Work Flow +----------------- + +* Accounting + # This is done using the marker translator loaded on each brick of the volume. Accounting happens in the background. Ie, it doesn't happen in-flight with the file operation. The file operations latency is not +directly affected by the time taken to perform accounting. This update is sent recursively upwards up to the root of the volume. + +* Enforcement + # The enforcer updates its 'view' (cached) of directory's disk usage on the incidence of a file operation after the expiry of hard/soft timeout, depending on the current usage. Enforcer uses quotad to get the +aggregated disk usage of a directory from the accounting information present on each brick (viz, provided by marker). + +* Aggregator (quotad) + # Quotad is a daemon that serves volume-wide disk usage of a directory, on which quota is configured. It is present on all nodes in the cluster (trusted storage pool) as bricks don't have a global view of cluster. +Quotad queries the disk usage information from all the bricks in that volume and aggregates. It manages all the volumes on which quota is enabled. + + +Benefit to GlusterFS +--------------------------------- + +* Support upto 65536 quota configurations per volume. +* More quotas can be configured in a single volume thereby leading to support GlusterFS for use cases like home directory. + +###For more information on quota usability refer the following link : +> https://access.redhat.com/site/documentation/en-US/Red_Hat_Storage/2.1/html-single/Administration_Guide/index.html#chap-User_Guide-Dir_Quota-Enable diff --git a/done/Features/rdmacm.md b/done/Features/rdmacm.md new file mode 100644 index 0000000..2c287e8 --- /dev/null +++ b/done/Features/rdmacm.md @@ -0,0 +1,26 @@ +## Rdma Connection manager ## + +### What? ### +Infiniband requires addresses of end points to be exchanged using an out-of-band channel (like tcp/ip). Glusterfs used a custom protocol over a tcp/ip channel to exchange this address. However, librdmacm provides the same functionality with the advantage of being a standard protocol. This helps if we want to communicate with a non-glusterfs entity (say nfs client with gluster nfs server) over infiniband. + +### Dependencies ### +* [IP over Infiniband](http://pkg-ofed.alioth.debian.org/howto/infiniband-howto-5.html) - The value to *option* **remote-host** in glusterfs transport configuration should be an IPoIB address +* [rdma cm kernel module](http://pkg-ofed.alioth.debian.org/howto/infiniband-howto-4.html#ss4.4) +* [user space rdmacm library - librdmacm](https://www.openfabrics.org/downloads/rdmacm) + +### rdma-cm in >= GlusterFs 3.4 ### + +Following is the impact of http://review.gluster.org/#change,149. + +New userspace packages needed: +librdmacm +librdmacm-devel + +### Limitations ### + +* Because of bug [890502](https://bugzilla.redhat.com/show_bug.cgi?id=890502), we've to probe the peer on an IPoIB address. This imposes a restriction that all volumes created in the future have to communicate over IPoIB address (irrespective of whether they use gluster's tcp or rdma transport). + +* Currently client has independence to choose b/w tcp and rdma transports while communicating with the server (by creating volumes with **transport-type tcp,rdma**). This independence was a by-product of our ability to use the tcp/ip channel - transports with *option transport-type tcp* - for rdma connection establishment handshake too. However, with new requirement of IPoIB address for connection establishment, we loose this independence (till we bring in [multi-network support](https://bugzilla.redhat.com/show_bug.cgi?id=765437) - where a brick can be identified by a set of ip-addresses and we can choose different pairs of ip-addresses for communication based on our requirements - in glusterd). + +### External links ### +* [Infiniband Howto](http://pkg-ofed.alioth.debian.org/howto/infiniband-howto.html) diff --git a/done/Features/readdir-ahead.md b/done/Features/readdir-ahead.md new file mode 100644 index 0000000..5302a02 --- /dev/null +++ b/done/Features/readdir-ahead.md @@ -0,0 +1,14 @@ +## Readdir-ahead ## + +### Summary ### +Provide read-ahead support for directories to improve sequential directory read performance. + +### Owners ### +Brian Foster + +### Detailed Description ### +The read-ahead feature for directories is analogous to read-ahead for files. The objective is to detect sequential directory read operations and establish a pipeline for directory content. When a readdir request is received and fulfilled, preemptively issue subsequent readdir requests to the server in anticipation of those requests from the user. If sequential readdir requests are received, the directory content is already immediately available in the client. If subsequent requests are not sequential or not received, said data is simply dropped and the optimization is bypassed. + +readdir-ahead is currently disabled by default. It can be enabled with the following command: + + gluster volume set readdir-ahead on diff --git a/done/Features/rebalance.md b/done/Features/rebalance.md new file mode 100644 index 0000000..e7212d4 --- /dev/null +++ b/done/Features/rebalance.md @@ -0,0 +1,74 @@ +## Background + + +For a more detailed description, view Jeff Darcy's blog post [here] +(http://pl.atyp.us/hekafs.org/index.php/2012/03/glusterfs-algorithms-distribution/) + +GlusterFS uses the distribute translator (DHT) to aggregate space of multiple servers. DHT distributes files among its subvolumes using a consistent hashing method providing 32-bit hashes. Each DHT subvolume is given a range in the 32-bit hash space. A hash value is calculated for every file using a combination of its name. The file is then placed in the subvolume with the hash range that contains the hash value. + +## What is rebalance? + +The rebalance process migrates files between the DHT subvolumes when necessary. + +## When is rebalance required? + +Rebalancing is required for two main cases. + +1. Addition/Removal of bricks + +2. Renaming of a file + +## Addition/Removal of bricks + +Whenever the number or order of DHT subvolumes change, the hash range given to each subvolume is recalculated. When this happens, already existing files on the volume will need to be moved to the correct subvolume based on their hash. Rebalance does this activity. + +Addition of bricks which increase the size of a volume will increase the number of DHT subvolumes and lead to recalculation of hash ranges (This doesn't happen when bricks are added to a volume to increase redundancy, i.e. increase replica count of a volume). This will require an explicit rebalance command to be issued to migrate the files. + +Removal of bricks which decrease the size of a volumes also causes the hash ranges of DHT to be recalculated. But we don't need to issue an explicit rebalance command in this case, as rebalance is done automatically by the remove-brick process if needed. + +## Renaming of a file + +Renaming of file will cause its hash to change. The file now needs to be moved to the correct subvolume based on its new hash. Rebalance does this. + +## How does rebalance work? + +At a high level, the rebalance process consists of the following 3 steps: + +1. Crawl the volume to access all files +2. Calculate the hash for the file +3. If needed move the migrate the file to the correct subvolume. + + +The rebalance process has been optimized by making it distributed across the trusted storage pool. With distributed rebalance, a rebalance process is launched on each peer in the cluster. Each rebalance process will crawl files on only those bricks of the volume which are present on it, and migrate the files which need migration to the correct brick. This speeds up the rebalance process considerably. + +## What will happen if rebalance is not run? + +### Addition of bricks + +With the current implementation of add-brick, when the size of a volume is augmented by adding new bricks, the new bricks are not put into use immediately i.e., the hash ranges there not recalculated immediately. This means that the files will still be placed only onto the existing bricks, leaving the newly added storage space unused. Starting a rebalance process on the volume will cause the hash ranges to be recalculated with the new bricks included, which allows the newly added storage space to be used. + +### Renaming a file + +When a file rename causes the file to be hashed to a new subvolume, DHT writes a link file on the new subvolume leaving the actual file on the original subvolume. A link file is an empty file, which has an extended attribute set that points to the subvolume on which the actual file exists. So, when a client accesses the renamed file, DHT first looks for the file in the hashed subvolume and gets the link file. DHT understands the link file, and gets the actual file from the subvolume pointed to by the link file. This leads to a slight reduction in performance. A rebalance will move the actual file to the hashed subvolume, allowing clients to access the file directly once again. + +## Are clients affected during a rebalance process? + +The rebalance process is transparent to applications on the clients. Applications which have open files on the volume will not be affected by the rebalance process, even if the open file requires migration. The DHT translator on the client will hide the migration from the applications. + +##How are open files migrated? + +(A more technical description of the algorithm used can be seen in the commit message of commit a07bb18c8adeb8597f62095c5d1361c5bad01f09.) + +To achieve migration of open files, two things need to be assured of, +a) any writes or changes happening to the file during migration are correctly synced to destination subvolume after the migration is complete. +b) any further changes should be made to the destination subvolume + +Both of these requirements require sending notificatoins to clients. Clients are notified by overloading an attribute used in every callback functions. DHT understands these attributes in the callbacks and can be notified if a file is being migrated or not. + +During rebalance, a file will be in two phases + +1. Migration in process - In this phase the file is being migrated by the rebalance process from the source subvolume to the destination subvolume. The rebalance process will set a 'in-migration' attribute on the file, which will notify the clients' DHT translator. The clients' DHT translator will then take care to send any further changes to the destination subvolume as well. This way we satisfy the first requirement + +2. Migration completed - Once the file has been migrated, the rebalance process will set a 'migration-complete' attribute on the file. The clients will be notified of the completion and all further operations on the file will happen on the destination subvolume. + +The DHT translator handles the above and allows the applications on the clients to continue working on a file under migration. diff --git a/done/Features/server-quorum.md b/done/Features/server-quorum.md new file mode 100644 index 0000000..7b20084 --- /dev/null +++ b/done/Features/server-quorum.md @@ -0,0 +1,44 @@ +# Server Quorum + +Server quorum is a feature intended to reduce the occurrence of "split brain" +after a brick failure or network partition. Split brain happens when different +sets of servers are allowed to process different sets of writes, leaving data +in a state that can not be reconciled automatically. The key to avoiding split +brain is to ensure that there can be only one set of servers - a quorum - that +can continue handling writes. Server quorum does this by the brutal but +effective means of forcing down all brick daemons on cluster nodes that can no +longer reach enough of their peers to form a majority. Because there can only +be one majority, there can be only one set of bricks remaining, and thus split +brain can not occur. + +## Options + +Server quorum is controlled by two parameters: + + * **cluster.server-quorum-type** + + This value may be "server" to indicate that server quorum is enabled, or + "none" to mean it's disabled. + + * **cluster.server-quorum-ratio** + + This is the percentage of cluster nodes that must be up to maintain quorum. + More precisely, this percentage of nodes *plus one* must be up. + +Note that these are cluster-wide flags. All volumes served by the cluster will +be affected. Once these values are set, quorum actions - starting or stopping +brick daemons in response to node or network events - will be automatic. + +## Best Practices + +If a cluster with an even number of nodes is split exactly down the middle, +neither half can have quorum (which requires **more than** half of the total). +This is particularly important when N=2, in which case the loss of either node +leads to loss of quorum. Therefore, it is highly advisable to ensure that the +cluster size is three or greater. The "extra" node in this case need not have +any bricks or serve any data. It need only be present to preserve the notion +of a quorum majority less than the entire cluster membership, allowing the +cluster to survive the loss of a single node without losing quorum. + + + diff --git a/done/Features/shard.md b/done/Features/shard.md new file mode 100644 index 0000000..2bdf6ce --- /dev/null +++ b/done/Features/shard.md @@ -0,0 +1,68 @@ +### Sharding xlator (Stripe 2.0) + +GlusterFS's answer to very large files (those which can grow beyond a +single brick) has never been clear. There is a stripe xlator which allows you to +do that, but that comes at a cost of flexibility - you can add servers only in +multiple of stripe-count x replica-count, mixing striped and unstriped files is +not possible in an "elegant" way. This also happens to be a big limiting factor +for the big data/Hadoop use case where super large files are the norm (and where +you want to split a file even if it could fit within a single server.) + +The proposed solution for this is to replace the current stripe xlator with a +new Shard xlator. Unlike the stripe xlator, Shard is not a cluster xlator. It is +placed on top of DHT. Initially all files will be created as normal files, even +up to a certain configurable size. The first block (default 4MB) will be stored +like a normal file. However further blocks will be stored in a file, named by +the GFID and block index in a separate namespace (like /.shard/GFID1.1, +/.shard/GFID1.2 ... /.shard/GFID1.N). File IO happening to a particular offset +will write to the appropriate "piece file", creating it if necessary. The +aggregated file size and block count will be stored in the xattr of the original +(first block) file. + +The advantage of such a model: + +- Data blocks are distributed by DHT in a "normal way". +- Adding servers can happen in any number (even one at a time) and DHT's + rebalance will spread out the "piece files" evenly. +- Self-healing of a large file is now more distributed into smaller files across + more servers. +- piece file naming scheme is immune to renames and hardlinks. + +Source: https://gist.github.com/avati/af04f1030dcf52e16535#sharding-xlator-stripe-20 + +## Usage: + +Shard translator is disabled by default. To enable it on a given volume, execute: + + gluster volume set features.shard on + +The default shard block size is 4MB. To modify it, execute: + + gluster volume set features.shard-block-size + +When a file is created in a volume with sharding disabled, its block size is +persisted in its xattr on the first block. This property of the file will remain +even if the shard-block-size for the volume is reconfigured later. + +If you want to disable sharding on a volume, it is advisable to create a new +volume without sharding and copy out contents of this volume into the new +volume. + +## Note: + +* Shard translator is still a beta feature in 3.7.0 and will be possibly fully + supported in one of the 3.7.x releases. +* It is advisable to use shard translator in volumes with replication enabled + for fault tolerance. + +## TO-DO: + +* Complete implementation of zerofill, discard and fallocate fops. +* Introduce caching and its invalidation within shard translator to store size + and block count of shard'ed files. +* Make shard translator work for non-Hadoop and non-VM use cases where there are + multiple clients operating on the same file. +* Serialize appending writes. +* Manage recovery of size and block count better in the face of faults during + ongoing inode write fops. +* Anything else that could crop up later :) \ No newline at end of file diff --git a/done/Features/tier.md b/done/Features/tier.md new file mode 100644 index 0000000..d44af09 --- /dev/null +++ b/done/Features/tier.md @@ -0,0 +1,170 @@ +##Tiering + +####Feature page: + +[http://gluster.readthedocs.org/en/latest/Feature Planning/GlusterFS 3.7/Data Classification/](http://gluster.readthedocs.org/en/latest/Feature Planning/GlusterFS 3.7/Data Classification/) + +#####Design: + +[goo.gl/bkU5qv](goo.gl/bkU5qv) + +###Theory of operation + + +The tiering feature enables different storage types to be used by the same +logical volume. In Gluster 3.7, the two types are classified as "cold" and +"hot", and are represented as two groups of bricks. The hot group acts as +a cache for the cold group. The bricks within the two groups themselves are +arranged according to standard Gluster volume conventions, e.g. replicated, +distributed replicated, or dispersed. + +A normal gluster volume can become a tiered volume by "attaching" bricks +to it. The attached bricks become the "hot" group. The bricks within the +original gluster volume are the "cold" bricks. + +For example, the original volume may be dispersed on HDD, and the "hot" +tier could be distributed-replicated SSDs. + +Once this new "tiered" volume is built, I/Os to it are subjected to cacheing +heuristics: + +* All I/Os are forwarded to the hot tier. + +* If a lookup fails to the hot tier, the I/O will be forwarded to the cold +tier. This is a "cache miss". + +* Files on the hot tier that are not touched within some time are demoted +(moved) to the cold tier (see performance parameters, below). + +* Files on the cold tier that are touched one or more times are promoted +(moved) to the hot tier. (see performance parameters, below). + +This resembles implementations by Ceph and the Linux data management (DM) +component. + +Performance enhancements being considered include: + +* Biasing migration of large files over small. + +* Only demoting when the hot tier is close to full. + +* Write-back cache for database updates. + +###Code organization + +The design endevors to be upward compatible with future migration policies, +such as scheduled file migration, data classification, etc. For example, +the caching logic is self-contained and separate from the file migration. A +different set of migration policies could use the same underlying migration +engine. The I/O tracking and meta data store compontents are intended to be +reusable for things besides caching semantics. + +####Libgfdb: + +Libgfdb provides abstract mechanism to record extra/rich metadata +required for data maintenance, such as data tiering/classification. +It provides consumer with API for recording and querying, keeping +the consumer abstracted from the data store used beneath for storing data. +It works in a plug-and-play model, where data stores can be plugged-in. +Presently we have plugin for Sqlite3. In the future will provide recording +and querying performance optimizer. In the current implementation the schema +of metadata is fixed. + +####Schema: + + GF_FILE_TB Table: + This table has one entry per file inode. It holds the metadata required to + make decisions in data maintenance. + GF_ID (Primary key) : File GFID (Universal Unique IDentifier in the namespace) + W_SEC, W_MSEC : Write wind time in sec & micro-sec + UW_SEC, UW_MSEC : Write un-wind time in sec & micro-sec + W_READ_SEC, W_READ_MSEC : Read wind time in sec & micro-sec + UW_READ_SEC, UW_READ_MSEC : Read un-wind time in sec & micro-sec + WRITE_FREQ_CNTR INTEGER : Write Frequency Counter + READ_FREQ_CNTR INTEGER : Read Frequency Counter + + GF_FLINK_TABLE: + This table has all the hardlinks to a file inode. + GF_ID : File GFID (Composite Primary Key)``| + GF_PID : Parent Directory GFID (Composite Primary Key) |-> Primary Key + FNAME : File Base Name (Composite Primary Key)__| + FPATH : File Full Path (Its redundant for now, this will go) + W_DEL_FLAG : This Flag is used for crash consistancy, when a link is unlinked. + i.e Set to 1 during unlink wind and during unwind this record is deleted + LINK_UPDATE : This Flag is used when a link is changed i.e rename. + Set to 1 when rename wind and set to 0 in rename unwind + +Libgfdb API : +Refer libglusterfs/src/gfdb/gfdb_data_store.h + +####ChangeTimeRecorder (CTR) Translator: + +ChangeTimeRecorder(CTR) is server side xlator(translator) which sits +just above posix xlator. The main role of this xlator is to record the +access/write patterns on a file residing the brick. It records the +read(only data) and write(data and metadata) times and also count on +how many times a file is read or written. This xlator also captures +the hard links to a file(as its required by data tiering to move +files). + +CTR Xlator is the consumer of libgfdb. + +To Enable/Disable CTR Xlator: + + **gluster volume set features.ctr-enabled {on/off}** + +To Enable/Disable Frequency Counter Recording in CTR Xlator: + + **gluster volume set features.record-counters {on/off}** + + +####Migration daemon: + +When a tiered volume is created, a migration daemon starts. There is one daemon +for every tiered volume per node. The daemon sleeps and then periodically +queries the database for files to promote or demote. The query callbacks +assembles files in a list, which is then enumerated. The frequencies by +which promotes and demotes happen is subject to user configuration. + +Selected files are migrated between the tiers using existing DHT migration +logic. The tier translator will leverage DHT rebalance performance +enhancements. + +Configurable for Migration daemon: + + gluster volume set cluster.tier-demote-frequency + + gluster volume set cluster.tier-promote-frequency + + gluster volume set cluster.read-freq-threshold + + gluster volume set cluster.write-freq-threshold + + +####Tier Translator: + +The tier translator is the root node in tiered volumes. The first subvolume +is the cold tier, and the second the hot tier. DHT logic for fowarding I/Os +is largely unchanged. Exceptions are handled according to the dht_methods_t +structure, which forks control according to DHT or tier type. + +The major exception is DHT's layout is not utilized for choosing hashed +subvolumes. Rather, the hot tier is always the hashed subvolume. + +Changes to DHT were made to allow "stacking", i.e. DHT over DHT: + +* readdir operations remember the index of the "leaf node" in the volume graph +(client id), rather than a unique index for each DHT instance. + +* Each DHT instance uses a unique extended attribute for tracking migration. + +* In certain cases, it is legal for tiered volumes to have unpopulated inodes +(wheras this would be an error in DHT's case). + +Currently tiered volume expansion (adding and removing bricks) is unsupported. + +####glusterd: + +The tiered volume tree is a composition of two other volumes. The glusterd +daemon builds it. Existing logic for adding and removing bricks is heavily +leveraged to attach and detach tiers, and perform statistics collection. diff --git a/done/Features/trash_xlator.md b/done/Features/trash_xlator.md new file mode 100644 index 0000000..3e38e87 --- /dev/null +++ b/done/Features/trash_xlator.md @@ -0,0 +1,80 @@ +Trash Translator +================= + +Trash translator will allow users to access deleted or truncated files. Every brick will maintain a hidden .trashcan directory , which will be used to store the files deleted or truncated from the respective brick .The aggreagate of all those .trashcan directory can be accesed from the mount point.In order to avoid name collisions , a time stamp is appended to the original file name while it is being moved to trash directory. + +##Implications and Usage +Apart from the primary use-case of accessing files deleted or truncated by user , the trash translator can be helpful for internal operations such as self-heal and rebalance . During self-heal and rebalance it is possible to lose crucial data.In those circumstances the trash translator can assist in recovery of the lost data. The trash translator is designed to intercept unlink, truncate and ftruncate fops, store a copy of the current file in the trash directory, and then perform the fop on the original file. For the internal operations , the files are stored under 'internal_op' folder inside trash directory. + +##Volume Options +1. *gluster volume set <VOLNAME> features.trash <on | off>* + + This command can be used to enable trash translator in a volume. If set to on, trash directory will be created in every brick inside the volume during volume start command. By default translator is loaded during volume start but remains non-functional. Disabling trash with the help of this option will not remove the trash directory or even its contents from the volume. + +2. *gluster volume set <VOLNAME> features.trash-dir <name>* + + This command is used to reconfigure the trash directory to a user specified name. The argument is a valid directory name. Directory will be created inside every brick under this name. If not specified by the user, the trash translator will create the trash directory with the default name “.trashcan”. This can be used only when trash-translator is on. + +3. *gluster volume set <VOLNAME> features.trash-max-filesize <size>* + + This command can be used to filter files entering trash directory based on their size. Files above trash_max_filesize are deleted/truncated directly. Value for size may be followed by mutliplicative suffixes KB (=1024), MB (=1024*1024 and GB. Default size is set to 5MB. As of now any value specified higher than 1GB will be changed to 1GB at the maximum level. + +4. *gluster volume set <VOLNAME> features.trash-eliminate-path <path1> [ , <path2> , . . . ]* + + This command can be used to set the eliminate pattern for the trash translator. Files residing under this pattern will not be moved to trash directory during deletion/truncation. Path must be a valid one present in volume. + +5. *gluster volume set <VOLNAME> features.trash-internal-op <on | off>* + + This command can be used to enable trash for internal operations like self-heal and re-balance. By default set to off. + +##Testing +Following steps give illustrates a simple scenario of deletion of file from directory + +1. Create a distributed volume with two bricks and start it. + + gluster volume create test rhs:/home/brick + + gluster volume start test + +2. Enable trash translator + + gluster volume set test feature.trash on + +3. Mount glusterfs client as follows. + + mount -t glusterfs rhs:test /mnt + +4. Create a directory and file in the mount. + + mkdir mnt/dir + + echo abc > mnt/dir/file + +5. Delete the file from the mount. + + rm mnt/dir/file -rf + +6. Checkout inside the trash directory. + + ls mnt/.trashcan + +We can find the deleted file inside the trash directory with timestamp appending on its filename. + +For example, + + [root@rh-host ~]#mount -t glusterfs rh-host:/test /mnt/test + [root@rh-host ~]#mkdir /mnt/test/abc + [root@rh-host ~]#touch /mnt/test/abc/file + [root@rh-host ~]#rm /mnt/test/abc/filer + remove regular empty file ‘/mnt/test/abc/file’? y + [root@rh-host ~]#ls /mnt/test/abc + [root@rh-host ~]# + [root@rh-host ~]#ls /mnt/test/.trashcan/abc/ + file2014-08-21_123400 + +##Points to be remembered +[1] As soon as the volume is started, trash directory will be created inside the volume and will be visible through mount. Disabling trash will not have any impact on its visibilty from the mount. +[2] Eventhough deletion of trash-directory is not permitted, currently residing trash contents will be removed on issuing delete on it and only an empty trash-directory exists. + +##Known issues +[1] Since trash translator resides on the server side, DHT translator is unaware of rename and truncate operations being done by this translator which will eventually moves the files to trash directory. Unless and until a complete-path-based lookup comes on trashed files, those may not be visible from the mount. diff --git a/done/Features/upcall.md b/done/Features/upcall.md new file mode 100644 index 0000000..5e9ced2 --- /dev/null +++ b/done/Features/upcall.md @@ -0,0 +1,38 @@ +## Upcall + +### Summary + +A generic and extensible framework, used to maintain states in the glusterfsd process for each of the files accessed (including the clients info doing the fops) and send notifications to the respective glusterfs clients incase of any change in that state. + +Few of the use-cases (currently using) of this infrastructure are: + +- Inode Update/Invalidation + +### Detailed Description + +GlusterFS, a scale-out storage platform, comprises of distributed file system which follows client-server architectural model. + +Its the client(glusterfs) which usually initiates an rpc request to the server(glusterfsd). After processing the request, reply is sent to the client as response to the same request. So till now, there was no interface and use-case present for the server to intimate or make a request to the client. + +This support is now being added using **“Upcall Infrastructure”.** + +A new xlator(Upcall) has been defined to maintain and process state of the events which require server to send upcall notifications. For each I/O on a inode, we create/update a ‘upcall_inode_ctx’ and store/update the list of clients’ info ‘upcall_client_t’ in the context. + +#### Cache Invalidation + +Each of the GlusterFS clients/applications cache certain state of the files (for eg, inode or attributes). In a muti-node environment these caches could lead to data-integrity issues, for certain time, if there are multiple clients accessing the same file simulataneously. + +To avoid such scenarios, we need server to notify clients incase of any change in the file state/attributes. + +More details can be found in the below links - + http://www.gluster.org/community/documentation/index.php/Features/Upcall-infrastructure + https://soumyakoduri.wordpress.com/2015/02/25/glusterfs-understanding-upcall-infrastructure-and-cache-invalidation-support/ + +*cache-invalidation is currently disabled by default.* +It can be enabled with the following command: + + gluster volume set features.cache-invalidation on + +Note: This upcall notification is sent to only those clients which have accessed the file recently (i.e, with in CACHE_INVALIDATE_PERIOD – default 60sec). This options can be tuned using the following command: + + gluster volume set features.cache-invalidation-timeout \ No newline at end of file diff --git a/done/Features/worm.md b/done/Features/worm.md new file mode 100644 index 0000000..dba9977 --- /dev/null +++ b/done/Features/worm.md @@ -0,0 +1,75 @@ +#WORM (Write Once Read Many) +This features enables you to create a `WORM volume` using gluster CLI. +##Description +WORM (write once,read many) is a desired feature for users who want to store data such as `log files` and where data is not allowed to get modified. + +GlusterFS provides a new key `features.worm` which takes boolean values(enable/disable) for volume set. + +Internally, the volume set command with 'feature.worm' key will add 'features/worm' translator in the brick's volume file. + +`This change would be reflected on a subsequent restart of the volume`, i.e gluster volume stop, followed by a gluster volume start. + +With a volume converted to WORM, the changes are as follows: + +* Reads are handled normally +* Only files with O_APPEND flag will be supported. +* Truncation,deletion wont be supported. + +##Volume Options +Use the volume set command on a volume and see if the volume is actually turned into WORM type. + + # features.worm enable +##Fully loaded Example +WORM feature is being supported from glusterfs version 3.4 +start glusterd by using the command + + # service glusterd start +Now create a volume by using the command + + # gluster volume create +start the volume created by running the command below. + + # gluster vol start +Run the command below to make sure that volume is created. + + # gluster volume info +Now turn on the WORM feature on the volume by using the command + + # gluster vol set worm enable +Verify that the option is set by using the command + + # gluster volume info +User should be able to see another option in the volume info + + # features.worm: enable +Now restart the volume for the changes to reflect, by performing volume stop and start. + + # gluster volume stop + # gluster volume start +Now mount the volume using fuse mount + + # mount -t glusterfs +create a file inside the mount point by running the command below + + # touch +Verify that user is able to create a file by running the command below + + # ls + +##How To Test +Now try deleting the above file which is been created + + # rm +Since WORM is enabled on the volume, it gives the following error message `rm: cannot remove '//': Read-only file system` + +put some content into the file which is created above. + + # echo "at the end of the file" >> +Now try editing the file by running the commnad below and verify that the following error message is displayed `rm: cannot remove '//': Read-only file system` + + # sed -i "1iAt the beginning of the file" +Now read the contents of the file and verify that file can be read. + + cat + +`Note: If WORM option is set on the volume before it is started, then volume need not be restarted for the changes to get reflected`. diff --git a/done/Features/zerofill.md b/done/Features/zerofill.md new file mode 100644 index 0000000..c0f1fc5 --- /dev/null +++ b/done/Features/zerofill.md @@ -0,0 +1,26 @@ +#zerofill API for GlusterFS +zerofill() API would allow creation of pre-allocated and zeroed-out files on GlusterFS volumes by offloading the zeroing part to server and/or storage (storage offloads use SCSI WRITESAME). +## Description + +Zerofill writes zeroes to a file in the specified range. This fop will be useful when a whole file needs to be initialized with zero (could be useful for zero filled VM disk image provisioning or during scrubbing of VM disk images). + +Client/application can issue this FOP for zeroing out. Gluster server will zero out required range of bytes ie server offloaded zeroing. In the absence of this fop, client/application has to repetitively issue write (zero) fop to the server, which is very inefficient method because of the overheads involved in RPC calls and acknowledgements. + +WRITESAME is a SCSI T10 command that takes a block of data as input and writes the same data to other blocks and this write is handled completely within the storage and hence is known as offload . Linux ,now has support for SCSI WRITESAME command which is exposed to the user in the form of BLKZEROOUT ioctl. BD Xlator can exploit BLKZEROOUT ioctl to implement this fop. Thus zeroing out operations can be completely offloaded to the storage device, +making it highly efficient. + +The fop takes two arguments offset and size. It zeroes out 'size' number of bytes in an opened file starting from 'offset' position. +This feature adds zerofill support to the following areas: +> - libglusterfs +- io-stats +- performance/md-cache,open-behind +- quota +- cluster/afr,dht,stripe +- rpc/xdr +- protocol/client,server +- io-threads +- marker +- storage/posix +- libgfapi + +Client applications can exploit this fop by using glfs_zerofill introduced in libgfapi.FUSE support to this fop has not been added as there is no system call for this fop. diff --git a/done/GlusterFS 3.5/AFR CLI enhancements.md b/done/GlusterFS 3.5/AFR CLI enhancements.md new file mode 100644 index 0000000..88f4980 --- /dev/null +++ b/done/GlusterFS 3.5/AFR CLI enhancements.md @@ -0,0 +1,204 @@ +Feature +------- + +AFR CLI enhancements + +SUMMARY +------- + +Presently the AFR reporting via CLI has lots of problems in the +representation of logs because of which they may not be able to use the +data effectively. This feature is to correct these problems and provide +a coherent mechanism to present heal status,information and the logs +associated. + +Owners +------ + +Venkatesh Somayajulu +Raghavan + +Current status +-------------- + +There are many bugs related to this which indicates the current status +and why these requirements are required. + +​1) 924062 - gluster volume heal info shows only gfids in some cases and +sometimes names. This is very confusing for the end user. + +​2) 852294 - gluster volume heal info hangs/crashes when there is a +large number of entries to be healed. + +​3) 883698 - when self heal daemon is turned off, heal info does not +show any output. But healing can happen because of lookups from IO path. +Hence list of entries to be healed still needs to be shown. + +​4) 921025 - directories are not reported when list of split brain +entries needs to be displayed. + +​5) 981185 - when self heal daemon process is offline, volume heal info +gives error as "staging failure" + +​6) 952084 - We need a command to resolve files in split brain state. + +​7) 986309 - We need to report source information for files which got +healed during a self heal session. + +​8) 986317 - Sometimes list of files to get healed also includes files +to which IO s being done since the entries for these files could be in +the xattrop directory. This could be confusing for the user. + +There is a master bug 926044 that sums up most of the above problems. It +does give the QA perspective of the current representation out of the +present reporting infrastructure. + +Detailed Description +-------------------- + +​1) One common thread among all the above complaints is that the +information presented to the user is FUD because of the following +reasons: + +(a) Split brain itself is a scary scenario especially with VMs. +(b) The data that we present to the users cannot be used in a stable + manner for them to get to the list of these files. For ex: we + need to give mechanisms by which he can automate the resolution out + of split brain. +(c) The logs that are generated are all the more scarier since we + see repetition of some error lines running into hundreds of lines. + Our mailing lists are filled with such emails from end users. + +Any data is useless unless it is associated with an event. For self +heal, the event that leads to self heal is the loss of connectivity to a +brick from a client. So all healing info and especially split brain +should be associated with such events. + +The following is hence the proposed mechanism: + +(a) Every loss of a brick from client's perspective is logged and + available via some ID. The information provides the time from when + the brick went down to when it came up. Also it should also report + the number of IO transactions(modifies) that hapenned during this + event. +(b) The list of these events are available via some CLI command. The + actual command needs to be detailed as part of this feature. +(c) All volume info commands regarding list of files to be healed, + files healed and split brain files should be associated with this + event(s). + +​2) Provide a mechanism to show statistics at a volume and replica group +level. It should show the number of files to be healed and number of +split brain files at both the volume and replica group level. + +​3) Provide a mechanism to show per volume list of files to be +healed/files healed/split brain in the following info: + +This should have the following information: + +(a) File name +(b) Bricks location +(c) Event association (brick going down) +(d) Source +(v) Sink + +​4) Self heal crawl statistics - Introduce new CLI commands for showing +more information on self heal crawl per volume. + +(a) Display why a self heal crawl ran (timeouts, brick coming up) +(b) Start time and end time +(c) Number of files it attempted to heal +(d) Location of the self heal daemon + +​5) Scale the logging infrastructure to handle huge number of file list +that needs to be displayed as part of the logging. + +(a) Right now the system crashes or hangs in case of a high number + of files. +(b) It causes CLI timeouts arbitrarily. The latencies involved in + the logging have to be studied (profiled) and mechanisms to + circumvent them have to be introduced. +(c) All files are displayed on the output. Have a better way of + representing them. + +Options are: + +(a) Maybe write to a glusterd log file or have a seperate directory + for afr heal logs. +(b) Have a status kind of command. This will display the current + status of the log building and maybe have batched way of + representing when there is a huge list. + +​6) We should provide mechanism where the user can heal split brain by +some pre-established policies: + +(a) Let the system figure out the latest files (assuming all nodes + are in time sync) and choose the copies that have the latest time. +(b) Choose one particular brick as the source for split brain and + heal all split brains from this brick. +(c) Just remove the split brain information from changelog. We leave + the exercise to the user to repair split brain where in he would + rewrite to the split brained files. (right now the user is forced to + remove xattrs manually for this step). + +Benefits to GlusterFS +-------------------- + +Makes the end user more aware of healing status and provides statistics. + +Scope +----- + +6.1. Nature of proposed change + +Modification to AFR and CLI and glusterd code + +6.2. Implications on manageability + +New CLI commands to be added. Existing commands to be improved. + +6.3. Implications on presentation layer + +N/A + +6.4. Implications on persistence layer + +N/A + +6.5. Implications on 'GlusterFS' backend + +N/A + +6.6. Modification to GlusterFS metadata + +N/A + +6.7. Implications on 'glusterd' + +Changes for healing specific commands will be introduced. + +How To Test +----------- + +See documentation session + +User Experience +--------------- + +*Changes in CLI, effect on User experience...* + +Documentation +------------- + + + +Status +------ + +Patches : + + + +Status: + +Merged \ No newline at end of file diff --git a/done/GlusterFS 3.5/Brick Failure Detection.md b/done/GlusterFS 3.5/Brick Failure Detection.md new file mode 100644 index 0000000..9952698 --- /dev/null +++ b/done/GlusterFS 3.5/Brick Failure Detection.md @@ -0,0 +1,151 @@ +Feature +------- + +Brick Failure Detection + +Summary +------- + +This feature attempts to identify storage/file system failures and +disable the failed brick without disrupting the remainder of the node's +operation. + +Owners +------ + +Vijay Bellur with help from Niels de Vos (or the other way around) + +Current status +-------------- + +Currently, if the underlying storage or file system failure happens, a +brick process will continue to function. In some cases, a brick can hang +due to failures in the underlying system. Due to such hangs in brick +processes, applications running on glusterfs clients can hang. + +Detailed Description +-------------------- + +Detecting failures on the filesystem that a brick uses makes it possible +to handle errors that are caused from outside of the Gluster +environment. + +There have been hanging brick processes when the underlying storage of a +brick went unavailable. A hanging brick process can still use the +network and repond to clients, but actual I/O to the storage is +impossible and can cause noticible delays on the client side. + +Benefit to GlusterFS +-------------------- + +Provide better detection of storage subsytem failures and prevent bricks +from hanging. + +Scope +----- + +### Nature of proposed change + +Add a health-checker to the posix xlator that periodically checks the +status of the filesystem (implies checking of functional +storage-hardware). + +### Implications on manageability + +When a brick process detects that the underlaying storage is not +responding anymore, the process will exit. There is no automated way +that the brick process gets restarted, the sysadmin will need to fix the +problem with the storage first. + +After correcting the storage (hardware or filesystem) issue, the +following command will start the brick process again: + + # gluster volume start force + +### Implications on presentation layer + +None + +### Implications on persistence layer + +None + +### Implications on 'GlusterFS' backend + +None + +### Modification to GlusterFS metadata + +None + +### Implications on 'glusterd' + +'glusterd' can detect that the brick process has exited, +`gluster volume status` will show that the brick process is not running +anymore. System administrators checking the logs should be able to +triage the cause. + +How To Test +----------- + +The health-checker thread that is part of each brick process will get +started automatically when a volume has been started. Verifying its +functionality can be done in different ways. + +On virtual hardware: + +- disconnect the disk from the VM that holds the brick + +On real hardware: + +- simulate a RAID-card failure by unplugging the card or cables + +On a system that uses LVM for the bricks: + +- use device-mapper to load an error-table for the disk, see [this + description](http://review.gluster.org/5176). + +On any system (writing to random offsets of the block device, more +difficult to trigger): + +1. cause corruption on the filesystem that holds the brick +2. read contents from the brick, hoping to hit the corrupted area +3. the filsystem should abort after hitting a bad spot, the + health-checker should notice that shortly afterwards + +User Experience +--------------- + +No more hanging brick processes when storage-hardware or the filesystem +fails. + +Dependencies +------------ + +Posix translator, not available for the BD-xlator. + +Documentation +------------- + +The health-checker is enabled by default and runs a check every 30 +seconds. This interval can be changed per volume with: + + # gluster volume set storage.health-check-interval + +If `SECONDS` is set to 0, the health-checker will be disabled. + +For further details refer: + + +Status +------ + +glusterfs-3.4 and newer include a health-checker for the posix xlator, +which was introduced with [bug +971774](https://bugzilla.redhat.com/971774): + +- [posix: add a simple + health-checker](http://review.gluster.org/5176)? + +Comments and Discussion +----------------------- diff --git a/done/GlusterFS 3.5/Disk Encryption.md b/done/GlusterFS 3.5/Disk Encryption.md new file mode 100644 index 0000000..4c6ab89 --- /dev/null +++ b/done/GlusterFS 3.5/Disk Encryption.md @@ -0,0 +1,443 @@ +Feature +======= + +Transparent encryption. Allows a volume to be encrypted "at rest" on the +server using keys only available on the client. + +1 Summary +========= + +Distributed systems impose tighter requirements to at-rest encryption. +This is because your encrypted data will be stored on servers, which are +de facto untrusted. In particular, your private encrypted data can be +subjected to analysis and tampering, which eventually will lead to its +revealing, if it is not properly protected. Specifically, usually it is +not enough to just encrypt data. In distributed systems serious +protection of your personal data is possible only in conjunction with a +special process, which is called authentication. GlusterFS provides such +enhanced service: In GlusterFS encryption is enhanced with +authentication. Currently we provide protection from "silent tampering". +This is a kind of tampering, which is hard to detect, because it doesn't +break POSIX compliance. Specifically, we protect encryption-specific +file's metadata. Such metadata includes unique file's object id (GFID), +cipher algorithm id, cipher block size and other attributes used by the +encryption process. + +1.1 Restrictions +---------------- + +​1. We encrypt only file content. The feature of transparent encryption +doesn't protect file names: they are neither encrypted, nor verified. +Protection of file names is not so critical as protection of +encryption-specific file's metadata: any attacks based on tampering file +names will break POSIX compliance and result in massive corruption, +which is easy to detect. + +​2. The feature of transparent encryption doesn't work in NFS-mounts of +GlusterFS volumes: NFS's file handles introduce security issues, which +are hard to resolve. NFS mounts of encrypted GlusterFS volumes will +result in failed file operations (see section "Encryption in different +types of mount sessions" for more details). + +​3. The feature of transparent encryption is incompatible with GlusterFS +performance translators quick-read, write-behind and open-behind. + +2 Owners +======== + +Jeff Darcy +Edward Shishkin + +3 Current status +================ + +Merged to the upstream. + +4 Detailed Description +====================== + +See Summary. + +5 Benefit to GlusterFS +====================== + +Besides the justifications that have applied to on-disk encryption just +about forever, recent events have raised awareness significantly. +Encryption using keys that are physically present at the server leaves +data vulnerable to physical seizure of the server. Encryption using keys +that are kept by the same organization entity leaves data vulnerable to +"insider threat" plus coercion or capture at the organization level. For +many, especially various kinds of service providers, only pure +client-side encryption provides the necessary levels of privacy and +deniability. + +Competitively, other projects - most notably +[Tahoe-LAFS](https://leastauthority.com/) - are already using recently +heightened awareness of these issues to attract users who would be +better served by our performance/scalability, usability, and diversity +of interfaces. Only the lack of proper encryption holds us back in these +cases. + +6 Scope +======= + +6.1. Nature of proposed change +------------------------------ + +This is a new client-side translator, using user-provided key +information plus information stored in xattrs to encrypt data +transparently as it's written and decrypt when it's read. + +6.2. Implications on manageability +---------------------------------- + +User needs to manage a per-volume master key (MK). That is: + +​1) Generate an independent MK for every volume which is to be +encrypted. Note, that one MK is created for the whole life of the +volume. + +​2) Provide MK on the client side at every mount in accordance with the +location, which has been specified at volume create time, or overridden +via respective mount option (see section How To Test). + +​3) Keep MK between mount sessions. Note that after successful mount MK +may be removed from the specified location. In this case user should +retain MK safely till next mount session. + +MK is a 256-bit secret string, which is known only to user. Generating +and retention of MK is in user's competence. + +WARNING!!! Losing MK will make content of all regular files of your +volume inaccessible. It is possible to mount a volume with improper MK, +however such mount sessions will allow to access only file names as they +are not encrypted. + +Recommendations on MK generation + +MK has to be a high-entropy key, appropriately generated by a key +derivation algorithm. One of the possible ways is using rand(1) provided +by the OpenSSL package. You need to specify the option "-hex" for proper +output format. For example, the next command prints a generated key to +the standard output: + + $ openssl rand -hex 32 + +6.3. Implications on presentation layer +--------------------------------------- + +N/A + +6.4. Implications on persistence layer +-------------------------------------- + +N/A + +6.5. Implications on 'GlusterFS' backend +---------------------------------------- + +All encrypted files on the servers contains padding at the end of file. +That is, size of all enDefines location of the master volume key on the +trusted client machine.crypted files on the servers is multiple to +cipher block size. Real file size is stored as file's xattr with the key +"trusted.glusterfs.crypt.att.size". The translation padded-file-size -\> +real-file-size (and backward) is performed by the crypt translator. + +6.6. Modification to GlusterFS metadata +--------------------------------------- + +Encryption-specific metadata in specified format is stored as file's +xattr with the key "trusted.glusterfs.crypt.att.cfmt". Current format of +metadata string is described in the slide \#27 of the following [ design +document](http://www.gluster.org/community/documentation/index.php/File:GlusterFS_transparent_encryption.pdf) + +6.7. Options of the crypt translator +------------------------------------ + +- data-cipher-alg + +Specifies cipher algorithm for file data encryption. Currently only one +option is available: AES\_XTS. This is hidden option. + +- block-size + +Specifies size (in bytes) of logical chunk which is encrypted as a whole +unit in the file body. If cipher modes with initial vectors are used for +encryption, then the initial vector gets reset for every such chunk. +Available values are: "512", "1024", "2048" and "4096". Default value is +"4096". + +- data-key-size + +Specifies size (in bits) of data cipher key. For AES\_XTS available +values are: "256" and "512". Default value is "256". The larger key size +("512") is for stronger security. + +- master-key + +Specifies pathname of the regular file, or symlink. Defines location of +the master volume key on the trusted client machine. + +7 Getting Started With Crypt Translator +======================================= + +​1. Create a volume . + +​2. Turn on crypt xlator: + + # gluster volume set `` encryption on + +​3. Turn off performance xlators that currently encryption is +incompatible with: + + # gluster volume set  performance.quick-read off + # gluster volume set  performance.write-behind off + # gluster volume set  performance.open-behind off + +​4. (optional) Set location of the volume master key: + + # gluster volume set  encryption.master-key  + +where is an absolute pathname of the file, which +will contain the volume master key (see section implications on +manageability). + +​5. (optional) Override default options of crypt xlator: + + # gluster volume set  encryption.data-key-size  + +where should have one of the following values: +"256"(default), "512". + + # gluster volume set  encryption.block-size  + +where should have one of the following values: "512", +"1024", "2048", "4096"(default). + +​6. Define location of the master key on your client machine, if it +wasn't specified at section 4 above, or you want it to be different from +the , specified at section 4. + +​7. On the client side make sure that the file with name + (or defined at section +6) exists and contains respective per-volume master key (see section +implications on manageability). This key has to be in hex form, i.e. +should be represented by 64 symbols from the set {'0', ..., '9', 'a', +..., 'f'}. The key should start at the beginning of the file. All +symbols at offsets \>= 64 are ignored. + +NOTE: (or defined at +step 6) can be a symlink. In this case make sure that the target file of +this symlink exists and contains respective per-volume master key. + +​8. Mount the volume on the client side as usual. If you +specified a location of the master key at section 6, then use the mount +option + +--xlator-option=.master-key= + +where is location of master key specified at +section 6, is suffixed with "-crypt". For +example, if you created a volume "myvol" in the step 1, then +suffixed\_vol\_name is "myvol-crypt". + +​9. During mount your client machine receives configuration info from +the untrusted server, so this step is extremely important! Check, that +your volume is really encrypted, and that it is encrypted with the +proper master key (see FAQ \#1,\#2). + +​10. (optional) After successful mount the file which contains master +key may be removed. NOTE: Next mount session will require the master-key +again. Keeping the master key between mount sessions is in user's +competence (see section implications on manageability). + +8 How to test +============= + +From a correctness standpoint, it's sufficient to run normal tests with +encryption enabled. From a security standpoint, there's a whole +discipline devoted to analysing the stored data for weaknesses, and +engagement with practitioners of that discipline will be necessary to +develop the right tests. + +9 Dependencies +============== + +Crypt translator requires OpenSSL of version \>= 1.0.1 + +10 Documentation +================ + +10.1 Basic design concepts +-------------------------- + +The basic design concepts are described in the following [pdf +slides](http://www.gluster.org/community/documentation/index.php/File:GlusterFS_transparent_encryption.pdf) + +10.2 Procedure of security open +------------------------------- + +So, in accordance with the basic design concepts above, before every +access to a file's body (by read(2), write(2), truncate(2), etc) we need +to make sure that the file's metadata is trusted. Otherwise, we risk to +deal with untrusted file's data. + +To make sure that file's metadata is trusted, file is subjected to a +special procedure of security open. The procedure of security open is +performed by crypt translator at FOP-\>open() (crypt\_open) time by the +function open\_format(). Currently this is a hardcoded composition of 2 +checks: + +1. verification of file's GFID by the file name; +2. verification of file's metadata by the verified GFID; + +If the security open succeeds, then the cache of trusted client machine +is replenished with file descriptor and file's inode, and user can +access the file's content by read(2), write(2), ftruncate(2), etc. +system calls, which accept file descriptor as argument. + +However, file API also allows to accept file body without opening the +file. For example, truncate(2), which accepts pathname instead of file +descriptor. To make sure that file's metadata is trusted, we create a +temporal file descriptor and mandatory call crypt\_open() before +truncating the file's body. + +10.3 Encryption in different types of mount sessions +---------------------------------------------------- + +Everything described in the section above is valid only for FUSE-mounts. +Besides, GlusterFS also supports so-called NFS-mounts. From the +standpoint of security the key difference between the mentioned types of +mount sessions is that in NFS-mount sessions file operations instead of +file name accept a so-called file handle (which is actually GFID). It +creates problems, since the file name is a basic point for verification. +As it follows from the section above, using the step 1, we can replenish +the cache of trusted machine with trusted file handles (GFIDs), and +perform a security open only by trusted GFID (by the step 2). However, +in this case we need to make sure that there is no leaks of non-trusted +GFIDs (and, moreover, such leaks won't be introduced by the development +process in future). This is possible only with changed GFID format: +everywhere in GlusterFS GFID should appear as a pair (uuid, +is\_verified), where is\_verified is a boolean variable, which is true, +if this GFID passed off the procedure of verification (step 1 in the +section above). + +The next problem is that current NFS protocol doesn't encrypt the +channel between NFS client and NFS server. It means that in NFS-mounts +of GlusterFS volumes NFS client and GlusterFS client should be the same +(trusted) machine. + +Taking into account the described problems, encryption in GlusterFS is +not supported in NFS-mount sessions. + +10.4 Class of cipher algorithms for file data encryption that can be supported by the crypt translator +------------------------------------------------------------------------------------------------------ + +We'll assume that any symmetric block cipher algorithm is completely +determined by a pair (alg\_id, mode\_id), where alg\_id is an algorithm +defined on elementary cipher blocks (e.g. AES), and mode\_id is a mode +of operation (e.g. ECB, XTS, etc). + +Technically, the crypt translator is able to support any symmetric block +cipher algorithms via additional options of the crypt translator. +However, in practice the set of supported algorithms is narrowed because +of various security and organization issues. Currently we support only +one algotithm. This is AES\_XTS. + +10.5 Bibliography +----------------- + +1. Recommendations for for Block Cipher Modes of Operation (NIST + Special Publication 800-38A). +2. Recommendation for Block Cipher Modes of Operation: The XTS-AES Mode + for Confidentiality on Storage Devices (NIST Special Publication + 800-38E). +3. Recommendation for Key Derivation Using Pseudorandom Functions, + (NIST Special Publication 800-108). +4. Recommendation for Block Cipher Modes of Operation: The CMAC Mode + for Authentication, (NIST Special Publication 800-38B). +5. Recommendation for Block Cipher Modes of Operation: Methods for Key + Wrapping, (NIST Special Publication 800-38F). +6. FIPS PUB 198-1 The Keyed-Hash Message Authentication Code (HMAC). +7. David A. McGrew, John Viega "The Galois/Counter Mode of Operation + (GCM)". + +11 FAQ +====== + +**1. How to make sure that my volume is really encrypted?** + +Check the respective graph of translators on your trusted client +machine. This graph is created at mount time and is stored by default in +the file /usr/local/var/log/glusterfs/mountpoint.log + +Here "mountpoint" is the absolute name of the mountpoint, where "/" are +replaced with "-". For example, if your volume is mounted to +/mnt/testfs, then you'll need to check the file +/usr/local/var/log/glusterfs/mnt-testfs.log + +Make sure that this graph contains the crypt translator, which looks +like the following: + + 13: volume xvol-crypt + 14:     type encryption/crypt + 15:     option master-key /home/edward/mykey + 16:     subvolumes xvol-dht + 17: end-volume + +**2. How to make sure that my volume is encrypted with a proper master +key?** + +Check the graph of translators on your trusted client machine (see the +FAQ\#1). Make sure that the option "master-key" of the crypt translator +specifies correct location of the master key on your trusted client +machine. + +**3. Can I change the encryption status of a volume?** + +You can change encryption status (enable/disable encryption) only for +empty volumes. Otherwise it will be incorrect (you'll end with IO +errors, data corruption and security problems). We strongly recommend to +decide once and forever at volume creation time, whether your volume has +to be encrypted, or not. + +**4. I am able to mount my encrypted volume with improper master keys +and get list of file names for every directory. Is it normal?** + +Yes, it is normal. It doesn't contradict the announced functionality: we +encrypt only file's content. File names are not encrypted, so it doesn't +make sense to hide them on the trusted client machine. + +**5. What is the reason for only supporting AES-XTS? This mode is not +using Intel's AES-NI instruction thus not utilizing hardware feature..** + +Distributed file systems impose tighter requirements to at-rest +encryption. We offer more than "at-rest-encryption". We offer "at-rest +encryption and authentication in distributed systems with non-trusted +servers". Data and metadata on the server can be easily subjected to +tampering and analysis with the purpose to reveal secret user's data. +And we have to resist to this tampering by performing data and metadata +authentication. + +Unfortunately, it is technically hard to implement full-fledged data +authentication via a stackable file system (GlusterFS translator), so we +have decided to perform a "light" authentication by using a special +cipher mode, which is resistant to tampering. Currently OpenSSL supports +only one such mode: this is XTS. Tampering of ciphertext created in XTS +mode will lead to unpredictable changes in the plain text. That said, +user will see "unpredictable gibberish" on the client side. Of course, +this is not an "official way" to detect tampering, but this is much +better than nothing. The "official way" (creating/checking MACs) we use +for metadata authentication. + +Other modes like CBC, CFB, OFB, etc supported by OpenSSL are strongly +not recommended for use in distributed systems with non-trusted servers. +For example, CBC mode doesn't "survive" overwrite of a logical block in +a file. It means that with every such overwrite (standard file system +operation) we'll need to re-encrypt the whole(!) file with different +key. CFB and OFB modes are sensitive to tampering: there is a way to +perform \*predictable\* changes in plaintext, which is unacceptable. + +Yes, XTS is slow (at least its current implementation in OpenSSL), but +we don't promise, that CFB, OFB with full-fledged authentication will be +faster. So.. diff --git a/done/GlusterFS 3.5/Exposing Volume Capabilities.md b/done/GlusterFS 3.5/Exposing Volume Capabilities.md new file mode 100644 index 0000000..0f72fbc --- /dev/null +++ b/done/GlusterFS 3.5/Exposing Volume Capabilities.md @@ -0,0 +1,161 @@ +Feature +------- + +Provide a capability to: + +- Probe the type (posix or bd) of volume. +- Provide list of capabilities of a xlator/volume. For example posix + xlator could support zerofill, BD xlator could support offloaded + copy, thin provisioning etc + +Summary +------- + +With multiple storage translators (posix and bd) being supported in +GlusterFS, it becomes necessary to know the volume type so that user can +issue appropriate calls that are relevant only to the a given volume +type. Hence there needs to be a way to expose the type of the storage +translator of the volume to the user. + +BD xlator is capable of providing server offloaded file copy, +server/storage offloaded zeroing of a file etc. This capabilities should +be visible to the client/user, so that these features can be exploited. + +Owners +------ + +M. Mohan Kumar +Bharata B Rao. + +Current status +-------------- + +BD xlator exports capability information through gluster volume info +(and --xml) output. For eg: + +*snip of gluster volume info output for a BD based volume* + + Xlator 1: BD + Capability 1: thin + +*snip of gluster volume info --xml output for a BD based volume* + + +    +     BD +      +       thin +      +    + + +But this capability information should also exposed through some other +means so that a host which is not part of Gluster peer could also avail +this capabilities. + +Exposing about type of volume (ie posix or BD) is still in conceptual +state currently and needs discussion. + +Detailed Description +-------------------- + +1. Type +- BD translator supports both regular files and block device, +i,e., one can create files on GlusterFS volume backed by BD +translator and this file could end up as regular posix file or a +logical volume (block device) based on the user's choice. User +can do a setxattr on the created file to convert it to a logical +volume. +- Users of BD backed volume like QEMU would like to know that it +is working with BD type of volume so that it can issue an +additional setxattr call after creating a VM image on GlusterFS +backend. This is necessary to ensure that the created VM image +is backed by LV instead of file. +- There are different ways to expose this information (BD type of +volume) to user. One way is to export it via a getxattr call. + +2. Capabilities +- BD xlator supports new features such as server offloaded file +copy, thin provisioned VM images etc (there is a patch posted to +Gerrit to add server offloaded file zeroing in posix xlator). +There is no standard way of exploiting these features from +client side (such as syscall to exploit server offloaded copy). +So these features need to be exported to the client so that they +can be used. BD xlator V2 patch exports these capabilities +information through gluster volume info (and --xml) output. But +if a client is not part of GlusterFS peer it can't run volume +info command to get the list of capabilities of a given +GlusterFS volume. Also GlusterFS block driver in qemu need to +get the capability list so that these features are used. + +Benefit to GlusterFS +-------------------- + +Enables proper consumption of BD xlator and client exploits new features +added in both posix and BD xlator. + +### Scope + +Nature of proposed change +------------------------- + +- Quickest way to expose volume type to a client can be achieved by + using getxattr fop. When a client issues getxattr("volume\_type") on + a root gfid, bd xlator will return 1 implying its BD xlator. But + posix xlator will return ENODATA and client code can interpret this + as posix xlator. + +- Also capability list can be returned via getxattr("caps") for root + gfid. + +Implications on manageability +----------------------------- + +None. + +Implications on presentation layer +---------------------------------- + +N/A + +Implications on persistence layer +--------------------------------- + +N/A + +Implications on 'GlusterFS' backend +----------------------------------- + +N/A + +Modification to GlusterFS metadata +---------------------------------- + +N/A + +Implications on 'glusterd' +-------------------------- + +N/A + +How To Test +----------- + +User Experience +--------------- + +Dependencies +------------ + +Documentation +------------- + +Status +------ + +Patch : + +Status : Merged + +Comments and Discussion +----------------------- diff --git a/done/GlusterFS 3.5/File Snapshot.md b/done/GlusterFS 3.5/File Snapshot.md new file mode 100644 index 0000000..b2d6c69 --- /dev/null +++ b/done/GlusterFS 3.5/File Snapshot.md @@ -0,0 +1,101 @@ +Feature +------- + +File Snapshots in GlusterFS + +### Summary + +Ability to take snapshots of files in GlusterFS + +### Owners + +Anand Avati + +### Source code + +Patch for this feature - + +### Detailed Description + +The feature adds file snapshotting support to GlusterFS. '' To use this +feature the file format should be QCOW2 (from QEMU)'' . The patch takes +the block layer code from Qemu and converts it into a translator in +gluster. + +### Benefit to GlusterFS + +Better integration with Openstack Cinder, and in general ability to take +snapshots of files (typically VM images) + +### Usage + +*To take snapshot of a file, the file format should be QCOW2. To set +file type as qcow2 check step \#2 below* + +​1. Turning on snapshot feature : + + gluster volume set `` features.file-snapshot on + +​2. To set qcow2 file format: + + setfattr -n trusted.glusterfs.block-format -v qcow2:10GB  + +​3. To create a snapshot: + + setfattr -n trusted.glusterfs.block-snapshot-create -v  + +​4. To apply/revert back to a snapshot: + + setfattr -n trusted.glusterfs.block-snapshot-goto -v   + +### Scope + +#### Nature of proposed change + +The work is going to be a new translator. Very minimal changes to +existing code (minor change in syncops) + +#### Implications on manageability + +Will need ability to load/unload the translator in the stack. + +#### Implications on presentation layer + +Feature must be presentation layer independent. + +#### Implications on persistence layer + +No implications + +#### Implications on 'GlusterFS' backend + +Internal snapshots - No implications. External snapshots - there will be +hidden directories added. + +#### Modification to GlusterFS metadata + +New xattr will be added to identify files which are 'snapshot managed' +vs raw files. + +#### Implications on 'glusterd' + +Yet another turn on/off feature for glusterd. Volgen will have to add a +new translator in the generated graph. + +### How To Test + +Snapshots can be tested by taking snapshots along with checksum of the +state of the file, making further changes and going back to old snapshot +and verify the checksum again. + +### Dependencies + +Dependent QEMU code is imported into the codebase. + +### Documentation + + + +### Status + +Merged in master and available in Gluster3.5 \ No newline at end of file diff --git a/done/GlusterFS 3.5/Onwire Compression-Decompression.md b/done/GlusterFS 3.5/Onwire Compression-Decompression.md new file mode 100644 index 0000000..a26aa7a --- /dev/null +++ b/done/GlusterFS 3.5/Onwire Compression-Decompression.md @@ -0,0 +1,96 @@ +Feature +======= + +On-Wire Compression/Decompression + +1. Summary +========== + +Translator to compress/decompress data in flight between client and +server. + +2. Owners +========= + +- Venky Shankar +- Prashanth Pai + +3. Current Status +================= + +Code has already been merged. Needs more testing. + +The [initial submission](http://review.gluster.org/3251) contained a +`compress` option, which introduced [some +confusion](https://bugzilla.redhat.com/1053670). [A correction has been +sent](http://review.gluster.org/6765) to rename the user visible options +to start with `network.compression`. + +TODO + +- Make xlator pluggable to add support for other compression methods +- Add support for lz4 compression: + +4. Detailed Description +======================= + +- When a writev call occurs, the client compresses the data before + sending it to server. On the server, compressed data is + decompressed. Similarly, when a readv call occurs, the server + compresses the data before sending it to client. On the client, the + compressed data is decompressed. Thus the amount of data sent over + the wire is minimized. + +- Compression/Decompression is done using Zlib library. + +- During normal operation, this is the format of data sent over wire: + + trailer(8 bytes). The trailer contains the CRC32 + checksum and length of original uncompressed data. This is used for + validation. + +5. Usage +======== + +Turning on compression xlator: + + # gluster volume set  network.compression on + +Configurable options: + + # gluster volume set  network.compression.compression-level 8 + # gluster volume set  network.compression.min-size 50 + +6. Benefits to GlusterFS +======================== + +Fewer bytes transferred over the network. + +7. Issues +========= + +- Issues with striped volumes. Compression xlator cannot work with + striped volumes + +- Issues with write-behind: Mount point hangs when writing a file with + write-behind xlator turned on. To overcome this, turn off + write-behind entirely OR set "performance.strict-write-ordering" to + on. + +- Issues with AFR: AFR v1 currently does not propagate xdata. + This issue has + been resolved in AFR v2. + +8. Dependencies +=============== + +Zlib library + +9. Documentation +================ + + + +10. Status +========== + +Code merged upstream. \ No newline at end of file diff --git a/done/GlusterFS 3.5/Quota Scalability.md b/done/GlusterFS 3.5/Quota Scalability.md new file mode 100644 index 0000000..f3b0a0d --- /dev/null +++ b/done/GlusterFS 3.5/Quota Scalability.md @@ -0,0 +1,99 @@ +Feature +------- + +Quota Scalability + +Summary +------- + +Support upto 65536 quota configurations per volume. + +Owners +------ + +Krishnan Parthasarathi +Vijay Bellur + +Current status +-------------- + +Current implementation of Directory Quota cannot scale beyond a few +hundred configured limits per volume. The aim of this feature is to +support upto 65536 quota configurations per volume. + +Detailed Description +-------------------- + +TBD + +Benefit to GlusterFS +-------------------- + +More quotas can be configured in a single volume thereby leading to +support GlusterFS for use cases like home directory. + +Scope +----- + +### Nature of proposed change + +- Move quota enforcement translator to the server +- Introduce a new quota daemon which helps in aggregating directory + consumption on the server +- Enhance marker's accounting to be modular +- Revamp configuration persistence and CLI listing for better scale +- Allow configuration of soft limits in addition to hard limits. + +### Implications on manageability + +Mostly the CLI will be backward compatible. New CLI to be introduced +needs to be enumerated here. + +### Implications on presentation layer + +None + +### Implications on persistence layer + +None + +### Implications on 'GlusterFS' backend + +None + +### Modification to GlusterFS metadata + +- Addition of a new extended attribute for storing configured hard and +soft limits on directories. + +### Implications on 'glusterd' + +- New file based configuration persistence + +How To Test +----------- + +TBD + +User Experience +--------------- + +TBD + +Dependencies +------------ + +None + +Documentation +------------- + +TBD + +Status +------ + +In development + +Comments and Discussion +----------------------- diff --git a/done/GlusterFS 3.5/Virt store usecase.md b/done/GlusterFS 3.5/Virt store usecase.md new file mode 100644 index 0000000..3e649b2 --- /dev/null +++ b/done/GlusterFS 3.5/Virt store usecase.md @@ -0,0 +1,140 @@ + Work In Progress + Author - Satheesaran Sundaramoorthi + + +**Introduction** +---------------- + +Gluster volumes are used to host Virtual Machines Images. (i.e) Virtual +machines Images are stored on gluster volumes. This usecase is popularly +known as *virt-store* usecase. + +This document explains more about, + +1. Enabling gluster volumes for virt-store usecase +2. Common Pitfalls +3. FAQs +4. References + +**Enabling gluster volumes for virt-store** +------------------------------------------- + +This section describes how to enable gluster volumes for virt store +usecase + +#### Volume Types + +Ideally gluster volumes serving virt-store, should provide +high-availability for the VMs running on it. If the volume is not +avilable, the VMs may move in to unusable state. So, its best +recommended to use **replica** or **distribute-replicate** volume for +this usecase + +*If you are new to GlusterFS, you can take a look at +[QuickStart](http://gluster.readthedocs.org/en/latest/Quick-Start-Guide/Quickstart/) guide or the [admin +guide](http://gluster.readthedocs.org/en/latest/Administrator%20Guide/README/)* + +#### Tunables + +The set of volume options are recommended for virt-store usecase, which +adds performance boost. Following are those options, + + quick-read=off + read-ahead=off + io-cache=off + stat-prefetch=off + eager-lock=enable + remote-dio=enable + quorum-type=auto + server-quorum-type=server + +- quick-read is meant for improving small-file read performance,which + is no longer reasonable for VM Image files +- read-ahead is turned off. VMs have their own way of doing that. This + is pretty usual to leave it to VM to determine the read-ahead +- io-cache is turned off +- stat-prefetch is turned off. stat-prefetch, caches the metadata + related to files and this is no longer a concern for VM Images (why + ?) +- eager-lock is turned on (why?) +- remote-dio is turned on,so in open() and creat() calls, O\_DIRECT + flag will be filtered at the client protocol level so server will + still continue to cache the file. +- quorum-type is set to auto. This basically enables client side + quorum. When client side quorum is enabled, there exists the rule + such that atleast half of the bricks in the replica group should be + UP and running. If not, the replica group would become read-only +- server-quorum-type is set to server. This basically enables + server-side quorum. This lays a condition that in a cluster, atleast + half the number of nodes, should be UP. If not the bricks ( read as + brick processes) will be killed, and thereby volume goes offline + +#### Applying the Tunables on the volume + +There are number of ways to do it. + +1. Make use of group-virt.example file +2. Copy & Paste + +##### Make use of group-virt.example file + +This is the method best suited and recommended. +*/etc/glusterfs/group-virt.example* has all options recommended for +virt-store as explained earlier. Copy this file, +*/etc/glusterfs/group-virt.example* to */var/lib/glusterd/groups/virt* + + cp /etc/glusterfs/group-virt.example /var/lib/glusterd/groups/virt + +Optimize the volume with all the options available in this *virt* file +in a single go + + gluster volume set group virt + +NOTE: No restart of the volume is required Verify the same with the +command, + + gluster volume info + +In forthcoming releases, this file will be automatically put in +*/var/lib/glusterd/groups/* and you can directly apply it on the volume + +##### Copy & Paste + +Copy all options from the above +section,[Virt-store-usecase\#Tunables](Virt-store-usecase#Tunables "wikilink") +and put in a file named *virt* in */var/lib/glusterd/groups/virt* Apply +all the options on the volume, + + gluster volume set group virt + +NOTE: This is not recommended, as the recommended volume options may/may +not change in future.Always stick to *virt* file available with the rpms + +#### Adding Ownership to Volume + +You can add uid:gid to the volume, + + gluster volume set storage.owner-uid + gluster volume set storage.owner-gid + +For example, when the volume would be accessed by qemu/kvm, you need to +add ownership as 107:107, + + gluster volume set storage.owner-uid 107 + gluster volume set storage.owner-gid 107 + +It would be 36:36 in the case of oVirt/RHEV, 165:165 in the case of +OpenStack Block Service (cinder),161:161 in case of OpenStack Image +Service (glance) is accessing this volume + +NOTE: Not setting the correct ownership may lead to "Permission Denied" +errors when accessing the image files residing on the volume + +**Common Pitfalls** +------------------- + +**FAQs** +-------- + +**References** +-------------- \ No newline at end of file diff --git a/done/GlusterFS 3.5/Zerofill.md b/done/GlusterFS 3.5/Zerofill.md new file mode 100644 index 0000000..43b279d --- /dev/null +++ b/done/GlusterFS 3.5/Zerofill.md @@ -0,0 +1,192 @@ +Feature +------- + +zerofill API for GlusterFS + +Summary +------- + +zerofill() API would allow creation of pre-allocated and zeroed-out +files on GlusterFS volumes by offloading the zeroing part to server +and/or storage (storage offloads use SCSI WRITESAME). + +Owners +------ + +Bharata B Rao +M. Mohankumar + +Current status +-------------- + +Patch on gerrit: + +Detailed Description +-------------------- + +Add support for a new ZEROFILL fop. Zerofill writes zeroes to a file in +the specified range. This fop will be useful when a whole file needs to +be initialized with zero (could be useful for zero filled VM disk image +provisioning or during scrubbing of VM disk images). + +Client/application can issue this FOP for zeroing out. Gluster server +will zero out required range of bytes ie server offloaded zeroing. In +the absence of this fop, client/application has to repetitively issue +write (zero) fop to the server, which is very inefficient method because +of the overheads involved in RPC calls and acknowledgements. + +WRITESAME is a SCSI T10 command that takes a block of data as input and +writes the same data to other blocks and this write is handled +completely within the storage and hence is known as offload . Linux ,now +has support for SCSI WRITESAME command which is exposed to the user in +the form of BLKZEROOUT ioctl. BD Xlator can exploit BLKZEROOUT ioctl to +implement this fop. Thus zeroing out operations can be completely +offloaded to the storage device , making it highly efficient. + +The fop takes two arguments offset and size. It zeroes out 'size' number +of bytes in an opened file starting from 'offset' position. + +Benefit to GlusterFS +-------------------- + +Benefits GlusterFS in virtualization by providing the ability to quickly +create pre-allocated and zeroed-out VM disk image by using +server/storage off-loads. + +### Scope + +Nature of proposed change +------------------------- + +An FOP supported in libgfapi and FUSE. + +Implications on manageability +----------------------------- + +None. + +Implications on presentation layer +---------------------------------- + +N/A + +Implications on persistence layer +--------------------------------- + +N/A + +Implications on 'GlusterFS' backend +----------------------------------- + +N/A + +Modification to GlusterFS metadata +---------------------------------- + +N/A + +Implications on 'glusterd' +-------------------------- + +N/A + +How To Test +----------- + +Test server offload by measuring the time taken for creating a fully +allocated and zeroed file on Posix backend. + +Test storage offload by measuring the time taken for creating a fully +allocated and zeroed file on BD backend. + +User Experience +--------------- + +Fast provisioning of VM images when GlusterFS is used as a file system +backend for KVM virtualization. + +Dependencies +------------ + +zerofill() support in BD backend depends on the new BD translator - + + +Documentation +------------- + +This feature add support for a new ZEROFILL fop. Zerofill writes zeroes +to a file in the specified range. This fop will be useful when a whole +file needs to be initialized with zero (could be useful for zero filled +VM disk image provisioning or during scrubbing of VM disk images). + +Client/application can issue this FOP for zeroing out. Gluster server +will zero out required range of bytes ie server offloaded zeroing. In +the absence of this fop, client/application has to repetitively issue +write (zero) fop to the server, which is very inefficient method because +of the overheads involved in RPC calls and acknowledgements. + +WRITESAME is a SCSI T10 command that takes a block of data as input and +writes the same data to other blocks and this write is handled +completely within the storage and hence is known as offload . Linux ,now +has support for SCSI WRITESAME command which is exposed to the user in +the form of BLKZEROOUT ioctl. BD Xlator can exploit BLKZEROOUT ioctl to +implement this fop. Thus zeroing out operations can be completely +offloaded to the storage device , making it highly efficient. + +The fop takes two arguments offset and size. It zeroes out 'size' number +of bytes in an opened file starting from 'offset' position. + +This feature adds zerofill support to the following areas: + +-  libglusterfs +-  io-stats +-  performance/md-cache,open-behind +-  quota +-  cluster/afr,dht,stripe +-  rpc/xdr +-  protocol/client,server +-  io-threads +-  marker +-  storage/posix +-  libgfapi + +Client applications can exploit this fop by using glfs\_zerofill +introduced in libgfapi.FUSE support to this fop has not been added as +there is no system call for this fop. + +Here is a performance comparison of server offloaded zeofill vs zeroing +out using repeated writes. + + [root@llmvm02 remote]# time ./offloaded aakash-test log 20 + + real    3m34.155s + user    0m0.018s + sys 0m0.040s + + +  [root@llmvm02 remote]# time ./manually aakash-test log 20 + + real    4m23.043s + user    0m2.197s + sys 0m14.457s +  [root@llmvm02 remote]# time ./offloaded aakash-test log 25; + + real    4m28.363s + user    0m0.021s + sys 0m0.025s + [root@llmvm02 remote]# time ./manually aakash-test log 25 + + real    5m34.278s + user    0m2.957s + sys 0m18.808s + +The argument log is a file which we want to set for logging purpose and +the third argument is size in GB . + +As we can see there is a performance improvement of around 20% with this +fop. + +Status +------ + +Patch : Status : Merged \ No newline at end of file diff --git a/done/GlusterFS 3.5/gfid access.md b/done/GlusterFS 3.5/gfid access.md new file mode 100644 index 0000000..db64076 --- /dev/null +++ b/done/GlusterFS 3.5/gfid access.md @@ -0,0 +1,89 @@ +### Instructions + +**Feature** + +'gfid-access' translator to provide access to data in glusterfs using a virtual path. + +**1 Summary** + +This particular Translator is designed to provide direct access to files in glusterfs using its gfid.'GFID' is glusterfs's inode numbers for a file to identify it uniquely. + +**2 Owners** + +Amar Tumballi  +Raghavendra G  +Anand Avati  + +**3 Current status** + +With glusterfs-3.4.0, glusterfs provides only path based access.A feature is added in 'fuse' layer in the current master branch, +but its desirable to have it as a separate translator for long time +maintenance. + +**4 Detailed Description** + +With this method, we can consume the data in changelog translator +(which is logging 'gfid' internally) very efficiently. + +**5 Benefit to GlusterFS** + +Provides a way to access files quickly with direct gfid. + +​**6. Scope** + +6.1. Nature of proposed change + +* A new translator. +* Fixes in 'glusterfsd.c' to add this translator automatically based +on mount time option. +* change to mount.glusterfs to parse this new option  +(single digit number or lines changed) + +6.2. Implications on manageability + +* No CLI required. +* mount.glusterfs script gets a new option. + +6.3. Implications on presentation layer + +* A new virtual access path is made available. But all access protocols work seemlessly, as the complexities are handled internally. + +6.4. Implications on persistence layer + +* None + +6.5. Implications on 'GlusterFS' backend + +* None + +6.6. Modification to GlusterFS metadata + +* None + +6.7. Implications on 'glusterd' + +* None + +7 How To Test + +* Mount glusterfs client with '-o aux-gfid-mount' and access files using '/mount/point/.gfid/ '. + +8 User Experience + +* A new virtual path available for users. + +9 Dependencies + +* None + +10 Documentation + +This wiki. + +11 Status + +Patch sent upstream. More review comments required. (http://review.gluster.org/5497) + +12 Comments and Discussion + +Please do give comments :-) \ No newline at end of file diff --git a/done/GlusterFS 3.5/index.md b/done/GlusterFS 3.5/index.md new file mode 100644 index 0000000..e8c2c88 --- /dev/null +++ b/done/GlusterFS 3.5/index.md @@ -0,0 +1,32 @@ +GlusterFS 3.5 Release +--------------------- + +Tentative Dates: + +Latest: 13-Nov, 2014 GlusterFS 3.5.3 + +17th Apr, 2014 - 3.5.0 GA + +GlusterFS 3.5 +------------- + +### Features in 3.5.0 + +- [Features/AFR CLI enhancements](./AFR CLI enhancements.md) +- [Features/exposing volume capabilities](./Exposing Volume Capabilities.md) +- [Features/File Snapshot](./File Snapshot.md) +- [Features/gfid-access](./gfid access.md) +- [Features/On-Wire Compression + Decompression](./Onwire Compression-Decompression.md) +- [Features/Quota Scalability](./Quota Scalability.md) +- [Features/readdir ahead](./readdir ahead.md) +- [Features/zerofill](./Zerofill.md) +- [Features/Brick Failure Detection](./Brick Failure Detection.md) +- [Features/disk-encryption](./Disk Encryption.md) +- Changelog based parallel geo-replication +- Improved block device translator + +Proposing New Features +---------------------- + +New feature proposals should be built using the New Feature Template in +the GlusterFS 3.7 planning page diff --git a/done/GlusterFS 3.5/libgfapi with qemu libvirt.md b/done/GlusterFS 3.5/libgfapi with qemu libvirt.md new file mode 100644 index 0000000..2309016 --- /dev/null +++ b/done/GlusterFS 3.5/libgfapi with qemu libvirt.md @@ -0,0 +1,222 @@ + Work In Progress + Author - Satheesaran Sundaramoorthi + + +**Purpose** +----------- + +Gluster volume can be used to store VM Disk images. This usecase is +popularly known as 'Virt-Store' usecase. Earlier, gluster volume had to +be fuse mounted and images are created/accessed over the fuse mount. + +With the introduction of GlusterFS libgfapi, QEMU supports glusterfs +through libgfapi directly. This we call as *QEMU driver for glusterfs*. +These document explains about the way to make use of QEMU driver for +glusterfs + +Steps for the entire procedure could be split in to 2 views viz,the +document from + +1. Steps to be done on gluster volume side +2. Steps to be done on Hypervisor side + +**Steps to be done on gluster side** +------------------------------------ + +These are the steps that needs to be done on the gluster side Precisely +this involves + +1. Creating "Trusted Storage Pool" +2. Creating a volume +3. Tuning the volume for virt-store +4. Tuning glusterd to accept requests from QEMU +5. Tuning glusterfsd to accept requests from QEMU +6. Setting ownership on the volume +7. Starting the volume + +##### Creating "Trusted Storage Pool" + +Install glusterfs rpms on the NODE. You can create a volume with a +single node. You can also scale up the cluster, as we call as *Trusted +Storage Pool*, by adding more nodes to the cluster + + gluster peer probe  + +##### Creating a volume + +It is highly recommended to have replicate volume or +distribute-replicate volume for virt-store usecase, as it would add high +availability and fault-tolerance. Remember the plain distribute works +equally well + + gluster volume create replica 2 .. + +where, is :/ Note: It is recommended to +create sub-directories inside brick and that could be used to create a +volume.For example, say, */home/brick1* is the mountpoint of XFS, then +you can create a sub-directory inside it */home/brick1/b1* and use it +while creating a volume.You can also use space available in root +filesystem for bricks. Gluster cli, by default, throws warning in that +case. You can override by using *force* option + + gluster volume create replica 2 .. force + +*If you are new to GlusterFS, you can take a look at +[QuickStart](http://gluster.readthedocs.org/en/latest/Quick-Start-Guide/Quickstart/) guide.* + +##### Tuning the volume for virt-store + +There are recommended settings available for virt-store. This provide +good performance characteristics when enabled on the volume that was +used for *virt-store* + +Refer to +[Virt-store-usecase\#Tunables](Virt-store-usecase#Tunables "wikilink") +for recommended tunables and for applying them on the volume, +[Virt-store-usecase\#Applying\_the\_Tunables\_on\_the\_volume](Virt-store-usecase#Applying_the_Tunables_on_the_volume "wikilink") + +##### Tuning glusterd to accept requests from QEMU + +glusterd receives the request only from the applications that run with +port number less than 1024 and it blocks otherwise. QEMU uses port +number greater than 1024 and to make glusterd accept requests from QEMU, +edit the glusterd vol file, */etc/glusterfs/glusterd.vol* and add the +following, + + option rpc-auth-allow-insecure on + +Note: If you have installed glusterfs from source, you can find glusterd +vol file at */usr/local/etc/glusterfs/glusterd.vol* + +Restart glusterd after adding that option to glusterd vol file + + service glusterd restart + +##### Tuning glusterfsd to accept requests from QEMU + +Enable the option *allow-insecure* on the particular volume + + gluster volume set  server.allow-insecure on + +**IMPORTANT :** As of now(april 2,2014)there is a bug, as +*allow-insecure* is not dynamically set on a volume.You need to restart +the volume for the change to take effect + +##### Setting ownership on the volume + +Set the ownership of qemu:qemu on to the volume + + gluster volume set  storage.owner-uid 107 + gluster volume set  storage.owner-gid 107 + +**IMPORTANT :** The UID and GID can differ per Linux distribution, or +even installation. The UID/GID should be the one fomr the *qemu* or +'kvm'' user, you can get the IDs with these commands: + + id qemu + getent group kvm + +##### Starting the volume + +Start the volume + + gluster volume start  + +**Steps to be done on Hypervisor Side** +--------------------------------------- + +Hypervisor is just the machine which spawns the Virtual Machines. This +machines should be necessarily the baremetal with more memory and +computing power. The following steps needs to be done on hypervisor, + +1. Install qemu-kvm +2. Install libvirt +3. Create a VM Image +4. Add ownership to the Image file +5. Create libvirt XML to define Virtual Machine +6. Define the VM +7. Start the VM +8. Verification + +##### Install qemu-kvm + +##### Install libvirt + +##### Create a VM Image + +Images can be created using *qemu-img* utility + + qemu-img create -f gluster://// + +- format - This can be raw or qcow2 +- server - One of the gluster Node's IP or FQDN +- vol-name - gluster volume name +- image - Image File name +- size - Size of the image + +Here is sample, + + qemu-img create -f qcow2 gluster://host.sample.com/vol1/vm1.img 10G + +##### Add ownership to the Image file + +NFS or FUSE mount the glusterfs volume and change the ownership of the +image file to qemu:qemu + + mount -t nfs -o vers=3 :/  + +Change the ownership of the image file that was earlier created using +*qemu-img* utility + + chown qemu:qemu / + +##### Create libvirt XML to define Virtual Machine + +*virt-install* is python wrapper which is mostly used to create VM using +set of params. *virt-install* doesn't support any network filesystem [ + ] + +Create a libvirt xml - See to +that the disk section is formatted in such a way, qemu driver for +glusterfs is being used. This can be seen in the following example xml +description + + + + + + + +
+ + +##### Define the VM from XML + +Define the VM from the XML file that was created earlier + + virsh define  + +Verify that the VM is created successfully + + virsh list --all + +##### Start the VM + +Start the VM + + virsh start  + +##### Verification + +You can verify the disk image file that is being used by VM + + virsh domblklist  + +The above should show the volume name and image name. Here is the +example, + + [root@test ~]# virsh domblklist vm-test2 + Target Source + ------------------------------------------------ + vda distrepvol/test.img + hdc - \ No newline at end of file diff --git a/done/GlusterFS 3.5/readdir ahead.md b/done/GlusterFS 3.5/readdir ahead.md new file mode 100644 index 0000000..fe34a97 --- /dev/null +++ b/done/GlusterFS 3.5/readdir ahead.md @@ -0,0 +1,117 @@ +Feature +------- + +readdir-ahead + +Summary +------- + +Provide read-ahead support for directories to improve sequential +directory read performance. + +Owners +------ + +Brian Foster + +Current status +-------------- + +Gluster currently does not attempt to improve directory read +performance. As a result, simple operations (i.e., ls) on large +directories are slow. + +Detailed Description +-------------------- + +The read-ahead feature for directories is analogous to read-ahead for +files. The objective is to detect sequential directory read operations +and establish a pipeline for directory content. When a readdir request +is received and fulfilled, preemptively issue subsequent readdir +requests to the server in anticipation of those requests from the user. +If sequential readdir requests are received, the directory content is +already immediately available in the client. If subsequent requests are +not sequential or not received, said data is simply dropped and the +optimization is bypassed. + +Benefit to GlusterFS +-------------------- + +Improved read performance of large directories. + +### Scope + +Nature of proposed change +------------------------- + +readdir-ahead support is enabled through a new client-side translator. + +Implications on manageability +----------------------------- + +None beyond the ability to enable and disable the translator. + +Implications on presentation layer +---------------------------------- + +N/A + +Implications on persistence layer +--------------------------------- + +N/A + +Implications on 'GlusterFS' backend +----------------------------------- + +N/A + +Modification to GlusterFS metadata +---------------------------------- + +N/A + +Implications on 'glusterd' +-------------------------- + +N/A + +How To Test +----------- + +Performance testing. Verify that sequential reads of large directories +complete faster (i.e., ls, xfs\_io -c readdir). + +User Experience +--------------- + +Improved performance on sequential read workloads. The translator should +otherwise be invisible and not detract performance or disrupt behavior +in any way. + +Dependencies +------------ + +N/A + +Documentation +------------- + +Set the associated config option to enable or disable directory +read-ahead on a volume: + + gluster volume set  readdir-ahead [enable|disable] + +readdir-ahead is disabled by default. + +Status +------ + +Development complete for the initial version. Minor changes and bug +fixes likely. + +Future versions might expand to provide generic caching and more +flexible behavior. + +Comments and Discussion +----------------------- \ No newline at end of file diff --git a/done/GlusterFS 3.6/Better Logging.md b/done/GlusterFS 3.6/Better Logging.md new file mode 100644 index 0000000..6aad602 --- /dev/null +++ b/done/GlusterFS 3.6/Better Logging.md @@ -0,0 +1,348 @@ +Feature +------- + +Gluster logging enhancements to support message IDs per message + +Summary +------- + +Enhance gluster logging to provide the following features, SubFeature +--\> SF + +- SF1: Add message IDs to message + +- SF2: Standardize error num reporting across messages + +- SF3: Enable repetitive message suppression in logs + +- SF4: Log location and hierarchy standardization (in case anything is +further required here, analysis pending) + +- SF5: Enable per sub-module logging level configuration + +- SF6: Enable logging to other frameworks, than just the current gluster +logs + +- SF7: Generate a catalogue of these message, with message ID, message, +reason for occurrence, recovery/troubleshooting steps. + +Owners +------ + +Balamurugan Arumugam +Krishnan Parthasarathi +Krutika Dhananjay +Shyamsundar Ranganathan + +Current status +-------------- + +### Existing infrastructure: + +Currently gf\_logXXX exists as an infrastructure API for all logging +related needs. This (typically) takes the form, + +gf\_log(dom, levl, fmt...) + +where, + +    dom: Open format string usually the xlator name, or "cli" or volume name etc. +    levl: One of, GF_LOG_EMERG, GF_LOG_ALERT, GF_LOG_CRITICAL, GF_LOG_ERROR, GF_LOG_WARNING, GF_LOG_NOTICE, GF_LOG_INFO, GF_LOG_DEBUG, GF_LOG_TRACE +    fmt: the actual message string, followed by the required arguments in the string + +The log initialization happens through, + +gf\_log\_init (void \*data, const char \*filename, const char \*ident) + +where, + +    data: glusterfs_ctx_t, largely unused in logging other than the required FILE and mutex fields +    filename: file name to log to +    ident: Like syslog ident parameter, largely unused + +The above infrastructure leads to logs of type, (sample extraction from +nfs.log) + +     [2013-12-08 14:17:17.603879] I [socket.c:3485:socket_init] 0-socket.ACL: SSL support is NOT enabled +     [2013-12-08 14:17:17.603937] I [socket.c:3500:socket_init] 0-socket.ACL: using system polling thread +     [2013-12-08 14:17:17.612128] I [nfs.c:934:init] 0-nfs: NFS service started +     [2013-12-08 14:17:17.612383] I [dht-shared.c:311:dht_init_regex] 0-testvol-dht: using regex rsync-hash-regex = ^\.(.+)\.[^.]+$ + +### Limitations/Issues in the infrastructure + +​1) Auto analysis of logs needs to be done based on the final message +string. Automated tools that can help with log message and related +troubleshooting options need to use the final string, which needs to be +intelligently parsed and also may change between releases. It would be +desirable to have message IDs so that such tools and trouble shooting +options can leverage the same in a much easier fashion. + +​2) The log message itself currently does not use the \_ident\_ which +can help as we move to more common logging frameworks like journald, +rsyslog (or syslog as the case maybe) + +​3) errno is the primary identifier of errors across gluster, i.e we do +not have error codes in gluster and use errno values everywhere. The log +messages currently do not lend themselves to standardization like +printing the string equivalent of errno rather than the actual errno +value, which \_could\_ be cryptic to administrators + +​4) Typical logging infrastructures provide suppression (on a +configurable basis) for repetitive messages to prevent log flooding, +this is currently missing in the current infrastructure + +​5) The current infrastructure cannot be used to control log levels at a +per xlator or sub module, as the \_dom\_ passed is a string that change +based on volume name, translator name etc. It would be desirable to have +a better module identification mechanism that can help with this +feature. + +​6) Currently the entire logging infrastructure resides within gluster. +It would be desirable in scaled situations to have centralized logging +and monitoring solutions in place, to be able to better analyse and +monitor the cluster health and take actions. + +This requires some form of pluggable logging frameworks that can be used +within gluster to enable this possibility. Currently the existing +framework is used throughout gluster and hence we need only to change +configuration and logging.c to enable logging to other frameworks (as an +example the current syslog plug that was provided). + +It would be desirable to enhance this to provide a more robust framework +for future extensions to other frameworks. This is not a limitation of +the current framework, so much as a re-factor to be able to switch +logging frameworks with more ease. + +​7) For centralized logging in the future, it would need better +identification strings from various gluster processes and hosts, which +is currently missing or suppressed in the logging infrastructure. + +Due to the nature of enhancements proposed, it is required that we +better the current infrastructure for the stated needs and do some +future proofing in terms of newer messages that would be added. + +Detailed Description +-------------------- + +NOTE: Covering details for SF1, SF2, and partially SF3, SF5, SF6. SF4/7 +will be covered in later revisions/phases. + +### Logging API changes: + +​1) Change the logging API as follows, + +From: gf\_log(dom, levl, fmt...) + +To: gf\_msg(dom, levl, errnum, msgid, fmt...) + +Where: + +    dom: Open string as used in the current logging infrastructure (helps in backward compat) +    levl: As in current logging infrastructure (current levels seem sufficient enough to not add more levels for better debuggability etc.) +     +    msgid: A message identifier, unique to this message FMT string and possibly this invocation. (SF1, lending to SF3) +    errnum: The errno that this message is generated for (with an implicit 0 meaning no error number per se with this message) (SF2) + +NOTE: Internally the gf\_msg would still be a macro that would add the +\_\_FILE\_\_ \_\_LINE\_\_ \_\_FUNCTION\_\_ arguments + +​2) Enforce \_ident\_ in the logging initialization API, gf\_log\_init +(void \*data, const char \*filename, const char \*ident) + +Where: + + ident would be the identifier string like, nfs, , brick-, cli, glusterd, as is the case with the log file name that is generated today (lending to SF6) + +#### What this achieves: + +With the above changes, we now have a message ID per message +(\_msgid\_), location of the message in terms of which component +(\_dom\_) and which process (\_ident\_). The further identification of +the message location in terms of host (ip/name) can be done in the +framework, when centralized logging infrastructure is introduced. + +#### Log message changes: + +With the above changes to the API the log message can now appear in a +compatibility mode to adhere to current logging format, or be presented +as follows, + +log invoked as: gf\_msg(dom, levl, ENOTSUP, msgidX) + +Example: gf\_msg ("logchecks", GF\_LOG\_CRITICAL, 22, logchecks\_msg\_4, +42, "Forty-Two", 42); + +Where: logchecks\_msg\_4 (GLFS\_COMP\_BASE + 4), "Critical: Format +testing: %d:%s:%x" + +​1) Gluster logging framework (logged as) + + [2014-02-17 08:52:28.038267] I [MSGID: 1002] [logchecks.c:44:go_log] 0-logchecks: Informational: Format testing: 42:Forty-Two:2a [Invalid argument] + +​2) syslog (passed as) + + Feb 17 14:17:42 somari logchecks[26205]: [MSGID: 1002] [logchecks.c:44:go_log] 0-logchecks: Informational: Format testing: 42:Forty-Two:2a [Invalid argument] + +​3) journald (passed as) + +    sd_journal_send("MESSAGE=", +                        "MESSAGE_ID=msgid", +                        "PRIORITY=levl", +                        "CODE_FILE=``", "CODE_LINE=`", "CODE_FUNC=", +                        "ERRNO=errnum", +                        "SYSLOG_IDENTIFIER=" +                        NULL); + +​4) CEE (Common Event Expression) format string passed to any CEE +consumer (say lumberjack) + +Based on generating @CEE JSON string as per specifications and passing +it the infrastructure in question. + +#### Message ID generation: + +​1) Some rules for message IDs + +- Every message, even if it is the same message FMT, will have a unique +message ID - Changes to a specific message string, hence will not change +its ID and also not impact other locations in the code that use the same +message FMT + +​2) A glfs-message-id.h file would contain ranges per component for +individual component based messages to be created without overlapping on +the ranges. + +​3) -message.h would contain something as follows, + +     #define GLFS_COMP_BASE         GLFS_MSGID_COMP_ +     #define GLFS_NUM_MESSAGES       1 +     #define GLFS_MSGID_END          (GLFS_COMP_BASE + GLFS_NUM_MESSAGES + 1) +     /* Messaged with message IDs */ +     #define glfs_msg_start_x GLFS_COMP_BASE, "Invalid: Start of messages" +     /*------------*/ +     #define _msg_1 (GLFS_COMP_BASE + 1), "Test message, replace with"\ +                        " original when using the template" +     /*------------*/ +     #define glfs_msg_end_x GLFS_MSGID_END, "Invalid: End of messages" + +​5) Each call to gf\_msg hence would be, + +    gf_msg(dom, levl, errnum, glfs_msg_x, ...) + +#### Setting per xlator logging levels (SF5): + +short description to be elaborated later + +Leverage this-\>loglevel to override the global loglevel. This can be +also configured from gluster CLI at runtime to change the log levels at +a per xlator level for targeted debugging. + +#### Multiple log suppression(SF3): + +short description to be elaborated later + +​1) Save the message string as follows, Msg\_Object(msgid, +msgstring(vasprintf(dom, fmt)), timestamp, repetitions) + +​2) On each message received by the logging infrastructure check the +list of saved last few Msg\_Objects as follows, + +2.1) compare msgid and on success compare msgstring for a match, compare +repetition tolerance time with current TS and saved TS in the +Msg\_Object + +2.1.1) if tolerance is within limits, increment repetitions and do not +print message + +2.1.2) if tolerance is outside limits, print repetition count for saved +message (if any) and print the new message + +2.2) If none of the messages match the current message, knock off the +oldest message in the list printing any repetition count message for the +same, and stash new message into the list + +The key things to remember and act on here would be to, minimize the +string duplication on each message, and also to keep the comparison +quick (hence base it off message IDs and errno to start with) + +#### Message catalogue (SF7): + + + +The idea is to use Doxygen comments in the -message.h per +component, to list information in various sections per message of +consequence and later use Doxygen to publish this catalogue on a per +release basis. + +Benefit to GlusterFS +-------------------- + +The mentioned limitations and auto log analysis benefits would accrue +for GlusterFS + +Scope +----- + +### Nature of proposed change + +All gf\_logXXX function invocations would change to gf\_msgXXX +invocations. + +### Implications on manageability + +None + +### Implications on presentation layer + +None + +### Implications on persistence layer + +None + +### Implications on 'GlusterFS' backend + +None + +### Modification to GlusterFS metadata + +None + +### Implications on 'glusterd' + +None + +How To Test +----------- + +A separate test utility that tests various logs and formats would be +provided to ensure that functionality can be tested independent of +GlusterFS + +User Experience +--------------- + +Users would notice changed logging formats as mentioned above, the +additional field of importance would be the MSGID: + +Dependencies +------------ + +None + +Documentation +------------- + +Intending to add a logging.md (or modify the same) to elaborate on how a +new component should now use the new framework and generate messages +with IDs in the same. + +Status +------ + +In development (see, ) + +Comments and Discussion +----------------------- + + \ No newline at end of file diff --git a/done/GlusterFS 3.6/Better Peer Identification.md b/done/GlusterFS 3.6/Better Peer Identification.md new file mode 100644 index 0000000..a8c6996 --- /dev/null +++ b/done/GlusterFS 3.6/Better Peer Identification.md @@ -0,0 +1,172 @@ +Feature +------- + +**Better peer identification** + +Summary +------- + +This proposal is regarding better identification of peers. + +Owners +------ + +Kaushal Madappa + +Current status +-------------- + +Glusterd currently is inconsistent in the way it identifies peers. This +causes problems when the same peer is referenced with different names in +different gluster commands. + +Detailed Description +-------------------- + +Currently, the way we identify peers is not consistent all through the +gluster code. We use uuids internally and hostnames externally. + +This setup works pretty well when all the peers are on a single network, +have one address, and are referred to in all the gluster commands with +same address. + +But once we start mixing up addresses in the commands (ip, shortnames, +fqdn) and bring in multiple networks we have problems. + +The problems were discussed in the following mailing list threads and +some solutions were proposed. + +- How do we identify peers? [^1] +- RFC - "Connection Groups" concept [^2] + +The solution to the multi-network problem is dependent on the solution +to the peer identification problem. So it'll be good to target fixing +the peer identification problem asap, ie. in 3.6, and take up the +networks problem later. + +Benefit to GlusterFS +-------------------- + +Sanity. It will be great to have all internal identifiers for peers +happening through a UUID, and being translated into a host/IP at the +most superficial layer. + +Scope +----- + +### Nature of proposed change + +The following changes will be done in Glusterd to improve peer +identification. + +1. Peerinfo struct will be extended to have a list of associated + hostnames/addresses, instead of a single hostname as it is + currently. The import/export and store/restore functions will be + changed to handle this. CLI will be updated to show this list of + addresses in peer status and pool list commands. +2. Peer probe will be changed to append an address to the peerinfo + address list, when we observe that the given address belongs to an + existing peer. +3. Have a new API for translation between hostname/addresses into + UUIDs. This new API will be used in all places where + hostnames/addresses were being validated, including peer probe, peer + detach, volume create, add-brick, remove-brick etc. +4. A new command - 'gluster peer add-address ' + - which appends to the address list will be implemented if time + permits. +5. A new command - 'gluster peer rename ' - which will + rename all occurrences of a peer with the newly given name will be + implemented if time permits. + +Changes 1-3 are the base for the other changes and will the primary +deliverables for this feature. + +### Implications on manageability + +The primary changes will bring about some changes to the CLI output of +'peer status' and 'pool list' commands. The normal and XML outputs for +these commands will contain a list of addresses for each peer, instead +of a single hostname. + +Tools depending on the output of these commands will need to be updated. + +**TODO**: *Add sample outputs* + +The new commands 'peer add-address' and 'peer rename' will improve +manageability of peers. + +### Implications on presentation layer + +None + +### Implications on persistence layer + +None + +### Implications on 'GlusterFS' backend + +None + +### Modification to GlusterFS metadata + +None + +### Implications on 'glusterd' + + + +How To Test +----------- + +**TODO:** *Add test cases* + +User Experience +--------------- + +User experience will improve for commands which used peer identifiers +(volume create/add-brick/remove-brick, peer probe, peer detach), as the +the user will no longer face errors caused by mixed usage of +identifiers. + +Dependencies +------------ + +None. + +Documentation +------------- + +The new behaviour of the peer probe command will need to be documented. +The new commands will need to be documented as well. + +**TODO:** *Add more documentations* + +Status +------ + +The feature is under development on forge [^3] and github [^4]. This +github merge request [^5] can be used for performing preliminary +reviews. Once we are satisfied with the changes, it will be posted for +review on gerrit. + +Comments and Discussion +----------------------- + +There are open issues around node crash + re-install with same IP (but +new UUID) which need to be addressed in this effort. + +Links +----- + + + + +[^1]: + +[^2]: + +[^3]: + +[^4]: + +[^5]: diff --git a/done/GlusterFS 3.6/Gluster User Serviceable Snapshots.md b/done/GlusterFS 3.6/Gluster User Serviceable Snapshots.md new file mode 100644 index 0000000..9af7062 --- /dev/null +++ b/done/GlusterFS 3.6/Gluster User Serviceable Snapshots.md @@ -0,0 +1,39 @@ +Feature +------- + +Enable user-serviceable snapshots for GlusterFS Volumes based on +GlusterFS-Snapshot feature + +Owners +------ + +Anand Avati +Anand Subramanian +Raghavendra Bhat +Varun Shastry + +Summary +------- + +Each snapshot capable GlusterFS Volume will contain a .snaps directory +through which a user will be able to access previously point-in-time +snapshot copies of his data. This will be enabled through a hidden +.snaps folder in each directory or sub-directory within the volume. +These user-serviceable snapshot copies will be read-only. + +Tests +----- + +​1) Enable uss (gluster volume set features.uss enable) A +snap daemon should get started for the volume. It should be visible in +gluster volume status command. 2) entering the snapshot world ls on +.snaps from any directory within the filesystem should be successful and +should show the list of snapshots as directories. 3) accessing the +snapshots One of the snapshots can be entered and it should show the +contents of the directory from which .snaps was entered, when the +snapshot was taken. NOTE: If the directory was not present when a +snapshot was taken (say snap1) and created later, then entering snap1 +directory (or any access) will fail with stale file handle. 4) Reading +from snapshots Any kind of read operations from the snapshots should be +successful. But any modifications to snapshot data is not allowed. +Snapshots are read-only \ No newline at end of file diff --git a/done/GlusterFS 3.6/Gluster Volume Snapshot.md b/done/GlusterFS 3.6/Gluster Volume Snapshot.md new file mode 100644 index 0000000..468992a --- /dev/null +++ b/done/GlusterFS 3.6/Gluster Volume Snapshot.md @@ -0,0 +1,354 @@ +Feature +------- + +Snapshot of Gluster Volume + +Summary +------- + +Gluster volume snapshot will provide point-in-time copy of a GlusterFS +volume. This snapshot is an online-snapshot therefore file-system and +its associated data continue to be available for the clients, while the +snapshot is being taken. + +Snapshot of a GlusterFS volume will create another read-only volume +which will be a point-in-time copy of the original volume. Users can use +this read-only volume to recover any file(s) they want. Snapshot will +also provide restore feature which will help the user to recover an +entire volume. The restore operation will replace the original volume +with the snapshot volume. + +Owner(s) +-------- + +Rajesh Joseph + +Copyright +--------- + +Copyright (c) 2013-2014 Red Hat, Inc. + +This feature is licensed under your choice of the GNU Lesser General +Public License, version 3 or any later version (LGPLv3 or later), or the +GNU General Public License, version 2 (GPLv2), in all cases as published +by the Free Software Foundation. + +Current status +-------------- + +Gluster volume snapshot support is provided in GlusterFS 3.6 + +Detailed Description +-------------------- + +GlusterFS snapshot feature will provide a crash consistent point-in-time +copy of Gluster volume(s). This snapshot is an online-snapshot therefore +file-system and its associated data continue to be available for the +clients, while the snapshot is being taken. As of now we are not +planning to provide application level crash consistency. That means if a +snapshot is restored then applications need to rely on journals or other +technique to recover or cleanup some of the operations performed on +GlusterFS volume. + +A GlusterFS volume is made up of multiple bricks spread across multiple +nodes. Each brick translates to a directory path on a given file-system. +The current snapshot design is based on thinly provisioned LVM2 snapshot +feature. Therefore as a prerequisite the Gluster bricks should be on +thinly provisioned LVM. For a single lvm, taking a snapshot would be +straight forward for the admin, but this is compounded in a GlusterFS +volume which has bricks spread across multiple LVM’s across multiple +nodes. Gluster volume snapshot feature aims to provide a set of +interfaces from which the admin can snap and manage the snapshots for +Gluster volumes. + +Gluster volume snapshot is nothing but snapshots of all the bricks in +the volume. So ideally all the bricks should be snapped at the same +time. But with real-life latencies (processor and network) this may not +hold true all the time. Therefore we need to make sure that during +snapshot the file-system is in consistent state. Therefore we barrier +few operation so that the file-system remains in a healthy state during +snapshot. + +For details about barrier [Server Side +Barrier](http://www.gluster.org/community/documentation/index.php/Features/Server-side_Barrier_feature) + +Benefit to GlusterFS +-------------------- + +Snapshot of glusterfs volume allows users to + +- A point in time checkpoint from which to recover/failback +- Allow read-only snaps to be the source of backups. + +Scope +----- + +### Nature of proposed change + +Gluster cli will be modified to provide new commands for snapshot +management. The entire snapshot core implementation will be done in +glusterd. + +Apart from this Snapshot will also make use of quiescing xlator for +doing quiescing. This will be a server side translator which will +quiesce will fops which can modify disk state. The quescing will be done +till the snapshot operation is complete. + +### Implications on manageability + +Snapshot will provide new set of cli commands to manage snapshots. REST +APIs are not planned for this release. + +### Implications on persistence layer + +Snapshot will create new volume per snapshot. These volumes are stored +in /var/lib/glusterd/snaps folder. Apart from this each volume will have +additional snapshot related information stored in snap\_list.info file +in its respective vol folder. + +### Implications on 'glusterd' + +Snapshot information and snapshot volume details are stored in +persistent stores. + +How To Test +----------- + +For testing this feature one needs to have mulitple thinly provisioned +volumes or else need to create LVM using loop back devices. + +Details of how to create thin volume can be found at the following link + + +Each brick needs to be in a independent LVM. And these LVMs should be +thinly provisioned. From these bricks create Gluster volume. This volume +can then be used for snapshot testing. + +See the User Experience section for various commands of snapshot. + +User Experience +--------------- + +##### Snapshot creation + + snapshot create   [description ] [force] + +This command will create a sapshot of the volume identified by volname. +snapname is a mandatory field and the name should be unique in the +entire cluster. Users can also provide an optional description to be +saved along with the snap (max 1024 characters). force keyword is used +if some bricks of orginal volume is down and still you want to take the +snapshot. + +##### Listing of available snaps + + gluster snapshot list [snap-name] [vol ] + +This command is used to list all snapshots taken, or for a specified +volume. If snap-name is provided then it will list the details of that +snap. + +##### Configuring the snapshot behavior + + gluster snapshot config [vol-name] + +This command will display existing config values for a volume. If volume +name is not provided then config values of all the volume is displayed. + + gluster snapshot config [vol-name] [ ] [ ] [force] + +The above command can be used to change the existing config values. If +vol-name is provided then config value of that volume is changed, else +it will set/change the system limit. + +The system limit is the default value of the config for all the volume. +Volume specific limit cannot cross the system limit. If a volume +specific limit is not provided then system limit will be considered. + +If any of this limit is decreased and the current snap count of the +system/volume is more than the limit then the command will fail. If user +still want to decrease the limit then force option should be used. + +**snap-max-limit**: Maximum snapshot limit for a volume. Snapshots +creation will fail if snap count reach this limit. + +**snap-max-soft-limit**: Maximum snapshot limit for a volume. Snapshots +can still be created if snap count reaches this limit. An auto-deletion +will be triggered if this limit is reached. The oldest snaps will be +deleted if snap count reaches this limit. This is represented as +percentage value. + +##### Status of snapshots + + gluster snapshot status ([snap-name] | [volume ]) + +Shows the status of all the snapshots or the specified snapshot. The +status will include the brick details, LVM details, process details, +etc. + +##### Activating a snap volume + +By default the snapshot created will be in an inactive state. Use the +following commands to activate snapshot. + + gluster snapshot activate  + +##### Deactivating a snap volume + + gluster snapshot deactivate  + +The above command will deactivate an active snapshot + +##### Deleting snaps + + gluster snapshot delete  + +This command will delete the specified snapshot. + +##### Restoring snaps + + gluster snapshot restore  + +This command restores an already taken snapshot of single or multiple +volumes. Snapshot restore is an offline activity therefore if any volume +which is part of the given snap is online then the restore operation +will fail. + +Once the snapshot is restored it will be deleted from the list of +snapshot. + +Dependencies +------------ + +To provide support for a crash-consistent snapshot feature Gluster core +com- ponents itself should be crash-consistent. As of now Gluster as a +whole is not crash-consistent. In this section we will identify those +Gluster components which are not crash-consistent. + +**Geo-Replication**: Geo-replication provides master-slave +synchronization option to Gluster. Geo-replication maintains state +information for completing the sync operation. Therefore ideally when a +snapshot is taken then both the master and slave snapshot should be +taken. And both master and slave snapshot should be in mutually +consistent state. + +Geo-replication make use of change-log to do the sync. By default the +change-log is stored .glusterfs folder in every brick. But the +change-log path is configurable. If change-log is part of the brick then +snapshot will contain the change-log changes as well. But if it is not +then it needs to be saved separately during a snapshot. + +Following things should be considered for making change-log +crash-consistent: + +- Change-log is part of the brick of the same volume. +- Change-log is outside the brick. As of now there is no size limit on + the + +change-log files. We need to answer following questions here + +- - Time taken to make a copy of the entire change-log. Will affect + the + +overall time of snapshot operation. + +- - The location where it can be copied. Will impact the disk usage + of + +the target disk or file-system. + +- Some part of change-log is present in the brick and some are outside + +the brick. This situation will arrive when change-log path is changed +in-between. + +- Change-log is saved in another volume and this volume forms a CG + with + +the volume about to be snapped. + +**Note**: Considering the above points we have decided not to support +change-log stored outside the bricks. + +For this release automatic snapshot of both master and slave session is +not supported. If required user need to explicitly take snapshot of both +master and slave. Following steps need to be followed while taking +snapshot of a master and slave setup + +- Stop geo-replication manually. +- Snapshot all the slaves first. +- When the slave snapshot is done then initiate master snapshot. +- When both the snapshot is complete geo-syncronization can be started + again. + +**Gluster Quota**: Quota enables an admin to specify per directory +quota. Quota makes use of marker translator to enforce quota. As of now +the marker framework is not completely crash-consistent. As part of +snapshot feature we need to address following issues. + +- If a snapshot is taken while the contribution size of a file is + being updated then you might end up with a snapshot where there is a + mismatch between the actual size of the file and the contribution of + the file. These in-consistencies can only be rectified when a + look-up is issued on the snapshot volume for the same file. As a + workaround admin needs to issue an explicit file-system crawl to + rectify the problem. +- For NFS, quota makes use of pgfid to build a path from gfid and + enforce quota. As of now pgfid update is not crash-consistent. +- Quota saves its configuration in file-system under /var/lib/glusterd + folder. As part of snapshot feature we need to save this file. + +**NFS**: NFS uses a single graph to represent all the volumes in the +system. And to make all the snapshot volume accessible these snapshot +volumes should be added to this graph. This brings in another +restriction, i.e. all the snapshot names should be unique and +additionally snap name should not clash with any other volume name as +well. + +To handle this situation we have decided to use an internal uuid as snap +name. And keep a mapping of this uuid and user given snap name in an +internal structure. + +Another restriction with NFS is that when a newly created volume +(snapshot volume) is started it will restart NFS server. Therefore we +decided when snapshot is taken it will be in stopped state. Later when +snapshot volume is needed it can be started explicitly. + +**DHT**: DHT xlator decides which node to look for a file/directory. +Some of the DHT fop are not atomic in nature, e.g rename (both file and +directory). Also these operations are not transactional in nature. That +means if a crash happens the data in server might be in an inconsistent +state. Depending upon the time of snapshot and which DHT operation is in +what state there can be an inconsistent snapshot. + +**AFR**: AFR is the high-availability module in Gluster. AFR keeps track +of fresh and correct copy of data using extended attributes. Therefore +it is important that before taking snapshot these extended attributes +are written into the disk. To make sure these attributes are written to +disk snapshot module will issue explicit sync after the +barrier/quiescing. + +The other issue with the current AFR is that it writes the volume name +to the extended attribute of all the files. AFR uses this for +self-healing. When snapshot is taken of such a volume the snapshotted +volume will also have the same volume name. Therefore AFR needs to +create a mapping of the real volume name and the extended entry name in +the volfile. So that correct name can be referred during self-heal. + +Another dependency on AFR is that currently there is no direct API or +call back function which will tell that AFR self-healing is completed on +a volume. This feature is required to heal a snapshot volume before +restore. + +Documentation +------------- + +Status +------ + +In development + +Comments and Discussion +----------------------- + + diff --git a/done/GlusterFS 3.6/New Style Replication.md b/done/GlusterFS 3.6/New Style Replication.md new file mode 100644 index 0000000..ffd8167 --- /dev/null +++ b/done/GlusterFS 3.6/New Style Replication.md @@ -0,0 +1,230 @@ +Goal +---- + +More partition-tolerant replication, with higher performance for most +use cases. + +Summary +------- + +NSR is a new synchronous replication translator, complementing or +perhaps some day replacing AFR. + +Owners +------ + +Jeff Darcy +Venky Shankar + +Current status +-------------- + +Design and prototype (nearly) complete, implementation beginning. + +Related Feature Requests and Bugs +--------------------------------- + +[AFR bugs related to "split +brain"](https://bugzilla.redhat.com/buglist.cgi?classification=Community&component=replicate&list_id=3040567&product=GlusterFS&query_format=advanced&short_desc=split&short_desc_type=allwordssubstr) + +[AFR bugs related to +"perf"](https://bugzilla.redhat.com/buglist.cgi?classification=Community&component=replicate&list_id=3040572&product=GlusterFS&query_format=advanced&short_desc=perf&short_desc_type=allwordssubstr) + +(Both lists are undoubtedly partial because not all bugs in these areas +using these specific words. In particular, "GFID mismatch" bugs are +really a kind of split brain, but aren't represented.) + +Detailed Description +-------------------- + +NSR is designed to have the following features. + +- Server based - "chain" replication can use bandwidth of both client + and server instead of splitting client bandwidth N ways. + +- Journal based - for reduced network traffic in normal operation, + plus faster recovery and greater resistance to "split brain" errors. + +- Variable consistency model - based on + [Dynamo](http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf) + to provide options trading some consistency for greater availability + and/or performance. + +- Newer, smaller codebase - reduces technical debt, enables higher + replica counts, more informative status reporting and logging, and + other future features (e.g. ordered asynchronous replication). + +Benefit to GlusterFS +==================== + +Faster, more robust, more manageable/maintainable replication. + +Scope +===== + +Nature of proposed change +------------------------- + +At least two new translators will be necessary. + +- A simple client-side translator to route requests to the current + leader among the bricks in a replica set. + +- A server-side translator to handle the "heavy lifting" of + replication, recovery, etc. + +Implications on manageability +----------------------------- + +At a high level, commands to enable, configure, and manage NSR will be +very similar to those already used for AFR. At a lower level, the +options affecting things things like quorum, consistency, and placement +of journals will all be completely different. + +Implications on presentation layer +---------------------------------- + +Minimal. Most changes will be to simplify or remove special handling for +AFR's unique behavior (especially around lookup vs. self-heal). + +Implications on persistence layer +--------------------------------- + +N/A + +Implications on 'GlusterFS' backend +----------------------------------- + +The journal for each brick in an NSR volume might (for performance +reasons) be placed on one or more local volumes other than the one +containing the brick's data. Special requirements around AIO, fsync, +etc. will be less than with AFR. + +Modification to GlusterFS metadata +---------------------------------- + +NSR will not use the same xattrs as AFR, reducing the need for larger +inodes. + +Implications on 'glusterd' +-------------------------- + +Volgen must be able to configure the client-side and server-side parts +of NSR, instead of AFR on the client side and index (which will no +longer be necessary) on the server side. Other interactions with +glusterd should remain mostly the same. + +How To Test +=========== + +Most basic AFR tests - e.g. reading/writing data, killing nodes, +starting/stopping self-heal - would apply to NSR as well. Tests that +embed assumptions about AFR xattrs or other internal artifacts will need +to be re-written. + +User Experience +=============== + +Minimal change, mostly related to new options. + +Dependencies +============ + +NSR depends on a cluster-management framework that can provide +membership tracking, leader election, and robust consistent key/value +data storage. This is expected to be developed in parallel as part of +the glusterd-scalability feature, but can be implemented (in simplified +form) within NSR itself if necessary. + +Documentation +============= + +TBD. + +Status +====== + +Some parts of earlier implementation updated to current tree, others in +the middle of replacement. + +- [New design](http://review.gluster.org/#/c/8915/) + +- [Basic translator code](http://review.gluster.org/#/c/8913/) (needs + update to new code-generation infractructure) + +- [GF\_FOP\_IPC](http://review.gluster.org/#/c/8812/) + +- [etcd support](http://review.gluster.org/#/c/8887/) + +- [New code-generation + infrastructure](http://review.gluster.org/#/c/9411/) + +- [New data-logging + translator](https://forge.gluster.org/~jdarcy/glusterfs-core/jdarcys-glusterfs-data-logging) + +Comments and Discussion +======================= + +My biggest concern with journal-based replication comes from my previous +use of DRBD. They do an "activity log"[^1] which sounds like the same +basic concept. Once that log filled, I experienced cascading failure. +When the journal can be filled faster than it's emptied this could cause +the problem I experienced. + +So what I'm looking to be convinced is how journalled replication +maintains full redundancy and how it will prevent the journal input from +exceeding the capacity of the journal output or at least how it won't +fail if this should happen. + +[jjulian](User:Jjulian "wikilink") +([talk](User talk:Jjulian "wikilink")) 17:21, 13 August 2013 (UTC) + +
+This is akin to a CAP Theorem[^2][^3] problem. If your nodes can't +communicate, what do you do with writes? Our replication approach has +traditionally been CP - enforce quorum, allow writes only among the +majority - and for the sake of satisfying user expectations (or POSIX) +pretty much has to remain CP at least by default. I personally think we +need to allow an AP choice as well, which is why the quorum levels in +NSR are tunable to get that result. + +So, what do we do if a node runs out of journal space? Well, it's unable +to function normally, i.e. it's failed, so it can't count toward quorum. +This would immediately lead to loss of write availability in a two-node +replica set, and could happen easily enough in a three-node replica set +if two similarly configured nodes ran out of journal space +simultaneously. A significant part of the complexity in our design is +around pruning no-longer-needed journal segments, precisely because this +is an icky problem, but even with all the pruning in the world it could +still happen eventually. Therefore the design also includes the notion +of arbiters, which can be quorum-only or can also have their own +journals (with no or partial data). Therefore, your quorum for +admission/journaling purposes can be significantly higher than your +actual replica count. So what options do we have to avoid or deal with +journal exhaustion? + +- Add more journal space (it's just files, so this can be done + reactively during an extended outage). + +- Add arbiters. + +- Decrease the quorum levels. + +- Manually kick a node out of the replica set. + +- Add admission control, artificially delaying new requests as the + journal becomes full. (This one requires more code.) + +If you do \*none\* of these things then yeah, you're scrod. That said, +do you think these options seem sufficient? + +[Jdarcy](User:Jdarcy "wikilink") ([talk](User talk:Jdarcy "wikilink")) +15:27, 29 August 2013 (UTC) + + + +[^1]: + +[^2]: + +[^3]: diff --git a/done/GlusterFS 3.6/Persistent AFR Changelog xattributes.md b/done/GlusterFS 3.6/Persistent AFR Changelog xattributes.md new file mode 100644 index 0000000..e21b788 --- /dev/null +++ b/done/GlusterFS 3.6/Persistent AFR Changelog xattributes.md @@ -0,0 +1,178 @@ +Feature +------- + +Provide a unique and consistent name for AFR changelog extended +attributes/ client translator names in the volume graph. + +Summary +------- + +Make AFR changelog extended attribute names independent of brick +position in the graph, which ensures that there will be no potential +misdirected self-heals during remove-brick operation. + +Owners +------ + +Ravishankar N +Pranith Kumar K + +Current status +-------------- + +Patches merged in master. + + + + + +Detailed Description +-------------------- + +BACKGROUND ON THE PROBLEM: =========================== AFR makes use of +changelog extended attributes on a per file basis which records pending +operations on that file and is used to determine the sources and sinks +when healing needs to be done. As of today, AFR uses the client +translator names (from the volume graph) as the names of the changelog +attributes. For eg. for a replica 3 volume, each file on every brick has +the following extended attributes: + + trusted.afr.-client-0-->maps to Brick0 + trusted.afr.-client-1-->maps to Brick1 + trusted.afr.-client-2-->maps to Brick2 + +​1) Now when any brick is removed (say Brick1), the graph is regenerated +and AFR maps the xattrs to the bricks so: + + trusted.afr.-client-0-->maps to Brick0 + trusted.afr.-client-1-->maps to Brick2  + +Thus the xattr 'trusted.afr.testvol-client-1' which earlier referred to +Brick1's attributes now refer to Brick-2's. If there are pending +self-heals prior to the remove-brick happened, healing could possibly +happen in the wrong direction thereby causing data loss. + +​2) The second problem is a dependency with Snapshot feature. Snapshot +volumes have new names (UUID based) and thus the (client)xlator names +are different. Eg: \<-client-0\> will now be +\<-client-0\>. When AFR uses these names to query for its +changelog xattrs but the files on the bricks have the old changelog +xattrs. Hence the heal information is completely lost. + +WHAT IS THE EXACT ISSUE WE ARE SOLVING OR OBJECTIVE OF THE +FEATURE/DESIGN? +========================================================================== +In a nutshell, the solution is to generate unique and persistent names +for the client translators so that even if any of the bricks are +removed, the translator names always map to the same bricks. In turn, +AFR, which uses these names for the changelog xattr names also refer to +the correct bricks. + +SOLUTION: + +The solution is explained as a sequence of steps: + +- The client translator names will still use the existing + nomenclature, except that now they are monotonically increasing + (-client-0,1,2...) and are not dependent on the brick + position.Let us call these names as brick-IDs. These brick IDs are + also written to the brickinfo files (in + /var/lib/glusterd/vols//bricks/\*) by glusterd during + volume creation. When the volfile is generated, these brick + brick-IDs form the client xlator names. + +- Whenever a brick operation is performed, the names are retained for + existing bricks irrespective of their position in the graph. New + bricks get the monotonically increasing brick-ID while names for + existing bricks are obtained from the brickinfo file. + +- Note that this approach does not affect client versions (old/new) in + anyway because the clients just use the volume config provided by + the volfile server. + +- For retaining backward compatibility, We need to check two items: + (a)Under what condition is remove brick allowed; (b)When is brick-ID + written to brickinfo file. + +For the above 2 items, the implementation rules will be thus: + +​i) This feature is implemented in 3.6. Lets say its op-version is 5. + +​ii) We need to implement a check to allow remove-brick only if cluster +opversion is \>=5 + +​iii) The brick-ID is written to brickinfo when the nodes are upgraded +(during glusterd restore) and when a peer is probed (i.e. during volfile +import). + +Benefit to GlusterFS +-------------------- + +Even if there are pending self-heals, remove-brick operations can be +carried out safely without fear of incorrect heals which may cause data +loss. + +Scope +----- + +### Nature of proposed change + +Modifications will be made in restore, volfile import and volgen +portions of glusterd. + +### Implications on manageability + +N/A + +### Implications on presentation layer + +N/A + +### Implications on persistence layer + +N/A + +### Implications on 'GlusterFS' backend + +N/A + +### Modification to GlusterFS metadata + +N/A + +### Implications on 'glusterd' + +As described earlier. + +How To Test +----------- + +remove-brick operation needs to be carried out on rep/dist-rep volumes +having pending self-heals and it must be verified that no data is lost. +snapshots of the volumes must also be able to access files without any +issues. + +User Experience +--------------- + +N/A + +Dependencies +------------ + +None. + +Documentation +------------- + +TBD + +Status +------ + +See 'Current status' section. + +Comments and Discussion +----------------------- + + \ No newline at end of file diff --git a/done/GlusterFS 3.6/RDMA Improvements.md b/done/GlusterFS 3.6/RDMA Improvements.md new file mode 100644 index 0000000..1e71729 --- /dev/null +++ b/done/GlusterFS 3.6/RDMA Improvements.md @@ -0,0 +1,101 @@ +Feature +------- + +**RDMA Improvements** + +Summary +------- + +This proposal is regarding getting RDMA volumes out of tech preview. + +Owners +------ + +Raghavendra Gowdappa +Vijay Bellur + +Current status +-------------- + +Work in progress + +Detailed Description +-------------------- + +Fix known & unknown issues in volumes with transport type rdma so that +RDMA can be used as the interconnect between client - servers & between +servers. + +- Performance Issues - Had found that performance was bad when + compared with plain ib-verbs send/recv v/s RDMA reads and writes. +- Co-existence with tcp - There seemed to be some memory corruptions + when we had both tcp and rdma transports. +- librdmacm for connection management - with this there is a + requirement that the brick has to listen on an IPoIB address and + this affects our current ability where a peer has the flexibility to + connect to either ethernet or infiniband address. Another related + feature Better peer identification will help us to resolve this + issue. +- More testing required + +Benefit to GlusterFS +-------------------- + +Scope +----- + +### Nature of proposed change + +Bug-fixes to transport/rdma + +### Implications on manageability + +Remove the warning about creation of rdma volumes in CLI. + +### Implications on presentation layer + +TBD + +### Implications on persistence layer + +No impact + +### Implications on 'GlusterFS' backend + +No impact + +### Modification to GlusterFS metadata + +No impact + +### Implications on 'glusterd' + +No impact + +How To Test +----------- + +TBD + +User Experience +--------------- + +TBD + +Dependencies +------------ + +Better Peer identification + +Documentation +------------- + +TBD + +Status +------ + +In development + +Comments and Discussion +----------------------- diff --git a/done/GlusterFS 3.6/Server-side Barrier feature.md b/done/GlusterFS 3.6/Server-side Barrier feature.md new file mode 100644 index 0000000..c13e25a --- /dev/null +++ b/done/GlusterFS 3.6/Server-side Barrier feature.md @@ -0,0 +1,213 @@ +Server-side barrier feature +=========================== + +- Author(s): Varun Shastry, Krishnan Parthasarathi +- Date: Jan 28 2014 +- Bugzilla: +- Document ID: BZ1060002 +- Document Version: 1 +- Obsoletes: NA + +Abstract +-------- + +Snapshot feature needs a mechanism in GlusterFS, where acknowledgements +to file operations (FOPs) are held back until the snapshot of all the +bricks of the volume are taken. + +The barrier feature would stop holding back FOPs after a configurable +'barrier-timeout' seconds. This is to prevent an accidental lockdown of +the volume. + +This mechanism should have the following properties: + +- Should keep 'barriering' transparent to the applications. +- Should not acknowledge FOPs that fall into the barrier class. A FOP + that when acknowledged to the application, could lead to the + snapshot of the volume become inconsistent, is a barrier class FOP. + +With the below example of 'unlink' how a FOP is classified as barrier +class is explained. + +For the following sequence of events, assuming unlink FOP was not +barriered. Assume a replicate volume with two bricks, namely b1 and b2. + + b1 b2 + time ---------------------------------- + | t1 snapshot + | t2 unlink /a unlink /a + \/ t3 mkdir /a mkdir /a + t4 snapshot + +The result of the sequence of events will store /a as a file in snapshot +b1 while /a is stored as directory in snapshot b2. This leads to split +brain problem of the AFR and in other way inconsistency of the volume. + +Copyright +--------- + +Copyright (c) 2014 Red Hat, Inc. + +This feature is licensed under your choice of the GNU Lesser General +Public License, version 3 or any later version (LGPLv3 or later), or the +GNU General Public License, version 2 (GPLv2), in all cases as published +by the Free Software Foundation. + +Introduction +------------ + +The volume snapshot feature snapshots a volume by snapshotting +individual bricks, that are available, using the lvm-snapshot +technology. As part of using lvm-snapshot, the design requires bricks to +be free from few set of modifications (fops in Barrier Class) to avoid +the inconsistency. This is where the server-side barriering of FOPs +comes into picture. + +Terminology +----------- + +- barrier(ing) - To make barrier fops temporarily inactive or + disabled. +- available - A brick is said to be available when the corresponding + glusterfsd process is running and serving file operations. +- FOP - File Operation + +High Level Design +----------------- + +### Architecture/Design Overview + +- Server-side barriering, for Snapshot, must be enabled/disabled on + the bricks of a volume in a synchronous manner. ie, any command + using this would be blocked until barriering is enabled/disabled. + The brick process would provide this mechanism via an RPC. +- Barrier translator would be placed immediately above io-threads + translator in the server/brick stack. +- Barrier translator would queue FOPs when enabled. On disable, the + translator dequeues all the FOPs, while serving new FOPs from + application. By default, barriering is disabled. +- The barrier feature would stop blocking the acknowledgements of FOPs + after a configurable 'barrier-timeout' seconds. This is to prevent + an accidental lockdown of the volume. +- Operations those fall into barrier class are listed below. Any other + fop not listed below does not fall into this category and hence are + not barriered. + - rmdir + - unlink + - rename + - [f]truncate + - fsync + - write with O\_SYNC flag + - [f]removexattr + +### Design Feature + +Following timeline diagram depicts message exchanges between glusterd +and brick during enable and disable of barriering. This diagram assumes +that enable operation is synchronous and disable is asynchronous. See +below for alternatives. + + glusterd (snapshot) barrier @ brick + ------------------ --------------- + t1 | | + t2 | continue to pass through + | all the fops + t3 send 'enable' | + t4 | * starts barriering the fops + | * send back the ack + t5 receive the ack | + | | + t6 | <take snap> | + | . | + | . | + | . | + | </take snap> | + | | + t7 send disable | + (does not wait for the ack) | + t8 | release all the holded fops + | and no more barriering + | | + t9 | continue in PASS_THROUGH mode + +Glusterd would send an RPC (described in API section), to enable +barriering on a brick, by setting option feature.barrier to 'ON' in +barrier translator. This would be performed on all the bricks present in +that node, belonging to the set of volumes that are being snapshotted. + +Disable of barriering can happen in synchronous or asynchronous mode. +The choice is left to the consumer of this feature. + +On disable, all FOPs queued up will be dequeued. Simultaneously the +subsequent barrier request(s) will be served. + +Barrier option enable/disable is persisted into the volfile. This is to +make the feature available for consumers in asynchronous mode, like any +other (configurable) feature. + +Barrier feature also has timeout option based on which dequeuing would +get triggered if the consumer fails to send the disable request. + +Low-level details of Barrier translator working +----------------------------------------------- + +The translator operates in one of two states, namely QUEUEING and +PASS\_THROUGH. + +When barriering is enabled, the translator moves to QUEUEING state. It +queues outgoing FOPs thereafter in the call back path. + +When barriering is disabled, the translator moves to PASS\_THROUGH state +and does not queue when it is in PASS\_THROUGH state. Additionally, the +queued FOPs are 'released', when the translator moves from QUEUEING to +PASS\_THROUGH state. + +It has a translator global queue (doubly linked lists, see +libglusterfs/src/list.h) where the FOPs are queued in the form of a call +stub (see libglusterfs/src/call-stub.[ch]) + +When the FOP has succeeded, but barrier translator failed to queue in +the call back, the barrier translator would disable barriering and +release any queued FOPs, barrier would inform the consumer about this +failure on succesive disable request. + +Interfaces +---------- + +### Application Programming Interface + +- An RPC procedure is added at the brick side, which allows any client + [sic] to set the feature.barrier option of the barrier translator + with a given value. +- Glusterd would be using this to set server-side-barriering on, on a + brick. + +Performance Considerations +-------------------------- + +- The barriering of FOPs may be perceived as a performance degrade by + the applications. Since this is a hard requirement for snapshot, the + onus is on the snapshot feature to reduce the window for which + barriering is enabled. + +### Scalability + +- In glusterd, each brick operation is executed in a serial manner. + So, the latency of enabling barriering is a function of the no. of + bricks present on the node of the set of volumes being snapshotted. + This is not a scalability limitation of the mechanism of enabling + barriering but a limitation in the brick operations mechanism in + glusterd. + +Migration Considerations +------------------------ + +The barrier translator is introduced with op-version 4. It is a +server-side translator and does not impact older clients even when this +feature is enabled. + +Installation and deployment +--------------------------- + +- Barrier xlator is not packaged with glusterfs-server rpm. With this + changes, this has to be added to the rpm. diff --git a/done/GlusterFS 3.6/Thousand Node Gluster.md b/done/GlusterFS 3.6/Thousand Node Gluster.md new file mode 100644 index 0000000..54c3e13 --- /dev/null +++ b/done/GlusterFS 3.6/Thousand Node Gluster.md @@ -0,0 +1,150 @@ +Goal +---- + +Thousand-node scalability for glusterd + +Summary +======= + +This "feature" is really a set of infrastructure changes that will +enable glusterd to manage a thousand servers gracefully. + +Owners +====== + +Krishnan Parthasarathi +Jeff Darcy + +Current status +============== + +Proposed, awaiting summit for approval. + +Related Feature Requests and Bugs +================================= + +N/A + +Detailed Description +==================== + +There are three major areas of change included in this proposal. + +- Replace the current order-n-squared heartbeat/membership protocol + with a much smaller "monitor cluster" based on Paxos or + [Raft](https://ramcloud.stanford.edu/wiki/download/attachments/11370504/raft.pdf), + to which I/O servers check in. + +- Use the monitor cluster to designate specific functions or roles - + e.g. self-heal, rebalance, leadership in an NSR subvolume - to I/O + servers in a coordinated and globally optimal fashion. + +- Replace the current system of replicating configuration data on all + servers (providing practically no guarantee of consistency if one is + absent during a configuration change) with storage of configuration + data in the monitor cluster. + +Benefit to GlusterFS +==================== + +Scaling of our management plane to 1000+ nodes, enabling competition +with other projects such as HDFS or Ceph which already have or claim +such scalability. + +Scope +===== + +Nature of proposed change +------------------------- + +Functionality very similar to what we need in the monitor cluster +already exists in some of the Raft implementations, notably +[etcd](https://github.com/coreos/etcd). Such a component could provide +the services described above to a modified glusterd running on each +server. The changes to glusterd would mostly consist of removing the +current heartbeat and config-storage code, replacing it with calls into +(and callbacks from) the monitor cluster. + +Implications on manageability +----------------------------- + +Enabling/starting monitor daemons on those few nodes that have them must +be done separately from starting glusterd. Since the changes mostly are +to how each glusterd interacts with others and with its own local +storage back end, interactions with the CLI or with glusterfsd need not +change. + +Implications on presentation layer +---------------------------------- + +N/A + +Implications on persistence layer +--------------------------------- + +N/A + +Implications on 'GlusterFS' backend +----------------------------------- + +N/A + +Modification to GlusterFS metadata +---------------------------------- + +The monitor daemons need space for their data, much like that currently +maintained in /var/lib/glusterd currently. + +Implications on 'glusterd' +-------------------------- + +Drastic. See sections above. + +How To Test +=========== + +A new set of tests for the monitor-cluster functionality will need to be +developed, perhaps derived from those for the external project if we +adopt one. Most tests related to our multi-node testing facilities +(cluster.rc) will also need to change. Tests which merely invoke the CLI +should require little if any change. + +User Experience +=============== + +Minimal change. + +Dependencies +============ + +A mature/stable enough implementation of Raft or a similar protocol. +Failing that, we'd need to develop our own service along similar lines. + +Documentation +============= + +TBD. + +Status +====== + +In design. + +The choice of technology and approaches are being discussed on the +-devel ML. + +- "Proposal for Glusterd-2.0" - + [1](http://www.gluster.org/pipermail/gluster-users/2014-September/018639.html) + +: Though the discussion has become passive, the question is whether we + choose to implement consensus algorithm inside our project or depend + on external projects that provide similar service. + +- "Management volume proposal" - + [2](http://www.gluster.org/pipermail/gluster-devel/2014-November/042944.html) + +: This has limitations due to the circular dependency making it + infeasible. + +Comments and Discussion +======================= diff --git a/done/GlusterFS 3.6/afrv2.md b/done/GlusterFS 3.6/afrv2.md new file mode 100644 index 0000000..a1767c7 --- /dev/null +++ b/done/GlusterFS 3.6/afrv2.md @@ -0,0 +1,244 @@ +Feature +------- + +This feature is major code re-factor of current afr along with a key +design change in the way changelog extended attributes are stored in +afr. + +Summary +------- + +This feature introduces design change in afr which separates ongoing +transaction, pending operation count for files/directories. + +Owners +------ + +Anand Avati +Pranith Kumar Karampuri + +Current status +-------------- + +The feature is in final stages of review at + + +Detailed Description +-------------------- + +How AFR works: + +In order to keep track of what copies of the file are modified and up to +date, and what copies require to be healed, AFR keeps state information +in the extended attributes of the file called changelog extended +attributes. These extended attributes stores that copy's view of how up +to date the other copies are. The extended attributes are modified in a +transaction which consists of 5 phases - LOCK, PRE-OP, OP, POST-OP and +UNLOCK. In the PRE-OP phase the extended attributes are updated to store +the intent of modification (in the OP phase.) + +In the POST-OP phase, depending on how many servers crashed mid way and +on how many servers the OP was applied successfully, a corresponding +change is made in the extended attributes (of the surviving copies) to +represent the staleness of the copies which missed the OP phase. + +Further, when those lagging servers become available, healing decisions +are taken based on these extended attribute values. + +Today, a PRE-OP increments the pending counters of all elements in the +array (where each element represents a server, and therefore one of the +members of the array represents that server itself.) The POST-OP +decrements those counters which represent servers where the operation +was successful. The update is performed on all the servers which have +made it till the POST-OP phase. The decision of whether a server crashed +in the middle of a transaction or whether the server lived through the +transaction and witnessed the other server crash, is inferred by +inspecting the extended attributes of all servers together. Because +there is no distinction between these counters as to how many of those +increments represent "in transit" operations and how many of those are +retained without decrement to represent "pending counters", there is +value in adding clarity to the system by separating the two. + +The change is to now have only one dirty flag on each server per file. +We also make the PRE-OP increment only that dirty flag rather than all +the elements of the pending array. The dirty flag must be set before +performing the operation, and based on which of the servers the +operation failed, we will set the pending counters representing these +failed servers on the remaining ones in the POST-OP phase. The dirty +counter is also cleared at the end of the POST-OP. This means, in +successful operations only the dirty flag (one integer) is incremented +and decremented per server per file. However if a pending counter is set +because of an operation failure, then the flag is an unambiguous "finger +pointing" at the other server. Meaning, if a file has a pending counter +AND a dirty flag, it will not undermine the "strength" of the pending +counter. This change completely removes today's ambiguity of whether a +pending counter represents a still ongoing operation (or crashed in +transit) vs a surely missed operation. + +Benefit to GlusterFS +-------------------- + +It increases the clarity of whether a file has any ongoing transactions +and any pending self-heals. Code is more maintainable now. + +Scope +----- + +### Nature of proposed change + +- Remove client side self-healing completely (opendir, openfd, lookup) - +Re-work readdir-failover to work reliably in case of NFS - Remove +unused/dead lock recovery code - Consistently use xdata in both calls +and callbacks in all FOPs - Per-inode event generation, used to force +inode ctx refresh - Implement dirty flag support (in place of pending +counts) - Eliminate inode ctx structure, use read subvol bits + +event\_generation - Implement inode ctx refreshing based on event +generation - Provide backward compatibility in transactions - remove +unused variables and functions - make code more consistent in style and +pattern - regularize and clean up inode-write transaction code - +regularize and clean up dir-write transaction code - regularize and +clean up common FOPs - reorganize transaction framework code - skip +setting xattrs in pending dict if nothing is pending - re-write +self-healing code using syncops - re-write simpler self-heal-daemon + +### Implications on manageability + +None + +### Implications on presentation layer + +None + +### Implications on persistence layer + +None + +### Implications on 'GlusterFS' backend + +None + +### Modification to GlusterFS metadata + +This changes the way pending counts vs Ongoing transactions are +represented in changelog extended attributes. + +### Implications on 'glusterd' + +None + +How To Test +----------- + +Same test cases of afrv1 hold. + +User Experience +--------------- + +None + +Dependencies +------------ + +None + +Documentation +------------- + +--- + +Status +------ + +The feature is in final stages of review at + + +Comments and Discussion +----------------------- + +--- + +Summary +------- + + + +Owners +------ + + + +Current status +-------------- + + + +Detailed Description +-------------------- + + + +Benefit to GlusterFS +-------------------- + + + +Scope +----- + +### Nature of proposed change + + + +### Implications on manageability + + + +### Implications on presentation layer + + + +### Implications on persistence layer + + + +### Implications on 'GlusterFS' backend + + + +### Modification to GlusterFS metadata + +