diff options
32 files changed, 0 insertions, 3559 deletions
diff --git a/doc/features/afr-arbiter-volumes.md b/doc/features/afr-arbiter-volumes.md deleted file mode 100644 index 1348e5645b8..00000000000 --- a/doc/features/afr-arbiter-volumes.md +++ /dev/null @@ -1,53 +0,0 @@ -Usage guide: Replicate volumes with arbiter configuration -========================================================== -Arbiter volumes are replica 3 volumes where the 3rd brick of the replica is -automatically configured as an arbiter node. What this means is that the 3rd -brick will store only the file name and metadata, but does not contain any data. -This configuration is helpful in avoiding split-brains while providing the same -level of consistency as a normal replica 3 volume. - -The arbiter volume can be created with the following command: -`gluster volume create <VOLNAME> replica 3 arbiter 1 host1:brick1 host2:brick2 host3:brick3` - -Note that the syntax is similar to creating a normal replica 3 volume with the -exception of the `arbiter 1` keyword. As seen in the command above, the only -permissible values for the replica count and arbiter count are 3 and 1 -respectively. Also, the 3rd brick is always chosen as the arbiter brick and it -is not configurable to have any other brick as the arbiter. - -Client/ Mount behaviour: -======================== -By default, client quorum (`cluster.quorum-type`) is set to `auto` for a replica -3 volume when it is created; i.e. at least 2 bricks need to be up to satisfy -quorum and to allow writes. This setting is not to be changed for arbiter -volumes also. Additionally, the arbiter volume has additional some checks to -prevent files from ending up in split-brain: - -* Clients take full file locks when writing to a file as opposed to range locks - in a normal replica 3 volume. - -* If 2 bricks are up and if one of them is the arbiter (i.e. the 3rd brick) *and* - it blames the other up brick, then all FOPS will fail with ENOTCONN (Transport - endpoint is not connected). IF the arbiter doesn't blame the other brick, - FOPS will be allowed to proceed. 'Blaming' here is w.r.t the values of AFR - changelog extended attributes. - -* If 2 bricks are up and the arbiter is down, then FOPS will be allowed. - -* In all cases, if there is only one source before the FOP is initiated and if - the FOP fails on that source, the application will receive ENOTCONN. - -Note: It is possible to see if a replica 3 volume has arbiter configuration from -the mount point. If -`$mount_point/.meta/graphs/active/$V0-replicate-0/options/arbiter-count` exists -and its value is 1, then it is an arbiter volume. Also the client volume graph -will have arbiter-count as a xlator option for AFR translators. - -Self-heal daemon behaviour: -=========================== -Since the arbiter brick does not store any data for the files, data-self-heal -from the arbiter brick will not take place. For example if there are 2 source -bricks B2 and B3 (B3 being arbiter brick) and B2 is down, then data-self-heal -will *not* happen from B3 to sink brick B1, and will be pending until B2 comes -up and heal can happen from it. Note that metadata and entry self-heals can -still happen from B3 if it is one of the sources. diff --git a/doc/features/afr-statistics.md b/doc/features/afr-statistics.md deleted file mode 100644 index d0705845aa4..00000000000 --- a/doc/features/afr-statistics.md +++ /dev/null @@ -1,142 +0,0 @@ -##gluster volume heal <volume-name> statistics - -##Description -In case of index self-heal, self-heal daemon reads the entries from the -local bricks, from /brick-path/.glusterfs/indices/xattrop/ directory. -So based on the entries read by self heal daemon, it will attempt self-heal. -Executing this command will list the crawl statistics of self heal done for -each brick. - -For each brick, it will list: -1. Starting time of crawl done for that brick. -2. Ending time of crawl done for that brick. -3. No of entries for which self-heal is successfully attempted. -4. No of split-brain entries found while self-healing. -5. No of entries for which heal failed. - - - -Example: -a) Create a gluster volume with replica count 2. -b) Create 10 files. -c) kill brick_1 of this replica. -d) Overwrite all 10 files. -e) Kill the other brick (brick_2), and bring back (brick_1) up. -f) Overwrite all 10 files. - -Now we have 10 files, which are in split brain. Self heal daemon will crawl for -both the bricks, and will count 10 files from each brick. -It will report 10 files under split-brain with respect to given brick. - -Gathering crawl statistics on volume volume1 has been successful ------------------------------------------------- - -Crawl statistics for brick no 0 -Hostname of brick 192.168.122.1 - -Starting time of crawl: Tue May 20 19:13:11 2014 - -Ending time of crawl: Tue May 20 19:13:12 2014 - -Type of crawl: INDEX -No. of entries healed: 0 -No. of entries in split-brain: 10 -No. of heal failed entries: 0 ------------------------------------------------- - -Crawl statistics for brick no 1 -Hostname of brick 192.168.122.1 - -Starting time of crawl: Tue May 20 19:13:12 2014 - -Ending time of crawl: Tue May 20 19:13:12 2014 - -Type of crawl: INDEX -No. of entries healed: 0 -No. of entries in split-brain: 10 -No. of heal failed entries: 0 - ------------------------------------------------- - - -As the output shows, self-heal daemon detects 10 files in split-brain with -resept to given brick. - - - - -##gluster volume heal <volume-name> statistics heal-count -It lists the number of entries present in -/<brick>/.glusterfs/indices/xattrop from each-brick. - - -1. Create a replicate volume. -2. Kill one brick of a replicate volume1. -3. Create 10 files. -4. Execute above command. - --------------------------------------------------------------------------------- -Gathering count of entries to be healed on volume volume1 has been successful - -Brick 192.168.122.1:/brick_1 -Number of entries: 10 - -Brick 192.168.122.1:/brick_2 -No gathered input for this brick -------------------------------------------------------------------------------- - - - - - - -##gluster volume heal <volume-name> statistics heal-count replica \ - ip_addr:/brick_location - -To list the number of entries to be healed from a particular replicate -subvolume, listing any one child of that replicate subvolume in the command, -will list the entries for all the childrens of that replicate subvolume. - -Example: dht - / \ - / \ - replica-1 replica-2 - / \ / \ - child-1 child-2 child-3 child-4 - /brick1 /brick2 /brick3 /brick4 - -gluster volume heal <vol-name> statistics heal-count ip:/brick1 -will list count only for child-1 and child-2. - -gluster volume heal <vol-name> statistics heal-count ip:/brick3 -will list count only for child-3 and child-4. - - - -1. Create a volume same as mentioned in the above graph. -2. Kill Brick-2. -3. Create some files. -4. If we are interested in knowing the number of files to be healed from each - brick of replica-1 only, mention any one child of that replica. - -gluster volume heal volume1 statistics heal-count replica 192.168.122.1:/brick2 - -output: -------- -Gathering count of entries to be healed per replica on volume volume1 has \ -been successful - -Brick 192.168.122.1:/brick_1 -Number of entries: 10 <--10 files - -Brick 192.168.122.1:/brick_2 -No gathered input for this brick <-Brick is down - -Brick 192.168.122.1:/brick_3 -No gathered input for this brick <--No result, as we are not - interested. - -Brick 192.168.122.1:/brick_4 <--No result, as we are not -No gathered input for this brick interested. - - diff --git a/doc/features/afr-v1.md b/doc/features/afr-v1.md deleted file mode 100644 index 0ab41a1ab4c..00000000000 --- a/doc/features/afr-v1.md +++ /dev/null @@ -1,340 +0,0 @@ -#Automatic File Replication -Afr xlator in glusterfs is responsible for replicating the data across the bricks. - -###Responsibilities of AFR -Its responsibilities include the following: - -1. Maintain replication consistency (i.e. Data on both the bricks should be same, even in the cases where there are operations happening on same file/directory in parallel from multiple applications/mount points as long as all the bricks in replica set are up) - -2. Provide a way of recovering data in case of failures as long as there is - at least one brick which has the correct data. - -3. Serve fresh data for read/stat/readdir etc - -###Transaction framework -For 1, 2 above afr uses transaction framework which consists of the following 5 -phases which happen on all the bricks in replica set(Bricks which are in replication): - -####1.Lock Phase -####2. Pre-op Phase -####3. Op Phase -####4. Post-op Phase -####5. Unlock Phase - -*Op Phase* is the actual operation sent by application (`write/create/unlink` etc). For every operation which afr receives that modifies data it sends that same operation in parallel to all the bricks in its replica set. This is how it achieves replication. - -*Lock, Unlock Phases* take necessary locks so that *Op phase* can provide **replication consistency** in normal work flow. - -#####For example: -If an application performs `touch a` and the other one does `mkdir a`, afr makes sure that either file with name `a` or directory with name `a` is created on both the bricks. - -*Pre-op, Post-op Phases* provide changelogging which enables afr to figure out which copy is fresh. -Once afr knows how to figure out fresh copy in the replica set it can **recover data** from fresh copy to stale copy. Also it can **serve fresh** data for `read/stat/readdir` etc. - -##Internal Operations -Brief introduction to internal operations in Glusterfs which make *Locking, Unlocking, Pre/Post ops* possible: - -###Internal Locking Operations -Glusterfs has **locks** translator which provides the following internal locking operations called `inodelk`, `entrylk` which are used by afr to achieve synchronization of operations on files or directories that conflict with each other. - -`Inodelk` gives the facility for translators in Glusterfs to obtain range (denoted by tuple with **offset**, **length**) locks in a given domain for an inode. -Full file lock is denoted by the tuple (offset: `0`, length: `0`) i.e. length `0` is considered as infinity. - -`Entrylk` enables translators of Glusterfs to obtain locks on `name` in a given domain for an inode, typically a directory. - -**Locks** translator provides both *blocking* and *nonblocking* variants and of these operations. - -###Xattrop -For pre/post ops posix translator provides an operation called xattrop. -xattrop is a way of *incrementing*/*decrementing* a number present in the extended attribute of the inode *atomically*. - -##Transaction Types -There are 3 types of transactions in AFR. -1. Data transactions - - Operations that add/modify/truncate the file contents. - - `Write`/`Truncate`/`Ftruncate` etc - -2. Metadata transactions - - Operations that modify the data kept in inode. - - `Chmod`/`Chown` etc - -3) Entry transactions - - Operations that add/remove/rename entries in a directory - - `Touch`/`Mkdir`/`Mknod` etc - -###Data transactions: - -*write* (`offset`, `size`) - writes data from `offset` of `size` - -*ftruncate*/*truncate* (`offset`) - truncates data from `offset` till the end of file. - -Afr internal locking needs to make sure that two conflicting data operations happen in order, one after the other so that it does not result in replication inconsistency. Afr data operations take inodelks in same domain (lets call it `data` domain). - -*Write* operation with offset `O` and size `S` takes an inode lock in data domain with range `(O, S)`. - -*Ftruncate*/*Truncate* operations with offset `O` take inode locks in `data` domain with range `(O, 0)`. Please note that size `0` means size infinity. - -These ranges make sure that overlapping write/truncate/ftruncate operations are done one after the other. - -Now that we know the ranges the operations take locks on, we will see how locking happens in afr. - -####Lock: -Afr initially attempts **non-blocking** locks on **all** the bricks of the replica set in **parallel**. If all the locks are successful then it goes on to perform pre-op. But in case **non-blocking** locks **fail** because there is *at least one conflicting operation* which already has a **granted lock** then it **unlocks** the **non-blocking** locks it already acquired in this previous step and proceeds to perform **blocking** locks **one after the other** on each of the subvolumes in the order of subvolumes specified in the volfile. - -Chances of **conflicting operations** is **very low** and time elapsed in **non-blocking** locks phase is `Max(latencies of the bricks for responding to inodelk)`, where as time elapsed in **blocking locks** phase is `Sum(latencies of the bricks for responding to inodelk)`. That is why afr always tries for non-blocking locks first and only then it moves to blocking locks. - -####Pre-op: -Each file/dir in a brick maintains the changelog(roughly pending operation count) of itself and that of the files -present in all the other bricks in it's replica set as seen by that brick. - -Lets consider an example replica volume with 2 bricks brick-a and brick-b. -all files in brick-a will have 2 entries -one for itself and the other for the file present in it's replica set, i.e.brick-b: -One can inspect changelogs using getfattr command. - - # getfattr -d -e hex -m. brick-a/file.txt - trusted.afr.vol-client-0=0x000000000000000000000000 -->changelog for itself (brick-a) - trusted.afr.vol-client-1=0x000000000000000000000000 -->changelog for brick-b as seen by brick-a - -Likewise, all files in brick-b will have: - - # getfattr -d -e hex -m. brick-b/file.txt - trusted.afr.vol-client-0=0x000000000000000000000000 -->changelog for brick-a as seen by brick-b - trusted.afr.vol-client-1=0x000000000000000000000000 -->changelog for itself (brick-b) - -#####Interpreting Changelog Value: -Each extended attribute has a value which is `24` hexa decimal digits. i.e. `12` bytes -First `8` digits (`4` bytes) represent changelog of `data`. Second `8` digits represent changelog -of `metadata`. Last 8 digits represent Changelog of `directory entries`. - -Pictorially representing the same, we have: - - 0x 00000000 00000000 00000000 - | | | - | | \_ changelog of directory entries - | \_ changelog of metadata - \ _ changelog of data - -Before write operation is performed on the brick, afr marks the file saying there is a pending operation. - -As part of this pre-op afr sends xattrop operation with increment 1 for data operation to make the extended attributes the following: - # getfattr -d -e hex -m. brick-a/file.txt - trusted.afr.vol-client-0=0x000000010000000000000000 -->changelog for itself (brick-a) - trusted.afr.vol-client-1=0x000000010000000000000000 -->changelog for brick-b as seen by brick-a - -Likewise, all files in brick-b will have: - - # getfattr -d -e hex -m. brick-b/file.txt - trusted.afr.vol-client-0=0x000000010000000000000000 -->changelog for brick-a as seen by brick-b - trusted.afr.vol-client-1=0x000000010000000000000000 -->changelog for itself (brick-b) - -As the operation is in progress on files on both the bricks all the extended attributes show the same value. - -####Op: -Now it sends the actual write operation to both the bricks. Afr remembers whether the operation is successful or not on all the subvolumes. - -####Post-Op: -If the operation succeeds on all the bricks then there is no pending operations on any of the bricks so as part of POST-OP afr sends xattrop operation with increment -1 i.e. decrement by 1 for data operation to make the extended attributes back to all zeros again. - -In case there is a failure on brick-b then there is still a pending operation on brick-b where as no pending operations are there on brick-a. So xattrop operation for both of these extended attributes differs now. For extended attribute corresponding to brick-a i.e. trusted.afr.vol-client-0 decrement by 1 is sent where as for extended attribute corresponding to brick-b increment by '0' is sent to retain the pending operation count. - - # getfattr -d -e hex -m. brick-a/file.txt - trusted.afr.vol-client-0=0x000000000000000000000000 -->changelog for itself (brick-a) - trusted.afr.vol-client-1=0x000000010000000000000000 -->changelog for brick-b as seen by brick-a - - # getfattr -d -e hex -m. brick-b/file.txt - trusted.afr.vol-client-0=0x000000000000000000000000 -->changelog for brick-a as seen by brick-b - trusted.afr.vol-client-1=0x000000010000000000000000 -->changelog for itself (brick-b) - -####Unlock: -Once the transaction is completed unlock is sent on all the bricks where lock is acquired. - - -###Meta Data transactions: - -setattr, setxattr, removexattr -All metadata operations take same inode lock with same range in metadata domain. - -####Lock: -Metadata locking also starts initially with non-blocking locks then move on to blocking locks on any failures because of conflicting operations. - -####Pre-op: -Before metadata operation is performed on the brick, afr marks the file saying there is a pending operation. -As part of this pre-op afr sends xattrop operation with increment 1 for metadata operation to make the extended attributes the following: - # getfattr -d -e hex -m. brick-a/file.txt - trusted.afr.vol-client-0=0x000000000000000100000000 -->changelog for itself (brick-a) - trusted.afr.vol-client-1=0x000000000000000100000000 -->changelog for brick-b as seen by brick-a - -Likewise, all files in brick-b will have: - # getfattr -d -e hex -m. brick-b/file.txt - trusted.afr.vol-client-0=0x000000000000000100000000 -->changelog for brick-a as seen by brick-b - trusted.afr.vol-client-1=0x000000000000000100000000 -->changelog for itself (brick-b) - -As the operation is in progress on files on both the bricks all the extended attributes show the same value. - -####Op: -Now it sends the actual metadata operation to both the bricks. Afr remembers whether the operation is successful or not on all the subvolumes. - -Post-Op: -If the operation succeeds on all the bricks then there is no pending operations on any of the bricks so as part of POST-OP afr sends xattrop operation with increment -1 i.e. decrement by 1 for metadata operation to make the extended attributes back to all zeros again. - -In case there is a failure on brick-b then there is still a pending operation on brick-b where as no pending operations are there on brick-a. So xattrop operation for both of these extended attributes differs now. For extended attribute corresponding to brick-a i.e. trusted.afr.vol-client-0 decrement by 1 is sent where as for extended attribute corresponding to brick-b increment by '0' is sent to retain the pending operation count. - - # getfattr -d -e hex -m. brick-a/file.txt - trusted.afr.vol-client-0=0x000000000000000000000000 -->changelog for itself (brick-a) - trusted.afr.vol-client-1=0x000000000000000100000000 -->changelog for brick-b as seen by brick-a - - # getfattr -d -e hex -m. brick-b/file.txt - trusted.afr.vol-client-0=0x000000000000000000000000 -->changelog for brick-a as seen by brick-b - trusted.afr.vol-client-1=0x000000000000000100000000 -->changelog for itself (brick-b) - -####Unlock: -Once the transaction is completed unlock is sent on all the bricks where lock is acquired. - - -###Entry transactions: - -create, mknod, mkdir, link, symlink, rename, unlink, rmdir -Pre-op/Post-op (done using xattrop) always happens on the parent directory. - -Entry Locks taken by these entry operations: - -**Create** (file `dir/a`): Lock on name `a` in inode of `dir` - -**mknod** (file `dir/a`): Lock on name `a` in inode of `dir` - -**mkdir** (dir `dir/a`): Lock on name `a` in inode of `dir` - -**link** (file `oldfile`, file `dir/newfile`): Lock on name `newfile` in inode of `dir` - -**Symlink** (file `oldfile`, file `dir`/`symlinkfile`): Lock on name `symlinkfile` in inode of `dir` - -**rename** of (file `dir1`/`file1`, file `dir2`/`file2`): Lock on name `file1` in inode of `dir1`, Lock on name `file2` in inode of `dir2` - -**rename** of (dir `dir1`/`dir2`, dir `dir3`/`dir4`): Lock on name `dir2` in inode of `dir1`, Lock on name `dir4` in inode of `dir3`, Lock on `NULL` in inode of `dir4` - -**unlink** (file `dir`/`a`): Lock on name `a` in inode of `dir` - -**rmdir** (dir dir/a): Lock on name `a` in inode of `dir`, Lock on `NULL` in inode of `a` - -####Lock: -Even entry locking starts initially with non-blocking locks then move on to blocking locks on any failures because of conflicting operations. - -####Pre-op: -Before entry operation is performed on the brick, afr marks the directory saying there is a pending operation. - -As part of this pre-op afr sends xattrop operation with increment 1 for entry operation to make the extended attributes the following: - - # getfattr -d -e hex -m. brick-a/ - trusted.afr.vol-client-0=0x000000000000000000000001 -->changelog for itself (brick-a) - trusted.afr.vol-client-1=0x000000000000000000000001 -->changelog for brick-b as seen by brick-a - -Likewise, all files in brick-b will have: - # getfattr -d -e hex -m. brick-b/ - trusted.afr.vol-client-0=0x000000000000000000000001 -->changelog for brick-a as seen by brick-b - trusted.afr.vol-client-1=0x000000000000000000000001 -->changelog for itself (brick-b) - -As the operation is in progress on files on both the bricks all the extended attributes show the same value. - -####Op: -Now it sends the actual entry operation to both the bricks. Afr remembers whether the operation is successful or not on all the subvolumes. - -####Post-Op: -If the operation succeeds on all the bricks then there is no pending operations on any of the bricks so as part of POST-OP afr sends xattrop operation with increment -1 i.e. decrement by 1 for entry operation to make the extended attributes back to all zeros again. - -In case there is a failure on brick-b then there is still a pending operation on brick-b where as no pending operations are there on brick-a. So xattrop operation for both of these extended attributes differs now. For extended attribute corresponding to brick-a i.e. trusted.afr.vol-client-0 decrement by 1 is sent where as for extended attribute corresponding to brick-b increment by '0' is sent to retain the pending operation count. - - # getfattr -d -e hex -m. brick-a/file.txt - trusted.afr.vol-client-0=0x000000000000000000000000 -->changelog for itself (brick-a) - trusted.afr.vol-client-1=0x000000000000000000000001 -->changelog for brick-b as seen by brick-a - - # getfattr -d -e hex -m. brick-b/file.txt - trusted.afr.vol-client-0=0x000000000000000000000000 -->changelog for brick-a as seen by brick-b - trusted.afr.vol-client-1=0x000000000000000000000001 -->changelog for itself (brick-b) - -####Unlock: -Once the transaction is completed unlock is sent on all the bricks where lock is acquired. - -The parts above cover how replication consistency is achieved in afr. - -Now let us look at how afr can figure out how to recover from failures given the changelog extended attributes - -###Recovering from failures (Self-heal) -For recovering from failures afr tries to determine which copy is the fresh copy based on the extended attributes. - -There are 3 possibilities: -1. All the extended attributes are zero on all the bricks. This means there are no pending operations on any of the bricks so there is nothing to recover. -2. According to the extended attributes there is a brick(brick-a) which noticed that there are operations pending on the other brick(brick-b). - - There are 4 possibilities for brick-b - - - It did not even participate in transaction (all extended attributes on brick-b are zeros). Choose brick-a as source and perform recovery to brick-b. - - - It participated in the transaction but died even before post-op. (All extended attributes on brick-b have a pending-count). Choose brick-a as source and perform recovery to brick-b. - - - It participated in the transaction and after the post-op extended attributes on brick-b show that there are pending operations on itself. Choose brick-a as source and perform recovery to brick-b. - - - It participated in the transaction and after the post-op extended attributes on brick-b show that there are pending operations on brick-a. This situation is called Split-brain and there is no way to recover. This situation can happen in cases of network partition. - -3. The only possibility now is where both brick-a, brick-b have pending operations. In this case changelogs extended attributes are all non-zeros on all the bricks. Basically what could have happened is the operations started on the file but either the whole replica set went down or the mount process itself dies before post-op is performed. In this case there is a possibility that data on the bricks is different. In this case afr chooses file with bigger size as source, if both files have same size then it choses the subvolume which has witnessed large number of pending operations on the other brick as source. If both have same number of pending operations then it chooses the file with newest ctime as source. If this is also same then it just picks one of the two bricks as source and syncs data on to the other to make sure that the files are replicas to each other. - -###Self-healing: -Afr does 3 types of self-heals for data recovery. - -1. Data self-heal - -2. Metadata self-heal - -3. Entry self-heal - -As we have seen earlier, afr depends on changelog extended attributes to figure out which copy is source and which copy is sink. General algorithm for performing this recovery (self-heal) is same for all of these different self-heals. - -1. Take appropriate full locks on the file/directory to make sure no other transaction is in progress while inspecting changelog extended attributes. -In this step, for - - Data self-heal afr takes inode lock with `offset: 0` and `size: 0`(infinity) in data domain. - - Entry self-heal takes entry lock on directory with `NULL` name i.e. full directory lock. - - Metadata self-heal it takes pre-defined range in metadata domain on which all the metadata operations on that inode take locks on. To prevent duplicate data self-heal an inode lock is taken in self-heal domain as well. - -2. Perform Sync from fresh copy to stale copy. -In this step, - - Metadata self-heal gets the inode attributes, extended attributes from source copy and sets them on the stale copy. - - - Entry self-heal reads entries on stale directories and see if they are present on source directory, if they are not present it deletes them. Then it reads entries on fresh directory and creates the missing entries on stale directories. - - - Data self-heal does things a bit differently to make sure no other writes on the file are blocked for the duration of self-heal because files sizes could be as big as 100G(VM files) and we don't want to block all the transactions until the self-heal is over. Locks translator allows two overlapping locks to be granted if they are from same lock owner. Using this what data self-heal does is it takes a small 128k size range lock and unlock previous acquired lock, heals just that 128k chunk and takes next 128k chunk lock and unlock previous lock and moves to the next one. It always makes sure that at least one lock is present on the file by selfheal throughout the duration of self-heal so that two self-heals don't happen in parallel. - - - Data self-heal has two algorithms, where the file can be copied only when there is data mismatch for that chunk called as 'diff' self-heal. The otherone is blind copy of each chunk called 'full' self-heal - -3. Change extended attributes to mark new sources after the sync. - -4. Unlock the locks acquired to perform self-heal. - -### Transaction Optimizations: -As we saw earlier afr transaction for all the operations that modify data happens in 5 phases, i.e. it sends 5 operations on the network for every operation. In the following sections we will see optimizations already implemented in afr which reduce the number of operations on the network to just 1 per transaction in best case. - -####Changelog-piggybacking -This optimization comes into picture when on same file descriptor, before write1's post op is complete write2's pre-op starts and the operations are succeeding. When writes come in that manner we can piggyback on the pre-op of write1 for write2 and somehow tell write1 that write2 will do the post-op that was supposed to be done by write1. So write1's post-op does not happen over network, write2's pre-op does not happen over network. This optimization does not hold if there are any failures in write1's phases. - -####Delayed Post-op -This optimization just delays post-op of the write transaction(write1) by a pre-configured amount time to increase the probability of next write piggybacking on the pre-op done by write1. - -With the combination of these two optimizations for operations like full file copy which are write intensive operations, what will essentially happen is for the first write a pre-op will happen. Then for the last write on the file post-op happens. So for all the write transactions between first write and last write afr reduced network operations from 5 to 3. - -####Eager-locking: -This optimization comes into picture when only one file descriptor is open on the file and performing writes just like in the previous optimization. What this optimization does is it takes a full file lock on the file irrespective of the offset, size of the write, so that lock acquired by write1 can be piggybacked by write2 and write2 takes the responsibility of unlocking it. both write1, write2 will have same lock owner and afr takes the responsibility of serializing overlapping writes so that replication consistency is maintained. - -With the combination of these optimizations for operations like full file copy which are write intensive operations, what will essentially happen is for the first write a pre-op, full-file lock will happen. Then for the last write on the file post-op, unlock happens. So for all the write transactions between first write and last write afr reduced network operations from 5 to 1. - -###Quorum in afr: -To avoid split-brains, afr employs the following quorum policies. - - In replica set with odd number of bricks, replica set is said to be in quorum if more than half of the bricks are up. - - In replica set with even number of bricks, if more than half of the bricks are up then it is said to be in quorum but if number of bricks that are up is equal to number of bricks that are down then, it is said to be in quorum if the first brick is also up in the set of bricks that are up. - -When quorum is not met in the replica set then modify operations on the mount are not allowed by afr. - -###Self-heal daemon and Index translator usage by afr: - -####Index xlator: -On each brick index xlator is loaded. This xlator keeps track of what is happening in afr's pre-op and post-op. If there is an ongoing I/O or a pending self-heal, changelog xattrs would have non-zero values. Whenever xattrop/fxattrop fop (pre-op, post-ops are done using these fops) comes to index xlator a link (with gfid as name of the file on which the fop is performed) is added in <brick>/.glusterfs/indices/xattrop directory. If the value returned by the fop is zero the link is removed from the index otherwise it is kept until zero is returned in the subsequent xattrop/fxattrop fops. - -####Self-heal-daemon: -self-heal-daemon process keeps running on each machine of the trusted storage pool. This process has afr xlators of all the volumes which are started. Its job is to crawl indices on bricks that are local to that machine. If any of the files represented by the gfid of the link name need healing and automatically heal them. This operation is performed every 10 minutes for each replica set. Additionally when a brick comes online also this operation is performed. diff --git a/doc/features/bit-rot/00-INDEX b/doc/features/bit-rot/00-INDEX deleted file mode 100644 index d351a1976ff..00000000000 --- a/doc/features/bit-rot/00-INDEX +++ /dev/null @@ -1,8 +0,0 @@ -00-INDEX - - this file -bitrot-docs.txt - - links to design, spec and feature page -object-versioning.txt - - object versioning mechanism to track object signature -memory-usage.txt - - memory usage during object expiry tracking diff --git a/doc/features/bit-rot/bitrot-docs.txt b/doc/features/bit-rot/bitrot-docs.txt deleted file mode 100644 index 39cd491dbcd..00000000000 --- a/doc/features/bit-rot/bitrot-docs.txt +++ /dev/null @@ -1,5 +0,0 @@ -* Feature page: http://www.gluster.org/community/documentation/index.php/Features/BitRot - -* Design: http://goo.gl/Mjy4mD - -* CLI specification: http://goo.gl/2o12Fn diff --git a/doc/features/bit-rot/memory-usage.txt b/doc/features/bit-rot/memory-usage.txt deleted file mode 100644 index 5fe06d4a209..00000000000 --- a/doc/features/bit-rot/memory-usage.txt +++ /dev/null @@ -1,48 +0,0 @@ -object expiry tracking memroy usage -==================================== - -Bitrot daemon tracks objects for expiry in a data structure known -as "timer-wheel" (after which the object is signed). It's a well -known data structure for tracking million of objects of expiry. -Let's see the memory usage involved when tracking 1 million -objects (per brick). - -Bitrot daemon uses "br_object" structure to hold information -needed for signing. An instance of this structure is allocated -for each object that needs to be signed. - -struct br_object { - xlator_t *this; - - br_child_t *child; - - void *data; - uuid_t gfid; - unsigned long signedversion; - - struct list_head list; -}; - -Timer-wheel requires an instance of the structure below per -object that needs to be tracked for expiry. - -struct gf_tw_timer_list { - void *data; - unsigned long expires; - - /** callback routine */ - void (*function)(struct gf_tw_timer_list *, void *, unsigned long); - - struct list_head entry; -}; - -Structure sizes: - sizeof (struct br_object): 64 bytes - sizeof (struct gf_tw_timer_list): 40 bytes - -Together, these structures take up 104 bytes. To track all 1 million objects -at the same time, the amount of memory taken up would be: - - 1,000,000 * 104 bytes: ~100MB - -Not so bad, I think. diff --git a/doc/features/bit-rot/object-versioning.txt b/doc/features/bit-rot/object-versioning.txt deleted file mode 100644 index def901f0fc5..00000000000 --- a/doc/features/bit-rot/object-versioning.txt +++ /dev/null @@ -1,236 +0,0 @@ -Object versioning -================= - - Bitrot detection in GlusterFS relies on object (file) checksum (hash) verification, - also known as "object signature". An object is signed when there are no active - file desciptors referring to it's inode (i.e., upon last close()). This is just an - hint for the initiation of hash calculation (and therefore signing). There is - absolutely no control over when clients can initiate modification operations on - the object. An object could be under modification while it's hash computation is - under progress. It would also be in-appropriate to restrict access to such objects - during the time duration of signing. - - Object versioning is used as a mechanism to identify the staleness of an objects - signature. The document below does not just list down the version update protocol, - but goes through various factors that led to its design. - -NOTE: The word "object" is used to represent a "regular file" (in linux sense) and - object versions are persisted in extended attributes of the object's inode. - Signature calculation includes object's data (no metadata as of now). - -INDEX -===== - i. Version updation protocol - ii. Correctness guaraantees - iii. Implementation - iv. Protocol enhancements - -i. Version updation protocol -============================ - There are two types of versions associated with an object: - - a) Ongoing version: This version is incremented on first open() [when - the in-memory representation of the object (inode) is marked dirty - and synchronized to disk. When an object is created, a default ongoing - version of one (1) is assigned. An object lookup() too assigns the - default version if not present. When a version is initialized upon - lookup() or creat() FOP, it need to be durable on disk and therefore - can just be a extended attrbute set with out an expensive fsync() - syscall. - - b) Signing version: This is the version against which an object is deemed - to be signed. An objects signature is tied to a particular signed version. - Since, an object is a candidate for signing upon last release() [last - close()], signing version is the "ongoing version" at that point of time - - An object's signature is trustable when the version it was signed against - matches the ongoing version, i.e., if the hash is calculated by hand and - compared against the object signature, it *should* be a perfect match if - and only if the versions are equal. On the other hand, the signature is - considered stale (might or might not match the hash just calculated). - - Initialization of object versions - --------------------------------- - An object that existed before the pre versioning days, is assigned the - default versions upon lookup(). The protocol at this point expects "no" - durability guarantess of the versions, i.e., extended attribute sets - need not be followed by an explicit filesystem sync (fsync()). In case - of a power outage or a crash, versions are re-initialized with defaults - if found to be non-existant. The signing version is initialized with a - deafault value of zero (0) and the ongoing version as one (1). - - [ - NOTE: If an object already has versions on-disk, lookup() just brings - the versions in memory. In this case both versions may or may - not match depending on state the object was left in. - ] - - - Increment of object versions - ---------------------------- - During initial versioning, the in-memory representation of the object is - marked dirty, so that subsequent modification operations on the object - triggers a versiong synchronization to disk (extended attribute set). - Moreover, this operation needs to be durable on disk, for the protocol - to be crash consistent. - - Let's picturize the various version states after subsequent open()s. - Not all modification operations need to increment the ongoing version, - only the first operations needs to (subsequent operations are NO-OPs). - - NOTE: From here one "[s]" depicts a durable filesystem operation and - "*" depicts the inode as dirty. - - - lookup() open() open() open() - =========================================================== - - OV(m): 1* 2 2 2 - ----------------------------------------- - OV(d): 1 2[s] 2 2 - SV(d): 0 0 0 0 - - - Let's now picturize the state when an already signed object undergoes - file operations. - - on-disk state: - OV(d): 3 - SV(d): 3|<signature> - - - lookup() open() open() open() - =========================================================== - - OV(m): 3* 4 4 4 - ----------------------------------------- - OV(d): 3 4[s] 4 4 - SV(d): 3 3 3 3 - - Signing process - --------------- - As per the above example, when the last open file descriptor is closed, - signing needs to be performed. The protocol restricts that the signing - needs to be attached to a version, which in this case is the in-memory - value of the ongoing version. A release() also marks the inode dirty, - therefore, the next open() does a durable version synchronization to - disk. - - [carry forwarding the versions from earlier example] - - close() release() open() open() - =========================================================== - - OV(m): 4 4* 5 5 - ----------------------------------------- - OV(d): 4 4 5[s] 5 - SV(d): 3 3 3 3 - - As shown above, a relase() call triggers a signing with signing version - as OV(m): which in this case is 4. During signing, the object is signed - with a signature attached to version 4 as shown below (continuing with - the last open() call from above): - - open() sign(4, signature) - =========================================================== - - OV(m): 5 5 - ----------------------------------------- - OV(d): 5 5 - SV(d): 3 4:<signature>[s] - - A signature comparison at this point of time is un-trustable due to - version mismatches. This also protects from node crashes and hard - reboots due to durability guarantee of on-disk version on first - open(). - - close() release() open() - =========================================================== - - OV(m): 4 4* 5 - -------------------------------- CRASH - OV(d): 4 4 5[s] - SV(d): 3 3 3 - - The protocol is immune to signing request after crashes due to - the version synchronization performed on first open(). Signing - request for a version lesser than the *current* ongoing version - can be ignored. It's left upon the implementation to either - accept or ignore such signing request(s). - - [ - NOTE: Inode forget() causes a fresh lookup() to be trigerred. - Since a forget() call is received when there are no - active references for an inode, the on-disk version is - the latest and would be copied in-memory on lookup(). - ] - -ii. Correctness Guarantees -========================== - - Concurrent open()'s - ------------------- - When an inode is dirty (i.e., the very next operations would try to - synchronize the version to disk), there can be multiple calls [say, - open()] that would find the inode state as dirty and try to writeback - the new version to disk. Also, note that, marking the inode as synced - and updating the in-memory version is done *after* the new version - is written on disk. This is done to avoid incorrect version stored - on-disk in case the version synchronization fails (but the in-memory - version still holding the updated value). - Coming back to multiple open() calls on an object, each open() call - tries to synchronize the new version to disk if the inode is marked - as dirty. This is safe as each open() would try to synchronize the - new version (ongoingversion + 1) even if the updation is concurrent. - The in-memory version is finally updated to reflect the updated - version and mark the inode non-dirty. Again this is done *only* if - the inode is dirty, thereby open() calls which updated the on-disk - version but lost the race to update the in-memory version result - are NO-OPs. - - on-disk state: - OV(d): 3 - SV(d): 3|<signature> - - - lookup() open() open()' open()' open() - ============================================================= - - OV(m): 3* 3* 3* 4 NO-OP - -------------------------------------------------- - OV(d): 3 4[s] 4[s] 4 4 - SV(d): 3 3 3 3 3 - - - open()/release() race - --------------------- - This race can cause a release() [on last close()] to pick up the - ongoing version which was just incremented on fresh open(). This - leads to signing of the object with the same version as the - ongoing version, thereby, mismatching signatures when calculated. - Another point that's worth mentioning here is that the open - file descriptor is *attached* to it's inode *after* it's done - version synchronization (and increment). Hence, if a release() - sneaks in this window, the file desriptor list for the given - inode is still empty, therefore release() considering it as a - last close(). - To counter this, the protocol should track the open and release - counts for file descriptors. A release() should only trigger a - signing request when the file desccriptor for an inode is empty - and the numbers of releases match the number of opens. When an - open() sneaks and increments the ongoing version but the file - descriptor is still not attached to the inode, open and release - counts mismatch, hence identifying an open() in progress. - - -iii. Implementation -=================== - Refer to: xlators/feature/bit-rot/src/stub - -iv. Protocol enhancements -========================= - - a) Delaying persisting on-disk versions till open() - b) Lazy version updation (until signing?) - c) Protocol changes required to handle anonymous file - descriptors in GlusterFS. diff --git a/doc/features/brick-failure-detection.md b/doc/features/brick-failure-detection.md deleted file mode 100644 index 24f2a18f39f..00000000000 --- a/doc/features/brick-failure-detection.md +++ /dev/null @@ -1,67 +0,0 @@ -# Brick Failure Detection - -This feature attempts to identify storage/file system failures and disable the failed brick without disrupting the remainder of the node's operation. - -## Description - -Detecting failures on the filesystem that a brick uses makes it possible to handle errors that are caused from outside of the Gluster environment. - -There have been hanging brick processes when the underlying storage of a brick went unavailable. A hanging brick process can still use the network and repond to clients, but actual I/O to the storage is impossible and can cause noticible delays on the client side. - -Provide better detection of storage subsytem failures and prevent bricks from hanging. It should prevent hanging brick processes when storage-hardware or the filesystem fails. - -A health-checker (thread) has been added to the posix xlator. This thread periodically checks the status of the filesystem (implies checking of functional storage-hardware). - -`glusterd` can detect that the brick process has exited, `gluster volume status` will show that the brick process is not running anymore. System administrators checking the logs should be able to triage the cause. - -## Usage and Configuration - -The health-checker is enabled by default and runs a check every 30 seconds. This interval can be changed per volume with: - - # gluster volume set <VOLNAME> storage.health-check-interval <SECONDS> - -If `SECONDS` is set to 0, the health-checker will be disabled. - -## Failure Detection - -Error are logged to the standard syslog (mostly `/var/log/messages`): - - Jun 24 11:31:49 vm130-32 kernel: XFS (dm-2): metadata I/O error: block 0x0 ("xfs_buf_iodone_callbacks") error 5 buf count 512 - Jun 24 11:31:49 vm130-32 kernel: XFS (dm-2): I/O Error Detected. Shutting down filesystem - Jun 24 11:31:49 vm130-32 kernel: XFS (dm-2): Please umount the filesystem and rectify the problem(s) - Jun 24 11:31:49 vm130-32 kernel: VFS:Filesystem freeze failed - Jun 24 11:31:50 vm130-32 GlusterFS[1969]: [2013-06-24 10:31:50.500674] M [posix-helpers.c:1114:posix_health_check_thread_proc] 0-failing_xfs-posix: health-check failed, going down - Jun 24 11:32:09 vm130-32 kernel: XFS (dm-2): xfs_log_force: error 5 returned. - Jun 24 11:32:20 vm130-32 GlusterFS[1969]: [2013-06-24 10:32:20.508690] M [posix-helpers.c:1119:posix_health_check_thread_proc] 0-failing_xfs-posix: still alive! -> SIGTERM - -The messages labelled with `GlusterFS` in the above output are also written to the logs of the brick process. - -## Recovery after a failure - -When a brick process detects that the underlaying storage is not responding anymore, the process will exit. There is no automated way that the brick process gets restarted, the sysadmin will need to fix the problem with the storage first. - -After correcting the storage (hardware or filesystem) issue, the following command will start the brick process again: - - # gluster volume start <VOLNAME> force - -## How To Test - -The health-checker thread that is part of each brick process will get started automatically when a volume has been started. Verifying its functionality can be done in different ways. - -On virtual hardware: - -* disconnect the disk from the VM that holds the brick - -On real hardware: - -* simulate a RAID-card failure by unplugging the card or cables - -On a system that uses LVM for the bricks: - -* use device-mapper to load an error-table for the disk, see [this description](http://review.gluster.org/5176). - -On any system (writing to random offsets of the block device, more difficult to trigger): - -1. cause corruption on the filesystem that holds the brick -2. read contents from the brick, hoping to hit the corrupted area -3. the filsystem should abort after hitting a bad spot, the health-checker should notice that shortly afterwards diff --git a/doc/features/dht.md b/doc/features/dht.md deleted file mode 100644 index c35dd6d0c27..00000000000 --- a/doc/features/dht.md +++ /dev/null @@ -1,223 +0,0 @@ -# How GlusterFS Distribution Works - -The defining feature of any scale-out system is its ability to distribute work -or data among many servers. Accordingly, people in the distributed-system -community have developed many powerful techniques to perform such distribution, -but those techniques often remain little known or understood even among other -members of the file system and database communities that benefit. This -confusion is represented even in the name of the GlusterFS component that -performs distribution - DHT, which stands for Distributed Hash Table but is not -actually a DHT as that term is most commonly used or defined. The way -GlusterFS's DHT works is based on a few basic principles: - - * All operations are driven by clients, which are all equal. There are no - special nodes with special knowledge of where files are or should be. - - * Directories exist on all subvolumes (bricks or lower-level aggregations of - bricks); files exist on only one. - - * Files are assigned to subvolumes based on *consistent hashing*, and even - more specifically a form of consistent hashing exemplified by Amazon's - [Dynamo][dynamo]. - -The result of all this is that users are presented with a set of files that is -the union of the files present on all subvolumes. The following sections -describe how this "uniting" process actually works. - -## Layouts - -The conceptual basis of Dynamo-style consistent hashing is of numbers around a -circle, like a clock. First, the circle is divided into segments and those -segments are assigned to bricks. (For the sake of simplicity we'll use -"bricks" hereafter even though they might actually be replicated/striped -subvolumes.) Several factors guide this assignment. - - * Assignments are done separately for each directory. - - * Historically, segments have all been the same size. However, this can lead - to smaller bricks becoming full while plenty of space remains on larger - ones. If the *cluster.weighted-rebalance* option is set, segments sizes - will be proportional to brick sizes. - - * Assignments need not include all bricks in the volume. If the - *cluster.subvols-per-directory* option is set, only that many bricks will - receive assignments for that directory. - -However these assignments are done, they collectively become what we call a -*layout* for a directory. This layout is then stored using extended -attributes, with each brick's copy of that extended attribute on that directory -consisting of four 32-bit fields. - - * A version, which might be DHT\_HASH\_TYPE\_DM to represent an assignment as - described above, or DHT\_HASH\_TYPE\_DM\_USER to represent an assignment made - manually by the user (or external script). - - * A "commit hash" which will be described later. - - * The first number in the assigned range (segment). - - * The last number in the assigned range. - -For example, the extended attributes representing a weighted assignment between -three bricks, one twice as big as the others, might look like this. - - * Brick A (the large one): DHT\_HASH\_TYPE\_DM 1234 0 0x7ffffff - - * Brick B: DHT\_HASH\_TYPE\_DM 1234 0x80000000 0xbfffffff - - * Brick C: DHT\_HASH\_TYPE\_DM 1234 0xc0000000 0xffffffff - -## Placing Files - -To place a file in a directory, we first need a layout for that directory - as -described above. Next, we calculate a hash for the file. To minimize -collisions either between files in the same directory with different names or -between files in different directories with the same name, this hash is -generated using both the (containing) directory's unique GFID and the file's -name. This hash is then matched to one of the layout assignments, to yield -what we call a *hashed location*. For example, consider the layout shown -above. The hash 0xabad1dea is between 0x80000000 and 0xbfffffff, so the -corresponding file's hashed location would be on Brick B. A second file with a -hash of 0xfaceb00c would be assigned to Brick C by the same reasoning. - -## Looking Up Files - -Because layout assignments might change, especially as bricks are added or -removed, finding a file involves more than calculating its hashed location and -looking there. That is in fact the first step, and works most of the time - -i.e. the file is found where we expected it to be - but there are a few more -steps when that's not the case. Historically, the next step has been to look -for the file **everywhere** - i.e. to broadcast our lookup request to all -subvolumes. If the file isn't found that way, it doesn't exist. At this -point, an open that requires the file's presence will fail, or a create/mkdir -that requires its absence will be allowed to continue. - -Regardless of whether a file is found at its hashed location or elsewhere, we -now know its *cached location*. As the name implies, this is stored within DHT -to satisfy future lookups. If it's not the same as the hashed location, we -also take an extra step. This step is the creation of a *linkfile*, which is a -special stub left at the **hashed** location pointing to the **cached** -location. Therefore, if a client naively looks for a file at its hashed -location and finds a linkfile instead, it can use that linkfile to look up the -file where it really is instead of needing to inquire everywhere. - -## Rebalancing - -As bricks are added or removed, or files are renamed, many files can end up -somewhere other than at their hashed locations. When this happens, the volumes -need to be rebalanced. This process consists of two parts. - - 1. Calculate new layouts, according to the current set of bricks (and possibly - their characteristics). We call this the "fix-layout" phase. - - 2. Migrate any "misplaced" files to their correct (hashed) locations, and - clean up any linkfiles which are no longer necessary. We call this the - "migrate-data" phase. - -Usually, these two phases are done together. (In fact, the code for them is -somewhat intermingled.) However, the migrate-data phase can involve a lot of -I/O and be very disruptive, so users can do just the fix-layout phase and defer -migrate-data until a more convenient time. This allows new files to be placed -on new bricks, even though old files might still be in the "wrong" place. - -When calculating a new layout to replace an old one, DHT specifically tries to -maximize overlap of the assigned ranges, thus minimizing data movement. This -difference can be very large. For example, consider the case where our example -layout from earlier is updated to add a new double-sided brick. Here's a very -inefficient way to do that. - - * Brick A (the large one): 0x00000000 to 0x55555555 - - * Brick B: 0x55555556 to 0x7fffffff - - * Brick C: 0x80000000 to 0xaaaaaaaa - - * Brick D (the new one): 0xaaaaaaab to 0xffffffff - -This would cause files in the following ranges to be migrated: - - * 0x55555556 to 0x7fffffff (from A to B) - - * 0x80000000 to 0xaaaaaaaa (from B to C) - - * 0xaaaaaaab to 0xbfffffff (from B to D) - - * 0xc0000000 to 0xffffffff (from C to D) - -As an historical note, this is exactly what we used to do, and in this case it -would have meant moving 7/12 of all files in the volume. Now let's consider a -new layout that's optimized to maximize overlap with the old one. - - * Brick A: 0x00000000 to 0x55555555 - - * Brick D: 0x55555556 to 0xaaaaaaaa <- optimized insertion point - - * Brick B: 0xaaaaaaab to 0xd5555554 - - * Brick C: 0xd5555555 to 0xffffffff - -In this case we only need to move 5/12 of all files. In a volume with millions -or even billions of files, reducing data movement by 1/6 of all files is a -pretty big improvement. In the future, DHT might use "virtual node IDs" or -multiple hash rings to make rebalancing even more efficient. - -## Rename Optimizations - -With the file-lookup mechanisms we already have in place, it's not necessary to -move a file from one brick to another when it's renamed - even across -directories. It will still be found, albeit a little less efficiently. The -first client to look for it after the rename will add a linkfile, which every -other client will follow from then on. Also, every client that has found the -file once will continue to find it based on its cached location, without any -network traffic at all. Because the extra lookup cost is small, and the -movement cost might be very large, DHT renames the file "in place" on its -current brick instead (taking advantage of the fact that directories exist -everywhere). - -This optimization is further extended to handle cases where renames are very -common. For example, rsync and similar tools often use a "write new then -rename" idiom in which a file "xxx" is actually written as ".xxx.1234" and then -moved into place only after its contents have been fully written. To make this -process more efficient, DHT uses a regular expression to separate the permanent -part of a file's name (in this case "xxx") from what is likely to be a -temporary part (the leading "." and trailing ".1234"). That way, after the -file is renamed it will be in its correct hashed location - which it wouldn't -be otherwise if "xxx" and ".xxx.1234" hash differently - and no linkfiles or -broadcast lookups will be necessary. - -In fact, there are two regular expressions available for this purpose - -*cluster.rsync-hash-regex* and *cluster.extra-hash-regex*. As its name -implies, *rsync-hash-regex* defaults to the pattern that regex uses, while -*extra-hash-regex* can be set by the user to support a second tool using the -same temporary-file idiom. - -## Commit Hashes - -A very recent addition to DHT's algorithmic arsenal is intended to reduce the -number of "broadcast" lookups the it issues. If a volume is completely in -balance, then no file could exist anywhere but at its hashed location. -Therefore, if we've already looked there and not found it, then looking -elsewhere would be pointless (and wasteful). The *commit hash* mechanism is -used to detect this case. A commit hash is assigned to a volume, and -separately to each directory, and then updated according to the following -rules. - - * The volume commit hash is changed whenever actions are taken that might - cause layout assignments across all directories to become invalid - i.e. - bricks being added, removed, or replaced. - - * The directory commit hash is changed whenever actions are taken that might - cause files to be "misplaced" - e.g. when they're renamed. - - * The directory commit hash is set to the volume commit hash when the - directory is created, and whenever the directory is fully rebalanced so that - all files are at their hashed locations. - -In other words, whenever either the volume or directory commit hash is changed -that creates a mismatch. In that case we revert to the "pessimistic" -broadcast-lookup method described earlier. However, if the two hashes match -then we can with skip the broadcast lookup and return a result immediately. -This has been observed to cause a 3x performance improvement in workloads that -involve creating many small files across many bricks. - -[dynamo]: http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf diff --git a/doc/features/file-snapshot.md b/doc/features/file-snapshot.md deleted file mode 100644 index 7f7c419fc7f..00000000000 --- a/doc/features/file-snapshot.md +++ /dev/null @@ -1,91 +0,0 @@ -#File Snapshot -This feature gives the ability to take snapshot of files. - -##Descritpion -This feature adds file snapshotting support to glusterfs. Snapshots can be created , deleted and reverted. - -To take a snapshot of a file, file should be in QCOW2 format as the code for the block layer snapshot has been taken from Qemu and put into gluster as a translator. - -With this feature, glusterfs will have better integration with Openstack Cinder, and in general ability to take snapshots of files (typically VM images). - -New extended attribute (xattr) will be added to identify files which are 'snapshot managed' vs raw files. - -##Volume Options -Following volume option needs to be set on the volume for taking file snapshot. - - # features.file-snapshot on -##CLI parameters -Following cli parameters needs to be passed with setfattr command to create, delete and revert file snapshot. - - # trusted.glusterfs.block-format - # trusted.glusterfs.block-snapshot-create - # trusted.glusterfs.block-snapshot-goto -##Fully loaded Example -Download glusterfs3.5 rpms from download.gluster.org -Install these rpms. - -start glusterd by using the command - - # service glusterd start -Now create a volume by using the command - - # gluster volume create <vol_name> <brick_path> -Run the command below to make sure that volume is created. - - # gluster volume info -Now turn on the snapshot feature on the volume by using the command - - # gluster volume set <vol_name> features.file-snapshot on -Verify that the option is set by using the command - - # gluster volume info -User should be able to see another option in the volume info - - # features.file-snapshot: on -Now mount the volume using fuse mount - - # mount -t glusterfs <vol_name> <mount point> -cd into the mount point - # cd <mount_point> - # touch <file_name> -Size of the file can be set and format of the file can be changed to QCOW2 by running the command below. File size can be in KB/MB/GB - - # setfattr -n trusted.glusterfs.block-format -v qcow2:<file_size> <file_name> -Now create another file and send data to that file by running the command - - # echo 'ABCDEFGHIJ' > <data_file1> -copy the data to the one file to another by running the command - - # dd if=data-file1 of=big-file conv=notrunc -Now take the `snapshot of the file` by running the command - - # setfattr -n trusted.glusterfs.block-snapshot-create -v <image1> <file_name> -Add some more contents to the file and take another file snaphot by doing the following steps - - # echo '1234567890' > <data_file2> - # dd if=<data_file2> of=<file_name> conv=notrunc - # setfattr -n trusted.glusterfs.block-snapshot-create -v <image2> <file_name> -Now `revert` both the file snapshots and write data to some files so that data can be compared. - - # setfattr -n trusted.glusterfs.block-snapshot-goto -v <image1> <file_name> - # dd if=<file_name> of=<out-file1> bs=11 count=1 - # setfattr -n trusted.glusterfs.block-snapshot-goto -v <image2> <file_name> - # dd if=<file_name> of=<out-file2> bs=11 count=1 -Now read the contents of the files and compare as below: - - # cat <data_file1>, <out_file1> and compare contents. - # cat <data_file2>, <out_file2> and compare contents. -##one line description for the variables used -file_name = File which will be creating in the mount point intially. - -data_file1 = File which contains data 'ABCDEFGHIJ' - -image1 = First file snapshot which has 'ABCDEFGHIJ' + some null values. - -data_file2 = File which contains data '1234567890' - -image2 = second file snapshot which has '1234567890' + some null values. - -out_file1 = After reverting image1 this contains 'ABCDEFGHIJ' - -out_file2 = After reverting image2 this contians '1234567890' diff --git a/doc/features/geo-replication/distributed-geo-rep.md b/doc/features/geo-replication/distributed-geo-rep.md deleted file mode 100644 index 0a3183d6269..00000000000 --- a/doc/features/geo-replication/distributed-geo-rep.md +++ /dev/null @@ -1,71 +0,0 @@ -Introduction -============ - -This document goes through the new design of distributed geo-replication, it's features and the nature of changes involved. First we list down some of the important features. - - - Distributed asynchronous replication - - Fast and versatile change detection - - Replica failover - - Hardlink synchronization - - Effective handling of deletes and renames - - Configurable sync engine (rsync, tar+ssh) - - Adaptive to a wide variety of workloads - - GFID synchronization - -Geo-replication makes use of the all new *journaling* infrastructure (a.k.a. changelog) to achieve great performance and feature improvements as mentioned above. To understand more about changelogging and the helper library (*libgfchangelog*) refer to document: doc/features/geo-replication/libgfchangelog.md - -Data Replication ----------------- - -Geo-replication is responsible to incrementally replicate data from the master node to the slave. But isn't that similar to what AFR does? Yes, but here the slave is located geographically distant from the master. Geo-replication follows the eventually consistent replication model, which implies, at any point of time, the slave would be lagging w.r.t. master, but would eventually catch up. Replication performance is dependent on two crucial factors: - - Network latency - - Change detection - -Network latency is something that is not in direct control for many reasons, but still there is always a best effort. Therefore, geo-replication offloads the data replicaiton part to common UNIX file transfer utilities. We choose the grand daddy of file transfers [rsync(1)] [1] as the default synchronization engine, as it's best known for it's diff transfer algorithm for effcient usage of network and lightning fast transfers (leave alone the flexibiliy). But what about small files performance? Due to it's checksumming algorithm, rsync has more overhead for small files -- the overhead of checksumming outweighs the bytes to be transferred for small files. Therefore, geo-replication can also use combination of tar piped over ssh to transfer large number of small files. Tests have shown a great improvement over standard rsync. However, sync engine is not yet dynamic to the file type and needs to be chosen manually by a configuration option. - -OTOH, change detection is something that is in full control of the application. Earlier (< release 3.5), geo-replicaiton would perform a file system crawl to indentify changes in the file system. This was not an unintelligent *check-every-single-inode* in the file system, but crawl logic based on *xtime*. xtime is an extended attribute maintained by the *marker* translator for each inode on the master and follows an upward-recursive marking pattern. Geo-replication would traverse a directory based on this simple condition: - -> xtime(master) > xtime(slave) - -E.g.: - -> MASTER SLAVE -> -> /\ /\ -> d0 dir0 d0 dir0 -> / \ / \ -> d1 dir1 d1 dir1 -> / / -> d2 d2 -> / / -> file0 file0 - -Consider the directory tree above. Assume that master and slave were in sync and the following operation happens on master: -``` -touch /d0/d1/d2/file0 -``` -This would trigger a xtime marking (xtime being the current timestamp) from the leaf (*file0*) upto the root (*/*), i.e. an *xattr* of *file0*, *d2*, *d1*, *d0* and finally */*. Geo-replication daemon would crawl the file system based the condition mentioned before and hence would only crawl the **left** part of the directory tree (as the **right** part would hve equal xtimes). - -Although the above crawling algorithm is fast, it still has to crawl a good part of the file system. Also, to decide whether to crawl a particular subdirectory, geo-rep need to compare xtime -- which is basically a **getxattr()** call on the master and slave (remember, *slave* is over a WAN). - -Therefore, in 3.5 the need arised to take crawling to the next level. Geo-replication now uses the changelogging infrastructure to idenitify changes in the filesystem. Actually, there is absolutely no crawl involved. Changelogging based detection is notification based. Geo-replication daemon registers itself with the changelog consumer library (*libgfchangelog*) and basically invokes a set of APIs to get the list of changes in the filesystem and replays them onto the slave. There is absolutely no crawl or any kind of extended attribute gets involved. - -Distributed Geo-Replication ---------------------------- -Geo-replication (also known as gsyncd or geo-rep) used to be non-distributed before release 3.5. The node on which geo-rep start command was executed was responsible for replication data to the slave. If this node goes offline due to some reason (reboot, crash, etc..), replication would thereby be ceased. So one of the main development efforts for release 3.5 was to *distributify* geo-replication. Geo-rep daemon running on each node (per brick) is responsible for replicating data **local** to each brick. This results in full parallelism and effective use of cluster/network resource. - -With release 3.5, geo-rep start command would spawn a geo-replication daemon on each node in the master cluster (one per brick). Geo-rep *status* command shown geo-rep session status from each master node. Similary, *stop* would gracefully tear down the session from all nodes. - -What else is synced? --------------------- - - GFID: Synchronizing the inode number (GFID) between master and the slave helps in synchronizing hardlinks. - - Purges are also handled effectively as there is no entry comparison between master and slave. With changelog replay, geo-rep perform unlink operation without having to resort to expensive **readdir()** over the WAN. - - Renames: With earlier geo-replication, because of the path based nature of crawling, renames were actually a delete and a create on the slave, followed by data transfer (not to mention the inode number change). Now, with changelogging, it's actually a **rename()** call on the slave. - -Replica Failover ----------------- -One of the basic volume configuration is a replicated volume (synchronous replication). Having geo-replication sync data from all replicas would mean wastage of network bandwidth and possibly data corruption on the slave (though that's unlikely). Therefore, geo-rep on such volume configurations works in an **ACTIVE** and **PASSIVE** mode. Geo-rep daemon on one of the replicas is responsible for replicating data (**ACTIVE**), while the other geo-rep daemon is basically doing nothing (**PASSIVE**). - -On the event of the *ACTIVE* node going offline, the *PASSIVE* node identifies this event (there's a lag of max 60 seconds for this identification) and switches to *ACTIVE*; thereby taking over the role of replicating data from where the earlier *ACTIVE* node left off. This guarantees uninterrupted data replication even on node reboot/failures. - -[1]:http://rsync.samba.org diff --git a/doc/features/geo-replication/libgfchangelog.md b/doc/features/geo-replication/libgfchangelog.md deleted file mode 100644 index 1dd0d24253a..00000000000 --- a/doc/features/geo-replication/libgfchangelog.md +++ /dev/null @@ -1,119 +0,0 @@ -libgfchangelog: "GlusterFS changelog" consumer library -====================================================== - -This document puts forward the intended need for GlusterFS changelog consumer library (a.k.a. libgfchangelog) for consuming changlogs produced by the Changelog translator. Further, it mentions the proposed design and the API exposed by it. A brief explanation of changelog translator can also be found as a commit message in the upstream source tree and the review link can be [accessed here] [1]. - -Initial consumer of changelogs would be Geo-Replication (release 3.5). Possible consumers in the future could be backup utilities, GlusterFS self-heal, bit-rot detection, AV scanners. All these utilities have one thing in common - to get a list of changed entities (created/modified/deleted) in the file system. Therefore, the need arises to provide such functionality in the form of a shared library that applications can link against and query for changes (See API section). There is no plan as of now to provide language bindings as such, but for shell script friendliness: 'gfind' command line utility (which would be dynamically linked with libgfchangelog) would be helpful. As of now, development for this utility is still not commenced. - -The next section gives a brief introduction about how changelogs are organized and managed. Then we propose couple of designs for libgfchangelog. API set is not covered in this document (maybe later). - -Changelogs -========== - -Changelogs can be thought as a running history for an entity in the file system from the time the entity came into existance. The goal is to capture all possible transitions the entity underwent till the time it got purged. The transition namespace is broken up into three categories with each category represented by a specific changelog format. Changes are recorded in a flat file in the filesystem and are rolled over after a specific time interval. All three types of categories are recorded in a single changelog file (sequentially) with a type for each entry. Having a single file reduces disk seeks and fragmentation and less number of files to deal with. Stratergy for pruning of old logs is still undecided. - - -Changelog Transition Namespace ------------------------------- - -As mentioned before the transition namespace is categorized into three types: - - TYPE-I : Data operation - - TYPE-II : Metadata operation - - TYPE-III : Entry operation - -One could visualize the transition of an file system entity as a state machine transitioning from one type to another. For TYPE-I and TYPE-II operations there is no state transition as such, but TYPE-III operation involves a state change from the file systems perspective. We can now classify file operations (fops) into one of the three types: - - Data operation: write(), writev(), truncate(), ftruncate() - - Metadata operation: setattr(), fsetattr(), setxattr(), fsetxattr(), removexattr(), fremovexattr() - - Entry operation: create(), mkdir(), mknod(), symlink(), link(), rename(), unlink(), rmdir() - -Changelog Entry Format ----------------------- - -In order to record the type of operation and entity underwent, a type identifier is used. Normally, the entity on which the operation is performed would be identified by the pathname, which is the most common way of addressing in a file system, but we choose to use GlusterFS internal file identifier (GFID) instead (as GlusterFS supports GFID based backend and the pathname field may not always be valid and other reasons which are out of scope of this this document). Therefore, the format of the record for the three types of operation can be summarized as follows: - - - TYPE-I : GFID of the file - - TYPE-II : GFID of the file - - TYPE-III : GFID + FOP + MODE + UID + GID + PARGFID/BNAME [PARGFID/BNAME] - -GFID's are analogous to inodes. TYPE-I and TYPE-II fops record the GFID of the entity on which the operation was performed: thereby recording that there was an data/metadata change on the inode. TYPE-III fops record at the minimum a set of six or seven records (depending on the type of operation), that is sufficient to identify what type of operation the entity underwent. Normally this record inculdes the GFID of the entity, the type of file operation (which is an integer [an enumerated value which is used in GluterFS]) and the parent GFID and the basename (analogous to parent inode and basename). - -Changelogs can be either in ascii or binary format, the difference being the format of the records that is persisted. In a binary changelog the gfids are recorded in it's native format ie. 16 byte record and the fop number as a 4 byte integer. In an ascii changelog, the gfids are stored in their canonical form and the fop number is stringified and persisted. Null charater is used as the record serarator and changelogs. This makes it hard to read changelogs from the command line, but the packed format is needed to support file names with spaces and special characters. Below is a snippet of a changelog along side it's hexdump. - -``` -00000000 47 6c 75 73 74 65 72 46 53 20 43 68 61 6e 67 65 |GlusterFS Change| -00000010 6c 6f 67 20 7c 20 76 65 72 73 69 6f 6e 3a 20 76 |log | version: v| -00000020 31 2e 31 20 7c 20 65 6e 63 6f 64 69 6e 67 20 3a |1.1 | encoding :| -00000030 20 32 0a 45 61 36 39 33 63 30 34 65 2d 61 66 39 | 2.Ea693c04e-af9| -00000040 65 2d 34 62 61 35 2d 39 63 61 37 2d 31 63 34 61 |e-4ba5-9ca7-1c4a| -00000050 34 37 30 31 30 64 36 32 00 32 33 00 33 33 32 36 |47010d62.23.3326| -00000060 31 00 30 00 30 00 66 36 35 34 32 33 32 65 2d 61 |1.0.0.f654232e-a| -00000070 34 32 62 2d 34 31 62 33 2d 62 35 61 61 2d 38 30 |42b-41b3-b5aa-80| -00000080 33 62 33 64 61 34 35 39 33 37 2f 6c 69 62 76 69 |3b3da45937/libvi| -00000090 72 74 5f 64 72 69 76 65 72 5f 6e 65 74 77 6f 72 |rt_driver_networ| -000000a0 6b 2e 73 6f 00 44 61 36 39 33 63 30 34 65 2d 61 |k.so.Da693c04e-a| -000000b0 66 39 65 2d 34 62 61 35 2d 39 63 61 37 2d 31 63 |f9e-4ba5-9ca7-1c| -000000c0 34 61 34 37 30 31 30 64 36 32 00 45 36 65 39 37 |4a47010d62.E6e97| -``` - -As you can see, there is an *entry* operation (journal record starting with an "E"). Records for this operation are: - - GFID : a693c04e-af9e-4ba5-9ca7-1c4a-47010d62 - - FOP : 23 (create) - - Mode : 33261 - - UID : 0 - - GID : 0 - - PARGFID/BNAME: f654232e-a42b-41b3-b5aa-803b3da45937 - -**NOTE**: In case of a rename operation, there would be an additional record (for the target PARGFID/BNAME). - -libgfchangelog --------------- - -NOTE: changelogs generated by the changelog translator are rolled over [with the timestamp as the suffix] after a specific interval, after which a new change is started. The current changelog [changelog file without the timestamp as the suffix] should never be processed unless it's rolled over. The rolled over logs should be treated read-only. - -Capturing changes performed on a file system is useful for applications that rely on file system scan (crawl) to figure out such information. Backup utilities, automatic file healing in a replicated environment, bit-rot detection and the likes are some of the end user applications that require a set of changed entities in a file system to act on. Goal of libgfchangelog is to provide the application (consumer) a fast and easy to use common query interface (API). The consumer need not worry about the changelog format, nomenclature of the changelog files etc. - -Now we list functionality and some of the features. - -Functionality -------------- - -Changelog Processing: Processing involes reading changelog file(s), converting the entries into human-readable (or application understandable) format (in case of binary log format). -Book-keeping: Keeping track of how much the application has consumed the changelog (ie. changes during the time slice start-time -> end-time). -Serve API request: Update the consumer by providing the set of changes. - -Processing could be done in two ways: - -* Pre-processing (pre-processing from the library POV): -Once a changelog file is rolled over (by the changelog translator), a set of post processing operations are performed. These operations could include conversion of a binary log file to an understandable format, collate a bunch of logs into a larger sampling period or just keep a private copy of the changelog (in ascii format). Extra disk space is consumed to store this private copy. The library would then be free to consume these logs and serve application requests. - -* On-demand: -The processing of the changelogs is trigerred when an application requests for changes. Downside of this being additional time spent on decoding the logs and data accumulation during application request time (but no additional disk space is used over the time period). - -After processing, the changelog is ready to be consumed by the application. The function of processing is to convert the logs into human/application readable format (an example is shown below): - -``` -E a7264fe2-dd6b-43e1-8786-a03b42cc2489 CREATE 33188 0 0 00000000-0000-0000-0000-000000000001%2Fservices1 -M a7264fe2-dd6b-43e1-8786-a03b42cc2489 NULL -M 00000000-0000-0000-0000-000000000001 NULL -D a7264fe2-dd6b-43e1-8786-a03b42cc2489 -``` - -Features --------- - -The following points mention some of the features that the library could provide. - - - Consumer could choose the update type when it registers with the library. 'types' could be: - - Streaming: The consumer is updated via stream of changes, ie. the library would just replay the logs - - Consolidated: The consumer is provided with a consolidated view of the changelog, eg. if <gfid> had an DATA and a METADATA operation, it would be presented as a single update. Similarly for ENTRY operations. - - Raw: This mode provides the consumer with the pathnames of the changelog files itself (after processing). The changelogs should be strictly treated as read-only. This gives the flexibility to the consumer to extract updates using thier own preferred way (eg. using command line tools like sed, awk, sort | uniq etc.). - - Application may choose to adopt a synchronous (blocking) or an asynchronous (callback) notification mechanism. - - Provide a unified view of changelogs from multiple peers (replication scenario) or a global changelog view of the entire cluster. - - -** The first cut of the library supports:** - - Raw access mode - - Synchronous programming model - - Per brick changelog consumption ie. no unified/globally aggregated changelog - -[1]:http://review.gluster.org/5127 diff --git a/doc/features/gfid-access.md b/doc/features/gfid-access.md deleted file mode 100644 index 2d324a18bdb..00000000000 --- a/doc/features/gfid-access.md +++ /dev/null @@ -1,73 +0,0 @@ -#Gfid-access Translator -The 'gfid-access' translator provides access to data in glusterfs using a -virtual path. This particular translator is designed to provide direct access to -files in glusterfs using its gfid. 'GFID' is glusterfs's inode number for a file -to identify it uniquely. As of now, Geo-replication is the only consumer of this -translator. The changelog translator logs the 'gfid' with corresponding file -operation in journals which are consumed by Geo-Replication to replicate the -files using gfid-access translator very efficiently. - -###Implications and Usage -A new virtual directory called '.gfid' is exposed in the aux-gfid mount -point when gluster volume is mounted with 'aux-gfid-mount' option. -All the gfids of files are exposed in one level under the '.gfid' directory. -No matter at what level the file resides, it is accessed using its -gfid under this virutal directory as shown in example below. All access -protocols work seemlessly, as the complexities are handled internally. - -###Testing -1. Mount glusterfs client with '-o aux-gfid-mount' as follows. - - mount -t glusterfs -o aux-gfid-mount <node-ip>:<volname> <mountpoint> - - Example: - - #mount -t glusterfs -o aux-gfid-mount rhs1:master /master-aux-mnt - -2. Get the 'gfid' of a file using normal mount or aux-gfid-mount and do some - operations as follows. - - getfattr -n glusterfs.gfid.string <file> - - Example: - - #getfattr -n glusterfs.gfid.string /master-aux-mnt/file - # file: file - glusterfs.gfid.string="796d3170-0910-4853-9ff3-3ee6b1132080" - - #cat /master-aux-mnt/file - sample data - - #stat /master-aux-mnt/file - File: `file' - Size: 12 Blocks: 1 IO Block: 131072 regular file - Device: 13h/19d Inode: 11525625031905452160 Links: 1 - Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) - Access: 2014-05-23 20:43:33.239999863 +0530 - Modify: 2014-05-23 17:36:48.224999989 +0530 - Change: 2014-05-23 20:44:10.081999938 +0530 - - -3. Access files using virtual path as follows. - - /mountpoint/.gfid/<actual-canonical-gfid-of-the-file\>' - - Example: - - #cat /master-aux-mnt/.gfid/796d3170-0910-4853-9ff3-3ee6b1132080 - sample data - #stat /master-aux-mnt/.gfid/796d3170-0910-4853-9ff3-3ee6b1132080 - File: `.gfid/796d3170-0910-4853-9ff3-3ee6b1132080' - Size: 12 Blocks: 1 IO Block: 131072 regular file - Device: 13h/19d Inode: 11525625031905452160 Links: 1 - Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) - Access: 2014-05-23 20:43:33.239999863 +0530 - Modify: 2014-05-23 17:36:48.224999989 +0530 - Change: 2014-05-23 20:44:10.081999938 +0530 - - We can notice that 'cat' command on the 'file' using path and using virtual - path displays the same data. Similarly 'stat' command on the 'file' and using - virtual path with gfid gives same Inode Number confirming that its same file. - -###Nature of changes -This feature is introduced with 'gfid-access' translator. diff --git a/doc/features/glusterfs_nfs-ganesha_integration.md b/doc/features/glusterfs_nfs-ganesha_integration.md deleted file mode 100644 index b30671506d7..00000000000 --- a/doc/features/glusterfs_nfs-ganesha_integration.md +++ /dev/null @@ -1,123 +0,0 @@ -# GlusterFS and NFS-Ganesha integration - -Nfs-ganesha can support NFS (v3, 4.0, 4.1 pNFS) and 9P (from the Plan9 operating system) protocols concurrently. It provides a FUSE-compatible File System Abstraction Layer(FSAL) to allow the file-system developers to plug in their own storage mechanism and access it from any NFS client. - -With NFS-GANESHA, the NFS client talks to the NFS-GANESHA server instead, which is in the user address space already. NFS-GANESHA can access the FUSE filesystems directly through its FSAL without copying any data to or from the kernel, thus potentially improving response times. Of course the network streams themselves (TCP/UDP) will still be handled by the Linux kernel when using NFS-GANESHA. - -Even GlusterFS has been integrated with NFS-Ganesha, in the recent past to export the volumes created via glusterfs, using “libgfapi”. libgfapi is a new userspace library developed to access data in glusterfs. It performs I/O on gluster volumes directly without FUSE mount. It is a filesystem like api which runs/sits in the application process context(which is NFS-Ganesha here) and eliminates the use of fuse and the kernel vfs layer from the glusterfs volume access. Thus by integrating NFS-Ganesha and libgfapi, the speed and latency have been improved compared to FUSE mount access. - -### 1.) Pre-requisites - - - Before starting to setup NFS-Ganesha, a GlusterFS volume should be created. - - Disable kernel-nfs, gluster-nfs services on the system using the following commands - - service nfs stop - - gluster vol set <volname> nfs.disable ON (Note: this command has to be repeated for all the volumes in the trusted-pool) - - Usually the libgfapi.so* files are installed in “/usr/lib” or “/usr/local/lib”, based on whether you have installed glusterfs using rpm or sources. Verify if those libgfapi.so* files are linked in “/usr/lib64″ and “/usr/local/lib64″ as well. If not create the links for those .so files in those directories. - -### 2.) Installing nfs-ganesha - -##### i) using rpm install - - - nfs-ganesha rpms are available in Fedora19 or later packages. So to install nfs-ganesha, run - - *#yum install nfs-ganesha* - - Using CentOS or EL, download the rpms from the below link : - - http://download.gluster.org/pub/gluster/glusterfs/nfs-ganesha - -##### ii) using sources - - - cd /root - - git clone git://github.com/nfs-ganesha/nfs-ganesha.git - - cd nfs-ganesha/ - - git submodule update --init - - git checkout -b next origin/next (Note : origin/next is the current development branch) - - rm -rf ~/build; mkdir ~/build ; cd ~/build - - cmake -DUSE_FSAL_GLUSTER=ON -DCURSES_LIBRARY=/usr/lib64 -DCURSES_INCLUDE_PATH=/usr/include/ncurses -DCMAKE_BUILD_TYPE=Maintainer /root/nfs-ganesha/src/ - - make; make install -> Note: libcap-devel, libnfsidmap, dbus-devel, libacl-devel ncurses* packages -> may need to be installed prior to running this command. For Fedora, libjemalloc, -> libjemalloc-devel may also be required. - -### 3.) Run nfs-ganesha server - - - To start nfs-ganesha manually, execute the following command: - - *#ganesha.nfsd -f <location_of_nfs-ganesha.conf_file> -L <location_of_log_file> -N <log_level> -d - -```sh -For example: -#ganesha.nfsd -f nfs-ganesha.conf -L nfs-ganesha.log -N NIV_DEBUG -d -where: -nfs-ganesha.log is the log file for the ganesha.nfsd process. -nfs-ganesha.conf is the configuration file -NIV_DEBUG is the log level. -``` - - To check if nfs-ganesha has started, execute the following command: - - *#ps aux | grep ganesha* - - By default '/' will be exported - -### 4.) Exporting GlusterFS volume via nfs-ganesha - -#####step 1 : - -To export any GlusterFS volume or directory inside volume, create the EXPORT block for each of those entries in a .conf file, for example export.conf. The following paremeters are required to export any entry. -- *#cat export.conf* - -```sh -EXPORT{ - Export_Id = 1 ; # Export ID unique to each export - Path = "volume_path"; # Path of the volume to be exported. Eg: "/test_volume" - - FSAL { - name = GLUSTER; - hostname = "10.xx.xx.xx"; # IP of one of the nodes in the trusted pool - volume = "volume_name"; # Volume name. Eg: "test_volume" - } - - Access_type = RW; # Access permissions - Squash = No_root_squash; # To enable/disable root squashing - Disable_ACL = TRUE; # To enable/disable ACL - Pseudo = "pseudo_path"; # NFSv4 pseudo path for this export. Eg: "/test_volume_pseudo" - Protocols = "3","4" ; # NFS protocols supported - Transports = "UDP","TCP" ; # Transport protocols supported - SecType = "sys"; # Security flavors supported -} -``` - -#####step 2 : - -Define/copy “nfs-ganesha.conf” file to a suitable location. This file is available in “/etc/glusterfs-ganesha” on installation of nfs-ganesha rpms or incase if using the sources, rename “/root/nfs-ganesha/src/FSAL/FSAL_GLUSTER/README” file to “nfs-ganesha.conf” file. - -#####step 3 : - -Now include the “export.conf” file in nfs-ganesha.conf. This can be done by adding the line below at the end of nfs-ganesha.conf. - - %include “export.conf” - -#####step 4 : - - - run ganesha server as mentioned in section 3 - - To check if the volume is exported, run - - *#showmount -e localhost* - -### 5.) Additional Notes - -To switch back to gluster-nfs/kernel-nfs, kill the ganesha daemon and start those services using the below commands : - - - pkill ganesha - - service nfs start (for kernel-nfs) - - gluster v set <volname> nfs.disable off - - -### 6.) References - - - Setup and create glusterfs volumes : -http://www.gluster.org/community/documentation/index.php/QuickStart - - - NFS-Ganesha wiki : https://github.com/nfs-ganesha/nfs-ganesha/wiki - - - Sample configuration files - - /root/nfs-ganesha/src/config_samples/gluster.conf - - https://github.com/nfs-ganesha/nfs-ganesha/blob/master/src/config_samples/gluster.conf - - - https://forge.gluster.org/nfs-ganesha-and-glusterfs-integration/pages/Home - - - http://blog.gluster.org/2014/09/glusterfs-and-nfs-ganesha-integration/ - diff --git a/doc/features/heal-info-and-split-brain-resolution.md b/doc/features/heal-info-and-split-brain-resolution.md deleted file mode 100644 index 7a6691db14e..00000000000 --- a/doc/features/heal-info-and-split-brain-resolution.md +++ /dev/null @@ -1,459 +0,0 @@ -The following document explains the usage of volume heal info and split-brain -resolution commands. - -##`gluster volume heal <VOLNAME> info [split-brain]` commands -###volume heal info -Usage: `gluster volume heal <VOLNAME> info` - -This lists all the files that need healing (either their path or -GFID is printed). -###Interpretting the output -All the files that are listed in the output of this command need healing to be -done. Apart from this, there are 2 special cases that may be associated with -an entry - -a) Is in split-brain - A file in data/metadata split-brain will -be listed with " - Is in split-brain" appended after its path/gfid. Eg., -"/file4" in the output provided below. But for a gfid split-brain, - the parent directory of the file is shown to be in split-brain and the file -itself is shown to be needing heal. Eg., "/dir" in the output provided below -which is in split-brain because of gfid split-brain of file "/dir/a". -b) Is possibly undergoing heal - A file is said to be possibly undergoing - heal because it is possible that the file was undergoing heal when heal status -was being determined but it cannot be said for sure. It could so have happened -that self-heal daemon and glfsheal process that is trying to get heal information -are competing for the same lock leading to such conclusion. Another possible case - could be multiple glfsheal processes running simultaneously (e.g., multiple users - ran heal info command at the same time), competing for same lock. - -The following is an example of heal info command's output. -###Example -Consider a replica volume "test" with 2 bricks b1 and b2; -self-heal daemon off, mounted at /mnt. - -`gluster volume heal test info` -~~~ -Brick \<hostname:brickpath-b1> -<gfid:aaca219f-0e25-4576-8689-3bfd93ca70c2> - Is in split-brain -<gfid:39f301ae-4038-48c2-a889-7dac143e82dd> - Is in split-brain -<gfid:c3c94de2-232d-4083-b534-5da17fc476ac> - Is in split-brain -<gfid:6dc78b20-7eb6-49a3-8edb-087b90142246> - -Number of entries: 4 - -Brick <hostname:brickpath-b2> -/dir/file2 -/dir/file1 - Is in split-brain -/dir - Is in split-brain -/dir/file3 -/file4 - Is in split-brain -/dir/a - - -Number of entries: 6 -~~~ - -###Analysis of the output -It can be seen that -A) from brick b1 4 entries need healing: - 1) file with gfid:6dc78b20-7eb6-49a3-8edb-087b90142246 needs healing - 2) "aaca219f-0e25-4576-8689-3bfd93ca70c2", -"39f301ae-4038-48c2-a889-7dac143e82dd" and "c3c94de2-232d-4083-b534-5da17fc476ac" - are in split-brain - -B) from brick b2 6 entries need healing- - 1) "a", "file2" and "file3" need healing - 2) "file1", "file4" & "/dir" are in split-brain - -###volume heal info split-brain -Usage: `gluster volume heal <VOLNAME> info split-brain` -This command shows all the files that are in split-brain. -##Example -`gluster volume heal test info split-brain` -~~~ -Brick <hostname:brickpath-b1> -<gfid:aaca219f-0e25-4576-8689-3bfd93ca70c2> -<gfid:39f301ae-4038-48c2-a889-7dac143e82dd> -<gfid:c3c94de2-232d-4083-b534-5da17fc476ac> -Number of entries in split-brain: 3 - -Brick <hostname:brickpath-b2> -/dir/file1 -/dir -/file4 -Number of entries in split-brain: 3 -~~~ -Note that, similar to heal info command, for gfid split-brains (same filename but different gfid) -their parent directories are listed to be in split-brain. - -##Resolution of split-brain using CLI -Once the files in split-brain are identified, their resolution can be done -from the command line. Note that entry/gfid split-brain resolution is not supported. -Split-brain resolution commands let the user resolve split-brain in 3 ways. -###Select the bigger-file as source -This command is useful for per file healing where it is known/decided that the -file with bigger size is to be considered as source. -1.`gluster volume heal <VOLNAME> split-brain bigger-file <FILE>` -`<FILE>` can be either the full file name as seen from the root of the volume -(or) the gfid-string representation of the file, which sometimes gets displayed -in the heal info command's output. -Once this command is executed, the replica containing the FILE with bigger -size is found out and heal is completed with it as source. - -###Example : -Consider the above output of heal info split-brain command. - -Before healing the file, notice file size and md5 checksums : -~~~ -On brick b1: -# stat b1/dir/file1 - File: ‘b1/dir/file1’ - Size: 17 Blocks: 16 IO Block: 4096 regular file -Device: fd03h/64771d Inode: 919362 Links: 2 -Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) -Access: 2015-03-06 13:55:40.149897333 +0530 -Modify: 2015-03-06 13:55:37.206880347 +0530 -Change: 2015-03-06 13:55:37.206880347 +0530 - Birth: - - -# md5sum b1/dir/file1 -040751929ceabf77c3c0b3b662f341a8 b1/dir/file1 - -On brick b2: -# stat b2/dir/file1 - File: ‘b2/dir/file1’ - Size: 13 Blocks: 16 IO Block: 4096 regular file -Device: fd03h/64771d Inode: 919365 Links: 2 -Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) -Access: 2015-03-06 13:54:22.974451898 +0530 -Modify: 2015-03-06 13:52:22.910758923 +0530 -Change: 2015-03-06 13:52:22.910758923 +0530 - Birth: - -# md5sum b2/dir/file1 -cb11635a45d45668a403145059c2a0d5 b2/dir/file1 -~~~ -Healing file1 using the above command - -`gluster volume heal test split-brain bigger-file /dir/file1` -Healed /dir/file1. - -After healing is complete, the md5sum and file size on both bricks should be the same. -~~~ -On brick b1: -# stat b1/dir/file1 - File: ‘b1/dir/file1’ - Size: 17 Blocks: 16 IO Block: 4096 regular file -Device: fd03h/64771d Inode: 919362 Links: 2 -Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) -Access: 2015-03-06 14:17:27.752429505 +0530 -Modify: 2015-03-06 13:55:37.206880347 +0530 -Change: 2015-03-06 14:17:12.880343950 +0530 - Birth: - -# md5sum b1/dir/file1 -040751929ceabf77c3c0b3b662f341a8 b1/dir/file1 - -On brick b2: -# stat b2/dir/file1 - File: ‘b2/dir/file1’ - Size: 17 Blocks: 16 IO Block: 4096 regular file -Device: fd03h/64771d Inode: 919365 Links: 2 -Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) -Access: 2015-03-06 14:17:23.249403600 +0530 -Modify: 2015-03-06 13:55:37.206880000 +0530 -Change: 2015-03-06 14:17:12.881343955 +0530 - Birth: - - -# md5sum b2/dir/file1 -040751929ceabf77c3c0b3b662f341a8 b2/dir/file1 -~~~ -###Select one replica as source for a particular file -2.`gluster volume heal <VOLNAME> split-brain source-brick <HOSTNAME:BRICKNAME> <FILE>` -`<HOSTNAME:BRICKNAME>` is selected as source brick, -FILE present in the source brick is taken as source for healing. - -###Example : -Notice the md5 checksums and file size before and after heal. - -Before heal : -~~~ -On brick b1: - - stat b1/file4 - File: ‘b1/file4’ - Size: 4 Blocks: 16 IO Block: 4096 regular file -Device: fd03h/64771d Inode: 919356 Links: 2 -Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) -Access: 2015-03-06 13:53:19.417085062 +0530 -Modify: 2015-03-06 13:53:19.426085114 +0530 -Change: 2015-03-06 13:53:19.426085114 +0530 - Birth: - -# md5sum b1/file4 -b6273b589df2dfdbd8fe35b1011e3183 b1/file4 - -On brick b2: - -# stat b2/file4 - File: ‘b2/file4’ - Size: 4 Blocks: 16 IO Block: 4096 regular file -Device: fd03h/64771d Inode: 919358 Links: 2 -Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) -Access: 2015-03-06 13:52:35.761833096 +0530 -Modify: 2015-03-06 13:52:35.769833142 +0530 -Change: 2015-03-06 13:52:35.769833142 +0530 - Birth: - -# md5sum b2/file4 -0bee89b07a248e27c83fc3d5951213c1 b2/file4 -~~~ -`gluster volume heal test split-brain source-brick test-host:/test/b1 gfid:c3c94de2-232d-4083-b534-5da17fc476ac` -Healed gfid:c3c94de2-232d-4083-b534-5da17fc476ac. - -After healing : -~~~ -On brick b1: -# stat b1/file4 - File: ‘b1/file4’ - Size: 4 Blocks: 16 IO Block: 4096 regular file -Device: fd03h/64771d Inode: 919356 Links: 2 -Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) -Access: 2015-03-06 14:23:38.944609863 +0530 -Modify: 2015-03-06 13:53:19.426085114 +0530 -Change: 2015-03-06 14:27:15.058927962 +0530 - Birth: - -# md5sum b1/file4 -b6273b589df2dfdbd8fe35b1011e3183 b1/file4 - -On brick b2: -# stat b2/file4 - File: ‘b2/file4’ - Size: 4 Blocks: 16 IO Block: 4096 regular file -Device: fd03h/64771d Inode: 919358 Links: 2 -Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) -Access: 2015-03-06 14:23:38.944609000 +0530 -Modify: 2015-03-06 13:53:19.426085000 +0530 -Change: 2015-03-06 14:27:15.059927968 +0530 - Birth: - -# md5sum b2/file4 -b6273b589df2dfdbd8fe35b1011e3183 b2/file4 -~~~ -Note that, as mentioned earlier, entry split-brain and gfid split-brain healing - are not supported using CLI. However, they can be fixed using the method described - [here](https://github.com/gluster/glusterfs/blob/master/doc/debugging/split-brain.md). -###Example: -Trying to heal /dir would fail as it is in entry split-brain. -`gluster volume heal test split-brain source-brick test-host:/test/b1 /dir` -Healing /dir failed:Operation not permitted. -Volume heal failed. - -3.`gluster volume heal <VOLNAME> split-brain source-brick <HOSTNAME:BRICKNAME>` -Consider a scenario where many files are in split-brain such that one brick of -replica pair is source. As the result of the above command all split-brained -files in `<HOSTNAME:BRICKNAME>` are selected as source and healed to the sink. - -###Example: -Consider a volume having three entries "a, b and c" in split-brain. -~~~ -`gluster volume heal test split-brain source-brick test-host:/test/b1` -Healed gfid:944b4764-c253-4f02-b35f-0d0ae2f86c0f. -Healed gfid:3256d814-961c-4e6e-8df2-3a3143269ced. -Healed gfid:b23dd8de-af03-4006-a803-96d8bc0df004. -Number of healed entries: 3 -~~~ - -## An overview of working of heal info commands -When these commands are invoked, a "glfsheal" process is spawned which reads -the entries from `/<brick-path>/.glusterfs/indices/xattrop/` directory of all -the bricks that are up (that it can connect to) one after another. These -entries are GFIDs of files that might need healing. Once GFID entries from a -brick are obtained, based on the lookup response of this file on each -participating brick of replica-pair & trusted.afr.* extended attributes it is -found out if the file needs healing, is in split-brain etc based on the -requirement of each command and displayed to the user. - - -##Resolution of split-brain from the mount point -A set of getfattr and setfattr commands have been provided to detect the data and metadata split-brain status of a file and resolve split-brain, if any, from mount point. - -Consider a volume "test", having bricks b0, b1, b2 and b3. - -~~~ -# gluster volume info test - -Volume Name: test -Type: Distributed-Replicate -Volume ID: 00161935-de9e-4b80-a643-b36693183b61 -Status: Started -Number of Bricks: 2 x 2 = 4 -Transport-type: tcp -Bricks: -Brick1: test-host:/test/b0 -Brick2: test-host:/test/b1 -Brick3: test-host:/test/b2 -Brick4: test-host:/test/b3 -~~~ - -Directory structure of the bricks is as follows: - -~~~ -# tree -R /test/b? -/test/b0 -├── dir -│ └── a -└── file100 - -/test/b1 -├── dir -│ └── a -└── file100 - -/test/b2 -├── dir -├── file1 -├── file2 -└── file99 - -/test/b3 -├── dir -├── file1 -├── file2 -└── file99 -~~~ - -Some files in the volume are in split-brain. -~~~ -# gluster v heal test info split-brain -Brick test-host:/test/b0/ -/file100 -/dir -Number of entries in split-brain: 2 - -Brick test-host:/test/b1/ -/file100 -/dir -Number of entries in split-brain: 2 - -Brick test-host:/test/b2/ -/file99 -<gfid:5399a8d1-aee9-4653-bb7f-606df02b3696> -Number of entries in split-brain: 2 - -Brick test-host:/test/b3/ -<gfid:05c4b283-af58-48ed-999e-4d706c7b97d5> -<gfid:5399a8d1-aee9-4653-bb7f-606df02b3696> -Number of entries in split-brain: 2 -~~~ -###To know data/metadata split-brain status of a file: -~~~ -getfattr -n replica.split-brain-status <path-to-file> -~~~ -The above command executed from mount provides information if a file is in data/metadata split-brain. Also provides the list of afr children to analyze to get more information about the file. -This command is not applicable to gfid/directory split-brain. - -###Example: -1) "file100" is in metadata split-brain. Executing the above mentioned command for file100 gives : -~~~ -# getfattr -n replica.split-brain-status file100 -# file: file100 -replica.split-brain-status="data-split-brain:no metadata-split-brain:yes Choices:test-client-0,test-client-1" -~~~ - -2) "file1" is in data split-brain. -~~~ -# getfattr -n replica.split-brain-status file1 -# file: file1 -replica.split-brain-status="data-split-brain:yes metadata-split-brain:no Choices:test-client-2,test-client-3" -~~~ - -3) "file99" is in both data and metadata split-brain. -~~~ -# getfattr -n replica.split-brain-status file99 -# file: file99 -replica.split-brain-status="data-split-brain:yes metadata-split-brain:yes Choices:test-client-2,test-client-3" -~~~ - -4) "dir" is in directory split-brain but as mentioned earlier, the above command is not applicable to such split-brain. So it says that the file is not under data or metadata split-brain. -~~~ -# getfattr -n replica.split-brain-status dir -# file: dir -replica.split-brain-status="The file is not under data or metadata split-brain" -~~~ - -5) "file2" is not in any kind of split-brain. -~~~ -# getfattr -n replica.split-brain-status file2 -# file: file2 -replica.split-brain-status="The file is not under data or metadata split-brain" -~~~ - -### To analyze the files in data and metadata split-brain -Trying to do operations (say cat, getfattr etc) from the mount on files in split-brain, gives an input/output error. To enable the users analyze such files, a setfattr command is provided. - -~~~ -# setfattr -n replica.split-brain-choice -v "choiceX" <path-to-file> -~~~ -Using this command, a particular brick can be chosen to access the file in split-brain from. - -###Example: -1) "file1" is in data-split-brain. Trying to read from the file gives input/output error. -~~~ -# cat file1 -cat: file1: Input/output error -~~~ -Split-brain choices provided for file1 were test-client-2 and test-client-3. - -Setting test-client-2 as split-brain choice for file1 serves reads from b2 for the file. -~~~ -# setfattr -n replica.split-brain-choice -v test-client-2 file1 -~~~ -Now, read operations on the file can be done. -~~~ -# cat file1 -xyz -~~~ -Similarly, to inspect the file from other choice, replica.split-brain-choice is to be set to test-client-3. - -Trying to inspect the file from a wrong choice errors out. - -To undo the split-brain-choice that has been set, the above mentioned setfattr command can be used -with "none" as the value for extended attribute. - -###Example: -~~~ -1) setfattr -n replica.split-brain-choice -v none file1 -~~~ -Now performing cat operation on the file will again result in input/output error, as before. -~~~ -# cat file -cat: file1: Input/output error -~~~ - -The user can access each file for a timeout amount of period every time replica.split-brain-choice is set. This timeout is configurable by user, with a default value of 5 minutes. -### To set split-brain-choice timeout -A setfattr command from the mount allows the user set this timeout, to be specified in minutes. -~~~ -# setfattr -n replica.split-brain-choice-timeout -v <timeout-in-minutes> <mount_point/file> -~~~ -This is a global timeout, i.e. applicable to all files as long as the mount exists. So, the timeout need not be set each time a file needs to be inspected but for a new mount it will have to be set again for the first time. This option also needs to be set every time there is a client graph switch (_See note #3_). - -### Resolving the split-brain -Once the choice for resolving split-brain is made, source brick is supposed to be set for the healing to be done. -This is done using the following command: - -~~~ -# setfattr -n replica.split-brain-heal-finalize -v <heal-choice> <path-to-file> -~~~ - -##Example -~~~ -# setfattr -n replica.split-brain-heal-finalize -v test-client-2 file1 -~~~ -The above process can be used to resolve data and/or metadata split-brain on all the files. - -NOTE: -1) If "fopen-keep-cache" fuse mount option is disabled then inode needs to be invalidated each time before selecting a new replica.split-brain-choice to inspect a file. This can be done by using: -~~~ -# sefattr -n inode-invalidate -v 0 <path-to-file> -~~~ - -2) The above mentioned process for split-brain resolution from mount will not work on nfs mounts as it doesn't provide xattrs support. - -3) Client graph switch occurs when there is a change in the client side translator graph; typically during addition of new translators to the graph on client side and add-brick/remove-brick operations. diff --git a/doc/features/libgfapi.md b/doc/features/libgfapi.md deleted file mode 100644 index dfc8cfe6527..00000000000 --- a/doc/features/libgfapi.md +++ /dev/null @@ -1,381 +0,0 @@ -One of the known methods to access glusterfs is via fuse module. However, it has some overhead or performance issues because of the number of context switches which need to be performed to complete one i/o transaction[1]. - - -To over come this limitation, a new method called ‘libgfapi’ is introduced. libgfapi support is available from GlusterFS-3.4 release. - -libgfapi is a userspace library for accessing data in glusterfs. libgfapi library perform IO on gluster volumes directly without FUSE mount. It is a filesystem like api and runs/sits in application process context. libgfapi eliminates the fuse and the kernel vfs layer from the glusterfs volume access. The speed and latency have improved with libgfapi access. [1] - - -Using libgfapi, various user-space filesystems (like NFS-Ganesha or Samba) or the virtualizer (like QEMU) can interact with GlusterFS which serves as back-end filesystem. Currently below projects integrate with glusterfs using libgfapi interfaces. - - -* qemu storage layer -* Samba VFS plugin -* NFS-Ganesha - -All the APIs in libgfapi make use of `struct glfs` object. This object -contains information about volume name, glusterfs context associated, -subvols in the graph etc which makes it unique for each volume. - - -For any application to make use of libgfapi, it should typically start -with the below APIs in the following order - - -* To create a new glfs object : - - glfs_t *glfs_new (const char *volname) ; - - glfs_new() returns glfs_t object. - - -* On this newly created glfs_t, you need to be either set a volfile path - (glfs_set_volfile) or a volfile server (glfs_set_volfile_server). - Incase of failures, the corresponding cleanup routine is - "glfs_unset_volfile_server" - - int glfs_set_volfile (glfs_t *fs, const char *volfile); - - int glfs_set_volfile_server (glfs_t *fs, const char *transport,const char *host, int port) ; - - int glfs_unset_volfile_server (glfs_t *fs, const char *transport,const char *host, int port) ; - -* Specify logging parameters using glfs_set_logging(): - - int glfs_set_logging (glfs_t *fs, const char *logfile, int loglevel) ; - -* Initializes the glfs_t object using glfs_init() - int glfs_init (glfs_t *fs) ; - -#### FOPs APIs available with libgfapi : - - - - int glfs_get_volumeid (struct glfs *fs, char *volid, size_t size); - - int glfs_setfsuid (uid_t fsuid) ; - - int glfs_setfsgid (gid_t fsgid) ; - - int glfs_setfsgroups (size_t size, const gid_t *list) ; - - glfs_fd_t *glfs_open (glfs_t *fs, const char *path, int flags) ; - - glfs_fd_t *glfs_creat (glfs_t *fs, const char *path, int flags,mode_t mode) ; - - int glfs_close (glfs_fd_t *fd) ; - - glfs_t *glfs_from_glfd (glfs_fd_t *fd) ; - - int glfs_set_xlator_option (glfs_t *fs, const char *xlator, const char *key,const char *value) ; - - typedef void (*glfs_io_cbk) (glfs_fd_t *fd, ssize_t ret, void *data); - - ssize_t glfs_read (glfs_fd_t *fd, void *buf,size_t count, int flags) ; - - ssize_t glfs_write (glfs_fd_t *fd, const void *buf,size_t count, int flags) ; - - int glfs_read_async (glfs_fd_t *fd, void *buf, size_t count, int flags, glfs_io_cbk fn, void *data) ; - - int glfs_write_async (glfs_fd_t *fd, const void *buf, size_t count, int flags, glfs_io_cbk fn, void *data) ; - - ssize_t glfs_readv (glfs_fd_t *fd, const struct iovec *iov, int iovcnt,int flags) ; - - ssize_t glfs_writev (glfs_fd_t *fd, const struct iovec *iov, int iovcnt,int flags) ; - - int glfs_readv_async (glfs_fd_t *fd, const struct iovec *iov, int count, int flags, glfs_io_cbk fn, void *data) ; - - int glfs_writev_async (glfs_fd_t *fd, const struct iovec *iov, int count, int flags, glfs_io_cbk fn, void *data) ; - - ssize_t glfs_pread (glfs_fd_t *fd, void *buf, size_t count, off_t offset,int flags) ; - - ssize_t glfs_pwrite (glfs_fd_t *fd, const void *buf, size_t count, off_t offset, int flags) ; - - int glfs_pread_async (glfs_fd_t *fd, void *buf, size_t count, off_t offset,int flags, glfs_io_cbk fn, void *data) ; - - int glfs_pwrite_async (glfs_fd_t *fd, const void *buf, int count, off_t offset,int flags, glfs_io_cbk fn, void *data) ; - - ssize_t glfs_preadv (glfs_fd_t *fd, const struct iovec *iov, int iovcnt, int count, off_t offset, int flags,glfs_io_cbk fn, void *data) ; - - ssize_t glfs_pwritev (glfs_fd_t *fd, const struct iovec *iov, int iovcnt,int count, off_t offset, int flags, glfs_io_cbk fn, void *data) ; - - int glfs_preadv_async (glfs_fd_t *fd, const struct iovec *iov, glfs_io_cbk fn, void *data) ; - - int glfs_pwritev_async (glfs_fd_t *fd, const struct iovec *iov, glfs_io_cbk fn, void *data) ; - - off_t glfs_lseek (glfs_fd_t *fd, off_t offset, int whence) ; - - int glfs_truncate (glfs_t *fs, const char *path, off_t length) ; - - int glfs_ftruncate (glfs_fd_t *fd, off_t length) ; - - int glfs_ftruncate_async (glfs_fd_t *fd, off_t length, glfs_io_cbk fn,void *data) ; - - int glfs_lstat (glfs_t *fs, const char *path, struct stat *buf) ; - - int glfs_stat (glfs_t *fs, const char *path, struct stat *buf) ; - - int glfs_fstat (glfs_fd_t *fd, struct stat *buf) ; - - int glfs_fsync (glfs_fd_t *fd) ; - - int glfs_fsync_async (glfs_fd_t *fd, glfs_io_cbk fn, void *data) ; - - int glfs_fdatasync (glfs_fd_t *fd) ; - - int glfs_fdatasync_async (glfs_fd_t *fd, glfs_io_cbk fn, void *data) ; - - int glfs_access (glfs_t *fs, const char *path, int mode) ; - - int glfs_symlink (glfs_t *fs, const char *oldpath, const char *newpath) ; - - int glfs_readlink (glfs_t *fs, const char *path,char *buf, size_t bufsiz) ; - - int glfs_mknod (glfs_t *fs, const char *path, mode_t mode, dev_t dev) ; - - int glfs_mkdir (glfs_t *fs, const char *path, mode_t mode) ; - - int glfs_unlink (glfs_t *fs, const char *path) ; - - int glfs_rmdir (glfs_t *fs, const char *path) ; - - int glfs_rename (glfs_t *fs, const char *oldpath, const char *newpath) ; - - int glfs_link (glfs_t *fs, const char *oldpath, const char *newpath) ; - - glfs_fd_t *glfs_opendir (glfs_t *fs, const char *path) ; - - int glfs_readdir_r (glfs_fd_t *fd, struct dirent *dirent,struct dirent **result) ; - - int glfs_readdirplus_r (glfs_fd_t *fd, struct stat *stat, struct dirent *dirent, struct dirent **result) ; - - struct dirent *glfs_readdir (glfs_fd_t *fd) ; - - struct dirent *glfs_readdirplus (glfs_fd_t *fd, struct stat *stat) ; - - long glfs_telldir (glfs_fd_t *fd) ; - - void glfs_seekdir (glfs_fd_t *fd, long offset) ; - - int glfs_closedir (glfs_fd_t *fd) ; - - int glfs_statvfs (glfs_t *fs, const char *path, struct statvfs *buf) ; - - int glfs_chmod (glfs_t *fs, const char *path, mode_t mode) ; - - int glfs_fchmod (glfs_fd_t *fd, mode_t mode) ; - - int glfs_chown (glfs_t *fs, const char *path, uid_t uid, gid_t gid) ; - - int glfs_lchown (glfs_t *fs, const char *path, uid_t uid, gid_t gid) ; - - int glfs_fchown (glfs_fd_t *fd, uid_t uid, gid_t gid) ; - - int glfs_utimens (glfs_t *fs, const char *path,struct timespec times[2]) ; - - int glfs_lutimens (glfs_t *fs, const char *path,struct timespec times[2]) ; - - int glfs_futimens (glfs_fd_t *fd, struct timespec times[2]) ; - - ssize_t glfs_getxattr (glfs_t *fs, const char *path, const char *name,void *value, size_t size) ; - - ssize_t glfs_lgetxattr (glfs_t *fs, const char *path, const char *name,void *value, size_t size) ; - - ssize_t glfs_fgetxattr (glfs_fd_t *fd, const char *name,void *value, size_t size) ; - - ssize_t glfs_listxattr (glfs_t *fs, const char *path,void *value, size_t size) ; - - ssize_t glfs_llistxattr (glfs_t *fs, const char *path, void *value,size_t size) ; - - ssize_t glfs_flistxattr (glfs_fd_t *fd, void *value, size_t size) ; - - int glfs_setxattr (glfs_t *fs, const char *path, const char *name,const void *value, size_t size, int flags) ; - - int glfs_lsetxattr (glfs_t *fs, const char *path, const char *name,const void *value, size_t size, int flags) ; - - int glfs_fsetxattr (glfs_fd_t *fd, const char *name,const void *value, size_t size, int flags) ; - - int glfs_removexattr (glfs_t *fs, const char *path, const char *name) ; - - int glfs_lremovexattr (glfs_t *fs, const char *path, const char *name) ; - - int glfs_fremovexattr (glfs_fd_t *fd, const char *name) ; - - int glfs_fallocate(glfs_fd_t *fd, int keep_size, off_t offset, size_t len) ; - - int glfs_discard(glfs_fd_t *fd, off_t offset, size_t len) ; - - int glfs_discard_async (glfs_fd_t *fd, off_t length, size_t lent, glfs_io_cbk fn, void *data) ; - - int glfs_zerofill(glfs_fd_t *fd, off_t offset, off_t len) ; - - int glfs_zerofill_async (glfs_fd_t *fd, off_t length, off_t len, glfs_io_cbk fn, void *data) ; - - char *glfs_getcwd (glfs_t *fs, char *buf, size_t size) ; - - int glfs_chdir (glfs_t *fs, const char *path) ; - - int glfs_fchdir (glfs_fd_t *fd) ; - - char *glfs_realpath (glfs_t *fs, const char *path, char *resolved_path) ; - - int glfs_posix_lock (glfs_fd_t *fd, int cmd, struct flock *flock) ; - - glfs_fd_t *glfs_dup (glfs_fd_t *fd) ; - - - struct glfs_object *glfs_h_lookupat (struct glfs *fs,struct glfs_object *parent, - const char *path, - struct stat *stat) ; - - struct glfs_object *glfs_h_creat (struct glfs *fs, struct glfs_object *parent, - const char *path, int flags, mode_t mode, - struct stat *sb) ; - - struct glfs_object *glfs_h_mkdir (struct glfs *fs, struct glfs_object *parent, - const char *path, mode_t flags, - struct stat *sb) ; - - struct glfs_object *glfs_h_mknod (struct glfs *fs, struct glfs_object *parent, - const char *path, mode_t mode, dev_t dev, - struct stat *sb) ; - - struct glfs_object *glfs_h_symlink (struct glfs *fs, struct glfs_object *parent, - const char *name, const char *data, - struct stat *stat) ; - - - int glfs_h_unlink (struct glfs *fs, struct glfs_object *parent, - const char *path) ; - - int glfs_h_close (struct glfs_object *object) ; - - int glfs_caller_specific_init (void *uid_caller_key, void *gid_caller_key, - void *future) ; - - int glfs_h_truncate (struct glfs *fs, struct glfs_object *object, - off_t offset) ; - - int glfs_h_stat(struct glfs *fs, struct glfs_object *object, - struct stat *stat) ; - - int glfs_h_getattrs (struct glfs *fs, struct glfs_object *object, - struct stat *stat) ; - - int glfs_h_getxattrs (struct glfs *fs, struct glfs_object *object, - const char *name, void *value, - size_t size) ; - - int glfs_h_setattrs (struct glfs *fs, struct glfs_object *object, - struct stat *sb, int valid) ; - - int glfs_h_setxattrs (struct glfs *fs, struct glfs_object *object, - const char *name, const void *value, - size_t size, int flags) ; - - int glfs_h_readlink (struct glfs *fs, struct glfs_object *object, char *buf, - size_t bufsiz) ; - - int glfs_h_link (struct glfs *fs, struct glfs_object *linktgt, - struct glfs_object *parent, const char *name) ; - - int glfs_h_rename (struct glfs *fs, struct glfs_object *olddir, - const char *oldname, struct glfs_object *newdir, - const char *newname) ; - - int glfs_h_removexattrs (struct glfs *fs, struct glfs_object *object, - const char *name) ; - - ssize_t glfs_h_extract_handle (struct glfs_object *object, - unsigned char *handle, int len) ; - - struct glfs_object *glfs_h_create_from_handle (struct glfs *fs, - unsigned char *handle, int len, - struct stat *stat) ; - - - struct glfs_fd *glfs_h_opendir (struct glfs *fs, - struct glfs_object *object) ; - - struct glfs_fd *glfs_h_open (struct glfs *fs, struct glfs_object *object, - int flags) ; - -For more details on these apis please refer glfs.h and glfs-handles.h in the source tree (api/src/) of glusterfs: - -* Incase of failures or to close the connection and destroy glfs_t -object, use glfs_fini. - - int glfs_fini (glfs_t *fs) ; - - -All the fileops are typically divided into below categories - -* a) Handle based Operations - - -These APIs create/make use of a glfs_object (referred as handles) unique -to each file within a volume. -The structure glfs_object contains inode pointer and gfid. - -For example: Since NFS protocol uses file handles to access files, these APIs are -mainly used by NFS-Ganesha server. - -Eg: - - struct glfs_object *glfs_h_lookupat (struct glfs *fs, - struct glfs_object *parent, - const char *path, - struct stat *stat); - - struct glfs_object *glfs_h_creat (struct glfs *fs, - struct glfs_object *parent, - const char *path, - int flags, mode_t mode, - struct stat *sb); - - struct glfs_object *glfs_h_mkdir (struct glfs *fs, - struct glfs_object *parent, - const char *path, mode_t flags, - struct stat *sb); - - - -* b) File path/descriptor based Operations - - -These APIs make use of file path/descriptor to determine the file on -which it needs to operate on. - -For example: Samba uses these APIs for file operations. - -Examples of the APIs using file path - - - int glfs_chdir (glfs_t *fs, const char *path) ; - - char *glfs_realpath (glfs_t *fs, const char *path, char *resolved_path) ; - -Once the file is opened, the file-descriptor generated is used for -further operations. - -Eg: - - int glfs_posix_lock (glfs_fd_t *fd, int cmd, struct flock *flock) ; - glfs_fd_t *glfs_dup (glfs_fd_t *fd) ; - - - -#### libgfapi bindings : - -libgfapi bindings are available for below languages: - - - Go - - Java - - python [2] - - Ruby - -For more details on these bindings,please refer : - - #http://www.gluster.org/community/documentation/index.php/Language_Bindings - -References: - -[1] http://humblec.com/libgfapi-interface-glusterfs/ -[2] http://www.gluster.org/2014/04/play-with-libgfapi-and-its-python-bindings/ - diff --git a/doc/features/mount_gluster_volume_using_pnfs.md b/doc/features/mount_gluster_volume_using_pnfs.md deleted file mode 100644 index 403f0c80e81..00000000000 --- a/doc/features/mount_gluster_volume_using_pnfs.md +++ /dev/null @@ -1,56 +0,0 @@ -# How to export gluster volumes using pNFS? - -The Parallel Network File System (pNFS) is part of the NFS v4.1 protocol that -allows compute clients to access storage devices directly and in parallel. -The pNFS cluster consists of MDS(Meta-Data-Server) and DS (Data-Server). -The client sends all the read/write requests directly to DS and all other -operations are handle by the MDS. pNFS support is implemented as part of -glusterFS+NFS-ganesha integration. - -### 1.) Pre-requisites - - - Create a GlusterFS volume - - - Install nfs-ganesha (refer section 5) - - - Disable kernel-nfs, gluster-nfs services on the system using the following commands - - service nfs stop - - gluster vol set <volname> nfs.disable ON (Note: this command has to be repeated for all the volumes in the trusted-pool) - -### 2.) Configure nfs-ganesha for pNFS - - - Disable nfs-ganesha and tear down HA cluster via gluster cli (pNFS did not need to disturb HA setup) - - gluster features.ganesha disable - - - For the optimal working of pNFS, ganesha servers should run on every node in the trusted pool manually(refer section 5) - - *#ganesha.nfsd -f <location_of_nfs-ganesha.conf_file> -L <location_of_log_file> -N <log_level> -d* - - - Check whether volume is exported via nfs-ganesha in all the nodes. - - *#showmount -e localhost* - -### 3.) Mount volume via pNFS - -Mount the volume using any nfs-ganesha server in the trusted pool.By default, nfs version 4.1 will use pNFS protocol for gluster volumes - - *#mount -t nfs4 -o minorversion=1 <ip of server>:/<volume name> <mount path>* - -### 4.) Points to be noted - - - Current architecture supports only single MDS and mulitple DS. The server with which client mounts will act as MDS and all severs including MDS can act as DS. - - - If any of the DS goes down , then MDS will handle those I/O's. - - - Hereafter, all the subsequent nfs clients need to use same server for mounting that volume via pNFS. i.e more than one MDS for a volume is not prefered - - - pNFS support is only tested with distributed, replicated or distribute-replicate volumes - - - It is tested and verfied with RHEL 6.5 , fedora 20, fedora 21 nfs clients. It is always better to use latest nfs-clients - -### 5.) References - - - Setup and create glusterfs volumes : http://www.gluster.org/community/documentation/index.php/QuickStart - - - NFS-Ganesha wiki : https://github.com/nfs-ganesha/nfs-ganesha/wiki - - - For installing, running NFS-Ganesha and exporting a volume : - - read doc/features/glusterfs_nfs-ganesha_integration.md - - http://blog.gluster.org/2014/09/glusterfs-and-nfs-ganesha-integration/ diff --git a/doc/features/nufa.md b/doc/features/nufa.md deleted file mode 100644 index 03b8194b4c0..00000000000 --- a/doc/features/nufa.md +++ /dev/null @@ -1,20 +0,0 @@ -# NUFA Translator - -The NUFA ("Non Uniform File Access") is a variant of the DHT ("Distributed Hash -Table") translator, intended for use with workloads that have a high locality -of reference. Instead of placing new files pseudo-randomly, it places them on -the same nodes where they are created so that future accesses can be made -locally. For replicated volumes, this means that one copy will be local and -others will be remote; the read-replica selection mechanisms will then favor -the local copy for reads. For non-replicated volumes, the only copy will be -local. - -## Interface - -Use of NUFA is controlled by a volume option, as follows. - - gluster volume set myvolume cluster.nufa on - -This will cause the NUFA translator to be used wherever the DHT translator -otherwise would be. The rest is all automatic. - diff --git a/doc/features/ovirt-integration.md b/doc/features/ovirt-integration.md deleted file mode 100644 index 46dbeabbbaa..00000000000 --- a/doc/features/ovirt-integration.md +++ /dev/null @@ -1,106 +0,0 @@ -##Ovirt Integration with glusterfs - -oVirt is an opensource virtualization management platform. You can use oVirt to manage -hardware nodes, storage and network resources, and to deploy and monitor virtual machines -running in your data center. oVirt serves as the bedrock for Red Hat''s Enterprise Virtualization product, -and is the "upstream" project where new features are developed in advance of their inclusion -in that supported product offering. - -To know more about ovirt please visit http://www.ovirt.org/ and to configure -#http://www.ovirt.org/Quick_Start_Guide#Install_oVirt_Engine_.28Fedora.29%60 - -For the installation step of ovirt, please refer -#http://www.ovirt.org/Quick_Start_Guide#Install_oVirt_Engine_.28Fedora.29%60 - -When oVirt integrated with gluster, glusterfs can be used in below forms: - -* As a storage domain to host VM disks. - -There are mainly two ways to exploit glusterfs as a storage domain. - - POSIXFS_DOMAIN ( >=oVirt 3.1 ) - - GLUSTERFS_DOMAIN ( >=oVirt 3.3) - -The former one has performance overhead and is not an ideal way to consume images hosted in glusterfs volumes. -When used by this method, qemu uses glusterfs `mount point` to access VM images and invite FUSE overhead. -The libvirt treats this as a file type disk in its xml schema. - -The latter is the recommended way of using glusterfs with ovirt as a storage domain. This provides better -and efficient way to access images hosted under glusterfs volumes.When qemu accessing glusterfs volume using this method, -it make use of `libgfapi` implementation of glusterfs and this method is called native integration. -Here the glusterfs is added as a block backend to qemu and libvirt treat this as a `network` type disk. - -For more details on this, please refer # http://www.ovirt.org/Features/GlusterFS_Storage_Domain -However there are 2 bugs which block usage of this feature. - -https://bugzilla.redhat.com/show_bug.cgi?id=1022961 -https://bugzilla.redhat.com/show_bug.cgi?id=1017289 - -Please check above bugs for latest status. - -* To manage gluster trusted pools. - -oVirt web admin console can be used to - - - add new / import existing gluster cluster - - add/delete volumes - - add/delete bricks - - set/reset volume options - - optimize volume for virt store - - Rebalance and Remove bricks - - Monitor gluster deployment - node, brick, volume status, - Enhanced service monitoring (Physical node resources as well Quota, geo-rep and self-heal status) through Nagios integration(>=oVirt 3.4) - - - -When configuing ovirt to manage only gluster cluster/trusted pool, you need to select `gluster` as an input for -`Application mode` in OVIRT ENGINE CONFIGURATION option of `engine-setup` command. -Refer # http://www.ovirt.org/Quick_Start_Guide#Install_oVirt_Engine_.28Fedora.29%60 - -If you want to use gluster as both ( as a storage domain to host VM disks and to manage gluster trusted pools) -you need to input `both` as a value for `Application mode` in engine-setup command. - -Once you have successfully installed oVirt Engine as mentioned above, you will be provided with instructions -to access oVirt''s web console. - -Below example shows how to configure gluster nodes in fedora. - - -#Configuring gluster nodes. - -On the machine designated as your host, install any supported distribution( ex:Fedora/CentOS/RHEL...etc). -A minimal installation is sufficient. - -Refer # http://www.ovirt.org/Quick_Start_Guide#Install_Hosts - - -##Connect to Ovirt Engine - -Log In to Administration Console - -Ensure that you have the administrator password configured during installation of oVirt engine. - -- To connect to oVirt webadmin console - - -Open a browser and navigate to https://domain.example.com/webadmin. Substitute domain.example.com with the URL provided during installation - -If this is your first time connecting to the administration console, oVirt Engine will issue -security certificates for your browser. Click the link labelled this certificate to trust the -ca.cer certificate. A pop-up displays, click Open to launch the Certificate dialog. -Click `Install Certificate` and select to place the certificate in Trusted Root Certification Authorities store. - - -The console login screen displays. Enter admin as your User Name, and enter the Password that -you provided during installation. Ensure that your domain is set to Internal. Click Login. - - -You have now successfully logged in to the oVirt web administration console. Here, you can configure and manage all your gluster resources. - -To manage gluster trusted pool: - -- Create a cluster with "Enable gluster service" - turned on. (Turn on "Enable virt service" if the same nodes are used as hypervisor as well) -- Add hosts which have already been set up as in step Configuring gluster nodes. -- Create a volume, and click on "Optimize for virt store",This sets the volume tunables optimize volume to be used as an image store - -To use this volume as a storage domain: - -Please refer `User interface` section of www.ovirt.org/Features/GlusterFS_Storage_Domain diff --git a/doc/features/qemu-integration.md b/doc/features/qemu-integration.md deleted file mode 100644 index b44dc06bb43..00000000000 --- a/doc/features/qemu-integration.md +++ /dev/null @@ -1,231 +0,0 @@ -Using GlusterFS volumes to host VM images and data was sub-optimal due to the FUSE overhead involved in accessing gluster volumes via GlusterFS native client. However this has changed now with two specific enhancements: - -- A new library called libgfapi is now available as part of GlusterFS that provides POSIX-like C APIs for accessing gluster volumes. libgfapi support is available from GlusterFS-3.4 release. -- QEMU (starting from QEMU-1.3) will have GlusterFS block driver that uses libgfapi and hence there is no FUSE overhead any longer when QEMU works with VM images on gluster volumes. - -GlusterFS with its pluggable translator model can serve as a flexible storage backend for QEMU. QEMU has to just talk to GlusterFS and GlusterFS will hide different file systems and storage types underneath. Various GlusterFS storage features like replication and striping will automatically be available for QEMU. Efforts are also on to add block device backend in Gluster via Block Device (BD) translator that will expose underlying block devices as files to QEMU. This allows GlusterFS to be a single storage backend for both file and block based storage types. - -###GlusterFS specifcation in QEMU - -VM image residing on gluster volume can be specified on QEMU command line using URI format - - gluster[+transport]://[server[:port]]/volname/image[?socket=...] - - - -* `gluster` is the protocol. - -* `transport` specifies the transport type used to connect to gluster management daemon (glusterd). Valid transport types are `tcp, unix and rdma.` If a transport type isn’t specified, then tcp type is assumed. - -* `server` specifies the server where the volume file specification for the given volume resides. This can be either hostname, ipv4 address or ipv6 address. ipv6 address needs to be within square brackets [ ]. If transport type is unix, then server field should not be specified. Instead the socket field needs to be populated with the path to unix domain socket. - -* `port` is the port number on which glusterd is listening. This is optional and if not specified, QEMU will send 0 which will make gluster to use the default port. If the transport type is unix, then port should not be specified. - -* `volname` is the name of the gluster volume which contains the VM image. - -* `image` is the path to the actual VM image that resides on gluster volume. - - -###Examples: - - gluster://1.2.3.4/testvol/a.img - gluster+tcp://1.2.3.4/testvol/a.img - gluster+tcp://1.2.3.4:24007/testvol/dir/a.img - gluster+tcp://[1:2:3:4:5:6:7:8]/testvol/dir/a.img - gluster+tcp://[1:2:3:4:5:6:7:8]:24007/testvol/dir/a.img - gluster+tcp://server.domain.com:24007/testvol/dir/a.img - gluster+unix:///testvol/dir/a.img?socket=/tmp/glusterd.socket - gluster+rdma://1.2.3.4:24007/testvol/a.img - - - -NOTE: (GlusterFS URI description and above examples are taken from QEMU documentation) - -###Configuring QEMU with GlusterFS backend - -While building QEMU from source, in addition to the normal configuration options, ensure that –enable-glusterfs options are specified explicitly with ./configure script to get glusterfs support in qemu. - -Starting with QEMU-1.6, pkg-config is used to configure the GlusterFS backend in QEMU. If you are using GlusterFS compiled and installed from sources, then the GlusterFS package config file (glusterfs-api.pc) might not be present at the standard path and you will have to explicitly add the path by executing this command before running the QEMU configure script: - - export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig/ - -Without this, GlusterFS driver will not be compiled into QEMU even when GlusterFS is present in the system. - -* Creating a VM image on GlusterFS backend - -qemu-img command can be used to create VM images on gluster backend. The general syntax for image creation looks like this: - -For ex: - - qemu-img create gluster://server/volname/path/to/image size - -## How to setup the environment: - -This usecase ( using glusterfs backend for VM disk store), is known as 'Virt-Store' usecase. Steps for the entire procedure could be split to: - -* Steps to be done on gluster volume side -* Steps to be done on Hypervisor side - - -##Steps to be done on gluster side - -These are the steps that needs to be done on the gluster side. Precisely this involves - - Creating "Trusted Storage Pool" - Creating a volume - Tuning the volume for virt-store - Tuning glusterd to accept requests from QEMU - Tuning glusterfsd to accept requests from QEMU - Setting ownership on the volume - Starting the volume - -* Creating "Trusted Storage Pool" - -Install glusterfs rpms on the NODE. You can create a volume with a single node. You can also scale up the cluster, as we call as `Trusted Storage Pool`, by adding more nodes to the cluster - - gluster peer probe <hostname> - -* Creating a volume - -It is highly recommended to have replicate volume or distribute-replicate volume for virt-store usecase, as it would add high availability and fault-tolerance. Remember the plain distribute works equally well - - gluster volume create replica 2 <brick1> .. <brickN> - -where, `<brick1> is <hostname>:/<path-of-dir> ` - - -Note: It is recommended to create sub-directories inside brick and that could be used to create a volume.For example, say, /home/brick1 is the mountpoint of XFS, then you can create a sub-directory inside it /home/brick1/b1 and use it while creating a volume.You can also use space available in root filesystem for bricks. Gluster cli, by default, throws warning in that case. You can override it by using force option - - gluster volume create replica 2 <brick1> .. <brickN> force - -If you are new to GlusterFS, you can take a look at QuickStart (http://www.gluster.org/community/documentation/index.php/QuickStart) guide. - -* Tuning the volume for virt-store - -There are recommended settings available for virt-store. This provide good performance characteristics when enabled on the volume that was used for virt-store - -Refer to http://www.gluster.org/community/documentation/index.php/Virt-store-usecase#Tunables for recommended tunables and for applying them on the volume, http://www.gluster.org/community/documentation/index.php/Virt-store-usecase#Applying_the_Tunables_on_the_volume - - -* Tuning glusterd to accept requests from QEMU - -glusterd receives the request only from the applications that run with port number less than 1024 and it blocks otherwise. QEMU uses port number greater than 1024 and to make glusterd accept requests from QEMU, edit the glusterd vol file, /etc/glusterfs/glusterd.vol and add the following, - - option rpc-auth-allow-insecure on - -Note: If you have installed glusterfs from source, you can find glusterd vol file at `/usr/local/etc/glusterfs/glusterd.vol` - -Restart glusterd after adding that option to glusterd vol file - - service glusterd restart - -* Tuning glusterfsd to accept requests from QEMU - -Enable the option `allow-insecure` on the particular volume - - gluster volume set <volname> server.allow-insecure on - -IMPORTANT : As of now(april 2,2014)there is a bug, as allow-insecure is not dynamically set on a volume.You need to restart the volume for the change to take effect - - -* Setting ownership on the volume - -Set the ownership of qemu:qemu on to the volume - - gluster volume set <vol-name> storage.owner-uid 107 - gluster volume set <vol-name> storage.owner-gid 107 - -* Starting the volume - -Start the volume - - gluster volume start <vol-name> - -## Steps to be done on Hypervisor Side: - -To create a raw image, - - qemu-img create gluster://1.2.3.4/testvol/dir/a.img 5G - -To create a qcow2 image, - - qemu-img create -f qcow2 gluster://server.domain.com:24007/testvol/a.img 5G - - - - - -## Booting VM image from GlusterFS backend - -A VM image 'a.img' residing on gluster volume testvol can be booted using QEMU like this: - - - qemu-system-x86_64 -drive file=gluster://1.2.3.4/testvol/a.img,if=virtio - -In addition to VM images, gluster drives can also be used as data drives: - - qemu-system-x86_64 -drive file=gluster://1.2.3.4/testvol/a.img,if=virtio -drive file=gluster://1.2.3.4/datavol/a-data.img,if=virtio - -Here 'a-data.img' from datavol gluster volume appears as a 2nd drive for the guest. - -It is also possible to make use of libvirt to define a disk and use it with qemu: - - -### Create libvirt XML to define Virtual Machine - -virt-install is python wrapper which is mostly used to create VM using set of params. How-ever virt-install doesn't support any network filesystem [ https://bugzilla.redhat.com/show_bug.cgi?id=1017308 ] - -Create a libvirt VM xml - http://libvirt.org/formatdomain.html where the disk section is formatted in such a way, qemu driver for glusterfs is being used. This can be seen in the following example xml description - - - <disk type='network' device='disk'> - <driver name='qemu' type='raw' cache='none'/> - <source protocol='gluster' name='distrepvol/vm3.img'> - <host name='10.70.37.106' port='24007'/> - </source> - <target dev='vda' bus='virtio'/> - <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/> - </disk> - - - - - -* Define the VM from the XML file that was created earlier - - - virsh define <xml-file-description> - -* Verify that the VM is created successfully - - - virsh list --all - -* Start the VM - - - virsh start <VM> - -* Verification - -You can verify the disk image file that is being used by VM - - virsh domblklist <VM-Domain-Name/ID> - -The above should show the volume name and image name. Here is the example, - - - [root@test ~]# virsh domblklist vm-test2 - Target Source - ------------------------------------------------ - vda distrepvol/test.img - hdc - - - -Reference: - -For more details on this feature implementation and its advantages, please refer: - -http://raobharata.wordpress.com/2012/10/29/qemu-glusterfs-native-integration/ - -http://www.gluster.org/community/documentation/index.php/Libgfapi_with_qemu_libvirt diff --git a/doc/features/quota/quota-object-count.md b/doc/features/quota/quota-object-count.md deleted file mode 100644 index 063aa7c5d61..00000000000 --- a/doc/features/quota/quota-object-count.md +++ /dev/null @@ -1,47 +0,0 @@ -Previous mechanism: -==================== - -The only way we could have retrieved the number of files/objects in a directory or volume was to do a crawl of the entire directory/volume. That was expensive and was not scalable. - -New Design Implementation: -========================== -The proposed mechanism will provide an easier alternative to determine the count of files/objects in a directory or volume. - -The new mechanism will store count of objects/files as part of an extended attribute of a directory. Each directory extended attribute value will indicate the number of files/objects present in a tree with the directory being considered as the root of the tree. - -Inode quota management -====================== - -**setting limits** - -Syntax: -*gluster volume quota <volname\> limit-objects <path\> <number\>* - -Details: -<number\> is a hard-limit for number of objects limitation for path <path\>. If hard-limit is exceeded, creation of file or directory is no longer permitted. - -**list-objects** - -Syntax: -*gluster volume quota <volname\> list-objects \[path\] ...* - -Details: -If path is not specified, then all the directories which has object limit set on it will be displayed. If we provide path then only that particular path is displayed along with the details associated with that. - -Sample output: - - Path Hard-limit Soft-limit Files Dirs Available Soft-limit exceeded? Hard-limit exceeded? - --------------------------------------------------------------------------------------------------------------------------------------------- - /dir 10 80% 0 1 9 No No - -**Deleting limits** - -Syntax: -*gluster volume quota <volname\> remove-objects <path\>* - -Details: -This will remove the object limit set on the specified path. - -Note: There is a known issue associated with remove-objects. When both usage limit and object limit is set on a path, then removal of any limit will lead to removal of other limit as well. This is tracked in the bug #1202244 - - diff --git a/doc/features/quota/quota-scalability.md b/doc/features/quota/quota-scalability.md deleted file mode 100644 index e47c898dd2a..00000000000 --- a/doc/features/quota/quota-scalability.md +++ /dev/null @@ -1,52 +0,0 @@ -Issues with older implemetation: ------------------------------------ -* >#### Enforcement of quota was done on client side. This had following two issues : - > >* All clients are not trusted and hence enforcement is not secure. - > >* Quota enforcer caches directory size for a certain time out period to reduce network calls to fetch size. On time out, this cache is validated by querying server. With more clients, the traffic caused due to this -validation increases. - -* >#### Relying on lookup calls on a file/directory (inode) to update its contribution [time consuming] - -* >####Hardlimits were stored in a comma separated list. - > >* Hence, changing hard limit of one directory is not an independent operation and would invalidate hard limits of other directories. We need to parse the string once for each of these directories just to identify whether its hard limit is changed. This limits the number of hard limits we can configure. - -* >####Cli used to fetch the list of directories on which quota-limit is set, from glusterd. - > >* With more number of limits, the network overhead incurred to fetch this list limits the scalability of number of directories on which we can set quota. - -* >#### Problem with NFS mount - > >* Quota, for its enforcement and accounting requires all the ancestors of a file/directory till root. However, with NFS relying heavily on nameless lookups (through which there is no guarantee that ancestry can be -accessed) this ancestry is not always present. Hence accounting and enforcement was not correct. - - -New Design Implementation: --------------------------------- - -* Quota enforcement is moved to server side. This addresses issues that arose because of client side enforcement. - -* Two levels of quota limits, soft and hard quota is introduced. - This will result in a message being logged on reaching soft quota and writes will fail with EDQUOT after hard limit is reached. - -Work Flow ------------------ - -* Accounting - # This is done using the marker translator loaded on each brick of the volume. Accounting happens in the background. Ie, it doesn't happen in-flight with the file operation. The file operations latency is not -directly affected by the time taken to perform accounting. This update is sent recursively upwards up to the root of the volume. - -* Enforcement - # The enforcer updates its 'view' (cached) of directory's disk usage on the incidence of a file operation after the expiry of hard/soft timeout, depending on the current usage. Enforcer uses quotad to get the -aggregated disk usage of a directory from the accounting information present on each brick (viz, provided by marker). - -* Aggregator (quotad) - # Quotad is a daemon that serves volume-wide disk usage of a directory, on which quota is configured. It is present on all nodes in the cluster (trusted storage pool) as bricks don't have a global view of cluster. -Quotad queries the disk usage information from all the bricks in that volume and aggregates. It manages all the volumes on which quota is enabled. - - -Benefit to GlusterFS ---------------------------------- - -* Support upto 65536 quota configurations per volume. -* More quotas can be configured in a single volume thereby leading to support GlusterFS for use cases like home directory. - -###For more information on quota usability refer the following link : -> https://access.redhat.com/site/documentation/en-US/Red_Hat_Storage/2.1/html-single/Administration_Guide/index.html#chap-User_Guide-Dir_Quota-Enable diff --git a/doc/features/rdmacm.md b/doc/features/rdmacm.md deleted file mode 100644 index 2c287e85fb6..00000000000 --- a/doc/features/rdmacm.md +++ /dev/null @@ -1,26 +0,0 @@ -## Rdma Connection manager ##
-
-### What? ###
-Infiniband requires addresses of end points to be exchanged using an out-of-band channel (like tcp/ip). Glusterfs used a custom protocol over a tcp/ip channel to exchange this address. However, librdmacm provides the same functionality with the advantage of being a standard protocol. This helps if we want to communicate with a non-glusterfs entity (say nfs client with gluster nfs server) over infiniband.
-
-### Dependencies ###
-* [IP over Infiniband](http://pkg-ofed.alioth.debian.org/howto/infiniband-howto-5.html) - The value to *option* **remote-host** in glusterfs transport configuration should be an IPoIB address
-* [rdma cm kernel module](http://pkg-ofed.alioth.debian.org/howto/infiniband-howto-4.html#ss4.4)
-* [user space rdmacm library - librdmacm](https://www.openfabrics.org/downloads/rdmacm)
-
-### rdma-cm in >= GlusterFs 3.4 ###
-
-Following is the impact of http://review.gluster.org/#change,149.
-
-New userspace packages needed:
-librdmacm
-librdmacm-devel
-
-### Limitations ###
-
-* Because of bug [890502](https://bugzilla.redhat.com/show_bug.cgi?id=890502), we've to probe the peer on an IPoIB address. This imposes a restriction that all volumes created in the future have to communicate over IPoIB address (irrespective of whether they use gluster's tcp or rdma transport).
-
-* Currently client has independence to choose b/w tcp and rdma transports while communicating with the server (by creating volumes with **transport-type tcp,rdma**). This independence was a by-product of our ability to use the tcp/ip channel - transports with *option transport-type tcp* - for rdma connection establishment handshake too. However, with new requirement of IPoIB address for connection establishment, we loose this independence (till we bring in [multi-network support](https://bugzilla.redhat.com/show_bug.cgi?id=765437) - where a brick can be identified by a set of ip-addresses and we can choose different pairs of ip-addresses for communication based on our requirements - in glusterd).
-
-### External links ###
-* [Infiniband Howto](http://pkg-ofed.alioth.debian.org/howto/infiniband-howto.html)
diff --git a/doc/features/readdir-ahead.md b/doc/features/readdir-ahead.md deleted file mode 100644 index 5302a021202..00000000000 --- a/doc/features/readdir-ahead.md +++ /dev/null @@ -1,14 +0,0 @@ -## Readdir-ahead ##
-
-### Summary ###
-Provide read-ahead support for directories to improve sequential directory read performance.
-
-### Owners ###
-Brian Foster
-
-### Detailed Description ###
-The read-ahead feature for directories is analogous to read-ahead for files. The objective is to detect sequential directory read operations and establish a pipeline for directory content. When a readdir request is received and fulfilled, preemptively issue subsequent readdir requests to the server in anticipation of those requests from the user. If sequential readdir requests are received, the directory content is already immediately available in the client. If subsequent requests are not sequential or not received, said data is simply dropped and the optimization is bypassed.
-
-readdir-ahead is currently disabled by default. It can be enabled with the following command:
-
- gluster volume set <volname> readdir-ahead on
diff --git a/doc/features/rebalance.md b/doc/features/rebalance.md deleted file mode 100644 index e7212d4011f..00000000000 --- a/doc/features/rebalance.md +++ /dev/null @@ -1,74 +0,0 @@ -## Background - - -For a more detailed description, view Jeff Darcy's blog post [here] -(http://pl.atyp.us/hekafs.org/index.php/2012/03/glusterfs-algorithms-distribution/) - -GlusterFS uses the distribute translator (DHT) to aggregate space of multiple servers. DHT distributes files among its subvolumes using a consistent hashing method providing 32-bit hashes. Each DHT subvolume is given a range in the 32-bit hash space. A hash value is calculated for every file using a combination of its name. The file is then placed in the subvolume with the hash range that contains the hash value. - -## What is rebalance? - -The rebalance process migrates files between the DHT subvolumes when necessary. - -## When is rebalance required? - -Rebalancing is required for two main cases. - -1. Addition/Removal of bricks - -2. Renaming of a file - -## Addition/Removal of bricks - -Whenever the number or order of DHT subvolumes change, the hash range given to each subvolume is recalculated. When this happens, already existing files on the volume will need to be moved to the correct subvolume based on their hash. Rebalance does this activity. - -Addition of bricks which increase the size of a volume will increase the number of DHT subvolumes and lead to recalculation of hash ranges (This doesn't happen when bricks are added to a volume to increase redundancy, i.e. increase replica count of a volume). This will require an explicit rebalance command to be issued to migrate the files. - -Removal of bricks which decrease the size of a volumes also causes the hash ranges of DHT to be recalculated. But we don't need to issue an explicit rebalance command in this case, as rebalance is done automatically by the remove-brick process if needed. - -## Renaming of a file - -Renaming of file will cause its hash to change. The file now needs to be moved to the correct subvolume based on its new hash. Rebalance does this. - -## How does rebalance work? - -At a high level, the rebalance process consists of the following 3 steps: - -1. Crawl the volume to access all files -2. Calculate the hash for the file -3. If needed move the migrate the file to the correct subvolume. - - -The rebalance process has been optimized by making it distributed across the trusted storage pool. With distributed rebalance, a rebalance process is launched on each peer in the cluster. Each rebalance process will crawl files on only those bricks of the volume which are present on it, and migrate the files which need migration to the correct brick. This speeds up the rebalance process considerably. - -## What will happen if rebalance is not run? - -### Addition of bricks - -With the current implementation of add-brick, when the size of a volume is augmented by adding new bricks, the new bricks are not put into use immediately i.e., the hash ranges there not recalculated immediately. This means that the files will still be placed only onto the existing bricks, leaving the newly added storage space unused. Starting a rebalance process on the volume will cause the hash ranges to be recalculated with the new bricks included, which allows the newly added storage space to be used. - -### Renaming a file - -When a file rename causes the file to be hashed to a new subvolume, DHT writes a link file on the new subvolume leaving the actual file on the original subvolume. A link file is an empty file, which has an extended attribute set that points to the subvolume on which the actual file exists. So, when a client accesses the renamed file, DHT first looks for the file in the hashed subvolume and gets the link file. DHT understands the link file, and gets the actual file from the subvolume pointed to by the link file. This leads to a slight reduction in performance. A rebalance will move the actual file to the hashed subvolume, allowing clients to access the file directly once again. - -## Are clients affected during a rebalance process? - -The rebalance process is transparent to applications on the clients. Applications which have open files on the volume will not be affected by the rebalance process, even if the open file requires migration. The DHT translator on the client will hide the migration from the applications. - -##How are open files migrated? - -(A more technical description of the algorithm used can be seen in the commit message of commit a07bb18c8adeb8597f62095c5d1361c5bad01f09.) - -To achieve migration of open files, two things need to be assured of, -a) any writes or changes happening to the file during migration are correctly synced to destination subvolume after the migration is complete. -b) any further changes should be made to the destination subvolume - -Both of these requirements require sending notificatoins to clients. Clients are notified by overloading an attribute used in every callback functions. DHT understands these attributes in the callbacks and can be notified if a file is being migrated or not. - -During rebalance, a file will be in two phases - -1. Migration in process - In this phase the file is being migrated by the rebalance process from the source subvolume to the destination subvolume. The rebalance process will set a 'in-migration' attribute on the file, which will notify the clients' DHT translator. The clients' DHT translator will then take care to send any further changes to the destination subvolume as well. This way we satisfy the first requirement - -2. Migration completed - Once the file has been migrated, the rebalance process will set a 'migration-complete' attribute on the file. The clients will be notified of the completion and all further operations on the file will happen on the destination subvolume. - -The DHT translator handles the above and allows the applications on the clients to continue working on a file under migration. diff --git a/doc/features/server-quorum.md b/doc/features/server-quorum.md deleted file mode 100644 index 7b20084cea8..00000000000 --- a/doc/features/server-quorum.md +++ /dev/null @@ -1,44 +0,0 @@ -# Server Quorum - -Server quorum is a feature intended to reduce the occurrence of "split brain" -after a brick failure or network partition. Split brain happens when different -sets of servers are allowed to process different sets of writes, leaving data -in a state that can not be reconciled automatically. The key to avoiding split -brain is to ensure that there can be only one set of servers - a quorum - that -can continue handling writes. Server quorum does this by the brutal but -effective means of forcing down all brick daemons on cluster nodes that can no -longer reach enough of their peers to form a majority. Because there can only -be one majority, there can be only one set of bricks remaining, and thus split -brain can not occur. - -## Options - -Server quorum is controlled by two parameters: - - * **cluster.server-quorum-type** - - This value may be "server" to indicate that server quorum is enabled, or - "none" to mean it's disabled. - - * **cluster.server-quorum-ratio** - - This is the percentage of cluster nodes that must be up to maintain quorum. - More precisely, this percentage of nodes *plus one* must be up. - -Note that these are cluster-wide flags. All volumes served by the cluster will -be affected. Once these values are set, quorum actions - starting or stopping -brick daemons in response to node or network events - will be automatic. - -## Best Practices - -If a cluster with an even number of nodes is split exactly down the middle, -neither half can have quorum (which requires **more than** half of the total). -This is particularly important when N=2, in which case the loss of either node -leads to loss of quorum. Therefore, it is highly advisable to ensure that the -cluster size is three or greater. The "extra" node in this case need not have -any bricks or serve any data. It need only be present to preserve the notion -of a quorum majority less than the entire cluster membership, allowing the -cluster to survive the loss of a single node without losing quorum. - - - diff --git a/doc/features/shard.md b/doc/features/shard.md deleted file mode 100644 index 3588531a2b4..00000000000 --- a/doc/features/shard.md +++ /dev/null @@ -1,68 +0,0 @@ -### Sharding xlator (Stripe 2.0) - -GlusterFS's answer to very large files (those which can grow beyond a -single brick) has never been clear. There is a stripe xlator which allows you to -do that, but that comes at a cost of flexibility - you can add servers only in -multiple of stripe-count x replica-count, mixing striped and unstriped files is -not possible in an "elegant" way. This also happens to be a big limiting factor -for the big data/Hadoop use case where super large files are the norm (and where -you want to split a file even if it could fit within a single server.) - -The proposed solution for this is to replace the current stripe xlator with a -new Shard xlator. Unlike the stripe xlator, Shard is not a cluster xlator. It is -placed on top of DHT. Initially all files will be created as normal files, even -up to a certain configurable size. The first block (default 4MB) will be stored -like a normal file. However further blocks will be stored in a file, named by -the GFID and block index in a separate namespace (like /.shard/GFID1.1, -/.shard/GFID1.2 ... /.shard/GFID1.N). File IO happening to a particular offset -will write to the appropriate "piece file", creating it if necessary. The -aggregated file size and block count will be stored in the xattr of the original -(first block) file. - -The advantage of such a model: - -- Data blocks are distributed by DHT in a "normal way". -- Adding servers can happen in any number (even one at a time) and DHT's - rebalance will spread out the "piece files" evenly. -- Self-healing of a large file is now more distributed into smaller files across - more servers. -- piece file naming scheme is immune to renames and hardlinks. - -Source: https://gist.github.com/avati/af04f1030dcf52e16535#sharding-xlator-stripe-20 - -## Usage: - -Shard translator is disabled by default. To enable it on a given volume, execute -<code> -gluster volume set <VOLNAME> features.shard on -</code> - -The default shard block size is 4MB. To modify it, execute -<code> -gluster volume set <VOLNAME> features.shard-block-size <value> -</code> - -When a file is created in a volume with sharding disabled, its block size is -persisted in its xattr on the first block. This property of the file will remain -even if the shard-block-size for the volume is reconfigured later. - -If you want to disable sharding on a volume, it is advisable to create a new -volume without sharding and copy out contents of this volume into the new -volume. - -## Note: -* Shard translator is still a beta feature in 3.7.0 and will be possibly fully - supported in one of the 3.7.x releases. -* It is advisable to use shard translator in volumes with replication enabled - for fault tolerance. - -## TO-DO: -* Complete implementation of zerofill, discard and fallocate fops. -* Introduce caching and its invalidation within shard translator to store size - and block count of shard'ed files. -* Make shard translator work for non-Hadoop and non-VM use cases where there are - multiple clients operating on the same file. -* Serialize appending writes. -* Manage recovery of size and block count better in the face of faults during - ongoing inode write fops. -* Anything else that could crop up later :) diff --git a/doc/features/tier/tier.md b/doc/features/tier/tier.md deleted file mode 100644 index 13e7d971bdf..00000000000 --- a/doc/features/tier/tier.md +++ /dev/null @@ -1,168 +0,0 @@ -##Tiering - -* ####Feature page: -http://www.gluster.org/community/documentation/index.php/Features/data-classification - -* #####Design: goo.gl/bkU5qv - -###Theory of operation - - -The tiering feature enables different storage types to be used by the same -logical volume. In Gluster 3.7, the two types are classified as "cold" and -"hot", and are represented as two groups of bricks. The hot group acts as -a cache for the cold group. The bricks within the two groups themselves are -arranged according to standard Gluster volume conventions, e.g. replicated, -distributed replicated, or dispersed. - -A normal gluster volume can become a tiered volume by "attaching" bricks -to it. The attached bricks become the "hot" group. The bricks within the -original gluster volume are the "cold" bricks. - -For example, the original volume may be dispersed on HDD, and the "hot" -tier could be distributed-replicated SSDs. - -Once this new "tiered" volume is built, I/Os to it are subjected to cacheing -heuristics: - -* All I/Os are forwarded to the hot tier. - -* If a lookup fails to the hot tier, the I/O will be forwarded to the cold -tier. This is a "cache miss". - -* Files on the hot tier that are not touched within some time are demoted -(moved) to the cold tier (see performance parameters, below). - -* Files on the cold tier that are touched one or more times are promoted -(moved) to the hot tier. (see performance parameters, below). - -This resembles implementations by Ceph and the Linux data management (DM) -component. - -Performance enhancements being considered include: - -* Biasing migration of large files over small. - -* Only demoting when the hot tier is close to full. - -* Write-back cache for database updates. - -###Code organization - -The design endevors to be upward compatible with future migration policies, -such as scheduled file migration, data classification, etc. For example, -the caching logic is self-contained and separate from the file migration. A -different set of migration policies could use the same underlying migration -engine. The I/O tracking and meta data store compontents are intended to be -reusable for things besides caching semantics. - -####Libgfdb: - -Libgfdb provides abstract mechanism to record extra/rich metadata -required for data maintenance, such as data tiering/classification. -It provides consumer with API for recording and querying, keeping -the consumer abstracted from the data store used beneath for storing data. -It works in a plug-and-play model, where data stores can be plugged-in. -Presently we have plugin for Sqlite3. In the future will provide recording -and querying performance optimizer. In the current implementation the schema -of metadata is fixed. - -####Schema: - - GF_FILE_TB Table: - This table has one entry per file inode. It holds the metadata required to - make decisions in data maintenance. - GF_ID (Primary key) : File GFID (Universal Unique IDentifier in the namespace) - W_SEC, W_MSEC : Write wind time in sec & micro-sec - UW_SEC, UW_MSEC : Write un-wind time in sec & micro-sec - W_READ_SEC, W_READ_MSEC : Read wind time in sec & micro-sec - UW_READ_SEC, UW_READ_MSEC : Read un-wind time in sec & micro-sec - WRITE_FREQ_CNTR INTEGER : Write Frequency Counter - READ_FREQ_CNTR INTEGER : Read Frequency Counter - - GF_FLINK_TABLE: - This table has all the hardlinks to a file inode. - GF_ID : File GFID (Composite Primary Key)``| - GF_PID : Parent Directory GFID (Composite Primary Key) |-> Primary Key - FNAME : File Base Name (Composite Primary Key)__| - FPATH : File Full Path (Its redundant for now, this will go) - W_DEL_FLAG : This Flag is used for crash consistancy, when a link is unlinked. - i.e Set to 1 during unlink wind and during unwind this record is deleted - LINK_UPDATE : This Flag is used when a link is changed i.e rename. - Set to 1 when rename wind and set to 0 in rename unwind - -Libgfdb API : -Refer libglusterfs/src/gfdb/gfdb_data_store.h - -####ChangeTimeRecorder (CTR) Translator: - -ChangeTimeRecorder(CTR) is server side xlator(translator) which sits -just above posix xlator. The main role of this xlator is to record the -access/write patterns on a file residing the brick. It records the -read(only data) and write(data and metadata) times and also count on -how many times a file is read or written. This xlator also captures -the hard links to a file(as its required by data tiering to move -files). - -CTR Xlator is the consumer of libgfdb. - -To Enable/Disable CTR Xlator: - - **gluster volume set <volume-name> features.ctr-enabled {on/off}** - -To Enable/Disable Frequency Counter Recording in CTR Xlator: - - **gluster volume set <volume-name> features.record-counters {on/off}** - - -####Migration daemon: - -When a tiered volume is created, a migration daemon starts. There is one daemon -for every tiered volume per node. The daemon sleeps and then periodically -queries the database for files to promote or demote. The query callbacks -assembles files in a list, which is then enumerated. The frequencies by -which promotes and demotes happen is subject to user configuration. - -Selected files are migrated between the tiers using existing DHT migration -logic. The tier translator will leverage DHT rebalance performance -enhancements. - -Configurable for Migration daemon: - - gluster volume set <volume-name> cluster.tier-demote-frequency <SECS> - - gluster volume set <volume-name> cluster.tier-promote-frequency <SECS> - - gluster volume set <volume-name> cluster.read-freq-threshold <SECS> - - gluster volume set <volume-name> cluster.write-freq-threshold <SECS> - - -####Tier Translator: - -The tier translator is the root node in tiered volumes. The first subvolume -is the cold tier, and the second the hot tier. DHT logic for fowarding I/Os -is largely unchanged. Exceptions are handled according to the dht_methods_t -structure, which forks control according to DHT or tier type. - -The major exception is DHT's layout is not utilized for choosing hashed -subvolumes. Rather, the hot tier is always the hashed subvolume. - -Changes to DHT were made to allow "stacking", i.e. DHT over DHT: - -* readdir operations remember the index of the "leaf node" in the volume graph -(client id), rather than a unique index for each DHT instance. - -* Each DHT instance uses a unique extended attribute for tracking migration. - -* In certain cases, it is legal for tiered volumes to have unpopulated inodes -(wheras this would be an error in DHT's case). - -Currently tiered volume expansion (adding and removing bricks) is unsupported. - -####glusterd: - -The tiered volume tree is a composition of two other volumes. The glusterd -daemon builds it. Existing logic for adding and removing bricks is heavily -leveraged to attach and detach tiers, and perform statistics collection. - diff --git a/doc/features/trash.md b/doc/features/trash.md deleted file mode 100644 index 3e38e872cf7..00000000000 --- a/doc/features/trash.md +++ /dev/null @@ -1,80 +0,0 @@ -Trash Translator -================= - -Trash translator will allow users to access deleted or truncated files. Every brick will maintain a hidden .trashcan directory , which will be used to store the files deleted or truncated from the respective brick .The aggreagate of all those .trashcan directory can be accesed from the mount point.In order to avoid name collisions , a time stamp is appended to the original file name while it is being moved to trash directory. - -##Implications and Usage -Apart from the primary use-case of accessing files deleted or truncated by user , the trash translator can be helpful for internal operations such as self-heal and rebalance . During self-heal and rebalance it is possible to lose crucial data.In those circumstances the trash translator can assist in recovery of the lost data. The trash translator is designed to intercept unlink, truncate and ftruncate fops, store a copy of the current file in the trash directory, and then perform the fop on the original file. For the internal operations , the files are stored under 'internal_op' folder inside trash directory. - -##Volume Options -1. *gluster volume set <VOLNAME> features.trash <on | off>* - - This command can be used to enable trash translator in a volume. If set to on, trash directory will be created in every brick inside the volume during volume start command. By default translator is loaded during volume start but remains non-functional. Disabling trash with the help of this option will not remove the trash directory or even its contents from the volume. - -2. *gluster volume set <VOLNAME> features.trash-dir <name>* - - This command is used to reconfigure the trash directory to a user specified name. The argument is a valid directory name. Directory will be created inside every brick under this name. If not specified by the user, the trash translator will create the trash directory with the default name “.trashcan”. This can be used only when trash-translator is on. - -3. *gluster volume set <VOLNAME> features.trash-max-filesize <size>* - - This command can be used to filter files entering trash directory based on their size. Files above trash_max_filesize are deleted/truncated directly. Value for size may be followed by mutliplicative suffixes KB (=1024), MB (=1024*1024 and GB. Default size is set to 5MB. As of now any value specified higher than 1GB will be changed to 1GB at the maximum level. - -4. *gluster volume set <VOLNAME> features.trash-eliminate-path <path1> [ , <path2> , . . . ]* - - This command can be used to set the eliminate pattern for the trash translator. Files residing under this pattern will not be moved to trash directory during deletion/truncation. Path must be a valid one present in volume. - -5. *gluster volume set <VOLNAME> features.trash-internal-op <on | off>* - - This command can be used to enable trash for internal operations like self-heal and re-balance. By default set to off. - -##Testing -Following steps give illustrates a simple scenario of deletion of file from directory - -1. Create a distributed volume with two bricks and start it. - - gluster volume create test rhs:/home/brick - - gluster volume start test - -2. Enable trash translator - - gluster volume set test feature.trash on - -3. Mount glusterfs client as follows. - - mount -t glusterfs rhs:test /mnt - -4. Create a directory and file in the mount. - - mkdir mnt/dir - - echo abc > mnt/dir/file - -5. Delete the file from the mount. - - rm mnt/dir/file -rf - -6. Checkout inside the trash directory. - - ls mnt/.trashcan - -We can find the deleted file inside the trash directory with timestamp appending on its filename. - -For example, - - [root@rh-host ~]#mount -t glusterfs rh-host:/test /mnt/test - [root@rh-host ~]#mkdir /mnt/test/abc - [root@rh-host ~]#touch /mnt/test/abc/file - [root@rh-host ~]#rm /mnt/test/abc/filer - remove regular empty file ‘/mnt/test/abc/file’? y - [root@rh-host ~]#ls /mnt/test/abc - [root@rh-host ~]# - [root@rh-host ~]#ls /mnt/test/.trashcan/abc/ - file2014-08-21_123400 - -##Points to be remembered -[1] As soon as the volume is started, trash directory will be created inside the volume and will be visible through mount. Disabling trash will not have any impact on its visibilty from the mount. -[2] Eventhough deletion of trash-directory is not permitted, currently residing trash contents will be removed on issuing delete on it and only an empty trash-directory exists. - -##Known issues -[1] Since trash translator resides on the server side, DHT translator is unaware of rename and truncate operations being done by this translator which will eventually moves the files to trash directory. Unless and until a complete-path-based lookup comes on trashed files, those may not be visible from the mount. diff --git a/doc/features/upcall.md b/doc/features/upcall.md deleted file mode 100644 index 894bd54264d..00000000000 --- a/doc/features/upcall.md +++ /dev/null @@ -1,33 +0,0 @@ -## Upcall ##
-
-### Summary ###
-A generic and extensible framework, used to maintain states in the glusterfsd process for each of the files accessed (including the clients info doing the fops) and send notifications to the respective glusterfs clients incase of any change in that state.
-
-Few of the use-cases (currently using) this infrastructure are:
-
- Inode Update/Invalidation
-
-### Detailed Description ###
-GlusterFS, a scale-out storage platform, comprises of distributed file system which follows client-server architectural model.
-
-Its the client(glusterfs) which usually initiates an rpc request to the server(glusterfsd). After processing the request, reply is sent to the client as response to the same request. So till now, there was no interface and use-case present for the server to intimate or make a request to the client.
-
-This support is now being added using “Upcall Infrastructure”.
-
-A new xlator(Upcall) has been defined to maintain and process state of the events which require server to send upcall notifications. For each I/O on a inode, we create/update a ‘upcall_inode_ctx’ and store/update the list of clients’ info ‘upcall_client_t’ in the context.
-
-#### Cache Invalidation ####
-Each of the GlusterFS clients/applications cache certain state of the files (for eg, inode or attributes). In a muti-node environment these caches could lead to data-integrity issues, for certain time, if there are multiple clients accessing the same file simulataneously.
-To avoid such scenarios, we need server to notify clients incase of any change in the file state/attributes.
-
-More details can be found in the below links -
- http://www.gluster.org/community/documentation/index.php/Features/Upcall-infrastructure
- https://soumyakoduri.wordpress.com/2015/02/25/glusterfs-understanding-upcall-infrastructure-and-cache-invalidation-support/
-
-cache-invalidation is currently disabled by default. It can be enabled with the following command:
-
- gluster volume set <volname> features.cache-invalidation on
-
-Note: This upcall notification is sent to only those clients which have accessed the file recently (i.e, with in CACHE_INVALIDATE_PERIOD – default 60sec). This options can be tuned using the following command:
-
- gluster volume set <volname> features.cache-invalidation-timeout <value>
diff --git a/doc/features/worm.md b/doc/features/worm.md deleted file mode 100644 index dba99777da5..00000000000 --- a/doc/features/worm.md +++ /dev/null @@ -1,75 +0,0 @@ -#WORM (Write Once Read Many) -This features enables you to create a `WORM volume` using gluster CLI. -##Description -WORM (write once,read many) is a desired feature for users who want to store data such as `log files` and where data is not allowed to get modified. - -GlusterFS provides a new key `features.worm` which takes boolean values(enable/disable) for volume set. - -Internally, the volume set command with 'feature.worm' key will add 'features/worm' translator in the brick's volume file. - -`This change would be reflected on a subsequent restart of the volume`, i.e gluster volume stop, followed by a gluster volume start. - -With a volume converted to WORM, the changes are as follows: - -* Reads are handled normally -* Only files with O_APPEND flag will be supported. -* Truncation,deletion wont be supported. - -##Volume Options -Use the volume set command on a volume and see if the volume is actually turned into WORM type. - - # features.worm enable -##Fully loaded Example -WORM feature is being supported from glusterfs version 3.4 -start glusterd by using the command - - # service glusterd start -Now create a volume by using the command - - # gluster volume create <vol_name> <brick_path> -start the volume created by running the command below. - - # gluster vol start <vol_name> -Run the command below to make sure that volume is created. - - # gluster volume info -Now turn on the WORM feature on the volume by using the command - - # gluster vol set <vol_name> worm enable -Verify that the option is set by using the command - - # gluster volume info -User should be able to see another option in the volume info - - # features.worm: enable -Now restart the volume for the changes to reflect, by performing volume stop and start. - - # gluster volume <vol_name> stop - # gluster volume <vol_name> start -Now mount the volume using fuse mount - - # mount -t glusterfs <vol_name> <mnt_point> -create a file inside the mount point by running the command below - - # touch <file_name> -Verify that user is able to create a file by running the command below - - # ls <file_name> - -##How To Test -Now try deleting the above file which is been created - - # rm <file_name> -Since WORM is enabled on the volume, it gives the following error message `rm: cannot remove '/<mnt_point>/<file_name>': Read-only file system` - -put some content into the file which is created above. - - # echo "at the end of the file" >> <file_name> -Now try editing the file by running the commnad below and verify that the following error message is displayed `rm: cannot remove '/<mnt_point>/<file_name>': Read-only file system` - - # sed -i "1iAt the beginning of the file" <file_name> -Now read the contents of the file and verify that file can be read. - - cat <file_name> - -`Note: If WORM option is set on the volume before it is started, then volume need not be restarted for the changes to get reflected`. diff --git a/doc/features/zerofill.md b/doc/features/zerofill.md deleted file mode 100644 index c0f1fc5014c..00000000000 --- a/doc/features/zerofill.md +++ /dev/null @@ -1,26 +0,0 @@ -#zerofill API for GlusterFS -zerofill() API would allow creation of pre-allocated and zeroed-out files on GlusterFS volumes by offloading the zeroing part to server and/or storage (storage offloads use SCSI WRITESAME). -## Description - -Zerofill writes zeroes to a file in the specified range. This fop will be useful when a whole file needs to be initialized with zero (could be useful for zero filled VM disk image provisioning or during scrubbing of VM disk images). - -Client/application can issue this FOP for zeroing out. Gluster server will zero out required range of bytes ie server offloaded zeroing. In the absence of this fop, client/application has to repetitively issue write (zero) fop to the server, which is very inefficient method because of the overheads involved in RPC calls and acknowledgements. - -WRITESAME is a SCSI T10 command that takes a block of data as input and writes the same data to other blocks and this write is handled completely within the storage and hence is known as offload . Linux ,now has support for SCSI WRITESAME command which is exposed to the user in the form of BLKZEROOUT ioctl. BD Xlator can exploit BLKZEROOUT ioctl to implement this fop. Thus zeroing out operations can be completely offloaded to the storage device, -making it highly efficient. - -The fop takes two arguments offset and size. It zeroes out 'size' number of bytes in an opened file starting from 'offset' position. -This feature adds zerofill support to the following areas: -> - libglusterfs -- io-stats -- performance/md-cache,open-behind -- quota -- cluster/afr,dht,stripe -- rpc/xdr -- protocol/client,server -- io-threads -- marker -- storage/posix -- libgfapi - -Client applications can exploit this fop by using glfs_zerofill introduced in libgfapi.FUSE support to this fop has not been added as there is no system call for this fop. |