summaryrefslogtreecommitdiffstats
path: root/doc/developer-guide/afr
diff options
context:
space:
mode:
authorHumble Devassy Chirammal <hchiramm@redhat.com>2015-03-30 12:21:05 +0530
committerKaleb KEITHLEY <kkeithle@redhat.com>2015-03-30 05:51:12 -0700
commit8907f67ba215172b01a7018adcbb063fcc4570e9 (patch)
tree68cf6557990ae4215926068625a7b5e0e1374882 /doc/developer-guide/afr
parente3bd2387a5973df4548fe4a62b5dfc227a2bdc64 (diff)
doc: restructure developer docs to new layout
The developer oriented information is scattered in source and its very difficult to identify which are those. With this patch subdirs are created under developer-guide which will be the parent for developer notes. The changes suggested in http://review.gluster.org/#/c/8827/ are also included in this patch. Change-Id: I4c8510d52c49f4066225f72cac8f97f087d6c70c BUG: 1206539 Signed-off-by: Humble Devassy Chirammal <hchiramm@redhat.com> Reviewed-on: http://review.gluster.org/10038 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Lalatendu Mohanty <lmohanty@redhat.com> Reviewed-by: Kaleb KEITHLEY <kkeithle@redhat.com>
Diffstat (limited to 'doc/developer-guide/afr')
-rw-r--r--doc/developer-guide/afr/afr-locks-evolution.md91
-rw-r--r--doc/developer-guide/afr/afr.md191
-rw-r--r--doc/developer-guide/afr/self-heal-daemon.md91
3 files changed, 373 insertions, 0 deletions
diff --git a/doc/developer-guide/afr/afr-locks-evolution.md b/doc/developer-guide/afr/afr-locks-evolution.md
new file mode 100644
index 00000000000..7d2a136d871
--- /dev/null
+++ b/doc/developer-guide/afr/afr-locks-evolution.md
@@ -0,0 +1,91 @@
+History of locking in AFR
+--------------------------
+
+GlusterFS has **locks** translator which provides the following internal locking operations called `inodelk`, `entrylk` which are used by afr to achieve synchronization of operations on files or directories that conflict with each other.
+
+`Inodelk` gives the facility for translators in GlusterFS to obtain range (denoted by tuple with **offset**, **length**) locks in a given **domain** for an inode.
+Full file lock is denoted by the tuple (offset: `0`, length: `0`) i.e. length `0` is considered as infinity.
+
+`Entrylk` enables translators of GlusterFS to obtain locks on `name` in a given **domain** for an inode, typically a directory.
+
+The **locks** translator provides both *blocking* and *nonblocking* variants and of these locks.
+
+
+AFR makes use of locks xlator extensively:
+
+1)For FOPS (from clients)
+-----------------------
+* Data transactions take inode locks on data domain, Let's refer to this domain name as DATA_DOMAIN.
+
+ So locking for writes would be something like this:`inodelk(offset,length, DATA_DOMAIN)`
+ For truncating a file to zero, it would be `inodelk(0,0,DATA_DOMAIN)`
+
+* Metadata transactions (chown/chmod) also take inode locks but on a special range on metadata domain,
+ i.e.`(LLONG_MAX-1 , 0, METADATA_DOMAIN).`
+
+* Entry transactions (create, mkdir, rmdir,unlink, symlink, link,rename) take entrylk on `(name, parent inode)`.
+
+
+2)For self heal:
+-------------
+* For Metadata self-heal, it is the same. i.e.`inodelk(LLONG_MAX-1 , 0, METADATA_DOMAIN)`.
+* For Entry self-heal, it is `entrylk(NULL name, parent inode)`. Specifying NULL for the name takes full lock on the directory referred to by the inode.
+* For data self-heal, there is a bit of history as to how locks evolved:
+
+###Initial version (say version 1) :
+There was no concept of selfheal daemon (shd). Only client lookups triggered heals. so AFR always took `inodelk(0,0,DATA_DOMAIN)` for healing. The issue with this approach was that when heal was in progress, I/O from clients was blocked .
+
+###version 2:
+shd was introduced. We needed to allow I/O to go through when heal was going,provided the ranges did not overlap. To that extent, the following approach was adopted:
+
++ 1.shd takes (full inodelk in DATA_DOMAIN). Thus client FOPS are blocked and cannot modify changelog-xattrs
++ 2.shd inspects xattrs to determine source/sink
++ 3.shd takes a chunk inodelk(0-128kb) again in DATA_DOMAIN (locks xlator allows overlapping locks if lock owner is the same).
++ 4.unlock full lock
++ 5.heal
++ 6.take next chunk lock(129-256kb)
++ 7.unlock 1st chunk lock, heal the second chunk and so on.
+
+
+Thus after 4, any client FOP could write to regions that was not currently under heal. The exception was truncate (to size 0) because it needs full file lock and will always block because some chunk is always under lock by the shd until heal completes.
+
+Another issue was that 2 shds could run in parallel. Say SHD1 and SHD2 compete for step 1. Let SHD1 win. It proceeds and completes step 4. Now SHD2 also succeeds in step 1, continues all steps. Thus at the end both shds will decrement the changelog leading to negative values in it)
+
+### version 3
+To prevent parallel self heals, another domain was introduced, let us call it SELF_HEAL_DOMAIN. With this domain, the following approach was adopted and is **the approach currently in use**:
+
++ 1.shd takes (full inodelk on SELF_HEAL_DOMAIN)
++ 2.shd takes (full inodelk on DATA_DOMAIN)
++ 3.shd inspects xattrs to determine source/sink
++ 4.unlock full lock on DATA_DOMAIN
++ 5.take chunk lock(0-128kb) on DATA_DOMAIN
++ 6.heal
++ 7.take next chunk lock(129-256kb) on DATA_DOMAIN
++ 8.unlock 1st chunk lock, heal and so on.
++ 9.Finally release full lock on SELF_HEAL_DOMAIN
+
+Thus until one shd completes step 9, another shd cannot start step 1, solving the problem of simultaneous heals.
+Note that the issue of truncate (to zero) FOP hanging still remains.
+Also there are multiple network calls involved in this scheme. (lock,heal(ie read+write), unlock) per chunk. i.e 4 calls per chunk.
+
+### version 4 (ToDo)
+Some improvements that need to be made in version 3:
+* Reduce network calls using piggy backing.
+* After taking chunk lock and healing, we need to unlock the lock before locking the next chunk. This gives a window for any pending truncate FOPs to succeed. If truncate succeeds, the heal of next chunk will fail (read returns zero)
+and heal is stopped. *BUT* there is **yet another** issue:
+
+* shd does steps 1 to 4. Let's assume source is brick b1, sink is brick b2 . i.e xattrs are (0,1) and (0,0) on b1 and b2 respectively. Now before shd takes (0-128kb) lock, a client FOP takes it.
+It modifies data but the FOP succeeds only on brick 2. writev returns success, and the attrs now read (0,1) (1,0). SHD takes over and heals. It had observed (0,1),(0,0) earlier
+and thus goes ahead and copies stale 128Kb from brick 1 to brick2. Thus as far as application is concerned, `writev` returned success but bricks have stale data.
+What needs to be done is `writev` must return success only if it succeeded on atleast one source brick (brick b1 in this case). Otherwise The heal still happens in reverse direction but as far as the application is concerned, it received an error.
+
+###Note on lock **domains**
+We have used conceptual names in this document like DATA_DOMAIN/ METADATA_DOMAIN/ SELF_HEAL_DOMAIN. In the code, these are mapped to strings that are based on the AFR xlator name like so:
+
+DATA_DOMAIN --->"vol_name-replicate-n"
+
+METADATA_DOMAIN --->"vol_name-replicate-n:metadata"
+
+SELF_HEAL_DOMAIN -->"vol_name-replicate-n:self-heal"
+
+where vol_name is the name of the volume and 'n' is the replica subvolume index (starting from 0).
diff --git a/doc/developer-guide/afr/afr.md b/doc/developer-guide/afr/afr.md
new file mode 100644
index 00000000000..566573a4e26
--- /dev/null
+++ b/doc/developer-guide/afr/afr.md
@@ -0,0 +1,191 @@
+cluster/afr translator
+======================
+
+Locking
+-------
+
+Before understanding replicate, one must understand two internal FOPs:
+
+### `GF_FILE_LK`
+
+This is exactly like `fcntl(2)` locking, except the locks are in a
+separate domain from locks held by applications.
+
+### `GF_DIR_LK (loc_t *loc, char *basename)`
+
+This allows one to lock a name under a directory. For example,
+to lock /mnt/glusterfs/foo, one would use the call:
+
+```
+GF_DIR_LK ({loc_t for "/mnt/glusterfs"}, "foo")
+```
+
+If one wishes to lock *all* the names under a particular directory,
+supply the basename argument as `NULL`.
+
+The locks can either be read locks or write locks; consult the
+function prototype for more details.
+
+Both these operations are implemented by the features/locks (earlier
+known as posix-locks) translator.
+
+Basic design
+------------
+
+All FOPs can be classified into four major groups:
+
+### inode-read
+
+Operations that read an inode's data (file contents) or metadata (perms, etc.).
+
+access, getxattr, fstat, readlink, readv, stat.
+
+### inode-write
+
+Operations that modify an inode's data or metadata.
+
+chmod, chown, truncate, writev, utimens.
+
+### dir-read
+
+Operations that read a directory's contents or metadata.
+
+readdir, getdents, checksum.
+
+### dir-write
+
+Operations that modify a directory's contents or metadata.
+
+create, link, mkdir, mknod, rename, rmdir, symlink, unlink.
+
+Some of these make a subgroup in that they modify *two* different entries:
+link, rename, symlink.
+
+### Others
+
+Other operations.
+
+flush, lookup, open, opendir, statfs.
+
+Algorithms
+----------
+
+Each of the four major groups has its own algorithm:
+
+### inode-read, dir-read
+
+1. Send a request to the first child that is up:
+ * if it fails:
+ * try the next available child
+ * if we have exhausted all children:
+ * return failure
+
+### inode-write
+
+ All operations are done in parallel unless specified otherwise.
+
+1. Send a ``GF_FILE_LK`` request on all children for a write lock on the
+ appropriate region
+ (for metadata operations: entire file (0, 0) for writev:
+ (offset, offset+size of buffer))
+ * If a lock request fails on a child:
+ * unlock all children
+ * try to acquire a blocking lock (`F_SETLKW`) on each child, serially.
+ If this fails (due to `ENOTCONN` or `EINVAL`):
+ Consider this child as dead for rest of transaction.
+2. Mark all children as "pending" on all (alive) children (see below for
+meaning of "pending").
+ * If it fails on any child:
+ * mark it as dead (in transaction local state).
+3. Perform operation on all (alive) children.
+ * If it fails on any child:
+ * mark it as dead (in transaction local state).
+4. Unmark all successful children as not "pending" on all nodes.
+5. Unlock region on all (alive) children.
+
+### dir-write
+
+ The algorithm for dir-write is same as above except instead of holding
+ `GF_FILE_LK` locks we hold a GF_DIR_LK lock on the name being operated upon.
+ In case of link-type calls, we hold locks on both the operand names.
+
+"pending"
+---------
+
+The "pending" number is like a journal entry. A pending entry is an
+array of 32-bit integers stored in network byte-order as the extended
+attribute of an inode (which can be a directory as well).
+
+There are three keys corresponding to three types of pending operations:
+
+### `AFR_METADATA_PENDING`
+
+There are some metadata operations pending on this inode (perms, ctime/mtime,
+xattr, etc.).
+
+### `AFR_DATA_PENDING`
+
+There is some data pending on this inode (writev).
+
+### `AFR_ENTRY_PENDING`
+
+There are some directory operations pending on this directory
+(create, unlink, etc.).
+
+Self heal
+---------
+
+* On lookup, gather extended attribute data:
+ * If entry is a regular file:
+ * If an entry is present on one child and not on others:
+ * create entry on others.
+ * If entries exist but have different metadata (perms, etc.):
+ * consider the entry with the highest `AFR_METADATA_PENDING` number as
+ definitive and replicate its attributes on children.
+ * If entry is a directory:
+ * Consider the entry with the highest `AFR_ENTRY_PENDING` number as
+ definitive and replicate its contents on all children.
+ * If any two entries have non-matching types (i.e., one is file and
+ other is directory):
+ * Announce to the user via log that a split-brain situation has been
+ detected, and do nothing.
+* On open, gather extended attribute data:
+ * Consider the file with the highest `AFR_DATA_PENDING` number as
+ the definitive one and replicate its contents on all other
+ children.
+
+During all self heal operations, appropriate locks must be held on all
+regions/entries being affected.
+
+Inode scaling
+-------------
+
+Inode scaling is necessary because if a situation arises where an inode number
+is returned for a directory (by lookup) which was previously the inode number
+of a file (as per FUSE's table), then FUSE gets horribly confused (consult a
+FUSE expert for more details).
+
+To avoid such a situation, we distribute the 64-bit inode space equally
+among all children of replicate.
+
+To illustrate:
+
+If c1, c2, c3 are children of replicate, they each get 1/3 of the available
+inode space:
+
+------------- -- -- -- -- -- -- -- -- -- -- -- ---
+Child: c1 c2 c3 c1 c2 c3 c1 c2 c3 c1 c2 ...
+Inode number: 1 2 3 4 5 6 7 8 9 10 11 ...
+------------- -- -- -- -- -- -- -- -- -- -- -- ---
+
+Thus, if lookup on c1 returns an inode number "2", it is scaled to "4"
+(which is the second inode number in c1's space).
+
+This way we ensure that there is never a collision of inode numbers from
+two different children.
+
+This reduction of inode space doesn't really reduce the usability of
+replicate since even if we assume replicate has 1024 children (which would be a
+highly unusual scenario), each child still has a 54-bit inode space:
+$2^{54} \sim 1.8 \times 10^{16}$, which is much larger than any real
+world requirement.
diff --git a/doc/developer-guide/afr/self-heal-daemon.md b/doc/developer-guide/afr/self-heal-daemon.md
new file mode 100644
index 00000000000..1fd8f08062a
--- /dev/null
+++ b/doc/developer-guide/afr/self-heal-daemon.md
@@ -0,0 +1,91 @@
+Self-Heal Daemon
+================
+The self-heal daemon (shd) is a glusterfs process that is responsible for healing files in a replicate/ disperse gluster volume.
+Every server (brick) node of the volume runs one instance of the shd. So even if one node contains replicate/ disperse bricks of
+multiple volumes, it would be healed by the same shd.
+
+This document only describes how the shd works for replicate (AFR) volumes.
+
+The shd is launched by glusterd when the volume starts (only if the volume includes a replicate configuration). The graph
+of the shd process in every node contains the following: The io-stats which is the top most xlator, its children being the
+replicate xlators (subvolumes) of *only* the bricks present in that particular node node, and finally *all* the client xlators that are the children of the replicate xlators.
+
+The shd does two types of self-heal crawls: Index heal and Full heal. For both these types of crawls, the basic idea is the same:
+For each file encountered while crawling, perform metadata, data and entry heals under appropriate locks.
+* An overview of how each of these heals is performed is detailed in the 'Self-healing' section of *doc/features/afr-v1.md*
+* The different file locks which the shd takes for each of these heals is detailed in *doc/code/xlators/cluster/afr/afr-locks-evolution.md*
+
+Metadata heal refers to healing extended attributes, mode and permissions of a file or directory.
+Data heal refers to healing the file contents.
+Entry self-heal refers to healing entries inside a directory.
+
+Index heal
+==========
+The index heal is done:
+ a) Every 600 seconds (can be changed via the `cluster.heal-timeout` volume option)
+ b) When it is explicitly triggered via the `gluster vol heal <VOLNAME>` command
+ c) Whenever a replica brick that was down comes back up.
+
+Only one heal can be in progress at one time, irrespective of reason why it was triggered. If another heal is triggered before the first one completes, it will be queued.
+Only one heal can be queued while the first one is running. If an Index heal is queued, it can be overridden by queuing a Full heal and not vice-versa. Also, before processing
+each entry in index heal, a check is made if a full heal is queued. If it is, then the index heal is aborted so that the full heal can proceed.
+
+In index heal, each shd reads the entries present inside .glusterfs/indices/xattrop folder and triggers heal on each entry with appropriate locks.
+The .glusterfs/indices/xattrop directory contains a base entry of the name "xattrop-<virtual-gfid-string>". All other entries are hardlinks to the base entry. The
+*names* of the hardlinks are the gfid strings of the files that may need heal.
+
+When a client (mount) performs an operation on the file, the index xlator present in each brick process adds the hardlinks in the pre-op phase of the FOP's transaction
+and removes it in post-op phase if the operation is successful. Thus if an entry is present inside the .glusterfs/indices/xattrop directory when there is no I/O
+happening on the file, it means the file needs healing (or atleast an examination if the brick crashed after the post-op completed but just before the removal of the hardlink).
+
+####Index heal steps:
+<pre><code>
+In shd process of *each node* {
+ opendir +readdir (.glusterfs/indices/xattrop)
+ for each entry inside it {
+ self_heal_entry() //Explained below.
+ }
+}
+</code></pre>
+
+<pre><code>
+self_heal_entry() {
+ Call syncop_lookup(replicae subvolume) which eventually does {
+ take appropriate locks
+ determine source and sinks from AFR changelog xattrs
+ perform whatever heal is needed (any of metadata, data and entry heal in that order)
+ clear changelog xattrs and hardlink inside .glusterfs/indices/xattrop
+ }
+}
+</code></pre>
+
+Note:
+* If the gfid hardlink is present in the .glusterfs/indices/xattrop of both replica bricks, then each shd will try to heal the file but only one of them will be able to proceed due to the self-heal domain lock.
+
+* While processing entries inside .glusterfs/indices/xattrop, if shd encounters an entry whose parent is yet to be healed, it will skip it and it will be picked up in the next crawl.
+
+* If a file is in data/ metadata split-brain, it will not be healed.
+
+* If a directory is in entry split-brain, a conservative merge will be performed, wherein after the merge, the entries of the directory will be a union of the entries in the replica pairs.
+
+Full heal
+=========
+A full heal is triggered by running `gluster vol heal <VOLNAME> full`. This command is usually run in disk replacement scenarios where the entire data is to be copied from one of the healthy bricks of the replica to the brick that was just replaced.
+
+Unlike the index heal which runs on the shd of every node in a replicate subvolume, the full heal is run only on the shd of one node per replicate subvolume: the node having the highest UUID.
+i.e In a 2x2 volume made of 4 nodes N1, N2, N3 and N4, If UUID of N1>N2 and UUID N4 >N3, then the full crawl is carried out by the shds of N1 and N4.(Node UUID can be found in `/var/lib/glusterd/glusterd.info`)
+
+The full heal steps are almost identical to the index heal, except the heal is performed on each replica starting from the root of the volume:
+<pre><code>
+In shd process of *highest UUID node per replica* {
+ opendir +readdir ("/")
+ for each entry inside it {
+ self_heal_entry()
+ if (entry == directory) {
+ /* Recurse*/
+ again opendir+readdir (directory) followed by self_heal_entry() of each entry.
+ }
+
+ }
+}
+</code></pre>