doc: add documentation for DHT

Change-Id: Iaa1ea72499a81134eb57a15867e0d08dd9c55bbd Signed-off-by: Jeff Darcy <jdarcy@redhat.com> Reviewed-on: http://review.gluster.org/8240 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: N Balachandran <nbalacha@redhat.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
author: Jeff Darcy <jdarcy@redhat.com> 2014-07-04 19:40:14 -0700
committer: Vijay Bellur <vbellur@redhat.com> 2015-01-19 03:27:48 -0800
commit: a29d6c80dfbd848e0b453d64faf761fe52b4d6c5 (patch)
tree: 2193cd19ee645464d221e730a93073b547db79dd /doc
parent: b45f623a7a8e14ca09a10c6a04c4c5f81e3a62e2 (diff)
1 files changed, 223 insertions, 0 deletions
diff --git a/doc/features/dht.md b/doc/features/dht.md
new file mode 100644
index 00000000000..c35dd6d0c27
--- /dev/null
+++ b/doc/features/dht.md
@@ -0,0 +1,223 @@
+# How GlusterFS Distribution Works
+
+The defining feature of any scale-out system is its ability to distribute work
+or data among many servers.  Accordingly, people in the distributed-system
+community have developed many powerful techniques to perform such distribution,
+but those techniques often remain little known or understood even among other
+members of the file system and database communities that benefit.  This
+confusion is represented even in the name of the GlusterFS component that
+performs distribution - DHT, which stands for Distributed Hash Table but is not
+actually a DHT as that term is most commonly used or defined.  The way
+GlusterFS's DHT works is based on a few basic principles:
+
+ * All operations are driven by clients, which are all equal.  There are no
+   special nodes with special knowledge of where files are or should be.
+ 
+ * Directories exist on all subvolumes (bricks or lower-level aggregations of
+   bricks); files exist on only one.
+ 
+ * Files are assigned to subvolumes based on *consistent hashing*, and even
+   more specifically a form of consistent hashing exemplified by Amazon's
+   [Dynamo][dynamo].
+ 
+The result of all this is that users are presented with a set of files that is
+the union of the files present on all subvolumes.  The following sections
+describe how this "uniting" process actually works.
+
+## Layouts
+
+The conceptual basis of Dynamo-style consistent hashing is of numbers around a
+circle, like a clock.  First, the circle is divided into segments and those
+segments are assigned to bricks.  (For the sake of simplicity we'll use
+"bricks" hereafter even though they might actually be replicated/striped
+subvolumes.)  Several factors guide this assignment.
+
+ * Assignments are done separately for each directory.
+
+ * Historically, segments have all been the same size.  However, this can lead
+   to smaller bricks becoming full while plenty of space remains on larger
+   ones.  If the *cluster.weighted-rebalance* option is set, segments sizes
+   will be proportional to brick sizes.
+ 
+ * Assignments need not include all bricks in the volume.  If the
+   *cluster.subvols-per-directory* option is set, only that many bricks will
+   receive assignments for that directory.
+ 
+However these assignments are done, they collectively become what we call a
+*layout* for a directory.  This layout is then stored using extended
+attributes, with each brick's copy of that extended attribute on that directory
+consisting of four 32-bit fields.
+
+ * A version, which might be DHT\_HASH\_TYPE\_DM to represent an assignment as
+   described above, or DHT\_HASH\_TYPE\_DM\_USER to represent an assignment made
+   manually by the user (or external script).
+ 
+ * A "commit hash" which will be described later.
+ 
+ * The first number in the assigned range (segment).
+ 
+ * The last number in the assigned range.
+ 
+For example, the extended attributes representing a weighted assignment between
+three bricks, one twice as big as the others, might look like this.
+
+ * Brick A (the large one): DHT\_HASH\_TYPE\_DM 1234 0 0x7ffffff
+ 
+ * Brick B: DHT\_HASH\_TYPE\_DM 1234 0x80000000 0xbfffffff
+ 
+ * Brick C: DHT\_HASH\_TYPE\_DM 1234 0xc0000000 0xffffffff
+ 
+## Placing Files
+
+To place a file in a directory, we first need a layout for that directory - as
+described above.  Next, we calculate a hash for the file.  To minimize
+collisions either between files in the same directory with different names or
+between files in different directories with the same name, this hash is
+generated using both the (containing) directory's unique GFID and the file's
+name.  This hash is then matched to one of the layout assignments, to yield
+what we call a *hashed location*.  For example, consider the layout shown
+above.  The hash 0xabad1dea is between 0x80000000 and 0xbfffffff, so the
+corresponding file's hashed location would be on Brick B.  A second file with a
+hash of 0xfaceb00c would be assigned to Brick C by the same reasoning.
+
+## Looking Up Files
+
+Because layout assignments might change, especially as bricks are added or
+removed, finding a file involves more than calculating its hashed location and
+looking there.  That is in fact the first step, and works most of the time -
+i.e. the file is found where we expected it to be - but there are a few more
+steps when that's not the case.  Historically, the next step has been to look
+for the file **everywhere** - i.e. to broadcast our lookup request to all
+subvolumes.  If the file isn't found that way, it doesn't exist.  At this
+point, an open that requires the file's presence will fail, or a create/mkdir
+that requires its absence will be allowed to continue.
+
+Regardless of whether a file is found at its hashed location or elsewhere, we
+now know its *cached location*.  As the name implies, this is stored within DHT
+to satisfy future lookups.  If it's not the same as the hashed location, we
+also take an extra step.  This step is the creation of a *linkfile*, which is a
+special stub left at the **hashed** location pointing to the **cached**
+location.  Therefore, if a client naively looks for a file at its hashed
+location and finds a linkfile instead, it can use that linkfile to look up the
+file where it really is instead of needing to inquire everywhere.
+
+## Rebalancing
+
+As bricks are added or removed, or files are renamed, many files can end up
+somewhere other than at their hashed locations.  When this happens, the volumes
+need to be rebalanced.  This process consists of two parts.
+
+ 1. Calculate new layouts, according to the current set of bricks (and possibly
+ their characteristics).  We call this the "fix-layout" phase.
+ 
+ 2. Migrate any "misplaced" files to their correct (hashed) locations, and
+ clean up any linkfiles which are no longer necessary.  We call this the
+ "migrate-data" phase.
+ 
+Usually, these two phases are done together.  (In fact, the code for them is
+somewhat intermingled.)  However, the migrate-data phase can involve a lot of
+I/O and be very disruptive, so users can do just the fix-layout phase and defer
+migrate-data until a more convenient time.  This allows new files to be placed
+on new bricks, even though old files might still be in the "wrong" place.
+
+When calculating a new layout to replace an old one, DHT specifically tries to
+maximize overlap of the assigned ranges, thus minimizing data movement.  This
+difference can be very large.  For example, consider the case where our example
+layout from earlier is updated to add a new double-sided brick.  Here's a very
+inefficient way to do that.
+
+ * Brick A (the large one): 0x00000000 to 0x55555555
+ 
+ * Brick B: 0x55555556 to 0x7fffffff
+ 
+ * Brick C: 0x80000000 to 0xaaaaaaaa
+ 
+ * Brick D (the new one): 0xaaaaaaab to 0xffffffff
+ 
+This would cause files in the following ranges to be migrated:
+
+ * 0x55555556 to 0x7fffffff (from A to B)
+ 
+ * 0x80000000 to 0xaaaaaaaa (from B to C)
+ 
+ * 0xaaaaaaab to 0xbfffffff (from B to D)
+ 
+ * 0xc0000000 to 0xffffffff (from C to D)
+ 
+As an historical note, this is exactly what we used to do, and in this case it
+would have meant moving 7/12 of all files in the volume.  Now let's consider a
+new layout that's optimized to maximize overlap with the old one. 
+
+ * Brick A: 0x00000000 to 0x55555555
+ 
+ * Brick D: 0x55555556 to 0xaaaaaaaa  <- optimized insertion point
+ 
+ * Brick B: 0xaaaaaaab to 0xd5555554
+ 
+ * Brick C: 0xd5555555 to 0xffffffff
+ 
+In this case we only need to move 5/12 of all files.  In a volume with millions
+or even billions of files, reducing data movement by 1/6 of all files is a
+pretty big improvement.  In the future, DHT might use "virtual node IDs" or
+multiple hash rings to make rebalancing even more efficient.
+
+## Rename Optimizations
+
+With the file-lookup mechanisms we already have in place, it's not necessary to
+move a file from one brick to another when it's renamed - even across
+directories.  It will still be found, albeit a little less efficiently.  The
+first client to look for it after the rename will add a linkfile, which every
+other client will follow from then on.  Also, every client that has found the
+file once will continue to find it based on its cached location, without any
+network traffic at all.  Because the extra lookup cost is small, and the
+movement cost might be very large, DHT renames the file "in place" on its
+current brick instead (taking advantage of the fact that directories exist
+everywhere).
+
+This optimization is further extended to handle cases where renames are very
+common.  For example, rsync and similar tools often use a "write new then
+rename" idiom in which a file "xxx" is actually written as ".xxx.1234" and then
+moved into place only after its contents have been fully written.  To make this
+process more efficient, DHT uses a regular expression to separate the permanent
+part of a file's name (in this case "xxx") from what is likely to be a
+temporary part (the leading "." and trailing ".1234").  That way, after the
+file is renamed it will be in its correct hashed location - which it wouldn't
+be otherwise if "xxx" and ".xxx.1234" hash differently - and no linkfiles or
+broadcast lookups will be necessary.
+
+In fact, there are two regular expressions available for this purpose -
+*cluster.rsync-hash-regex* and *cluster.extra-hash-regex*.  As its name
+implies, *rsync-hash-regex* defaults to the pattern that regex uses, while
+*extra-hash-regex* can be set by the user to support a second tool using the
+same temporary-file idiom.
+
+## Commit Hashes
+
+A very recent addition to DHT's algorithmic arsenal is intended to reduce the
+number of "broadcast" lookups the it issues.  If a volume is completely in
+balance, then no file could exist anywhere but at its hashed location.
+Therefore, if we've already looked there and not found it, then looking
+elsewhere would be pointless (and wasteful).  The *commit hash* mechanism is
+used to detect this case.  A commit hash is assigned to a volume, and
+separately to each directory, and then updated according to the following
+rules.
+
+ * The volume commit hash is changed whenever actions are taken that might
+   cause layout assignments across all directories to become invalid - i.e.
+   bricks being added, removed, or replaced.
+ 
+ * The directory commit hash is changed whenever actions are taken that might
+   cause files to be "misplaced" - e.g. when they're renamed.
+ 
+ * The directory commit hash is set to the volume commit hash when the
+   directory is created, and whenever the directory is fully rebalanced so that
+   all files are at their hashed locations.
+ 
+In other words, whenever either the volume or directory commit hash is changed
+that creates a mismatch.  In that case we revert to the "pessimistic"
+broadcast-lookup method described earlier.  However, if the two hashes match
+then we can with skip the broadcast lookup and return a result immediately.
+This has been observed to cause a 3x performance improvement in workloads that
+involve creating many small files across many bricks.
+
+[dynamo]: http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf
author	Jeff Darcy <jdarcy@redhat.com>	2014-07-04 19:40:14 -0700
committer	Vijay Bellur <vbellur@redhat.com>	2015-01-19 03:27:48 -0800
commit	a29d6c80dfbd848e0b453d64faf761fe52b4d6c5 (patch)
tree	2193cd19ee645464d221e730a93073b547db79dd /doc
parent	b45f623a7a8e14ca09a10c6a04c4c5f81e3a62e2 (diff)