Diffstat (limited to 'under_review/dht-scalability.md')
1 files changed, 171 insertions, 0 deletions
diff --git a/under_review/dht-scalability.md b/under_review/dht-scalability.md
new file mode 100644
@@ -0,0 +1,171 @@
+More scalable DHT translator.
+Current DHT inhibits scalability by requiring that directories be on all
+subvolumes. In addition to the extra message traffic this incurs during
+*mkdir*, it adds significant complexity keeping all of the directories
+consistent across operations like *create* and *rename*. What is
+proposed, in a nutshell, is that directories should only exist on one
+subvolume, which might contain "stubs" pointing to files and directories
+that can be accessed by GFID on the same or other subvolumes. In concert
+with this, the way we store layout information needs to change, so that
+at least the "fix-layout" part of rebalancing need not involve
+traversing every directory on every subvolume.
+Shyam Ranganathan <email@example.com>
+Raghavendra Gowdappa <firstname.lastname@example.org>
+Proposed, awaiting summit for approval.
+Related Feature Requests and Bugs
+[Features/thousand-node-glusterd](../GLusterFS 3.6/Thousand Node Gluster.md)
+will define new ways of managing maintenance activities, some of which
+are DHT-related. Also, the new "mon cluster" might be where we store
+[Features/data-classification](../GLusterFS 3.7/Data Classification.md)
+also affects layout storage and use.
+Under this scheme, path-based lookup becomes very different. Currently,
+we look up a path on the file's "hashed" subvol first (according to
+parent-directory layout and file GFID). If it's not there, we need to
+look elsewhere - in the worst case on **all** subvolumes. In the future,
+our first lookup should be at the parent directory's subvolume. If the
+file is not there, it's not linked anywhere (though it might still exist
+unlinked and accessible by GFID) and we can terminate immediately. If it
+is there, then that single copy of the parent directory will contain a
+"stub" giving the file's GFID and a hint for what subvolume it's on
+(similar to a current linkfile). That information can then be used to
+initiate a GFID-based lookup. Many other code paths, such as *rename*,
+can leverage this new infrastructure to avoid current problems with
+multiple directory entries and linkfiles all for the same actual file.
+A possible enhancement would be to include more information in stubs,
+allowing readdirp to operate only on the directory and avoid going to
+every subvolume for information about individual files. Also, some
+secondary issues such as hard links and garbage collection (of unlinked
+but still open files) remain TBD in the final design.
+With respect to layout storage, the basic idea is to store a fairly
+small number of actual layouts - default, user defined, or related to
+data classification - that are each shared across many directories.
+These layouts are stored as part of our configuration, and the xattrs on
+individual directories need only specify a shared layout ID (plus
+possibly some additional "tweak" parameters) instead of a full explicit
+layout. When we do any kind of rebalancing, we need only change the
+shared layouts and not the pointers on each directory.
+Benefit to GlusterFS
+Improved scalability and performance for all directory-entry operations.
+Improved reliability for conflicting directory-entry operations, and for
+Almost instantaneous "fix-layout" completion.
+### Nature of proposed change
+Due to the complexity of the changes involved, this will probably be a
+new translator developed using a similar model to that used for AFR2.
+While it's likely to share/borrow a significant amount of code from
+current DHT, the new version will go through most of its development
+lifecycle separately and then completely supplant the old version, as
+opposed to integrating individual changes. For testing of
+compatibility/migration, it is probably desirable for both versions to
+coexist in the source tree and packages, but not necessarily in the same
+### Implications on manageability
+New/different options, but otherwise no change.
+### Implications on presentation layer
+No change. At this level the new DHT translator should be a plug-in
+replacement for the old one.
+### Implications on persistence layer
+None, unless you count reduced xattr usage.
+### Implications on 'GlusterFS' backend
+This will fundamentally change the directory structure on our back end.
+A file that is currently visible within a brick as \$BRICK\_ROOT/a/b/c
+might now be visible only as \$GFID\_FOR\_B/c with neither of the parent
+directories even present on that brick. Even that "file" will actually
+be a stub containing only the file's own GFID; to get the contents, one
+would need to look up that GFID in .glusterfs on this or some other
+Linkfiles will be gone, also subsumed by stubs.
+### Modification to GlusterFS metadata
+Explicit layouts will be replaced by IDs for shared layouts (in config
+### Implications on 'glusterd'
+Minimal changes (mostly volfile generation).
+How To Test
+Most existing DHT tests should suffice, except for those that depend on
+the details of how layouts are stored and formatted. Those will have to
+be modified to go through the extra layer of indirection to where the
+actual layouts live.
+None, except for better performance and less lost data.
+See "related features" section.
+TBD. There should be very little at the user level, though hopefully
+this time we'll do better at documenting the things developers
+(including add-on tool developers) need to know.
+Design in progress
+Design and some notes can be found here,
+Thread at gluster-devel opening this up for discussion here,
+Comments and Discussion