summaryrefslogtreecommitdiffstats
path: root/under_review/dht-scalability.md
diff options
context:
space:
mode:
authorKaushal M <kshlmster@gmail.com>2016-01-20 13:09:23 +0530
committerRaghavendra Talur <rtalur@redhat.com>2016-01-27 22:23:22 -0800
commit601bfa2719d8c9be40982b8a6526c21cd0ea4966 (patch)
treed7d37c778873907ca98cf7c9961bc9686c3d182f /under_review/dht-scalability.md
parent063b5556d7271bfe06ec80b6a1957fbd5cacec51 (diff)
Rename in_progress to under_review
`in_progress` is vague term, which could either mean the feature review is in progress, or that the feature implementation is in progress. Renaming to `under_review` gives a much better indication that the feature is under review and implementation hasn't begun yet. Refer https://review.gluster.org/13187 for the discussion which lead to this Change-Id: I3f48e15deb4cf5486d7b8cac4a7915f9925f38f5 Signed-off-by: Kaushal M <kshlmster@gmail.com> Reviewed-on: http://review.gluster.org/13264 Reviewed-by: Raghavendra Talur <rtalur@redhat.com> Tested-by: Raghavendra Talur <rtalur@redhat.com>
Diffstat (limited to 'under_review/dht-scalability.md')
-rw-r--r--under_review/dht-scalability.md171
1 files changed, 171 insertions, 0 deletions
diff --git a/under_review/dht-scalability.md b/under_review/dht-scalability.md
new file mode 100644
index 0000000..83ef255
--- /dev/null
+++ b/under_review/dht-scalability.md
@@ -0,0 +1,171 @@
+Goal
+----
+
+More scalable DHT translator.
+
+Summary
+-------
+
+Current DHT inhibits scalability by requiring that directories be on all
+subvolumes. In addition to the extra message traffic this incurs during
+*mkdir*, it adds significant complexity keeping all of the directories
+consistent across operations like *create* and *rename*. What is
+proposed, in a nutshell, is that directories should only exist on one
+subvolume, which might contain "stubs" pointing to files and directories
+that can be accessed by GFID on the same or other subvolumes. In concert
+with this, the way we store layout information needs to change, so that
+at least the "fix-layout" part of rebalancing need not involve
+traversing every directory on every subvolume.
+
+Owners
+------
+
+Shyam Ranganathan <srangana@redhat.com>
+
+Raghavendra Gowdappa <rgowdapp@redhat.com>
+
+Current status
+--------------
+
+Proposed, awaiting summit for approval.
+
+Related Feature Requests and Bugs
+---------------------------------
+
+[Features/thousand-node-glusterd](../GLusterFS 3.6/Thousand Node Gluster.md)
+will define new ways of managing maintenance activities, some of which
+are DHT-related. Also, the new "mon cluster" might be where we store
+layout information.
+
+[Features/data-classification](../GLusterFS 3.7/Data Classification.md)
+also affects layout storage and use.
+
+Detailed Description
+--------------------
+
+Under this scheme, path-based lookup becomes very different. Currently,
+we look up a path on the file's "hashed" subvol first (according to
+parent-directory layout and file GFID). If it's not there, we need to
+look elsewhere - in the worst case on **all** subvolumes. In the future,
+our first lookup should be at the parent directory's subvolume. If the
+file is not there, it's not linked anywhere (though it might still exist
+unlinked and accessible by GFID) and we can terminate immediately. If it
+is there, then that single copy of the parent directory will contain a
+"stub" giving the file's GFID and a hint for what subvolume it's on
+(similar to a current linkfile). That information can then be used to
+initiate a GFID-based lookup. Many other code paths, such as *rename*,
+can leverage this new infrastructure to avoid current problems with
+multiple directory entries and linkfiles all for the same actual file.
+
+A possible enhancement would be to include more information in stubs,
+allowing readdirp to operate only on the directory and avoid going to
+every subvolume for information about individual files. Also, some
+secondary issues such as hard links and garbage collection (of unlinked
+but still open files) remain TBD in the final design.
+
+With respect to layout storage, the basic idea is to store a fairly
+small number of actual layouts - default, user defined, or related to
+data classification - that are each shared across many directories.
+These layouts are stored as part of our configuration, and the xattrs on
+individual directories need only specify a shared layout ID (plus
+possibly some additional "tweak" parameters) instead of a full explicit
+layout. When we do any kind of rebalancing, we need only change the
+shared layouts and not the pointers on each directory.
+
+Benefit to GlusterFS
+--------------------
+
+Improved scalability and performance for all directory-entry operations.
+
+Improved reliability for conflicting directory-entry operations, and for
+layout repair.
+
+Almost instantaneous "fix-layout" completion.
+
+Scope
+-----
+
+### Nature of proposed change
+
+Due to the complexity of the changes involved, this will probably be a
+new translator developed using a similar model to that used for AFR2.
+While it's likely to share/borrow a significant amount of code from
+current DHT, the new version will go through most of its development
+lifecycle separately and then completely supplant the old version, as
+opposed to integrating individual changes. For testing of
+compatibility/migration, it is probably desirable for both versions to
+coexist in the source tree and packages, but not necessarily in the same
+process.
+
+### Implications on manageability
+
+New/different options, but otherwise no change.
+
+### Implications on presentation layer
+
+No change. At this level the new DHT translator should be a plug-in
+replacement for the old one.
+
+### Implications on persistence layer
+
+None, unless you count reduced xattr usage.
+
+### Implications on 'GlusterFS' backend
+
+This will fundamentally change the directory structure on our back end.
+A file that is currently visible within a brick as \$BRICK\_ROOT/a/b/c
+might now be visible only as \$GFID\_FOR\_B/c with neither of the parent
+directories even present on that brick. Even that "file" will actually
+be a stub containing only the file's own GFID; to get the contents, one
+would need to look up that GFID in .glusterfs on this or some other
+brick.
+
+Linkfiles will be gone, also subsumed by stubs.
+
+### Modification to GlusterFS metadata
+
+Explicit layouts will be replaced by IDs for shared layouts (in config
+storage).
+
+### Implications on 'glusterd'
+
+Minimal changes (mostly volfile generation).
+
+How To Test
+-----------
+
+Most existing DHT tests should suffice, except for those that depend on
+the details of how layouts are stored and formatted. Those will have to
+be modified to go through the extra layer of indirection to where the
+actual layouts live.
+
+User Experience
+---------------
+
+None, except for better performance and less lost data.
+
+Dependencies
+------------
+
+See "related features" section.
+
+Documentation
+-------------
+
+TBD. There should be very little at the user level, though hopefully
+this time we'll do better at documenting the things developers
+(including add-on tool developers) need to know.
+
+Status
+------
+
+Design in progress
+
+Design and some notes can be found here,
+<https://drive.google.com/open?id=15_TOW9jwzW4griAmk-rqg2cWF-LHiR_TJ8Jn0vOvYpU&authuser=0>
+
+Thread at gluster-devel opening this up for discussion here,
+<https://www.mail-archive.com/gluster-devel%40gluster.org/msg03036.html>
+
+Comments and Discussion
+-----------------------