diff options
author | raghavendra talur <raghavendra.talur@gmail.com> | 2015-08-20 15:09:31 +0530 |
---|---|---|
committer | Humble Devassy Chirammal <humble.devassy@gmail.com> | 2015-08-31 02:27:22 -0700 |
commit | 9e9e3c5620882d2f769694996ff4d7e0cf36cc2b (patch) | |
tree | 3a00cbd0cc24eb7df3de9b2eeeb8d42ee9175f88 /Feature Planning/GlusterFS 3.6/New Style Replication.md | |
parent | f6055cdb4dedde576ed8ec55a13814a69dceefdc (diff) |
Create basic directory structure
All new features specs go into in_progress directory.
Once signed off, it should be moved to done directory.
For now,
This change moves all the Gluster 4.0 feature specs to
in_progress. All other specs are under done/release-version.
More cleanup required will be done incrementally.
Change-Id: Id272d301ba8c434cbf7a9a966ceba05fe63b230d
BUG: 1206539
Signed-off-by: Raghavendra Talur <rtalur@redhat.com>
Reviewed-on: http://review.gluster.org/11969
Reviewed-by: Humble Devassy Chirammal <humble.devassy@gmail.com>
Reviewed-by: Prashanth Pai <ppai@redhat.com>
Tested-by: Humble Devassy Chirammal <humble.devassy@gmail.com>
Diffstat (limited to 'Feature Planning/GlusterFS 3.6/New Style Replication.md')
-rw-r--r-- | Feature Planning/GlusterFS 3.6/New Style Replication.md | 230 |
1 files changed, 0 insertions, 230 deletions
diff --git a/Feature Planning/GlusterFS 3.6/New Style Replication.md b/Feature Planning/GlusterFS 3.6/New Style Replication.md deleted file mode 100644 index ffd8167..0000000 --- a/Feature Planning/GlusterFS 3.6/New Style Replication.md +++ /dev/null @@ -1,230 +0,0 @@ -Goal ----- - -More partition-tolerant replication, with higher performance for most -use cases. - -Summary -------- - -NSR is a new synchronous replication translator, complementing or -perhaps some day replacing AFR. - -Owners ------- - -Jeff Darcy <jdarcy@redhat.com> -Venky Shankar <vshankar@redhat.com> - -Current status --------------- - -Design and prototype (nearly) complete, implementation beginning. - -Related Feature Requests and Bugs ---------------------------------- - -[AFR bugs related to "split -brain"](https://bugzilla.redhat.com/buglist.cgi?classification=Community&component=replicate&list_id=3040567&product=GlusterFS&query_format=advanced&short_desc=split&short_desc_type=allwordssubstr) - -[AFR bugs related to -"perf"](https://bugzilla.redhat.com/buglist.cgi?classification=Community&component=replicate&list_id=3040572&product=GlusterFS&query_format=advanced&short_desc=perf&short_desc_type=allwordssubstr) - -(Both lists are undoubtedly partial because not all bugs in these areas -using these specific words. In particular, "GFID mismatch" bugs are -really a kind of split brain, but aren't represented.) - -Detailed Description --------------------- - -NSR is designed to have the following features. - -- Server based - "chain" replication can use bandwidth of both client - and server instead of splitting client bandwidth N ways. - -- Journal based - for reduced network traffic in normal operation, - plus faster recovery and greater resistance to "split brain" errors. - -- Variable consistency model - based on - [Dynamo](http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf) - to provide options trading some consistency for greater availability - and/or performance. - -- Newer, smaller codebase - reduces technical debt, enables higher - replica counts, more informative status reporting and logging, and - other future features (e.g. ordered asynchronous replication). - -Benefit to GlusterFS -==================== - -Faster, more robust, more manageable/maintainable replication. - -Scope -===== - -Nature of proposed change -------------------------- - -At least two new translators will be necessary. - -- A simple client-side translator to route requests to the current - leader among the bricks in a replica set. - -- A server-side translator to handle the "heavy lifting" of - replication, recovery, etc. - -Implications on manageability ------------------------------ - -At a high level, commands to enable, configure, and manage NSR will be -very similar to those already used for AFR. At a lower level, the -options affecting things things like quorum, consistency, and placement -of journals will all be completely different. - -Implications on presentation layer ----------------------------------- - -Minimal. Most changes will be to simplify or remove special handling for -AFR's unique behavior (especially around lookup vs. self-heal). - -Implications on persistence layer ---------------------------------- - -N/A - -Implications on 'GlusterFS' backend ------------------------------------ - -The journal for each brick in an NSR volume might (for performance -reasons) be placed on one or more local volumes other than the one -containing the brick's data. Special requirements around AIO, fsync, -etc. will be less than with AFR. - -Modification to GlusterFS metadata ----------------------------------- - -NSR will not use the same xattrs as AFR, reducing the need for larger -inodes. - -Implications on 'glusterd' --------------------------- - -Volgen must be able to configure the client-side and server-side parts -of NSR, instead of AFR on the client side and index (which will no -longer be necessary) on the server side. Other interactions with -glusterd should remain mostly the same. - -How To Test -=========== - -Most basic AFR tests - e.g. reading/writing data, killing nodes, -starting/stopping self-heal - would apply to NSR as well. Tests that -embed assumptions about AFR xattrs or other internal artifacts will need -to be re-written. - -User Experience -=============== - -Minimal change, mostly related to new options. - -Dependencies -============ - -NSR depends on a cluster-management framework that can provide -membership tracking, leader election, and robust consistent key/value -data storage. This is expected to be developed in parallel as part of -the glusterd-scalability feature, but can be implemented (in simplified -form) within NSR itself if necessary. - -Documentation -============= - -TBD. - -Status -====== - -Some parts of earlier implementation updated to current tree, others in -the middle of replacement. - -- [New design](http://review.gluster.org/#/c/8915/) - -- [Basic translator code](http://review.gluster.org/#/c/8913/) (needs - update to new code-generation infractructure) - -- [GF\_FOP\_IPC](http://review.gluster.org/#/c/8812/) - -- [etcd support](http://review.gluster.org/#/c/8887/) - -- [New code-generation - infrastructure](http://review.gluster.org/#/c/9411/) - -- [New data-logging - translator](https://forge.gluster.org/~jdarcy/glusterfs-core/jdarcys-glusterfs-data-logging) - -Comments and Discussion -======================= - -My biggest concern with journal-based replication comes from my previous -use of DRBD. They do an "activity log"[^1] which sounds like the same -basic concept. Once that log filled, I experienced cascading failure. -When the journal can be filled faster than it's emptied this could cause -the problem I experienced. - -So what I'm looking to be convinced is how journalled replication -maintains full redundancy and how it will prevent the journal input from -exceeding the capacity of the journal output or at least how it won't -fail if this should happen. - -[jjulian](User:Jjulian "wikilink") -([talk](User talk:Jjulian "wikilink")) 17:21, 13 August 2013 (UTC) - -<hr/> -This is akin to a CAP Theorem[^2][^3] problem. If your nodes can't -communicate, what do you do with writes? Our replication approach has -traditionally been CP - enforce quorum, allow writes only among the -majority - and for the sake of satisfying user expectations (or POSIX) -pretty much has to remain CP at least by default. I personally think we -need to allow an AP choice as well, which is why the quorum levels in -NSR are tunable to get that result. - -So, what do we do if a node runs out of journal space? Well, it's unable -to function normally, i.e. it's failed, so it can't count toward quorum. -This would immediately lead to loss of write availability in a two-node -replica set, and could happen easily enough in a three-node replica set -if two similarly configured nodes ran out of journal space -simultaneously. A significant part of the complexity in our design is -around pruning no-longer-needed journal segments, precisely because this -is an icky problem, but even with all the pruning in the world it could -still happen eventually. Therefore the design also includes the notion -of arbiters, which can be quorum-only or can also have their own -journals (with no or partial data). Therefore, your quorum for -admission/journaling purposes can be significantly higher than your -actual replica count. So what options do we have to avoid or deal with -journal exhaustion? - -- Add more journal space (it's just files, so this can be done - reactively during an extended outage). - -- Add arbiters. - -- Decrease the quorum levels. - -- Manually kick a node out of the replica set. - -- Add admission control, artificially delaying new requests as the - journal becomes full. (This one requires more code.) - -If you do \*none\* of these things then yeah, you're scrod. That said, -do you think these options seem sufficient? - -[Jdarcy](User:Jdarcy "wikilink") ([talk](User talk:Jdarcy "wikilink")) -15:27, 29 August 2013 (UTC) - -<references/> - -[^1]: <http://www.drbd.org/users-guide-emb/s-activity-log.html> - -[^2]: <http://www.julianbrowne.com/article/viewer/brewers-cap-theorem> - -[^3]: <http://henryr.github.io/cap-faq/> |