diff options
Diffstat (limited to 'done/GlusterFS 3.6/New Style Replication.md')
-rw-r--r-- | done/GlusterFS 3.6/New Style Replication.md | 230 |
1 files changed, 230 insertions, 0 deletions
diff --git a/done/GlusterFS 3.6/New Style Replication.md b/done/GlusterFS 3.6/New Style Replication.md new file mode 100644 index 0000000..ffd8167 --- /dev/null +++ b/done/GlusterFS 3.6/New Style Replication.md @@ -0,0 +1,230 @@ +Goal +---- + +More partition-tolerant replication, with higher performance for most +use cases. + +Summary +------- + +NSR is a new synchronous replication translator, complementing or +perhaps some day replacing AFR. + +Owners +------ + +Jeff Darcy <jdarcy@redhat.com> +Venky Shankar <vshankar@redhat.com> + +Current status +-------------- + +Design and prototype (nearly) complete, implementation beginning. + +Related Feature Requests and Bugs +--------------------------------- + +[AFR bugs related to "split +brain"](https://bugzilla.redhat.com/buglist.cgi?classification=Community&component=replicate&list_id=3040567&product=GlusterFS&query_format=advanced&short_desc=split&short_desc_type=allwordssubstr) + +[AFR bugs related to +"perf"](https://bugzilla.redhat.com/buglist.cgi?classification=Community&component=replicate&list_id=3040572&product=GlusterFS&query_format=advanced&short_desc=perf&short_desc_type=allwordssubstr) + +(Both lists are undoubtedly partial because not all bugs in these areas +using these specific words. In particular, "GFID mismatch" bugs are +really a kind of split brain, but aren't represented.) + +Detailed Description +-------------------- + +NSR is designed to have the following features. + +- Server based - "chain" replication can use bandwidth of both client + and server instead of splitting client bandwidth N ways. + +- Journal based - for reduced network traffic in normal operation, + plus faster recovery and greater resistance to "split brain" errors. + +- Variable consistency model - based on + [Dynamo](http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf) + to provide options trading some consistency for greater availability + and/or performance. + +- Newer, smaller codebase - reduces technical debt, enables higher + replica counts, more informative status reporting and logging, and + other future features (e.g. ordered asynchronous replication). + +Benefit to GlusterFS +==================== + +Faster, more robust, more manageable/maintainable replication. + +Scope +===== + +Nature of proposed change +------------------------- + +At least two new translators will be necessary. + +- A simple client-side translator to route requests to the current + leader among the bricks in a replica set. + +- A server-side translator to handle the "heavy lifting" of + replication, recovery, etc. + +Implications on manageability +----------------------------- + +At a high level, commands to enable, configure, and manage NSR will be +very similar to those already used for AFR. At a lower level, the +options affecting things things like quorum, consistency, and placement +of journals will all be completely different. + +Implications on presentation layer +---------------------------------- + +Minimal. Most changes will be to simplify or remove special handling for +AFR's unique behavior (especially around lookup vs. self-heal). + +Implications on persistence layer +--------------------------------- + +N/A + +Implications on 'GlusterFS' backend +----------------------------------- + +The journal for each brick in an NSR volume might (for performance +reasons) be placed on one or more local volumes other than the one +containing the brick's data. Special requirements around AIO, fsync, +etc. will be less than with AFR. + +Modification to GlusterFS metadata +---------------------------------- + +NSR will not use the same xattrs as AFR, reducing the need for larger +inodes. + +Implications on 'glusterd' +-------------------------- + +Volgen must be able to configure the client-side and server-side parts +of NSR, instead of AFR on the client side and index (which will no +longer be necessary) on the server side. Other interactions with +glusterd should remain mostly the same. + +How To Test +=========== + +Most basic AFR tests - e.g. reading/writing data, killing nodes, +starting/stopping self-heal - would apply to NSR as well. Tests that +embed assumptions about AFR xattrs or other internal artifacts will need +to be re-written. + +User Experience +=============== + +Minimal change, mostly related to new options. + +Dependencies +============ + +NSR depends on a cluster-management framework that can provide +membership tracking, leader election, and robust consistent key/value +data storage. This is expected to be developed in parallel as part of +the glusterd-scalability feature, but can be implemented (in simplified +form) within NSR itself if necessary. + +Documentation +============= + +TBD. + +Status +====== + +Some parts of earlier implementation updated to current tree, others in +the middle of replacement. + +- [New design](http://review.gluster.org/#/c/8915/) + +- [Basic translator code](http://review.gluster.org/#/c/8913/) (needs + update to new code-generation infractructure) + +- [GF\_FOP\_IPC](http://review.gluster.org/#/c/8812/) + +- [etcd support](http://review.gluster.org/#/c/8887/) + +- [New code-generation + infrastructure](http://review.gluster.org/#/c/9411/) + +- [New data-logging + translator](https://forge.gluster.org/~jdarcy/glusterfs-core/jdarcys-glusterfs-data-logging) + +Comments and Discussion +======================= + +My biggest concern with journal-based replication comes from my previous +use of DRBD. They do an "activity log"[^1] which sounds like the same +basic concept. Once that log filled, I experienced cascading failure. +When the journal can be filled faster than it's emptied this could cause +the problem I experienced. + +So what I'm looking to be convinced is how journalled replication +maintains full redundancy and how it will prevent the journal input from +exceeding the capacity of the journal output or at least how it won't +fail if this should happen. + +[jjulian](User:Jjulian "wikilink") +([talk](User talk:Jjulian "wikilink")) 17:21, 13 August 2013 (UTC) + +<hr/> +This is akin to a CAP Theorem[^2][^3] problem. If your nodes can't +communicate, what do you do with writes? Our replication approach has +traditionally been CP - enforce quorum, allow writes only among the +majority - and for the sake of satisfying user expectations (or POSIX) +pretty much has to remain CP at least by default. I personally think we +need to allow an AP choice as well, which is why the quorum levels in +NSR are tunable to get that result. + +So, what do we do if a node runs out of journal space? Well, it's unable +to function normally, i.e. it's failed, so it can't count toward quorum. +This would immediately lead to loss of write availability in a two-node +replica set, and could happen easily enough in a three-node replica set +if two similarly configured nodes ran out of journal space +simultaneously. A significant part of the complexity in our design is +around pruning no-longer-needed journal segments, precisely because this +is an icky problem, but even with all the pruning in the world it could +still happen eventually. Therefore the design also includes the notion +of arbiters, which can be quorum-only or can also have their own +journals (with no or partial data). Therefore, your quorum for +admission/journaling purposes can be significantly higher than your +actual replica count. So what options do we have to avoid or deal with +journal exhaustion? + +- Add more journal space (it's just files, so this can be done + reactively during an extended outage). + +- Add arbiters. + +- Decrease the quorum levels. + +- Manually kick a node out of the replica set. + +- Add admission control, artificially delaying new requests as the + journal becomes full. (This one requires more code.) + +If you do \*none\* of these things then yeah, you're scrod. That said, +do you think these options seem sufficient? + +[Jdarcy](User:Jdarcy "wikilink") ([talk](User talk:Jdarcy "wikilink")) +15:27, 29 August 2013 (UTC) + +<references/> + +[^1]: <http://www.drbd.org/users-guide-emb/s-activity-log.html> + +[^2]: <http://www.julianbrowne.com/article/viewer/brewers-cap-theorem> + +[^3]: <http://henryr.github.io/cap-faq/> |