done/GlusterFS 3.6/New Style Replication.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230

Goal
----

More partition-tolerant replication, with higher performance for most
use cases.

Summary
-------

NSR is a new synchronous replication translator, complementing or
perhaps some day replacing AFR.

Owners
------

Jeff Darcy <jdarcy@redhat.com>  
Venky Shankar <vshankar@redhat.com>

Current status
--------------

Design and prototype (nearly) complete, implementation beginning.

Related Feature Requests and Bugs
---------------------------------

[AFR bugs related to "split
brain"](https://bugzilla.redhat.com/buglist.cgi?classification=Community&component=replicate&list_id=3040567&product=GlusterFS&query_format=advanced&short_desc=split&short_desc_type=allwordssubstr)

[AFR bugs related to
"perf"](https://bugzilla.redhat.com/buglist.cgi?classification=Community&component=replicate&list_id=3040572&product=GlusterFS&query_format=advanced&short_desc=perf&short_desc_type=allwordssubstr)

(Both lists are undoubtedly partial because not all bugs in these areas
using these specific words. In particular, "GFID mismatch" bugs are
really a kind of split brain, but aren't represented.)

Detailed Description
--------------------

NSR is designed to have the following features.

-   Server based - "chain" replication can use bandwidth of both client
    and server instead of splitting client bandwidth N ways.

-   Journal based - for reduced network traffic in normal operation,
    plus faster recovery and greater resistance to "split brain" errors.

-   Variable consistency model - based on
    [Dynamo](http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf)
    to provide options trading some consistency for greater availability
    and/or performance.

-   Newer, smaller codebase - reduces technical debt, enables higher
    replica counts, more informative status reporting and logging, and
    other future features (e.g. ordered asynchronous replication).

Benefit to GlusterFS
====================

Faster, more robust, more manageable/maintainable replication.

Scope
=====

Nature of proposed change
-------------------------

At least two new translators will be necessary.

-   A simple client-side translator to route requests to the current
    leader among the bricks in a replica set.

-   A server-side translator to handle the "heavy lifting" of
    replication, recovery, etc.

Implications on manageability
-----------------------------

At a high level, commands to enable, configure, and manage NSR will be
very similar to those already used for AFR. At a lower level, the
options affecting things things like quorum, consistency, and placement
of journals will all be completely different.

Implications on presentation layer
----------------------------------

Minimal. Most changes will be to simplify or remove special handling for
AFR's unique behavior (especially around lookup vs. self-heal).

Implications on persistence layer
---------------------------------

N/A

Implications on 'GlusterFS' backend
-----------------------------------

The journal for each brick in an NSR volume might (for performance
reasons) be placed on one or more local volumes other than the one
containing the brick's data. Special requirements around AIO, fsync,
etc. will be less than with AFR.

Modification to GlusterFS metadata
----------------------------------

NSR will not use the same xattrs as AFR, reducing the need for larger
inodes.

Implications on 'glusterd'
--------------------------

Volgen must be able to configure the client-side and server-side parts
of NSR, instead of AFR on the client side and index (which will no
longer be necessary) on the server side. Other interactions with
glusterd should remain mostly the same.

How To Test
===========

Most basic AFR tests - e.g. reading/writing data, killing nodes,
starting/stopping self-heal - would apply to NSR as well. Tests that
embed assumptions about AFR xattrs or other internal artifacts will need
to be re-written.

User Experience
===============

Minimal change, mostly related to new options.

Dependencies
============

NSR depends on a cluster-management framework that can provide
membership tracking, leader election, and robust consistent key/value
data storage. This is expected to be developed in parallel as part of
the glusterd-scalability feature, but can be implemented (in simplified
form) within NSR itself if necessary.

Documentation
=============

TBD.

Status
======

Some parts of earlier implementation updated to current tree, others in
the middle of replacement.

-   [New design](http://review.gluster.org/#/c/8915/)

-   [Basic translator code](http://review.gluster.org/#/c/8913/) (needs
    update to new code-generation infractructure)

-   [GF\_FOP\_IPC](http://review.gluster.org/#/c/8812/)

-   [etcd support](http://review.gluster.org/#/c/8887/)

-   [New code-generation
    infrastructure](http://review.gluster.org/#/c/9411/)

-   [New data-logging
    translator](https://forge.gluster.org/~jdarcy/glusterfs-core/jdarcys-glusterfs-data-logging)

Comments and Discussion
=======================

My biggest concern with journal-based replication comes from my previous
use of DRBD. They do an "activity log"[^1] which sounds like the same
basic concept. Once that log filled, I experienced cascading failure.
When the journal can be filled faster than it's emptied this could cause
the problem I experienced.

So what I'm looking to be convinced is how journalled replication
maintains full redundancy and how it will prevent the journal input from
exceeding the capacity of the journal output or at least how it won't
fail if this should happen.

[jjulian](User:Jjulian "wikilink")
([talk](User talk:Jjulian "wikilink")) 17:21, 13 August 2013 (UTC)

<hr/>
This is akin to a CAP Theorem[^2][^3] problem. If your nodes can't
communicate, what do you do with writes? Our replication approach has
traditionally been CP - enforce quorum, allow writes only among the
majority - and for the sake of satisfying user expectations (or POSIX)
pretty much has to remain CP at least by default. I personally think we
need to allow an AP choice as well, which is why the quorum levels in
NSR are tunable to get that result.

So, what do we do if a node runs out of journal space? Well, it's unable
to function normally, i.e. it's failed, so it can't count toward quorum.
This would immediately lead to loss of write availability in a two-node
replica set, and could happen easily enough in a three-node replica set
if two similarly configured nodes ran out of journal space
simultaneously. A significant part of the complexity in our design is
around pruning no-longer-needed journal segments, precisely because this
is an icky problem, but even with all the pruning in the world it could
still happen eventually. Therefore the design also includes the notion
of arbiters, which can be quorum-only or can also have their own
journals (with no or partial data). Therefore, your quorum for
admission/journaling purposes can be significantly higher than your
actual replica count. So what options do we have to avoid or deal with
journal exhaustion?

-   Add more journal space (it's just files, so this can be done
    reactively during an extended outage).

-   Add arbiters.

-   Decrease the quorum levels.

-   Manually kick a node out of the replica set.

-   Add admission control, artificially delaying new requests as the
    journal becomes full. (This one requires more code.)

If you do \*none\* of these things then yeah, you're scrod. That said,
do you think these options seem sufficient?

[Jdarcy](User:Jdarcy "wikilink") ([talk](User talk:Jdarcy "wikilink"))
15:27, 29 August 2013 (UTC)

<references/>

[^1]: <http://www.drbd.org/users-guide-emb/s-activity-log.html>

[^2]: <http://www.julianbrowne.com/article/viewer/brewers-cap-theorem>

[^3]: <http://henryr.github.io/cap-faq/>