done/GlusterFS 3.6/afrv2.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244

Feature
-------

This feature is major code re-factor of current afr along with a key
design change in the way changelog extended attributes are stored in
afr.

Summary
-------

This feature introduces design change in afr which separates ongoing
transaction, pending operation count for files/directories.

Owners
------

Anand Avati  
Pranith Kumar Karampuri

Current status
--------------

The feature is in final stages of review at
<http://review.gluster.org/6010>

Detailed Description
--------------------

How AFR works:

In order to keep track of what copies of the file are modified and up to
date, and what copies require to be healed, AFR keeps state information
in the extended attributes of the file called changelog extended
attributes. These extended attributes stores that copy's view of how up
to date the other copies are. The extended attributes are modified in a
transaction which consists of 5 phases - LOCK, PRE-OP, OP, POST-OP and
UNLOCK. In the PRE-OP phase the extended attributes are updated to store
the intent of modification (in the OP phase.)

In the POST-OP phase, depending on how many servers crashed mid way and
on how many servers the OP was applied successfully, a corresponding
change is made in the extended attributes (of the surviving copies) to
represent the staleness of the copies which missed the OP phase.

Further, when those lagging servers become available, healing decisions
are taken based on these extended attribute values.

Today, a PRE-OP increments the pending counters of all elements in the
array (where each element represents a server, and therefore one of the
members of the array represents that server itself.) The POST-OP
decrements those counters which represent servers where the operation
was successful. The update is performed on all the servers which have
made it till the POST-OP phase. The decision of whether a server crashed
in the middle of a transaction or whether the server lived through the
transaction and witnessed the other server crash, is inferred by
inspecting the extended attributes of all servers together. Because
there is no distinction between these counters as to how many of those
increments represent "in transit" operations and how many of those are
retained without decrement to represent "pending counters", there is
value in adding clarity to the system by separating the two.

The change is to now have only one dirty flag on each server per file.
We also make the PRE-OP increment only that dirty flag rather than all
the elements of the pending array. The dirty flag must be set before
performing the operation, and based on which of the servers the
operation failed, we will set the pending counters representing these
failed servers on the remaining ones in the POST-OP phase. The dirty
counter is also cleared at the end of the POST-OP. This means, in
successful operations only the dirty flag (one integer) is incremented
and decremented per server per file. However if a pending counter is set
because of an operation failure, then the flag is an unambiguous "finger
pointing" at the other server. Meaning, if a file has a pending counter
AND a dirty flag, it will not undermine the "strength" of the pending
counter. This change completely removes today's ambiguity of whether a
pending counter represents a still ongoing operation (or crashed in
transit) vs a surely missed operation.

Benefit to GlusterFS
--------------------

It increases the clarity of whether a file has any ongoing transactions
and any pending self-heals. Code is more maintainable now.

Scope
-----

### Nature of proposed change

- Remove client side self-healing completely (opendir, openfd, lookup) -
Re-work readdir-failover to work reliably in case of NFS - Remove
unused/dead lock recovery code - Consistently use xdata in both calls
and callbacks in all FOPs - Per-inode event generation, used to force
inode ctx refresh - Implement dirty flag support (in place of pending
counts) - Eliminate inode ctx structure, use read subvol bits +
event\_generation - Implement inode ctx refreshing based on event
generation - Provide backward compatibility in transactions - remove
unused variables and functions - make code more consistent in style and
pattern - regularize and clean up inode-write transaction code -
regularize and clean up dir-write transaction code - regularize and
clean up common FOPs - reorganize transaction framework code - skip
setting xattrs in pending dict if nothing is pending - re-write
self-healing code using syncops - re-write simpler self-heal-daemon

### Implications on manageability

None

### Implications on presentation layer

None

### Implications on persistence layer

None

### Implications on 'GlusterFS' backend

None

### Modification to GlusterFS metadata

This changes the way pending counts vs Ongoing transactions are
represented in changelog extended attributes.

### Implications on 'glusterd'

None

How To Test
-----------

Same test cases of afrv1 hold.

User Experience
---------------

None

Dependencies
------------

None

Documentation
-------------

---

Status
------

The feature is in final stages of review at
<http://review.gluster.org/6010>

Comments and Discussion
-----------------------

---

Summary
-------

<Brief Description of the Feature>

Owners
------

<Feature Owners - Ideally includes you :-)>

Current status
--------------

<Provide details on related existing features, if any and why this new feature is needed>

Detailed Description
--------------------

<Detailed Feature Description>

Benefit to GlusterFS
--------------------

<Describe Value additions to GlusterFS>

Scope
-----

### Nature of proposed change

<modification to existing code, new translators ...>

### Implications on manageability

<Glusterd, GlusterCLI, Web Console, REST API>

### Implications on presentation layer

<NFS/SAMBA/UFO/FUSE/libglusterfsclient Integration>

### Implications on persistence layer

<LVM, XFS, RHEL ...>

### Implications on 'GlusterFS' backend

<brick's data format, layout changes>

### Modification to GlusterFS metadata

<extended attributes used, internal hidden files to keep the metadata...>

### Implications on 'glusterd'

<persistent store, configuration changes, brick-op...>

How To Test
-----------

<Description on Testing the feature>

User Experience
---------------

<Changes in CLI, effect on User experience...>

Dependencies
------------

<Dependencies, if any>

Documentation
-------------

<Documentation for the feature>

Status
------

<Status of development - Design Ready, In development, Completed>

Comments and Discussion
-----------------------

<Follow here>