1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
|
Object versioning
=================
Bitrot detection in GlusterFS relies on object (file) checksum (hash) verification,
also known as "object signature". An object is signed when there are no active
file desciptors referring to it's inode (i.e., upon last close()). This is just an
hint for the initiation of hash calculation (and therefore signing). There is
absolutely no control over when clients can initiate modification operations on
the object. An object could be under modification while it's hash computation is
under progress. It would also be in-appropriate to restrict access to such objects
during the time duration of signing.
Object versioning is used as a mechanism to identify the staleness of an objects
signature. The document below does not just list down the version update protocol,
but goes through various factors that led to its design.
NOTE: The word "object" is used to represent a "regular file" (in linux sense) and
object versions are persisted in extended attributes of the object's inode.
Signature calculation includes object's data (no metadata as of now).
INDEX
=====
i. Version updation protocol
ii. Correctness guaraantees
iii. Implementation
iv. Protocol enhancements
i. Version updation protocol
============================
There are two types of versions associated with an object:
a) Ongoing version: This version is incremented on first open() [when
the in-memory representation of the object (inode) is marked dirty
and synchronized to disk. When an object is created, a default ongoing
version of one (1) is assigned. An object lookup() too assigns the
default version if not present. When a version is initialized upon
lookup() or creat() FOP, it need to be durable on disk and therefore
can just be a extended attrbute set with out an expensive fsync()
syscall.
b) Signing version: This is the version against which an object is deemed
to be signed. An objects signature is tied to a particular signed version.
Since, an object is a candidate for signing upon last release() [last
close()], signing version is the "ongoing version" at that point of time
An object's signature is trustable when the version it was signed against
matches the ongoing version, i.e., if the hash is calculated by hand and
compared against the object signature, it *should* be a perfect match if
and only if the versions are equal. On the other hand, the signature is
considered stale (might or might not match the hash just calculated).
Initialization of object versions
---------------------------------
An object that existed before the pre versioning days, is assigned the
default versions upon lookup(). The protocol at this point expects "no"
durability guarantess of the versions, i.e., extended attribute sets
need not be followed by an explicit filesystem sync (fsync()). In case
of a power outage or a crash, versions are re-initialized with defaults
if found to be non-existant. The signing version is initialized with a
deafault value of zero (0) and the ongoing version as one (1).
[
NOTE: If an object already has versions on-disk, lookup() just brings
the versions in memory. In this case both versions may or may
not match depending on state the object was left in.
]
Increment of object versions
----------------------------
During initial versioning, the in-memory representation of the object is
marked dirty, so that subsequent modification operations on the object
triggers a versiong synchronization to disk (extended attribute set).
Moreover, this operation needs to be durable on disk, for the protocol
to be crash consistent.
Let's picturize the various version states after subsequent open()s.
Not all modification operations need to increment the ongoing version,
only the first operations needs to (subsequent operations are NO-OPs).
NOTE: From here one "[s]" depicts a durable filesystem operation and
"*" depicts the inode as dirty.
lookup() open() open() open()
===========================================================
OV(m): 1* 2 2 2
-----------------------------------------
OV(d): 1 2[s] 2 2
SV(d): 0 0 0 0
Let's now picturize the state when an already signed object undergoes
file operations.
on-disk state:
OV(d): 3
SV(d): 3|<signature>
lookup() open() open() open()
===========================================================
OV(m): 3* 4 4 4
-----------------------------------------
OV(d): 3 4[s] 4 4
SV(d): 3 3 3 3
Signing process
---------------
As per the above example, when the last open file descriptor is closed,
signing needs to be performed. The protocol restricts that the signing
needs to be attached to a version, which in this case is the in-memory
value of the ongoing version. A release() also marks the inode dirty,
therefore, the next open() does a durable version synchronization to
disk.
[carry forwarding the versions from earlier example]
close() release() open() open()
===========================================================
OV(m): 4 4* 5 5
-----------------------------------------
OV(d): 4 4 5[s] 5
SV(d): 3 3 3 3
As shown above, a relase() call triggers a signing with signing version
as OV(m): which in this case is 4. During signing, the object is signed
with a signature attached to version 4 as shown below (continuing with
the last open() call from above):
open() sign(4, signature)
===========================================================
OV(m): 5 5
-----------------------------------------
OV(d): 5 5
SV(d): 3 4:<signature>[s]
A signature comparison at this point of time is un-trustable due to
version mismatches. This also protects from node crashes and hard
reboots due to durability guarantee of on-disk version on first
open().
close() release() open()
===========================================================
OV(m): 4 4* 5
-------------------------------- CRASH
OV(d): 4 4 5[s]
SV(d): 3 3 3
The protocol is immune to signing request after crashes due to
the version synchronization performed on first open(). Signing
request for a version lesser than the *current* ongoing version
can be ignored. It's left upon the implementation to either
accept or ignore such signing request(s).
[
NOTE: Inode forget() causes a fresh lookup() to be trigerred.
Since a forget() call is received when there are no
active references for an inode, the on-disk version is
the latest and would be copied in-memory on lookup().
]
ii. Correctness Guarantees
==========================
Concurrent open()'s
-------------------
When an inode is dirty (i.e., the very next operations would try to
synchronize the version to disk), there can be multiple calls [say,
open()] that would find the inode state as dirty and try to writeback
the new version to disk. Also, note that, marking the inode as synced
and updating the in-memory version is done *after* the new version
is written on disk. This is done to avoid incorrect version stored
on-disk in case the version synchronization fails (but the in-memory
version still holding the updated value).
Coming back to multiple open() calls on an object, each open() call
tries to synchronize the new version to disk if the inode is marked
as dirty. This is safe as each open() would try to synchronize the
new version (ongoingversion + 1) even if the updation is concurrent.
The in-memory version is finally updated to reflect the updated
version and mark the inode non-dirty. Again this is done *only* if
the inode is dirty, thereby open() calls which updated the on-disk
version but lost the race to update the in-memory version result
are NO-OPs.
on-disk state:
OV(d): 3
SV(d): 3|<signature>
lookup() open() open()' open()' open()
=============================================================
OV(m): 3* 3* 3* 4 NO-OP
--------------------------------------------------
OV(d): 3 4[s] 4[s] 4 4
SV(d): 3 3 3 3 3
open()/release() race
---------------------
This race can cause a release() [on last close()] to pick up the
ongoing version which was just incremented on fresh open(). This
leads to signing of the object with the same version as the
ongoing version, thereby, mismatching signatures when calculated.
Another point that's worth mentioning here is that the open
file descriptor is *attached* to it's inode *after* it's done
version synchronization (and increment). Hence, if a release()
sneaks in this window, the file desriptor list for the given
inode is still empty, therefore release() considering it as a
last close().
To counter this, the protocol should track the open and release
counts for file descriptors. A release() should only trigger a
signing request when the file desccriptor for an inode is empty
and the numbers of releases match the number of opens. When an
open() sneaks and increments the ongoing version but the file
descriptor is still not attached to the inode, open and release
counts mismatch, hence identifying an open() in progress.
iii. Implementation
===================
Refer to: xlators/feature/bit-rot/src/stub
iv. Protocol enhancements
=========================
a) Delaying persisting on-disk versions till open()
b) Lazy version updation (until signing?)
c) Protocol changes required to handle anonymous file
descriptors in GlusterFS.
|