summaryrefslogtreecommitdiffstats
path: root/doc/features/bit-rot/object-versioning.txt
blob: def901f0fc50fcc1aa0daa1ca8a25c94c13e5aa1 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
Object versioning
=================

  Bitrot detection in GlusterFS relies on object (file) checksum (hash) verification,
  also known as "object signature". An object is signed when there are no active
  file desciptors referring to it's inode (i.e., upon last close()). This is just an
  hint for the initiation of hash calculation (and therefore signing). There is
  absolutely no control over when clients can initiate modification operations on
  the object. An object could be under modification while it's hash computation is
  under progress. It would also be in-appropriate to restrict access to such objects
  during the time duration of signing.

  Object versioning is used as a mechanism to identify the staleness of an objects
  signature. The document below does not just list down the version update protocol,
  but goes through various factors that led to its design.

NOTE: The word "object" is used to represent a "regular file" (in linux sense) and
      object versions are persisted in extended attributes of the object's inode.
      Signature calculation includes object's data (no metadata as of now).

INDEX
=====
  i.   Version updation protocol
  ii.  Correctness guaraantees
  iii. Implementation
  iv.  Protocol enhancements

i. Version updation protocol
============================
  There are two types of versions associated with an object:

  a) Ongoing version: This version is incremented on first open() [when
     the in-memory representation of the object (inode) is marked dirty
     and synchronized to disk. When an object is created, a default ongoing
     version of one (1) is assigned. An object lookup() too assigns the
     default version if not present. When a version is initialized upon
     lookup() or creat() FOP, it need to be durable on disk and therefore
     can just be a extended attrbute set with out an expensive fsync()
     syscall.

  b) Signing version: This is the version against which an object is deemed
     to be signed. An objects signature is tied to a particular signed version.
     Since, an object is a candidate for signing upon last release() [last
     close()], signing version is the "ongoing version" at that point of time

  An object's signature is trustable when the version it was signed against
  matches the ongoing version, i.e., if the hash is calculated by hand and
  compared against the object signature, it *should* be a perfect match if
  and only if the versions are equal. On the other hand, the signature is
  considered stale (might or might not match the hash just calculated).

  Initialization of object versions
  ---------------------------------
     An object that existed before the pre versioning days, is assigned the
     default versions upon lookup(). The protocol at this point expects "no"
     durability guarantess of the versions, i.e., extended attribute sets
     need not be followed by an explicit filesystem sync (fsync()). In case
     of a power outage or a crash, versions are re-initialized with defaults
     if found to be non-existant. The signing version is initialized with a
     deafault value of zero (0) and the ongoing version as one (1).

     [
       NOTE: If an object already has versions on-disk, lookup() just brings
             the versions in memory. In this case both versions may or may
             not match depending on state the object was left in.
     ]


  Increment of object versions
  ----------------------------
     During initial versioning, the in-memory representation of the object is
     marked dirty, so that subsequent modification operations on the object
     triggers a versiong synchronization to disk (extended attribute set).
     Moreover, this operation needs to be durable on disk, for the protocol
     to be crash consistent.

     Let's picturize the various version states after subsequent open()s.
     Not all modification operations need to increment the ongoing version,
     only the first operations needs to (subsequent operations are NO-OPs).

     NOTE: From here one "[s]" depicts a durable filesystem operation and
           "*" depicts the inode as dirty.


                       lookup()     open()    open()    open()
            ===========================================================

            OV(m):        1*          2         2         2
                      -----------------------------------------
            OV(d):        1           2[s]      2         2
            SV(d):        0           0         0         0


     Let's now picturize the state when an already signed object undergoes
     file operations.

     on-disk state:
          OV(d): 3
          SV(d): 3|<signature>


                       lookup()     open()    open()    open()
            ===========================================================

            OV(m):        3*          4         4         4
                      -----------------------------------------
            OV(d):        3           4[s]      4         4
            SV(d):        3           3         3         3

  Signing process
  ---------------
     As per the above example, when the last open file descriptor is closed,
     signing needs to be performed. The protocol restricts that the signing
     needs to be attached to a version, which in this case is the in-memory
     value of the ongoing version. A release() also marks the inode dirty,
     therefore, the next open() does a durable version synchronization to
     disk.

     [carry forwarding the versions from earlier example]

                       close()     release()  open()   open()
            ===========================================================

            OV(m):        4           4*        5         5
                      -----------------------------------------
            OV(d):        4           4         5[s]      5
            SV(d):        3           3         3         3

     As shown above, a relase() call triggers a signing with signing version
     as OV(m): which in this case is 4. During signing, the object is signed
     with a signature attached to version 4 as shown below (continuing with
     the last open() call from above):

                       open()           sign(4, signature)
            ===========================================================

            OV(m):        5                     5
                      -----------------------------------------
            OV(d):        5                     5
            SV(d):        3               4:<signature>[s]

     A signature comparison at this point of time is un-trustable due to
     version mismatches. This also protects from node crashes and hard
     reboots due to durability guarantee of on-disk version on first
     open().

                       close()     release()  open()
            ===========================================================

            OV(m):        4           4*        5
                      --------------------------------  CRASH
            OV(d):        4           4         5[s]
            SV(d):        3           3         3

     The protocol is immune to signing request after crashes due to
     the version synchronization performed on first open(). Signing
     request for a version lesser than the *current* ongoing version
     can be ignored. It's left upon the implementation to either
     accept or ignore such signing request(s).

     [
        NOTE: Inode forget() causes a fresh lookup() to be trigerred.
              Since a forget() call is received when there are no
              active references for an inode, the on-disk version is
              the latest and would be copied in-memory on lookup().
     ]

ii. Correctness Guarantees
==========================

     Concurrent open()'s
     -------------------
     When an inode is dirty (i.e., the very next operations would try to
     synchronize the version to disk), there can be multiple calls [say,
     open()] that would find the inode state as dirty and try to writeback
     the new version to disk. Also, note that, marking the inode as synced
     and updating the in-memory version is done *after* the new version
     is written on disk. This is done to avoid incorrect version stored
     on-disk in case the version synchronization fails (but the in-memory
     version still holding the updated value).
     Coming back to multiple open() calls on an object, each open() call
     tries to synchronize the new version to disk if the inode is marked
     as dirty. This is safe as each open() would try to synchronize the
     new version (ongoingversion + 1) even if the updation is concurrent.
     The in-memory version is finally updated to reflect the updated
     version and mark the inode non-dirty. Again this is done *only* if
     the inode is dirty, thereby open() calls which updated the on-disk
     version but lost the race to update the in-memory version result
     are NO-OPs.

     on-disk state:
          OV(d): 3
          SV(d): 3|<signature>


                       lookup()     open()    open()'   open()'  open()
            =============================================================

            OV(m):        3*          3*        3*        4      NO-OP
                      --------------------------------------------------
            OV(d):        3           4[s]      4[s]      4        4
            SV(d):        3           3         3         3        3


     open()/release() race
     ---------------------
     This race can cause a release() [on last close()] to pick up the
     ongoing version which was just incremented on fresh open(). This
     leads to signing of the object with the same version as the
     ongoing version, thereby, mismatching signatures when calculated.
     Another point that's worth mentioning here is that the open
     file descriptor is *attached* to it's inode *after* it's done
     version synchronization (and increment). Hence, if a release()
     sneaks in this window, the file desriptor list for the given
     inode is still empty, therefore release() considering it as a
     last close().
     To counter this, the protocol should track the open and release
     counts for file descriptors. A release() should only trigger a
     signing request when the file desccriptor for an inode is empty
     and the numbers of releases match the number of opens. When an
     open() sneaks and increments the ongoing version but the file
     descriptor is still not attached to the inode, open and release
     counts mismatch, hence identifying an open() in progress.


iii. Implementation
===================
     Refer to: xlators/feature/bit-rot/src/stub

iv. Protocol enhancements
=========================

     a) Delaying persisting on-disk versions till open()
     b) Lazy version updation (until signing?)
     c) Protocol changes required to handle anonymous file
        descriptors in GlusterFS.