summaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorVenky Shankar <vshankar@redhat.com>2015-02-18 17:01:21 +0530
committerVijay Bellur <vbellur@redhat.com>2015-03-24 10:57:24 -0700
commitcd3d34289c92f01843a866f4432bdd2da1ee59db (patch)
treedafc160856d7ceb2d0fbcb1a0385f3a5892d61ba
parent866c64ba5e29a90b37fa051061a58300ae129a2c (diff)
doc: document bit-rot feature
Change-Id: Ibad640d01975906b7642c76a1649e3e272f3a8bc BUG: 1170075 Signed-off-by: Venky Shankar <vshankar@redhat.com> Reviewed-on: http://review.gluster.org/9712 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
-rw-r--r--doc/features/bit-rot/00-INDEX8
-rw-r--r--doc/features/bit-rot/bitrot-docs.txt5
-rw-r--r--doc/features/bit-rot/memory-usage.txt48
-rw-r--r--doc/features/bit-rot/object-versioning.txt236
4 files changed, 297 insertions, 0 deletions
diff --git a/doc/features/bit-rot/00-INDEX b/doc/features/bit-rot/00-INDEX
new file mode 100644
index 00000000000..d351a1976ff
--- /dev/null
+++ b/doc/features/bit-rot/00-INDEX
@@ -0,0 +1,8 @@
+00-INDEX
+ - this file
+bitrot-docs.txt
+ - links to design, spec and feature page
+object-versioning.txt
+ - object versioning mechanism to track object signature
+memory-usage.txt
+ - memory usage during object expiry tracking
diff --git a/doc/features/bit-rot/bitrot-docs.txt b/doc/features/bit-rot/bitrot-docs.txt
new file mode 100644
index 00000000000..39cd491dbcd
--- /dev/null
+++ b/doc/features/bit-rot/bitrot-docs.txt
@@ -0,0 +1,5 @@
+* Feature page: http://www.gluster.org/community/documentation/index.php/Features/BitRot
+
+* Design: http://goo.gl/Mjy4mD
+
+* CLI specification: http://goo.gl/2o12Fn
diff --git a/doc/features/bit-rot/memory-usage.txt b/doc/features/bit-rot/memory-usage.txt
new file mode 100644
index 00000000000..5fe06d4a209
--- /dev/null
+++ b/doc/features/bit-rot/memory-usage.txt
@@ -0,0 +1,48 @@
+object expiry tracking memroy usage
+====================================
+
+Bitrot daemon tracks objects for expiry in a data structure known
+as "timer-wheel" (after which the object is signed). It's a well
+known data structure for tracking million of objects of expiry.
+Let's see the memory usage involved when tracking 1 million
+objects (per brick).
+
+Bitrot daemon uses "br_object" structure to hold information
+needed for signing. An instance of this structure is allocated
+for each object that needs to be signed.
+
+struct br_object {
+ xlator_t *this;
+
+ br_child_t *child;
+
+ void *data;
+ uuid_t gfid;
+ unsigned long signedversion;
+
+ struct list_head list;
+};
+
+Timer-wheel requires an instance of the structure below per
+object that needs to be tracked for expiry.
+
+struct gf_tw_timer_list {
+ void *data;
+ unsigned long expires;
+
+ /** callback routine */
+ void (*function)(struct gf_tw_timer_list *, void *, unsigned long);
+
+ struct list_head entry;
+};
+
+Structure sizes:
+ sizeof (struct br_object): 64 bytes
+ sizeof (struct gf_tw_timer_list): 40 bytes
+
+Together, these structures take up 104 bytes. To track all 1 million objects
+at the same time, the amount of memory taken up would be:
+
+ 1,000,000 * 104 bytes: ~100MB
+
+Not so bad, I think.
diff --git a/doc/features/bit-rot/object-versioning.txt b/doc/features/bit-rot/object-versioning.txt
new file mode 100644
index 00000000000..def901f0fc5
--- /dev/null
+++ b/doc/features/bit-rot/object-versioning.txt
@@ -0,0 +1,236 @@
+Object versioning
+=================
+
+ Bitrot detection in GlusterFS relies on object (file) checksum (hash) verification,
+ also known as "object signature". An object is signed when there are no active
+ file desciptors referring to it's inode (i.e., upon last close()). This is just an
+ hint for the initiation of hash calculation (and therefore signing). There is
+ absolutely no control over when clients can initiate modification operations on
+ the object. An object could be under modification while it's hash computation is
+ under progress. It would also be in-appropriate to restrict access to such objects
+ during the time duration of signing.
+
+ Object versioning is used as a mechanism to identify the staleness of an objects
+ signature. The document below does not just list down the version update protocol,
+ but goes through various factors that led to its design.
+
+NOTE: The word "object" is used to represent a "regular file" (in linux sense) and
+ object versions are persisted in extended attributes of the object's inode.
+ Signature calculation includes object's data (no metadata as of now).
+
+INDEX
+=====
+ i. Version updation protocol
+ ii. Correctness guaraantees
+ iii. Implementation
+ iv. Protocol enhancements
+
+i. Version updation protocol
+============================
+ There are two types of versions associated with an object:
+
+ a) Ongoing version: This version is incremented on first open() [when
+ the in-memory representation of the object (inode) is marked dirty
+ and synchronized to disk. When an object is created, a default ongoing
+ version of one (1) is assigned. An object lookup() too assigns the
+ default version if not present. When a version is initialized upon
+ lookup() or creat() FOP, it need to be durable on disk and therefore
+ can just be a extended attrbute set with out an expensive fsync()
+ syscall.
+
+ b) Signing version: This is the version against which an object is deemed
+ to be signed. An objects signature is tied to a particular signed version.
+ Since, an object is a candidate for signing upon last release() [last
+ close()], signing version is the "ongoing version" at that point of time
+
+ An object's signature is trustable when the version it was signed against
+ matches the ongoing version, i.e., if the hash is calculated by hand and
+ compared against the object signature, it *should* be a perfect match if
+ and only if the versions are equal. On the other hand, the signature is
+ considered stale (might or might not match the hash just calculated).
+
+ Initialization of object versions
+ ---------------------------------
+ An object that existed before the pre versioning days, is assigned the
+ default versions upon lookup(). The protocol at this point expects "no"
+ durability guarantess of the versions, i.e., extended attribute sets
+ need not be followed by an explicit filesystem sync (fsync()). In case
+ of a power outage or a crash, versions are re-initialized with defaults
+ if found to be non-existant. The signing version is initialized with a
+ deafault value of zero (0) and the ongoing version as one (1).
+
+ [
+ NOTE: If an object already has versions on-disk, lookup() just brings
+ the versions in memory. In this case both versions may or may
+ not match depending on state the object was left in.
+ ]
+
+
+ Increment of object versions
+ ----------------------------
+ During initial versioning, the in-memory representation of the object is
+ marked dirty, so that subsequent modification operations on the object
+ triggers a versiong synchronization to disk (extended attribute set).
+ Moreover, this operation needs to be durable on disk, for the protocol
+ to be crash consistent.
+
+ Let's picturize the various version states after subsequent open()s.
+ Not all modification operations need to increment the ongoing version,
+ only the first operations needs to (subsequent operations are NO-OPs).
+
+ NOTE: From here one "[s]" depicts a durable filesystem operation and
+ "*" depicts the inode as dirty.
+
+
+ lookup() open() open() open()
+ ===========================================================
+
+ OV(m): 1* 2 2 2
+ -----------------------------------------
+ OV(d): 1 2[s] 2 2
+ SV(d): 0 0 0 0
+
+
+ Let's now picturize the state when an already signed object undergoes
+ file operations.
+
+ on-disk state:
+ OV(d): 3
+ SV(d): 3|<signature>
+
+
+ lookup() open() open() open()
+ ===========================================================
+
+ OV(m): 3* 4 4 4
+ -----------------------------------------
+ OV(d): 3 4[s] 4 4
+ SV(d): 3 3 3 3
+
+ Signing process
+ ---------------
+ As per the above example, when the last open file descriptor is closed,
+ signing needs to be performed. The protocol restricts that the signing
+ needs to be attached to a version, which in this case is the in-memory
+ value of the ongoing version. A release() also marks the inode dirty,
+ therefore, the next open() does a durable version synchronization to
+ disk.
+
+ [carry forwarding the versions from earlier example]
+
+ close() release() open() open()
+ ===========================================================
+
+ OV(m): 4 4* 5 5
+ -----------------------------------------
+ OV(d): 4 4 5[s] 5
+ SV(d): 3 3 3 3
+
+ As shown above, a relase() call triggers a signing with signing version
+ as OV(m): which in this case is 4. During signing, the object is signed
+ with a signature attached to version 4 as shown below (continuing with
+ the last open() call from above):
+
+ open() sign(4, signature)
+ ===========================================================
+
+ OV(m): 5 5
+ -----------------------------------------
+ OV(d): 5 5
+ SV(d): 3 4:<signature>[s]
+
+ A signature comparison at this point of time is un-trustable due to
+ version mismatches. This also protects from node crashes and hard
+ reboots due to durability guarantee of on-disk version on first
+ open().
+
+ close() release() open()
+ ===========================================================
+
+ OV(m): 4 4* 5
+ -------------------------------- CRASH
+ OV(d): 4 4 5[s]
+ SV(d): 3 3 3
+
+ The protocol is immune to signing request after crashes due to
+ the version synchronization performed on first open(). Signing
+ request for a version lesser than the *current* ongoing version
+ can be ignored. It's left upon the implementation to either
+ accept or ignore such signing request(s).
+
+ [
+ NOTE: Inode forget() causes a fresh lookup() to be trigerred.
+ Since a forget() call is received when there are no
+ active references for an inode, the on-disk version is
+ the latest and would be copied in-memory on lookup().
+ ]
+
+ii. Correctness Guarantees
+==========================
+
+ Concurrent open()'s
+ -------------------
+ When an inode is dirty (i.e., the very next operations would try to
+ synchronize the version to disk), there can be multiple calls [say,
+ open()] that would find the inode state as dirty and try to writeback
+ the new version to disk. Also, note that, marking the inode as synced
+ and updating the in-memory version is done *after* the new version
+ is written on disk. This is done to avoid incorrect version stored
+ on-disk in case the version synchronization fails (but the in-memory
+ version still holding the updated value).
+ Coming back to multiple open() calls on an object, each open() call
+ tries to synchronize the new version to disk if the inode is marked
+ as dirty. This is safe as each open() would try to synchronize the
+ new version (ongoingversion + 1) even if the updation is concurrent.
+ The in-memory version is finally updated to reflect the updated
+ version and mark the inode non-dirty. Again this is done *only* if
+ the inode is dirty, thereby open() calls which updated the on-disk
+ version but lost the race to update the in-memory version result
+ are NO-OPs.
+
+ on-disk state:
+ OV(d): 3
+ SV(d): 3|<signature>
+
+
+ lookup() open() open()' open()' open()
+ =============================================================
+
+ OV(m): 3* 3* 3* 4 NO-OP
+ --------------------------------------------------
+ OV(d): 3 4[s] 4[s] 4 4
+ SV(d): 3 3 3 3 3
+
+
+ open()/release() race
+ ---------------------
+ This race can cause a release() [on last close()] to pick up the
+ ongoing version which was just incremented on fresh open(). This
+ leads to signing of the object with the same version as the
+ ongoing version, thereby, mismatching signatures when calculated.
+ Another point that's worth mentioning here is that the open
+ file descriptor is *attached* to it's inode *after* it's done
+ version synchronization (and increment). Hence, if a release()
+ sneaks in this window, the file desriptor list for the given
+ inode is still empty, therefore release() considering it as a
+ last close().
+ To counter this, the protocol should track the open and release
+ counts for file descriptors. A release() should only trigger a
+ signing request when the file desccriptor for an inode is empty
+ and the numbers of releases match the number of opens. When an
+ open() sneaks and increments the ongoing version but the file
+ descriptor is still not attached to the inode, open and release
+ counts mismatch, hence identifying an open() in progress.
+
+
+iii. Implementation
+===================
+ Refer to: xlators/feature/bit-rot/src/stub
+
+iv. Protocol enhancements
+=========================
+
+ a) Delaying persisting on-disk versions till open()
+ b) Lazy version updation (until signing?)
+ c) Protocol changes required to handle anonymous file
+ descriptors in GlusterFS.