Added all files

author: Vikas Gorur <vikas@zresearch.com> 2009-02-18 17:36:07 +0530
committer: Vikas Gorur <vikas@zresearch.com> 2009-02-18 17:36:07 +0530
commit: 77adf4cd648dce41f89469dd185deec6b6b53a0b (patch)
tree: 02e155a5753b398ee572b45793f889b538efab6b /doc/hacker-guide/replicate.txt
parent: f3b2e6580e5663292ee113c741343c8a43ee133f (diff)
1 files changed, 206 insertions, 0 deletions
diff --git a/doc/hacker-guide/replicate.txt b/doc/hacker-guide/replicate.txt
new file mode 100644
index 000000000..284f373fb
--- /dev/null
+++ b/doc/hacker-guide/replicate.txt
@@ -0,0 +1,206 @@
+---------------
+* cluster/replicate
+---------------
+
+Before understanding replicate, one must understand two internal FOPs:
+
+GF_FILE_LK:
+  This is exactly like fcntl(2) locking, except the locks are in a 
+  separate domain from locks held by applications.
+
+GF_DIR_LK (loc_t *loc, char *basename):
+  This allows one to lock a name under a directory. For example,
+  to lock /mnt/glusterfs/foo, one would use the call:
+
+  GF_DIR_LK ({loc_t for "/mnt/glusterfs"}, "foo")
+
+  If one wishes to lock *all* the names under a particular directory,
+  supply the basename argument as NULL.
+
+  The locks can either be read locks or write locks; consult the 
+  function prototype for more details.
+
+Both these operations are implemented by the features/locks (earlier
+known as posix-locks) translator.
+
+--------------
+* Basic design
+--------------
+
+All FOPs can be classified into four major groups:
+
+ - inode-read
+   Operations that read an inode's data (file contents) or metadata (perms, etc.).
+
+   access, getxattr, fstat, readlink, readv, stat.
+
+ - inode-write
+   Operations that modify an inode's data or metadata.
+
+   chmod, chown, truncate, writev, utimens.
+
+ - dir-read
+   Operations that read a directory's contents or metadata.
+
+   readdir, getdents, checksum.
+
+ - dir-write
+   Operations that modify a directory's contents or metadata.
+
+   create, link, mkdir, mknod, rename, rmdir, symlink, unlink.
+
+   Some of these make a subgroup in that they modify *two* different entries:
+        link, rename, symlink.
+
+ - Others
+   Other operations.
+
+   flush, lookup, open, opendir, statfs.
+
+------------
+* Algorithms
+------------
+
+Each of the four major groups has its own algorithm:
+
+ ----------------------
+ - inode-read, dir-read
+ ----------------------
+
+ = Send a request to the first child that is up:
+   - if it fails:
+       try the next available child
+   - if we have exhausted all children:
+       return failure
+
+ -------------
+ - inode-write
+ -------------
+
+ All operations are done in parallel unless specified otherwise.
+
+ (1) Send a GF_FILE_LK request on all children for a write lock on 
+     the appropriate region
+            (for metadata operations: entire file (0, 0)
+             for writev: (offset, offset+size of buffer))
+
+     - If a lock request fails on a child:
+         unlock all children
+         try to acquire a blocking lock (F_SETLKW) on each child, serially.
+	 
+	 If this fails (due to ENOTCONN or EINVAL):
+           Consider this child as dead for rest of transaction.
+
+ (2) Mark all children as "pending" on all (alive) children 
+     (see below for meaning of "pending").
+
+     - If it fails on any child:
+         mark it as dead (in transaction local state).
+
+ (3) Perform operation on all (alive) children.
+
+     - If it fails on any child:
+         mark it as dead (in transaction local state).
+
+ (4) Unmark all successful children as not "pending" on all nodes.
+
+ (5) Unlock region on all (alive) children.
+     
+ -----------
+ - dir-write
+ -----------
+
+ The algorithm for dir-write is same as above except instead of holding
+ GF_FILE_LK locks we hold a GF_DIR_LK lock on the name being operated upon.
+ In case of link-type calls, we hold locks on both the operand names.
+
+-----------
+* "pending"
+-----------
+
+ The "pending" number is like a journal entry. A pending entry is an
+ array of 32-bit integers stored in network byte-order as the extended
+ attribute of an inode (which can be a directory as well).
+ 
+ There are three keys corresponding to three types of pending operations:
+
+ - AFR_METADATA_PENDING
+     There are some metadata operations pending on this inode (perms, ctime/mtime, 
+     xattr, etc.).
+
+ - AFR_DATA_PENDING
+     There is some data pending on this inode (writev).
+
+ - AFR_ENTRY_PENDING
+     There are some directory operations pending on this directory
+     (create, unlink, etc.).
+    
+-----------
+* Self heal
+-----------
+
+ - On lookup, gather extended attribute data:
+   - If entry is a regular file:
+     - If an entry is present on one child and not on others:
+       - create entry on others.
+     - If entries exist but have different metadata (perms, etc.):
+       - consider the entry with the highest AFR_METADATA_PENDING number as
+         definitive and replicate its attributes on children.
+
+   - If entry is a directory:
+     - Consider the entry with the higest AFR_ENTRY_PENDING number as
+       definitive and replicate its contents on all children.
+
+   - If any two entries have non-matching types (i.e., one is file and
+     other is directory):
+     - Announce to the user via log that a split-brain situation has been
+       detected, and do nothing.
+ 
+ - On open, gather extended attribute data:
+   - Consider the file with the highest AFR_DATA_PENDING number as
+     the definitive one and replicate its contents on all other
+     children.
+
+ During all self heal operations, appropriate locks must be held on all
+ regions/entries being affected.
+
+---------------
+* Inode scaling
+---------------
+
+Inode scaling is necessary because if a situation arises where:
+  - An inode number is returned for a directory (by lookup) which was
+    previously the inode number of a file (as per FUSE's table), then
+    FUSE gets horribly confused (consult a FUSE expert for more details).
+
+To avoid such a situation, we distribute the 64-bit inode space equally
+among all children of replicate.
+
+To illustrate:
+
+If c1, c2, c3 are children of replicate, they each get 1/3 of the available
+inode space:
+
+Child:        c1   c2   c3   c1   c2   c3   c1   c2   c3   c1   c2 ...
+Inode number: 1    2    3    4    5    6    7    8    9    10   11 ...
+
+Thus, if lookup on c1 returns an inode number "2", it is scaled to "4"
+(which is the second inode number in c1's space).
+
+This way we ensure that there is never a collision of inode numbers from
+two different children.
+
+This reduction of inode space doesn't really reduce the usability of 
+replicate since even if we assume replicate has 1024 children (which would be a
+highly unusual scenario), each child still has a 54-bit inode space.
+
+2^54 ~ 1.8 * 10^16
+
+which is much larger than any real world requirement.
+
+
+==============================================
+$ Last updated: Sun Oct 12 23:17:01 IST 2008 $
+$ Author: Vikas Gorur <vikas@zresearch.com>  $
+==============================================
+
author	Vikas Gorur <vikas@zresearch.com>	2009-02-18 17:36:07 +0530
committer	Vikas Gorur <vikas@zresearch.com>	2009-02-18 17:36:07 +0530
commit	77adf4cd648dce41f89469dd185deec6b6b53a0b (patch)
tree	02e155a5753b398ee572b45793f889b538efab6b /doc/hacker-guide/replicate.txt
parent	f3b2e6580e5663292ee113c741343c8a43ee133f (diff)