diff options
Diffstat (limited to 'doc/legacy/hacker-guide/replicate.txt')
| -rw-r--r-- | doc/legacy/hacker-guide/replicate.txt | 206 | 
1 files changed, 0 insertions, 206 deletions
diff --git a/doc/legacy/hacker-guide/replicate.txt b/doc/legacy/hacker-guide/replicate.txt deleted file mode 100644 index ad5b352a829..00000000000 --- a/doc/legacy/hacker-guide/replicate.txt +++ /dev/null @@ -1,206 +0,0 @@ ---------------- -* cluster/replicate ---------------- - -Before understanding replicate, one must understand two internal FOPs: - -GF_FILE_LK: -  This is exactly like fcntl(2) locking, except the locks are in a -  separate domain from locks held by applications. - -GF_DIR_LK (loc_t *loc, char *basename): -  This allows one to lock a name under a directory. For example, -  to lock /mnt/glusterfs/foo, one would use the call: - -  GF_DIR_LK ({loc_t for "/mnt/glusterfs"}, "foo") - -  If one wishes to lock *all* the names under a particular directory, -  supply the basename argument as NULL. - -  The locks can either be read locks or write locks; consult the -  function prototype for more details. - -Both these operations are implemented by the features/locks (earlier -known as posix-locks) translator. - --------------- -* Basic design --------------- - -All FOPs can be classified into four major groups: - - - inode-read -   Operations that read an inode's data (file contents) or metadata (perms, etc.). - -   access, getxattr, fstat, readlink, readv, stat. - - - inode-write -   Operations that modify an inode's data or metadata. - -   chmod, chown, truncate, writev, utimens. - - - dir-read -   Operations that read a directory's contents or metadata. - -   readdir, getdents, checksum. - - - dir-write -   Operations that modify a directory's contents or metadata. - -   create, link, mkdir, mknod, rename, rmdir, symlink, unlink. - -   Some of these make a subgroup in that they modify *two* different entries: -        link, rename, symlink. - - - Others -   Other operations. - -   flush, lookup, open, opendir, statfs. - ------------- -* Algorithms ------------- - -Each of the four major groups has its own algorithm: - - ---------------------- - - inode-read, dir-read - ---------------------- - - = Send a request to the first child that is up: -   - if it fails: -       try the next available child -   - if we have exhausted all children: -       return failure - - ------------- - - inode-write - ------------- - - All operations are done in parallel unless specified otherwise. - - (1) Send a GF_FILE_LK request on all children for a write lock on -     the appropriate region -            (for metadata operations: entire file (0, 0) -             for writev: (offset, offset+size of buffer)) - -     - If a lock request fails on a child: -         unlock all children -         try to acquire a blocking lock (F_SETLKW) on each child, serially. - -	 If this fails (due to ENOTCONN or EINVAL): -           Consider this child as dead for rest of transaction. - - (2) Mark all children as "pending" on all (alive) children -     (see below for meaning of "pending"). - -     - If it fails on any child: -         mark it as dead (in transaction local state). - - (3) Perform operation on all (alive) children. - -     - If it fails on any child: -         mark it as dead (in transaction local state). - - (4) Unmark all successful children as not "pending" on all nodes. - - (5) Unlock region on all (alive) children. - - ----------- - - dir-write - ----------- - - The algorithm for dir-write is same as above except instead of holding - GF_FILE_LK locks we hold a GF_DIR_LK lock on the name being operated upon. - In case of link-type calls, we hold locks on both the operand names. - ------------ -* "pending" ------------ - - The "pending" number is like a journal entry. A pending entry is an - array of 32-bit integers stored in network byte-order as the extended - attribute of an inode (which can be a directory as well). - - There are three keys corresponding to three types of pending operations: - - - AFR_METADATA_PENDING -     There are some metadata operations pending on this inode (perms, ctime/mtime, -     xattr, etc.). - - - AFR_DATA_PENDING -     There is some data pending on this inode (writev). - - - AFR_ENTRY_PENDING -     There are some directory operations pending on this directory -     (create, unlink, etc.). - ------------ -* Self heal ------------ - - - On lookup, gather extended attribute data: -   - If entry is a regular file: -     - If an entry is present on one child and not on others: -       - create entry on others. -     - If entries exist but have different metadata (perms, etc.): -       - consider the entry with the highest AFR_METADATA_PENDING number as -         definitive and replicate its attributes on children. - -   - If entry is a directory: -     - Consider the entry with the highest AFR_ENTRY_PENDING number as -       definitive and replicate its contents on all children. - -   - If any two entries have non-matching types (i.e., one is file and -     other is directory): -     - Announce to the user via log that a split-brain situation has been -       detected, and do nothing. - - - On open, gather extended attribute data: -   - Consider the file with the highest AFR_DATA_PENDING number as -     the definitive one and replicate its contents on all other -     children. - - During all self heal operations, appropriate locks must be held on all - regions/entries being affected. - ---------------- -* Inode scaling ---------------- - -Inode scaling is necessary because if a situation arises where: -  - An inode number is returned for a directory (by lookup) which was -    previously the inode number of a file (as per FUSE's table), then -    FUSE gets horribly confused (consult a FUSE expert for more details). - -To avoid such a situation, we distribute the 64-bit inode space equally -among all children of replicate. - -To illustrate: - -If c1, c2, c3 are children of replicate, they each get 1/3 of the available -inode space: - -Child:        c1   c2   c3   c1   c2   c3   c1   c2   c3   c1   c2 ... -Inode number: 1    2    3    4    5    6    7    8    9    10   11 ... - -Thus, if lookup on c1 returns an inode number "2", it is scaled to "4" -(which is the second inode number in c1's space). - -This way we ensure that there is never a collision of inode numbers from -two different children. - -This reduction of inode space doesn't really reduce the usability of -replicate since even if we assume replicate has 1024 children (which would be a -highly unusual scenario), each child still has a 54-bit inode space. - -2^54 ~ 1.8 * 10^16 - -which is much larger than any real world requirement. - - -============================================== -$ Last updated: Sun Oct 12 23:17:01 IST 2008 $ -$ Author: Vikas Gorur <vikas@gluster.com>  $ -============================================== -  | 
