diff options
author | Vijay Bellur <vijay@gluster.com> | 2012-04-09 23:11:52 +0530 |
---|---|---|
committer | Anand Avati <avati@redhat.com> | 2012-04-11 10:25:56 -0700 |
commit | 076830c068fb39bbc3e863c89a4253cbea36357e (patch) | |
tree | 842884d8db9a40d5a53e5171c852a84daa8e0f65 /doc/hacker-guide/replicate.txt | |
parent | df8e2f53b70f4f49af70df308010dddfe5ca35ec (diff) |
doc: Move outdated documentation to legacy
Change-Id: I0ceba9a993e8b1cdef4ff6a784bfd69c08107d88
BUG: 811311
Signed-off-by: Vijay Bellur <vijay@gluster.com>
Reviewed-on: http://review.gluster.com/3116
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Amar Tumballi <amarts@redhat.com>
Reviewed-by: Anand Avati <avati@redhat.com>
Diffstat (limited to 'doc/hacker-guide/replicate.txt')
-rw-r--r-- | doc/hacker-guide/replicate.txt | 206 |
1 files changed, 0 insertions, 206 deletions
diff --git a/doc/hacker-guide/replicate.txt b/doc/hacker-guide/replicate.txt deleted file mode 100644 index fd1ef2747f0..00000000000 --- a/doc/hacker-guide/replicate.txt +++ /dev/null @@ -1,206 +0,0 @@ ---------------- -* cluster/replicate ---------------- - -Before understanding replicate, one must understand two internal FOPs: - -GF_FILE_LK: - This is exactly like fcntl(2) locking, except the locks are in a - separate domain from locks held by applications. - -GF_DIR_LK (loc_t *loc, char *basename): - This allows one to lock a name under a directory. For example, - to lock /mnt/glusterfs/foo, one would use the call: - - GF_DIR_LK ({loc_t for "/mnt/glusterfs"}, "foo") - - If one wishes to lock *all* the names under a particular directory, - supply the basename argument as NULL. - - The locks can either be read locks or write locks; consult the - function prototype for more details. - -Both these operations are implemented by the features/locks (earlier -known as posix-locks) translator. - --------------- -* Basic design --------------- - -All FOPs can be classified into four major groups: - - - inode-read - Operations that read an inode's data (file contents) or metadata (perms, etc.). - - access, getxattr, fstat, readlink, readv, stat. - - - inode-write - Operations that modify an inode's data or metadata. - - chmod, chown, truncate, writev, utimens. - - - dir-read - Operations that read a directory's contents or metadata. - - readdir, getdents, checksum. - - - dir-write - Operations that modify a directory's contents or metadata. - - create, link, mkdir, mknod, rename, rmdir, symlink, unlink. - - Some of these make a subgroup in that they modify *two* different entries: - link, rename, symlink. - - - Others - Other operations. - - flush, lookup, open, opendir, statfs. - ------------- -* Algorithms ------------- - -Each of the four major groups has its own algorithm: - - ---------------------- - - inode-read, dir-read - ---------------------- - - = Send a request to the first child that is up: - - if it fails: - try the next available child - - if we have exhausted all children: - return failure - - ------------- - - inode-write - ------------- - - All operations are done in parallel unless specified otherwise. - - (1) Send a GF_FILE_LK request on all children for a write lock on - the appropriate region - (for metadata operations: entire file (0, 0) - for writev: (offset, offset+size of buffer)) - - - If a lock request fails on a child: - unlock all children - try to acquire a blocking lock (F_SETLKW) on each child, serially. - - If this fails (due to ENOTCONN or EINVAL): - Consider this child as dead for rest of transaction. - - (2) Mark all children as "pending" on all (alive) children - (see below for meaning of "pending"). - - - If it fails on any child: - mark it as dead (in transaction local state). - - (3) Perform operation on all (alive) children. - - - If it fails on any child: - mark it as dead (in transaction local state). - - (4) Unmark all successful children as not "pending" on all nodes. - - (5) Unlock region on all (alive) children. - - ----------- - - dir-write - ----------- - - The algorithm for dir-write is same as above except instead of holding - GF_FILE_LK locks we hold a GF_DIR_LK lock on the name being operated upon. - In case of link-type calls, we hold locks on both the operand names. - ------------ -* "pending" ------------ - - The "pending" number is like a journal entry. A pending entry is an - array of 32-bit integers stored in network byte-order as the extended - attribute of an inode (which can be a directory as well). - - There are three keys corresponding to three types of pending operations: - - - AFR_METADATA_PENDING - There are some metadata operations pending on this inode (perms, ctime/mtime, - xattr, etc.). - - - AFR_DATA_PENDING - There is some data pending on this inode (writev). - - - AFR_ENTRY_PENDING - There are some directory operations pending on this directory - (create, unlink, etc.). - ------------ -* Self heal ------------ - - - On lookup, gather extended attribute data: - - If entry is a regular file: - - If an entry is present on one child and not on others: - - create entry on others. - - If entries exist but have different metadata (perms, etc.): - - consider the entry with the highest AFR_METADATA_PENDING number as - definitive and replicate its attributes on children. - - - If entry is a directory: - - Consider the entry with the higest AFR_ENTRY_PENDING number as - definitive and replicate its contents on all children. - - - If any two entries have non-matching types (i.e., one is file and - other is directory): - - Announce to the user via log that a split-brain situation has been - detected, and do nothing. - - - On open, gather extended attribute data: - - Consider the file with the highest AFR_DATA_PENDING number as - the definitive one and replicate its contents on all other - children. - - During all self heal operations, appropriate locks must be held on all - regions/entries being affected. - ---------------- -* Inode scaling ---------------- - -Inode scaling is necessary because if a situation arises where: - - An inode number is returned for a directory (by lookup) which was - previously the inode number of a file (as per FUSE's table), then - FUSE gets horribly confused (consult a FUSE expert for more details). - -To avoid such a situation, we distribute the 64-bit inode space equally -among all children of replicate. - -To illustrate: - -If c1, c2, c3 are children of replicate, they each get 1/3 of the available -inode space: - -Child: c1 c2 c3 c1 c2 c3 c1 c2 c3 c1 c2 ... -Inode number: 1 2 3 4 5 6 7 8 9 10 11 ... - -Thus, if lookup on c1 returns an inode number "2", it is scaled to "4" -(which is the second inode number in c1's space). - -This way we ensure that there is never a collision of inode numbers from -two different children. - -This reduction of inode space doesn't really reduce the usability of -replicate since even if we assume replicate has 1024 children (which would be a -highly unusual scenario), each child still has a 54-bit inode space. - -2^54 ~ 1.8 * 10^16 - -which is much larger than any real world requirement. - - -============================================== -$ Last updated: Sun Oct 12 23:17:01 IST 2008 $ -$ Author: Vikas Gorur <vikas@gluster.com> $ -============================================== - |