diff options
Diffstat (limited to 'doc/hacker-guide/en-US/markdown/afr.md')
-rw-r--r-- | doc/hacker-guide/en-US/markdown/afr.md | 191 |
1 files changed, 0 insertions, 191 deletions
diff --git a/doc/hacker-guide/en-US/markdown/afr.md b/doc/hacker-guide/en-US/markdown/afr.md deleted file mode 100644 index 566573a4e26..00000000000 --- a/doc/hacker-guide/en-US/markdown/afr.md +++ /dev/null @@ -1,191 +0,0 @@ -cluster/afr translator -====================== - -Locking -------- - -Before understanding replicate, one must understand two internal FOPs: - -### `GF_FILE_LK` - -This is exactly like `fcntl(2)` locking, except the locks are in a -separate domain from locks held by applications. - -### `GF_DIR_LK (loc_t *loc, char *basename)` - -This allows one to lock a name under a directory. For example, -to lock /mnt/glusterfs/foo, one would use the call: - -``` -GF_DIR_LK ({loc_t for "/mnt/glusterfs"}, "foo") -``` - -If one wishes to lock *all* the names under a particular directory, -supply the basename argument as `NULL`. - -The locks can either be read locks or write locks; consult the -function prototype for more details. - -Both these operations are implemented by the features/locks (earlier -known as posix-locks) translator. - -Basic design ------------- - -All FOPs can be classified into four major groups: - -### inode-read - -Operations that read an inode's data (file contents) or metadata (perms, etc.). - -access, getxattr, fstat, readlink, readv, stat. - -### inode-write - -Operations that modify an inode's data or metadata. - -chmod, chown, truncate, writev, utimens. - -### dir-read - -Operations that read a directory's contents or metadata. - -readdir, getdents, checksum. - -### dir-write - -Operations that modify a directory's contents or metadata. - -create, link, mkdir, mknod, rename, rmdir, symlink, unlink. - -Some of these make a subgroup in that they modify *two* different entries: -link, rename, symlink. - -### Others - -Other operations. - -flush, lookup, open, opendir, statfs. - -Algorithms ----------- - -Each of the four major groups has its own algorithm: - -### inode-read, dir-read - -1. Send a request to the first child that is up: - * if it fails: - * try the next available child - * if we have exhausted all children: - * return failure - -### inode-write - - All operations are done in parallel unless specified otherwise. - -1. Send a ``GF_FILE_LK`` request on all children for a write lock on the - appropriate region - (for metadata operations: entire file (0, 0) for writev: - (offset, offset+size of buffer)) - * If a lock request fails on a child: - * unlock all children - * try to acquire a blocking lock (`F_SETLKW`) on each child, serially. - If this fails (due to `ENOTCONN` or `EINVAL`): - Consider this child as dead for rest of transaction. -2. Mark all children as "pending" on all (alive) children (see below for -meaning of "pending"). - * If it fails on any child: - * mark it as dead (in transaction local state). -3. Perform operation on all (alive) children. - * If it fails on any child: - * mark it as dead (in transaction local state). -4. Unmark all successful children as not "pending" on all nodes. -5. Unlock region on all (alive) children. - -### dir-write - - The algorithm for dir-write is same as above except instead of holding - `GF_FILE_LK` locks we hold a GF_DIR_LK lock on the name being operated upon. - In case of link-type calls, we hold locks on both the operand names. - -"pending" ---------- - -The "pending" number is like a journal entry. A pending entry is an -array of 32-bit integers stored in network byte-order as the extended -attribute of an inode (which can be a directory as well). - -There are three keys corresponding to three types of pending operations: - -### `AFR_METADATA_PENDING` - -There are some metadata operations pending on this inode (perms, ctime/mtime, -xattr, etc.). - -### `AFR_DATA_PENDING` - -There is some data pending on this inode (writev). - -### `AFR_ENTRY_PENDING` - -There are some directory operations pending on this directory -(create, unlink, etc.). - -Self heal ---------- - -* On lookup, gather extended attribute data: - * If entry is a regular file: - * If an entry is present on one child and not on others: - * create entry on others. - * If entries exist but have different metadata (perms, etc.): - * consider the entry with the highest `AFR_METADATA_PENDING` number as - definitive and replicate its attributes on children. - * If entry is a directory: - * Consider the entry with the highest `AFR_ENTRY_PENDING` number as - definitive and replicate its contents on all children. - * If any two entries have non-matching types (i.e., one is file and - other is directory): - * Announce to the user via log that a split-brain situation has been - detected, and do nothing. -* On open, gather extended attribute data: - * Consider the file with the highest `AFR_DATA_PENDING` number as - the definitive one and replicate its contents on all other - children. - -During all self heal operations, appropriate locks must be held on all -regions/entries being affected. - -Inode scaling -------------- - -Inode scaling is necessary because if a situation arises where an inode number -is returned for a directory (by lookup) which was previously the inode number -of a file (as per FUSE's table), then FUSE gets horribly confused (consult a -FUSE expert for more details). - -To avoid such a situation, we distribute the 64-bit inode space equally -among all children of replicate. - -To illustrate: - -If c1, c2, c3 are children of replicate, they each get 1/3 of the available -inode space: - -------------- -- -- -- -- -- -- -- -- -- -- -- --- -Child: c1 c2 c3 c1 c2 c3 c1 c2 c3 c1 c2 ... -Inode number: 1 2 3 4 5 6 7 8 9 10 11 ... -------------- -- -- -- -- -- -- -- -- -- -- -- --- - -Thus, if lookup on c1 returns an inode number "2", it is scaled to "4" -(which is the second inode number in c1's space). - -This way we ensure that there is never a collision of inode numbers from -two different children. - -This reduction of inode space doesn't really reduce the usability of -replicate since even if we assume replicate has 1024 children (which would be a -highly unusual scenario), each child still has a 54-bit inode space: -$2^{54} \sim 1.8 \times 10^{16}$, which is much larger than any real -world requirement. |