summaryrefslogtreecommitdiffstats
path: root/doc/hacker-guide/en-US/markdown
diff options
context:
space:
mode:
Diffstat (limited to 'doc/hacker-guide/en-US/markdown')
-rw-r--r--doc/hacker-guide/en-US/markdown/adding-fops.md18
-rw-r--r--doc/hacker-guide/en-US/markdown/afr.md191
-rw-r--r--doc/hacker-guide/en-US/markdown/coding-standard.md402
-rw-r--r--doc/hacker-guide/en-US/markdown/posix.md59
-rw-r--r--doc/hacker-guide/en-US/markdown/translator-development.md666
-rw-r--r--doc/hacker-guide/en-US/markdown/write-behind.md56
6 files changed, 1392 insertions, 0 deletions
diff --git a/doc/hacker-guide/en-US/markdown/adding-fops.md b/doc/hacker-guide/en-US/markdown/adding-fops.md
new file mode 100644
index 000000000..3f72ed3e2
--- /dev/null
+++ b/doc/hacker-guide/en-US/markdown/adding-fops.md
@@ -0,0 +1,18 @@
+Adding a new FOP
+================
+
+Steps to be followed when adding a new FOP to GlusterFS:
+
+1. Edit `glusterfs.h` and add a `GF_FOP_*` constant.
+2. Edit `xlator.[ch]` and:
+ * add the new prototype for fop and callback.
+ * edit `xlator_fops` structure.
+3. Edit `xlator.c` and add to fill_defaults.
+4. Edit `protocol.h` and add struct necessary for the new FOP.
+5. Edit `defaults.[ch]` and provide default implementation.
+6. Edit `call-stub.[ch]` and provide stub implementation.
+7. Edit `common-utils.c` and add to gf_global_variable_init().
+8. Edit client-protocol and add your FOP.
+9. Edit server-protocol and add your FOP.
+10. Implement your FOP in any translator for which the default implementation
+ is not sufficient.
diff --git a/doc/hacker-guide/en-US/markdown/afr.md b/doc/hacker-guide/en-US/markdown/afr.md
new file mode 100644
index 000000000..1be7e39f2
--- /dev/null
+++ b/doc/hacker-guide/en-US/markdown/afr.md
@@ -0,0 +1,191 @@
+cluster/afr translator
+======================
+
+Locking
+-------
+
+Before understanding replicate, one must understand two internal FOPs:
+
+### `GF_FILE_LK`
+
+This is exactly like `fcntl(2)` locking, except the locks are in a
+separate domain from locks held by applications.
+
+### `GF_DIR_LK (loc_t *loc, char *basename)`
+
+This allows one to lock a name under a directory. For example,
+to lock /mnt/glusterfs/foo, one would use the call:
+
+```
+GF_DIR_LK ({loc_t for "/mnt/glusterfs"}, "foo")
+```
+
+If one wishes to lock *all* the names under a particular directory,
+supply the basename argument as `NULL`.
+
+The locks can either be read locks or write locks; consult the
+function prototype for more details.
+
+Both these operations are implemented by the features/locks (earlier
+known as posix-locks) translator.
+
+Basic design
+------------
+
+All FOPs can be classified into four major groups:
+
+### inode-read
+
+Operations that read an inode's data (file contents) or metadata (perms, etc.).
+
+access, getxattr, fstat, readlink, readv, stat.
+
+### inode-write
+
+Operations that modify an inode's data or metadata.
+
+chmod, chown, truncate, writev, utimens.
+
+### dir-read
+
+Operations that read a directory's contents or metadata.
+
+readdir, getdents, checksum.
+
+### dir-write
+
+Operations that modify a directory's contents or metadata.
+
+create, link, mkdir, mknod, rename, rmdir, symlink, unlink.
+
+Some of these make a subgroup in that they modify *two* different entries:
+link, rename, symlink.
+
+### Others
+
+Other operations.
+
+flush, lookup, open, opendir, statfs.
+
+Algorithms
+----------
+
+Each of the four major groups has its own algorithm:
+
+### inode-read, dir-read
+
+1. Send a request to the first child that is up:
+ * if it fails:
+ * try the next available child
+ * if we have exhausted all children:
+ * return failure
+
+### inode-write
+
+ All operations are done in parallel unless specified otherwise.
+
+1. Send a ``GF_FILE_LK`` request on all children for a write lock on the
+ appropriate region
+ (for metadata operations: entire file (0, 0) for writev:
+ (offset, offset+size of buffer))
+ * If a lock request fails on a child:
+ * unlock all children
+ * try to acquire a blocking lock (`F_SETLKW`) on each child, serially.
+ If this fails (due to `ENOTCONN` or `EINVAL`):
+ Consider this child as dead for rest of transaction.
+2. Mark all children as "pending" on all (alive) children (see below for
+meaning of "pending").
+ * If it fails on any child:
+ * mark it as dead (in transaction local state).
+3. Perform operation on all (alive) children.
+ * If it fails on any child:
+ * mark it as dead (in transaction local state).
+4. Unmark all successful children as not "pending" on all nodes.
+5. Unlock region on all (alive) children.
+
+### dir-write
+
+ The algorithm for dir-write is same as above except instead of holding
+ `GF_FILE_LK` locks we hold a GF_DIR_LK lock on the name being operated upon.
+ In case of link-type calls, we hold locks on both the operand names.
+
+"pending"
+---------
+
+The "pending" number is like a journal entry. A pending entry is an
+array of 32-bit integers stored in network byte-order as the extended
+attribute of an inode (which can be a directory as well).
+
+There are three keys corresponding to three types of pending operations:
+
+### `AFR_METADATA_PENDING`
+
+There are some metadata operations pending on this inode (perms, ctime/mtime,
+xattr, etc.).
+
+### `AFR_DATA_PENDING`
+
+There is some data pending on this inode (writev).
+
+### `AFR_ENTRY_PENDING`
+
+There are some directory operations pending on this directory
+(create, unlink, etc.).
+
+Self heal
+---------
+
+* On lookup, gather extended attribute data:
+ * If entry is a regular file:
+ * If an entry is present on one child and not on others:
+ * create entry on others.
+ * If entries exist but have different metadata (perms, etc.):
+ * consider the entry with the highest `AFR_METADATA_PENDING` number as
+ definitive and replicate its attributes on children.
+ * If entry is a directory:
+ * Consider the entry with the higest `AFR_ENTRY_PENDING` number as
+ definitive and replicate its contents on all children.
+ * If any two entries have non-matching types (i.e., one is file and
+ other is directory):
+ * Announce to the user via log that a split-brain situation has been
+ detected, and do nothing.
+* On open, gather extended attribute data:
+ * Consider the file with the highest `AFR_DATA_PENDING` number as
+ the definitive one and replicate its contents on all other
+ children.
+
+During all self heal operations, appropriate locks must be held on all
+regions/entries being affected.
+
+Inode scaling
+-------------
+
+Inode scaling is necessary because if a situation arises where an inode number
+is returned for a directory (by lookup) which was previously the inode number
+of a file (as per FUSE's table), then FUSE gets horribly confused (consult a
+FUSE expert for more details).
+
+To avoid such a situation, we distribute the 64-bit inode space equally
+among all children of replicate.
+
+To illustrate:
+
+If c1, c2, c3 are children of replicate, they each get 1/3 of the available
+inode space:
+
+------------- -- -- -- -- -- -- -- -- -- -- -- ---
+Child: c1 c2 c3 c1 c2 c3 c1 c2 c3 c1 c2 ...
+Inode number: 1 2 3 4 5 6 7 8 9 10 11 ...
+------------- -- -- -- -- -- -- -- -- -- -- -- ---
+
+Thus, if lookup on c1 returns an inode number "2", it is scaled to "4"
+(which is the second inode number in c1's space).
+
+This way we ensure that there is never a collision of inode numbers from
+two different children.
+
+This reduction of inode space doesn't really reduce the usability of
+replicate since even if we assume replicate has 1024 children (which would be a
+highly unusual scenario), each child still has a 54-bit inode space:
+$2^{54} \sim 1.8 \times 10^{16}$, which is much larger than any real
+world requirement.
diff --git a/doc/hacker-guide/en-US/markdown/coding-standard.md b/doc/hacker-guide/en-US/markdown/coding-standard.md
new file mode 100644
index 000000000..178dc142a
--- /dev/null
+++ b/doc/hacker-guide/en-US/markdown/coding-standard.md
@@ -0,0 +1,402 @@
+GlusterFS Coding Standards
+==========================
+
+Structure definitions should have a comment per member
+------------------------------------------------------
+
+Every member in a structure definition must have a comment about its
+purpose. The comment should be descriptive without being overly verbose.
+
+*Bad:*
+
+```
+gf_lock_t lock; /* lock */
+```
+
+*Good:*
+
+```
+DBTYPE access_mode; /* access mode for accessing
+ * the databases, can be
+ * DB_HASH, DB_BTREE
+ * (option access-mode <mode>)
+ */
+```
+
+Declare all variables at the beginning of the function
+------------------------------------------------------
+
+All local variables in a function must be declared immediately after the
+opening brace. This makes it easy to keep track of memory that needs to be freed
+during exit. It also helps debugging, since gdb cannot handle variables
+declared inside loops or other such blocks.
+
+Always initialize local variables
+---------------------------------
+
+Every local variable should be initialized to a sensible default value
+at the point of its declaration. All pointers should be initialized to NULL,
+and all integers should be zero or (if it makes sense) an error value.
+
+
+*Good:*
+
+```
+int ret = 0;
+char *databuf = NULL;
+int _fd = -1;
+```
+
+Initialization should always be done with a constant value
+----------------------------------------------------------
+
+Never use a non-constant expression as the initialization value for a variable.
+
+
+*Bad:*
+
+```
+pid_t pid = frame->root->pid;
+char *databuf = malloc (1024);
+```
+
+Validate all arguments to a function
+------------------------------------
+
+All pointer arguments to a function must be checked for `NULL`.
+A macro named `VALIDATE` (in `common-utils.h`)
+takes one argument, and if it is `NULL`, writes a log message and
+jumps to a label called `err` after setting op_ret and op_errno
+appropriately. It is recommended to use this template.
+
+
+*Good:*
+
+```
+VALIDATE(frame);
+VALIDATE(this);
+VALIDATE(inode);
+```
+
+Never rely on precedence of operators
+-------------------------------------
+
+Never write code that relies on the precedence of operators to execute
+correctly. Such code can be hard to read and someone else might not
+know the precedence of operators as accurately as you do.
+
+*Bad:*
+
+```
+if (op_ret == -1 && errno != ENOENT)
+```
+
+*Good:*
+
+```
+if ((op_ret == -1) && (errno != ENOENT))
+```
+
+Use exactly matching types
+--------------------------
+
+Use a variable of the exact type declared in the manual to hold the
+return value of a function. Do not use an ``equivalent'' type.
+
+
+*Bad:*
+
+```
+int len = strlen (path);
+```
+
+*Good:*
+
+```
+size_t len = strlen (path);
+```
+
+Never write code such as `foo->bar->baz`; check every pointer
+-------------------------------------------------------------
+
+Do not write code that blindly follows a chain of pointer
+references. Any pointer in the chain may be `NULL` and thus
+cause a crash. Verify that each pointer is non-null before following
+it.
+
+Check return value of all functions and system calls
+----------------------------------------------------
+
+The return value of all system calls and API functions must be checked
+for success or failure.
+
+*Bad:*
+
+```
+close (fd);
+```
+
+*Good:*
+
+```
+op_ret = close (_fd);
+if (op_ret == -1) {
+ gf_log (this->name, GF_LOG_ERROR,
+ "close on file %s failed (%s)", real_path,
+ strerror (errno));
+ op_errno = errno;
+ goto out;
+}
+```
+
+
+Gracefully handle failure of malloc
+-----------------------------------
+
+GlusterFS should never crash or exit due to lack of memory. If a
+memory allocation fails, the call should be unwound and an error
+returned to the user.
+
+*Use result args and reserve the return value to indicate success or failure:*
+
+The return value of every functions must indicate success or failure (unless
+it is impossible for the function to fail --- e.g., boolean functions). If
+the function needs to return additional data, it must be returned using a
+result (pointer) argument.
+
+*Bad:*
+
+```
+int32_t dict_get_int32 (dict_t *this, char *key);
+```
+
+*Good:*
+
+```
+int dict_get_int32 (dict_t *this, char *key, int32_t *val);
+```
+
+Always use the `n' versions of string functions
+-----------------------------------------------
+
+Unless impossible, use the length-limited versions of the string functions.
+
+*Bad:*
+
+```
+strcpy (entry_path, real_path);
+```
+
+*Good:*
+
+```
+strncpy (entry_path, real_path, entry_path_len);
+```
+
+No dead or commented code
+-------------------------
+
+There must be no dead code (code to which control can never be passed) or
+commented out code in the codebase.
+
+Only one unwind and return per function
+---------------------------------------
+
+There must be only one exit out of a function. `UNWIND` and return
+should happen at only point in the function.
+
+Function length or Keep functions small
+---------------------------------------
+
+We live in the UNIX-world where modules do one thing and do it well.
+This rule should apply to our functions also. If a function is very long, try splitting it
+into many little helper functions. The question is, in a coding
+spree, how do we know a function is long and unreadable. One rule of
+thumb given by Linus Torvalds is that, a function should be broken-up
+if you have 4 or more levels of indentation going on for more than 3-4
+lines.
+
+*Example for a helper function:*
+```
+static int
+same_owner (posix_lock_t *l1, posix_lock_t *l2)
+{
+ return ((l1->client_pid == l2->client_pid) &&
+ (l1->transport == l2->transport));
+}
+```
+
+Defining functions as static
+----------------------------
+
+Define internal functions as static only if you're
+very sure that there will not be a crash(..of any kind..) emanating in
+that function. If there is even a remote possibility, perhaps due to
+pointer derefering, etc, declare the function as non-static. This
+ensures that when a crash does happen, the function name shows up the
+in the back-trace generated by libc. However, doing so has potential
+for polluting the function namespace, so to avoid conflicts with other
+components in other parts, ensure that the function names are
+prepended with a prefix that identify the component to which it
+belongs. For eg. non-static functions in io-threads translator start
+with iot_.
+
+Ensure function calls wrap around after 80-columns
+--------------------------------------------------
+
+Place remaining arguments on the next line if needed.
+
+Functions arguments and function definition
+-------------------------------------------
+
+Place all the arguments of a function definition on the same line
+until the line goes beyond 80-cols. Arguments that extend beyind
+80-cols should be placed on the next line.
+
+Style issues
+------------
+
+### Brace placement
+
+Use K&R/Linux style of brace placement for blocks.
+
+*Good:*
+
+```
+int some_function (...)
+{
+ if (...) {
+ /* ... */
+ } else if (...) {
+ /* ... */
+ } else {
+ /* ... */
+ }
+
+ do {
+ /* ... */
+ } while (cond);
+}
+```
+
+### Indentation
+
+Use *eight* spaces for indenting blocks. Ensure that your
+file contains only spaces and not tab characters. You can do this
+in Emacs by selecting the entire file (`C-x h`) and
+running `M-x untabify`.
+
+To make Emacs indent lines automatically by eight spaces, add this
+line to your `.emacs`:
+
+```
+(add-hook 'c-mode-hook (lambda () (c-set-style "linux")))
+```
+
+### Comments
+
+Write a comment before every function describing its purpose (one-line),
+its arguments, and its return value. Mention whether it is an internal
+function or an exported function.
+
+Write a comment before every structure describing its purpose, and
+write comments about each of its members.
+
+Follow the style shown below for comments, since such comments
+can then be automatically extracted by doxygen to generate
+documentation.
+
+*Good:*
+
+```
+/**
+* hash_name -hash function for filenames
+* @par: parent inode number
+* @name: basename of inode
+* @mod: number of buckets in the hashtable
+*
+* @return: success: bucket number
+* failure: -1
+*
+* Not for external use.
+*/
+```
+
+### Indicating critical sections
+
+To clearly show regions of code which execute with locks held, use
+the following format:
+
+```
+pthread_mutex_lock (&mutex);
+{
+ /* code */
+}
+pthread_mutex_unlock (&mutex);
+```
+
+*A skeleton fop function:*
+
+This is the recommended template for any fop. In the beginning come
+the initializations. After that, the `success' control flow should be
+linear. Any error conditions should cause a `goto` to a single
+point, `out`. At that point, the code should detect the error
+that has occured and do appropriate cleanup.
+
+```
+int32_t
+sample_fop (call_frame_t *frame, xlator_t *this, ...)
+{
+ char * var1 = NULL;
+ int32_t op_ret = -1;
+ int32_t op_errno = 0;
+ DIR * dir = NULL;
+ struct posix_fd * pfd = NULL;
+
+ VALIDATE_OR_GOTO (frame, out);
+ VALIDATE_OR_GOTO (this, out);
+
+ /* other validations */
+
+ dir = opendir (...);
+
+ if (dir == NULL) {
+ op_errno = errno;
+ gf_log (this->name, GF_LOG_ERROR,
+ "opendir failed on %s (%s)", loc->path,
+ strerror (op_errno));
+ goto out;
+ }
+
+ /* another system call */
+ if (...) {
+ op_errno = ENOMEM;
+ gf_log (this->name, GF_LOG_ERROR,
+ "out of memory :(");
+ goto out;
+ }
+
+ /* ... */
+
+ out:
+ if (op_ret == -1) {
+
+ /* check for all the cleanup that needs to be
+ done */
+
+ if (dir) {
+ closedir (dir);
+ dir = NULL;
+ }
+
+ if (pfd) {
+ FREE (pfd->path);
+ FREE (pfd);
+ pfd = NULL;
+ }
+ }
+
+ STACK_UNWIND (frame, op_ret, op_errno, fd);
+ return 0;
+}
+```
diff --git a/doc/hacker-guide/en-US/markdown/posix.md b/doc/hacker-guide/en-US/markdown/posix.md
new file mode 100644
index 000000000..84c813e55
--- /dev/null
+++ b/doc/hacker-guide/en-US/markdown/posix.md
@@ -0,0 +1,59 @@
+storage/posix translator
+========================
+
+Notes
+-----
+
+### `SET_FS_ID`
+
+This is so that all filesystem checks are done with the user's
+uid/gid and not GlusterFS's uid/gid.
+
+### `MAKE_REAL_PATH`
+
+This macro concatenates the base directory of the posix volume
+('option directory') with the given path.
+
+### `need_xattr` in lookup
+
+If this flag is passed, lookup returns a xattr dictionary that contains
+the file's create time, the file's contents, and the version number
+of the file.
+
+This is a hack to increase small file performance. If an application
+wants to read a small file, it can finish its job with just a lookup
+call instead of a lookup followed by read.
+
+### `getdents`/`setdents`
+
+These are used by unify to set and get directory entries.
+
+### `ALIGN_BUF`
+
+Macro to align an address to a page boundary (4K).
+
+### `priv->export_statfs`
+
+In some cases, two exported volumes may reside on the same
+partition on the server. Sending statvfs info for both
+the volumes will lead to erroneous df output at the client,
+since free space on the partition will be counted twice.
+
+In such cases, user can disable exporting statvfs info
+on one of the volumes by setting this option.
+
+### `xattrop`
+
+This fop is used by replicate to set version numbers on files.
+
+### `getxattr`/`setxattr` hack to read/write files
+
+A key, `GLUSTERFS_FILE_CONTENT_STRING`, is handled in a special way by
+`getxattr`/`setxattr`. A getxattr with the key will return the entire
+content of the file as the value. A `setxattr` with the key will write
+the value as the entire content of the file.
+
+### `posix_checksum`
+
+This calculates a simple XOR checksum on all entry names in a
+directory that is used by unify to compare directory contents.
diff --git a/doc/hacker-guide/en-US/markdown/translator-development.md b/doc/hacker-guide/en-US/markdown/translator-development.md
new file mode 100644
index 000000000..77d1b606a
--- /dev/null
+++ b/doc/hacker-guide/en-US/markdown/translator-development.md
@@ -0,0 +1,666 @@
+Translator development
+======================
+
+Setting the Stage
+-----------------
+
+This is the first post in a series that will explain some of the details of
+writing a GlusterFS translator, using some actual code to illustrate.
+
+Before we begin, a word about environments. GlusterFS is over 300K lines of
+code spread across a few hundred files. That's no Linux kernel or anything, but
+ you're still going to be navigating through a lot of code in every
+code-editing session, so some kind of cross-referencing is *essential*. I use
+cscope with the vim bindings, and if I couldn't do Crtl+G and such to jump
+between definitions all the time my productivity would be cut in half. You may
+prefer different tools, but as I go through these examples you'll need
+something functionally similar to follow on. OK, on with the show.
+
+The first thing you need to know is that translators are not just bags of
+functions and variables. They need to have a very definite internal structure
+so that the translator-loading code can figure out where all the pieces are.
+The way it does this is to use dlsym to look for specific names within your
+shared-object file, as follow (from `xlator.c`):
+
+```
+if (!(xl->fops = dlsym (handle, "fops"))) {
+ gf_log ("xlator", GF_LOG_WARNING, "dlsym(fops) on %s",
+ dlerror ());
+ goto out;
+}
+
+if (!(xl->cbks = dlsym (handle, "cbks"))) {
+ gf_log ("xlator", GF_LOG_WARNING, "dlsym(cbks) on %s",
+ dlerror ());
+ goto out;
+}
+
+if (!(xl->init = dlsym (handle, "init"))) {
+ gf_log ("xlator", GF_LOG_WARNING, "dlsym(init) on %s",
+ dlerror ());
+ goto out;
+}
+
+if (!(xl->fini = dlsym (handle, "fini"))) {
+ gf_log ("xlator", GF_LOG_WARNING, "dlsym(fini) on %s",
+ dlerror ());
+ goto out;
+}
+```
+
+In this example, `xl` is a pointer to the in-memory object for the translator
+we're loading. As you can see, it's looking up various symbols *by name* in the
+ shared object it just loaded, and storing pointers to those symbols. Some of
+them (e.g. init are functions, while others e.g. fops are dispatch tables
+containing pointers to many functions. Together, these make up the translator's
+ public interface.
+
+Most of this glue or boilerplate can easily be found at the bottom of one of
+the source files that make up each translator. We're going to use the `rot-13`
+translator just for fun, so in this case you'd look in `rot-13.c` to see this:
+
+```
+struct xlator_fops fops = {
+ .readv = rot13_readv,
+ .writev = rot13_writev
+};
+
+struct xlator_cbks cbks = {
+};
+
+struct volume_options options[] = {
+{ .key = {"encrypt-write"},
+ .type = GF_OPTION_TYPE_BOOL
+},
+{ .key = {"decrypt-read"},
+ .type = GF_OPTION_TYPE_BOOL
+},
+{ .key = {NULL} },
+};
+```
+
+The `fops` table, defined in `xlator.h`, is one of the most important pieces.
+This table contains a pointer to each of the filesystem functions that your
+translator might implement -- `open`, `read`, `stat`, `chmod`, and so on. There
+are 82 such functions in all, but don't worry; any that you don't specify here
+will be see as null and filled with defaults from `defaults.c` when your
+translator is loaded. In this particular example, since `rot-13` is an
+exceptionally simple translator, we only fill in two entries for `readv` and
+`writev`.
+
+There are actually two other tables, also required to have predefined names,
+that are also used to find translator functions: `cbks` (which is empty in this
+ snippet) and `dumpops` (which is missing entirely). The first of these specify
+ entry points for when inodes are forgotten or file descriptors are released.
+In other words, they're destructors for objects in which your translator might
+ have an interest. Mostly you can ignore them, because the default behavior
+handles even the simpler cases of translator-specific inode/fd context
+automatically. However, if the context you attach is a complex structure
+requiring complex cleanup, you'll need to supply these functions. As for
+dumpops, that's just used if you want to provide functions to pretty-print
+various structures in logs. I've never used it myself, though I probably
+should. What's noteworthy here is that we don't even define dumpops. That's
+because all of the functions that might use these dispatch functions will check
+ for `xl->dumpops` being `NULL` before calling through it. This is in sharp
+contrast to the behavior for `fops` and `cbks1`, which *must* be present. If
+they're not, translator loading will fail because these pointers are not
+checked every time and if they're `NULL` then we'll segfault. That's why we
+provide an empty definition for cbks; it's OK for the individual function
+pointers to be NULL, but not for the whole table to be absent.
+
+The last piece I'll cover today is options. As you can see, this is a table of
+translator-specific option names and some information about their types.
+GlusterFS actually provides a pretty rich set of types (`volume_option_type_t`
+in `options.`h) which includes paths, translator names, percentages, and times
+in addition to the obvious integers and strings. Also, the `volume_option_t`
+structure can include information about alternate names, min/max/default
+values, enumerated string values, and descriptions. We don't see any of these
+here, so let's take a quick look at some more complex examples from afr.c and
+then come back to `rot-13`.
+
+```
+{ .key = {"data-self-heal-algorithm"},
+ .type = GF_OPTION_TYPE_STR,
+ .default_value = "",
+ .description = "Select between \"full\", \"diff\". The "
+ "\"full\" algorithm copies the entire file from "
+ "source to sink. The \"diff\" algorithm copies to "
+ "sink only those blocks whose checksums don't match "
+ "with those of source.",
+ .value = { "diff", "full", "" }
+},
+{ .key = {"data-self-heal-window-size"},
+ .type = GF_OPTION_TYPE_INT,
+ .min = 1,
+ .max = 1024,
+ .default_value = "1",
+ .description = "Maximum number blocks per file for which "
+ "self-heal process would be applied simultaneously."
+},
+```
+
+When your translator is loaded, all of this information is used to parse the
+options actually provided in the volfile, and then the result is turned into a
+dictionary and stored as `xl->options`. This dictionary is then processed by
+your init function, which you can see being looked up in the first code
+fragment above. We're only going to look at a small part of the `rot-13`'s
+init for now.
+
+```
+priv->decrypt_read = 1;
+priv->encrypt_write = 1;
+
+data = dict_get (this->options, "encrypt-write");
+if (data) {
+ if (gf_string2boolean (data->data, &priv->encrypt_write
+ == -1) {
+ gf_log (this->name, GF_LOG_ERROR,
+ "encrypt-write takes only boolean options");
+ return -1;
+ }
+}
+```
+
+What we can see here is that we're setting some defaults in our priv structure,
+then looking to see if an `encrypt-write` option was actually provided. If so,
+we convert and store it. This is a pretty classic use of dict_get to fetch a
+field from a dictionary, and of using one of many conversion functions in
+`common-utils.c` to convert `data->data` into something we can use.
+
+So far we've covered the basic of how a translator gets loaded, how we find its
+various parts, and how we process its options. In my next Translator 101 post,
+we'll go a little deeper into other things that init and its companion fini
+might do, and how some other fields in our `xlator_t` structure (commonly
+referred to as this) are commonly used.
+
+`init`, `fini`, and private context
+-----------------------------------
+
+In the previous Translator 101 post, we looked at some of the dispatch tables
+and options processing in a translator. This time we're going to cover the rest
+ of the "shell" of a translator -- i.e. the other global parts not specific to
+handling a particular request.
+
+Let's start by looking at the relationship between a translator and its shared
+library. At a first approximation, this is the relationship between an object
+and a class in just about any object-oriented programming language. The class
+defines behaviors, but has to be instantiated as an object to have any kind of
+existence. In our case the object is an `xlator_t`. Several of these might be
+created within the same daemon, sharing all of the same code through init/fini
+and dispatch tables, but sharing *no data*. You could implement shared data (as
+ static variables in your shared libraries) but that's strongly discouraged.
+Every function in your shared library will get an `xlator_t` as an argument,
+and should use it. This lack of class-level data is one of the points where
+the analogy to common OOP systems starts to break down. Another place is the
+complete lack of inheritance. Translators inherit behavior (code) from exactly
+one shared library -- looked up and loaded using the `type` field in a volfile
+`volume ... end-volume` block -- and that's it -- not even single inheritance,
+no subclasses or superclasses, no mixins or prototypes, just the relationship
+between an object and its class. With that in mind, let's turn to the init
+function that we just barely touched on last time.
+
+```
+int32_t
+init (xlator_t *this)
+{
+ data_t *data = NULL;
+ rot_13_private_t *priv = NULL;
+
+ if (!this->children || this->children->next) {
+ gf_log ("rot13", GF_LOG_ERROR,
+ "FATAL: rot13 should have exactly one child");
+ return -1;
+ }
+
+ if (!this->parents) {
+ gf_log (this->name, GF_LOG_WARNING,
+ "dangling volume. check volfile ");
+ }
+
+ priv = GF_CALLOC (sizeof (rot_13_private_t), 1, 0);
+ if (!priv)
+ return -1;
+```
+
+At the very top, we see the function signature -- we get a pointer to the
+`xlator_t` object that we're initializing, and we return an `int32_t` status.
+As with most functions in the translator API, this should be zero to indicate
+success. In this case it's safe to return -1 for failure, but watch out: in
+dispatch-table functions, the return value means the status of the *function
+call* rather than the *request*. A request error should be reflected as a
+callback with a non-zero `op_re`t value, but the dispatch function itself
+should still return zero. In fact, the handling of a non-zero return from a
+dispatch function is not all that robust (we recently had a bug report in
+HekaFS related to this) so it's something you should probably avoid
+altogether. This only underscores the difference between dispatch functions
+and `init`/`fini` functions, where non-zero returns *are* expected and handled
+logically by aborting the translator setup. We can see that down at the
+bottom, where we return -1 to indicate that we couldn't allocate our
+private-data area (more about that later).
+
+The first thing this init function does is check that the translator is being
+set up in the right kind of environment. Translators are called by parents and
+in turn call children. Some translators are "initial" translators that inject
+requests into the system from elsewhere -- e.g. mount/fuse injecting requests
+from the kernel, protocol/server injecting requests from the network. Those
+translators don't need parents, but `rot-13` does and so we check for that.
+Similarly, some translators are "final" translators that (from the perspective
+of the current process) terminate requests instead of passing them on -- e.g.
+`protocol/client` passing them to another node, `storage/posix` passing them to
+a local filesystem. Other translators "multiplex" between multiple children --
+ passing each parent request on to one (`cluster/dht`), some
+(`cluster/stripe`), or all (`cluster/afr`) of those children. `rot-13` fits
+into none of those categories either, so it checks that it has *exactly one*
+child. It might be more convenient or robust if translator shared libraries
+had standard variables describing these requirements, to be checked in a
+consistent way by the translator-loading infrastructure itself instead of by
+each separate init function, but this is the way translators work today.
+
+The last thing we see in this fragment is allocating our private data area.
+This can literally be anything we want; the infrastructure just provides the
+priv pointer as a convenience but takes no responsibility for how it's used. In
+ this case we're using `GF_CALLOC` to allocate our own `rot_13_private_t`
+structure. This gets us all the benefits of GlusterFS's memory-leak detection
+infrastructure, but the way we're calling it is not quite ideal. For one thing,
+ the first two arguments -- from `calloc(3)` -- are kind of reversed. For
+another, notice how the last argument is zero. That can actually be an
+enumerated value, to tell the GlusterFS allocator *what* type we're
+allocating. This can be very useful information for memory profiling and leak
+detection, so it's recommended that you follow the example of any
+x`xx-mem-types.h` file elsewhere in the source tree instead of just passing
+zero here (even though that works).
+
+To finish our tour of standard initialization/termination, let's look at the
+end of `init` and the beginning of `fini`:
+
+```
+ this->private = priv;
+ gf_log ("rot13", GF_LOG_DEBUG, "rot13 xlator loaded");
+ return 0;
+}
+
+void
+fini (xlator_t *this)
+{
+ rot_13_private_t *priv = this->private;
+
+ if (!priv)
+ return;
+ this->private = NULL;
+ GF_FREE (priv);
+```
+
+At the end of init we're just storing our private-data pointer in the `priv`
+field of our `xlator_t`, then returning zero to indicate that initialization
+succeeded. As is usually the case, our fini is even simpler. All it really has
+to do is `GF_FREE` our private-data pointer, which we do in a slightly
+roundabout way here. Notice how we don't even have a return value here, since
+there's nothing obvious and useful that the infrastructure could do if `fini`
+failed.
+
+That's practically everything we need to know to get our translator through
+loading, initialization, options processing, and termination. If we had defined
+ no dispatch functions, we could actually configure a daemon to use our
+translator and it would work as a basic pass-through from its parent to a
+single child. In the next post I'll cover how to build the translator and
+configure a daemon to use it, so that we can actually step through it in a
+debugger and see how it all fits together before we actually start adding
+functionality.
+
+This Time For Real
+------------------
+
+In the first two parts of this series, we learned how to write a basic
+translator skeleton that can get through loading, initialization, and option
+processing. This time we'll cover how to build that translator, configure a
+volume to use it, and run the glusterfs daemon in debug mode.
+
+Unfortunately, there's not much direct support for writing new translators. You
+can check out a GlusterFS tree and splice in your own translator directory, but
+ that's a bit painful because you'll have to update multiple makefiles plus a
+bunch of autoconf garbage. As part of the HekaFS project, I basically reverse
+engineered the truly necessary parts of the translator-building process and
+then pestered one of the Fedora glusterfs package maintainers (thanks
+daMaestro!) to add a `glusterfs-devel` package with the required headers. Since
+ then the complexity level in the HekaFS tree has crept back up a bit, but I
+still remember the simple method and still consider it the easiest way to get
+started on a new translator. For the sake of those not using Fedora, I'm going
+to describe a method that doesn't depend on that header package. What it does
+depend on is a GlusterFS source tree, much as you might have cloned from GitHub
+ or the Gluster review site. This tree doesn't have to be fully built, but you
+do need to run `autogen.sh` and configure in it. Then you can take the
+following simple makefile and put it in a directory with your actual source.
+
+```
+# Change these to match your source code.
+TARGET = rot-13.so
+OBJECTS = rot-13.o
+
+# Change these to match your environment.
+GLFS_SRC = /srv/glusterfs
+GLFS_LIB = /usr/lib64
+HOST_OS = GF_LINUX_HOST_OS
+
+# You shouldn't need to change anything below here.
+
+CFLAGS = -fPIC -Wall -O0 -g \
+ -DHAVE_CONFIG_H -D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE \
+ -D$(HOST_OS) -I$(GLFS_SRC) -I$(GLFS_SRC)/contrib/uuid \
+ -I$(GLFS_SRC)/libglusterfs/src
+LDFLAGS = -shared -nostartfiles -L$(GLFS_LIB) -lglusterfs \
+ -lpthread
+
+$(TARGET): $(OBJECTS)
+ $(CC) $(OBJECTS) $(LDFLAGS) -o $(TARGET)
+```
+
+Yes, it's still Linux-specific. Mea culpa. As you can see, we're sticking with
+the `rot-13` example, so you can just copy the files from
+`xlators/encryption/rot-13/src` in your GlusterFS tree to follow on. Type
+`make` and you should be rewarded with a nice little `.so` file.
+
+```
+xlator_example$ ls -l rot-13.so
+-rwxr-xr-x. 1 jeff jeff 40784 Nov 16 16:41 rot-13.so
+```
+
+Notice that we've built with optimization level zero and debugging symbols
+included, which would not typically be the case for a packaged version of
+GlusterFS. Let's put our version of `rot-13.so` into a slightly different file
+on our system, so that it doesn't stomp on the installed version (not that
+you'd ever want to use that anyway).
+
+```
+xlator_example# ls /usr/lib64/glusterfs/3git/xlator/encryption/
+crypt.so crypt.so.0 crypt.so.0.0.0 rot-13.so rot-13.so.0
+rot-13.so.0.0.0
+xlator_example# cp rot-13.so \
+ /usr/lib64/glusterfs/3git/xlator/encryption/my-rot-13.so
+```
+
+These paths represent the current Gluster filesystem layout, which is likely to
+be deprecated in favor of the Fedora layout; your paths may vary. At this point
+ we're ready to configure a volume using our new translator. To do that, I'm
+going to suggest something that's strongly discouraged except during
+development (the Gluster guys are going to hate me for this): write our own
+volfile. Here's just about the simplest volfile you'll ever see.
+
+```
+volume my-posix
+ type storage/posix
+ option directory /srv/export
+end-volume
+
+volume my-rot13
+ type encryption/my-rot-13
+ subvolumes my-posix
+end-volume
+```
+
+All we have here is a basic brick using `/srv/export` for its data, and then
+an instance of our translator layered on top -- no client or server is
+necessary for what we're doing, and the system will automatically push a
+mount/fuse translator on top if there's no server translator. To try this out,
+all we need is the following command (assuming the directories involved already
+ exist).
+
+```
+xlator_example$ glusterfs --debug -f my.vol /srv/import
+```
+
+You should be rewarded with a whole lot of log output, including the text of
+the volfile (this is very useful for debugging problems in the field). If you
+go to another window on the same machine, you can see that you have a new
+filesystem mounted.
+
+```
+~$ df /srv/import
+Filesystem 1K-blocks Used Available Use% Mounted on
+/srv/xlator_example/my.vol
+ 114506240 2706176 105983488 3% /srv/import
+```
+
+Just for fun, write something into a file in `/srv/import`, then look at the
+corresponding file in `/srv/export` to see it all `rot-13`'ed for you.
+
+```
+~$ echo hello > /srv/import/a_file
+~$ cat /srv/export/a_file
+uryyb
+```
+
+There you have it -- functionality you control, implemented easily, layered on
+top of local storage. Now you could start adding functionality -- real
+encryption, perhaps -- and inevitably having to debug it. You could do that the
+ old-school way, with `gf_log` (preferred) or even plain old `printf`, or you
+could run daemons under `gdb` instead. Alternatively, you could wait for the
+next Translator 101 post, where we'll be doing exactly that.
+
+Debugging a Translator
+----------------------
+
+Now that we've learned what a translator looks like and how to build one, it's
+time to run one and actually watch it work. The best way to do this is good
+old-fashioned `gdb`, as follows (using some of the examples from last time).
+
+```
+xlator_example# gdb glusterfs
+GNU gdb (GDB) Red Hat Enterprise Linux (7.2-50.el6)
+...
+(gdb) r --debug -f my.vol /srv/import
+Starting program: /usr/sbin/glusterfs --debug -f my.vol /srv/import
+...
+[2011-11-23 11:23:16.495516] I [fuse-bridge.c:2971:fuse_init]
+ 0-glusterfs-fuse: FUSE inited with protocol versions:
+ glusterfs 7.13 kernel 7.13
+```
+
+If you get to this point, your glusterfs client process is already running. You
+can go to another window to see the mountpoint, do file operations, etc.
+
+```
+~# df /srv/import
+Filesystem 1K-blocks Used Available Use% Mounted on
+/root/xlator_example/my.vol
+ 114506240 2643968 106045568 3% /srv/import
+~# ls /srv/import
+a_file
+~# cat /srv/import/a_file
+hello
+```
+
+Now let's interrupt the process and see where we are.
+
+```
+^C
+Program received signal SIGINT, Interrupt.
+0x0000003a0060b3dc in pthread_cond_wait@@GLIBC_2.3.2 ()
+ from /lib64/libpthread.so.0
+(gdb) info threads
+ 5 Thread 0x7fffeffff700 (LWP 27206) 0x0000003a002dd8c7
+ in readv ()
+ from /lib64/libc.so.6
+ 4 Thread 0x7ffff50e3700 (LWP 27205) 0x0000003a0060b75b
+ in pthread_cond_timedwait@@GLIBC_2.3.2 ()
+ from /lib64/libpthread.so.0
+ 3 Thread 0x7ffff5f02700 (LWP 27204) 0x0000003a0060b3dc
+ in pthread_cond_wait@@GLIBC_2.3.2 ()
+ from /lib64/libpthread.so.0
+ 2 Thread 0x7ffff6903700 (LWP 27203) 0x0000003a0060f245
+ in sigwait ()
+ from /lib64/libpthread.so.0
+* 1 Thread 0x7ffff7957700 (LWP 27196) 0x0000003a0060b3dc
+ in pthread_cond_wait@@GLIBC_2.3.2 ()
+ from /lib64/libpthread.so.0
+```
+
+Like any non-toy server, this one has multiple threads. What are they all
+doing? Honestly, even I don't know. Thread 1 turns out to be in
+`event_dispatch_epoll`, which means it's the one handling all of our network
+I/O. Note that with socket multi-threading patch this will change, with one
+thread in `socket_poller` per connection. Thread 2 is in `glusterfs_sigwaiter`
+which means signals will be isolated to that thread. Thread 3 is in
+`syncenv_task`, so it's a worker process for synchronous requests such as
+those used by the rebalance and repair code. Thread 4 is in
+`janitor_get_next_fd`, so it's waiting for a chance to close no-longer-needed
+file descriptors on the local filesystem. (I admit I had to look that one up,
+BTW.) Lastly, thread 5 is in `fuse_thread_proc`, so it's the one fetching
+requests from our FUSE interface. You'll often see many more threads than
+this, but it's a pretty good basic set. Now, let's set a breakpoint so we can
+actually watch a request.
+
+```
+(gdb) b rot13_writev
+Breakpoint 1 at 0x7ffff50e4f0b: file rot-13.c, line 119.
+(gdb) c
+Continuing.
+```
+
+At this point we go into our other window and do something that will involve a write.
+
+```
+~# echo goodbye > /srv/import/another_file
+(back to the first window)
+[Switching to Thread 0x7fffeffff700 (LWP 27206)]
+
+Breakpoint 1, rot13_writev (frame=0x7ffff6e4402c, this=0x638440,
+ fd=0x7ffff409802c, vector=0x7fffe8000cd8, count=1, offset=0,
+ iobref=0x7fffe8001070) at rot-13.c:119
+119 rot_13_private_t *priv = (rot_13_private_t *)this->private;
+```
+
+Remember how we built with debugging symbols enabled and no optimization? That
+will be pretty important for the next few steps. As you can see, we're in
+`rot13_writev`, with several parameters.
+
+* `frame` is our always-present frame pointer for this request. Also,
+ `frame->local` will point to any local data we created and attached to the
+ request ourselves.
+* `this` is a pointer to our instance of the `rot-13` translator. You can examine
+ it if you like to see the name, type, options, parent/children, inode table,
+ and other stuff associated with it.
+* `fd` is a pointer to a file-descriptor *object* (`fd_t`, not just a
+ file-descriptor index which is what most people use "fd" for). This in turn
+ points to an inode object (`inode_t`) and we can associate our own
+ `rot-13`-specific data with either of these.
+* `vector` and `count` together describe the data buffers for this write, which
+ we'll get to in a moment.
+* `offset` is the offset into the file at which we're writing.
+* `iobref` is a buffer-reference object, which is used to track the life cycle
+ of buffers containing read/write data. If you look closely, you'll notice that
+ `vector[0].iov_base` points to the same address as `iobref->iobrefs[0].ptr`, which
+ should give you some idea of the inter-relationships between vector and iobref.
+
+OK, now what about that `vector`? We can use it to examine the data being
+written, like this.
+
+```
+(gdb) p vector[0]
+$2 = {iov_base = 0x7ffff7936000, iov_len = 8}
+(gdb) x/s 0x7ffff7936000
+0x7ffff7936000: "goodbye\n"
+```
+
+It's not always safe to view this data as a string, because it might just as
+well be binary data, but since we're generating the write this time it's safe
+and convenient. With that knowledge, let's step through things a bit.
+
+```
+(gdb) s
+120 if (priv->encrypt_write)
+(gdb)
+121 rot13_iovec (vector, count);
+(gdb)
+rot13_iovec (vector=0x7fffe8000cd8, count=1) at rot-13.c:57
+57 for (i = 0; i < count; i++) {
+(gdb)
+58 rot13 (vector[i].iov_base, vector[i].iov_len);
+(gdb)
+rot13 (buf=0x7ffff7936000 "goodbye\n", len=8) at rot-13.c:45
+45 for (i = 0; i < len; i++) {
+(gdb)
+46 if (buf[i] >= 'a' && buf[i] <= 'z')
+(gdb)
+47 buf[i] = 'a' + ((buf[i] - 'a' + 13) % 26);
+```
+
+Here we've stepped into `rot13_iovec`, which iterates through our vector
+calling `rot13`, which in turn iterates through the characters in that chunk
+doing the `rot-13` operation if/as appropriate. This is pretty straightforward
+stuff, so let's skip to the next interesting bit.
+
+```
+(gdb) fin
+Run till exit from #0 rot13 (buf=0x7ffff7936000 "goodbye\n",
+ len=8) at rot-13.c:47
+rot13_iovec (vector=0x7fffe8000cd8, count=1) at rot-13.c:57
+57 for (i = 0; i < count; i++) {
+(gdb) fin
+Run till exit from #0 rot13_iovec (vector=0x7fffe8000cd8,
+ count=1) at rot-13.c:57
+rot13_writev (frame=0x7ffff6e4402c, this=0x638440,
+ fd=0x7ffff409802c, vector=0x7fffe8000cd8, count=1,
+ offset=0, iobref=0x7fffe8001070) at rot-13.c:123
+123 STACK_WIND (frame,
+(gdb) b 129
+Breakpoint 2 at 0x7ffff50e4f35: file rot-13.c, line 129.
+(gdb) b rot13_writev_cbk
+Breakpoint 3 at 0x7ffff50e4db3: file rot-13.c, line 106.
+(gdb) c
+```
+
+So we've set breakpoints on both the callback and the statement following the
+`STACK_WIND`. Which one will we hit first?
+
+```
+Breakpoint 3, rot13_writev_cbk (frame=0x7ffff6e4402c,
+ cookie=0x7ffff6e440d8, this=0x638440, op_ret=8, op_errno=0,
+ prebuf=0x7fffefffeca0, postbuf=0x7fffefffec30)
+ at rot-13.c:106
+106 STACK_UNWIND_STRICT (writev, frame, op_ret, op_errno,
+ prebuf, postbuf);
+(gdb) bt
+#0 rot13_writev_cbk (frame=0x7ffff6e4402c,
+ cookie=0x7ffff6e440d8, this=0x638440, op_ret=8, op_errno=0,
+ prebuf=0x7fffefffeca0, postbuf=0x7fffefffec30)
+ at rot-13.c:106
+#1 0x00007ffff52f1b37 in posix_writev (frame=0x7ffff6e440d8,
+ this=<value optimized out>, fd=<value optimized out>,
+ vector=<value optimized out>, count=1,
+ offset=<value optimized out>, iobref=0x7fffe8001070)
+ at posix.c:2217
+#2 0x00007ffff50e513e in rot13_writev (frame=0x7ffff6e4402c,
+ this=0x638440, fd=0x7ffff409802c, vector=0x7fffe8000cd8,
+ count=1, offset=0, iobref=0x7fffe8001070) at rot-13.c:123
+```
+
+Surprise! We're in `rot13_writev_cbk` now, called (indirectly) while we're
+still in `rot13_writev` before `STACK_WIND` returns (still at rot-13.c:123). If
+ you did any request cleanup here, then you need to be careful about what you
+do in the remainder of `rot13_writev` because data may have been freed etc.
+It's tempting to say you should just do the cleanup in `rot13_writev` after
+the `STACK_WIND,` but that's not valid because it's also possible that some
+other translator returned without calling `STACK_UNWIND` -- i.e. before
+`rot13_writev` is called, so then it would be the one getting null-pointer
+errors instead. To put it another way, the callback and the return from
+`STACK_WIND` can occur in either order or even simultaneously on different
+threads. Even if you were to use reference counts, you'd have to make sure to
+use locking or atomic operations to avoid races, and it's not worth it. Unless
+you *really* understand the possible flows of control and know what you're
+doing, it's better to do cleanup in the callback and nothing after
+`STACK_WIND.`
+
+At this point all that's left is a `STACK_UNWIND` and a return. The
+`STACK_UNWIND` invokes our parent's completion callback, and in this case our
+parent is FUSE so at that point the VFS layer is notified of the write being
+complete. Finally, we return through several levels of normal function calls
+until we come back to fuse_thread_proc, which waits for the next request.
+
+So that's it. For extra fun, you might want to repeat this exercise by stepping
+through some other call -- stat or setxattr might be good choices -- but you'll
+ have to use a translator that actually implements those calls to see much
+that's interesting. Then you'll pretty much know everything I knew when I
+started writing my first for-real translators, and probably even a bit more. I
+hope you've enjoyed this series, or at least found it useful, and if you have
+any suggestions for other topics I should cover please let me know (via
+comments or email, IRC or Twitter).
diff --git a/doc/hacker-guide/en-US/markdown/write-behind.md b/doc/hacker-guide/en-US/markdown/write-behind.md
new file mode 100644
index 000000000..e20682249
--- /dev/null
+++ b/doc/hacker-guide/en-US/markdown/write-behind.md
@@ -0,0 +1,56 @@
+performance/write-behind translator
+===================================
+
+Basic working
+--------------
+
+Write behind is basically a translator to lie to the application that the
+write-requests are finished, even before it is actually finished.
+
+On a regular translator tree without write-behind, control flow is like this:
+
+1. application makes a `write()` system call.
+2. VFS ==> FUSE ==> `/dev/fuse`.
+3. fuse-bridge initiates a glusterfs `writev()` call.
+4. `writev()` is `STACK_WIND()`ed upto client-protocol or storage translator.
+5. client-protocol, on receiving reply from server, starts `STACK_UNWIND()` towards the fuse-bridge.
+
+On a translator tree with write-behind, control flow is like this:
+
+1. application makes a `write()` system call.
+2. VFS ==> FUSE ==> `/dev/fuse`.
+3. fuse-bridge initiates a glusterfs `writev()` call.
+4. `writev()` is `STACK_WIND()`ed upto write-behind translator.
+5. write-behind adds the write buffer to its internal queue and does a `STACK_UNWIND()` towards the fuse-bridge.
+
+write call is completed in application's percepective. after
+`STACK_UNWIND()`ing towards the fuse-bridge, write-behind initiates a fresh
+writev() call to its child translator, whose replies will be consumed by
+write-behind itself. Write-behind _doesn't_ cache the write buffer, unless
+`option flush-behind on` is specified in volume specification file.
+
+Windowing
+---------
+
+With respect to write-behind, each write-buffer has three flags: `stack_wound`, `write_behind` and `got_reply`.
+
+* `stack_wound`: if set, indicates that write-behind has initiated `STACK_WIND()` towards child translator.
+* `write_behind`: if set, indicates that write-behind has done `STACK_UNWIND()` towards fuse-bridge.
+* `got_reply`: if set, indicates that write-behind has received reply from child translator for a `writev()` `STACK_WIND()`. a request will be destroyed by write-behind only if this flag is set.
+
+Currently pending write requests = aggregate size of requests with write_behind = 1 and got_reply = 0.
+
+window size limits the aggregate size of currently pending write requests. once
+the pending requests' size has reached the window size, write-behind blocks
+writev() calls from fuse-bridge. Blocking is only from application's
+perspective. Write-behind does `STACK_WIND()` to child translator
+straight-away, but hold behind the `STACK_UNWIND()` towards fuse-bridge.
+`STACK_UNWIND()` is done only once write-behind gets enough replies to
+accomodate for currently blocked request.
+
+Flush behind
+------------
+
+If `option flush-behind on` is specified in volume specification file, then
+write-behind sends aggregate write requests to child translator, instead of
+regular per request `STACK_WIND()`s.