diff options
Diffstat (limited to 'doc/hacker-guide/en-US')
| -rw-r--r-- | doc/hacker-guide/en-US/markdown/adding-fops.md | 18 | ||||
| -rw-r--r-- | doc/hacker-guide/en-US/markdown/afr.md | 191 | ||||
| -rw-r--r-- | doc/hacker-guide/en-US/markdown/coding-standard.md | 402 | ||||
| -rw-r--r-- | doc/hacker-guide/en-US/markdown/posix.md | 59 | ||||
| -rw-r--r-- | doc/hacker-guide/en-US/markdown/translator-development.md | 666 | ||||
| -rw-r--r-- | doc/hacker-guide/en-US/markdown/write-behind.md | 56 |
6 files changed, 1392 insertions, 0 deletions
diff --git a/doc/hacker-guide/en-US/markdown/adding-fops.md b/doc/hacker-guide/en-US/markdown/adding-fops.md new file mode 100644 index 000000000..3f72ed3e2 --- /dev/null +++ b/doc/hacker-guide/en-US/markdown/adding-fops.md @@ -0,0 +1,18 @@ +Adding a new FOP +================ + +Steps to be followed when adding a new FOP to GlusterFS: + +1. Edit `glusterfs.h` and add a `GF_FOP_*` constant. +2. Edit `xlator.[ch]` and: + * add the new prototype for fop and callback. + * edit `xlator_fops` structure. +3. Edit `xlator.c` and add to fill_defaults. +4. Edit `protocol.h` and add struct necessary for the new FOP. +5. Edit `defaults.[ch]` and provide default implementation. +6. Edit `call-stub.[ch]` and provide stub implementation. +7. Edit `common-utils.c` and add to gf_global_variable_init(). +8. Edit client-protocol and add your FOP. +9. Edit server-protocol and add your FOP. +10. Implement your FOP in any translator for which the default implementation + is not sufficient. diff --git a/doc/hacker-guide/en-US/markdown/afr.md b/doc/hacker-guide/en-US/markdown/afr.md new file mode 100644 index 000000000..1be7e39f2 --- /dev/null +++ b/doc/hacker-guide/en-US/markdown/afr.md @@ -0,0 +1,191 @@ +cluster/afr translator +====================== + +Locking +------- + +Before understanding replicate, one must understand two internal FOPs: + +### `GF_FILE_LK` + +This is exactly like `fcntl(2)` locking, except the locks are in a +separate domain from locks held by applications. + +### `GF_DIR_LK (loc_t *loc, char *basename)` + +This allows one to lock a name under a directory. For example, +to lock /mnt/glusterfs/foo, one would use the call: + +``` +GF_DIR_LK ({loc_t for "/mnt/glusterfs"}, "foo") +``` + +If one wishes to lock *all* the names under a particular directory, +supply the basename argument as `NULL`. + +The locks can either be read locks or write locks; consult the +function prototype for more details. + +Both these operations are implemented by the features/locks (earlier +known as posix-locks) translator. + +Basic design +------------ + +All FOPs can be classified into four major groups: + +### inode-read + +Operations that read an inode's data (file contents) or metadata (perms, etc.). + +access, getxattr, fstat, readlink, readv, stat. + +### inode-write + +Operations that modify an inode's data or metadata. + +chmod, chown, truncate, writev, utimens. + +### dir-read + +Operations that read a directory's contents or metadata. + +readdir, getdents, checksum. + +### dir-write + +Operations that modify a directory's contents or metadata. + +create, link, mkdir, mknod, rename, rmdir, symlink, unlink. + +Some of these make a subgroup in that they modify *two* different entries: +link, rename, symlink. + +### Others + +Other operations. + +flush, lookup, open, opendir, statfs. + +Algorithms +---------- + +Each of the four major groups has its own algorithm: + +### inode-read, dir-read + +1. Send a request to the first child that is up: + * if it fails: + * try the next available child + * if we have exhausted all children: + * return failure + +### inode-write + + All operations are done in parallel unless specified otherwise. + +1. Send a ``GF_FILE_LK`` request on all children for a write lock on the + appropriate region + (for metadata operations: entire file (0, 0) for writev: + (offset, offset+size of buffer)) + * If a lock request fails on a child: + * unlock all children + * try to acquire a blocking lock (`F_SETLKW`) on each child, serially. + If this fails (due to `ENOTCONN` or `EINVAL`): + Consider this child as dead for rest of transaction. +2. Mark all children as "pending" on all (alive) children (see below for +meaning of "pending"). + * If it fails on any child: + * mark it as dead (in transaction local state). +3. Perform operation on all (alive) children. + * If it fails on any child: + * mark it as dead (in transaction local state). +4. Unmark all successful children as not "pending" on all nodes. +5. Unlock region on all (alive) children. + +### dir-write + + The algorithm for dir-write is same as above except instead of holding + `GF_FILE_LK` locks we hold a GF_DIR_LK lock on the name being operated upon. + In case of link-type calls, we hold locks on both the operand names. + +"pending" +--------- + +The "pending" number is like a journal entry. A pending entry is an +array of 32-bit integers stored in network byte-order as the extended +attribute of an inode (which can be a directory as well). + +There are three keys corresponding to three types of pending operations: + +### `AFR_METADATA_PENDING` + +There are some metadata operations pending on this inode (perms, ctime/mtime, +xattr, etc.). + +### `AFR_DATA_PENDING` + +There is some data pending on this inode (writev). + +### `AFR_ENTRY_PENDING` + +There are some directory operations pending on this directory +(create, unlink, etc.). + +Self heal +--------- + +* On lookup, gather extended attribute data: + * If entry is a regular file: + * If an entry is present on one child and not on others: + * create entry on others. + * If entries exist but have different metadata (perms, etc.): + * consider the entry with the highest `AFR_METADATA_PENDING` number as + definitive and replicate its attributes on children. + * If entry is a directory: + * Consider the entry with the higest `AFR_ENTRY_PENDING` number as + definitive and replicate its contents on all children. + * If any two entries have non-matching types (i.e., one is file and + other is directory): + * Announce to the user via log that a split-brain situation has been + detected, and do nothing. +* On open, gather extended attribute data: + * Consider the file with the highest `AFR_DATA_PENDING` number as + the definitive one and replicate its contents on all other + children. + +During all self heal operations, appropriate locks must be held on all +regions/entries being affected. + +Inode scaling +------------- + +Inode scaling is necessary because if a situation arises where an inode number +is returned for a directory (by lookup) which was previously the inode number +of a file (as per FUSE's table), then FUSE gets horribly confused (consult a +FUSE expert for more details). + +To avoid such a situation, we distribute the 64-bit inode space equally +among all children of replicate. + +To illustrate: + +If c1, c2, c3 are children of replicate, they each get 1/3 of the available +inode space: + +------------- -- -- -- -- -- -- -- -- -- -- -- --- +Child: c1 c2 c3 c1 c2 c3 c1 c2 c3 c1 c2 ... +Inode number: 1 2 3 4 5 6 7 8 9 10 11 ... +------------- -- -- -- -- -- -- -- -- -- -- -- --- + +Thus, if lookup on c1 returns an inode number "2", it is scaled to "4" +(which is the second inode number in c1's space). + +This way we ensure that there is never a collision of inode numbers from +two different children. + +This reduction of inode space doesn't really reduce the usability of +replicate since even if we assume replicate has 1024 children (which would be a +highly unusual scenario), each child still has a 54-bit inode space: +$2^{54} \sim 1.8 \times 10^{16}$, which is much larger than any real +world requirement. diff --git a/doc/hacker-guide/en-US/markdown/coding-standard.md b/doc/hacker-guide/en-US/markdown/coding-standard.md new file mode 100644 index 000000000..178dc142a --- /dev/null +++ b/doc/hacker-guide/en-US/markdown/coding-standard.md @@ -0,0 +1,402 @@ +GlusterFS Coding Standards +========================== + +Structure definitions should have a comment per member +------------------------------------------------------ + +Every member in a structure definition must have a comment about its +purpose. The comment should be descriptive without being overly verbose. + +*Bad:* + +``` +gf_lock_t lock; /* lock */ +``` + +*Good:* + +``` +DBTYPE access_mode; /* access mode for accessing + * the databases, can be + * DB_HASH, DB_BTREE + * (option access-mode <mode>) + */ +``` + +Declare all variables at the beginning of the function +------------------------------------------------------ + +All local variables in a function must be declared immediately after the +opening brace. This makes it easy to keep track of memory that needs to be freed +during exit. It also helps debugging, since gdb cannot handle variables +declared inside loops or other such blocks. + +Always initialize local variables +--------------------------------- + +Every local variable should be initialized to a sensible default value +at the point of its declaration. All pointers should be initialized to NULL, +and all integers should be zero or (if it makes sense) an error value. + + +*Good:* + +``` +int ret = 0; +char *databuf = NULL; +int _fd = -1; +``` + +Initialization should always be done with a constant value +---------------------------------------------------------- + +Never use a non-constant expression as the initialization value for a variable. + + +*Bad:* + +``` +pid_t pid = frame->root->pid; +char *databuf = malloc (1024); +``` + +Validate all arguments to a function +------------------------------------ + +All pointer arguments to a function must be checked for `NULL`. +A macro named `VALIDATE` (in `common-utils.h`) +takes one argument, and if it is `NULL`, writes a log message and +jumps to a label called `err` after setting op_ret and op_errno +appropriately. It is recommended to use this template. + + +*Good:* + +``` +VALIDATE(frame); +VALIDATE(this); +VALIDATE(inode); +``` + +Never rely on precedence of operators +------------------------------------- + +Never write code that relies on the precedence of operators to execute +correctly. Such code can be hard to read and someone else might not +know the precedence of operators as accurately as you do. + +*Bad:* + +``` +if (op_ret == -1 && errno != ENOENT) +``` + +*Good:* + +``` +if ((op_ret == -1) && (errno != ENOENT)) +``` + +Use exactly matching types +-------------------------- + +Use a variable of the exact type declared in the manual to hold the +return value of a function. Do not use an ``equivalent'' type. + + +*Bad:* + +``` +int len = strlen (path); +``` + +*Good:* + +``` +size_t len = strlen (path); +``` + +Never write code such as `foo->bar->baz`; check every pointer +------------------------------------------------------------- + +Do not write code that blindly follows a chain of pointer +references. Any pointer in the chain may be `NULL` and thus +cause a crash. Verify that each pointer is non-null before following +it. + +Check return value of all functions and system calls +---------------------------------------------------- + +The return value of all system calls and API functions must be checked +for success or failure. + +*Bad:* + +``` +close (fd); +``` + +*Good:* + +``` +op_ret = close (_fd); +if (op_ret == -1) { + gf_log (this->name, GF_LOG_ERROR, + "close on file %s failed (%s)", real_path, + strerror (errno)); + op_errno = errno; + goto out; +} +``` + + +Gracefully handle failure of malloc +----------------------------------- + +GlusterFS should never crash or exit due to lack of memory. If a +memory allocation fails, the call should be unwound and an error +returned to the user. + +*Use result args and reserve the return value to indicate success or failure:* + +The return value of every functions must indicate success or failure (unless +it is impossible for the function to fail --- e.g., boolean functions). If +the function needs to return additional data, it must be returned using a +result (pointer) argument. + +*Bad:* + +``` +int32_t dict_get_int32 (dict_t *this, char *key); +``` + +*Good:* + +``` +int dict_get_int32 (dict_t *this, char *key, int32_t *val); +``` + +Always use the `n' versions of string functions +----------------------------------------------- + +Unless impossible, use the length-limited versions of the string functions. + +*Bad:* + +``` +strcpy (entry_path, real_path); +``` + +*Good:* + +``` +strncpy (entry_path, real_path, entry_path_len); +``` + +No dead or commented code +------------------------- + +There must be no dead code (code to which control can never be passed) or +commented out code in the codebase. + +Only one unwind and return per function +--------------------------------------- + +There must be only one exit out of a function. `UNWIND` and return +should happen at only point in the function. + +Function length or Keep functions small +--------------------------------------- + +We live in the UNIX-world where modules do one thing and do it well. +This rule should apply to our functions also. If a function is very long, try splitting it +into many little helper functions. The question is, in a coding +spree, how do we know a function is long and unreadable. One rule of +thumb given by Linus Torvalds is that, a function should be broken-up +if you have 4 or more levels of indentation going on for more than 3-4 +lines. + +*Example for a helper function:* +``` +static int +same_owner (posix_lock_t *l1, posix_lock_t *l2) +{ + return ((l1->client_pid == l2->client_pid) && + (l1->transport == l2->transport)); +} +``` + +Defining functions as static +---------------------------- + +Define internal functions as static only if you're +very sure that there will not be a crash(..of any kind..) emanating in +that function. If there is even a remote possibility, perhaps due to +pointer derefering, etc, declare the function as non-static. This +ensures that when a crash does happen, the function name shows up the +in the back-trace generated by libc. However, doing so has potential +for polluting the function namespace, so to avoid conflicts with other +components in other parts, ensure that the function names are +prepended with a prefix that identify the component to which it +belongs. For eg. non-static functions in io-threads translator start +with iot_. + +Ensure function calls wrap around after 80-columns +-------------------------------------------------- + +Place remaining arguments on the next line if needed. + +Functions arguments and function definition +------------------------------------------- + +Place all the arguments of a function definition on the same line +until the line goes beyond 80-cols. Arguments that extend beyind +80-cols should be placed on the next line. + +Style issues +------------ + +### Brace placement + +Use K&R/Linux style of brace placement for blocks. + +*Good:* + +``` +int some_function (...) +{ + if (...) { + /* ... */ + } else if (...) { + /* ... */ + } else { + /* ... */ + } + + do { + /* ... */ + } while (cond); +} +``` + +### Indentation + +Use *eight* spaces for indenting blocks. Ensure that your +file contains only spaces and not tab characters. You can do this +in Emacs by selecting the entire file (`C-x h`) and +running `M-x untabify`. + +To make Emacs indent lines automatically by eight spaces, add this +line to your `.emacs`: + +``` +(add-hook 'c-mode-hook (lambda () (c-set-style "linux"))) +``` + +### Comments + +Write a comment before every function describing its purpose (one-line), +its arguments, and its return value. Mention whether it is an internal +function or an exported function. + +Write a comment before every structure describing its purpose, and +write comments about each of its members. + +Follow the style shown below for comments, since such comments +can then be automatically extracted by doxygen to generate +documentation. + +*Good:* + +``` +/** +* hash_name -hash function for filenames +* @par: parent inode number +* @name: basename of inode +* @mod: number of buckets in the hashtable +* +* @return: success: bucket number +* failure: -1 +* +* Not for external use. +*/ +``` + +### Indicating critical sections + +To clearly show regions of code which execute with locks held, use +the following format: + +``` +pthread_mutex_lock (&mutex); +{ + /* code */ +} +pthread_mutex_unlock (&mutex); +``` + +*A skeleton fop function:* + +This is the recommended template for any fop. In the beginning come +the initializations. After that, the `success' control flow should be +linear. Any error conditions should cause a `goto` to a single +point, `out`. At that point, the code should detect the error +that has occured and do appropriate cleanup. + +``` +int32_t +sample_fop (call_frame_t *frame, xlator_t *this, ...) +{ + char * var1 = NULL; + int32_t op_ret = -1; + int32_t op_errno = 0; + DIR * dir = NULL; + struct posix_fd * pfd = NULL; + + VALIDATE_OR_GOTO (frame, out); + VALIDATE_OR_GOTO (this, out); + + /* other validations */ + + dir = opendir (...); + + if (dir == NULL) { + op_errno = errno; + gf_log (this->name, GF_LOG_ERROR, + "opendir failed on %s (%s)", loc->path, + strerror (op_errno)); + goto out; + } + + /* another system call */ + if (...) { + op_errno = ENOMEM; + gf_log (this->name, GF_LOG_ERROR, + "out of memory :("); + goto out; + } + + /* ... */ + + out: + if (op_ret == -1) { + + /* check for all the cleanup that needs to be + done */ + + if (dir) { + closedir (dir); + dir = NULL; + } + + if (pfd) { + FREE (pfd->path); + FREE (pfd); + pfd = NULL; + } + } + + STACK_UNWIND (frame, op_ret, op_errno, fd); + return 0; +} +``` diff --git a/doc/hacker-guide/en-US/markdown/posix.md b/doc/hacker-guide/en-US/markdown/posix.md new file mode 100644 index 000000000..84c813e55 --- /dev/null +++ b/doc/hacker-guide/en-US/markdown/posix.md @@ -0,0 +1,59 @@ +storage/posix translator +======================== + +Notes +----- + +### `SET_FS_ID` + +This is so that all filesystem checks are done with the user's +uid/gid and not GlusterFS's uid/gid. + +### `MAKE_REAL_PATH` + +This macro concatenates the base directory of the posix volume +('option directory') with the given path. + +### `need_xattr` in lookup + +If this flag is passed, lookup returns a xattr dictionary that contains +the file's create time, the file's contents, and the version number +of the file. + +This is a hack to increase small file performance. If an application +wants to read a small file, it can finish its job with just a lookup +call instead of a lookup followed by read. + +### `getdents`/`setdents` + +These are used by unify to set and get directory entries. + +### `ALIGN_BUF` + +Macro to align an address to a page boundary (4K). + +### `priv->export_statfs` + +In some cases, two exported volumes may reside on the same +partition on the server. Sending statvfs info for both +the volumes will lead to erroneous df output at the client, +since free space on the partition will be counted twice. + +In such cases, user can disable exporting statvfs info +on one of the volumes by setting this option. + +### `xattrop` + +This fop is used by replicate to set version numbers on files. + +### `getxattr`/`setxattr` hack to read/write files + +A key, `GLUSTERFS_FILE_CONTENT_STRING`, is handled in a special way by +`getxattr`/`setxattr`. A getxattr with the key will return the entire +content of the file as the value. A `setxattr` with the key will write +the value as the entire content of the file. + +### `posix_checksum` + +This calculates a simple XOR checksum on all entry names in a +directory that is used by unify to compare directory contents. diff --git a/doc/hacker-guide/en-US/markdown/translator-development.md b/doc/hacker-guide/en-US/markdown/translator-development.md new file mode 100644 index 000000000..77d1b606a --- /dev/null +++ b/doc/hacker-guide/en-US/markdown/translator-development.md @@ -0,0 +1,666 @@ +Translator development +====================== + +Setting the Stage +----------------- + +This is the first post in a series that will explain some of the details of +writing a GlusterFS translator, using some actual code to illustrate. + +Before we begin, a word about environments. GlusterFS is over 300K lines of +code spread across a few hundred files. That's no Linux kernel or anything, but + you're still going to be navigating through a lot of code in every +code-editing session, so some kind of cross-referencing is *essential*. I use +cscope with the vim bindings, and if I couldn't do Crtl+G and such to jump +between definitions all the time my productivity would be cut in half. You may +prefer different tools, but as I go through these examples you'll need +something functionally similar to follow on. OK, on with the show. + +The first thing you need to know is that translators are not just bags of +functions and variables. They need to have a very definite internal structure +so that the translator-loading code can figure out where all the pieces are. +The way it does this is to use dlsym to look for specific names within your +shared-object file, as follow (from `xlator.c`): + +``` +if (!(xl->fops = dlsym (handle, "fops"))) { + gf_log ("xlator", GF_LOG_WARNING, "dlsym(fops) on %s", + dlerror ()); + goto out; +} + +if (!(xl->cbks = dlsym (handle, "cbks"))) { + gf_log ("xlator", GF_LOG_WARNING, "dlsym(cbks) on %s", + dlerror ()); + goto out; +} + +if (!(xl->init = dlsym (handle, "init"))) { + gf_log ("xlator", GF_LOG_WARNING, "dlsym(init) on %s", + dlerror ()); + goto out; +} + +if (!(xl->fini = dlsym (handle, "fini"))) { + gf_log ("xlator", GF_LOG_WARNING, "dlsym(fini) on %s", + dlerror ()); + goto out; +} +``` + +In this example, `xl` is a pointer to the in-memory object for the translator +we're loading. As you can see, it's looking up various symbols *by name* in the + shared object it just loaded, and storing pointers to those symbols. Some of +them (e.g. init are functions, while others e.g. fops are dispatch tables +containing pointers to many functions. Together, these make up the translator's + public interface. + +Most of this glue or boilerplate can easily be found at the bottom of one of +the source files that make up each translator. We're going to use the `rot-13` +translator just for fun, so in this case you'd look in `rot-13.c` to see this: + +``` +struct xlator_fops fops = { + .readv = rot13_readv, + .writev = rot13_writev +}; + +struct xlator_cbks cbks = { +}; + +struct volume_options options[] = { +{ .key = {"encrypt-write"}, + .type = GF_OPTION_TYPE_BOOL +}, +{ .key = {"decrypt-read"}, + .type = GF_OPTION_TYPE_BOOL +}, +{ .key = {NULL} }, +}; +``` + +The `fops` table, defined in `xlator.h`, is one of the most important pieces. +This table contains a pointer to each of the filesystem functions that your +translator might implement -- `open`, `read`, `stat`, `chmod`, and so on. There +are 82 such functions in all, but don't worry; any that you don't specify here +will be see as null and filled with defaults from `defaults.c` when your +translator is loaded. In this particular example, since `rot-13` is an +exceptionally simple translator, we only fill in two entries for `readv` and +`writev`. + +There are actually two other tables, also required to have predefined names, +that are also used to find translator functions: `cbks` (which is empty in this + snippet) and `dumpops` (which is missing entirely). The first of these specify + entry points for when inodes are forgotten or file descriptors are released. +In other words, they're destructors for objects in which your translator might + have an interest. Mostly you can ignore them, because the default behavior +handles even the simpler cases of translator-specific inode/fd context +automatically. However, if the context you attach is a complex structure +requiring complex cleanup, you'll need to supply these functions. As for +dumpops, that's just used if you want to provide functions to pretty-print +various structures in logs. I've never used it myself, though I probably +should. What's noteworthy here is that we don't even define dumpops. That's +because all of the functions that might use these dispatch functions will check + for `xl->dumpops` being `NULL` before calling through it. This is in sharp +contrast to the behavior for `fops` and `cbks1`, which *must* be present. If +they're not, translator loading will fail because these pointers are not +checked every time and if they're `NULL` then we'll segfault. That's why we +provide an empty definition for cbks; it's OK for the individual function +pointers to be NULL, but not for the whole table to be absent. + +The last piece I'll cover today is options. As you can see, this is a table of +translator-specific option names and some information about their types. +GlusterFS actually provides a pretty rich set of types (`volume_option_type_t` +in `options.`h) which includes paths, translator names, percentages, and times +in addition to the obvious integers and strings. Also, the `volume_option_t` +structure can include information about alternate names, min/max/default +values, enumerated string values, and descriptions. We don't see any of these +here, so let's take a quick look at some more complex examples from afr.c and +then come back to `rot-13`. + +``` +{ .key = {"data-self-heal-algorithm"}, + .type = GF_OPTION_TYPE_STR, + .default_value = "", + .description = "Select between \"full\", \"diff\". The " + "\"full\" algorithm copies the entire file from " + "source to sink. The \"diff\" algorithm copies to " + "sink only those blocks whose checksums don't match " + "with those of source.", + .value = { "diff", "full", "" } +}, +{ .key = {"data-self-heal-window-size"}, + .type = GF_OPTION_TYPE_INT, + .min = 1, + .max = 1024, + .default_value = "1", + .description = "Maximum number blocks per file for which " + "self-heal process would be applied simultaneously." +}, +``` + +When your translator is loaded, all of this information is used to parse the +options actually provided in the volfile, and then the result is turned into a +dictionary and stored as `xl->options`. This dictionary is then processed by +your init function, which you can see being looked up in the first code +fragment above. We're only going to look at a small part of the `rot-13`'s +init for now. + +``` +priv->decrypt_read = 1; +priv->encrypt_write = 1; + +data = dict_get (this->options, "encrypt-write"); +if (data) { + if (gf_string2boolean (data->data, &priv->encrypt_write + == -1) { + gf_log (this->name, GF_LOG_ERROR, + "encrypt-write takes only boolean options"); + return -1; + } +} +``` + +What we can see here is that we're setting some defaults in our priv structure, +then looking to see if an `encrypt-write` option was actually provided. If so, +we convert and store it. This is a pretty classic use of dict_get to fetch a +field from a dictionary, and of using one of many conversion functions in +`common-utils.c` to convert `data->data` into something we can use. + +So far we've covered the basic of how a translator gets loaded, how we find its +various parts, and how we process its options. In my next Translator 101 post, +we'll go a little deeper into other things that init and its companion fini +might do, and how some other fields in our `xlator_t` structure (commonly +referred to as this) are commonly used. + +`init`, `fini`, and private context +----------------------------------- + +In the previous Translator 101 post, we looked at some of the dispatch tables +and options processing in a translator. This time we're going to cover the rest + of the "shell" of a translator -- i.e. the other global parts not specific to +handling a particular request. + +Let's start by looking at the relationship between a translator and its shared +library. At a first approximation, this is the relationship between an object +and a class in just about any object-oriented programming language. The class +defines behaviors, but has to be instantiated as an object to have any kind of +existence. In our case the object is an `xlator_t`. Several of these might be +created within the same daemon, sharing all of the same code through init/fini +and dispatch tables, but sharing *no data*. You could implement shared data (as + static variables in your shared libraries) but that's strongly discouraged. +Every function in your shared library will get an `xlator_t` as an argument, +and should use it. This lack of class-level data is one of the points where +the analogy to common OOP systems starts to break down. Another place is the +complete lack of inheritance. Translators inherit behavior (code) from exactly +one shared library -- looked up and loaded using the `type` field in a volfile +`volume ... end-volume` block -- and that's it -- not even single inheritance, +no subclasses or superclasses, no mixins or prototypes, just the relationship +between an object and its class. With that in mind, let's turn to the init +function that we just barely touched on last time. + +``` +int32_t +init (xlator_t *this) +{ + data_t *data = NULL; + rot_13_private_t *priv = NULL; + + if (!this->children || this->children->next) { + gf_log ("rot13", GF_LOG_ERROR, + "FATAL: rot13 should have exactly one child"); + return -1; + } + + if (!this->parents) { + gf_log (this->name, GF_LOG_WARNING, + "dangling volume. check volfile "); + } + + priv = GF_CALLOC (sizeof (rot_13_private_t), 1, 0); + if (!priv) + return -1; +``` + +At the very top, we see the function signature -- we get a pointer to the +`xlator_t` object that we're initializing, and we return an `int32_t` status. +As with most functions in the translator API, this should be zero to indicate +success. In this case it's safe to return -1 for failure, but watch out: in +dispatch-table functions, the return value means the status of the *function +call* rather than the *request*. A request error should be reflected as a +callback with a non-zero `op_re`t value, but the dispatch function itself +should still return zero. In fact, the handling of a non-zero return from a +dispatch function is not all that robust (we recently had a bug report in +HekaFS related to this) so it's something you should probably avoid +altogether. This only underscores the difference between dispatch functions +and `init`/`fini` functions, where non-zero returns *are* expected and handled +logically by aborting the translator setup. We can see that down at the +bottom, where we return -1 to indicate that we couldn't allocate our +private-data area (more about that later). + +The first thing this init function does is check that the translator is being +set up in the right kind of environment. Translators are called by parents and +in turn call children. Some translators are "initial" translators that inject +requests into the system from elsewhere -- e.g. mount/fuse injecting requests +from the kernel, protocol/server injecting requests from the network. Those +translators don't need parents, but `rot-13` does and so we check for that. +Similarly, some translators are "final" translators that (from the perspective +of the current process) terminate requests instead of passing them on -- e.g. +`protocol/client` passing them to another node, `storage/posix` passing them to +a local filesystem. Other translators "multiplex" between multiple children -- + passing each parent request on to one (`cluster/dht`), some +(`cluster/stripe`), or all (`cluster/afr`) of those children. `rot-13` fits +into none of those categories either, so it checks that it has *exactly one* +child. It might be more convenient or robust if translator shared libraries +had standard variables describing these requirements, to be checked in a +consistent way by the translator-loading infrastructure itself instead of by +each separate init function, but this is the way translators work today. + +The last thing we see in this fragment is allocating our private data area. +This can literally be anything we want; the infrastructure just provides the +priv pointer as a convenience but takes no responsibility for how it's used. In + this case we're using `GF_CALLOC` to allocate our own `rot_13_private_t` +structure. This gets us all the benefits of GlusterFS's memory-leak detection +infrastructure, but the way we're calling it is not quite ideal. For one thing, + the first two arguments -- from `calloc(3)` -- are kind of reversed. For +another, notice how the last argument is zero. That can actually be an +enumerated value, to tell the GlusterFS allocator *what* type we're +allocating. This can be very useful information for memory profiling and leak +detection, so it's recommended that you follow the example of any +x`xx-mem-types.h` file elsewhere in the source tree instead of just passing +zero here (even though that works). + +To finish our tour of standard initialization/termination, let's look at the +end of `init` and the beginning of `fini`: + +``` + this->private = priv; + gf_log ("rot13", GF_LOG_DEBUG, "rot13 xlator loaded"); + return 0; +} + +void +fini (xlator_t *this) +{ + rot_13_private_t *priv = this->private; + + if (!priv) + return; + this->private = NULL; + GF_FREE (priv); +``` + +At the end of init we're just storing our private-data pointer in the `priv` +field of our `xlator_t`, then returning zero to indicate that initialization +succeeded. As is usually the case, our fini is even simpler. All it really has +to do is `GF_FREE` our private-data pointer, which we do in a slightly +roundabout way here. Notice how we don't even have a return value here, since +there's nothing obvious and useful that the infrastructure could do if `fini` +failed. + +That's practically everything we need to know to get our translator through +loading, initialization, options processing, and termination. If we had defined + no dispatch functions, we could actually configure a daemon to use our +translator and it would work as a basic pass-through from its parent to a +single child. In the next post I'll cover how to build the translator and +configure a daemon to use it, so that we can actually step through it in a +debugger and see how it all fits together before we actually start adding +functionality. + +This Time For Real +------------------ + +In the first two parts of this series, we learned how to write a basic +translator skeleton that can get through loading, initialization, and option +processing. This time we'll cover how to build that translator, configure a +volume to use it, and run the glusterfs daemon in debug mode. + +Unfortunately, there's not much direct support for writing new translators. You +can check out a GlusterFS tree and splice in your own translator directory, but + that's a bit painful because you'll have to update multiple makefiles plus a +bunch of autoconf garbage. As part of the HekaFS project, I basically reverse +engineered the truly necessary parts of the translator-building process and +then pestered one of the Fedora glusterfs package maintainers (thanks +daMaestro!) to add a `glusterfs-devel` package with the required headers. Since + then the complexity level in the HekaFS tree has crept back up a bit, but I +still remember the simple method and still consider it the easiest way to get +started on a new translator. For the sake of those not using Fedora, I'm going +to describe a method that doesn't depend on that header package. What it does +depend on is a GlusterFS source tree, much as you might have cloned from GitHub + or the Gluster review site. This tree doesn't have to be fully built, but you +do need to run `autogen.sh` and configure in it. Then you can take the +following simple makefile and put it in a directory with your actual source. + +``` +# Change these to match your source code. +TARGET = rot-13.so +OBJECTS = rot-13.o + +# Change these to match your environment. +GLFS_SRC = /srv/glusterfs +GLFS_LIB = /usr/lib64 +HOST_OS = GF_LINUX_HOST_OS + +# You shouldn't need to change anything below here. + +CFLAGS = -fPIC -Wall -O0 -g \ + -DHAVE_CONFIG_H -D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE \ + -D$(HOST_OS) -I$(GLFS_SRC) -I$(GLFS_SRC)/contrib/uuid \ + -I$(GLFS_SRC)/libglusterfs/src +LDFLAGS = -shared -nostartfiles -L$(GLFS_LIB) -lglusterfs \ + -lpthread + +$(TARGET): $(OBJECTS) + $(CC) $(OBJECTS) $(LDFLAGS) -o $(TARGET) +``` + +Yes, it's still Linux-specific. Mea culpa. As you can see, we're sticking with +the `rot-13` example, so you can just copy the files from +`xlators/encryption/rot-13/src` in your GlusterFS tree to follow on. Type +`make` and you should be rewarded with a nice little `.so` file. + +``` +xlator_example$ ls -l rot-13.so +-rwxr-xr-x. 1 jeff jeff 40784 Nov 16 16:41 rot-13.so +``` + +Notice that we've built with optimization level zero and debugging symbols +included, which would not typically be the case for a packaged version of +GlusterFS. Let's put our version of `rot-13.so` into a slightly different file +on our system, so that it doesn't stomp on the installed version (not that +you'd ever want to use that anyway). + +``` +xlator_example# ls /usr/lib64/glusterfs/3git/xlator/encryption/ +crypt.so crypt.so.0 crypt.so.0.0.0 rot-13.so rot-13.so.0 +rot-13.so.0.0.0 +xlator_example# cp rot-13.so \ + /usr/lib64/glusterfs/3git/xlator/encryption/my-rot-13.so +``` + +These paths represent the current Gluster filesystem layout, which is likely to +be deprecated in favor of the Fedora layout; your paths may vary. At this point + we're ready to configure a volume using our new translator. To do that, I'm +going to suggest something that's strongly discouraged except during +development (the Gluster guys are going to hate me for this): write our own +volfile. Here's just about the simplest volfile you'll ever see. + +``` +volume my-posix + type storage/posix + option directory /srv/export +end-volume + +volume my-rot13 + type encryption/my-rot-13 + subvolumes my-posix +end-volume +``` + +All we have here is a basic brick using `/srv/export` for its data, and then +an instance of our translator layered on top -- no client or server is +necessary for what we're doing, and the system will automatically push a +mount/fuse translator on top if there's no server translator. To try this out, +all we need is the following command (assuming the directories involved already + exist). + +``` +xlator_example$ glusterfs --debug -f my.vol /srv/import +``` + +You should be rewarded with a whole lot of log output, including the text of +the volfile (this is very useful for debugging problems in the field). If you +go to another window on the same machine, you can see that you have a new +filesystem mounted. + +``` +~$ df /srv/import +Filesystem 1K-blocks Used Available Use% Mounted on +/srv/xlator_example/my.vol + 114506240 2706176 105983488 3% /srv/import +``` + +Just for fun, write something into a file in `/srv/import`, then look at the +corresponding file in `/srv/export` to see it all `rot-13`'ed for you. + +``` +~$ echo hello > /srv/import/a_file +~$ cat /srv/export/a_file +uryyb +``` + +There you have it -- functionality you control, implemented easily, layered on +top of local storage. Now you could start adding functionality -- real +encryption, perhaps -- and inevitably having to debug it. You could do that the + old-school way, with `gf_log` (preferred) or even plain old `printf`, or you +could run daemons under `gdb` instead. Alternatively, you could wait for the +next Translator 101 post, where we'll be doing exactly that. + +Debugging a Translator +---------------------- + +Now that we've learned what a translator looks like and how to build one, it's +time to run one and actually watch it work. The best way to do this is good +old-fashioned `gdb`, as follows (using some of the examples from last time). + +``` +xlator_example# gdb glusterfs +GNU gdb (GDB) Red Hat Enterprise Linux (7.2-50.el6) +... +(gdb) r --debug -f my.vol /srv/import +Starting program: /usr/sbin/glusterfs --debug -f my.vol /srv/import +... +[2011-11-23 11:23:16.495516] I [fuse-bridge.c:2971:fuse_init] + 0-glusterfs-fuse: FUSE inited with protocol versions: + glusterfs 7.13 kernel 7.13 +``` + +If you get to this point, your glusterfs client process is already running. You +can go to another window to see the mountpoint, do file operations, etc. + +``` +~# df /srv/import +Filesystem 1K-blocks Used Available Use% Mounted on +/root/xlator_example/my.vol + 114506240 2643968 106045568 3% /srv/import +~# ls /srv/import +a_file +~# cat /srv/import/a_file +hello +``` + +Now let's interrupt the process and see where we are. + +``` +^C +Program received signal SIGINT, Interrupt. +0x0000003a0060b3dc in pthread_cond_wait@@GLIBC_2.3.2 () + from /lib64/libpthread.so.0 +(gdb) info threads + 5 Thread 0x7fffeffff700 (LWP 27206) 0x0000003a002dd8c7 + in readv () + from /lib64/libc.so.6 + 4 Thread 0x7ffff50e3700 (LWP 27205) 0x0000003a0060b75b + in pthread_cond_timedwait@@GLIBC_2.3.2 () + from /lib64/libpthread.so.0 + 3 Thread 0x7ffff5f02700 (LWP 27204) 0x0000003a0060b3dc + in pthread_cond_wait@@GLIBC_2.3.2 () + from /lib64/libpthread.so.0 + 2 Thread 0x7ffff6903700 (LWP 27203) 0x0000003a0060f245 + in sigwait () + from /lib64/libpthread.so.0 +* 1 Thread 0x7ffff7957700 (LWP 27196) 0x0000003a0060b3dc + in pthread_cond_wait@@GLIBC_2.3.2 () + from /lib64/libpthread.so.0 +``` + +Like any non-toy server, this one has multiple threads. What are they all +doing? Honestly, even I don't know. Thread 1 turns out to be in +`event_dispatch_epoll`, which means it's the one handling all of our network +I/O. Note that with socket multi-threading patch this will change, with one +thread in `socket_poller` per connection. Thread 2 is in `glusterfs_sigwaiter` +which means signals will be isolated to that thread. Thread 3 is in +`syncenv_task`, so it's a worker process for synchronous requests such as +those used by the rebalance and repair code. Thread 4 is in +`janitor_get_next_fd`, so it's waiting for a chance to close no-longer-needed +file descriptors on the local filesystem. (I admit I had to look that one up, +BTW.) Lastly, thread 5 is in `fuse_thread_proc`, so it's the one fetching +requests from our FUSE interface. You'll often see many more threads than +this, but it's a pretty good basic set. Now, let's set a breakpoint so we can +actually watch a request. + +``` +(gdb) b rot13_writev +Breakpoint 1 at 0x7ffff50e4f0b: file rot-13.c, line 119. +(gdb) c +Continuing. +``` + +At this point we go into our other window and do something that will involve a write. + +``` +~# echo goodbye > /srv/import/another_file +(back to the first window) +[Switching to Thread 0x7fffeffff700 (LWP 27206)] + +Breakpoint 1, rot13_writev (frame=0x7ffff6e4402c, this=0x638440, + fd=0x7ffff409802c, vector=0x7fffe8000cd8, count=1, offset=0, + iobref=0x7fffe8001070) at rot-13.c:119 +119 rot_13_private_t *priv = (rot_13_private_t *)this->private; +``` + +Remember how we built with debugging symbols enabled and no optimization? That +will be pretty important for the next few steps. As you can see, we're in +`rot13_writev`, with several parameters. + +* `frame` is our always-present frame pointer for this request. Also, + `frame->local` will point to any local data we created and attached to the + request ourselves. +* `this` is a pointer to our instance of the `rot-13` translator. You can examine + it if you like to see the name, type, options, parent/children, inode table, + and other stuff associated with it. +* `fd` is a pointer to a file-descriptor *object* (`fd_t`, not just a + file-descriptor index which is what most people use "fd" for). This in turn + points to an inode object (`inode_t`) and we can associate our own + `rot-13`-specific data with either of these. +* `vector` and `count` together describe the data buffers for this write, which + we'll get to in a moment. +* `offset` is the offset into the file at which we're writing. +* `iobref` is a buffer-reference object, which is used to track the life cycle + of buffers containing read/write data. If you look closely, you'll notice that + `vector[0].iov_base` points to the same address as `iobref->iobrefs[0].ptr`, which + should give you some idea of the inter-relationships between vector and iobref. + +OK, now what about that `vector`? We can use it to examine the data being +written, like this. + +``` +(gdb) p vector[0] +$2 = {iov_base = 0x7ffff7936000, iov_len = 8} +(gdb) x/s 0x7ffff7936000 +0x7ffff7936000: "goodbye\n" +``` + +It's not always safe to view this data as a string, because it might just as +well be binary data, but since we're generating the write this time it's safe +and convenient. With that knowledge, let's step through things a bit. + +``` +(gdb) s +120 if (priv->encrypt_write) +(gdb) +121 rot13_iovec (vector, count); +(gdb) +rot13_iovec (vector=0x7fffe8000cd8, count=1) at rot-13.c:57 +57 for (i = 0; i < count; i++) { +(gdb) +58 rot13 (vector[i].iov_base, vector[i].iov_len); +(gdb) +rot13 (buf=0x7ffff7936000 "goodbye\n", len=8) at rot-13.c:45 +45 for (i = 0; i < len; i++) { +(gdb) +46 if (buf[i] >= 'a' && buf[i] <= 'z') +(gdb) +47 buf[i] = 'a' + ((buf[i] - 'a' + 13) % 26); +``` + +Here we've stepped into `rot13_iovec`, which iterates through our vector +calling `rot13`, which in turn iterates through the characters in that chunk +doing the `rot-13` operation if/as appropriate. This is pretty straightforward +stuff, so let's skip to the next interesting bit. + +``` +(gdb) fin +Run till exit from #0 rot13 (buf=0x7ffff7936000 "goodbye\n", + len=8) at rot-13.c:47 +rot13_iovec (vector=0x7fffe8000cd8, count=1) at rot-13.c:57 +57 for (i = 0; i < count; i++) { +(gdb) fin +Run till exit from #0 rot13_iovec (vector=0x7fffe8000cd8, + count=1) at rot-13.c:57 +rot13_writev (frame=0x7ffff6e4402c, this=0x638440, + fd=0x7ffff409802c, vector=0x7fffe8000cd8, count=1, + offset=0, iobref=0x7fffe8001070) at rot-13.c:123 +123 STACK_WIND (frame, +(gdb) b 129 +Breakpoint 2 at 0x7ffff50e4f35: file rot-13.c, line 129. +(gdb) b rot13_writev_cbk +Breakpoint 3 at 0x7ffff50e4db3: file rot-13.c, line 106. +(gdb) c +``` + +So we've set breakpoints on both the callback and the statement following the +`STACK_WIND`. Which one will we hit first? + +``` +Breakpoint 3, rot13_writev_cbk (frame=0x7ffff6e4402c, + cookie=0x7ffff6e440d8, this=0x638440, op_ret=8, op_errno=0, + prebuf=0x7fffefffeca0, postbuf=0x7fffefffec30) + at rot-13.c:106 +106 STACK_UNWIND_STRICT (writev, frame, op_ret, op_errno, + prebuf, postbuf); +(gdb) bt +#0 rot13_writev_cbk (frame=0x7ffff6e4402c, + cookie=0x7ffff6e440d8, this=0x638440, op_ret=8, op_errno=0, + prebuf=0x7fffefffeca0, postbuf=0x7fffefffec30) + at rot-13.c:106 +#1 0x00007ffff52f1b37 in posix_writev (frame=0x7ffff6e440d8, + this=<value optimized out>, fd=<value optimized out>, + vector=<value optimized out>, count=1, + offset=<value optimized out>, iobref=0x7fffe8001070) + at posix.c:2217 +#2 0x00007ffff50e513e in rot13_writev (frame=0x7ffff6e4402c, + this=0x638440, fd=0x7ffff409802c, vector=0x7fffe8000cd8, + count=1, offset=0, iobref=0x7fffe8001070) at rot-13.c:123 +``` + +Surprise! We're in `rot13_writev_cbk` now, called (indirectly) while we're +still in `rot13_writev` before `STACK_WIND` returns (still at rot-13.c:123). If + you did any request cleanup here, then you need to be careful about what you +do in the remainder of `rot13_writev` because data may have been freed etc. +It's tempting to say you should just do the cleanup in `rot13_writev` after +the `STACK_WIND,` but that's not valid because it's also possible that some +other translator returned without calling `STACK_UNWIND` -- i.e. before +`rot13_writev` is called, so then it would be the one getting null-pointer +errors instead. To put it another way, the callback and the return from +`STACK_WIND` can occur in either order or even simultaneously on different +threads. Even if you were to use reference counts, you'd have to make sure to +use locking or atomic operations to avoid races, and it's not worth it. Unless +you *really* understand the possible flows of control and know what you're +doing, it's better to do cleanup in the callback and nothing after +`STACK_WIND.` + +At this point all that's left is a `STACK_UNWIND` and a return. The +`STACK_UNWIND` invokes our parent's completion callback, and in this case our +parent is FUSE so at that point the VFS layer is notified of the write being +complete. Finally, we return through several levels of normal function calls +until we come back to fuse_thread_proc, which waits for the next request. + +So that's it. For extra fun, you might want to repeat this exercise by stepping +through some other call -- stat or setxattr might be good choices -- but you'll + have to use a translator that actually implements those calls to see much +that's interesting. Then you'll pretty much know everything I knew when I +started writing my first for-real translators, and probably even a bit more. I +hope you've enjoyed this series, or at least found it useful, and if you have +any suggestions for other topics I should cover please let me know (via +comments or email, IRC or Twitter). diff --git a/doc/hacker-guide/en-US/markdown/write-behind.md b/doc/hacker-guide/en-US/markdown/write-behind.md new file mode 100644 index 000000000..e20682249 --- /dev/null +++ b/doc/hacker-guide/en-US/markdown/write-behind.md @@ -0,0 +1,56 @@ +performance/write-behind translator +=================================== + +Basic working +-------------- + +Write behind is basically a translator to lie to the application that the +write-requests are finished, even before it is actually finished. + +On a regular translator tree without write-behind, control flow is like this: + +1. application makes a `write()` system call. +2. VFS ==> FUSE ==> `/dev/fuse`. +3. fuse-bridge initiates a glusterfs `writev()` call. +4. `writev()` is `STACK_WIND()`ed upto client-protocol or storage translator. +5. client-protocol, on receiving reply from server, starts `STACK_UNWIND()` towards the fuse-bridge. + +On a translator tree with write-behind, control flow is like this: + +1. application makes a `write()` system call. +2. VFS ==> FUSE ==> `/dev/fuse`. +3. fuse-bridge initiates a glusterfs `writev()` call. +4. `writev()` is `STACK_WIND()`ed upto write-behind translator. +5. write-behind adds the write buffer to its internal queue and does a `STACK_UNWIND()` towards the fuse-bridge. + +write call is completed in application's percepective. after +`STACK_UNWIND()`ing towards the fuse-bridge, write-behind initiates a fresh +writev() call to its child translator, whose replies will be consumed by +write-behind itself. Write-behind _doesn't_ cache the write buffer, unless +`option flush-behind on` is specified in volume specification file. + +Windowing +--------- + +With respect to write-behind, each write-buffer has three flags: `stack_wound`, `write_behind` and `got_reply`. + +* `stack_wound`: if set, indicates that write-behind has initiated `STACK_WIND()` towards child translator. +* `write_behind`: if set, indicates that write-behind has done `STACK_UNWIND()` towards fuse-bridge. +* `got_reply`: if set, indicates that write-behind has received reply from child translator for a `writev()` `STACK_WIND()`. a request will be destroyed by write-behind only if this flag is set. + +Currently pending write requests = aggregate size of requests with write_behind = 1 and got_reply = 0. + +window size limits the aggregate size of currently pending write requests. once +the pending requests' size has reached the window size, write-behind blocks +writev() calls from fuse-bridge. Blocking is only from application's +perspective. Write-behind does `STACK_WIND()` to child translator +straight-away, but hold behind the `STACK_UNWIND()` towards fuse-bridge. +`STACK_UNWIND()` is done only once write-behind gets enough replies to +accomodate for currently blocked request. + +Flush behind +------------ + +If `option flush-behind on` is specified in volume specification file, then +write-behind sends aggregate write requests to child translator, instead of +regular per request `STACK_WIND()`s. |
