diff options
author | Humble Devassy Chirammal <hchiramm@redhat.com> | 2015-03-30 12:21:05 +0530 |
---|---|---|
committer | Kaleb KEITHLEY <kkeithle@redhat.com> | 2015-03-30 05:51:12 -0700 |
commit | 8907f67ba215172b01a7018adcbb063fcc4570e9 (patch) | |
tree | 68cf6557990ae4215926068625a7b5e0e1374882 /doc/hacker-guide | |
parent | e3bd2387a5973df4548fe4a62b5dfc227a2bdc64 (diff) |
doc: restructure developer docs to new layout
The developer oriented information is scattered in source
and its very difficult to identify which are those.
With this patch subdirs are created under developer-guide
which will be the parent for developer notes. The changes
suggested in http://review.gluster.org/#/c/8827/ are also
included in this patch.
Change-Id: I4c8510d52c49f4066225f72cac8f97f087d6c70c
BUG: 1206539
Signed-off-by: Humble Devassy Chirammal <hchiramm@redhat.com>
Reviewed-on: http://review.gluster.org/10038
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Lalatendu Mohanty <lmohanty@redhat.com>
Reviewed-by: Kaleb KEITHLEY <kkeithle@redhat.com>
Diffstat (limited to 'doc/hacker-guide')
-rw-r--r-- | doc/hacker-guide/en-US/markdown/adding-fops.md | 18 | ||||
-rw-r--r-- | doc/hacker-guide/en-US/markdown/afr.md | 191 | ||||
-rw-r--r-- | doc/hacker-guide/en-US/markdown/coding-standard.md | 402 | ||||
-rw-r--r-- | doc/hacker-guide/en-US/markdown/inode.md | 226 | ||||
-rw-r--r-- | doc/hacker-guide/en-US/markdown/posix.md | 59 | ||||
-rw-r--r-- | doc/hacker-guide/en-US/markdown/translator-development.md | 666 | ||||
-rw-r--r-- | doc/hacker-guide/en-US/markdown/unittest.md | 228 | ||||
-rw-r--r-- | doc/hacker-guide/en-US/markdown/write-behind.md | 56 |
8 files changed, 0 insertions, 1846 deletions
diff --git a/doc/hacker-guide/en-US/markdown/adding-fops.md b/doc/hacker-guide/en-US/markdown/adding-fops.md deleted file mode 100644 index 3f72ed3e23a..00000000000 --- a/doc/hacker-guide/en-US/markdown/adding-fops.md +++ /dev/null @@ -1,18 +0,0 @@ -Adding a new FOP -================ - -Steps to be followed when adding a new FOP to GlusterFS: - -1. Edit `glusterfs.h` and add a `GF_FOP_*` constant. -2. Edit `xlator.[ch]` and: - * add the new prototype for fop and callback. - * edit `xlator_fops` structure. -3. Edit `xlator.c` and add to fill_defaults. -4. Edit `protocol.h` and add struct necessary for the new FOP. -5. Edit `defaults.[ch]` and provide default implementation. -6. Edit `call-stub.[ch]` and provide stub implementation. -7. Edit `common-utils.c` and add to gf_global_variable_init(). -8. Edit client-protocol and add your FOP. -9. Edit server-protocol and add your FOP. -10. Implement your FOP in any translator for which the default implementation - is not sufficient. diff --git a/doc/hacker-guide/en-US/markdown/afr.md b/doc/hacker-guide/en-US/markdown/afr.md deleted file mode 100644 index 566573a4e26..00000000000 --- a/doc/hacker-guide/en-US/markdown/afr.md +++ /dev/null @@ -1,191 +0,0 @@ -cluster/afr translator -====================== - -Locking -------- - -Before understanding replicate, one must understand two internal FOPs: - -### `GF_FILE_LK` - -This is exactly like `fcntl(2)` locking, except the locks are in a -separate domain from locks held by applications. - -### `GF_DIR_LK (loc_t *loc, char *basename)` - -This allows one to lock a name under a directory. For example, -to lock /mnt/glusterfs/foo, one would use the call: - -``` -GF_DIR_LK ({loc_t for "/mnt/glusterfs"}, "foo") -``` - -If one wishes to lock *all* the names under a particular directory, -supply the basename argument as `NULL`. - -The locks can either be read locks or write locks; consult the -function prototype for more details. - -Both these operations are implemented by the features/locks (earlier -known as posix-locks) translator. - -Basic design ------------- - -All FOPs can be classified into four major groups: - -### inode-read - -Operations that read an inode's data (file contents) or metadata (perms, etc.). - -access, getxattr, fstat, readlink, readv, stat. - -### inode-write - -Operations that modify an inode's data or metadata. - -chmod, chown, truncate, writev, utimens. - -### dir-read - -Operations that read a directory's contents or metadata. - -readdir, getdents, checksum. - -### dir-write - -Operations that modify a directory's contents or metadata. - -create, link, mkdir, mknod, rename, rmdir, symlink, unlink. - -Some of these make a subgroup in that they modify *two* different entries: -link, rename, symlink. - -### Others - -Other operations. - -flush, lookup, open, opendir, statfs. - -Algorithms ----------- - -Each of the four major groups has its own algorithm: - -### inode-read, dir-read - -1. Send a request to the first child that is up: - * if it fails: - * try the next available child - * if we have exhausted all children: - * return failure - -### inode-write - - All operations are done in parallel unless specified otherwise. - -1. Send a ``GF_FILE_LK`` request on all children for a write lock on the - appropriate region - (for metadata operations: entire file (0, 0) for writev: - (offset, offset+size of buffer)) - * If a lock request fails on a child: - * unlock all children - * try to acquire a blocking lock (`F_SETLKW`) on each child, serially. - If this fails (due to `ENOTCONN` or `EINVAL`): - Consider this child as dead for rest of transaction. -2. Mark all children as "pending" on all (alive) children (see below for -meaning of "pending"). - * If it fails on any child: - * mark it as dead (in transaction local state). -3. Perform operation on all (alive) children. - * If it fails on any child: - * mark it as dead (in transaction local state). -4. Unmark all successful children as not "pending" on all nodes. -5. Unlock region on all (alive) children. - -### dir-write - - The algorithm for dir-write is same as above except instead of holding - `GF_FILE_LK` locks we hold a GF_DIR_LK lock on the name being operated upon. - In case of link-type calls, we hold locks on both the operand names. - -"pending" ---------- - -The "pending" number is like a journal entry. A pending entry is an -array of 32-bit integers stored in network byte-order as the extended -attribute of an inode (which can be a directory as well). - -There are three keys corresponding to three types of pending operations: - -### `AFR_METADATA_PENDING` - -There are some metadata operations pending on this inode (perms, ctime/mtime, -xattr, etc.). - -### `AFR_DATA_PENDING` - -There is some data pending on this inode (writev). - -### `AFR_ENTRY_PENDING` - -There are some directory operations pending on this directory -(create, unlink, etc.). - -Self heal ---------- - -* On lookup, gather extended attribute data: - * If entry is a regular file: - * If an entry is present on one child and not on others: - * create entry on others. - * If entries exist but have different metadata (perms, etc.): - * consider the entry with the highest `AFR_METADATA_PENDING` number as - definitive and replicate its attributes on children. - * If entry is a directory: - * Consider the entry with the highest `AFR_ENTRY_PENDING` number as - definitive and replicate its contents on all children. - * If any two entries have non-matching types (i.e., one is file and - other is directory): - * Announce to the user via log that a split-brain situation has been - detected, and do nothing. -* On open, gather extended attribute data: - * Consider the file with the highest `AFR_DATA_PENDING` number as - the definitive one and replicate its contents on all other - children. - -During all self heal operations, appropriate locks must be held on all -regions/entries being affected. - -Inode scaling -------------- - -Inode scaling is necessary because if a situation arises where an inode number -is returned for a directory (by lookup) which was previously the inode number -of a file (as per FUSE's table), then FUSE gets horribly confused (consult a -FUSE expert for more details). - -To avoid such a situation, we distribute the 64-bit inode space equally -among all children of replicate. - -To illustrate: - -If c1, c2, c3 are children of replicate, they each get 1/3 of the available -inode space: - -------------- -- -- -- -- -- -- -- -- -- -- -- --- -Child: c1 c2 c3 c1 c2 c3 c1 c2 c3 c1 c2 ... -Inode number: 1 2 3 4 5 6 7 8 9 10 11 ... -------------- -- -- -- -- -- -- -- -- -- -- -- --- - -Thus, if lookup on c1 returns an inode number "2", it is scaled to "4" -(which is the second inode number in c1's space). - -This way we ensure that there is never a collision of inode numbers from -two different children. - -This reduction of inode space doesn't really reduce the usability of -replicate since even if we assume replicate has 1024 children (which would be a -highly unusual scenario), each child still has a 54-bit inode space: -$2^{54} \sim 1.8 \times 10^{16}$, which is much larger than any real -world requirement. diff --git a/doc/hacker-guide/en-US/markdown/coding-standard.md b/doc/hacker-guide/en-US/markdown/coding-standard.md deleted file mode 100644 index 368c5553464..00000000000 --- a/doc/hacker-guide/en-US/markdown/coding-standard.md +++ /dev/null @@ -1,402 +0,0 @@ -GlusterFS Coding Standards -========================== - -Structure definitions should have a comment per member ------------------------------------------------------- - -Every member in a structure definition must have a comment about its -purpose. The comment should be descriptive without being overly verbose. - -*Bad:* - -``` -gf_lock_t lock; /* lock */ -``` - -*Good:* - -``` -DBTYPE access_mode; /* access mode for accessing - * the databases, can be - * DB_HASH, DB_BTREE - * (option access-mode <mode>) - */ -``` - -Declare all variables at the beginning of the function ------------------------------------------------------- - -All local variables in a function must be declared immediately after the -opening brace. This makes it easy to keep track of memory that needs to be freed -during exit. It also helps debugging, since gdb cannot handle variables -declared inside loops or other such blocks. - -Always initialize local variables ---------------------------------- - -Every local variable should be initialized to a sensible default value -at the point of its declaration. All pointers should be initialized to NULL, -and all integers should be zero or (if it makes sense) an error value. - - -*Good:* - -``` -int ret = 0; -char *databuf = NULL; -int _fd = -1; -``` - -Initialization should always be done with a constant value ----------------------------------------------------------- - -Never use a non-constant expression as the initialization value for a variable. - - -*Bad:* - -``` -pid_t pid = frame->root->pid; -char *databuf = malloc (1024); -``` - -Validate all arguments to a function ------------------------------------- - -All pointer arguments to a function must be checked for `NULL`. -A macro named `VALIDATE` (in `common-utils.h`) -takes one argument, and if it is `NULL`, writes a log message and -jumps to a label called `err` after setting op_ret and op_errno -appropriately. It is recommended to use this template. - - -*Good:* - -``` -VALIDATE(frame); -VALIDATE(this); -VALIDATE(inode); -``` - -Never rely on precedence of operators -------------------------------------- - -Never write code that relies on the precedence of operators to execute -correctly. Such code can be hard to read and someone else might not -know the precedence of operators as accurately as you do. - -*Bad:* - -``` -if (op_ret == -1 && errno != ENOENT) -``` - -*Good:* - -``` -if ((op_ret == -1) && (errno != ENOENT)) -``` - -Use exactly matching types --------------------------- - -Use a variable of the exact type declared in the manual to hold the -return value of a function. Do not use an ``equivalent'' type. - - -*Bad:* - -``` -int len = strlen (path); -``` - -*Good:* - -``` -size_t len = strlen (path); -``` - -Never write code such as `foo->bar->baz`; check every pointer -------------------------------------------------------------- - -Do not write code that blindly follows a chain of pointer -references. Any pointer in the chain may be `NULL` and thus -cause a crash. Verify that each pointer is non-null before following -it. - -Check return value of all functions and system calls ----------------------------------------------------- - -The return value of all system calls and API functions must be checked -for success or failure. - -*Bad:* - -``` -close (fd); -``` - -*Good:* - -``` -op_ret = close (_fd); -if (op_ret == -1) { - gf_log (this->name, GF_LOG_ERROR, - "close on file %s failed (%s)", real_path, - strerror (errno)); - op_errno = errno; - goto out; -} -``` - - -Gracefully handle failure of malloc ------------------------------------ - -GlusterFS should never crash or exit due to lack of memory. If a -memory allocation fails, the call should be unwound and an error -returned to the user. - -*Use result args and reserve the return value to indicate success or failure:* - -The return value of every functions must indicate success or failure (unless -it is impossible for the function to fail --- e.g., boolean functions). If -the function needs to return additional data, it must be returned using a -result (pointer) argument. - -*Bad:* - -``` -int32_t dict_get_int32 (dict_t *this, char *key); -``` - -*Good:* - -``` -int dict_get_int32 (dict_t *this, char *key, int32_t *val); -``` - -Always use the `n' versions of string functions ------------------------------------------------ - -Unless impossible, use the length-limited versions of the string functions. - -*Bad:* - -``` -strcpy (entry_path, real_path); -``` - -*Good:* - -``` -strncpy (entry_path, real_path, entry_path_len); -``` - -No dead or commented code -------------------------- - -There must be no dead code (code to which control can never be passed) or -commented out code in the codebase. - -Only one unwind and return per function ---------------------------------------- - -There must be only one exit out of a function. `UNWIND` and return -should happen at only point in the function. - -Function length or Keep functions small ---------------------------------------- - -We live in the UNIX-world where modules do one thing and do it well. -This rule should apply to our functions also. If a function is very long, try splitting it -into many little helper functions. The question is, in a coding -spree, how do we know a function is long and unreadable. One rule of -thumb given by Linus Torvalds is that, a function should be broken-up -if you have 4 or more levels of indentation going on for more than 3-4 -lines. - -*Example for a helper function:* -``` -static int -same_owner (posix_lock_t *l1, posix_lock_t *l2) -{ - return ((l1->client_pid == l2->client_pid) && - (l1->transport == l2->transport)); -} -``` - -Defining functions as static ----------------------------- - -Define internal functions as static only if you're -very sure that there will not be a crash(..of any kind..) emanating in -that function. If there is even a remote possibility, perhaps due to -pointer derefering, etc, declare the function as non-static. This -ensures that when a crash does happen, the function name shows up the -in the back-trace generated by libc. However, doing so has potential -for polluting the function namespace, so to avoid conflicts with other -components in other parts, ensure that the function names are -prepended with a prefix that identify the component to which it -belongs. For eg. non-static functions in io-threads translator start -with iot_. - -Ensure function calls wrap around after 80-columns --------------------------------------------------- - -Place remaining arguments on the next line if needed. - -Functions arguments and function definition -------------------------------------------- - -Place all the arguments of a function definition on the same line -until the line goes beyond 80-cols. Arguments that extend beyind -80-cols should be placed on the next line. - -Style issues ------------- - -### Brace placement - -Use K&R/Linux style of brace placement for blocks. - -*Good:* - -``` -int some_function (...) -{ - if (...) { - /* ... */ - } else if (...) { - /* ... */ - } else { - /* ... */ - } - - do { - /* ... */ - } while (cond); -} -``` - -### Indentation - -Use *eight* spaces for indenting blocks. Ensure that your -file contains only spaces and not tab characters. You can do this -in Emacs by selecting the entire file (`C-x h`) and -running `M-x untabify`. - -To make Emacs indent lines automatically by eight spaces, add this -line to your `.emacs`: - -``` -(add-hook 'c-mode-hook (lambda () (c-set-style "linux"))) -``` - -### Comments - -Write a comment before every function describing its purpose (one-line), -its arguments, and its return value. Mention whether it is an internal -function or an exported function. - -Write a comment before every structure describing its purpose, and -write comments about each of its members. - -Follow the style shown below for comments, since such comments -can then be automatically extracted by doxygen to generate -documentation. - -*Good:* - -``` -/** -* hash_name -hash function for filenames -* @par: parent inode number -* @name: basename of inode -* @mod: number of buckets in the hashtable -* -* @return: success: bucket number -* failure: -1 -* -* Not for external use. -*/ -``` - -### Indicating critical sections - -To clearly show regions of code which execute with locks held, use -the following format: - -``` -pthread_mutex_lock (&mutex); -{ - /* code */ -} -pthread_mutex_unlock (&mutex); -``` - -*A skeleton fop function:* - -This is the recommended template for any fop. In the beginning come -the initializations. After that, the `success' control flow should be -linear. Any error conditions should cause a `goto` to a single -point, `out`. At that point, the code should detect the error -that has occurred and do appropriate cleanup. - -``` -int32_t -sample_fop (call_frame_t *frame, xlator_t *this, ...) -{ - char * var1 = NULL; - int32_t op_ret = -1; - int32_t op_errno = 0; - DIR * dir = NULL; - struct posix_fd * pfd = NULL; - - VALIDATE_OR_GOTO (frame, out); - VALIDATE_OR_GOTO (this, out); - - /* other validations */ - - dir = opendir (...); - - if (dir == NULL) { - op_errno = errno; - gf_log (this->name, GF_LOG_ERROR, - "opendir failed on %s (%s)", loc->path, - strerror (op_errno)); - goto out; - } - - /* another system call */ - if (...) { - op_errno = ENOMEM; - gf_log (this->name, GF_LOG_ERROR, - "out of memory :("); - goto out; - } - - /* ... */ - - out: - if (op_ret == -1) { - - /* check for all the cleanup that needs to be - done */ - - if (dir) { - closedir (dir); - dir = NULL; - } - - if (pfd) { - FREE (pfd->path); - FREE (pfd); - pfd = NULL; - } - } - - STACK_UNWIND (frame, op_ret, op_errno, fd); - return 0; -} -``` diff --git a/doc/hacker-guide/en-US/markdown/inode.md b/doc/hacker-guide/en-US/markdown/inode.md deleted file mode 100644 index a340ab9ca8e..00000000000 --- a/doc/hacker-guide/en-US/markdown/inode.md +++ /dev/null @@ -1,226 +0,0 @@ -#Inode and dentry management in GlusterFS: - -##Background -Filesystems internally refer to files and directories via inodes. Inodes -are unique identifiers of the entities stored in a filesystem. Whenever an -application has to operate on a file/directory (read/modify), the filesystem -maps that file/directory to the right inode and start referring to that inode -whenever an operation has to be performed on the file/directory. - -In GlusterFS a new inode gets created whenever a new file/directory is created -OR when a successful lookup is done on a file/directory for the first time. -Inodes in GlusterFS are maintained by the inode table which gets initiated when -the filesystem daemon is started (both for the brick process as well as the -mount process). Below are some important data structures for inode management. - -## Data-structure (inode-table) -``` -struct _inode_table { - pthread_mutex_t lock; - size_t hashsize; /* bucket size of inode hash and dentry hash */ - char *name; /* name of the inode table, just for gf_log() */ - inode_t *root; /* root directory inode, with inode - number and gfid 1 */ - xlator_t *xl; /* xlator to be called to do purge and - the xlator which maintains the inode table*/ - uint32_t lru_limit; /* maximum LRU cache size */ - struct list_head *inode_hash; /* buckets for inode hash table */ - struct list_head *name_hash; /* buckets for dentry hash table */ - struct list_head active; /* list of inodes currently active (in an fop) */ - uint32_t active_size; /* count of inodes in active list */ - struct list_head lru; /* list of inodes recently used. - lru.next most recent */ - uint32_t lru_size; /* count of inodes in lru list */ - struct list_head purge; /* list of inodes to be purged soon */ - uint32_t purge_size; /* count of inodes in purge list */ - - struct mem_pool *inode_pool; /* memory pool for inodes */ - struct mem_pool *dentry_pool; /* memory pool for dentrys */ - struct mem_pool *fd_mem_pool; /* memory pool for fd_t */ - int ctxcount; /* number of slots in inode->ctx */ -}; -``` - -#Life-cycle -``` - -inode_table_new (size_t lru_limit, xlator_t *xl) - -This is a function which allocates a new inode table. Usually the top xlators in -the graph such as protocol/server (for bricks), fuse and nfs (for fuse and nfs -mounts) and libgfapi do inode managements. Hence they are the ones which will -allocate a new inode table by calling the above function. - -Each xlator graph in glusterfs maintains an inode table. So in fuse clients, -whenever there is a graph change due to add brick/remove brick or -addition/removal of some other xlators, a new graph is created which creates a -new inode table. - -Thus an allocated inode table is destroyed only when the filesystem daemon is -killed or unmounted. - -``` - -#what it contains. -``` - -Inode table in glusterfs mainly contains a hash table for maintaining inodes. -In general a file/directory is considered to be existing if there is a -corresponding inode present in the inode table. If a inode for a file/directory -cannot be found in the inode table, glusterfs tries to resolve it by sending a -lookup on the entry for which the inode is needed. If lookup is successful, then -a new inode correponding to the entry is added to the hash table present in the -inode table. Thus an inode present in the hash-table means, its an existing -file/directory within the filesystem. The inode table also contains the hash -size of the hash table (as of now it is hard coded to 14057. The hash value of -a inode is calculated using its gfid). - -Apart from the hash table, inode table also maintains 3 important list of inodes -1) Active list: -Active list contains all the active inodes (i.e inodes which are currently part -of some fop). -2) Lru list: -Least recently used inodes list. A limit can be set for the size of the lru -list. For bricks it is 16384 and for clients it is infinity. -3) Purge list: -List of all the inodes which have to be purged (i.e inodes which have to be -deleted from the inode table due to unlink/rmdir/forget). - -And at last it also contains the mem-pool for allocating inodes, dentries so -that frequent malloc/calloc and free of the data structures can be avoided. -``` - -#Data structure (inode) -``` -struct _inode { - inode_table_t *table; /* the table this inode belongs to */ - uuid_t gfid; /* unique identifier of the inode */ - gf_lock_t lock; - uint64_t nlookup; - uint32_t fd_count; /* Open fd count */ - uint32_t ref; /* reference count on this inode */ - ia_type_t ia_type; /* what kind of file */ - struct list_head fd_list; /* list of open files on this inode */ - struct list_head dentry_list; /* list of directory entries for this inode */ - struct list_head hash; /* hash table pointers */ - struct list_head list; /* active/lru/purge */ - - struct _inode_ctx *_ctx; /* place holder for keeping the - information about the inode by different xlators */ -}; - -As said above, inodes are internal way of identifying the files/directories. A -inode uniquely represents a file/directory. A new inode is created whenever a -create/mkdir/symlink/mknod operations are performed. Apart from that a new inode -is created upon the successful fresh lookup of a file/directory. Say the -filesystem contained some file "a" within root and the filesystem was -unmounted. Now when glusterfs is mounted and some operation is perfomed on "/a", -glusterfs tries to get the inode for the entry "a" with parent inode as -root. But, since glusterfs just came up, it will not be able to find the inode -for "a" and will send a lookup on "/a". If the lookup operation succeeds (i.e. -the root of glusterfs contains an entry called "a"), then a new inode for "/a" -is created and added to the inode table. - -Depending upon the situation, an inode can be in one of the 3 lists maintained -by the inode table. If some fop is happening on the inode, then the inode will -be present in the active inodes list maintained by the inode table. Active -inodes are those inodes whose refcount is greater than zero. Whenever some -operation comes on a file/directory, and the resolver tries to find the inode -for it, it increments the refcount of the inode before returning the inode. The -refcount of an inode can be incremented by calling the below function - -inode_ref (inode_t *inode) - -Any xlator which wants to operate on a inode as part of some fop (or wants the -inode in the callback), should hold a ref on the inode. -Once the fop is completed before sending the reply of the fop to the above -layers , the inode has to be unrefed. When the refcount of an inode becomes -zero, it is removed from the active inodes list and put into LRU list maintained -by the inode table. Thus in short if some fop is happening on a file/directory, -the corresponding inode will be in the active list or it will be in the LRU -list. -``` - -#Life Cycle - -A new inode is created whenever a new file/directory/symlink is created OR a -successful lookup of an existing entry is done. The xlators which does inode -management (as of now protocol/server, fuse, nfs, gfapi) will perform inode_link -operation upon successful lookup or successful creation of a new entry. - -inode_link (inode_t *inode, inode_t *parent, const char *name, - struct iatt *buf); - -inode_link actually adds the inode to the inode table (to be precise it adds -the inode to the hash table maintained by the inode table. The hash value is -calculated based on the gfid). Copies the gfid to the inode (the gfid is -present in the iatt structure). Creates a dentry with the new name. - -A inode is removed from the inode table and eventually destroyed when unlink -or rmdir operation is performed on a file/directory, or the the lru limit of -the inode table has been exceeded. - -#Data structure (dentry) -``` - -struct _dentry { - struct list_head inode_list; /* list of dentries of inode */ - struct list_head hash; /* hash table pointers */ - inode_t *inode; /* inode of this directory entry */ - char *name; /* name of the directory entry */ - inode_t *parent; /* directory of the entry */ -}; - -A dentry is the presence of an entry for a file/directory within its parent -directory. A dentry usually points to the inode to which it belongs to. In -glusterfs a dentry contains the following fields. -1) a hook using which it can add itself to the list of -the dentries maintained by the inode to which it points to. -2) A hash table pointer. -3) Pointer to the inode to which it belongs to. -4) Name of the dentry -5) Pointer to the inode of the parent directory in which the dentry is present - -A new dentry is created when a new file/directory/symlink is created or a hard -link to an existing file is created. - -__dentry_create (inode_t *inode, inode_t *parent, const char *name); - -A dentry holds a refcount on the parent -directory so that the parent inode is never removed from the active inode's list -and put to the lru list (If the lru limit of the lru list is exceeded, there is -a chance of parent inode being destroyed. To avoid it, the dentries hold a -reference to the parent inode). A dentry is removed whenevern a unlink/rmdir -is perfomed on a file/directory. Or when the lru limit has been exceeded, the -oldest inodes are purged out of the inode table, during which all the dentries -of the inode are removed. - -Whenever a unlink/rmdir comes on a file/directory, the corresponding inode -should be removed from the inode table. So upon unlink/rmdir, the inode will -be moved to the purge list maintained by the inode table and from there it is -destroyed. To be more specific, if a inode has to be destroyed, its refcount -and nlookup count both should become 0. For refcount to become 0, the inode -should not be part of any fop (there should not be any open fds). Or if the -inode belongs to a directory, then there should not be any fop happening on the -directory and it should not contain any dentries within it. For nlookup count to -become zero, a forget has to be sent on the inode with nlookup count set to 0 as -an argument. For fuse clients, forget is sent by the kernel itself whenever a -unlink/rmdir is performed. But for brick processes, upon unlink/rmdir, the -protocol/server itself has to do inode_forget. Whenever the inode has to be -deleted due to file removal or lru limit being exceeded the inode is retired -(i.e. all the dentries of the inode are deleted and the inode is moved to the -purge list maintained by the inode table), the nlookup count is set to 0 via -inode_forget api. The inode table, then prunes all the inodes from the purge -list by destroying the inode contexts maintained by each xlator. - -unlinking of the dentry is done via inode_unlink; - -void -inode_unlink (inode_t *inode, inode_t *parent, const char *name); - -If the inode has multiple hard links, then the unlink operation performed by -the application results just in the removal of the dentry with the name provided -by the application. For the inode to be removed, all the dentries of the inode -should be unlinked. -``` - diff --git a/doc/hacker-guide/en-US/markdown/posix.md b/doc/hacker-guide/en-US/markdown/posix.md deleted file mode 100644 index 84c813e55a2..00000000000 --- a/doc/hacker-guide/en-US/markdown/posix.md +++ /dev/null @@ -1,59 +0,0 @@ -storage/posix translator -======================== - -Notes ------ - -### `SET_FS_ID` - -This is so that all filesystem checks are done with the user's -uid/gid and not GlusterFS's uid/gid. - -### `MAKE_REAL_PATH` - -This macro concatenates the base directory of the posix volume -('option directory') with the given path. - -### `need_xattr` in lookup - -If this flag is passed, lookup returns a xattr dictionary that contains -the file's create time, the file's contents, and the version number -of the file. - -This is a hack to increase small file performance. If an application -wants to read a small file, it can finish its job with just a lookup -call instead of a lookup followed by read. - -### `getdents`/`setdents` - -These are used by unify to set and get directory entries. - -### `ALIGN_BUF` - -Macro to align an address to a page boundary (4K). - -### `priv->export_statfs` - -In some cases, two exported volumes may reside on the same -partition on the server. Sending statvfs info for both -the volumes will lead to erroneous df output at the client, -since free space on the partition will be counted twice. - -In such cases, user can disable exporting statvfs info -on one of the volumes by setting this option. - -### `xattrop` - -This fop is used by replicate to set version numbers on files. - -### `getxattr`/`setxattr` hack to read/write files - -A key, `GLUSTERFS_FILE_CONTENT_STRING`, is handled in a special way by -`getxattr`/`setxattr`. A getxattr with the key will return the entire -content of the file as the value. A `setxattr` with the key will write -the value as the entire content of the file. - -### `posix_checksum` - -This calculates a simple XOR checksum on all entry names in a -directory that is used by unify to compare directory contents. diff --git a/doc/hacker-guide/en-US/markdown/translator-development.md b/doc/hacker-guide/en-US/markdown/translator-development.md deleted file mode 100644 index edadd5150dc..00000000000 --- a/doc/hacker-guide/en-US/markdown/translator-development.md +++ /dev/null @@ -1,666 +0,0 @@ -Translator development -====================== - -Setting the Stage ------------------ - -This is the first post in a series that will explain some of the details of -writing a GlusterFS translator, using some actual code to illustrate. - -Before we begin, a word about environments. GlusterFS is over 300K lines of -code spread across a few hundred files. That's no Linux kernel or anything, but - you're still going to be navigating through a lot of code in every -code-editing session, so some kind of cross-referencing is *essential*. I use -cscope with the vim bindings, and if I couldn't do Crtl+G and such to jump -between definitions all the time my productivity would be cut in half. You may -prefer different tools, but as I go through these examples you'll need -something functionally similar to follow on. OK, on with the show. - -The first thing you need to know is that translators are not just bags of -functions and variables. They need to have a very definite internal structure -so that the translator-loading code can figure out where all the pieces are. -The way it does this is to use dlsym to look for specific names within your -shared-object file, as follow (from `xlator.c`): - -``` -if (!(xl->fops = dlsym (handle, "fops"))) { - gf_log ("xlator", GF_LOG_WARNING, "dlsym(fops) on %s", - dlerror ()); - goto out; -} - -if (!(xl->cbks = dlsym (handle, "cbks"))) { - gf_log ("xlator", GF_LOG_WARNING, "dlsym(cbks) on %s", - dlerror ()); - goto out; -} - -if (!(xl->init = dlsym (handle, "init"))) { - gf_log ("xlator", GF_LOG_WARNING, "dlsym(init) on %s", - dlerror ()); - goto out; -} - -if (!(xl->fini = dlsym (handle, "fini"))) { - gf_log ("xlator", GF_LOG_WARNING, "dlsym(fini) on %s", - dlerror ()); - goto out; -} -``` - -In this example, `xl` is a pointer to the in-memory object for the translator -we're loading. As you can see, it's looking up various symbols *by name* in the - shared object it just loaded, and storing pointers to those symbols. Some of -them (e.g. init are functions, while others e.g. fops are dispatch tables -containing pointers to many functions. Together, these make up the translator's - public interface. - -Most of this glue or boilerplate can easily be found at the bottom of one of -the source files that make up each translator. We're going to use the `rot-13` -translator just for fun, so in this case you'd look in `rot-13.c` to see this: - -``` -struct xlator_fops fops = { - .readv = rot13_readv, - .writev = rot13_writev -}; - -struct xlator_cbks cbks = { -}; - -struct volume_options options[] = { -{ .key = {"encrypt-write"}, - .type = GF_OPTION_TYPE_BOOL -}, -{ .key = {"decrypt-read"}, - .type = GF_OPTION_TYPE_BOOL -}, -{ .key = {NULL} }, -}; -``` - -The `fops` table, defined in `xlator.h`, is one of the most important pieces. -This table contains a pointer to each of the filesystem functions that your -translator might implement -- `open`, `read`, `stat`, `chmod`, and so on. There -are 82 such functions in all, but don't worry; any that you don't specify here -will be see as null and filled with defaults from `defaults.c` when your -translator is loaded. In this particular example, since `rot-13` is an -exceptionally simple translator, we only fill in two entries for `readv` and -`writev`. - -There are actually two other tables, also required to have predefined names, -that are also used to find translator functions: `cbks` (which is empty in this - snippet) and `dumpops` (which is missing entirely). The first of these specify - entry points for when inodes are forgotten or file descriptors are released. -In other words, they're destructors for objects in which your translator might - have an interest. Mostly you can ignore them, because the default behavior -handles even the simpler cases of translator-specific inode/fd context -automatically. However, if the context you attach is a complex structure -requiring complex cleanup, you'll need to supply these functions. As for -dumpops, that's just used if you want to provide functions to pretty-print -various structures in logs. I've never used it myself, though I probably -should. What's noteworthy here is that we don't even define dumpops. That's -because all of the functions that might use these dispatch functions will check - for `xl->dumpops` being `NULL` before calling through it. This is in sharp -contrast to the behavior for `fops` and `cbks1`, which *must* be present. If -they're not, translator loading will fail because these pointers are not -checked every time and if they're `NULL` then we'll segfault. That's why we -provide an empty definition for cbks; it's OK for the individual function -pointers to be NULL, but not for the whole table to be absent. - -The last piece I'll cover today is options. As you can see, this is a table of -translator-specific option names and some information about their types. -GlusterFS actually provides a pretty rich set of types (`volume_option_type_t` -in `options.`h) which includes paths, translator names, percentages, and times -in addition to the obvious integers and strings. Also, the `volume_option_t` -structure can include information about alternate names, min/max/default -values, enumerated string values, and descriptions. We don't see any of these -here, so let's take a quick look at some more complex examples from afr.c and -then come back to `rot-13`. - -``` -{ .key = {"data-self-heal-algorithm"}, - .type = GF_OPTION_TYPE_STR, - .default_value = "", - .description = "Select between \"full\", \"diff\". The " - "\"full\" algorithm copies the entire file from " - "source to sink. The \"diff\" algorithm copies to " - "sink only those blocks whose checksums don't match " - "with those of source.", - .value = { "diff", "full", "" } -}, -{ .key = {"data-self-heal-window-size"}, - .type = GF_OPTION_TYPE_INT, - .min = 1, - .max = 1024, - .default_value = "1", - .description = "Maximum number blocks per file for which " - "self-heal process would be applied simultaneously." -}, -``` - -When your translator is loaded, all of this information is used to parse the -options actually provided in the volfile, and then the result is turned into a -dictionary and stored as `xl->options`. This dictionary is then processed by -your init function, which you can see being looked up in the first code -fragment above. We're only going to look at a small part of the `rot-13`'s -init for now. - -``` -priv->decrypt_read = 1; -priv->encrypt_write = 1; - -data = dict_get (this->options, "encrypt-write"); -if (data) { - if (gf_string2boolean (data->data, &priv->encrypt_write - == -1) { - gf_log (this->name, GF_LOG_ERROR, - "encrypt-write takes only boolean options"); - return -1; - } -} -``` - -What we can see here is that we're setting some defaults in our priv structure, -then looking to see if an `encrypt-write` option was actually provided. If so, -we convert and store it. This is a pretty classic use of dict_get to fetch a -field from a dictionary, and of using one of many conversion functions in -`common-utils.c` to convert `data->data` into something we can use. - -So far we've covered the basic of how a translator gets loaded, how we find its -various parts, and how we process its options. In my next Translator 101 post, -we'll go a little deeper into other things that init and its companion fini -might do, and how some other fields in our `xlator_t` structure (commonly -referred to as this) are commonly used. - -`init`, `fini`, and private context ------------------------------------ - -In the previous Translator 101 post, we looked at some of the dispatch tables -and options processing in a translator. This time we're going to cover the rest - of the "shell" of a translator -- i.e. the other global parts not specific to -handling a particular request. - -Let's start by looking at the relationship between a translator and its shared -library. At a first approximation, this is the relationship between an object -and a class in just about any object-oriented programming language. The class -defines behaviors, but has to be instantiated as an object to have any kind of -existence. In our case the object is an `xlator_t`. Several of these might be -created within the same daemon, sharing all of the same code through init/fini -and dispatch tables, but sharing *no data*. You could implement shared data (as - static variables in your shared libraries) but that's strongly discouraged. -Every function in your shared library will get an `xlator_t` as an argument, -and should use it. This lack of class-level data is one of the points where -the analogy to common OOP systems starts to break down. Another place is the -complete lack of inheritance. Translators inherit behavior (code) from exactly -one shared library -- looked up and loaded using the `type` field in a volfile -`volume ... end-volume` block -- and that's it -- not even single inheritance, -no subclasses or superclasses, no mixins or prototypes, just the relationship -between an object and its class. With that in mind, let's turn to the init -function that we just barely touched on last time. - -``` -int32_t -init (xlator_t *this) -{ - data_t *data = NULL; - rot_13_private_t *priv = NULL; - - if (!this->children || this->children->next) { - gf_log ("rot13", GF_LOG_ERROR, - "FATAL: rot13 should have exactly one child"); - return -1; - } - - if (!this->parents) { - gf_log (this->name, GF_LOG_WARNING, - "dangling volume. check volfile "); - } - - priv = GF_CALLOC (sizeof (rot_13_private_t), 1, 0); - if (!priv) - return -1; -``` - -At the very top, we see the function signature -- we get a pointer to the -`xlator_t` object that we're initializing, and we return an `int32_t` status. -As with most functions in the translator API, this should be zero to indicate -success. In this case it's safe to return -1 for failure, but watch out: in -dispatch-table functions, the return value means the status of the *function -call* rather than the *request*. A request error should be reflected as a -callback with a non-zero `op_re`t value, but the dispatch function itself -should still return zero. In fact, the handling of a non-zero return from a -dispatch function is not all that robust (we recently had a bug report in -HekaFS related to this) so it's something you should probably avoid -altogether. This only underscores the difference between dispatch functions -and `init`/`fini` functions, where non-zero returns *are* expected and handled -logically by aborting the translator setup. We can see that down at the -bottom, where we return -1 to indicate that we couldn't allocate our -private-data area (more about that later). - -The first thing this init function does is check that the translator is being -set up in the right kind of environment. Translators are called by parents and -in turn call children. Some translators are "initial" translators that inject -requests into the system from elsewhere -- e.g. mount/fuse injecting requests -from the kernel, protocol/server injecting requests from the network. Those -translators don't need parents, but `rot-13` does and so we check for that. -Similarly, some translators are "final" translators that (from the perspective -of the current process) terminate requests instead of passing them on -- e.g. -`protocol/client` passing them to another node, `storage/posix` passing them to -a local filesystem. Other translators "multiplex" between multiple children -- - passing each parent request on to one (`cluster/dht`), some -(`cluster/stripe`), or all (`cluster/afr`) of those children. `rot-13` fits -into none of those categories either, so it checks that it has *exactly one* -child. It might be more convenient or robust if translator shared libraries -had standard variables describing these requirements, to be checked in a -consistent way by the translator-loading infrastructure itself instead of by -each separate init function, but this is the way translators work today. - -The last thing we see in this fragment is allocating our private data area. -This can literally be anything we want; the infrastructure just provides the -priv pointer as a convenience but takes no responsibility for how it's used. In - this case we're using `GF_CALLOC` to allocate our own `rot_13_private_t` -structure. This gets us all the benefits of GlusterFS's memory-leak detection -infrastructure, but the way we're calling it is not quite ideal. For one thing, - the first two arguments -- from `calloc(3)` -- are kind of reversed. For -another, notice how the last argument is zero. That can actually be an -enumerated value, to tell the GlusterFS allocator *what* type we're -allocating. This can be very useful information for memory profiling and leak -detection, so it's recommended that you follow the example of any -x`xx-mem-types.h` file elsewhere in the source tree instead of just passing -zero here (even though that works). - -To finish our tour of standard initialization/termination, let's look at the -end of `init` and the beginning of `fini`: - -``` - this->private = priv; - gf_log ("rot13", GF_LOG_DEBUG, "rot13 xlator loaded"); - return 0; -} - -void -fini (xlator_t *this) -{ - rot_13_private_t *priv = this->private; - - if (!priv) - return; - this->private = NULL; - GF_FREE (priv); -``` - -At the end of init we're just storing our private-data pointer in the `priv` -field of our `xlator_t`, then returning zero to indicate that initialization -succeeded. As is usually the case, our fini is even simpler. All it really has -to do is `GF_FREE` our private-data pointer, which we do in a slightly -roundabout way here. Notice how we don't even have a return value here, since -there's nothing obvious and useful that the infrastructure could do if `fini` -failed. - -That's practically everything we need to know to get our translator through -loading, initialization, options processing, and termination. If we had defined - no dispatch functions, we could actually configure a daemon to use our -translator and it would work as a basic pass-through from its parent to a -single child. In the next post I'll cover how to build the translator and -configure a daemon to use it, so that we can actually step through it in a -debugger and see how it all fits together before we actually start adding -functionality. - -This Time For Real ------------------- - -In the first two parts of this series, we learned how to write a basic -translator skeleton that can get through loading, initialization, and option -processing. This time we'll cover how to build that translator, configure a -volume to use it, and run the glusterfs daemon in debug mode. - -Unfortunately, there's not much direct support for writing new translators. You -can check out a GlusterFS tree and splice in your own translator directory, but - that's a bit painful because you'll have to update multiple makefiles plus a -bunch of autoconf garbage. As part of the HekaFS project, I basically reverse -engineered the truly necessary parts of the translator-building process and -then pestered one of the Fedora glusterfs package maintainers (thanks -daMaestro!) to add a `glusterfs-devel` package with the required headers. Since - then the complexity level in the HekaFS tree has crept back up a bit, but I -still remember the simple method and still consider it the easiest way to get -started on a new translator. For the sake of those not using Fedora, I'm going -to describe a method that doesn't depend on that header package. What it does -depend on is a GlusterFS source tree, much as you might have cloned from GitHub - or the Gluster review site. This tree doesn't have to be fully built, but you -do need to run `autogen.sh` and configure in it. Then you can take the -following simple makefile and put it in a directory with your actual source. - -``` -# Change these to match your source code. -TARGET = rot-13.so -OBJECTS = rot-13.o - -# Change these to match your environment. -GLFS_SRC = /srv/glusterfs -GLFS_LIB = /usr/lib64 -HOST_OS = GF_LINUX_HOST_OS - -# You shouldn't need to change anything below here. - -CFLAGS = -fPIC -Wall -O0 -g \ - -DHAVE_CONFIG_H -D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE \ - -D$(HOST_OS) -I$(GLFS_SRC) -I$(GLFS_SRC)/contrib/uuid \ - -I$(GLFS_SRC)/libglusterfs/src -LDFLAGS = -shared -nostartfiles -L$(GLFS_LIB) -LIBS = -lglusterfs -lpthread - -$(TARGET): $(OBJECTS) - $(CC) $(OBJECTS) $(LDFLAGS) -o $(TARGET) $(OBJECTS) $(LIBS) -``` - -Yes, it's still Linux-specific. Mea culpa. As you can see, we're sticking with -the `rot-13` example, so you can just copy the files from -`xlators/encryption/rot-13/src` in your GlusterFS tree to follow on. Type -`make` and you should be rewarded with a nice little `.so` file. - -``` -xlator_example$ ls -l rot-13.so --rwxr-xr-x. 1 jeff jeff 40784 Nov 16 16:41 rot-13.so -``` - -Notice that we've built with optimization level zero and debugging symbols -included, which would not typically be the case for a packaged version of -GlusterFS. Let's put our version of `rot-13.so` into a slightly different file -on our system, so that it doesn't stomp on the installed version (not that -you'd ever want to use that anyway). - -``` -xlator_example# ls /usr/lib64/glusterfs/3git/xlator/encryption/ -crypt.so crypt.so.0 crypt.so.0.0.0 rot-13.so rot-13.so.0 -rot-13.so.0.0.0 -xlator_example# cp rot-13.so \ - /usr/lib64/glusterfs/3git/xlator/encryption/my-rot-13.so -``` - -These paths represent the current Gluster filesystem layout, which is likely to -be deprecated in favor of the Fedora layout; your paths may vary. At this point - we're ready to configure a volume using our new translator. To do that, I'm -going to suggest something that's strongly discouraged except during -development (the Gluster guys are going to hate me for this): write our own -volfile. Here's just about the simplest volfile you'll ever see. - -``` -volume my-posix - type storage/posix - option directory /srv/export -end-volume - -volume my-rot13 - type encryption/my-rot-13 - subvolumes my-posix -end-volume -``` - -All we have here is a basic brick using `/srv/export` for its data, and then -an instance of our translator layered on top -- no client or server is -necessary for what we're doing, and the system will automatically push a -mount/fuse translator on top if there's no server translator. To try this out, -all we need is the following command (assuming the directories involved already - exist). - -``` -xlator_example$ glusterfs --debug -f my.vol /srv/import -``` - -You should be rewarded with a whole lot of log output, including the text of -the volfile (this is very useful for debugging problems in the field). If you -go to another window on the same machine, you can see that you have a new -filesystem mounted. - -``` -~$ df /srv/import -Filesystem 1K-blocks Used Available Use% Mounted on -/srv/xlator_example/my.vol - 114506240 2706176 105983488 3% /srv/import -``` - -Just for fun, write something into a file in `/srv/import`, then look at the -corresponding file in `/srv/export` to see it all `rot-13`'ed for you. - -``` -~$ echo hello > /srv/import/a_file -~$ cat /srv/export/a_file -uryyb -``` - -There you have it -- functionality you control, implemented easily, layered on -top of local storage. Now you could start adding functionality -- real -encryption, perhaps -- and inevitably having to debug it. You could do that the - old-school way, with `gf_log` (preferred) or even plain old `printf`, or you -could run daemons under `gdb` instead. Alternatively, you could wait for the -next Translator 101 post, where we'll be doing exactly that. - -Debugging a Translator ----------------------- - -Now that we've learned what a translator looks like and how to build one, it's -time to run one and actually watch it work. The best way to do this is good -old-fashioned `gdb`, as follows (using some of the examples from last time). - -``` -xlator_example# gdb glusterfs -GNU gdb (GDB) Red Hat Enterprise Linux (7.2-50.el6) -... -(gdb) r --debug -f my.vol /srv/import -Starting program: /usr/sbin/glusterfs --debug -f my.vol /srv/import -... -[2011-11-23 11:23:16.495516] I [fuse-bridge.c:2971:fuse_init] - 0-glusterfs-fuse: FUSE inited with protocol versions: - glusterfs 7.13 kernel 7.13 -``` - -If you get to this point, your glusterfs client process is already running. You -can go to another window to see the mountpoint, do file operations, etc. - -``` -~# df /srv/import -Filesystem 1K-blocks Used Available Use% Mounted on -/root/xlator_example/my.vol - 114506240 2643968 106045568 3% /srv/import -~# ls /srv/import -a_file -~# cat /srv/import/a_file -hello -``` - -Now let's interrupt the process and see where we are. - -``` -^C -Program received signal SIGINT, Interrupt. -0x0000003a0060b3dc in pthread_cond_wait@@GLIBC_2.3.2 () - from /lib64/libpthread.so.0 -(gdb) info threads - 5 Thread 0x7fffeffff700 (LWP 27206) 0x0000003a002dd8c7 - in readv () - from /lib64/libc.so.6 - 4 Thread 0x7ffff50e3700 (LWP 27205) 0x0000003a0060b75b - in pthread_cond_timedwait@@GLIBC_2.3.2 () - from /lib64/libpthread.so.0 - 3 Thread 0x7ffff5f02700 (LWP 27204) 0x0000003a0060b3dc - in pthread_cond_wait@@GLIBC_2.3.2 () - from /lib64/libpthread.so.0 - 2 Thread 0x7ffff6903700 (LWP 27203) 0x0000003a0060f245 - in sigwait () - from /lib64/libpthread.so.0 -* 1 Thread 0x7ffff7957700 (LWP 27196) 0x0000003a0060b3dc - in pthread_cond_wait@@GLIBC_2.3.2 () - from /lib64/libpthread.so.0 -``` - -Like any non-toy server, this one has multiple threads. What are they all -doing? Honestly, even I don't know. Thread 1 turns out to be in -`event_dispatch_epoll`, which means it's the one handling all of our network -I/O. Note that with socket multi-threading patch this will change, with one -thread in `socket_poller` per connection. Thread 2 is in `glusterfs_sigwaiter` -which means signals will be isolated to that thread. Thread 3 is in -`syncenv_task`, so it's a worker process for synchronous requests such as -those used by the rebalance and repair code. Thread 4 is in -`janitor_get_next_fd`, so it's waiting for a chance to close no-longer-needed -file descriptors on the local filesystem. (I admit I had to look that one up, -BTW.) Lastly, thread 5 is in `fuse_thread_proc`, so it's the one fetching -requests from our FUSE interface. You'll often see many more threads than -this, but it's a pretty good basic set. Now, let's set a breakpoint so we can -actually watch a request. - -``` -(gdb) b rot13_writev -Breakpoint 1 at 0x7ffff50e4f0b: file rot-13.c, line 119. -(gdb) c -Continuing. -``` - -At this point we go into our other window and do something that will involve a write. - -``` -~# echo goodbye > /srv/import/another_file -(back to the first window) -[Switching to Thread 0x7fffeffff700 (LWP 27206)] - -Breakpoint 1, rot13_writev (frame=0x7ffff6e4402c, this=0x638440, - fd=0x7ffff409802c, vector=0x7fffe8000cd8, count=1, offset=0, - iobref=0x7fffe8001070) at rot-13.c:119 -119 rot_13_private_t *priv = (rot_13_private_t *)this->private; -``` - -Remember how we built with debugging symbols enabled and no optimization? That -will be pretty important for the next few steps. As you can see, we're in -`rot13_writev`, with several parameters. - -* `frame` is our always-present frame pointer for this request. Also, - `frame->local` will point to any local data we created and attached to the - request ourselves. -* `this` is a pointer to our instance of the `rot-13` translator. You can examine - it if you like to see the name, type, options, parent/children, inode table, - and other stuff associated with it. -* `fd` is a pointer to a file-descriptor *object* (`fd_t`, not just a - file-descriptor index which is what most people use "fd" for). This in turn - points to an inode object (`inode_t`) and we can associate our own - `rot-13`-specific data with either of these. -* `vector` and `count` together describe the data buffers for this write, which - we'll get to in a moment. -* `offset` is the offset into the file at which we're writing. -* `iobref` is a buffer-reference object, which is used to track the life cycle - of buffers containing read/write data. If you look closely, you'll notice that - `vector[0].iov_base` points to the same address as `iobref->iobrefs[0].ptr`, which - should give you some idea of the inter-relationships between vector and iobref. - -OK, now what about that `vector`? We can use it to examine the data being -written, like this. - -``` -(gdb) p vector[0] -$2 = {iov_base = 0x7ffff7936000, iov_len = 8} -(gdb) x/s 0x7ffff7936000 -0x7ffff7936000: "goodbye\n" -``` - -It's not always safe to view this data as a string, because it might just as -well be binary data, but since we're generating the write this time it's safe -and convenient. With that knowledge, let's step through things a bit. - -``` -(gdb) s -120 if (priv->encrypt_write) -(gdb) -121 rot13_iovec (vector, count); -(gdb) -rot13_iovec (vector=0x7fffe8000cd8, count=1) at rot-13.c:57 -57 for (i = 0; i < count; i++) { -(gdb) -58 rot13 (vector[i].iov_base, vector[i].iov_len); -(gdb) -rot13 (buf=0x7ffff7936000 "goodbye\n", len=8) at rot-13.c:45 -45 for (i = 0; i < len; i++) { -(gdb) -46 if (buf[i] >= 'a' && buf[i] <= 'z') -(gdb) -47 buf[i] = 'a' + ((buf[i] - 'a' + 13) % 26); -``` - -Here we've stepped into `rot13_iovec`, which iterates through our vector -calling `rot13`, which in turn iterates through the characters in that chunk -doing the `rot-13` operation if/as appropriate. This is pretty straightforward -stuff, so let's skip to the next interesting bit. - -``` -(gdb) fin -Run till exit from #0 rot13 (buf=0x7ffff7936000 "goodbye\n", - len=8) at rot-13.c:47 -rot13_iovec (vector=0x7fffe8000cd8, count=1) at rot-13.c:57 -57 for (i = 0; i < count; i++) { -(gdb) fin -Run till exit from #0 rot13_iovec (vector=0x7fffe8000cd8, - count=1) at rot-13.c:57 -rot13_writev (frame=0x7ffff6e4402c, this=0x638440, - fd=0x7ffff409802c, vector=0x7fffe8000cd8, count=1, - offset=0, iobref=0x7fffe8001070) at rot-13.c:123 -123 STACK_WIND (frame, -(gdb) b 129 -Breakpoint 2 at 0x7ffff50e4f35: file rot-13.c, line 129. -(gdb) b rot13_writev_cbk -Breakpoint 3 at 0x7ffff50e4db3: file rot-13.c, line 106. -(gdb) c -``` - -So we've set breakpoints on both the callback and the statement following the -`STACK_WIND`. Which one will we hit first? - -``` -Breakpoint 3, rot13_writev_cbk (frame=0x7ffff6e4402c, - cookie=0x7ffff6e440d8, this=0x638440, op_ret=8, op_errno=0, - prebuf=0x7fffefffeca0, postbuf=0x7fffefffec30) - at rot-13.c:106 -106 STACK_UNWIND_STRICT (writev, frame, op_ret, op_errno, - prebuf, postbuf); -(gdb) bt -#0 rot13_writev_cbk (frame=0x7ffff6e4402c, - cookie=0x7ffff6e440d8, this=0x638440, op_ret=8, op_errno=0, - prebuf=0x7fffefffeca0, postbuf=0x7fffefffec30) - at rot-13.c:106 -#1 0x00007ffff52f1b37 in posix_writev (frame=0x7ffff6e440d8, - this=<value optimized out>, fd=<value optimized out>, - vector=<value optimized out>, count=1, - offset=<value optimized out>, iobref=0x7fffe8001070) - at posix.c:2217 -#2 0x00007ffff50e513e in rot13_writev (frame=0x7ffff6e4402c, - this=0x638440, fd=0x7ffff409802c, vector=0x7fffe8000cd8, - count=1, offset=0, iobref=0x7fffe8001070) at rot-13.c:123 -``` - -Surprise! We're in `rot13_writev_cbk` now, called (indirectly) while we're -still in `rot13_writev` before `STACK_WIND` returns (still at rot-13.c:123). If - you did any request cleanup here, then you need to be careful about what you -do in the remainder of `rot13_writev` because data may have been freed etc. -It's tempting to say you should just do the cleanup in `rot13_writev` after -the `STACK_WIND,` but that's not valid because it's also possible that some -other translator returned without calling `STACK_UNWIND` -- i.e. before -`rot13_writev` is called, so then it would be the one getting null-pointer -errors instead. To put it another way, the callback and the return from -`STACK_WIND` can occur in either order or even simultaneously on different -threads. Even if you were to use reference counts, you'd have to make sure to -use locking or atomic operations to avoid races, and it's not worth it. Unless -you *really* understand the possible flows of control and know what you're -doing, it's better to do cleanup in the callback and nothing after -`STACK_WIND.` - -At this point all that's left is a `STACK_UNWIND` and a return. The -`STACK_UNWIND` invokes our parent's completion callback, and in this case our -parent is FUSE so at that point the VFS layer is notified of the write being -complete. Finally, we return through several levels of normal function calls -until we come back to fuse_thread_proc, which waits for the next request. - -So that's it. For extra fun, you might want to repeat this exercise by stepping -through some other call -- stat or setxattr might be good choices -- but you'll - have to use a translator that actually implements those calls to see much -that's interesting. Then you'll pretty much know everything I knew when I -started writing my first for-real translators, and probably even a bit more. I -hope you've enjoyed this series, or at least found it useful, and if you have -any suggestions for other topics I should cover please let me know (via -comments or email, IRC or Twitter). diff --git a/doc/hacker-guide/en-US/markdown/unittest.md b/doc/hacker-guide/en-US/markdown/unittest.md deleted file mode 100644 index 5c6c0a8a039..00000000000 --- a/doc/hacker-guide/en-US/markdown/unittest.md +++ /dev/null @@ -1,228 +0,0 @@ -# Unit Tests in GlusterFS - -## Overview -[Art-of-unittesting][definitionofunittest] provides a good definition for unit tests. A good unit test is: - -* Able to be fully automated -* Has full control over all the pieces running (Use mocks or stubs to achieve this isolation when needed) -* Can be run in any order if part of many other tests -* Runs in memory (no DB or File access, for example) -* Consistently returns the same result (You always run the same test, so no random numbers, for example. save those for integration or range tests) -* Runs fast -* Tests a single logical concept in the system -* Readable -* Maintainable -* Trustworthy (when you see its result, you don’t need to debug the code just to be sure) - -## cmocka -GlusterFS unit test framework is based on [cmocka][]. cmocka provides -developers with methods to isolate and test modules written in C language. It -also provides integration with Jenkins by providing JUnit XML compliant unit -test results. - -cmocka - -## Running Unit Tests -To execute the unit tests, all you need is to type `make check`. Here is a step-by-step example assuming you just cloned a GlusterFS tree: - -``` -$ ./autogen.sh -$ ./configure --enable-debug -$ make check -``` - -Sample output: - -``` -PASS: mem_pool_unittest -============================================================================ -Testsuite summary for glusterfs 3git -============================================================================ -# TOTAL: 1 -# PASS: 1 -# SKIP: 0 -# XFAIL: 0 -# FAIL: 0 -# XPASS: 0 -# ERROR: 0 -============================================================================ -``` - -In this example, `mem_pool_unittest` has multiple tests inside, but `make check` assumes that the program itself is the test, and that is why it only shows one test. Here is the output when we run `mem_pool_unittest` directly: - -``` -$ ./libglusterfs/src/mem_pool_unittest -[==========] Running 10 test(s). -[ RUN ] test_gf_mem_acct_enable_set -Expected assertion data != ((void *)0) occurred -[ OK ] test_gf_mem_acct_enable_set -[ RUN ] test_gf_mem_set_acct_info_asserts -Expected assertion xl != ((void *)0) occurred -Expected assertion size > ((4 + sizeof (size_t) + sizeof (xlator_t *) + 4 + 8) + 8) occurred -Expected assertion type <= xl->mem_acct.num_types occurred -[ OK ] test_gf_mem_set_acct_info_asserts -[ RUN ] test_gf_mem_set_acct_info_memory -[ OK ] test_gf_mem_set_acct_info_memory -[ RUN ] test_gf_calloc_default_calloc -[ OK ] test_gf_calloc_default_calloc -[ RUN ] test_gf_calloc_mem_acct_enabled -[ OK ] test_gf_calloc_mem_acct_enabled -[ RUN ] test_gf_malloc_default_malloc -[ OK ] test_gf_malloc_default_malloc -[ RUN ] test_gf_malloc_mem_acct_enabled -[ OK ] test_gf_malloc_mem_acct_enabled -[ RUN ] test_gf_realloc_default_realloc -[ OK ] test_gf_realloc_default_realloc -[ RUN ] test_gf_realloc_mem_acct_enabled -[ OK ] test_gf_realloc_mem_acct_enabled -[ RUN ] test_gf_realloc_ptr -Expected assertion ((void *)0) != ptr occurred -[ OK ] test_gf_realloc_ptr -[==========] 10 test(s) run. -[ PASSED ] 10 test(s). -[ FAILED ] 0 test(s). -[ REPORT ] Created libglusterfs_mem_pool_xunit.xml report -``` - - -## Writing Unit Tests - -### Enhancing your C functions - -#### Programming by Contract -Add the following to your C file: - -```c -#include <cmocka_pbc.h> -``` - -```c -/* - * Programming by Contract is a programming methodology - * which binds the caller and the function called to a - * contract. The contract is represented using Hoare Triple: - * {P} C {Q} - * where {P} is the precondition before executing command C, - * and {Q} is the postcondition. - * - * See also: - * http://en.wikipedia.org/wiki/Design_by_contract - * http://en.wikipedia.org/wiki/Hoare_logic - * http://dlang.org/dbc.html - */ - #ifndef CMOCKERY_PBC_H_ -#define CMOCKERY_PBC_H_ - -#if defined(UNIT_TESTING) || defined (DEBUG) - -#include <assert.h> - -/* - * Checks caller responsibility against contract - */ -#define REQUIRE(cond) assert(cond) - -/* - * Checks function reponsability against contract. - */ -#define ENSURE(cond) assert(cond) - -/* - * While REQUIRE and ENSURE apply to functions, INVARIANT - * applies to classes/structs. It ensures that intances - * of the class/struct are consistent. In other words, - * that the instance has not been corrupted. - */ -#define INVARIANT(invariant_fnc) do{ (invariant_fnc) } while (0); - -#else -#define REQUIRE(cond) do { } while (0); -#define ENSURE(cond) do { } while (0); -#define INVARIANT(invariant_fnc) do{ } while (0); - -#endif /* defined(UNIT_TESTING) || defined (DEBUG) */ -#endif /* CMOCKERY_PBC_H_ */ -``` - -##### Example -This is an _extremely_ simple example: - -```c -int divide (int n, int d) -{ - int ans; - - REQUIRE(d != 0); - - ans = n / d; - - // As code is added to this function throughout its lifetime, - // ENSURE will assert that data will be returned - // according to the contract. Again this is an - // extremely simple example. :-D - ENSURE( ans == (n / d) ); - - return ans; -} - -``` - -##### Important Note -`REQUIRE`, `ENSURE`, and `INVARIANT` are only available when `DEBUG` or `UNIT_TESTING` are set in the CFLAGS. You must pass `--enable-debug` to `./configure` to enable PBC on your non-unittest builds. - -#### Overriding functions -Cmockery2 provides its own memory allocation functions which check for buffer overrun and memory leaks. The following header file must be included **last** to be able to override any of the memory allocation functions: - -```c -#include <cmocka.h> -``` - -This file will only take effect with the `UNIT_TESTING` CFLAG is set. - -### Creating a unit test -Once you identify the C file you would like to test, first create a `unittest` directory under the directory where the C file is located. This will isolate the unittests to a different directory. - -Next, you need to edit the `Makefile.am` file in the directory where your C file is located. Initialize the -`Makefile.am` if it does not already have the following sections: - -``` -#### UNIT TESTS ##### -CLEANFILES += *.gcda *.gcno *_xunit.xml -noinst_PROGRAMS = -TESTS = -``` - -Now you can add the following for each of the unit tests that you would like to build: - -``` -### UNIT TEST xxx_unittest ### -xxx_unittest_CPPFLAGS = $(xxx_CPPFLAGS) -xxx_unittest_SOURCES = xxx.c \ - unittest/xxx_unittest.c -xxx_unittest_CFLAGS = $(UNITTEST_CFLAGS) -xxx_unittest_LDFLAGS = $(UNITTEST_LDFLAGS) -noinst_PROGRAMS += xxx_unittest -TESTS += xxx_unittest -``` - -Where `xxx` is the name of your C file. For example, look at `libglusterfs/src/Makefile.am`. - -Copy the simple unit test from the [cmocka API][cmockaapi] to `unittest/xxx_unittest.c`. If you would like to see an example of a unit test, please refer to `libglusterfs/src/unittest/mem_pool_unittest.c`. - -#### Mocking -You may see that the linker will complain about missing functions needed by the C file you would like to test. Identify the required functions, then place their stubs in a file called `unittest/xxx_mock.c`, then include this file in `Makefile.am` in `xxx_unittest_SOURCES`. This will allow you to you Cmockery2's mocking functions. - -#### Running the unit test -You can type `make` in the directory where the C file is located. Once you built it and there are no errors, you can execute the test either by directly executing the program (in our example above it is called `xxx_unittest` ), or by running `make check`. - -#### Debugging -Sometimes you may need to debug your unit test. To do that, you will have to point `gdb` to the binary which is located in the same directory as the source. For example, you can do the following from the root of the source tree to debug `mem_pool_unittest`: - -``` -$ gdb libglusterfs/src/mem_pool_unittest -``` - - -[cmocka]: https://cmocka.org -[definitionofunittest]: http://artofunittesting.com/definition-of-a-unit-test/ -[cmockapi]: https://api.cmocka.org diff --git a/doc/hacker-guide/en-US/markdown/write-behind.md b/doc/hacker-guide/en-US/markdown/write-behind.md deleted file mode 100644 index 0d78964fa20..00000000000 --- a/doc/hacker-guide/en-US/markdown/write-behind.md +++ /dev/null @@ -1,56 +0,0 @@ -performance/write-behind translator -=================================== - -Basic working --------------- - -Write behind is basically a translator to lie to the application that the -write-requests are finished, even before it is actually finished. - -On a regular translator tree without write-behind, control flow is like this: - -1. application makes a `write()` system call. -2. VFS ==> FUSE ==> `/dev/fuse`. -3. fuse-bridge initiates a glusterfs `writev()` call. -4. `writev()` is `STACK_WIND()`ed up to client-protocol or storage translator. -5. client-protocol, on receiving reply from server, starts `STACK_UNWIND()` towards the fuse-bridge. - -On a translator tree with write-behind, control flow is like this: - -1. application makes a `write()` system call. -2. VFS ==> FUSE ==> `/dev/fuse`. -3. fuse-bridge initiates a glusterfs `writev()` call. -4. `writev()` is `STACK_WIND()`ed up to write-behind translator. -5. write-behind adds the write buffer to its internal queue and does a `STACK_UNWIND()` towards the fuse-bridge. - -write call is completed in application's percepective. after -`STACK_UNWIND()`ing towards the fuse-bridge, write-behind initiates a fresh -writev() call to its child translator, whose replies will be consumed by -write-behind itself. Write-behind _doesn't_ cache the write buffer, unless -`option flush-behind on` is specified in volume specification file. - -Windowing ---------- - -With respect to write-behind, each write-buffer has three flags: `stack_wound`, `write_behind` and `got_reply`. - -* `stack_wound`: if set, indicates that write-behind has initiated `STACK_WIND()` towards child translator. -* `write_behind`: if set, indicates that write-behind has done `STACK_UNWIND()` towards fuse-bridge. -* `got_reply`: if set, indicates that write-behind has received reply from child translator for a `writev()` `STACK_WIND()`. a request will be destroyed by write-behind only if this flag is set. - -Currently pending write requests = aggregate size of requests with write_behind = 1 and got_reply = 0. - -window size limits the aggregate size of currently pending write requests. once -the pending requests' size has reached the window size, write-behind blocks -writev() calls from fuse-bridge. Blocking is only from application's -perspective. Write-behind does `STACK_WIND()` to child translator -straight-away, but hold behind the `STACK_UNWIND()` towards fuse-bridge. -`STACK_UNWIND()` is done only once write-behind gets enough replies to -accommodate for currently blocked request. - -Flush behind ------------- - -If `option flush-behind on` is specified in volume specification file, then -write-behind sends aggregate write requests to child translator, instead of -regular per request `STACK_WIND()`s. |