diff options
Diffstat (limited to 'doc/hacker-guide')
| -rw-r--r-- | doc/hacker-guide/Makefile.am | 8 | ||||
| -rw-r--r-- | doc/hacker-guide/adding-fops.txt | 33 | ||||
| -rw-r--r-- | doc/hacker-guide/bdb.txt | 70 | ||||
| -rw-r--r-- | doc/hacker-guide/call-stub.txt | 1033 | ||||
| -rw-r--r-- | doc/hacker-guide/en-US/markdown/adding-fops.md | 18 | ||||
| -rw-r--r-- | doc/hacker-guide/en-US/markdown/afr.md | 191 | ||||
| -rw-r--r-- | doc/hacker-guide/en-US/markdown/coding-standard.md | 402 | ||||
| -rw-r--r-- | doc/hacker-guide/en-US/markdown/posix.md | 59 | ||||
| -rw-r--r-- | doc/hacker-guide/en-US/markdown/translator-development.md | 666 | ||||
| -rw-r--r-- | doc/hacker-guide/en-US/markdown/write-behind.md | 56 | ||||
| -rw-r--r-- | doc/hacker-guide/hacker-guide.tex | 309 | ||||
| -rw-r--r-- | doc/hacker-guide/lock-ahead.txt | 80 | ||||
| -rw-r--r-- | doc/hacker-guide/posix.txt | 59 | ||||
| -rw-r--r-- | doc/hacker-guide/replicate.txt | 206 | ||||
| -rw-r--r-- | doc/hacker-guide/write-behind.txt | 45 |
15 files changed, 1392 insertions, 1843 deletions
diff --git a/doc/hacker-guide/Makefile.am b/doc/hacker-guide/Makefile.am deleted file mode 100644 index 65c92ac23..000000000 --- a/doc/hacker-guide/Makefile.am +++ /dev/null @@ -1,8 +0,0 @@ -EXTRA_DIST = replicate.txt bdb.txt posix.txt call-stub.txt write-behind.txt - -#EXTRA_DIST = hacker-guide.tex afr.txt bdb.txt posix.txt call-stub.txt write-behind.txt -#hacker_guidedir = $(docdir) -#hacker_guide_DATA = hacker-guide.pdf - -#hacker-guide.pdf: $(EXTRA_DIST) -# pdflatex $(srcdir)/hacker-guide.tex diff --git a/doc/hacker-guide/adding-fops.txt b/doc/hacker-guide/adding-fops.txt deleted file mode 100644 index e70dbbdc8..000000000 --- a/doc/hacker-guide/adding-fops.txt +++ /dev/null @@ -1,33 +0,0 @@ - HOW TO ADD A NEW FOP TO GlusterFS - ================================= - -Steps to be followed when adding a new FOP to GlusterFS: - -1. Edit glusterfs.h and add a GF_FOP_* constant. - -2. Edit xlator.[ch] and: - 2a. add the new prototype for fop and callback. - 2b. edit xlator_fops structure. - -3. Edit xlator.c and add to fill_defaults. - -4. Edit protocol.h and add struct necessary for the new FOP. - -5. Edit defaults.[ch] and provide default implementation. - -6. Edit call-stub.[ch] and provide stub implementation. - -7. Edit common-utils.c and add to gf_global_variable_init(). - -8. Edit client-protocol and add your FOP. - -9. Edit server-protocol and add your FOP. - -10. Implement your FOP in any translator for which the default implementation - is not sufficient. - -========================================== -Last updated: Mon Oct 27 21:35:49 IST 2008 - -Author: Vikas Gorur <vikas@gluster.com> -========================================== diff --git a/doc/hacker-guide/bdb.txt b/doc/hacker-guide/bdb.txt deleted file mode 100644 index fd0bd3652..000000000 --- a/doc/hacker-guide/bdb.txt +++ /dev/null @@ -1,70 +0,0 @@ - -* How does file translates to key/value pair? ---------------------------------------------- - - in bdb a file is identified by key (obtained by taking basename() of the path of -the file) and file contents are stored as value corresponding to the key in database -file (defaults to glusterfs_storage.db under dirname() directory). - -* symlinks, directories ------------------------ - - symlinks and directories are stored as is. - -* db (database) files ---------------------- - - every directory, including root directory, contains a database file called -glusterfs_storage.db. all the regular files contained in the directory are stored -as key/value pair inside the glusterfs_storage.db. - -* internal data cache ---------------------- - - db does not provide a way to find out the size of the value corresponding to a key. -so, bdb makes DB->get() call for key and takes the length of the value returned. -since DB->get() also returns file contents for key, bdb maintains an internal cache and -stores the file contents in the cache. - every directory maintains a seperate cache. - -* inode number transformation ------------------------------ - - bdb allocates a inode number to each file and directory on its own. bdb maintains a -global counter and increments it after allocating inode number for each file -(regular, symlink or directory). NOTE: bdb does not guarantee persistent inode numbers. - -* checkpoint thread -------------------- - - bdb creates a checkpoint thread at the time of init(). checkpoint thread does a -periodic checkpoint on the DB_ENV. checkpoint is the mechanism, provided by db, to -forcefully commit the logged transactions to the storage. - -NOTES ABOUT FOPS: ------------------ - -lookup() - - 1> do lstat() on the path, if lstat fails, we assume that the file being looked up - is either a regular file or doesn't exist. - 2> lookup in the DB of parent directory for key corresponding to path. if key exists, - return key, with. - NOTE: 'struct stat' stat()ed from DB file is used as a container for 'struct stat' - of the regular file. st_ino, st_size, st_blocks are updated with file's values. - -readv() - - 1> do a lookup in bctx cache. if successful, return the requested data from cache. - 2> if cache missed, do a DB->get() the entire file content and insert to cache. - -writev(): - 1> flush any cached content of this file. - 2> do a DB->put(), with DB_DBT_PARTIAL flag. - NOTE: DB_DBT_PARTIAL is used to do partial update of a value in DB. - -readdir(): - 1> regular readdir() in a loop, and vomit all DB_ENV log files and DB files that - we encounter. - 2> if the readdir() buffer still has space, open a DB cursor and do a sequential - DBC->get() to fill the reaadir buffer. - - diff --git a/doc/hacker-guide/call-stub.txt b/doc/hacker-guide/call-stub.txt deleted file mode 100644 index bca1579b2..000000000 --- a/doc/hacker-guide/call-stub.txt +++ /dev/null @@ -1,1033 +0,0 @@ -creating a call stub and pausing a call ---------------------------------------- -libglusterfs provides seperate API to pause each of the fop. parameters to each API is -@frame - call frame which has to be used to resume the call at call_resume(). -@fn - procedure to call during call_resume(). - NOTE: @fn should exactly take the same type and number of parameters that - the corresponding regular fop takes. -rest will be the regular parameters to corresponding fop. - -NOTE: @frame can never be NULL. fop_<operation>_stub() fails with errno - set to EINVAL, if @frame is NULL. also wherever @loc is applicable, - @loc cannot be NULL. - -refer to individual stub creation API to know about call-stub creation's behaviour with -specific parameters. - -here is the list of stub creation APIs for xlator fops. - -@frame - call frame which has to be used to resume the call at call_resume(). -@fn - procedure to call during call_resume(). -@loc - pointer to location structure. - NOTE: @loc will be copied to a different location, with inode_ref() to - @loc->inode and @loc->parent, if not NULL. also @loc->path will be - copied to a different location. -@need_xattr - flag to specify if xattr should be returned or not. -call_stub_t * -fop_lookup_stub (call_frame_t *frame, - fop_lookup_t fn, - loc_t *loc, - int32_t need_xattr); - -@frame - call frame which has to be used to resume the call at call_resume(). -@fn - procedure to call during call_resume(). -@loc - pointer to location structure. - NOTE: @loc will be copied to a different location, with inode_ref() to - @loc->inode and @loc->parent, if not NULL. also @loc->path will be - copied to a different location. -call_stub_t * -fop_stat_stub (call_frame_t *frame, - fop_stat_t fn, - loc_t *loc); - -@frame - call frame which has to be used to resume the call at call_resume(). -@fn - procedure to call during call_resume(). -@fd - file descriptor parameter to lk fop. - NOTE: @fd is stored with a fd_ref(). -call_stub_t * -fop_fstat_stub (call_frame_t *frame, - fop_fstat_t fn, - fd_t *fd); - -@frame - call frame which has to be used to resume the call at call_resume(). -@fn - procedure to call during call_resume(). -@loc - pointer to location structure. - NOTE: @loc will be copied to a different location, with inode_ref() to @loc->inode and - @loc->parent, if not NULL. also @loc->path will be copied to a different location. -@mode - mode parameter to chmod. -call_stub_t * -fop_chmod_stub (call_frame_t *frame, - fop_chmod_t fn, - loc_t *loc, - mode_t mode); - -@frame - call frame which has to be used to resume the call at call_resume(). -@fn - procedure to call during call_resume(). -@fd - file descriptor parameter to lk fop. - NOTE: @fd is stored with a fd_ref(). -@mode - mode parameter for fchmod fop. -call_stub_t * -fop_fchmod_stub (call_frame_t *frame, - fop_fchmod_t fn, - fd_t *fd, - mode_t mode); - -@frame - call frame which has to be used to resume the call at call_resume(). -@fn - procedure to call during call_resume(). -@loc - pointer to location structure. - NOTE: @loc will be copied to a different location, with inode_ref() to @loc->inode and - @loc->parent, if not NULL. also @loc->path will be copied to a different location. -@uid - uid parameter to chown. -@gid - gid parameter to chown. -call_stub_t * -fop_chown_stub (call_frame_t *frame, - fop_chown_t fn, - loc_t *loc, - uid_t uid, - gid_t gid); - -@frame - call frame which has to be used to resume the call at call_resume(). -@fn - procedure to call during call_resume(). -@fd - file descriptor parameter to lk fop. - NOTE: @fd is stored with a fd_ref(). -@uid - uid parameter to fchown. -@gid - gid parameter to fchown. -call_stub_t * -fop_fchown_stub (call_frame_t *frame, - fop_fchown_t fn, - fd_t *fd, - uid_t uid, - gid_t gid); - -@frame - call frame which has to be used to resume the call at call_resume(). -@fn - procedure to call during call_resume(). -@loc - pointer to location structure. - NOTE: @loc will be copied to a different location, with inode_ref() to - @loc->inode and @loc->parent, if not NULL. also @loc->path will be - copied to a different location, if not NULL. -@off - offset parameter to truncate fop. -call_stub_t * -fop_truncate_stub (call_frame_t *frame, - fop_truncate_t fn, - loc_t *loc, - off_t off); - -@frame - call frame which has to be used to resume the call at call_resume(). -@fn - procedure to call during call_resume(). -@fd - file descriptor parameter to lk fop. - NOTE: @fd is stored with a fd_ref(). -@off - offset parameter to ftruncate fop. -call_stub_t * -fop_ftruncate_stub (call_frame_t *frame, - fop_ftruncate_t fn, - fd_t *fd, - off_t off); - -@frame - call frame which has to be used to resume the call at call_resume(). -@fn - procedure to call during call_resume(). -@loc - pointer to location structure. - NOTE: @loc will be copied to a different location, with inode_ref() to - @loc->inode and @loc->parent, if not NULL. also @loc->path will be - copied to a different location. -@tv - tv parameter to utimens fop. -call_stub_t * -fop_utimens_stub (call_frame_t *frame, - fop_utimens_t fn, - loc_t *loc, - struct timespec tv[2]); - -@frame - call frame which has to be used to resume the call at call_resume(). -@fn - procedure to call during call_resume(). -@loc - pointer to location structure. - NOTE: @loc will be copied to a different location, with inode_ref() to - @loc->inode and @loc->parent, if not NULL. also @loc->path will be - copied to a different location. -@mask - mask parameter for access fop. -call_stub_t * -fop_access_stub (call_frame_t *frame, - fop_access_t fn, - loc_t *loc, - int32_t mask); - -@frame - call frame which has to be used to resume the call at call_resume(). -@fn - procedure to call during call_resume(). -@loc - pointer to location structure. - NOTE: @loc will be copied to a different location, with inode_ref() to - @loc->inode and @loc->parent, if not NULL. also @loc->path will be - copied to a different location. -@size - size parameter to readlink fop. -call_stub_t * -fop_readlink_stub (call_frame_t *frame, - fop_readlink_t fn, - loc_t *loc, - size_t size); - -@frame - call frame which has to be used to resume the call at call_resume(). -@fn - procedure to call during call_resume(). -@loc - pointer to location structure. - NOTE: @loc will be copied to a different location, with inode_ref() to - @loc->inode and @loc->parent, if not NULL. also @loc->path will be - copied to a different location. -@mode - mode parameter to mknod fop. -@rdev - rdev parameter to mknod fop. -call_stub_t * -fop_mknod_stub (call_frame_t *frame, - fop_mknod_t fn, - loc_t *loc, - mode_t mode, - dev_t rdev); - -@frame - call frame which has to be used to resume the call at call_resume(). -@fn - procedure to call during call_resume(). -@loc - pointer to location structure. - NOTE: @loc will be copied to a different location, with inode_ref() to - @loc->inode and @loc->parent, if not NULL. also @loc->path will be - copied to a different location. -@mode - mode parameter to mkdir fop. -call_stub_t * -fop_mkdir_stub (call_frame_t *frame, - fop_mkdir_t fn, - loc_t *loc, - mode_t mode); - -@frame - call frame which has to be used to resume the call at call_resume(). -@fn - procedure to call during call_resume(). -@loc - pointer to location structure. - NOTE: @loc will be copied to a different location, with inode_ref() to - @loc->inode and @loc->parent, if not NULL. also @loc->path will be - copied to a different location. -call_stub_t * -fop_unlink_stub (call_frame_t *frame, - fop_unlink_t fn, - loc_t *loc); - -@frame - call frame which has to be used to resume the call at call_resume(). -@fn - procedure to call during call_resume(). -@loc - pointer to location structure. - NOTE: @loc will be copied to a different location, with inode_ref() to - @loc->inode and @loc->parent, if not NULL. also @loc->path will be - copied to a different location. -call_stub_t * -fop_rmdir_stub (call_frame_t *frame, - fop_rmdir_t fn, - loc_t *loc); - -@frame - call frame which has to be used to resume the call at call_resume(). -@fn - procedure to call during call_resume(). -@linkname - linkname parameter to symlink fop. -@loc - pointer to location structure. - NOTE: @loc will be copied to a different location, with inode_ref() to - @loc->inode and @loc->parent, if not NULL. also @loc->path will be - copied to a different location. -call_stub_t * -fop_symlink_stub (call_frame_t *frame, - fop_symlink_t fn, - const char *linkname, - loc_t *loc); - -@frame - call frame which has to be used to resume the call at call_resume(). -@fn - procedure to call during call_resume(). -@oldloc - pointer to location structure. - NOTE: @oldloc will be copied to a different location, with inode_ref() to - @oldloc->inode and @oldloc->parent, if not NULL. also @oldloc->path will - be copied to a different location, if not NULL. -@newloc - pointer to location structure. - NOTE: @newloc will be copied to a different location, with inode_ref() to - @newloc->inode and @newloc->parent, if not NULL. also @newloc->path will - be copied to a different location, if not NULL. -call_stub_t * -fop_rename_stub (call_frame_t *frame, - fop_rename_t fn, - loc_t *oldloc, - loc_t *newloc); - -@frame - call frame which has to be used to resume the call at call_resume(). -@fn - procedure to call during call_resume(). -@loc - pointer to location structure. - NOTE: @loc will be copied to a different location, with inode_ref() to - @loc->inode and @loc->parent, if not NULL. also @loc->path will be - copied to a different location. -@newpath - newpath parameter to link fop. -call_stub_t * -fop_link_stub (call_frame_t *frame, - fop_link_t fn, - loc_t *oldloc, - const char *newpath); - -@frame - call frame which has to be used to resume the call at call_resume(). -@fn - procedure to call during call_resume(). -@loc - pointer to location structure. - NOTE: @loc will be copied to a different location, with inode_ref() to - @loc->inode and @loc->parent, if not NULL. also @loc->path will be - copied to a different location. -@flags - flags parameter to create fop. -@mode - mode parameter to create fop. -@fd - file descriptor parameter to create fop. - NOTE: @fd is stored with a fd_ref(). -call_stub_t * -fop_create_stub (call_frame_t *frame, - fop_create_t fn, - loc_t *loc, - int32_t flags, - mode_t mode, fd_t *fd); - -@frame - call frame which has to be used to resume the call at call_resume(). -@fn - procedure to call during call_resume(). -@flags - flags parameter to open fop. -@loc - pointer to location structure. - NOTE: @loc will be copied to a different location, with inode_ref() to - @loc->inode and @loc->parent, if not NULL. also @loc->path will be - copied to a different location. -call_stub_t * -fop_open_stub (call_frame_t *frame, - fop_open_t fn, - loc_t *loc, - int32_t flags, - fd_t *fd); - -@frame - call frame which has to be used to resume the call at call_resume(). -@fn - procedure to call during call_resume(). -@fd - file descriptor parameter to lk fop. - NOTE: @fd is stored with a fd_ref(). -@size - size parameter to readv fop. -@off - offset parameter to readv fop. -call_stub_t * -fop_readv_stub (call_frame_t *frame, - fop_readv_t fn, - fd_t *fd, - size_t size, - off_t off); - -@frame - call frame which has to be used to resume the call at call_resume(). -@fn - procedure to call during call_resume(). -@fd - file descriptor parameter to lk fop. - NOTE: @fd is stored with a fd_ref(). -@vector - vector parameter to writev fop. - NOTE: @vector is iov_dup()ed while creating stub. and frame->root->req_refs - dictionary is dict_ref()ed. -@count - count parameter to writev fop. -@off - off parameter to writev fop. -call_stub_t * -fop_writev_stub (call_frame_t *frame, - fop_writev_t fn, - fd_t *fd, - struct iovec *vector, - int32_t count, - off_t off); - -@frame - call frame which has to be used to resume the call at call_resume(). -@fn - procedure to call during call_resume(). -@fd - file descriptor parameter to flush fop. - NOTE: @fd is stored with a fd_ref(). -call_stub_t * -fop_flush_stub (call_frame_t *frame, - fop_flush_t fn, - fd_t *fd); - - -@frame - call frame which has to be used to resume the call at call_resume(). -@fn - procedure to call during call_resume(). -@fd - file descriptor parameter to lk fop. - NOTE: @fd is stored with a fd_ref(). -@datasync - datasync parameter to fsync fop. -call_stub_t * -fop_fsync_stub (call_frame_t *frame, - fop_fsync_t fn, - fd_t *fd, - int32_t datasync); - -@frame - call frame which has to be used to resume the call at call_resume(). -@fn - procedure to call during call_resume(). -@loc - pointer to location structure. - NOTE: @loc will be copied to a different location, with inode_ref() to @loc->inode and - @loc->parent, if not NULL. also @loc->path will be copied to a different location. -@fd - file descriptor parameter to opendir fop. - NOTE: @fd is stored with a fd_ref(). -call_stub_t * -fop_opendir_stub (call_frame_t *frame, - fop_opendir_t fn, - loc_t *loc, - fd_t *fd); - -@frame - call frame which has to be used to resume the call at call_resume(). -@fn - procedure to call during call_resume(). -@fd - file descriptor parameter to getdents fop. - NOTE: @fd is stored with a fd_ref(). -@size - size parameter to getdents fop. -@off - off parameter to getdents fop. -@flags - flags parameter to getdents fop. -call_stub_t * -fop_getdents_stub (call_frame_t *frame, - fop_getdents_t fn, - fd_t *fd, - size_t size, - off_t off, - int32_t flag); - -@frame - call frame which has to be used to resume the call at call_resume(). -@fn - procedure to call during call_resume(). -@fd - file descriptor parameter to setdents fop. - NOTE: @fd is stored with a fd_ref(). -@flags - flags parameter to setdents fop. -@entries - entries parameter to setdents fop. -call_stub_t * -fop_setdents_stub (call_frame_t *frame, - fop_setdents_t fn, - fd_t *fd, - int32_t flags, - dir_entry_t *entries, - int32_t count); - -@frame - call frame which has to be used to resume the call at call_resume(). -@fn - procedure to call during call_resume(). -@fd - file descriptor parameter to setdents fop. - NOTE: @fd is stored with a fd_ref(). -@datasync - datasync parameter to fsyncdir fop. -call_stub_t * -fop_fsyncdir_stub (call_frame_t *frame, - fop_fsyncdir_t fn, - fd_t *fd, - int32_t datasync); - -@frame - call frame which has to be used to resume the call at call_resume(). -@fn - procedure to call during call_resume(). -@loc - pointer to location structure. - NOTE: @loc will be copied to a different location, with inode_ref() to - @loc->inode and @loc->parent, if not NULL. also @loc->path will be - copied to a different location. -call_stub_t * -fop_statfs_stub (call_frame_t *frame, - fop_statfs_t fn, - loc_t *loc); - -@frame - call frame which has to be used to resume the call at call_resume(). -@fn - procedure to call during call_resume(). -@loc - pointer to location structure. - NOTE: @loc will be copied to a different location, with inode_ref() to - @loc->inode and @loc->parent, if not NULL. also @loc->path will be - copied to a different location. -@dict - dict parameter to setxattr fop. - NOTE: stub creation procedure stores @dict pointer with dict_ref() to it. -call_stub_t * -fop_setxattr_stub (call_frame_t *frame, - fop_setxattr_t fn, - loc_t *loc, - dict_t *dict, - int32_t flags); - -@frame - call frame which has to be used to resume the call at call_resume(). -@fn - procedure to call during call_resume(). -@loc - pointer to location structure. - NOTE: @loc will be copied to a different location, with inode_ref() to - @loc->inode and @loc->parent, if not NULL. also @loc->path will be - copied to a different location. -@name - name parameter to getxattr fop. -call_stub_t * -fop_getxattr_stub (call_frame_t *frame, - fop_getxattr_t fn, - loc_t *loc, - const char *name); - -@frame - call frame which has to be used to resume the call at call_resume(). -@fn - procedure to call during call_resume(). -@loc - pointer to location structure. - NOTE: @loc will be copied to a different location, with inode_ref() to - @loc->inode and @loc->parent, if not NULL. also @loc->path will be - copied to a different location. -@name - name parameter to removexattr fop. - NOTE: name string will be copied to a different location while creating stub. -call_stub_t * -fop_removexattr_stub (call_frame_t *frame, - fop_removexattr_t fn, - loc_t *loc, - const char *name); - -@frame - call frame which has to be used to resume the call at call_resume(). -@fn - procedure to call during call_resume(). -@fd - file descriptor parameter to lk fop. - NOTE: @fd is stored with a fd_ref(). -@cmd - command parameter to lk fop. -@lock - lock parameter to lk fop. - NOTE: lock will be copied to a different location while creating stub. -call_stub_t * -fop_lk_stub (call_frame_t *frame, - fop_lk_t fn, - fd_t *fd, - int32_t cmd, - struct flock *lock); - -@frame - call frame which has to be used to resume the call at call_resume(). -@fn - procedure to call during call_resume(). -@fd - fd parameter to gf_lk fop. - NOTE: @fd is fd_ref()ed while creating stub, if not NULL. -@cmd - cmd parameter to gf_lk fop. -@lock - lock paramater to gf_lk fop. - NOTE: @lock is copied to a different memory location while creating - stub. -call_stub_t * -fop_gf_lk_stub (call_frame_t *frame, - fop_gf_lk_t fn, - fd_t *fd, - int32_t cmd, - struct flock *lock); - -@frame - call frame which has to be used to resume the call at call_resume(). -@fn - procedure to call during call_resume(). -@fd - file descriptor parameter to readdir fop. - NOTE: @fd is stored with a fd_ref(). -@size - size parameter to readdir fop. -@off - offset parameter to readdir fop. -call_stub_t * -fop_readdir_stub (call_frame_t *frame, - fop_readdir_t fn, - fd_t *fd, - size_t size, - off_t off); - -@frame - call frame which has to be used to resume the call at call_resume(). -@fn - procedure to call during call_resume(). -@loc - pointer to location structure. - NOTE: @loc will be copied to a different location, with inode_ref() to - @loc->inode and @loc->parent, if not NULL. also @loc->path will be - copied to a different location. -@flags - flags parameter to checksum fop. -call_stub_t * -fop_checksum_stub (call_frame_t *frame, - fop_checksum_t fn, - loc_t *loc, - int32_t flags); - -@frame - call frame which has to be used to resume the call at call_resume(). -@fn - procedure to call during call_resume(). -@op_ret - op_ret parameter to @fn. -@op_errno - op_errno parameter to @fn. -@inode - inode parameter to @fn. - NOTE: @inode pointer is stored with a inode_ref(). -@buf - buf parameter to @fn. - NOTE: @buf is copied to a different memory location, if not NULL. -@dict - dict parameter to @fn. - NOTE: @dict pointer is stored with dict_ref(). -call_stub_t * -fop_lookup_cbk_stub (call_frame_t *frame, - fop_lookup_cbk_t fn, - int32_t op_ret, - int32_t op_errno, - inode_t *inode, - struct stat *buf, - dict_t *dict); -@frame - call frame which has to be used to resume the call at call_resume(). -@fn - procedure to call during call_resume(). -@op_ret - op_ret parameter to @fn. -@op_errno - op_errno parameter to @fn. -@buf - buf parameter to @fn. - NOTE: @buf is copied to a different memory location, if not NULL. -call_stub_t * -fop_stat_cbk_stub (call_frame_t *frame, - fop_stat_cbk_t fn, - int32_t op_ret, - int32_t op_errno, - struct stat *buf); - -@frame - call frame which has to be used to resume the call at call_resume(). -@fn - procedure to call during call_resume(). -@op_ret - op_ret parameter to @fn. -@op_errno - op_errno parameter to @fn. -@buf - buf parameter to @fn. - NOTE: @buf is copied to a different memory location, if not NULL. -call_stub_t * -fop_fstat_cbk_stub (call_frame_t *frame, - fop_fstat_cbk_t fn, - int32_t op_ret, - int32_t op_errno, - struct stat *buf); - -@frame - call frame which has to be used to resume the call at call_resume(). -@fn - procedure to call during call_resume(). -@op_ret - op_ret parameter to @fn. -@op_errno - op_errno parameter to @fn. -@buf - buf parameter to @fn. - NOTE: @buf is copied to a different memory location, if not NULL. -call_stub_t * -fop_chmod_cbk_stub (call_frame_t *frame, - fop_chmod_cbk_t fn, - int32_t op_ret, - int32_t op_errno, - struct stat *buf); - -@frame - call frame which has to be used to resume the call at call_resume(). -@fn - procedure to call during call_resume(). -@op_ret - op_ret parameter to @fn. -@op_errno - op_errno parameter to @fn. -@buf - buf parameter to @fn. - NOTE: @buf is copied to a different memory location, if not NULL. -call_stub_t * -fop_fchmod_cbk_stub (call_frame_t *frame, - fop_fchmod_cbk_t fn, - int32_t op_ret, - int32_t op_errno, - struct stat *buf); - -@frame - call frame which has to be used to resume the call at call_resume(). -@fn - procedure to call during call_resume(). -@op_ret - op_ret parameter to @fn. -@op_errno - op_errno parameter to @fn. -@buf - buf parameter to @fn. - NOTE: @buf is copied to a different memory location, if not NULL. -call_stub_t * -fop_chown_cbk_stub (call_frame_t *frame, - fop_chown_cbk_t fn, - int32_t op_ret, - int32_t op_errno, - struct stat *buf); - - -@frame - call frame which has to be used to resume the call at call_resume(). -@fn - procedure to call during call_resume(). -@op_ret - op_ret parameter to @fn. -@op_errno - op_errno parameter to @fn. -@buf - buf parameter to @fn. - NOTE: @buf is copied to a different memory location, if not NULL. -call_stub_t * -fop_fchown_cbk_stub (call_frame_t *frame, - fop_fchown_cbk_t fn, - int32_t op_ret, - int32_t op_errno, - struct stat *buf); - - -@frame - call frame which has to be used to resume the call at call_resume(). -@fn - procedure to call during call_resume(). -@op_ret - op_ret parameter to @fn. -@op_errno - op_errno parameter to @fn. -@buf - buf parameter to @fn. - NOTE: @buf is copied to a different memory location, if not NULL. -call_stub_t * -fop_truncate_cbk_stub (call_frame_t *frame, - fop_truncate_cbk_t fn, - int32_t op_ret, - int32_t op_errno, - struct stat *buf); - - -@frame - call frame which has to be used to resume the call at call_resume(). -@fn - procedure to call during call_resume(). -@op_ret - op_ret parameter to @fn. -@op_errno - op_errno parameter to @fn. -@buf - buf parameter to @fn. - NOTE: @buf is copied to a different memory location, if not NULL. -call_stub_t * -fop_ftruncate_cbk_stub (call_frame_t *frame, - fop_ftruncate_cbk_t fn, - int32_t op_ret, - int32_t op_errno, - struct stat *buf); - - -@frame - call frame which has to be used to resume the call at call_resume(). -@fn - procedure to call during call_resume(). -@op_ret - op_ret parameter to @fn. -@op_errno - op_errno parameter to @fn. -@buf - buf parameter to @fn. - NOTE: @buf is copied to a different memory location, if not NULL. -call_stub_t * -fop_utimens_cbk_stub (call_frame_t *frame, - fop_utimens_cbk_t fn, - int32_t op_ret, - int32_t op_errno, - struct stat *buf); - - -@frame - call frame which has to be used to resume the call at call_resume(). -@fn - procedure to call during call_resume(). -@op_ret - op_ret parameter to @fn. -@op_errno - op_errno parameter to @fn. -call_stub_t * -fop_access_cbk_stub (call_frame_t *frame, - fop_access_cbk_t fn, - int32_t op_ret, - int32_t op_errno); - - -@frame - call frame which has to be used to resume the call at call_resume(). -@fn - procedure to call during call_resume(). -@op_ret - op_ret parameter to @fn. -@op_errno - op_errno parameter to @fn. -@path - path parameter to @fn. - NOTE: @path is copied to a different memory location, if not NULL. -call_stub_t * -fop_readlink_cbk_stub (call_frame_t *frame, - fop_readlink_cbk_t fn, - int32_t op_ret, - int32_t op_errno, - const char *path); - - -@frame - call frame which has to be used to resume the call at call_resume(). -@fn - procedure to call during call_resume(). -@op_ret - op_ret parameter to @fn. -@op_errno - op_errno parameter to @fn. -@inode - inode parameter to @fn. - NOTE: @inode pointer is stored with a inode_ref(). -@buf - buf parameter to @fn. - NOTE: @buf is copied to a different memory location, if not NULL. -call_stub_t * -fop_mknod_cbk_stub (call_frame_t *frame, - fop_mknod_cbk_t fn, - int32_t op_ret, - int32_t op_errno, - inode_t *inode, - struct stat *buf); - - -@frame - call frame which has to be used to resume the call at call_resume(). -@fn - procedure to call during call_resume(). -@op_ret - op_ret parameter to @fn. -@op_errno - op_errno parameter to @fn. -@inode - inode parameter to @fn. - NOTE: @inode pointer is stored with a inode_ref(). -@buf - buf parameter to @fn. - NOTE: @buf is copied to a different memory location, if not NULL. -call_stub_t * -fop_mkdir_cbk_stub (call_frame_t *frame, - fop_mkdir_cbk_t fn, - int32_t op_ret, - int32_t op_errno, - inode_t *inode, - struct stat *buf); - - -@frame - call frame which has to be used to resume the call at call_resume(). -@fn - procedure to call during call_resume(). -@op_ret - op_ret parameter to @fn. -@op_errno - op_errno parameter to @fn. -call_stub_t * -fop_unlink_cbk_stub (call_frame_t *frame, - fop_unlink_cbk_t fn, - int32_t op_ret, - int32_t op_errno); - - -@frame - call frame which has to be used to resume the call at call_resume(). -@fn - procedure to call during call_resume(). -@op_ret - op_ret parameter to @fn. -@op_errno - op_errno parameter to @fn. -call_stub_t * -fop_rmdir_cbk_stub (call_frame_t *frame, - fop_rmdir_cbk_t fn, - int32_t op_ret, - int32_t op_errno); - - -@frame - call frame which has to be used to resume the call at call_resume(). -@fn - procedure to call during call_resume(). -@op_ret - op_ret parameter to @fn. -@op_errno - op_errno parameter to @fn. -@inode - inode parameter to @fn. - NOTE: @inode pointer is stored with a inode_ref(). -@buf - buf parameter to @fn. - NOTE: @buf is copied to a different memory location, if not NULL. -call_stub_t * -fop_symlink_cbk_stub (call_frame_t *frame, - fop_symlink_cbk_t fn, - int32_t op_ret, - int32_t op_errno, - inode_t *inode, - struct stat *buf); - - -@frame - call frame which has to be used to resume the call at call_resume(). -@fn - procedure to call during call_resume(). -@op_ret - op_ret parameter to @fn. -@op_errno - op_errno parameter to @fn. -@buf - buf parameter to @fn. - NOTE: @buf is copied to a different memory location, if not NULL. -call_stub_t * -fop_rename_cbk_stub (call_frame_t *frame, - fop_rename_cbk_t fn, - int32_t op_ret, - int32_t op_errno, - struct stat *buf); - - -@frame - call frame which has to be used to resume the call at call_resume(). -@fn - procedure to call during call_resume(). -@op_ret - op_ret parameter to @fn. -@op_errno - op_errno parameter to @fn. -@inode - inode parameter to @fn. - NOTE: @inode pointer is stored with a inode_ref(). -@buf - buf parameter to @fn. - NOTE: @buf is copied to a different memory location, if not NULL. -call_stub_t * -fop_link_cbk_stub (call_frame_t *frame, - fop_link_cbk_t fn, - int32_t op_ret, - int32_t op_errno, - inode_t *inode, - struct stat *buf); - -@frame - call frame which has to be used to resume the call at call_resume(). -@fn - procedure to call during call_resume(). -@op_ret - op_ret parameter to @fn. -@op_errno - op_errno parameter to @fn. -@fd - fd parameter to @fn. - NOTE: @fd pointer is stored with a fd_ref(). -@inode - inode parameter to @fn. - NOTE: @inode pointer is stored with a inode_ref(). -@buf - buf parameter to @fn. - NOTE: @buf is copied to a different memory location, if not NULL. -call_stub_t * -fop_create_cbk_stub (call_frame_t *frame, - fop_create_cbk_t fn, - int32_t op_ret, - int32_t op_errno, - fd_t *fd, - inode_t *inode, - struct stat *buf); - - -@frame - call frame which has to be used to resume the call at call_resume(). -@fn - procedure to call during call_resume(). -@op_ret - op_ret parameter to @fn. -@op_errno - op_errno parameter to @fn. -@fd - fd parameter to @fn. - NOTE: @fd pointer is stored with a fd_ref(). -call_stub_t * -fop_open_cbk_stub (call_frame_t *frame, - fop_open_cbk_t fn, - int32_t op_ret, - int32_t op_errno, - fd_t *fd); - - -@frame - call frame which has to be used to resume the call at call_resume(). -@fn - procedure to call during call_resume(). -@op_ret - op_ret parameter to @fn. -@op_errno - op_errno parameter to @fn. -@vector - vector parameter to @fn. - NOTE: @vector is copied to a different memory location, if not NULL. also - frame->root->rsp_refs is dict_ref()ed. -@stbuf - stbuf parameter to @fn. - NOTE: @stbuf is copied to a different memory location, if not NULL. -call_stub_t * -fop_readv_cbk_stub (call_frame_t *frame, - fop_readv_cbk_t fn, - int32_t op_ret, - int32_t op_errno, - struct iovec *vector, - int32_t count, - struct stat *stbuf); - - -@frame - call frame which has to be used to resume the call at call_resume(). -@fn - procedure to call during call_resume(). -@op_ret - op_ret parameter to @fn. -@op_errno - op_errno parameter to @fn. -@stbuf - stbuf parameter to @fn. - NOTE: @stbuf is copied to a different memory location, if not NULL. -call_stub_t * -fop_writev_cbk_stub (call_frame_t *frame, - fop_writev_cbk_t fn, - int32_t op_ret, - int32_t op_errno, - struct stat *stbuf); - - -@frame - call frame which has to be used to resume the call at call_resume(). -@fn - procedure to call during call_resume(). -@op_ret - op_ret parameter to @fn. -@op_errno - op_errno parameter to @fn. -call_stub_t * -fop_flush_cbk_stub (call_frame_t *frame, - fop_flush_cbk_t fn, - int32_t op_ret, - int32_t op_errno); - -@frame - call frame which has to be used to resume the call at call_resume(). -@fn - procedure to call during call_resume(). -@op_ret - op_ret parameter to @fn. -@op_errno - op_errno parameter to @fn. -call_stub_t * -fop_fsync_cbk_stub (call_frame_t *frame, - fop_fsync_cbk_t fn, - int32_t op_ret, - int32_t op_errno); - - -@frame - call frame which has to be used to resume the call at call_resume(). -@fn - procedure to call during call_resume(). -@op_ret - op_ret parameter to @fn. -@op_errno - op_errno parameter to @fn. -@fd - fd parameter to @fn. - NOTE: @fd pointer is stored with a fd_ref(). -call_stub_t * -fop_opendir_cbk_stub (call_frame_t *frame, - fop_opendir_cbk_t fn, - int32_t op_ret, - int32_t op_errno, - fd_t *fd); - - -@frame - call frame which has to be used to resume the call at call_resume(). -@fn - procedure to call during call_resume(). -@op_ret - op_ret parameter to @fn. -@op_errno - op_errno parameter to @fn. -@entries - entries parameter to @fn. -@count - count parameter to @fn. -call_stub_t * -fop_getdents_cbk_stub (call_frame_t *frame, - fop_getdents_cbk_t fn, - int32_t op_ret, - int32_t op_errno, - dir_entry_t *entries, - int32_t count); - - -@frame - call frame which has to be used to resume the call at call_resume(). -@fn - procedure to call during call_resume(). -@op_ret - op_ret parameter to @fn. -@op_errno - op_errno parameter to @fn. -call_stub_t * -fop_setdents_cbk_stub (call_frame_t *frame, - fop_setdents_cbk_t fn, - int32_t op_ret, - int32_t op_errno); - -@frame - call frame which has to be used to resume the call at call_resume(). -@fn - procedure to call during call_resume(). -@op_ret - op_ret parameter to @fn. -@op_errno - op_errno parameter to @fn. -call_stub_t * -fop_fsyncdir_cbk_stub (call_frame_t *frame, - fop_fsyncdir_cbk_t fn, - int32_t op_ret, - int32_t op_errno); - - -@frame - call frame which has to be used to resume the call at call_resume(). -@fn - procedure to call during call_resume(). -@op_ret - op_ret parameter to @fn. -@op_errno - op_errno parameter to @fn. -@buf - buf parameter to @fn. - NOTE: @buf is copied to a different memory location, if not NULL. -call_stub_t * -fop_statfs_cbk_stub (call_frame_t *frame, - fop_statfs_cbk_t fn, - int32_t op_ret, - int32_t op_errno, - struct statvfs *buf); - - -@frame - call frame which has to be used to resume the call at call_resume(). -@fn - procedure to call during call_resume(). -@op_ret - op_ret parameter to @fn. -@op_errno - op_errno parameter to @fn. -call_stub_t * -fop_setxattr_cbk_stub (call_frame_t *frame, - fop_setxattr_cbk_t fn, - int32_t op_ret, - int32_t op_errno); - - -@frame - call frame which has to be used to resume the call at call_resume(). -@fn - procedure to call during call_resume(). -@op_ret - op_ret parameter to @fn. -@op_errno - op_errno parameter to @fn. -@value - value dictionary parameter to @fn. - NOTE: @value pointer is stored with a dict_ref(). -call_stub_t * -fop_getxattr_cbk_stub (call_frame_t *frame, - fop_getxattr_cbk_t fn, - int32_t op_ret, - int32_t op_errno, - dict_t *value); - - -@frame - call frame which has to be used to resume the call at call_resume(). -@fn - procedure to call during call_resume(). -@op_ret - op_ret parameter to @fn. -@op_errno - op_errno parameter to @fn. -call_stub_t * -fop_removexattr_cbk_stub (call_frame_t *frame, - fop_removexattr_cbk_t fn, - int32_t op_ret, - int32_t op_errno); - - -@frame - call frame which has to be used to resume the call at call_resume(). -@fn - procedure to call during call_resume(). -@op_ret - op_ret parameter to @fn. -@op_errno - op_errno parameter to @fn. -@lock - lock parameter to @fn. - NOTE: @lock is copied to a different memory location while creating - stub. -call_stub_t * -fop_lk_cbk_stub (call_frame_t *frame, - fop_lk_cbk_t fn, - int32_t op_ret, - int32_t op_errno, - struct flock *lock); - -@frame - call frame which has to be used to resume the call at call_resume(). -@fn - procedure to call during call_resume(). -@op_ret - op_ret parameter to @fn. -@op_errno - op_errno parameter to @fn. -@lock - lock parameter to @fn. - NOTE: @lock is copied to a different memory location while creating - stub. -call_stub_t * -fop_gf_lk_cbk_stub (call_frame_t *frame, - fop_gf_lk_cbk_t fn, - int32_t op_ret, - int32_t op_errno, - struct flock *lock); - - -@frame - call frame which has to be used to resume the call at call_resume(). -@fn - procedure to call during call_resume(). -@op_ret - op_ret parameter to @fn. -@op_errno - op_errno parameter to @fn. -@entries - entries parameter to @fn. -call_stub_t * -fop_readdir_cbk_stub (call_frame_t *frame, - fop_readdir_cbk_t fn, - int32_t op_ret, - int32_t op_errno, - gf_dirent_t *entries); - - -@frame - call frame which has to be used to resume the call at call_resume(). -@fn - procedure to call during call_resume(). -@op_ret - op_ret parameter to @fn. -@op_errno - op_errno parameter to @fn. -@file_checksum - file_checksum parameter to @fn. - NOTE: file_checksum will be copied to a different memory location - while creating stub. -@dir_checksum - dir_checksum parameter to @fn. - NOTE: file_checksum will be copied to a different memory location - while creating stub. -call_stub_t * -fop_checksum_cbk_stub (call_frame_t *frame, - fop_checksum_cbk_t fn, - int32_t op_ret, - int32_t op_errno, - uint8_t *file_checksum, - uint8_t *dir_checksum); - -resuming a call: ---------------- - call can be resumed using call stub through call_resume API. - - void call_resume (call_stub_t *stub); - - stub - call stub created during pausing a call. - - NOTE: call_resume() will decrease reference count of any fd_t, dict_t and inode_t that it finds - in stub->args.<operation>.<fd_t-or-inode_t-or-dict_t>. so, if any fd_t, dict_t or - inode_t pointers are assigned at stub->args.<operation>.<fd_t-or-inode_t-or-dict_t> after - fop_<operation>_stub() call, they must be <fd_t-or-inode_t-or-dict_t>_ref()ed. - - call_resume does not STACK_DESTROY() for any fop. - - if stub->fn is NULL, call_resume does STACK_WIND() or STACK_UNWIND() using the stub->frame. - - return - call resume fails only if stub is NULL. call resume fails with errno set to EINVAL. diff --git a/doc/hacker-guide/en-US/markdown/adding-fops.md b/doc/hacker-guide/en-US/markdown/adding-fops.md new file mode 100644 index 000000000..3f72ed3e2 --- /dev/null +++ b/doc/hacker-guide/en-US/markdown/adding-fops.md @@ -0,0 +1,18 @@ +Adding a new FOP +================ + +Steps to be followed when adding a new FOP to GlusterFS: + +1. Edit `glusterfs.h` and add a `GF_FOP_*` constant. +2. Edit `xlator.[ch]` and: + * add the new prototype for fop and callback. + * edit `xlator_fops` structure. +3. Edit `xlator.c` and add to fill_defaults. +4. Edit `protocol.h` and add struct necessary for the new FOP. +5. Edit `defaults.[ch]` and provide default implementation. +6. Edit `call-stub.[ch]` and provide stub implementation. +7. Edit `common-utils.c` and add to gf_global_variable_init(). +8. Edit client-protocol and add your FOP. +9. Edit server-protocol and add your FOP. +10. Implement your FOP in any translator for which the default implementation + is not sufficient. diff --git a/doc/hacker-guide/en-US/markdown/afr.md b/doc/hacker-guide/en-US/markdown/afr.md new file mode 100644 index 000000000..1be7e39f2 --- /dev/null +++ b/doc/hacker-guide/en-US/markdown/afr.md @@ -0,0 +1,191 @@ +cluster/afr translator +====================== + +Locking +------- + +Before understanding replicate, one must understand two internal FOPs: + +### `GF_FILE_LK` + +This is exactly like `fcntl(2)` locking, except the locks are in a +separate domain from locks held by applications. + +### `GF_DIR_LK (loc_t *loc, char *basename)` + +This allows one to lock a name under a directory. For example, +to lock /mnt/glusterfs/foo, one would use the call: + +``` +GF_DIR_LK ({loc_t for "/mnt/glusterfs"}, "foo") +``` + +If one wishes to lock *all* the names under a particular directory, +supply the basename argument as `NULL`. + +The locks can either be read locks or write locks; consult the +function prototype for more details. + +Both these operations are implemented by the features/locks (earlier +known as posix-locks) translator. + +Basic design +------------ + +All FOPs can be classified into four major groups: + +### inode-read + +Operations that read an inode's data (file contents) or metadata (perms, etc.). + +access, getxattr, fstat, readlink, readv, stat. + +### inode-write + +Operations that modify an inode's data or metadata. + +chmod, chown, truncate, writev, utimens. + +### dir-read + +Operations that read a directory's contents or metadata. + +readdir, getdents, checksum. + +### dir-write + +Operations that modify a directory's contents or metadata. + +create, link, mkdir, mknod, rename, rmdir, symlink, unlink. + +Some of these make a subgroup in that they modify *two* different entries: +link, rename, symlink. + +### Others + +Other operations. + +flush, lookup, open, opendir, statfs. + +Algorithms +---------- + +Each of the four major groups has its own algorithm: + +### inode-read, dir-read + +1. Send a request to the first child that is up: + * if it fails: + * try the next available child + * if we have exhausted all children: + * return failure + +### inode-write + + All operations are done in parallel unless specified otherwise. + +1. Send a ``GF_FILE_LK`` request on all children for a write lock on the + appropriate region + (for metadata operations: entire file (0, 0) for writev: + (offset, offset+size of buffer)) + * If a lock request fails on a child: + * unlock all children + * try to acquire a blocking lock (`F_SETLKW`) on each child, serially. + If this fails (due to `ENOTCONN` or `EINVAL`): + Consider this child as dead for rest of transaction. +2. Mark all children as "pending" on all (alive) children (see below for +meaning of "pending"). + * If it fails on any child: + * mark it as dead (in transaction local state). +3. Perform operation on all (alive) children. + * If it fails on any child: + * mark it as dead (in transaction local state). +4. Unmark all successful children as not "pending" on all nodes. +5. Unlock region on all (alive) children. + +### dir-write + + The algorithm for dir-write is same as above except instead of holding + `GF_FILE_LK` locks we hold a GF_DIR_LK lock on the name being operated upon. + In case of link-type calls, we hold locks on both the operand names. + +"pending" +--------- + +The "pending" number is like a journal entry. A pending entry is an +array of 32-bit integers stored in network byte-order as the extended +attribute of an inode (which can be a directory as well). + +There are three keys corresponding to three types of pending operations: + +### `AFR_METADATA_PENDING` + +There are some metadata operations pending on this inode (perms, ctime/mtime, +xattr, etc.). + +### `AFR_DATA_PENDING` + +There is some data pending on this inode (writev). + +### `AFR_ENTRY_PENDING` + +There are some directory operations pending on this directory +(create, unlink, etc.). + +Self heal +--------- + +* On lookup, gather extended attribute data: + * If entry is a regular file: + * If an entry is present on one child and not on others: + * create entry on others. + * If entries exist but have different metadata (perms, etc.): + * consider the entry with the highest `AFR_METADATA_PENDING` number as + definitive and replicate its attributes on children. + * If entry is a directory: + * Consider the entry with the higest `AFR_ENTRY_PENDING` number as + definitive and replicate its contents on all children. + * If any two entries have non-matching types (i.e., one is file and + other is directory): + * Announce to the user via log that a split-brain situation has been + detected, and do nothing. +* On open, gather extended attribute data: + * Consider the file with the highest `AFR_DATA_PENDING` number as + the definitive one and replicate its contents on all other + children. + +During all self heal operations, appropriate locks must be held on all +regions/entries being affected. + +Inode scaling +------------- + +Inode scaling is necessary because if a situation arises where an inode number +is returned for a directory (by lookup) which was previously the inode number +of a file (as per FUSE's table), then FUSE gets horribly confused (consult a +FUSE expert for more details). + +To avoid such a situation, we distribute the 64-bit inode space equally +among all children of replicate. + +To illustrate: + +If c1, c2, c3 are children of replicate, they each get 1/3 of the available +inode space: + +------------- -- -- -- -- -- -- -- -- -- -- -- --- +Child: c1 c2 c3 c1 c2 c3 c1 c2 c3 c1 c2 ... +Inode number: 1 2 3 4 5 6 7 8 9 10 11 ... +------------- -- -- -- -- -- -- -- -- -- -- -- --- + +Thus, if lookup on c1 returns an inode number "2", it is scaled to "4" +(which is the second inode number in c1's space). + +This way we ensure that there is never a collision of inode numbers from +two different children. + +This reduction of inode space doesn't really reduce the usability of +replicate since even if we assume replicate has 1024 children (which would be a +highly unusual scenario), each child still has a 54-bit inode space: +$2^{54} \sim 1.8 \times 10^{16}$, which is much larger than any real +world requirement. diff --git a/doc/hacker-guide/en-US/markdown/coding-standard.md b/doc/hacker-guide/en-US/markdown/coding-standard.md new file mode 100644 index 000000000..178dc142a --- /dev/null +++ b/doc/hacker-guide/en-US/markdown/coding-standard.md @@ -0,0 +1,402 @@ +GlusterFS Coding Standards +========================== + +Structure definitions should have a comment per member +------------------------------------------------------ + +Every member in a structure definition must have a comment about its +purpose. The comment should be descriptive without being overly verbose. + +*Bad:* + +``` +gf_lock_t lock; /* lock */ +``` + +*Good:* + +``` +DBTYPE access_mode; /* access mode for accessing + * the databases, can be + * DB_HASH, DB_BTREE + * (option access-mode <mode>) + */ +``` + +Declare all variables at the beginning of the function +------------------------------------------------------ + +All local variables in a function must be declared immediately after the +opening brace. This makes it easy to keep track of memory that needs to be freed +during exit. It also helps debugging, since gdb cannot handle variables +declared inside loops or other such blocks. + +Always initialize local variables +--------------------------------- + +Every local variable should be initialized to a sensible default value +at the point of its declaration. All pointers should be initialized to NULL, +and all integers should be zero or (if it makes sense) an error value. + + +*Good:* + +``` +int ret = 0; +char *databuf = NULL; +int _fd = -1; +``` + +Initialization should always be done with a constant value +---------------------------------------------------------- + +Never use a non-constant expression as the initialization value for a variable. + + +*Bad:* + +``` +pid_t pid = frame->root->pid; +char *databuf = malloc (1024); +``` + +Validate all arguments to a function +------------------------------------ + +All pointer arguments to a function must be checked for `NULL`. +A macro named `VALIDATE` (in `common-utils.h`) +takes one argument, and if it is `NULL`, writes a log message and +jumps to a label called `err` after setting op_ret and op_errno +appropriately. It is recommended to use this template. + + +*Good:* + +``` +VALIDATE(frame); +VALIDATE(this); +VALIDATE(inode); +``` + +Never rely on precedence of operators +------------------------------------- + +Never write code that relies on the precedence of operators to execute +correctly. Such code can be hard to read and someone else might not +know the precedence of operators as accurately as you do. + +*Bad:* + +``` +if (op_ret == -1 && errno != ENOENT) +``` + +*Good:* + +``` +if ((op_ret == -1) && (errno != ENOENT)) +``` + +Use exactly matching types +-------------------------- + +Use a variable of the exact type declared in the manual to hold the +return value of a function. Do not use an ``equivalent'' type. + + +*Bad:* + +``` +int len = strlen (path); +``` + +*Good:* + +``` +size_t len = strlen (path); +``` + +Never write code such as `foo->bar->baz`; check every pointer +------------------------------------------------------------- + +Do not write code that blindly follows a chain of pointer +references. Any pointer in the chain may be `NULL` and thus +cause a crash. Verify that each pointer is non-null before following +it. + +Check return value of all functions and system calls +---------------------------------------------------- + +The return value of all system calls and API functions must be checked +for success or failure. + +*Bad:* + +``` +close (fd); +``` + +*Good:* + +``` +op_ret = close (_fd); +if (op_ret == -1) { + gf_log (this->name, GF_LOG_ERROR, + "close on file %s failed (%s)", real_path, + strerror (errno)); + op_errno = errno; + goto out; +} +``` + + +Gracefully handle failure of malloc +----------------------------------- + +GlusterFS should never crash or exit due to lack of memory. If a +memory allocation fails, the call should be unwound and an error +returned to the user. + +*Use result args and reserve the return value to indicate success or failure:* + +The return value of every functions must indicate success or failure (unless +it is impossible for the function to fail --- e.g., boolean functions). If +the function needs to return additional data, it must be returned using a +result (pointer) argument. + +*Bad:* + +``` +int32_t dict_get_int32 (dict_t *this, char *key); +``` + +*Good:* + +``` +int dict_get_int32 (dict_t *this, char *key, int32_t *val); +``` + +Always use the `n' versions of string functions +----------------------------------------------- + +Unless impossible, use the length-limited versions of the string functions. + +*Bad:* + +``` +strcpy (entry_path, real_path); +``` + +*Good:* + +``` +strncpy (entry_path, real_path, entry_path_len); +``` + +No dead or commented code +------------------------- + +There must be no dead code (code to which control can never be passed) or +commented out code in the codebase. + +Only one unwind and return per function +--------------------------------------- + +There must be only one exit out of a function. `UNWIND` and return +should happen at only point in the function. + +Function length or Keep functions small +--------------------------------------- + +We live in the UNIX-world where modules do one thing and do it well. +This rule should apply to our functions also. If a function is very long, try splitting it +into many little helper functions. The question is, in a coding +spree, how do we know a function is long and unreadable. One rule of +thumb given by Linus Torvalds is that, a function should be broken-up +if you have 4 or more levels of indentation going on for more than 3-4 +lines. + +*Example for a helper function:* +``` +static int +same_owner (posix_lock_t *l1, posix_lock_t *l2) +{ + return ((l1->client_pid == l2->client_pid) && + (l1->transport == l2->transport)); +} +``` + +Defining functions as static +---------------------------- + +Define internal functions as static only if you're +very sure that there will not be a crash(..of any kind..) emanating in +that function. If there is even a remote possibility, perhaps due to +pointer derefering, etc, declare the function as non-static. This +ensures that when a crash does happen, the function name shows up the +in the back-trace generated by libc. However, doing so has potential +for polluting the function namespace, so to avoid conflicts with other +components in other parts, ensure that the function names are +prepended with a prefix that identify the component to which it +belongs. For eg. non-static functions in io-threads translator start +with iot_. + +Ensure function calls wrap around after 80-columns +-------------------------------------------------- + +Place remaining arguments on the next line if needed. + +Functions arguments and function definition +------------------------------------------- + +Place all the arguments of a function definition on the same line +until the line goes beyond 80-cols. Arguments that extend beyind +80-cols should be placed on the next line. + +Style issues +------------ + +### Brace placement + +Use K&R/Linux style of brace placement for blocks. + +*Good:* + +``` +int some_function (...) +{ + if (...) { + /* ... */ + } else if (...) { + /* ... */ + } else { + /* ... */ + } + + do { + /* ... */ + } while (cond); +} +``` + +### Indentation + +Use *eight* spaces for indenting blocks. Ensure that your +file contains only spaces and not tab characters. You can do this +in Emacs by selecting the entire file (`C-x h`) and +running `M-x untabify`. + +To make Emacs indent lines automatically by eight spaces, add this +line to your `.emacs`: + +``` +(add-hook 'c-mode-hook (lambda () (c-set-style "linux"))) +``` + +### Comments + +Write a comment before every function describing its purpose (one-line), +its arguments, and its return value. Mention whether it is an internal +function or an exported function. + +Write a comment before every structure describing its purpose, and +write comments about each of its members. + +Follow the style shown below for comments, since such comments +can then be automatically extracted by doxygen to generate +documentation. + +*Good:* + +``` +/** +* hash_name -hash function for filenames +* @par: parent inode number +* @name: basename of inode +* @mod: number of buckets in the hashtable +* +* @return: success: bucket number +* failure: -1 +* +* Not for external use. +*/ +``` + +### Indicating critical sections + +To clearly show regions of code which execute with locks held, use +the following format: + +``` +pthread_mutex_lock (&mutex); +{ + /* code */ +} +pthread_mutex_unlock (&mutex); +``` + +*A skeleton fop function:* + +This is the recommended template for any fop. In the beginning come +the initializations. After that, the `success' control flow should be +linear. Any error conditions should cause a `goto` to a single +point, `out`. At that point, the code should detect the error +that has occured and do appropriate cleanup. + +``` +int32_t +sample_fop (call_frame_t *frame, xlator_t *this, ...) +{ + char * var1 = NULL; + int32_t op_ret = -1; + int32_t op_errno = 0; + DIR * dir = NULL; + struct posix_fd * pfd = NULL; + + VALIDATE_OR_GOTO (frame, out); + VALIDATE_OR_GOTO (this, out); + + /* other validations */ + + dir = opendir (...); + + if (dir == NULL) { + op_errno = errno; + gf_log (this->name, GF_LOG_ERROR, + "opendir failed on %s (%s)", loc->path, + strerror (op_errno)); + goto out; + } + + /* another system call */ + if (...) { + op_errno = ENOMEM; + gf_log (this->name, GF_LOG_ERROR, + "out of memory :("); + goto out; + } + + /* ... */ + + out: + if (op_ret == -1) { + + /* check for all the cleanup that needs to be + done */ + + if (dir) { + closedir (dir); + dir = NULL; + } + + if (pfd) { + FREE (pfd->path); + FREE (pfd); + pfd = NULL; + } + } + + STACK_UNWIND (frame, op_ret, op_errno, fd); + return 0; +} +``` diff --git a/doc/hacker-guide/en-US/markdown/posix.md b/doc/hacker-guide/en-US/markdown/posix.md new file mode 100644 index 000000000..84c813e55 --- /dev/null +++ b/doc/hacker-guide/en-US/markdown/posix.md @@ -0,0 +1,59 @@ +storage/posix translator +======================== + +Notes +----- + +### `SET_FS_ID` + +This is so that all filesystem checks are done with the user's +uid/gid and not GlusterFS's uid/gid. + +### `MAKE_REAL_PATH` + +This macro concatenates the base directory of the posix volume +('option directory') with the given path. + +### `need_xattr` in lookup + +If this flag is passed, lookup returns a xattr dictionary that contains +the file's create time, the file's contents, and the version number +of the file. + +This is a hack to increase small file performance. If an application +wants to read a small file, it can finish its job with just a lookup +call instead of a lookup followed by read. + +### `getdents`/`setdents` + +These are used by unify to set and get directory entries. + +### `ALIGN_BUF` + +Macro to align an address to a page boundary (4K). + +### `priv->export_statfs` + +In some cases, two exported volumes may reside on the same +partition on the server. Sending statvfs info for both +the volumes will lead to erroneous df output at the client, +since free space on the partition will be counted twice. + +In such cases, user can disable exporting statvfs info +on one of the volumes by setting this option. + +### `xattrop` + +This fop is used by replicate to set version numbers on files. + +### `getxattr`/`setxattr` hack to read/write files + +A key, `GLUSTERFS_FILE_CONTENT_STRING`, is handled in a special way by +`getxattr`/`setxattr`. A getxattr with the key will return the entire +content of the file as the value. A `setxattr` with the key will write +the value as the entire content of the file. + +### `posix_checksum` + +This calculates a simple XOR checksum on all entry names in a +directory that is used by unify to compare directory contents. diff --git a/doc/hacker-guide/en-US/markdown/translator-development.md b/doc/hacker-guide/en-US/markdown/translator-development.md new file mode 100644 index 000000000..77d1b606a --- /dev/null +++ b/doc/hacker-guide/en-US/markdown/translator-development.md @@ -0,0 +1,666 @@ +Translator development +====================== + +Setting the Stage +----------------- + +This is the first post in a series that will explain some of the details of +writing a GlusterFS translator, using some actual code to illustrate. + +Before we begin, a word about environments. GlusterFS is over 300K lines of +code spread across a few hundred files. That's no Linux kernel or anything, but + you're still going to be navigating through a lot of code in every +code-editing session, so some kind of cross-referencing is *essential*. I use +cscope with the vim bindings, and if I couldn't do Crtl+G and such to jump +between definitions all the time my productivity would be cut in half. You may +prefer different tools, but as I go through these examples you'll need +something functionally similar to follow on. OK, on with the show. + +The first thing you need to know is that translators are not just bags of +functions and variables. They need to have a very definite internal structure +so that the translator-loading code can figure out where all the pieces are. +The way it does this is to use dlsym to look for specific names within your +shared-object file, as follow (from `xlator.c`): + +``` +if (!(xl->fops = dlsym (handle, "fops"))) { + gf_log ("xlator", GF_LOG_WARNING, "dlsym(fops) on %s", + dlerror ()); + goto out; +} + +if (!(xl->cbks = dlsym (handle, "cbks"))) { + gf_log ("xlator", GF_LOG_WARNING, "dlsym(cbks) on %s", + dlerror ()); + goto out; +} + +if (!(xl->init = dlsym (handle, "init"))) { + gf_log ("xlator", GF_LOG_WARNING, "dlsym(init) on %s", + dlerror ()); + goto out; +} + +if (!(xl->fini = dlsym (handle, "fini"))) { + gf_log ("xlator", GF_LOG_WARNING, "dlsym(fini) on %s", + dlerror ()); + goto out; +} +``` + +In this example, `xl` is a pointer to the in-memory object for the translator +we're loading. As you can see, it's looking up various symbols *by name* in the + shared object it just loaded, and storing pointers to those symbols. Some of +them (e.g. init are functions, while others e.g. fops are dispatch tables +containing pointers to many functions. Together, these make up the translator's + public interface. + +Most of this glue or boilerplate can easily be found at the bottom of one of +the source files that make up each translator. We're going to use the `rot-13` +translator just for fun, so in this case you'd look in `rot-13.c` to see this: + +``` +struct xlator_fops fops = { + .readv = rot13_readv, + .writev = rot13_writev +}; + +struct xlator_cbks cbks = { +}; + +struct volume_options options[] = { +{ .key = {"encrypt-write"}, + .type = GF_OPTION_TYPE_BOOL +}, +{ .key = {"decrypt-read"}, + .type = GF_OPTION_TYPE_BOOL +}, +{ .key = {NULL} }, +}; +``` + +The `fops` table, defined in `xlator.h`, is one of the most important pieces. +This table contains a pointer to each of the filesystem functions that your +translator might implement -- `open`, `read`, `stat`, `chmod`, and so on. There +are 82 such functions in all, but don't worry; any that you don't specify here +will be see as null and filled with defaults from `defaults.c` when your +translator is loaded. In this particular example, since `rot-13` is an +exceptionally simple translator, we only fill in two entries for `readv` and +`writev`. + +There are actually two other tables, also required to have predefined names, +that are also used to find translator functions: `cbks` (which is empty in this + snippet) and `dumpops` (which is missing entirely). The first of these specify + entry points for when inodes are forgotten or file descriptors are released. +In other words, they're destructors for objects in which your translator might + have an interest. Mostly you can ignore them, because the default behavior +handles even the simpler cases of translator-specific inode/fd context +automatically. However, if the context you attach is a complex structure +requiring complex cleanup, you'll need to supply these functions. As for +dumpops, that's just used if you want to provide functions to pretty-print +various structures in logs. I've never used it myself, though I probably +should. What's noteworthy here is that we don't even define dumpops. That's +because all of the functions that might use these dispatch functions will check + for `xl->dumpops` being `NULL` before calling through it. This is in sharp +contrast to the behavior for `fops` and `cbks1`, which *must* be present. If +they're not, translator loading will fail because these pointers are not +checked every time and if they're `NULL` then we'll segfault. That's why we +provide an empty definition for cbks; it's OK for the individual function +pointers to be NULL, but not for the whole table to be absent. + +The last piece I'll cover today is options. As you can see, this is a table of +translator-specific option names and some information about their types. +GlusterFS actually provides a pretty rich set of types (`volume_option_type_t` +in `options.`h) which includes paths, translator names, percentages, and times +in addition to the obvious integers and strings. Also, the `volume_option_t` +structure can include information about alternate names, min/max/default +values, enumerated string values, and descriptions. We don't see any of these +here, so let's take a quick look at some more complex examples from afr.c and +then come back to `rot-13`. + +``` +{ .key = {"data-self-heal-algorithm"}, + .type = GF_OPTION_TYPE_STR, + .default_value = "", + .description = "Select between \"full\", \"diff\". The " + "\"full\" algorithm copies the entire file from " + "source to sink. The \"diff\" algorithm copies to " + "sink only those blocks whose checksums don't match " + "with those of source.", + .value = { "diff", "full", "" } +}, +{ .key = {"data-self-heal-window-size"}, + .type = GF_OPTION_TYPE_INT, + .min = 1, + .max = 1024, + .default_value = "1", + .description = "Maximum number blocks per file for which " + "self-heal process would be applied simultaneously." +}, +``` + +When your translator is loaded, all of this information is used to parse the +options actually provided in the volfile, and then the result is turned into a +dictionary and stored as `xl->options`. This dictionary is then processed by +your init function, which you can see being looked up in the first code +fragment above. We're only going to look at a small part of the `rot-13`'s +init for now. + +``` +priv->decrypt_read = 1; +priv->encrypt_write = 1; + +data = dict_get (this->options, "encrypt-write"); +if (data) { + if (gf_string2boolean (data->data, &priv->encrypt_write + == -1) { + gf_log (this->name, GF_LOG_ERROR, + "encrypt-write takes only boolean options"); + return -1; + } +} +``` + +What we can see here is that we're setting some defaults in our priv structure, +then looking to see if an `encrypt-write` option was actually provided. If so, +we convert and store it. This is a pretty classic use of dict_get to fetch a +field from a dictionary, and of using one of many conversion functions in +`common-utils.c` to convert `data->data` into something we can use. + +So far we've covered the basic of how a translator gets loaded, how we find its +various parts, and how we process its options. In my next Translator 101 post, +we'll go a little deeper into other things that init and its companion fini +might do, and how some other fields in our `xlator_t` structure (commonly +referred to as this) are commonly used. + +`init`, `fini`, and private context +----------------------------------- + +In the previous Translator 101 post, we looked at some of the dispatch tables +and options processing in a translator. This time we're going to cover the rest + of the "shell" of a translator -- i.e. the other global parts not specific to +handling a particular request. + +Let's start by looking at the relationship between a translator and its shared +library. At a first approximation, this is the relationship between an object +and a class in just about any object-oriented programming language. The class +defines behaviors, but has to be instantiated as an object to have any kind of +existence. In our case the object is an `xlator_t`. Several of these might be +created within the same daemon, sharing all of the same code through init/fini +and dispatch tables, but sharing *no data*. You could implement shared data (as + static variables in your shared libraries) but that's strongly discouraged. +Every function in your shared library will get an `xlator_t` as an argument, +and should use it. This lack of class-level data is one of the points where +the analogy to common OOP systems starts to break down. Another place is the +complete lack of inheritance. Translators inherit behavior (code) from exactly +one shared library -- looked up and loaded using the `type` field in a volfile +`volume ... end-volume` block -- and that's it -- not even single inheritance, +no subclasses or superclasses, no mixins or prototypes, just the relationship +between an object and its class. With that in mind, let's turn to the init +function that we just barely touched on last time. + +``` +int32_t +init (xlator_t *this) +{ + data_t *data = NULL; + rot_13_private_t *priv = NULL; + + if (!this->children || this->children->next) { + gf_log ("rot13", GF_LOG_ERROR, + "FATAL: rot13 should have exactly one child"); + return -1; + } + + if (!this->parents) { + gf_log (this->name, GF_LOG_WARNING, + "dangling volume. check volfile "); + } + + priv = GF_CALLOC (sizeof (rot_13_private_t), 1, 0); + if (!priv) + return -1; +``` + +At the very top, we see the function signature -- we get a pointer to the +`xlator_t` object that we're initializing, and we return an `int32_t` status. +As with most functions in the translator API, this should be zero to indicate +success. In this case it's safe to return -1 for failure, but watch out: in +dispatch-table functions, the return value means the status of the *function +call* rather than the *request*. A request error should be reflected as a +callback with a non-zero `op_re`t value, but the dispatch function itself +should still return zero. In fact, the handling of a non-zero return from a +dispatch function is not all that robust (we recently had a bug report in +HekaFS related to this) so it's something you should probably avoid +altogether. This only underscores the difference between dispatch functions +and `init`/`fini` functions, where non-zero returns *are* expected and handled +logically by aborting the translator setup. We can see that down at the +bottom, where we return -1 to indicate that we couldn't allocate our +private-data area (more about that later). + +The first thing this init function does is check that the translator is being +set up in the right kind of environment. Translators are called by parents and +in turn call children. Some translators are "initial" translators that inject +requests into the system from elsewhere -- e.g. mount/fuse injecting requests +from the kernel, protocol/server injecting requests from the network. Those +translators don't need parents, but `rot-13` does and so we check for that. +Similarly, some translators are "final" translators that (from the perspective +of the current process) terminate requests instead of passing them on -- e.g. +`protocol/client` passing them to another node, `storage/posix` passing them to +a local filesystem. Other translators "multiplex" between multiple children -- + passing each parent request on to one (`cluster/dht`), some +(`cluster/stripe`), or all (`cluster/afr`) of those children. `rot-13` fits +into none of those categories either, so it checks that it has *exactly one* +child. It might be more convenient or robust if translator shared libraries +had standard variables describing these requirements, to be checked in a +consistent way by the translator-loading infrastructure itself instead of by +each separate init function, but this is the way translators work today. + +The last thing we see in this fragment is allocating our private data area. +This can literally be anything we want; the infrastructure just provides the +priv pointer as a convenience but takes no responsibility for how it's used. In + this case we're using `GF_CALLOC` to allocate our own `rot_13_private_t` +structure. This gets us all the benefits of GlusterFS's memory-leak detection +infrastructure, but the way we're calling it is not quite ideal. For one thing, + the first two arguments -- from `calloc(3)` -- are kind of reversed. For +another, notice how the last argument is zero. That can actually be an +enumerated value, to tell the GlusterFS allocator *what* type we're +allocating. This can be very useful information for memory profiling and leak +detection, so it's recommended that you follow the example of any +x`xx-mem-types.h` file elsewhere in the source tree instead of just passing +zero here (even though that works). + +To finish our tour of standard initialization/termination, let's look at the +end of `init` and the beginning of `fini`: + +``` + this->private = priv; + gf_log ("rot13", GF_LOG_DEBUG, "rot13 xlator loaded"); + return 0; +} + +void +fini (xlator_t *this) +{ + rot_13_private_t *priv = this->private; + + if (!priv) + return; + this->private = NULL; + GF_FREE (priv); +``` + +At the end of init we're just storing our private-data pointer in the `priv` +field of our `xlator_t`, then returning zero to indicate that initialization +succeeded. As is usually the case, our fini is even simpler. All it really has +to do is `GF_FREE` our private-data pointer, which we do in a slightly +roundabout way here. Notice how we don't even have a return value here, since +there's nothing obvious and useful that the infrastructure could do if `fini` +failed. + +That's practically everything we need to know to get our translator through +loading, initialization, options processing, and termination. If we had defined + no dispatch functions, we could actually configure a daemon to use our +translator and it would work as a basic pass-through from its parent to a +single child. In the next post I'll cover how to build the translator and +configure a daemon to use it, so that we can actually step through it in a +debugger and see how it all fits together before we actually start adding +functionality. + +This Time For Real +------------------ + +In the first two parts of this series, we learned how to write a basic +translator skeleton that can get through loading, initialization, and option +processing. This time we'll cover how to build that translator, configure a +volume to use it, and run the glusterfs daemon in debug mode. + +Unfortunately, there's not much direct support for writing new translators. You +can check out a GlusterFS tree and splice in your own translator directory, but + that's a bit painful because you'll have to update multiple makefiles plus a +bunch of autoconf garbage. As part of the HekaFS project, I basically reverse +engineered the truly necessary parts of the translator-building process and +then pestered one of the Fedora glusterfs package maintainers (thanks +daMaestro!) to add a `glusterfs-devel` package with the required headers. Since + then the complexity level in the HekaFS tree has crept back up a bit, but I +still remember the simple method and still consider it the easiest way to get +started on a new translator. For the sake of those not using Fedora, I'm going +to describe a method that doesn't depend on that header package. What it does +depend on is a GlusterFS source tree, much as you might have cloned from GitHub + or the Gluster review site. This tree doesn't have to be fully built, but you +do need to run `autogen.sh` and configure in it. Then you can take the +following simple makefile and put it in a directory with your actual source. + +``` +# Change these to match your source code. +TARGET = rot-13.so +OBJECTS = rot-13.o + +# Change these to match your environment. +GLFS_SRC = /srv/glusterfs +GLFS_LIB = /usr/lib64 +HOST_OS = GF_LINUX_HOST_OS + +# You shouldn't need to change anything below here. + +CFLAGS = -fPIC -Wall -O0 -g \ + -DHAVE_CONFIG_H -D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE \ + -D$(HOST_OS) -I$(GLFS_SRC) -I$(GLFS_SRC)/contrib/uuid \ + -I$(GLFS_SRC)/libglusterfs/src +LDFLAGS = -shared -nostartfiles -L$(GLFS_LIB) -lglusterfs \ + -lpthread + +$(TARGET): $(OBJECTS) + $(CC) $(OBJECTS) $(LDFLAGS) -o $(TARGET) +``` + +Yes, it's still Linux-specific. Mea culpa. As you can see, we're sticking with +the `rot-13` example, so you can just copy the files from +`xlators/encryption/rot-13/src` in your GlusterFS tree to follow on. Type +`make` and you should be rewarded with a nice little `.so` file. + +``` +xlator_example$ ls -l rot-13.so +-rwxr-xr-x. 1 jeff jeff 40784 Nov 16 16:41 rot-13.so +``` + +Notice that we've built with optimization level zero and debugging symbols +included, which would not typically be the case for a packaged version of +GlusterFS. Let's put our version of `rot-13.so` into a slightly different file +on our system, so that it doesn't stomp on the installed version (not that +you'd ever want to use that anyway). + +``` +xlator_example# ls /usr/lib64/glusterfs/3git/xlator/encryption/ +crypt.so crypt.so.0 crypt.so.0.0.0 rot-13.so rot-13.so.0 +rot-13.so.0.0.0 +xlator_example# cp rot-13.so \ + /usr/lib64/glusterfs/3git/xlator/encryption/my-rot-13.so +``` + +These paths represent the current Gluster filesystem layout, which is likely to +be deprecated in favor of the Fedora layout; your paths may vary. At this point + we're ready to configure a volume using our new translator. To do that, I'm +going to suggest something that's strongly discouraged except during +development (the Gluster guys are going to hate me for this): write our own +volfile. Here's just about the simplest volfile you'll ever see. + +``` +volume my-posix + type storage/posix + option directory /srv/export +end-volume + +volume my-rot13 + type encryption/my-rot-13 + subvolumes my-posix +end-volume +``` + +All we have here is a basic brick using `/srv/export` for its data, and then +an instance of our translator layered on top -- no client or server is +necessary for what we're doing, and the system will automatically push a +mount/fuse translator on top if there's no server translator. To try this out, +all we need is the following command (assuming the directories involved already + exist). + +``` +xlator_example$ glusterfs --debug -f my.vol /srv/import +``` + +You should be rewarded with a whole lot of log output, including the text of +the volfile (this is very useful for debugging problems in the field). If you +go to another window on the same machine, you can see that you have a new +filesystem mounted. + +``` +~$ df /srv/import +Filesystem 1K-blocks Used Available Use% Mounted on +/srv/xlator_example/my.vol + 114506240 2706176 105983488 3% /srv/import +``` + +Just for fun, write something into a file in `/srv/import`, then look at the +corresponding file in `/srv/export` to see it all `rot-13`'ed for you. + +``` +~$ echo hello > /srv/import/a_file +~$ cat /srv/export/a_file +uryyb +``` + +There you have it -- functionality you control, implemented easily, layered on +top of local storage. Now you could start adding functionality -- real +encryption, perhaps -- and inevitably having to debug it. You could do that the + old-school way, with `gf_log` (preferred) or even plain old `printf`, or you +could run daemons under `gdb` instead. Alternatively, you could wait for the +next Translator 101 post, where we'll be doing exactly that. + +Debugging a Translator +---------------------- + +Now that we've learned what a translator looks like and how to build one, it's +time to run one and actually watch it work. The best way to do this is good +old-fashioned `gdb`, as follows (using some of the examples from last time). + +``` +xlator_example# gdb glusterfs +GNU gdb (GDB) Red Hat Enterprise Linux (7.2-50.el6) +... +(gdb) r --debug -f my.vol /srv/import +Starting program: /usr/sbin/glusterfs --debug -f my.vol /srv/import +... +[2011-11-23 11:23:16.495516] I [fuse-bridge.c:2971:fuse_init] + 0-glusterfs-fuse: FUSE inited with protocol versions: + glusterfs 7.13 kernel 7.13 +``` + +If you get to this point, your glusterfs client process is already running. You +can go to another window to see the mountpoint, do file operations, etc. + +``` +~# df /srv/import +Filesystem 1K-blocks Used Available Use% Mounted on +/root/xlator_example/my.vol + 114506240 2643968 106045568 3% /srv/import +~# ls /srv/import +a_file +~# cat /srv/import/a_file +hello +``` + +Now let's interrupt the process and see where we are. + +``` +^C +Program received signal SIGINT, Interrupt. +0x0000003a0060b3dc in pthread_cond_wait@@GLIBC_2.3.2 () + from /lib64/libpthread.so.0 +(gdb) info threads + 5 Thread 0x7fffeffff700 (LWP 27206) 0x0000003a002dd8c7 + in readv () + from /lib64/libc.so.6 + 4 Thread 0x7ffff50e3700 (LWP 27205) 0x0000003a0060b75b + in pthread_cond_timedwait@@GLIBC_2.3.2 () + from /lib64/libpthread.so.0 + 3 Thread 0x7ffff5f02700 (LWP 27204) 0x0000003a0060b3dc + in pthread_cond_wait@@GLIBC_2.3.2 () + from /lib64/libpthread.so.0 + 2 Thread 0x7ffff6903700 (LWP 27203) 0x0000003a0060f245 + in sigwait () + from /lib64/libpthread.so.0 +* 1 Thread 0x7ffff7957700 (LWP 27196) 0x0000003a0060b3dc + in pthread_cond_wait@@GLIBC_2.3.2 () + from /lib64/libpthread.so.0 +``` + +Like any non-toy server, this one has multiple threads. What are they all +doing? Honestly, even I don't know. Thread 1 turns out to be in +`event_dispatch_epoll`, which means it's the one handling all of our network +I/O. Note that with socket multi-threading patch this will change, with one +thread in `socket_poller` per connection. Thread 2 is in `glusterfs_sigwaiter` +which means signals will be isolated to that thread. Thread 3 is in +`syncenv_task`, so it's a worker process for synchronous requests such as +those used by the rebalance and repair code. Thread 4 is in +`janitor_get_next_fd`, so it's waiting for a chance to close no-longer-needed +file descriptors on the local filesystem. (I admit I had to look that one up, +BTW.) Lastly, thread 5 is in `fuse_thread_proc`, so it's the one fetching +requests from our FUSE interface. You'll often see many more threads than +this, but it's a pretty good basic set. Now, let's set a breakpoint so we can +actually watch a request. + +``` +(gdb) b rot13_writev +Breakpoint 1 at 0x7ffff50e4f0b: file rot-13.c, line 119. +(gdb) c +Continuing. +``` + +At this point we go into our other window and do something that will involve a write. + +``` +~# echo goodbye > /srv/import/another_file +(back to the first window) +[Switching to Thread 0x7fffeffff700 (LWP 27206)] + +Breakpoint 1, rot13_writev (frame=0x7ffff6e4402c, this=0x638440, + fd=0x7ffff409802c, vector=0x7fffe8000cd8, count=1, offset=0, + iobref=0x7fffe8001070) at rot-13.c:119 +119 rot_13_private_t *priv = (rot_13_private_t *)this->private; +``` + +Remember how we built with debugging symbols enabled and no optimization? That +will be pretty important for the next few steps. As you can see, we're in +`rot13_writev`, with several parameters. + +* `frame` is our always-present frame pointer for this request. Also, + `frame->local` will point to any local data we created and attached to the + request ourselves. +* `this` is a pointer to our instance of the `rot-13` translator. You can examine + it if you like to see the name, type, options, parent/children, inode table, + and other stuff associated with it. +* `fd` is a pointer to a file-descriptor *object* (`fd_t`, not just a + file-descriptor index which is what most people use "fd" for). This in turn + points to an inode object (`inode_t`) and we can associate our own + `rot-13`-specific data with either of these. +* `vector` and `count` together describe the data buffers for this write, which + we'll get to in a moment. +* `offset` is the offset into the file at which we're writing. +* `iobref` is a buffer-reference object, which is used to track the life cycle + of buffers containing read/write data. If you look closely, you'll notice that + `vector[0].iov_base` points to the same address as `iobref->iobrefs[0].ptr`, which + should give you some idea of the inter-relationships between vector and iobref. + +OK, now what about that `vector`? We can use it to examine the data being +written, like this. + +``` +(gdb) p vector[0] +$2 = {iov_base = 0x7ffff7936000, iov_len = 8} +(gdb) x/s 0x7ffff7936000 +0x7ffff7936000: "goodbye\n" +``` + +It's not always safe to view this data as a string, because it might just as +well be binary data, but since we're generating the write this time it's safe +and convenient. With that knowledge, let's step through things a bit. + +``` +(gdb) s +120 if (priv->encrypt_write) +(gdb) +121 rot13_iovec (vector, count); +(gdb) +rot13_iovec (vector=0x7fffe8000cd8, count=1) at rot-13.c:57 +57 for (i = 0; i < count; i++) { +(gdb) +58 rot13 (vector[i].iov_base, vector[i].iov_len); +(gdb) +rot13 (buf=0x7ffff7936000 "goodbye\n", len=8) at rot-13.c:45 +45 for (i = 0; i < len; i++) { +(gdb) +46 if (buf[i] >= 'a' && buf[i] <= 'z') +(gdb) +47 buf[i] = 'a' + ((buf[i] - 'a' + 13) % 26); +``` + +Here we've stepped into `rot13_iovec`, which iterates through our vector +calling `rot13`, which in turn iterates through the characters in that chunk +doing the `rot-13` operation if/as appropriate. This is pretty straightforward +stuff, so let's skip to the next interesting bit. + +``` +(gdb) fin +Run till exit from #0 rot13 (buf=0x7ffff7936000 "goodbye\n", + len=8) at rot-13.c:47 +rot13_iovec (vector=0x7fffe8000cd8, count=1) at rot-13.c:57 +57 for (i = 0; i < count; i++) { +(gdb) fin +Run till exit from #0 rot13_iovec (vector=0x7fffe8000cd8, + count=1) at rot-13.c:57 +rot13_writev (frame=0x7ffff6e4402c, this=0x638440, + fd=0x7ffff409802c, vector=0x7fffe8000cd8, count=1, + offset=0, iobref=0x7fffe8001070) at rot-13.c:123 +123 STACK_WIND (frame, +(gdb) b 129 +Breakpoint 2 at 0x7ffff50e4f35: file rot-13.c, line 129. +(gdb) b rot13_writev_cbk +Breakpoint 3 at 0x7ffff50e4db3: file rot-13.c, line 106. +(gdb) c +``` + +So we've set breakpoints on both the callback and the statement following the +`STACK_WIND`. Which one will we hit first? + +``` +Breakpoint 3, rot13_writev_cbk (frame=0x7ffff6e4402c, + cookie=0x7ffff6e440d8, this=0x638440, op_ret=8, op_errno=0, + prebuf=0x7fffefffeca0, postbuf=0x7fffefffec30) + at rot-13.c:106 +106 STACK_UNWIND_STRICT (writev, frame, op_ret, op_errno, + prebuf, postbuf); +(gdb) bt +#0 rot13_writev_cbk (frame=0x7ffff6e4402c, + cookie=0x7ffff6e440d8, this=0x638440, op_ret=8, op_errno=0, + prebuf=0x7fffefffeca0, postbuf=0x7fffefffec30) + at rot-13.c:106 +#1 0x00007ffff52f1b37 in posix_writev (frame=0x7ffff6e440d8, + this=<value optimized out>, fd=<value optimized out>, + vector=<value optimized out>, count=1, + offset=<value optimized out>, iobref=0x7fffe8001070) + at posix.c:2217 +#2 0x00007ffff50e513e in rot13_writev (frame=0x7ffff6e4402c, + this=0x638440, fd=0x7ffff409802c, vector=0x7fffe8000cd8, + count=1, offset=0, iobref=0x7fffe8001070) at rot-13.c:123 +``` + +Surprise! We're in `rot13_writev_cbk` now, called (indirectly) while we're +still in `rot13_writev` before `STACK_WIND` returns (still at rot-13.c:123). If + you did any request cleanup here, then you need to be careful about what you +do in the remainder of `rot13_writev` because data may have been freed etc. +It's tempting to say you should just do the cleanup in `rot13_writev` after +the `STACK_WIND,` but that's not valid because it's also possible that some +other translator returned without calling `STACK_UNWIND` -- i.e. before +`rot13_writev` is called, so then it would be the one getting null-pointer +errors instead. To put it another way, the callback and the return from +`STACK_WIND` can occur in either order or even simultaneously on different +threads. Even if you were to use reference counts, you'd have to make sure to +use locking or atomic operations to avoid races, and it's not worth it. Unless +you *really* understand the possible flows of control and know what you're +doing, it's better to do cleanup in the callback and nothing after +`STACK_WIND.` + +At this point all that's left is a `STACK_UNWIND` and a return. The +`STACK_UNWIND` invokes our parent's completion callback, and in this case our +parent is FUSE so at that point the VFS layer is notified of the write being +complete. Finally, we return through several levels of normal function calls +until we come back to fuse_thread_proc, which waits for the next request. + +So that's it. For extra fun, you might want to repeat this exercise by stepping +through some other call -- stat or setxattr might be good choices -- but you'll + have to use a translator that actually implements those calls to see much +that's interesting. Then you'll pretty much know everything I knew when I +started writing my first for-real translators, and probably even a bit more. I +hope you've enjoyed this series, or at least found it useful, and if you have +any suggestions for other topics I should cover please let me know (via +comments or email, IRC or Twitter). diff --git a/doc/hacker-guide/en-US/markdown/write-behind.md b/doc/hacker-guide/en-US/markdown/write-behind.md new file mode 100644 index 000000000..e20682249 --- /dev/null +++ b/doc/hacker-guide/en-US/markdown/write-behind.md @@ -0,0 +1,56 @@ +performance/write-behind translator +=================================== + +Basic working +-------------- + +Write behind is basically a translator to lie to the application that the +write-requests are finished, even before it is actually finished. + +On a regular translator tree without write-behind, control flow is like this: + +1. application makes a `write()` system call. +2. VFS ==> FUSE ==> `/dev/fuse`. +3. fuse-bridge initiates a glusterfs `writev()` call. +4. `writev()` is `STACK_WIND()`ed upto client-protocol or storage translator. +5. client-protocol, on receiving reply from server, starts `STACK_UNWIND()` towards the fuse-bridge. + +On a translator tree with write-behind, control flow is like this: + +1. application makes a `write()` system call. +2. VFS ==> FUSE ==> `/dev/fuse`. +3. fuse-bridge initiates a glusterfs `writev()` call. +4. `writev()` is `STACK_WIND()`ed upto write-behind translator. +5. write-behind adds the write buffer to its internal queue and does a `STACK_UNWIND()` towards the fuse-bridge. + +write call is completed in application's percepective. after +`STACK_UNWIND()`ing towards the fuse-bridge, write-behind initiates a fresh +writev() call to its child translator, whose replies will be consumed by +write-behind itself. Write-behind _doesn't_ cache the write buffer, unless +`option flush-behind on` is specified in volume specification file. + +Windowing +--------- + +With respect to write-behind, each write-buffer has three flags: `stack_wound`, `write_behind` and `got_reply`. + +* `stack_wound`: if set, indicates that write-behind has initiated `STACK_WIND()` towards child translator. +* `write_behind`: if set, indicates that write-behind has done `STACK_UNWIND()` towards fuse-bridge. +* `got_reply`: if set, indicates that write-behind has received reply from child translator for a `writev()` `STACK_WIND()`. a request will be destroyed by write-behind only if this flag is set. + +Currently pending write requests = aggregate size of requests with write_behind = 1 and got_reply = 0. + +window size limits the aggregate size of currently pending write requests. once +the pending requests' size has reached the window size, write-behind blocks +writev() calls from fuse-bridge. Blocking is only from application's +perspective. Write-behind does `STACK_WIND()` to child translator +straight-away, but hold behind the `STACK_UNWIND()` towards fuse-bridge. +`STACK_UNWIND()` is done only once write-behind gets enough replies to +accomodate for currently blocked request. + +Flush behind +------------ + +If `option flush-behind on` is specified in volume specification file, then +write-behind sends aggregate write requests to child translator, instead of +regular per request `STACK_WIND()`s. diff --git a/doc/hacker-guide/hacker-guide.tex b/doc/hacker-guide/hacker-guide.tex deleted file mode 100644 index c2d7255d7..000000000 --- a/doc/hacker-guide/hacker-guide.tex +++ /dev/null @@ -1,309 +0,0 @@ -\documentclass{book}[12pt] -\usepackage{graphicx} -% \usepackage{fancyhdr} - -% \pagestyle{fancy} -\begin{document} - -% \headheight 117pt -% \rhead{\includegraphics{zr-logo.eps}} - -\author{Gluster} -\title{GlusterFS 1.3 Hacker's Guide} -\date{June 1, 2007} - -\maketitle -\frontmatter -\tableofcontents - -\mainmatter -\chapter{Introduction} - -\section{Coding guidelines} -GlusterFS uses Git for version control. To get the latest source do: -\begin{verbatim} - $ git clone git://git.gluster.com/glusterfs.git glusterfs -\end{verbatim} -\noindent -GlusterFS follows the GNU coding -standards\footnote{http://www.gnu.org/prep/standards\_toc.html} for the -most part. - -\chapter{Major components} -\section{libglusterfs} -\texttt{libglusterfs} contains supporting code used by all the other components. -The important files here are: - -\texttt{dict.c}: This is an implementation of a serializable dictionary type. It is -used by the protocol code to send requests and replies. It is also used to pass options -to translators. - -\texttt{logging.c}: This is a thread-safe logging library. The log messages go to a -file (default \texttt{/usr/local/var/log/glusterfs/*}). - -\texttt{protocol.c}: This file implements the GlusterFS on-the-wire -protocol. The protocol itself is a simple ASCII protocol, designed to -be easy to parse and be human readable. - -A sample GlusterFS protocol block looks like this: -\begin{verbatim} - Block Start header - 0000000000000023 callid - 00000001 type - 00000016 op - xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx human-readable name - 00000000000000000000000000000ac3 block size - <...> block - Block End -\end{verbatim} - -\texttt{stack.h}: This file defines the \texttt{STACK\_WIND} and -\texttt{STACK\_UNWIND} macros which are used to implement the parallel -stack that is maintained for inter-xlator calls. See the \textsl{Taking control -of the stack} section below for more details. - -\texttt{spec.y}: This contains the Yacc grammar for the GlusterFS -specification file, and the parsing code. - - -Draw diagrams of trees -Two rules: -(1) directory structure is same -(2) file can exist only on one node - -\section{glusterfs-fuse} -\section{glusterfsd} -\section{transport} -\section{scheduler} -\section{xlator} - -\chapter{xlators} -\section{Taking control of the stack} -One can think of STACK\_WIND/UNWIND as a very specific RPC mechanism. - -% \includegraphics{stack.eps} - -\section{Overview of xlators} - -\flushleft{\LARGE\texttt{cluster/}} -\vskip 2ex -\flushleft{\Large\texttt{afr}} -\vskip 2ex -\flushleft{\Large\texttt{stripe}} -\vskip 2ex -\flushleft{\Large\texttt{unify}} - -\vskip 4ex -\flushleft{\LARGE\texttt{debug/}} -\vskip 2ex -\flushleft{\Large\texttt{trace}} -\vskip 2ex -The trace xlator simply logs all fops and mops, and passes them through to its child. - -\vskip 4ex -\flushleft{\LARGE\texttt{features/}} -\flushleft{\Large\texttt{posix-locks}} -\vskip 2ex -This xlator implements \textsc{posix} record locking semantics over -any kind of storage. - -\vskip 4ex -\flushleft{\LARGE\texttt{performance/}} - -\flushleft{\Large\texttt{io-threads}} -\vskip 2ex -\flushleft{\Large\texttt{read-ahead}} -\vskip 2ex -\flushleft{\Large\texttt{stat-prefetch}} -\vskip 2ex -\flushleft{\Large\texttt{write-behind}} -\vskip 2ex - -\vskip 4ex -\flushleft{\LARGE\texttt{protocol/}} -\vskip 2ex - -\flushleft{\Large\texttt{client}} -\vskip 2ex - -\flushleft{\Large\texttt{server}} -\vskip 2ex - -\vskip 4ex -\flushleft{\LARGE\texttt{storage/}} -\flushleft{\Large\texttt{posix}} -\vskip 2ex -The \texttt{posix} xlator is the one which actually makes calls to the -on-disk filesystem. Currently this is the only storage xlator available. However, -plans to develop other storage xlators, such as one for Amazon's S3 service, are -on the roadmap. - -\chapter{Writing a simple xlator} -\noindent -In this section we're going to write a rot13 xlator. ``Rot13'' is a -simple substitution cipher which obscures a text by replacing each -letter with the letter thirteen places down the alphabet. So `a' (0) -would become `n' (12), `b' would be 'm', and so on. Rot13 applied to -a piece of ciphertext yields the plaintext again, because rot13 is its -own inverse, since: - -\[ -x_c = x + 13\; (mod\; 26) -\] -\[ -x_c + 13\; (mod\; 26) = x + 13 + 13\; (mod\; 26) = x -\] - -First we include the requisite headers. - -\begin{verbatim} -#include <ctype.h> -#include <sys/uio.h> - -#include "glusterfs.h" -#include "xlator.h" -#include "logging.h" - -/* - * This is a rot13 ``encryption'' xlator. It rot13's data when - * writing to disk and rot13's it back when reading it. - * This xlator is meant as an example, not for production - * use ;) (hence no error-checking) - */ - -\end{verbatim} - -Then we write the rot13 function itself. For simplicity, we only transform lower case -letters. Any other byte is passed through as it is. - -\begin{verbatim} -/* We only handle lower case letters for simplicity */ -static void -rot13 (char *buf, int len) -{ - int i; - for (i = 0; i < len; i++) { - if (isalpha (buf[i])) - buf[i] = (buf[i] - 'a' + 13) % 26; - else if (buf[i] <= 26) - buf[i] = (buf[i] + 13) % 26 + 'a'; - } -} -\end{verbatim} - -Next comes a utility function whose purpose will be clear after looking at the code -below. - -\begin{verbatim} -static void -rot13_iovec (struct iovec *vector, int count) -{ - int i; - for (i = 0; i < count; i++) { - rot13 (vector[i].iov_base, vector[i].iov_len); - } -} -\end{verbatim} - -\begin{verbatim} -static int32_t -rot13_readv_cbk (call_frame_t *frame, - call_frame_t *prev_frame, - xlator_t *this, - int32_t op_ret, - int32_t op_errno, - struct iovec *vector, - int32_t count) -{ - rot13_iovec (vector, count); - - STACK_UNWIND (frame, op_ret, op_errno, vector, count); - return 0; -} - -static int32_t -rot13_readv (call_frame_t *frame, - xlator_t *this, - dict_t *ctx, - size_t size, - off_t offset) -{ - STACK_WIND (frame, - rot13_readv_cbk, - FIRST_CHILD (this), - FIRST_CHILD (this)->fops->readv, - ctx, size, offset); - return 0; -} - -static int32_t -rot13_writev_cbk (call_frame_t *frame, - call_frame_t *prev_frame, - xlator_t *this, - int32_t op_ret, - int32_t op_errno) -{ - STACK_UNWIND (frame, op_ret, op_errno); - return 0; -} - -static int32_t -rot13_writev (call_frame_t *frame, - xlator_t *this, - dict_t *ctx, - struct iovec *vector, - int32_t count, - off_t offset) -{ - rot13_iovec (vector, count); - - STACK_WIND (frame, - rot13_writev_cbk, - FIRST_CHILD (this), - FIRST_CHILD (this)->fops->writev, - ctx, vector, count, offset); - return 0; -} - -\end{verbatim} - -Every xlator must define two functions and two external symbols. The functions are -\texttt{init} and \texttt{fini}, and the symbols are \texttt{fops} and \texttt{mops}. -The \texttt{init} function is called when the xlator is loaded by GlusterFS, and -contains code for the xlator to initialize itself. Note that if an xlator is present -multiple times in the spec tree, the \texttt{init} function will be called each time -the xlator is loaded. - -\begin{verbatim} -int32_t -init (xlator_t *this) -{ - if (!this->children) { - gf_log ("rot13", GF_LOG_ERROR, - "FATAL: rot13 should have exactly one child"); - return -1; - } - - gf_log ("rot13", GF_LOG_DEBUG, "rot13 xlator loaded"); - return 0; -} -\end{verbatim} - -\begin{verbatim} - -void -fini (xlator_t *this) -{ - return; -} - -struct xlator_fops fops = { - .readv = rot13_readv, - .writev = rot13_writev -}; - - -\end{verbatim} - -\end{document} - diff --git a/doc/hacker-guide/lock-ahead.txt b/doc/hacker-guide/lock-ahead.txt deleted file mode 100644 index 63392b7fa..000000000 --- a/doc/hacker-guide/lock-ahead.txt +++ /dev/null @@ -1,80 +0,0 @@ - Lock-ahead translator - --------------------- - -The objective of the lock-ahead translator is to speculatively -hold locks (inodelk and entrylk) on the universal set (0 - infinity -in case of inodelk and all basenames in case of entrylk) even -when a lock is requested only on a subset, in anticipation that -further locks will be requested within the same universal set. - -So, for example, when cluster/replicate locks a region before -writing to it, lock-ahead would instead lock the entire file. -On further writes, lock-ahead can immediately return success for -the lock requests, since the entire file has been previously locked. - -To avoid starvation of other clients/mountpoints, we employ a -notify mechanism, described below. - -typedef struct { - struct list_head subset_locks; -} la_universal_lock_t; - -Universal lock structure is stored in the inode context. - -typedef struct { - enum {LOCK_AHEAD_ENTRYLK, LOCK_AHEAD_FENTRYLK, - LOCK_AHEAD_INODELK, LOCK_AHEAD_FINODELK}; - - union { - fd_t *fd; - loc_t loc; - }; - - off_t l_start; - off_t l_len; - - const char *basename; - - struct list_head universal_lock; -} la_subset_lock_t; - - -fops implemented: - -* inodelk/finodelk/entrylk/fentrylk: - -lock: - if universal lock held: - add subset to it (save loc_t or fd) and return success - else: - send lock-notify fop - hold universal lock and return - (set inode context, add subset to it, save loc_t or fd) - - if this fails: - forward the lock request - -unlock: - if subset exists in universal lock: - delete subset lock from list - else: - forward it - -* release: - hold subset locks (each subset lock using the saved loc_t or fd) - and release universal lock - -* lock-notify (on unwind) (new fop) - hold subset locks and release universal lock - - -lock-notify in locks translator: - -if a subset lock in entrylk/inodelk cannot be satisfied -because of a universal lock held by someone else: - unwind the lock-notify fop - -============================================== -$ Last updated: Tue Feb 17 11:31:18 IST 2009 $ -$ Author: Vikas Gorur <vikas@gluster.com> $ -============================================== diff --git a/doc/hacker-guide/posix.txt b/doc/hacker-guide/posix.txt deleted file mode 100644 index d0132abfe..000000000 --- a/doc/hacker-guide/posix.txt +++ /dev/null @@ -1,59 +0,0 @@ ---------------- -* storage/posix ---------------- - -- SET_FS_ID - - This is so that all filesystem checks are done with the user's - uid/gid and not GlusterFS's uid/gid. - -- MAKE_REAL_PATH - - This macro concatenates the base directory of the posix volume - ('option directory') with the given path. - -- need_xattr in lookup - - If this flag is passed, lookup returns a xattr dictionary that contains - the file's create time, the file's contents, and the version number - of the file. - - This is a hack to increase small file performance. If an application - wants to read a small file, it can finish its job with just a lookup - call instead of a lookup followed by read. - -- getdents/setdents - - These are used by unify to set and get directory entries. - -- ALIGN_BUF - - Macro to align an address to a page boundary (4K). - -- priv->export_statfs - - In some cases, two exported volumes may reside on the same - partition on the server. Sending statvfs info for both - the volumes will lead to erroneous df output at the client, - since free space on the partition will be counted twice. - - In such cases, user can disable exporting statvfs info - on one of the volumes by setting this option. - -- xattrop - - This fop is used by replicate to set version numbers on files. - -- getxattr/setxattr hack to read/write files - - A key, GLUSTERFS_FILE_CONTENT_STRING, is handled in a special way by - getxattr/setxattr. A getxattr with the key will return the entire - content of the file as the value. A setxattr with the key will write - the value as the entire content of the file. - -- posix_checksum - - This calculates a simple XOR checksum on all entry names in a - directory that is used by unify to compare directory contents. - - diff --git a/doc/hacker-guide/replicate.txt b/doc/hacker-guide/replicate.txt deleted file mode 100644 index fd1ef2747..000000000 --- a/doc/hacker-guide/replicate.txt +++ /dev/null @@ -1,206 +0,0 @@ ---------------- -* cluster/replicate ---------------- - -Before understanding replicate, one must understand two internal FOPs: - -GF_FILE_LK: - This is exactly like fcntl(2) locking, except the locks are in a - separate domain from locks held by applications. - -GF_DIR_LK (loc_t *loc, char *basename): - This allows one to lock a name under a directory. For example, - to lock /mnt/glusterfs/foo, one would use the call: - - GF_DIR_LK ({loc_t for "/mnt/glusterfs"}, "foo") - - If one wishes to lock *all* the names under a particular directory, - supply the basename argument as NULL. - - The locks can either be read locks or write locks; consult the - function prototype for more details. - -Both these operations are implemented by the features/locks (earlier -known as posix-locks) translator. - --------------- -* Basic design --------------- - -All FOPs can be classified into four major groups: - - - inode-read - Operations that read an inode's data (file contents) or metadata (perms, etc.). - - access, getxattr, fstat, readlink, readv, stat. - - - inode-write - Operations that modify an inode's data or metadata. - - chmod, chown, truncate, writev, utimens. - - - dir-read - Operations that read a directory's contents or metadata. - - readdir, getdents, checksum. - - - dir-write - Operations that modify a directory's contents or metadata. - - create, link, mkdir, mknod, rename, rmdir, symlink, unlink. - - Some of these make a subgroup in that they modify *two* different entries: - link, rename, symlink. - - - Others - Other operations. - - flush, lookup, open, opendir, statfs. - ------------- -* Algorithms ------------- - -Each of the four major groups has its own algorithm: - - ---------------------- - - inode-read, dir-read - ---------------------- - - = Send a request to the first child that is up: - - if it fails: - try the next available child - - if we have exhausted all children: - return failure - - ------------- - - inode-write - ------------- - - All operations are done in parallel unless specified otherwise. - - (1) Send a GF_FILE_LK request on all children for a write lock on - the appropriate region - (for metadata operations: entire file (0, 0) - for writev: (offset, offset+size of buffer)) - - - If a lock request fails on a child: - unlock all children - try to acquire a blocking lock (F_SETLKW) on each child, serially. - - If this fails (due to ENOTCONN or EINVAL): - Consider this child as dead for rest of transaction. - - (2) Mark all children as "pending" on all (alive) children - (see below for meaning of "pending"). - - - If it fails on any child: - mark it as dead (in transaction local state). - - (3) Perform operation on all (alive) children. - - - If it fails on any child: - mark it as dead (in transaction local state). - - (4) Unmark all successful children as not "pending" on all nodes. - - (5) Unlock region on all (alive) children. - - ----------- - - dir-write - ----------- - - The algorithm for dir-write is same as above except instead of holding - GF_FILE_LK locks we hold a GF_DIR_LK lock on the name being operated upon. - In case of link-type calls, we hold locks on both the operand names. - ------------ -* "pending" ------------ - - The "pending" number is like a journal entry. A pending entry is an - array of 32-bit integers stored in network byte-order as the extended - attribute of an inode (which can be a directory as well). - - There are three keys corresponding to three types of pending operations: - - - AFR_METADATA_PENDING - There are some metadata operations pending on this inode (perms, ctime/mtime, - xattr, etc.). - - - AFR_DATA_PENDING - There is some data pending on this inode (writev). - - - AFR_ENTRY_PENDING - There are some directory operations pending on this directory - (create, unlink, etc.). - ------------ -* Self heal ------------ - - - On lookup, gather extended attribute data: - - If entry is a regular file: - - If an entry is present on one child and not on others: - - create entry on others. - - If entries exist but have different metadata (perms, etc.): - - consider the entry with the highest AFR_METADATA_PENDING number as - definitive and replicate its attributes on children. - - - If entry is a directory: - - Consider the entry with the higest AFR_ENTRY_PENDING number as - definitive and replicate its contents on all children. - - - If any two entries have non-matching types (i.e., one is file and - other is directory): - - Announce to the user via log that a split-brain situation has been - detected, and do nothing. - - - On open, gather extended attribute data: - - Consider the file with the highest AFR_DATA_PENDING number as - the definitive one and replicate its contents on all other - children. - - During all self heal operations, appropriate locks must be held on all - regions/entries being affected. - ---------------- -* Inode scaling ---------------- - -Inode scaling is necessary because if a situation arises where: - - An inode number is returned for a directory (by lookup) which was - previously the inode number of a file (as per FUSE's table), then - FUSE gets horribly confused (consult a FUSE expert for more details). - -To avoid such a situation, we distribute the 64-bit inode space equally -among all children of replicate. - -To illustrate: - -If c1, c2, c3 are children of replicate, they each get 1/3 of the available -inode space: - -Child: c1 c2 c3 c1 c2 c3 c1 c2 c3 c1 c2 ... -Inode number: 1 2 3 4 5 6 7 8 9 10 11 ... - -Thus, if lookup on c1 returns an inode number "2", it is scaled to "4" -(which is the second inode number in c1's space). - -This way we ensure that there is never a collision of inode numbers from -two different children. - -This reduction of inode space doesn't really reduce the usability of -replicate since even if we assume replicate has 1024 children (which would be a -highly unusual scenario), each child still has a 54-bit inode space. - -2^54 ~ 1.8 * 10^16 - -which is much larger than any real world requirement. - - -============================================== -$ Last updated: Sun Oct 12 23:17:01 IST 2008 $ -$ Author: Vikas Gorur <vikas@gluster.com> $ -============================================== - diff --git a/doc/hacker-guide/write-behind.txt b/doc/hacker-guide/write-behind.txt deleted file mode 100644 index 498e95480..000000000 --- a/doc/hacker-guide/write-behind.txt +++ /dev/null @@ -1,45 +0,0 @@ -basic working --------------- - - write behind is basically a translator to lie to the application that the write-requests are finished, even before it is actually finished. - - on a regular translator tree without write-behind, control flow is like this: - - 1. application makes a write() system call. - 2. VFS ==> FUSE ==> /dev/fuse. - 3. fuse-bridge initiates a glusterfs writev() call. - 4. writev() is STACK_WIND()ed upto client-protocol or storage translator. - 5. client-protocol, on recieving reply from server, starts STACK_UNWIND() towards the fuse-bridge. - - on a translator tree with write-behind, control flow is like this: - - 1. application makes a write() system call. - 2. VFS ==> FUSE ==> /dev/fuse. - 3. fuse-bridge initiates a glusterfs writev() call. - 4. writev() is STACK_WIND()ed upto write-behind translator. - 5. write-behind adds the write buffer to its internal queue and does a STACK_UNWIND() towards the fuse-bridge. - - write call is completed in application's percepective. after STACK_UNWIND()ing towards the fuse-bridge, write-behind initiates a fresh writev() call to its child translator, whose replies will be consumed by write-behind itself. write-behind _doesn't_ cache the write buffer, unless 'option flush-behind on' is specified in volume specification file. - -windowing ---------- - - write respect to write-behind, each write-buffer has three flags: 'stack_wound', 'write_behind' and 'got_reply'. - - stack_wound: if set, indicates that write-behind has initiated STACK_WIND() towards child translator. - - write_behind: if set, indicates that write-behind has done STACK_UNWIND() towards fuse-bridge. - - got_reply: if set, indicates that write-behind has recieved reply from child translator for a writev() STACK_WIND(). a request will be destroyed by write-behind only if this flag is set. - - currently pending write requests = aggregate size of requests with write_behind = 1 and got_reply = 0. - - window size limits the aggregate size of currently pending write requests. once the pending requests' size has reached the window size, write-behind blocks writev() calls from fuse-bridge. - blocking is only from application's perspective. write-behind does STACK_WIND() to child translator straight-away, but hold behind the STACK_UNWIND() towards fuse-bridge. STACK_UNWIND() is done only once write-behind gets enough replies to accomodate for currently blocked request. - -flush behind ------------- - - if 'option flush-behind on' is specified in volume specification file, then write-behind sends aggregate write requests to child translator, instead of regular per request STACK_WIND()s. - - |
