diff options
Diffstat (limited to 'doc/hacker-guide')
-rw-r--r-- | doc/hacker-guide/Makefile.am | 8 | ||||
-rw-r--r-- | doc/hacker-guide/adding-fops.txt | 33 | ||||
-rw-r--r-- | doc/hacker-guide/bdb.txt | 70 | ||||
-rw-r--r-- | doc/hacker-guide/call-stub.txt | 1033 | ||||
-rw-r--r-- | doc/hacker-guide/hacker-guide.tex | 312 | ||||
-rw-r--r-- | doc/hacker-guide/posix.txt | 59 | ||||
-rw-r--r-- | doc/hacker-guide/replicate.txt | 206 | ||||
-rw-r--r-- | doc/hacker-guide/write-behind.txt | 45 |
8 files changed, 1766 insertions, 0 deletions
diff --git a/doc/hacker-guide/Makefile.am b/doc/hacker-guide/Makefile.am new file mode 100644 index 000000000..65c92ac23 --- /dev/null +++ b/doc/hacker-guide/Makefile.am @@ -0,0 +1,8 @@ +EXTRA_DIST = replicate.txt bdb.txt posix.txt call-stub.txt write-behind.txt + +#EXTRA_DIST = hacker-guide.tex afr.txt bdb.txt posix.txt call-stub.txt write-behind.txt +#hacker_guidedir = $(docdir) +#hacker_guide_DATA = hacker-guide.pdf + +#hacker-guide.pdf: $(EXTRA_DIST) +# pdflatex $(srcdir)/hacker-guide.tex diff --git a/doc/hacker-guide/adding-fops.txt b/doc/hacker-guide/adding-fops.txt new file mode 100644 index 000000000..293de2637 --- /dev/null +++ b/doc/hacker-guide/adding-fops.txt @@ -0,0 +1,33 @@ + HOW TO ADD A NEW FOP TO GlusterFS + ================================= + +Steps to be followed when adding a new FOP to GlusterFS: + +1. Edit glusterfs.h and add a GF_FOP_* constant. + +2. Edit xlator.[ch] and: + 2a. add the new prototype for fop and callback. + 2b. edit xlator_fops structure. + +3. Edit xlator.c and add to fill_defaults. + +4. Edit protocol.h and add struct necessary for the new FOP. + +5. Edit defaults.[ch] and provide default implementation. + +6. Edit call-stub.[ch] and provide stub implementation. + +7. Edit common-utils.c and add to gf_global_variable_init(). + +8. Edit client-protocol and add your FOP. + +9. Edit server-protocol and add your FOP. + +10. Implement your FOP in any translator for which the default implementation + is not sufficient. + +========================================== +Last updated: Mon Oct 27 21:35:49 IST 2008 + +Author: Vikas Gorur <vikas@zresearch.com> +========================================== diff --git a/doc/hacker-guide/bdb.txt b/doc/hacker-guide/bdb.txt new file mode 100644 index 000000000..fd0bd3652 --- /dev/null +++ b/doc/hacker-guide/bdb.txt @@ -0,0 +1,70 @@ + +* How does file translates to key/value pair? +--------------------------------------------- + + in bdb a file is identified by key (obtained by taking basename() of the path of +the file) and file contents are stored as value corresponding to the key in database +file (defaults to glusterfs_storage.db under dirname() directory). + +* symlinks, directories +----------------------- + + symlinks and directories are stored as is. + +* db (database) files +--------------------- + + every directory, including root directory, contains a database file called +glusterfs_storage.db. all the regular files contained in the directory are stored +as key/value pair inside the glusterfs_storage.db. + +* internal data cache +--------------------- + + db does not provide a way to find out the size of the value corresponding to a key. +so, bdb makes DB->get() call for key and takes the length of the value returned. +since DB->get() also returns file contents for key, bdb maintains an internal cache and +stores the file contents in the cache. + every directory maintains a seperate cache. + +* inode number transformation +----------------------------- + + bdb allocates a inode number to each file and directory on its own. bdb maintains a +global counter and increments it after allocating inode number for each file +(regular, symlink or directory). NOTE: bdb does not guarantee persistent inode numbers. + +* checkpoint thread +------------------- + + bdb creates a checkpoint thread at the time of init(). checkpoint thread does a +periodic checkpoint on the DB_ENV. checkpoint is the mechanism, provided by db, to +forcefully commit the logged transactions to the storage. + +NOTES ABOUT FOPS: +----------------- + +lookup() - + 1> do lstat() on the path, if lstat fails, we assume that the file being looked up + is either a regular file or doesn't exist. + 2> lookup in the DB of parent directory for key corresponding to path. if key exists, + return key, with. + NOTE: 'struct stat' stat()ed from DB file is used as a container for 'struct stat' + of the regular file. st_ino, st_size, st_blocks are updated with file's values. + +readv() - + 1> do a lookup in bctx cache. if successful, return the requested data from cache. + 2> if cache missed, do a DB->get() the entire file content and insert to cache. + +writev(): + 1> flush any cached content of this file. + 2> do a DB->put(), with DB_DBT_PARTIAL flag. + NOTE: DB_DBT_PARTIAL is used to do partial update of a value in DB. + +readdir(): + 1> regular readdir() in a loop, and vomit all DB_ENV log files and DB files that + we encounter. + 2> if the readdir() buffer still has space, open a DB cursor and do a sequential + DBC->get() to fill the reaadir buffer. + + diff --git a/doc/hacker-guide/call-stub.txt b/doc/hacker-guide/call-stub.txt new file mode 100644 index 000000000..bca1579b2 --- /dev/null +++ b/doc/hacker-guide/call-stub.txt @@ -0,0 +1,1033 @@ +creating a call stub and pausing a call +--------------------------------------- +libglusterfs provides seperate API to pause each of the fop. parameters to each API is +@frame - call frame which has to be used to resume the call at call_resume(). +@fn - procedure to call during call_resume(). + NOTE: @fn should exactly take the same type and number of parameters that + the corresponding regular fop takes. +rest will be the regular parameters to corresponding fop. + +NOTE: @frame can never be NULL. fop_<operation>_stub() fails with errno + set to EINVAL, if @frame is NULL. also wherever @loc is applicable, + @loc cannot be NULL. + +refer to individual stub creation API to know about call-stub creation's behaviour with +specific parameters. + +here is the list of stub creation APIs for xlator fops. + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn - procedure to call during call_resume(). +@loc - pointer to location structure. + NOTE: @loc will be copied to a different location, with inode_ref() to + @loc->inode and @loc->parent, if not NULL. also @loc->path will be + copied to a different location. +@need_xattr - flag to specify if xattr should be returned or not. +call_stub_t * +fop_lookup_stub (call_frame_t *frame, + fop_lookup_t fn, + loc_t *loc, + int32_t need_xattr); + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn - procedure to call during call_resume(). +@loc - pointer to location structure. + NOTE: @loc will be copied to a different location, with inode_ref() to + @loc->inode and @loc->parent, if not NULL. also @loc->path will be + copied to a different location. +call_stub_t * +fop_stat_stub (call_frame_t *frame, + fop_stat_t fn, + loc_t *loc); + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn - procedure to call during call_resume(). +@fd - file descriptor parameter to lk fop. + NOTE: @fd is stored with a fd_ref(). +call_stub_t * +fop_fstat_stub (call_frame_t *frame, + fop_fstat_t fn, + fd_t *fd); + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn - procedure to call during call_resume(). +@loc - pointer to location structure. + NOTE: @loc will be copied to a different location, with inode_ref() to @loc->inode and + @loc->parent, if not NULL. also @loc->path will be copied to a different location. +@mode - mode parameter to chmod. +call_stub_t * +fop_chmod_stub (call_frame_t *frame, + fop_chmod_t fn, + loc_t *loc, + mode_t mode); + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn - procedure to call during call_resume(). +@fd - file descriptor parameter to lk fop. + NOTE: @fd is stored with a fd_ref(). +@mode - mode parameter for fchmod fop. +call_stub_t * +fop_fchmod_stub (call_frame_t *frame, + fop_fchmod_t fn, + fd_t *fd, + mode_t mode); + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn - procedure to call during call_resume(). +@loc - pointer to location structure. + NOTE: @loc will be copied to a different location, with inode_ref() to @loc->inode and + @loc->parent, if not NULL. also @loc->path will be copied to a different location. +@uid - uid parameter to chown. +@gid - gid parameter to chown. +call_stub_t * +fop_chown_stub (call_frame_t *frame, + fop_chown_t fn, + loc_t *loc, + uid_t uid, + gid_t gid); + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn - procedure to call during call_resume(). +@fd - file descriptor parameter to lk fop. + NOTE: @fd is stored with a fd_ref(). +@uid - uid parameter to fchown. +@gid - gid parameter to fchown. +call_stub_t * +fop_fchown_stub (call_frame_t *frame, + fop_fchown_t fn, + fd_t *fd, + uid_t uid, + gid_t gid); + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn - procedure to call during call_resume(). +@loc - pointer to location structure. + NOTE: @loc will be copied to a different location, with inode_ref() to + @loc->inode and @loc->parent, if not NULL. also @loc->path will be + copied to a different location, if not NULL. +@off - offset parameter to truncate fop. +call_stub_t * +fop_truncate_stub (call_frame_t *frame, + fop_truncate_t fn, + loc_t *loc, + off_t off); + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn - procedure to call during call_resume(). +@fd - file descriptor parameter to lk fop. + NOTE: @fd is stored with a fd_ref(). +@off - offset parameter to ftruncate fop. +call_stub_t * +fop_ftruncate_stub (call_frame_t *frame, + fop_ftruncate_t fn, + fd_t *fd, + off_t off); + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn - procedure to call during call_resume(). +@loc - pointer to location structure. + NOTE: @loc will be copied to a different location, with inode_ref() to + @loc->inode and @loc->parent, if not NULL. also @loc->path will be + copied to a different location. +@tv - tv parameter to utimens fop. +call_stub_t * +fop_utimens_stub (call_frame_t *frame, + fop_utimens_t fn, + loc_t *loc, + struct timespec tv[2]); + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn - procedure to call during call_resume(). +@loc - pointer to location structure. + NOTE: @loc will be copied to a different location, with inode_ref() to + @loc->inode and @loc->parent, if not NULL. also @loc->path will be + copied to a different location. +@mask - mask parameter for access fop. +call_stub_t * +fop_access_stub (call_frame_t *frame, + fop_access_t fn, + loc_t *loc, + int32_t mask); + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn - procedure to call during call_resume(). +@loc - pointer to location structure. + NOTE: @loc will be copied to a different location, with inode_ref() to + @loc->inode and @loc->parent, if not NULL. also @loc->path will be + copied to a different location. +@size - size parameter to readlink fop. +call_stub_t * +fop_readlink_stub (call_frame_t *frame, + fop_readlink_t fn, + loc_t *loc, + size_t size); + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn - procedure to call during call_resume(). +@loc - pointer to location structure. + NOTE: @loc will be copied to a different location, with inode_ref() to + @loc->inode and @loc->parent, if not NULL. also @loc->path will be + copied to a different location. +@mode - mode parameter to mknod fop. +@rdev - rdev parameter to mknod fop. +call_stub_t * +fop_mknod_stub (call_frame_t *frame, + fop_mknod_t fn, + loc_t *loc, + mode_t mode, + dev_t rdev); + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn - procedure to call during call_resume(). +@loc - pointer to location structure. + NOTE: @loc will be copied to a different location, with inode_ref() to + @loc->inode and @loc->parent, if not NULL. also @loc->path will be + copied to a different location. +@mode - mode parameter to mkdir fop. +call_stub_t * +fop_mkdir_stub (call_frame_t *frame, + fop_mkdir_t fn, + loc_t *loc, + mode_t mode); + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn - procedure to call during call_resume(). +@loc - pointer to location structure. + NOTE: @loc will be copied to a different location, with inode_ref() to + @loc->inode and @loc->parent, if not NULL. also @loc->path will be + copied to a different location. +call_stub_t * +fop_unlink_stub (call_frame_t *frame, + fop_unlink_t fn, + loc_t *loc); + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn - procedure to call during call_resume(). +@loc - pointer to location structure. + NOTE: @loc will be copied to a different location, with inode_ref() to + @loc->inode and @loc->parent, if not NULL. also @loc->path will be + copied to a different location. +call_stub_t * +fop_rmdir_stub (call_frame_t *frame, + fop_rmdir_t fn, + loc_t *loc); + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn - procedure to call during call_resume(). +@linkname - linkname parameter to symlink fop. +@loc - pointer to location structure. + NOTE: @loc will be copied to a different location, with inode_ref() to + @loc->inode and @loc->parent, if not NULL. also @loc->path will be + copied to a different location. +call_stub_t * +fop_symlink_stub (call_frame_t *frame, + fop_symlink_t fn, + const char *linkname, + loc_t *loc); + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn - procedure to call during call_resume(). +@oldloc - pointer to location structure. + NOTE: @oldloc will be copied to a different location, with inode_ref() to + @oldloc->inode and @oldloc->parent, if not NULL. also @oldloc->path will + be copied to a different location, if not NULL. +@newloc - pointer to location structure. + NOTE: @newloc will be copied to a different location, with inode_ref() to + @newloc->inode and @newloc->parent, if not NULL. also @newloc->path will + be copied to a different location, if not NULL. +call_stub_t * +fop_rename_stub (call_frame_t *frame, + fop_rename_t fn, + loc_t *oldloc, + loc_t *newloc); + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn - procedure to call during call_resume(). +@loc - pointer to location structure. + NOTE: @loc will be copied to a different location, with inode_ref() to + @loc->inode and @loc->parent, if not NULL. also @loc->path will be + copied to a different location. +@newpath - newpath parameter to link fop. +call_stub_t * +fop_link_stub (call_frame_t *frame, + fop_link_t fn, + loc_t *oldloc, + const char *newpath); + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn - procedure to call during call_resume(). +@loc - pointer to location structure. + NOTE: @loc will be copied to a different location, with inode_ref() to + @loc->inode and @loc->parent, if not NULL. also @loc->path will be + copied to a different location. +@flags - flags parameter to create fop. +@mode - mode parameter to create fop. +@fd - file descriptor parameter to create fop. + NOTE: @fd is stored with a fd_ref(). +call_stub_t * +fop_create_stub (call_frame_t *frame, + fop_create_t fn, + loc_t *loc, + int32_t flags, + mode_t mode, fd_t *fd); + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn - procedure to call during call_resume(). +@flags - flags parameter to open fop. +@loc - pointer to location structure. + NOTE: @loc will be copied to a different location, with inode_ref() to + @loc->inode and @loc->parent, if not NULL. also @loc->path will be + copied to a different location. +call_stub_t * +fop_open_stub (call_frame_t *frame, + fop_open_t fn, + loc_t *loc, + int32_t flags, + fd_t *fd); + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn - procedure to call during call_resume(). +@fd - file descriptor parameter to lk fop. + NOTE: @fd is stored with a fd_ref(). +@size - size parameter to readv fop. +@off - offset parameter to readv fop. +call_stub_t * +fop_readv_stub (call_frame_t *frame, + fop_readv_t fn, + fd_t *fd, + size_t size, + off_t off); + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn - procedure to call during call_resume(). +@fd - file descriptor parameter to lk fop. + NOTE: @fd is stored with a fd_ref(). +@vector - vector parameter to writev fop. + NOTE: @vector is iov_dup()ed while creating stub. and frame->root->req_refs + dictionary is dict_ref()ed. +@count - count parameter to writev fop. +@off - off parameter to writev fop. +call_stub_t * +fop_writev_stub (call_frame_t *frame, + fop_writev_t fn, + fd_t *fd, + struct iovec *vector, + int32_t count, + off_t off); + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn - procedure to call during call_resume(). +@fd - file descriptor parameter to flush fop. + NOTE: @fd is stored with a fd_ref(). +call_stub_t * +fop_flush_stub (call_frame_t *frame, + fop_flush_t fn, + fd_t *fd); + + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn - procedure to call during call_resume(). +@fd - file descriptor parameter to lk fop. + NOTE: @fd is stored with a fd_ref(). +@datasync - datasync parameter to fsync fop. +call_stub_t * +fop_fsync_stub (call_frame_t *frame, + fop_fsync_t fn, + fd_t *fd, + int32_t datasync); + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn - procedure to call during call_resume(). +@loc - pointer to location structure. + NOTE: @loc will be copied to a different location, with inode_ref() to @loc->inode and + @loc->parent, if not NULL. also @loc->path will be copied to a different location. +@fd - file descriptor parameter to opendir fop. + NOTE: @fd is stored with a fd_ref(). +call_stub_t * +fop_opendir_stub (call_frame_t *frame, + fop_opendir_t fn, + loc_t *loc, + fd_t *fd); + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn - procedure to call during call_resume(). +@fd - file descriptor parameter to getdents fop. + NOTE: @fd is stored with a fd_ref(). +@size - size parameter to getdents fop. +@off - off parameter to getdents fop. +@flags - flags parameter to getdents fop. +call_stub_t * +fop_getdents_stub (call_frame_t *frame, + fop_getdents_t fn, + fd_t *fd, + size_t size, + off_t off, + int32_t flag); + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn - procedure to call during call_resume(). +@fd - file descriptor parameter to setdents fop. + NOTE: @fd is stored with a fd_ref(). +@flags - flags parameter to setdents fop. +@entries - entries parameter to setdents fop. +call_stub_t * +fop_setdents_stub (call_frame_t *frame, + fop_setdents_t fn, + fd_t *fd, + int32_t flags, + dir_entry_t *entries, + int32_t count); + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn - procedure to call during call_resume(). +@fd - file descriptor parameter to setdents fop. + NOTE: @fd is stored with a fd_ref(). +@datasync - datasync parameter to fsyncdir fop. +call_stub_t * +fop_fsyncdir_stub (call_frame_t *frame, + fop_fsyncdir_t fn, + fd_t *fd, + int32_t datasync); + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn - procedure to call during call_resume(). +@loc - pointer to location structure. + NOTE: @loc will be copied to a different location, with inode_ref() to + @loc->inode and @loc->parent, if not NULL. also @loc->path will be + copied to a different location. +call_stub_t * +fop_statfs_stub (call_frame_t *frame, + fop_statfs_t fn, + loc_t *loc); + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn - procedure to call during call_resume(). +@loc - pointer to location structure. + NOTE: @loc will be copied to a different location, with inode_ref() to + @loc->inode and @loc->parent, if not NULL. also @loc->path will be + copied to a different location. +@dict - dict parameter to setxattr fop. + NOTE: stub creation procedure stores @dict pointer with dict_ref() to it. +call_stub_t * +fop_setxattr_stub (call_frame_t *frame, + fop_setxattr_t fn, + loc_t *loc, + dict_t *dict, + int32_t flags); + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn - procedure to call during call_resume(). +@loc - pointer to location structure. + NOTE: @loc will be copied to a different location, with inode_ref() to + @loc->inode and @loc->parent, if not NULL. also @loc->path will be + copied to a different location. +@name - name parameter to getxattr fop. +call_stub_t * +fop_getxattr_stub (call_frame_t *frame, + fop_getxattr_t fn, + loc_t *loc, + const char *name); + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn - procedure to call during call_resume(). +@loc - pointer to location structure. + NOTE: @loc will be copied to a different location, with inode_ref() to + @loc->inode and @loc->parent, if not NULL. also @loc->path will be + copied to a different location. +@name - name parameter to removexattr fop. + NOTE: name string will be copied to a different location while creating stub. +call_stub_t * +fop_removexattr_stub (call_frame_t *frame, + fop_removexattr_t fn, + loc_t *loc, + const char *name); + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn - procedure to call during call_resume(). +@fd - file descriptor parameter to lk fop. + NOTE: @fd is stored with a fd_ref(). +@cmd - command parameter to lk fop. +@lock - lock parameter to lk fop. + NOTE: lock will be copied to a different location while creating stub. +call_stub_t * +fop_lk_stub (call_frame_t *frame, + fop_lk_t fn, + fd_t *fd, + int32_t cmd, + struct flock *lock); + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn - procedure to call during call_resume(). +@fd - fd parameter to gf_lk fop. + NOTE: @fd is fd_ref()ed while creating stub, if not NULL. +@cmd - cmd parameter to gf_lk fop. +@lock - lock paramater to gf_lk fop. + NOTE: @lock is copied to a different memory location while creating + stub. +call_stub_t * +fop_gf_lk_stub (call_frame_t *frame, + fop_gf_lk_t fn, + fd_t *fd, + int32_t cmd, + struct flock *lock); + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn - procedure to call during call_resume(). +@fd - file descriptor parameter to readdir fop. + NOTE: @fd is stored with a fd_ref(). +@size - size parameter to readdir fop. +@off - offset parameter to readdir fop. +call_stub_t * +fop_readdir_stub (call_frame_t *frame, + fop_readdir_t fn, + fd_t *fd, + size_t size, + off_t off); + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn - procedure to call during call_resume(). +@loc - pointer to location structure. + NOTE: @loc will be copied to a different location, with inode_ref() to + @loc->inode and @loc->parent, if not NULL. also @loc->path will be + copied to a different location. +@flags - flags parameter to checksum fop. +call_stub_t * +fop_checksum_stub (call_frame_t *frame, + fop_checksum_t fn, + loc_t *loc, + int32_t flags); + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn - procedure to call during call_resume(). +@op_ret - op_ret parameter to @fn. +@op_errno - op_errno parameter to @fn. +@inode - inode parameter to @fn. + NOTE: @inode pointer is stored with a inode_ref(). +@buf - buf parameter to @fn. + NOTE: @buf is copied to a different memory location, if not NULL. +@dict - dict parameter to @fn. + NOTE: @dict pointer is stored with dict_ref(). +call_stub_t * +fop_lookup_cbk_stub (call_frame_t *frame, + fop_lookup_cbk_t fn, + int32_t op_ret, + int32_t op_errno, + inode_t *inode, + struct stat *buf, + dict_t *dict); +@frame - call frame which has to be used to resume the call at call_resume(). +@fn - procedure to call during call_resume(). +@op_ret - op_ret parameter to @fn. +@op_errno - op_errno parameter to @fn. +@buf - buf parameter to @fn. + NOTE: @buf is copied to a different memory location, if not NULL. +call_stub_t * +fop_stat_cbk_stub (call_frame_t *frame, + fop_stat_cbk_t fn, + int32_t op_ret, + int32_t op_errno, + struct stat *buf); + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn - procedure to call during call_resume(). +@op_ret - op_ret parameter to @fn. +@op_errno - op_errno parameter to @fn. +@buf - buf parameter to @fn. + NOTE: @buf is copied to a different memory location, if not NULL. +call_stub_t * +fop_fstat_cbk_stub (call_frame_t *frame, + fop_fstat_cbk_t fn, + int32_t op_ret, + int32_t op_errno, + struct stat *buf); + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn - procedure to call during call_resume(). +@op_ret - op_ret parameter to @fn. +@op_errno - op_errno parameter to @fn. +@buf - buf parameter to @fn. + NOTE: @buf is copied to a different memory location, if not NULL. +call_stub_t * +fop_chmod_cbk_stub (call_frame_t *frame, + fop_chmod_cbk_t fn, + int32_t op_ret, + int32_t op_errno, + struct stat *buf); + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn - procedure to call during call_resume(). +@op_ret - op_ret parameter to @fn. +@op_errno - op_errno parameter to @fn. +@buf - buf parameter to @fn. + NOTE: @buf is copied to a different memory location, if not NULL. +call_stub_t * +fop_fchmod_cbk_stub (call_frame_t *frame, + fop_fchmod_cbk_t fn, + int32_t op_ret, + int32_t op_errno, + struct stat *buf); + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn - procedure to call during call_resume(). +@op_ret - op_ret parameter to @fn. +@op_errno - op_errno parameter to @fn. +@buf - buf parameter to @fn. + NOTE: @buf is copied to a different memory location, if not NULL. +call_stub_t * +fop_chown_cbk_stub (call_frame_t *frame, + fop_chown_cbk_t fn, + int32_t op_ret, + int32_t op_errno, + struct stat *buf); + + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn - procedure to call during call_resume(). +@op_ret - op_ret parameter to @fn. +@op_errno - op_errno parameter to @fn. +@buf - buf parameter to @fn. + NOTE: @buf is copied to a different memory location, if not NULL. +call_stub_t * +fop_fchown_cbk_stub (call_frame_t *frame, + fop_fchown_cbk_t fn, + int32_t op_ret, + int32_t op_errno, + struct stat *buf); + + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn - procedure to call during call_resume(). +@op_ret - op_ret parameter to @fn. +@op_errno - op_errno parameter to @fn. +@buf - buf parameter to @fn. + NOTE: @buf is copied to a different memory location, if not NULL. +call_stub_t * +fop_truncate_cbk_stub (call_frame_t *frame, + fop_truncate_cbk_t fn, + int32_t op_ret, + int32_t op_errno, + struct stat *buf); + + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn - procedure to call during call_resume(). +@op_ret - op_ret parameter to @fn. +@op_errno - op_errno parameter to @fn. +@buf - buf parameter to @fn. + NOTE: @buf is copied to a different memory location, if not NULL. +call_stub_t * +fop_ftruncate_cbk_stub (call_frame_t *frame, + fop_ftruncate_cbk_t fn, + int32_t op_ret, + int32_t op_errno, + struct stat *buf); + + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn - procedure to call during call_resume(). +@op_ret - op_ret parameter to @fn. +@op_errno - op_errno parameter to @fn. +@buf - buf parameter to @fn. + NOTE: @buf is copied to a different memory location, if not NULL. +call_stub_t * +fop_utimens_cbk_stub (call_frame_t *frame, + fop_utimens_cbk_t fn, + int32_t op_ret, + int32_t op_errno, + struct stat *buf); + + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn - procedure to call during call_resume(). +@op_ret - op_ret parameter to @fn. +@op_errno - op_errno parameter to @fn. +call_stub_t * +fop_access_cbk_stub (call_frame_t *frame, + fop_access_cbk_t fn, + int32_t op_ret, + int32_t op_errno); + + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn - procedure to call during call_resume(). +@op_ret - op_ret parameter to @fn. +@op_errno - op_errno parameter to @fn. +@path - path parameter to @fn. + NOTE: @path is copied to a different memory location, if not NULL. +call_stub_t * +fop_readlink_cbk_stub (call_frame_t *frame, + fop_readlink_cbk_t fn, + int32_t op_ret, + int32_t op_errno, + const char *path); + + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn - procedure to call during call_resume(). +@op_ret - op_ret parameter to @fn. +@op_errno - op_errno parameter to @fn. +@inode - inode parameter to @fn. + NOTE: @inode pointer is stored with a inode_ref(). +@buf - buf parameter to @fn. + NOTE: @buf is copied to a different memory location, if not NULL. +call_stub_t * +fop_mknod_cbk_stub (call_frame_t *frame, + fop_mknod_cbk_t fn, + int32_t op_ret, + int32_t op_errno, + inode_t *inode, + struct stat *buf); + + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn - procedure to call during call_resume(). +@op_ret - op_ret parameter to @fn. +@op_errno - op_errno parameter to @fn. +@inode - inode parameter to @fn. + NOTE: @inode pointer is stored with a inode_ref(). +@buf - buf parameter to @fn. + NOTE: @buf is copied to a different memory location, if not NULL. +call_stub_t * +fop_mkdir_cbk_stub (call_frame_t *frame, + fop_mkdir_cbk_t fn, + int32_t op_ret, + int32_t op_errno, + inode_t *inode, + struct stat *buf); + + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn - procedure to call during call_resume(). +@op_ret - op_ret parameter to @fn. +@op_errno - op_errno parameter to @fn. +call_stub_t * +fop_unlink_cbk_stub (call_frame_t *frame, + fop_unlink_cbk_t fn, + int32_t op_ret, + int32_t op_errno); + + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn - procedure to call during call_resume(). +@op_ret - op_ret parameter to @fn. +@op_errno - op_errno parameter to @fn. +call_stub_t * +fop_rmdir_cbk_stub (call_frame_t *frame, + fop_rmdir_cbk_t fn, + int32_t op_ret, + int32_t op_errno); + + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn - procedure to call during call_resume(). +@op_ret - op_ret parameter to @fn. +@op_errno - op_errno parameter to @fn. +@inode - inode parameter to @fn. + NOTE: @inode pointer is stored with a inode_ref(). +@buf - buf parameter to @fn. + NOTE: @buf is copied to a different memory location, if not NULL. +call_stub_t * +fop_symlink_cbk_stub (call_frame_t *frame, + fop_symlink_cbk_t fn, + int32_t op_ret, + int32_t op_errno, + inode_t *inode, + struct stat *buf); + + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn - procedure to call during call_resume(). +@op_ret - op_ret parameter to @fn. +@op_errno - op_errno parameter to @fn. +@buf - buf parameter to @fn. + NOTE: @buf is copied to a different memory location, if not NULL. +call_stub_t * +fop_rename_cbk_stub (call_frame_t *frame, + fop_rename_cbk_t fn, + int32_t op_ret, + int32_t op_errno, + struct stat *buf); + + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn - procedure to call during call_resume(). +@op_ret - op_ret parameter to @fn. +@op_errno - op_errno parameter to @fn. +@inode - inode parameter to @fn. + NOTE: @inode pointer is stored with a inode_ref(). +@buf - buf parameter to @fn. + NOTE: @buf is copied to a different memory location, if not NULL. +call_stub_t * +fop_link_cbk_stub (call_frame_t *frame, + fop_link_cbk_t fn, + int32_t op_ret, + int32_t op_errno, + inode_t *inode, + struct stat *buf); + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn - procedure to call during call_resume(). +@op_ret - op_ret parameter to @fn. +@op_errno - op_errno parameter to @fn. +@fd - fd parameter to @fn. + NOTE: @fd pointer is stored with a fd_ref(). +@inode - inode parameter to @fn. + NOTE: @inode pointer is stored with a inode_ref(). +@buf - buf parameter to @fn. + NOTE: @buf is copied to a different memory location, if not NULL. +call_stub_t * +fop_create_cbk_stub (call_frame_t *frame, + fop_create_cbk_t fn, + int32_t op_ret, + int32_t op_errno, + fd_t *fd, + inode_t *inode, + struct stat *buf); + + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn - procedure to call during call_resume(). +@op_ret - op_ret parameter to @fn. +@op_errno - op_errno parameter to @fn. +@fd - fd parameter to @fn. + NOTE: @fd pointer is stored with a fd_ref(). +call_stub_t * +fop_open_cbk_stub (call_frame_t *frame, + fop_open_cbk_t fn, + int32_t op_ret, + int32_t op_errno, + fd_t *fd); + + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn - procedure to call during call_resume(). +@op_ret - op_ret parameter to @fn. +@op_errno - op_errno parameter to @fn. +@vector - vector parameter to @fn. + NOTE: @vector is copied to a different memory location, if not NULL. also + frame->root->rsp_refs is dict_ref()ed. +@stbuf - stbuf parameter to @fn. + NOTE: @stbuf is copied to a different memory location, if not NULL. +call_stub_t * +fop_readv_cbk_stub (call_frame_t *frame, + fop_readv_cbk_t fn, + int32_t op_ret, + int32_t op_errno, + struct iovec *vector, + int32_t count, + struct stat *stbuf); + + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn - procedure to call during call_resume(). +@op_ret - op_ret parameter to @fn. +@op_errno - op_errno parameter to @fn. +@stbuf - stbuf parameter to @fn. + NOTE: @stbuf is copied to a different memory location, if not NULL. +call_stub_t * +fop_writev_cbk_stub (call_frame_t *frame, + fop_writev_cbk_t fn, + int32_t op_ret, + int32_t op_errno, + struct stat *stbuf); + + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn - procedure to call during call_resume(). +@op_ret - op_ret parameter to @fn. +@op_errno - op_errno parameter to @fn. +call_stub_t * +fop_flush_cbk_stub (call_frame_t *frame, + fop_flush_cbk_t fn, + int32_t op_ret, + int32_t op_errno); + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn - procedure to call during call_resume(). +@op_ret - op_ret parameter to @fn. +@op_errno - op_errno parameter to @fn. +call_stub_t * +fop_fsync_cbk_stub (call_frame_t *frame, + fop_fsync_cbk_t fn, + int32_t op_ret, + int32_t op_errno); + + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn - procedure to call during call_resume(). +@op_ret - op_ret parameter to @fn. +@op_errno - op_errno parameter to @fn. +@fd - fd parameter to @fn. + NOTE: @fd pointer is stored with a fd_ref(). +call_stub_t * +fop_opendir_cbk_stub (call_frame_t *frame, + fop_opendir_cbk_t fn, + int32_t op_ret, + int32_t op_errno, + fd_t *fd); + + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn - procedure to call during call_resume(). +@op_ret - op_ret parameter to @fn. +@op_errno - op_errno parameter to @fn. +@entries - entries parameter to @fn. +@count - count parameter to @fn. +call_stub_t * +fop_getdents_cbk_stub (call_frame_t *frame, + fop_getdents_cbk_t fn, + int32_t op_ret, + int32_t op_errno, + dir_entry_t *entries, + int32_t count); + + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn - procedure to call during call_resume(). +@op_ret - op_ret parameter to @fn. +@op_errno - op_errno parameter to @fn. +call_stub_t * +fop_setdents_cbk_stub (call_frame_t *frame, + fop_setdents_cbk_t fn, + int32_t op_ret, + int32_t op_errno); + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn - procedure to call during call_resume(). +@op_ret - op_ret parameter to @fn. +@op_errno - op_errno parameter to @fn. +call_stub_t * +fop_fsyncdir_cbk_stub (call_frame_t *frame, + fop_fsyncdir_cbk_t fn, + int32_t op_ret, + int32_t op_errno); + + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn - procedure to call during call_resume(). +@op_ret - op_ret parameter to @fn. +@op_errno - op_errno parameter to @fn. +@buf - buf parameter to @fn. + NOTE: @buf is copied to a different memory location, if not NULL. +call_stub_t * +fop_statfs_cbk_stub (call_frame_t *frame, + fop_statfs_cbk_t fn, + int32_t op_ret, + int32_t op_errno, + struct statvfs *buf); + + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn - procedure to call during call_resume(). +@op_ret - op_ret parameter to @fn. +@op_errno - op_errno parameter to @fn. +call_stub_t * +fop_setxattr_cbk_stub (call_frame_t *frame, + fop_setxattr_cbk_t fn, + int32_t op_ret, + int32_t op_errno); + + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn - procedure to call during call_resume(). +@op_ret - op_ret parameter to @fn. +@op_errno - op_errno parameter to @fn. +@value - value dictionary parameter to @fn. + NOTE: @value pointer is stored with a dict_ref(). +call_stub_t * +fop_getxattr_cbk_stub (call_frame_t *frame, + fop_getxattr_cbk_t fn, + int32_t op_ret, + int32_t op_errno, + dict_t *value); + + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn - procedure to call during call_resume(). +@op_ret - op_ret parameter to @fn. +@op_errno - op_errno parameter to @fn. +call_stub_t * +fop_removexattr_cbk_stub (call_frame_t *frame, + fop_removexattr_cbk_t fn, + int32_t op_ret, + int32_t op_errno); + + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn - procedure to call during call_resume(). +@op_ret - op_ret parameter to @fn. +@op_errno - op_errno parameter to @fn. +@lock - lock parameter to @fn. + NOTE: @lock is copied to a different memory location while creating + stub. +call_stub_t * +fop_lk_cbk_stub (call_frame_t *frame, + fop_lk_cbk_t fn, + int32_t op_ret, + int32_t op_errno, + struct flock *lock); + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn - procedure to call during call_resume(). +@op_ret - op_ret parameter to @fn. +@op_errno - op_errno parameter to @fn. +@lock - lock parameter to @fn. + NOTE: @lock is copied to a different memory location while creating + stub. +call_stub_t * +fop_gf_lk_cbk_stub (call_frame_t *frame, + fop_gf_lk_cbk_t fn, + int32_t op_ret, + int32_t op_errno, + struct flock *lock); + + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn - procedure to call during call_resume(). +@op_ret - op_ret parameter to @fn. +@op_errno - op_errno parameter to @fn. +@entries - entries parameter to @fn. +call_stub_t * +fop_readdir_cbk_stub (call_frame_t *frame, + fop_readdir_cbk_t fn, + int32_t op_ret, + int32_t op_errno, + gf_dirent_t *entries); + + +@frame - call frame which has to be used to resume the call at call_resume(). +@fn - procedure to call during call_resume(). +@op_ret - op_ret parameter to @fn. +@op_errno - op_errno parameter to @fn. +@file_checksum - file_checksum parameter to @fn. + NOTE: file_checksum will be copied to a different memory location + while creating stub. +@dir_checksum - dir_checksum parameter to @fn. + NOTE: file_checksum will be copied to a different memory location + while creating stub. +call_stub_t * +fop_checksum_cbk_stub (call_frame_t *frame, + fop_checksum_cbk_t fn, + int32_t op_ret, + int32_t op_errno, + uint8_t *file_checksum, + uint8_t *dir_checksum); + +resuming a call: +--------------- + call can be resumed using call stub through call_resume API. + + void call_resume (call_stub_t *stub); + + stub - call stub created during pausing a call. + + NOTE: call_resume() will decrease reference count of any fd_t, dict_t and inode_t that it finds + in stub->args.<operation>.<fd_t-or-inode_t-or-dict_t>. so, if any fd_t, dict_t or + inode_t pointers are assigned at stub->args.<operation>.<fd_t-or-inode_t-or-dict_t> after + fop_<operation>_stub() call, they must be <fd_t-or-inode_t-or-dict_t>_ref()ed. + + call_resume does not STACK_DESTROY() for any fop. + + if stub->fn is NULL, call_resume does STACK_WIND() or STACK_UNWIND() using the stub->frame. + + return - call resume fails only if stub is NULL. call resume fails with errno set to EINVAL. diff --git a/doc/hacker-guide/hacker-guide.tex b/doc/hacker-guide/hacker-guide.tex new file mode 100644 index 000000000..72c44df1a --- /dev/null +++ b/doc/hacker-guide/hacker-guide.tex @@ -0,0 +1,312 @@ +\documentclass{book}[12pt] +\usepackage{graphicx} +% \usepackage{fancyhdr} + +% \pagestyle{fancy} +\begin{document} + +% \headheight 117pt +% \rhead{\includegraphics{zr-logo.eps}} + +\author{Z Research} +\title{GlusterFS 1.3 Hacker's Guide} +\date{June 1, 2007} + +\maketitle +\frontmatter +\tableofcontents + +\mainmatter +\chapter{Introduction} + +\section{Coding guidelines} +GlusterFS uses GNU Arch for version control. To get the latest source do: +\begin{verbatim} + $ tla register-archive http://arch.sv.gnu.org/archives/gluster + $ tla -A gluster@sv.gnu.org get glusterfs--mainline--2.4 +\end{verbatim} +\noindent +GlusterFS follows the GNU coding +standards\footnote{http://www.gnu.org/prep/standards\_toc.html} for the +most part. + +\chapter{Major components} +\section{libglusterfs} +\texttt{libglusterfs} contains supporting code used by all the other components. +The important files here are: + +\texttt{dict.c}: This is an implementation of a serializable dictionary type. It is +used by the protocol code to send requests and replies. It is also used to pass options +to translators. + +\texttt{logging.c}: This is a thread-safe logging library. The log messages go to a +file (default \texttt{/usr/local/var/log/glusterfs/*}). + +\texttt{protocol.c}: This file implements the GlusterFS on-the-wire +protocol. The protocol itself is a simple ASCII protocol, designed to +be easy to parse and be human readable. + +A sample GlusterFS protocol block looks like this: +\begin{verbatim} + Block Start header + 0000000000000023 callid + 00000001 type + 00000016 op + xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx human-readable name + 00000000000000000000000000000ac3 block size + <...> block + Block End +\end{verbatim} + +\texttt{stack.h}: This file defines the \texttt{STACK\_WIND} and +\texttt{STACK\_UNWIND} macros which are used to implement the parallel +stack that is maintained for inter-xlator calls. See the \textsl{Taking control +of the stack} section below for more details. + +\texttt{spec.y}: This contains the Yacc grammar for the GlusterFS +specification file, and the parsing code. + + +Draw diagrams of trees +Two rules: +(1) directory structure is same +(2) file can exist only on one node + +\section{glusterfs-fuse} +\section{glusterfsd} +\section{transport} +\section{scheduler} +\section{xlator} + +\chapter{xlators} +\section{Taking control of the stack} +One can think of STACK\_WIND/UNWIND as a very specific RPC mechanism. + +% \includegraphics{stack.eps} + +\section{Overview of xlators} + +\flushleft{\LARGE\texttt{cluster/}} +\vskip 2ex +\flushleft{\Large\texttt{afr}} +\vskip 2ex +\flushleft{\Large\texttt{stripe}} +\vskip 2ex +\flushleft{\Large\texttt{unify}} + +\vskip 4ex +\flushleft{\LARGE\texttt{debug/}} +\vskip 2ex +\flushleft{\Large\texttt{trace}} +\vskip 2ex +The trace xlator simply logs all fops and mops, and passes them through to its child. + +\vskip 4ex +\flushleft{\LARGE\texttt{features/}} +\flushleft{\Large\texttt{posix-locks}} +\vskip 2ex +This xlator implements \textsc{posix} record locking semantics over +any kind of storage. + +\vskip 4ex +\flushleft{\LARGE\texttt{performance/}} + +\flushleft{\Large\texttt{io-threads}} +\vskip 2ex +\flushleft{\Large\texttt{read-ahead}} +\vskip 2ex +\flushleft{\Large\texttt{stat-prefetch}} +\vskip 2ex +\flushleft{\Large\texttt{write-behind}} +\vskip 2ex + +\vskip 4ex +\flushleft{\LARGE\texttt{protocol/}} +\vskip 2ex + +\flushleft{\Large\texttt{client}} +\vskip 2ex + +\flushleft{\Large\texttt{server}} +\vskip 2ex + +\vskip 4ex +\flushleft{\LARGE\texttt{storage/}} +\flushleft{\Large\texttt{posix}} +\vskip 2ex +The \texttt{posix} xlator is the one which actually makes calls to the +on-disk filesystem. Currently this is the only storage xlator available. However, +plans to develop other storage xlators, such as one for Amazon's S3 service, are +on the roadmap. + +\chapter{Writing a simple xlator} +\noindent +In this section we're going to write a rot13 xlator. ``Rot13'' is a +simple substitution cipher which obscures a text by replacing each +letter with the letter thirteen places down the alphabet. So `a' (0) +would become `n' (12), `b' would be 'm', and so on. Rot13 applied to +a piece of ciphertext yields the plaintext again, because rot13 is its +own inverse, since: + +\[ +x_c = x + 13\; (mod\; 26) +\] +\[ +x_c + 13\; (mod\; 26) = x + 13 + 13\; (mod\; 26) = x +\] + +First we include the requisite headers. + +\begin{verbatim} +#include <ctype.h> +#include <sys/uio.h> + +#include "glusterfs.h" +#include "xlator.h" +#include "logging.h" + +/* + * This is a rot13 ``encryption'' xlator. It rot13's data when + * writing to disk and rot13's it back when reading it. + * This xlator is meant as an example, not for production + * use ;) (hence no error-checking) + */ + +\end{verbatim} + +Then we write the rot13 function itself. For simplicity, we only transform lower case +letters. Any other byte is passed through as it is. + +\begin{verbatim} +/* We only handle lower case letters for simplicity */ +static void +rot13 (char *buf, int len) +{ + int i; + for (i = 0; i < len; i++) { + if (isalpha (buf[i])) + buf[i] = (buf[i] - 'a' + 13) % 26; + else if (buf[i] <= 26) + buf[i] = (buf[i] + 13) % 26 + 'a'; + } +} +\end{verbatim} + +Next comes a utility function whose purpose will be clear after looking at the code +below. + +\begin{verbatim} +static void +rot13_iovec (struct iovec *vector, int count) +{ + int i; + for (i = 0; i < count; i++) { + rot13 (vector[i].iov_base, vector[i].iov_len); + } +} +\end{verbatim} + +\begin{verbatim} +static int32_t +rot13_readv_cbk (call_frame_t *frame, + call_frame_t *prev_frame, + xlator_t *this, + int32_t op_ret, + int32_t op_errno, + struct iovec *vector, + int32_t count) +{ + rot13_iovec (vector, count); + + STACK_UNWIND (frame, op_ret, op_errno, vector, count); + return 0; +} + +static int32_t +rot13_readv (call_frame_t *frame, + xlator_t *this, + dict_t *ctx, + size_t size, + off_t offset) +{ + STACK_WIND (frame, + rot13_readv_cbk, + FIRST_CHILD (this), + FIRST_CHILD (this)->fops->readv, + ctx, size, offset); + return 0; +} + +static int32_t +rot13_writev_cbk (call_frame_t *frame, + call_frame_t *prev_frame, + xlator_t *this, + int32_t op_ret, + int32_t op_errno) +{ + STACK_UNWIND (frame, op_ret, op_errno); + return 0; +} + +static int32_t +rot13_writev (call_frame_t *frame, + xlator_t *this, + dict_t *ctx, + struct iovec *vector, + int32_t count, + off_t offset) +{ + rot13_iovec (vector, count); + + STACK_WIND (frame, + rot13_writev_cbk, + FIRST_CHILD (this), + FIRST_CHILD (this)->fops->writev, + ctx, vector, count, offset); + return 0; +} + +\end{verbatim} + +Every xlator must define two functions and two external symbols. The functions are +\texttt{init} and \texttt{fini}, and the symbols are \texttt{fops} and \texttt{mops}. +The \texttt{init} function is called when the xlator is loaded by GlusterFS, and +contains code for the xlator to initialize itself. Note that if an xlator is present +multiple times in the spec tree, the \texttt{init} function will be called each time +the xlator is loaded. + +\begin{verbatim} +int32_t +init (xlator_t *this) +{ + if (!this->children) { + gf_log ("rot13", GF_LOG_ERROR, + "FATAL: rot13 should have exactly one child"); + return -1; + } + + gf_log ("rot13", GF_LOG_DEBUG, "rot13 xlator loaded"); + return 0; +} +\end{verbatim} + +\begin{verbatim} + +void +fini (xlator_t *this) +{ + return; +} + +struct xlator_fops fops = { + .readv = rot13_readv, + .writev = rot13_writev +}; + +struct xlator_mops mops = { +}; + +\end{verbatim} + +\end{document} + diff --git a/doc/hacker-guide/posix.txt b/doc/hacker-guide/posix.txt new file mode 100644 index 000000000..d0132abfe --- /dev/null +++ b/doc/hacker-guide/posix.txt @@ -0,0 +1,59 @@ +--------------- +* storage/posix +--------------- + +- SET_FS_ID + + This is so that all filesystem checks are done with the user's + uid/gid and not GlusterFS's uid/gid. + +- MAKE_REAL_PATH + + This macro concatenates the base directory of the posix volume + ('option directory') with the given path. + +- need_xattr in lookup + + If this flag is passed, lookup returns a xattr dictionary that contains + the file's create time, the file's contents, and the version number + of the file. + + This is a hack to increase small file performance. If an application + wants to read a small file, it can finish its job with just a lookup + call instead of a lookup followed by read. + +- getdents/setdents + + These are used by unify to set and get directory entries. + +- ALIGN_BUF + + Macro to align an address to a page boundary (4K). + +- priv->export_statfs + + In some cases, two exported volumes may reside on the same + partition on the server. Sending statvfs info for both + the volumes will lead to erroneous df output at the client, + since free space on the partition will be counted twice. + + In such cases, user can disable exporting statvfs info + on one of the volumes by setting this option. + +- xattrop + + This fop is used by replicate to set version numbers on files. + +- getxattr/setxattr hack to read/write files + + A key, GLUSTERFS_FILE_CONTENT_STRING, is handled in a special way by + getxattr/setxattr. A getxattr with the key will return the entire + content of the file as the value. A setxattr with the key will write + the value as the entire content of the file. + +- posix_checksum + + This calculates a simple XOR checksum on all entry names in a + directory that is used by unify to compare directory contents. + + diff --git a/doc/hacker-guide/replicate.txt b/doc/hacker-guide/replicate.txt new file mode 100644 index 000000000..284f373fb --- /dev/null +++ b/doc/hacker-guide/replicate.txt @@ -0,0 +1,206 @@ +--------------- +* cluster/replicate +--------------- + +Before understanding replicate, one must understand two internal FOPs: + +GF_FILE_LK: + This is exactly like fcntl(2) locking, except the locks are in a + separate domain from locks held by applications. + +GF_DIR_LK (loc_t *loc, char *basename): + This allows one to lock a name under a directory. For example, + to lock /mnt/glusterfs/foo, one would use the call: + + GF_DIR_LK ({loc_t for "/mnt/glusterfs"}, "foo") + + If one wishes to lock *all* the names under a particular directory, + supply the basename argument as NULL. + + The locks can either be read locks or write locks; consult the + function prototype for more details. + +Both these operations are implemented by the features/locks (earlier +known as posix-locks) translator. + +-------------- +* Basic design +-------------- + +All FOPs can be classified into four major groups: + + - inode-read + Operations that read an inode's data (file contents) or metadata (perms, etc.). + + access, getxattr, fstat, readlink, readv, stat. + + - inode-write + Operations that modify an inode's data or metadata. + + chmod, chown, truncate, writev, utimens. + + - dir-read + Operations that read a directory's contents or metadata. + + readdir, getdents, checksum. + + - dir-write + Operations that modify a directory's contents or metadata. + + create, link, mkdir, mknod, rename, rmdir, symlink, unlink. + + Some of these make a subgroup in that they modify *two* different entries: + link, rename, symlink. + + - Others + Other operations. + + flush, lookup, open, opendir, statfs. + +------------ +* Algorithms +------------ + +Each of the four major groups has its own algorithm: + + ---------------------- + - inode-read, dir-read + ---------------------- + + = Send a request to the first child that is up: + - if it fails: + try the next available child + - if we have exhausted all children: + return failure + + ------------- + - inode-write + ------------- + + All operations are done in parallel unless specified otherwise. + + (1) Send a GF_FILE_LK request on all children for a write lock on + the appropriate region + (for metadata operations: entire file (0, 0) + for writev: (offset, offset+size of buffer)) + + - If a lock request fails on a child: + unlock all children + try to acquire a blocking lock (F_SETLKW) on each child, serially. + + If this fails (due to ENOTCONN or EINVAL): + Consider this child as dead for rest of transaction. + + (2) Mark all children as "pending" on all (alive) children + (see below for meaning of "pending"). + + - If it fails on any child: + mark it as dead (in transaction local state). + + (3) Perform operation on all (alive) children. + + - If it fails on any child: + mark it as dead (in transaction local state). + + (4) Unmark all successful children as not "pending" on all nodes. + + (5) Unlock region on all (alive) children. + + ----------- + - dir-write + ----------- + + The algorithm for dir-write is same as above except instead of holding + GF_FILE_LK locks we hold a GF_DIR_LK lock on the name being operated upon. + In case of link-type calls, we hold locks on both the operand names. + +----------- +* "pending" +----------- + + The "pending" number is like a journal entry. A pending entry is an + array of 32-bit integers stored in network byte-order as the extended + attribute of an inode (which can be a directory as well). + + There are three keys corresponding to three types of pending operations: + + - AFR_METADATA_PENDING + There are some metadata operations pending on this inode (perms, ctime/mtime, + xattr, etc.). + + - AFR_DATA_PENDING + There is some data pending on this inode (writev). + + - AFR_ENTRY_PENDING + There are some directory operations pending on this directory + (create, unlink, etc.). + +----------- +* Self heal +----------- + + - On lookup, gather extended attribute data: + - If entry is a regular file: + - If an entry is present on one child and not on others: + - create entry on others. + - If entries exist but have different metadata (perms, etc.): + - consider the entry with the highest AFR_METADATA_PENDING number as + definitive and replicate its attributes on children. + + - If entry is a directory: + - Consider the entry with the higest AFR_ENTRY_PENDING number as + definitive and replicate its contents on all children. + + - If any two entries have non-matching types (i.e., one is file and + other is directory): + - Announce to the user via log that a split-brain situation has been + detected, and do nothing. + + - On open, gather extended attribute data: + - Consider the file with the highest AFR_DATA_PENDING number as + the definitive one and replicate its contents on all other + children. + + During all self heal operations, appropriate locks must be held on all + regions/entries being affected. + +--------------- +* Inode scaling +--------------- + +Inode scaling is necessary because if a situation arises where: + - An inode number is returned for a directory (by lookup) which was + previously the inode number of a file (as per FUSE's table), then + FUSE gets horribly confused (consult a FUSE expert for more details). + +To avoid such a situation, we distribute the 64-bit inode space equally +among all children of replicate. + +To illustrate: + +If c1, c2, c3 are children of replicate, they each get 1/3 of the available +inode space: + +Child: c1 c2 c3 c1 c2 c3 c1 c2 c3 c1 c2 ... +Inode number: 1 2 3 4 5 6 7 8 9 10 11 ... + +Thus, if lookup on c1 returns an inode number "2", it is scaled to "4" +(which is the second inode number in c1's space). + +This way we ensure that there is never a collision of inode numbers from +two different children. + +This reduction of inode space doesn't really reduce the usability of +replicate since even if we assume replicate has 1024 children (which would be a +highly unusual scenario), each child still has a 54-bit inode space. + +2^54 ~ 1.8 * 10^16 + +which is much larger than any real world requirement. + + +============================================== +$ Last updated: Sun Oct 12 23:17:01 IST 2008 $ +$ Author: Vikas Gorur <vikas@zresearch.com> $ +============================================== + diff --git a/doc/hacker-guide/write-behind.txt b/doc/hacker-guide/write-behind.txt new file mode 100644 index 000000000..498e95480 --- /dev/null +++ b/doc/hacker-guide/write-behind.txt @@ -0,0 +1,45 @@ +basic working +-------------- + + write behind is basically a translator to lie to the application that the write-requests are finished, even before it is actually finished. + + on a regular translator tree without write-behind, control flow is like this: + + 1. application makes a write() system call. + 2. VFS ==> FUSE ==> /dev/fuse. + 3. fuse-bridge initiates a glusterfs writev() call. + 4. writev() is STACK_WIND()ed upto client-protocol or storage translator. + 5. client-protocol, on recieving reply from server, starts STACK_UNWIND() towards the fuse-bridge. + + on a translator tree with write-behind, control flow is like this: + + 1. application makes a write() system call. + 2. VFS ==> FUSE ==> /dev/fuse. + 3. fuse-bridge initiates a glusterfs writev() call. + 4. writev() is STACK_WIND()ed upto write-behind translator. + 5. write-behind adds the write buffer to its internal queue and does a STACK_UNWIND() towards the fuse-bridge. + + write call is completed in application's percepective. after STACK_UNWIND()ing towards the fuse-bridge, write-behind initiates a fresh writev() call to its child translator, whose replies will be consumed by write-behind itself. write-behind _doesn't_ cache the write buffer, unless 'option flush-behind on' is specified in volume specification file. + +windowing +--------- + + write respect to write-behind, each write-buffer has three flags: 'stack_wound', 'write_behind' and 'got_reply'. + + stack_wound: if set, indicates that write-behind has initiated STACK_WIND() towards child translator. + + write_behind: if set, indicates that write-behind has done STACK_UNWIND() towards fuse-bridge. + + got_reply: if set, indicates that write-behind has recieved reply from child translator for a writev() STACK_WIND(). a request will be destroyed by write-behind only if this flag is set. + + currently pending write requests = aggregate size of requests with write_behind = 1 and got_reply = 0. + + window size limits the aggregate size of currently pending write requests. once the pending requests' size has reached the window size, write-behind blocks writev() calls from fuse-bridge. + blocking is only from application's perspective. write-behind does STACK_WIND() to child translator straight-away, but hold behind the STACK_UNWIND() towards fuse-bridge. STACK_UNWIND() is done only once write-behind gets enough replies to accomodate for currently blocked request. + +flush behind +------------ + + if 'option flush-behind on' is specified in volume specification file, then write-behind sends aggregate write requests to child translator, instead of regular per request STACK_WIND()s. + + |