From 2855dff243f20a78cd8cc4e7cd581a9c558b2e69 Mon Sep 17 00:00:00 2001
From: Raghavendra Bhat <raghavendra@redhat.com>
Date: Tue, 23 Sep 2014 13:02:56 +0530
Subject: doc: documentation of inode and dentry management

Change-Id: Ica510752d011596e8ecff5ea13c4b2bbf76ba186
BUG: 1145475
Signed-off-by: Raghavendra Bhat <raghavendra@redhat.com>
Reviewed-on: http://review.gluster.org/8815
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Vijay Bellur <vbellur@redhat.com>
---
 doc/hacker-guide/en-US/markdown/inode.md | 226 +++++++++++++++++++++++++++++++
 1 file changed, 226 insertions(+)
 create mode 100644 doc/hacker-guide/en-US/markdown/inode.md

(limited to 'doc/hacker-guide')

diff --git a/doc/hacker-guide/en-US/markdown/inode.md b/doc/hacker-guide/en-US/markdown/inode.md
new file mode 100644
index 00000000000..a340ab9ca8e
--- /dev/null
+++ b/doc/hacker-guide/en-US/markdown/inode.md
@@ -0,0 +1,226 @@
+#Inode and dentry management in GlusterFS:
+
+##Background
+Filesystems internally refer to files and directories via inodes. Inodes
+are unique identifiers of the entities stored in a filesystem. Whenever an
+application has to operate on a file/directory (read/modify), the filesystem
+maps that file/directory to the right inode and start referring to that inode
+whenever an operation has to be performed on the file/directory.
+
+In GlusterFS a new inode gets created whenever a new file/directory is created
+OR when a successful lookup is done on a file/directory for the first time.
+Inodes in GlusterFS are maintained by the inode table which gets initiated when
+the filesystem daemon is started (both for the brick process as well as the
+mount process). Below are some important data structures for inode management.
+
+## Data-structure (inode-table)
+```
+struct _inode_table {
+        pthread_mutex_t    lock;
+        size_t             hashsize;    /* bucket size of inode hash and dentry hash */
+        char              *name;        /* name of the inode table, just for gf_log() */
+        inode_t           *root;        /* root directory inode, with inode
+        number and gfid 1 */
+        xlator_t          *xl;          /* xlator to be called to do purge and
+        the xlator which maintains the inode table*/
+        uint32_t           lru_limit;   /* maximum LRU cache size */
+        struct list_head  *inode_hash;  /* buckets for inode hash table */
+        struct list_head  *name_hash;   /* buckets for dentry hash table */
+        struct list_head   active;      /* list of inodes currently active (in an fop) */
+        uint32_t           active_size; /* count of inodes in active list */
+        struct list_head   lru;         /* list of inodes recently used.
+                                           lru.next most recent */
+        uint32_t           lru_size;    /* count of inodes in lru list  */
+        struct list_head   purge;       /* list of inodes to be purged soon */
+        uint32_t           purge_size;  /* count of inodes in purge list */
+
+        struct mem_pool   *inode_pool;  /* memory pool for inodes */
+        struct mem_pool   *dentry_pool; /* memory pool for dentrys */
+        struct mem_pool   *fd_mem_pool; /* memory pool for fd_t */
+        int                ctxcount;    /* number of slots in inode->ctx */
+};
+```
+
+#Life-cycle
+```
+
+inode_table_new (size_t lru_limit, xlator_t *xl)
+
+This is a function which allocates a new inode table. Usually the top xlators in
+the graph such as protocol/server (for bricks), fuse and nfs (for fuse and nfs
+mounts) and libgfapi do inode managements. Hence they are the ones which will
+allocate a new inode table by calling the above function.
+
+Each xlator graph in glusterfs maintains an inode table. So in fuse clients,
+whenever there is a graph change due to add brick/remove brick or
+addition/removal of some other xlators, a new graph is created which creates a
+new inode table.
+
+Thus an allocated inode table is destroyed only when the filesystem daemon is
+killed or unmounted.
+
+```
+
+#what it contains.
+```
+
+Inode table in glusterfs mainly contains a hash table for maintaining inodes.
+In general a file/directory is considered to be existing if there is a
+corresponding inode present in the inode table. If a inode for a file/directory
+cannot be found in the inode table, glusterfs tries to resolve it by sending a
+lookup on the entry for which the inode is needed. If lookup is successful, then
+a new inode correponding to the entry is added to the hash table present in the
+inode table. Thus an inode present in the hash-table means, its an existing
+file/directory within the filesystem. The inode table also contains the hash
+size of the hash table (as of now it is hard coded to 14057. The hash value of
+a inode is calculated using its gfid).
+
+Apart from the hash table, inode table also maintains 3 important list of inodes
+1) Active list:
+Active list contains all the active inodes (i.e inodes which are currently part
+of some fop).
+2) Lru list:
+Least recently used inodes list. A limit can be set for the size of the lru
+list. For bricks it is 16384 and for clients it is infinity.
+3) Purge list:
+List of all the inodes which have to be purged (i.e inodes which have to be
+deleted from the inode table due to unlink/rmdir/forget).
+
+And at last it also contains the mem-pool for allocating inodes, dentries so
+that frequent malloc/calloc and free of the data structures can be avoided.
+```
+
+#Data structure (inode)
+```
+struct _inode {
+        inode_table_t       *table;         /* the table this inode belongs to */
+        uuid_t               gfid;          /* unique identifier of the inode */
+        gf_lock_t            lock;
+        uint64_t             nlookup;
+        uint32_t             fd_count;      /* Open fd count */
+        uint32_t             ref;           /* reference count on this inode */
+        ia_type_t            ia_type;       /* what kind of file */
+        struct list_head     fd_list;       /* list of open files on this inode */
+        struct list_head     dentry_list;   /* list of directory entries for this inode */
+        struct list_head     hash;          /* hash table pointers */
+        struct list_head     list;          /* active/lru/purge */
+
+        struct _inode_ctx   *_ctx;          /* place holder for keeping the
+        information about the inode by different xlators */
+};
+
+As said above, inodes are internal way of identifying the files/directories. A
+inode uniquely represents a file/directory. A new inode is created whenever a
+create/mkdir/symlink/mknod operations are performed. Apart from that a new inode
+is created upon the successful fresh lookup of a file/directory. Say the
+filesystem contained some file "a" within root and the filesystem was
+unmounted. Now when glusterfs is mounted and some operation is perfomed on "/a",
+glusterfs tries to get the inode for the entry "a" with parent inode as
+root. But, since glusterfs just came up, it will not be able to find the inode
+for "a" and will send a lookup on "/a". If the lookup operation succeeds (i.e.
+the root of glusterfs contains an entry called "a"), then a new inode for "/a"
+is created and added to the inode table.
+
+Depending upon the situation, an inode can be in one of the 3 lists maintained
+by the inode table. If some fop is happening on the inode, then the inode will
+be present in the active inodes list maintained by the inode table. Active
+inodes are those inodes whose refcount is greater than zero. Whenever some
+operation comes on a file/directory, and the resolver tries to find the inode
+for it, it increments the refcount of the inode before returning the inode. The
+refcount of an inode can be incremented by calling the below function
+
+inode_ref (inode_t *inode)
+
+Any xlator which wants to operate on a inode as part of some fop (or wants the
+inode in the callback), should hold a ref on the inode.
+Once the fop is completed before sending the reply of the fop to the above
+layers , the inode has to be unrefed. When the refcount of an inode becomes
+zero, it is removed from the active inodes list and put into LRU list maintained
+by the inode table. Thus in short if some fop is happening on a file/directory,
+the corresponding inode will be in the active list or it will be in the LRU
+list.
+```
+
+#Life Cycle
+
+A new inode is created whenever a new file/directory/symlink is created OR a
+successful lookup of an existing entry is done. The xlators which does inode
+management (as of now protocol/server, fuse, nfs, gfapi) will perform inode_link
+operation upon successful lookup or successful creation of a new entry.
+
+inode_link (inode_t *inode, inode_t *parent, const char *name,
+            struct iatt *buf);
+
+inode_link actually adds the inode to the inode table (to be precise it adds
+the inode to the hash table maintained by the inode table. The hash value is
+calculated based on the gfid). Copies the gfid to the inode (the gfid is
+present in the iatt structure). Creates a dentry with the new name.
+
+A inode is removed from the inode table and eventually destroyed when unlink
+or rmdir operation is performed on a file/directory, or the the lru limit of
+the inode table has been exceeded.
+
+#Data structure (dentry)
+```
+
+struct _dentry {
+        struct list_head   inode_list;   /* list of dentries of inode */
+        struct list_head   hash;         /* hash table pointers */
+        inode_t           *inode;        /* inode of this directory entry */
+        char              *name;         /* name of the directory entry */
+        inode_t           *parent;       /* directory of the entry */
+};
+
+A dentry is the presence of an entry for a file/directory within its parent
+directory. A dentry usually points to the inode to which it belongs to. In
+glusterfs a dentry contains the following fields.
+1) a hook using which it can add itself to the list of
+the dentries maintained by the inode to which it points to.
+2) A hash table pointer.
+3) Pointer to the inode to which it belongs to.
+4) Name of the dentry
+5) Pointer to the inode of the parent directory in which the dentry is present
+
+A new dentry is created when a new file/directory/symlink is created or a hard
+link to an existing file is created.
+
+__dentry_create (inode_t *inode, inode_t *parent, const char *name);
+
+A dentry holds a refcount on the parent
+directory so that the parent inode is never removed from the active inode's list
+and put to the lru list (If the lru limit of the lru list is exceeded, there is
+a chance of parent inode being destroyed. To avoid it, the dentries hold a
+reference to the parent inode). A dentry is removed whenevern a unlink/rmdir
+is perfomed on a file/directory. Or when the lru limit has been exceeded, the
+oldest inodes are purged out of the inode table, during which all the dentries
+of the inode are removed.
+
+Whenever a unlink/rmdir comes on a file/directory, the corresponding inode
+should be removed from the inode table. So upon unlink/rmdir, the inode will
+be moved to the purge list maintained by the inode table and from there it is
+destroyed. To be more specific, if a inode has to be destroyed, its refcount
+and nlookup count both should become 0. For refcount to become 0, the inode
+should not be part of any fop (there should not be any open fds). Or if the
+inode belongs to a directory, then there should not be any fop happening on the
+directory and it should not contain any dentries within it. For nlookup count to
+become zero, a forget has to be sent on the inode with nlookup count set to 0 as
+an argument. For fuse clients, forget is sent by the kernel itself whenever a
+unlink/rmdir is performed. But for brick processes, upon unlink/rmdir, the
+protocol/server itself has to do inode_forget. Whenever the inode has to be
+deleted due to file removal or lru limit being exceeded  the inode is retired
+(i.e. all the dentries of the inode are deleted and the inode is moved to the
+purge list maintained by the inode table), the nlookup count is set to 0 via
+inode_forget api. The inode table, then prunes all the inodes from the purge
+list by destroying the inode contexts maintained by each xlator.
+
+unlinking of the dentry is done via inode_unlink;
+
+void
+inode_unlink (inode_t *inode, inode_t *parent, const char *name);
+
+If the inode has multiple hard links, then the unlink operation performed by
+the application results just in the removal of the dentry with the name provided
+by the application. For the inode to be removed, all the dentries of the inode
+should be unlinked.
+```
+
-- 
cgit