From 601bfa2719d8c9be40982b8a6526c21cd0ea4966 Mon Sep 17 00:00:00 2001
From: Kaushal M <kshlmster@gmail.com>
Date: Wed, 20 Jan 2016 13:09:23 +0530
Subject: Rename in_progress to under_review

`in_progress` is vague term, which could either mean the feature
review is in progress, or that the feature implementation is in
progress.

Renaming to `under_review` gives a much better indication that the
feature is under review and implementation hasn't begun yet.

Refer https://review.gluster.org/13187 for the discussion which
lead to this

Change-Id: I3f48e15deb4cf5486d7b8cac4a7915f9925f38f5
Signed-off-by: Kaushal M <kshlmster@gmail.com>
Reviewed-on: http://review.gluster.org/13264
Reviewed-by: Raghavendra Talur <rtalur@redhat.com>
Tested-by: Raghavendra Talur <rtalur@redhat.com>
---
 under_review/stat-xattr-cache.md | 197 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 197 insertions(+)
 create mode 100644 under_review/stat-xattr-cache.md

(limited to 'under_review/stat-xattr-cache.md')

diff --git a/under_review/stat-xattr-cache.md b/under_review/stat-xattr-cache.md
new file mode 100644
index 0000000..e00399d
--- /dev/null
+++ b/under_review/stat-xattr-cache.md
@@ -0,0 +1,197 @@
+Feature
+-------
+
+server-side md-cache
+
+Summary
+-------
+
+Two years ago, Peter Portante noticed the extremely high number of
+system calls on the XFS brick required per Swift object. Since then, he
+and Ben England have observed several similar cases.
+
+More recently, while looking at a **netmist** single-thread workload run
+by a major banking customer to characterize Gluster performance, Ben
+observed this [system call profile PER
+FILE](https://s3.amazonaws.com/ben.england/netmist-and-gluster.pdf) .
+This is strong evidence of several problems with POSIX translator:
+
+-   repeated polling with **sys\_lgetxattr** of the **gfid** xattr
+-   repeated **sys\_lstat** calls
+-   polling of xattrs that were *undefined*
+-   calling **sys\_llistattr** to get list of all xattrs AFTER all other
+    calls
+-   calling *'sys\_lgetxattr* two times, once to find out how big the
+    value is and once to get the value!
+-   one-at-a-time calls to get individual xattrs
+
+All of the problems except for the last one could be solved through use
+of a metadata cache associated with each inode. The last problem is not
+solvable in a pure POSIX API at this time, although XFS offers an
+**ioctl** that can get all xattrs at once (the cache could conceivably
+determine whether the brick was XFS or not and exploit this where
+available).
+
+Note that as xattrs are added to the system, this becomes more and more
+costly, and as Gluster adds new features, these typically require that
+state be kept associated with a file, usually in one or more xattrs.
+
+Owners
+------
+
+TBS
+
+Current status
+--------------
+
+There is already a **md-cache** translator, so you would think that
+problems like this would not occur, but clearly they do -- this
+translator is typically on the client side of the protocol and is
+typically above such translators as AFR and DHT. The problems may be
+worse in cases where the md-cache translator is not present (example:
+SMB with gluster-vfs plugin that requires stat-prefetch volume parameter
+to be set to *off*.
+
+Related Feature Requests and Bugs
+---------------------------------
+
+-   [Features/Smallfile Perf](../GlusterFS 3.7/Small File Performance.md)
+-   bugzillas TBS
+
+Detailed Description
+--------------------
+
+This proposal has changed as a result of discussions in
+\#gluster-meeting - instead of modifying the POSIX translator, we
+propose to use the md-cache translator in the server above the POSIX
+translator, and add negative caching capabilities to the md-cache
+translator.
+
+By "negative caching" we mean that md-cache can tell you if the xattr
+does not exist without calling down the translator stack. How can it do
+this? In the server side, the only path to the brick is through the
+md-cache translator. When it encounters a xattr get request for a file
+it has not seen before, the first step is to call down with llistxattr()
+to find out what xattrs are stored for that file. From that point on
+until the file is evicted from the cache, any request for non-existent
+xattr values from higher translators will immediately be returned with
+ENODATA, without calling down to POSIX translator.
+
+We must ensure that memory leaks do not occur, and that race conditions
+do not occur while multiple threads are accessing the cache, but this
+seems like a manageable problem and is certainly not a new problem for
+Gluster translator code.
+
+Benefit to GlusterFS
+--------------------
+
+Most of the system calls and about 50% of the elapsed time could have
+been removed from the above small-file read profile through use of this
+cache. This benefit will be more visible as we transition to using SSD
+storage, where disk seek times will not mask overheads such as this.
+
+Scope
+-----
+
+This can be done local to the glusterfsd process by inserting md-cache
+translator just above the POSIX translator, where the vast majority of
+the stat, getxattr and setxattr calls are generated from.
+
+### Nature of proposed change
+
+No new translators are required. We may require some existing
+translators to call down the stack ("wind a FOP") instead of calling
+sys\_\*xattr themselves if these calls are heavily used, so that they
+can take advantage of the stat-xattr-cache.
+
+It is *really important* that the md-cache use listxattr() to
+immediately determine which xattrs are on disk, avoiding needless
+getxattr calls this way. At present it does not do this.
+
+### Implications on manageability
+
+None. We need to make sure that the cache is big enough to support the
+threads that use it, but not so big that it consumes a significant
+percentage of memory. We may want to make a cache size and expiration
+time a tunable so that we can experiment in performance testing to
+determine optimal parameters.
+
+### Implications on presentation layer
+
+Translators above the md-cache translator are not affected
+
+### Implications on persistence layer
+
+None.
+
+### Implications on 'GlusterFS' backend
+
+None
+
+### Modification to GlusterFS metadata
+
+None
+
+### Implications on 'glusterd'
+
+None
+
+How To Test
+-----------
+
+We can use strace of a single-thread smallfile workload to verify that
+the cache is filtering out excess system calls. We could include
+counters into the cache to measure the cache hit rate.
+
+User Experience
+---------------
+
+single-thread small-file creates should be faster, particularly on SSD
+storage. Performance testing is needed to further quantify this.
+
+Dependencies
+------------
+
+None
+
+Documentation
+-------------
+
+None, except for tunables relating to cache size and expiration time.
+
+Status
+------
+
+Not started.
+
+Comments and Discussion
+-----------------------
+
+Jeff Darcy: I've been saying for ages that we should store xattrs in a
+local DB and avoid local xattrs altogether. Besides performance, this
+would also eliminate the need for special configuration of the
+underlying local FS (to accommodate our highly unusual use of this
+feature) and generally be good for platform independence. Not quite so
+sure about other stat(2) information, but perhaps I could be persuaded.
+In any case, this has led me to look into the relevant code on a few
+occasions. Unfortunately, there are \*many\* places that directly call
+sys\_\*xattr instead of winding fops - glusterd (for replace-brick),
+changelog, quota, snapshots, and others. I think this feature is still
+very worthwhile, but all of the "cheating" we've tolerated over the
+years is going to make it more difficult.
+
+Ben England: a local DB might be a good option but could also become a
+bottleneck, unless you have a DB instance per brick (local) filesystem.
+One problem that the DB would solve is getting all the metadata in one
+query - at present POSIX API requires you to get one xattr at a time. If
+we implement a caching layer that hides whether a DB or xattrs are being
+used, we can make it easier to experiment with a DB (level DB?). On your
+2nd point, While it's true that there are many sites that call
+sys\_\*xattr directory, only a few of these really generate a lot of
+system calls. For example, some of these calls are only for the
+mountpoint. From a performance perspective, as long as we can intercept
+the vast majority of the sys\_\*xattr calls with this caching layer,
+IMHO we can tolerate a few exceptions in glusterd, etc. However, from a
+CORRECTNESS standpoint, we have to be careful that calls bypassing the
+caching layer don't cause cache contents to become stale (out-of-date,
+inconsistent with the on-disk brick filesystem contents).
-- 
cgit