summaryrefslogtreecommitdiffstats
path: root/Feature Planning/GlusterFS 4.0/stat-xattr-cache.md
diff options
context:
space:
mode:
authorraghavendra talur <raghavendra.talur@gmail.com>2015-08-20 15:09:31 +0530
committerHumble Devassy Chirammal <humble.devassy@gmail.com>2015-08-31 02:27:22 -0700
commit9e9e3c5620882d2f769694996ff4d7e0cf36cc2b (patch)
tree3a00cbd0cc24eb7df3de9b2eeeb8d42ee9175f88 /Feature Planning/GlusterFS 4.0/stat-xattr-cache.md
parentf6055cdb4dedde576ed8ec55a13814a69dceefdc (diff)
Create basic directory structure
All new features specs go into in_progress directory. Once signed off, it should be moved to done directory. For now, This change moves all the Gluster 4.0 feature specs to in_progress. All other specs are under done/release-version. More cleanup required will be done incrementally. Change-Id: Id272d301ba8c434cbf7a9a966ceba05fe63b230d BUG: 1206539 Signed-off-by: Raghavendra Talur <rtalur@redhat.com> Reviewed-on: http://review.gluster.org/11969 Reviewed-by: Humble Devassy Chirammal <humble.devassy@gmail.com> Reviewed-by: Prashanth Pai <ppai@redhat.com> Tested-by: Humble Devassy Chirammal <humble.devassy@gmail.com>
Diffstat (limited to 'Feature Planning/GlusterFS 4.0/stat-xattr-cache.md')
-rw-r--r--Feature Planning/GlusterFS 4.0/stat-xattr-cache.md197
1 files changed, 0 insertions, 197 deletions
diff --git a/Feature Planning/GlusterFS 4.0/stat-xattr-cache.md b/Feature Planning/GlusterFS 4.0/stat-xattr-cache.md
deleted file mode 100644
index e00399d..0000000
--- a/Feature Planning/GlusterFS 4.0/stat-xattr-cache.md
+++ /dev/null
@@ -1,197 +0,0 @@
-Feature
--------
-
-server-side md-cache
-
-Summary
--------
-
-Two years ago, Peter Portante noticed the extremely high number of
-system calls on the XFS brick required per Swift object. Since then, he
-and Ben England have observed several similar cases.
-
-More recently, while looking at a **netmist** single-thread workload run
-by a major banking customer to characterize Gluster performance, Ben
-observed this [system call profile PER
-FILE](https://s3.amazonaws.com/ben.england/netmist-and-gluster.pdf) .
-This is strong evidence of several problems with POSIX translator:
-
-- repeated polling with **sys\_lgetxattr** of the **gfid** xattr
-- repeated **sys\_lstat** calls
-- polling of xattrs that were *undefined*
-- calling **sys\_llistattr** to get list of all xattrs AFTER all other
- calls
-- calling *'sys\_lgetxattr* two times, once to find out how big the
- value is and once to get the value!
-- one-at-a-time calls to get individual xattrs
-
-All of the problems except for the last one could be solved through use
-of a metadata cache associated with each inode. The last problem is not
-solvable in a pure POSIX API at this time, although XFS offers an
-**ioctl** that can get all xattrs at once (the cache could conceivably
-determine whether the brick was XFS or not and exploit this where
-available).
-
-Note that as xattrs are added to the system, this becomes more and more
-costly, and as Gluster adds new features, these typically require that
-state be kept associated with a file, usually in one or more xattrs.
-
-Owners
-------
-
-TBS
-
-Current status
---------------
-
-There is already a **md-cache** translator, so you would think that
-problems like this would not occur, but clearly they do -- this
-translator is typically on the client side of the protocol and is
-typically above such translators as AFR and DHT. The problems may be
-worse in cases where the md-cache translator is not present (example:
-SMB with gluster-vfs plugin that requires stat-prefetch volume parameter
-to be set to *off*.
-
-Related Feature Requests and Bugs
----------------------------------
-
-- [Features/Smallfile Perf](../GlusterFS 3.7/Small File Performance.md)
-- bugzillas TBS
-
-Detailed Description
---------------------
-
-This proposal has changed as a result of discussions in
-\#gluster-meeting - instead of modifying the POSIX translator, we
-propose to use the md-cache translator in the server above the POSIX
-translator, and add negative caching capabilities to the md-cache
-translator.
-
-By "negative caching" we mean that md-cache can tell you if the xattr
-does not exist without calling down the translator stack. How can it do
-this? In the server side, the only path to the brick is through the
-md-cache translator. When it encounters a xattr get request for a file
-it has not seen before, the first step is to call down with llistxattr()
-to find out what xattrs are stored for that file. From that point on
-until the file is evicted from the cache, any request for non-existent
-xattr values from higher translators will immediately be returned with
-ENODATA, without calling down to POSIX translator.
-
-We must ensure that memory leaks do not occur, and that race conditions
-do not occur while multiple threads are accessing the cache, but this
-seems like a manageable problem and is certainly not a new problem for
-Gluster translator code.
-
-Benefit to GlusterFS
---------------------
-
-Most of the system calls and about 50% of the elapsed time could have
-been removed from the above small-file read profile through use of this
-cache. This benefit will be more visible as we transition to using SSD
-storage, where disk seek times will not mask overheads such as this.
-
-Scope
------
-
-This can be done local to the glusterfsd process by inserting md-cache
-translator just above the POSIX translator, where the vast majority of
-the stat, getxattr and setxattr calls are generated from.
-
-### Nature of proposed change
-
-No new translators are required. We may require some existing
-translators to call down the stack ("wind a FOP") instead of calling
-sys\_\*xattr themselves if these calls are heavily used, so that they
-can take advantage of the stat-xattr-cache.
-
-It is *really important* that the md-cache use listxattr() to
-immediately determine which xattrs are on disk, avoiding needless
-getxattr calls this way. At present it does not do this.
-
-### Implications on manageability
-
-None. We need to make sure that the cache is big enough to support the
-threads that use it, but not so big that it consumes a significant
-percentage of memory. We may want to make a cache size and expiration
-time a tunable so that we can experiment in performance testing to
-determine optimal parameters.
-
-### Implications on presentation layer
-
-Translators above the md-cache translator are not affected
-
-### Implications on persistence layer
-
-None.
-
-### Implications on 'GlusterFS' backend
-
-None
-
-### Modification to GlusterFS metadata
-
-None
-
-### Implications on 'glusterd'
-
-None
-
-How To Test
------------
-
-We can use strace of a single-thread smallfile workload to verify that
-the cache is filtering out excess system calls. We could include
-counters into the cache to measure the cache hit rate.
-
-User Experience
----------------
-
-single-thread small-file creates should be faster, particularly on SSD
-storage. Performance testing is needed to further quantify this.
-
-Dependencies
-------------
-
-None
-
-Documentation
--------------
-
-None, except for tunables relating to cache size and expiration time.
-
-Status
-------
-
-Not started.
-
-Comments and Discussion
------------------------
-
-Jeff Darcy: I've been saying for ages that we should store xattrs in a
-local DB and avoid local xattrs altogether. Besides performance, this
-would also eliminate the need for special configuration of the
-underlying local FS (to accommodate our highly unusual use of this
-feature) and generally be good for platform independence. Not quite so
-sure about other stat(2) information, but perhaps I could be persuaded.
-In any case, this has led me to look into the relevant code on a few
-occasions. Unfortunately, there are \*many\* places that directly call
-sys\_\*xattr instead of winding fops - glusterd (for replace-brick),
-changelog, quota, snapshots, and others. I think this feature is still
-very worthwhile, but all of the "cheating" we've tolerated over the
-years is going to make it more difficult.
-
-Ben England: a local DB might be a good option but could also become a
-bottleneck, unless you have a DB instance per brick (local) filesystem.
-One problem that the DB would solve is getting all the metadata in one
-query - at present POSIX API requires you to get one xattr at a time. If
-we implement a caching layer that hides whether a DB or xattrs are being
-used, we can make it easier to experiment with a DB (level DB?). On your
-2nd point, While it's true that there are many sites that call
-sys\_\*xattr directory, only a few of these really generate a lot of
-system calls. For example, some of these calls are only for the
-mountpoint. From a performance perspective, as long as we can intercept
-the vast majority of the sys\_\*xattr calls with this caching layer,
-IMHO we can tolerate a few exceptions in glusterd, etc. However, from a
-CORRECTNESS standpoint, we have to be careful that calls bypassing the
-caching layer don't cause cache contents to become stale (out-of-date,
-inconsistent with the on-disk brick filesystem contents).