diff options
author | raghavendra talur <raghavendra.talur@gmail.com> | 2015-08-20 15:09:31 +0530 |
---|---|---|
committer | Humble Devassy Chirammal <humble.devassy@gmail.com> | 2015-08-31 02:27:22 -0700 |
commit | 9e9e3c5620882d2f769694996ff4d7e0cf36cc2b (patch) | |
tree | 3a00cbd0cc24eb7df3de9b2eeeb8d42ee9175f88 /Feature Planning/GlusterFS 4.0/stat-xattr-cache.md | |
parent | f6055cdb4dedde576ed8ec55a13814a69dceefdc (diff) |
Create basic directory structure
All new features specs go into in_progress directory.
Once signed off, it should be moved to done directory.
For now,
This change moves all the Gluster 4.0 feature specs to
in_progress. All other specs are under done/release-version.
More cleanup required will be done incrementally.
Change-Id: Id272d301ba8c434cbf7a9a966ceba05fe63b230d
BUG: 1206539
Signed-off-by: Raghavendra Talur <rtalur@redhat.com>
Reviewed-on: http://review.gluster.org/11969
Reviewed-by: Humble Devassy Chirammal <humble.devassy@gmail.com>
Reviewed-by: Prashanth Pai <ppai@redhat.com>
Tested-by: Humble Devassy Chirammal <humble.devassy@gmail.com>
Diffstat (limited to 'Feature Planning/GlusterFS 4.0/stat-xattr-cache.md')
-rw-r--r-- | Feature Planning/GlusterFS 4.0/stat-xattr-cache.md | 197 |
1 files changed, 0 insertions, 197 deletions
diff --git a/Feature Planning/GlusterFS 4.0/stat-xattr-cache.md b/Feature Planning/GlusterFS 4.0/stat-xattr-cache.md deleted file mode 100644 index e00399d..0000000 --- a/Feature Planning/GlusterFS 4.0/stat-xattr-cache.md +++ /dev/null @@ -1,197 +0,0 @@ -Feature -------- - -server-side md-cache - -Summary -------- - -Two years ago, Peter Portante noticed the extremely high number of -system calls on the XFS brick required per Swift object. Since then, he -and Ben England have observed several similar cases. - -More recently, while looking at a **netmist** single-thread workload run -by a major banking customer to characterize Gluster performance, Ben -observed this [system call profile PER -FILE](https://s3.amazonaws.com/ben.england/netmist-and-gluster.pdf) . -This is strong evidence of several problems with POSIX translator: - -- repeated polling with **sys\_lgetxattr** of the **gfid** xattr -- repeated **sys\_lstat** calls -- polling of xattrs that were *undefined* -- calling **sys\_llistattr** to get list of all xattrs AFTER all other - calls -- calling *'sys\_lgetxattr* two times, once to find out how big the - value is and once to get the value! -- one-at-a-time calls to get individual xattrs - -All of the problems except for the last one could be solved through use -of a metadata cache associated with each inode. The last problem is not -solvable in a pure POSIX API at this time, although XFS offers an -**ioctl** that can get all xattrs at once (the cache could conceivably -determine whether the brick was XFS or not and exploit this where -available). - -Note that as xattrs are added to the system, this becomes more and more -costly, and as Gluster adds new features, these typically require that -state be kept associated with a file, usually in one or more xattrs. - -Owners ------- - -TBS - -Current status --------------- - -There is already a **md-cache** translator, so you would think that -problems like this would not occur, but clearly they do -- this -translator is typically on the client side of the protocol and is -typically above such translators as AFR and DHT. The problems may be -worse in cases where the md-cache translator is not present (example: -SMB with gluster-vfs plugin that requires stat-prefetch volume parameter -to be set to *off*. - -Related Feature Requests and Bugs ---------------------------------- - -- [Features/Smallfile Perf](../GlusterFS 3.7/Small File Performance.md) -- bugzillas TBS - -Detailed Description --------------------- - -This proposal has changed as a result of discussions in -\#gluster-meeting - instead of modifying the POSIX translator, we -propose to use the md-cache translator in the server above the POSIX -translator, and add negative caching capabilities to the md-cache -translator. - -By "negative caching" we mean that md-cache can tell you if the xattr -does not exist without calling down the translator stack. How can it do -this? In the server side, the only path to the brick is through the -md-cache translator. When it encounters a xattr get request for a file -it has not seen before, the first step is to call down with llistxattr() -to find out what xattrs are stored for that file. From that point on -until the file is evicted from the cache, any request for non-existent -xattr values from higher translators will immediately be returned with -ENODATA, without calling down to POSIX translator. - -We must ensure that memory leaks do not occur, and that race conditions -do not occur while multiple threads are accessing the cache, but this -seems like a manageable problem and is certainly not a new problem for -Gluster translator code. - -Benefit to GlusterFS --------------------- - -Most of the system calls and about 50% of the elapsed time could have -been removed from the above small-file read profile through use of this -cache. This benefit will be more visible as we transition to using SSD -storage, where disk seek times will not mask overheads such as this. - -Scope ------ - -This can be done local to the glusterfsd process by inserting md-cache -translator just above the POSIX translator, where the vast majority of -the stat, getxattr and setxattr calls are generated from. - -### Nature of proposed change - -No new translators are required. We may require some existing -translators to call down the stack ("wind a FOP") instead of calling -sys\_\*xattr themselves if these calls are heavily used, so that they -can take advantage of the stat-xattr-cache. - -It is *really important* that the md-cache use listxattr() to -immediately determine which xattrs are on disk, avoiding needless -getxattr calls this way. At present it does not do this. - -### Implications on manageability - -None. We need to make sure that the cache is big enough to support the -threads that use it, but not so big that it consumes a significant -percentage of memory. We may want to make a cache size and expiration -time a tunable so that we can experiment in performance testing to -determine optimal parameters. - -### Implications on presentation layer - -Translators above the md-cache translator are not affected - -### Implications on persistence layer - -None. - -### Implications on 'GlusterFS' backend - -None - -### Modification to GlusterFS metadata - -None - -### Implications on 'glusterd' - -None - -How To Test ------------ - -We can use strace of a single-thread smallfile workload to verify that -the cache is filtering out excess system calls. We could include -counters into the cache to measure the cache hit rate. - -User Experience ---------------- - -single-thread small-file creates should be faster, particularly on SSD -storage. Performance testing is needed to further quantify this. - -Dependencies ------------- - -None - -Documentation -------------- - -None, except for tunables relating to cache size and expiration time. - -Status ------- - -Not started. - -Comments and Discussion ------------------------ - -Jeff Darcy: I've been saying for ages that we should store xattrs in a -local DB and avoid local xattrs altogether. Besides performance, this -would also eliminate the need for special configuration of the -underlying local FS (to accommodate our highly unusual use of this -feature) and generally be good for platform independence. Not quite so -sure about other stat(2) information, but perhaps I could be persuaded. -In any case, this has led me to look into the relevant code on a few -occasions. Unfortunately, there are \*many\* places that directly call -sys\_\*xattr instead of winding fops - glusterd (for replace-brick), -changelog, quota, snapshots, and others. I think this feature is still -very worthwhile, but all of the "cheating" we've tolerated over the -years is going to make it more difficult. - -Ben England: a local DB might be a good option but could also become a -bottleneck, unless you have a DB instance per brick (local) filesystem. -One problem that the DB would solve is getting all the metadata in one -query - at present POSIX API requires you to get one xattr at a time. If -we implement a caching layer that hides whether a DB or xattrs are being -used, we can make it easier to experiment with a DB (level DB?). On your -2nd point, While it's true that there are many sites that call -sys\_\*xattr directory, only a few of these really generate a lot of -system calls. For example, some of these calls are only for the -mountpoint. From a performance perspective, as long as we can intercept -the vast majority of the sys\_\*xattr calls with this caching layer, -IMHO we can tolerate a few exceptions in glusterd, etc. However, from a -CORRECTNESS standpoint, we have to be careful that calls bypassing the -caching layer don't cause cache contents to become stale (out-of-date, -inconsistent with the on-disk brick filesystem contents). |