summaryrefslogtreecommitdiffstats
path: root/Feature Planning/GlusterFS 4.0
diff options
context:
space:
mode:
authorhchiramm <hchiramm@redhat.com>2015-08-04 11:07:42 +0530
committerhchiramm <hchiramm@redhat.com>2015-08-04 11:07:42 +0530
commitd7d3274c6f6cea46ad296fc6d1259ee9a4e9964f (patch)
tree8db500bcea5190101703ab2ebc4b28587a6e994c /Feature Planning/GlusterFS 4.0
parent146b7ef7a31997634b29302a6e345ff5d9d7497a (diff)
Adding Features and planning features to glusterfs-specs repo
As per the discussion (http://www.gluster.org/pipermail/gluster-users/2015-July/022918.html) the specs are part of this repo. Signed-off-by: hchiramm <hchiramm@redhat.com>
Diffstat (limited to 'Feature Planning/GlusterFS 4.0')
-rw-r--r--Feature Planning/GlusterFS 4.0/Better Brick Mgmt.md180
-rw-r--r--Feature Planning/GlusterFS 4.0/Compression Dedup.md128
-rw-r--r--Feature Planning/GlusterFS 4.0/Split Network.md138
-rw-r--r--Feature Planning/GlusterFS 4.0/caching.md143
-rw-r--r--Feature Planning/GlusterFS 4.0/code-generation.md143
-rw-r--r--Feature Planning/GlusterFS 4.0/composite-operations.md438
-rw-r--r--Feature Planning/GlusterFS 4.0/dht-scalability.md171
-rw-r--r--Feature Planning/GlusterFS 4.0/index.md82
-rw-r--r--Feature Planning/GlusterFS 4.0/stat-xattr-cache.md197
-rw-r--r--Feature Planning/GlusterFS 4.0/volgen-rewrite.md128
10 files changed, 1748 insertions, 0 deletions
diff --git a/Feature Planning/GlusterFS 4.0/Better Brick Mgmt.md b/Feature Planning/GlusterFS 4.0/Better Brick Mgmt.md
new file mode 100644
index 0000000..adfc781
--- /dev/null
+++ b/Feature Planning/GlusterFS 4.0/Better Brick Mgmt.md
@@ -0,0 +1,180 @@
+Goal
+----
+
+Easier (more autonomous) assignment of storage to specific roles
+
+Summary
+-------
+
+Managing bricks and arrangements of bricks (e.g. into replica sets)
+manually doesn't scale. Instead, we need more intuitive ways to group
+bricks together into pools, allocate space from those pools (creating
+new pools), and let users define volumes in terms of pools rather than
+individual bricks. We get to worry about how to arrange those bricks
+into an intelligent volume configuration, e.g. replicating between
+bricks that are the same size/speed/type but not on the same server.
+
+Because this smarter and/or finer-grain resource allocation (plus
+general technology evolution) is likely to result in many more bricks
+per server than we have now, we also need a brick-daemon infrastructure
+capable of handling that.
+
+Owners
+------
+
+Jeff Darcy <jdarcy@redhat.com>
+
+Current status
+--------------
+
+Proposed, waiting until summit for approval.
+
+Related Feature Requests and Bugs
+---------------------------------
+
+[Features/data-classification](../GlusterFS 3.7/Data Classification.md)
+will drive the heaviest and/or most sophisticated use of this feature,
+and some of the underlying mechanisms were originally proposed there.
+
+Detailed Description
+--------------------
+
+To start with, we need to distinguish between the raw brick that the
+user allocates to GlusterFS and the pieces of that brick that result
+from our complicated storage allocation. Some documents refer to these
+as u-brick and s-brick respectively, though perhaps it's better to keep
+calling the former bricks and come up with a new name for the latter -
+slice, tile, pebble, etc. For now, let's stick with the x-brick
+terminology. We can manipulate these objects in several ways.
+
+- Group u-bricks together into an equivalent pool of s-bricks
+ (trivially 1:1).
+
+- Allocate space from a pool of s-bricks, creating a set of smaller
+ s-bricks. Note that the results of applying this repeatedly might be
+ s-bricks which are on the same u-brick but part of different
+ volumes.
+
+- Combine multiple s-bricks into one via some combination of
+ replication, erasure coding, distribution, tiering, etc.
+
+- Export an s-brick as a volume.
+
+These operations - especially combining - can be applied iteratively,
+creating successively more complex structures prior to the final export.
+To support this, the code we currently use to generate volfiles needs to
+be changed to generate similar definitions for the various levels of
+s-bricks. Combined with the need to support versioning of these files
+(for snapshots), this probably means a rewrite of the volgen code.
+Another type of configuration file we need to create is for a brick
+daemon. We still run one glusterfsd process per u-brick, for various
+reasons.
+
+- Maximize compatibility with our current infrastructure for starting
+ and monitoring server processes.
+
+- Align the boundaries between actual and detected device failures.
+
+- Reduce the number of ports assigned, both for administrative
+ convenience and to avoid exhaustion.
+
+- Reduce context-switch and virtual-memory thrashing between too many
+ uncoordinated processes. Some day we might even add custom resource
+ control/scheduling between s-bricks within a process, which would be
+ impossible in separate processes.
+
+These new glusterfsd processes are going to require more complex
+volfiles, and more complex translator-graph code to consume those. They
+also need to be more parallel internally, so this feature depends on
+eliminating single-threaded bottlenecks such as our socket transport.
+
+Benefit to GlusterFS
+--------------------
+
+- Reduced administrative overhead for large/complex volume
+ configurations.
+
+- More flexible/sophisticated volume configurations, especially with
+ respect to other features such as tiering or internal enhancements
+ such as overlapping replica/erasure sets.
+
+- Improved performance.
+
+Scope
+-----
+
+### Nature of proposed change
+
+- New object model, exposed via both glusterd-level and user-level
+ commands on those objects.
+
+- Rewritten volfile infrastructure.
+
+- Significantly enhanced translator-graph infrastructure.
+
+- Multi-threaded transport.
+
+### Implications on manageability
+
+New commands will be needed to group u-bricks into pools, allocate
+s-bricks from pools, etc. There will also be new commands to view status
+of objects at various levels, and perhaps to set options on them. On the
+other hand, "volume create" will probably become simpler as the
+specifics of creating a volume are delegated downward to s-bricks.
+
+### Implications on presentation layer
+
+Surprisingly little.
+
+### Implications on persistence layer
+
+None.
+
+### Implications on 'GlusterFS' backend
+
+The on-disk structures (.glusterfs and so on) currently associated with
+a brick become associated with an s-brick. The u-brick itself will
+contain little, probably just an enumeration of the s-bricks into which
+it has been divided.
+
+### Modification to GlusterFS metadata
+
+None.
+
+### Implications on 'glusterd'
+
+See detailed description.
+
+How To Test
+-----------
+
+New tests will be needed for grouping/allocation functions. In
+particular, negative tests for incorrect or impossible configurations
+will be needed. Once s-bricks have been aggregated back into volumes,
+most of the current volume-level tests will still apply. Related tests
+will also be developed as part of the data classification feature.
+
+User Experience
+---------------
+
+See "implications on manageability" etc.
+
+Dependencies
+------------
+
+This feature is so closely associated with data classification that the
+two can barely be considered separately.
+
+Documentation
+-------------
+
+Much of our "brick and volume management" documentation will require a
+thorough review, if not an actual rewrite.
+
+Status
+------
+
+Design still in progress.
+
+Comments and Discussion
+-----------------------
diff --git a/Feature Planning/GlusterFS 4.0/Compression Dedup.md b/Feature Planning/GlusterFS 4.0/Compression Dedup.md
new file mode 100644
index 0000000..7829018
--- /dev/null
+++ b/Feature Planning/GlusterFS 4.0/Compression Dedup.md
@@ -0,0 +1,128 @@
+Feature
+-------
+
+Compression / Deduplication
+
+Summary
+-------
+
+In the never-ending quest to increase storage efficiency (or conversely
+to decrease storage cost), we could compress and/or deduplicate data
+stored on bricks.
+
+Owners
+------
+
+Jeff Darcy <jdarcy@redhat.com>
+
+Current status
+--------------
+
+Just a vague idea so far.
+
+Related Feature Requests and Bugs
+---------------------------------
+
+TBD
+
+Detailed Description
+--------------------
+
+Compression and deduplication for GlusterFS have been discussed many
+times. Deduplication across machines/bricks is a recognized Hard
+Problem, with uncertain benefits, and is thus considered out of scope.
+Deduplication within a brick is potentially achievable by using
+something like
+[lessfs](http://sourceforge.net/projects/lessfs/files/ "wikilink"),
+which is itself a FUSE filesystem, so one fairly simple approach would
+be to integrate lessfs as a translator. There's no similar option for
+compression.
+
+In both cases, it's generally preferable to work on fully expanded files
+while they're open, and then compress/dedup when they're closed. Some of
+the bitrot or tiering infrastructure might be useful for moving files
+between these states, or detecting when such a change is needed. There
+are also some interesting interactions with quota, since we need to
+count the un-compressed un-deduplicated size of the file against quota
+(or do we?) and that's not what the underlying local file system will
+report.
+
+Benefit to GlusterFS
+--------------------
+
+Less \$\$\$/GB for our users.
+
+Scope
+-----
+
+### Nature of proposed change
+
+New translators, hooks into bitrot/tiering/quota, probably new daemons.
+
+### Implications on manageability
+
+Besides turning these options on or off, or setting parameters, there
+will probably need to be some way of reporting the real vs.
+compressed/deduplicated size of files/bricks/volumes.
+
+### Implications on presentation layer
+
+Should be none.
+
+### Implications on persistence layer
+
+If the DM folks ever get their <expletive deleted> together on this
+front, we might be able to use some of their stuff instead of lessfs.
+That worked so well for thin provisioning and snapshots.
+
+### Implications on 'GlusterFS' backend
+
+What's on the brick will no longer match the data that the user stored
+(and might some day retrieve). In the case of compression,
+reconstituting the user-visible version of the data should be a simple
+matter of decompressing via a well known algorithm. In the case of
+deduplication, the relevant data structures are much more complicated
+and reconstitution will be correspondingly more difficult.
+
+### Modification to GlusterFS metadata
+
+Some of the information tracking deduplicated blocks will probably be
+stored "privately" in .glusterfs or similar.
+
+### Implications on 'glusterd'
+
+TBD
+
+How To Test
+-----------
+
+TBD
+
+User Experience
+---------------
+
+Mostly unchanged, except for performance. As with erasure coding, a
+compressed/deduplicated slow tier will usually need to be paired with a
+simpler fast tier for overall performance to be acceptable.
+
+Dependencies
+------------
+
+External: lessfs, DM, whatever other technology we use to do the
+low-level work
+
+Internal: tiering/bitrot (perhaps changelog?) to track state and detect
+changes
+
+Documentation
+-------------
+
+TBD
+
+Status
+------
+
+Still just a vague idea.
+
+Comments and Discussion
+-----------------------
diff --git a/Feature Planning/GlusterFS 4.0/Split Network.md b/Feature Planning/GlusterFS 4.0/Split Network.md
new file mode 100644
index 0000000..95cf944
--- /dev/null
+++ b/Feature Planning/GlusterFS 4.0/Split Network.md
@@ -0,0 +1,138 @@
+Goal
+----
+
+Better support for multiple networks, especially front-end vs. back-end.
+
+Summary
+-------
+
+GlusterFS generally expects that all clients and servers use a common
+set of network names and/or addresses. For many users, having a separate
+network exclusively for servers is highly desirable for both performance
+reasons (segregating administrative traffic and/or second-hop NFS
+traffic from ongoing user I/O) and security reasons (limiting
+administrative access to the private network). While such configurations
+can already be created with routing/iptables trickery, full and explicit
+support would be a great improvement.
+
+Owners
+------
+
+Jeff Darcy <jdarcy@redhat.com>
+
+Current status
+--------------
+
+Proposed, awaiting summit for approval.
+
+Related Feature Requests and Bugs
+---------------------------------
+
+One proposal for the high-level syntax and semantics was made [on the
+mailing
+list](http://www.gluster.org/pipermail/gluster-users/2014-November/019463.html).
+
+Detailed Description
+--------------------
+
+At the very least, we need to be able to define and keep track of
+multiple names/addresses for the same node, one used on the back-end
+network e.g. for "peer probe" and and NFS and the other used on the
+front-end network by native-protocol clients. The association can be
+done via the node UUID, but we still need a way for the user to specify
+which name/address is to be used for which purpose.
+
+Future enhancements could include multiple front-end (client) networks,
+and network-specific access control.
+
+Benefit to GlusterFS
+--------------------
+
+More flexible network network topologies, potentially enhancing
+performance and/or security for some deployments.
+
+Scope
+-----
+
+### Nature of proposed change
+
+The information in /var/lib/glusterd/peers/\* must be enhanced to
+include multiple names/addresses per peer, plus tags for roles
+associated with each address/name.
+
+The volfile-generation code must be enhanced to generate volfiles for
+each purpose - server, native client, NFS proxy, self-heal/rebalance -
+using the names/addresses appropriate to that purpose.
+
+### Implications on manageability
+
+CLI and GUI support must be added for viewing/changing the addresses
+associated with each server and the roles associated with each address.
+
+### Implications on presentation layer
+
+None. The changes should be transparent to users.
+
+### Implications on persistence layer
+
+None.
+
+### Implications on 'GlusterFS' backend
+
+None.
+
+### Modification to GlusterFS metadata
+
+See [nature of proposed change](#Nature_of_proposed_change "wikilink").
+
+### Implications on 'glusterd'
+
+See [nature of proposed change](#Nature_of_proposed_change "wikilink").
+
+How To Test
+-----------
+
+Set up a physical configuration with separate front-end and back-end
+networks.
+
+Use the new CLI/GUI features to define addresses and roles split across
+the two networks.
+
+Mount a volume using each of the several volfiles that result, and
+generate some traffic.
+
+Verify that the traffic is actually on the network appropriate to that
+mount type.
+
+User Experience
+---------------
+
+By default, nothing changes. If and only if a user wants to set up a
+more "advanced" split-network configuration, they'll have new tools
+allowing them to do that without having to "step outside" to mess with
+routing tables etc.
+
+Dependencies
+------------
+
+None.
+
+Documentation
+-------------
+
+New documentation will be needed at both the conceptual and detail
+levels, describing how (and why?) to set up a split-network
+configuration.
+
+Status
+------
+
+In design.
+
+Comments and Discussion
+-----------------------
+
+Some use-cases in [Bug 764850](https://bugzilla.redhat.com/764850).
+Feedback requested. Please jump in.
+
+[Discussion on gluster-devel](https://mail.corp.redhat.com/zimbra/#16)
diff --git a/Feature Planning/GlusterFS 4.0/caching.md b/Feature Planning/GlusterFS 4.0/caching.md
new file mode 100644
index 0000000..2c21c0c
--- /dev/null
+++ b/Feature Planning/GlusterFS 4.0/caching.md
@@ -0,0 +1,143 @@
+Goal Caching
+------------
+
+Improved performance via client-side caching.
+
+Summary
+-------
+
+GlusterFS has historically taken a very conservative approach to
+client-side caching, due to the cost and difficulty of ensuring
+consistency across a truly distributed file system. However, this has
+often led to a competitive disadvantage vs. other file systems that
+cache more aggressively. While one could argue that expecting an
+application designed for a local FS or NFS to behave the same way on a
+distributed FS is unrealistic, or question whether competitors' caching
+is really safe, this nonetheless remains one of our users' top requests.
+
+For purposes of this discussion, pre-fetching into cache is considered
+part of caching itself. However, write-behind caching (buffering) is a
+separate feature, and is not in scope.
+
+Owners
+------
+
+Xavier Hernandez <xhernandez@datalab.es>
+
+Jeff Darcy <jdarcy@redhat.com>
+
+Current status
+--------------
+
+Proposed, waiting until summit for approval.
+
+Related Feature Requests and Bugs
+---------------------------------
+
+[Features/FS-Cache](Features/FS-Cache "wikilink") is about a looser
+(non-consistent) kind of caching integrated via FUSE. This feature is
+differentiated by being fully consistent, and implemented in GlusterFS
+itself.
+
+[IMCa](http://mvapich.cse.ohio-state.edu/static/media/publications/slide/imca_icpp08.pdf)
+describes a completely external approach to caching (both data and
+metadata) with GlusterFS.
+
+Detailed Description
+--------------------
+
+Retaining data in cache on a client after it's read is trivial.
+Pre-fetching into that same cache is barely more difficult. All of the
+hard parts are on the server.
+
+- Tracking which clients still have cached copies of which data (or
+ metadata).
+
+- Issuing and waiting for invalidation requests when a client changes
+ data cached elsewhere.
+
+- Handling failures of the servers tracking client state, and of
+ communication with clients that need to be invalidated.
+
+- Doing all of this without putting performance in the toilet.
+
+Invalidating cached copies is analogous to breaking locks, so the
+async-notification and "oplock" code already being developed for
+multi-protocol (SMB3/NFS4) support can probably be used here. More
+design is probably needed around scalable/performant tracking of client
+cache state by servers.
+
+Benefit to GlusterFS
+--------------------
+
+Much better performance for cache-friendly workloads.
+
+Scope
+-----
+
+### Nature of proposed change
+
+Some of the existing "performance" translators could be replaced by a
+single client-caching translator. There will also need to be a
+server-side helper translator to track client cache states and issue
+invalidation requests at the appropriate times. Such asynchronous
+(server-initiated) requests probably require transport changes, and
+[GF\_FOP\_IPC](http://review.gluster.org/#/c/8812/) might play a part as
+well.
+
+### Implications on manageability
+
+New commands will be needed to set cache parameters, force cache
+flushes, etc.
+
+### Implications on presentation layer
+
+None, except for integration with the same async/oplock infrastructure
+as used separately in SMB and NFS.
+
+### Implications on persistence layer
+
+None.
+
+### Implications on 'GlusterFS' backend
+
+We will likely need some sort of database associated with each brick to
+maintain information about cache states.
+
+### Modification to GlusterFS metadata
+
+None.
+
+### Implications on 'glusterd'
+
+None.
+
+How To Test
+-----------
+
+We'll need new tests to verify that invalidations are in fact occurring,
+that we can't read stale/inconsistent data despite the increased caching
+on clients, etc.
+
+User Experience
+---------------
+
+See "implications on manageability" section.
+
+Dependencies
+------------
+
+Async-notification and oplock code from the Samba team.
+
+Documentation
+-------------
+
+TBD
+
+Status
+------
+
+Design in private review, hopefully available for public review soon.
+
+Comments and Discussion
+-----------------------
diff --git a/Feature Planning/GlusterFS 4.0/code-generation.md b/Feature Planning/GlusterFS 4.0/code-generation.md
new file mode 100644
index 0000000..5c25a13
--- /dev/null
+++ b/Feature Planning/GlusterFS 4.0/code-generation.md
@@ -0,0 +1,143 @@
+Goal
+----
+
+Reduce internal duplication of code by generating from templates.
+
+Summary
+-------
+
+The translator calling convention is based on long lists of
+operation-specific arguments instead of a common "control block"
+struct/union. As a result, many parts of our code are highly repetitive
+both internally and with respect to one another. As an example of
+internal redundancy, consider how many of the functions in
+[defaults.c](https://github.com/gluster/glusterfs/blob/master/libglusterfs/src/defaults.c)
+look similar. As an example of external redundancy, consider how the
+[patch to add GF\_FOP\_IPC](http://review.gluster.org/#/c/8812/) has to
+make parallel changes to 17 files - defaults, stubs, syncops, RPC,
+io-threads, and so on. All of this duplication slows development of new
+features, and creates huge potential for errors as definitions that need
+to match don't. Indeed, during development of a code generator for NSR,
+several such inconsistencies have already been found.
+
+Owners
+------
+
+Jeff Darcy <jdarcy@redhat.com>
+
+Current status
+--------------
+
+Proposed, awaiting approval.
+
+Related Feature Requests and Bugs
+---------------------------------
+
+Code generation was already used successfully in the first generation of
+[NSR](../GlusterFS 3.6/New Style Replication.md) and will continue to be
+used in the second.
+
+Detailed Description
+--------------------
+
+See Summary section above.
+
+Benefit to GlusterFS
+--------------------
+
+- Fewer bugs from inconsistencies between how similar operations are
+ handled within one translator, or how a single operation is handled
+ across many.
+
+- Greater ease of adding new operation types, or new translators which
+ implement similar functionality for many operations.
+
+Scope
+-----
+
+### Nature of proposed change
+
+The code-generation infrastructure itself consists of three parts:
+
+- A list of operations and their associated arguments (both original
+ and callback, with types).
+
+- A script to combine this list with a template to do the actual
+ generation.
+
+- Modifications to makefiles etc. to do generation during a build.
+
+The first and easiest target is auto-generated default functions. Stubs
+and syncops could follow pretty quickly. Other possibilities include:
+
+- GFAPI (both C and Python)
+
+- glupy
+
+- RPC (replace rpcgen?)
+
+- io-threads
+
+- changelog (the [full-data-logging
+ translator](https://forge.gluster.org/~jdarcy/glusterfs-core/jdarcys-glusterfs-data-logging)
+ on the forge already uses this technique)
+
+Even something as complicated as AFR/NSR/EC could use code generation to
+handle quorum checks more consistently, wrap operations in transactions,
+and so on. NSR already does; the others could.
+
+### Implications on manageability
+
+None.
+
+### Implications on presentation layer
+
+None.
+
+### Implications on persistence layer
+
+None.
+
+### Implications on 'GlusterFS' backend
+
+None.
+
+### Modification to GlusterFS metadata
+
+None.
+
+### Implications on 'glusterd'
+
+None.
+
+How To Test
+-----------
+
+This change is not intended to introduce any change visible except to
+developers. Standard regression tests should be sufficient to verify
+that no such change has occurred.
+
+User Experience
+---------------
+
+None.
+
+Dependencies
+------------
+
+None.
+
+Documentation
+-------------
+
+Developer documentation should explain the format of the fop-description
+and template files. In particular developers need to know what variables
+are available for use in templates, and how to add new ones.
+
+Status
+------
+
+Patch available to generate default functions. Others to follow.
+
+Comments and Discussion
+-----------------------
diff --git a/Feature Planning/GlusterFS 4.0/composite-operations.md b/Feature Planning/GlusterFS 4.0/composite-operations.md
new file mode 100644
index 0000000..5cc29b4
--- /dev/null
+++ b/Feature Planning/GlusterFS 4.0/composite-operations.md
@@ -0,0 +1,438 @@
+Feature
+-------
+
+Composite operations is a term describing elimination of round trips
+through a variety of techniques. Some of these techniques are borrowed
+from NFS and SMB protocols in spirit at least.
+
+Why do we need this? All too frequently we encounter situations where
+Gluster performance is an order of magnitude or even two orders of
+magnitude slower than NFS or SMB to a local filesystem, particularly for
+small-file and metadata-intensive workloads (example: file browsing).
+You can argue that Gluster provides more functionality, so it should be
+slower, but we need to close the gap -- if Gluster was half the speed of
+NFS and provided much greater functionality plus scalability, users
+would be ok with some performance tradeoff.
+
+What is the root cause? Response time of Gluster APIs is much higher
+than response time of other protocols. A simple protocol trace can show
+you a root cause for this: excessive round-trips.
+
+There are two dimensions to this:
+
+- operations that require lookups on every brick (covered elsewhere)
+- excessive one-at-a-time access to xattrs and ACLs
+- client responsible for maintaining filesystem state instead of
+ server
+- SMB: case-insensitivity of Windows = no direct lookup by filename on
+ brick
+
+Summary
+-------
+
+example of previous success: eager-lock. When Gluster was first acquired
+by Red Hat and testing with 10-GbE interfaces began, we quickly noticed
+that sequential write performance was not what we expected. The Gluster
+protocol required a 5-step sequence for every write from client to
+server(s), in order to maintain consistency between replicas, which is
+loosely paraphrased here:
+
+- lock-replica-inode
+- pre-op (mark replicas dirty)
+- write
+- post-op
+- unlock-replica-inode
+
+The **cluster.eager-lock** feature was added to Gluster (3.4?) to allow
+the client to hang onto the lock, and we combined post-op for previous
+write with pre-op for current write and actual write request so that
+instead of 5 RPCs per write we got down to ONE RPC per write, and write
+performance improved significantly (how much TBS)
+
+Owners
+------
+
+TBS
+
+Current status
+--------------
+
+Some of the problems with round trips stem from lack of scalability in
+DHT protocol, and attributes of AFR protocol.
+
+Related Feature Requests and Bugs
+---------------------------------
+
+- [Features/Smallfile Perf](../GlusterFS 3.7/Small File Performance.md) - small-file performance enhancement menu
+- [Features/dht-scalability](./dht-scalability.md) - new, scalable DHT
+- [Features/new-style-replication](..GlusterFS 3.6/New Style Replication.md)- client no longer does replication
+
+*Note : search RHS buglist for small-file-related performance bugs and directory browsing performance bugs, I haven't done that yet, there are a LOT of them*
+
+Detailed Description
+--------------------
+
+Here are the proposals:
+
+- READDIRPLUS generalization
+- lockless-CREATE
+- CREATE-AND-WRITE - allow CREATE op to transmit data and metadata
+ also
+- case-insensitivity feature - removes perf. penalty for SMB
+
+### READDIRPLUS used to prefetch xattrs
+
+recent correction: For SMB and other protocols that have additional
+security metadata, READDIRPLUS can be used more effectively to prefetch
+xattr data, such as ACLs and Windows-specific security info. However,
+upper layers have to make use of this feature. We treat ACLs as a
+special case of an extended attribute, since ACLs are not currently
+returned by READDIRPLUS (can someone confirm this?). The current RPC
+request and response structure are in
+[gfs3\_readdirp\_{req,rsp}](https://forge.gluster.org/glusterfs-core/glusterfs/blobs/master/rpc/xdr/src/glusterfs3-xdr.x)
+in the above source code URL. In both cases, the request structure field
+"dict" can contain a list of extended attribute IDs (or names, not sure
+which).
+
+However, once these xattrs are prefetched, will md-cache translator in
+the client be able to hang onto them to prevent round-trips to the
+server? Is there any additional invalidation needed for the expanded
+role of md-cache?
+
+### eager-lock for directories
+
+This extension doesn't seem to impact APIs at all, but it does require a
+way to safely do a CREATE FOP that will either appear on all replicas or
+none (or allow self-healing to repair the difference in the directories
+in the correct way).
+
+If we have an NSR translator, this seems pretty straightforward. NSR
+only allows the client to talk to the "leader" server in the replica
+host set, and the leader then takes responsibility for propagating the
+change.
+
+With AFR, the situation is very different. In order to guarantee that a
+CREATE will succeed on all AFR subvolumes, the client must write-lock
+the parent directory. Otherwise some other client could create the same
+file at the same time on some but not all of the AFR subvolumes.
+
+But why unlock? Chances are that any immediate subsequent file create in
+that directory will be coming from the same client, so it makes sense
+for the client to hang onto the write lock for a short while, unless
+some other client wants it. This optimistic lock behavior is similar to
+the "eager-lock" feature in the AFR translator today. Doing this saves
+us not only the need to do a LOOKUP prior to CREATE, but also saves us
+the need to do a directory unlock per file!
+
+### CREATE-AND-WRITE
+
+This extension is similar to quick-read, where the OPEN FOP can return
+the file data if it's small enough. This extension adds the following
+features to the CREATE FOP:
+
+- - optionally specify xattrs to associate with file when it's
+ created
+ - optionally specify write data (if it fits in 1 RPC)
+ - optionally close the file (what RELEASE does today)
+ - optionally fsync the file (for apps that require file
+ persistence such as Swift)
+
+This option is also similar to what librados (Ceph) API allows user to
+do today, see [Ioctx.write\_full in librados python
+binding](http://ceph.com/docs/master/rados/api/python/#writing-reading-and-removing-objects)
+
+This avoids the need for the round-trip sequence:
+
+- lock inode for write
+- create
+- write
+- flush(directory)
+- set-xattr[1]
+- set-xattr[2]
+- ...
+- set-xattr[N]
+- release
+- unlock inode
+
+The existing protocol structure is in [structure
+gfs3\_create\_req](https://forge.gluster.org/glusterfs-core/glusterfs/blobs/master/rpc/xdr/src/glusterfs3-xdr.x)
+. We would allocate reserved bits from the "flags" field for the
+optional extensions. The xdata field in the request would contain a
+tagged sequence containing the optional parameter values.
+
+### case-insensitive volume support
+
+The SMB protocol is bridging a divide between an operating system,
+Windows, that supports case-insensitive lookup, and an operating system,
+Linux (POSIX) that supports only case-sensitive lookup inside Gluster
+bricks. If nothing is done to bridge this gap, file lookup and creation
+becomes very expensive in large directories (a few thousand files in
+size):
+
+- on CREATE, the client has to search the entire directory to
+ determine whether some other file with the same name (but a
+ different case mix) already exists. This requires locking the
+ directory. Furthermore, consistent hashing, which knows nothing
+ about case mix, can not predict which brick might contain the file,
+ since it might have been created with a different case mix. This is
+ a SCALABILITY issue.
+
+- on LOOKUP, the client has to search all bricks for the filename
+ since there is in general no way to predict which brick the
+ case-altered version of the filename might have hashed to. This is a
+ SCALABILITY issue. The entire contents of the directory on each
+ brick must be searched as well.
+
+- SMB does support "case-sensitive yes" smb.conf configuration option,
+ but this is user-hostile since Windows does not understand it.
+
+What happens when Linux user-mode process such as glusterfsd (brick)
+tries to do a case-insensitive lookup on the filename using a local
+filesystem? XFS has a feature for this, but Gluster can't assume XFS as
+for VFS supporting case-insensitivity - it's not going to happen. Yes
+you can do readdir on directory and scan for the case-insensitive match,
+but it's O(N\^2) where N is number of files you place into a directory.
+
+**Proposal**: only use lower-case filenames (or upper-case, it doesn't
+matter) at the brick filesystem, and record the original case mix (how
+the user specified the filename at create/rename time) in an xattr, call
+it 'original-case'.
+
+**Issue**: (from Ira Cooper): what locales would be supported? SMB
+already had to deal with this.
+
+We could define a 'case-insensitive' volume parameter (default off), so
+that users who have no SMB clients do not experience this change in
+behavior.
+
+This mapping to lower-case filenames has to happen at or above DHT layer
+to avoid the scalability issue above. If this is not done by DHT (if it
+is done in VFS-gluster SMB plugin for example), then Gluster clients on
+a POSIX filesystem will not see the same filenames as Windows users, and
+this will lead to confusion.
+
+However, this has consequences for sharing file between SMB and non-SMB
+client - non-SMB client will pay performance penalty for
+case-insensitivity and will see case-insensitive behavior that is not
+strictly POSIX-compliant - for example if I create file "a" and then
+file "A" in same directory, the 2nd create will get EEXIST. That's the
+price you pay for having the two kinds of clients accessing the same
+volume - the most restricted client has to win.
+
+Changes required to DHT or equivalent:
+
+- READDIR(PLUS): report filenames as the user expects to see them,
+ using the original-case xattr. see above READDIRPLUS enhancement for
+ how this can be done efficiently.
+- CREATE (or RENAME):, map the filename within the brick to lower case
+ before creating, and records the original case mix using the
+ original-case xattr. See CREATE-AND-WRITE enhancement above for how
+ this can be done efficiently.
+- LOOKUP: map the filename to lower case before attempting a lookup on
+ the brick.
+- RENAME: To prevent loss of file during a client-side crash, first
+ delete the case-mix xattr, then do the rename, then re-add the
+ case-mix xattr. If the case-mix xattr is not present, then the
+ lower-case filename is returned by READDIR(PLUS) but the file is not
+ lost.
+
+Since existing SMB users may want to take advantage of this change, we
+need a process for converting a Gluster volume to support
+case-insensitivity:
+
+- optional - use "find /your/brick/directory -not -type d -a -not
+ -path '/your/brick/directory/.glusterfs/\*' | tr '[A-Z]' '[a-z]' |
+ sort " command in parallel on every brick, and do sort -merge of
+ per-brick outputs followed by "uniq -d" to quickly determine if
+ there are case-insensitivity collisions on existing volume. This
+ would let user resolve such conflicts ahead of time without taking
+ down the volume.
+- shut down the volume
+- run a script on all bricks in parallel to convert it to
+ case-insensitive format - very fast because it runs on a local fs.
+ - rename the brick file to lower case and store an xattr with
+ original case.
+- turn volume lookup-unhashed to ON because files will not yet be on
+ the right brick.
+- set volume into case-insensitive state
+- start volume - it is now online but not in efficient state
+- rebalance (get DHT to place the files where they belong)
+ - If rebalance uncovers case-insensitive filename collisions (very
+ unlikely), the 2nd file is renamed to its original case-mix with
+ string 'case-collision-gfid' + hex gfid appended, and a counter
+ is incremented. A simple "find" command at each brick in
+ parallel executed with pdsh can locate all instances of such
+ files - the user then has to decide what they want to do with
+ them.
+- reset lookup-unhashed to default (auto)
+
+Benefit to GlusterFS
+--------------------
+
+- READDIRPLUS optimizations could completely solve the performance
+ problems with file browsing in large directories, at least to the
+ point where Gluster performs similarly to NFS and SMB in general and
+ can't be blamed. (DHT v2 could also improve performance by not
+ requiring round trips to every brick to retrieve a directory).
+
+- lockless-CREATE - can improve small-file create performance
+ significantly by condensing 4 round-trips into 1. Small-file create
+ is the worst-performing feature in Gluster today. However, it won't
+ solve small-file create problems until we address other areas below.
+
+- CREATE-AND-WRITE - as you can see, at least 6 round trips (maybe
+ more) are combined into 1 round trip.
+
+The performance benefit increases as the Gluster client round-trip time
+to the servers increases. For example, these enhancements could make
+possible use of Gluster protocol over a WAN.
+
+Scope
+-----
+
+Still unsure. This impacts libgfapi - if we want applications to take
+advantage of these enhancements, we need to expose these APIs to
+applications somehow, and POSIX does not allow them AFAIK.
+
+CREATE-AND-WRITE impacts the translator interface. Translators must be
+able to pass down:
+
+- a list of xattr values (which translators in the stack can append
+ to).
+- a data buffer
+- flags to request optionally that file be fsynced and/or closed.
+
+The [fop\_create\_t
+params](https://forge.gluster.org/glusterfs-core/glusterfs/blobs/master/libglusterfs/src/xlator.h)
+have both a "flags" parameter and a "xdata" parameter; this last
+parameter could be used to pass both data and xattrs in a tagged
+sequence format (not sure whether **dict\_t** supports this).
+
+### Nature of proposed change
+
+The Gluster code might need refactoring in create-related code to
+maximize common code between existing implementation, which won't go
+away, and the new implementation of these FOPS.
+
+However, I suspect that READDIRPLUS extensions may be possible to insert
+without disrupting existing code that much, may need some help on this
+one.
+
+### Implications on manageability
+
+The gluster volume profile command will have to be extended to get
+support for the new CREATE FOP if this is how we choose to implement it.
+
+These changes should be somewhat transparent to management layer
+otherwise.
+
+### Implications on presentation layer
+
+Swift-on-file Gluster-specific code would have to change to take
+advantage of this feature.
+
+NFS and SMB would have to change to exploit new features to reduce
+SMB-specific xattr and ACL access.
+
+The
+[libgfapi](https://forge.gluster.org/glusterfs-core/glusterfs/blobs/master/api/src/glfs.h)
+implementation would have to expose these features.
+
+- **glfs\_readdirplus\_r** - it's not clear that struct dirent would
+ be able to handle xattrs, and there is no place to specify which
+ extended attributes we are interested in.
+- **glfs\_creat** - has no parameters to support xattrs or write data.
+ So we'd need a new entry point to do this.
+
+### Implications on persistence layer
+
+None.
+
+### Implications on 'GlusterFS' backend
+
+None.
+
+### Modification to GlusterFS metadata
+
+None. We are repackaging how data gets passed in protocol, not what it
+means.
+
+### Implications on 'glusterd'
+
+None.
+
+How To Test
+-----------
+
+We have programs that can generate metadata-intensive workloads, such as
+smallfile benchmark or fio. For smallfile creates, we can use a modified
+version of the [parallel libgfapi
+benchmark](https://github.com/bengland2/parallel-libgfapi) (don't worry,
+I know the developer ;-) to verify that the response time for the new
+create-and-write API is better than before, or to verify that
+lockless-create improves response time.
+
+In the case of readdirplus extensions, we can test with simple libgfapi
+program coupled with a protocol trace or gluster volume profile output
+to see if it's working and has desired decrease in response time.
+
+User Experience
+---------------
+
+The impact of this operation should be functionally transparent to the
+end-user, but it should significantly improve Gluster performance to the
+point where throughput and response time are reasonably close (not
+equal) to NFS, SMB, etc on local filesystems. This is particularly true
+for small-file operations and directory browsing/listing.
+
+Dependencies
+------------
+
+This change will have significant impact on translators, it is not easy.
+Because this is a non-trivial change, an incremental approach should be
+specified and followed, with each stage committed and regression tested
+separately. For example, we could break CREATE-and-WRITE proposal into 4
+pieces:
+
+- add libgfapi support, with ENOSUPPORT returned for unimplemented
+ features
+- add list of xattrs written at create time.
+- add write data
+- add close and fsync options
+
+Documentation
+-------------
+
+How do we document RPC protocol changes? For now, I'll try to use IDL .x
+file or whatever specifies the RPC itself.
+
+Status
+------
+
+Not designed yet.
+
+Comments and Discussion
+-----------------------
+
+### Jeff Darcy 16:20, 3 December 2014
+
+Talk:Features/composite-operations
+
+"SMB: case-insensitivity of Windows = no direct lookup by filename on
+brick"
+
+We did actually come up with a way to do the case-preserving and
+case-squashing lookups simultaneously before falling back to the global
+lookup, but AFAIK it's not implemented.
+
+READDIRPLUS extension: md-cache actually does pre-fetch some attributes
+associated with (Linux) ACLs and SELinux. Maybe it just needs to
+pre-fetch some others for SMB? Also, fetching into glusterfs process
+memory doesn't save us the context switch. For that we need dentry
+injection (or something like it) so that the information is available in
+the kernel by the time the user asks for it.
+
+"glfs\_creat - has no parameters to support xattrs"
+
+These are being added already because NSR reconciliation needs them (for
+many other calls already).
diff --git a/Feature Planning/GlusterFS 4.0/dht-scalability.md b/Feature Planning/GlusterFS 4.0/dht-scalability.md
new file mode 100644
index 0000000..83ef255
--- /dev/null
+++ b/Feature Planning/GlusterFS 4.0/dht-scalability.md
@@ -0,0 +1,171 @@
+Goal
+----
+
+More scalable DHT translator.
+
+Summary
+-------
+
+Current DHT inhibits scalability by requiring that directories be on all
+subvolumes. In addition to the extra message traffic this incurs during
+*mkdir*, it adds significant complexity keeping all of the directories
+consistent across operations like *create* and *rename*. What is
+proposed, in a nutshell, is that directories should only exist on one
+subvolume, which might contain "stubs" pointing to files and directories
+that can be accessed by GFID on the same or other subvolumes. In concert
+with this, the way we store layout information needs to change, so that
+at least the "fix-layout" part of rebalancing need not involve
+traversing every directory on every subvolume.
+
+Owners
+------
+
+Shyam Ranganathan <srangana@redhat.com>
+
+Raghavendra Gowdappa <rgowdapp@redhat.com>
+
+Current status
+--------------
+
+Proposed, awaiting summit for approval.
+
+Related Feature Requests and Bugs
+---------------------------------
+
+[Features/thousand-node-glusterd](../GLusterFS 3.6/Thousand Node Gluster.md)
+will define new ways of managing maintenance activities, some of which
+are DHT-related. Also, the new "mon cluster" might be where we store
+layout information.
+
+[Features/data-classification](../GLusterFS 3.7/Data Classification.md)
+also affects layout storage and use.
+
+Detailed Description
+--------------------
+
+Under this scheme, path-based lookup becomes very different. Currently,
+we look up a path on the file's "hashed" subvol first (according to
+parent-directory layout and file GFID). If it's not there, we need to
+look elsewhere - in the worst case on **all** subvolumes. In the future,
+our first lookup should be at the parent directory's subvolume. If the
+file is not there, it's not linked anywhere (though it might still exist
+unlinked and accessible by GFID) and we can terminate immediately. If it
+is there, then that single copy of the parent directory will contain a
+"stub" giving the file's GFID and a hint for what subvolume it's on
+(similar to a current linkfile). That information can then be used to
+initiate a GFID-based lookup. Many other code paths, such as *rename*,
+can leverage this new infrastructure to avoid current problems with
+multiple directory entries and linkfiles all for the same actual file.
+
+A possible enhancement would be to include more information in stubs,
+allowing readdirp to operate only on the directory and avoid going to
+every subvolume for information about individual files. Also, some
+secondary issues such as hard links and garbage collection (of unlinked
+but still open files) remain TBD in the final design.
+
+With respect to layout storage, the basic idea is to store a fairly
+small number of actual layouts - default, user defined, or related to
+data classification - that are each shared across many directories.
+These layouts are stored as part of our configuration, and the xattrs on
+individual directories need only specify a shared layout ID (plus
+possibly some additional "tweak" parameters) instead of a full explicit
+layout. When we do any kind of rebalancing, we need only change the
+shared layouts and not the pointers on each directory.
+
+Benefit to GlusterFS
+--------------------
+
+Improved scalability and performance for all directory-entry operations.
+
+Improved reliability for conflicting directory-entry operations, and for
+layout repair.
+
+Almost instantaneous "fix-layout" completion.
+
+Scope
+-----
+
+### Nature of proposed change
+
+Due to the complexity of the changes involved, this will probably be a
+new translator developed using a similar model to that used for AFR2.
+While it's likely to share/borrow a significant amount of code from
+current DHT, the new version will go through most of its development
+lifecycle separately and then completely supplant the old version, as
+opposed to integrating individual changes. For testing of
+compatibility/migration, it is probably desirable for both versions to
+coexist in the source tree and packages, but not necessarily in the same
+process.
+
+### Implications on manageability
+
+New/different options, but otherwise no change.
+
+### Implications on presentation layer
+
+No change. At this level the new DHT translator should be a plug-in
+replacement for the old one.
+
+### Implications on persistence layer
+
+None, unless you count reduced xattr usage.
+
+### Implications on 'GlusterFS' backend
+
+This will fundamentally change the directory structure on our back end.
+A file that is currently visible within a brick as \$BRICK\_ROOT/a/b/c
+might now be visible only as \$GFID\_FOR\_B/c with neither of the parent
+directories even present on that brick. Even that "file" will actually
+be a stub containing only the file's own GFID; to get the contents, one
+would need to look up that GFID in .glusterfs on this or some other
+brick.
+
+Linkfiles will be gone, also subsumed by stubs.
+
+### Modification to GlusterFS metadata
+
+Explicit layouts will be replaced by IDs for shared layouts (in config
+storage).
+
+### Implications on 'glusterd'
+
+Minimal changes (mostly volfile generation).
+
+How To Test
+-----------
+
+Most existing DHT tests should suffice, except for those that depend on
+the details of how layouts are stored and formatted. Those will have to
+be modified to go through the extra layer of indirection to where the
+actual layouts live.
+
+User Experience
+---------------
+
+None, except for better performance and less lost data.
+
+Dependencies
+------------
+
+See "related features" section.
+
+Documentation
+-------------
+
+TBD. There should be very little at the user level, though hopefully
+this time we'll do better at documenting the things developers
+(including add-on tool developers) need to know.
+
+Status
+------
+
+Design in progress
+
+Design and some notes can be found here,
+<https://drive.google.com/open?id=15_TOW9jwzW4griAmk-rqg2cWF-LHiR_TJ8Jn0vOvYpU&authuser=0>
+
+Thread at gluster-devel opening this up for discussion here,
+<https://www.mail-archive.com/gluster-devel%40gluster.org/msg03036.html>
+
+Comments and Discussion
+-----------------------
diff --git a/Feature Planning/GlusterFS 4.0/index.md b/Feature Planning/GlusterFS 4.0/index.md
new file mode 100644
index 0000000..0a3d47d
--- /dev/null
+++ b/Feature Planning/GlusterFS 4.0/index.md
@@ -0,0 +1,82 @@
+GlusterFS 4.0 Release Planning
+------------------------------
+
+Tentative Dates:
+
+Feature proposal for GlusterFS 4.0
+----------------------------------
+
+This list has been seeded with features from <http://goo.gl/QyjfxM>
+which provides some rationale and context. Feel free to add more. Some
+of the individual feature pages are still incomplete, but should be
+completed before voting on the final 4.0 feature set.
+
+### Node Scaling Features
+
+- [Features/thousand-node-glusterd](../GlusterFS 3.6/Thousand Node Gluster.md):
+ Glusterd changes for higher scale.
+
+- [Features/dht-scalability](./dht-scalability.md):
+ a.k.a. DHT2
+
+- [Features/sharding-xlator](../GlusterFS 3.7/Sharding xlator.md):
+ Replacement for striping.
+
+- [Features/caching](./caching.md): Client-side caching, with coherency support.
+
+### Technology Scaling Features
+
+- [Features/data-classification](../GlusterFS 3.7/Data Classification.md):
+ Tiering, compliance, and more.
+
+- [Features/SplitNetwork](./Split Network.md):
+ Support for public/private (or other multiple) networks.
+
+- [Features/new-style-replication](../GlusterFS 3.6/New Style Replication.md):
+ Log-based, chain replication.
+
+- [Features/better-brick-mgmt](./Better Brick Mgmt.md):
+ Flexible resource allocation + daemon infrastructure to handle
+ (many) more bricks
+
+- [Features/compression-dedup](./Compression Dedup.md):
+ Compression and/or deduplication
+
+### Small File Performance Features
+
+- [Features/composite-operations](./composite-operations.md):
+ Reducing round trips by wrapping multiple ops in one message.
+
+- [Features/stat-xattr-cache](./stat-xattr-cache.md):
+ Caching stat/xattr information in (user-space) server memory.
+
+### Technical Debt Reduction
+
+- [Features/code-generation](./code-generation.md):
+ Code generation
+
+- [Features/volgen-rewrite](./volgen-rewrite.md):
+ Technical-debt reduction
+
+### Other Features
+
+- [Features/rest-api](../GlusterFS 3.7/rest-api.md):
+ Fully generic API sufficient to support all CLI operations.
+
+- Features/mgmt-plugins:
+ No more patching glusterd for every new feature.
+
+- Features/perf-monitoring:
+ Always-on performance monitoring and hotspot identification.
+
+Proposing New Features
+----------------------
+
+[New Feature Template](../Feature Template.md)
+
+Use the template to create a new feature page, and then link to it from the "Feature Proposals" section above.
+
+Release Criterion
+-----------------
+
+- TBD
diff --git a/Feature Planning/GlusterFS 4.0/stat-xattr-cache.md b/Feature Planning/GlusterFS 4.0/stat-xattr-cache.md
new file mode 100644
index 0000000..e00399d
--- /dev/null
+++ b/Feature Planning/GlusterFS 4.0/stat-xattr-cache.md
@@ -0,0 +1,197 @@
+Feature
+-------
+
+server-side md-cache
+
+Summary
+-------
+
+Two years ago, Peter Portante noticed the extremely high number of
+system calls on the XFS brick required per Swift object. Since then, he
+and Ben England have observed several similar cases.
+
+More recently, while looking at a **netmist** single-thread workload run
+by a major banking customer to characterize Gluster performance, Ben
+observed this [system call profile PER
+FILE](https://s3.amazonaws.com/ben.england/netmist-and-gluster.pdf) .
+This is strong evidence of several problems with POSIX translator:
+
+- repeated polling with **sys\_lgetxattr** of the **gfid** xattr
+- repeated **sys\_lstat** calls
+- polling of xattrs that were *undefined*
+- calling **sys\_llistattr** to get list of all xattrs AFTER all other
+ calls
+- calling *'sys\_lgetxattr* two times, once to find out how big the
+ value is and once to get the value!
+- one-at-a-time calls to get individual xattrs
+
+All of the problems except for the last one could be solved through use
+of a metadata cache associated with each inode. The last problem is not
+solvable in a pure POSIX API at this time, although XFS offers an
+**ioctl** that can get all xattrs at once (the cache could conceivably
+determine whether the brick was XFS or not and exploit this where
+available).
+
+Note that as xattrs are added to the system, this becomes more and more
+costly, and as Gluster adds new features, these typically require that
+state be kept associated with a file, usually in one or more xattrs.
+
+Owners
+------
+
+TBS
+
+Current status
+--------------
+
+There is already a **md-cache** translator, so you would think that
+problems like this would not occur, but clearly they do -- this
+translator is typically on the client side of the protocol and is
+typically above such translators as AFR and DHT. The problems may be
+worse in cases where the md-cache translator is not present (example:
+SMB with gluster-vfs plugin that requires stat-prefetch volume parameter
+to be set to *off*.
+
+Related Feature Requests and Bugs
+---------------------------------
+
+- [Features/Smallfile Perf](../GlusterFS 3.7/Small File Performance.md)
+- bugzillas TBS
+
+Detailed Description
+--------------------
+
+This proposal has changed as a result of discussions in
+\#gluster-meeting - instead of modifying the POSIX translator, we
+propose to use the md-cache translator in the server above the POSIX
+translator, and add negative caching capabilities to the md-cache
+translator.
+
+By "negative caching" we mean that md-cache can tell you if the xattr
+does not exist without calling down the translator stack. How can it do
+this? In the server side, the only path to the brick is through the
+md-cache translator. When it encounters a xattr get request for a file
+it has not seen before, the first step is to call down with llistxattr()
+to find out what xattrs are stored for that file. From that point on
+until the file is evicted from the cache, any request for non-existent
+xattr values from higher translators will immediately be returned with
+ENODATA, without calling down to POSIX translator.
+
+We must ensure that memory leaks do not occur, and that race conditions
+do not occur while multiple threads are accessing the cache, but this
+seems like a manageable problem and is certainly not a new problem for
+Gluster translator code.
+
+Benefit to GlusterFS
+--------------------
+
+Most of the system calls and about 50% of the elapsed time could have
+been removed from the above small-file read profile through use of this
+cache. This benefit will be more visible as we transition to using SSD
+storage, where disk seek times will not mask overheads such as this.
+
+Scope
+-----
+
+This can be done local to the glusterfsd process by inserting md-cache
+translator just above the POSIX translator, where the vast majority of
+the stat, getxattr and setxattr calls are generated from.
+
+### Nature of proposed change
+
+No new translators are required. We may require some existing
+translators to call down the stack ("wind a FOP") instead of calling
+sys\_\*xattr themselves if these calls are heavily used, so that they
+can take advantage of the stat-xattr-cache.
+
+It is *really important* that the md-cache use listxattr() to
+immediately determine which xattrs are on disk, avoiding needless
+getxattr calls this way. At present it does not do this.
+
+### Implications on manageability
+
+None. We need to make sure that the cache is big enough to support the
+threads that use it, but not so big that it consumes a significant
+percentage of memory. We may want to make a cache size and expiration
+time a tunable so that we can experiment in performance testing to
+determine optimal parameters.
+
+### Implications on presentation layer
+
+Translators above the md-cache translator are not affected
+
+### Implications on persistence layer
+
+None.
+
+### Implications on 'GlusterFS' backend
+
+None
+
+### Modification to GlusterFS metadata
+
+None
+
+### Implications on 'glusterd'
+
+None
+
+How To Test
+-----------
+
+We can use strace of a single-thread smallfile workload to verify that
+the cache is filtering out excess system calls. We could include
+counters into the cache to measure the cache hit rate.
+
+User Experience
+---------------
+
+single-thread small-file creates should be faster, particularly on SSD
+storage. Performance testing is needed to further quantify this.
+
+Dependencies
+------------
+
+None
+
+Documentation
+-------------
+
+None, except for tunables relating to cache size and expiration time.
+
+Status
+------
+
+Not started.
+
+Comments and Discussion
+-----------------------
+
+Jeff Darcy: I've been saying for ages that we should store xattrs in a
+local DB and avoid local xattrs altogether. Besides performance, this
+would also eliminate the need for special configuration of the
+underlying local FS (to accommodate our highly unusual use of this
+feature) and generally be good for platform independence. Not quite so
+sure about other stat(2) information, but perhaps I could be persuaded.
+In any case, this has led me to look into the relevant code on a few
+occasions. Unfortunately, there are \*many\* places that directly call
+sys\_\*xattr instead of winding fops - glusterd (for replace-brick),
+changelog, quota, snapshots, and others. I think this feature is still
+very worthwhile, but all of the "cheating" we've tolerated over the
+years is going to make it more difficult.
+
+Ben England: a local DB might be a good option but could also become a
+bottleneck, unless you have a DB instance per brick (local) filesystem.
+One problem that the DB would solve is getting all the metadata in one
+query - at present POSIX API requires you to get one xattr at a time. If
+we implement a caching layer that hides whether a DB or xattrs are being
+used, we can make it easier to experiment with a DB (level DB?). On your
+2nd point, While it's true that there are many sites that call
+sys\_\*xattr directory, only a few of these really generate a lot of
+system calls. For example, some of these calls are only for the
+mountpoint. From a performance perspective, as long as we can intercept
+the vast majority of the sys\_\*xattr calls with this caching layer,
+IMHO we can tolerate a few exceptions in glusterd, etc. However, from a
+CORRECTNESS standpoint, we have to be careful that calls bypassing the
+caching layer don't cause cache contents to become stale (out-of-date,
+inconsistent with the on-disk brick filesystem contents).
diff --git a/Feature Planning/GlusterFS 4.0/volgen-rewrite.md b/Feature Planning/GlusterFS 4.0/volgen-rewrite.md
new file mode 100644
index 0000000..4b954b3
--- /dev/null
+++ b/Feature Planning/GlusterFS 4.0/volgen-rewrite.md
@@ -0,0 +1,128 @@
+Feature
+-------
+
+Volgen rewrite
+
+Summary
+-------
+
+This module has become an important choke point for development of new
+features, as each new feature needs to make changes here. Many previous
+feature additions have been rushed in, by copying/pasting code or adding
+special-case checks, instead of refactoring. The result is a big
+hairball. Every new change that involves client translators has to deal
+with various permutations of replication/EC, striping/sharding,
+rebalance, self-heal, quota, snapshots, tiering, NFS, and so on. Each
+new change at this point is almost certain to introduce subtle errors
+that will only be caught when certain combinations of features and
+operations are attempted. There aren't even enough tests to cover even
+the basic combinations, and we'd need hundreds more to test the obscure
+ones.
+
+Owners
+------
+
+Jeff Darcy <jdarcy@redhat.com>
+
+Current status
+--------------
+
+Just a proposal so far.
+
+Related Feature Requests and Bugs
+---------------------------------
+
+TBD
+
+Detailed Description
+--------------------
+
+Many of the problems stem from the fact that our internal volfiles need
+to be consistent with, but slightly different from, one another. Instead
+of generating them all separately, we should separate the generation
+into two phases:
+
+*  Generate a "core" or "vanilla" graph containing all of the translators, option settings, etc. common to all of the special-purpose volfiles.
+
+*  For each special-purpose volfile, copy the core/vanilla graph (*not the code* that generated it) and modify the copy to get what's desired.
+
+Some of the other problems in this module stem from lower-level issues
+such as bad data- or control-structure choices (e.g. operating on a
+linear list of bricks instead of a proper graph), or complex
+object-lifecycle management (e.g. see
+<https://bugzilla.redhat.com/show_bug.cgi?id=1211749>). Some of these
+problems might be alleviated by using a higher-level language with
+complex data structures and garbage collection. An infrastructure
+already exists to do graph manipulation in Python, developed for HekaFS
+and subsequently used in several other places (it's already in our
+source tree).
+
+Benefit to GlusterFS
+--------------------
+
+More correct, and more \*verifiably\* correct, volfile generation even
+as the next dozen features are added. Also, accelerated development time
+for those next dozen features.
+
+Scope
+-----
+
+### Nature of proposed change
+
+Pretty much limited to what currently exists in glusterd-volgen.c
+
+### Implications on manageability
+
+None.
+
+### Implications on presentation layer
+
+None.
+
+### Implications on persistence layer
+
+None.
+
+### Implications on 'GlusterFS' backend
+
+None.
+
+### Modification to GlusterFS metadata
+
+None.
+
+### Implications on 'glusterd'
+
+None, unless we decide to store volfiles in a different format (e.g.
+JSON) so we can use someone else's parser instead of rolling our own.
+
+How To Test
+-----------
+
+Practically every current test generates multiple volfiles, which will
+quickly smoke out any differences. Ideally, we'd add a bunch more tests
+(many of which might fail against current code) to verify correctness of
+results for particularly troublesome combinations of features.
+
+User Experience
+---------------
+
+None.
+
+Dependencies
+------------
+
+None.
+
+Documentation
+-------------
+
+None.
+
+Status
+------
+
+Still just a proposal.
+
+Comments and Discussion
+-----------------------