diff options
author | hchiramm <hchiramm@redhat.com> | 2015-08-04 11:07:42 +0530 |
---|---|---|
committer | hchiramm <hchiramm@redhat.com> | 2015-08-04 11:07:42 +0530 |
commit | d7d3274c6f6cea46ad296fc6d1259ee9a4e9964f (patch) | |
tree | 8db500bcea5190101703ab2ebc4b28587a6e994c /Feature Planning/GlusterFS 4.0 | |
parent | 146b7ef7a31997634b29302a6e345ff5d9d7497a (diff) |
Adding Features and planning features to glusterfs-specs repo
As per the discussion (http://www.gluster.org/pipermail/gluster-users/2015-July/022918.html)
the specs are part of this repo.
Signed-off-by: hchiramm <hchiramm@redhat.com>
Diffstat (limited to 'Feature Planning/GlusterFS 4.0')
-rw-r--r-- | Feature Planning/GlusterFS 4.0/Better Brick Mgmt.md | 180 | ||||
-rw-r--r-- | Feature Planning/GlusterFS 4.0/Compression Dedup.md | 128 | ||||
-rw-r--r-- | Feature Planning/GlusterFS 4.0/Split Network.md | 138 | ||||
-rw-r--r-- | Feature Planning/GlusterFS 4.0/caching.md | 143 | ||||
-rw-r--r-- | Feature Planning/GlusterFS 4.0/code-generation.md | 143 | ||||
-rw-r--r-- | Feature Planning/GlusterFS 4.0/composite-operations.md | 438 | ||||
-rw-r--r-- | Feature Planning/GlusterFS 4.0/dht-scalability.md | 171 | ||||
-rw-r--r-- | Feature Planning/GlusterFS 4.0/index.md | 82 | ||||
-rw-r--r-- | Feature Planning/GlusterFS 4.0/stat-xattr-cache.md | 197 | ||||
-rw-r--r-- | Feature Planning/GlusterFS 4.0/volgen-rewrite.md | 128 |
10 files changed, 1748 insertions, 0 deletions
diff --git a/Feature Planning/GlusterFS 4.0/Better Brick Mgmt.md b/Feature Planning/GlusterFS 4.0/Better Brick Mgmt.md new file mode 100644 index 0000000..adfc781 --- /dev/null +++ b/Feature Planning/GlusterFS 4.0/Better Brick Mgmt.md @@ -0,0 +1,180 @@ +Goal +---- + +Easier (more autonomous) assignment of storage to specific roles + +Summary +------- + +Managing bricks and arrangements of bricks (e.g. into replica sets) +manually doesn't scale. Instead, we need more intuitive ways to group +bricks together into pools, allocate space from those pools (creating +new pools), and let users define volumes in terms of pools rather than +individual bricks. We get to worry about how to arrange those bricks +into an intelligent volume configuration, e.g. replicating between +bricks that are the same size/speed/type but not on the same server. + +Because this smarter and/or finer-grain resource allocation (plus +general technology evolution) is likely to result in many more bricks +per server than we have now, we also need a brick-daemon infrastructure +capable of handling that. + +Owners +------ + +Jeff Darcy <jdarcy@redhat.com> + +Current status +-------------- + +Proposed, waiting until summit for approval. + +Related Feature Requests and Bugs +--------------------------------- + +[Features/data-classification](../GlusterFS 3.7/Data Classification.md) +will drive the heaviest and/or most sophisticated use of this feature, +and some of the underlying mechanisms were originally proposed there. + +Detailed Description +-------------------- + +To start with, we need to distinguish between the raw brick that the +user allocates to GlusterFS and the pieces of that brick that result +from our complicated storage allocation. Some documents refer to these +as u-brick and s-brick respectively, though perhaps it's better to keep +calling the former bricks and come up with a new name for the latter - +slice, tile, pebble, etc. For now, let's stick with the x-brick +terminology. We can manipulate these objects in several ways. + +- Group u-bricks together into an equivalent pool of s-bricks + (trivially 1:1). + +- Allocate space from a pool of s-bricks, creating a set of smaller + s-bricks. Note that the results of applying this repeatedly might be + s-bricks which are on the same u-brick but part of different + volumes. + +- Combine multiple s-bricks into one via some combination of + replication, erasure coding, distribution, tiering, etc. + +- Export an s-brick as a volume. + +These operations - especially combining - can be applied iteratively, +creating successively more complex structures prior to the final export. +To support this, the code we currently use to generate volfiles needs to +be changed to generate similar definitions for the various levels of +s-bricks. Combined with the need to support versioning of these files +(for snapshots), this probably means a rewrite of the volgen code. +Another type of configuration file we need to create is for a brick +daemon. We still run one glusterfsd process per u-brick, for various +reasons. + +- Maximize compatibility with our current infrastructure for starting + and monitoring server processes. + +- Align the boundaries between actual and detected device failures. + +- Reduce the number of ports assigned, both for administrative + convenience and to avoid exhaustion. + +- Reduce context-switch and virtual-memory thrashing between too many + uncoordinated processes. Some day we might even add custom resource + control/scheduling between s-bricks within a process, which would be + impossible in separate processes. + +These new glusterfsd processes are going to require more complex +volfiles, and more complex translator-graph code to consume those. They +also need to be more parallel internally, so this feature depends on +eliminating single-threaded bottlenecks such as our socket transport. + +Benefit to GlusterFS +-------------------- + +- Reduced administrative overhead for large/complex volume + configurations. + +- More flexible/sophisticated volume configurations, especially with + respect to other features such as tiering or internal enhancements + such as overlapping replica/erasure sets. + +- Improved performance. + +Scope +----- + +### Nature of proposed change + +- New object model, exposed via both glusterd-level and user-level + commands on those objects. + +- Rewritten volfile infrastructure. + +- Significantly enhanced translator-graph infrastructure. + +- Multi-threaded transport. + +### Implications on manageability + +New commands will be needed to group u-bricks into pools, allocate +s-bricks from pools, etc. There will also be new commands to view status +of objects at various levels, and perhaps to set options on them. On the +other hand, "volume create" will probably become simpler as the +specifics of creating a volume are delegated downward to s-bricks. + +### Implications on presentation layer + +Surprisingly little. + +### Implications on persistence layer + +None. + +### Implications on 'GlusterFS' backend + +The on-disk structures (.glusterfs and so on) currently associated with +a brick become associated with an s-brick. The u-brick itself will +contain little, probably just an enumeration of the s-bricks into which +it has been divided. + +### Modification to GlusterFS metadata + +None. + +### Implications on 'glusterd' + +See detailed description. + +How To Test +----------- + +New tests will be needed for grouping/allocation functions. In +particular, negative tests for incorrect or impossible configurations +will be needed. Once s-bricks have been aggregated back into volumes, +most of the current volume-level tests will still apply. Related tests +will also be developed as part of the data classification feature. + +User Experience +--------------- + +See "implications on manageability" etc. + +Dependencies +------------ + +This feature is so closely associated with data classification that the +two can barely be considered separately. + +Documentation +------------- + +Much of our "brick and volume management" documentation will require a +thorough review, if not an actual rewrite. + +Status +------ + +Design still in progress. + +Comments and Discussion +----------------------- diff --git a/Feature Planning/GlusterFS 4.0/Compression Dedup.md b/Feature Planning/GlusterFS 4.0/Compression Dedup.md new file mode 100644 index 0000000..7829018 --- /dev/null +++ b/Feature Planning/GlusterFS 4.0/Compression Dedup.md @@ -0,0 +1,128 @@ +Feature +------- + +Compression / Deduplication + +Summary +------- + +In the never-ending quest to increase storage efficiency (or conversely +to decrease storage cost), we could compress and/or deduplicate data +stored on bricks. + +Owners +------ + +Jeff Darcy <jdarcy@redhat.com> + +Current status +-------------- + +Just a vague idea so far. + +Related Feature Requests and Bugs +--------------------------------- + +TBD + +Detailed Description +-------------------- + +Compression and deduplication for GlusterFS have been discussed many +times. Deduplication across machines/bricks is a recognized Hard +Problem, with uncertain benefits, and is thus considered out of scope. +Deduplication within a brick is potentially achievable by using +something like +[lessfs](http://sourceforge.net/projects/lessfs/files/ "wikilink"), +which is itself a FUSE filesystem, so one fairly simple approach would +be to integrate lessfs as a translator. There's no similar option for +compression. + +In both cases, it's generally preferable to work on fully expanded files +while they're open, and then compress/dedup when they're closed. Some of +the bitrot or tiering infrastructure might be useful for moving files +between these states, or detecting when such a change is needed. There +are also some interesting interactions with quota, since we need to +count the un-compressed un-deduplicated size of the file against quota +(or do we?) and that's not what the underlying local file system will +report. + +Benefit to GlusterFS +-------------------- + +Less \$\$\$/GB for our users. + +Scope +----- + +### Nature of proposed change + +New translators, hooks into bitrot/tiering/quota, probably new daemons. + +### Implications on manageability + +Besides turning these options on or off, or setting parameters, there +will probably need to be some way of reporting the real vs. +compressed/deduplicated size of files/bricks/volumes. + +### Implications on presentation layer + +Should be none. + +### Implications on persistence layer + +If the DM folks ever get their <expletive deleted> together on this +front, we might be able to use some of their stuff instead of lessfs. +That worked so well for thin provisioning and snapshots. + +### Implications on 'GlusterFS' backend + +What's on the brick will no longer match the data that the user stored +(and might some day retrieve). In the case of compression, +reconstituting the user-visible version of the data should be a simple +matter of decompressing via a well known algorithm. In the case of +deduplication, the relevant data structures are much more complicated +and reconstitution will be correspondingly more difficult. + +### Modification to GlusterFS metadata + +Some of the information tracking deduplicated blocks will probably be +stored "privately" in .glusterfs or similar. + +### Implications on 'glusterd' + +TBD + +How To Test +----------- + +TBD + +User Experience +--------------- + +Mostly unchanged, except for performance. As with erasure coding, a +compressed/deduplicated slow tier will usually need to be paired with a +simpler fast tier for overall performance to be acceptable. + +Dependencies +------------ + +External: lessfs, DM, whatever other technology we use to do the +low-level work + +Internal: tiering/bitrot (perhaps changelog?) to track state and detect +changes + +Documentation +------------- + +TBD + +Status +------ + +Still just a vague idea. + +Comments and Discussion +----------------------- diff --git a/Feature Planning/GlusterFS 4.0/Split Network.md b/Feature Planning/GlusterFS 4.0/Split Network.md new file mode 100644 index 0000000..95cf944 --- /dev/null +++ b/Feature Planning/GlusterFS 4.0/Split Network.md @@ -0,0 +1,138 @@ +Goal +---- + +Better support for multiple networks, especially front-end vs. back-end. + +Summary +------- + +GlusterFS generally expects that all clients and servers use a common +set of network names and/or addresses. For many users, having a separate +network exclusively for servers is highly desirable for both performance +reasons (segregating administrative traffic and/or second-hop NFS +traffic from ongoing user I/O) and security reasons (limiting +administrative access to the private network). While such configurations +can already be created with routing/iptables trickery, full and explicit +support would be a great improvement. + +Owners +------ + +Jeff Darcy <jdarcy@redhat.com> + +Current status +-------------- + +Proposed, awaiting summit for approval. + +Related Feature Requests and Bugs +--------------------------------- + +One proposal for the high-level syntax and semantics was made [on the +mailing +list](http://www.gluster.org/pipermail/gluster-users/2014-November/019463.html). + +Detailed Description +-------------------- + +At the very least, we need to be able to define and keep track of +multiple names/addresses for the same node, one used on the back-end +network e.g. for "peer probe" and and NFS and the other used on the +front-end network by native-protocol clients. The association can be +done via the node UUID, but we still need a way for the user to specify +which name/address is to be used for which purpose. + +Future enhancements could include multiple front-end (client) networks, +and network-specific access control. + +Benefit to GlusterFS +-------------------- + +More flexible network network topologies, potentially enhancing +performance and/or security for some deployments. + +Scope +----- + +### Nature of proposed change + +The information in /var/lib/glusterd/peers/\* must be enhanced to +include multiple names/addresses per peer, plus tags for roles +associated with each address/name. + +The volfile-generation code must be enhanced to generate volfiles for +each purpose - server, native client, NFS proxy, self-heal/rebalance - +using the names/addresses appropriate to that purpose. + +### Implications on manageability + +CLI and GUI support must be added for viewing/changing the addresses +associated with each server and the roles associated with each address. + +### Implications on presentation layer + +None. The changes should be transparent to users. + +### Implications on persistence layer + +None. + +### Implications on 'GlusterFS' backend + +None. + +### Modification to GlusterFS metadata + +See [nature of proposed change](#Nature_of_proposed_change "wikilink"). + +### Implications on 'glusterd' + +See [nature of proposed change](#Nature_of_proposed_change "wikilink"). + +How To Test +----------- + +Set up a physical configuration with separate front-end and back-end +networks. + +Use the new CLI/GUI features to define addresses and roles split across +the two networks. + +Mount a volume using each of the several volfiles that result, and +generate some traffic. + +Verify that the traffic is actually on the network appropriate to that +mount type. + +User Experience +--------------- + +By default, nothing changes. If and only if a user wants to set up a +more "advanced" split-network configuration, they'll have new tools +allowing them to do that without having to "step outside" to mess with +routing tables etc. + +Dependencies +------------ + +None. + +Documentation +------------- + +New documentation will be needed at both the conceptual and detail +levels, describing how (and why?) to set up a split-network +configuration. + +Status +------ + +In design. + +Comments and Discussion +----------------------- + +Some use-cases in [Bug 764850](https://bugzilla.redhat.com/764850). +Feedback requested. Please jump in. + +[Discussion on gluster-devel](https://mail.corp.redhat.com/zimbra/#16) diff --git a/Feature Planning/GlusterFS 4.0/caching.md b/Feature Planning/GlusterFS 4.0/caching.md new file mode 100644 index 0000000..2c21c0c --- /dev/null +++ b/Feature Planning/GlusterFS 4.0/caching.md @@ -0,0 +1,143 @@ +Goal Caching +------------ + +Improved performance via client-side caching. + +Summary +------- + +GlusterFS has historically taken a very conservative approach to +client-side caching, due to the cost and difficulty of ensuring +consistency across a truly distributed file system. However, this has +often led to a competitive disadvantage vs. other file systems that +cache more aggressively. While one could argue that expecting an +application designed for a local FS or NFS to behave the same way on a +distributed FS is unrealistic, or question whether competitors' caching +is really safe, this nonetheless remains one of our users' top requests. + +For purposes of this discussion, pre-fetching into cache is considered +part of caching itself. However, write-behind caching (buffering) is a +separate feature, and is not in scope. + +Owners +------ + +Xavier Hernandez <xhernandez@datalab.es> + +Jeff Darcy <jdarcy@redhat.com> + +Current status +-------------- + +Proposed, waiting until summit for approval. + +Related Feature Requests and Bugs +--------------------------------- + +[Features/FS-Cache](Features/FS-Cache "wikilink") is about a looser +(non-consistent) kind of caching integrated via FUSE. This feature is +differentiated by being fully consistent, and implemented in GlusterFS +itself. + +[IMCa](http://mvapich.cse.ohio-state.edu/static/media/publications/slide/imca_icpp08.pdf) +describes a completely external approach to caching (both data and +metadata) with GlusterFS. + +Detailed Description +-------------------- + +Retaining data in cache on a client after it's read is trivial. +Pre-fetching into that same cache is barely more difficult. All of the +hard parts are on the server. + +- Tracking which clients still have cached copies of which data (or + metadata). + +- Issuing and waiting for invalidation requests when a client changes + data cached elsewhere. + +- Handling failures of the servers tracking client state, and of + communication with clients that need to be invalidated. + +- Doing all of this without putting performance in the toilet. + +Invalidating cached copies is analogous to breaking locks, so the +async-notification and "oplock" code already being developed for +multi-protocol (SMB3/NFS4) support can probably be used here. More +design is probably needed around scalable/performant tracking of client +cache state by servers. + +Benefit to GlusterFS +-------------------- + +Much better performance for cache-friendly workloads. + +Scope +----- + +### Nature of proposed change + +Some of the existing "performance" translators could be replaced by a +single client-caching translator. There will also need to be a +server-side helper translator to track client cache states and issue +invalidation requests at the appropriate times. Such asynchronous +(server-initiated) requests probably require transport changes, and +[GF\_FOP\_IPC](http://review.gluster.org/#/c/8812/) might play a part as +well. + +### Implications on manageability + +New commands will be needed to set cache parameters, force cache +flushes, etc. + +### Implications on presentation layer + +None, except for integration with the same async/oplock infrastructure +as used separately in SMB and NFS. + +### Implications on persistence layer + +None. + +### Implications on 'GlusterFS' backend + +We will likely need some sort of database associated with each brick to +maintain information about cache states. + +### Modification to GlusterFS metadata + +None. + +### Implications on 'glusterd' + +None. + +How To Test +----------- + +We'll need new tests to verify that invalidations are in fact occurring, +that we can't read stale/inconsistent data despite the increased caching +on clients, etc. + +User Experience +--------------- + +See "implications on manageability" section. + +Dependencies +------------ + +Async-notification and oplock code from the Samba team. + +Documentation +------------- + +TBD + +Status +------ + +Design in private review, hopefully available for public review soon. + +Comments and Discussion +----------------------- diff --git a/Feature Planning/GlusterFS 4.0/code-generation.md b/Feature Planning/GlusterFS 4.0/code-generation.md new file mode 100644 index 0000000..5c25a13 --- /dev/null +++ b/Feature Planning/GlusterFS 4.0/code-generation.md @@ -0,0 +1,143 @@ +Goal +---- + +Reduce internal duplication of code by generating from templates. + +Summary +------- + +The translator calling convention is based on long lists of +operation-specific arguments instead of a common "control block" +struct/union. As a result, many parts of our code are highly repetitive +both internally and with respect to one another. As an example of +internal redundancy, consider how many of the functions in +[defaults.c](https://github.com/gluster/glusterfs/blob/master/libglusterfs/src/defaults.c) +look similar. As an example of external redundancy, consider how the +[patch to add GF\_FOP\_IPC](http://review.gluster.org/#/c/8812/) has to +make parallel changes to 17 files - defaults, stubs, syncops, RPC, +io-threads, and so on. All of this duplication slows development of new +features, and creates huge potential for errors as definitions that need +to match don't. Indeed, during development of a code generator for NSR, +several such inconsistencies have already been found. + +Owners +------ + +Jeff Darcy <jdarcy@redhat.com> + +Current status +-------------- + +Proposed, awaiting approval. + +Related Feature Requests and Bugs +--------------------------------- + +Code generation was already used successfully in the first generation of +[NSR](../GlusterFS 3.6/New Style Replication.md) and will continue to be +used in the second. + +Detailed Description +-------------------- + +See Summary section above. + +Benefit to GlusterFS +-------------------- + +- Fewer bugs from inconsistencies between how similar operations are + handled within one translator, or how a single operation is handled + across many. + +- Greater ease of adding new operation types, or new translators which + implement similar functionality for many operations. + +Scope +----- + +### Nature of proposed change + +The code-generation infrastructure itself consists of three parts: + +- A list of operations and their associated arguments (both original + and callback, with types). + +- A script to combine this list with a template to do the actual + generation. + +- Modifications to makefiles etc. to do generation during a build. + +The first and easiest target is auto-generated default functions. Stubs +and syncops could follow pretty quickly. Other possibilities include: + +- GFAPI (both C and Python) + +- glupy + +- RPC (replace rpcgen?) + +- io-threads + +- changelog (the [full-data-logging + translator](https://forge.gluster.org/~jdarcy/glusterfs-core/jdarcys-glusterfs-data-logging) + on the forge already uses this technique) + +Even something as complicated as AFR/NSR/EC could use code generation to +handle quorum checks more consistently, wrap operations in transactions, +and so on. NSR already does; the others could. + +### Implications on manageability + +None. + +### Implications on presentation layer + +None. + +### Implications on persistence layer + +None. + +### Implications on 'GlusterFS' backend + +None. + +### Modification to GlusterFS metadata + +None. + +### Implications on 'glusterd' + +None. + +How To Test +----------- + +This change is not intended to introduce any change visible except to +developers. Standard regression tests should be sufficient to verify +that no such change has occurred. + +User Experience +--------------- + +None. + +Dependencies +------------ + +None. + +Documentation +------------- + +Developer documentation should explain the format of the fop-description +and template files. In particular developers need to know what variables +are available for use in templates, and how to add new ones. + +Status +------ + +Patch available to generate default functions. Others to follow. + +Comments and Discussion +----------------------- diff --git a/Feature Planning/GlusterFS 4.0/composite-operations.md b/Feature Planning/GlusterFS 4.0/composite-operations.md new file mode 100644 index 0000000..5cc29b4 --- /dev/null +++ b/Feature Planning/GlusterFS 4.0/composite-operations.md @@ -0,0 +1,438 @@ +Feature +------- + +Composite operations is a term describing elimination of round trips +through a variety of techniques. Some of these techniques are borrowed +from NFS and SMB protocols in spirit at least. + +Why do we need this? All too frequently we encounter situations where +Gluster performance is an order of magnitude or even two orders of +magnitude slower than NFS or SMB to a local filesystem, particularly for +small-file and metadata-intensive workloads (example: file browsing). +You can argue that Gluster provides more functionality, so it should be +slower, but we need to close the gap -- if Gluster was half the speed of +NFS and provided much greater functionality plus scalability, users +would be ok with some performance tradeoff. + +What is the root cause? Response time of Gluster APIs is much higher +than response time of other protocols. A simple protocol trace can show +you a root cause for this: excessive round-trips. + +There are two dimensions to this: + +- operations that require lookups on every brick (covered elsewhere) +- excessive one-at-a-time access to xattrs and ACLs +- client responsible for maintaining filesystem state instead of + server +- SMB: case-insensitivity of Windows = no direct lookup by filename on + brick + +Summary +------- + +example of previous success: eager-lock. When Gluster was first acquired +by Red Hat and testing with 10-GbE interfaces began, we quickly noticed +that sequential write performance was not what we expected. The Gluster +protocol required a 5-step sequence for every write from client to +server(s), in order to maintain consistency between replicas, which is +loosely paraphrased here: + +- lock-replica-inode +- pre-op (mark replicas dirty) +- write +- post-op +- unlock-replica-inode + +The **cluster.eager-lock** feature was added to Gluster (3.4?) to allow +the client to hang onto the lock, and we combined post-op for previous +write with pre-op for current write and actual write request so that +instead of 5 RPCs per write we got down to ONE RPC per write, and write +performance improved significantly (how much TBS) + +Owners +------ + +TBS + +Current status +-------------- + +Some of the problems with round trips stem from lack of scalability in +DHT protocol, and attributes of AFR protocol. + +Related Feature Requests and Bugs +--------------------------------- + +- [Features/Smallfile Perf](../GlusterFS 3.7/Small File Performance.md) - small-file performance enhancement menu +- [Features/dht-scalability](./dht-scalability.md) - new, scalable DHT +- [Features/new-style-replication](..GlusterFS 3.6/New Style Replication.md)- client no longer does replication + +*Note : search RHS buglist for small-file-related performance bugs and directory browsing performance bugs, I haven't done that yet, there are a LOT of them* + +Detailed Description +-------------------- + +Here are the proposals: + +- READDIRPLUS generalization +- lockless-CREATE +- CREATE-AND-WRITE - allow CREATE op to transmit data and metadata + also +- case-insensitivity feature - removes perf. penalty for SMB + +### READDIRPLUS used to prefetch xattrs + +recent correction: For SMB and other protocols that have additional +security metadata, READDIRPLUS can be used more effectively to prefetch +xattr data, such as ACLs and Windows-specific security info. However, +upper layers have to make use of this feature. We treat ACLs as a +special case of an extended attribute, since ACLs are not currently +returned by READDIRPLUS (can someone confirm this?). The current RPC +request and response structure are in +[gfs3\_readdirp\_{req,rsp}](https://forge.gluster.org/glusterfs-core/glusterfs/blobs/master/rpc/xdr/src/glusterfs3-xdr.x) +in the above source code URL. In both cases, the request structure field +"dict" can contain a list of extended attribute IDs (or names, not sure +which). + +However, once these xattrs are prefetched, will md-cache translator in +the client be able to hang onto them to prevent round-trips to the +server? Is there any additional invalidation needed for the expanded +role of md-cache? + +### eager-lock for directories + +This extension doesn't seem to impact APIs at all, but it does require a +way to safely do a CREATE FOP that will either appear on all replicas or +none (or allow self-healing to repair the difference in the directories +in the correct way). + +If we have an NSR translator, this seems pretty straightforward. NSR +only allows the client to talk to the "leader" server in the replica +host set, and the leader then takes responsibility for propagating the +change. + +With AFR, the situation is very different. In order to guarantee that a +CREATE will succeed on all AFR subvolumes, the client must write-lock +the parent directory. Otherwise some other client could create the same +file at the same time on some but not all of the AFR subvolumes. + +But why unlock? Chances are that any immediate subsequent file create in +that directory will be coming from the same client, so it makes sense +for the client to hang onto the write lock for a short while, unless +some other client wants it. This optimistic lock behavior is similar to +the "eager-lock" feature in the AFR translator today. Doing this saves +us not only the need to do a LOOKUP prior to CREATE, but also saves us +the need to do a directory unlock per file! + +### CREATE-AND-WRITE + +This extension is similar to quick-read, where the OPEN FOP can return +the file data if it's small enough. This extension adds the following +features to the CREATE FOP: + +- - optionally specify xattrs to associate with file when it's + created + - optionally specify write data (if it fits in 1 RPC) + - optionally close the file (what RELEASE does today) + - optionally fsync the file (for apps that require file + persistence such as Swift) + +This option is also similar to what librados (Ceph) API allows user to +do today, see [Ioctx.write\_full in librados python +binding](http://ceph.com/docs/master/rados/api/python/#writing-reading-and-removing-objects) + +This avoids the need for the round-trip sequence: + +- lock inode for write +- create +- write +- flush(directory) +- set-xattr[1] +- set-xattr[2] +- ... +- set-xattr[N] +- release +- unlock inode + +The existing protocol structure is in [structure +gfs3\_create\_req](https://forge.gluster.org/glusterfs-core/glusterfs/blobs/master/rpc/xdr/src/glusterfs3-xdr.x) +. We would allocate reserved bits from the "flags" field for the +optional extensions. The xdata field in the request would contain a +tagged sequence containing the optional parameter values. + +### case-insensitive volume support + +The SMB protocol is bridging a divide between an operating system, +Windows, that supports case-insensitive lookup, and an operating system, +Linux (POSIX) that supports only case-sensitive lookup inside Gluster +bricks. If nothing is done to bridge this gap, file lookup and creation +becomes very expensive in large directories (a few thousand files in +size): + +- on CREATE, the client has to search the entire directory to + determine whether some other file with the same name (but a + different case mix) already exists. This requires locking the + directory. Furthermore, consistent hashing, which knows nothing + about case mix, can not predict which brick might contain the file, + since it might have been created with a different case mix. This is + a SCALABILITY issue. + +- on LOOKUP, the client has to search all bricks for the filename + since there is in general no way to predict which brick the + case-altered version of the filename might have hashed to. This is a + SCALABILITY issue. The entire contents of the directory on each + brick must be searched as well. + +- SMB does support "case-sensitive yes" smb.conf configuration option, + but this is user-hostile since Windows does not understand it. + +What happens when Linux user-mode process such as glusterfsd (brick) +tries to do a case-insensitive lookup on the filename using a local +filesystem? XFS has a feature for this, but Gluster can't assume XFS as +for VFS supporting case-insensitivity - it's not going to happen. Yes +you can do readdir on directory and scan for the case-insensitive match, +but it's O(N\^2) where N is number of files you place into a directory. + +**Proposal**: only use lower-case filenames (or upper-case, it doesn't +matter) at the brick filesystem, and record the original case mix (how +the user specified the filename at create/rename time) in an xattr, call +it 'original-case'. + +**Issue**: (from Ira Cooper): what locales would be supported? SMB +already had to deal with this. + +We could define a 'case-insensitive' volume parameter (default off), so +that users who have no SMB clients do not experience this change in +behavior. + +This mapping to lower-case filenames has to happen at or above DHT layer +to avoid the scalability issue above. If this is not done by DHT (if it +is done in VFS-gluster SMB plugin for example), then Gluster clients on +a POSIX filesystem will not see the same filenames as Windows users, and +this will lead to confusion. + +However, this has consequences for sharing file between SMB and non-SMB +client - non-SMB client will pay performance penalty for +case-insensitivity and will see case-insensitive behavior that is not +strictly POSIX-compliant - for example if I create file "a" and then +file "A" in same directory, the 2nd create will get EEXIST. That's the +price you pay for having the two kinds of clients accessing the same +volume - the most restricted client has to win. + +Changes required to DHT or equivalent: + +- READDIR(PLUS): report filenames as the user expects to see them, + using the original-case xattr. see above READDIRPLUS enhancement for + how this can be done efficiently. +- CREATE (or RENAME):, map the filename within the brick to lower case + before creating, and records the original case mix using the + original-case xattr. See CREATE-AND-WRITE enhancement above for how + this can be done efficiently. +- LOOKUP: map the filename to lower case before attempting a lookup on + the brick. +- RENAME: To prevent loss of file during a client-side crash, first + delete the case-mix xattr, then do the rename, then re-add the + case-mix xattr. If the case-mix xattr is not present, then the + lower-case filename is returned by READDIR(PLUS) but the file is not + lost. + +Since existing SMB users may want to take advantage of this change, we +need a process for converting a Gluster volume to support +case-insensitivity: + +- optional - use "find /your/brick/directory -not -type d -a -not + -path '/your/brick/directory/.glusterfs/\*' | tr '[A-Z]' '[a-z]' | + sort " command in parallel on every brick, and do sort -merge of + per-brick outputs followed by "uniq -d" to quickly determine if + there are case-insensitivity collisions on existing volume. This + would let user resolve such conflicts ahead of time without taking + down the volume. +- shut down the volume +- run a script on all bricks in parallel to convert it to + case-insensitive format - very fast because it runs on a local fs. + - rename the brick file to lower case and store an xattr with + original case. +- turn volume lookup-unhashed to ON because files will not yet be on + the right brick. +- set volume into case-insensitive state +- start volume - it is now online but not in efficient state +- rebalance (get DHT to place the files where they belong) + - If rebalance uncovers case-insensitive filename collisions (very + unlikely), the 2nd file is renamed to its original case-mix with + string 'case-collision-gfid' + hex gfid appended, and a counter + is incremented. A simple "find" command at each brick in + parallel executed with pdsh can locate all instances of such + files - the user then has to decide what they want to do with + them. +- reset lookup-unhashed to default (auto) + +Benefit to GlusterFS +-------------------- + +- READDIRPLUS optimizations could completely solve the performance + problems with file browsing in large directories, at least to the + point where Gluster performs similarly to NFS and SMB in general and + can't be blamed. (DHT v2 could also improve performance by not + requiring round trips to every brick to retrieve a directory). + +- lockless-CREATE - can improve small-file create performance + significantly by condensing 4 round-trips into 1. Small-file create + is the worst-performing feature in Gluster today. However, it won't + solve small-file create problems until we address other areas below. + +- CREATE-AND-WRITE - as you can see, at least 6 round trips (maybe + more) are combined into 1 round trip. + +The performance benefit increases as the Gluster client round-trip time +to the servers increases. For example, these enhancements could make +possible use of Gluster protocol over a WAN. + +Scope +----- + +Still unsure. This impacts libgfapi - if we want applications to take +advantage of these enhancements, we need to expose these APIs to +applications somehow, and POSIX does not allow them AFAIK. + +CREATE-AND-WRITE impacts the translator interface. Translators must be +able to pass down: + +- a list of xattr values (which translators in the stack can append + to). +- a data buffer +- flags to request optionally that file be fsynced and/or closed. + +The [fop\_create\_t +params](https://forge.gluster.org/glusterfs-core/glusterfs/blobs/master/libglusterfs/src/xlator.h) +have both a "flags" parameter and a "xdata" parameter; this last +parameter could be used to pass both data and xattrs in a tagged +sequence format (not sure whether **dict\_t** supports this). + +### Nature of proposed change + +The Gluster code might need refactoring in create-related code to +maximize common code between existing implementation, which won't go +away, and the new implementation of these FOPS. + +However, I suspect that READDIRPLUS extensions may be possible to insert +without disrupting existing code that much, may need some help on this +one. + +### Implications on manageability + +The gluster volume profile command will have to be extended to get +support for the new CREATE FOP if this is how we choose to implement it. + +These changes should be somewhat transparent to management layer +otherwise. + +### Implications on presentation layer + +Swift-on-file Gluster-specific code would have to change to take +advantage of this feature. + +NFS and SMB would have to change to exploit new features to reduce +SMB-specific xattr and ACL access. + +The +[libgfapi](https://forge.gluster.org/glusterfs-core/glusterfs/blobs/master/api/src/glfs.h) +implementation would have to expose these features. + +- **glfs\_readdirplus\_r** - it's not clear that struct dirent would + be able to handle xattrs, and there is no place to specify which + extended attributes we are interested in. +- **glfs\_creat** - has no parameters to support xattrs or write data. + So we'd need a new entry point to do this. + +### Implications on persistence layer + +None. + +### Implications on 'GlusterFS' backend + +None. + +### Modification to GlusterFS metadata + +None. We are repackaging how data gets passed in protocol, not what it +means. + +### Implications on 'glusterd' + +None. + +How To Test +----------- + +We have programs that can generate metadata-intensive workloads, such as +smallfile benchmark or fio. For smallfile creates, we can use a modified +version of the [parallel libgfapi +benchmark](https://github.com/bengland2/parallel-libgfapi) (don't worry, +I know the developer ;-) to verify that the response time for the new +create-and-write API is better than before, or to verify that +lockless-create improves response time. + +In the case of readdirplus extensions, we can test with simple libgfapi +program coupled with a protocol trace or gluster volume profile output +to see if it's working and has desired decrease in response time. + +User Experience +--------------- + +The impact of this operation should be functionally transparent to the +end-user, but it should significantly improve Gluster performance to the +point where throughput and response time are reasonably close (not +equal) to NFS, SMB, etc on local filesystems. This is particularly true +for small-file operations and directory browsing/listing. + +Dependencies +------------ + +This change will have significant impact on translators, it is not easy. +Because this is a non-trivial change, an incremental approach should be +specified and followed, with each stage committed and regression tested +separately. For example, we could break CREATE-and-WRITE proposal into 4 +pieces: + +- add libgfapi support, with ENOSUPPORT returned for unimplemented + features +- add list of xattrs written at create time. +- add write data +- add close and fsync options + +Documentation +------------- + +How do we document RPC protocol changes? For now, I'll try to use IDL .x +file or whatever specifies the RPC itself. + +Status +------ + +Not designed yet. + +Comments and Discussion +----------------------- + +### Jeff Darcy 16:20, 3 December 2014 + +Talk:Features/composite-operations + +"SMB: case-insensitivity of Windows = no direct lookup by filename on +brick" + +We did actually come up with a way to do the case-preserving and +case-squashing lookups simultaneously before falling back to the global +lookup, but AFAIK it's not implemented. + +READDIRPLUS extension: md-cache actually does pre-fetch some attributes +associated with (Linux) ACLs and SELinux. Maybe it just needs to +pre-fetch some others for SMB? Also, fetching into glusterfs process +memory doesn't save us the context switch. For that we need dentry +injection (or something like it) so that the information is available in +the kernel by the time the user asks for it. + +"glfs\_creat - has no parameters to support xattrs" + +These are being added already because NSR reconciliation needs them (for +many other calls already). diff --git a/Feature Planning/GlusterFS 4.0/dht-scalability.md b/Feature Planning/GlusterFS 4.0/dht-scalability.md new file mode 100644 index 0000000..83ef255 --- /dev/null +++ b/Feature Planning/GlusterFS 4.0/dht-scalability.md @@ -0,0 +1,171 @@ +Goal +---- + +More scalable DHT translator. + +Summary +------- + +Current DHT inhibits scalability by requiring that directories be on all +subvolumes. In addition to the extra message traffic this incurs during +*mkdir*, it adds significant complexity keeping all of the directories +consistent across operations like *create* and *rename*. What is +proposed, in a nutshell, is that directories should only exist on one +subvolume, which might contain "stubs" pointing to files and directories +that can be accessed by GFID on the same or other subvolumes. In concert +with this, the way we store layout information needs to change, so that +at least the "fix-layout" part of rebalancing need not involve +traversing every directory on every subvolume. + +Owners +------ + +Shyam Ranganathan <srangana@redhat.com> + +Raghavendra Gowdappa <rgowdapp@redhat.com> + +Current status +-------------- + +Proposed, awaiting summit for approval. + +Related Feature Requests and Bugs +--------------------------------- + +[Features/thousand-node-glusterd](../GLusterFS 3.6/Thousand Node Gluster.md) +will define new ways of managing maintenance activities, some of which +are DHT-related. Also, the new "mon cluster" might be where we store +layout information. + +[Features/data-classification](../GLusterFS 3.7/Data Classification.md) +also affects layout storage and use. + +Detailed Description +-------------------- + +Under this scheme, path-based lookup becomes very different. Currently, +we look up a path on the file's "hashed" subvol first (according to +parent-directory layout and file GFID). If it's not there, we need to +look elsewhere - in the worst case on **all** subvolumes. In the future, +our first lookup should be at the parent directory's subvolume. If the +file is not there, it's not linked anywhere (though it might still exist +unlinked and accessible by GFID) and we can terminate immediately. If it +is there, then that single copy of the parent directory will contain a +"stub" giving the file's GFID and a hint for what subvolume it's on +(similar to a current linkfile). That information can then be used to +initiate a GFID-based lookup. Many other code paths, such as *rename*, +can leverage this new infrastructure to avoid current problems with +multiple directory entries and linkfiles all for the same actual file. + +A possible enhancement would be to include more information in stubs, +allowing readdirp to operate only on the directory and avoid going to +every subvolume for information about individual files. Also, some +secondary issues such as hard links and garbage collection (of unlinked +but still open files) remain TBD in the final design. + +With respect to layout storage, the basic idea is to store a fairly +small number of actual layouts - default, user defined, or related to +data classification - that are each shared across many directories. +These layouts are stored as part of our configuration, and the xattrs on +individual directories need only specify a shared layout ID (plus +possibly some additional "tweak" parameters) instead of a full explicit +layout. When we do any kind of rebalancing, we need only change the +shared layouts and not the pointers on each directory. + +Benefit to GlusterFS +-------------------- + +Improved scalability and performance for all directory-entry operations. + +Improved reliability for conflicting directory-entry operations, and for +layout repair. + +Almost instantaneous "fix-layout" completion. + +Scope +----- + +### Nature of proposed change + +Due to the complexity of the changes involved, this will probably be a +new translator developed using a similar model to that used for AFR2. +While it's likely to share/borrow a significant amount of code from +current DHT, the new version will go through most of its development +lifecycle separately and then completely supplant the old version, as +opposed to integrating individual changes. For testing of +compatibility/migration, it is probably desirable for both versions to +coexist in the source tree and packages, but not necessarily in the same +process. + +### Implications on manageability + +New/different options, but otherwise no change. + +### Implications on presentation layer + +No change. At this level the new DHT translator should be a plug-in +replacement for the old one. + +### Implications on persistence layer + +None, unless you count reduced xattr usage. + +### Implications on 'GlusterFS' backend + +This will fundamentally change the directory structure on our back end. +A file that is currently visible within a brick as \$BRICK\_ROOT/a/b/c +might now be visible only as \$GFID\_FOR\_B/c with neither of the parent +directories even present on that brick. Even that "file" will actually +be a stub containing only the file's own GFID; to get the contents, one +would need to look up that GFID in .glusterfs on this or some other +brick. + +Linkfiles will be gone, also subsumed by stubs. + +### Modification to GlusterFS metadata + +Explicit layouts will be replaced by IDs for shared layouts (in config +storage). + +### Implications on 'glusterd' + +Minimal changes (mostly volfile generation). + +How To Test +----------- + +Most existing DHT tests should suffice, except for those that depend on +the details of how layouts are stored and formatted. Those will have to +be modified to go through the extra layer of indirection to where the +actual layouts live. + +User Experience +--------------- + +None, except for better performance and less lost data. + +Dependencies +------------ + +See "related features" section. + +Documentation +------------- + +TBD. There should be very little at the user level, though hopefully +this time we'll do better at documenting the things developers +(including add-on tool developers) need to know. + +Status +------ + +Design in progress + +Design and some notes can be found here, +<https://drive.google.com/open?id=15_TOW9jwzW4griAmk-rqg2cWF-LHiR_TJ8Jn0vOvYpU&authuser=0> + +Thread at gluster-devel opening this up for discussion here, +<https://www.mail-archive.com/gluster-devel%40gluster.org/msg03036.html> + +Comments and Discussion +----------------------- diff --git a/Feature Planning/GlusterFS 4.0/index.md b/Feature Planning/GlusterFS 4.0/index.md new file mode 100644 index 0000000..0a3d47d --- /dev/null +++ b/Feature Planning/GlusterFS 4.0/index.md @@ -0,0 +1,82 @@ +GlusterFS 4.0 Release Planning +------------------------------ + +Tentative Dates: + +Feature proposal for GlusterFS 4.0 +---------------------------------- + +This list has been seeded with features from <http://goo.gl/QyjfxM> +which provides some rationale and context. Feel free to add more. Some +of the individual feature pages are still incomplete, but should be +completed before voting on the final 4.0 feature set. + +### Node Scaling Features + +- [Features/thousand-node-glusterd](../GlusterFS 3.6/Thousand Node Gluster.md): + Glusterd changes for higher scale. + +- [Features/dht-scalability](./dht-scalability.md): + a.k.a. DHT2 + +- [Features/sharding-xlator](../GlusterFS 3.7/Sharding xlator.md): + Replacement for striping. + +- [Features/caching](./caching.md): Client-side caching, with coherency support. + +### Technology Scaling Features + +- [Features/data-classification](../GlusterFS 3.7/Data Classification.md): + Tiering, compliance, and more. + +- [Features/SplitNetwork](./Split Network.md): + Support for public/private (or other multiple) networks. + +- [Features/new-style-replication](../GlusterFS 3.6/New Style Replication.md): + Log-based, chain replication. + +- [Features/better-brick-mgmt](./Better Brick Mgmt.md): + Flexible resource allocation + daemon infrastructure to handle + (many) more bricks + +- [Features/compression-dedup](./Compression Dedup.md): + Compression and/or deduplication + +### Small File Performance Features + +- [Features/composite-operations](./composite-operations.md): + Reducing round trips by wrapping multiple ops in one message. + +- [Features/stat-xattr-cache](./stat-xattr-cache.md): + Caching stat/xattr information in (user-space) server memory. + +### Technical Debt Reduction + +- [Features/code-generation](./code-generation.md): + Code generation + +- [Features/volgen-rewrite](./volgen-rewrite.md): + Technical-debt reduction + +### Other Features + +- [Features/rest-api](../GlusterFS 3.7/rest-api.md): + Fully generic API sufficient to support all CLI operations. + +- Features/mgmt-plugins: + No more patching glusterd for every new feature. + +- Features/perf-monitoring: + Always-on performance monitoring and hotspot identification. + +Proposing New Features +---------------------- + +[New Feature Template](../Feature Template.md) + +Use the template to create a new feature page, and then link to it from the "Feature Proposals" section above. + +Release Criterion +----------------- + +- TBD diff --git a/Feature Planning/GlusterFS 4.0/stat-xattr-cache.md b/Feature Planning/GlusterFS 4.0/stat-xattr-cache.md new file mode 100644 index 0000000..e00399d --- /dev/null +++ b/Feature Planning/GlusterFS 4.0/stat-xattr-cache.md @@ -0,0 +1,197 @@ +Feature +------- + +server-side md-cache + +Summary +------- + +Two years ago, Peter Portante noticed the extremely high number of +system calls on the XFS brick required per Swift object. Since then, he +and Ben England have observed several similar cases. + +More recently, while looking at a **netmist** single-thread workload run +by a major banking customer to characterize Gluster performance, Ben +observed this [system call profile PER +FILE](https://s3.amazonaws.com/ben.england/netmist-and-gluster.pdf) . +This is strong evidence of several problems with POSIX translator: + +- repeated polling with **sys\_lgetxattr** of the **gfid** xattr +- repeated **sys\_lstat** calls +- polling of xattrs that were *undefined* +- calling **sys\_llistattr** to get list of all xattrs AFTER all other + calls +- calling *'sys\_lgetxattr* two times, once to find out how big the + value is and once to get the value! +- one-at-a-time calls to get individual xattrs + +All of the problems except for the last one could be solved through use +of a metadata cache associated with each inode. The last problem is not +solvable in a pure POSIX API at this time, although XFS offers an +**ioctl** that can get all xattrs at once (the cache could conceivably +determine whether the brick was XFS or not and exploit this where +available). + +Note that as xattrs are added to the system, this becomes more and more +costly, and as Gluster adds new features, these typically require that +state be kept associated with a file, usually in one or more xattrs. + +Owners +------ + +TBS + +Current status +-------------- + +There is already a **md-cache** translator, so you would think that +problems like this would not occur, but clearly they do -- this +translator is typically on the client side of the protocol and is +typically above such translators as AFR and DHT. The problems may be +worse in cases where the md-cache translator is not present (example: +SMB with gluster-vfs plugin that requires stat-prefetch volume parameter +to be set to *off*. + +Related Feature Requests and Bugs +--------------------------------- + +- [Features/Smallfile Perf](../GlusterFS 3.7/Small File Performance.md) +- bugzillas TBS + +Detailed Description +-------------------- + +This proposal has changed as a result of discussions in +\#gluster-meeting - instead of modifying the POSIX translator, we +propose to use the md-cache translator in the server above the POSIX +translator, and add negative caching capabilities to the md-cache +translator. + +By "negative caching" we mean that md-cache can tell you if the xattr +does not exist without calling down the translator stack. How can it do +this? In the server side, the only path to the brick is through the +md-cache translator. When it encounters a xattr get request for a file +it has not seen before, the first step is to call down with llistxattr() +to find out what xattrs are stored for that file. From that point on +until the file is evicted from the cache, any request for non-existent +xattr values from higher translators will immediately be returned with +ENODATA, without calling down to POSIX translator. + +We must ensure that memory leaks do not occur, and that race conditions +do not occur while multiple threads are accessing the cache, but this +seems like a manageable problem and is certainly not a new problem for +Gluster translator code. + +Benefit to GlusterFS +-------------------- + +Most of the system calls and about 50% of the elapsed time could have +been removed from the above small-file read profile through use of this +cache. This benefit will be more visible as we transition to using SSD +storage, where disk seek times will not mask overheads such as this. + +Scope +----- + +This can be done local to the glusterfsd process by inserting md-cache +translator just above the POSIX translator, where the vast majority of +the stat, getxattr and setxattr calls are generated from. + +### Nature of proposed change + +No new translators are required. We may require some existing +translators to call down the stack ("wind a FOP") instead of calling +sys\_\*xattr themselves if these calls are heavily used, so that they +can take advantage of the stat-xattr-cache. + +It is *really important* that the md-cache use listxattr() to +immediately determine which xattrs are on disk, avoiding needless +getxattr calls this way. At present it does not do this. + +### Implications on manageability + +None. We need to make sure that the cache is big enough to support the +threads that use it, but not so big that it consumes a significant +percentage of memory. We may want to make a cache size and expiration +time a tunable so that we can experiment in performance testing to +determine optimal parameters. + +### Implications on presentation layer + +Translators above the md-cache translator are not affected + +### Implications on persistence layer + +None. + +### Implications on 'GlusterFS' backend + +None + +### Modification to GlusterFS metadata + +None + +### Implications on 'glusterd' + +None + +How To Test +----------- + +We can use strace of a single-thread smallfile workload to verify that +the cache is filtering out excess system calls. We could include +counters into the cache to measure the cache hit rate. + +User Experience +--------------- + +single-thread small-file creates should be faster, particularly on SSD +storage. Performance testing is needed to further quantify this. + +Dependencies +------------ + +None + +Documentation +------------- + +None, except for tunables relating to cache size and expiration time. + +Status +------ + +Not started. + +Comments and Discussion +----------------------- + +Jeff Darcy: I've been saying for ages that we should store xattrs in a +local DB and avoid local xattrs altogether. Besides performance, this +would also eliminate the need for special configuration of the +underlying local FS (to accommodate our highly unusual use of this +feature) and generally be good for platform independence. Not quite so +sure about other stat(2) information, but perhaps I could be persuaded. +In any case, this has led me to look into the relevant code on a few +occasions. Unfortunately, there are \*many\* places that directly call +sys\_\*xattr instead of winding fops - glusterd (for replace-brick), +changelog, quota, snapshots, and others. I think this feature is still +very worthwhile, but all of the "cheating" we've tolerated over the +years is going to make it more difficult. + +Ben England: a local DB might be a good option but could also become a +bottleneck, unless you have a DB instance per brick (local) filesystem. +One problem that the DB would solve is getting all the metadata in one +query - at present POSIX API requires you to get one xattr at a time. If +we implement a caching layer that hides whether a DB or xattrs are being +used, we can make it easier to experiment with a DB (level DB?). On your +2nd point, While it's true that there are many sites that call +sys\_\*xattr directory, only a few of these really generate a lot of +system calls. For example, some of these calls are only for the +mountpoint. From a performance perspective, as long as we can intercept +the vast majority of the sys\_\*xattr calls with this caching layer, +IMHO we can tolerate a few exceptions in glusterd, etc. However, from a +CORRECTNESS standpoint, we have to be careful that calls bypassing the +caching layer don't cause cache contents to become stale (out-of-date, +inconsistent with the on-disk brick filesystem contents). diff --git a/Feature Planning/GlusterFS 4.0/volgen-rewrite.md b/Feature Planning/GlusterFS 4.0/volgen-rewrite.md new file mode 100644 index 0000000..4b954b3 --- /dev/null +++ b/Feature Planning/GlusterFS 4.0/volgen-rewrite.md @@ -0,0 +1,128 @@ +Feature +------- + +Volgen rewrite + +Summary +------- + +This module has become an important choke point for development of new +features, as each new feature needs to make changes here. Many previous +feature additions have been rushed in, by copying/pasting code or adding +special-case checks, instead of refactoring. The result is a big +hairball. Every new change that involves client translators has to deal +with various permutations of replication/EC, striping/sharding, +rebalance, self-heal, quota, snapshots, tiering, NFS, and so on. Each +new change at this point is almost certain to introduce subtle errors +that will only be caught when certain combinations of features and +operations are attempted. There aren't even enough tests to cover even +the basic combinations, and we'd need hundreds more to test the obscure +ones. + +Owners +------ + +Jeff Darcy <jdarcy@redhat.com> + +Current status +-------------- + +Just a proposal so far. + +Related Feature Requests and Bugs +--------------------------------- + +TBD + +Detailed Description +-------------------- + +Many of the problems stem from the fact that our internal volfiles need +to be consistent with, but slightly different from, one another. Instead +of generating them all separately, we should separate the generation +into two phases: + +* Generate a "core" or "vanilla" graph containing all of the translators, option settings, etc. common to all of the special-purpose volfiles. + +* For each special-purpose volfile, copy the core/vanilla graph (*not the code* that generated it) and modify the copy to get what's desired. + +Some of the other problems in this module stem from lower-level issues +such as bad data- or control-structure choices (e.g. operating on a +linear list of bricks instead of a proper graph), or complex +object-lifecycle management (e.g. see +<https://bugzilla.redhat.com/show_bug.cgi?id=1211749>). Some of these +problems might be alleviated by using a higher-level language with +complex data structures and garbage collection. An infrastructure +already exists to do graph manipulation in Python, developed for HekaFS +and subsequently used in several other places (it's already in our +source tree). + +Benefit to GlusterFS +-------------------- + +More correct, and more \*verifiably\* correct, volfile generation even +as the next dozen features are added. Also, accelerated development time +for those next dozen features. + +Scope +----- + +### Nature of proposed change + +Pretty much limited to what currently exists in glusterd-volgen.c + +### Implications on manageability + +None. + +### Implications on presentation layer + +None. + +### Implications on persistence layer + +None. + +### Implications on 'GlusterFS' backend + +None. + +### Modification to GlusterFS metadata + +None. + +### Implications on 'glusterd' + +None, unless we decide to store volfiles in a different format (e.g. +JSON) so we can use someone else's parser instead of rolling our own. + +How To Test +----------- + +Practically every current test generates multiple volfiles, which will +quickly smoke out any differences. Ideally, we'd add a bunch more tests +(many of which might fail against current code) to verify correctness of +results for particularly troublesome combinations of features. + +User Experience +--------------- + +None. + +Dependencies +------------ + +None. + +Documentation +------------- + +None. + +Status +------ + +Still just a proposal. + +Comments and Discussion +----------------------- |