Diffstat (limited to 'under_review/composite-operations.md')
1 files changed, 438 insertions, 0 deletions
diff --git a/under_review/composite-operations.md b/under_review/composite-operations.md
new file mode 100644
@@ -0,0 +1,438 @@
+Composite operations is a term describing elimination of round trips
+through a variety of techniques. Some of these techniques are borrowed
+from NFS and SMB protocols in spirit at least.
+Why do we need this? All too frequently we encounter situations where
+Gluster performance is an order of magnitude or even two orders of
+magnitude slower than NFS or SMB to a local filesystem, particularly for
+small-file and metadata-intensive workloads (example: file browsing).
+You can argue that Gluster provides more functionality, so it should be
+slower, but we need to close the gap -- if Gluster was half the speed of
+NFS and provided much greater functionality plus scalability, users
+would be ok with some performance tradeoff.
+What is the root cause? Response time of Gluster APIs is much higher
+than response time of other protocols. A simple protocol trace can show
+you a root cause for this: excessive round-trips.
+There are two dimensions to this:
+- operations that require lookups on every brick (covered elsewhere)
+- excessive one-at-a-time access to xattrs and ACLs
+- client responsible for maintaining filesystem state instead of
+- SMB: case-insensitivity of Windows = no direct lookup by filename on
+example of previous success: eager-lock. When Gluster was first acquired
+by Red Hat and testing with 10-GbE interfaces began, we quickly noticed
+that sequential write performance was not what we expected. The Gluster
+protocol required a 5-step sequence for every write from client to
+server(s), in order to maintain consistency between replicas, which is
+loosely paraphrased here:
+- pre-op (mark replicas dirty)
+The **cluster.eager-lock** feature was added to Gluster (3.4?) to allow
+the client to hang onto the lock, and we combined post-op for previous
+write with pre-op for current write and actual write request so that
+instead of 5 RPCs per write we got down to ONE RPC per write, and write
+performance improved significantly (how much TBS)
+Some of the problems with round trips stem from lack of scalability in
+DHT protocol, and attributes of AFR protocol.
+Related Feature Requests and Bugs
+- [Features/Smallfile Perf](../GlusterFS 3.7/Small File Performance.md) - small-file performance enhancement menu
+- [Features/dht-scalability](./dht-scalability.md) - new, scalable DHT
+- [Features/new-style-replication](..GlusterFS 3.6/New Style Replication.md)- client no longer does replication
+*Note : search RHS buglist for small-file-related performance bugs and directory browsing performance bugs, I haven't done that yet, there are a LOT of them*
+Here are the proposals:
+- READDIRPLUS generalization
+- CREATE-AND-WRITE - allow CREATE op to transmit data and metadata
+- case-insensitivity feature - removes perf. penalty for SMB
+### READDIRPLUS used to prefetch xattrs
+recent correction: For SMB and other protocols that have additional
+security metadata, READDIRPLUS can be used more effectively to prefetch
+xattr data, such as ACLs and Windows-specific security info. However,
+upper layers have to make use of this feature. We treat ACLs as a
+special case of an extended attribute, since ACLs are not currently
+returned by READDIRPLUS (can someone confirm this?). The current RPC
+request and response structure are in
+in the above source code URL. In both cases, the request structure field
+"dict" can contain a list of extended attribute IDs (or names, not sure
+However, once these xattrs are prefetched, will md-cache translator in
+the client be able to hang onto them to prevent round-trips to the
+server? Is there any additional invalidation needed for the expanded
+role of md-cache?
+### eager-lock for directories
+This extension doesn't seem to impact APIs at all, but it does require a
+way to safely do a CREATE FOP that will either appear on all replicas or
+none (or allow self-healing to repair the difference in the directories
+in the correct way).
+If we have an NSR translator, this seems pretty straightforward. NSR
+only allows the client to talk to the "leader" server in the replica
+host set, and the leader then takes responsibility for propagating the
+With AFR, the situation is very different. In order to guarantee that a
+CREATE will succeed on all AFR subvolumes, the client must write-lock
+the parent directory. Otherwise some other client could create the same
+file at the same time on some but not all of the AFR subvolumes.
+But why unlock? Chances are that any immediate subsequent file create in
+that directory will be coming from the same client, so it makes sense
+for the client to hang onto the write lock for a short while, unless
+some other client wants it. This optimistic lock behavior is similar to
+the "eager-lock" feature in the AFR translator today. Doing this saves
+us not only the need to do a LOOKUP prior to CREATE, but also saves us
+the need to do a directory unlock per file!
+This extension is similar to quick-read, where the OPEN FOP can return
+the file data if it's small enough. This extension adds the following
+features to the CREATE FOP:
+- - optionally specify xattrs to associate with file when it's
+ - optionally specify write data (if it fits in 1 RPC)
+ - optionally close the file (what RELEASE does today)
+ - optionally fsync the file (for apps that require file
+ persistence such as Swift)
+This option is also similar to what librados (Ceph) API allows user to
+do today, see [Ioctx.write\_full in librados python
+This avoids the need for the round-trip sequence:
+- lock inode for write
+- unlock inode
+The existing protocol structure is in [structure
+. We would allocate reserved bits from the "flags" field for the
+optional extensions. The xdata field in the request would contain a
+tagged sequence containing the optional parameter values.
+### case-insensitive volume support
+The SMB protocol is bridging a divide between an operating system,
+Windows, that supports case-insensitive lookup, and an operating system,
+Linux (POSIX) that supports only case-sensitive lookup inside Gluster
+bricks. If nothing is done to bridge this gap, file lookup and creation
+becomes very expensive in large directories (a few thousand files in
+- on CREATE, the client has to search the entire directory to
+ determine whether some other file with the same name (but a
+ different case mix) already exists. This requires locking the
+ directory. Furthermore, consistent hashing, which knows nothing
+ about case mix, can not predict which brick might contain the file,
+ since it might have been created with a different case mix. This is
+ a SCALABILITY issue.
+- on LOOKUP, the client has to search all bricks for the filename
+ since there is in general no way to predict which brick the
+ case-altered version of the filename might have hashed to. This is a
+ SCALABILITY issue. The entire contents of the directory on each
+ brick must be searched as well.
+- SMB does support "case-sensitive yes" smb.conf configuration option,
+ but this is user-hostile since Windows does not understand it.
+What happens when Linux user-mode process such as glusterfsd (brick)
+tries to do a case-insensitive lookup on the filename using a local
+filesystem? XFS has a feature for this, but Gluster can't assume XFS as
+for VFS supporting case-insensitivity - it's not going to happen. Yes
+you can do readdir on directory and scan for the case-insensitive match,
+but it's O(N\^2) where N is number of files you place into a directory.
+**Proposal**: only use lower-case filenames (or upper-case, it doesn't
+matter) at the brick filesystem, and record the original case mix (how
+the user specified the filename at create/rename time) in an xattr, call
+**Issue**: (from Ira Cooper): what locales would be supported? SMB
+already had to deal with this.
+We could define a 'case-insensitive' volume parameter (default off), so
+that users who have no SMB clients do not experience this change in
+This mapping to lower-case filenames has to happen at or above DHT layer
+to avoid the scalability issue above. If this is not done by DHT (if it
+is done in VFS-gluster SMB plugin for example), then Gluster clients on
+a POSIX filesystem will not see the same filenames as Windows users, and
+this will lead to confusion.
+However, this has consequences for sharing file between SMB and non-SMB
+client - non-SMB client will pay performance penalty for
+case-insensitivity and will see case-insensitive behavior that is not
+strictly POSIX-compliant - for example if I create file "a" and then
+file "A" in same directory, the 2nd create will get EEXIST. That's the
+price you pay for having the two kinds of clients accessing the same
+volume - the most restricted client has to win.
+Changes required to DHT or equivalent:
+- READDIR(PLUS): report filenames as the user expects to see them,
+ using the original-case xattr. see above READDIRPLUS enhancement for
+ how this can be done efficiently.
+- CREATE (or RENAME):, map the filename within the brick to lower case
+ before creating, and records the original case mix using the
+ original-case xattr. See CREATE-AND-WRITE enhancement above for how
+ this can be done efficiently.
+- LOOKUP: map the filename to lower case before attempting a lookup on
+ the brick.
+- RENAME: To prevent loss of file during a client-side crash, first
+ delete the case-mix xattr, then do the rename, then re-add the
+ case-mix xattr. If the case-mix xattr is not present, then the
+ lower-case filename is returned by READDIR(PLUS) but the file is not
+Since existing SMB users may want to take advantage of this change, we
+need a process for converting a Gluster volume to support
+- optional - use "find /your/brick/directory -not -type d -a -not
+ -path '/your/brick/directory/.glusterfs/\*' | tr '[A-Z]' '[a-z]' |
+ sort " command in parallel on every brick, and do sort -merge of
+ per-brick outputs followed by "uniq -d" to quickly determine if
+ there are case-insensitivity collisions on existing volume. This
+ would let user resolve such conflicts ahead of time without taking
+ down the volume.
+- shut down the volume
+- run a script on all bricks in parallel to convert it to
+ case-insensitive format - very fast because it runs on a local fs.
+ - rename the brick file to lower case and store an xattr with
+ original case.
+- turn volume lookup-unhashed to ON because files will not yet be on
+ the right brick.
+- set volume into case-insensitive state
+- start volume - it is now online but not in efficient state
+- rebalance (get DHT to place the files where they belong)
+ - If rebalance uncovers case-insensitive filename collisions (very
+ unlikely), the 2nd file is renamed to its original case-mix with
+ string 'case-collision-gfid' + hex gfid appended, and a counter
+ is incremented. A simple "find" command at each brick in
+ parallel executed with pdsh can locate all instances of such
+ files - the user then has to decide what they want to do with
+- reset lookup-unhashed to default (auto)
+Benefit to GlusterFS
+- READDIRPLUS optimizations could completely solve the performance
+ problems with file browsing in large directories, at least to the
+ point where Gluster performs similarly to NFS and SMB in general and
+ can't be blamed. (DHT v2 could also improve performance by not
+ requiring round trips to every brick to retrieve a directory).
+- lockless-CREATE - can improve small-file create performance
+ significantly by condensing 4 round-trips into 1. Small-file create
+ is the worst-performing feature in Gluster today. However, it won't
+ solve small-file create problems until we address other areas below.
+- CREATE-AND-WRITE - as you can see, at least 6 round trips (maybe
+ more) are combined into 1 round trip.
+The performance benefit increases as the Gluster client round-trip time
+to the servers increases. For example, these enhancements could make
+possible use of Gluster protocol over a WAN.
+Still unsure. This impacts libgfapi - if we want applications to take
+advantage of these enhancements, we need to expose these APIs to
+applications somehow, and POSIX does not allow them AFAIK.
+CREATE-AND-WRITE impacts the translator interface. Translators must be
+able to pass down:
+- a list of xattr values (which translators in the stack can append
+- a data buffer
+- flags to request optionally that file be fsynced and/or closed.
+have both a "flags" parameter and a "xdata" parameter; this last
+parameter could be used to pass both data and xattrs in a tagged
+sequence format (not sure whether **dict\_t** supports this).
+### Nature of proposed change
+The Gluster code might need refactoring in create-related code to
+maximize common code between existing implementation, which won't go
+away, and the new implementation of these FOPS.
+However, I suspect that READDIRPLUS extensions may be possible to insert
+without disrupting existing code that much, may need some help on this
+### Implications on manageability
+The gluster volume profile command will have to be extended to get
+support for the new CREATE FOP if this is how we choose to implement it.
+These changes should be somewhat transparent to management layer
+### Implications on presentation layer
+Swift-on-file Gluster-specific code would have to change to take
+advantage of this feature.
+NFS and SMB would have to change to exploit new features to reduce
+SMB-specific xattr and ACL access.
+implementation would have to expose these features.
+- **glfs\_readdirplus\_r** - it's not clear that struct dirent would
+ be able to handle xattrs, and there is no place to specify which
+ extended attributes we are interested in.
+- **glfs\_creat** - has no parameters to support xattrs or write data.
+ So we'd need a new entry point to do this.
+### Implications on persistence layer
+### Implications on 'GlusterFS' backend
+### Modification to GlusterFS metadata
+None. We are repackaging how data gets passed in protocol, not what it
+### Implications on 'glusterd'
+How To Test
+We have programs that can generate metadata-intensive workloads, such as
+smallfile benchmark or fio. For smallfile creates, we can use a modified
+version of the [parallel libgfapi
+benchmark](https://github.com/bengland2/parallel-libgfapi) (don't worry,
+I know the developer ;-) to verify that the response time for the new
+create-and-write API is better than before, or to verify that
+lockless-create improves response time.
+In the case of readdirplus extensions, we can test with simple libgfapi
+program coupled with a protocol trace or gluster volume profile output
+to see if it's working and has desired decrease in response time.
+The impact of this operation should be functionally transparent to the
+end-user, but it should significantly improve Gluster performance to the
+point where throughput and response time are reasonably close (not
+equal) to NFS, SMB, etc on local filesystems. This is particularly true
+for small-file operations and directory browsing/listing.
+This change will have significant impact on translators, it is not easy.
+Because this is a non-trivial change, an incremental approach should be
+specified and followed, with each stage committed and regression tested
+separately. For example, we could break CREATE-and-WRITE proposal into 4
+- add libgfapi support, with ENOSUPPORT returned for unimplemented
+- add list of xattrs written at create time.
+- add write data
+- add close and fsync options
+How do we document RPC protocol changes? For now, I'll try to use IDL .x
+file or whatever specifies the RPC itself.
+Not designed yet.
+Comments and Discussion
+### Jeff Darcy 16:20, 3 December 2014
+"SMB: case-insensitivity of Windows = no direct lookup by filename on
+We did actually come up with a way to do the case-preserving and
+case-squashing lookups simultaneously before falling back to the global
+lookup, but AFAIK it's not implemented.
+READDIRPLUS extension: md-cache actually does pre-fetch some attributes
+associated with (Linux) ACLs and SELinux. Maybe it just needs to
+pre-fetch some others for SMB? Also, fetching into glusterfs process
+memory doesn't save us the context switch. For that we need dentry
+injection (or something like it) so that the information is available in
+the kernel by the time the user asks for it.
+"glfs\_creat - has no parameters to support xattrs"
+These are being added already because NSR reconciliation needs them (for
+many other calls already).