1 files changed, 438 insertions, 0 deletions
diff --git a/in_progress/composite-operations.md b/in_progress/composite-operations.md
new file mode 100644
index 0000000..5cc29b4
--- /dev/null
+++ b/in_progress/composite-operations.md
@@ -0,0 +1,438 @@
+Feature
+-------
+
+Composite operations is a term describing elimination of round trips
+through a variety of techniques. Some of these techniques are borrowed
+from NFS and SMB protocols in spirit at least.
+
+Why do we need this? All too frequently we encounter situations where
+Gluster performance is an order of magnitude or even two orders of
+magnitude slower than NFS or SMB to a local filesystem, particularly for
+small-file and metadata-intensive workloads (example: file browsing).
+You can argue that Gluster provides more functionality, so it should be
+slower, but we need to close the gap -- if Gluster was half the speed of
+NFS and provided much greater functionality plus scalability, users
+would be ok with some performance tradeoff.
+
+What is the root cause? Response time of Gluster APIs is much higher
+than response time of other protocols. A simple protocol trace can show
+you a root cause for this: excessive round-trips.
+
+There are two dimensions to this:
+
+-   operations that require lookups on every brick (covered elsewhere)
+-   excessive one-at-a-time access to xattrs and ACLs
+-   client responsible for maintaining filesystem state instead of
+    server
+-   SMB: case-insensitivity of Windows = no direct lookup by filename on
+    brick
+
+Summary
+-------
+
+example of previous success: eager-lock. When Gluster was first acquired
+by Red Hat and testing with 10-GbE interfaces began, we quickly noticed
+that sequential write performance was not what we expected. The Gluster
+protocol required a 5-step sequence for every write from client to
+server(s), in order to maintain consistency between replicas, which is
+loosely paraphrased here:
+
+-   lock-replica-inode
+-   pre-op (mark replicas dirty)
+-   write
+-   post-op
+-   unlock-replica-inode
+
+The **cluster.eager-lock** feature was added to Gluster (3.4?) to allow
+the client to hang onto the lock, and we combined post-op for previous
+write with pre-op for current write and actual write request so that
+instead of 5 RPCs per write we got down to ONE RPC per write, and write
+performance improved significantly (how much TBS)
+
+Owners
+------
+
+TBS
+
+Current status
+--------------
+
+Some of the problems with round trips stem from lack of scalability in
+DHT protocol, and attributes of AFR protocol.
+
+Related Feature Requests and Bugs
+---------------------------------
+
+-   [Features/Smallfile Perf](../GlusterFS 3.7/Small File Performance.md) - small-file performance enhancement menu
+-   [Features/dht-scalability](./dht-scalability.md) - new, scalable DHT
+-   [Features/new-style-replication](..GlusterFS 3.6/New Style Replication.md)- client no longer does replication
+
+*Note : search RHS buglist for small-file-related performance bugs and directory browsing performance bugs, I haven't done that yet, there are a LOT of them*
+
+Detailed Description
+--------------------
+
+Here are the proposals:
+
+-   READDIRPLUS generalization
+-   lockless-CREATE
+-   CREATE-AND-WRITE - allow CREATE op to transmit data and metadata
+    also
+-   case-insensitivity feature - removes perf. penalty for SMB
+
+### READDIRPLUS used to prefetch xattrs
+
+recent correction: For SMB and other protocols that have additional
+security metadata, READDIRPLUS can be used more effectively to prefetch
+xattr data, such as ACLs and Windows-specific security info. However,
+upper layers have to make use of this feature. We treat ACLs as a
+special case of an extended attribute, since ACLs are not currently
+returned by READDIRPLUS (can someone confirm this?). The current RPC
+request and response structure are in
+[gfs3\_readdirp\_{req,rsp}](https://forge.gluster.org/glusterfs-core/glusterfs/blobs/master/rpc/xdr/src/glusterfs3-xdr.x)
+in the above source code URL. In both cases, the request structure field
+"dict" can contain a list of extended attribute IDs (or names, not sure
+which).
+
+However, once these xattrs are prefetched, will md-cache translator in
+the client be able to hang onto them to prevent round-trips to the
+server? Is there any additional invalidation needed for the expanded
+role of md-cache?
+
+### eager-lock for directories
+
+This extension doesn't seem to impact APIs at all, but it does require a
+way to safely do a CREATE FOP that will either appear on all replicas or
+none (or allow self-healing to repair the difference in the directories
+in the correct way).
+
+If we have an NSR translator, this seems pretty straightforward. NSR
+only allows the client to talk to the "leader" server in the replica
+host set, and the leader then takes responsibility for propagating the
+change.
+
+With AFR, the situation is very different. In order to guarantee that a
+CREATE will succeed on all AFR subvolumes, the client must write-lock
+the parent directory. Otherwise some other client could create the same
+file at the same time on some but not all of the AFR subvolumes.
+
+But why unlock? Chances are that any immediate subsequent file create in
+that directory will be coming from the same client, so it makes sense
+for the client to hang onto the write lock for a short while, unless
+some other client wants it. This optimistic lock behavior is similar to
+the "eager-lock" feature in the AFR translator today. Doing this saves
+us not only the need to do a LOOKUP prior to CREATE, but also saves us
+the need to do a directory unlock per file!
+
+### CREATE-AND-WRITE
+
+This extension is similar to quick-read, where the OPEN FOP can return
+the file data if it's small enough. This extension adds the following
+features to the CREATE FOP:
+
+-   -   optionally specify xattrs to associate with file when it's
+        created
+    -   optionally specify write data (if it fits in 1 RPC)
+    -   optionally close the file (what RELEASE does today)
+    -   optionally fsync the file (for apps that require file
+        persistence such as Swift)
+
+This option is also similar to what librados (Ceph) API allows user to
+do today, see [Ioctx.write\_full in librados python
+binding](http://ceph.com/docs/master/rados/api/python/#writing-reading-and-removing-objects)
+
+This avoids the need for the round-trip sequence:
+
+-   lock inode for write
+-   create
+-   write
+-   flush(directory)
+-   set-xattr[1]
+-   set-xattr[2]
+-   ...
+-   set-xattr[N]
+-   release
+-   unlock inode
+
+The existing protocol structure is in [structure
+gfs3\_create\_req](https://forge.gluster.org/glusterfs-core/glusterfs/blobs/master/rpc/xdr/src/glusterfs3-xdr.x)
+. We would allocate reserved bits from the "flags" field for the
+optional extensions. The xdata field in the request would contain a
+tagged sequence containing the optional parameter values.
+
+### case-insensitive volume support
+
+The SMB protocol is bridging a divide between an operating system,
+Windows, that supports case-insensitive lookup, and an operating system,
+Linux (POSIX) that supports only case-sensitive lookup inside Gluster
+bricks. If nothing is done to bridge this gap, file lookup and creation
+becomes very expensive in large directories (a few thousand files in
+size):
+
+-   on CREATE, the client has to search the entire directory to
+    determine whether some other file with the same name (but a
+    different case mix) already exists. This requires locking the
+    directory. Furthermore, consistent hashing, which knows nothing
+    about case mix, can not predict which brick might contain the file,
+    since it might have been created with a different case mix. This is
+    a SCALABILITY issue.
+
+-   on LOOKUP, the client has to search all bricks for the filename
+    since there is in general no way to predict which brick the
+    case-altered version of the filename might have hashed to. This is a
+    SCALABILITY issue. The entire contents of the directory on each
+    brick must be searched as well.
+
+-   SMB does support "case-sensitive yes" smb.conf configuration option,
+    but this is user-hostile since Windows does not understand it.
+
+What happens when Linux user-mode process such as glusterfsd (brick)
+tries to do a case-insensitive lookup on the filename using a local
+filesystem? XFS has a feature for this, but Gluster can't assume XFS as
+for VFS supporting case-insensitivity - it's not going to happen. Yes
+you can do readdir on directory and scan for the case-insensitive match,
+but it's O(N\^2) where N is number of files you place into a directory.
+
+**Proposal**: only use lower-case filenames (or upper-case, it doesn't
+matter) at the brick filesystem, and record the original case mix (how
+the user specified the filename at create/rename time) in an xattr, call
+it 'original-case'.
+
+**Issue**: (from Ira Cooper): what locales would be supported? SMB
+already had to deal with this.
+
+We could define a 'case-insensitive' volume parameter (default off), so
+that users who have no SMB clients do not experience this change in
+behavior.
+
+This mapping to lower-case filenames has to happen at or above DHT layer
+to avoid the scalability issue above. If this is not done by DHT (if it
+is done in VFS-gluster SMB plugin for example), then Gluster clients on
+a POSIX filesystem will not see the same filenames as Windows users, and
+this will lead to confusion.
+
+However, this has consequences for sharing file between SMB and non-SMB
+client - non-SMB client will pay performance penalty for
+case-insensitivity and will see case-insensitive behavior that is not
+strictly POSIX-compliant - for example if I create file "a" and then
+file "A" in same directory, the 2nd create will get EEXIST. That's the
+price you pay for having the two kinds of clients accessing the same
+volume - the most restricted client has to win.
+
+Changes required to DHT or equivalent:
+
+-   READDIR(PLUS): report filenames as the user expects to see them,
+    using the original-case xattr. see above READDIRPLUS enhancement for
+    how this can be done efficiently.
+-   CREATE (or RENAME):, map the filename within the brick to lower case
+    before creating, and records the original case mix using the
+    original-case xattr. See CREATE-AND-WRITE enhancement above for how
+    this can be done efficiently.
+-   LOOKUP: map the filename to lower case before attempting a lookup on
+    the brick.
+-   RENAME: To prevent loss of file during a client-side crash, first
+    delete the case-mix xattr, then do the rename, then re-add the
+    case-mix xattr. If the case-mix xattr is not present, then the
+    lower-case filename is returned by READDIR(PLUS) but the file is not
+    lost.
+
+Since existing SMB users may want to take advantage of this change, we
+need a process for converting a Gluster volume to support
+case-insensitivity:
+
+-   optional - use "find /your/brick/directory -not -type d -a -not
+    -path '/your/brick/directory/.glusterfs/\*' | tr '[A-Z]' '[a-z]' |
+    sort " command in parallel on every brick, and do sort -merge of
+    per-brick outputs followed by "uniq -d" to quickly determine if
+    there are case-insensitivity collisions on existing volume. This
+    would let user resolve such conflicts ahead of time without taking
+    down the volume.
+-   shut down the volume
+-   run a script on all bricks in parallel to convert it to
+    case-insensitive format - very fast because it runs on a local fs.
+    -   rename the brick file to lower case and store an xattr with
+        original case.
+-   turn volume lookup-unhashed to ON because files will not yet be on
+    the right brick.
+-   set volume into case-insensitive state
+-   start volume - it is now online but not in efficient state
+-   rebalance (get DHT to place the files where they belong)
+    -   If rebalance uncovers case-insensitive filename collisions (very
+        unlikely), the 2nd file is renamed to its original case-mix with
+        string 'case-collision-gfid' + hex gfid appended, and a counter
+        is incremented. A simple "find" command at each brick in
+        parallel executed with pdsh can locate all instances of such
+        files - the user then has to decide what they want to do with
+        them.
+-   reset lookup-unhashed to default (auto)
+
+Benefit to GlusterFS
+--------------------
+
+-   READDIRPLUS optimizations could completely solve the performance
+    problems with file browsing in large directories, at least to the
+    point where Gluster performs similarly to NFS and SMB in general and
+    can't be blamed. (DHT v2 could also improve performance by not
+    requiring round trips to every brick to retrieve a directory).
+
+-   lockless-CREATE - can improve small-file create performance
+    significantly by condensing 4 round-trips into 1. Small-file create
+    is the worst-performing feature in Gluster today. However, it won't
+    solve small-file create problems until we address other areas below.
+
+-   CREATE-AND-WRITE - as you can see, at least 6 round trips (maybe
+    more) are combined into 1 round trip.
+
+The performance benefit increases as the Gluster client round-trip time
+to the servers increases. For example, these enhancements could make
+possible use of Gluster protocol over a WAN.
+
+Scope
+-----
+
+Still unsure. This impacts libgfapi - if we want applications to take
+advantage of these enhancements, we need to expose these APIs to
+applications somehow, and POSIX does not allow them AFAIK.
+
+CREATE-AND-WRITE impacts the translator interface. Translators must be
+able to pass down:
+
+-   a list of xattr values (which translators in the stack can append
+    to).
+-   a data buffer
+-   flags to request optionally that file be fsynced and/or closed.
+
+The [fop\_create\_t
+params](https://forge.gluster.org/glusterfs-core/glusterfs/blobs/master/libglusterfs/src/xlator.h)
+have both a "flags" parameter and a "xdata" parameter; this last
+parameter could be used to pass both data and xattrs in a tagged
+sequence format (not sure whether **dict\_t** supports this).
+
+### Nature of proposed change
+
+The Gluster code might need refactoring in create-related code to
+maximize common code between existing implementation, which won't go
+away, and the new implementation of these FOPS.
+
+However, I suspect that READDIRPLUS extensions may be possible to insert
+without disrupting existing code that much, may need some help on this
+one.
+
+### Implications on manageability
+
+The gluster volume profile command will have to be extended to get
+support for the new CREATE FOP if this is how we choose to implement it.
+
+These changes should be somewhat transparent to management layer
+otherwise.
+
+### Implications on presentation layer
+
+Swift-on-file Gluster-specific code would have to change to take
+advantage of this feature.
+
+NFS and SMB would have to change to exploit new features to reduce
+SMB-specific xattr and ACL access.
+
+The
+[libgfapi](https://forge.gluster.org/glusterfs-core/glusterfs/blobs/master/api/src/glfs.h)
+implementation would have to expose these features.
+
+-   **glfs\_readdirplus\_r** - it's not clear that struct dirent would
+    be able to handle xattrs, and there is no place to specify which
+    extended attributes we are interested in.
+-   **glfs\_creat** - has no parameters to support xattrs or write data.
+    So we'd need a new entry point to do this.
+
+### Implications on persistence layer
+
+None.
+
+### Implications on 'GlusterFS' backend
+
+None.
+
+### Modification to GlusterFS metadata
+
+None. We are repackaging how data gets passed in protocol, not what it
+means.
+
+### Implications on 'glusterd'
+
+None.
+
+How To Test
+-----------
+
+We have programs that can generate metadata-intensive workloads, such as
+smallfile benchmark or fio. For smallfile creates, we can use a modified
+version of the [parallel libgfapi
+benchmark](https://github.com/bengland2/parallel-libgfapi) (don't worry,
+I know the developer ;-) to verify that the response time for the new
+create-and-write API is better than before, or to verify that
+lockless-create improves response time.
+
+In the case of readdirplus extensions, we can test with simple libgfapi
+program coupled with a protocol trace or gluster volume profile output
+to see if it's working and has desired decrease in response time.
+
+User Experience
+---------------
+
+The impact of this operation should be functionally transparent to the
+end-user, but it should significantly improve Gluster performance to the
+point where throughput and response time are reasonably close (not
+equal) to NFS, SMB, etc on local filesystems. This is particularly true
+for small-file operations and directory browsing/listing.
+
+Dependencies
+------------
+
+This change will have significant impact on translators, it is not easy.
+Because this is a non-trivial change, an incremental approach should be
+specified and followed, with each stage committed and regression tested
+separately. For example, we could break CREATE-and-WRITE proposal into 4
+pieces:
+
+-   add libgfapi support, with ENOSUPPORT returned for unimplemented
+    features
+-   add list of xattrs written at create time.
+-   add write data
+-   add close and fsync options
+
+Documentation
+-------------
+
+How do we document RPC protocol changes? For now, I'll try to use IDL .x
+file or whatever specifies the RPC itself.
+
+Status
+------
+
+Not designed yet.
+
+Comments and Discussion
+-----------------------
+
+### Jeff Darcy 16:20, 3 December 2014
+
+Talk:Features/composite-operations
+
+"SMB: case-insensitivity of Windows = no direct lookup by filename on
+brick"
+
+We did actually come up with a way to do the case-preserving and
+case-squashing lookups simultaneously before falling back to the global
+lookup, but AFAIK it's not implemented.
+
+READDIRPLUS extension: md-cache actually does pre-fetch some attributes
+associated with (Linux) ACLs and SELinux. Maybe it just needs to
+pre-fetch some others for SMB? Also, fetching into glusterfs process
+memory doesn't save us the context switch. For that we need dentry
+injection (or something like it) so that the information is available in
+the kernel by the time the user asks for it.
+
+"glfs\_creat - has no parameters to support xattrs"
+
+These are being added already because NSR reconciliation needs them (for
+many other calls already).