in_progress/composite-operations.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438

Feature
-------

Composite operations is a term describing elimination of round trips
through a variety of techniques. Some of these techniques are borrowed
from NFS and SMB protocols in spirit at least.

Why do we need this? All too frequently we encounter situations where
Gluster performance is an order of magnitude or even two orders of
magnitude slower than NFS or SMB to a local filesystem, particularly for
small-file and metadata-intensive workloads (example: file browsing).
You can argue that Gluster provides more functionality, so it should be
slower, but we need to close the gap -- if Gluster was half the speed of
NFS and provided much greater functionality plus scalability, users
would be ok with some performance tradeoff.

What is the root cause? Response time of Gluster APIs is much higher
than response time of other protocols. A simple protocol trace can show
you a root cause for this: excessive round-trips.

There are two dimensions to this:

-   operations that require lookups on every brick (covered elsewhere)
-   excessive one-at-a-time access to xattrs and ACLs
-   client responsible for maintaining filesystem state instead of
    server
-   SMB: case-insensitivity of Windows = no direct lookup by filename on
    brick

Summary
-------

example of previous success: eager-lock. When Gluster was first acquired
by Red Hat and testing with 10-GbE interfaces began, we quickly noticed
that sequential write performance was not what we expected. The Gluster
protocol required a 5-step sequence for every write from client to
server(s), in order to maintain consistency between replicas, which is
loosely paraphrased here:

-   lock-replica-inode
-   pre-op (mark replicas dirty)
-   write
-   post-op
-   unlock-replica-inode

The **cluster.eager-lock** feature was added to Gluster (3.4?) to allow
the client to hang onto the lock, and we combined post-op for previous
write with pre-op for current write and actual write request so that
instead of 5 RPCs per write we got down to ONE RPC per write, and write
performance improved significantly (how much TBS)

Owners
------

TBS

Current status
--------------

Some of the problems with round trips stem from lack of scalability in
DHT protocol, and attributes of AFR protocol.

Related Feature Requests and Bugs
---------------------------------

-   [Features/Smallfile Perf](../GlusterFS 3.7/Small File Performance.md) - small-file performance enhancement menu
-   [Features/dht-scalability](./dht-scalability.md) - new, scalable DHT
-   [Features/new-style-replication](..GlusterFS 3.6/New Style Replication.md)- client no longer does replication

*Note : search RHS buglist for small-file-related performance bugs and directory browsing performance bugs, I haven't done that yet, there are a LOT of them*

Detailed Description
--------------------

Here are the proposals:

-   READDIRPLUS generalization
-   lockless-CREATE
-   CREATE-AND-WRITE - allow CREATE op to transmit data and metadata
    also
-   case-insensitivity feature - removes perf. penalty for SMB

### READDIRPLUS used to prefetch xattrs

recent correction: For SMB and other protocols that have additional
security metadata, READDIRPLUS can be used more effectively to prefetch
xattr data, such as ACLs and Windows-specific security info. However,
upper layers have to make use of this feature. We treat ACLs as a
special case of an extended attribute, since ACLs are not currently
returned by READDIRPLUS (can someone confirm this?). The current RPC
request and response structure are in
[gfs3\_readdirp\_{req,rsp}](https://forge.gluster.org/glusterfs-core/glusterfs/blobs/master/rpc/xdr/src/glusterfs3-xdr.x)
in the above source code URL. In both cases, the request structure field
"dict" can contain a list of extended attribute IDs (or names, not sure
which).

However, once these xattrs are prefetched, will md-cache translator in
the client be able to hang onto them to prevent round-trips to the
server? Is there any additional invalidation needed for the expanded
role of md-cache?

### eager-lock for directories

This extension doesn't seem to impact APIs at all, but it does require a
way to safely do a CREATE FOP that will either appear on all replicas or
none (or allow self-healing to repair the difference in the directories
in the correct way).

If we have an NSR translator, this seems pretty straightforward. NSR
only allows the client to talk to the "leader" server in the replica
host set, and the leader then takes responsibility for propagating the
change.

With AFR, the situation is very different. In order to guarantee that a
CREATE will succeed on all AFR subvolumes, the client must write-lock
the parent directory. Otherwise some other client could create the same
file at the same time on some but not all of the AFR subvolumes.

But why unlock? Chances are that any immediate subsequent file create in
that directory will be coming from the same client, so it makes sense
for the client to hang onto the write lock for a short while, unless
some other client wants it. This optimistic lock behavior is similar to
the "eager-lock" feature in the AFR translator today. Doing this saves
us not only the need to do a LOOKUP prior to CREATE, but also saves us
the need to do a directory unlock per file!

### CREATE-AND-WRITE

This extension is similar to quick-read, where the OPEN FOP can return
the file data if it's small enough. This extension adds the following
features to the CREATE FOP:

-   -   optionally specify xattrs to associate with file when it's
        created
    -   optionally specify write data (if it fits in 1 RPC)
    -   optionally close the file (what RELEASE does today)
    -   optionally fsync the file (for apps that require file
        persistence such as Swift)

This option is also similar to what librados (Ceph) API allows user to
do today, see [Ioctx.write\_full in librados python
binding](http://ceph.com/docs/master/rados/api/python/#writing-reading-and-removing-objects)

This avoids the need for the round-trip sequence:

-   lock inode for write
-   create
-   write
-   flush(directory)
-   set-xattr[1]
-   set-xattr[2]
-   ...
-   set-xattr[N]
-   release
-   unlock inode

The existing protocol structure is in [structure
gfs3\_create\_req](https://forge.gluster.org/glusterfs-core/glusterfs/blobs/master/rpc/xdr/src/glusterfs3-xdr.x)
. We would allocate reserved bits from the "flags" field for the
optional extensions. The xdata field in the request would contain a
tagged sequence containing the optional parameter values.

### case-insensitive volume support

The SMB protocol is bridging a divide between an operating system,
Windows, that supports case-insensitive lookup, and an operating system,
Linux (POSIX) that supports only case-sensitive lookup inside Gluster
bricks. If nothing is done to bridge this gap, file lookup and creation
becomes very expensive in large directories (a few thousand files in
size):

-   on CREATE, the client has to search the entire directory to
    determine whether some other file with the same name (but a
    different case mix) already exists. This requires locking the
    directory. Furthermore, consistent hashing, which knows nothing
    about case mix, can not predict which brick might contain the file,
    since it might have been created with a different case mix. This is
    a SCALABILITY issue.

-   on LOOKUP, the client has to search all bricks for the filename
    since there is in general no way to predict which brick the
    case-altered version of the filename might have hashed to. This is a
    SCALABILITY issue. The entire contents of the directory on each
    brick must be searched as well.

-   SMB does support "case-sensitive yes" smb.conf configuration option,
    but this is user-hostile since Windows does not understand it.

What happens when Linux user-mode process such as glusterfsd (brick)
tries to do a case-insensitive lookup on the filename using a local
filesystem? XFS has a feature for this, but Gluster can't assume XFS as
for VFS supporting case-insensitivity - it's not going to happen. Yes
you can do readdir on directory and scan for the case-insensitive match,
but it's O(N\^2) where N is number of files you place into a directory.

**Proposal**: only use lower-case filenames (or upper-case, it doesn't
matter) at the brick filesystem, and record the original case mix (how
the user specified the filename at create/rename time) in an xattr, call
it 'original-case'.

**Issue**: (from Ira Cooper): what locales would be supported? SMB
already had to deal with this.

We could define a 'case-insensitive' volume parameter (default off), so
that users who have no SMB clients do not experience this change in
behavior.

This mapping to lower-case filenames has to happen at or above DHT layer
to avoid the scalability issue above. If this is not done by DHT (if it
is done in VFS-gluster SMB plugin for example), then Gluster clients on
a POSIX filesystem will not see the same filenames as Windows users, and
this will lead to confusion.

However, this has consequences for sharing file between SMB and non-SMB
client - non-SMB client will pay performance penalty for
case-insensitivity and will see case-insensitive behavior that is not
strictly POSIX-compliant - for example if I create file "a" and then
file "A" in same directory, the 2nd create will get EEXIST. That's the
price you pay for having the two kinds of clients accessing the same
volume - the most restricted client has to win.

Changes required to DHT or equivalent:

-   READDIR(PLUS): report filenames as the user expects to see them,
    using the original-case xattr. see above READDIRPLUS enhancement for
    how this can be done efficiently.
-   CREATE (or RENAME):, map the filename within the brick to lower case
    before creating, and records the original case mix using the
    original-case xattr. See CREATE-AND-WRITE enhancement above for how
    this can be done efficiently.
-   LOOKUP: map the filename to lower case before attempting a lookup on
    the brick.
-   RENAME: To prevent loss of file during a client-side crash, first
    delete the case-mix xattr, then do the rename, then re-add the
    case-mix xattr. If the case-mix xattr is not present, then the
    lower-case filename is returned by READDIR(PLUS) but the file is not
    lost.

Since existing SMB users may want to take advantage of this change, we
need a process for converting a Gluster volume to support
case-insensitivity:

-   optional - use "find /your/brick/directory -not -type d -a -not
    -path '/your/brick/directory/.glusterfs/\*' | tr '[A-Z]' '[a-z]' |
    sort " command in parallel on every brick, and do sort -merge of
    per-brick outputs followed by "uniq -d" to quickly determine if
    there are case-insensitivity collisions on existing volume. This
    would let user resolve such conflicts ahead of time without taking
    down the volume.
-   shut down the volume
-   run a script on all bricks in parallel to convert it to
    case-insensitive format - very fast because it runs on a local fs.
    -   rename the brick file to lower case and store an xattr with
        original case.
-   turn volume lookup-unhashed to ON because files will not yet be on
    the right brick.
-   set volume into case-insensitive state
-   start volume - it is now online but not in efficient state
-   rebalance (get DHT to place the files where they belong)
    -   If rebalance uncovers case-insensitive filename collisions (very
        unlikely), the 2nd file is renamed to its original case-mix with
        string 'case-collision-gfid' + hex gfid appended, and a counter
        is incremented. A simple "find" command at each brick in
        parallel executed with pdsh can locate all instances of such
        files - the user then has to decide what they want to do with
        them.
-   reset lookup-unhashed to default (auto)

Benefit to GlusterFS
--------------------

-   READDIRPLUS optimizations could completely solve the performance
    problems with file browsing in large directories, at least to the
    point where Gluster performs similarly to NFS and SMB in general and
    can't be blamed. (DHT v2 could also improve performance by not
    requiring round trips to every brick to retrieve a directory).

-   lockless-CREATE - can improve small-file create performance
    significantly by condensing 4 round-trips into 1. Small-file create
    is the worst-performing feature in Gluster today. However, it won't
    solve small-file create problems until we address other areas below.

-   CREATE-AND-WRITE - as you can see, at least 6 round trips (maybe
    more) are combined into 1 round trip.

The performance benefit increases as the Gluster client round-trip time
to the servers increases. For example, these enhancements could make
possible use of Gluster protocol over a WAN.

Scope
-----

Still unsure. This impacts libgfapi - if we want applications to take
advantage of these enhancements, we need to expose these APIs to
applications somehow, and POSIX does not allow them AFAIK.

CREATE-AND-WRITE impacts the translator interface. Translators must be
able to pass down:

-   a list of xattr values (which translators in the stack can append
    to).
-   a data buffer
-   flags to request optionally that file be fsynced and/or closed.

The [fop\_create\_t
params](https://forge.gluster.org/glusterfs-core/glusterfs/blobs/master/libglusterfs/src/xlator.h)
have both a "flags" parameter and a "xdata" parameter; this last
parameter could be used to pass both data and xattrs in a tagged
sequence format (not sure whether **dict\_t** supports this).

### Nature of proposed change

The Gluster code might need refactoring in create-related code to
maximize common code between existing implementation, which won't go
away, and the new implementation of these FOPS.

However, I suspect that READDIRPLUS extensions may be possible to insert
without disrupting existing code that much, may need some help on this
one.

### Implications on manageability

The gluster volume profile command will have to be extended to get
support for the new CREATE FOP if this is how we choose to implement it.

These changes should be somewhat transparent to management layer
otherwise.

### Implications on presentation layer

Swift-on-file Gluster-specific code would have to change to take
advantage of this feature.

NFS and SMB would have to change to exploit new features to reduce
SMB-specific xattr and ACL access.

The
[libgfapi](https://forge.gluster.org/glusterfs-core/glusterfs/blobs/master/api/src/glfs.h)
implementation would have to expose these features.

-   **glfs\_readdirplus\_r** - it's not clear that struct dirent would
    be able to handle xattrs, and there is no place to specify which
    extended attributes we are interested in.
-   **glfs\_creat** - has no parameters to support xattrs or write data.
    So we'd need a new entry point to do this.

### Implications on persistence layer

None.

### Implications on 'GlusterFS' backend

None.

### Modification to GlusterFS metadata

None. We are repackaging how data gets passed in protocol, not what it
means.

### Implications on 'glusterd'

None.

How To Test
-----------

We have programs that can generate metadata-intensive workloads, such as
smallfile benchmark or fio. For smallfile creates, we can use a modified
version of the [parallel libgfapi
benchmark](https://github.com/bengland2/parallel-libgfapi) (don't worry,
I know the developer ;-) to verify that the response time for the new
create-and-write API is better than before, or to verify that
lockless-create improves response time.

In the case of readdirplus extensions, we can test with simple libgfapi
program coupled with a protocol trace or gluster volume profile output
to see if it's working and has desired decrease in response time.

User Experience
---------------

The impact of this operation should be functionally transparent to the
end-user, but it should significantly improve Gluster performance to the
point where throughput and response time are reasonably close (not
equal) to NFS, SMB, etc on local filesystems. This is particularly true
for small-file operations and directory browsing/listing.

Dependencies
------------

This change will have significant impact on translators, it is not easy.
Because this is a non-trivial change, an incremental approach should be
specified and followed, with each stage committed and regression tested
separately. For example, we could break CREATE-and-WRITE proposal into 4
pieces:

-   add libgfapi support, with ENOSUPPORT returned for unimplemented
    features
-   add list of xattrs written at create time.
-   add write data
-   add close and fsync options

Documentation
-------------

How do we document RPC protocol changes? For now, I'll try to use IDL .x
file or whatever specifies the RPC itself.

Status
------

Not designed yet.

Comments and Discussion
-----------------------

### Jeff Darcy 16:20, 3 December 2014

Talk:Features/composite-operations

"SMB: case-insensitivity of Windows = no direct lookup by filename on
brick"

We did actually come up with a way to do the case-preserving and
case-squashing lookups simultaneously before falling back to the global
lookup, but AFAIK it's not implemented.

READDIRPLUS extension: md-cache actually does pre-fetch some attributes
associated with (Linux) ACLs and SELinux. Maybe it just needs to
pre-fetch some others for SMB? Also, fetching into glusterfs process
memory doesn't save us the context switch. For that we need dentry
injection (or something like it) so that the information is available in
the kernel by the time the user asks for it.

"glfs\_creat - has no parameters to support xattrs"

These are being added already because NSR reconciliation needs them (for
many other calls already).