glusterfs-quota.git/xlators/cluster/afr, branch test

storage/posix: implement batched fsync in a single thread

2013-07-23T13:11:12+00:00

Because of the extra fsync()s issued by AFR transaction, they
could potentially "clog" all the io-threads denying unrelated
operations from making progress.

This patch assigns a dedicated thread to issues fsyncs, as
an experimental feature to understand performance characteristics
with the approach.

As a basis, incoming individual fsync requests are grouped into
batches, falling in the same @batch-fsync-delay-usec window of
time. These windows can extend in practice, as processing of
the previous batch can take longer than @batch-fsync-delay-usec
while new requests are getting batched.

The feature support three modes (similar to the -S modes of fs_mark)

- syncfs: In this mode one syncfs() is issued per batch, instead
  of N fsync()s (one per file.)

- syncfs-single-fsync: In this mode one syncfs() is issued per
  batch (which, on Linux, guarantees the completion of write-out
  of dirty pages in the filesystem up to that point) and one single
  fsync() to synchronize or flush the controller/drive cache. This
  corresponds to -S 2 of fsmark.

- syncfs-reverse-fsync: In this mode, one syncfs() is issued per
  batch, and all the open files in that batch are fsync()'ed in
  the reverse order of the queue. This corresponds to -S 4 of
  fsmark.

- reverse-fsync: In this mode, no syncfs() is issued and all the
  files in the batch are fsync()'ed in the reverse order. This
  corresponds to -S 3 of fsmark.

Change-Id: Ia1e170a810c780c8d80e02cf910accc4170c4cd4
BUG: 927146
Signed-off-by: Anand Avati 
Reviewed-on: http://review.gluster.org/4746
Tested-by: Gluster Build System 
Reviewed-by: Vijay Bellur

cluster/afr: Handle parallel hardlinks self-heal

2013-07-23T07:33:39+00:00

Change-Id: Ieda11870c65edae500140b6c061f15a7b3f264f3
BUG: 986905
Signed-off-by: Pranith Kumar K 
Reviewed-on: http://review.gluster.org/5370
Tested-by: Gluster Build System 
Reviewed-by: Vijay Bellur

afr: customize client-pid=-1 xtime aggregation to tolerate a replica down

2013-07-15T08:24:19+00:00

Using the new 'pluggable policies' API of libxlator.

Change-Id: Ie7528182dff8fb42c6e8287a106d3057944df775
BUG: 847839
Original Author: Csaba Henk 
Signed-off-by: Avra Sengupta 
Reviewed-on: http://review.gluster.org/4904
Tested-by: Gluster Build System 
Reviewed-by: Vijay Bellur

libxlator: implement pluggable aggregation policies

2013-07-15T08:23:53+00:00

The API is described in libxlator.h.

Behavior remains the same for this commit; this
is a preparatory step for per-translator customization
of aggregation.

Change-Id: I5d42923af59b2fd78e1ff59c12763875b57c5190
BUG: 847839
Original Author: Csaba Henk 
Signed-off-by: Avra Sengupta 
Reviewed-on: http://review.gluster.org/4903
Tested-by: Gluster Build System 
Reviewed-by: Amar Tumballi 
Reviewed-by: Vijay Bellur

cluster/*: get logic to calculate min() of the 'stime' xattr

2013-07-15T04:06:42+00:00

* in both distribute and replicate (ignoring stripe for now),
  add logic to calculate the min() of stime values.

* What is a 'stime' ? Why is this required:
  -  stime means 'slave xtime', mainly used to keep track of slave
  node's sync status when distributed geo-replication is used.
  Logic of calculating 'min()' for this stime is very important as
  in case of crashes/reboots/shutdown, we will have to 'restart'
  with crawling from stime time value from the mount point, which
  gives the 'min()' of all the bricks, which means, we don't miss
  syncing any files in the above cases.

Change-Id: I2be8d434326572be9d4986db665570a6181db1ee
BUG: 847839
Original Author: Amar Tumballi 
Signed-off-by: Avra Sengupta 
Reviewed-on: http://review.gluster.org/4893
Tested-by: Gluster Build System 
Reviewed-by: Vijay Bellur

afr : change the log level in lookup path to minimize incessant logging.

2013-07-09T06:26:23+00:00

Change the logging levels from WARNING to DEBUG in the lookup path to
minimize incessant logging in case of gfid mismatch errors.

Change-Id: I631b16df3249cf826606f547531f985dac696088
BUG: 959083
Signed-off-by: Ravishankar N 
Reviewed-on: http://review.gluster.org/4939
Reviewed-by: Pranith Kumar Karampuri 
Tested-by: Gluster Build System 
Reviewed-by: Anand Avati

cluster/afr: Let two data-self-heals compete in new domain

2013-07-04T04:35:30+00:00

Problem:
At the moment data-self-heal acquires locks in following
pattern. It takes full file lock then gets xattrs on files on both
replicas. Decides sources/sinks based on the xattrs. Now it acquires
lock from 0-128k then unlocks the full file lock. Syncs 0-128k range
from source to sink now acquires lock 128k+1 till 256k then unlocks
0-128k, syncs 128k+1 till 256k block... so on finally it takes full file
lock again then unlocks the final small range block.
It decrements pending counts and then unlocks the full file lock.

     This pattern of locks is chosen to avoid more than 1 self-heal
to be in progress. BUT if another self-heal tries to take a full
file lock while a self-heal is already in progress it will be put in
blocked queue, further inodelks from writes by the application will
also be put in blocked queue because of the way locks xlator grants
inodelks. So until the self-heal is complete writes are blocked.

Here is the code:
xlators/features/locks/src/inodelk.c - line 225
if (__blocked_lock_conflict (dom, lock) && !(__owner_has_lock (dom, lock))) {
         ret = -EAGAIN;
         if (can_block == 0)
                 goto out;

         gettimeofday (&lock->blkd_time, NULL);
         list_add_tail (&lock->blocked_locks, &dom->blocked_inodelks);
}

This leads to hangs in applications.

Fix:
Since we want to prevent two parallel self-heals. We let them compete
in a separate "domain". Lets call the domain on which the locks have
been taken on in previous approach as "data-domain".

In the new approach When a self-heal is triggered,
it acquires a full lock in the new domain "self-heal-domain".
    After this it performs data-self-heal using the locks in
    "data-domain" as before.
unlock the full file lock in "self-heal-domain"

With this approach, application's writevs don't have to wait
in pending queue when more than 1 self-heal is triggered.

Change-Id: Id79aef3dfa888945977fb9758374ac41c320d0d5
BUG: 967717
Signed-off-by: Pranith Kumar K 
Reviewed-on: http://review.gluster.org/5100
Tested-by: Gluster Build System 
Reviewed-by: Vijay Bellur

cluster/afr: Refactor inodelk to handle multiple domains

2013-07-04T04:28:11+00:00

- afr_local_copy should not be memduping locked nodes, that would
  mean that lock is taken in self-heal on those nodes even before
  it actually takes the lock. So removed memdup code. Even entry
  lock related copying (lockee info) is also not necessary for
  self-heal functionality, so removing that as well. Since it is
  not local_copy anymore changed its name.

- My editor changed tabs to spaces.

Change-Id: I8dfb92cb8338e9a967c06907a8e29a8404782d61
BUG: 967717
Signed-off-by: Pranith Kumar K 
Reviewed-on: http://review.gluster.org/5099
Tested-by: Gluster Build System 
Reviewed-by: Vijay Bellur

cluster/afr: Provide an option to disable afr durability

2013-07-03T14:33:13+00:00

Change-Id: I40eec20ca6b3f857245a2438883822e251077ee9
BUG: 979365
Signed-off-by: Pranith Kumar K 
Reviewed-on: http://review.gluster.org/5269
Tested-by: Gluster Build System 
Reviewed-by: Vijay Bellur

cluster/afr: post-op should complete before starting flush

2013-07-03T07:40:08+00:00

Problem:
At the moment afr-flush makes sure that a delayed post-op
is woken up but it does not wait for it to complete the
post-op before flush unwinds.
These are the steps that are happening:
1) flush fop comes on an fd which wakes up a delayed post-op
and continues with the flush fop.
2) post-op sends fsync on the wire.
3) flush completes and unwinds to fuse.
4) graph switch happens on the fuse mount disconnecting the
old graph's client connections to bricks.
5) xattrop after fsync fails with ENOTCONN because the
connections from old graph are taken down now.

Fix:
Wait for post-op to complete before starting to flush.
We could make flush act similar to fsync (i.e.) wind
flush as is but wait for post-op to complete before unwinding
flush, but it is better to send flush as the final fop. So
wind of flush will start after post-op is complete. Had to
change fsync to accommodate this change.

Change-Id: I93aa642647751969511718b0e137afbd067b388a
BUG: 980548
Signed-off-by: Pranith Kumar K 
Reviewed-on: http://review.gluster.org/5274
Tested-by: Gluster Build System 
Reviewed-by: Vijay Bellur