glusterfs.git/xlators/cluster/afr/src/afr-transaction.h, branch v7.8

cluster/afr: Allow lookup on root if it is from ADD_REPLICA_MOUNT

2018-12-18T10:30:19+00:00

Problem: When trying to convert a plain distribute volume to replica-3
or arbiter type it is failing with ENOTCONN error as the lookup on
the root will fail as there is no quorum.

Fix: Allow lookup on root if it is coming from the ADD_REPLICA_MOUNT
which is used while adding bricks to a volume. It will try to set the
pending xattrs for the newly added bricks to allow the heal to happen
in the right direction and avoid data loss scenarios.

Note: This fix will solve the problem of type conversion only in the
case where the volume was mounted at least once. The conversion of
non mounted volumes will still fail since the dht selfheal tries to
set the directory layout will fail as they do that with the PID
GF_CLIENT_PID_NO_ROOT_SQUASH set in the frame->root.

Change-Id: Ic511939981dad118cc946754341318b164954b3b
fixes: bz#1655854
Signed-off-by: karthik-us

afr: thin-arbiter 2 domain locking and in-memory state

2018-10-25T12:26:22+00:00

2 domain locking + xattrop for write-txn failures:
--------------------------------------------------
- A post-op wound on TA takes AFR_TA_DOM_NOTIFY range lock and
AFR_TA_DOM_MODIFY full lock, does xattrop on TA and releases
AFR_TA_DOM_MODIFY lock and stores in-memory which brick is bad.

- All further write txn failures are handled based on this in-memory
value without querying the TA.

- When shd heals the files, it does so by requesting full lock on
AFR_TA_DOM_NOTIFY domain. Client uses this as a cue (via upcall),
releases AFR_TA_DOM_NOTIFY range lock and invalidates its in-memory
notion of which brick is bad. The next write txn failure is wound on TA
to again update the in-memory state.

- Any incomplete write txns before the AFR_TA_DOM_NOTIFY upcall release
request is got is completed before the lock is released.

- Any write txns got after the release request are maintained in a ta_waitq.

- After the release is complete, the ta_waitq elements are spliced to a
separate queue which is then processed one by one.

- For fops that come in parallel when the in-memory bad brick is still
unknown, only one is wound to TA on wire. The other ones are maintained
in a ta_onwireq which is then processed after we get the response from
TA.

Change-Id: I32c7b61a61776663601ab0040e2f0767eca1fd64
updates: bz#1579788
Signed-off-by: Ravishankar N 
Signed-off-by: Ashish Pandey

Land clang-format changes

2018-09-12T11:52:48+00:00

Change-Id: I6f5d8140a06f3c1b2d196849299f8d483028d33b

afr: common thin-arbiter functions

2018-08-23T06:37:27+00:00

...that can be used by client and self-heal daemon, namely:

afr_ta_post_op_lock()
afr_ta_post_op_unlock()

Note: These are not yet consumed. They will be used in the write txn
changes patch which will introduce 2 domain locking.

updates: bz#1579788
Change-Id: I636d50f8fde00736665060e8f9ee4510d5f38795
Signed-off-by: Ravishankar N

afr: initial changes for thin arbiter

2018-04-30T06:41:11+00:00

1. Create thin arbiter index file during mount.
2. Set pending marker in thin arbiter id file in case of failure.

Change-Id: I269eb8d069f0323f1fc616175e5e5eb7b91d5f82
updates: #352
Signed-off-by: Ravishankar N

afr: add new value for read-hash-mode volume option

2018-03-29T07:37:04+00:00

Updates: #363

This new value (3) will try to wind read requests to the child of AFR
having the least amount of pending requests in its queue.

Change-Id: If6bda2aac9bf7aec3fc39622f78659313c4b6508
Signed-off-by: Ravishankar N

cluster/afr: Make AFR eager-locking similar to EC

2018-03-14T13:32:35+00:00

Problem:
1) Afr's eager-lock only works for data transactions.
2) When there are conflicting writes, write with conflicting region initiates
unlock of eager-lock leading to extra pre-ops and post-ops on the file. When
eager-lock goes off, it leads to extra fsyncs for random-write workload in afr.

Solution (that is modeled after EC):
In EC, when there is a conflicting write, it waits for the current write to
complete before it winds the conflicted write. This leads to better utilization
of network and disk, because we will not be doing extra xattrops and FSYNCs and
inodelk/unlock. Moved fd based counters to inode based counters.

I tried to model the solution based on EC's locking, but it is not similar to
AFR because we had to keep backward compatibility.

Lifecycle of lock:
==================
First transaction is added to inode->owners list and an inodelk will be sent on
the wire. All the next transactions will be put in inode->waiters list until
the first transaction completes inodelk and [f]xattrop completely.  Once
[f]xattrop also completes, all the requests in the inode->waiters list are
checked if it conflict with any of the existing locks which are in
inode->owners list and if not are added to inode->owners list and resumed with
doing transaction. When these transactions complete fop phase they will be
moved to inode->post_op list and resume the transactions that were paused
because of conflicts. Post-op and unlock will not be issued on the wire until
that is the last transaction on that inode. Last transaction when it has to
perform post-op can choose to sleep for deyed-post-op-secs value. During that
time if any other transaction comes, it will wake up the sleeping transaction
and takes over the ownership of the lock and the cycle continues. If the
dealyed-post-op-secs expire, then the timer thread will wakeup the sleeping
transaction and it will set lock->release to true and starts doing post-op and
then unlock. During this time if any other transactions come, they will be put
in inode->frozen list. Once the previous unlock comes it will move the frozen
list to waiters list and moves the first element from this waiters-list to
owners-list and attempts the lock and the cycle continues. This is the general
idea.  There is logic at the time of dealying and at the time of new
transaction or in flush fop to wakeup existing sleeping transactions or
choosing whether to delay a transaction etc, which is subjected to change based
on future enhancements etc.

Fixes: #418
BUG: 1549606
Change-Id: I88b570bbcf332a27c82d2767dfa82472f60055dc
Signed-off-by: Pranith Kumar K

cluster/afr: Remove unused code paths

2018-03-06T08:49:31+00:00

Removed
1) afr-v1 self-heal locks related code which is not used anymore
2) transaction has some data types that are not needed, so removed them
3) Never used lock tracing available in afr as gluster's network tracing does
the job. So removed that as well.
4) Changelog is always enabled and afr is always used with locks, so
__changelog_enabled, afr_lock_server_count etc functions can be deleted.
5) transaction.fop/done/resume always call the same functions, so no need
to have these variables.

BUG: 1549606
Change-Id: I370c146fec2892d40e674d232a5d7256e003c7f1
Signed-off-by: Pranith Kumar K

cluster/afr: Remove compound-fops usage in afr

2018-03-06T07:55:25+00:00

We are not seeing much improvement with this change. So removing the
feature so that it doesn't need to be maintained anymore.

Fixes: #414
Change-Id: Ic7969b151544daf2547bd262a9fa03f575626411
Signed-off-by: Pranith Kumar K

afr: propagate correct errno for fop failures in arbiter

2017-05-15T12:18:32+00:00

Problem:
If quorum is not met in fop cbk, arbiter sends an ENOTCONN error to the
upper xlators. In a VM workload with sharding enabled, this was leading
to the VM pausing when replace-brick was performed as described in the BZ.

Fix:
Move the fop cbk arbitration logic to afr_handle_quorum() because in
normal replica volumes, that is the function that has the quorum and
errno checks in the fop cbk path before doing a post-op.

Thanks to Pranith for suggesting this approach.

Change-Id: Ie6315db30c5e36326b71b90a01da824109e86796
BUG: 1449610
Signed-off-by: Ravishankar N 
Reviewed-on: https://review.gluster.org/17235
Smoke: Gluster Build System 
Reviewed-by: Pranith Kumar Karampuri 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System