glusterfs.git/xlators/cluster/afr/src/afr-read-txn.c, branch v7.2

cluster/ta: Notify the clients only if there are pending heals

2019-07-24T11:01:54+00:00

Problem:
In case of thin arbiter, before index healer starts crawling the
indices at every heal-timeout interval, even if there is nothing to
be healed it will send an upcall notification to all the clients to
release any AFR_TA_DOM_NOTIFY locks that they hold. SHD will wait
for the upcall to return before proceeding with the heal even though
there is nothing to be healed. This will also invalidates the cached
information about the bricks states on the clients which leads to
extra calls on TA from clients for the next reads & writes if needed.
This will impact the IO performance.

Fix:
- Before sending the upcall to the clients, check for any pending heals
on TA without taking  any locks.
- If there is nothing marked bad on TA, then continue with the index
crawl to heal any dirty markings present on the files due to any post-op
failure.
- If there is a brick marked as bad on TA, then take the
AFR_TA_DOM_NOTIFY lock on TA from SHD, get the state on TA and
continue with the current healing process.

Change-Id: Ieb477bc6cb18bbdfd4e7a0453c5ed79b574ec9d6
fixes: bz#1729483
Signed-off-by: karthik-us

afr: thin-arbiter read txn fixes

2019-03-29T08:35:36+00:00

- Fixes afr_ta_read_txn() to handle inode refresh failures.
code-path.
- Fixes a double free issue of dict.

Note: This patch address post-merge review comments for commit
69532c141be160b3fea03c1579ae4ac13018dcdf

fixes: bz#1686398
Change-Id: Id5299b45b68569d47df6b73755918237a1592cb4
Signed-off-by: Ravishankar N

cluster/afr: Allow lookup on root if it is from ADD_REPLICA_MOUNT

2018-12-18T10:30:19+00:00

Problem: When trying to convert a plain distribute volume to replica-3
or arbiter type it is failing with ENOTCONN error as the lookup on
the root will fail as there is no quorum.

Fix: Allow lookup on root if it is coming from the ADD_REPLICA_MOUNT
which is used while adding bricks to a volume. It will try to set the
pending xattrs for the newly added bricks to allow the heal to happen
in the right direction and avoid data loss scenarios.

Note: This fix will solve the problem of type conversion only in the
case where the volume was mounted at least once. The conversion of
non mounted volumes will still fail since the dht selfheal tries to
set the directory layout will fail as they do that with the PID
GF_CLIENT_PID_NO_ROOT_SQUASH set in the frame->root.

Change-Id: Ic511939981dad118cc946754341318b164954b3b
fixes: bz#1655854
Signed-off-by: karthik-us

afr: thin-arbiter 2 domain locking and in-memory state

2018-10-25T12:26:22+00:00

2 domain locking + xattrop for write-txn failures:
--------------------------------------------------
- A post-op wound on TA takes AFR_TA_DOM_NOTIFY range lock and
AFR_TA_DOM_MODIFY full lock, does xattrop on TA and releases
AFR_TA_DOM_MODIFY lock and stores in-memory which brick is bad.

- All further write txn failures are handled based on this in-memory
value without querying the TA.

- When shd heals the files, it does so by requesting full lock on
AFR_TA_DOM_NOTIFY domain. Client uses this as a cue (via upcall),
releases AFR_TA_DOM_NOTIFY range lock and invalidates its in-memory
notion of which brick is bad. The next write txn failure is wound on TA
to again update the in-memory state.

- Any incomplete write txns before the AFR_TA_DOM_NOTIFY upcall release
request is got is completed before the lock is released.

- Any write txns got after the release request are maintained in a ta_waitq.

- After the release is complete, the ta_waitq elements are spliced to a
separate queue which is then processed one by one.

- For fops that come in parallel when the in-memory bad brick is still
unknown, only one is wound to TA on wire. The other ones are maintained
in a ta_onwireq which is then processed after we get the response from
TA.

Change-Id: I32c7b61a61776663601ab0040e2f0767eca1fd64
updates: bz#1579788
Signed-off-by: Ravishankar N 
Signed-off-by: Ashish Pandey

Land part 2 of clang-format changes

2018-09-12T12:22:45+00:00

Change-Id: Ia84cc24c8924e6d22d02ac15f611c10e26db99b4
Signed-off-by: Nigel Babu

afr: thin-arbiter read txn changes

2018-09-05T08:28:23+00:00

If both data bricks are up, read subvol will be based on read_subvols.

If only one data brick is up:
- First qeury the data-brick that is up. If it blames the other brick,
allow the reads.

- If if doesn't, query the TA to obtain the source of truth.

TODO: See if in-memory state can be maintained for read txns (BZ 1624358).

updates: bz#1579788
Change-Id: I61eec35592af3a1aaf9f90846d9a358b2e4b2fcc
Signed-off-by: Ravishankar N

afr: don't update readables if inode refresh failed on all children

2018-06-18T11:21:46+00:00

Problem:
If inode refresh failed on all children of afr due to ENOENT (say file
migrated by dht), it resets the readables to zero. Any inflight txn which
then later comes on the inode fails with EIO because no readable
children present for the inode.

Fix:
Don't update readables when inode refresh fails on *all* children of
afr. In that way any inflight txns will either proceed with its own inode
refresh if needed and fail it with the right errno or use the old value
of readables and continue with the txn.

Also, add quorum checks to the beginning of afr_transaction(). Otherwise, we
seem to be winding the lock and checking for quorum only in pre-op pahse.

Note: This should ideally fix BZ 1329505 since the stop gap fix for
it is has been reverted at https://review.gluster.org/#/c/20028.

Change-Id: Ia638c092d8d12dc27afb3cdad133394845061319
updates: bz#1584483
Signed-off-by: Ravishankar N

afr: add new value for read-hash-mode volume option

2018-03-29T07:37:04+00:00

Updates: #363

This new value (3) will try to wind read requests to the child of AFR
having the least amount of pending requests in its queue.

Change-Id: If6bda2aac9bf7aec3fc39622f78659313c4b6508
Signed-off-by: Ravishankar N

afr: add checks for allowing lookups

2017-11-18T00:38:47+00:00

Problem:
In an arbiter volume, lookup was being served from one of the sink
bricks (source brick was down). shard uses the iatt values from lookup cbk
to calculate the size and block count, which in this case were incorrect
values. shard_local_t->last_block was thus initialised to -1, resulting
in an infinite while loop in shard_common_resolve_shards().

Fix:
Use client quorum logic to allow or fail the lookups from afr if there
are no readable subvolumes. So in replica-3 or arbiter vols, if there is
no good copy or if quorum is not met, fail lookup with ENOTCONN.

With this fix, we are also removing support for quorum-reads xlator
option. So if quorum is not met, neither read nor write txns are allowed
and we fail the fop with ENOTCONN.

Change-Id: Ic65c00c24f77ece007328b421494eee62a505fa0
BUG: 1467250
Signed-off-by: Ravishankar N

afr: do not mention split-brain in log message in read_txn

2017-03-20T13:58:50+00:00

I am seeing a lot of messages in qe/customer logs where read_txn
complains that file is possibly in split-brain because of no readable
subvol being found, does inode refresh and then there is no split-brain
message post the inode refresh. This means that a lookup was not issued
on the indoe to populate 'readable' or it can mean one brick is source
for data and the other for metadata, making readable to be zero (because
readable=intersection of (data,metadata readable) since commit
7a1c1e290470149696.

Since we anyway log actual split-brains post inode-refresh, move this
message to DEBUG log level.

Change-Id: Idb88b8ea362515279dc9b246f06b6b646c6d8013
BUG: 1433838
Signed-off-by: Ravishankar N 
Reviewed-on: https://review.gluster.org/16879
Reviewed-by: Pranith Kumar Karampuri 
Tested-by: Pranith Kumar Karampuri 
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System