glusterfs.git/xlators/cluster/afr/src/afr.h, branch v3.10.2

afr: Avoid resetting event_gen when brick is always down

2017-01-13T02:54:02+00:00

Problem:
__afr_set_in_flight_sb_status(), which resets event_gen to zero, is
called if failed_subvols[i] is non-zero for any brick. But failed_subvols[i]
is true even if the brick was down *before* the transaction started.
Hence say if 1 brick is down in  a replica-3, every writev that comes
will trigger an inode refresh because of this resetting, as seen from
the no. of FSTATs in the profile info in the BZ.

Fix:
Reset event gen only if the brick was previously a valid read child and
the FOP failed on it the first time.

Also `s/afr_inode_read_subvol_reset/afr_inode_event_gen_reset` because
the function only resets event gen and not the data/metadata readable.

Change-Id: I603ae646cbde96995c35db77916e2ed80b602a91
BUG: 1409206
Signed-off-by: Ravishankar N 
Reviewed-on: http://review.gluster.org/16309
Smoke: Gluster Build System 
Reviewed-by: Pranith Kumar Karampuri 
Tested-by: Pranith Kumar Karampuri 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System

cluster/afr: Do not log of split-brain when there isn't one

2017-01-12T06:42:02+00:00

* Even on errors like ENOENT, AFR logs split-brain after
  read-txn refresh, introduced by commit a07ddd8f.
  This can be a cause of much panic and confusion and needs to be fixed.

* Also fixed this issue in write-txns.

* Fixed afr read txns to log about split-brain only after knowing that
  there is no split-brain choice configured.

* Removed code duplication

* Fixed incorrect passing of error code in afr_write_txn_refresh_done()
  (the function was passing -0 as errno to gf_msg().

Change-Id: I354f454ce5bf0e5f00bc27916eb597367cb7d927
BUG: 1411625
Signed-off-by: Krutika Dhananjay 
Reviewed-on: http://review.gluster.org/16362
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Ravishankar N 
Reviewed-by: Pranith Kumar Karampuri

afr: Ignore event_generation checks post inode refresh for write txns

2016-12-22T11:06:31+00:00

Before http://review.gluster.org/#/c/15673/, after inode refresh, we
failed read txns in case of EIO or event_generation being zero. For
write transactions, the check was only for EIO. 15673 re-factored the
code to fail both read and write when event_generation=0. This seems to
have caused a regression as explained in the BZ.

This patch restores that behaviour in afr_txn_refresh_done().

Change-Id: Ib8e116506badce6f58b55827dbe403d95069d744
BUG: 1406224
Signed-off-by: Ravishankar N 
Reviewed-on: http://review.gluster.org/16205
Reviewed-by: Pranith Kumar Karampuri 
Smoke: Gluster Build System 
CentOS-regression: Gluster Build System 
NetBSD-regression: NetBSD Build System

afr, client: More mem-leak fixes in COMPOUND fop cbk

2016-12-05T01:28:40+00:00

Bugs found and fixed:
1. Use correct subvolume index in pre-op-writev compound cbk
2. Prevent use-after-free of local->compound_args members in
   compound fops cbk in protocol/client
3. Fix xdata and xattr leaks in client_process_response
4. Fix possible leak of xdata in client_pre_writev() in
   test mode.
5. Free req->compound_req_array.compound_req_array_val as well
   after freeing its members
6. Free tmp_rsp->flock.lk_owner.lk_owner_val in LK fop.

Change-Id: I15b646d7d4e0e5cd4ea3d2d6452c815cf2eaf68f
BUG: 1401218
Signed-off-by: Krutika Dhananjay 
Reviewed-on: http://review.gluster.org/16020
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
Reviewed-by: Pranith Kumar Karampuri 
CentOS-regression: Gluster Build System

afr: Fix the EIO that can occur in afr_inode_refresh as a result

2016-11-29T04:10:24+00:00

     of cache invalidation(upcall).

Issue:
------
When a cache invalidation is recieved as a result of changing
pending xattr, the read_subvol is reset. Consider the below chain
of execution:

CHILD_DOWN
...
afr_readv
...
afr_inode_refresh
...
afr_inode_read_subvol_reset <- as a result of pending xattr set by
                               some other client GF_EVENT_UPCALL will
                               be sent
afr_refresh_done -> this results in an EIO, as the read subvol was
                    reset by the end of the afr_inode_refresh

Solution:
---------
When GF_EVENT_UPCALL is recieved, instead of resetting read_subvol,
set a variable need_refresh in inode_ctx, the next time some one
starts a txn, along with event gen, need_rrefresh also needs to
be checked.

Change-Id: Ifda21a7a8039b8874215e1afa4bdf20f7d991b58
BUG: 1396952
Signed-off-by: Poornima G 
Reviewed-on: http://review.gluster.org/15892
Reviewed-by: Ravishankar N 
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Pranith Kumar Karampuri

cluster/afr: Fix bugs in [f]inodelk/[f]entrylk

2016-11-26T15:34:59+00:00

Problems:
1) Inodelk is not taking quorum into account
2) finodelk, [f]entrylk are not implemented correctly
3) By default afr doesn't go for non-blocking parallel locks.

Fix:
Implemented a common framework which can be used by
[f]inodelk/[f]entrylk.  Used quorum for the same.

Change-Id: I239f13875a065298630d266941df10cfa3addc85
BUG: 1369077
Signed-off-by: Pranith Kumar K 
Reviewed-on: http://review.gluster.org/15802
Tested-by: Krutika Dhananjay 
Reviewed-by: Krutika Dhananjay 
Smoke: Gluster Build System 
Reviewed-by: Ravishankar N 
CentOS-regression: Gluster Build System 
NetBSD-regression: NetBSD Build System

md-cache, afr: Reduce the window of stale read

2016-10-20T07:07:55+00:00

Problem:
Consider a replica setup, where one mount writes data to a
file and the other mount reads the file. In afr, read operations
are not transaction based, a brick(read subvolume) is chosen as
a part of lookup or other operations, read is always wound only
to the read subvolume, even if there was write from a different client
that failed on this brick. This stale read continues until there is
a lookup or any write operation from the mount point. Currently, this
is not a major issue, as a lookup is issued before every read and it will
switch the read subvolume to a correct one. But with the plan of
increasing md-cache timeout to 600s, the stale read problem will be
more pronounced, i.e. stale read can continue for 600s(or more if cascaded
with readdirp), as there will be no lookups.

Solution:
Afr doesn't have any built-in solution for stale read(without affecting
the performance). The solution that came up, was to use upcall. When a file
on any brick is marked bad for the first time, upcall sends a notification
to all the clients that had recently accessed the file. The solution has
2 parts:
- Identifying when a file is marked bad, on any of the bricks,
  for the first time
- Client side actions on recieving the notifications

Identifying when a file is marked bad on any of the bricks for the first time:
-----------------------------------------------------------------------------
The idea is to track xattrop in upcall. xattrop currently comes with 2 afr
xattrs - afr dirty bit and afr pending xattrs.
   Dirty xattr is set to 1 before every write, and is unset if write succeeds.
In certain scenarios, dirty xattr can be 0 and still the file could be bad
copy. Hence do not track dirty xattr.
   Pending xattr is set on the good copy, indicating the other bricks that have
bad copy. It is still not as simple as, notifying when any of the pending xattrs
change. It could lead to flood of notifcations, in case the other brick is
completely down or consistantly failing. Hence it is important to notify only
once, the first time a good copy is marked bad.

Client side actions on recieving pending xattr change, notification:
--------------------------------------------------------------------
md-cache will invalidate the cache of that file, so that further lookup is
passed down to afr and hence update the read subvolume. Invalidating only in
md-cache is not enough, consider the folling oder of opertaions:
- pending xattr invalidation - invalidate md-cache
- readdirp on the bad read subvolume - fill md-cache
- lookup (served from md-cache)
- read - wound to the old read subvol.
Hence, along with invalidating md-cache, it is very important to reset the
read subvolume for that file, in afr.

Design Credit: Anuradha Talur, Ravishankar N

1. xattrop doesn't carry info saying post op/pre op.
2. Pre xattrop will have 0 value for all pending xattrs,
   the cbk of pre xattrop carries the on-disk xattr value.
   Non zero indicated healing is required.
3. Post xattrop will have non zero value for any of the
   pending xattrs, if the fop failed on any of the bricks.

Change-Id: I469cbc111714c433984fe1c922be2ef113c25804
BUG: 1211863
Signed-off-by: Poornima G 
Reviewed-on: http://review.gluster.org/15398
Reviewed-by: Pranith Kumar Karampuri 
Tested-by: Pranith Kumar Karampuri 
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System

afr: Consume compound fops in afr transaction

2016-09-01T17:22:39+00:00

Change-Id: Ib06ece3cce1b10d28d6d2953da28444f5c2457ad
BUG: 1290304
Signed-off-by: Anuradha Talur 
Reviewed-on: http://review.gluster.org/15014
Tested-by: Pranith Kumar Karampuri 
Smoke: Gluster Build System 
CentOS-regression: Gluster Build System 
NetBSD-regression: NetBSD Build System 
Reviewed-by: Krutika Dhananjay 
Reviewed-by: Pranith Kumar Karampuri

cluster/afr: Give option to do consistent-io

2016-08-22T20:55:42+00:00

Problem:
When tiering/rebalance does migrations and afr with 2-way replica is in
picture, migration can read stale data if the source brick goes down and writes
to the destination. After this deletion of the file leads to permanent loss of
data after migration.

Fix:
Rebalance/tiering should migrate only when the data is definitely not stale. So
introduce an option in afr called consistent-io which will be enabled in
migration daemons.

BUG: 1306398
Change-Id: I750f65091cc70a3ed4bf3c12f83d0949af43920a
Signed-off-by: Pranith Kumar K 
Reviewed-on: http://review.gluster.org/13425
Reviewed-by: Anuradha Talur 
Reviewed-by: Krutika Dhananjay 
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System

cluster/afr: Prevent split-brain when bricks are brought off and on in cyclic order

2016-08-22T09:38:36+00:00

When the bricks are brought offline and then online in cyclic
order while writes are in progress on a file, thanks to inode
refresh in write txns, AFR will mostly fail the write attempt
when the only good copy is offline. However, there is still a
remote possibility that the file will run into split-brain if
the brick that has the lone good copy goes offline *after* the
inode refresh but *before* the write txn completes (I call it
in-flight split-brain in the patch for ease of reference),
requiring intervention from admin to resolve the split-brain
before the IO can resume normally on the file. To get around this,
the patch does the following things:
i) retains the dirty xattrs on the file
ii) avoids marking the last of the good copies as bad (or accused)
    in case it is the one to go down during the course of a write.
iii) fails that particular write with the appropriate errno.

This way, we still have one good copy left despite the split-brain situation
which when it is back online, will be chosen as source to do the heal.

Change-Id: I9ca634b026ac830b172bac076437cc3bf1ae7d8a
BUG: 1363721
Signed-off-by: Krutika Dhananjay 
Reviewed-on: http://review.gluster.org/15080
Tested-by: Pranith Kumar Karampuri 
Smoke: Gluster Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Ravishankar N 
Reviewed-by: Oleksandr Natalenko 
NetBSD-regression: NetBSD Build System 
Reviewed-by: Pranith Kumar Karampuri