glusterfs.git/xlators/cluster, branch v3.9.1

ec: Invalidations in disperse volume should not update the stat

2017-01-17T14:55:46+00:00

Backport of http://review.gluster.org/16329

Issue:
In disperse volume, the file is present across bricks, hence the stat
from one brick doesn't carry the valid size of the file. Therefore
the upcall from one brick updating the md-cache results in wrong size
being updated.

Fix:
If the notification is cache invalidation then, indicate md-cache that
the attributes is invalid.

>Reviewed-on: http://review.gluster.org/16329
>Smoke: Gluster Build System 
>NetBSD-regression: NetBSD Build System 
>Reviewed-by: Xavier Hernandez 
>CentOS-regression: Gluster Build System 
>Reviewed-by: Pranith Kumar Karampuri 
(cherry picked from commit 95d07a3d2d68805d93d36a447436e27c48777939)

BUG: 1410688
Change-Id: Id89d2283478e70b62b435a8891fffc86d2be8cb2
Signed-off-by: Poornima G 
Reviewed-on: http://review.gluster.org/16341
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Pranith Kumar Karampuri

cluster/afr: Remove backward compatibility for locks with v1

2017-01-17T14:46:53+00:00

When we have cascading locks with same lk-owner there is a possibility for
a deadlock to happen. One example is as follows:

self-heal takes a lock in data-domain for big name with 256 chars of "aaaa...a"
and starts heal in a 3-way replication when brick-0 is offline and healing from
brick-1 to brick-2 is in progress. So this lock is active on brick-1 and
brick-2. Now brick-0 comes online and an operation wants to take full lock and
the lock is granted at brick-0 and it is waiting for lock on brick-1. As part
of entry healing it takes full locks on all the available bricks and then
proceeds with healing the entry. Now this lock will start waiting on brick-0
because some other operation already has a granted lock on it. This leads to a
deadlock. Operation is waiting for unlock on "aaaa..." by heal where as heal is
waiting for the operation to unlock on brick-0. Initially I thought this is
happening because healing is trying to take a lock on all the available bricks
instead of just the bricks that are participating in heal. But later realized
that same kind of deadlock can happen if a brick goes down after the heal
starts but comes back before it completes. So the essential problem is the
cascading locks with same lk-owner which were added for backward compatibility
with afr-v1 which can be safely removed now that versions with afr-v1 are
already EOL. This patch removes the compatibility with v1 which requires
cascading locks with same lk-owner.

In the next version we can make locking-scheme option a dummy and switch
completely to v2.

 >BUG: 1401404
 >Change-Id: Ic9afab8260f5ff4dff5329eb0429811bcb879079
 >Signed-off-by: Pranith Kumar K 
 >Reviewed-on: http://review.gluster.org/16024
 >Smoke: Gluster Build System 
 >Reviewed-by: Ravishankar N 
 >NetBSD-regression: NetBSD Build System 
 >CentOS-regression: Gluster Build System 

BUG: 1413062
Change-Id: I4f5d485d9e0646ad3dc384e5ec36682b0933c9d3
Signed-off-by: Pranith Kumar K 
Reviewed-on: http://review.gluster.org/16413
Smoke: Gluster Build System 
CentOS-regression: Gluster Build System 
NetBSD-regression: NetBSD Build System

cluster/afr: Do not log of split-brain when there isn't one

2017-01-13T12:38:58+00:00

        Backport of: http://review.gluster.org/16362

* Even on errors like ENOENT, AFR logs split-brain after
  read-txn refresh, introduced by commit a07ddd8f.
  This can be a cause of much panic and confusion and needs to be fixed.

* Also fixed this issue in write-txns.

* Fixed afr read txns to log about split-brain only after knowing that
  there is no split-brain choice configured.

* Removed code duplication

* Fixed incorrect passing of error code in afr_write_txn_refresh_done()
  (the function was passing -0 as errno to gf_msg().

Change-Id: Ie40d2c498674a1fe8dc2c521b05e30c0bce85c02
BUG: 1412914
Signed-off-by: Krutika Dhananjay 
Reviewed-on: http://review.gluster.org/16388
Smoke: Gluster Build System 
Reviewed-by: Ravishankar N 
Reviewed-by: Pranith Kumar Karampuri 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System

afr: Avoid resetting event_gen when brick is always down

2017-01-13T08:31:12+00:00

Problem:
__afr_set_in_flight_sb_status(), which resets event_gen to zero, is
called if failed_subvols[i] is non-zero for any brick. But failed_subvols[i]
is true even if the brick was down *before* the transaction started.
Hence say if 1 brick is down in  a replica-3, every writev that comes
will trigger an inode refresh because of this resetting, as seen from
the no. of FSTATs in the profile info in the BZ.

Fix:
Reset event gen only if the brick was previously a valid read child and
the FOP failed on it the first time.

Also `s/afr_inode_read_subvol_reset/afr_inode_event_gen_reset` because
the function only resets event gen and not the data/metadata readable.

> Signed-off-by: Ravishankar N 
> Reviewed-on: http://review.gluster.org/16309
> Smoke: Gluster Build System 
> Reviewed-by: Pranith Kumar Karampuri 
> Tested-by: Pranith Kumar Karampuri 
> NetBSD-regression: NetBSD Build System 
> CentOS-regression: Gluster Build System 
(cherry picked from commit 522640be476a3f97dac932f7046f0643ec0ec2f2)

Change-Id: I603ae646cbde96995c35db77916e2ed80b602a91
BUG: 1412886
Reviewed-on: http://review.gluster.org/16385
Tested-by: Ravishankar N 
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
Reviewed-by: Krutika Dhananjay 
Reviewed-by: Pranith Kumar Karampuri 
CentOS-regression: Gluster Build System

cluster/ec: Check xdata to avoid memory leak

2017-01-12T06:21:39+00:00

Problem: ec_writev_start calls ec_make_internal_fop_xdata
to set "yes" in xdata before ec_readv (an internal fop)
is called for head and tail. Second call to this function
is overwriting the previous allocated dict_t to "xdata",
which results in memory leak.

Solution: In ec_make_internal_fop_xdata, check if *xdata
is NULL or not to avoid overwriting *xdata.


>Change-Id: I49b83923e11aff9b92d002e86424c0c2e1f5f74f
>BUG: 1400818
>Signed-off-by: Ashish Pandey 
>Reviewed-on: http://review.gluster.org/16007
>Reviewed-by: Xavier Hernandez 
>Reviewed-by: Pranith Kumar Karampuri 
>Tested-by: Pranith Kumar Karampuri 
>Smoke: Gluster Build System 
>NetBSD-regression: NetBSD Build System 
>CentOS-regression: Gluster Build System 

Change-Id: I49b83923e11aff9b92d002e86424c0c2e1f5f74f
BUG: 1400833
Signed-off-by: Ashish Pandey 
Reviewed-on: http://review.gluster.org/16006
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Smoke: Gluster Build System 
Reviewed-by: Xavier Hernandez

dht/md-cache: Filter invalidate if the file is made a linkto file

2017-01-05T06:10:12+00:00

Backport of http://review.gluster.org/15789

Upcall as a part of setattr, sends an invalidation and the
invalidation carries the resulting stat value. When a file
is converted to linkto files, even then an invalidation
is set and as a result the mountpoint shows the sticky
bit in the stat of the file.
eg: ---------T. 945 root root 0 Nov  8 10:14 hardlink.999

Fix:
When dht recieves a notification of sticky bit change, it updates
the flag, to indicate md-cache to send the subsequent lookup.

>Reviewed-on: http://review.gluster.org/15789
>Smoke: Gluster Build System 
>NetBSD-regression: NetBSD Build System 
>Reviewed-by: Niels de Vos 
>CentOS-regression: Gluster Build System 
>Reviewed-by: Susant Palai 
>Reviewed-by: Rajesh Joseph 
>(cherry picked from commit 4536f7bdf16f8286d67598eda9a46c029f0c0bf4)

Change-Id: Ic2fd7a5b196db0754f9b97072e644e6bf69da606
BUG: 1401376
Signed-off-by: Poornima G 
Reviewed-on: http://review.gluster.org/16022
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Raghavendra G

cluster/dht: Fix memory corruption while accessing regex stored in

2017-01-03T11:36:10+00:00

private

If reconfigure is executed parallely (or concurrently with dht_init),
there are races that can corrupt memory. One such race is modification
of regexes stored in conf (conf->rsync_regex_valid and
conf->extra_regex_valid) through dht_init_regex. With change [1],
reconfigure codepath can get executed parallely (with itself or with
dht_init) and this fix is needed.

Also, a reconfigure can race with any thread doing dht_layout_search,
resulting in dht_layout_search accessing regex freed up by reconfigure
(like in bz 1399134).

[1] http://review.gluster.org/15046

>Change-Id: I039422a65374cf0ccbe0073441f0e8c442ebf830
>BUG: 1399134
>Signed-off-by: Raghavendra G 
>Reviewed-on: http://review.gluster.org/15945
>Smoke: Gluster Build System 
>NetBSD-regression: NetBSD Build System 
>Reviewed-by: N Balachandran 
>CentOS-regression: Gluster Build System 
>Reviewed-by: Shyamsundar Ranganathan 

Change-Id: I039422a65374cf0ccbe0073441f0e8c442ebf830
BUG: 1399422
Signed-off-by: Raghavendra G 
(cherry picked from commit 64451d0f25e7cc7aafc1b6589122648281e4310a)
Reviewed-on: http://review.gluster.org/15949
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System

cluster/afr: Fix missing name indices due to EEXIST error

2016-12-28T09:06:48+00:00

        Backport of: http://review.gluster.org/16286

PROBLEM:
Consider a volume with  granular-entry-heal and sharding enabled. When
a replica is down and a shard is created as part of a write, the name
index is correctly created under indices/entry-changes/.
Now when a read on the same region triggers another MKNOD, the fop
fails on the online bricks with EEXIST. By virtue of this being a
symmetric error, the failed_subvols[] array is reset to all zeroes.
Because of this, before post-op, the GF_XATTROP_ENTRY_OUT_KEY will be
set, causing the name index, which was created in the previous MKNOD
operation, to be wrongly deleted in THIS MKNOD operation.

FIX:
The ideal fix would have been for a transaction to delete the name
index ONLY if it knows it is the one that created the index in the first
place. This would involve gathering information as to whether THIS xattrop
created the index from individual bricks, aggregating their responses and
based on the various posisble combinations of responses, decide whether to
delete the index or not. This is rather complex. Simpler fix would be
for post-op to examine local->op_ret in the event of no failed_subvols
to figure out whether to delete the name index or not. This can occasionally
lead to creation of stale name indices but they won't be affecting the IO path
or mess with pending changelogs in any way and self-heal in its crawl of
"entry-changes" directory would take care to delete such indices.

Change-Id: I8c5c08b7a208e840b5970fe5699dabdaf751a150
BUG: 1408785
Signed-off-by: Krutika Dhananjay 
Reviewed-on: http://review.gluster.org/16294
Smoke: Gluster Build System 
CentOS-regression: Gluster Build System 
NetBSD-regression: NetBSD Build System 
Reviewed-by: Pranith Kumar Karampuri

afr: use accused matrix instead of readable matrix for deciding heals

2016-12-28T09:05:53+00:00

Problem:
afr_replies_interpret() used the 'readable' matrix to trigger client
side heals after inode refresh. But for arbiter, readable is always
zero. So when `dd` is run with a data brick down, spurious data heals
are are triggered. These heals open an fd, causing eager lock to be
disabled (open fd count >1) in afr transactions, leading to extra FXATTROPS

Fix:
Use the accused matrix (derived from interpreting the afr pending
xattrs) to decide whether we can start heal or not.

> Reviewed-on: http://review.gluster.org/16277
> NetBSD-regression: NetBSD Build System 
> CentOS-regression: Gluster Build System 
> Smoke: Gluster Build System 
> Reviewed-by: Pranith Kumar Karampuri 
> Tested-by: Pranith Kumar Karampuri 
(cherry picked from commit 5a7c86e578f5bbd793126a035c30e6b052177a9f)

Change-Id: Ibbd56c9aed6026de6ec42422e60293702aaf55f9
BUG: 1408770
Signed-off-by: Ravishankar N 
Reviewed-on: http://review.gluster.org/16290
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Pranith Kumar Karampuri

afr: Ignore event_generation checks post inode refresh for write txns

2016-12-26T15:06:12+00:00

Before http://review.gluster.org/#/c/15673/, after inode refresh, we
failed read txns in case of EIO or event_generation being zero. For
write transactions, the check was only for EIO. 15673 re-factored the
code to fail both read and write when event_generation=0. This seems to
have caused a regression as explained in the BZ.

This patch restores that behaviour in afr_txn_refresh_done().

> Reviewed-on: http://review.gluster.org/16205
> Reviewed-by: Pranith Kumar Karampuri 
> Smoke: Gluster Build System 
> CentOS-regression: Gluster Build System 
> NetBSD-regression: NetBSD Build System 
(cherry picked from commit 7ee998b9041d594d93a4e2ef369892c185e80def)


Change-Id: Ib8e116506badce6f58b55827dbe403d95069d744
BUG: 1408171
Signed-off-by: Ravishankar N 
Reviewed-on: http://review.gluster.org/16271
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Pranith Kumar Karampuri