glusterfs.git/xlators/cluster, branch v3.6.3

ec: Special handling of anonymous fd

2015-03-30T07:22:32+00:00

Anonymous file descriptors need to be handled specially because
they can be used in some non standard ways (i.e. an anonymous fd
can be used without having been opened).

This caused NFS to fail on some operations because ec always
expected to have a previous successful opendir call (from patch
http://review.gluster.org/9098/).

This patch treats all anonymous fd as opened on all subvolumes.

This is a backport of http://review.gluster.org/9513/

Change-Id: I09dbbce2ffc1ae3a5bcbb328bed55b84f4f0b9f8
BUG: 1187526
Signed-off-by: Xavier Hernandez 
Reviewed-on: http://review.gluster.org/9596
Reviewed-by: Pranith Kumar Karampuri 
Tested-by: Pranith Kumar Karampuri 
Reviewed-by: Dan Lambright 
Tested-by: Gluster Build System 
Reviewed-by: Raghavendra Bhat

cluster/ec: Wait for all bricks to notify before notifying parent

2015-03-30T07:20:56+00:00

        Backport of http://review.gluster.org/9523

This is to prevent spurious heals that can result in self-heal.

BUG: 1188471
Change-Id: Iaea335d59431d8d85a236963a365f5c791fc7c49
Signed-off-by: Pranith Kumar K 
Reviewed-on: http://review.gluster.org/9552
Reviewed-by: Xavier Hernandez 
Tested-by: Gluster Build System 
Reviewed-by: Raghavendra Bhat

cluster/ec: Handle CHILD UP/DOWN in all cases

2015-03-30T07:20:38+00:00

        Backport of http://review.gluster.org/9396

Problem:
When all the bricks are down at the time of mounting the volume, then mount
command hangs.

Fix:
1. Ignore all CHILD_CONNECTING events comming from subvolumes.
2. On timer expiration (without enough up or down childs) send
   CHILD_DOWN.
3. Once enough up or down subvolumes are detected, send the appropriate event.
   When rest of the subvols go up/down without changing the overall
   ec-up/ec-down send CHILD_MODIFIED to parent subvols.

BUG: 1188471
Change-Id: If92bd84107d49495cd104deb34601afe7f9b155c
Signed-off-by: Pranith Kumar K 
Reviewed-on: http://review.gluster.org/9551
Reviewed-by: Xavier Hernandez 
Tested-by: Gluster Build System 
Reviewed-by: Raghavendra Bhat

cluster/afr: serialize inode locks

2015-03-25T11:28:33+00:00

        Backport of http://review.gluster.org/9372

Problem:
Afr winds inodelk calls without any order, so blocking inodelks
from two different mounts can lead to dead lock when mount1 gets
the lock on brick-1 and blocked on brick-2 where as mount2 gets
lock on brick-2 and blocked on brick-1

Fix:
Serialize the inodelks whether they are blocking inodelks or
non-blocking inodelks.

        Non-blocking locks also need to be serialized.
Otherwise there is a chance that both the mounts which issued same
non-blocking inodelk may endup not acquiring the lock on any-brick.
Ex:
Mount1 and Mount2 request for full length lock on file f1.  Mount1 afr may
acquire the partial lock on brick-1 and may not acquire the lock on brick-2
because Mount2 already got the lock on brick-2, vice versa. Since both the
mounts only got partial locks, afr treats them as failure in gaining the locks
and unwinds with EAGAIN errno.

BUG: 1189023
Change-Id: If5dd502d9d25d12425749a8efcf08a1423b29255
Signed-off-by: Pranith Kumar K 
Reviewed-on: http://review.gluster.org/9576
Reviewed-by: Krutika Dhananjay 
Tested-by: Gluster Build System 
Reviewed-by: Raghavendra Bhat

cluster/afr: Make read child match check in afr optional

2015-03-25T08:58:53+00:00

        Backport of: http://review.gluster.org/9917

Having this particular check which was introduced by
commit c57c455347a72ebf0085add49ff59aae26c7a70d causes a drop in
performance in readdirp. So the behavior is made configurable with this
patch.

Change-Id: I4a19813cfc786504340264a5a5533a0c43a1d4a4
BUG: 1202673
Signed-off-by: Krutika Dhananjay 
Reviewed-on: http://review.gluster.org/9929
Reviewed-by: Atin Mukherjee 
Tested-by: Gluster Build System 
Reviewed-by: Pranith Kumar Karampuri 
Reviewed-by: Raghavendra Bhat

afr: remove stale index entries

2015-03-25T08:57:01+00:00

Backport of http://review.gluster.org/9714

Problem:
During pre-op phase, the index xlator
1. Creates the entry inside .glusterfs/indices/xattrop
2. Winds the xattrop fop to posix to mark dirty/pending changelogs.
If the brick crashes after 1, the xattrop entry becomes stale and never
gets removed by shd during subsequent crawls because there is nothing to
heal (changelogs are zero).

Though the stale entry does not get displayed in the output of 'heal info'
command, it nevertheless stays there forever unless a new write tansaction
is performed on the file.

Fix:
During index self-heal if afr xattrs are found to be clean (indicated by
ret value of 2 on a call to afr_shd_selfheal(), send a dummy
post-op with all 0s for the xattr values, which makes the index xlator
to unlink the stale entry.

Change-Id: Iffb171e40490abd8d44df09ccc058b5da67baafe
BUG: 1203081
Signed-off-by: Ravishankar N 
Reviewed-on: http://review.gluster.org/9920
Tested-by: Gluster Build System 
Reviewed-by: Pranith Kumar Karampuri 
Reviewed-by: Raghavendra Bhat

cluster/afr: Convert quota size from n/w to host order before use

2015-03-16T09:45:43+00:00

        Backport of: http://review.gluster.org/9853

Change-Id: I83f1ab16a2dc54841e7beff3033333fba009b3a4
BUG: 1201622
Signed-off-by: Krutika Dhananjay 
Reviewed-on: http://review.gluster.org/9884
Reviewed-by: Ravishankar N 
Tested-by: Gluster Build System 
Reviewed-by: Anuradha Talur 
Reviewed-by: Pranith Kumar Karampuri 
Reviewed-by: Raghavendra Bhat

cluster/afr: Handle getxattr of quota-size key

2015-03-14T10:13:45+00:00

        Backport of http://review.gluster.org/9820

Afr needs to query QUOTA_SIZE_KEY from all the subvolumes and return the
value which is maximum of the readable bricks.

BUG: 1201624
Change-Id: I41725a7323020c1480c38560dc5ae2c2e82d6d47
Signed-off-by: Pranith Kumar K 
Reviewed-on: http://review.gluster.org/9873
Reviewed-by: Krutika Dhananjay 
Tested-by: Gluster Build System 
Reviewed-by: Raghavendra Bhat

cluster/afr: Do not increment healed_count if no healing was performed

2015-03-14T10:08:48+00:00

        Backport of: http://review.gluster.org/9713

PROBLEM:
When file modifications are happening while index heal is launched,
index healer could pick up entries which appeared in indices/xattrop
transiently during the course of the operations on the mount point, and
do not really need any heal. This will cause index healer to keep doing
index-heal in a loop as long as it finds this entry, by believing that
it did successfully heal some gfids even when it didn't.

FIX:
afr_selfheal() now returns a 1 to indicate that it did not (need to)
heal a given gfid. afr_shd_selfheal() will not increment healed_count
whenever afr_selfheal() returns a 1.

Change-Id: I9158c814419b635fac3dfe2fe40c94d1548ea4e8
BUG: 1194306
Signed-off-by: Krutika Dhananjay 
Reviewed-on: http://review.gluster.org/9852
Tested-by: Gluster Build System 
Reviewed-by: Ravishankar N 
Reviewed-by: Anuradha Talur 
Reviewed-by: Raghavendra Bhat

afr: stop encoding subvolume id in readdir d_off

2015-03-04T07:38:41+00:00

        Backport of http://review.gluster.org/9332

The purpose of encoding d_off in AFR is to indicate the
selected subvolume for the first readdir, and continue all
further readdirs of the session on the same subvolume. This is
required because, unlike files, dir d_offs are specific to the
backend and cannot be re-used on another subvolume. The d_off
transformation encodes the subvolume id and prevents such
invalid use of d_offs on other servers.

However, this approach could be quite wasteful of precious d_off
bit-space. Unlike DHT, where server id can change from entry to
entry and thus encoding the server id in the transformed d_off
is necessary, we could take a slightly relaxed approach in AFR.
The approach is to save the subvolume where the last readdir
request was sent in the fd_ctx. This consumes constant space (i.e
no per-entry cache), and serves the purpose of avoiding d_off
"misuse" (i.e using d_off from one server on another).

The compromise here is NFS resuming readdir from a non-0 cookie
after an extended delay (either anonymous FD has been reclaimed,
or server has restarted). In such cases a subvolume is picked
freshly. To make this fresh picking more deterministic (i.e, to
pick the same subvolume whenever possible, even after reboots),
the function afr_hash_child (used by afr_read_subvol_select_by_policy)
is modified to skip all dynamic inputs (i.e PID) for the case
of directories.

BUG: 1191537
Change-Id: I7e3bd8dfe346a9a8e428d7ddeada6cfb66e64e54
Signed-off-by: Anand Avati 
Reviewed-on: http://review.gluster.org/9638
Tested-by: Gluster Build System 
Reviewed-by: Raghavendra Bhat