glusterfs.git/xlators/cluster/ec/src/ec.c, branch release-4.1

cluster/ec: Fix pre-op xattrop management

2018-05-25T02:06:11+00:00

Multiple pre-op xattrop can be simultaneously being processed. On the cbk
it was checked if the fop was waiting for some specific data (like size and
version) and, if so, it was assumed that this answer should contain that
data.

This is not true, since a fop can be waiting for some data, but it may come
from the xattrop of another fop.

This patch differentiates between needing some information and providing it.

This is related to parallel writes. Disabling them fixed the problem, but
also prevented concurrent reads. A change has been made so that disabling
parallel writes still allows parallel reads.

Backport of:
> BUG: 1578325

Fixes: bz#1582056
Change-Id: I74772ad6b80b7b37805da93d5ec3ae099e96b041
Signed-off-by: Xavi Hernandez

cluster/ec: Turn ON the stripe-cache option by default

2018-04-06T07:05:55+00:00

Change-Id: I0a290396c30c635b13ee73004d20259efb76a954
fixes: bz#1563945
Signed-off-by: Ashish Pandey

cluster/ec: send list-node-uuids request to all subvolumes

2018-03-28T18:19:25+00:00

The xattr trusted.glusterfs.list-node-uuids was only sent to a single
subvolume. This was returning null uuids from the other subvolumes as
if they were down.

This fix forces that xattr to be requested from all subvolumes.

Change-Id: If62eb39a6857258923ba625e153d4ad79018ea2f
fixes: bz#1561406
Signed-off-by: Xavi Hernandez

cluster/ec: Change default read policy to gfid-hash

2018-03-14T06:22:05+00:00

Problem:
Whenever we read data from file over NFS, NFS reads
more data then requested and caches it. Based on the
stat information it makes sure that the cached/pre-read
data is valid or not.

Consider 4 + 2 EC volume and all the bricks are on
differnt nodes.

In EC, with round-robin read policy, reads are sent on
different set of data bricks. This way, it balances the
read fops to go on all the bricks and avoid heating UP
(overloading) same set of bricks.

Due to small difference in clock speed, it is possible
that we get minor difference for atime, mtime or ctime
for different bricks. That might cause a different stat
returned to NFS based on which NFS will discard
cached/pre-read data which is actually not changed and
could be used.

Solution:
Change read policy for EC as gfid-hash. That will force
all the read to go to same set of bricks.

Change-Id: I825441cc519e94bf3dc3aa0bd4cb7c6ae6392c84
BUG: 1554743
Signed-off-by: Ashish Pandey

cluster/ec: avoid delays in self-heal

2018-03-14T03:12:27+00:00

Self-heal creates a thread per brick to sweep the index looking for
files that need to be healed. These threads are started before the
volume comes online, so nothing is done but waiting for the next
sweep. This happens once per minute.

When a replace brick command is executed, the new graph is loaded and
all index sweeper threads started. When all bricks have reported, a
getxattr request is sent to the root directory of the volume. This
causes a heal on it (because the new brick doesn't have good data),
and marks its contents as pending to be healed. This is done by the
index sweeper thread on the next round, one minute later.

This patch solves this problem by waking all index sweeper threads
after a successful check on the root directory.

Additionally, the index sweep thread scans the index directory
sequentially, but it might happen that after healing a directory entry
more index entries are created but skipped by the current directory
scan. This causes the remaining entries to be processed on the next
round, one minute later. The same can happen in the next round, so
the heal is running in bursts and taking a lot to finish, specially
on volumes with many directory levels.

This patch solves this problem by immediately restarting the index
sweep if a directory has been healed.

Change-Id: I58d9ab6ef17b30f704dc322e1d3d53b904e5f30e
BUG: 1547662
Signed-off-by: Xavi Hernandez

cluster/ec : EC options for GD2

2018-01-22T08:53:36+00:00

Updates #302

Change-Id: I31b4648f7b1a394fceece5cba8120c579c66edd9
Signed-off-by: Sunil Kumar Acharya

locks: added inodelk/entrylk contention upcall notifications

2018-01-16T10:37:22+00:00

The locks xlator now is able to send a contention notification to
the current owner of the lock.

This is only a notification that can be used to improve performance
of some client side operations that might benefit from extended
duration of lock ownership. Nothing is done if the lock owner decides
to ignore the message and to not release the lock. For forced
release of acquired resources, leases must be used.

Change-Id: I7f1ad32a0b4b445505b09908a050080ad848f8e0
Signed-off-by: Xavier Hernandez

cluster/ec: Fix possible shift overflow

2017-12-22T16:19:53+00:00

A coverity scan has revelaed a potential shift overflow while scanning
the bitmap of available subvolumes. The actual overflow cannot happen,
but I've changed to test used to control the limit to make it explicit.

Change-Id: Ieb55f010bbca68a1d86a93e47822f7c709a26e83
BUG: 789278
Signed-off-by: Xavier Hernandez

cluster/ec: Change [f]getxattr to parallel-dispatch-one

2017-12-22T09:25:29+00:00

At the moment in EC, [f]getxattr operations wait to acquire a lock
while other operations are in progress even when it is in the same mount with a
lock on the file/directory. This happens because [f]getxattr operations
follow the model where the operation is wound on 'k' of the bricks and are
matched to make sure the data returned is same on all of them. This consistency
check requires that no other operations are on-going while [f]getxattr
operations are wound to the bricks. We can perform [f]getxattr in
another way as well, where we find the good_mask from the lock that is already
granted and wind the operation on any one of the good bricks and unwind the
answer after adjusting size/blocks to the parent xlator. Since we are taking
into account good_mask, the reply we get will either be before or after a
possible on-going operation. Using this method, the operation doesn't need to
depend on completion of on-going operations which could be taking long time (In
case of some slow disks and writes are in progress etc). Thus we reduce the
time to serve [f]getxattr requests.

I changed [f]getxattr to dispatch-one and added extra logic in
ec_link_has_lock_conflict() to not have any conflicts for fops with
EC_MINIMUM_ONE as fop->minimum to achieve the effect described above.
Modified scripts to make sure READ fop is received in EC to trigger heals.

Updates gluster/glusterfs#368
Change-Id: I3b4ebf89181c336b7b8d5471b0454f016cdaf296
Signed-off-by: Pranith Kumar K

cluster/ec: Add default value for the redundancy option

2017-12-21T07:04:32+00:00

Updates #302

Change-Id: Ifccc6a64c2b2a3b3a0133e6c8042cdf61a0838ce
Signed-off-by: Kamal Mohanan