glusterfs.git/xlators/cluster/ec/src/ec-common.c, branch v3.12.12

cluster/ec: OpenFD heal implementation for EC

2018-02-02T06:50:35+00:00

Existing EC code doesn't try to heal the OpenFD to
avoid unnecessary healing of the data later.

Fix implements the healing of open FDs before
carrying out file operations on them by making an
attempt to open the FDs on required up nodes.

Backport of:
>BUG: 1431955

BUG: 1536334
Change-Id: Ib696f59c41ffd8d5678a484b23a00bb02764ed15
Signed-off-by: Sunil Kumar Acharya

cluster/ec: Non-disruptive upgrade on EC volume fails

2017-07-14T00:26:04+00:00

Problem:
Enabling optimistic changelog on EC volume was not
handling node down scenarios appropriately resulting
in volume data inaccessibility.

Solution:
Update dirty xattr appropriately on good bricks whenever
nodes are down. This would fix the metadata information
as part of heal and thus ensures data accessibility.

BUG: 1468261
Change-Id: I08b0d28df386d9b2b49c3de84b4aac1c729ac057
Signed-off-by: Sunil Kumar Acharya 
Reviewed-on: https://review.gluster.org/17703
Smoke: Gluster Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Pranith Kumar Karampuri

cluster/ec: Get size of file in EC [f]xattrop

2017-07-13T08:10:11+00:00

Problem:
For allowing parallel writes we shouldn't depend on ia_size to be same for
all the bricks in each write_cbk(). But we need to make sure backend size
is correct on all the bricks and no crashes/manual modifications happened.

Fix:
At the time of get_size_version() we do 1 check to make sure size of the file
is same across the bricks. From then on the FOPs will give the status of the
fop, so we rely on this information to keep which bricks are good/bad.

Updates #251
Change-Id: I1df645347e2e9f2e09cfa4411b6cc305d7f4e4e5
Signed-off-by: Pranith Kumar K 
Reviewed-on: https://review.gluster.org/17741
Smoke: Gluster Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Xavier Hernandez

cluster/ec: Update xattr and heal size properly

2017-06-06T14:41:52+00:00

Problem-1 : Recursive healing of same file is happening
when IO is going on even after data heal completes.

Solution:
RCA: At the end of the write, when ec_update_size_version
gets called, we send it only on good bricks and not
on healing brick. Due to this, xattr on healing brick
will always remain out of sync and when the background
heal check source and sink, it finds this brick to be
healed and start healing from scratch. That involve
ftruncate and writing all of the data again.

To solve this, send xattrop on all the good bricks as
well as healing bricks.

Problem-2: The above fix exposes the data corruption
during heal. If the write on a file is going on and
heal finishes, we find that the file gets corrupted.

RCA:
The real problem happens in ec_rebuild_data(). Here we receive the
'size' argument which contains the real file size at the time of
starting self-heal and it's assigned to heal->total_size.

After that, a sequence of calls to ec_sync_heal_block() are done. Each
call ends up calling ec_manager_heal_block(), which does the actual work
of healing a block.

First a lock on the inode is taken in state EC_STATE_INIT using
ec_heal_inodelk(). When the lock is acquired, ec_heal_lock_cbk() is
called. This function calls ec_set_inode_size() to store the real size
of the inode (it uses heal->total_size).

The next step is to read the block to be healed. This is done using a
regular ec_readv(). One of the things this call does is to trim the
returned size if the file is smaller than the requested size.

In our case, when we read the last block of a file whose size was = 512
mod 1024 at the time of starting self-heal, ec_readv() will return only
the first 512 bytes, not the whole 1024 bytes.

This isn't a problem since the following ec_writev() sent from the heal
code only attempts to write the amount of data read, so it shouldn't
modify the remaining 512 bytes.

However ec_writev() also checks the file size. If we are writing the
last block of the file (determined by the size stored on the inode that
we have set to heal->total_size), any data beyond the (imposed) end of
file will be cleared with 0's. This causes the 512 bytes after the
heal->total_size to be cleared. Since the file was written after heal
started, the these bytes contained data, so the block written to the
damaged brick will be incorrect.

Solution:
Align heal->total_size to a multiple of the stripe size.

Thanks "Xavier Hernandez" 
to find out the root cause and to fix the issue.

Change-Id: I6c9f37b3ff9dd7f5dc1858ad6f9845c05b4e204e
BUG: 1428673
Signed-off-by: Ashish Pandey 
Reviewed-on: https://review.gluster.org/16985
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Pranith Kumar Karampuri 
Reviewed-by: Xavier Hernandez

cluster/ec : Don't count healing brick as healthy brick

2017-04-12T11:51:05+00:00

In ec_child_select, we should send fop on healing bricks
unconditionaly but to check  the number of healthy bricks
against fragments and minimum count, we should not count
these healing bricks.

Count bits of fop->mask before adding ealing brick to
fop->mask

Change-Id: I3fa80bdd5ca34ca070d610116b84154b917c5999
BUG: 1439527
Signed-off-by: Ashish Pandey 
Reviewed-on: https://review.gluster.org/17007
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
Reviewed-by: Pranith Kumar Karampuri 
CentOS-regression: Gluster Build System

cluster/ec: Don't mark dirty on entry/meta ops in query-info

2017-03-07T17:52:29+00:00

We wanted to mark dirty for metadata/entry operations
whenever query-info is set and info is not yet there because we
are anyway sending xattrop over the network. But this is causing
25% regression from 3.8.8 so removing this optimization

Also fixed two small issues that we didn't find in the previous
patch
1) reconfigure failure was sending return value 0 for optimistic-changelog
2) ec->optimistic_changelog was set to true even before OPTION_INIT

BUG: 1408809
Change-Id: Iabb0b64bd4d3623688790e4b67e5c20b4da977a1
Signed-off-by: Pranith Kumar K 
Reviewed-on: https://review.gluster.org/16865
Reviewed-by: Xavier Hernandez 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Smoke: Gluster Build System

cluster/ec: Introduce optimistic changelog in EC

2017-03-04T12:37:56+00:00

Problem: Fix to https://bugzilla.redhat.com/show_bug.cgi?id=1316873 has made
changes to set dirty flag before every update fop, data or metadata, and unset
it after successful operation. That makes some of the fops very slow such as
entry operations or metadata operations.

Solution: File data operations are the only operation which take some time and
setting dirty flag before a fop and unsetting it after serves the purpose as
probability of failure of a fop is high when the time duration is more. For all
the other operations, set dirty flag at the end of the fop, if any brick is
down and need heal.

Providing following option to choose between high performance or better heal
marking for metadata and entry fops.

Set/Unset dirty flag for every update fop at the start of the fop. If ON, this
option impacts performance of entry operations or metadata operations as it
will set dirty flag at the start and unset it at the end of ALL update fop. If
OFF and all the bricks are good, dirty flag will be set at the start only for
file fops For metadata and entry fops dirty flag will not be set at the start,
if all the bricks are good. This does not impact performance for metadata
operations and entry operation but has a very small window to miss marking
entry as dirty in case it is required to be healed.

Thanks to Xavi and Ashish for the design
Picked the .t file from Ashish' patch https://review.gluster.org/16298

BUG: 1408809
Change-Id: I3ce860063f0e2901e50754dcfc3e4ed22daf819f
Signed-off-by: Pranith Kumar K 
Reviewed-on: https://review.gluster.org/16821
Smoke: Gluster Build System 
Reviewed-by: Xavier Hernandez 
Tested-by: Xavier Hernandez 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System

cluster/ec: Don't trigger data/metadata heal on Lookups

2017-02-27T03:06:55+00:00

Problem-1
If Lookup which doesn't take any locks observes version mismatch it can't be
trusted. If we launch a heal based on this information it will lead to
self-heals which will affect I/O performance in the cases where Lookup is
wrong. Considering self-heal-daemon and operations on the inode from client
which take locks can still trigger heal we can choose to not attempt a heal on
Lookup.

Problem-2:
Fixed spurious failure of
tests/bitrot/bug-1373520.t
For the issues above, what was happening was that ec_heal_inspect()
is preventing 'name' heal to happen

Problem-3:
tests/basic/ec/ec-background-heals.t
To be honest I don't know what the problem was, while fixing
the 2 problems above, I made some changes to ec_heal_inspect() and
ec_need_heal() after which when I tried to recreate the spurious
failure it just didn't happen even after a long time.

BUG: 1414287
Signed-off-by: Pranith Kumar K 
Change-Id: Ife2535e1d0b267712973673f6d474e288f3c6834
Reviewed-on: https://review.gluster.org/16468
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
Reviewed-by: Xavier Hernandez 
CentOS-regression: Gluster Build System 
Reviewed-by: Ashish Pandey

cluster/ec: Change level of messages to DEBUG

2017-01-27T11:51:39+00:00

Heal failed or passed should not be logged as warning.
These can be observed from heal info if the heal is
happening or not. If we require to debug a case where
heal is not happening, we can set the level to DEBUG.

Change-Id: I347665c8c8b6223bb08a9f3dd5643a10ddc3b93e
BUG: 1417050
Signed-off-by: Ashish Pandey 
Reviewed-on: https://review.gluster.org/16473
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
Reviewed-by: Xavier Hernandez 
CentOS-regression: Gluster Build System

cluster/disperse: Do not log fop failed for lockless fops

2017-01-20T07:39:40+00:00

Problem: Operation failed messages are getting logged
based on the callbacks of lockless fop's. If a fop does
not take a lock, it is possible that it will get some
out of sync xattr, iatts. We can not depend on these
callback to psay that the fop has failed.

Solution: Print failed messages only for locked fops.
However, heal would still be triggered.

Change-Id: I4427402c8c944c23f16073613caa03ea788bead3
BUG: 1414287
Signed-off-by: Ashish Pandey 
Reviewed-on: http://review.gluster.org/16435
Reviewed-by: Xavier Hernandez 
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System