glusterfs.git/tests, branch v5.13

afr: prevent spurious entry heals leading to gfid split-brain

2020-04-06T04:46:50+00:00

Problem:
In a hyperconverged setup with granular-entry-heal enabled, if a file is
recreated while one of the bricks is down, and an index heal is triggered
(with the brick still down), entry-self heal was doing a spurious heal
with just the 2 good bricks. It was doing a post-op leading to removal
of the filename from .glusterfs/indices/entry-changes as well as
erroneous setting of afr xattrs on the parent. When the brick came up,
the xattrs were cleared, resulting in the renamed file not getting
healed and leading to gfid split-brain and EIO on the mount.

Fix:
Proceed with entry heal only when shd can connect to all bricks of the replica,
just like in data and metadata heal.

fixes: #1103
Change-Id: I916ae26ad1fabf259bc6362da52d433b7223b17e
Signed-off-by: Ravishankar N 
(cherry picked from commit 06453d77d056fbaa393a137ca277a20e38d2f67e)

afr: mark pending xattrs as a part of metadata heal

2020-04-02T04:35:50+00:00

...if pending xattrs are zero for all children.

Problem:
If there are no pending xattrs and a metadata heal needs to be
performed, it can be possible that we end up with xattrs inadvertendly
deleted from all bricks, as explained in the  BZ.

Fix:
After picking one among the sources as the good copy, mark pending xattrs on
all sources to blame the sinks. Now even if this metadata heal fails midway,
a subsequent heal will still choose one of the valid sources that it
picked previously.

Updates: #1067
Change-Id: If1b050b70b0ad911e162c04db4d89b263e2b8d7b
Signed-off-by: Ravishankar N 
(cherry picked from commit 2d5ba449e9200b16184b1e7fc84cabd015f1f779)

Cluster/afr: Don't treat all bricks having metadata pending as split-brain

2020-03-02T09:24:06+00:00

Problem:
We currently don't have a roll-back/undoing of post-ops if quorum is not met.
Though the FOP is still unwound with failure, the xattrs remain on the disk.
Due to these partial post-ops and partial heals (healing only when 2 bricks
are up), we can end up in metadata split-brain purely from the afr xattrs
point of view i.e each brick is blamed by atleast one of the others for
metadata. These scenarios are hit when there is frequent connect/disconnect
of the client/shd to the bricks.

Fix:
Pick a source based on the xattr values. If 2 bricks blame one, the blamed
one must be treated as sink. If there is no majority, all are sources. Once
we pick a source, self-heal will then do the heal instead of erroring out
due to split-brain.
This patch also adds restriction of all the bricks to be up to perform
metadata heal to avoid any metadata loss.

Removed the test case tests/bugs/replicate/bug-1468279-source-not-blaming-sinks.t
as it was doing metadata heal even when only 2 of 3 bricks were up.

Change-Id: I07a9d62f84ceda329dcab1f02a33aeed258dcb09
fixes: bz#1806931
Signed-off-by: karthik-us

protocol/client: Do not fallback to anon-fd if fd is not open

2020-03-02T08:08:17+00:00

If an open comes on a file when a brick is down and after the brick comes up,
a fop comes on the fd, client xlator would still wind the fop on anon-fd
leading to wrong behavior of the fops in some cases.

Example:
If lk fop is issued on the fd just after the brick is up in the scenario above,
lk fop will be sent on anon-fd instead of failing it on that client xlator.
This lock will never be freed upon close of the fd as flush on anon-fd is
invalid and is not wound below server xlator.

As a fix, failing the fop unless the fd has FALLBACK_TO_ANON_FD flag.

> Change-Id: I77692d056660b2858e323bdabdfe0a381807cccc
> fixes bz#1390914
> Signed-off-by: Pranith Kumar K 

Change-Id: I77692d056660b2858e323bdabdfe0a381807cccc
fixes bz#1808256
Signed-off-by: Mohit Agrawal

afr: wake up index healer threads

2020-02-27T07:16:20+00:00

Backport of https://review.gluster.org/#/c/glusterfs/+/23288/

...whenever shd is re-enabled after disabling or there is a change in
`cluster.heal-timeout`, without needing to restart shd or waiting for the
current `cluster.heal-timeout` seconds to expire.

See BZ 1743988 for more details.

Change-Id: Ia5ebd7c8e9f5b54cba3199c141fdd1af2f9b9bfe
fixes: bz#1807431
Reported-by: Glen Kiessling 
Signed-off-by: Ravishankar N

tests: fix spurious failure of bug-1402841.t-mt-dir-scan-race.t

2020-02-27T06:40:16+00:00

Problem:
Since commit 600ba94183333c4af9b4a09616690994fd528478, shd starts
healing as soon as it is toggled from disabled to enabled. This was
causing the following line in the .t to fail on a 'fast' machine (always
on my laptop and sometimes on the jenkins slaves).

EXPECT_NOT "^0$" get_pending_heal_count $V0

because by the time shd was disabled, the heal was already completed.

Fix:
Increase the no. of files to be healed and make it a variable called
FILE_COUNT, should we need to bump it up further because the machines
become even faster. Also created pending metadata heals to increase the
time taken to heal a file.

fixes: bz#1807748
Change-Id: I5a26b08e45b8c19bce3c01ce67bdcc28ed48198d
Signed-off-by: Ravishankar N 
(cherry picked from commit 724c657995a2e148243eeb78c68b620c6d7714a5)

cluster/ec: fix EIO error for concurrent writes on sparse files

2020-02-25T06:48:11+00:00

EC doesn't allow concurrent writes on overlapping areas, they are
serialized. However non-overlapping writes are serviced in parallel.
When a write is not aligned, EC first needs to read the entire chunk
from disk, apply the modified fragment and write it again.

The problem appears on sparse files because a write to an offset
implicitly creates data on offsets below it (so, in some way, they
are overlapping). For example, if a file is empty and we read 10 bytes
from offset 10, read() will return 0 bytes. Now, if we write one byte
at offset 1M and retry the same read, the system call will return 10
bytes (all containing 0's).

So if we have two writes, the first one at offset 10 and the second one
at offset 1M, EC will send both in parallel because they do not overlap.
However, the first one will try to read missing data from the first chunk
(i.e. offsets 0 to 9) to recombine the entire chunk and do the final write.
This read will happen in parallel with the write to 1M. What could happen
is that half of the bricks process the write before the read, and the
half do the read before the write. Some bricks will return 10 bytes of
data while the otherw will return 0 bytes (because the file on the brick
has not been expanded yet).

When EC tries to recombine the answers from the bricks, it can't, because
it needs more than half consistent answers to recover the data. So this
read fails with EIO error. This error is propagated to the parent write,
which is aborted and EIO is returned to the application.

The issue happened because EC assumed that a write to a given offset
implies that offsets below it exist.

This fix prevents the read of the chunk from bricks if the current size
of the file is smaller than the read chunk offset. This size is
correctly tracked, so this fixes the issue.

Also modifying ec-stripe.t file for Test #13 within it.
In this patch, if a file size is less than the offset we are writing, we
fill zeros in head and tail and do not consider it strip cache miss.
That actually make sense as we know what data that part holds and there is
no need of reading it from bricks.

Change-Id: Ic342e8c35c555b8534109e9314c9a0710b6225d6
Fixes: bz#1805053
Signed-off-by: Xavi Hernandez

cluster/ec: Prevent double pre-op xattrops

2020-02-24T07:55:55+00:00

Problem:
Race:
Thread-1                                    Thread-2
1) Does ec_get_size_version() to perform
pre-op fxattrop as part of write-1
                                           2) Calls ec_set_dirty_flag() in
                                              ec_get_size_version() for write-2.
					      This sets dirty[] to 1
3) Completes executing
ec_prepare_update_cbk leading to
ctx->dirty[] = '1'
					   4) Takes LOCK(inode->lock) to check if there are
					      any flags and sets dirty-flag because
				              lock->waiting_flag is 0 now. This leads to
					      fxattrop to increment on-disk dirty[] to '2'

At the end of the writes the file will be marked for heal even when it doesn't need heal.

Fix:
Perform ec_set_dirty_flag() and other checks inside LOCK() to prevent dirty[] to be marked
as '1' in step 2) above

Fixes: bz#1805050
Change-Id: Icac2ab39c0b1e7e154387800fbededc561612865
Signed-off-by: Pranith Kumar K

server: Mount fails after reboot 1/3 gluster nodes

2020-02-24T07:50:34+00:00

Problem: At the time of coming up one server node(1x3) after reboot
client is unmounted.The client is unmounted because a client
is getting AUTH_FAILED event and client call fini for the graph.The
client is getting AUTH_FAILED because brick is not attached with a
graph at that moment

Solution: To avoid the unmounting the client graph throw ENOENT error
          from server in case if brick is not attached with server at
          the time of authenticate clients.

> Credits: Xavi Hernandez 
> Change-Id: Ie6fbd73cbcf23a35d8db8841b3b6036e87682f5e
> Fixes: bz#1793852
> Signed-off-by: Mohit Agrawal 
> (cherry picked from commit f6421dff22a6ddaf14134f6894deae219948c89d)

Fixes: bz#1804512
Change-Id: Ie6fbd73cbcf23a35d8db8841b3b6036e87682f5e
Signed-off-by: Mohit Agrawal

cluster/ec: fix fd reopen

2020-02-20T08:26:40+00:00

Currently EC tries to reopen fd's that have been opened while a brick
was down. This is done as part of regular write operations, just after
having acquired the locks, and it's sent as a sub-fop of the main write
fop.

There were two problems:

1. The reopen was attempted on all UP bricks, even if a previous lock
didn't succeed. This is incorrect because most probably the open will
fail.

2. If reopen is sent and fails, the error is propagated to the main
operation, causing it to fail when it shouldn't.

To fix this, we only attempt reopens on bricks where the current fop
owns a lock, and we prevent any error to be propagated to the main
fop.

To implement this behaviour an argument used to indicate the minimum
number of required answers has overloaded to also include some flags. To
make the change consistent, it has been necessary to rename the
argument, which means that a lot of files have been changed. However
there are no functional changes.

This change has also uncovered a problem in discard code, which didn't
correctely process requests of small sizes because no real discard fop
was being processed, only a write of 0's on some region. In this case
some fields of the fop remained uninitialized or with incorrect values.
To fix this, a new function has been created to simulate success on a
fop and it's used in the discard case.

Thanks to Pranith for providing a test script that has also detected an
issue in this patch. This patch includes a small modification of this
script to force data to be written into bricks before stopping them.

Change-Id: If272343873369186c2fb8f43c1d9c52c3ea304ec
Fixes: bz#1805047
Signed-off-by: Xavi Hernandez