glusterfs.git/xlators/cluster/afr/src, branch release-6

afr: event gen changes

2020-07-08T02:19:40+00:00

The general idea of the changes is to prevent resetting event generation
to zero in the inode ctx, since event gen is something that should
follow 'causal order'.

Change #1:
For a read txn, in inode refresh cbk, if event_generation is
found zero, we are failing the read fop. This is not needed
because change in event gen is only a marker for the next inode refresh to
happen and should not be taken into account by the current read txn.

Change #2:
The event gen being zero above can happen if there is a racing lookup,
which resets even get (in afr_lookup_done) if there are non zero afr
xattrs. The resetting is done only to trigger an inode refresh and a
possible client side heal on the next lookup. That can be acheived by
setting the need_refresh flag in the inode ctx. So replaced all
occurences of resetting even gen to zero with a call to
afr_inode_need_refresh_set().

Change #3:
In both lookup and discover path, we are doing an inode refresh which is
not required since all 3 essentially do the same thing- update the inode
ctx with the good/bad copies from the brick replies. Inode refresh also
triggers background heals, but I think it is okay to do it when we call
refresh during the read and write txns and not in the lookup path.

The .ts which relied on inode refresh in lookup path to trigger heals are
now changed to do read txn so that inode refresh and the heal happens.

Change-Id: Iebf39a9be6ffd7ffd6e4046c96b0fa78ade6c5ec
Fixes: #1179
Signed-off-by: Ravishankar N 
Reported-by: Erik Jacobson 
(cherry picked from commit f0fcd909ad4535b60c9208d4804ebe6afe421a09)

cluster/afr: Prioritize ENOSPC over other errors

2020-07-08T01:29:06+00:00

Problem:
In a replicate/arbiter volume if file creations or writes fails on
quorum number of bricks and on one brick it is due to ENOSPC and
on other brick it fails for a different reason, it may fail with
errors other than ENOSPC in some cases.

Fix:
Prioritize ENOSPC over other lesser priority errors and do not set
op_errno in posix_gfid_set if op_ret is 0 to avoid receiving any
error_no which can be misinterpreted by __afr_dir_write_finalize().

Also removing the function afr_has_arbiter_fop_cbk_quorum() which
might consider a successful reply form a single brick as quorum
success in some cases, whereas we always need fop to be successful
on quorum number of bricks in arbiter configuration.

Change-Id: I106e267f8b9451f681022f1cccb410d9bc824c08
Fixes: #1254
Signed-off-by: karthik-us 
(cherry picked from commit fa63b45ca5edf172b1b89b28b5db3c5129cc57b6)

afr: more quorum checks in lookup and new entry marking

2020-07-08T01:26:32+00:00

Problem: See github issue for details.

Fix:
-In lookup if the entry exists in 2 out of 3 bricks, don't fail the
lookup with ENOENT just because there is an entrylk on the parent.
Consider quorum before deciding.

-If entry FOP does not succeed on quorum no. of bricks, do not perform
new entry mark.

Fixes: #1303
Change-Id: I56df8c89ad53b29fa450c7930a7b7ccec9f4a6c5
Signed-off-by: Ravishankar N 
(cherry picked from commit c4a6748f25d2c1ab3ebcf89952278ebf94c8d371)

afr: mark pending xattrs as a part of metadata heal

2020-04-20T07:40:10+00:00

...if pending xattrs are zero for all children.

Problem:
If there are no pending xattrs and a metadata heal needs to be
performed, it can be possible that we end up with xattrs inadvertendly
deleted from all bricks, as explained in the  BZ.

Fix:
After picking one among the sources as the good copy, mark pending xattrs on
all sources to blame the sinks. Now even if this metadata heal fails midway,
a subsequent heal will still choose one of the valid sources that it
picked previously.

Updates: #1067
Change-Id: If1b050b70b0ad911e162c04db4d89b263e2b8d7b
Signed-off-by: Ravishankar N 
(cherry picked from commit 2d5ba449e9200b16184b1e7fc84cabd015f1f779)

cluster/afr: fix race when bricks come up

2020-04-20T07:36:23+00:00

The was a problem when self-heal was sending lookups at the same time
that one of the bricks was coming up. In this case there was a chance
that the number of 'up' bricks changes in the middle of sending the
requests to subvolumes which caused a discrepancy in the expected
number of replies and the actual number of sent requests.

This discrepancy caused that AFR continued executing requests before
all requests were complete. Eventually, the frame of the pending
request was destroyed when the operation terminated, causing a use-
after-free issue when the answer was finally received.

In theory the same thing could happen in the reverse way, i.e. AFR
tries to wait for more replies than sent requests, causing a hang.

Backport of:
> Change-Id: I7ed6108554ca379d532efb1a29b2de8085410b70
> Signed-off-by: Xavi Hernandez 
> Fixes: bz#1808875

Change-Id: I7ed6108554ca379d532efb1a29b2de8085410b70
Signed-off-by: Xavi Hernandez 
Fixes: bz#1809439

afr: prevent spurious entry heals leading to gfid split-brain

2020-02-28T06:06:10+00:00

Problem:
In a hyperconverged setup with granular-entry-heal enabled, if a file is
recreated while one of the bricks is down, and an index heal is triggered
(with the brick still down), entry-self heal was doing a spurious heal
with just the 2 good bricks. It was doing a post-op leading to removal
of the filename from .glusterfs/indices/entry-changes as well as
erroneous setting of afr xattrs on the parent. When the brick came up,
the xattrs were cleared, resulting in the renamed file not getting
healed and leading to gfid split-brain and EIO on the mount.

Fix:
Proceed with entry heal only when shd can connect to all bricks of the replica,
just like in data and metadata heal.

fixes: bz#1804594
Change-Id: I916ae26ad1fabf259bc6362da52d433b7223b17e
Signed-off-by: Ravishankar N 
(cherry picked from commit 06453d77d056fbaa393a137ca277a20e38d2f67e)

cluster/thin-arbiter: Wait for TA connection before ta-file lookup

2020-02-26T11:18:54+00:00

Problem:
When we mount a ta volume, as soon as 2 data bricks are connected
we consider that the mount is done and then send a lookup/create on
ta file on ta node. However, this connection with ta node might not
have been completed.
Due to this delay, ta replica id file will not be created and we
will see ENOTCONN error in log file if we do lookup.

Solution:
As we know that this ta node could have a higher latency, we should
wait for reasonable time for connection to happen before sending
lookup/create on replica id file.

fixes: bz#1804546
Change-Id: I36f90865afe617e4e84cee57fec832a16f5dd6cc
(cherry picked from commit a7fa54ddea3fe429f143b37e4de06a93b49d776a)

Cluster/afr: Don't treat all bricks having metadata pending as split-brain

2020-02-25T07:06:51+00:00

Problem:
We currently don't have a roll-back/undoing of post-ops if quorum is not met.
Though the FOP is still unwound with failure, the xattrs remain on the disk.
Due to these partial post-ops and partial heals (healing only when 2 bricks
are up), we can end up in metadata split-brain purely from the afr xattrs
point of view i.e each brick is blamed by atleast one of the others for
metadata. These scenarios are hit when there is frequent connect/disconnect
of the client/shd to the bricks.

Fix:
Pick a source based on the xattr values. If 2 bricks blame one, the blamed
one must be treated as sink. If there is no majority, all are sources. Once
we pick a source, self-heal will then do the heal instead of erroring out
due to split-brain.
This patch also adds restriction of all the bricks to be up to perform
metadata heal to avoid any metadata loss.

Removed the test case tests/bugs/replicate/bug-1468279-source-not-blaming-sinks.t
as it was doing metadata heal even when only 2 of 3 bricks were up.

Change-Id: I07a9d62f84ceda329dcab1f02a33aeed258dcb09
fixes: bz#1805097
Signed-off-by: karthik-us

cluster/afr: Heal entries when there is a source & no healed_sinks

2019-10-17T10:52:54+00:00

Problem:
In a situation where B1 blames B2, B2 blames B1 and B3 doesn't blame
anything for entry heal, heal will not complete even though we have
clear source and sinks. This will happen because while doing
afr_selfheal_find_direction() only the bricks which are blamed by
non-accused bricks are considered as sinks. Later in
__afr_selfheal_entry_finalize_source() when it tries to mark all the
non-sources as sinks it fails to do so because there won't be any
healed_sinks marked, no witness present and there will be a source.

Fix:
If there is a source and no healed_sinks, then reset all the locked
sources to 0 and healed sinks to 1 to do conservative merge.

Change-Id: If40d8bc95d52a52b2730f55bdcf135109b421548
Fixes: bz#1760706
Signed-off-by: karthik-us

afr: support split-brain CLI for replica 3

2019-10-17T10:51:33+00:00

Ever since we added quorum checks for lookups in afr via commit
bd44d59741bb8c0f5d7a62c5b1094179dd0ce8a4, the split-brain resolution
commands would not work for replica 3 because there would be no
readables for the lookup fop.

The argument was that split-brains do not occur in replica 3 but we do
see (data/metadata) split-brain cases once in a while which indicate that there are
a few bugs/corner cases yet to be discovered and fixed.

Fortunately, commit  8016d51a3bbd410b0b927ed66be50a09574b7982 added
GF_CLIENT_PID_GLFS_HEALD as the pid for all fops made by glfsheal. If we
leverage this and allow lookups in afr when pid is GF_CLIENT_PID_GLFS_HEALD,
split-brain resolution commands will work for replica 3 volumes too.

Likewise, the check is added in shard_lookup as well to permit resolving
split-brains by specifying "/.shard/shard-file.xx" as the file name
(which previously used to fail with EPERM).

Change-Id: I3c543dea79caf7cfbc1633e9089cb1cdd2538ba9
Fixes: bz#1760792
Signed-off-by: Ravishankar N 
(cherry picked from commit 47dbd753187f69b3835d2e42fdbe7485874c4b3e)