diff options
author | Raghavendra G <rgowdapp@redhat.com> | 2018-02-08 17:12:41 +0530 |
---|---|---|
committer | Raghavendra G <rgowdapp@redhat.com> | 2018-05-07 06:04:10 +0000 |
commit | 5e356e3f470779433bfe6b0b368676062842b367 (patch) | |
tree | 79f5bcf2ed1b3b0b377cdb36176eb185103dc8f4 /tests | |
parent | 01a7d27bbaf12618fde56ca05cf9c953445493f3 (diff) |
cluster/dht: fixes to parallel renames to same destination codepath
Test case:
# while true; do uuid="`uuidgen`"; echo "some data" > "test$uuid"; mv
"test$uuid" "test" -f || break; echo "done:$uuid"; done
This script was run in parallel from multiple mountpoints
Along the course of getting the above usecase working, many issues
were found:
Issue 1:
=======
consider a case of rename (src, dst). We can encounter a situation
where,
* dst is a file present at the time of lookup
* dst is removed by the time rename fop reaches glusterfs
In this scenario, acquring inodelk on dst fails with ESTALE resulting
in failure of rename. However, as per POSIX irrespective of whether
dst is present or not, rename should be successful. Acquiring entrylk
provides synchronization even in races like this.
Algorithm:
1. Take inodelks on src and dst (if dst is present) on respective
cached subvols. These inodelks are done to preserve backward
compatibility with older clients, so that synchronization is
preserved when a volume is mounted by clients of different
versions. Once relevant older versions (3.10, 3.12, 3.13) reach
EOL, this code can be removed.
2. Ignore ENOENT/ESTALE errors of inodelk on dst.
3. protect namespace of src and dst. To protect namespace of a file,
take inodelk on parent on hashed subvol, then take entrylk on the
same subvol on parent with basename of file. inodelk on parent is
done to guard against changes to parent layout so that hashed
subvol won't change during rename.
4. <rest of rename continues>
5. unlock all locks
Issue 2:
========
linkfile creation in lookup codepath can race with a rename. Imagine
the following scenario:
* lookup finds a data-file with gfid - gfid-dst - without a
corresponding linkto file on hashed-subvol. It decides to create
linkto file with gfid - gfid-dst.
- Note that some codepaths of dht-rename deletes linkto file of
dst as first step. So, a lookup racing with an in-progress
rename can easily run into this situation.
* a rename (src-path:gfid-src, dst-path:gfid-dst) renames data-file
and hence gfid of data-file changes to gfid-src with path dst-path.
* lookup proceeds and creates linkto file - dst-path - with gfid -
dst-gfid - on hashed-subvol.
* rename tries to create a linkto file dst-path with src-gfid on
hashed-subvol, but it fails with EEXIST. But EEXIST is ignored
during linkto file creation.
Now we've ended with dst-path having different gfids - dst-gfid on
linkto file and src-gfid on data file. Future lookups on dst-path will
always fail with ESTALE, due to differing gfids.
The fix is to synchronize linkfile creation in lookup path with rename
using the same mechanism of protecting namespace explained in solution
of Issue 1. Once locks are acquired, before proceeding with linkfile
creation, we check whether conditions for linkto file creation are
still valid. If not, we skip linkto file creation.
Issue 3:
========
gfid of dst-path can change by the time locks are acquired. This
means, either another rename overwrote dst-path or dst-path was
deleted and recreated by a different client. When this happens,
cached-subvol for dst can change. If rename proceeds with old-gfid and
old-cached subvol, we'll end up in inconsistent state(s) like dst-path
with different gfids on different subvols, more than one data-file
being present etc.
Fix is to do the lookup with a new inode after protecting namespace of
dst. Post lookup, we've to compare gfids and correct local state
appropriately to be in sync with backend.
Issue 4:
========
During revalidate lookup, if following a linkto file doesn't lead to a
valid data-file, local->cached-subvol was not reset to NULL. This
means we would be operating on a stale state which can lead to
inconsistency. As a fix, reset it to NULL before proceeding with
lookup everywhere.
Issue 5:
========
Stale dentries left out in inode table on brick resulted in failures
of link fop even though the file/dentry didn't exist on backend fs. A
patch is submitted to fix this issue. Please check the dependency tree
of current patch on gerrit for details
In short, we fix the problem by not blindly trusting the
inode-table. Instead we validate whether dentry is present by doing
lookup on backend fs.
Change-Id: I832e5c47d232f90c4edb1fafc512bf19bebde165
updates: bz#1543279
BUG: 1543279
Signed-off-by: Raghavendra G <rgowdapp@redhat.com>
Diffstat (limited to 'tests')
-rw-r--r-- | tests/bugs/distribute/bug-1543279.t | 65 | ||||
-rw-r--r-- | tests/include.rc | 3 |
2 files changed, 67 insertions, 1 deletions
diff --git a/tests/bugs/distribute/bug-1543279.t b/tests/bugs/distribute/bug-1543279.t new file mode 100644 index 00000000000..67cc0f50c2f --- /dev/null +++ b/tests/bugs/distribute/bug-1543279.t @@ -0,0 +1,65 @@ +#!/bin/bash + +. $(dirname $0)/../../include.rc +. $(dirname $0)/../../volume.rc +. $(dirname $0)/../../dht.rc + +TESTS_EXPECTED_IN_LOOP=44 +SCRIPT_TIMEOUT=600 + +rename_files() { + MOUNT=$1 + ITERATIONS=$2 + for i in $(seq 1 $ITERATIONS); do uuid="`uuidgen`"; echo "some data" > $MOUNT/test$uuid; mv $MOUNT/test$uuid $MOUNT/test -f || return $?; done +} + +run_test_for_volume() { + VOLUME=$1 + ITERATIONS=$2 + TEST_IN_LOOP $CLI volume start $VOLUME + + TEST_IN_LOOP glusterfs -s $H0 --volfile-id $VOLUME $M0 + TEST_IN_LOOP glusterfs -s $H0 --volfile-id $VOLUME $M1 + TEST_IN_LOOP glusterfs -s $H0 --volfile-id $VOLUME $M2 + TEST_IN_LOOP glusterfs -s $H0 --volfile-id $VOLUME $M3 + + rename_files $M0 $ITERATIONS & + M0_RENAME_PID=$! + + rename_files $M1 $ITERATIONS & + M1_RENAME_PID=$! + + rename_files $M2 $ITERATIONS & + M2_RENAME_PID=$! + + rename_files $M3 $ITERATIONS & + M3_RENAME_PID=$! + + TEST_IN_LOOP wait $M0_RENAME_PID + TEST_IN_LOOP wait $M1_RENAME_PID + TEST_IN_LOOP wait $M2_RENAME_PID + TEST_IN_LOOP wait $M3_RENAME_PID + + TEST_IN_LOOP $CLI volume stop $VOLUME + TEST_IN_LOOP $CLI volume delete $VOLUME + umount $M0 $M1 $M2 $M3 +} + +cleanup + +TEST glusterd +TEST pidof glusterd + +TEST $CLI volume create $V0 $H0:$B0/${V0}{0..8} force +run_test_for_volume $V0 200 + +TEST $CLI volume create $V0 replica 3 arbiter 1 $H0:$B0/${V0}{0..8} force +run_test_for_volume $V0 200 + +TEST $CLI volume create $V0 replica 3 $H0:$B0/${V0}{0..8} force +run_test_for_volume $V0 200 + +TEST $CLI volume create $V0 disperse 6 redundancy 2 $H0:$B0/${V0}{0..5} force +run_test_for_volume $V0 200 + +cleanup diff --git a/tests/include.rc b/tests/include.rc index f02bba1442b..5a99e027df8 100644 --- a/tests/include.rc +++ b/tests/include.rc @@ -1,6 +1,7 @@ M0=${M0:=/mnt/glusterfs/0}; # 0th mount point for FUSE M1=${M1:=/mnt/glusterfs/1}; # 1st mount point for FUSE M2=${M2:=/mnt/glusterfs/2}; # 2nd mount point for FUSE +M3=${M3:=/mnt/glusterfs/3}; # 3rd mount point for FUSE N0=${N0:=/mnt/nfs/0}; # 0th mount point for NFS N1=${N1:=/mnt/nfs/1}; # 1st mount point for NFS V0=${V0:=patchy}; # volume name to use in tests @@ -8,7 +9,7 @@ V1=${V1:=patchy1}; # volume name to use in tests GMV0=${GMV0:=master}; # master volume name to use in geo-rep tests GSV0=${GSV0:=slave}; # slave volume name to use in geo-rep tests B0=${B0:=/d/backends}; # top level of brick directories -WORKDIRS="$B0 $M0 $M1 $M2 $N0 $N1" +WORKDIRS="$B0 $M0 $M1 $M2 $M3 $N0 $N1" ROOT_GFID="00000000-0000-0000-0000-000000000001" DOT_SHARD_GFID="be318638-e8a0-4c6d-977d-7a937aa84806" |