cluster/dht: ignore EEXIST error in mkdir to avoid GFID mismatch

In dht_mkdir_cbk, EEXIST error is treated like a true error. Because of this the following sequence of events can happen, eventually resulting in GFID mismatch and (and possibly leaked locks and hang, in the presence of replicate.) The issue exists when many clients concurrently attempt creation of directory and subdirectory (e.g mkdir -p /mnt/gluster/dir1/subdir) 0. First mkdir happens by one client on the hashed subvolume. Only one client wins the race. Others racing mkdirs get EEXIST. Yet other "laggers" in the race encounter the just-created directory in lookup() on the hash dir. 1. At least one "lagger" lookup() notices that there are missing directories on other subvolumes (which the "winner" mkdir is yet to create), and starts off self-heal of the directory. 2. At least on some subvolumes, self-heal's mkdir wins the race against the "winner" mkdir and creates the directory first. This causes the "winner" mkdir to experience EEXIST error on those subvolumes. 3. On other subvolumes where "winner" mkdir won the race, self-heal experiences EEXIST error, but self-heal is properly translating that into a success (but mkdir code path is not -- which is the bug.) 4. Both mkdir and self-heal assign hash layouts to the just created directory. But self-heal distributes hash range across N (total) subvolumes, whereas mkdir distributes hash range across N - M (where M is the number of subvolumes where mkdir lost the race). Both the clients "cache" their respective layouts in the near future for all future creates inside them (evidence in logs) 5. During the creation of the subdirectory, two clients race again. Ideally winner performs mkdir() on the hashed subvolume and proceeds to create other dirs, loser experiences EEXIST error on the hashed subvolume and backs off. But in this case, because the two clients have different layout views of the parent directory (because of different hash splits and assignements), the hashed subvolumes for the new directory can end up being different. Therefore, both clients now win the race (they were never fighting against each other on a common server), assigning different GFIDs to the directory on their respective (different) subvolumes. Some of the remaining subvolumes get GFID1, others GFID2. Conclusion/Fix: Making mkdir translate EEXIST error as success (just the way self-heal is already rightly doing) will bring back truth to the design claim that concurrent mkdir/self-heals perform deterministic + idempotent operations. This will prevent the differing "hash views" by different clients and thereby also avoid GFID mismatch by forcing all clients to have a "fair race", because the hashed subvolume for all will be the same (and thereby avoiding leaked locks and hangs.) Change-Id: I84592fb9b8a3f739a07e2afb23b33758a0a9a157 BUG: 907072 Signed-off-by: Anand Avati <avati@redhat.com> Reviewed-on: http://review.gluster.org/4459 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Amar Tumballi <amarts@redhat.com>
author: Anand Avati <avati@redhat.com> 2013-02-02 18:59:10 -0800
committer: Anand Avati <avati@redhat.com> 2013-02-03 12:14:19 -0800
commit: 80d08f13b0fd6ee0d10f0569165982913339607d (patch)
tree: 904951edac6d0b38b9c597df4258893d1f988d67 /xlators/cluster/dht/src
parent: 34d8edf58ac0379d7495e57c9a846cdec3794b0d (diff)
1 files changed, 9 insertions, 0 deletions
diff --git a/xlators/cluster/dht/src/dht-common.c b/xlators/cluster/dht/src/dht-common.c
index 99cf6f787a3..802bd5a49ee 100644
--- a/xlators/cluster/dht/src/dht-common.c
+++ b/xlators/cluster/dht/src/dht-common.c
@@ -3876,6 +3876,15 @@ dht_mkdir_cbk (call_frame_t *frame, void *cookie, xlator_t *this,
                         ret = dht_layout_merge (this, layout, prev->this,
                                                 -1, ENOSPC, NULL);
                 } else {
+			if (op_ret == -1 && op_errno == EEXIST)
+				/* Very likely just a race between mkdir and
+				   self-heal (from lookup of a concurrent mkdir
+				   attempt).
+				   Ignore error for now. layout setting will
+				   anyways fail if this was a different (old)
+				   pre-existing different directory.
+				*/
+				op_ret = 0;
                         ret = dht_layout_merge (this, layout, prev->this,
                                                 op_ret, op_errno, NULL);
                 }
author	Anand Avati <avati@redhat.com>	2013-02-02 18:59:10 -0800
committer	Anand Avati <avati@redhat.com>	2013-02-03 12:14:19 -0800
commit	80d08f13b0fd6ee0d10f0569165982913339607d (patch)
tree	904951edac6d0b38b9c597df4258893d1f988d67 /xlators/cluster/dht/src
parent	34d8edf58ac0379d7495e57c9a846cdec3794b0d (diff)