md-cache, afr: Reduce the window of stale read

Problem: Consider a replica setup, where one mount writes data to a file and the other mount reads the file. In afr, read operations are not transaction based, a brick(read subvolume) is chosen as a part of lookup or other operations, read is always wound only to the read subvolume, even if there was write from a different client that failed on this brick. This stale read continues until there is a lookup or any write operation from the mount point. Currently, this is not a major issue, as a lookup is issued before every read and it will switch the read subvolume to a correct one. But with the plan of increasing md-cache timeout to 600s, the stale read problem will be more pronounced, i.e. stale read can continue for 600s(or more if cascaded with readdirp), as there will be no lookups. Solution: Afr doesn't have any built-in solution for stale read(without affecting the performance). The solution that came up, was to use upcall. When a file on any brick is marked bad for the first time, upcall sends a notification to all the clients that had recently accessed the file. The solution has 2 parts: - Identifying when a file is marked bad, on any of the bricks, for the first time - Client side actions on recieving the notifications Identifying when a file is marked bad on any of the bricks for the first time: ----------------------------------------------------------------------------- The idea is to track xattrop in upcall. xattrop currently comes with 2 afr xattrs - afr dirty bit and afr pending xattrs. Dirty xattr is set to 1 before every write, and is unset if write succeeds. In certain scenarios, dirty xattr can be 0 and still the file could be bad copy. Hence do not track dirty xattr. Pending xattr is set on the good copy, indicating the other bricks that have bad copy. It is still not as simple as, notifying when any of the pending xattrs change. It could lead to flood of notifcations, in case the other brick is completely down or consistantly failing. Hence it is important to notify only once, the first time a good copy is marked bad. Client side actions on recieving pending xattr change, notification: -------------------------------------------------------------------- md-cache will invalidate the cache of that file, so that further lookup is passed down to afr and hence update the read subvolume. Invalidating only in md-cache is not enough, consider the folling oder of opertaions: - pending xattr invalidation - invalidate md-cache - readdirp on the bad read subvolume - fill md-cache - lookup (served from md-cache) - read - wound to the old read subvol. Hence, along with invalidating md-cache, it is very important to reset the read subvolume for that file, in afr. Design Credit: Anuradha Talur, Ravishankar N 1. xattrop doesn't carry info saying post op/pre op. 2. Pre xattrop will have 0 value for all pending xattrs, the cbk of pre xattrop carries the on-disk xattr value. Non zero indicated healing is required. 3. Post xattrop will have non zero value for any of the pending xattrs, if the fop failed on any of the bricks. >Reviewed-on: http://review.gluster.org/15398 >Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> >Tested-by: Pranith Kumar Karampuri <pkarampu@redhat.com> >Smoke: Gluster Build System <jenkins@build.gluster.org> >NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> >CentOS-regression: Gluster Build System <jenkins@build.gluster.org> >Signed-off-by: Poornima G <pgurusid@redhat.com> Change-Id: I469cbc111714c433984fe1c922be2ef113c25804 BUG: 1399450 Signed-off-by: Poornima G <pgurusid@redhat.com> Reviewed-on: http://review.gluster.org/15958 Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
author: Poornima G <pgurusid@redhat.com> 2016-09-04 08:27:47 +0530
committer: Pranith Kumar Karampuri <pkarampu@redhat.com> 2016-12-01 21:45:41 -0800
commit: b80e1c607b3d3aeaea2f929716b676918dc74cad (patch)
tree: addc8e57f196ceb255f08731b5772ddddc2a1116 /tests
parent: ccecf4f069961ca5c7c392e8702883e17adfe767 (diff)
1 files changed, 44 insertions, 0 deletions
diff --git a/tests/bugs/md-cache/afr-stale-read.t b/tests/bugs/md-cache/afr-stale-read.t
new file mode 100755
index 00000000000..7cee5afe27e
--- /dev/null
+++ b/tests/bugs/md-cache/afr-stale-read.t
@@ -0,0 +1,44 @@
+#!/bin/bash
+
+. $(dirname $0)/../../include.rc
+#. $(dirname $0)/../../volume.rc
+
+cleanup;
+
+#Basic checks
+TEST glusterd
+TEST pidof glusterd
+TEST $CLI volume info
+
+TEST $CLI volume create $V0 replica 2 $H0:$B0/${V0}{1..2};
+
+TEST $CLI volume set $V0 features.cache-invalidation on
+TEST $CLI volume set $V0 features.cache-invalidation-timeout 600
+TEST $CLI volume set $V0 performance.cache-invalidation on
+TEST $CLI volume set $V0 performance.md-cache-timeout 600
+TEST $CLI volume set $V0 performance.cache-samba-metadata on
+TEST $CLI volume set $V0 cluster.self-heal-daemon off
+TEST $CLI volume set $V0 read-subvolume $V0-client-0
+TEST $CLI volume set $V0 performance.quick-read off
+
+TEST $CLI volume start $V0
+
+TEST glusterfs --volfile-id=/$V0 --volfile-server=$H0 $M0
+TEST glusterfs --volfile-id=/$V0 --volfile-server=$H0 $M1
+
+#Write some data from M0 and read it from M1,
+#so that M1 selects a read subvol, and caches the lookup
+TEST `echo "one" > $M0/file1`
+EXPECT "one" cat $M1/file1
+
+#Fail few writes from M0 on brick-0, as a result of this failure
+#upcall in brick-0 will invalidate the read subvolume of M1.
+TEST chattr +i $B0/${V0}1/file1
+TEST `echo "two" > $M0/file1`
+TEST `echo "three" > $M0/file1`
+TEST `echo "four" > $M0/file1`
+TEST `echo "five" > $M0/file1`
+
+EXPECT_WITHIN $MDC_TIMEOUT "five" cat $M1/file1
+TEST chattr -i $B0/${V0}1/file1
+cleanup;
author	Poornima G <pgurusid@redhat.com>	2016-09-04 08:27:47 +0530
committer	Pranith Kumar Karampuri <pkarampu@redhat.com>	2016-12-01 21:45:41 -0800
commit	b80e1c607b3d3aeaea2f929716b676918dc74cad (patch)
tree	addc8e57f196ceb255f08731b5772ddddc2a1116 /tests
parent	ccecf4f069961ca5c7c392e8702883e17adfe767 (diff)