diff options
author | Pranith Kumar K <pkarampu@redhat.com> | 2014-09-17 11:33:23 +0530 |
---|---|---|
committer | Vijay Bellur <vbellur@redhat.com> | 2014-09-29 23:30:31 -0700 |
commit | 86b4c0319d4275859575720eced3200583942cfb (patch) | |
tree | a2bca0671aa95e6b6cce68d0217a3b97a7167a20 | |
parent | 369f59a91e2aee13a6e12ef78e7188f29a819ff7 (diff) |
cluster/afr: Launch self-heal only when all the brick status is known
Problem:
File goes into split-brain because of wrong erasing of xattrs.
RCA:
The issue happens because index self-heal is triggered even before all the
bricks are up. So what ends up happening while erasing the xattrs is, xattrs
are erased only on the sink brick for the brick that it thinks is up leading to
split-brain
Example:
lets say the xattrs before heal started are:
brick 2:
trusted.afr.vol1-client-2=0x000000020000000000000000
trusted.afr.vol1-client-3=0x000000020000000000000000
brick 3:
trusted.afr.vol1-client-2=0x000010040000000000000000
trusted.afr.vol1-client-3=0x000000000000000000000000
if only brick-2 came up at the time of triggering the self-heal only
'trusted.afr.vol1-client-2' is erased leading to the following xattrs:
brick 2:
trusted.afr.vol1-client-2=0x000000000000000000000000
trusted.afr.vol1-client-3=0x000000020000000000000000
brick 3:
trusted.afr.vol1-client-2=0x000010040000000000000000
trusted.afr.vol1-client-3=0x000000000000000000000000
So the file goes into split-brain.
BUG: 1142612
Change-Id: I0c8b66e154f03b636db052c97745399a7cca265b
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
Reviewed-on: http://review.gluster.org/8756
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Krutika Dhananjay <kdhananj@redhat.com>
Reviewed-by: Vijay Bellur <vbellur@redhat.com>
-rw-r--r-- | xlators/cluster/afr/src/afr-common.c | 17 |
1 files changed, 15 insertions, 2 deletions
diff --git a/xlators/cluster/afr/src/afr-common.c b/xlators/cluster/afr/src/afr-common.c index 3e745e2491e..42ff70937ac 100644 --- a/xlators/cluster/afr/src/afr-common.c +++ b/xlators/cluster/afr/src/afr-common.c @@ -3678,8 +3678,21 @@ afr_notify (xlator_t *this, int32_t event, if (propagate) ret = default_notify (this, event, data); - if (call_psh && priv->shd.iamshd) { - afr_selfheal_childup (this, up_child); + if (!had_heard_from_all && have_heard_from_all && priv->shd.iamshd) { + /* + * Since self-heal is supposed to be launched only after + * the responses from all the bricks are collected, + * launch self-heals now on all up subvols. + */ + for (i = 0; i < priv->child_count; i++) + if (priv->child_up[i]) + afr_selfheal_childup (this, i); + } else if (have_heard_from_all && call_psh && priv->shd.iamshd) { + /* + * Already heard from everyone. Just launch heal on now up + * subvolume. + */ + afr_selfheal_childup (this, up_child); } out: return ret; |