diff options
author | Poornima G <pgurusid@redhat.com> | 2016-09-04 08:27:47 +0530 |
---|---|---|
committer | Pranith Kumar Karampuri <pkarampu@redhat.com> | 2016-10-20 00:07:55 -0700 |
commit | 8d8eded58cd5431a7000a70337444b828cb400d8 (patch) | |
tree | 3b5562dec5bd3fdfdb5e0a6d81684efbd16f8478 /xlators/cluster/afr | |
parent | a482645865af56b8d34a17430649c3c646f3d450 (diff) |
md-cache, afr: Reduce the window of stale read
Problem:
Consider a replica setup, where one mount writes data to a
file and the other mount reads the file. In afr, read operations
are not transaction based, a brick(read subvolume) is chosen as
a part of lookup or other operations, read is always wound only
to the read subvolume, even if there was write from a different client
that failed on this brick. This stale read continues until there is
a lookup or any write operation from the mount point. Currently, this
is not a major issue, as a lookup is issued before every read and it will
switch the read subvolume to a correct one. But with the plan of
increasing md-cache timeout to 600s, the stale read problem will be
more pronounced, i.e. stale read can continue for 600s(or more if cascaded
with readdirp), as there will be no lookups.
Solution:
Afr doesn't have any built-in solution for stale read(without affecting
the performance). The solution that came up, was to use upcall. When a file
on any brick is marked bad for the first time, upcall sends a notification
to all the clients that had recently accessed the file. The solution has
2 parts:
- Identifying when a file is marked bad, on any of the bricks,
for the first time
- Client side actions on recieving the notifications
Identifying when a file is marked bad on any of the bricks for the first time:
-----------------------------------------------------------------------------
The idea is to track xattrop in upcall. xattrop currently comes with 2 afr
xattrs - afr dirty bit and afr pending xattrs.
Dirty xattr is set to 1 before every write, and is unset if write succeeds.
In certain scenarios, dirty xattr can be 0 and still the file could be bad
copy. Hence do not track dirty xattr.
Pending xattr is set on the good copy, indicating the other bricks that have
bad copy. It is still not as simple as, notifying when any of the pending xattrs
change. It could lead to flood of notifcations, in case the other brick is
completely down or consistantly failing. Hence it is important to notify only
once, the first time a good copy is marked bad.
Client side actions on recieving pending xattr change, notification:
--------------------------------------------------------------------
md-cache will invalidate the cache of that file, so that further lookup is
passed down to afr and hence update the read subvolume. Invalidating only in
md-cache is not enough, consider the folling oder of opertaions:
- pending xattr invalidation - invalidate md-cache
- readdirp on the bad read subvolume - fill md-cache
- lookup (served from md-cache)
- read - wound to the old read subvol.
Hence, along with invalidating md-cache, it is very important to reset the
read subvolume for that file, in afr.
Design Credit: Anuradha Talur, Ravishankar N
1. xattrop doesn't carry info saying post op/pre op.
2. Pre xattrop will have 0 value for all pending xattrs,
the cbk of pre xattrop carries the on-disk xattr value.
Non zero indicated healing is required.
3. Post xattrop will have non zero value for any of the
pending xattrs, if the fop failed on any of the bricks.
Change-Id: I469cbc111714c433984fe1c922be2ef113c25804
BUG: 1211863
Signed-off-by: Poornima G <pgurusid@redhat.com>
Reviewed-on: http://review.gluster.org/15398
Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
Tested-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
Smoke: Gluster Build System <jenkins@build.gluster.org>
NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org>
CentOS-regression: Gluster Build System <jenkins@build.gluster.org>
Diffstat (limited to 'xlators/cluster/afr')
-rw-r--r-- | xlators/cluster/afr/src/afr-common.c | 41 | ||||
-rw-r--r-- | xlators/cluster/afr/src/afr.h | 1 |
2 files changed, 40 insertions, 2 deletions
diff --git a/xlators/cluster/afr/src/afr-common.c b/xlators/cluster/afr/src/afr-common.c index c8c6642d1e7..e77f0dd6891 100644 --- a/xlators/cluster/afr/src/afr-common.c +++ b/xlators/cluster/afr/src/afr-common.c @@ -32,7 +32,7 @@ #include "statedump.h" #include "inode.h" #include "events.h" - +#include "upcall-utils.h" #include "fd.h" #include "afr-inode-read.h" @@ -4231,6 +4231,14 @@ afr_ipc (call_frame_t *frame, xlator_t *this, int32_t op, dict_t *xdata) goto err; call_cnt = local->call_count; + + if (xdata) { + for (i = 0; i < priv->child_count; i++) { + if (dict_set_int8 (xdata, priv->pending_key[i], 0) < 0) + goto err; + } + } + for (i = 0; i < priv->child_count; i++) { if (!local->child_up[i]) continue; @@ -4468,6 +4476,10 @@ afr_notify (xlator_t *this, int32_t event, dict_t *output = NULL; gf_boolean_t had_quorum = _gf_false; gf_boolean_t has_quorum = _gf_false; + struct gf_upcall *up_data = NULL; + struct gf_upcall_cache_invalidation *up_ci = NULL; + inode_table_t *itable = NULL; + inode_t *inode = NULL; priv = this->private; @@ -4588,7 +4600,34 @@ afr_notify (xlator_t *this, int32_t event, case GF_EVENT_SOME_CHILD_DOWN: priv->last_event[idx] = event; break; + case GF_EVENT_UPCALL: + up_data = (struct gf_upcall *)data; + if (up_data->event_type != GF_UPCALL_CACHE_INVALIDATION) + break; + up_ci = (struct gf_upcall_cache_invalidation *)up_data->data; + + /* Since md-cache will be aggressively filtering + * lookups, the stale read issue will be more + * pronounced. Hence when a pending xattr is set notify + * all the md-cache clients to invalidate the existing + * stat cache and send the lookup next time */ + if (up_ci->dict) { + for (i = 0; i < priv->child_count; i++) { + if (dict_get (up_ci->dict, priv->pending_key[i])) { + ret = dict_set_int8 (up_ci->dict, + MDC_INVALIDATE_IATT , 0); + break; + } + } + } + itable = ((xlator_t *)this->graph->top)->itable; + /*Internal processes may not have itable for top xlator*/ + if (itable) + inode = inode_find (itable, up_data->gfid); + if (inode) + afr_inode_read_subvol_reset (inode, this); + break; default: propagate = 1; break; diff --git a/xlators/cluster/afr/src/afr.h b/xlators/cluster/afr/src/afr.h index ff136c0b093..93f4ba3dddc 100644 --- a/xlators/cluster/afr/src/afr.h +++ b/xlators/cluster/afr/src/afr.h @@ -23,7 +23,6 @@ #include "afr-self-heald.h" #include "afr-messages.h" -#define AFR_XATTR_PREFIX "trusted.afr" #define AFR_PATHINFO_HEADER "REPLICATE:" #define AFR_SH_READDIR_SIZE_KEY "self-heal-readdir-size" #define AFR_SH_DATA_DOMAIN_FMT "%s:self-heal" |