diff options
author | Anand Avati <avati@gluster.com> | 2011-08-16 12:56:21 +0530 |
---|---|---|
committer | Anand Avati <avati@gluster.com> | 2011-09-08 07:07:45 -0700 |
commit | f421ca932152c3a6102b19607d789dafdbe87ef1 (patch) | |
tree | b456b4f1c0393037846ccbed1af0fb533da767af /xlators/cluster/afr | |
parent | e6992c73dc39a8e9e17f83eb1a6d5ab63c31f511 (diff) |
cluster/afr: eager locking of FD writes
This patch is a change in the way write transactions hold a lock
which optimizes the case of sequential writes from a single writer.
Lock phase of a transaction has two sub-phases. First is an attempt
to acquire locks in parallel by broadcasting non-blocking lock
requests. If lock aquistion fails on any server, then the held locks
are unlocked and revert to a blocking locked mode sequentially on
one server after another.
The change in this patch is to make the initial broadcasting lock
request attempt to acquire lock on the entire file. If this fails,
we revert back to the sequential "regional" blocking lock as before.
In the case where such an "eager" lock is granted in the non-blocking
phase, it gives rise to an opportunity for optimization. i.e, if
the next write transaction on the same FD arrives before the unlock
phase of the first transaction, it "takes over" the full file lock.
Similarly if yet another transaction arrives before the unlock phase
of the "optimized" transaction, that in turn "takes over" the lock
as well. The actual unlock now happens at the end of the last
"optimzed" transaction.
Any operation which arrives before the unlock phase of the previous
transaction is a potential candidate to become an "optimized"
transaction. In cases where the previous transaction had aquired
lock as a "regional" blocking lock, and the next transaction comes
in before its unlock phase, then it would not be an "optimized"
transaction.
Implied assumption
------------------
Since two or more transactions can now operate within the same
large lock, there is a possibility that overlapping transactions
can arrive at oppoosite orders on the servers. However in the
larger picture this is not possible as write-behind already
ensures that no two overlapping writes on an inode are in transit
at the same time. Overlapping writes across clients are not a
problem as they compete at locks anyways.
Theoretical benefits and potential harms
----------------------------------------
In case of a single writer: The benefits are large for sequential
writes. In the best case the entire file write can happen with just
one lock and unlock per server, provided writes are coming in fast
enough and getting pipelined by write-behind soon enough (which is
usually the case). If the writes are not coming in fast enough, then
the optimization "kicks in" for only those subsets of writes which
are close enough to get "piggybacked". For random writes the benefits
are the same as well. In any case the overall performance is better
than or equal to the performance without this optimization for a single
writer.
In case of multiple writers: When multiple writers are not writing
concurrently, there is no negative performance impact. When multiple
writers are writing concurrently to the same region, there is no
negative impact either, as they were previously getting arbitrated
at the locks translator too. In the case of multiple writers writing
to different regions concurrently, there will be an increased number
of "failovers" from failed parallel non-blocking to sequential blocking
regional locks. This above "worst case" has a simple workaround that
as soon as we detect > 1 open-fd-count in lookup xattr, we can disable
this optimization on those fds.
Beneficial side-effects
-----------------------
There is another similar optimization in AFR for changelogs which goes
by the name of "changelog-piggybacking". That works in a similar way where
pending flags get 'taken over' or 'piggybacked' by the next transaction
if its 'pre-op' phase kicks in before the 'post-op' phase of the
previous transaction. It has been observed that this changelog-piggybacking
optimization gives a saving of about ~55% savings of xattr calls hitting
the wire, measured across various types of network interfaces. The side
effect of this eager-lock optimization is that it gives an almost 100%
saving of xattr calls by making the optimistic-changelog work much more
efficiently as it gives a wider overlap of the xattr phases of two
consecutive transactions.
Change-Id: I41c02eb3b64c14c68ef66a344610ec3f024cd59d
BUG: 3409
Reviewed-on: http://review.gluster.com/243
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@gluster.com>
Diffstat (limited to 'xlators/cluster/afr')
-rw-r--r-- | xlators/cluster/afr/src/afr-common.c | 23 | ||||
-rw-r--r-- | xlators/cluster/afr/src/afr-lk-common.c | 184 | ||||
-rw-r--r-- | xlators/cluster/afr/src/afr-transaction.c | 10 | ||||
-rw-r--r-- | xlators/cluster/afr/src/afr.c | 2 | ||||
-rw-r--r-- | xlators/cluster/afr/src/afr.h | 12 |
5 files changed, 171 insertions, 60 deletions
diff --git a/xlators/cluster/afr/src/afr-common.c b/xlators/cluster/afr/src/afr-common.c index a435a38b16d..94938d97f61 100644 --- a/xlators/cluster/afr/src/afr-common.c +++ b/xlators/cluster/afr/src/afr-common.c @@ -363,6 +363,7 @@ afr_local_transaction_cleanup (afr_local_t *local, xlator_t *this) GF_FREE (local->transaction.child_errno); GF_FREE (local->child_errno); + GF_FREE (local->transaction.eager_lock); GF_FREE (local->transaction.basename); GF_FREE (local->transaction.new_basename); @@ -1264,6 +1265,22 @@ afr_fd_ctx_set (xlator_t *this, fd_t *fd) goto unlock; } + fd_ctx->lock_piggyback = GF_CALLOC (sizeof (*fd_ctx->lock_piggyback), + priv->child_count, + gf_afr_mt_char); + if (!fd_ctx->lock_piggyback) { + ret = -ENOMEM; + goto unlock; + } + + fd_ctx->lock_acquired = GF_CALLOC (sizeof (*fd_ctx->lock_acquired), + priv->child_count, + gf_afr_mt_char); + if (!fd_ctx->lock_acquired) { + ret = -ENOMEM; + goto unlock; + } + fd_ctx->up_count = priv->up_count; fd_ctx->down_count = priv->down_count; @@ -1504,6 +1521,12 @@ afr_cleanup_fd_ctx (xlator_t *this, fd_t *fd) if (fd_ctx->pre_op_piggyback) GF_FREE (fd_ctx->pre_op_piggyback); + if (fd_ctx->lock_piggyback) + GF_FREE (fd_ctx->lock_piggyback); + + if (fd_ctx->lock_acquired) + GF_FREE (fd_ctx->lock_acquired); + GF_FREE (fd_ctx); } diff --git a/xlators/cluster/afr/src/afr-lk-common.c b/xlators/cluster/afr/src/afr-lk-common.c index c059a6b9e5b..157d54d42aa 100644 --- a/xlators/cluster/afr/src/afr-lk-common.c +++ b/xlators/cluster/afr/src/afr-lk-common.c @@ -555,6 +555,7 @@ afr_unlock_inodelk_cbk (call_frame_t *frame, void *cookie, xlator_t *this, int32_t op_ret, int32_t op_errno) { afr_local_t *local = NULL; + int child_index = (long) cookie; local = frame->local; @@ -568,6 +569,8 @@ afr_unlock_inodelk_cbk (call_frame_t *frame, void *cookie, xlator_t *this, local->loc.path, strerror (op_errno)); } + local->transaction.eager_lock[child_index] = 0; + afr_unlock_common_cbk (frame, cookie, this, op_ret, op_errno); return 0; @@ -581,8 +584,13 @@ afr_unlock_inodelk (call_frame_t *frame, xlator_t *this) afr_local_t *local = NULL; afr_private_t *priv = NULL; struct gf_flock flock = {0,}; + struct gf_flock full_flock = {0,}; + struct gf_flock *flock_use = &flock; int call_count = 0; int i = 0; + int piggyback = 0; + afr_fd_ctx_t *fd_ctx = NULL; + local = frame->local; int_lock = &local->internal_lock; @@ -592,6 +600,8 @@ afr_unlock_inodelk (call_frame_t *frame, xlator_t *this) flock.l_len = int_lock->lk_flock.l_len; flock.l_type = F_UNLCK; + full_flock.l_type = F_UNLCK; + call_count = afr_locked_nodes_count (int_lock->inode_locked_nodes, priv->child_count); @@ -604,42 +614,70 @@ afr_unlock_inodelk (call_frame_t *frame, xlator_t *this) goto out; } + if (local->fd) + fd_ctx = afr_fd_ctx_get (local->fd, this); + for (i = 0; i < priv->child_count; i++) { - if (int_lock->inode_locked_nodes[i] & LOCKED_YES) { - if (local->fd) { - afr_trace_inodelk_in (frame, AFR_INODELK_TRANSACTION, - AFR_UNLOCK_OP, &flock, F_SETLK, i); + if ((int_lock->inode_locked_nodes[i] & LOCKED_YES) + != LOCKED_YES) + continue; - STACK_WIND_COOKIE (frame, afr_unlock_inodelk_cbk, - (void *) (long)i, - priv->children[i], - priv->children[i]->fops->finodelk, - this->name, local->fd, - F_SETLK, &flock); + if (local->fd) { + if (!local->transaction.eager_lock[i]) { + goto wind; + } + + piggyback = 0; + flock_use = &full_flock; + + LOCK (&local->fd->lock); + { + if (fd_ctx->lock_piggyback[i]) { + fd_ctx->lock_piggyback[i]--; + piggyback = 1; + } else { + fd_ctx->lock_acquired[i]--; + } + } + UNLOCK (&local->fd->lock); + if (piggyback) { + afr_unlock_inodelk_cbk (frame, (void *) (long) i, + this, 1, 0); if (!--call_count) break; + continue; + } - } else { - afr_trace_inodelk_in (frame, AFR_INODELK_TRANSACTION, - AFR_UNLOCK_OP, &flock, F_SETLK, i); + wind: + afr_trace_inodelk_in (frame, AFR_INODELK_TRANSACTION, + AFR_UNLOCK_OP, flock_use, F_SETLK, i); - STACK_WIND_COOKIE (frame, afr_unlock_inodelk_cbk, - (void *) (long)i, - priv->children[i], - priv->children[i]->fops->inodelk, - this->name, &local->loc, - F_SETLK, &flock); + STACK_WIND_COOKIE (frame, afr_unlock_inodelk_cbk, + (void *) (long)i, + priv->children[i], + priv->children[i]->fops->finodelk, + this->name, local->fd, + F_SETLK, flock_use); - if (!--call_count) - break; + if (!--call_count) + break; - } + } else { + afr_trace_inodelk_in (frame, AFR_INODELK_TRANSACTION, + AFR_UNLOCK_OP, &flock, F_SETLK, i); - } + STACK_WIND_COOKIE (frame, afr_unlock_inodelk_cbk, + (void *) (long)i, + priv->children[i], + priv->children[i]->fops->inodelk, + this->name, &local->loc, + F_SETLK, &flock); + if (!--call_count) + break; + } } - out: return 0; } @@ -1288,6 +1326,7 @@ afr_nonblocking_inodelk_cbk (call_frame_t *frame, void *cookie, xlator_t *this, afr_private_t *priv = NULL; int call_count = 0; int child_index = (long) cookie; + afr_fd_ctx_t *fd_ctx = NULL; local = frame->local; int_lock = &local->internal_lock; @@ -1303,7 +1342,7 @@ afr_nonblocking_inodelk_cbk (call_frame_t *frame, void *cookie, xlator_t *this, } UNLOCK (&frame->lock); - if (op_ret < 0 ) { + if (op_ret < 0) { if (op_errno == ENOSYS) { /* return ENOTSUP */ gf_log (this->name, GF_LOG_ERROR, @@ -1315,10 +1354,27 @@ afr_nonblocking_inodelk_cbk (call_frame_t *frame, void *cookie, xlator_t *this, int_lock->lock_op_errno = op_errno; local->op_errno = op_errno; } - } else if (op_ret == 0) { + } else { int_lock->inode_locked_nodes[child_index] |= LOCKED_YES; int_lock->inodelk_lock_count++; + + if (priv->eager_lock && local->fd) { + fd_ctx = afr_fd_ctx_get (local->fd, this); + local->transaction.eager_lock[child_index] = 1; + /* piggybacked */ + + if (op_ret == 1) { + /* piggybacked */ + } else if (op_ret == 0) { + /* lock acquired from server */ + LOCK (&local->fd->lock); + { + fd_ctx->lock_acquired[child_index]++; + } + UNLOCK (&local->fd->lock); + } + } } if (call_count == 0) { @@ -1358,6 +1414,9 @@ afr_nonblocking_inodelk (call_frame_t *frame, xlator_t *this) int i = 0; int ret = 0; struct gf_flock flock = {0,}; + struct gf_flock full_flock = {0,}; + struct gf_flock *flock_use = &flock; + int piggyback = 0; local = frame->local; int_lock = &local->internal_lock; @@ -1367,6 +1426,8 @@ afr_nonblocking_inodelk (call_frame_t *frame, xlator_t *this) flock.l_len = int_lock->lk_flock.l_len; flock.l_type = int_lock->lk_flock.l_type; + full_flock.l_type = int_lock->lk_flock.l_type; + initialize_inodelk_variables (frame, this); if (local->fd) { @@ -1398,49 +1459,72 @@ afr_nonblocking_inodelk (call_frame_t *frame, xlator_t *this) goto out; } + frame->root->lk_owner = (long) (local->fd); + /* Send non-blocking inodelk calls only on up children and where the fd has been opened */ for (i = 0; i < priv->child_count; i++) { - if (local->child_up[i] && fd_ctx->opened_on[i]) { - afr_trace_inodelk_in (frame, AFR_INODELK_NB_TRANSACTION, - AFR_LOCK_OP, &flock, F_SETLK, i); + if (!local->child_up[i] || !fd_ctx->opened_on[i]) + continue; - STACK_WIND_COOKIE (frame, afr_nonblocking_inodelk_cbk, - (void *) (long) i, - priv->children[i], - priv->children[i]->fops->finodelk, - this->name, local->fd, - F_SETLK, &flock); + if (!priv->eager_lock) + goto wind; + + flock_use = &full_flock; + piggyback = 0; + + LOCK (&local->fd->lock); + { + if (fd_ctx->lock_acquired[i]) { + fd_ctx->lock_piggyback[i]++; + piggyback = 1; + } + } + UNLOCK (&local->fd->lock); + if (piggyback) { + /* (op_ret == 1) => indicate piggybacked lock */ + afr_nonblocking_inodelk_cbk (frame, (void *) (long) i, + this, 1, 0); if (!--call_count) break; - + continue; } + wind: + afr_trace_inodelk_in (frame, AFR_INODELK_NB_TRANSACTION, + AFR_LOCK_OP, flock_use, F_SETLK, i); + STACK_WIND_COOKIE (frame, afr_nonblocking_inodelk_cbk, + (void *) (long) i, + priv->children[i], + priv->children[i]->fops->finodelk, + this->name, local->fd, + F_SETLK, flock_use); + + if (!--call_count) + break; } } else { call_count = internal_lock_count (frame, this, NULL); int_lock->lk_call_count = call_count; for (i = 0; i < priv->child_count; i++) { - if (local->child_up[i]) { - afr_trace_inodelk_in (frame, AFR_INODELK_NB_TRANSACTION, - AFR_LOCK_OP, &flock, F_SETLK, i); + if (!local->child_up[i]) + continue; + afr_trace_inodelk_in (frame, AFR_INODELK_NB_TRANSACTION, + AFR_LOCK_OP, &flock, F_SETLK, i); - STACK_WIND_COOKIE (frame, afr_nonblocking_inodelk_cbk, - (void *) (long) i, - priv->children[i], - priv->children[i]->fops->inodelk, - this->name, &local->loc, - F_SETLK, &flock); - - if (!--call_count) - break; + STACK_WIND_COOKIE (frame, afr_nonblocking_inodelk_cbk, + (void *) (long) i, + priv->children[i], + priv->children[i]->fops->inodelk, + this->name, &local->loc, + F_SETLK, &flock); - } + if (!--call_count) + break; } } - out: return ret; } diff --git a/xlators/cluster/afr/src/afr-transaction.c b/xlators/cluster/afr/src/afr-transaction.c index ac329958d2a..c0c409bbc96 100644 --- a/xlators/cluster/afr/src/afr-transaction.c +++ b/xlators/cluster/afr/src/afr-transaction.c @@ -377,13 +377,6 @@ afr_changelog_post_op_cbk (call_frame_t *frame, void *cookie, xlator_t *this, child_index = (long) cookie; - if (op_ret == 1) { - } - - if (op_ret == 0) { - __mark_pre_op_undone_on_fd (frame, this, child_index); - } - LOCK (&frame->lock); { call_count = --local->call_count; @@ -563,6 +556,7 @@ afr_changelog_post_op (call_frame_t *frame, xlator_t *this) afr_changelog_post_op_cbk (frame, (void *)(long)i, this, 1, 0, xattr[i]); } else { + __mark_pre_op_undone_on_fd (frame, this, i); STACK_WIND_COOKIE (frame, afr_changelog_post_op_cbk, (void *) (long) i, @@ -1134,8 +1128,6 @@ afr_lock (call_frame_t *frame, xlator_t *this) { afr_pid_save (frame); - frame->root->pid = (long) frame->root; - afr_set_lk_owner (frame, this); afr_set_lock_number (frame, this); diff --git a/xlators/cluster/afr/src/afr.c b/xlators/cluster/afr/src/afr.c index a662037e788..47224ec8caa 100644 --- a/xlators/cluster/afr/src/afr.c +++ b/xlators/cluster/afr/src/afr.c @@ -744,6 +744,8 @@ init (xlator_t *this) pthread_mutex_init (&priv->mutex, NULL); INIT_LIST_HEAD (&priv->saved_fds); + priv->eager_lock = _gf_true; + ret = 0; out: return ret; diff --git a/xlators/cluster/afr/src/afr.h b/xlators/cluster/afr/src/afr.h index 6f40ded1274..6a9907e974b 100644 --- a/xlators/cluster/afr/src/afr.h +++ b/xlators/cluster/afr/src/afr.h @@ -92,6 +92,7 @@ typedef struct _afr_private { pthread_mutex_t mutex; struct list_head saved_fds; /* list of fds on which locks have succeeded */ gf_boolean_t optimistic_change_log; + gf_boolean_t eager_lock; char vol_uuid[UUID_SIZE + 1]; int32_t *last_event; @@ -597,6 +598,8 @@ typedef struct _afr_local { struct { off_t start, len; + int *eager_lock; + char *basename; char *new_basename; @@ -636,6 +639,9 @@ typedef struct { unsigned int *opened_on; /* which subvolumes the fd is open on */ unsigned int *pre_op_piggyback; + unsigned int *lock_piggyback; + unsigned int *lock_acquired; + int flags; int32_t wbflags; uint64_t up_count; /* number of CHILD_UPs this fd has seen */ @@ -939,6 +945,10 @@ afr_transaction_local_init (afr_local_t *local, afr_private_t *priv) priv->child_count, gf_afr_mt_int32_t); + local->transaction.eager_lock = GF_CALLOC (sizeof (*local->transaction.eager_lock), + priv->child_count, + gf_afr_mt_int32_t); + local->internal_lock.transaction_lk_type = AFR_TRANSACTION_LK; return 0; @@ -947,5 +957,5 @@ afr_transaction_local_init (afr_local_t *local, afr_private_t *priv) int32_t afr_marker_getxattr (call_frame_t *frame, xlator_t *this, loc_t *loc, const char *name,afr_local_t *local, afr_private_t *priv ); - +afr_fd_ctx_t * afr_fd_ctx_get (fd_t *fd, xlator_t *this); #endif /* __AFR_H__ */ |