cluster/afr: eager locking of FD writes

This patch is a change in the way write transactions hold a lock which optimizes the case of sequential writes from a single writer. Lock phase of a transaction has two sub-phases. First is an attempt to acquire locks in parallel by broadcasting non-blocking lock requests. If lock aquistion fails on any server, then the held locks are unlocked and revert to a blocking locked mode sequentially on one server after another. The change in this patch is to make the initial broadcasting lock request attempt to acquire lock on the entire file. If this fails, we revert back to the sequential "regional" blocking lock as before. In the case where such an "eager" lock is granted in the non-blocking phase, it gives rise to an opportunity for optimization. i.e, if the next write transaction on the same FD arrives before the unlock phase of the first transaction, it "takes over" the full file lock. Similarly if yet another transaction arrives before the unlock phase of the "optimized" transaction, that in turn "takes over" the lock as well. The actual unlock now happens at the end of the last "optimzed" transaction. Any operation which arrives before the unlock phase of the previous transaction is a potential candidate to become an "optimized" transaction. In cases where the previous transaction had aquired lock as a "regional" blocking lock, and the next transaction comes in before its unlock phase, then it would not be an "optimized" transaction. Implied assumption ------------------ Since two or more transactions can now operate within the same large lock, there is a possibility that overlapping transactions can arrive at oppoosite orders on the servers. However in the larger picture this is not possible as write-behind already ensures that no two overlapping writes on an inode are in transit at the same time. Overlapping writes across clients are not a problem as they compete at locks anyways. Theoretical benefits and potential harms ---------------------------------------- In case of a single writer: The benefits are large for sequential writes. In the best case the entire file write can happen with just one lock and unlock per server, provided writes are coming in fast enough and getting pipelined by write-behind soon enough (which is usually the case). If the writes are not coming in fast enough, then the optimization "kicks in" for only those subsets of writes which are close enough to get "piggybacked". For random writes the benefits are the same as well. In any case the overall performance is better than or equal to the performance without this optimization for a single writer. In case of multiple writers: When multiple writers are not writing concurrently, there is no negative performance impact. When multiple writers are writing concurrently to the same region, there is no negative impact either, as they were previously getting arbitrated at the locks translator too. In the case of multiple writers writing to different regions concurrently, there will be an increased number of "failovers" from failed parallel non-blocking to sequential blocking regional locks. This above "worst case" has a simple workaround that as soon as we detect > 1 open-fd-count in lookup xattr, we can disable this optimization on those fds. Beneficial side-effects ----------------------- There is another similar optimization in AFR for changelogs which goes by the name of "changelog-piggybacking". That works in a similar way where pending flags get 'taken over' or 'piggybacked' by the next transaction if its 'pre-op' phase kicks in before the 'post-op' phase of the previous transaction. It has been observed that this changelog-piggybacking optimization gives a saving of about ~55% savings of xattr calls hitting the wire, measured across various types of network interfaces. The side effect of this eager-lock optimization is that it gives an almost 100% saving of xattr calls by making the optimistic-changelog work much more efficiently as it gives a wider overlap of the xattr phases of two consecutive transactions. Change-Id: I41c02eb3b64c14c68ef66a344610ec3f024cd59d BUG: 3409 Reviewed-on: http://review.gluster.com/240 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@gluster.com>
author: Anand Avati <avati@gluster.com> 2011-08-16 12:56:21 +0530
committer: Anand Avati <avati@gluster.com> 2011-09-08 07:07:30 -0700
commit: c83856797fd55fa59c885ba5efd3ac912fcb9a96 (patch)
tree: d26a920f5549f87144ac5890ad22766532bc69b1 /xlators/cluster
parent: 73ae561db9054c67ce120eb87fa955943bdc06bd (diff)
4 files changed, 182 insertions, 58 deletions
diff --git a/xlators/cluster/afr/src/afr-common.c b/xlators/cluster/afr/src/afr-common.c
index 9803744a0ec..caf72459336 100644
--- a/xlators/cluster/afr/src/afr-common.c
+++ b/xlators/cluster/afr/src/afr-common.c
@@ -763,6 +763,7 @@ afr_local_transaction_cleanup (afr_local_t *local, xlator_t *this)
 
         GF_FREE (local->transaction.child_errno);
         GF_FREE (local->child_errno);
+        GF_FREE (local->transaction.eager_lock);
 
         GF_FREE (local->transaction.basename);
         GF_FREE (local->transaction.new_basename);
@@ -2020,6 +2021,22 @@ afr_fd_ctx_set (xlator_t *this, fd_t *fd)
                         goto unlock;
                 }
 
+                fd_ctx->lock_piggyback = GF_CALLOC (sizeof (*fd_ctx->lock_piggyback),
+                                                    priv->child_count,
+                                                    gf_afr_mt_char);
+                if (!fd_ctx->lock_piggyback) {
+                        ret = -ENOMEM;
+                        goto unlock;
+                }
+
+                fd_ctx->lock_acquired = GF_CALLOC (sizeof (*fd_ctx->lock_acquired),
+                                                    priv->child_count,
+                                                    gf_afr_mt_char);
+                if (!fd_ctx->lock_acquired) {
+                        ret = -ENOMEM;
+                        goto unlock;
+                }
+
                 fd_ctx->up_count   = priv->up_count;
                 fd_ctx->down_count = priv->down_count;
 
@@ -2274,6 +2291,12 @@ afr_cleanup_fd_ctx (xlator_t *this, fd_t *fd)
                         GF_FREE (paused_call);
                 }
 
+                if (fd_ctx->lock_piggyback)
+                        GF_FREE (fd_ctx->lock_piggyback);
+
+                if (fd_ctx->lock_acquired)
+                        GF_FREE (fd_ctx->lock_acquired);
+
                 GF_FREE (fd_ctx);
         }
 
@@ -3593,6 +3616,14 @@ afr_transaction_local_init (afr_local_t *local, xlator_t *this)
         if (!local->child_errno)
                 goto out;
 
+        local->transaction.eager_lock =
+                GF_CALLOC (sizeof (*local->transaction.eager_lock),
+                           priv->child_count,
+                           gf_afr_mt_int32_t);
+
+        if (!local->transaction.eager_lock)
+                goto out;
+
         local->pending = GF_CALLOC (sizeof (*local->pending),
                                     priv->child_count,
                                     gf_afr_mt_int32_t);
diff --git a/xlators/cluster/afr/src/afr-lk-common.c b/xlators/cluster/afr/src/afr-lk-common.c
index 483d7250e43..c8dd8b635e6 100644
--- a/xlators/cluster/afr/src/afr-lk-common.c
+++ b/xlators/cluster/afr/src/afr-lk-common.c
@@ -567,7 +567,13 @@ afr_unlock_inodelk_cbk (call_frame_t *frame, void *cookie, xlator_t *this,
                         local->loc.path, child_index, strerror (op_errno));
         }
 
+
         int_lock->inode_locked_nodes[child_index] &= LOCKED_NO;
+
+        if (op_ret == 1) {
+                local->transaction.eager_lock[child_index] = 0;
+        }
+
         afr_unlock_common_cbk (frame, cookie, this, op_ret, op_errno);
 
         return 0;
@@ -581,8 +587,13 @@ afr_unlock_inodelk (call_frame_t *frame, xlator_t *this)
         afr_local_t         *local    = NULL;
         afr_private_t       *priv     = NULL;
         struct gf_flock flock = {0,};
+        struct gf_flock full_flock = {0,};
+        struct gf_flock *flock_use = &flock;
         int call_count = 0;
         int i = 0;
+        int piggyback = 0;
+        afr_fd_ctx_t        *fd_ctx      = NULL;
+
 
         local    = frame->local;
         int_lock = &local->internal_lock;
@@ -592,9 +603,13 @@ afr_unlock_inodelk (call_frame_t *frame, xlator_t *this)
         flock.l_len   = int_lock->lk_flock.l_len;
         flock.l_type  = F_UNLCK;
 
+
         gf_log (this->name, GF_LOG_DEBUG, "attempting data unlock range %"PRIu64
                 " %"PRIu64" by %"PRIu64, flock.l_start, flock.l_len,
                 frame->root->lk_owner);
+
+        full_flock.l_type = F_UNLCK;
+
         call_count = afr_locked_nodes_count (int_lock->inode_locked_nodes,
                                              priv->child_count);
 
@@ -607,42 +622,69 @@ afr_unlock_inodelk (call_frame_t *frame, xlator_t *this)
                 goto out;
         }
 
+        if (local->fd)
+                fd_ctx = afr_fd_ctx_get (local->fd, this);
+
         for (i = 0; i < priv->child_count; i++) {
-                if (int_lock->inode_locked_nodes[i] & LOCKED_YES) {
-                        if (local->fd) {
-                                afr_trace_inodelk_in (frame, AFR_INODELK_TRANSACTION,
-                                                      AFR_UNLOCK_OP, &flock, F_SETLK, i);
+                if ((int_lock->inode_locked_nodes[i] & LOCKED_YES)
+                    != LOCKED_YES)
+                        continue;
 
-                                STACK_WIND_COOKIE (frame, afr_unlock_inodelk_cbk,
-                                                   (void *) (long)i,
-                                                   priv->children[i],
-                                                   priv->children[i]->fops->finodelk,
-                                                   this->name, local->fd,
-                                                   F_SETLK, &flock);
+                if (local->fd) {
+                        if (!local->transaction.eager_lock[i]) {
+                                goto wind;
+                        }
+
+                        piggyback = 0;
+                        flock_use = &full_flock;
 
+                        LOCK (&local->fd->lock);
+                        {
+                                if (fd_ctx->lock_piggyback[i]) {
+                                        fd_ctx->lock_piggyback[i]--;
+                                        piggyback = 1;
+                                }
+                        }
+                        UNLOCK (&local->fd->lock);
+
+                        if (piggyback) {
+                                afr_unlock_inodelk_cbk (frame, (void *) (long) i,
+                                                        this, 1, 0);
                                 if (!--call_count)
                                         break;
+                                continue;
+                        }
+
+                        fd_ctx->lock_acquired[i]--;
+                wind:
+                        afr_trace_inodelk_in (frame, AFR_INODELK_TRANSACTION,
+                                              AFR_UNLOCK_OP, &flock, F_SETLK, i);
 
-                        } else {
-                                afr_trace_inodelk_in (frame, AFR_INODELK_TRANSACTION,
-                                                      AFR_UNLOCK_OP, &flock, F_SETLK, i);
+                        STACK_WIND_COOKIE (frame, afr_unlock_inodelk_cbk,
+                                           (void *) (long)i,
+                                           priv->children[i],
+                                           priv->children[i]->fops->finodelk,
+                                           this->name, local->fd,
+                                           F_SETLK, &flock);
 
-                                STACK_WIND_COOKIE (frame, afr_unlock_inodelk_cbk,
-                                                   (void *) (long)i,
-                                                   priv->children[i],
-                                                   priv->children[i]->fops->inodelk,
-                                                   this->name, &local->loc,
-                                                   F_SETLK, &flock);
+                        if (!--call_count)
+                                break;
 
-                                if (!--call_count)
-                                        break;
+                } else {
+                        afr_trace_inodelk_in (frame, AFR_INODELK_TRANSACTION,
+                                              AFR_UNLOCK_OP, &flock, F_SETLK, i);
 
-                        }
+                        STACK_WIND_COOKIE (frame, afr_unlock_inodelk_cbk,
+                                           (void *) (long)i,
+                                           priv->children[i],
+                                           priv->children[i]->fops->inodelk,
+                                           this->name, &local->loc,
+                                           F_SETLK, &flock);
 
+                        if (!--call_count)
+                                break;
                 }
-
         }
-
 out:
         return 0;
 }
@@ -1288,7 +1330,11 @@ afr_nonblocking_inodelk_cbk (call_frame_t *frame, void *cookie, xlator_t *this,
         afr_local_t         *local    = NULL;
         int call_count  = 0;
         int child_index = (long) cookie;
+        afr_fd_ctx_t        *fd_ctx = NULL;
+        afr_private_t       *priv     = NULL;
+
 
+        priv = this->private;
         local    = frame->local;
         int_lock = &local->internal_lock;
 
@@ -1302,7 +1348,7 @@ afr_nonblocking_inodelk_cbk (call_frame_t *frame, void *cookie, xlator_t *this,
         }
         UNLOCK (&frame->lock);
 
-        if (op_ret < 0 ) {
+        if (op_ret < 0) {
                 if (op_errno == ENOSYS) {
                         /* return ENOTSUP */
                         gf_log (this->name, GF_LOG_ERROR,
@@ -1314,9 +1360,27 @@ afr_nonblocking_inodelk_cbk (call_frame_t *frame, void *cookie, xlator_t *this,
                         int_lock->lock_op_errno      = op_errno;
                         local->op_errno              = op_errno;
                 }
-        } else if (op_ret == 0) {
-                int_lock->inode_locked_nodes[child_index] |= LOCKED_YES;
+        } else {
+                int_lock->inode_locked_nodes[child_index]
+                        |= LOCKED_YES;
                 int_lock->inodelk_lock_count++;
+
+                if (priv->eager_lock && local->fd) {
+                        fd_ctx = afr_fd_ctx_get (local->fd, this);
+                        local->transaction.eager_lock[child_index] = 1;
+                        /* piggybacked */
+
+                        if (op_ret == 1) {
+                                /* piggybacked */
+                        } else if (op_ret == 0) {
+                                /* lock acquired from server */
+                                LOCK (&local->fd->lock);
+                                {
+                                        fd_ctx->lock_acquired[child_index]++;
+                                }
+                                UNLOCK (&local->fd->lock);
+                        }
+                }
         }
 
         if (call_count == 0) {
@@ -1355,6 +1419,9 @@ afr_nonblocking_inodelk (call_frame_t *frame, xlator_t *this)
         int      i          = 0;
         int      ret        = 0;
         struct gf_flock flock = {0,};
+        struct gf_flock full_flock = {0,};
+        struct gf_flock *flock_use = &flock;
+        int     piggyback = 0;
 
         local    = frame->local;
         int_lock = &local->internal_lock;
@@ -1367,6 +1434,9 @@ afr_nonblocking_inodelk (call_frame_t *frame, xlator_t *this)
         gf_log (this->name, GF_LOG_DEBUG, "attempting data lock range %"PRIu64
                 " %"PRIu64" by %"PRIu64, flock.l_start, flock.l_len,
                 frame->root->lk_owner);
+
+        full_flock.l_type = int_lock->lk_flock.l_type;
+
         initialize_inodelk_variables (frame, this);
 
         if (local->fd) {
@@ -1400,22 +1470,45 @@ afr_nonblocking_inodelk (call_frame_t *frame, xlator_t *this)
                 /* Send non-blocking inodelk calls only on up children
                    and where the fd has been opened */
                 for (i = 0; i < priv->child_count; i++) {
-                        if (local->child_up[i] && local->fd_open_on[i]) {
-                                afr_trace_inodelk_in (frame, AFR_INODELK_NB_TRANSACTION,
-                                                      AFR_LOCK_OP, &flock, F_SETLK, i);
+                        if (!local->child_up[i] || !local->fd_open_on[i])
+                                continue;
 
-                                STACK_WIND_COOKIE (frame, afr_nonblocking_inodelk_cbk,
-                                                   (void *) (long) i,
-                                                   priv->children[i],
-                                                   priv->children[i]->fops->finodelk,
-                                                   this->name, local->fd,
-                                                   F_SETLK, &flock);
+                        if (!priv->eager_lock)
+                                goto wind;
+
+                        flock_use = &full_flock;
+                        piggyback = 0;
 
+                        LOCK (&local->fd->lock);
+                        {
+                                if (fd_ctx->lock_acquired[i]) {
+                                        fd_ctx->lock_piggyback[i]++;
+                                        piggyback = 1;
+                                }
+                        }
+                        UNLOCK (&local->fd->lock);
+
+                        if (piggyback) {
+                                /* (op_ret == 1) => indicate piggybacked lock */
+                                afr_nonblocking_inodelk_cbk (frame, (void *) (long) i,
+                                                             this, 1, 0);
                                 if (!--call_count)
                                         break;
-
+                                continue;
                         }
+                wind:
+                        afr_trace_inodelk_in (frame, AFR_INODELK_NB_TRANSACTION,
+                                              AFR_LOCK_OP, flock_use, F_SETLK, i);
 
+                        STACK_WIND_COOKIE (frame, afr_nonblocking_inodelk_cbk,
+                                           (void *) (long) i,
+                                           priv->children[i],
+                                           priv->children[i]->fops->finodelk,
+                                           this->name, local->fd,
+                                           F_SETLK, flock_use);
+
+                        if (!--call_count)
+                                break;
                 }
         } else {
                 call_count = internal_lock_count (frame, this);
@@ -1423,24 +1516,22 @@ afr_nonblocking_inodelk (call_frame_t *frame, xlator_t *this)
                 int_lock->lk_expected_count = call_count;
 
                 for (i = 0; i < priv->child_count; i++) {
-                        if (local->child_up[i]) {
-                                afr_trace_inodelk_in (frame, AFR_INODELK_NB_TRANSACTION,
-                                                      AFR_LOCK_OP, &flock, F_SETLK, i);
-
-                                STACK_WIND_COOKIE (frame, afr_nonblocking_inodelk_cbk,
-                                                   (void *) (long) i,
-                                                   priv->children[i],
-                                                   priv->children[i]->fops->inodelk,
-                                                   this->name, &local->loc,
-                                                   F_SETLK, &flock);
+                        if (!local->child_up[i])
+                                continue;
+                        afr_trace_inodelk_in (frame, AFR_INODELK_NB_TRANSACTION,
+                                              AFR_LOCK_OP, &flock, F_SETLK, i);
 
-                                if (!--call_count)
-                                        break;
+                        STACK_WIND_COOKIE (frame, afr_nonblocking_inodelk_cbk,
+                                           (void *) (long) i,
+                                           priv->children[i],
+                                           priv->children[i]->fops->inodelk,
+                                           this->name, &local->loc,
+                                           F_SETLK, &flock);
 
-                        }
+                        if (!--call_count)
+                                break;
                 }
         }
-
 out:
         return ret;
 }
diff --git a/xlators/cluster/afr/src/afr-transaction.c b/xlators/cluster/afr/src/afr-transaction.c
index b40e171030f..795fec255d3 100644
--- a/xlators/cluster/afr/src/afr-transaction.c
+++ b/xlators/cluster/afr/src/afr-transaction.c
@@ -377,13 +377,6 @@ afr_changelog_post_op_cbk (call_frame_t *frame, void *cookie, xlator_t *this,
 
         child_index = (long) cookie;
 
-        if (op_ret == 1) {
-        }
-
-        if (op_ret == 0) {
-                __mark_pre_op_undone_on_fd (frame, this, child_index);
-        }
-
         LOCK (&frame->lock);
         {
                 call_count = --local->call_count;
@@ -610,6 +603,7 @@ afr_changelog_post_op (call_frame_t *frame, xlator_t *this)
                                 afr_changelog_post_op_cbk (frame, (void *)(long)i,
                                                            this, 1, 0, xattr[i]);
                         } else {
+                                __mark_pre_op_undone_on_fd (frame, this, i);
                                 STACK_WIND_COOKIE (frame,
                                                    afr_changelog_post_op_cbk,
                                                    (void *) (long) i,
diff --git a/xlators/cluster/afr/src/afr.h b/xlators/cluster/afr/src/afr.h
index 06e1a664e23..bd0a08842b2 100644
--- a/xlators/cluster/afr/src/afr.h
+++ b/xlators/cluster/afr/src/afr.h
@@ -130,6 +130,7 @@ typedef struct _afr_private {
         pthread_mutex_t  mutex;
         struct list_head saved_fds;   /* list of fds on which locks have succeeded */
         gf_boolean_t     optimistic_change_log;
+        gf_boolean_t     eager_lock;
 
         char                   vol_uuid[UUID_SIZE + 1];
         int32_t                *last_event;
@@ -652,6 +653,8 @@ typedef struct _afr_local {
         struct {
                 off_t start, len;
 
+                int *eager_lock;
+
                 char *basename;
                 char *new_basename;
 
@@ -701,6 +704,9 @@ typedef struct {
         afr_fd_open_status_t *opened_on; /* which subvolumes the fd is open on */
         unsigned int *pre_op_piggyback;
 
+        unsigned int *lock_piggyback;
+        unsigned int *lock_acquired;
+
         int flags;
         int32_t wbflags;
         uint64_t up_count;   /* number of CHILD_UPs this fd has seen */
@@ -985,4 +991,6 @@ int
 afr_open_fd_fix (call_frame_t *frame, xlator_t *this, gf_boolean_t pause_fop);
 int
 afr_set_elem_count_get (unsigned char *elems, int child_count);
+afr_fd_ctx_t *
+afr_fd_ctx_get (fd_t *fd, xlator_t *this);
 #endif /* __AFR_H__ */
author	Anand Avati <avati@gluster.com>	2011-08-16 12:56:21 +0530
committer	Anand Avati <avati@gluster.com>	2011-09-08 07:07:30 -0700
commit	c83856797fd55fa59c885ba5efd3ac912fcb9a96 (patch)
tree	d26a920f5549f87144ac5890ad22766532bc69b1 /xlators/cluster
parent	73ae561db9054c67ce120eb87fa955943bdc06bd (diff)