glusterfs.git, branch v3.4.0alpha3

glusterd: big lock - a coarse-grained locking to prevent races

2013-04-17T12:48:50+00:00

There are primarily three lists that are part of glusterd process,
that are concurrently accessed. Namely, priv->volumes, priv->peers
and volinfo->bricks_list.

Big-lock approach
-----------------
WHAT IS IT?
Big lock is a coarse-grained lock which protects all three
lists, mentioned above, from racy access.

HOW DOES IT WORK?
At any given point in time, glusterd's thread(s) are in execution
_iff_ there is a preceding, inbound network event. Of course, the
sigwaiter thread and timer thread are exceptions.
A network event is an external trigger to glusterd, via the epoll
thread, in the form of POLLIN and POLLERR.
As long as we take the big-lock at all such entry points and yield
it when we are done, we are guaranteed that all the network events,
accessing the global lists, are serialised.

This amounts to holding the big lock at
- all the handlers of all the actors in glusterd. (POLLIN)
- all the cbks in glusterd. (POLLIN)
- rpc_notify (DISCONNECT event), if we access/modify
  one of the three lists. (POLLERR)

In the case of synctask'ized volume operations, we must remember that,
if we held the big lock for the entire duration of the handler,
we may block other non-synctask rpc actors from executing.
For eg, volume-start would block in PMAP SIGNIN, if done incorrectly.
To prevent this, we need to yield the big lock, when we yield the
synctask, and reacquire on waking up of the synctask.

BUG: 948686
Change-Id: I429832f1fed67bcac0813403d58346558a403ce9
Signed-off-by: Krishnan Parthasarathi 
Reviewed-on: http://review.gluster.org/4835
Tested-by: Gluster Build System 
Reviewed-by: Vijay Bellur

glusterd: Fixed spurious wakeups in glusterd syncops

2013-04-17T12:46:43+00:00

glusterd syncops perform a barrier_wake whenever rpc_clnt_submit returned -1.
This is based on the wrong assumption that the cbkfn wasn't called.
This would result in one more wakeup than there ought to be.

BUG: 948686
Change-Id: I839fd218a81255fe50c2047d67461d45360e894d
Signed-off-by: Krishnan Parthasarathi 
Reviewed-on: http://review.gluster.org/4834
Tested-by: Gluster Build System 
Reviewed-by: Vijay Bellur

syncenv: be robust against spurious wake()s

2013-04-17T12:46:09+00:00

In the current implementation, when the callers of synctasks perform
a spurious wake() of a sleeping synctask (i.e, an extra wake() soon
after a wake() which already woke up a yielded synctask), there is
now a possibility of two sync threacs picking up the same synctask.
This can result in a crash. The fix is to change ->slept = 0|1 and
membership of synctask in runqueue atomically.

Today we dequeue a task from the runqueue in syncenv_task(), but
reset ->slept = 0 much later in synctask_switchto() in an unlocked
manner -- which is safe, when there are no spurious wake()s.

However, this opens a race window where, if a second wake() happens
after the dequeue, but before setting ->slept = 0, it results in
queueing the same synctask in the runqueue once again, and get
picked up by a different synctask.

This is has been diagnosed to be the crashes in the regression tests
of http://review.gluster.org/4784. However that patch still has a
spurious wake() [the trigger for this bug] which is yet to be fixed.

BUG: 948686
Change-Id: I51858e887cad2680e46fb973629f8465f4429363
Original-author: Anand Avati 
Signed-off-by: Krishnan Parthasarathi 
Reviewed-on: http://review.gluster.org/4833
Reviewed-by: Vijay Bellur 
Tested-by: Gluster Build System

tests: fix dependency on sleep in bug-874498.t

2013-04-17T11:36:27+00:00

With the introduction of http://review.gluster.org/4784, there are
delays which breaks bug-874498.t which wrongly depends on healing
to finish within 2 seconds.

Fix this by using 'EXPECT_WITHIN 60' instead of sleep 2.

BUG: 874498
Change-Id: I7131699908e63b024d2dd71395b3e94c15fe925c
Original-author: Anand Avati 
Signed-off-by: Krishnan Parthasarathi 
Reviewed-on: http://review.gluster.org/4832
Reviewed-by: Vijay Bellur 
Tested-by: Gluster Build System

tests: fix further issues with bug-874498.t

2013-04-17T08:53:43+00:00

The failure of bug-874498.t seems to be a "bug" in glustershd.
The situation seems to be when both subvolumes of a replica are
"local" to glustershd, and in such cases glustershd is sensitive
to the order in which the subvols come up.

The core of the issue itself is that, without the patch (#4784),
self-heal daemon completes the processing of index and no entries
are left inside the xattrop index after a few seconds of volume
start force. However with the patch, the stale "backing file"
(against which index performs link()) is left. The likely reason
is that an "INDEX" based crawl is not happening against the subvol
when this patch is applied.

Before #4784 patch, the order in which subvols came up was :

[2013-04-09 22:55:35.117679] I [client-handshake.c:1456:client_setvolume_cbk] 0-patchy-client-0: Connected to 10.3.129.13:49156, attached to remote volume '/d/backends/brick1'.
...
[2013-04-09 22:55:35.118399] I [client-handshake.c:1456:client_setvolume_cbk] 0-patchy-client-1: Connected to 10.3.129.13:49157, attached to remote volume '/d/backends/brick2'.

However, with the patch, the order is reversed:

[2013-04-09 22:53:34.945370] I [client-handshake.c:1456:client_setvolume_cbk] 0-patchy-client-1: Connected to 10.3.129.13:49153, attached to remote volume '/d/backends/brick2'.
...
[2013-04-09 22:53:34.950966] I [client-handshake.c:1456:client_setvolume_cbk] 0-patchy-client-0: Connected to 10.3.129.13:49152, attached to remote volume '/d/backends/brick1'.

The index in brick2 has the list of files/gfid to heal. It appears
to be the case that when brick1 is the first subvol to be detected
as coming up, somehow an INDEX based crawl is clearing all the
index entries in brick2, but if brick2 comes up as the first subvol,
then the backing file is left stale.

Also, doing a "gluster volume heal full" seems to leave out stale
backing files too. As the crawl is performed on the namespace and
the backing file is never encountered there to get cleared out.

So the interim (possibly permanent) fix is to have the script issue
a regular self-heal command (and not a "full" one).

The failure of the script itself is non-critical. The data files are
all healed, and it is just the backing file which is left behind. The
stale backing file too gets cleared in the next index based healing,
either triggered manually or after 10mins.

BUG: 874498
Change-Id: I601e9adec46bb7f8ba0b1ba09d53b83bf317ab6a
Original-author: Anand Avati
Signed-off-by: Krishnan Parthasarathi
Reviewed-on: http://review.gluster.org/4831
Tested-by: Gluster Build System
Reviewed-by: Vijay Bellur

synctask: introduce synclocks for co-operative locking

2013-04-17T08:53:20+00:00

This patch introduces a synclocks - co-operative locks for synctasks.
Synctasks yield themselves when a lock cannot be acquired at the time
of the lock call, and the unlocker will wake the yielded locker at
the time of unlock.

The implementation is safe in a multi-threaded syncenv framework.

It is also safe for sharing the lock between non-synctasks. i.e, the
same lock can be used for synchronization between a synctask and
a regular thread. In such a situation, waiting synctasks will yield
themselves while non-synctasks will sleep on a cond variable. The
unlocker (which could be either a synctask or a regular thread) will
wake up any type of lock waiter (synctask or regular).

Usage:

    Declaration and Initialization
    ------------------------------

    synclock_t lock;

    ret = synclock_init (&lock);
    if (ret) {
        /* lock could not be allocated */
    }

   Locking and non-blocking lock attempt
   -------------------------------------

   ret = synclock_trylock (&lock);
   if (ret && (errno == EBUSY)) {
      /* lock is held by someone else */
      return;
   }

   synclock_lock (&lock);
   {
      /* critical section */
   }
   synclock_unlock (&lock);

BUG: 763820
Change-Id: I23066f7b66b41d3d9fb2311fdaca333e98dd7442
Signed-off-by: Krishnan Parthasarathi 
Original-author: Anand Avati 
Reviewed-on: http://review.gluster.org/4830
Tested-by: Gluster Build System 
Reviewed-by: Vijay Bellur

glusterd: fix segfault on volume status detail

2013-04-16T16:43:23+00:00

If for some reason glusterd_get_brick_root() fails,
it frees the gf_strdup'ed *mount_point in its own error path,
and returns -1.

Unfortunately it already had assigned that pointer value
to the output argument, the caller function
glusterd_add_brick_detail() sees a non-NULL pointer,
and free() again: segfault.

Could be fixed with a one-liner (*mount_point = NULL)
in the error path, but I think glusterd_get_brick_root()
should only assign to the output argument once all checks passed,
so I use a local temporary pointer, which increases the patch a bit.

Change-Id: I3f3035f01e80a5e9bdf2da895e4cf7baa3dfbd2f
BUG: 919352
Signed-off-by: Lars Ellenberg 
Reviewed-on: http://review.gluster.org/4646
Reviewed-by: Krishnan Parthasarathi 
Tested-by: Gluster Build System 
Reviewed-on: http://review.gluster.org/4841
Reviewed-by: Jeff Darcy

glusterd: allow multiple instances of glusterd on one machine

2013-04-16T11:48:58+00:00

This is needed to support automated testing of cluster-communication
features such as probing and quorum.  In order to use this, you need to
do the following preparatory steps.

* Copy /var/lib/glusterd to another directory for each virtual host

* Ensure that each virtual host has a different UUID in its glusterd.info

Now you can start each copy of glusterd with the following xlator-options.

* management.transport.socket.bind-address=$ip_address

* management.working-directory=$unique_working_directory

You can use 127.x.y.z addresses for binding without needing to assign
them to interfaces explicitly.  Note that you must use addresses, not
names, because of some stuff in the socket code that's not worth fixing
just for this usage, but after that you can use names in /etc/hosts
instead.

At this point you can issue CLI commands to a specific glusterd using
the --remote-host option.  So far probe, volume create/start/stop,
mount, and basic I/O all seem to work as expected with multiple
instances.

Change-Id: I1beabb44cff8763d2774bc208b2ffcda27c1a550
BUG: 913555
Original-author: Jeff Darcy 
Signed-off-by: Krishnan Parthasarathi 
Reviewed-on: http://review.gluster.org/4838
Tested-by: Gluster Build System 
Reviewed-by: Vijay Bellur

tests/cluster.rc: support for virtual multi-server glusterd

2013-04-16T03:48:26+00:00

 tests

Since http://review.gluster.org/4556 glusterd is capable of running
many instances of itself on a single system. This patch exploits
that feature and enhances the regression test framework to expose
handy primitives so that test cases may be written to test glusterd
in a cluster.

Usage:

1. Include "$(dirname)/../cluster.rc" to get access to the extensions

2. Call launch_cluster $N where $N is the count of virtual servers

Calling launch_cluster, starts $N glusterds which bind to $N different
IPs and dynamically defines these primitives:

 - Variables $H1 .. $Hn assigned to hostnames of each "server".

 - Variables $CLI_1 .. $CLI_n assigned as commands to run CLI commands
   on the corresponding N'th server.

 - Variables $B1 .. $Bn assigned to the backend directories on each
   "server".

 - Function kill_glusterd, which accepts a parameter - index number of
   glusterd to be killed.

 - Variables $glusterd_1 .. $glusterd_n assigned to the command lines
   to restart the corresponding glusterd, if it was previously killed.

The current set of primitives and functions were implemented with the goal
of satisfying ./tests/bugs/bug-913555.t. The API will be made richer as
we add more cluster test cases

Change-Id: I6e79c58098ed0862cf75a0b56e4ce384ec2e4eb2
BUG: 913555
Original-author: Anand Avati 
Signed-off-by: Jeff Darcy 
Reviewed-on: http://review.gluster.org/4836
Tested-by: Gluster Build System 
Reviewed-by: Krishnan Parthasarathi

rpc-transport: fix glusterd crash when rdma.so missing

2013-04-12T18:31:38+00:00

Add checks before trying to delete vol_opt from list and free

Change-Id: I2858f58518394beb8f74fa477be81d7bdd38304f
BUG: 924215
Signed-off-by: Rajesh Amaravathi 
Reviewed-on: http://review.gluster.org/4704
Tested-by: Gluster Build System 
Reviewed-by: Anand Avati 
Reviewed-on: http://review.gluster.org/4819
Reviewed-by: Jeff Darcy