glusterfs.git/xlators/features/marker/utils/syncdaemon/resource.py, branch v3.4.2qa2

geo-replication: catch select.error on select()

2012-11-29T00:23:49+00:00

tailer() in resource.py does not correctly catch exceptions from
select(). select() can raise an instance of the select.error class and
the current expression only catches ValueError (and the instance will
have reference called selecterror).

The geo-rep log contains a call trace like this:
> E [syncdutils:190:log_raise_exception] : FAIL:
> Traceback (most recent call last):
> File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 216, in twrap
> tf(*aa)
> File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 123, in tailer
> poe, _ ,_ = select([po.stderr for po in errstore], [], [], 1)
> File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 276, in select
> return eintr_wrap(oselect.select, oselect.error, *a)
> File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 269, in eintr_wrap
> return func(*a)
> error: (9, 'Bad file descriptor')

BUG: 880308
Change-Id: I2babe42918950d0e9ddb3d08fa21aa3548ccf7c5
Signed-off-by: Niels de Vos 
Reviewed-on: http://review.gluster.org/4233
Reviewed-by: Peter Portante 
Reviewed-by: Csaba Henk 
Tested-by: Gluster Build System

geo-rep/gsyncd: work around rsync argument overflow

2012-09-07T23:50:08+00:00

instead of passing the files to be synced as args to rsync, have rsync
read them on stdin with '-0 --files-from=-'

Change-Id: Ic3f71a0269941ce50051af8adfad183a52a79b01
BUG: 855306
Signed-off-by: Csaba Henk 
Reviewed-on: http://review.gluster.org/3917
Tested-by: Gluster Build System 
Reviewed-by: Amar Tumballi 
Reviewed-by: Anand Avati

geo-rep / gsyncd: add support for sending xtimes through rsync

2012-07-19T17:15:57+00:00

Note that in said mode metadata synchronization is best effort:
rsync syncs metadata at last so if rsync is interrupted in between
xattr sync and metadata sync stages, then file will be considered
in sync

Change-Id: I1c75eab33b0a1000abf3ad36b2d484a89eeda1bd
BUG: 841062
Signed-off-by: Csaba Henk 
Reviewed-on: http://review.gluster.com/3683
Tested-by: Gluster Build System 
Reviewed-by: Venky Shankar

geo-rep / gsyncd: rsync option cleanups, fixes

2012-07-18T11:56:13+00:00

- add two tunables for rsync: "rsync-options" and "rsync-ssh-options"
- always pass "--no-implied-dirs" to rsync

Change-Id: I3d67a4cba8cabd681edac80e6b1fb8ea322008bd
BUG: 841062
Signed-off-by: Csaba Henk 
Reviewed-on: http://review.gluster.com/3682
Tested-by: Gluster Build System 
Reviewed-by: Vijay Bellur

geo-rep / gsyncd: fixes to communication with child processes

2012-07-15T04:09:41+00:00

due to not using the proper Python keyword, errhandler thread
was possible to run into empty select

Signed-off-by: Csaba Henk 
BUG: 764678
Change-Id: I3c39e718e72545c27d50fd73aa6daf54062331b0
Reviewed-on: http://review.gluster.com/3560
Tested-by: Gluster Build System 
Reviewed-by: Anand Avati

geo-rep / gsyncd: sanitize error log of external commands

2012-07-15T04:08:25+00:00

If a command invoked by gsyncd fails, gsyncd makes a log
of what comes out on its stderr. So far the log indeterministically
broke lines at random places. Now put some effort into reconstructing
original lines and having a faithful log.

BUG: 764678
Change-Id: I16fcc75d3e0f624c10c71d9b37c937ca677087cc
Signed-off-by: Csaba Henk 
Reviewed-on: http://review.gluster.com/3561
Tested-by: Gluster Build System 
Reviewed-by: Anand Avati

gsyncd / geo-rep : failover/failback

2012-06-13T15:39:06+00:00

This commit is based on Venky Shankar 's
original implementation. Let us first quote Venky's
description, then we summarize changes to his work.

------
First version of failover/failback.

Failback mechanism uses two exclusive modes:
  * blind-sync
    This mode works with xtime pairs (both master and slave) to
    identify candidated to sync the original master from the slave

  * wrapup-sync
    This mode is similar to the normal working of gsyncd except
    that orphaned entities in the gluster volume are not assigned
    xtimes. This prevents un-necessary transfer of data for such
    entities.

Modes can be enabled via:

  gluster volume geo-replication M S config special_sync_mode blind
  gluster volume geo-replication M S config special_sync_mode wrapup

To turn off the special modes (i.e. to revert to normal gsyncd behaviour) use:

  gluster volume geo-replication colon-d0 192.168.1.2::colon-d config \!special_sync_mode
------

Code has been refactored to meet following goals:

- make checkpointing work with special sync modes
- move out sync mode related conditionals from the crawl
  loop and make all decisions to be made at startup
  time
- be intrusive to the crawl loop to smallest possible degree
  (we will have to change/revisit it for other reasons,
  and the complexity of that should not increase)

So, xtime parsing/updating/evaluation that's specific to
the certain special modes are represented as mixin classes;
basic operation logic is in an abstract base class.
On startup, special-sync-mode tunable is dynamically dispatched
to the corresponding mixin and the actual master class is
derived from the chosen mixin and the ABS.

Change-Id: Ic9b8448f31ad4239a8200dc689f7d713662a67de
BUG: 830497
Signed-off-by: Csaba Henk 
Reviewed-on: http://review.gluster.com/3541
Tested-by: Gluster Build System 
Reviewed-by: Venky Shankar

geo-rep / gsyncd: further cleanup refinements

2012-05-25T01:22:52+00:00

- Regarding issue of leftover ssh control dirs:

  If master side worker is stuck in connection establishment
  phase, have the monitor kill it softly (ie. first by SIGTERM,
  to let it cleanup). This is trickier than sounds on first hearing,
  because if worker is stuck in waiting for a RePCe answer
  (in threading.Condition().wait()), then SIGTERM is ignored
  (more precisely, Python holds it back for the wait and resends it to
  itself when wait is over).

  So instead of signalling the worker only, we send TERM to the
  whole process group -- that brings down the ssh connection, which
  wakes up the waiting worker, which then can cleanup. Only problem
  is that monitor is also in the process group and it should not coomit
  a suicide. That is taken care by setting up a one-time SIGTERM
  handler in the monitor.

- Regarding slave gsyncd stuck in chdir:

  Slave gsyncd is usually well behaved: if master does not send
  keepalives, it takes care to exit. However, if a hang occurs
  in early phase, when slave is to change to the gluster mountpoint,
  no timeout is set up for that (and unlike on master side, neither
  is there an external actor like the monitor to do that).

  So, to manage this scenario, we do the chdir in a (supposedly)
  short lived thread, and in the main thread we wait for the termination
  of this thread. If that does not happen within the time limit, main
  thread calls for cleanup and exit. (This logic explicitely takes the
  appropriate action in the cases when chdir succeeds or when hangs;
  but what about the remaining case, when chdir fails? Well in that case
  the chdir thread's exception handler will put the process to
  cleanup and exit route.)

Change-Id: I6ad6faa9c7b1c37084d171d1e1a756abaff9eba8
BUG: 786291
Signed-off-by: Csaba Henk 
Reviewed-on: http://review.gluster.com/3376
Tested-by: Gluster Build System 
Reviewed-by: Anand Avati

geo-rep / gsyncd: add "--super" to rsync invocation

2012-05-25T01:05:47+00:00

This forces rsync to perform supposedly privileged operations on
unprivileged slaves (like chown(2)).

For consistent behavior (with gsyncd's "chown" RPC call that's
being used for symlinks and directories), we also pass
"--numeric-ids" to rsync.

Also took the chance to retire gsyncd's "--rsync-extra" option
which was there for debugging purposes (related to a resolved
issue).

Change-Id: I4ee4d0d3a8c4e0f6746d34d7722c8a567a67491c
BUG: 822121
Signed-off-by: Csaba Henk 
Reviewed-on: http://review.gluster.com/3426
Tested-by: Gluster Build System 
Reviewed-by: Anand Avati

geo-rep / gsyncd: fix cleanup of temporary mounts

2012-05-22T01:32:13+00:00

[This is a "forward port" of fafd5c17, http://review.gluster.com/2908]

The "finally" clause that was meant to cleanup after the
temp mount has not covered the case of getting signalled
(eg. by monitor, upon worker timing out).

So here we "outsource" the cleanup to an ephemeral child process.
Child calls setsid(2) so it won't be bothered by internal process
management. We use a pipe in between worker and the cleanup child;
when child sees the worker end getting closed, it performs the cleanup.
Worker end can get closed either because worker closes it (normal case),
or because worker has terminated (faulty case) -- thus as bonus, we get
a nice uniform handling with no need to differentiate between normal and
faulty cases.

The faulty case that was seen IRL -- ie., users of maintainance mounts
hang in chdir(2) to mount point -- can be simulated for testing purposes
by applying the following patch:

  diff --git a/xlators/mount/fuse/src/fuse-bridge.c b/xlators/mount/fuse/src/fuse-bridge.c
  index acd3c68..1ce5dc1 100644
  --- a/xlators/mount/fuse/src/fuse-bridge.c
  +++ b/xlators/mount/fuse/src/fuse-bridge.c
  @@ -2918,7 +2918,7 @@ fuse_init (xlator_t *this, fuse_in_header_t *finh, void *msg)
           if (fini->minor < 9)
                   *priv->msg0_len_p = sizeof(*finh) + FUSE_COMPAT_WRITE_IN_SIZE;
   #endif
  -        ret = send_fuse_obj (this, finh, &fino);
  +        ret = priv->client_pid_set ? 0 : send_fuse_obj (this, finh, &fino);
           if (ret == 0)
                   gf_log ("glusterfs-fuse", GF_LOG_INFO,
                           "FUSE inited with protocol versions:"

Change-Id: I14bad56a60a7fa82d0104fa4b9a20f4e42a7186f
BUG: 786291
Signed-off-by: Csaba Henk 
Reviewed-on: http://review.gluster.com/3259
Tested-by: Gluster Build System 
Reviewed-by: Jeff Darcy 
Reviewed-by: Anand Avati