glusterfs.git/xlators/cluster/afr/src/pump.c, branch v3.6.8

cluster/afr: Do not increment healed_count if no healing was performed

2015-03-14T10:08:48+00:00

        Backport of: http://review.gluster.org/9713

PROBLEM:
When file modifications are happening while index heal is launched,
index healer could pick up entries which appeared in indices/xattrop
transiently during the course of the operations on the mount point, and
do not really need any heal. This will cause index healer to keep doing
index-heal in a loop as long as it finds this entry, by believing that
it did successfully heal some gfids even when it didn't.

FIX:
afr_selfheal() now returns a 1 to indicate that it did not (need to)
heal a given gfid. afr_shd_selfheal() will not increment healed_count
whenever afr_selfheal() returns a 1.

Change-Id: I9158c814419b635fac3dfe2fe40c94d1548ea4e8
BUG: 1194306
Signed-off-by: Krutika Dhananjay 
Reviewed-on: http://review.gluster.org/9852
Tested-by: Gluster Build System 
Reviewed-by: Ravishankar N 
Reviewed-by: Anuradha Talur 
Reviewed-by: Raghavendra Bhat

cluster/afr: Perform gfid heal inside locks.

2014-09-12T08:14:38+00:00

        Backport of http://review.gluster.org/8512

Problem:
Allowing lookup with 'gfid-req' will lead to assigning gfid at posix layer.
When two mounts perform lookup in parallel that can lead to both bricks getting
different gfids leading to gfid-mismatch/EIO for the lookup.

Fix:
Perform gfid heal inside lock.

BUG: 1136823
Change-Id: I1059fcd38338348b7e96a0bd4c35b0f49bf77aec
Signed-off-by: Pranith Kumar K 
Reviewed-on: http://review.gluster.org/8589
Reviewed-by: Krutika Dhananjay 
Tested-by: Gluster Build System 
Reviewed-by: Ravishankar N 
Reviewed-by: Vijay Bellur

cluster/afr: refactor

2014-03-22T12:25:57+00:00

- Remove client side self-healing completely (opendir, openfd, lookup)
- Re-work readdir-failover to work reliably in case of NFS
- Remove unused/dead lock recovery code
- Consistently use xdata in both calls and callbacks in all FOPs
- Per-inode event generation, used to force inode ctx refresh
- Implement dirty flag support (in place of pending counts)
- Eliminate inode ctx structure, use read subvol bits + event_generation
- Implement inode ctx refreshing based on event generation
- Provide backward compatibility in transactions
- remove unused variables and functions
- make code more consistent in style and pattern
- regularize and clean up inode-write transaction code
- regularize and clean up dir-write transaction code
- regularize and clean up common FOPs
- reorganize transaction framework code
- skip setting xattrs in pending dict if nothing is pending
- re-write self-healing code using syncops
- re-write simpler self-heal-daemon

Change-Id: I1e4080c9796c8a2815c2dab4be3073f389d614a8
BUG: 1021686
Signed-off-by: Anand Avati 
Reviewed-on: http://review.gluster.org/6010
Tested-by: Gluster Build System 
Reviewed-by: Vijay Bellur

core: add @xdata parameter to syncop_[f]removexattr()

2014-02-13T19:17:05+00:00

To be used in afr metadata self-heal

Change-Id: I8dac4b19d61e331702427eeb5b606aab3d20b328
BUG: 1021686
Signed-off-by: Anand Avati 
Reviewed-on: http://review.gluster.org/6941
Tested-by: Gluster Build System 
Reviewed-by: Raghavendra Bhat

cluster/afr: goto statements may cause exit before memory is freed.

2014-02-10T16:38:57+00:00

Memory is allocated for pump_priv and for pump_priv->resume_path, but if
an error is detected the references to that memory go out of scope and
the memory is never freed.

This patch assures that the memory is freed on error.

Patchset 2: These are Kaleb's recommended changes which, compared to my
            original fix, are more comprehensive and provide a more
            complete resolution to the memory leakage bugs in this
            function.  The bug reported by Coverity was limited to a
            single memory allocation.

BUG: 789278
CID: 1124737

Change-Id: Ie239e3b5d28d97308bf948efec6a92f107bc648b
Signed-off-by: Christopher R. Hertel 
Reviewed-on: http://review.gluster.org/6929
Tested-by: Gluster Build System 
Reviewed-by: Vijay Bellur

cluster/afr: Fix memory leak.

2014-02-08T16:21:21+00:00

Change-Id: I811d104684905a5a9a794cde8e925bd1a97f6546
BUG: 789278
Signed-off-by: Poornima 
Reviewed-on: http://review.gluster.org/6906
Tested-by: Gluster Build System 
Reviewed-by: Vijay Bellur

syncop: Change return value of syncop

2014-01-20T07:05:15+00:00

Problem:
We found a day-1 bug when syncop_xxx() infra is used inside a synctask with
compilation optimization (CFLAGS -O2).

Detailed explanation of the Root cause:
We found the bug in 'gf_defrag_migrate_data' in rebalance operation:

Lets look at interesting parts of the function:

int
gf_defrag_migrate_data (xlator_t *this, gf_defrag_info_t *defrag, loc_t *loc,
                        dict_t *migrate_data)
{
.....
code section - [ Loop ]
        while ((ret = syncop_readdirp (this, fd, 131072, offset, NULL,
                                       &entries)) != 0) {
.....
code section - [ ERRNO-1 ] (errno of readdirp is stored in readdir_operrno by a
thread)
                /* Need to keep track of ENOENT errno, that means, there is no
                   need to send more readdirp() */
                readdir_operrno = errno;
.....
code section - [ SYNCOP-1 ] (syncop_getxattr is called by a thread)
                        ret = syncop_getxattr (this, &entry_loc, &dict,
                                               GF_XATTR_LINKINFO_KEY);
code section - [ ERRNO-2]   (checking for failures of syncop_getxattr(). This
may not always be executed in same thread which executed [SYNCOP-1])
                        if (ret < 0) {
                                if (errno != ENODATA) {
                                        loglevel = GF_LOG_ERROR;
                                        defrag->total_failures += 1;
.....
}

the function above could be executed by thread(t1) till [SYNCOP-1] and code
from [ERRNO-2] can be executed by a different thread(t2) because of the way
syncop-infra schedules the tasks.

when the code is compiled with -O2 optimization this is the assembly code that
is generated:
 [ERRNO-1]
1165                        readdir_operrno = errno; <<---- errno gets expanded
as *(__errno_location())
   0x00007fd149d48b60 <+496>:        callq  0x7fd149d410c0 
   0x00007fd149d48b72 <+514>:        mov    %rax,0x50(%rsp) <<------ Address
returned by __errno_location() is stored in a special location in stack for
later use.
   0x00007fd149d48b77 <+519>:        mov    (%rax),%eax
   0x00007fd149d48b79 <+521>:        mov    %eax,0x78(%rsp)
....
 [ERRNO-2]
1281                                        if (errno != ENODATA) {
   0x00007fd149d492ae <+2366>:        mov    0x50(%rsp),%rax <<-----  Because
it already stored the address returned by __errno_location(), it just
dereferences the address to get the errno value. BUT THIS CODE NEED NOT BE
EXECUTED BY SAME THREAD!!!
   0x00007fd149d492b3 <+2371>:        mov    $0x9,%ebp
   0x00007fd149d492b8 <+2376>:        mov    (%rax),%edi
   0x00007fd149d492ba <+2378>:        cmp    $0x3d,%edi

The problem is that __errno_location() value of t1 and t2 are different. So
[ERRNO-2] ends up reading errno of t1 instead of errno of t2 even though t2 is
executing [ERRNO-2] code section.

When code is compiled without any optimization for [ERRNO-2]:
1281                                        if (errno != ENODATA) {
   0x00007fd58e7a326f <+2237>:        callq  0x7fd58e797300
<<--- As it is calling __errno_location() again it gets the
location from t2 so it works as intended.
   0x00007fd58e7a3274 <+2242>:        mov    (%rax),%eax
   0x00007fd58e7a3276 <+2244>:        cmp    $0x3d,%eax
   0x00007fd58e7a3279 <+2247>:        je     0x7fd58e7a32a1


Fix:
Make syncop_xxx() return (-errno) value as the return value in
case of errors and all the functions which make syncop_xxx() will need to use
(-ret) to figure out the reason for failure in case of syncop_xxx() failures.

Change-Id: I314d20dabe55d3e62ff66f3b4adb1cac2eaebb57
BUG: 1040356
Signed-off-by: Pranith Kumar K 
Reviewed-on: http://review.gluster.org/6475
Tested-by: Gluster Build System 
Reviewed-by: Anand Avati

synctask: minor enhancements

2013-08-28T22:52:24+00:00

- Enhance syncenv_new() to accept scaling parameters of syncproc.
  Previously the scaling parameters were hardcoded and decided at
  compile time.

- New API synctask_create() which returns the created synctask. This
  is similar to synctask_new which only returned the status of whether
  a synctask could be created or not.

  The meaning of NULL cbk in synctask_create() means the task is
  "joinable". Until synctask_join() is called on such a synctask,
  the task is not reaped and resources are not destroyed. The
  task would be in a zombie state after synctask_fn returns and
  before synctask_join() is called.

Change-Id: I368ec9037de9510d2ba951f0aad86aaf18d9a6b6
BUG: 986775
Signed-off-by: Anand Avati 
Reviewed-on: http://review.gluster.org/5365
Tested-by: Gluster Build System 
Reviewed-by: Brian Foster

cluster/afr: Let two data-self-heals compete in new domain

2013-07-04T04:35:30+00:00

Problem:
At the moment data-self-heal acquires locks in following
pattern. It takes full file lock then gets xattrs on files on both
replicas. Decides sources/sinks based on the xattrs. Now it acquires
lock from 0-128k then unlocks the full file lock. Syncs 0-128k range
from source to sink now acquires lock 128k+1 till 256k then unlocks
0-128k, syncs 128k+1 till 256k block... so on finally it takes full file
lock again then unlocks the final small range block.
It decrements pending counts and then unlocks the full file lock.

     This pattern of locks is chosen to avoid more than 1 self-heal
to be in progress. BUT if another self-heal tries to take a full
file lock while a self-heal is already in progress it will be put in
blocked queue, further inodelks from writes by the application will
also be put in blocked queue because of the way locks xlator grants
inodelks. So until the self-heal is complete writes are blocked.

Here is the code:
xlators/features/locks/src/inodelk.c - line 225
if (__blocked_lock_conflict (dom, lock) && !(__owner_has_lock (dom, lock))) {
         ret = -EAGAIN;
         if (can_block == 0)
                 goto out;

         gettimeofday (&lock->blkd_time, NULL);
         list_add_tail (&lock->blocked_locks, &dom->blocked_inodelks);
}

This leads to hangs in applications.

Fix:
Since we want to prevent two parallel self-heals. We let them compete
in a separate "domain". Lets call the domain on which the locks have
been taken on in previous approach as "data-domain".

In the new approach When a self-heal is triggered,
it acquires a full lock in the new domain "self-heal-domain".
    After this it performs data-self-heal using the locks in
    "data-domain" as before.
unlock the full file lock in "self-heal-domain"

With this approach, application's writevs don't have to wait
in pending queue when more than 1 self-heal is triggered.

Change-Id: Id79aef3dfa888945977fb9758374ac41c320d0d5
BUG: 967717
Signed-off-by: Pranith Kumar K 
Reviewed-on: http://review.gluster.org/5100
Tested-by: Gluster Build System 
Reviewed-by: Vijay Bellur

pump: Set self-heal readdir size in pump

2013-04-04T00:33:15+00:00

Problem:
In Pump entry self-heal happens for each directory during the
first opendir using conservative merge. But in entry-self-heal
readdir is issued with '0' size. So entry self-heal is not
creating any files. After pump thinks entry self-heal is complete
it proceeds to heal each of the file in the directory it just
healed. Fortunately most of the times it chooses source-brick
in pump as read-child for readdir. This happens because readchild is the
subvolume on which lookup succeeds first. In pump lookup succeeds
faster in local process than on the destination brick process most
of the times. For all the entries pump finds in readdir it does a
lookup. During this lookup the entry on the destination brick is
created and healed. This is the reason why replace-brick
succeeds whenever read-child for the directory is chosen as the
source-brick.  Which is most of the times. When read-child is chosen
as the destination brick, readdir returns no entries so replace-brick
completes without syncing the whole data.

Fix:
Set readdir-size in pump so that entry self-heal happens with
64k size. This ensures that entry self-heal triggered from
opendir actually creates the files on the destination brick.

Change-Id: I65ea45d3c2735a9578f3aa34eff771b6563241ca
BUG: 909800
Signed-off-by: Pranith Kumar K 
Reviewed-on: http://review.gluster.org/4712
Tested-by: Gluster Build System 
Reviewed-by: Anand Avati