glusterfs.git/xlators/cluster/dht/src/dht-selfheal.c, branch v3.6.5

dht: support heterogeneous brick sizes

2014-07-12T16:20:52+00:00

Calculation of layouts now considers the size of each brick, so that
smaller bricks don't get an "unfair" share of allocations and start
returning ENOSPC while the larger bricks still have plenty of space.

The observation has been made that some clients might get ENOTCONN when
trying to fetch disk-size information, and end up calculating layouts
differently.  The following meta-observations can be made.

(1) This scenario is extremely unlikely in configurations with AFR.

(2) The most likely consequence of this scenario is that some files will
be placed sub-optimally by the client with the obsolete (non-weighted)
layout.  They'll still be found anyway, so this isn't a show stopper.

(3) Without this patch it's *guaranteed* that some files will be placed
sub-optimally, because any layout that fails to account for brick sizes
is sub-optimal.

(4) We shouldn't be doing fix-layout from two nodes simultaneously
anyway.  That's inefficient at best.  Any instances of such behavior are
separate bugs, which should be fixed separately.

(5) In the most extreme edge case, two nodes doing weighted and
non-weighted layout fixes could race and end up creating an internally
inconsistent layout.  This condition is still transient; it will be
detected and repaired automatically the next time anyone fetches the
layout.  (If it's not that's also a preexisting bug that can show up in
other contexts.)

In conclusion, it's not the purpose of this patch to fix bugs elsewhere
in DHT.  Its purpose is to make life incrementally better for users who
add new hardware with larger disks etc. than the older equipment.  It's
only one part of an ongoing process to improve layout management and
repair, all the way up to support for multiple hash rings or tiering.

Change-Id: I05eb6f9eface9cdaf8622e0260c8c7f29020447f
BUG: 1114680
Signed-off-by: Jeff Darcy 
Reviewed-on: http://review.gluster.org/8093
Tested-by: Gluster Build System 
Reviewed-by: Raghavendra G 
Reviewed-by: Shyamsundar Ranganathan 
Reviewed-by: Vijay Bellur

DHT/Logging

2014-07-12T16:16:54+00:00

Changed the log level of a message from none to debug as none does
not print a log level in the log file.

Change-Id: I463d1095d69bbd0036958282da13cb8e0226f34f
BUG: 1116797
Signed-off-by: Nithya Balachandran 
Reviewed-on: http://review.gluster.org/8253
Reviewed-by: Krutika Dhananjay 
Tested-by: Gluster Build System 
Reviewed-by: Vijay Bellur

cluster/dht: Added logging of new layout for dir-selfheal

2014-07-03T16:22:37+00:00

Added a log which logs the new layout which will be used
for the directory self healing

It prints:

a) Subvolume name
b) Error --> Is needed because layout healing depends on
             the error and having it in log will help in
             debugging
c) Start     Starting of the layout range
d) Stop      Ending of the layout range

Change-Id: I48c9c697716a899165ed29b737362a75c62e09b3
BUG: 1113066
Signed-off-by: Venkatesh Somyajulu 
Reviewed-on: http://review.gluster.org/8173
Tested-by: Gluster Build System 
Reviewed-by: Vijay Bellur

cluster/dht: Do layout self healing of directory for nameless lookup

2014-06-17T12:39:22+00:00

Problem: Currently in the  nameless lookup code path, if at the
         end of the lookup, even if it detects that layout
         anamolies are there, layout healing will not be done as
         there is no code to heal it.
         So there can be race between mkdir and lookup.

         Assume mkdir is going on from some other mount point,
         Say, M1. Directories are created on some nodes but layout
         is not set yet.

         Now from M2, nameless lookup goes, lookup will be success
         full as the directory is present on some of the nodes, but
         it won't heal layout. Now if create goes after lookup fop,
         because layout is absent, file creation will fail.

Fix:     Included the code of layout self-heal in the nameless
         lookup path. At the end of lookup, layout will be computed
         as it would have been in the named lookup, but it will be
         set to those node only, where directory is present.
         So after that if create fop goes, the probabiliy to get the
         subvolume with proper hash-range is high now, so reduces
         the race window.

Other:  Whenever a directory is created, we have to choose a brick
        from which we start allocating layout in a circular fashion.
        To calculate this starting brick, I have changed the candidate
        from name of the directory to gfid of the directory

        But to compute where a given file belongs, we will still
        use the name of the file. Hash computed from the name of the
        file should belong to any one of the directory-hash-range

        Calculation of hash for a file is acting as a consumer and the
        setting of directory layout based on gfid is acting as a producer,
        which are independent from each other.

Change-Id: I3808c55082cd1b5c72d2c77cbbc063f55aa38bee
BUG: 1095888
Signed-off-by: Venkatesh Somyajulu 
Reviewed-on: http://review.gluster.org/7493
Tested-by: Gluster Build System 
Reviewed-by: Raghavendra Bhat 
Reviewed-by: Vijay Bellur

cluster/dht: Bring option to choose gfid or name based hashing

2014-06-17T12:26:27+00:00

Change-Id: I11794eb2adceb88e75864aede450e904431a6273
BUG: 1095888
Signed-off-by: Venkatesh Somyajulu 
Reviewed-on: http://review.gluster.org/8049
Tested-by: Gluster Build System 
Reviewed-by: Vijay Bellur

Cluster/DHT: New logging framework

2014-06-16T13:25:51+00:00

Moved all relevant DHT gf_log calls to the new logging
framework.

Change-Id: I3af3cfe0416e332774a6c4ff6a091d006c400af2
BUG: 1075611
Signed-off-by: Nithya Balachandran 
Reviewed-on: http://review.gluster.org/7929
Tested-by: Gluster Build System 
Reviewed-by: Vijay Bellur

cluster/dht: Set quota limit key in dht_selfheal of dirs.

2014-01-23T05:39:57+00:00

Also fixed check in dht_is_subvol_in_layout to check if the
layouts are zero'ed out.

Change-Id: I4bf8ebf66d3ef1946309b6c9aac9e79bf8a6d495
BUG: 969461
Signed-off-by: shishir gowda 
Signed-off-by: Varun Shastry 
Reviewed-on: http://review.gluster.org/6392
Reviewed-by: Raghavendra G 
Tested-by: Gluster Build System 
Reviewed-by: Vijay Bellur

syncop: Change return value of syncop

2014-01-20T07:05:15+00:00

Problem:
We found a day-1 bug when syncop_xxx() infra is used inside a synctask with
compilation optimization (CFLAGS -O2).

Detailed explanation of the Root cause:
We found the bug in 'gf_defrag_migrate_data' in rebalance operation:

Lets look at interesting parts of the function:

int
gf_defrag_migrate_data (xlator_t *this, gf_defrag_info_t *defrag, loc_t *loc,
                        dict_t *migrate_data)
{
.....
code section - [ Loop ]
        while ((ret = syncop_readdirp (this, fd, 131072, offset, NULL,
                                       &entries)) != 0) {
.....
code section - [ ERRNO-1 ] (errno of readdirp is stored in readdir_operrno by a
thread)
                /* Need to keep track of ENOENT errno, that means, there is no
                   need to send more readdirp() */
                readdir_operrno = errno;
.....
code section - [ SYNCOP-1 ] (syncop_getxattr is called by a thread)
                        ret = syncop_getxattr (this, &entry_loc, &dict,
                                               GF_XATTR_LINKINFO_KEY);
code section - [ ERRNO-2]   (checking for failures of syncop_getxattr(). This
may not always be executed in same thread which executed [SYNCOP-1])
                        if (ret < 0) {
                                if (errno != ENODATA) {
                                        loglevel = GF_LOG_ERROR;
                                        defrag->total_failures += 1;
.....
}

the function above could be executed by thread(t1) till [SYNCOP-1] and code
from [ERRNO-2] can be executed by a different thread(t2) because of the way
syncop-infra schedules the tasks.

when the code is compiled with -O2 optimization this is the assembly code that
is generated:
 [ERRNO-1]
1165                        readdir_operrno = errno; <<---- errno gets expanded
as *(__errno_location())
   0x00007fd149d48b60 <+496>:        callq  0x7fd149d410c0 
   0x00007fd149d48b72 <+514>:        mov    %rax,0x50(%rsp) <<------ Address
returned by __errno_location() is stored in a special location in stack for
later use.
   0x00007fd149d48b77 <+519>:        mov    (%rax),%eax
   0x00007fd149d48b79 <+521>:        mov    %eax,0x78(%rsp)
....
 [ERRNO-2]
1281                                        if (errno != ENODATA) {
   0x00007fd149d492ae <+2366>:        mov    0x50(%rsp),%rax <<-----  Because
it already stored the address returned by __errno_location(), it just
dereferences the address to get the errno value. BUT THIS CODE NEED NOT BE
EXECUTED BY SAME THREAD!!!
   0x00007fd149d492b3 <+2371>:        mov    $0x9,%ebp
   0x00007fd149d492b8 <+2376>:        mov    (%rax),%edi
   0x00007fd149d492ba <+2378>:        cmp    $0x3d,%edi

The problem is that __errno_location() value of t1 and t2 are different. So
[ERRNO-2] ends up reading errno of t1 instead of errno of t2 even though t2 is
executing [ERRNO-2] code section.

When code is compiled without any optimization for [ERRNO-2]:
1281                                        if (errno != ENODATA) {
   0x00007fd58e7a326f <+2237>:        callq  0x7fd58e797300
<<--- As it is calling __errno_location() again it gets the
location from t2 so it works as intended.
   0x00007fd58e7a3274 <+2242>:        mov    (%rax),%eax
   0x00007fd58e7a3276 <+2244>:        cmp    $0x3d,%eax
   0x00007fd58e7a3279 <+2247>:        je     0x7fd58e7a32a1


Fix:
Make syncop_xxx() return (-errno) value as the return value in
case of errors and all the functions which make syncop_xxx() will need to use
(-ret) to figure out the reason for failure in case of syncop_xxx() failures.

Change-Id: I314d20dabe55d3e62ff66f3b4adb1cac2eaebb57
BUG: 1040356
Signed-off-by: Pranith Kumar K 
Reviewed-on: http://review.gluster.org/6475
Tested-by: Gluster Build System 
Reviewed-by: Anand Avati

gNFS: Incorrect NFS ACL encoding for XFS

2013-09-29T23:54:37+00:00

Problem:
Incorrect NFS ACL encoding causes "system.posix_acl_default"
setxattr failure on bricks on XFS file system. XFS (potentially
others?) doesn't understand when the 0x10 prefix is added to the
ACL type field for default ACLs (which the Linux NFS client adds)
which causes setfacl()->setxattr() to fail silently. NFS client
adds NFS_ACL_DEFAULT(0x1000) for default ACL.

FIX:
Mask the prefix (added by NFS client) OFF, so the setfacl is not
rejected when it hits the FS.

Original patch by: "Richard Wareing"

Change-Id: I17ad27d84f030cdea8396eb667ee031f0d41b396
BUG: 1009210
Signed-off-by: Santosh Kumar Pradhan 
Reviewed-on: http://review.gluster.org/5980
Reviewed-by: Amar Tumballi 
Tested-by: Gluster Build System 
Reviewed-by: Anand Avati

cluster/dht: assign layout onto missing directories too

2013-09-09T21:58:09+00:00

The current self-healing algorithm is ignoring missing directories
for assigning new layout. When lookup() is racing against mkdir()
or when self-healing a half-done mkdir(), the layout assignment split
must happen based on the final number of directories, and not the
currently existing number of directories (because we finish mkdir()
of missing directories before hash layout assignment).

Without this fix, concurrent mkdir() and lookup() will step on
each others feet, create a messed up layout on disk, and end up
with different in-memory layouts.

Once two clients have different in-memory layouts, creation of
subdirectory will not arbitrate on the same hashed subvolume and will
result in GFID mismatch of the sub-directory.

Change-Id: Ia47acad67c265060405984c822b4d37512b9dbb3
BUG: 907072
Signed-off-by: Anand Avati 
Reviewed-on: http://review.gluster.org/5849
Tested-by: Gluster Build System 
Reviewed-by: Amar Tumballi 
Reviewed-by: Peter Portante 
Tested-by: Peter Portante