glusterfs.git/xlators/cluster/dht/src/dht-helper.c, branch v3.5.0beta3

cluster/dht: set op_errno correctly during migration.

2014-01-28T16:46:43+00:00

Change-Id: I65acedf92c1003975a584a2ac54527e9a2a1e52f
BUG: 1010241
Signed-off-by: Raghavendra G 
Reviewed-on: http://review.gluster.org/6219
Reviewed-by: Shyamsundar Ranganathan 
Tested-by: Gluster Build System 
Reviewed-by: Anand Avati 
Reviewed-on: http://review.gluster.org/6817
Reviewed-by: Vijay Bellur

core: fix errno for non-existent GFID

2013-11-26T18:29:23+00:00

When clients refer to a GFID which does not exist, the errno to
be returned in ESTALE (and not ENOENT). Even though ENOENT might
look "proper" most of the time, as the application eventually expects
ENOENT even if a parent directory does not exist, not returning
ESTALE results in resolvers (FUSE and GFAPI) to not retry resolution
in uncached mode. This can result in spurious ENOENTs during
concurrent path modification operations.

Change-Id: I7a06ea6d6a191739f2e9c6e333a1969615e05936
BUG: 1032894
Signed-off-by: Anand Avati 
Reviewed-on: http://review.gluster.org/6318
Tested-by: Gluster Build System 
Reviewed-by: Amar Tumballi 
Reviewed-by: Brian Foster 
Reviewed-by: Vijay Bellur

cluster/dht - rebalance: handle the rebalance @ inode level (!fd level)

2013-11-13T19:45:18+00:00

* migrate all the fd's on an inode to newer subvol after rebalance
* use the migration in progress flag in inode, so all the operations
  on the inode can make use of it

Change-Id: Ib807a46e927a1062688fc15119c916797c52a350
BUG: 1013456
Signed-off-by: Amar Tumballi 
Reviewed-on: http://review.gluster.org/5891
Tested-by: Gluster Build System 
Reviewed-by: Anand Avati

cluster/dht: Prevent dht_access from going into a loop.

2013-07-16T06:56:34+00:00

If access fails with ENOTCONN, do not wind to same subvol.
We wind to first-up-subvol if access fails with ENOTCONN.
In few cases, if dht has only 1 subvolume, and access fails with
ENOTCONN, we go into a infinite loop of winding to same subvol

The fix is to check if we previously wound to same subvol, and
fail if first-up-subvol is same.

Change-Id: Ib5d3ce7d33e8ea09147905a7df1ed280874fa549
BUG: 983431
Signed-off-by: shishir gowda 
Reviewed-on: http://review.gluster.org/5319
Tested-by: Gluster Build System 
Reviewed-by: Anand Avati

cluster/dht: Do not open fd in migration check/complete for non fd ops

2013-05-31T12:15:44+00:00

if local->fd == NULL, then in dht_migration_check_complete, do not do
open call. Let the layout get updated, and proceed with invoking the
registered target_fn.

if local->fd == NULL, do not call dht_rebalance_in_progress_check for
truncate fop, but proceed with truncate2.

Change-Id: Ia5a5d40bcea7bfb320ef7096af1e035b8847d4ff
BUG: 960055
Signed-off-by: shishir gowda 
Reviewed-on: http://review.gluster.org/4958
Reviewed-by: Amar Tumballi 
Tested-by: Gluster Build System 
Reviewed-by: Vijay Bellur

cluster/dht: getxattr linkto as root:root

2013-05-31T12:12:58+00:00

In path based op's like truncate, we use getxattr instead of
fgetxattr call. These can fail with permission denied issues
as linkto file creation, and setattr of ownership is not atomic,
and in cases where setattr failed (subvols down..)

The fix is to perform getxattr as root:root as it is a internal
fop. fgetxattr, bypass the access check, as it already has a valid
open fd.

Change-Id: Ie221c9172e3c1c7ed4e50c8782d362826910756f
BUG: 957074
Signed-off-by: shishir gowda 
Reviewed-on: http://review.gluster.org/4890
Tested-by: Gluster Build System 
Reviewed-by: Vijay Bellur

cluster/dht: Don't do extra unref in dht-migration checks

2013-05-17T03:47:03+00:00

Problem:
syncop_open used to perform a ref in syncop_open_cbk so the extra
unref was needed but now syncop_open_cbk does not take a ref so no
need to do extra unref.

Fix:
remove the extra fd_unref and let dht_local_wipe do the final unref.

Change-Id: Ibe8f9a678d456a0c7bff175306068b5cd297ecc4
BUG: 961615
Signed-off-by: Pranith Kumar K 
Reviewed-on: http://review.gluster.org/4974
Reviewed-by: Raghavendra Bhat 
Reviewed-by: Krishnan Parthasarathi 
Reviewed-by: Amar Tumballi 
Tested-by: Gluster Build System

dht: improve transform/detransform of d_off (and be ext4 safe)

2013-04-01T20:13:01+00:00

The scheme to encode brick d_off and brick id into global d_off has
two approaches. Since both brick d_off and global d_off are both 64-bit
wide, we need to be careful about how the brick id is encoded.

Filesystems like XFS always give a d_off which fits within 32bits. So
we have another 32bits (actually 31, in this scheme, as seen ahead) to
encode the brick id - which is typically plenty.

Filesystems like the recent EXT4 utilize the upto 63 low bits in d_off,
as the d_off is calculated based on a hash function value. This leaves
us no "unused" bits to encode the brick id.

However both these filesystmes (EXT4 more importantly) are "tolerant" in
terms of the accuracy of the value presented back in seekdir(). i.e, a
seekdir(val) actually seeks to the entry which has the "closest" true
offset.

This "two-prong" scheme exploits this behavior - which seems to be the
best middle ground amongst various approaches and has all the advantages
of the old approach:

- Works against XFS and EXT4, the two most common filesystems out there.
  (which wasn't an "advantage" of the old approach as it is borken against
   EXT4)

- Probably works against most of the others as well. The ones which would
  NOT work are those which return HUGE d_offs _and_ NOT tolerant to
  seekdir() to "closest" true offset.

- Nothing to "remember in memory" or evict "old entries".

- Works fine across NFS server reboots and also NFS head failover.

- Tolerant to seekdir() to arbitrary locations.

Algorithm:

Each d_off can be encoded in either of the two schemes. There is no
requirement to encode all d_offs of a directory or a reply-set in
the same scheme.

The topmost bit of the 64 bits is used to specify the "type" of encoding
of this particular d_off. If the topmost bit (bit-63) is 1, it indicates
that the encoding scheme holds a HUGE d_off. If the topmost bit is is 0,
it indicates that the "small" d_off encoding scheme is used.

The goal of the "small" d_off encoding is to stay as dense as possible
towards the lower bits even in the global d_off.

The goal of the HUGE d_off encoding is to stay as accurate (close) as
possible to the "true" d_off after a round of encoding and decoding.

If DHT has N subvolumes, we need ROOF(Log2(N)) "bits" to encode the brick
ID (call it "n").

SMALL d_off
===========

Encoding
--------
    If the top n + 1 bits are free in a brick offset, then we leave the
top bit as 0 and set the remaining bits based on the old formula:

   hi_mask = 0xffffffffffffffff

   hi_mask = ~(hi_mask >> (n + 1))

   if ((hi_mask & d_off_brick) != 0)
       do_large_d_off_encoding ()

   d_off_global = (d_off_brick * N) + brick_id

Decoding
--------
    If the top bit in the global offset is 0, it indicates that this
is the encoding formula used. So decoding such a global offset will
be like the old formula:

   if ((d_off_global & 0x8000000000000000) != 0)
      do_large_d_off_decoding()

   d_off_brick = (d_off_global % N)

   brick_id = d_off_global / N

HUGE d_off
==========

Encoding
--------
   If the top n + 1 bits are NOT free in a given brick offset, then we
set the top bit as 1 in the global offset. The low n bits are replaced
by brick_id.

    low_mask = 0xffffffffffffffff << n   // where n is ROOF(Log2(N))

    d_off_global = (0x8000000000000000 | d_off_brick & low_mask) + brick_id

    if (d_off_global == 0xffffffffffffffff)
        discard_entry();

Decoding
--------
    If the top bit in the global offset is set 1, it indicates that
the encoding formula used is above. So decoding would look like:

    hi_mask = (0xffffffffffffffff << n)
    low_mask = ~(hi_mask)

    d_off_brick = (global_d_off & hi_mask & 0x7fffffffffffffff)

    brick_id = global_d_off & low_mask

    If "losing" the low n bits in this decoding of d_off_brick looks
"scary", we need to realize that till recently EXT4 used to only
return what can now be expressed as (d_off_global >> 32). The extra
31 bits of hash added by EXT recently, only decreases the probability
of a collision, and not eliminate it completely, anyways. In a way,
the "lost" n bits are made up by decreasing the probability of
collision by sharding the files into N bricks / EXT directories
    -- call it "hash hedging", if you will :-)

Thanks-to: Zach Brown 
Change-Id: Ieba9a7071829d51860b7c131982f12e0136b9855
BUG: 838784
Signed-off-by: Anand Avati 
Reviewed-on: http://review.gluster.org/4711
Reviewed-by: Jeff Darcy 
Tested-by: Gluster Build System

dht: make DHT xattr names configurable

2013-03-21T22:04:31+00:00

This is necessary to support "DHT over DHT" configurations, so that the
upper and lower instances of DHT don't step all over each other.  Why
would we even consider such a thing?  Because it gives us the ability to
do data tiering and rack-aware placement, either by themselves or as
complements to other functionality such as erasure codes or
deduplication which save space but cost performance.  By setting up the
top-level DHT to place data into one of several lower-level DHT pools
based on policy instead of pure elastic hashing, we get better
performance for 90% of accesses and better storage efficiency for 90% of
data, all for relatively low effort.

Change-Id: I72e65c29edfc80babf39f7a2a00090f4588c4070
BUG: 924265
Signed-off-by: Jeff Darcy 
Reviewed-on: http://review.gluster.org/4694
Tested-by: Gluster Build System 
Reviewed-by: Anand Avati

cluster/distribute: Reopen fds in migration internally as root:root

2013-02-15T04:26:58+00:00

Though linkfile_create and rebalance dst file create sent a setattr
with correct ownership, there is still a race window where the linkfile
open (client open due to migration) will fail, as its ownership will be
root:root.

Change-Id: I056092da6102319efa3bb9f1795f8c97db2a3fed
BUG: 884597
Signed-off-by: shishir gowda 
Reviewed-on: http://review.gluster.org/4513
Tested-by: Gluster Build System 
Reviewed-by: Amar Tumballi 
Reviewed-by: Anand Avati