glusterfs.git/xlators/cluster, branch v3.4.0beta2

cluster/afr: Don't queue transactions during open-fd fix

2013-05-09T03:52:37+00:00

Before Anonymous fds are available, afr had to queue up
transactions if the file is not opened on one of its
subvolumes. This happens until the attempt to open the
file either succeeds or fails. These attempts happen
until the file is successfully opened on the subvolume.
Now client xlator uses anonymous fds to perform the fops
if the fd used for the fop is not 'opened'.
Fops will be successful even when the file is not opened
so there is no need to queue up the transactions anymore in afr.
Open is attempted on the subvolume where it is not
opened independent of the fop.

Change-Id: I6d59293023e2de41c606395028c8980b83faca3f
BUG: 953887
Signed-off-by: Pranith Kumar K 
Reviewed-on: http://review.gluster.org/4868
Tested-by: Gluster Build System 
Reviewed-by: Vijay Bellur

cluster/afr: Turn on eager-lock for fd DATA transactions

2013-05-07T12:00:05+00:00

Problem:
With the present implementation, eager-lock is issued for
any fd fop. eager-lock is being transferred to metadata
transactions. But the lk-owner is set to local->fd address
only for DATA transactions, but for METADATA transactions
it is frame->root. Because of this unlock on the eager-lock fails
and rebalance hangs.

Fix:
Enable eager-lock for fd DATA transactions

This is a backport of change If30df7486a0b2f5e4150d3259d1261f81473ce8a
http://review.gluster.org/#/c/4588/

BUG: 916226
Change-Id: Id41ac17f467c37e7fd8863e0c19932d7b16344f8
Signed-off-by: Emmanuel Dreyfus 
Reviewed-on: http://review.gluster.org/4899
Tested-by: Gluster Build System 
Reviewed-by: Vijay Bellur

cluster/afr: Preserve mtime in self-heal

2013-04-12T07:22:01+00:00

Problem:
Data self-heal may choose sink iatt to set mtimes.
This happens because after syncing of data is done
self-heal does one more xattrops/fstat to determine
sources sinks to set the inode-ctx. Since this is done
after data syncing and erase of xattrs, old source and
old sink are now sources, but the mtimes of them differ.
Old code just takes the first source from the list and
update mtimes, which could be sink before the self-heal
started.

Fix:
Set mtime from 'sources before syncing'.

Change-Id: Id769e1b99aa4f041eaee775f64cbf2c57b799723
BUG: 918437
Signed-off-by: Pranith Kumar K 
Reviewed-on: http://review.gluster.org/4658
Reviewed-by: Jeff Darcy 
Tested-by: Gluster Build System 
Reviewed-on: http://review.gluster.org/4663

cluster/distribute: Ignore non-participating subvols for layout checks

2013-04-11T17:41:01+00:00

Backporting fix http://review.gluster.org/#/c/4668/

When subvols-per-directory is < available subvols, then there are layouts
which are not populated. This leads to incorrect identification of holes or
overlaps. We need to ignore layouts, which have err == 0, and start == stop.
In the current scenario (start == stop == 0).

Additionally, in layout-merge, treat missing xattrs as err = 0. In case of
missing layouts, anomalies will reset them.

For any other valid subvoles, err != 0 in case of layouts being zeroed out.

Also reverted back dht_selfheal_dir_xattr, which does layout calculation only
on subvols which have errors.

BUG: 921408
Change-Id: I75a8edcb92af5b53b3253c9addd7a812e9242836
Signed-off-by: shishir gowda 
Reviewed-on: http://review.gluster.org/4800
Reviewed-by: Amar Tumballi 
Reviewed-by: Jeff Darcy 
Tested-by: Gluster Build System

dht: improve transform/detransform of d_off (and be ext4 safe)

2013-04-11T17:40:44+00:00

Backporting  Avati's fix http://review.gluster.org/4711

The scheme to encode brick d_off and brick id into global d_off has
two approaches. Since both brick d_off and global d_off are both 64-bit
wide, we need to be careful about how the brick id is encoded.

Filesystems like XFS always give a d_off which fits within 32bits. So
we have another 32bits (actually 31, in this scheme, as seen ahead) to
encode the brick id - which is typically plenty.

Filesystems like the recent EXT4 utilize the upto 63 low bits in d_off,
as the d_off is calculated based on a hash function value. This leaves
us no "unused" bits to encode the brick id.

However both these filesystmes (EXT4 more importantly) are "tolerant" in
terms of the accuracy of the value presented back in seekdir(). i.e, a
seekdir(val) actually seeks to the entry which has the "closest" true
offset.

This "two-prong" scheme exploits this behavior - which seems to be the
best middle ground amongst various approaches and has all the advantages
of the old approach:

- Works against XFS and EXT4, the two most common filesystems out there.
  (which wasn't an "advantage" of the old approach as it is borken against
   EXT4)

- Probably works against most of the others as well. The ones which would
  NOT work are those which return HUGE d_offs _and_ NOT tolerant to
  seekdir() to "closest" true offset.

- Nothing to "remember in memory" or evict "old entries".

- Works fine across NFS server reboots and also NFS head failover.

- Tolerant to seekdir() to arbitrary locations.

Algorithm:

Each d_off can be encoded in either of the two schemes. There is no
requirement to encode all d_offs of a directory or a reply-set in
the same scheme.

The topmost bit of the 64 bits is used to specify the "type" of encoding
of this particular d_off. If the topmost bit (bit-63) is 1, it indicates
that the encoding scheme holds a HUGE d_off. If the topmost bit is is 0,
it indicates that the "small" d_off encoding scheme is used.

The goal of the "small" d_off encoding is to stay as dense as possible
towards the lower bits even in the global d_off.

The goal of the HUGE d_off encoding is to stay as accurate (close) as
possible to the "true" d_off after a round of encoding and decoding.

If DHT has N subvolumes, we need ROOF(Log2(N)) "bits" to encode the brick
ID (call it "n").

SMALL d_off
===========

Encoding
--------
    If the top n + 1 bits are free in a brick offset, then we leave the
top bit as 0 and set the remaining bits based on the old formula:

   hi_mask = 0xffffffffffffffff

   hi_mask = ~(hi_mask >> (n + 1))

   if ((hi_mask & d_off_brick) != 0)
       do_large_d_off_encoding ()

   d_off_global = (d_off_brick * N) + brick_id

Decoding
--------
    If the top bit in the global offset is 0, it indicates that this
is the encoding formula used. So decoding such a global offset will
be like the old formula:

   if ((d_off_global & 0x8000000000000000) != 0)
      do_large_d_off_decoding()

   d_off_brick = (d_off_global % N)

   brick_id = d_off_global / N

HUGE d_off
==========

Encoding
--------
   If the top n + 1 bits are NOT free in a given brick offset, then we
set the top bit as 1 in the global offset. The low n bits are replaced
by brick_id.

    low_mask = 0xffffffffffffffff << n   // where n is ROOF(Log2(N))

    d_off_global = (0x8000000000000000 | d_off_brick & low_mask) + brick_id

    if (d_off_global == 0xffffffffffffffff)
        discard_entry();

Decoding
--------
    If the top bit in the global offset is set 1, it indicates that
the encoding formula used is above. So decoding would look like:

    hi_mask = (0xffffffffffffffff << n)
    low_mask = ~(hi_mask)

    d_off_brick = (global_d_off & hi_mask & 0x7fffffffffffffff)

    brick_id = global_d_off & low_mask

    If "losing" the low n bits in this decoding of d_off_brick looks
"scary", we need to realize that till recently EXT4 used to only
return what can now be expressed as (d_off_global >> 32). The extra
31 bits of hash added by EXT recently, only decreases the probability
of a collision, and not eliminate it completely, anyways. In a way,
the "lost" n bits are made up by decreasing the probability of
collision by sharding the files into N bricks / EXT directories
    -- call it "hash hedging", if you will :-)

Change-Id: I9551c581c3f3d4c9e719764881036d554f60c557
Thanks-to: Zach Brown 
BUG: 838784
Signed-off-by: shishir gowda 
Reviewed-on: http://review.gluster.org/4799
Reviewed-by: Amar Tumballi 
Reviewed-by: Jeff Darcy 
Tested-by: Gluster Build System

cluster/distribute: Fix layout overlaps due to spread-count in selfheal path

2013-03-09T15:36:34+00:00

We needed to zero out the layout range, before we re-calculate the range.
When spread-count is issued, we would end up with stale ranges in the layout.

Replaced dht_selfheal_dir_xattr with dht_fix_dir_xattr, which correctly resets
the un-used (after re-cal) layouts.

Change-Id: I1a900d15df07335f59356bd23182ccec34381ab2
BUG: 884455
Signed-off-by: shishir gowda 
Reviewed-on: http://review.gluster.org/4648
Tested-by: Gluster Build System 
Reviewed-by: Anand Avati

cluster/afr: do complete split-brain check in all the fd based fops

2013-03-05T10:22:46+00:00

fd based operations such as readv checked only for data split brain
instead of complete split-brain (i.e both data + metadata) assuming that
open would have done the complete split-brain check. However open-behind
would have unwound open, without winding to afr thus preventing the complete
split-brain check and some appliations will be able to read the contents
of the file even though the file has metadata split-brain. So let all
the fd based fops do a defensive check of complete split-brain.

Change-Id: I0ea52f782b371ce73e8e1c61f9def438fce1bd28
BUG: 846240
Signed-off-by: Raghavendra Bhat 
Reviewed-on: http://review.gluster.org/4620
Tested-by: Gluster Build System 
Reviewed-by: Vijay Bellur

cluster/distribute: Reopen fds in migration internally as root:root

2013-03-04T10:42:30+00:00

Though linkfile_create and rebalance dst file create sent a setattr
with correct ownership, there is still a race window where the linkfile
open (client open due to migration) will fail, as its ownership will be
root:root.

BUG: 884597
Change-Id: Iba73681eae4f280d39ee6c9a40009e195768bee7
Signed-off-by: shishir gowda 
Reviewed-on: http://review.gluster.org/4612
Tested-by: Gluster Build System 
Reviewed-by: Jeff Darcy

cluster/distribute: Prevent spurious multiple defrag crawls

2013-03-04T10:42:10+00:00

In dht_notify, we used to create a thread to start defrag
crawls after we had heard from all child subvols.
This was in-correct, as a later event, could also trigger the
crawl again(due to the fact that all subvols had responded).

The fix is to make sure, the thread is started only once after
all subvols have responded the first time

BUG: 916449
Change-Id: I1619344fbb1cb51d5e1db38d8a29821fa870fa8b
Signed-off-by: shishir gowda 
Reviewed-on: http://review.gluster.org/4610
Tested-by: Gluster Build System 
Reviewed-by: Jeff Darcy

cluster/distribute: Preserve file size during rebalance migration

2013-03-04T10:42:00+00:00

If holes are encountered, then we do not write these to the dst,
which sometimes causes file size to be lesser than src. Data is not
corrupted, as when non-zero reads are received, we do write that data.

Calling a truncrate to give file size to prevent it from being
truncated to less than src in case the file end has holes.

Thanks to Brian Foster for providing the test case

BUG: 915554
Change-Id: I7e1e0c475118b073c3ebb87e93220c1ec22e8b7d
Signed-off-by: shishir gowda 
Reviewed-on: http://review.gluster.org/4609
Tested-by: Gluster Build System 
Reviewed-by: Vijay Bellur