glusterfs.git/xlators/cluster, branch v3.3.2qa2

cluster/dht: Correct min_free_disk behaviour

2013-04-17T11:53:57+00:00

Problem:
Files were being created in subvol which had less than min_free_disk available
even in the cases where other subvols with more space were available.

Solution:
Changed the logic to look for subvol which has more space available. In cases
where all the subvols have lesser than Min_free_disk available , the one with
max space and atleast one inode is available.

Known Issue: Cannot ensure that first file that is created right after
min-free-value is crossed on a brick will get created in other brick because
disk usage stat takes some time to update in glusterprocess. Will fix that as
part of another bug.

Change-Id: Icaba552db053ad8b00be0914b1f4853fb7661bd3
BUG: 874554
Signed-off-by: Raghavendra Talur 
Signed-off-by: Varun Shastry 
Reviewed-on: http://review.gluster.org/4839
Tested-by: Gluster Build System 
Reviewed-by: Vijay Bellur

dht: improve transform/detransform of d_off (and be ext4 safe)

2013-04-16T17:35:59+00:00

Backporting  Avati's fix http://review.gluster.org/4711

The scheme to encode brick d_off and brick id into global d_off has
two approaches. Since both brick d_off and global d_off are both 64-bit
wide, we need to be careful about how the brick id is encoded.

Filesystems like XFS always give a d_off which fits within 32bits. So
we have another 32bits (actually 31, in this scheme, as seen ahead) to
encode the brick id - which is typically plenty.

Filesystems like the recent EXT4 utilize the upto 63 low bits in d_off,
as the d_off is calculated based on a hash function value. This leaves
us no "unused" bits to encode the brick id.

However both these filesystmes (EXT4 more importantly) are "tolerant" in
terms of the accuracy of the value presented back in seekdir(). i.e, a
seekdir(val) actually seeks to the entry which has the "closest" true
offset.

This "two-prong" scheme exploits this behavior - which seems to be the
best middle ground amongst various approaches and has all the advantages
of the old approach:

- Works against XFS and EXT4, the two most common filesystems out there.
  (which wasn't an "advantage" of the old approach as it is borken against
   EXT4)

- Probably works against most of the others as well. The ones which would
  NOT work are those which return HUGE d_offs _and_ NOT tolerant to
  seekdir() to "closest" true offset.

- Nothing to "remember in memory" or evict "old entries".

- Works fine across NFS server reboots and also NFS head failover.

- Tolerant to seekdir() to arbitrary locations.

Algorithm:

Each d_off can be encoded in either of the two schemes. There is no
requirement to encode all d_offs of a directory or a reply-set in
the same scheme.

The topmost bit of the 64 bits is used to specify the "type" of encoding
of this particular d_off. If the topmost bit (bit-63) is 1, it indicates
that the encoding scheme holds a HUGE d_off. If the topmost bit is is 0,
it indicates that the "small" d_off encoding scheme is used.

The goal of the "small" d_off encoding is to stay as dense as possible
towards the lower bits even in the global d_off.

The goal of the HUGE d_off encoding is to stay as accurate (close) as
possible to the "true" d_off after a round of encoding and decoding.

If DHT has N subvolumes, we need ROOF(Log2(N)) "bits" to encode the brick
ID (call it "n").

SMALL d_off
===========

Encoding
--------
    If the top n + 1 bits are free in a brick offset, then we leave the
top bit as 0 and set the remaining bits based on the old formula:

   hi_mask = 0xffffffffffffffff

   hi_mask = ~(hi_mask >> (n + 1))

   if ((hi_mask & d_off_brick) != 0)
       do_large_d_off_encoding ()

   d_off_global = (d_off_brick * N) + brick_id

Decoding
--------
    If the top bit in the global offset is 0, it indicates that this
is the encoding formula used. So decoding such a global offset will
be like the old formula:

   if ((d_off_global & 0x8000000000000000) != 0)
      do_large_d_off_decoding()

   d_off_brick = (d_off_global % N)

   brick_id = d_off_global / N

HUGE d_off
==========

Encoding
--------
   If the top n + 1 bits are NOT free in a given brick offset, then we
set the top bit as 1 in the global offset. The low n bits are replaced
by brick_id.

    low_mask = 0xffffffffffffffff << n   // where n is ROOF(Log2(N))

    d_off_global = (0x8000000000000000 | d_off_brick & low_mask) + brick_id

    if (d_off_global == 0xffffffffffffffff)
        discard_entry();

Decoding
--------
    If the top bit in the global offset is set 1, it indicates that
the encoding formula used is above. So decoding would look like:

    hi_mask = (0xffffffffffffffff << n)
    low_mask = ~(hi_mask)

    d_off_brick = (global_d_off & hi_mask & 0x7fffffffffffffff)

    brick_id = global_d_off & low_mask

    If "losing" the low n bits in this decoding of d_off_brick looks
"scary", we need to realize that till recently EXT4 used to only
return what can now be expressed as (d_off_global >> 32). The extra
31 bits of hash added by EXT recently, only decreases the probability
of a collision, and not eliminate it completely, anyways. In a way,
the "lost" n bits are made up by decreasing the probability of
collision by sharding the files into N bricks / EXT directories
    -- call it "hash hedging", if you will :-)

Change-Id: I9551c581c3f3d4c9e719764881036d554f60c557
Thanks-to: Zach Brown 
BUG: 838784
Signed-off-by: shishir gowda 
Reviewed-on: http://review.gluster.org/4799
Reviewed-by: Amar Tumballi 
Reviewed-by: Jeff Darcy 
Tested-by: Gluster Build System 
Reviewed-on: http://review.gluster.org/4822

cluster/afr: Try for all locks before failing in rename

2013-04-10T08:26:09+00:00

Change-Id: If0e917e5d4914f6807b4a96f81668a467b15d0df
BUG: 922809
Signed-off-by: Pranith Kumar K 
Reviewed-on: http://review.gluster.org/4689
Tested-by: Gluster Build System 
Reviewed-by: Vijay Bellur

cluster/afr: Preserve mtime in self-heal

2013-04-10T07:41:19+00:00

Problem:
Data self-heal may choose sink iatt to set mtimes.
This happens because after syncing of data is done
self-heal does one more xattrops/fstat to determine
sources sinks to set the inode-ctx. Since this is done
after data syncing and erase of xattrs, old source and
old sink are now sources, but the mtimes of them differ.
Old code just takes the first source from the list and
update mtimes, which could be sink before the self-heal
started.

Fix:
Set mtime from 'sources before syncing'.

Change-Id: Id769e1b99aa4f041eaee775f64cbf2c57b799723
BUG: 918437
Signed-off-by: Pranith Kumar K 
Reviewed-on: http://review.gluster.org/4658
Reviewed-by: Jeff Darcy 
Tested-by: Gluster Build System 
Reviewed-on: http://review.gluster.org/4664
Reviewed-by: Vijay Bellur

cluster/afr: Filter O_TRUNC in afr-fix-open

2012-11-05T10:27:42+00:00

RCA:
When open was done while a brick is down, afr opens the file after
the brick comes backup. If this happens after the self-heal on the file
is completed by self-heald etc, the file will end up in truncated state.

Fix:
Filter O_TRUNC while afr-fix-open because afr_open turns O_TRUNC
into truncate transaction, so there will be pending changelog for
the subvolume on which open fails.

Testing:
Had to simulate the race by stopping fix-open until self-heald completes
self-heal on the file after brick online.

Change-Id: If99eb3eb272dea0ed8c7b754dce675eb6efaf802
BUG: 841840
Signed-off-by: Pranith Kumar K 
Reviewed-on: http://review.gluster.org/4147
Reviewed-by: Jeff Darcy 
Tested-by: Gluster Build System

Fixed some general typing errors.

2012-09-27T18:22:34+00:00

Eg: changed recieved to received

Change-Id: I360fcb99c97c8a0222e373fee20ea2fccfb938db
BUG: 860543
Signed-off-by: Varun Shastry 
Reviewed-on: http://review.gluster.org/3999
Tested-by: Gluster Build System 
Reviewed-by: Vijay Bellur

Self-heald: Fix inode leak

2012-08-30T10:25:50+00:00

RCA:
There is an inode-leak because inode_link returns
linked inode by taking a reference. That needs to be
unreffed.

Fix:
Added the code to perform unrefs. In addition to that
updated the loc inode with the linked-inode because that is
the best practice. The code to update the input inode's
gfid can be removed later, its already removed in master.

Tests:
Checked that opendir comes with an loc with valid inode
Checked that re-opendir happens successfully. Tested index,
full self-heal work fine with the fix.

BUG: 826580
Change-Id: I0c68192ff98f76152ed112b393d497b8fee93355
Signed-off-by: Pranith Kumar K 
Reviewed-on: http://review.gluster.org/3518
Tested-by: Gluster Build System 
Reviewed-by: Raghavendra Bhat 
Reviewed-by: Vijay Bellur

dht/rebalance: set the correct ownership on the dst file.

2012-08-30T09:45:51+00:00

Currently, the dst file created has root:root ownership, till
migration is completed. During this phase, open fails on the dst
file if uid/gid is non-root.
Setting the dst_file to the correct ownership fixes the issue

Change-Id: Icfec89eb10dc866cdee38dab17695fe21174ef99
BUG: 852361
Signed-off-by: shishir gowda 
Reviewed-on: http://review.gluster.org/3862
Tested-by: Gluster Build System 
Reviewed-by: Amar Tumballi

afr: Avoid excessive logging in self-heal.

2012-08-17T09:22:02+00:00

- (Excessive) Logging has been very useful as 'bread-crumbs' in
  many a root-cause analyses. This patch aims at avoiding logging when
  the information could be reconstructed using the xattrs, statedump,
  and/or "volume heal" CLI commands.

Change-Id: I8f646cbee44e98495ea6963f9dfcae95375c8900
BUG: 844804
Signed-off-by: Krishnan Parthasarathi 
Reviewed-on: http://review.gluster.com/3827
Reviewed-by: Pranith Kumar Karampuri 
Tested-by: Gluster Build System 
Reviewed-by: Vijay Bellur

cluster/afr: Handle child_up & fd not opened case in xaction

2012-08-17T08:14:38+00:00

RCA:
When an fd is opened while a brick is down, after the brick
comes back up afr issues open on the other brick. It can
fail for a number of reasons (enoent etc). While the system
is in that state, inode/entrylks pre-op happen only on the
brick that is up and fd is opened for fd-fops. post-op should
consider only the bricks where both pre-op and fop succeeded
as success, rest of them as failures. Code now marks only the
children that are down as failures as opposed to child_down &
fd-not-opened. This makes change-log appear as success on the
subvolume where we did not do any fop leading to no change-log
but differences in data/metadata for reg-files.

Fix:
Mark non-participants of fop as failure. This is tracked in
transaction.pre_op[].

Tests:
Simulated the scenario using err-gen on top of one of the client
xlator which fails all fops always. Performed fops and the changelog
represented pending fops on the brick with err-gen loaded. Tested
the case of brick down and perform entry/metadata/data operations
to confirm they still work as expected.

Change-Id: I41905936126b19abba56ca581c0301a894507e1a
BUG: 844987
Signed-off-by: Pranith Kumar K 
Reviewed-on: http://review.gluster.com/3776
Tested-by: Gluster Build System 
Reviewed-by: Vijay Bellur