glusterfs.git/xlators, branch v3.3.2qa4

NFS is picking up geo-rep's already open (read-only) file descriptor

2013-07-05T13:31:14+00:00

Add anonymous member to fd_t and use it instead of over-loading pid for
geo-rep and self heal

Change-Id: I4d6b29a044a8ed4b8f69ff6e3f35ee227739b2af
Signed-off-by: Kaleb S. KEITHLEY 
BUG: 874272
Reviewed-on: http://review.gluster.org/4185
Tested-by: Gluster Build System 
Reviewed-by: Raghavendra Bhat 
Reviewed-by: Vijay Bellur 
Reviewed-on: http://review.gluster.org/5283

cluster/afr: detect in-progress creation in lookup and return ENOENT

2013-06-19T05:49:01+00:00

        Port of http://review.gluster.org/4625

if any subvol returned ENOENT while parent entrylk lock was held,
yield and return ENOENT for the entire lookup.

This is how the issue happens:

Multiple clients A, B and C are attempting 'mkdir -p /mnt/a/b/c'

1 Client A is in the middle of mkdir(/a). It has acquired lock.
  It has performed mkdir(/a) on one subvol, and second one is still
  in progress
2 Client B performs a lookup, sees directory /a on one,
  ENOENT on the other, succeeds lookup.
3 Client B performs lookup on /a/b on both subvols, both return ENOENT
  (one subvol because /a/b does not exist, another because /a
  itself does not exist)
4 Client B proceeds to mkdir /a/b. It obtains entrylk on inode=/a with
  basename=b on one subvol, but fails on other subvol as /a is yet to
  be created by Client A.
5 Client A finishes mkdir of /a on other subvol
6 Client C also attempts to create /a/b, lookup returns ENOENT on
  both subvols.
7 Client C tries to obtain entrylk on on inode=/a with basename=b,
  obtains on one subvol (where B had failed), and waits for B to unlock
  on other subvol.
8 Client B finishes mkdir() on one subvol with GFID-1 and completes
  transaction and unlocks
9 Client C gets the lock on the second subvol, At this stage second
  subvol already has /a/b created from Client B, but Client C does not
  check that in the middle of mkdir transaction
10 Client C attempts mkdir /a/b on both subvols. It succeeds on
   ONLY ONE (where Client B could not get lock because of
   missing parent /a dir) with GFID-2, and gets EEXIST from ONE subvol.
This way we have /a/b in GFID mismatch. One subvol got GFID-1 because
Client B performed transaction on only one subvol (because entrylk()
could not be obtained on second subvol because of missing parent dir --
caused by premature/speculative succeeding of lookup() on /a when locks
are detected). Other subvol gets GFID-2 from Client C because while
it was waiting for entrylk() on both subvols, Client B was in the
middle of creating mkdir() on only one subvol, and Client C does not
"expect" this when it is between lock() and pre-op()/op() phase of the
transaction.

Change-Id: I40107d4638ffdcb7b1ff4748c8e5ea92e62697e8
BUG: 860210
Signed-off-by: Pranith Kumar K 
Reviewed-on: http://review.gluster.org/5173
Tested-by: Gluster Build System 
Reviewed-by: Vijay Bellur

cluster/dht: Linkfiles creation with correct uid/gid

2013-05-16T15:49:39+00:00

If renames are done with different uid/gid (non-owners), then we would
end up with incorrect uid/gid.

The fix is to create linkfiles, and heal the uid/gid as root:root. This
preserves our notion of creation as root:root and heal the uid/gid as
root:root in all paths. Additionally, we need to consider uid/gid from
only src_cached subvol, and not from linkfiles.

rename is also done as root:root if done on linkfile, as setattr of ownership
on linkfile is done after the rename

BUG: 884597
Change-Id: Ifaacd8dba0f39cb909761ffc8fe7e06cd44ec8de
Signed-off-by: shishir gowda 
Reviewed-on: http://review.gluster.org/5025
Tested-by: Gluster Build System 
Reviewed-by: Vijay Bellur

cluster/dht: Create linkfile with file uid/gid

2013-05-16T15:47:44+00:00

Currently, linkfile creation happens as root.

use uid/gid returned from _cbk (link/rename) to set the correct ownership of
the link files.

Change-Id: I5345cff193d5095442ca446fbe5ea05f2c2d86a3
Signed-off-by: shishir gowda 
BUG: 884597
Reviewed-on: http://review.gluster.org/5024
Reviewed-by: Vijay Bellur 
Tested-by: Gluster Build System

libglusterfs/statedump: move options file and statedumps from /tmp

2013-05-14T17:47:09+00:00

Change-Id: I6b107b9a668b0521b955dba8895cbbeaf9e7cb02
BUG: 764890
Signed-off-by: Raghavendra Bhat 
Reviewed-on: http://review.gluster.org/5005
Tested-by: Gluster Build System 
Reviewed-by: Vijay Bellur

geo-rep: retire old style ssh setup

2013-04-27T16:36:58+00:00

Users are still using geo-rep with the old, deprecated, insecure, unsupported
ssh setup. Not their fault -- the implementation of the new method had the
following charasteristics:
- old method is possible, but with default settings it's not working
- it can be made operational by fiddling with "remote-gsyncd" tunable
- with default setting, an unhelpful, actually misleading error message is
  produced
- the UI gave no hint to the changes in the ssh setup

http://review.gluster.org/4392 tried to fix these; what it accomplished was
unrestricted support to the bad practice (by making the default old setup
operational).

From this on:
- we disable the old method by reserving the "remote-gsyncd" tunable
- if the old method is attempted, give a hint what to do

Change-Id: Icade94725d8d8d2d4c89cab992d4226351637b86
BUG: 895656
Signed-off-by: Csaba Henk 
Reviewed-on: http://review.gluster.org/4892
Tested-by: Gluster Build System 
Reviewed-by: Vijay Bellur

glusterd: replace obsolete /usr/local reference for remote ssh/gsyncd

2013-04-27T16:36:39+00:00

See https://bugzilla.redhat.com/show_bug.cgi?id=895656
    https://bugzilla.redhat.com/show_bug.cgi?id=764679 (GLUSTER-2947)
    https://bugzilla.redhat.com/show_bug.cgi?id=764623 (GLUSTER-2891)

The comments in the bzs are a bit obtuse and/or vague. As near as I
can make out we had, for a while, a "convenience symlink" to or from
/usr/local/libexec/gsyncd, which no longer exists.

And, lacking any comments in the code, I gather this is some sort of
fallback or failsafe logic: if the first, normal attempt to invoke gsyncd
fails then an attempt is made to ssh to the box and invoke it.

In any event, there's nothing in /usr/local/... so it's unquestionably
wrong to try to invoke anything there.

[Backporting Kaleb's patch]

BUG: 895656
Change-Id: I3b7ac7a049b91ce101b930599294830147cc60ad
Signed-off-by: Kaleb S. KEITHLEY 
Signed-off-by: Csaba Henk 
Reviewed-on: http://review.gluster.org/4891
Tested-by: Gluster Build System 
Reviewed-by: Vijay Bellur

distribute: Fix fds being leaked during rebalance

2013-04-26T07:35:00+00:00

This patch is a backport of 2 patches from master branch which fixes the
leak of fds during a rebalance process.

The patches are,
* libglusterfs/syncop: do not hold ref on the fd in cbk
  (e979c0de9dde14fe18d0ad7298c6da9cc878bbab)
* cluster/distribute: Remove suprious fd_unref call
  (5d29e598665456b2b7250fdca14de7409098877a)

Change-Id: Icea1d0b32cb3670f7decc24261996bca3fe816dc
BUG: 928631
Signed-off-by: Kaushal M 
Reviewed-on: http://review.gluster.org/4888
Reviewed-by: Vijay Bellur 
Tested-by: Gluster Build System

cluster/dht: Correct min_free_disk behaviour

2013-04-17T11:53:57+00:00

Problem:
Files were being created in subvol which had less than min_free_disk available
even in the cases where other subvols with more space were available.

Solution:
Changed the logic to look for subvol which has more space available. In cases
where all the subvols have lesser than Min_free_disk available , the one with
max space and atleast one inode is available.

Known Issue: Cannot ensure that first file that is created right after
min-free-value is crossed on a brick will get created in other brick because
disk usage stat takes some time to update in glusterprocess. Will fix that as
part of another bug.

Change-Id: Icaba552db053ad8b00be0914b1f4853fb7661bd3
BUG: 874554
Signed-off-by: Raghavendra Talur 
Signed-off-by: Varun Shastry 
Reviewed-on: http://review.gluster.org/4839
Tested-by: Gluster Build System 
Reviewed-by: Vijay Bellur

dht: improve transform/detransform of d_off (and be ext4 safe)

2013-04-16T17:35:59+00:00

Backporting  Avati's fix http://review.gluster.org/4711

The scheme to encode brick d_off and brick id into global d_off has
two approaches. Since both brick d_off and global d_off are both 64-bit
wide, we need to be careful about how the brick id is encoded.

Filesystems like XFS always give a d_off which fits within 32bits. So
we have another 32bits (actually 31, in this scheme, as seen ahead) to
encode the brick id - which is typically plenty.

Filesystems like the recent EXT4 utilize the upto 63 low bits in d_off,
as the d_off is calculated based on a hash function value. This leaves
us no "unused" bits to encode the brick id.

However both these filesystmes (EXT4 more importantly) are "tolerant" in
terms of the accuracy of the value presented back in seekdir(). i.e, a
seekdir(val) actually seeks to the entry which has the "closest" true
offset.

This "two-prong" scheme exploits this behavior - which seems to be the
best middle ground amongst various approaches and has all the advantages
of the old approach:

- Works against XFS and EXT4, the two most common filesystems out there.
  (which wasn't an "advantage" of the old approach as it is borken against
   EXT4)

- Probably works against most of the others as well. The ones which would
  NOT work are those which return HUGE d_offs _and_ NOT tolerant to
  seekdir() to "closest" true offset.

- Nothing to "remember in memory" or evict "old entries".

- Works fine across NFS server reboots and also NFS head failover.

- Tolerant to seekdir() to arbitrary locations.

Algorithm:

Each d_off can be encoded in either of the two schemes. There is no
requirement to encode all d_offs of a directory or a reply-set in
the same scheme.

The topmost bit of the 64 bits is used to specify the "type" of encoding
of this particular d_off. If the topmost bit (bit-63) is 1, it indicates
that the encoding scheme holds a HUGE d_off. If the topmost bit is is 0,
it indicates that the "small" d_off encoding scheme is used.

The goal of the "small" d_off encoding is to stay as dense as possible
towards the lower bits even in the global d_off.

The goal of the HUGE d_off encoding is to stay as accurate (close) as
possible to the "true" d_off after a round of encoding and decoding.

If DHT has N subvolumes, we need ROOF(Log2(N)) "bits" to encode the brick
ID (call it "n").

SMALL d_off
===========

Encoding
--------
    If the top n + 1 bits are free in a brick offset, then we leave the
top bit as 0 and set the remaining bits based on the old formula:

   hi_mask = 0xffffffffffffffff

   hi_mask = ~(hi_mask >> (n + 1))

   if ((hi_mask & d_off_brick) != 0)
       do_large_d_off_encoding ()

   d_off_global = (d_off_brick * N) + brick_id

Decoding
--------
    If the top bit in the global offset is 0, it indicates that this
is the encoding formula used. So decoding such a global offset will
be like the old formula:

   if ((d_off_global & 0x8000000000000000) != 0)
      do_large_d_off_decoding()

   d_off_brick = (d_off_global % N)

   brick_id = d_off_global / N

HUGE d_off
==========

Encoding
--------
   If the top n + 1 bits are NOT free in a given brick offset, then we
set the top bit as 1 in the global offset. The low n bits are replaced
by brick_id.

    low_mask = 0xffffffffffffffff << n   // where n is ROOF(Log2(N))

    d_off_global = (0x8000000000000000 | d_off_brick & low_mask) + brick_id

    if (d_off_global == 0xffffffffffffffff)
        discard_entry();

Decoding
--------
    If the top bit in the global offset is set 1, it indicates that
the encoding formula used is above. So decoding would look like:

    hi_mask = (0xffffffffffffffff << n)
    low_mask = ~(hi_mask)

    d_off_brick = (global_d_off & hi_mask & 0x7fffffffffffffff)

    brick_id = global_d_off & low_mask

    If "losing" the low n bits in this decoding of d_off_brick looks
"scary", we need to realize that till recently EXT4 used to only
return what can now be expressed as (d_off_global >> 32). The extra
31 bits of hash added by EXT recently, only decreases the probability
of a collision, and not eliminate it completely, anyways. In a way,
the "lost" n bits are made up by decreasing the probability of
collision by sharding the files into N bricks / EXT directories
    -- call it "hash hedging", if you will :-)

Change-Id: I9551c581c3f3d4c9e719764881036d554f60c557
Thanks-to: Zach Brown 
BUG: 838784
Signed-off-by: shishir gowda 
Reviewed-on: http://review.gluster.org/4799
Reviewed-by: Amar Tumballi 
Reviewed-by: Jeff Darcy 
Tested-by: Gluster Build System 
Reviewed-on: http://review.gluster.org/4822