glusterfs.git/xlators/cluster, branch v3.7.0beta2

dht/tier/rebalancer: Fix reset of tiering client pid

2015-05-10T08:06:27+00:00

In the patch http://review.gluster.org/#/c/9657
the client pid set by tiering migration was getting over-
written in dht_start_rebalance_task(). Just corrected it
in dht_setxattr() before calling dht_start_rebalance_task()
and removed it from dht_start_rebalance_task().

>    http://review.gluster.org/#/c/10502/
>    Cherry picked from commit a5fe0f594d41e1a11661d9074bb19e9c2e2c4776
>    Change-Id: I37cfa111f83a4e5d498042575c93799f60b49870
>    BUG: 1217937
>    Signed-off-by: Joseph Fernandes 
>    Reviewed-on: http://review.gluster.org/10502
>    Tested-by: Gluster Build System 
>    Reviewed-by: Susant Palai 
>    Reviewed-by: Dan Lambright 

Signed-off-by: Joseph Fernandes 
Reviewed-on: http://review.gluster.org/10502
Tested-by: Gluster Build System 
Reviewed-by: Susant Palai 
Reviewed-by: Dan Lambright 
Signed-off-by: Joseph Fernandes 

Conflicts:
	xlators/cluster/dht/src/dht-common.c
	xlators/cluster/dht/src/tier.c

Change-Id: Id513114c9a880c6196162dd4b35bbf1155a8cd09
BUG: 1219027
Reviewed-on: http://review.gluster.org/10609
Reviewed-by: N Balachandran 
Tested-by: NetBSD Build System
Tested-by: Gluster Build System 
Reviewed-by: Vijay Bellur

dht: make lookup-unhashed=auto do something actually useful

2015-05-10T04:55:09+00:00

The key concept here is to determine whether a directory is "clean" by
comparing its last-known-good topology to the current one for the
volume.  These are stored as "commit hashes" on the directory and the
volume root respectively.  The volume's commit hash changes whenever a
brick is added or removed, and a fix-layout is done.  A directory's
commit hash changes only when a full rebalance (not just fix-layout)
is done on it.  If all bricks are present and have a directory
commit hash that matches the volume commit hash, then we can assume
that every file is in its "proper" place. Therefore, if we look for
a file in that proper place and don't find it, we can assume it's not
on any other subvolume and *safely* skip the global (broadcast to all)
lookup.

Change-Id: Id6ce4593ba1f7daffa74cfab591cb45960629ae3
BUG: 1220064
Reviewed-on-master: http://review.gluster.org/#/c/7702/
Signed-off-by: Jeff Darcy 
Signed-off-by: Shyam 
Reviewed-on: http://review.gluster.org/10729
Tested-by: Gluster Build System 
Reviewed-by: Krishnan Parthasarathi 
Reviewed-by: Vijay Bellur

ec: Fix failures with missing files

2015-05-10T00:30:19+00:00

      Backport of http://review.gluster.com/9407

When a file does not exist on a brick but it does on others, there
could be problems trying to access it because there was some loc_t
structures with null 'pargfid' but 'name' was set. This forced
inode resolution based on /name instead of  which
would be the correct one. To solve this problem, 'name' is always
set to NULL when 'pargfid' is not present.

Another problem was caused by an incorrect management of errors
while doing incremental locking. The only allowed error during an
incremental locking was ENOTCONN, but missing files on a brick can
be returned as ESTALE. This caused an EIO on the operation.

This patch doesn't care of errors during an incremental locking. At
the end of the operation it will check if there are enough successfully
locked bricks to continue or not.

BUG: 1220011
Change-Id: I4a1e6235d80e20ef7ef12daba0807b859ee5c435
Signed-off-by: Xavier Hernandez 
Reviewed-on: http://review.gluster.org/10701
Reviewed-by: Pranith Kumar Karampuri 
Tested-by: Gluster Build System

core: use reference counting for mem_acct structures

2015-05-09T21:27:36+00:00

When freeing memory, our memory-accounting code expects to be able to
dereference from the (previously) allocated block to its owning
translator.  However, as we have already found once in option
validation and twice in logging, that translator might itself have
been freed and the dereference attempt causes on of our daemons to
crash with SIGSEGV.  This patch attempts to fix that as follows:

 * We no longer embed a struct mem_acct directly in a struct xlator,
   but instead allocate it separately.

 * Allocated memory blocks now contain a pointer to the mem_acct
   instead of the xlator.

 * The mem_acct structure contains a reference count, manipulated in
   both the normal and translator allocate/free code using atomic
   increments and decrements.

 * Because it's now a separate structure, we can defer freeing the
   mem_acct until its reference count reaches zero (either way).

 * Some unit tests were disabled, because they embedded their own
   copies of the implementation for what they were supposedly testing.
   Life's too short to spend time fixing tests that seem designed to
   impede progress by requiring a certain implementation as well as
   behavior.

Change-Id: Id929b11387927136f78626901729296b6c0d0fd7
BUG: 1219026
Signed-off-by: Jeff Darcy 
Reviewed-on: http://review.gluster.org/10417
Tested-by: Gluster Build System 
Reviewed-by: Krishnan Parthasarathi 
Reviewed-by: Niels de Vos 
Reviewed-by: Pranith Kumar Karampuri 
Reviewed-on: http://review.gluster.org/10723
Tested-by: NetBSD Build System

glusterd: support for tier volumes 'detach start' and 'detach commit'

2015-05-09T15:35:14+00:00

        Back port of http://review.gluster.org/10108

These commands work in a manner analagous to rebalancing when removing a
brick. The existing migration daemon detects "detach start" and switches
to moving data off the hot tier. While in this state all lookups are
directed to the cold tier.

gluster v detach-tier  start
gluster v detach-tier  commit

The status and stop cli commands shall be submitted separately.

>Change-Id: I24fda5cc3ba74f5fb8aa9a3234ad51f18b80a8a0
>BUG: 1205540
>Signed-off-by: Dan Lambright 
>Signed-off-by: root 
>Signed-off-by: Dan Lambright 
>Reviewed-on: http://review.gluster.org/10108
>Reviewed-by: Kaleb KEITHLEY 

Change-Id: I212d748d077fb5870ee84b316c653acbafbea3f7
BUG: 1220047
Signed-off-by: Mohammed Rafi KC 
Reviewed-on: http://review.gluster.org/10708
Reviewed-by: Dan Lambright 
Tested-by: Gluster Build System 
Reviewed-by: Vijay Bellur

cluster/dht: change log level of developer logs to DEBUG

2015-05-09T10:30:51+00:00

Backport of : http://review.gluster.org/10281

A few log messages in dht directory self heal at log level INFO are useful
only for developers and these logs tend to casue excessive logs in our
log files. Hence moving the log level of such logs to DEBUG.

Change-Id: I8a543f4ddeb5c20b2978a0f7b18d8baccc935a54
BUG: 1217949
Signed-off-by: Vijay Bellur 
Reviewed-on: http://review.gluster.org/10281
Reviewed-by: N Balachandran 
Tested-by: Gluster Build System 
Reviewed-on: http://review.gluster.org/10704
Tested-by: NetBSD Build System
Reviewed-by: Raghavendra Talur

cluster/afr : Prevent inode-evict during split-brain resolution

2015-05-09T08:54:56+00:00

        Backport of: http://review.gluster.org/#/c/10134/

1) Provided setfattr command to set timeout for split-brain
choice.

2) If split-brain inspection/resolution is being done
from the mount for a file, ref the inode when
split-brain-choice is set.
This inode will be unconditionally unref-ed after timeout
seconds set by the user/default otherwise.

3) Updated the doc and testcase to reflect the changes.

Change-Id: I15c9037dee28855f21e680e7e3632e1f48dba4e1
BUG: 1219388
Reviewed-on: http://review.gluster.org/10134
Reviewed-by: Krutika Dhananjay 
Reviewed-by: Ravishankar N 
Tested-by: Gluster Build System 
Reviewed-by: Pranith Kumar Karampuri 
Signed-off-by: Anuradha 
Reviewed-on: http://review.gluster.org/10679

cluster/ec: Change meaning of trusted.ec.dirty

2015-05-09T03:12:31+00:00

- With this change, the xattr will represent if the file needs to be healed or
  not. It will have different values for data/entry and metadata changes.
- inode ref leaks and dict_set_dynstr related leaks fixed
- Added support for trylock/lock based on heal-cmd execution or not
  in data heal.
- Made fixes to pass regression runs

Change-Id: I9d8def4c2badde18a76b7898816fecfac113737a
BUG: 1216303
Signed-off-by: Pranith Kumar K 
Reviewed-on: http://review.gluster.org/10385
Reviewed-on: http://review.gluster.org/10693
Tested-by: NetBSD Build System
Tested-by: Gluster Build System

cluster/ec: data heal implementation for ec

2015-05-08T22:05:30+00:00

Data self-heal:
1) Take inode lock in domain 'this->name:self-heal' on 0-0 range (full file),
   So that no other processes try to do self-heal at the same time.
2) Take inode lock in domain 'this->name' on 0-0 range (full file),
3) perform fxattrop+fstat and get the xattrs on all the bricks
3) Choose the brick with ec->fragment number of same version as source
4) Truncate sinks
5) Unlock lock taken in 2)
5) For each block take full file lock, Read from sources write to the sinks, Unlock
6) Take full file lock and see if the file is still sane copy i.e. File didn't become unusable while the bricks are offline.
   Update mtime to before healing
7) xattrop with -ve values of 'dirty' and difference of highest and its own
   version values for version xattr
8) unlock lock acquired in 6)
9) unlock lock acquired in 1)

Change-Id: I6f4d42cd5423c767262c9d7bb5ca7767adb3e5fd
BUG: 1216303
Signed-off-by: Pranith Kumar K 
Reviewed-on: http://review.gluster.org/10384
Reviewed-on: http://review.gluster.org/10692
Tested-by: Gluster Build System

cluster/ec: metadata/name/entry heal implementation for ec

2015-05-08T22:03:59+00:00

Metadata self-heal:
1) Take inode lock in domain 'this->name' on 0-0 range (full file)
2) perform lookup and get the xattrs on all the bricks
3) Choose the brick with highest version as source
4) Setattr uid/gid/permissions
5) removexattr stale xattrs
6) Setxattr existing/new xattrs
7) xattrop with -ve values of 'dirty' and difference of highest and its own
   version values for version xattr
8) unlock lock acquired in 1)

Entry self-heal:
1) take directory lock in domain 'this->name:self-heal' on 'NULL' to prevent
   more than one self-heal
2) we take directory lock in domain 'this->name' on 'NULL'
3) Perform lookup on version, dirty and remember the values
4) unlock lock acquired in 2)
5) readdir on all the bricks and trigger name heals
6) xattrop with -ve values of 'dirty' and difference of highest and its own
   version values for version xattr
7) unlock lock acquired in 1)

Name heal:
1) Take 'name' lock in 'this->name' on 'NULL'
2) Perform lookup on 'name' and get stat and xattr structures
3) Build gfid_db where for each gfid we know what subvolumes/bricks have
   a file with 'name'
4) Delete all the stale files i.e. the file does not exist on more than
   ec->redundancy number of bricks
5) On all the subvolumes/bricks with missing entry create 'name' with same
   type,gfid,permissions etc.
6) Unlock lock acquired in 1)
Known limitation: At the moment with present design, it conservatively
preserves the 'name' in case it can not decide whether to delete it.  this can
happen in the following scenario:
1) we have 3=2+1 (bricks: A, B, C) ec volume and 1 brick is down (Lets say A)
2) rename d1/f1 -> d2/f2 is performed but the rename is successful only on one
   of the bricks (Lets say B)
3) Now name self-heal on d1 and d2 would re-create the file on both d1 and d2
   resulting in d1/f1 and d2/f2.

Because we wanted to prevent data loss in the case above, the following
scenario is not healable, i.e. it needs manual intervention:
1) we have 3=2+1 (bricks: A, B, C) ec volume and 1 brick is down (Lets say A)
2) We have two hard links: d1/a, d2/b and another file d3/c even before the
   brick went down
3) rename d3/c -> d2/b is performed
4) Now name self-heal on d2/b doesn't heal because d2/b with older gfid will
   not be deleted.  One could think why not delete the link if there is
   more than 1 hardlink, but that leads to similar data loss issue I described
   earlier:
Scenario:
1) we have 3=2+1 (bricks: A, B, C) ec volume and 1 brick is down (Lets say A)
2) We have two hard links: d1/a, d2/b
3) rename d1/a -> d3/c, d2/b -> d4/d is performed and both the operations are
   successful only on one of the bricks (Lets say B)
4) Now name self-heal on the 'names' above which can happen in parallel can
   decide to delete the file thinking it has 2 links but after all the
   self-heals do unlinks we are left with data loss.

Change-Id: I3a68218a47bb726bd684604efea63cf11cfd11be
BUG: 1216303
Signed-off-by: Pranith Kumar K 
Reviewed-on: http://review.gluster.org/10298
Reviewed-on: http://review.gluster.org/10691
Tested-by: Gluster Build System 
Tested-by: NetBSD Build System