glusterfs.git/xlators/cluster/dht, branch v3.7.0beta2

dht/tier/rebalancer: Fix reset of tiering client pid

2015-05-10T08:06:27+00:00

In the patch http://review.gluster.org/#/c/9657
the client pid set by tiering migration was getting over-
written in dht_start_rebalance_task(). Just corrected it
in dht_setxattr() before calling dht_start_rebalance_task()
and removed it from dht_start_rebalance_task().

>    http://review.gluster.org/#/c/10502/
>    Cherry picked from commit a5fe0f594d41e1a11661d9074bb19e9c2e2c4776
>    Change-Id: I37cfa111f83a4e5d498042575c93799f60b49870
>    BUG: 1217937
>    Signed-off-by: Joseph Fernandes 
>    Reviewed-on: http://review.gluster.org/10502
>    Tested-by: Gluster Build System 
>    Reviewed-by: Susant Palai 
>    Reviewed-by: Dan Lambright 

Signed-off-by: Joseph Fernandes 
Reviewed-on: http://review.gluster.org/10502
Tested-by: Gluster Build System 
Reviewed-by: Susant Palai 
Reviewed-by: Dan Lambright 
Signed-off-by: Joseph Fernandes 

Conflicts:
	xlators/cluster/dht/src/dht-common.c
	xlators/cluster/dht/src/tier.c

Change-Id: Id513114c9a880c6196162dd4b35bbf1155a8cd09
BUG: 1219027
Reviewed-on: http://review.gluster.org/10609
Reviewed-by: N Balachandran 
Tested-by: NetBSD Build System
Tested-by: Gluster Build System 
Reviewed-by: Vijay Bellur

dht: make lookup-unhashed=auto do something actually useful

2015-05-10T04:55:09+00:00

The key concept here is to determine whether a directory is "clean" by
comparing its last-known-good topology to the current one for the
volume.  These are stored as "commit hashes" on the directory and the
volume root respectively.  The volume's commit hash changes whenever a
brick is added or removed, and a fix-layout is done.  A directory's
commit hash changes only when a full rebalance (not just fix-layout)
is done on it.  If all bricks are present and have a directory
commit hash that matches the volume commit hash, then we can assume
that every file is in its "proper" place. Therefore, if we look for
a file in that proper place and don't find it, we can assume it's not
on any other subvolume and *safely* skip the global (broadcast to all)
lookup.

Change-Id: Id6ce4593ba1f7daffa74cfab591cb45960629ae3
BUG: 1220064
Reviewed-on-master: http://review.gluster.org/#/c/7702/
Signed-off-by: Jeff Darcy 
Signed-off-by: Shyam 
Reviewed-on: http://review.gluster.org/10729
Tested-by: Gluster Build System 
Reviewed-by: Krishnan Parthasarathi 
Reviewed-by: Vijay Bellur

core: use reference counting for mem_acct structures

2015-05-09T21:27:36+00:00

When freeing memory, our memory-accounting code expects to be able to
dereference from the (previously) allocated block to its owning
translator.  However, as we have already found once in option
validation and twice in logging, that translator might itself have
been freed and the dereference attempt causes on of our daemons to
crash with SIGSEGV.  This patch attempts to fix that as follows:

 * We no longer embed a struct mem_acct directly in a struct xlator,
   but instead allocate it separately.

 * Allocated memory blocks now contain a pointer to the mem_acct
   instead of the xlator.

 * The mem_acct structure contains a reference count, manipulated in
   both the normal and translator allocate/free code using atomic
   increments and decrements.

 * Because it's now a separate structure, we can defer freeing the
   mem_acct until its reference count reaches zero (either way).

 * Some unit tests were disabled, because they embedded their own
   copies of the implementation for what they were supposedly testing.
   Life's too short to spend time fixing tests that seem designed to
   impede progress by requiring a certain implementation as well as
   behavior.

Change-Id: Id929b11387927136f78626901729296b6c0d0fd7
BUG: 1219026
Signed-off-by: Jeff Darcy 
Reviewed-on: http://review.gluster.org/10417
Tested-by: Gluster Build System 
Reviewed-by: Krishnan Parthasarathi 
Reviewed-by: Niels de Vos 
Reviewed-by: Pranith Kumar Karampuri 
Reviewed-on: http://review.gluster.org/10723
Tested-by: NetBSD Build System

glusterd: support for tier volumes 'detach start' and 'detach commit'

2015-05-09T15:35:14+00:00

        Back port of http://review.gluster.org/10108

These commands work in a manner analagous to rebalancing when removing a
brick. The existing migration daemon detects "detach start" and switches
to moving data off the hot tier. While in this state all lookups are
directed to the cold tier.

gluster v detach-tier  start
gluster v detach-tier  commit

The status and stop cli commands shall be submitted separately.

>Change-Id: I24fda5cc3ba74f5fb8aa9a3234ad51f18b80a8a0
>BUG: 1205540
>Signed-off-by: Dan Lambright 
>Signed-off-by: root 
>Signed-off-by: Dan Lambright 
>Reviewed-on: http://review.gluster.org/10108
>Reviewed-by: Kaleb KEITHLEY 

Change-Id: I212d748d077fb5870ee84b316c653acbafbea3f7
BUG: 1220047
Signed-off-by: Mohammed Rafi KC 
Reviewed-on: http://review.gluster.org/10708
Reviewed-by: Dan Lambright 
Tested-by: Gluster Build System 
Reviewed-by: Vijay Bellur

cluster/dht: change log level of developer logs to DEBUG

2015-05-09T10:30:51+00:00

Backport of : http://review.gluster.org/10281

A few log messages in dht directory self heal at log level INFO are useful
only for developers and these logs tend to casue excessive logs in our
log files. Hence moving the log level of such logs to DEBUG.

Change-Id: I8a543f4ddeb5c20b2978a0f7b18d8baccc935a54
BUG: 1217949
Signed-off-by: Vijay Bellur 
Reviewed-on: http://review.gluster.org/10281
Reviewed-by: N Balachandran 
Tested-by: Gluster Build System 
Reviewed-on: http://review.gluster.org/10704
Tested-by: NetBSD Build System
Reviewed-by: Raghavendra Talur

cluster/tier: don't use hot tier until subvolumes ready

2015-05-08T12:17:00+00:00

This is a backport of fix 10435 to Gluster 3.7.

When we attach a tier, the hot tier becomes the hashed
subvolume. But directories may not yet have been replicated by
the fix layout process. Hence lookups to those directories
will fail on the hot subvolume. We should only go to the hashed
subvolume once the layout has been fixed. This is known if the
layout for the parent directory does not have an error. If
there is an error, the cold tier is considered the hashed
subvolume. The exception to this rules is ENOCON, in which
case we do not know where the file is and must abort.

Note we may revalidate a lookup for a directory even if the
inode has not yet been populated by FUSE. This case can
happen in tiering (where one tier has completed a lookup
but the other has not, in which case we revalidate one tier
when we call lookup the second time). Such inodes are
still invalid and should not be consulted for validation.

> http://review.gluster.org/#/c/10435/
> Change-Id: Ia2bc62e1d807bd70590bd2a8300496264d73c523
> BUG: 1214289
> Signed-off-by: Dan Lambright 
> Reviewed-on: http://review.gluster.org/10435
> Tested-by: Gluster Build System 
> Reviewed-by: Raghavendra G 
> Reviewed-by: N Balachandran 
> Signed-off-by: Dan Lambright 

Change-Id: Ia2bc62e1d807bd70590bd2a8300496264d73c523
BUG: 1219547
Signed-off-by: Dan Lambright 
Reviewed-on: http://review.gluster.org/10649
Tested-by: NetBSD Build System
Reviewed-by: Joseph Fernandes
Tested-by: Gluster Build System 
Reviewed-by: Vijay Bellur

features/shard: Implement [f]truncate fops

2015-05-08T11:49:54+00:00

        Backport of: http://review.gluster.org/10631
To-Do:
* Make ftruncate work even in the absence of path
* Aggregate and update ia_blocks appropriately when a file is
  truncated to a lower size.

Change-Id: Icd424430066233ba61a030e72fdddf692d2b3f22
BUG: 1214247
Signed-off-by: Krutika Dhananjay 
Reviewed-on: http://review.gluster.org/10638
Tested-by: Gluster Build System 
Reviewed-by: Vijay Bellur

geo-rep: rename handling in dht volume

2015-05-08T11:49:22+00:00

Background:

Glusterfs changelogs are stored in each brick, which records the changes
happened in that brick. Georep will run in all the nodes of master and
processes changelogs "independently".
Processing changelogs is in brick level, but all the fops will be replayed
on "slave mount" point.

Problem:
With a DHT volume, in changelog "internal fops" are NOT recorded.
For Rename case, Rename is recorded in "hashed" brick changelog.
(DHT's internal fops like creating linkto file, unlink is NOT recorded).
This lead us to inconsistent rename operations.

For example,
Distribute volume created with Two bricks B1, B2.

//Consider master volume mounted @ /mnt/master
and following operations executed:
cd /mnt/master
touch f1 // f1 falls on B1 Hash
mv f1 f2 // f2 falls on B2 Hash

// Here, Changelogs are recorded as below:
@B1
CREATE f1
@B2
RENAME f1 f2

Here, race exists between Brick B1 and B2, say B2 will get executed first.
Source file f1 itself is "NOT PRESENT", so it will go ahead and create
f2 (Current implementation).
We have this problem When rename falls in another brick and
file is unlinked in Master.

Similar kind of issue exists in following case too(multiple rename):
CREATE f1
RENAME f1 f2
RENAME f2 f1

Solution:

Instead of carrying out "changelogging" at "HASHED volume",
carry out  at the "CACHED volume".
This way we have rename operations carried out where actual files are present.
So,Changelog recorded as :
@B1
CREATE f1
RENAME f1 f2

credit: sarumuga@redhat.com

PS: Some of the races as the one below are _NOT_ fixed by this patch

* f1 and f2 exist. B1 and B2 are their respective cached subvols. For
both files hashed-subvol == cached-subvol
* mv f1 f2 on master.
* B1 has change-log entry of rename f1 f2
* rebalance migrates f2 from B1 and B2
* mv f2 f1 on master.
* B2 has change-log entry of rename f2 f1

Since changelog entries (rename f1 f2) and (rename f2 f1) are processed
independently by gsyncds, which of either f1 and f2 survives on slave
is subject to race. Note that on master its file f1 with name f1 which
survived. On slave it can be either file f1 with name f1 or file f2
with name f2 based on who wins the race of processing changelog.

BUG: 1219412
Change-Id: I43725d69635e2ce065135691ef629014e8df7d50
Original-Author: Nithya Balachandran 
Reviewed-on: http://review.gluster.org/10410
Signed-off-by: Saravanakumar Arumugam 
Reviewed-on: http://review.gluster.org/10628
Tested-by: Gluster Build System 
Tested-by: NetBSD Build System
Reviewed-by: Kotresh HR 
Reviewed-by: Raghavendra G 
Reviewed-by: Vijay Bellur

guster/dht: tiered volumes may not allow access to files undergoing migration

2015-05-08T08:55:41+00:00

This is a backport of fix 10324 to Gluster 3.7.

If a read IO occurs against a file that has reached rebalance
phase 2, we redirect the IO to the destination. For tiered
volumes, when we try to reopen the file (on the destination),
the lower level DHT receives the open call and fails; it does
not have a "cached subvol". Fix is to "teach" the lower level
DHT of the new location by sending a locate before the open.

> http://review.gluster.org/#/c/10324/
> Change-Id: Ia4acb0035ff1da15f6a8f9ed54f43c76e8b98f5f
> BUG: 1214048
> Signed-off-by: Dan Lambright 
> Signed-off-by: root 
> Signed-off-by: Dan Lambright 
> Reviewed-on: http://review.gluster.org/10324
> Tested-by: NetBSD Build System
> Tested-by: Gluster Build System 
> Reviewed-by: Raghavendra G 
> Tested-by: Raghavendra G 
> Signed-off-by: Dan Lambright 

Change-Id: Ia4acb0035ff1da15f6a8f9ed54f43c76e8b98f5f
BUG: 1219608
Signed-off-by: Dan Lambright 
Reviewed-on: http://review.gluster.org/10654
Tested-by: Gluster Build System 
Tested-by: NetBSD Build System
Reviewed-by: Joseph Fernandes
Reviewed-by: Vijay Bellur

Restore build on non Linux systems

2015-05-07T18:52:10+00:00

This change broke the build on NetBSD, FreeBSD, and MacOS X:
http://review.gluster.org/10526/

We restore the build with two fixes:
- Use POSIX-compliant sysconf(_SC_NPROCESSORS_ONLN) to get the
  number of processors, instead of Linux specific get_nprocs().
  That let us remove Linux-specific #include 
- Only define MAX() if it is not already defined. NetBSD defines
  it in  which is already included

Backport of: I62341c670598670e47ea2f69ab94864f96588b18
BUG: 1212676
Signed-off-by: Emmanuel Dreyfus 

Change-Id: I0f098153e76954bb85b5dca3f054a069e31dd94c
Reviewed-on: http://review.gluster.org/10653
Reviewed-by: Shyamsundar Ranganathan 
Reviewed-by: Kaleb KEITHLEY 
Tested-by: Vijay Bellur