glusterfs.git/tests/bugs/replicate, branch release-3.10

mount/fuse: use fstat in getattr implementation if any opened fd is available

2018-01-03T14:54:19+00:00

The restriction of using fds opened by the same Pid means fds cannot
be shared across threads of multithreaded application. Note that fops
from kernel have different Pid for different threads. Imagine
following sequence of operations:

* Turn off performance.open-behind
* Thread t1 opens an fd - fd1 - on file "file". Let's assume nodeid of
  "file" is "nodeid-file".
* Thread t2 does RENAME ("newfile", "file"). Let's assume nodeid of
  "newfile" as "nodeid-newfile".
* t2 proceeds to do fstat (fd1)

The above set of operations can sometimes result in ESTALE/ENOENT
errors. RENAME overwrites "file" with "newfile" changing its nodeid
from "nodeid-file" to "nodeid-newfile" and post RENAME, "nodeid-file" is
removed from the backend. If fstat carries nodeid-file as argument,
which can happen if lookup has not refreshed the nodeid of "file" and
since t2 doesn't have an fd opened, fuse_getattr_resume uses STAT
which will fail as "nodeid-file" no longer exists.

Since the above set of operations and sharing of fds across
multiple threads are valid, this is a bug.

The fix is to use any fd opened on the inode. In this specific example
fuse_getattr_resume will find fd1 and winds down the call as fstat
(fd1) which won't fail.

Cross-checked with "Miklos Szeredi"  for
any security issues with this solution and he approves the solution.

Thanks to "Miklos Szeredi"  for all the
pointers and discussions.

>Change-Id: I88dd29b3607cd2594eee9d72a1637b5346c8d49c
>BUG: 1510401
>Signed-off-by: Raghavendra G 

(cherry picked from commit 8b57378e5596f287a7b9d106dd6fb56a624b42ee)
Change-Id: I88dd29b3607cd2594eee9d72a1637b5346c8d49c
BUG: 1529086
Signed-off-by: Raghavendra G

afr: auto-resolve split-brains for zero-byte files

2017-10-02T12:34:23+00:00

Backport of https://review.gluster.org/#/c/18283/

Problems:
As described in BZ 1491670, renaming hardlinks can result in data/mdata
split-brain of the DHT link-to files (T files) without any mismatch of
data and metadata.

As described in BZ 1486063, for a zero-byte file with only dirty bits
set, arbiter brick will likely be chosen as the source brick.

Fix:
For zero byte files in split-brain, pick first brick as
a) data source if file size is zero on all bricks.
b) metadata source if metadata is the same on all bricks

In arbiter case, if file size is zero on all bricks and there are no
pending afr xattrs, pick 1st brick as data source.

Change-Id: I0270a9a2f97c3b21087e280bb890159b43975e04
BUG: 1496321
Signed-off-by: Ravishankar N 
Reported-by: Rahul Hinduja 
Reported-by: Mabi

posix: add sanity checks for removing the gfid symlink for directories

2017-09-23T12:19:56+00:00

Backport of https://review.gluster.org/17945

...during mkdir and rmdir. Otherwise, during entry self-heal, the
directory could be left out without a .glusterfs symlink causing fops like
opendir, readdir to fail.  The only chance the missing symlink will be
created is when a fresh lookup comes on it.

Change-Id: I2e1cf1bce8962ea80187edd8f6d73e0a09cf9f8e
BUG: 1491966
Signed-off-by: Ravishankar N

afr: mark non sources as sinks in metadata heal

2017-07-19T21:32:48+00:00

Problem:
In a 3 way replica, when the source brick does not have pending xattrs
for the sinks, but the 2 sinks blame each other, metadata heal was not
happpening because we were not setting all non-sources as sinks.

Fix: Mark all non-sources as sinks, like it is done in data and entry
heal.

> Reviewed-on: https://review.gluster.org/17717
> Smoke: Gluster Build System 
> Reviewed-by: Pranith Kumar Karampuri 
> CentOS-regression: Gluster Build System 
(cherry picked from commit 77c1ed5fd299914e91ff034d78ef6e3600b9151c)

Change-Id: I534978940f5087302e307fcc810a48ffe898ce08
BUG: 1471612
Signed-off-by: Ravishankar N 
Reviewed-on: https://review.gluster.org/17782
Smoke: Gluster Build System 
Reviewed-by: Pranith Kumar Karampuri 
Tested-by: Pranith Kumar Karampuri 
CentOS-regression: Gluster Build System

afr: don't do a post-op on a brick if op failed

2017-04-27T10:46:55+00:00

Problem:
In afr-v2, self-blaming xattrs are not there by design. But if the FOP
failed on a brick due to an error other than ENOTCONN (or even due to
ENOTCONN, but we regained connection before postop was wound), we wind
the post-op also on the failed brick, leading to setting self-blaming
xattrs on that brick. This can lead to undesired results like healing of
files in split-brain etc.

Fix:
If a fop failed on a brick on which pre-op was successful, do not
perform post-op on it. This also produces the desired effect of not
resetting the dirty xattr on the brick, which is how it should be
because if the fop failed on a brick, there is no reason to clear the
dirty bit which actually serves as an indication of the failure.

> Reviewed-on: https://review.gluster.org/16976
> Smoke: Gluster Build System 
> NetBSD-regression: NetBSD Build System 
> CentOS-regression: Gluster Build System 
> Reviewed-by: Pranith Kumar Karampuri 
(cherry picked from commit 10dad995c989e9d77c341135d7c48817baba966c)

Change-Id: I5f1caf4d1b39f36cf8093ccef940118638caa9c4
BUG: 1443501
Signed-off-by: Ravishankar N 
Reviewed-on: https://review.gluster.org/17083
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Pranith Kumar Karampuri

cluster/afr: Undo pending xattrs only on the up bricks

2017-03-28T12:03:19+00:00

Problem:
While doing conservative merge, even if a brick is down, it will reset
the pending xattr on that. When that brick comes up, as part of the
heal, it will consider this brick as the source and removes the entries
on the other bricks, which leads to data loss.

Fix:
Undo pending only for the bricks which are up.

> Change-Id: I18436fa0bb1faa5f60531b357dea3f6b20446303
> BUG: 1433571
> Signed-off-by: karthik-us 
> Reviewed-on: https://review.gluster.org/16913
> Reviewed-by: Pranith Kumar Karampuri 
> Smoke: Gluster Build System 
> NetBSD-regression: NetBSD Build System 
> CentOS-regression: Gluster Build System 
> Reviewed-by: Ravishankar N 
(cherry picked from commit f91596e6566c605e70a31a60523d11f78a097c3c)

Change-Id: I51dbdc53e84051ec73308df9d4cf27726fc29dc7
BUG: 1436203
Signed-off-by: karthik-us 
Reviewed-on: https://review.gluster.org/16955
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Pranith Kumar Karampuri 
Reviewed-by: Ravishankar N

afr: all children of AFR must be up to resolve s-brain

2017-02-15T12:31:45+00:00

Problem:
The various split-brain resolution policies (favorite-child-policy based,
CLI based and mount (get/setfattr) based) attempt to resolve split-brain
even when not all bricks of replica are up. This can be a problem when
say in a replica 3, the only good copy is down and the other 2 bricks
are up and blame each other (i.e. split-brain). We end up healing the
file in such a  case and allow I/O on it.

Fix:
A decision on whether the file is in split-brain or not must be taken
only if we are able to examine the afr xattrs of *all* bricks of a given
replica.

Signed-off-by: Ravishankar N 
> Reviewed-on: https://review.gluster.org/16476
> Smoke: Gluster Build System 
> NetBSD-regression: NetBSD Build System 
> CentOS-regression: Gluster Build System 
> Reviewed-by: Pranith Kumar Karampuri 

(cherry picked from commit 0e03336a9362e5717e561f76b0c543e5a197b31b)
Change-Id: Icddb1268b380005799990f5379ef957d84639ef9
BUG: 1420982
Reviewed-on: https://review.gluster.org/16587
Tested-by: Ravishankar N 
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Shyamsundar Ranganathan

core: run many bricks within one glusterfsd process

2017-02-02T00:54:58+00:00

This patch adds support for multiple brick translator stacks running in
a single brick server process.  This reduces our per-brick memory usage
by approximately 3x, and our appetite for TCP ports even more.  It also
creates potential to avoid process/thread thrashing, and to improve QoS
by scheduling more carefully across the bricks, but realizing that
potential will require further work.

Multiplexing is controlled by the "cluster.brick-multiplex" global
option.  By default it's off, and bricks are started in separate
processes as before.  If multiplexing is enabled, then *compatible*
bricks (mostly those with the same transport options) will be started in
the same process.

Backport of:
> Change-Id: I45059454e51d6f4cbb29a4953359c09a408695cb
> BUG: 1385758
> Reviewed-on: https://review.gluster.org/14763

Change-Id: I4bce9080f6c93d50171823298fdf920258317ee8
BUG: 1418091
Signed-off-by: Jeff Darcy 
Reviewed-on: https://review.gluster.org/16496
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Shyamsundar Ranganathan

tier : Tier as a service

2017-01-17T04:49:47+00:00

tierd is implemented by separating from rebalance process.

The commands affected:

1) Attach tier will trigger this process instead of old one
2) tier start and tier start force will also trigger this process.
3) volume status [tier] will show tier daemon as a process instead
of task and normal tier status and tier detach status works.
4) tier stop implemented.
5) detach tier implemented separately along with new detach tier
status
6) volume tier volname status will work using the changes.
7) volume set works

This patch has separated the tier translator from the legacy
DHT rebalance code. It now sends the RPCs from the CLI
to glusterd separate to the DHT rebalance code.
The daemon is now a service, similar to the snapshot daemon,
and can be viewed using the volume status command.

The code for the validation and commit phase are the same
as the earlier tier validation code in DHT rebalance.

The “brickop” phase has been changed so that the status
command can use this framework.

The service management framework is now used.
DHT rebalance does not use this framework.

This service framework takes care of :

*) spawning the daemon, killing it and other such processes.
*) volume set options , which are written on the volfile.
*) restart and reconfigure functions. Restart is to restart
the daemon at two points
        1)after gluster goes down and comes up.
        2) to stop detach tier.
*) reconfigure is used to make immediate volfile changes.
By doing this, we don’t restart the daemon.
it has the code to rewrite the volfile for topological
changes too (which comes into place during add and remove brick).

With this patch the log, pid, and volfile are separated
and put into respective directories.

Change-Id: I3681d0d66894714b55aa02ca2a30ac000362a399
BUG: 1313838
Signed-off-by: hari gowtham 
Reviewed-on: http://review.gluster.org/13365
Smoke: Gluster Build System 
Tested-by: hari gowtham 
CentOS-regression: Gluster Build System 
NetBSD-regression: NetBSD Build System 
Reviewed-by: Dan Lambright 
Reviewed-by: Atin Mukherjee

cluster/afr: Fix missing name indices due to EEXIST error

2016-12-27T11:53:04+00:00

PROBLEM:
Consider a volume with  granular-entry-heal and sharding enabled. When
a replica is down and a shard is created as part of a write, the name
index is correctly created under indices/entry-changes/.
Now when a read on the same region triggers another MKNOD, the fop
fails on the online bricks with EEXIST. By virtue of this being a
symmetric error, the failed_subvols[] array is reset to all zeroes.
Because of this, before post-op, the GF_XATTROP_ENTRY_OUT_KEY will be
set, causing the name index, which was created in the previous MKNOD
operation, to be wrongly deleted in THIS MKNOD operation.

FIX:
The ideal fix would have been for a transaction to delete the name
index ONLY if it knows it is the one that created the index in the first
place. This would involve gathering information as to whether THIS xattrop
created the index from individual bricks, aggregating their responses and
based on the various posisble combinations of responses, decide whether to
delete the index or not. This is rather complex. Simpler fix would be
for post-op to examine local->op_ret in the event of no failed_subvols
to figure out whether to delete the name index or not. This can occasionally
lead to creation of stale name indices but they won't be affecting the IO path
or mess with pending changelogs in any way and self-heal in its crawl of
"entry-changes" directory would take care to delete such indices.

Change-Id: Ic1b5257f4dc9c20cb740a866b9598cf785a1affa
BUG: 1408712
Signed-off-by: Krutika Dhananjay 
Reviewed-on: http://review.gluster.org/16286
Smoke: Gluster Build System 
Reviewed-by: Pranith Kumar Karampuri 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System