glusterfs.git/xlators/cluster/afr/src/afr-common.c, branch release-3.11

cluster/afr: Implement quorum for lk fop

2017-06-20T13:49:23+00:00

Problem:
At the moment when we have replica 3 or arbiter setup, even when
lk succeeds on just one brick we give success to application which
is wrong

Fix:
Consider quorum-number of successes as success when quorum is enabled.

 >BUG: 1461792
 >Change-Id: I5789e6eb5defb68f8a0eb9cd594d316f5cdebaea
 >Signed-off-by: Pranith Kumar K 
 >Reviewed-on: https://review.gluster.org/17524
 >Smoke: Gluster Build System 
 >NetBSD-regression: NetBSD Build System 
 >CentOS-regression: Gluster Build System 
 >Reviewed-by: Ravishankar N 

BUG: 1462661
Change-Id: I5789e6eb5defb68f8a0eb9cd594d316f5cdebaea
Signed-off-by: Pranith Kumar K 
Reviewed-on: https://review.gluster.org/17578
Smoke: Gluster Build System 
Reviewed-by: Ravishankar N 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Shyamsundar Ranganathan

afr: add errno to afr_inode_refresh_done()

2017-06-06T12:55:00+00:00

Backport of https://review.gluster.org/17413 and
https://review.gluster.org/17436

Problem:
When parellel `rm -rf`s were being done from cifs clients, opendir might
fail on some replicas with ENOENT. DHT ignores partial opendir failures
in dht_fd_cbk() and winds readdirs on those replicas. Afr inode refresh
(as a part of readdirp read_txn) sees in its fd context that the state
of the fds is *not* AFR_FD_OPENED and bails out to
afr_inode_refresh_done() without doing a refresh. When this happens, the
errno is set as EIO due to lack of readable subvols, logging split-brain
messages in the logs.

Fix:
Introduce an errno argument to afr_inode_refresh_do() to bail out with
the right error value when inode refresh is not performed.

Change-Id: I075707fbb73fd93a923b77b923a96aac79e847f9
BUG: 1457616
Signed-off-by: Ravishankar N 
Reviewed-on: https://review.gluster.org/17434
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Pranith Kumar Karampuri

cluster/afr: Return the list of node_uuids for the subvolume

2017-05-19T13:20:57+00:00

Problem:
AFR was returning the node uuid of the first node for every file if
the replica set was healthy, which was resulting in only one node
migrating all the files.

Fix:
With this patch AFR returns the list of node_uuids to the upper layer,
so that they can decide on which node to migrate which files, resulting
in improved performance. Ordering of node uuids will be maintained based
on the ordering of the bricks. If a brick is down, then the node uuid
for that will be set to all zeros.

>Reviewed-on: https://review.gluster.org/17084
> Reviewed-by: Pranith Kumar Karampuri 
> Tested-by: Pranith Kumar Karampuri 
> Smoke: Gluster Build System 
> NetBSD-regression: NetBSD Build System 
> CentOS-regression: Gluster Build System 
(cherry picked from commit 0a50167c0a8f950f5a1c76442b6c9abea466200d)

Change-Id: I73ee0f9898ae473584fdf487a2980d7a6db22f31
BUG: 1451573
Signed-off-by: karthik-us 
Reviewed-on: https://review.gluster.org/17336
Tested-by: Ravishankar N 
Smoke: Gluster Build System 
Reviewed-by: Pranith Kumar Karampuri 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Shyamsundar Ranganathan

afr: include quorum type and count when dumping afr priv

2017-05-12T13:36:40+00:00

Squash of  https://review.gluster.org/17196 and
           https://review.gluster.org/17215

Dump the client quorum type ('auto', 'fixed' or 'none'). If it is 'fixed',
also dump the quorum-count. This information will be available in the client
statedump and in
//.meta/graphs/active/testvol-replicate-X/private.

Also added a test-case.

Change-Id: I91367c5250b26efb35e5f7d7c397def09cc77cbc
BUG: 1449921
Signed-off-by: Ravishankar N 
Reviewed-on: https://review.gluster.org/17243
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Shyamsundar Ranganathan

afr: send the correct iatt values in fsync cbk

2017-05-12T13:35:15+00:00

Problem:
afr unwinds the fsync fop with an iatt buffer from one of its children
on whom fsync was successful. But that child might not be a valid read
subvolume for that inode because of pending heals or because it happens
to be the arbiter brick etc. Thus we end up sending the wrong iatt to
mdcache which will in turn serve it to the application on a subsequent
stat call as reported in the BZ.

Fix:
Pick a child on whom the fsync was successful *and* that is readable as
indicated in the inode context.

> Reviewed-on: https://review.gluster.org/17227
> CentOS-regression: Gluster Build System 
> Reviewed-by: Pranith Kumar Karampuri 
> NetBSD-regression: NetBSD Build System 
> Smoke: Gluster Build System 
(cherry picked from commit 1a8fa910ccba7aa941f673302c1ddbd7bd818e39)

Change-Id: Ie8647289219cebe02dde4727e19a729b3353ebcf
BUG: 1449924
RCA'ed-by: Miklós Fokin 
Signed-off-by: Ravishankar N 
Reviewed-on: https://review.gluster.org/17244
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Shyamsundar Ranganathan

Halo Replication feature for AFR translator

2017-05-08T05:37:07+00:00

	Backport of https://review.gluster.org/16177
		    https://review.gluster.org/17174

Merged both these patches to make sure IPV6 changes don't make it to 3.11 at all.

Summary:
Halo Geo-replication is a feature which allows Gluster or NFS clients to write
locally to their region (as defined by a latency "halo" or threshold if you
like), and have their writes asynchronously propagate from their origin to the
rest of the cluster.  Clients can also write synchronously to the cluster
simply by specifying a halo-latency which is very large (e.g. 10seconds) which
will include all bricks.

In other words, it allows clients to decide at mount time if they desire
synchronous or asynchronous IO into a cluster and the cluster can support both
of these modes to any number of clients simultaneously.

There are a few new volume options due to this feature:
  halo-shd-latency:  The threshold below which self-heal daemons will
  consider children (bricks) connected.

  halo-nfsd-latency: The threshold below which NFS daemons will consider
  children (bricks) connected.

  halo-latency: The threshold below which all other clients will
  consider children (bricks) connected.

  halo-min-replicas: The minimum number of replicas which are to
  be enforced regardless of latency specified in the above 3 options.
  If the number of children falls below this threshold the next
  best (chosen by latency) shall be swapped in.

New FUSE mount options:
  halo-latency & halo-min-replicas: As descripted above.

This feature combined with multi-threaded SHD support (D1271745) results in
some pretty cool geo-replication possibilities.

Operational Notes:
- Global consistency is gaurenteed for synchronous clients, this is provided by
  the existing entry-locking mechanism.
- Asynchronous clients on the other hand and merely consistent to their region.
  Writes & deletes will be protected via entry-locks as usual preventing
  concurrent writes into files which are undergoing replication.  Read operations
  on the other hand should never block.
- Writes are allowed from _any_ region and propagated from the origin to all
  other regions.  The take away from this is care should be taken to ensure
  multiple writers do not write the same files resulting in a gfid split-brain
  which will require resolution via split-brain policies (majority, mtime &
  size).  Recommended method for preventing this is using the nfs-auth feature to
  define which region for each share has RW permissions, tiers not in the origin
  region should have RO perms.

TODO:
- Synchronous clients (including the SHD) should choose clients from their own
  region as preferred sources for reads.  Most of the plumbing is in place for
  this via the child_latency array.
- Better GFID split brain handling & better dent type split brain handling
  (i.e. create a trash can and move the offending files into it).
- Tagging in addition to latency as a means of defining which children you wish
  to synchronously write to

Test Plan:
- The usual suspects, clang, gcc w/ address sanitizer & valgrind
- Prove tests

Reviewers: jackl, dph, cjh, meyering

Reviewed By: meyering

Subscribers: ethanr

Differential Revision: https://phabricator.fb.com/D1272053

Tasks: 4117827

 >Change-Id: I694a9ab429722da538da171ec528406e77b5e6d1
 >BUG: 1428061
 >Signed-off-by: Kevin Vigor 
 >Reviewed-on: http://review.gluster.org/16099
 >Reviewed-on: https://review.gluster.org/16177
 >Tested-by: Pranith Kumar Karampuri 
 >Smoke: Gluster Build System 
 >NetBSD-regression: NetBSD Build System 
 >CentOS-regression: Gluster Build System 
 >Reviewed-by: Pranith Kumar Karampuri 

BUG: 1448416
Change-Id: I694a9ab429722da538da171ec528406e77b5e6d1
Signed-off-by: Pranith Kumar K 
Reviewed-on: https://review.gluster.org/17192
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Kaushal M

afr: all children of AFR must be up to resolve s-brain

2017-02-10T01:37:00+00:00

Problem:
The various split-brain resolution policies (favorite-child-policy based,
CLI based and mount (get/setfattr) based) attempt to resolve split-brain
even when not all bricks of replica are up. This can be a problem when
say in a replica 3, the only good copy is down and the other 2 bricks
are up and blame each other (i.e. split-brain). We end up healing the
file in such a  case and allow I/O on it.

Fix:
A decision on whether the file is in split-brain or not must be taken
only if we are able to examine the afr xattrs of *all* bricks of a given
replica.

Change-Id: Icddb1268b380005799990f5379ef957d84639ef9
BUG: 1417522
Signed-off-by: Ravishankar N 
Reviewed-on: https://review.gluster.org/16476
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Pranith Kumar Karampuri

afr: Avoid resetting event_gen when brick is always down

2017-01-13T02:54:02+00:00

Problem:
__afr_set_in_flight_sb_status(), which resets event_gen to zero, is
called if failed_subvols[i] is non-zero for any brick. But failed_subvols[i]
is true even if the brick was down *before* the transaction started.
Hence say if 1 brick is down in  a replica-3, every writev that comes
will trigger an inode refresh because of this resetting, as seen from
the no. of FSTATs in the profile info in the BZ.

Fix:
Reset event gen only if the brick was previously a valid read child and
the FOP failed on it the first time.

Also `s/afr_inode_read_subvol_reset/afr_inode_event_gen_reset` because
the function only resets event gen and not the data/metadata readable.

Change-Id: I603ae646cbde96995c35db77916e2ed80b602a91
BUG: 1409206
Signed-off-by: Ravishankar N 
Reviewed-on: http://review.gluster.org/16309
Smoke: Gluster Build System 
Reviewed-by: Pranith Kumar Karampuri 
Tested-by: Pranith Kumar Karampuri 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System

ec: Invalidations in disperse volume should not update the stat

2017-01-06T05:12:18+00:00

Issue:
In disperse volume, the file is present across bricks, hence the stat
from one brick doesn't carry the valid size of the file. Therefore
the upcall from one brick updating the md-cache results in wrong size
being updated.

Fix:
If the notification is cache invalidation then, indicate md-cache that
the attributes is invalid.

BUG: 1410375
Change-Id: Id89d2283478e70b62b435a8891fffc86d2be8cb2
Signed-off-by: Poornima G 
Reviewed-on: http://review.gluster.org/16329
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
Reviewed-by: Xavier Hernandez 
CentOS-regression: Gluster Build System 
Reviewed-by: Pranith Kumar Karampuri

afr: use accused matrix instead of readable matrix for deciding heals

2016-12-27T06:34:02+00:00

Problem:
afr_replies_interpret() used the 'readable' matrix to trigger client
side heals after inode refresh. But for arbiter, readable is always
zero. So when `dd` is run with a data brick down, spurious data heals
are are triggered. These heals open an fd, causing eager lock to be
disabled (open fd count >1) in afr transactions, leading to extra FXATTROPS

Fix:
Use the accused matrix (derived from interpreting the afr pending
xattrs) to decide whether we can start heal or not.

Change-Id: Ibbd56c9aed6026de6ec42422e60293702aaf55f9
BUG: 1408395
Signed-off-by: Ravishankar N 
Reviewed-on: http://review.gluster.org/16277
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Smoke: Gluster Build System 
Reviewed-by: Pranith Kumar Karampuri 
Tested-by: Pranith Kumar Karampuri