glusterfs.git/libglusterfs/src/glusterfs.h, branch v3.10.7

mount/fuse: Make event-history feature configurable

2017-10-02T12:41:57+00:00

... and disable it by default.

        Backport of:
        > Change-Id: Ia533788d309c78688a315dc8cd04d30fad9e9485
        > Reviewed-on: https://review.gluster.org/18242
        > BUG: 1467614
        > cherry-picked from commit 956d43d6e89d40ee683547003b876f1f456f03b6

This is because having it disabled seems to improve performance.
This could be due to the lock contention by the different epoll threads
on the circular buff lock in the fop cbks just before writing their response
to /dev/fuse.

Just to provide some data - wrt ovirt-gluster hyperconverged
environment, I saw an increase in IOPs by 12K with event-history
disabled for randrom read workload.

Usage:
mount -t glusterfs -o event-history=on $HOSTNAME:$VOLNAME $MOUNTPOINT
OR
glusterfs --event-history=on --volfile-server=$HOSTNAME --volfile-id=$VOLNAME $MOUNTPOINT

Change-Id: Ia533788d309c78688a315dc8cd04d30fad9e9485
BUG: 1495430
Signed-off-by: Krutika Dhananjay

cluster/ec: Node uuid xattr support update for EC

2017-09-13T06:41:52+00:00

Problem:
The change in EC to return list of node uuids for
GF_XATTR_NODE_UUID_KEY was causing problems with
geo-rep.

Fix:
This patch will allow to get the single node uuid
as it was doing before with the key
"GF_XATTR_NODE_UUID_KEY", and will also allow to get
the list of node uuids by using a new key
"GF_XATTR_LIST_NODE_UUIDS_KEY". This will solve
the problem with geo-rep and any other features which
were depending on this.

>BUG: 1462790
>Change-Id: I2d9214a9658d4a41a3d6de08600884d2bda5f3eb
>Signed-off-by: Sunil Kumar Acharya 
>Reviewed-on: https://review.gluster.org/17594
>Smoke: Gluster Build System 
>Reviewed-by: Xavier Hernandez 
>Reviewed-by: Pranith Kumar Karampuri 
>CentOS-regression: Gluster Build System 
>Signed-off-by: Sunil Kumar Acharya 

BUG: 1487647
Change-Id: I2d9214a9658d4a41a3d6de08600884d2bda5f3eb
Signed-off-by: Sunil Kumar Acharya 
Reviewed-on: https://review.gluster.org/17667
Smoke: Gluster Build System 
Reviewed-by: Xavier Hernandez 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System

cluster/afr: Returning single and list of node uuids from AFR

2017-07-01T15:55:31+00:00

Problem:
The change in afr to return list of node uuids was causing problems
with geo-rep.

Fix:
This patch will allow to get the single node uuid as it was doing
before with the key "GF_XATTR_NODE_UUID_KEY", and will also allow
to get the list of node uuids by using a new key
"GF_XATTR_LIST_NODE_UUIDS_KEY". This will solve the problem with
geo-rep and any other feature which were depending on this.

> Change-Id: I09885dac6dfca127be94b708470c8c2941356f9a
> BUG: 1462790
> Signed-off-by: karthik-us 
> Reviewed-on: https://review.gluster.org/17576
> Smoke: Gluster Build System 
> Reviewed-by: Ravishankar N 
> Reviewed-by: Kotresh HR 
> NetBSD-regression: NetBSD Build System 
> CentOS-regression: Gluster Build System 
> Reviewed-by: Jeff Darcy 

(cherry picked from commit 475ec9928ef96b63a0bfa859a9ae68709275033c)

Change-Id: I5e741a48a426ee9a3cc69612051e0e9bcf33b500
BUG: 1464078
Signed-off-by: karthik-us 
Reviewed-on: https://review.gluster.org/17603
Reviewed-by: Pranith Kumar Karampuri 
Smoke: Gluster Build System 
CentOS-regression: Gluster Build System

glusterfsd: send PARENT_UP on brick attach

2017-05-31T05:59:07+00:00

With brick multiplexing being enabled, if a brick is instance attached to a
process then a PARENT_UP event is needed so that it reaches right till
posix layer and then from posix CHILD_UP event is sent back to all the
children.

>Reviewed-on: https://review.gluster.org/17225
>NetBSD-regression: NetBSD Build System 
>Smoke: Gluster Build System 
>CentOS-regression: Gluster Build System 
>Reviewed-by: Jeff Darcy 
>(cherry picked from commit 86ad032949cb80b6ba3df9dc8268243529d4eb84)

Change-Id: Ic341086adb3bbbde0342af518e1b273dd2f669b9
BUG: 1450728
Signed-off-by: Atin Mukherjee 
Reviewed-on: https://review.gluster.org/17288
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Raghavendra Talur

core: run many bricks within one glusterfsd process

2017-02-02T00:54:58+00:00

This patch adds support for multiple brick translator stacks running in
a single brick server process.  This reduces our per-brick memory usage
by approximately 3x, and our appetite for TCP ports even more.  It also
creates potential to avoid process/thread thrashing, and to improve QoS
by scheduling more carefully across the bricks, but realizing that
potential will require further work.

Multiplexing is controlled by the "cluster.brick-multiplex" global
option.  By default it's off, and bricks are started in separate
processes as before.  If multiplexing is enabled, then *compatible*
bricks (mostly those with the same transport options) will be started in
the same process.

Backport of:
> Change-Id: I45059454e51d6f4cbb29a4953359c09a408695cb
> BUG: 1385758
> Reviewed-on: https://review.gluster.org/14763

Change-Id: I4bce9080f6c93d50171823298fdf920258317ee8
BUG: 1418091
Signed-off-by: Jeff Darcy 
Reviewed-on: https://review.gluster.org/16496
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Shyamsundar Ranganathan

tier : Tier as a service

2017-01-17T04:49:47+00:00

tierd is implemented by separating from rebalance process.

The commands affected:

1) Attach tier will trigger this process instead of old one
2) tier start and tier start force will also trigger this process.
3) volume status [tier] will show tier daemon as a process instead
of task and normal tier status and tier detach status works.
4) tier stop implemented.
5) detach tier implemented separately along with new detach tier
status
6) volume tier volname status will work using the changes.
7) volume set works

This patch has separated the tier translator from the legacy
DHT rebalance code. It now sends the RPCs from the CLI
to glusterd separate to the DHT rebalance code.
The daemon is now a service, similar to the snapshot daemon,
and can be viewed using the volume status command.

The code for the validation and commit phase are the same
as the earlier tier validation code in DHT rebalance.

The “brickop” phase has been changed so that the status
command can use this framework.

The service management framework is now used.
DHT rebalance does not use this framework.

This service framework takes care of :

*) spawning the daemon, killing it and other such processes.
*) volume set options , which are written on the volfile.
*) restart and reconfigure functions. Restart is to restart
the daemon at two points
        1)after gluster goes down and comes up.
        2) to stop detach tier.
*) reconfigure is used to make immediate volfile changes.
By doing this, we don’t restart the daemon.
it has the code to rewrite the volfile for topological
changes too (which comes into place during add and remove brick).

With this patch the log, pid, and volfile are separated
and put into respective directories.

Change-Id: I3681d0d66894714b55aa02ca2a30ac000362a399
BUG: 1313838
Signed-off-by: hari gowtham 
Reviewed-on: http://review.gluster.org/13365
Smoke: Gluster Build System 
Tested-by: hari gowtham 
CentOS-regression: Gluster Build System 
NetBSD-regression: NetBSD Build System 
Reviewed-by: Dan Lambright 
Reviewed-by: Atin Mukherjee

ec: Invalidations in disperse volume should not update the stat

2017-01-06T05:12:18+00:00

Issue:
In disperse volume, the file is present across bricks, hence the stat
from one brick doesn't carry the valid size of the file. Therefore
the upcall from one brick updating the md-cache results in wrong size
being updated.

Fix:
If the notification is cache invalidation then, indicate md-cache that
the attributes is invalid.

BUG: 1410375
Change-Id: Id89d2283478e70b62b435a8891fffc86d2be8cb2
Signed-off-by: Poornima G 
Reviewed-on: http://review.gluster.org/16329
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
Reviewed-by: Xavier Hernandez 
CentOS-regression: Gluster Build System 
Reviewed-by: Pranith Kumar Karampuri

md-cache, afr: Reduce the window of stale read

2016-10-20T07:07:55+00:00

Problem:
Consider a replica setup, where one mount writes data to a
file and the other mount reads the file. In afr, read operations
are not transaction based, a brick(read subvolume) is chosen as
a part of lookup or other operations, read is always wound only
to the read subvolume, even if there was write from a different client
that failed on this brick. This stale read continues until there is
a lookup or any write operation from the mount point. Currently, this
is not a major issue, as a lookup is issued before every read and it will
switch the read subvolume to a correct one. But with the plan of
increasing md-cache timeout to 600s, the stale read problem will be
more pronounced, i.e. stale read can continue for 600s(or more if cascaded
with readdirp), as there will be no lookups.

Solution:
Afr doesn't have any built-in solution for stale read(without affecting
the performance). The solution that came up, was to use upcall. When a file
on any brick is marked bad for the first time, upcall sends a notification
to all the clients that had recently accessed the file. The solution has
2 parts:
- Identifying when a file is marked bad, on any of the bricks,
  for the first time
- Client side actions on recieving the notifications

Identifying when a file is marked bad on any of the bricks for the first time:
-----------------------------------------------------------------------------
The idea is to track xattrop in upcall. xattrop currently comes with 2 afr
xattrs - afr dirty bit and afr pending xattrs.
   Dirty xattr is set to 1 before every write, and is unset if write succeeds.
In certain scenarios, dirty xattr can be 0 and still the file could be bad
copy. Hence do not track dirty xattr.
   Pending xattr is set on the good copy, indicating the other bricks that have
bad copy. It is still not as simple as, notifying when any of the pending xattrs
change. It could lead to flood of notifcations, in case the other brick is
completely down or consistantly failing. Hence it is important to notify only
once, the first time a good copy is marked bad.

Client side actions on recieving pending xattr change, notification:
--------------------------------------------------------------------
md-cache will invalidate the cache of that file, so that further lookup is
passed down to afr and hence update the read subvolume. Invalidating only in
md-cache is not enough, consider the folling oder of opertaions:
- pending xattr invalidation - invalidate md-cache
- readdirp on the bad read subvolume - fill md-cache
- lookup (served from md-cache)
- read - wound to the old read subvol.
Hence, along with invalidating md-cache, it is very important to reset the
read subvolume for that file, in afr.

Design Credit: Anuradha Talur, Ravishankar N

1. xattrop doesn't carry info saying post op/pre op.
2. Pre xattrop will have 0 value for all pending xattrs,
   the cbk of pre xattrop carries the on-disk xattr value.
   Non zero indicated healing is required.
3. Post xattrop will have non zero value for any of the
   pending xattrs, if the fop failed on any of the bricks.

Change-Id: I469cbc111714c433984fe1c922be2ef113c25804
BUG: 1211863
Signed-off-by: Poornima G 
Reviewed-on: http://review.gluster.org/15398
Reviewed-by: Pranith Kumar Karampuri 
Tested-by: Pranith Kumar Karampuri 
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System

glusterfsd/main: fix OOM adjustment for older kernels

2016-10-11T12:18:05+00:00

Milind Changire reported that GlusterFS fails to build on RHEL5
because linux/oom.h is unavailable.

Milind's initial patch disables OOM adjustment completely
for those environments that do not have this header. However,
I'd take another approach that:

1) checks for linux/oom.h in compile-time and defines necessary
constants if the header is not present;
2) checks for available OOM API in /proc in run-time and uses it
accordingly.

This allows OOM to be adjusted properly on RHEL5 (the kernel is pretty new
to present /proc API for that) as well as RHEL6 (the kernel has many thing
backported including new /proc API).

Change-Id: I1bc610586872d208430575c149a7d0c54bd82370
BUG: 1379769
Signed-off-by: Oleksandr Natalenko 
Reviewed-on: http://review.gluster.org/15587
Tested-by: Oleksandr Natalenko 
Reviewed-by: Niels de Vos 
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System

dht, md-cache, upcall: Add invalidation of IATT when the layout changes

2016-08-31T06:08:54+00:00

Issue:
dht_layout is built as a part of lookup only. The layout can be
modified by rebalance process. Since every IO fop is preceded
by a lookup, there are very less issues of stale layout. But
with enhancements of aggressive caching of stats in md-cache,
the lookup will reduce and expose the stale layout issue often.

Solution:
Since stale layout is already an issue on dht, there is already
a plan to fix this at the dht layer, but this fix is not currently
planned for any release. Until this fix comes out, we can have
a workaround where, the upcall will send a notification to md-cache
when a layout xattr is changed. As a part of layout change notification
the existing cache is invalidated and the next lookup will fetch the
latest layout.

This is not a foolproof solution as the window between the layout change
and the next lookup(after invalidation of stat), where there will be stale
layout. But until the final fix comes in, this reduces the stale layout
window.

Change-Id: Iacf871a38b35880c1fc0bc68fe7ce291265e71d4
BUG: 1369638
Signed-off-by: Poornima G 
Reviewed-on: http://review.gluster.org/15300
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Smoke: Gluster Build System 
Reviewed-by: Raghavendra G