glusterfs.git/libglusterfs/src/glusterfs/glusterfs.h, branch v7.1

afr: make heal info lockless

2019-12-16T05:38:25+00:00

Changes in locks xlator:
Added support for per-domain inodelk count requests.
Caller needs to set GLUSTERFS_MULTIPLE_DOM_LK_CNT_REQUESTS key in the
dict and then set each key with name
'GLUSTERFS_INODELK_DOM_PREFIX:'.
In the response dict, the xlator will send the per domain count as
values for each of these keys.

Changes in AFR:
Replaced afr_selfheal_locked_inspect() with afr_lockless_inspect(). Logic has
been added to make the latter behave same as the former, thus not
breaking the current heal info output behaviour.

fixes: bz#1783858
Change-Id: Ie9e83c162aa77f44a39c2ba7115de558120ada4d
Signed-off-by: Ravishankar N 
(cherry picked from commit d7e049160a9dea988ded5816491c2234d40ab6b3)

ctime: Set mdata xattr on legacy files

2019-08-19T11:27:14+00:00

Problem:
The files which were created before ctime enabled would not
have "trusted.glusterfs.mdata"(stores time attributes) xattr.
Upon fops which modifies either ctime or mtime, the xattr
gets created with latest ctime, mtime and atime, which is
incorrect. It should update only the corresponding time
attribute and rest from backend

Solution:
Creating xattr with values from brick is not possible as
each brick of replica set would have different times.
So create the xattr upon successful lookup if the xattr
is not created

Note To Reviewers:
The time attributes used to set xattr is got from successful
lookup. Instead of sending the whole iatt over the wire via
setxattr, a structure called mdata_iatt is sent. The mdata_iatt
contains only time attributes.

Backport of:
 > Patch:  https://review.gluster.org/22936
 > Change-Id: I5e535631ddef04195361ae0364336410a2895dd4
 > BUG: 1593542
 > Signed-off-by: Kotresh HR 

Change-Id: I5e535631ddef04195361ae0364336410a2895dd4
updates: bz#1739430
Signed-off-by: Kotresh HR

glusterd/svc: update pid of mux volumes from the shd process

2019-07-24T10:29:17+00:00

For a normal volume, we are updating the pid from a the
process while we do a daemonization or at the end of the
init if it is no-daemon mode. Along with updating the pid
we also lock the file, to make sure that the process is
running fine.

With brick mux, we were updating the pidfile from gluterd
after an attach/detach request.

There are two problems with this approach.
1) We are not holding a pidlock for any file other than parent
   process.
2) There is a chance for possible race conditions with attach/detach.
   For example, shd start and a volume stop could race. Let's say
   we are starting an shd and it is attached to a volume.
   While we trying to link the pid file to the running process,
   this would have deleted by the thread that doing a volume stop.

Backport of : https://review.gluster.org/#/c/glusterfs/+/22935/
>Change-Id: I29a00352102877ce09ea3f376ca52affceb5cf1a
>Updates: bz#1722541
>Signed-off-by: Mohammed Rafi KC 

Change-Id: I29a00352102877ce09ea3f376ca52affceb5cf1a
Updates: bz#1732668
Signed-off-by: Mohammed Rafi KC

features/shard: Fix block-count accounting upon truncate to lower size

2019-06-04T07:30:12+00:00

The way delta_blocks is computed in shard is incorrect, when a file
is truncated to a lower size. The accounting only considers change
in size of the last of the truncated shards.

FIX:

Get the block-count of each shard just before an unlink at posix in
xdata.  Their summation plus the change in size of last shard
(from an actual truncate) is used to compute delta_blocks which is
used in the xattrop for size update.

Change-Id: I9128a192e9bf8c3c3a959e96b7400879d03d7c53
fixes: bz#1705884
Signed-off-by: Krutika Dhananjay

core: Log level changes do not effect on running client process

2019-04-15T04:30:43+00:00

Problem: commit c34e4161f3cb6539ec83a9020f3d27eb4759a975 set log-level
         per xlator during reconfigure only for a brick process not for
         the client process.

Solution: 1) Change per xlator log-level only if brick_mux is enabled.To make sure
             about brick multiplex introudce a flag brick_mux at ctx->cmd_args.

Note: There are two other changes done with this patch
      1) Ignore client-log-level option to attach a brick with
         already running brick if brick_mux is enabled
      2) Add a log to print pid of the running process to make easier
         debugging

Change-Id: I39e85de778e150d0685cd9a79425ce8b4783f9c9
Signed-off-by: Mohit Agrawal 
Fixes: bz#1696046

graph.c: remove extra gettimeofday() - reuse the graph dob.

2019-04-15T02:25:26+00:00

It was written just before fill_void() call.

Note that there was a possible overflow if the hostname was too long
(unrelated to this patch), but it is now also fixed, as we use a smaller buffer
for the hostname. This, in turn, forces us to check if gethostname() failed
and add explicitly the terminating null to it.

Change-Id: I45fbc0a8e105f1247f3cbf61befac06fabbaea06
updates: bz#1193929
Signed-off-by: Yaniv Kaul

libglusterfs: define macros needed for cloudsync

2019-04-04T19:43:40+00:00

Change-Id: Iec5ce7f17fbf899f881a58cd20c4c967e3b71668
fixes: bz#1642168
Signed-off-by: Anuradha Talur

mgmt/shd: Implement multiplexing in self heal daemon

2019-04-01T03:44:23+00:00

Problem:

Shd daemon is per node, which means they create a graph
with all volumes on it. While this is a great for utilizing
resources, it is so good in terms of performance and managebility.

Because self-heal daemons doesn't have capability to automatically
reconfigure their graphs. So each time when any configurations
changes happens to the volumes(replicate/disperse), we need to restart
shd to bring the changes into the graph.

Because of this all on going heal for all other volumes has to be
stopped in the middle, and need to restart all over again.

Solution:

This changes makes shd as a per volume daemon, so that the graph
will be generated for each volumes.

When we want to start/reconfigure shd for a volume, we first search
for an existing shd running on the node, if there is none, we will
start a new process. If already a daemon is running for shd, then
we will simply detach a graph for a volume and reatach the updated
graph for the volume. This won't touch any of the on going operations
for any other volumes on the shd daemon.

Example of an shd graph when it is per volume

                           graph
                     -----------------------
                     |     debug-iostat    |
                     -----------------------
                    /         |             \
                   /          |              \
              ---------    ---------      ----------
              | AFR-1 |    | AFR-2 |      |  AFR-3 |
              --------     ---------      ----------

A running shd daemon with 3 volumes will be like-->

                           graph
                     -----------------------
                     |     debug-iostat    |
                     -----------------------
                    /           |           \
                   /            |            \
              ------------   ------------  ------------
              | volume-1 |   | volume-2 |  | volume-3 |
              ------------   ------------  ------------

Change-Id: Idcb2698be3eeb95beaac47125565c93370afbd99
fixes: bz#1659708
Signed-off-by: Mohammed Rafi KC

core: implement a global thread pool

2019-02-18T02:58:24+00:00

This patch implements a thread pool that is wait-free for adding jobs to
the queue and uses a very small locked region to get jobs. This makes it
possible to decrease contention drastically. It's based on wfcqueue
structure provided by urcu library.

It automatically enables more threads when load demands it, and stops
them when not needed. There's a maximum number of threads that can be
used. This value can be configured.

Depending on the workload, the maximum number of threads plays an
important role. So it needs to be configured for optimal performance.
Currently the thread pool doesn't self adjust the maximum for the
workload, so this configuration needs to be changed manually.

For this reason, the global thread pool has been made optional, so that
volumes can still use the thread pool provided by io-threads.

To enable it for bricks, the following option needs to be set:

   config.global-threading = on

This option has no effect if bricks are already running. A restart is
required to activate it. It's recommended to also enable the following
option when running bricks with the global thread pool:

   performance.iot-pass-through = on

To enable it for a FUSE mount point, the option '--global-threading'
must be added to the mount command. To change it, an umount and remount
is needed. It's recommended to disable the following option when using
global threading on a mount point:

   performance.client-io-threads = off

To enable it for services managed by glusterd, glusterd needs to be
started with option '--global-threading'. In this case all daemons, like
self-heal, will be using the global thread pool.

Currently it can only be enabled for bricks, FUSE mounts and glusterd
services.

The maximum number of threads for clients and bricks can be configured
using the following options:

   config.client-threads
   config.brick-threads

These options can be applied online and its effect is immediate most of
the times. If one of them is set to 0, the maximum number of threads
will be calcutated as #cores * 2.

Some distributions use a very old userspace-rcu library (version 0.7)
for this reason, some header files from version 0.10 have been copied
into contrib/userspace-rcu and are used if the detected version is 0.7
or older.

An additional change has been made to io-threads to prevent that threads
are started when iot-pass-through is set.

Change-Id: I09d19e246b9e6d53c6247b29dfca6af6ee00a24b
updates: #532
Signed-off-by: Xavi Hernandez

mount/fuse: expose auto-invalidation as a mount option

2019-02-02T03:07:35+00:00

Auto invalidation is necessary when same (meta)data is shared/access
across multiple mounts. However, if (meta)data is not shared, all
relevant I/O goes through the cache of single mount and hence is
coherent with (meta)data on bricks always. So, fuse-auto-invalidation
can be disabled for this case which gives a huge performance boost for
workloads that write data and then immediately read the data they just
wrote.

From glusterfs --help,


      --auto-invalidation[=BOOL]   controls whether fuse-kernel can
                             auto-invalidate attribute, dentry and page-cache.
                             Disable this only if same files/directories are
                             not accessed across two different mounts
                             concurrently [default: "on"]


Details on how disabling auto-invalidation helped to reduce pgbench
init times can be found at [1]. Time taken for pgbench init of scale
8000 was 8340s. That will be an improvement of 86% (59280s vs 8340s)
with auto-invalidations turned off along with other
optimizations. Just disabling auto-invalidation contributed 56%
improvement by reducing the total time taken by 33260s.

[1] https://www.spinics.net/lists/gluster-devel/msg25907.html

Change-Id: I0ed730dba9064bd9c576ad1800170a21e100e1ce
Signed-off-by: Raghavendra Gowdappa 
updates: bz#1664934