glusterfs.git/libglusterfs, branch v3.11.0rc1

glusterfsd: send PARENT_UP on brick attach

2017-05-16T00:32:25+00:00

With brick multiplexing being enabled, if a brick is instance attached to a
process then a PARENT_UP event is needed so that it reaches right till
posix layer and then from posix CHILD_UP event is sent back to all the
children.

>Reviewed-on: https://review.gluster.org/17225
>NetBSD-regression: NetBSD Build System 
>Smoke: Gluster Build System 
>CentOS-regression: Gluster Build System 
>Reviewed-by: Jeff Darcy 
>(cherry picked from commit 86ad032949cb80b6ba3df9dc8268243529d4eb84)

Change-Id: Ic341086adb3bbbde0342af518e1b273dd2f669b9
BUG: 1450729
Signed-off-by: Atin Mukherjee 
Reviewed-on: https://review.gluster.org/17289
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Shyamsundar Ranganathan

core: make the per glusterfs_ctx_t timer-wheel refcounted

2017-05-12T13:32:32+00:00

xlators can use a 'global' timer-wheel for scheduling events. This
timer-wheel is managed per glusterfs_ctx_t, but does not need to be
allocated for every graph. When an xlator wants to use the timer-wheel,
it will be instanciated on demand, and provided to xlators that request
it later on.

By adding a reference counter to the glusterfs_ctx_t for the
timer-wheel, the threads and structures can be cleaned up when the last
xlator does not have a need for it anymore. In general, the xlators
request the timer-wheel in init(), and they should return it in fini().

Because the timer-wheel is managed per glusterfs_ctx_t, the functions
can be added to ctx.c and do not need to live in their very minimal
tw.[ch] files.


>Reported-by: Poornima G 
>Signed-off-by: Niels de Vos 
>Reviewed-on: https://review.gluster.org/17068
>NetBSD-regression: NetBSD Build System 
>CentOS-regression: Gluster Build System 
>Smoke: Gluster Build System 
>Reviewed-by: Amar Tumballi 
>Reviewed-by: Zhou Zhengping 
>Reviewed-by: Kaleb KEITHLEY 
>(cherry picked from commit 73fcf3a874b2049da31d01b8363d1ac85c9488c2)

Change-Id: I19d225b39aaa272d9005ba7adc3104c3764f1572
BUG: 1450267
Reviewed-on: https://review.gluster.org/17262
Tested-by: Poornima G 
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Niels de Vos

glusterd: socketfile & pidfile related fixes for brick multiplexing feature

2017-05-10T14:05:52+00:00

Problem: While brick-muliplexing is on after restarting glusterd, CLI is
         not showing pid of all brick processes in all volumes.

Solution: While brick-mux is on all local brick process communicated through one
          UNIX socket but as per current code (glusterd_brick_start) it is trying
          to communicate with separate UNIX socket for each volume which is populated
          based on brick-name and vol-name.Because of multiplexing design only one
          UNIX socket is opened so it is throwing poller error and not able to
          fetch correct status of brick process through cli process.
          To resolve the problem write a new function glusterd_set_socket_filepath_for_mux
          that will call by glusterd_brick_start to validate about the existence of socketpath.
          To avoid the continuous EPOLLERR erros in  logs update socket_connect code.

Test:     To reproduce the issue followed below steps
          1) Create two distributed volumes(dist1 and dist2)
          2) Set cluster.brick-multiplex is on
          3) kill glusterd
          4) run command gluster v status
          After apply the patch it shows correct pid for all volumes

> BUG: 1444596
> Change-Id: I5d10af69dea0d0ca19511f43870f34295a54a4d2
> Signed-off-by: Mohit Agrawal 
> Reviewed-on: https://review.gluster.org/17101
> Smoke: Gluster Build System 
> Reviewed-by: Prashanth Pai 
> NetBSD-regression: NetBSD Build System 
> CentOS-regression: Gluster Build System 
> Reviewed-by: Atin Mukherjee 
> (cherry picked from commit 21c7f7baccfaf644805e63682e5a7d2a9864a1e6)

Change-Id: Ia95b9d36e50566b293a8d6350f8316dafc27033b
BUG: 1449004
Signed-off-by: Mohit Agrawal 
Reviewed-on: https://review.gluster.org/17212
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
Reviewed-by: Atin Mukherjee 
Reviewed-by: Prashanth Pai 
CentOS-regression: Gluster Build System

Halo Replication feature for AFR translator

2017-05-08T05:37:07+00:00

	Backport of https://review.gluster.org/16177
		    https://review.gluster.org/17174

Merged both these patches to make sure IPV6 changes don't make it to 3.11 at all.

Summary:
Halo Geo-replication is a feature which allows Gluster or NFS clients to write
locally to their region (as defined by a latency "halo" or threshold if you
like), and have their writes asynchronously propagate from their origin to the
rest of the cluster.  Clients can also write synchronously to the cluster
simply by specifying a halo-latency which is very large (e.g. 10seconds) which
will include all bricks.

In other words, it allows clients to decide at mount time if they desire
synchronous or asynchronous IO into a cluster and the cluster can support both
of these modes to any number of clients simultaneously.

There are a few new volume options due to this feature:
  halo-shd-latency:  The threshold below which self-heal daemons will
  consider children (bricks) connected.

  halo-nfsd-latency: The threshold below which NFS daemons will consider
  children (bricks) connected.

  halo-latency: The threshold below which all other clients will
  consider children (bricks) connected.

  halo-min-replicas: The minimum number of replicas which are to
  be enforced regardless of latency specified in the above 3 options.
  If the number of children falls below this threshold the next
  best (chosen by latency) shall be swapped in.

New FUSE mount options:
  halo-latency & halo-min-replicas: As descripted above.

This feature combined with multi-threaded SHD support (D1271745) results in
some pretty cool geo-replication possibilities.

Operational Notes:
- Global consistency is gaurenteed for synchronous clients, this is provided by
  the existing entry-locking mechanism.
- Asynchronous clients on the other hand and merely consistent to their region.
  Writes & deletes will be protected via entry-locks as usual preventing
  concurrent writes into files which are undergoing replication.  Read operations
  on the other hand should never block.
- Writes are allowed from _any_ region and propagated from the origin to all
  other regions.  The take away from this is care should be taken to ensure
  multiple writers do not write the same files resulting in a gfid split-brain
  which will require resolution via split-brain policies (majority, mtime &
  size).  Recommended method for preventing this is using the nfs-auth feature to
  define which region for each share has RW permissions, tiers not in the origin
  region should have RO perms.

TODO:
- Synchronous clients (including the SHD) should choose clients from their own
  region as preferred sources for reads.  Most of the plumbing is in place for
  this via the child_latency array.
- Better GFID split brain handling & better dent type split brain handling
  (i.e. create a trash can and move the offending files into it).
- Tagging in addition to latency as a means of defining which children you wish
  to synchronously write to

Test Plan:
- The usual suspects, clang, gcc w/ address sanitizer & valgrind
- Prove tests

Reviewers: jackl, dph, cjh, meyering

Reviewed By: meyering

Subscribers: ethanr

Differential Revision: https://phabricator.fb.com/D1272053

Tasks: 4117827

 >Change-Id: I694a9ab429722da538da171ec528406e77b5e6d1
 >BUG: 1428061
 >Signed-off-by: Kevin Vigor 
 >Reviewed-on: http://review.gluster.org/16099
 >Reviewed-on: https://review.gluster.org/16177
 >Tested-by: Pranith Kumar Karampuri 
 >Smoke: Gluster Build System 
 >NetBSD-regression: NetBSD Build System 
 >CentOS-regression: Gluster Build System 
 >Reviewed-by: Pranith Kumar Karampuri 

BUG: 1448416
Change-Id: I694a9ab429722da538da171ec528406e77b5e6d1
Signed-off-by: Pranith Kumar K 
Reviewed-on: https://review.gluster.org/17192
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Kaushal M

SELinux : implementation of SELinux translator

2017-05-04T12:20:26+00:00

The patch implement a part of SELinux translator to support setting
SELinux contexts on files in a glusterfs volume.

URL: https://github.com/gluster/glusterfs-specs/blob/master/accepted/SELinux-client-support.md

Upstream reference :
>Change-Id: Id8916bd8e064ccf74ba86225ead95f86dc5a1a25
>BUG: 1318100
>Fixes : #55
>Signed-off-by: Manikandan Selvaganesh 
>Signed-off-by: Jiffin Tony Thottan 
>Signed-off-by: Niels de Vos 
>Reviewed-on: https://review.gluster.org/13762
>Smoke: Gluster Build System 
>NetBSD-regression: NetBSD Build System 
>CentOS-regression: Gluster Build System 
>Reviewed-by: Manikandan Selvaganesh 
>Reviewed-by: Atin Mukherjee 

Change-Id: Id8916bd8e064ccf74ba86225ead95f86dc5a1a25
BUG: 1447597
Signed-off-by: Jiffin Tony Thottan 
Reviewed-on: https://review.gluster.org/17159
Reviewed-by: Niels de Vos 
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System

libglusterfs: Correct op-version for 3.11

2017-05-03T13:02:43+00:00

Change-Id: Ida58626b3c25ed97f3b0b94b3fa5be58d607fede
BUG: 1447543
Signed-off-by: Kaushal M 
Reviewed-on: https://review.gluster.org/17155
Smoke: Gluster Build System 
Reviewed-by: Niels de Vos 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System

dht: send lookup on old name inside rename with bname and pargfid

2017-04-29T14:28:38+00:00

Inside rename, a lookup is done on the source name to make sure that
the file is there. But we used to do a gfid based lookup and hence,
even if the source name was renamed to a new name from some other client,
lookup will be successful as server3_3_lookup will fetch the new path
based on the gfid.

So even if the source file does not exist any more rename will carry on,
and as server3_3_link(destination is hashed to a different brick other
than source cached scenario) also does gfid based resolve, it wont
detect that the source name does not exist and hardlink creation will be
successful (since gfid based resolve will get the new dentry).

To solve this problem, do a name based lookup inside rename. So that
rename will fail right away if the source does not exist.

Change-Id: Ieba8bdd6675088dbf18de90ed4622df043d163bd
BUG: 1412135
Signed-off-by: Susant Palai 
Reviewed-on: https://review.gluster.org/16375
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
Reviewed-by: N Balachandran 
CentOS-regression: Gluster Build System 
Reviewed-by: Raghavendra G

cluster/dht: rebalance perf enhancement

2017-04-29T14:28:19+00:00

Problem: Throttle settings "normal" and "aggressive" for rebalance
did not have performance difference.

normal mode spawns $(no. of cores - 4)/2 threads and aggressive
spawns $(no. of cores - 4) threads. Though aggressive mode has twice
the number of threads compared to that of normal mode, there was no
performance gain when switched to aggressive mode from normal mode.

RCA:
During the course of debugging the above problem, we tried assigning
migration job to migration threads spawned by rebalance, rather than
synctasks(as there is more overhead associated to manage the task
queue and threads). This gave us a significant improvement over rebalance
under synctasks. This patch does not really gurantee that there will be a
clear performance difference between normal and aggressive mode, but this
patch certainly maximized the disk utilization for 1GBfiles run.

Results:

Test enviroment:
Gluster Config:
Number of Bricks: 2 (one brick per disk(RAID-6 12 disk))
Bricks:
Brick1: server1:/brick/test1/1
Brick2: server2:/brick/test1/1
Options Reconfigured:
performance.readdir-ahead: on
server.event-threads: 4
client.event-threads: 4

1000 files with 1GB each were created/renamed such that all files will have
server1 as cached and server2 as hashed, so that all files will be migrated.

Test machines had 24 cores each.

Results  with/without synctask based migration:
-----------------------------------------------

mode                    normal(10threads)          aggressive(20threads)

timetaken               0:55:30 (h:m:s)            0:56:3 (h:m:s)
withsynctask

timetaken
with migrator           0:38:3 (h:m:s)             0:23:41 (h:m:s)
threads

From above table it can be seen that, there is a clear 2x perf gain between
rebalance with synctask vs rebalance with migrator threads.

Additionally this patch modifies the code so that caller will have the exact error
number returned by dht_migrate_file(earlier the errno meaning was overloaded). This
will help avoiding scenarios where migration failure due to ENOENT, can result in
rebalance abort/failure.

Change-Id: I8904e2fb147419d4a51c1267be11a08ffd52168e
BUG: 1420166
Signed-off-by: Susant Palai 
Reviewed-on: https://review.gluster.org/16427
Smoke: Gluster Build System 
Reviewed-by: N Balachandran 
Reviewed-by: Raghavendra G 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System

libglusterfs/stack.h: reduce duplication of code

2017-04-27T14:54:40+00:00

* Use STACK_UNWIND_STRICT everywhere.
* Provide STACK_WIND_COMMON as both STACK_WIND_COOKIE
  and STACK_WIND differ by just 1 line and 1 option.

Updates gluster/glusterfs#137

Change-Id: Ifbb6b9c4702b02f4a02834824f509fd10c78f0ce
Signed-off-by: Amar Tumballi 
Reviewed-on: https://review.gluster.org/16915
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Jeff Darcy

feature/dht: Directory synchronization

2017-04-26T09:00:34+00:00

Design doc: https://review.gluster.org/16876

Directory creation is now synchronized with blocking inodelk of the
parent on the hashed subvolume followed by the entrylk on the hashed
subvolume between dht_mkdir, dht_rmdir, dht_rename_dir and lookup
selfheal mkdir.

To maintain internal consistency of directories across all subvols of
dht, we need locks. Specifically we are interested in:

 1. Consistency of layout of a directory. Only one writer should modify
    the layout at a time. A writer (layout setting during directory heal
    as part of lookup) shouldn't modify the layout while there are
    readers (all other fops like create, mkdir etc., which consume
    layout) and readers shouldn't read the layout while a writer is in
    progress. Readers can read the layout simultaneously. Writer takes
    a WRITE inodelk on the directory (whose layout is being modified)
    across ALL subvols. Reader takes a READ inodelk on the directory
    (whose layout is being read) on ANY subvol.

 2. Consistency of directory namespace across subvols. The path and
    associated gfid should be same on all subvols. A gfid should not be
    associated with more than one path on any subvol. All fops that can
    change directory names (mkdir, rmdir, renamedir, directory creation
    phase in lookup-heal) takes an entrylk on hashed subvol of the
    directory.

 NOTE1: In point 2 above, since dht takes entrylk on hashed subvol of a
        directory, the transaction itself is a consumer of layout on
        parent directory. So, the transaction is a reader of parent
        layout and does an inodelk on parent directory just like any
        other layout reader. So a mkdir (dir/subdir) would:

     > Acquire a READ inodelk on "dir" on any subvol.
     > Acquire an entrylk (dir, "subdir") on hashed subvol of "subdir".
     > creates directory on hashed subvol and possibly on non-hashed subvols.
     > UNLOCK (entrylk)
     > UNLOCK (inodelk)

 NOTE2: mkdir fop while setting the layout of the directory being created
        is considered as a reader, but NOT a writer. The reason is for
        a fop which can consume the layout of a directory to come either
        of the following conditions has to be true:

     > mkdir syscall from application has to complete. In this case no
       need of synchronization.
     > A lookup issued on the directory racing with mkdir has to complete.
       Since layout setting by a lookup is considered as a writer, only
       one of either mkdir or lookup will set the layout.

Code re-organization:
   All the lock related routines are moved to "dht-lock.c" file.
   New wrapper function is introduced to take blocking inodelk
   followed by entrylk 'dht_protect_namespace'

Updates #191
Change-Id: I01569094dfbe1852de6f586475be79c1ba965a31
Signed-off-by: Kotresh HR 
BUG: 1443373
Reviewed-on: https://review.gluster.org/15472
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Raghavendra G 
Smoke: Gluster Build System