glusterfs.git/rpc/rpc-lib/src/rpc-clnt.h, branch v5.4

Land clang-format changes

2018-09-12T11:52:48+00:00

Change-Id: I6f5d8140a06f3c1b2d196849299f8d483028d33b

rpc/clnt: Don't let consumers manage "connected" state

2018-06-04T07:27:42+00:00

The state management of "connected" in rpc is ad-hoc as far as the
responsibility goes. Note that there is nothing wrong with
functionality itself. rpc layer manages this state in disconnect
codepath and has exposed an api to manage this one from
consumers. Note that rpc layer never sets "connected" to true by
itself, which forces the consumers to use this api to get a working
rpc connection. The situation is best captured from a comment in code
from Jeff Darcy in glusterfsd/src/gf-attach.c:

-/*
- * In a sane world, the generic RPC layer would be capable of tracking
- * connection status by itself, with no help from us.  It might invoke our
- * callback if we had registered one, but only to provide information.  Sadly,
- * we don't live in that world.  Instead, the callback *must* exist and *must*
- * call rpc_clnt_{set,unset}_connected, because that's the only way those
- * fields get set (with RPC both above and below us on the stack).  If we don't
- * do that, then rpc_clnt_submit doesn't think we're connected even when we
- * are.  It calls the socket code to reconnect, but the socket code tracks this
- * stuff in a sane way so it knows we're connected and returns EINPROGRESS.
- * Then we're stuck, connected but unable to use the connection.  To make it
- * work, we define and register this trivial callback.
- */

Also, consumers of rpc know about state of connection only through the
notifications sent by rpc-clnt. So, consumers don't have any extra
information to manage the state and hence letting them manage the
state is counter intuitive. This patch cleans that up and instead
moves the responsibility of state management of rpc layer into
itself.

Change-Id: I31e641a60795fc480ca753917f4b2579f1e05094
Signed-off-by: Raghavendra G 
Fixes: bz#1585585

build: address linkage issues

2018-03-05T14:25:17+00:00

We have the following undefined symbol error from protocol/server.so:

  glusterfs_mgmt_pmap_signout
  glusterfs_autoscale_threads

See https://review.gluster.org/19225 (bz#1532238)
and https://review.gluster.org/19657 (bz#1550895)

(why are there two different bzs for the same bug?)

IMO this is a cleaner solution. I.e. moving the above two functions
to libgfrpc (.../rpc/rpc-lib/...)

I would also, for (foolish) consistency sake, like to see
glusterfs_mgmt_pmap_signin() moved from glusterfsd to libgfrpc as
well.

This works on f28/rawhide, with its new, more restrictive run-time
link semantics. The smoke and regression tests on earlier fedora and
centos will confirm that it works on those platforms too.

Change-Id: I9cfbd1cc15e7ebd9fc31b56ac791287fa2c584de
BUG: 1550895
Signed-off-by: Kaleb S. KEITHLEY

rpc/*: auth-header changes

2018-01-17T06:00:39+00:00

Introduce another authentication header which can now send more data.
This is useful because this data can be common for all the fops, and
we don't need to change all the signatures.

As part of this, made rpc-clnt.c little more modular to support multiple
authentication structures.

stack.h changes are placeholder for the ctime etc, can be moved later
based on need.

updates #384

Change-Id: I6111c13cfd2ec92e2b4e9295896bf62a8a33b2c7
Signed-off-by: Amar Tumballi

glusterfs: Use gcc builtin ATOMIC operator to increase/decreate refcount.

2017-12-12T09:05:56+00:00

Problem: In glusterfs code base we call mutex_lock/unlock to take
         reference/dereference for a object.Sometime it could be
         reason for lock contention also.

Solution: There is no need to use mutex to increase/decrease ref
          counter, instead of using mutex use gcc builtin ATOMIC
          operation.

Test:   I have not observed yet how much performance gain after apply
        this patch specific to glusterfs but i have tested same
        with below small program(mutex and atomic both) and
        get good difference.

static int numOuterLoops;
static void *
threadFunc(void *arg)
{
    int j;

    for (j = 0; j < numOuterLoops; j++) {
            __atomic_add_fetch (&glob, 1,__ATOMIC_ACQ_REL);
    }
    return NULL;
}

int
main(int argc, char *argv[])
{
    int opt, s, j;
    int numThreads;
    pthread_t *thread;
    int verbose;
    int64_t n = 0;

    if (argc < 2 ) {
     printf(" Please provide 2 args Num of threads && Outer Loop\n");
     exit (-1);
    }
    numThreads = atoi(argv[1]);
    numOuterLoops = atoi (argv[2]);

    if (1) {
        printf("\tthreads: %d; outer loops: %d;\n",
                numThreads, numOuterLoops);
    }

    thread = calloc(numThreads, sizeof(pthread_t));
    if (thread == NULL) {
        printf ("calloc error so exit\n");
        exit (-1);
    }

    __atomic_store (&glob, &n, __ATOMIC_RELEASE);
    for (j = 0; j < numThreads; j++) {
        s = pthread_create(&thread[j], NULL, threadFunc, NULL);
        if (s != 0) {
            printf ("pthread_create failed so exit\n");
            exit (-1);
        }
    }

    for (j = 0; j < numThreads; j++) {
        s = pthread_join(thread[j], NULL);
        if (s != 0) {
            printf ("pthread_join failed so exit\n");
            exit (-1);
        }
    }
    printf("glob value is %ld\n",__atomic_load_n (&glob,__ATOMIC_RELAXED));

    exit(0);
}

   time ./thr_count 800 800000
   threads: 800; outer loops: 800000;
   glob value is 640000000

real	1m10.288s
user	0m57.269s
sys	3m31.565s

time ./thr_count_atomic 800 800000
     threads: 800; outer loops: 800000;
glob value is 640000000

real	0m20.313s
user	1m20.558s
sys	0m0.028

Change-Id: Ie5030a52ea264875e002e108dd4b207b15ab7cc7
Signed-off-by: Mohit Agrawal

rpc: Eliminate conn->lock contention by using more granular locks

2017-11-28T13:02:06+00:00

rpc_clnt_submit() acquires conn->lock  before call to
rpc_transport_submit_request() and subsequent queuing of frame into
saved_frames list. However, as part of handling RPC_TRANSPORT_MSG_RECEIVED
and RPC_TRANSPORT_MSG_SENT notifications in rpc_clnt_notify(), conn->lock
is again used to atomically update conn->last_received and conn->last_sent
event timestamps.

So when conn->lock is acquired as part of submitting a request,
a parallel POLLIN notification gets blocked at rpc layer until the request
submission completes and the lock is released.

To get around this, this patch call clock_gettime (instead to call gettimeofday)
to update event timestamps in conn->last_received and conn->last_sent and to
call clock_gettime don't need to call mutex_lock because it (clock_gettime)
is thread safe call.

Note: Run fio on vm after apply the patch, iops is improved after apply
      the patch.

Change-Id: I347b5031d61c426b276bc5e07136a7172645d763
BUG: 1467614
Signed-off-by: Krutika Dhananjay

rpc: use GF_ATOMIC_INC to generate rpc_clnt's callid

2017-05-08T04:08:29+00:00

Change-Id: I57ad970411db1ccd3d2c56c504c7da9cc221051f
BUG: 1448692
Signed-off-by: Zhou Zhengping 
Reviewed-on: https://review.gluster.org/17198
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
Reviewed-by: Niels de Vos 
CentOS-regression: Gluster Build System 
Reviewed-by: Raghavendra G

Halo Replication feature for AFR translator

2017-05-02T10:23:53+00:00

Summary:
Halo Geo-replication is a feature which allows Gluster or NFS clients to write
locally to their region (as defined by a latency "halo" or threshold if you
like), and have their writes asynchronously propagate from their origin to the
rest of the cluster.  Clients can also write synchronously to the cluster
simply by specifying a halo-latency which is very large (e.g. 10seconds) which
will include all bricks.

In other words, it allows clients to decide at mount time if they desire
synchronous or asynchronous IO into a cluster and the cluster can support both
of these modes to any number of clients simultaneously.

There are a few new volume options due to this feature:
  halo-shd-latency:  The threshold below which self-heal daemons will
  consider children (bricks) connected.

  halo-nfsd-latency: The threshold below which NFS daemons will consider
  children (bricks) connected.

  halo-latency: The threshold below which all other clients will
  consider children (bricks) connected.

  halo-min-replicas: The minimum number of replicas which are to
  be enforced regardless of latency specified in the above 3 options.
  If the number of children falls below this threshold the next
  best (chosen by latency) shall be swapped in.

New FUSE mount options:
  halo-latency & halo-min-replicas: As descripted above.

This feature combined with multi-threaded SHD support (D1271745) results in
some pretty cool geo-replication possibilities.

Operational Notes:
- Global consistency is gaurenteed for synchronous clients, this is provided by
  the existing entry-locking mechanism.
- Asynchronous clients on the other hand and merely consistent to their region.
  Writes & deletes will be protected via entry-locks as usual preventing
  concurrent writes into files which are undergoing replication.  Read operations
  on the other hand should never block.
- Writes are allowed from _any_ region and propagated from the origin to all
  other regions.  The take away from this is care should be taken to ensure
  multiple writers do not write the same files resulting in a gfid split-brain
  which will require resolution via split-brain policies (majority, mtime &
  size).  Recommended method for preventing this is using the nfs-auth feature to
  define which region for each share has RW permissions, tiers not in the origin
  region should have RO perms.

TODO:
- Synchronous clients (including the SHD) should choose clients from their own
  region as preferred sources for reads.  Most of the plumbing is in place for
  this via the child_latency array.
- Better GFID split brain handling & better dent type split brain handling
  (i.e. create a trash can and move the offending files into it).
- Tagging in addition to latency as a means of defining which children you wish
  to synchronously write to

Test Plan:
- The usual suspects, clang, gcc w/ address sanitizer & valgrind
- Prove tests

Reviewers: jackl, dph, cjh, meyering

Reviewed By: meyering

Subscribers: ethanr

Differential Revision: https://phabricator.fb.com/D1272053

Tasks: 4117827

Change-Id: I694a9ab429722da538da171ec528406e77b5e6d1
BUG: 1428061
Signed-off-by: Kevin Vigor 
Reviewed-on: http://review.gluster.org/16099
Reviewed-on: https://review.gluster.org/16177
Tested-by: Pranith Kumar Karampuri 
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Pranith Kumar Karampuri

rpc/clnt: remove locks while notifying CONNECT/DISCONNECT

2017-03-01T14:35:48+00:00

Locking during notify was introduced as part of commit
aa22f24f5db7659387704998ae01520708869873 [1]. The fix was introduced
to fix out-of-order CONNECT/DISCONNECT events from rpc-clnt to parent
xlators [2]. However as part of handling DISCONNECT protocol/client
does unwind saved frames (with failure) waiting for responses. This
saved_frames_unwind can be a costly operation and hence ideally
shouldn't be included in the critical section of notifylock, as it
unnecessarily delays the reconnection to same brick. Also, its not a
good practise to pass control to other xlators holding a lock as it
can lead to deadlocks. So, this patch removes locking in rpc-clnt
while notifying parent xlators.

To fix [2], two changes are present in this patch:

* notify DISCONNECT before cleaning up rpc connection (same as commit
  a6b63e11b7758cf1bfcb6798, patch [3]).
* protocol/client uses rpc_clnt_cleanup_and_start, which cleans up rpc
  connection and does a start while handling a DISCONNECT event from
  rpc. Note that patch [3] was reverted as rpc_clnt_start called in
  quick_reconnect path of protocol/client didn't invoke connect on
  transport as the connection was not cleaned up _yet_ (as cleanup was
  moved post notification in rpc-clnt). This resulted in clients never
  attempting connect to bricks.

Note that one of the neater ways to fix [2] (without using locks) is
to introduce generation numbers to map CONNECT and DISCONNECTS across
epochs and ignore DISCONNECT events if they don't belong to current
epoch. However, this approach is a bit complex to implement and
requires time. So, current patch is a hacky stop-gap fix till we come
up with a more cleaner solution.

[1] http://review.gluster.org/15916
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1386626
[3] http://review.gluster.org/15681

Change-Id: I62daeee8bb1430004e28558f6eb133efd4ccf418
Signed-off-by: Raghavendra G 
BUG: 1427012
Reviewed-on: https://review.gluster.org/16784
Smoke: Gluster Build System 
Reviewed-by: Milind Changire 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System

core: run many bricks within one glusterfsd process

2017-01-31T00:13:58+00:00

This patch adds support for multiple brick translator stacks running
in a single brick server process.  This reduces our per-brick memory usage by
approximately 3x, and our appetite for TCP ports even more.  It also creates
potential to avoid process/thread thrashing, and to improve QoS by scheduling
more carefully across the bricks, but realizing that potential will require
further work.

Multiplexing is controlled by the "cluster.brick-multiplex" global option.  By
default it's off, and bricks are started in separate processes as before.  If
multiplexing is enabled, then *compatible* bricks (mostly those with the same
transport options) will be started in the same process.

Change-Id: I45059454e51d6f4cbb29a4953359c09a408695cb
BUG: 1385758
Signed-off-by: Jeff Darcy 
Reviewed-on: https://review.gluster.org/14763
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Vijay Bellur