glusterfs.git/rpc/rpc-lib/src/rpc-clnt.c, branch v8.2

rpc: fix missing unref on reconnect

2019-10-02T08:23:17+00:00

On protocol client connecting to brick, client will firstly contact
glusterd to get port, then reconnect to glusterfsd. Reconnect cancels
the reconnect timer and start a new one. However, cancelling the timer
does not unref rpc ref-ed for it. That leads to refcount leak.

Fix this issue by unref-ing rpc if reconnect timer is canceled.

Change-Id: Ice89dcd93cb283a0c7250c369cc8961d52fb2022
Fixes: bz#1538900
BUG: 1538900
Signed-off-by: Zhang Huan

graph/cleanup: Fix race in graph cleanup

2019-09-05T16:14:44+00:00

We were unconditionally cleaning up the grap when we get
child_down followed by parent_down. But this is prone to
race condition when some of the bricks are already disconnected.
In this case, even before the last child down is executed in the
client xlator code,we might have freed the graph. Because the
child_down event is alreadt recevied.

To fix this race, we have introduced a check to see if all client
xlator have cleared thier reconnect chain, and called the child_down
for last time.

Change-Id: I7d02813bc366dac733a836e0cd7b14a6fac52042
fixes: bz#1727329
Signed-off-by: Mohammed Rafi KC

Revert "rpc: implement reconnect back-off strategy"

2019-05-21T08:36:32+00:00

This reverts commit 59841f7e1ff0511b04884015441a181a56d07bea.

This revert is done as a 'possible' fix for frequent regression
failures, which are random in nature too (ie, different tests fails
in different runs).

Why exactly this patch? Because this patch seemed like most probable
candidate which got merged in last 15days, and after which regressions
are failing more often.

Updates: bz#1711827
Change-Id: I35333162fcd4064f9609525ca93c666053c6d959

rpc: implement reconnect back-off strategy

2019-05-11T14:25:53+00:00

When a connection failure happens, gluster tries to reconnect every 3
seconds. In some cases the failure is spurious, so a delay of 3 seconds
could be unnecessarily long.

This patch implements a back-off strategy that tries a reconnect as soon
as 1 tenth of a second. If this fails, the time is doubled until it's
around 3 seconds. After that, the reconnect is attempted every 3 seconds
as before.

Change-Id: Icb3fbe20d618f50cbbb599dce542b4e871c22149
Updates: bz#1193929
Signed-off-by: Xavier Hernandez

rpc: Remove duplicate code

2019-03-28T05:35:25+00:00

rpc_clnt_disable() and rpc_clnt_disconnect() have same code.
Removed rpc_clnt_disconnect() and moved calls to rpc_clnt_disconnect()
to rpc_clnt_disable()

updates bz#1193929
Change-Id: I965f57cc1d5af36d266810125558b6f5e5f279d4
Signed-off-by: Pranith Kumar K

rpc/transport: Missing a ref on dict while creating transport object

2019-03-20T13:24:44+00:00

while creating rpc_tranpsort object, we store a dictionary without
taking a ref on dict but it does an unref during the cleaning of the
transport object.

So the rpc layer expect the caller to take a ref on the dictionary
before passing dict to rpc layer. This leads to a lot of confusion
across the code base and leads to ref leaks.

Semantically, this is not correct. It is the rpc layer responsibility
to take a ref when storing it, and free during the cleanup.

I'm listing down the total issues or leaks across the code base because
of this confusion. These issues are currently present in the upstream
master.

1) changelog_rpc_client_init

2) quota_enforcer_init

3) rpcsvc_create_listeners : when there are two transport, like tcp,rdma.

4) quotad_aggregator_init

5) glusterd: init

6) nfs3_init_state

7) server: init

8) client:init

This patch does the cleanup according to the semantics.

Change-Id: I46373af9630373eb375ee6de0e6f2bbe2a677425
updates: bz#1659708
Signed-off-by: Mohammed Rafi KC

clnt/rpc: ref leak during disconnect.

2019-02-12T07:05:58+00:00

During disconnect cleanup, we are not cancelling reconnect
timer, which causes a ref leak each time when a disconnect
happen.

Change-Id: I9d05d1f368d080e04836bf6a0bb018bf8f7b5b8a
updates: bz#1659708
Signed-off-by: Mohammed Rafi KC

rpc-clnt: reduce transport connect log for EINPROGRESS

2019-01-07T03:19:34+00:00

quotad and ganesha.nfsd prints many logs as,

[rpc-clnt.c:1739:rpc_clnt_submit ] 0--quota: error returned while attempting to connect to host: (null), port 0

Change-Id: Ic0c815400619e4a87a772a51b19822920228c1ef
Updates: bz#1596787
Signed-off-by: Kinglong Mee

libglusterfs: Move devel headers under glusterfs directory

2018-12-05T21:47:04+00:00

libglusterfs devel package headers are referenced in code using
include semantics for a program, this while it works can be better
especially when dealing with out of tree xlator builds or in
general out of tree devel package usage.

Towards this, the following changes are done,
- moved all devel headers under a glusterfs directory
- Included these headers using system header notation <> in all
code outside of libglusterfs
- Included these headers using own program notation "" within
libglusterfs

This change although big, is just moving around the headers and
making it correct when including these headers from other sources.

This helps us correctly include libglusterfs includes without
namespace conflicts.

Change-Id: Id2a98854e671a7ee5d73be44da5ba1a74252423b
Updates: bz#1193929
Signed-off-by: ShyamsundarR

rpcsvc: provide each request handler thread its own queue

2018-11-29T01:19:12+00:00

A single global per program queue is contended by all request handler
threads and event threads. This can lead to high contention. So,
reduce the contention by providing each request handler thread its own
private queue.

Thanks to "Manoj Pillai" for the idea of pairing a
single queue with a fixed request-handler-thread and event-thread,
which brought down the performance regression due to overhead of
queuing significantly.

Thanks to "Xavi Hernandez" for discussion on
how to communicate the event-thread death to request-handler-thread.

Thanks to "Karan Sandha" for voluntarily running
the perf benchmarks to qualify that performance regression introduced
by ping-timer-fixes is fixed with this patch and patiently running
many iterations of regression tests while RCAing the issue.

Thanks to "Milind Changire" for patiently running
the many iterations of perf benchmarking tests while RCAing the
regression caused by ping-timer-expiry fixes.

Change-Id: I578c3fc67713f4234bd3abbec5d3fbba19059ea5
Fixes: bz#1644629
Signed-off-by: Raghavendra Gowdappa