glusterfs.git/rpc/rpc-lib/src/rpc-clnt.c, branch v3.10.2

rpc: bump up conn->cleanup_gen in rpc_clnt_reconnect_cleanup

2017-03-27T13:58:29+00:00

Commit 086436a introduced generation number (cleanup_gen) to ensure that
rpc layer doesn't end up cleaning up the connection object if
application layer has already destroyed it. Bumping up cleanup_gen was
done only in rpc_clnt_connection_cleanup (). However the same is needed
in rpc_clnt_reconnect_cleanup () too as with out it if the object gets destroyed
through the reconnect event in the application layer, rpc layer will
still end up in trying to delete the object resulting into double free
and crash.

Peer probing an invalid host/IP was the basic test to catch this issue.

>Reviewed-on: https://review.gluster.org/16914
>Smoke: Gluster Build System 
>NetBSD-regression: NetBSD Build System 
>Reviewed-by: Milind Changire 
>CentOS-regression: Gluster Build System 
>Reviewed-by: Jeff Darcy 
>(cherry picked from commit 39e09ad1e0e93f08153688c31433c38529f93716)

Change-Id: Id5332f3239cb324cead34eb51cf73d426733bd46
BUG: 1434399
Signed-off-by: Atin Mukherjee 
Reviewed-on: https://review.gluster.org/16936
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Shyamsundar Ranganathan

rpc/clnt: remove locks while notifying CONNECT/DISCONNECT

2017-03-06T16:20:33+00:00

Locking during notify was introduced as part of commit
aa22f24f5db7659387704998ae01520708869873 [1]. The fix was introduced
to fix out-of-order CONNECT/DISCONNECT events from rpc-clnt to parent
xlators [2]. However as part of handling DISCONNECT protocol/client
does unwind saved frames (with failure) waiting for responses. This
saved_frames_unwind can be a costly operation and hence ideally
shouldn't be included in the critical section of notifylock, as it
unnecessarily delays the reconnection to same brick. Also, its not a
good practise to pass control to other xlators holding a lock as it
can lead to deadlocks. So, this patch removes locking in rpc-clnt
while notifying parent xlators.

To fix [2], two changes are present in this patch:

* notify DISCONNECT before cleaning up rpc connection (same as commit
  a6b63e11b7758cf1bfcb6798, patch [3]).
* protocol/client uses rpc_clnt_cleanup_and_start, which cleans up rpc
  connection and does a start while handling a DISCONNECT event from
  rpc. Note that patch [3] was reverted as rpc_clnt_start called in
  quick_reconnect path of protocol/client didn't invoke connect on
  transport as the connection was not cleaned up _yet_ (as cleanup was
  moved post notification in rpc-clnt). This resulted in clients never
  attempting connect to bricks.

Note that one of the neater ways to fix [2] (without using locks) is
to introduce generation numbers to map CONNECT and DISCONNECTS across
epochs and ignore DISCONNECT events if they don't belong to current
epoch. However, this approach is a bit complex to implement and
requires time. So, current patch is a hacky stop-gap fix till we come
up with a more cleaner solution.

[1] http://review.gluster.org/15916
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1386626
[3] http://review.gluster.org/15681

>Change-Id: I62daeee8bb1430004e28558f6eb133efd4ccf418
>Signed-off-by: Raghavendra G 
>BUG: 1427012
>Reviewed-on: https://review.gluster.org/16784
>Smoke: Gluster Build System 
>Reviewed-by: Milind Changire 
>NetBSD-regression: NetBSD Build System 
>CentOS-regression: Gluster Build System 
(cherry picked from commit 773f32caf190af4ee48818279b6e6d3c9f2ecc79)

Change-Id: I62daeee8bb1430004e28558f6eb133efd4ccf418
Signed-off-by: Raghavendra G 
BUG: 1428670
Reviewed-on: https://review.gluster.org/16835
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Shyamsundar Ranganathan

socket: socket disconnect should wait for poller thread exit

2016-12-22T04:49:19+00:00

When SSL is enabled or if "transport.socket.own-thread" option is set
then socket_poller is run as different thread. Currently during
disconnect or PARENT_DOWN scenario we don't wait for this thread
to terminate. PARENT_DOWN will disconnect the socket layer and
cleanup resources used by socket_poller.

Therefore before disconnect we should wait for poller thread to exit.

Change-Id: I71f984b47d260ffd979102f180a99a0bed29f0d6
BUG: 1404181
Signed-off-by: Rajesh Joseph 
Reviewed-on: http://review.gluster.org/16141
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Kaushal M 
Reviewed-by: Raghavendra Talur 
Reviewed-by: Raghavendra G

rpc: fix for race between rpc and protocol/client

2016-12-05T14:09:48+00:00

It is possible that the notification thread which notifies
protocol/client layer about the disconnection is put to sleep
and meanwhile, a fuse thread or a timer thread initiates and
completes reconnection to the brick. The notification thread
is then woken up and protocol/client layer updates its flags
to indicate that network is disconnected. No reconnection is
initiated because reconnection is rpc-lib layer's responsibility
and its flags indicate that connection is connected.

Fix: Serialize connect and disconnect notify

Credit: Raghavendra Talur 
Change-Id: I8ff5d1a3283b47f5c26848a42016a40bc34ffc1d
BUG: 1386626
Signed-off-by: Rajesh Joseph 
Reviewed-on: http://review.gluster.org/15916
Reviewed-by: Raghavendra G 
Smoke: Gluster Build System 
Reviewed-by: Raghavendra Talur 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System

Revert "rpc: Fix the race between notification and reconnection"

2016-11-16T09:25:31+00:00

This reverts commit a6b63e11b7758cf1bfcb67985e25ec02845f0995.

Nithya and Rajesh found that the mount fails sometimes after this patch
was merged so reverting it.

BUG: 1386626
Change-Id: I959a5b6c7da61368cf4c67c98193c6e8fdd1755d
Signed-off-by: Pranith Kumar K 
Reviewed-on: http://review.gluster.org/15838
Reviewed-by: N Balachandran 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Smoke: Gluster Build System

rpc: Fix the race between notification and reconnection

2016-10-25T06:42:20+00:00

Problem:
There was a hang because unlock on an entry failed with
ENOTCONN.
Client thinks the connection is down where as server thinks
the connection is up.

This is the race we are seeing:
1) Connection from client to the brick disconnects.
2) Saved frames unwind is called which unwinds all
   frames that were wound before disconnect.
3) connection from client to the brick happens and
   setvolume.
4) Disconnect notification for the connection in 1)
   comes now and calls client_rpc_notify() which
   marks the connection to be offline even when the
   connection is up.

This is happening because I/O can retrigger connection
before disconnect notification is sent to the higher
layers in rpc.

Fix:
Notify the higher layers that a disconnect happened and then
go ahead with reconnect logic.

For the logs which point to the information above check:
https://bugzilla.redhat.com/show_bug.cgi?id=1386626#c1

Thanks to Raghavendra G for suggesting the correct fix.

BUG: 1386626
Change-Id: I3c84ba1f17010bd69049fa88ec5f0ae431f8cda9
Signed-off-by: Pranith Kumar K 
Reviewed-on: http://review.gluster.org/15681
NetBSD-regression: NetBSD Build System 
Reviewed-by: Niels de Vos 
CentOS-regression: Gluster Build System 
Smoke: Gluster Build System 
Reviewed-by: Raghavendra G

rpc: fix unused variable warnings/errors

2016-08-29T12:00:49+00:00

http://review.gluster.org/14085 fixes a/the "leak" - via the
generated rpc/xdr headers - of pragmas that mask these warnings.

However 14085 won't pass the smoke test until all the warnings are
fixed.

Change-Id: I20d91091bee0bf8f198a307ebba4b284bc3817ff
BUG: 1369124
Signed-off-by: Kaleb S. KEITHLEY 
Reviewed-on: http://review.gluster.org/15240
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Smoke: Gluster Build System

changelog/rpc: Fix rpc_clnt_t mem leaks

2016-07-22T15:12:52+00:00

PROBLEM:
   1. Freeing up rpc_clnt object might lead to crashes. Well,
      it was not a necessity to free rpc-clnt object till now
      because all the existing use cases needs to reconnect
      back on disconnects. Hence timer code was not taking
      ref on rpc-clnt object.

      Glusterd had some use-cases that led to crash due to
      ping-timer and they fixed only those code paths that
      involve ping-timer.

      Now, since changelog has an use-case where rpc-clnt
      need to be freed up, we need to fix timer code to take
      refs

   2. In changelog, because of issue 1, only mydata was being
      freed which is incorrect. And there are races where
      rpc-clnt object would access the freed mydata which
      would lead to crashes.

      Since changelog xlator resides on brick side and is long
      living process, if multiple libgfchangelog consumers
      register to changelog and disconnect/reconnect mulitple
      times, it would result in leak of 'rpc-clnt' object
      for every connect/disconnect.

SOLUTION:
   1. Handle ref/unref of 'rpc_clnt' structure in timer
      functions properly.
   2. In changelog, unref 'rpc_clnt' in RPC_CLNT_DISCONNECT
      after disabling timers and free mydata on RPC_CLNT_DESTROY.

RPC SETUP IN CHANGELOG:
   1. changelog xlator initiates rpc server say 'changelog_rpc_server'
   2. libgfchangelog initiates one rpc server say 'libgfchangelog_rpc_server'
   3. libgfchangelog initiates rpc client and connects to 'changelog_rpc_server'
   4. In return changelog_rpc_server initiates a rpc client and connects back
      to 'libgfchangelog_rpc_server'

REF/UNREF HANDLING IN TIMER FUNCTIONS:
Let's say rpc clnt refcount = 1
   1. Take the ref before reigstering callback to timer queue
           >>>>  rpc_clnt_ref (say ref count becomes = 2)
   2. Register a callback to timer say 'callback1'
   3. If register fails:
           >>>> rpc_clnt_unref (ref count = 1)
   4. On timer expiration, 'callback1' gets called. So unref rpc clnt at the end
      in 'callback1'. This is corresponding to ref taken in step 1
           >>>> rpc_clnt_unref (ref count = 1)
   5. The cycle from step-1 to step-4 continues....until timer cancel event happens
   6. timer cancel of say 'callback1'
           If timer cancel fails:
                 Do nothing, Step-4 would have unrefd
           If timer cancel succeeds:
                 >>>> rpc_clnt_unref (ref count = 1)

Change-Id: I91389bc511b8b1a17824941970ee8d2c29a74a09
BUG: 1316178
Signed-off-by: Kotresh HR 
Reviewed-on: http://review.gluster.org/13658
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Raghavendra G

glusterd/rpc : Discard duplicate Disconnect events

2016-03-22T19:24:53+00:00

If a peer rpc disconnect event has been already processed, skip the furthers as
processing them are overheads and sometimes may lead to a crash like due to a
double free

Change-Id: Iec589ce85daf28fd5b267cb6fc82a4238e0e8adc
BUG: 1318546
Signed-off-by: Atin Mukherjee 
Reviewed-on: http://review.gluster.org/13790
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Jeff Darcy

rpc: Connect back only if rpc is not disabled

2016-03-08T13:53:07+00:00

This is to fix regression caused by below patch -
http://review.gluster.org/#/c/13456/

As discussed over http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/14284,
patch #13456 caused a regression where in if there are any pending
rpc invocations, we end up accessing freed object. This patch
fixes it by allowing reconnect during rpc submit only if rpc
is not disabled.

Change-Id: I4ef4dd52bd42368bb89129f98bc973e46c6a39f4
BUG: 1295107
Signed-off-by: Kotresh HR 
Reviewed-on: http://review.gluster.org/13592
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
Reviewed-by: Raghavendra G 
CentOS-regression: Gluster Build System