glusterfs.git/xlators/protocol/client/src/client.h, branch v3.7.10

protocol/client : porting log messages to new framework

2015-06-17T12:11:43+00:00

        Backport of http://review.gluster.org/#/c/10042/

Cherry picked from 379dbbfd683d2b0e1704c098b1f020567328122c
> Change-Id: I9bf2ca08fef969e566a64475d0f7a16d37e66eeb
> BUG: 1194640
> Signed-off-by: Manikandan Selvaganesh 
> Reviewed-on: http://review.gluster.org/10042
> Tested-by: Gluster Build System 
> Reviewed-by: Raghavendra G 
> Tested-by: Raghavendra G 

Change-Id: I9bf2ca08fef969e566a64475d0f7a16d37e66eeb
BUG: 1217722
Signed-off-by: Manikandan Selvaganesh 
Reviewed-on: http://review.gluster.org/11240
Tested-by: Gluster Build System 
Tested-by: NetBSD Build System 
Reviewed-by: Raghavendra G 
Tested-by: Raghavendra G

cluster/dht: Change the subvolume encoding in d_off to be a "global"

2015-03-18T11:47:41+00:00

position in the graph rather than relative (local) to a particular
translator.

Encoding the volume in this way allows a single translator to manage
which brick is currently being scanned for directory entries. Using a
single translator minimizes allocated bits in the d_off. It also allows
multiple DHT translators in the same graph to have a common frame of
reference (the graph position) for which brick is being read. Multiple
DHT translators are needed for the Tiering feature.

The fix builds off a previous change (9332) which removed subvolume
encoding from AFR. The fix makes an equivalent change to the EC
translator.

More background can be found in fix 9332 and gluster-dev discussions [1].

DHT and AFR/EC are responsibile (as before) for choosing which brick to
enumerate directory entries in over the readdir lifecycle.

The client translator receiving the readdir fop encodes the dht_t. It
is referred to as the "leaf node" in the graph and corresponds to the
brick being scanned.

When DHT decodes the d_off, it translates the leaf node to a local
subvolume, which represents the next node in the graph leading to
the brick.

Tracking of leaf nodes is done in common utility functions. Leaf nodes
counts and positional information are updated on a graph switch.

[1] www.gluster.org/pipermail/gluster-devel/2015-January/043592.html

Change-Id: Iaf0ea86d7046b1ceadbad69d88707b243077ebc8
BUG: 1190734
Signed-off-by: Dan Lambright 
Reviewed-on: http://review.gluster.org/9688
Reviewed-by: Xavier Hernandez 
Reviewed-by: Krishnan Parthasarathi 
Reviewed-by: Vijay Bellur 
Tested-by: Vijay Bellur

protocol/client: defer cleanup of private until RPC notifications are handled.

2015-03-02T09:16:51+00:00

This fix is required for glfs_fini to be able to perform fini on client
xlators in a graph. We are deferring freeing of client xlator's private
until all RPC related resources are destroyed. This guarantees that
client xlator would free RPC related resources provided its private
structures are still accessible via its this pointer.

'Weak' property: If there are no epoll threads executing after calling
fini() on a client xlator, then all its RPC related resources are
guaranteed to be freed. We can now free the corresponding 'this'
pointer.

Change-Id: Ie00b14dda096ac128e1c37e0032f07d17fd701ce
BUG: 1093594
Signed-off-by: Krishnan Parthasarathi 
Reviewed-on: http://review.gluster.org/9680
Reviewed-by: Rajesh Joseph 
Tested-by: Gluster Build System 
Reviewed-by: Vijay Bellur

protocol/client: sequence CHILD_UP, CHILD_DOWN etc notifications

2015-02-07T21:25:26+00:00

... from all bricks in the volume

This patch is important in the context of MT epoll. With MT epoll,
notification events from client xlators could reach cluster xlators like
afr, dht, ec, stripe etc. in different orders.

For e.g, In a distributed replicate volume of 2 bricks, namely Brick1
and Brick2, the following network events are observed by a mount
process.

- connection to Brick1 is broken.
- connection to Brick1 has been restored.

- connection to Brick2 is broken.
- connection to Brick2 has been restored.

Without establishing a total ordering of events, we can't guarantee that
cluster xlators like afr, dht perceive them in the same order.  While we
would expect afr (say) to perceive it as only one of Brick1 and Brick2
going down at any given time, it is possible for the notification of
Brick2 going offline to race with the notification of Brick1 coming back
online.

Change-Id: I78f5a52bfb05593335d0e9ad53ebfff98995593d
BUG: 1104462
Signed-off-by: Krishnan Parthasarathi 
Reviewed-on: http://review.gluster.org/9591
Tested-by: Gluster Build System 
Reviewed-by: Vijay Bellur

epoll: Adding the ability to configure epoll threads

2015-02-07T21:23:03+00:00

Add the ability to configure the number of event threads
for various gluster services.

Currently with the multi thread epoll patch, it is possible
to have more than one thread waiting on socket activity and
processing the same. This thread count is currently static,
which this commit makes dynamic.

The current services which use IO path, i.e brick processes,
any client process (nfs, FUSE, gfapi, heal,
rebalance, etc.a), gain 2 set parameters to control the number
of threads that are processing events. These settings are,
  - client.event-threads 
  - server.event-threads 

The client setting affects the client graph consumers, and the
server setting affects the brick processes. These are processed
and inited/reconfigured using the client/server protocol xlators.

Other services (say glusterd) would need to extend similar
configuration settings to take advantage of multi threaded event
processing.

At present glusterd is not enabled with this commit, as it does not
stand to gain from this multi-threading (as I understand it).

Change-Id: Id8422fc57a9f95a135158eb6477ccf9d3c9ea4d9
BUG: 1104462
Signed-off-by: Shyam 
Reviewed-on: http://review.gluster.org/9488
Tested-by: Gluster Build System 
Reviewed-by: Vijay Bellur

protocol/client: Prevent "Dereference after NULL check" errors.

2015-01-16T09:03:41+00:00

Fixes 46 defects marked as "Dereference after NULL check" errors
in coverity scan for client xlator.

Change-Id: I0b4c991a3995ce74d7885fc5470ec7f5c589b411
BUG: 789278
Signed-off-by: Vijay Bellur 
Reviewed-on: http://review.gluster.org/9287
Tested-by: Gluster Build System 
Reviewed-by: Krishnan Parthasarathi 
Reviewed-by: Niels de Vos

rdma: client connection establishment takes more time

2014-11-18T08:50:13+00:00

For rdma type only volume client connection establishment
with server takes more than three seconds. Because for
tcp,rdma type volume, will have 2 ports one for tcp and
one for rdma, tcp port is stored with brickname and rdma
port is stored as "brickname.rdma" during pamap_sighin.
During the handshake when trying to get the brick port
for rdma clients, since we are not aware of server
transport type, we will append '.rdma' with brick name.
So for tcp,rdma volume there will be an entry with
'.rdma', but it will fail for rdma type only volume.
So we will try again, this time without appending '.rdma'
using a flag variable need_different_port, and it will succeed,
but the reconnection happens only after 3 seconds.
In this patch for rdma only type volume 
we will append '.rdma' during the pmap_signin. So during the
handshake we will get the correct port for first try itself.
Since we don't need to retry , we can remove the
need_different_port flag variable.

Change-Id: Ie8e3a7f532d4104829dbe995e99b35e95571466c
BUG: 1153569
Signed-off-by: Mohammed Rafi KC 
Reviewed-on: http://review.gluster.org/8934
Tested-by: Gluster Build System 
Reviewed-by: Krishnan Parthasarathi 
Reviewed-by: Raghavendra G 
Tested-by: Raghavendra G

rpc: implement server.manage-gids for group resolving on the bricks

2014-05-09T19:22:39+00:00

The new volume option 'server.manage-gids' can be enabled in
environments where a user belongs to more than the current absolute
maximum of 93 groups. This option triggers the following behavior:

1. The AUTH_GLUSTERFS structure sent by GlusterFS clients (fuse, nfs or
   libgfapi) will contain only one (1) auxiliary group, instead of
   a full list. This reduces network usage and prevents problems in
   encoding the AUTH_GLUSTERFS structure which should fit in 400 bytes.
2. The single group in the RPC Calls received by the server is replaced
   by resolving the groups server-side. Permission checks and similar in
   lower xlators are applied against the full list of groups where the
   user belongs to, and not the single auxiliary group that the client
   sent.

Change-Id: I9e540de13e3022f8b63ff893ecba511129a47b91
BUG: 1053579
Signed-off-by: Niels de Vos 
Reviewed-on: http://review.gluster.org/7501
Tested-by: Gluster Build System 
Reviewed-by: Santosh Pradhan 
Reviewed-by: Harshavardhana 
Reviewed-by: Anand Avati

protocol/client: conn-id should be unique when lk-heal is off

2014-02-17T16:09:38+00:00

Problem:
It was observed that in some cases client disconnects
and re-connects before server xlator could detect that a
disconnect happened. So it still uses previous fdtable and ltable.
But it can so happen that in between disconnect and re-connect
an 'unlock' fop may fail because the fds are marked 'bad' in client
xlator upon disconnect. Due to this stale locks remain on the brick
which lead to hangs/self-heals not happening etc.

For the exact bug RCA please look at
https://bugzilla.redhat.com/show_bug.cgi?id=1049932#c0

Fix:
When lk-heal is not enabled make sure connection-id is different for
every setvolume. This will make sure that a previous connection's
resources are not re-used in server xlator.

Change-Id: Id844aaa76dfcf2740db72533bca53c23b2fe5549
BUG: 1049932
Signed-off-by: Pranith Kumar K 
Reviewed-on: http://review.gluster.org/6669
Tested-by: Gluster Build System 
Reviewed-by: Krishnan Parthasarathi 
Reviewed-by: Vijay Bellur

libglusterfs: Add monotonic clocking counter for timer thread

2013-10-15T07:14:57+00:00

gettimeofday() returns the current wall clock time and timezone.
Using these functions in order to measure the passage of time
(how long an operation took) therefore seems like a no-brainer.

This time suffer's from some limitations:

a. They have a low resolution: “High-performance” timing by
definition, requires clock resolutions into the microseconds
or better.

b. They can jump forwards and backwards in time: Computer
clocks all tick at slightly different rates, which causes
the time to drift. Most systems have NTP enabled which
periodically adjusts the system clock to keep them in sync
with “actual” time. The adjustment can cause the clock to
suddenly jump forward (artificially inflating your timing
numbers) or jump backwards (causing your timing calculations
to go negative or hugely positive). In such cases timer
thread could go into an infinite loop.

From 'man gettimeofday':
----------
..
..
The time returned by gettimeofday() is affected by discontinuous
jumps in the system time (e.g., if the system administrator manually
changes the system time).  If you need a monotonically increasing
clock, see clock_gettime(2).
..
..
----------

Rationale:

For calculating interval timing for Timer thread, all that’s
needed should be clock as a simple counter that increments
at a stable rate.

This is necessary to avoid the jumps which are caused by using
"wall time", this counter must be monotonic that can never
“tick” backwards, ever.

Change-Id: I701d31e71a85a73d21a6c5cd15583e7a5a645eeb
BUG: 1017993
Signed-off-by: Harshavardhana 
Reviewed-on: http://review.gluster.org/6070
Tested-by: Gluster Build System 
Reviewed-by: Anand Avati