glusterfs.git/xlators/performance, branch exp

performance/write-behind: Add more fops for checking dependency with

2016-12-12T13:05:46+00:00

cached writes

Fops like readdirp, link, fallocate, discard, zerofill return iatt of
files in their responses. This iatt can be cached by md-cache. Hence
it is important that write-behind maintains relative ordering of these
fops with cached writes. Failure to do so, can result in md-cache
storing stale iatts and returning the same to applications.

Change-Id: Icfe12ad807e42fe9e52a9f63e47ce63f511c6946
BUG: 1390050
Signed-off-by: Raghavendra G 
Reviewed-on: http://review.gluster.org/15757
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System

dht/md-cache: Filter invalidate if the file is made a linkto file

2016-12-02T10:46:58+00:00

Upcall as a part of setattr, sends an invalidation and the
invalidation carries the resulting stat value. When a file
is converted to linkto files, even then an invalidation
is set and as a result the mountpoint shows the sticky
bit in the stat of the file.
eg: ---------T. 945 root root 0 Nov  8 10:14 hardlink.999

Fix:
When dht recieves a notification of sticky bit change, it updates
the flag, to indicate md-cache to send the subsequent lookup.

Change-Id: Ic2fd7a5b196db0754f9b97072e644e6bf69da606
BUG: 1392713
Signed-off-by: Poornima G 
Reviewed-on: http://review.gluster.org/15789
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
Reviewed-by: Niels de Vos 
CentOS-regression: Gluster Build System 
Reviewed-by: Susant Palai 
Reviewed-by: Rajesh Joseph

io-cache: Fix a read hang

2016-11-23T13:11:07+00:00

Issue:
=====
In certain cases, there was no unwind of read
from read-ahead xlator, thus resulting in hang.

RCA:
====
In certain cases, ioc_readv() issues STACK_WIND_TAIL() instead
of STACK_WIND(). One such case is when inode_ctx for that file
is not present (can happen if readdirp was called, and populates
md-cache and serves all the lookups from cache).

Consider the following graph:
...
io-cache (parent)
   |
readdir-ahead
   |
read-ahead
...

Below is the code snippet of ioc_readv calling STACK_WIND_TAIL:
ioc_readv()
{
...
 if (!inode_ctx)
   STACK_WIND_TAIL (frame, FIRST_CHILD (frame->this),
                    FIRST_CHILD (frame->this)->fops->readv, fd,
                    size, offset, flags, xdata);
   /* Ideally, this stack_wind should wind to readdir-ahead:readv()
      but it winds to read-ahead:readv(). See below for
      explaination.
    */
...
}

STACK_WIND_TAIL (frame, obj, fn, ...)
{
  frame->this = obj;
  /* for the above mentioned graph, frame->this will be readdir-ahead
   * frame->this = FIRST_CHILD (frame->this) i.e. readdir-ahead, which
   * is as expected
   */
  ...
  THIS = obj;
  /* THIS will be read-ahead instead of readdir-ahead!, as obj expands
   * to "FIRST_CHILD (frame->this)" and frame->this was pointing
   * to readdir-ahead in the previous statement.
   */
  ...
  fn (frame, obj, params);
  /* fn will call read-ahead:readv() instead of readdir-ahead:readv()!
   * as fn expands to "FIRST_CHILD (frame->this)->fops->readv" and
   * frame->this was pointing ro readdir-ahead in the first statement
   */
  ...
}

Thus, the readdir-ahead's readv() implementation will be skipped, and
ra_readv() will be called with frame->this = "readdir-ahead" and
this = "read-ahead". This can lead to corruption / hang / other problems.
But in this perticular case, when 'frame->this' and 'this' passed
to ra_readv() doesn't match, it causes ra_readv() to call ra_readv()
again!. Thus the logic of read-ahead readv() falls apart and leads to
hang.

Solution:
=========
Ideally, STACK_WIND_TAIL() should be modified as:
STACK_WIND_TAIL (frame, obj, fn, ...)
{
  next_xl = obj /* resolve obj as the variables passed in obj macro
                   can be overwritten in the further instrucions */
  next_xl_fn = fn /* resolve fn and store in a tmp variable, before
                     modifying any variables */
  frame->this = next_xl;
  ...
  THIS = next_xl;
  ...
  next_xl_fn (frame, next_xl, params);
  ...
}
But for this solution, knowing the type of variable 'next_xl_fn' is
a challenge and is not easy. Hence just modifying all the existing
callers to pass "FIRST_CHILD (this)" as obj, instead of
"FIRST_CHILD (frame->this)".

Change-Id: I179ffe3d1f154bc5a1935fd2ee44e912eb0fbb61
BUG: 1388292
Signed-off-by: Poornima G 
Reviewed-on: http://review.gluster.org/15901
Smoke: Gluster Build System 
Reviewed-by: Raghavendra G 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System

performance/io-threads: Exit threads in fini() as well

2016-11-22T07:01:18+00:00

Problem:
io-threads starts the thread in 'init()' but doesn't clean them up
on 'fini()'. It relies on PARENT_DOWN to exit threads but there can
be cases where event before PARENT_UP the graph init code can think
of issuing fini(). This code path is hit when glfs_init() is called
on a volume that is in 'stopped' state. It leads to a crash in ganesha
process, because the io-thread tries to access freed memory.

Fix:
Ideal fix would be to wait for all fops in io-thread list to be completed on
PARENT_DOWN, and have fini() do cleanup of threads. Because there is no proper
documentation about how PARENT_DOWN/fini are supposed to be used,
we are getting different kinds of sequences in different higher level protocols.
So for now cleaning up in both PARENT_DOWN and fini(). Fuse doesn't call fini()
gfapi is not calling PARENT_DOWN in some cases, so for now I don't see
another way out.

BUG: 1396793
Change-Id: I9c9154e7d57198dbaff0f30d3ffc25f6d8088aec
Signed-off-by: Pranith Kumar K 
Reviewed-on: http://review.gluster.org/15888
Smoke: Gluster Build System 
CentOS-regression: Gluster Build System 
NetBSD-regression: NetBSD Build System 
Reviewed-by: Raghavendra G

afr,dht,ec: Replace GF_EVENT_CHILD_MODIFIED with event SOME_DESCENDENT_DOWN/UP

2016-11-21T09:32:05+00:00

Currently these are few events related to child_up/down:
GF_EVENT_CHILD_UP :  Issued when any of the protocol client
connects.
GF_EVENT_CHILD_MODIFIED : Issued by afr/dht/ec
GF_EVENT_CHILD_DOWN : Issued when any of the protocol client
disconnects.
These events get modified at the dht/afr/ec layers. Here is a
brief on the same.

DHT:
- All the subvolumes reported once, and atleast one child came
  up, then GF_EVENT_CHILD_UP is issued
- connect GF_EVENT_CHILD_UP is issued
- disconnect GF_EVENT_CHILD_MODIFIED is issued
- All the subvolumes disconnected, GF_EVENT_CHILD_DOWN is issued

AFR:
- First subvolume came up, then GF_EVENT_CHILD_UP is issued
- Subsequent subvolumes coming up, results in GF_EVENT_CHILD_MODIFIED
- Any of the subvolumes go down, then GF_EVENT_SOME_CHILD_DOWN is issued
- Last up subvolume goes down, then GF_EVENT_CHILD_DOWN is issued

Until the patch [1] introduced GF_EVENT_SOME_CHILD_UP,
GF_EVENT_CHILD_MODIFIED was issued by afr/dht when any of the subvolumes
go up or down.

Now with md-cache changes, there is a necessity to differentiate between
child up and down. Hence, introducing GF_EVENT_SOME_DESCENDENT_DOWN/UP and
getting rid of GF_EVENT_CHILD_MODIFIED.

[1] http://review.gluster.org/12573

Change-Id: I704140b6598f7ec705493251d2dbc4191c965a58
BUG: 1396038
Signed-off-by: Poornima G 
Reviewed-on: http://review.gluster.org/15764
CentOS-regression: Gluster Build System 
NetBSD-regression: NetBSD Build System 
Smoke: Gluster Build System 
Reviewed-by: N Balachandran 
Reviewed-by: Pranith Kumar Karampuri 
Reviewed-by: Rajesh Joseph

performance/open-behind: Avoid deadlock in statedump

2016-11-10T06:42:57+00:00

Problem:
open-behind is taking fd->lock then inode->lock where as statedump is taking
inode->lock then fd->lock, so it is leading to deadlock

In open-behind, following code exists:
void
ob_fd_free (ob_fd_t *ob_fd)
{
        loc_wipe (&ob_fd->loc); <<--- this takes (inode->lock)
.......
}

int
ob_wake_cbk (call_frame_t *frame, void *cookie, xlator_t *this,
             int op_ret, int op_errno, fd_t *fd_ret, dict_t *xdata)
{
	.......
        LOCK (&fd->lock); <<---- fd->lock
        {
	.......
                __fd_ctx_del (fd, this, NULL);
                ob_fd_free (ob_fd); <<<---------------
        }
        UNLOCK (&fd->lock);
.......
}
=================================================================
In statedump this code exists:
inode_dump (inode_t *inode, char *prefix)
{
.......
	ret = TRY_LOCK(&inode->lock); <<---- inode->lock
.......
	fd_ctx_dump (fd, prefix); <<<-----
.......
}
fd_ctx_dump (fd_t *fd, char *prefix)
{
.......
        LOCK (&fd->lock); <<<------------------ this takes fd-lock
        {
.......
}

Fix:
Make sure open-behind doesn't call ob_fd_free() inside fd->lock

BUG: 1393259
Change-Id: I4abdcfc5216270fa1e2b43f7b73445f49e6d6e6e
Signed-off-by: Pranith Kumar K 
Reviewed-on: http://review.gluster.org/15808
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Poornima G 
Reviewed-by: Raghavendra G

performance/write-behind: fix flush stuck by former failed writes

2016-11-02T04:36:35+00:00

the issue is happened in this case:
assume a file is opened with fd1 and fd2.
1. some WRITE opto fd1 got error, they were add back to 'todo' queue
   because of those error.
2. fd2 closed, a FLUSH op is send to write-behind.
3. FLUSH can not be unwind because it's not a legal waiter for those
   failed write(as func __wb_request_waiting_on() say). and those failed
   WRITE also can not be ended if fd1 is not closed. fd2 stuck in close
   syscall.

to resolve this issue, we can change the way we determine 2 requests is
'conflict': flush/fsync is not conflict with those write that is not
belonged to them. so __wb_pick_winds() can wind the FLUSH op.

below is some information when the stuck issue happen:
glusterdump logs:
[xlator.performance.write-behind.wb_inode]
path=/ltp-F9eG0ZSOME/rw-buffered-16436
inode=0x7fdbe8039b9c
window_conf=1048576
window_current=249856
transit-size=0
dontsync=0

[.WRITE]
request-ptr=0x7fdbe8020200
refcount=1
wound=no
generation-number=4
req->op_ret=-1
req->op_errno=116
sync-attempts=3
sync-in-progress=no
size=131072
offset=1220608
lied=-1
append=0
fulfilled=0
go=0

[.WRITE]
request-ptr=0x7fdbe8068c30
refcount=1
wound=no
generation-number=5
req->op_ret=-1
req->op_errno=116
sync-attempts=2
sync-in-progress=no
size=118784
offset=1351680
lied=-1
append=0
fulfilled=0
go=0

[.FLUSH]
request-ptr=0x7fdbe8021cd0
refcount=1
wound=no
generation-number=6
req->op_ret=0
req->op_errno=0
sync-attempts=0

gdb detail about above 3 requests:
(gdb) print *((wb_request_t *)0x7fdbe8021cd0)
$2 = {all = {next = 0x7fdbe803a608, prev = 0x7fdbe8068c30}, todo = {next
= 0x7fdbe803a618, prev = 0x7fdbe8068c40}, lie = {next = 0x7fdbe8021cf0,
    prev = 0x7fdbe8021cf0}, winds = {next = 0x7fdbe8021d00, prev =
0x7fdbe8021d00}, unwinds = {next = 0x7fdbe8021d10, prev =
0x7fdbe8021d10}, wip = {
    next = 0x7fdbe8021d20, prev = 0x7fdbe8021d20}, stub =
0x7fdbe80224dc, write_size = 0, orig_size = 0, total_size = 0, op_ret =
0, op_errno = 0,
  refcount = 1, wb_inode = 0x7fdbe803a5f0, fop = GF_FOP_FLUSH, lk_owner
= {len = 8, data = "W\322T\f\271\367y$", '\000' },
  iobref = 0x0, gen = 6, fd = 0x7fdbe800f0dc, wind_count = 0, ordering =
{size = 0, off = 0, append = 0, tempted = 0, lied = 0, fulfilled = 0,
    go = 0}}
(gdb) print *((wb_request_t *)0x7fdbe8020200)
$3 = {all = {next = 0x7fdbe8068c30, prev = 0x7fdbe803a608}, todo = {next
= 0x7fdbe8068c40, prev = 0x7fdbe803a618}, lie = {next = 0x7fdbe8068c50,
    prev = 0x7fdbe803a628}, winds = {next = 0x7fdbe8020230, prev =
0x7fdbe8020230}, unwinds = {next = 0x7fdbe8020240, prev =
0x7fdbe8020240}, wip = {
    next = 0x7fdbe8020250, prev = 0x7fdbe8020250}, stub =
0x7fdbe8062c3c, write_size = 131072, orig_size = 4096, total_size = 0,
op_ret = -1,
  op_errno = 116, refcount = 1, wb_inode = 0x7fdbe803a5f0, fop =
GF_FOP_WRITE, lk_owner = {len = 8, data = '\000' },
  iobref = 0x7fdbe80311a0, gen = 4, fd = 0x7fdbe805c89c, wind_count = 3,
ordering = {size = 131072, off = 1220608, append = 0, tempted = -1,
    lied = -1, fulfilled = 0, go = 0}}
(gdb) print *((wb_request_t *)0x7fdbe8068c30)
$4 = {all = {next = 0x7fdbe8021cd0, prev = 0x7fdbe8020200}, todo = {next
= 0x7fdbe8021ce0, prev = 0x7fdbe8020210}, lie = {next = 0x7fdbe803a628,
    prev = 0x7fdbe8020220}, winds = {next = 0x7fdbe8068c60, prev =
0x7fdbe8068c60}, unwinds = {next = 0x7fdbe8068c70, prev =
0x7fdbe8068c70}, wip = {
    next = 0x7fdbe8068c80, prev = 0x7fdbe8068c80}, stub =
0x7fdbe806746c, write_size = 118784, orig_size = 4096, total_size = 0,
op_ret = -1,
  op_errno = 116, refcount = 1, wb_inode = 0x7fdbe803a5f0, fop =
GF_FOP_WRITE, lk_owner = {len = 8, data = '\000' },
  iobref = 0x7fdbe8052b10, gen = 5, fd = 0x7fdbe805c89c, wind_count = 2,
ordering = {size = 118784, off = 1351680, append = 0, tempted = -1,
    lied = -1, fulfilled = 0, go = 0}}

you can see they are all on 'todo' queue, and FLUSH op fd is not the
same WRITE op fd.

Change-Id: Id687f9cd3b9f281e1a97c83f1ce981ede272b8ab
BUG: 1372211
Signed-off-by: Ryan Ding 
Reviewed-on: http://review.gluster.org/15380
Tested-by: Raghavendra G 
Reviewed-by: Raghavendra G 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Smoke: Gluster Build System

md-cache: Invalidate cache entry for open() with O_TRUNC

2016-10-27T05:46:32+00:00

When a file is opened with O_TRUNC flag set, its size gets
set to '0'. This case needs to be handled in md-cache to
avoid sending incorrect cached stat.

Change-Id: I95d1f8a6634734898883ede010c3e7b0b7eb97d9
BUG: 1382266
Signed-off-by: Soumya Koduri 
Reviewed-on: http://review.gluster.org/15618
Smoke: Gluster Build System 
Reviewed-by: jiffin tony Thottan 
Tested-by: jiffin tony Thottan 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Kaleb KEITHLEY

performance/io-threads: Exit all threads on PARENT_DOWN

2016-10-23T09:32:05+00:00

Problem:
When glfs_fini() is called on a volume where client.io-threads is enabled,
fini() will free up iothread xl's private structure but there would be some
threads that are sleeping which would wakeup after the timedwait completes
leading to accessing already free'd memory.

Fix:
As part of parent-down, exit all sleeping threads.

BUG: 1381830
Change-Id: I0bb8d90241112c355fb22ee3fbfd7307f475b339
Signed-off-by: Pranith Kumar K 
Reviewed-on: http://review.gluster.org/15620
Smoke: Gluster Build System 
CentOS-regression: Gluster Build System 
NetBSD-regression: NetBSD Build System 
Reviewed-by: Raghavendra G

performance/write-behind: remove the request from liability queue in

2016-10-17T04:53:10+00:00

wb_fulfill_request

Before this patch, a request is removed from liability queue only when
ref count of request hits 0. Though, wb_fulfill_request does an unref,
it need not be the last unref and hence the request may survive in
liability queue till the last unref. Let,

T1: the time at which wb_fulfill_request is invoked
T2: the time at which last unref is done on request

Let's consider a case of T2 > T1. In the time window between T1 and
T2, any other request (waiter) conflicting with request in liability
queue (blocker - basically a write which has been lied) is blocked
from winding. If T2 happens to be when wb_do_unwinds is invoked, no
further processing of request list happens and "waiter" would get
blocked forever. An example imaginary sequence of events is given
below:

1. A write request w1 is picked up for unwinding in __wb_pick_unwinds
   (but unwind is not done _yet_ and hence reference
   remains). However, w1 is moved to liability queue. Let's call this
   invocation of wb_process_queue by wb_writev as PQ1.

2. A flush (f1) request hits write behind. Since the liability queue
   of inode is not empty, f1 is not picked for unwinding. Let's call
   the invocation of wb_process_queue by wb_flush as PQ2.

3. PQ2 continues and picks w1 for fulfilling and invokes
   wb_fulfill. As part of successful wb_fulfill_cbk,
   wb_fulfill_request (w1) is invoked. But, w1 is not freed (and hence
   not removed from liability queue) as w1 is not unwound _yet_ and a
   ref remains (PQ1 has not invoked wb_do_unwinds _yet_).

4. wb_fulfill_cbk (triggered by PQ2) invokes a wb_process_queue (let's
   say PQ3). f1 is not resumed in PQ3 as w1 is still in liability
   queue. At this time, PQ2 and PQ3 are complete.

5. PQ1 continues, unwinds w1 and does last unref on w1 and w1 is freed
   (and removed from liability queue). Since PQ1 didn't invoke
   wb_fulfill on any other write requests, there won't be any future
   codepaths that would invoke wb_process_queue and f1 is stuck
   forever.

With this fix, w1 is removed from liability queue in step 3 above and
PQ3 resumes f1 in step 4 (as there are no requests conflicting with f1
in liability queue during execution of PQ3).

Signed-off-by: Raghavendra G 
BUG: 1379655
Change-Id: Idacda1fcd520ac27f30224f8dfe8360dba6ac6cb
Reviewed-on: http://review.gluster.org/15579
CentOS-regression: Gluster Build System 
NetBSD-regression: NetBSD Build System 
Smoke: Gluster Build System