glusterfs.git/xlators/performance, branch release-3.8

performance/read-ahead: prevent stale data being returned to application.

2017-05-11T09:49:52+00:00

Assume that fd is shared by two application threads/processes.

T0 read is triggered from app-thread t1 and read call passes through
   write-behind.
T1 app-thread t2 issues a write. The page on which read from t1 is
   waiting is marked stale
T2 write-behind caches write and indicates to application as write
   complete.
T3 app-thread t2 issues read to same region. Since, there is already a
   page for that region (created as part of read at T0), this read
   request waits on that page to be filled (though it is stale, which
   is a bug).
T4 read (triggered at T0) completes from brick (with write still
   pending). Now both read requests from t1 and t2 are served this data
   (though data is stale from app-thread t2's perspective - which is a
   bug)
T5 write is flushed to brick by write-behind.

Fix is to not to serve data from a stale page, but instead initiate a
fresh read to back-end.

>Change-Id: Id6af733464fa41bb4e81fd29c7451c73d06453fb
>BUG: 1414242
>Signed-off-by: Raghavendra G 
>Reviewed-on: https://review.gluster.org/7447
>Smoke: Gluster Build System 
>CentOS-regression: Gluster Build System 
>Reviewed-by: Csaba Henk 
>NetBSD-regression: NetBSD Build System 
>Reviewed-by: Zhou Zhengping 
>Reviewed-by: Amar Tumballi 

(cherry picked from commit 2ff39c5cbea6fbda0d7a442f55e6dc2a72efb171)
Change-Id: Id6af733464fa41bb4e81fd29c7451c73d06453fb
BUG: 1449314
Signed-off-by: Raghavendra G 
Reviewed-on: https://review.gluster.org/17223
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Niels de Vos 
Smoke: Gluster Build System

performance/io-threads: Exit threads in fini() as well

2017-01-18T02:46:34+00:00

Problem:
io-threads starts the thread in 'init()' but doesn't clean them up
on 'fini()'. It relies on PARENT_DOWN to exit threads but there can
be cases where event before PARENT_UP the graph init code can think
of issuing fini(). This code path is hit when glfs_init() is called
on a volume that is in 'stopped' state. It leads to a crash in ganesha
process, because the io-thread tries to access freed memory.

Fix:
Ideal fix would be to wait for all fops in io-thread list to be completed on
PARENT_DOWN, and have fini() do cleanup of threads. Because there is no proper
documentation about how PARENT_DOWN/fini are supposed to be used,
we are getting different kinds of sequences in different higher level protocols.
So for now cleaning up in both PARENT_DOWN and fini(). Fuse doesn't call fini()
gfapi is not calling PARENT_DOWN in some cases, so for now I don't see
another way out.

 >BUG: 1396793
 >Change-Id: I9c9154e7d57198dbaff0f30d3ffc25f6d8088aec
 >Signed-off-by: Pranith Kumar K 
 >Reviewed-on: http://review.gluster.org/15888
 >Smoke: Gluster Build System 
 >CentOS-regression: Gluster Build System 
 >NetBSD-regression: NetBSD Build System 
 >Reviewed-by: Raghavendra G 
 >(cherry picked from commit 25817a8c868b6c1b8149117f13e4216a99e453aa)

BUG: 1412941
Change-Id: I5e36a7d253f2ef8abce507eced1eb7073cff930c
Signed-off-by: Pranith Kumar K 
Reviewed-on: http://review.gluster.org/16397
CentOS-regression: Gluster Build System 
NetBSD-regression: NetBSD Build System 
Smoke: Gluster Build System

performance/io-threads: Exit all threads on PARENT_DOWN

2017-01-17T15:58:14+00:00

Problem:
When glfs_fini() is called on a volume where client.io-threads is enabled,
fini() will free up iothread xl's private structure but there would be some
threads that are sleeping which would wakeup after the timedwait completes
leading to accessing already free'd memory.

Fix:
As part of parent-down, exit all sleeping threads.

Please note that the upstream patch differs from this a little bit,
because least-prio-throttling feature is removed from master, 3.9

 >BUG: 1381830
 >Change-Id: I0bb8d90241112c355fb22ee3fbfd7307f475b339
 >Signed-off-by: Pranith Kumar K 
 >Reviewed-on: http://review.gluster.org/15620
 >Smoke: Gluster Build System 
 >CentOS-regression: Gluster Build System 
 >NetBSD-regression: NetBSD Build System 
 >Reviewed-by: Raghavendra G 
 >(cherry picked from commit d7a5ca16911caca03cec1112d4be56a9cda2ee30)

BUG: 1412941
Change-Id: I6341156251279b24ab2323cedf1b9722e42da671
Signed-off-by: Pranith Kumar K 
Reviewed-on: http://review.gluster.org/16396
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System

performance/open-behind: Avoid deadlock in statedump

2016-11-11T07:22:30+00:00

Problem:
open-behind is taking fd->lock then inode->lock where as statedump is taking
inode->lock then fd->lock, so it is leading to deadlock

In open-behind, following code exists:
void
ob_fd_free (ob_fd_t *ob_fd)
{
        loc_wipe (&ob_fd->loc); <<--- this takes (inode->lock)
.......
}

int
ob_wake_cbk (call_frame_t *frame, void *cookie, xlator_t *this,
             int op_ret, int op_errno, fd_t *fd_ret, dict_t *xdata)
{
	.......
        LOCK (&fd->lock); <<---- fd->lock
        {
	.......
                __fd_ctx_del (fd, this, NULL);
                ob_fd_free (ob_fd); <<<---------------
        }
        UNLOCK (&fd->lock);
.......
}
=================================================================
In statedump this code exists:
inode_dump (inode_t *inode, char *prefix)
{
.......
	ret = TRY_LOCK(&inode->lock); <<---- inode->lock
.......
	fd_ctx_dump (fd, prefix); <<<-----
.......
}
fd_ctx_dump (fd_t *fd, char *prefix)
{
.......
        LOCK (&fd->lock); <<<------------------ this takes fd-lock
        {
.......
}

Fix:
Make sure open-behind doesn't call ob_fd_free() inside fd->lock

 >BUG: 1393259
 >Change-Id: I4abdcfc5216270fa1e2b43f7b73445f49e6d6e6e
 >Signed-off-by: Pranith Kumar K 
 >Reviewed-on: http://review.gluster.org/15808
 >Smoke: Gluster Build System 
 >NetBSD-regression: NetBSD Build System 
 >CentOS-regression: Gluster Build System 
 >Reviewed-by: Poornima G 
 >Reviewed-by: Raghavendra G 

BUG: 1393682
Change-Id: I45a0fbed683ef6acb7900df87534927f332fdaaa
Signed-off-by: Pranith Kumar K 
Reviewed-on: http://review.gluster.org/15818
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Raghavendra G

md-cache: Invalidate cache entry for open() with O_TRUNC

2016-11-04T03:19:47+00:00

When a file is opened with O_TRUNC flag set, its size gets
set to '0'. This case needs to be handled in md-cache to
avoid sending incorrect cached stat.

This is backport of below mainline patch -
http://review.gluster.org/#/c/15618/

> Change-Id: I95d1f8a6634734898883ede010c3e7b0b7eb97d9
> BUG: 1382266
> Signed-off-by: Soumya Koduri 
> Reviewed-on: http://review.gluster.org/15618
> Smoke: Gluster Build System 
> Reviewed-by: jiffin tony Thottan 
> Tested-by: jiffin tony Thottan 
> NetBSD-regression: NetBSD Build System 
> CentOS-regression: Gluster Build System 
> Reviewed-by: Kaleb KEITHLEY 
> (cherry picked from commit 6ca5d6382f03685b31b12accb095093cf1486603)

Change-Id: I92349f5b48aef07f3790db7aae25bfa2ddb5947e
BUG: 1391450
Signed-off-by: Soumya Koduri 
Reviewed-on: http://review.gluster.org/15771
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Kaleb KEITHLEY

performance/write-behind: fix flush stuck by former failed writes

2016-11-03T09:28:13+00:00

the issue is happened in this case:
assume a file is opened with fd1 and fd2.
1. some WRITE opto fd1 got error, they were add back to 'todo' queue
   because of those error.
2. fd2 closed, a FLUSH op is send to write-behind.
3. FLUSH can not be unwind because it's not a legal waiter for those
   failed write(as func __wb_request_waiting_on() say). and those failed
   WRITE also can not be ended if fd1 is not closed. fd2 stuck in close
   syscall.

to resolve this issue, we can change the way we determine 2 requests is
'conflict': flush/fsync is not conflict with those write that is not
belonged to them. so __wb_pick_winds() can wind the FLUSH op.

below is some information when the stuck issue happen:
glusterdump logs:
[xlator.performance.write-behind.wb_inode]
path=/ltp-F9eG0ZSOME/rw-buffered-16436
inode=0x7fdbe8039b9c
window_conf=1048576
window_current=249856
transit-size=0
dontsync=0

[.WRITE]
request-ptr=0x7fdbe8020200
refcount=1
wound=no
generation-number=4
req->op_ret=-1
req->op_errno=116
sync-attempts=3
sync-in-progress=no
size=131072
offset=1220608
lied=-1
append=0
fulfilled=0
go=0

[.WRITE]
request-ptr=0x7fdbe8068c30
refcount=1
wound=no
generation-number=5
req->op_ret=-1
req->op_errno=116
sync-attempts=2
sync-in-progress=no
size=118784
offset=1351680
lied=-1
append=0
fulfilled=0
go=0

[.FLUSH]
request-ptr=0x7fdbe8021cd0
refcount=1
wound=no
generation-number=6
req->op_ret=0
req->op_errno=0
sync-attempts=0

gdb detail about above 3 requests:
(gdb) print *((wb_request_t *)0x7fdbe8021cd0)
$2 = {all = {next = 0x7fdbe803a608, prev = 0x7fdbe8068c30}, todo = {next
= 0x7fdbe803a618, prev = 0x7fdbe8068c40}, lie = {next = 0x7fdbe8021cf0,
    prev = 0x7fdbe8021cf0}, winds = {next = 0x7fdbe8021d00, prev =
0x7fdbe8021d00}, unwinds = {next = 0x7fdbe8021d10, prev =
0x7fdbe8021d10}, wip = {
    next = 0x7fdbe8021d20, prev = 0x7fdbe8021d20}, stub =
0x7fdbe80224dc, write_size = 0, orig_size = 0, total_size = 0, op_ret =
0, op_errno = 0,
  refcount = 1, wb_inode = 0x7fdbe803a5f0, fop = GF_FOP_FLUSH, lk_owner
= {len = 8, data = "W\322T\f\271\367y$", '\000' },
  iobref = 0x0, gen = 6, fd = 0x7fdbe800f0dc, wind_count = 0, ordering =
{size = 0, off = 0, append = 0, tempted = 0, lied = 0, fulfilled = 0,
    go = 0}}
(gdb) print *((wb_request_t *)0x7fdbe8020200)
$3 = {all = {next = 0x7fdbe8068c30, prev = 0x7fdbe803a608}, todo = {next
= 0x7fdbe8068c40, prev = 0x7fdbe803a618}, lie = {next = 0x7fdbe8068c50,
    prev = 0x7fdbe803a628}, winds = {next = 0x7fdbe8020230, prev =
0x7fdbe8020230}, unwinds = {next = 0x7fdbe8020240, prev =
0x7fdbe8020240}, wip = {
    next = 0x7fdbe8020250, prev = 0x7fdbe8020250}, stub =
0x7fdbe8062c3c, write_size = 131072, orig_size = 4096, total_size = 0,
op_ret = -1,
  op_errno = 116, refcount = 1, wb_inode = 0x7fdbe803a5f0, fop =
GF_FOP_WRITE, lk_owner = {len = 8, data = '\000' },
  iobref = 0x7fdbe80311a0, gen = 4, fd = 0x7fdbe805c89c, wind_count = 3,
ordering = {size = 131072, off = 1220608, append = 0, tempted = -1,
    lied = -1, fulfilled = 0, go = 0}}
(gdb) print *((wb_request_t *)0x7fdbe8068c30)
$4 = {all = {next = 0x7fdbe8021cd0, prev = 0x7fdbe8020200}, todo = {next
= 0x7fdbe8021ce0, prev = 0x7fdbe8020210}, lie = {next = 0x7fdbe803a628,
    prev = 0x7fdbe8020220}, winds = {next = 0x7fdbe8068c60, prev =
0x7fdbe8068c60}, unwinds = {next = 0x7fdbe8068c70, prev =
0x7fdbe8068c70}, wip = {
    next = 0x7fdbe8068c80, prev = 0x7fdbe8068c80}, stub =
0x7fdbe806746c, write_size = 118784, orig_size = 4096, total_size = 0,
op_ret = -1,
  op_errno = 116, refcount = 1, wb_inode = 0x7fdbe803a5f0, fop =
GF_FOP_WRITE, lk_owner = {len = 8, data = '\000' },
  iobref = 0x7fdbe8052b10, gen = 5, fd = 0x7fdbe805c89c, wind_count = 2,
ordering = {size = 118784, off = 1351680, append = 0, tempted = -1,
    lied = -1, fulfilled = 0, go = 0}}

you can see they are all on 'todo' queue, and FLUSH op fd is not the
same WRITE op fd.

> Change-Id: Id687f9cd3b9f281e1a97c83f1ce981ede272b8ab
> BUG: 1372211
> Signed-off-by: Ryan Ding 

Change-Id: Id687f9cd3b9f281e1a97c83f1ce981ede272b8ab
BUG: 1390838
Signed-off-by: Ryan Ding 
Reviewed-on: http://review.gluster.org/15762
Tested-by: Raghavendra G 
Reviewed-by: Raghavendra G 
Smoke: Gluster Build System 
CentOS-regression: Gluster Build System 
NetBSD-regression: NetBSD Build System

performance/write-behind: remove the request from liability queue in

2016-10-18T05:37:09+00:00

wb_fulfill_request

Before this patch, a request is removed from liability queue only when
ref count of request hits 0. Though, wb_fulfill_request does an unref,
it need not be the last unref and hence the request may survive in
liability queue till the last unref. Let,

T1: the time at which wb_fulfill_request is invoked
T2: the time at which last unref is done on request

Let's consider a case of T2 > T1. In the time window between T1 and
T2, any other request (waiter) conflicting with request in liability
queue (blocker - basically a write which has been lied) is blocked
from winding. If T2 happens to be when wb_do_unwinds is invoked, no
further processing of request list happens and "waiter" would get
blocked forever. An example imaginary sequence of events is given
below:

1. A write request w1 is picked up for unwinding in __wb_pick_unwinds
   (but unwind is not done _yet_ and hence reference
   remains). However, w1 is moved to liability queue. Let's call this
   invocation of wb_process_queue by wb_writev as PQ1.

2. A flush (f1) request hits write behind. Since the liability queue
   of inode is not empty, f1 is not picked for unwinding. Let's call
   the invocation of wb_process_queue by wb_flush as PQ2.

3. PQ2 continues and picks w1 for fulfilling and invokes
   wb_fulfill. As part of successful wb_fulfill_cbk,
   wb_fulfill_request (w1) is invoked. But, w1 is not freed (and hence
   not removed from liability queue) as w1 is not unwound _yet_ and a
   ref remains (PQ1 has not invoked wb_do_unwinds _yet_).

4. wb_fulfill_cbk (triggered by PQ2) invokes a wb_process_queue (let's
   say PQ3). f1 is not resumed in PQ3 as w1 is still in liability
   queue. At this time, PQ2 and PQ3 are complete.

5. PQ1 continues, unwinds w1 and does last unref on w1 and w1 is freed
   (and removed from liability queue). Since PQ1 didn't invoke
   wb_fulfill on any other write requests, there won't be any future
   codepaths that would invoke wb_process_queue and f1 is stuck
   forever.

With this fix, w1 is removed from liability queue in step 3 above and
PQ3 resumes f1 in step 4 (as there are no requests conflicting with f1
in liability queue during execution of PQ3).

> Signed-off-by: Raghavendra G 
> BUG: 1379655
> Change-Id: Idacda1fcd520ac27f30224f8dfe8360dba6ac6cb
> Reviewed-on: http://review.gluster.org/15579
> CentOS-regression: Gluster Build System 
> NetBSD-regression: NetBSD Build System 
> Smoke: Gluster Build System 
(cherry picked from commit a8b2a981881221925bb5edfe7bb65b25ad855c04)

Signed-off-by: Raghavendra G 
BUG: 1385620
Change-Id: Idacda1fcd520ac27f30224f8dfe8360dba6ac6cb
Reviewed-on: http://review.gluster.org/15658
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System

performance/open-behind: Pass O_DIRECT flags for anon fd reads when required

2016-09-26T04:42:23+00:00

        Backport of: http://review.gluster.org/15537
        cherry-picked from a412a4f50d8ca2ae68dbfa93b80757889150ce99

Writes are already passing the correct flags at the time of open().

Also, make io-cache honor direct-io for anon-fds with
O_DIRECT flag during reads.

Change-Id: I9eb89c3bda34f9287861eb3b53c3d6a7b967c105
BUG: 1375959
Signed-off-by: Krutika Dhananjay 
Reviewed-on: http://review.gluster.org/15552
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Pranith Kumar Karampuri

performance/decompounder: Add graph for decompounder xlator

2016-06-14T15:16:51+00:00

This xlator will fall below protocol/server.
This is mandatory xlator without any options.

Observed that the callback for decompounder translator
was not added which was causing volume start
to fail.
Added cbks for decompounder.
master-
http://review.gluster.org/#/c/13968/

Change-Id: I3e16a566376338d9c6d36d6fbc7bf295fda9f3a6
BUG: 1346222
Signed-off-by: Ashish Pandey 
Reviewed-on: http://review.gluster.org/14729
Reviewed-by: Ravishankar N 
Smoke: Gluster Build System 
Reviewed-by: Anuradha Talur 
CentOS-regression: Gluster Build System 
Reviewed-by: Niels de Vos 
NetBSD-regression: NetBSD Build System

readdir-ahead: Prefetch xattrs needed by md-cache

2016-05-10T16:42:38+00:00

Problem:
Negative cache feature implementation in md-cache requires xattrs
returned by posix to be intercepted for every call that can possibly
return xattrs. This includes readdirp(). This is crucial to treat
missing keys in cache as a case of negative entry (returns ENODATA)

md-cache puts names of xattrs that it wants to cache in xdata and
passes it down to posix which returns the specified xattrs in the
callback. This is done in lookup() and readdirp(). Hence, a xattr
that is cached can be invalidated during readdirp_cbk too.

This is based on the assumption that readdirp() will always return
all xattrs that md-cache is interested in. However, this is not the
case when readdirp() call is served from readdir-ahead's cache.
readdir-ahead xlator will pre-fetch dentries during opendir_cbk
and readdirp. These internal readdirp() calls made by readdir-ahead
xlator does not set xdata in it's requests. Hence, no xattrs are
fetched and stored in it's internal cache.

This causes metadata loss in gluster-swift. md-cache returns ENODATA
during getxattr() call even though the xattr for that object exists on
the brick. On receiving ENODATA, gluster-swift will create new metadata
and do setxattr(). This results in loss of information stored in
existing xattr.

Fix:
During opendir, md-cache will communicate to readdir-ahead asking it
to store the names of xattrs it's interested in so that readdir-ahead
can fetch those in all subsequent internal readdirp() calls issued by
it. This stored names of xattrs is invalidated/updated on the next
real readdirp() call issued by application. This readdirp() call will
have xdata set correctly by md-cache xlator.

Cherry picked from commit 0c73e7050c4d30ace0c39cc9b9634e9c1b448cfb:
> BUG: 1333023
> Change-Id: I32d46f93a99d4ec34c741f3c52b0646d141614f9
> Reviewed-on: http://review.gluster.org/14214
> Tested-by: Prashanth Pai 
> NetBSD-regression: NetBSD Build System 
> CentOS-regression: Gluster Build System 
> Tested-by: Gluster Build System 
> Smoke: Gluster Build System 
> Reviewed-by: Raghavendra G 
> Tested-by: Raghavendra G 

BUG: 1334699
Change-Id: I32d46f93a99d4ec34c741f3c52b0646d141614f9
Signed-off-by: Prashanth Pai 
Reviewed-on: http://review.gluster.org/14281
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Niels de Vos 
Smoke: Gluster Build System