glusterfs.git/tests/bugs, branch v7.7

features/shard: Aggregate file size, block-count before unwinding removexattr

2020-07-13T06:29:35+00:00

Posix translator returns pre and postbufs in the dict in {F}REMOVEXATTR fops.
These iatts are further cached at layers like md-cache.
Shard translator, in its current state, simply returns these values without
updating the aggregated file size and block-count.

This patch fixes this problem.

Change-Id: I4b2dd41ede472c5829af80a67401ec5a6376d872
Fixes: #1243
Signed-off-by: Krutika Dhananjay 
(cherry picked from commit 32519525108a2ac6bcc64ad931dc8048d33d64de)

cluster/afr: Prioritize ENOSPC over other errors

2020-06-22T05:42:20+00:00

Problem:
In a replicate/arbiter volume if file creations or writes fails on
quorum number of bricks and on one brick it is due to ENOSPC and
on other brick it fails for a different reason, it may fail with
errors other than ENOSPC in some cases.

Fix:
Prioritize ENOSPC over other lesser priority errors and do not set
op_errno in posix_gfid_set if op_ret is 0 to avoid receiving any
error_no which can be misinterpreted by __afr_dir_write_finalize().

Also removing the function afr_has_arbiter_fop_cbk_quorum() which
might consider a successful reply form a single brick as quorum
success in some cases, whereas we always need fop to be successful
on quorum number of bricks in arbiter configuration.

Change-Id: I106e267f8b9451f681022f1cccb410d9bc824c08
Fixes: #1254
Signed-off-by: karthik-us 
(cherry picked from commit fa63b45ca5edf172b1b89b28b5db3c5129cc57b6)

afr: more quorum checks in lookup and new entry marking

2020-06-18T05:57:18+00:00

Problem: See github issue for details.

Fix:
-In lookup if the entry exists in 2 out of 3 bricks, don't fail the
lookup with ENOENT just because there is an entrylk on the parent.
Consider quorum before deciding.

-If entry FOP does not succeed on quorum no. of bricks, do not perform
new entry mark.

Fixes: #1303
Change-Id: I56df8c89ad53b29fa450c7930a7b7ccec9f4a6c5
Signed-off-by: Ravishankar N 
(cherry picked from commit c4a6748f25d2c1ab3ebcf89952278ebf94c8d371)

features/shard: Aggregate size, block-count in iatt before unwinding setxattr

2020-06-15T12:12:00+00:00

Posix translator returns pre and postbufs in the dict in {F}SETXATTR fops.
These iatts are further cached at layers like md-cache.
Shard translator, in its current state, simply returns these values without
updating the aggregated file size and block-count.

This patch fixes this problem.

Change-Id: I4da0eceb4235b91546df79270bcc0af8cd64e9ea
Fixes: #1243
Signed-off-by: Krutika Dhananjay 
(cherry picked from commit 29ec66c6ab77e2d6893c6e213a3d1fb148702c99)

open-behind: rewrite of internal logic

2020-06-15T10:53:24+00:00

There was a critical flaw in the previous implementation of open-behind.

When an open is done in the background, it's necessary to take a
reference on the fd_t object because once we "fake" the open answer,
the fd could be destroyed. However as long as there's a reference,
the release function won't be called. So, if the application closes
the file descriptor without having actually opened it, there will
always remain at least 1 reference, causing a leak.

To avoid this problem, the previous implementation didn't take a
reference on the fd_t, so there were races where the fd could be
destroyed while it was still in use.

To fix this, I've implemented a new xlator cbk that gets called from
fuse when the application closes a file descriptor.

The whole logic of handling background opens have been simplified and
it's more efficient now. Only if the fop needs to be delayed until an
open completes, a stub is created. Otherwise no memory allocations are
needed.

Correctly handling the close request while the open is still pending
has added a bit of complexity, but overall normal operation is simpler.

Change-Id: I6376a5491368e0e1c283cc452849032636261592
Fixes: #1225
Signed-off-by: Xavi Hernandez

features/utime: Don't access frame after stack-wind

2020-04-07T11:22:36+00:00

Problem:
frame is accessed after stack-wind. This can lead to crash
if the cbk frees the frame.

Fix:
Use new frame for the wind instead.

Fixes: #832
Change-Id: I64754609f1114b0bbd4d1336fa81a56f2cca6e03
Signed-off-by: Pranith Kumar K

write-behind: fix data corruption

2020-04-07T11:20:05+00:00

There was a bug in write-behind that allowed a previous completed write
to overwrite the overlapping region of data from a future write.

Suppose we want to send three writes (W1, W2 and W3). W1 and W2 are
sequential, and W3 writes at the same offset of W2:

    W2.offset = W3.offset = W1.offset + W1.size

Both W1 and W2 are sent in parallel. W3 is only sent after W2 completes.
So W3 should *always* overwrite the overlapping part of W2.

Suppose write-behind processes the requests from 2 concurrent threads:

    Thread 1                    Thread 2

    
                                
    wb_enqueue_tempted(W1)
    /* W1 is assigned gen X */
                                wb_enqueue_tempted(W2)
                                /* W2 is assigned gen X */

                                wb_process_queue()
                                  __wb_preprocess_winds()
                                    /* W1 and W2 are sequential and all
                                     * other requisites are met to merge
                                     * both requests. */
                                    __wb_collapse_small_writes(W1, W2)
                                    __wb_fulfill_request(W2)

                                  __wb_pick_unwinds() -> W2
                                  /* In this case, since the request is
                                   * already fulfilled, wb_inode->gen
                                   * is not updated. */

                                wb_do_unwinds()
                                  STACK_UNWIND(W2)

                                /* The application has received the
                                 * result of W2, so it can send W3. */
                                

                                wb_enqueue_tempted(W3)
                                /* W3 is assigned gen X */

                                wb_process_queue()
                                  /* Here we have W1 (which contains
                                   * the conflicting W2) and W3 with
                                   * same gen, so they are interpreted
                                   * as concurrent writes that do not
                                   * conflict. */
                                  __wb_pick_winds() -> W3

                                wb_do_winds()
                                  STACK_WIND(W3)

    wb_process_queue()
      /* Eventually W1 will be
       * ready to be sent */
      __wb_pick_winds() -> W1
      __wb_pick_unwinds() -> W1
        /* Here wb_inode->gen is
         * incremented. */

    wb_do_unwinds()
      STACK_UNWIND(W1)

    wb_do_winds()
      STACK_WIND(W1)

So, as we can see, W3 is sent before W1, which shouldn't happen.

The problem is that wb_inode->gen is only incremented for requests that
have not been fulfilled but, after a merge, the request is marked as
fulfilled even though it has not been sent to the brick. This allows
that future requests are assigned to the same generation, which could
be internally reordered.

Solution:

Increment wb_inode->gen before any unwind, even if it's for a fulfilled
request.

Special thanks to Stefan Ring for writing a reproducer that has been
crucial to identify the issue.

Change-Id: Id4ab0f294a09aca9a863ecaeef8856474662ab45
Signed-off-by: Xavi Hernandez 
Fixes: #884

afr: mark pending xattrs as a part of metadata heal

2020-04-07T11:09:25+00:00

...if pending xattrs are zero for all children.

Problem:
If there are no pending xattrs and a metadata heal needs to be
performed, it can be possible that we end up with xattrs inadvertendly
deleted from all bricks, as explained in the  BZ.

Fix:
After picking one among the sources as the good copy, mark pending xattrs on
all sources to blame the sinks. Now even if this metadata heal fails midway,
a subsequent heal will still choose one of the valid sources that it
picked previously.

Updates: #1067
Change-Id: If1b050b70b0ad911e162c04db4d89b263e2b8d7b
Signed-off-by: Ravishankar N 
(cherry picked from commit 2d5ba449e9200b16184b1e7fc84cabd015f1f779)

glusterd: Brick process fails to come up with brickmux on

2020-03-17T06:04:40+00:00

Issue:
1- In a cluster of 3 Nodes N1, N2, N3. Create 3 volumes vol1,
vol2, vol3 with 3 bricks (one from each node)
2- Set cluster.brick-multiplex on
3- Start all 3 volumes
4- Check if all bricks on a node are running on same port
5- Kill N1
6- Set performance.readdir-ahead for volumes vol1, vol2, vol3
7- Bring N1 up and check volume status
8- All bricks processes not running on N1.

Root Cause -
Since, There is a diff in volfile versions in N1 as compared
to N2 and N3 therefore glusterd_import_friend_volume() is called.
glusterd_import_friend_volume() copies the new_volinfo and deletes
old_volinfo and then calls glusterd_start_bricks().
glusterd_start_bricks() looks for the volfiles and sends an rpc
request to glusterfs_handle_attach(). Now, since the volinfo
has been deleted by glusterd_delete_stale_volume()
from priv->volumes list before glusterd_start_bricks() and
glusterd_create_volfiles_and_notify_services() and
glusterd_list_add_order is called after glusterd_start_bricks(),
therefore the attach RPC req gets an empty volfile path
and that causes the brick to crash.

Fix- Call glusterd_list_add_order() and
glusterd_create_volfiles_and_notify_services before
glusterd_start_bricks() cal is made in glusterd_import_friend_volume

> Change-Id: Idfe0e8710f7eb77ca3ddfa1cabeb45b2987f41aa
> Bug: bz#1773856
> Signed-off-by: Mohammed Rafi KC 
(cherry picked from commit 45e81aae791da9d013aba2286af44826227c05ec)

Change-Id: Idfe0e8710f7eb77ca3ddfa1cabeb45b2987f41aa
fixes: bz#1808964
Signed-off-by: Sanju Rakonde

afr: prevent spurious entry heals leading to gfid split-brain

2020-02-25T17:07:04+00:00

Problem:
In a hyperconverged setup with granular-entry-heal enabled, if a file is
recreated while one of the bricks is down, and an index heal is triggered
(with the brick still down), entry-self heal was doing a spurious heal
with just the 2 good bricks. It was doing a post-op leading to removal
of the filename from .glusterfs/indices/entry-changes as well as
erroneous setting of afr xattrs on the parent. When the brick came up,
the xattrs were cleared, resulting in the renamed file not getting
healed and leading to gfid split-brain and EIO on the mount.

Fix:
Proceed with entry heal only when shd can connect to all bricks of the replica,
just like in data and metadata heal.

fixes: bz#1804591
Change-Id: I916ae26ad1fabf259bc6362da52d433b7223b17e
Signed-off-by: Ravishankar N 
(cherry picked from commit 06453d77d056fbaa393a137ca277a20e38d2f67e)