glusterfs.git/xlators/cluster/dht/src/dht-common.h, branch v5.11

Land clang-format changes

2018-09-12T11:52:48+00:00

Change-Id: I6f5d8140a06f3c1b2d196849299f8d483028d33b

cluster/dht: Rework the debug xattr to get hashed subvol

2018-09-07T08:55:32+00:00

The earlier implementation required the file to already exist
when trying to get the hashed subvol. The reworked implementation
allows a user to get the hashed subvol for any filename, whether
it exists or not.

Usage: getfattr -n "dht.file.hashed-subvol." 

Eg:To get the hashed subvol for file-1 inside dir-1

getfattr -n "dht.file.hashed-subvol.file-1" /mnt/gluster/dir1

credit: rgowdapp@redhat.com

Change-Id: Iae20bd5f56d387ef48c1c0a4ffa9f692866bf739
fixes: bz#1624244
Signed-off-by: N Balachandran

dht: remove useless argument from dht_iatt_merge

2018-07-18T08:16:01+00:00

The last using of the subvol argument has been removed at 4e1ec35ef4f7
("core: fill 'ia_ino' from 'ia_gfid' in 'storage/posix' ......")
7 years ago (2011-06-16).

Change-Id: I9788d79e2e40cc153cf2960e28c7c1c1033dc8f7
fixes: bz#1601683
Signed-off-by: Kinglong Mee

dht: Inconsistent permission for directories after brick stop/start

2018-07-11T03:13:24+00:00

Problem: Inconsistent access permissions on directories after
         bringing back the down sub-volumes, in case of directories dht_setattr
         first wind a call on MDS once call is finished on MDS then wind a call
         on NON-MDS.At the time of revalidating dht just compare the uid/gid with
         stbuf uid/gid and if anyone differs set a flag to heal the same.

Solution: Add a condition to compare permission also in dht_revalidate_cbk
          to set a flag to call dht_dir_attr_heal.

BUG: 1584517
Change-Id: I3e039607148005015b5d93364536158380d4c5aa
fixes: bz#1584517
Signed-off-by: Mohit Agrawal

cluster/dht: Do not try to use up the readdirp buffer

2018-06-29T11:10:47+00:00

DHT attempts to use up the entire buffer in readdirp before
unwinding in an attempt to reduce the number of calls.
However, this has 2 disadvantages:
1. This can cause a stack overflow when parallel readdir
is enabled. If the buffer only has a little space,rda can send back
only one or two entries. If those entries are stripped out by
dht_readdirp_cbk (linkto files for example) it will once again
wind down to rda in an attempt to fill the buffer before unwinding to FUSE.
This process can continue for several iterations, causing the stack
to grow and eventually overflow, causing the process to crash.
2. If parallel readdir is disabled, dht could send readdirp
calls with small buffers to the bricks, thus increasing the
number of network calls.

We are therefore reverting to the existing behaviour.
Please note, this only mitigates the stack overflow, it does
not prevent it from happening. This is still possible if
a subvol has thousands of linkto files for instance.

Change-Id: I291bc181c5249762d0c4fe27fa4fc2631166adf5
fixes: bz#1593548
Signed-off-by: N Balachandran

cluster/dht: Remove EIO from dht_inode_missing

2018-05-17T06:49:08+00:00

Removed EIO from the list of errnos that triggered
a migrate check task.

Change-Id: I7f89c7a16056421588f1af2377cebe6affddcb47
fixes: bz#1578823
Signed-off-by: N Balachandran

cluster/dht: Debug virtual xattrs for dht

2018-05-09T03:46:36+00:00

Provide a virtual xattr with which to query
the hashed subvol for a file.

Change-Id: Ic7abd031f875da4b9084841ea7c25d6c8a851992
fixes: bz#1574421
Signed-off-by: N Balachandran

cluster/dht: fixes to parallel renames to same destination codepath

2018-05-07T06:04:10+00:00

Test case:
 # while true; do uuid="`uuidgen`"; echo "some data" > "test$uuid"; mv
   "test$uuid" "test" -f || break; echo "done:$uuid"; done

 This script was run in parallel from multiple mountpoints

Along the course of getting the above usecase working, many issues
were found:

Issue 1:
=======
consider a case of rename (src, dst). We can encounter a situation
where,
* dst is a file present at the time of lookup
* dst is removed by the time rename fop reaches glusterfs

In this scenario, acquring inodelk on dst fails with ESTALE resulting
in failure of rename. However, as per POSIX irrespective of whether
dst is present or not, rename should be successful. Acquiring entrylk
provides synchronization even in races like this.

Algorithm:
1. Take inodelks on src and dst (if dst is present) on respective
   cached subvols. These inodelks are done to preserve backward
   compatibility with older clients, so that synchronization is
   preserved when a volume is mounted by clients of different
   versions. Once relevant older versions (3.10, 3.12, 3.13) reach
   EOL, this code can be removed.
2. Ignore ENOENT/ESTALE errors of inodelk on dst.
3. protect namespace of src and dst. To protect namespace of a file,
   take inodelk on parent on hashed subvol, then take entrylk on the
   same subvol on parent with basename of file. inodelk on parent is
   done to guard against changes to parent layout so that hashed
   subvol won't change during rename.
4. 
5. unlock all locks

Issue 2:
========
linkfile creation in lookup codepath can race with a rename. Imagine
the following scenario:
* lookup finds a data-file with gfid - gfid-dst - without a
  corresponding linkto file on hashed-subvol. It decides to create
  linkto file with gfid - gfid-dst.
    - Note that some codepaths of dht-rename deletes linkto file of
      dst as first step. So, a lookup racing with an in-progress
      rename can easily run into this situation.
* a rename (src-path:gfid-src, dst-path:gfid-dst) renames data-file
  and hence gfid of data-file changes to gfid-src with path dst-path.
* lookup proceeds and creates linkto file - dst-path - with gfid -
  dst-gfid - on hashed-subvol.
* rename tries to create a linkto file dst-path with src-gfid on
  hashed-subvol, but it fails with EEXIST. But EEXIST is ignored
  during linkto file creation.

Now we've ended with dst-path having different gfids - dst-gfid on
linkto file and src-gfid on data file. Future lookups on dst-path will
always fail with ESTALE, due to differing gfids.

The fix is to synchronize linkfile creation in lookup path with rename
using the same mechanism of protecting namespace explained in solution
of Issue 1. Once locks are acquired, before proceeding with linkfile
creation, we check whether conditions for linkto file creation are
still valid. If not, we skip linkto file creation.

Issue 3:
========
gfid of dst-path can change by the time locks are acquired. This
means, either another rename overwrote dst-path or dst-path was
deleted and recreated by a different client. When this happens,
cached-subvol for dst can change. If rename proceeds with old-gfid and
old-cached subvol, we'll end up in inconsistent state(s) like dst-path
with different gfids on different subvols, more than one data-file
being present etc.

Fix is to do the lookup with a new inode after protecting namespace of
dst. Post lookup, we've to compare gfids and correct local state
appropriately to be in sync with backend.

Issue 4:
========
During revalidate lookup, if following a linkto file doesn't lead to a
valid data-file, local->cached-subvol was not reset to NULL. This
means we would be operating on a stale state which can lead to
inconsistency. As a fix, reset it to NULL before proceeding with
lookup everywhere.

Issue 5:
========
Stale dentries left out in inode table on brick resulted in failures
of link fop even though the file/dentry didn't exist on backend fs. A
patch is submitted to fix this issue. Please check the dependency tree
of current patch on gerrit for details

In short, we fix the problem by not blindly trusting the
inode-table. Instead we validate whether dentry is present by doing
lookup on backend fs.

Change-Id: I832e5c47d232f90c4edb1fafc512bf19bebde165
updates: bz#1543279
BUG: 1543279
Signed-off-by: Raghavendra G

cluster/dht: log error only if layout healing is required

2018-05-07T05:37:06+00:00

selfhealing of directory is invoked on two conditions:
1. no layout on disk or layout has some anomalies (holes/overlaps)
2. mds xattr is not set on the directory

When dht_selfheal_directory is called with a correct layout just to
set mds xattr, we see error msgs complaining about "not able to form
layout on directory", which is misleading as the layout is
correct. So, log this msg only if layout has anomalies.

Change-Id: I4af25246fc3a2450c2426e9902d1a5b372eab125
updates: bz#1543279
BUG: 1543279
Signed-off-by: Raghavendra G

cluster/dht: store the 'reaction' on failures per lock

2018-02-23T03:19:58+00:00

Currently its passed in dht_blocking_inode(entry)lk, which would be a
global value for all the locks passed in the argument. This would
be a limitation for cases where we want to ignore failures on only few
locks and fail for others.

Change-Id: I02cfbcaafb593ad8140c0e5af725c866b630fb6b
BUG: 1543279
Signed-off-by: Raghavendra G