| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Two glusterfs clients return inconsistent errnos when the bricks of the volume
were down. Consider two gluster mounts. Mount 1 was done when the bricks were
online. Mount 2 was done after the bricks were killed, (using the 'glusterfs'
command instead of the mount script).
For any request, mount 1 will return ENOTCONN, where as mount 2 will return
ENOENT.
This happens because for the 2nd mount, a fuse would send a lookup on '/' for
any request, as it hadn't been done yet. The client xlator returns ENOTCONN,
but the dht_lookup_dir_cbk changed this to ENOENT unconditionally when
aggregating. So, fuse returned ENOENT, even though the errno should have been
ENOTCONN.
backporting http://review.gluster.org/6072
BUG: 1019095
Change-Id: Iaa40dffefddfcaf1ab7736f5423d7f9d2ece1363
Original-author: Kaushal M <kaushal@redhat.com>
Signed-off-by: shishir gowda <gowda.shishir@gmail.com>
Reviewed-on: http://review.gluster.org/6471
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Harshavardhana <harsha@harshavardhana.net>
Reviewed-by: Anand Avati <avati@redhat.com>
|
|
|
|
|
|
|
|
|
|
| |
BUG: 1057846
Change-Id: I19051c19a54c8aab37eb7cb32dde9f7e9e77c073
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
Reviewed-on: http://review.gluster.org/6854
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Kaleb KEITHLEY <kkeithle@redhat.com>
Reviewed-by: Vijay Bellur <vbellur@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem:
In some code paths neither loc->gfid nor loc->inode->gfid
is populated which leads to EINVAL for linkfile setattr
in dht_linkfile_attr_heal.
Fix:
Populate loc->gfid before dht_linkfile_attr_heal.
BUG: 971805
Change-Id: I8e4b7510ee5c38aa9ccf5283c7165c7df25ec62b
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
Reviewed-on: http://review.gluster.org/6691
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Vijay Bellur <vbellur@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Backport of http://review.gluster.org/4971
If unlink of linkfile returns ENOENT, do not fail unlink.
Proceed with unlinking of cached file.
Change-Id: If7cec92b40c39d68dd9c3606c6c2c3a6bd67d27b
BUG: 966848
Signed-off-by: Anand Avati <avati@redhat.com>
Reviewed-on: http://review.gluster.org/6586
Reviewed-by: Harshavardhana <harsha@harshavardhana.net>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Vijay Bellur <vbellur@redhat.com>
|
|
|
|
|
|
|
|
|
| |
This reverts commit 837422858c2e4ab447879a4141361fd382645406
Change-Id: I0909f26ce088454bb14b3694b489c672286a4ae6
Reviewed-on: http://review.gluster.org/6575
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Vijay Bellur <vbellur@redhat.com>
|
|
|
|
|
|
|
|
| |
Change-Id: I74818a03f7c5d7891561515af2fa35ea3775255c
BUG: 1032894
Signed-off-by: Vijay Bellur <vbellur@redhat.com>
Reviewed-on: http://review.gluster.org/6582
Tested-by: Gluster Build System <jenkins@build.gluster.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Change-Id: Ib727948c6e21b19fd509f258ff0aea1c5d1a84d1
BUG: 966845
Signed-off-by: shishir gowda <sgowda@redhat.com>
Reviewed-on: http://review.gluster.org/5056
Reviewed-by: Amar Tumballi <amarts@redhat.com>
Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-on: http://review.gluster.org/6517
Reviewed-by: Shishir Gowda <gowda.shishir@gmail.com>
Reviewed-by: Vijay Bellur <vbellur@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We were wrongly detecting holes/overlaps for already accounted
errors. Additionally, sort should also handle zero'ed out layout
Change-Id: Ic3d13e1d735b914f9acc01fe919bc90656baea48
BUG: 1003851
Signed-off-by: shishir gowda <sgowda@redhat.com>
Reviewed-on: http://review.gluster.org/5762
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Amar Tumballi <amarts@redhat.com>
Reviewed-by: Anand Avati <avati@redhat.com>
Reviewed-on: http://review.gluster.org/6469
Reviewed-by: Vijay Bellur <vbellur@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently, we sent GF_READDIR_SKIP_DIRS for all subvolumes if
first_subvol != first_up_subvolume.
Also first_up_subvolume can change with-in the life of a call and
cbk. Saving the first_up_subvol in dht_local for checks.
Back porting fix http://review.gluster.org/5577
BUG: 996474
Change-Id: I67b5bbe781e12812557b569b7d0a0beba4224159
Signed-off-by: shishir gowda <gowda.shishir@gmail.com>
Reviewed-on: http://review.gluster.org/6468
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Vijay Bellur <vbellur@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Additionally, update op_errno to the lasted failure. If failures
found in complete_check, error returned would be EUCLEAN instead
of the right failure (in this case ENOENT)
Change-Id: Ib813867f4b817af651627b9ea07b0b09fa2b26ce
BUG: 966852
Original-author: shishir gowda <sgowda@redhat.com>
Signed-off-by: Anand Avati <avati@redhat.com>
Reviewed-on: http://review.gluster.org/6495
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Vijay Bellur <vbellur@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Creating linkfile could have failed, but we dont care about linkfile
for setting layout in the inode ctx (could be EEXIST etc.)
So ignore @inode in cbk and pick it up from local->loc.inode
Backporting http://review.gluster.org/6319
BUG: 1032859
Change-Id: Ic95e303a4c060900d041820d4faa68d1c4685b6a
Original-author: Anand Avati <avati@redhat.com>
Signed-off-by: shishir gowda <gowda.shishir@gmail.com>
Reviewed-on: http://review.gluster.org/6470
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Earlier disk space check had an issue which didn't
provide the needed functionality to avoid migration
when the destination had lesser available space,
scenario we need to avoid is stated below :
During rebalance `migrate-data` - Destination subvol experiences
a `reduction` in 'blocks' of free space, at the same time source
subvol gains certain 'blocks' of free space. A valid check is
necessary here to avoid errorneous move to destination where
the space could be scantily available.
This patch provides a proper fix in place by subtracting
necessary file blocks from destination and adding those blocks
to source.
backporting fix http://review.gluster.org/5961
BUG: 982919
Original-author: Harshavardhana <harsha@harshavardhana.net>
Signed-off-by: shishir gowda <gowda.shishir@gmail.com>
Change-Id: If5808eaa89e66d7bcaeee7268fe3fe5b1b56f51d
Signed-off-by: shishir gowda <gowda.shishir@gmail.com>
Reviewed-on: http://review.gluster.org/6461
Reviewed-by: Harshavardhana <harsha@harshavardhana.net>
Reviewed-by: Vijay Bellur <vbellur@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
|
|
|
|
|
|
|
|
|
|
|
| |
xattr name can legally be NULL. Handle that case without crashing.
Change-Id: Ie214cb05ccd52565dc247a9234ad83ae799d3866
BUG: 1036879
Signed-off-by: Anand Avati <avati@redhat.com>
Reviewed-on: http://review.gluster.org/6420
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Vijay Bellur <vbellur@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
| |
@key can legally be NULL. Handle that case without crashing.
Change-Id: Iaae293caa7eeb24afc9cd2580799173e2ce00911
BUG: 1036879
Signed-off-by: Anand Avati <avati@redhat.com>
Reviewed-on: http://review.gluster.org/6402
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Vijay Bellur <vbellur@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Creating linkfile could have failed, but we dont care about linkfile
for setting layout in the inode ctx (could be EEXIST etc.)
So ignore @inode in cbk and pick it up from local->loc.inode
Change-Id: I2952799d7ae0d3441b84b2ca2981afd75d7576e2
BUG: 1032859
Signed-off-by: Anand Avati <avati@redhat.com>
Reviewed-on: http://review.gluster.org/6358
Tested-by: Gluster Build System <jenkins@build.gluster.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When clients refer to a GFID which does not exist, the errno to
be returned in ESTALE (and not ENOENT). Even though ENOENT might
look "proper" most of the time, as the application eventually expects
ENOENT even if a parent directory does not exist, not returning
ESTALE results in resolvers (FUSE and GFAPI) to not retry resolution
in uncached mode. This can result in spurious ENOENTs during
concurrent path modification operations.
Change-Id: I7a06ea6d6a191739f2e9c6e333a1969615e05936
BUG: 1032894
Signed-off-by: Anand Avati <avati@redhat.com>
Reviewed-on: http://review.gluster.org/6322
Tested-by: Gluster Build System <jenkins@build.gluster.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
As described in https://bugzilla.redhat.com/show_bug.cgi?id=1005526
eager-locks are broken on release-3.4, at least for NetBSD. This
change disable them by default, leaving the admin the possibility
to explicitely enable the feature if needed.
BUG: 1005526
Change-Id: I6f1b393865b103ec56ad5eb5143f59bb8672f19c
Signed-off-by: Emmanuel Dreyfus <manu@netbsd.org>
Reviewed-on: http://review.gluster.org/6020
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Vijay Bellur <vbellur@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently rebalance/remove-brick op's display migration failed count even
for files which failed due to space issues (not enough space for file, or
migration leading to cluster imbalance)
These will now be counted as skipped, and rebalance/remove-brick status
will display the additional counter
BUG: 989846
Change-Id: I4efa7ce69dd43680ff47181afed0c561954c5080
Signed-off-by: shishir gowda <sgowda@redhat.com>
Reviewed-on: http://review.gluster.org/5977
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently when selecting a alternative subvolume when hashed
subvol has exceeded min-free-disk/inodes, we do not check if
layouts have errors (including decommissioning). This leads
to data being written to those subvolumes, and in case of
decommissioning, will lead to data loss.
BUG: 982919
> Original-Author: shishir gowda <sgowda@redhat.com>
> Reviewed-on: http://review.gluster.org/5299
Change-Id: If301a86cf3ca5fad6529bd2e61382f9901663ba0
Signed-off-by: Amar Tumballi <amarts@redhat.com>
Reviewed-on: http://review.gluster.org/5888
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Vijay Bellur <vbellur@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
| |
Else this results in a missing frame causing a hang
Change-Id: Ib5f3dc6a3999449faa2853cee2944af2fb065a20
BUG: 1002399
Signed-off-by: Anand Avati <avati@redhat.com>
Reviewed-on: http://review.gluster.org/5878
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Vijay Bellur <vbellur@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The current self-healing algorithm is ignoring missing directories
for assigning new layout. When lookup() is racing against mkdir()
or when self-healing a half-done mkdir(), the layout assignment split
must happen based on the final number of directories, and not the
currently existing number of directories (because we finish mkdir()
of missing directories before hash layout assignment).
Without this fix, concurrent mkdir() and lookup() will step on
each others feet, create a messed up layout on disk, and end up
with different in-memory layouts.
Once two clients have different in-memory layouts, creation of
subdirectory will not arbitrate on the same hashed subvolume and will
result in GFID mismatch of the sub-directory.
Change-Id: Ia47acad67c265060405984c822b4d37512b9dbb3
BUG: 907072
Signed-off-by: Anand Avati <avati@redhat.com>
Reviewed-on: http://review.gluster.org/5871
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Vijay Bellur <vbellur@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem:
Nfs xlator never does open on a file for performing writes,
afr does not perform changelog wakeup for this fd so operations
which do metadata operations as soon as the data operations are
completed perceive a delay od 'post-op-delay-secs'.
Fix:
Perform changelog wakeup on anon-fd if the fd with same pid is
not present in inode-list.
Note:
This approach is a short-term fix. A proper fix needs a new domain
for taking metadata locks so that data/metadata locks don't compete
with each other.
BUG: 966018
Change-Id: Ia9188a253e7943801b665e1b9205e2f551952d87
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
Reviewed-on: http://review.gluster.org/5067
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
| |
This is a backport of Ia5a5d40bcea7bfb320ef7096af1e035b8847d4ff
BUG: 960055
Change-Id: Ibf3547a775d7ca2f3a097c880cdf38ffafb322da
Signed-off-by: Emmanuel Dreyfus <manu@netbsd.org>
Reviewed-on: http://review.gluster.org/5139
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is support for discovering a filename in a given directory
which has a case insensitive match of a given name. It is implemented
as a virtual extended attribute on the directory where the required
filename is specified in the key.
E.g:
sh# getfattr -e "text" -n user.glusterfs.get_real_filename:FiLe-B /mnt/samba/patchy
getfattr: Removing leading '/' from absolute path names
# file: mnt/samba/patchy
user.glusterfs.get_real_filename:FiLe-B="file-b"
In reality, there can be multiple "answers" as the backend filesystem is
case sensitive and there can be multiple files which can strcasecamp()
successfully. In this case we pick the first matched file from the first
responding server.
If a matching file does not exist, we return ENOENT (and NOT ENODATA).
This way the caller can differentiate between "unsupported" glusterfs
API and file not existing.
This API is used by Samba VFS to perform efficient discovery of the real
filename without doing a full scan at the Samba level.
Change-Id: I53054c4067cba69e585fd0bbce004495bc6e39e8
BUG: 953694
Signed-off-by: Anand Avati <avati@redhat.com>
Reviewed-on: http://review.gluster.org/5163
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Vijay Bellur <vbellur@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
| |
Do not let inode linking to happen only in lookup(). While
that works, it is inefficient.
Change-Id: I51bbfb6255ec4324ab17ff00566375f49d120c06
BUG: 953694
Signed-off-by: Anand Avati <avati@redhat.com>
Reviewed-on: http://review.gluster.org/5162
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Vijay Bellur <vbellur@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem:
When taking blocking entrylks, afr orders the entrylks based on
uuid_compare of gfids of parent dirs, if they are equal then it orders
them based on the basenames. While this approach works fine, the
implementation assumes loc->gfids to be populated at the time of
the comparison, but loc may have gfid in loc->inode->gfid instead
of loc->gfid which was leading to order mismatches and dead-locks.
Fix:
Implemented loc_gfid which gives gfid by checking both loc->gfid,
loc->inode->gfid. Used this for ordering the blocking entrylks.
Change-Id: I2743fcaff3d670fbeb6b8e0a496f106a3585dde1
BUG: 965987
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
Reviewed-on: http://review.gluster.org/5063
Reviewed-by: Krishnan Parthasarathi <kparthas@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Vijay Bellur <vbellur@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch backports the following changes from the master branch
99fe09f glusterd: Moved the volume entry table to a separate file.
e306d08 glusterd: Changing the volume entry table's representation.
eac54f6 glusterd: Added option description, and validation function fields.
bcb4235 glusterd: Added validation function for performance cache max and min size.
8897d08 glusterd: Added validation function for quota-timeout.
4579609 glusterd: Added validation function for stripe-block-size.
6788bad glusterd: Fix some options in vme table
549231d glusterd: Added the validation function for subvols-per-directory
9636e63 glusterd: Added description for nfs.transport-type option in volume set help.
Change-Id: I4a64ad94f17df4b45a3a32262a83e2c35fb5f7da
BUG: 907311
Signed-off-by: Kaushal M <kaushal@redhat.com>
Reviewed-on: http://review.gluster.org/4956
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Vijay Bellur <vbellur@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Before Anonymous fds are available, afr had to queue up
transactions if the file is not opened on one of its
subvolumes. This happens until the attempt to open the
file either succeeds or fails. These attempts happen
until the file is successfully opened on the subvolume.
Now client xlator uses anonymous fds to perform the fops
if the fd used for the fop is not 'opened'.
Fops will be successful even when the file is not opened
so there is no need to queue up the transactions anymore in afr.
Open is attempted on the subvolume where it is not
opened independent of the fop.
Change-Id: I6d59293023e2de41c606395028c8980b83faca3f
BUG: 953887
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
Reviewed-on: http://review.gluster.org/4868
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Vijay Bellur <vbellur@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem:
With the present implementation, eager-lock is issued for
any fd fop. eager-lock is being transferred to metadata
transactions. But the lk-owner is set to local->fd address
only for DATA transactions, but for METADATA transactions
it is frame->root. Because of this unlock on the eager-lock fails
and rebalance hangs.
Fix:
Enable eager-lock for fd DATA transactions
This is a backport of change If30df7486a0b2f5e4150d3259d1261f81473ce8a
http://review.gluster.org/#/c/4588/
BUG: 916226
Change-Id: Id41ac17f467c37e7fd8863e0c19932d7b16344f8
Signed-off-by: Emmanuel Dreyfus <manu@netbsd.org>
Reviewed-on: http://review.gluster.org/4899
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Vijay Bellur <vbellur@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem:
Data self-heal may choose sink iatt to set mtimes.
This happens because after syncing of data is done
self-heal does one more xattrops/fstat to determine
sources sinks to set the inode-ctx. Since this is done
after data syncing and erase of xattrs, old source and
old sink are now sources, but the mtimes of them differ.
Old code just takes the first source from the list and
update mtimes, which could be sink before the self-heal
started.
Fix:
Set mtime from 'sources before syncing'.
Change-Id: Id769e1b99aa4f041eaee775f64cbf2c57b799723
BUG: 918437
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
Reviewed-on: http://review.gluster.org/4658
Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-on: http://review.gluster.org/4663
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Backporting fix http://review.gluster.org/#/c/4668/
When subvols-per-directory is < available subvols, then there are layouts
which are not populated. This leads to incorrect identification of holes or
overlaps. We need to ignore layouts, which have err == 0, and start == stop.
In the current scenario (start == stop == 0).
Additionally, in layout-merge, treat missing xattrs as err = 0. In case of
missing layouts, anomalies will reset them.
For any other valid subvoles, err != 0 in case of layouts being zeroed out.
Also reverted back dht_selfheal_dir_xattr, which does layout calculation only
on subvols which have errors.
BUG: 921408
Change-Id: I75a8edcb92af5b53b3253c9addd7a812e9242836
Signed-off-by: shishir gowda <sgowda@redhat.com>
Reviewed-on: http://review.gluster.org/4800
Reviewed-by: Amar Tumballi <amarts@redhat.com>
Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Backporting Avati's fix http://review.gluster.org/4711
The scheme to encode brick d_off and brick id into global d_off has
two approaches. Since both brick d_off and global d_off are both 64-bit
wide, we need to be careful about how the brick id is encoded.
Filesystems like XFS always give a d_off which fits within 32bits. So
we have another 32bits (actually 31, in this scheme, as seen ahead) to
encode the brick id - which is typically plenty.
Filesystems like the recent EXT4 utilize the upto 63 low bits in d_off,
as the d_off is calculated based on a hash function value. This leaves
us no "unused" bits to encode the brick id.
However both these filesystmes (EXT4 more importantly) are "tolerant" in
terms of the accuracy of the value presented back in seekdir(). i.e, a
seekdir(val) actually seeks to the entry which has the "closest" true
offset.
This "two-prong" scheme exploits this behavior - which seems to be the
best middle ground amongst various approaches and has all the advantages
of the old approach:
- Works against XFS and EXT4, the two most common filesystems out there.
(which wasn't an "advantage" of the old approach as it is borken against
EXT4)
- Probably works against most of the others as well. The ones which would
NOT work are those which return HUGE d_offs _and_ NOT tolerant to
seekdir() to "closest" true offset.
- Nothing to "remember in memory" or evict "old entries".
- Works fine across NFS server reboots and also NFS head failover.
- Tolerant to seekdir() to arbitrary locations.
Algorithm:
Each d_off can be encoded in either of the two schemes. There is no
requirement to encode all d_offs of a directory or a reply-set in
the same scheme.
The topmost bit of the 64 bits is used to specify the "type" of encoding
of this particular d_off. If the topmost bit (bit-63) is 1, it indicates
that the encoding scheme holds a HUGE d_off. If the topmost bit is is 0,
it indicates that the "small" d_off encoding scheme is used.
The goal of the "small" d_off encoding is to stay as dense as possible
towards the lower bits even in the global d_off.
The goal of the HUGE d_off encoding is to stay as accurate (close) as
possible to the "true" d_off after a round of encoding and decoding.
If DHT has N subvolumes, we need ROOF(Log2(N)) "bits" to encode the brick
ID (call it "n").
SMALL d_off
===========
Encoding
--------
If the top n + 1 bits are free in a brick offset, then we leave the
top bit as 0 and set the remaining bits based on the old formula:
hi_mask = 0xffffffffffffffff
hi_mask = ~(hi_mask >> (n + 1))
if ((hi_mask & d_off_brick) != 0)
do_large_d_off_encoding ()
d_off_global = (d_off_brick * N) + brick_id
Decoding
--------
If the top bit in the global offset is 0, it indicates that this
is the encoding formula used. So decoding such a global offset will
be like the old formula:
if ((d_off_global & 0x8000000000000000) != 0)
do_large_d_off_decoding()
d_off_brick = (d_off_global % N)
brick_id = d_off_global / N
HUGE d_off
==========
Encoding
--------
If the top n + 1 bits are NOT free in a given brick offset, then we
set the top bit as 1 in the global offset. The low n bits are replaced
by brick_id.
low_mask = 0xffffffffffffffff << n // where n is ROOF(Log2(N))
d_off_global = (0x8000000000000000 | d_off_brick & low_mask) + brick_id
if (d_off_global == 0xffffffffffffffff)
discard_entry();
Decoding
--------
If the top bit in the global offset is set 1, it indicates that
the encoding formula used is above. So decoding would look like:
hi_mask = (0xffffffffffffffff << n)
low_mask = ~(hi_mask)
d_off_brick = (global_d_off & hi_mask & 0x7fffffffffffffff)
brick_id = global_d_off & low_mask
If "losing" the low n bits in this decoding of d_off_brick looks
"scary", we need to realize that till recently EXT4 used to only
return what can now be expressed as (d_off_global >> 32). The extra
31 bits of hash added by EXT recently, only decreases the probability
of a collision, and not eliminate it completely, anyways. In a way,
the "lost" n bits are made up by decreasing the probability of
collision by sharding the files into N bricks / EXT directories
-- call it "hash hedging", if you will :-)
Change-Id: I9551c581c3f3d4c9e719764881036d554f60c557
Thanks-to: Zach Brown <zab@redhat.com>
BUG: 838784
Signed-off-by: shishir gowda <sgowda@redhat.com>
Reviewed-on: http://review.gluster.org/4799
Reviewed-by: Amar Tumballi <amarts@redhat.com>
Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We needed to zero out the layout range, before we re-calculate the range.
When spread-count is issued, we would end up with stale ranges in the layout.
Replaced dht_selfheal_dir_xattr with dht_fix_dir_xattr, which correctly resets
the un-used (after re-cal) layouts.
Change-Id: I1a900d15df07335f59356bd23182ccec34381ab2
BUG: 884455
Signed-off-by: shishir gowda <sgowda@redhat.com>
Reviewed-on: http://review.gluster.org/4648
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
fd based operations such as readv checked only for data split brain
instead of complete split-brain (i.e both data + metadata) assuming that
open would have done the complete split-brain check. However open-behind
would have unwound open, without winding to afr thus preventing the complete
split-brain check and some appliations will be able to read the contents
of the file even though the file has metadata split-brain. So let all
the fd based fops do a defensive check of complete split-brain.
Change-Id: I0ea52f782b371ce73e8e1c61f9def438fce1bd28
BUG: 846240
Signed-off-by: Raghavendra Bhat <raghavendra@redhat.com>
Reviewed-on: http://review.gluster.org/4620
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Vijay Bellur <vbellur@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Though linkfile_create and rebalance dst file create sent a setattr
with correct ownership, there is still a race window where the linkfile
open (client open due to migration) will fail, as its ownership will be
root:root.
BUG: 884597
Change-Id: Iba73681eae4f280d39ee6c9a40009e195768bee7
Signed-off-by: shishir gowda <sgowda@redhat.com>
Reviewed-on: http://review.gluster.org/4612
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In dht_notify, we used to create a thread to start defrag
crawls after we had heard from all child subvols.
This was in-correct, as a later event, could also trigger the
crawl again(due to the fact that all subvols had responded).
The fix is to make sure, the thread is started only once after
all subvols have responded the first time
BUG: 916449
Change-Id: I1619344fbb1cb51d5e1db38d8a29821fa870fa8b
Signed-off-by: shishir gowda <sgowda@redhat.com>
Reviewed-on: http://review.gluster.org/4610
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If holes are encountered, then we do not write these to the dst,
which sometimes causes file size to be lesser than src. Data is not
corrupted, as when non-zero reads are received, we do write that data.
Calling a truncrate to give file size to prevent it from being
truncated to less than src in case the file end has holes.
Thanks to Brian Foster for providing the test case
BUG: 915554
Change-Id: I7e1e0c475118b073c3ebb87e93220c1ec22e8b7d
Signed-off-by: shishir gowda <sgowda@redhat.com>
Reviewed-on: http://review.gluster.org/4609
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Vijay Bellur <vbellur@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
| |
After fix http://review.gluster.org/4282 (libglusterfsterfs/syncop: do not
hold ref on the fd in cbk) was pushed, syncop_open does not take a ref anymore.
BUG: 910661
Change-Id: Idedff91270966e6e70e71ee83785c0228e238d31
Signed-off-by: shishir gowda <sgowda@redhat.com>
Reviewed-on: http://review.gluster.org/4608
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently, linkfile creation happens as root.
use uid/gid returned from _cbk (link/rename) to set the correct ownership of
the link files.
Also added test/dht.rc to implement common dht functions
BUG: 884597
Change-Id: I6bc0e04f62d4716fc033681e5678e852a1be7a2f
Signed-off-by: shishir gowda <sgowda@redhat.com>
Reviewed-on: http://review.gluster.org/4607
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Since directories have presence on all subvolumes there is
no definite meaning of ->hashed_subvol or ->cached_subvol.
getxattr() code path chooses ->cached_subvol for pathinfo
extended attribute. While this makes sense of files, it makes
less sense for directories. Further if a hashed or a cached
subvolume is down, and there's a getxattr request for a
directory, we return with an errno.
This patch changes pathinfo extended attribute contents by
aggregating information from all subvolumes that are up.
Change-Id: I58adb741d63ccfd1d0239af75eb65f26f0fb384d
Signed-off-by: Venky Shankar <vshankar@redhat.com>
BUG: 856455
Reviewed-on: http://review.gluster.org/4047
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
|
|
|
|
|
|
|
|
|
|
| |
Change-Id: I1c9541058c7d07786539a3266ca125a6a15287d8
BUG: 859835
Signed-off-by: Anand Avati <avati@redhat.com>
Original-author: Kacper Kowalik (Xarthisius) <xarthisius.kk@gmail.com>
Signed-off-by: Kacper Kowalik (Xarthisius) <xarthisius.kk@gmail.com>
Reviewed-on: http://review.gluster.org/3967
Tested-by: Gluster Build System <jenkins@build.gluster.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Typically this lock was not needed in practice, but with
http://review.gluster.org/3842, this code gets executed in multiple
threads for different servers and we lose a count. This results in
leaked lock and a hang for a future transaction.
Change-Id: I377ed20e44f2a45cff522289dfef181f0653eca2
BUG: 765564
Signed-off-by: Anand Avati <avati@redhat.com>
Reviewed-on: http://review.gluster.org/4480
Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This method deals with the case where swapping might gain a bigger overlap
for the xlator currently under consideration, but sacrifices even more from
the xlator we're swapping with. For example:
A = 0x00000000 - 0x44444443 (new 0x00000000 - 0x55555554)
B = 0x44444444 - 0x77777776 (new 0x55555555 - 0xaaaaaaa9)
C = 0x77777777 - 0xffffffff (new 0xaaaaaaaa - 0xffffffff)
Here, the new range for B has a bigger overlap with the old C than with the
old B (0x33333333 vs. 0x22222222 to be precise) so looking only at that
might lead us to swap. However, such a swap turns the new C's overlap from
0x55555556 (vs. old C) to *zero* (vs. old B). In other words, we've gained
0x11111111 for B but lost 0x55555556 for C, so it's a bad idea.
The new algorithm accounts for all effects of the swap, so it not only avoids
bad swaps but can make some good ones that would have been missed previously.
For example, if swapping a range X with a later range Y would not increase the
overlap for X we would previously have skipped it even if the swap would
increase Y's overlap without affecting X's. This is the normal case when we're
adding a new brick (which initially has zero overlap with any old range) so
finding more good swaps is probably even more important than avoiding bad ones.
Also, the logic in dht_overlap_calc was completely broken before, causing
integer overflows instead of providing correct values, so no matter what
higher-level algorithm was in place the GIGO effect would have resulted in
bad decisions.
Change-Id: If61ed513cfcb931916c6b51da293e3efbaaf385f
BUG: 853258
Signed-off-by: Jeff Darcy <jdarcy@redhat.com>
Reviewed-on: http://review.gluster.org/3908
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
|
|
|
|
|
|
|
|
|
|
| |
Change-Id: I7049c0c64e36a9dfa4cc0e0b34de7ec111d2f6c1
BUG: 908302
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
Reviewed-on: http://review.gluster.org/4076
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
Reviewed-by: Anand Avati <avati@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There is no necessity for the delayed-post-op to wait until
the next fop phase on the fd completes. Change-log,
locks are inherited by the time next fop phase is attempted
so the wakeup can happen just before the fop phase is started.
Change-Id: I0b8e591f591b0f7565eb55265ab51f476ed2b165
BUG: 908302
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
Reviewed-on: http://review.gluster.org/4073
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
Reviewed-by: Anand Avati <avati@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem:
Files were being created in subvol which had less than
min_free_disk available even in the cases where other
subvols with more space were available.
Solution:
Changed the logic to look for subvol which has more
space available.
In cases where all the subvols have lesser than
Min_free_disk available , the one with max space and
atleast one inode is available.
Known Issue: Cannot ensure that first file that is
created right after min-free-value is crossed on a
brick will get created in other brick because disk
usage stat takes some time to update in glusterprocess.
Will fix that as part of another bug.
Change-Id: If3ae0bf5a44f8739ce35b3ee3f191009ddd44455
BUG: 858488
Signed-off-by: Raghavendra Talur <rtalur@redhat.com>
Reviewed-on: http://review.gluster.org/4420
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem:
* 'CONNECTING' is taken as CHILD_UP.
* Sending notifications (default_notify()) for all the events individually
while mounting.
Solution:
* Consider Child up only after the event CHILD_UP is received.
* Send a single notification for all the children's events only
while mounting.
Change-Id: I1b7de127e12f5bfb8f80702dbdce02019e138bc8
BUG: 885072
Signed-off-by: Varun Shastry <vshastry@redhat.com>
Reviewed-on: http://review.gluster.org/4356
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In dht_mkdir_cbk, EEXIST error is treated like a true error. Because
of this the following sequence of events can happen, eventually
resulting in GFID mismatch and (and possibly leaked locks and hang,
in the presence of replicate.)
The issue exists when many clients concurrently attempt creation of
directory and subdirectory (e.g mkdir -p /mnt/gluster/dir1/subdir)
0. First mkdir happens by one client on the hashed subvolume. Only
one client wins the race. Others racing mkdirs get EEXIST. Yet
other "laggers" in the race encounter the just-created directory
in lookup() on the hash dir.
1. At least one "lagger" lookup() notices that there are missing
directories on other subvolumes (which the "winner" mkdir is yet
to create), and starts off self-heal of the directory.
2. At least on some subvolumes, self-heal's mkdir wins the race
against the "winner" mkdir and creates the directory first. This
causes the "winner" mkdir to experience EEXIST error on those
subvolumes.
3. On other subvolumes where "winner" mkdir won the race, self-heal
experiences EEXIST error, but self-heal is properly translating
that into a success (but mkdir code path is not -- which is the
bug.)
4. Both mkdir and self-heal assign hash layouts to the just created
directory. But self-heal distributes hash range across N (total)
subvolumes, whereas mkdir distributes hash range across N - M
(where M is the number of subvolumes where mkdir lost the race).
Both the clients "cache" their respective layouts in the near
future for all future creates inside them (evidence in logs)
5. During the creation of the subdirectory, two clients race again.
Ideally winner performs mkdir() on the hashed subvolume and proceeds
to create other dirs, loser experiences EEXIST error on the hashed
subvolume and backs off. But in this case, because the two clients
have different layout views of the parent directory (because of
different hash splits and assignements), the hashed subvolumes for
the new directory can end up being different. Therefore, both clients
now win the race (they were never fighting against each other on a
common server), assigning different GFIDs to the directory on their
respective (different) subvolumes. Some of the remaining subvolumes
get GFID1, others GFID2.
Conclusion/Fix:
Making mkdir translate EEXIST error as success (just the way self-heal
is already rightly doing) will bring back truth to the design claim
that concurrent mkdir/self-heals perform deterministic + idempotent
operations. This will prevent the differing "hash views" by different
clients and thereby also avoid GFID mismatch by forcing all clients
to have a "fair race", because the hashed subvolume for all will be
the same (and thereby avoiding leaked locks and hangs.)
Change-Id: I84592fb9b8a3f739a07e2afb23b33758a0a9a157
BUG: 907072
Signed-off-by: Anand Avati <avati@redhat.com>
Reviewed-on: http://review.gluster.org/4459
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Amar Tumballi <amarts@redhat.com>
|
|
|
|
|
|
|
|
|
|
| |
Change-Id: Iaf119f839cb2113b8f8efb7bf7636d471b6541bf
BUG: 866440
Signed-off-by: Venkatesh Somyajula <vsomyaju@redhat.com>
Reviewed-on: http://review.gluster.org/4385
Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Default_fops uses stack_wind_tail. It winds without creating the frame leading
into wrong subvol return in the cookie. To avoid the problem caused by the
same, we're getting the subvol by passing the cookie.
Change-Id: I51ee79b22c89e4fb0b89e9a0bc3ac96c5b469f8f
BUG: 893338
Signed-off-by: Varun Shastry <vshastry@redhat.com>
Reviewed-on: http://review.gluster.org/4388
Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
Tested-by: Anand Avati <avati@redhat.com>
|