| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Implement support for the fallocate file operation. fallocate
allocates blocks for a particular inode such that future writes
to the associated region of the file are guaranteed not to fail
with ENOSPC.
This patch adds fallocate support to the following areas:
- libglusterfs
- mount/fuse
- io-stats
- performance/md-cache,open-behind
- quota
- cluster/afr,dht,stripe
- rpc/xdr
- protocol/client,server
- io-threads
- marker
- storage/posix
- libgfapi
BUG: 949242
Change-Id: Ice8e61351f9d6115c5df68768bc844abbf0ce8bd
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-on: http://review.gluster.org/4969
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem:
In some code paths neither loc->gfid nor loc->inode->gfid
is populated which leads to EINVAL for linkfile setattr
in dht_linkfile_attr_heal.
Fix:
Populate loc->gfid before dht_linkfile_attr_heal.
Change-Id: I062770e6f6eaead304eff1dae81f8588a3b97eed
BUG: 971805
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
Reviewed-on: http://review.gluster.org/5178
Reviewed-by: Shishir Gowda <sgowda@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Vijay Bellur <vbellur@redhat.com>
|
|
|
|
|
|
|
|
|
|
| |
Change-Id: Ie2b2dc54aa0d500c35752c72d3b562bcc05b1fc2
BUG: 969336
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
Reviewed-on: http://review.gluster.org/5123
Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Vijay Bellur <vbellur@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If unlink of linkfile returns ENOENT, do not fail unlink.
Proceed with unlinking of cached file.
Change-Id: If7cec92b40c39d68dd9c3606c6c2c3a6bd67d27b
BUG: 966848
Signed-off-by: shishir gowda <sgowda@redhat.com>
Reviewed-on: http://review.gluster.org/4971
Reviewed-by: Amar Tumballi <amarts@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Vijay Bellur <vbellur@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem:
When taking blocking entrylks, afr orders the entrylks based on
uuid_compare of gfids of parent dirs, if they are equal then it orders
them based on the basenames. While this approach works fine, the
implementation assumes loc->gfids to be populated at the time of
the comparison, but loc may have gfid in loc->inode->gfid instead
of loc->gfid which was leading to order mismatches and dead-locks.
Fix:
Implemented loc_gfid which gives gfid by checking both loc->gfid,
loc->inode->gfid. Used this for ordering the blocking entrylks.
Change-Id: Ib0db36bbaf0df09fa87c3c3bb6a834db74fc2154
BUG: 965987
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
Reviewed-on: http://review.gluster.org/5062
Reviewed-by: Krishnan Parthasarathi <kparthas@redhat.com>
Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Vijay Bellur <vbellur@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Assign local = frame->local before dereferencing
local->linkfile.linkfile_cbk. Additionally, fail if
op_ret is non_zero.
Change-Id: I96a2f34ba29887da9ccaae38a644431cf7c43265
BUG: 966858
Signed-off-by: shishir gowda <sgowda@redhat.com>
Reviewed-on: http://review.gluster.org/5141
Reviewed-by: Amar Tumballi <amarts@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem:
Lookups in discovery fail with ENOENT so local->inode
is never set. dht_layout_set logs the callstack when
the function is called in that state.
Fix:
Don't set layout when lookups fail in discovery.
Change-Id: I5d588314c89e3575fcf7796d57847e35fd20f89a
BUG: 965434
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
Reviewed-on: http://review.gluster.org/5055
Reviewed-by: Shishir Gowda <sgowda@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We cannot heal in dht_discover, as it is a gfid based lookup, and not
path based. So, returning error here would lead to app's to see failure.
Also, update the layout in inode_ctx even if it has anomalies. Let
subsequent heals fix the issue.
Change-Id: I2358aadacf9a24e20a22ab0a6055c38c5eb6485c
BUG: 960348
Original-author: shishir gowda <sgowda@redhat.com>
Signed-off-by: shishir gowda <sgowda@redhat.com>
Signed-off-by: Jeff Darcy <jdarcy@redhat.com>
Reviewed-on: http://review.gluster.org/4959
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem: If it is fresh volume with no files created the xattrop directory is
not present. If crawl happens in that time, lookup for xattrop
directory will fail and it results in printing a misleading log.
Fix: Changed misleading log.
Change-Id: Iae5da3e8423564d64096f88abdaf8c98e4935840
BUG: 928575
Signed-off-by: Venkatesh Somyajulu <vsomyaju@redhat.com>
Reviewed-on: http://review.gluster.org/4993
Reviewed-by: Vijay Bellur <vbellur@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If linkfile create fails with EEXISTS, then check if the file
is a linkfile for the same file. If not, return the error
Change-Id: Iab42db54422dea69de0049b5196365e65edadd91
BUG: 966858
Signed-off-by: shishir gowda <sgowda@redhat.com>
Reviewed-on: http://review.gluster.org/5060
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Amar Tumballi <amarts@redhat.com>
Reviewed-by: Vijay Bellur <vbellur@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If renames are done with different uid/gid (non-owners), then we would
end up with incorrect uid/gid.
The fix is to create linkfiles, and heal the uid/gid as root:root. This
preserves our notion of creation as root:root and heal the uid/gid as
root:root in all paths. Additionally, we need to consider uid/gid from
only src_cached subvol, and not from linkfiles.
rename is also done as root:root if done on linkfile, as setattr of ownership
on linkfile is done after the rename
Change-Id: Icb5d431dc42da9c02dfae81980e3fe769a47a274
BUG: 884597
Signed-off-by: shishir gowda <sgowda@redhat.com>
Reviewed-on: http://review.gluster.org/4682
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
if local->fd == NULL, then in dht_migration_check_complete, do not do
open call. Let the layout get updated, and proceed with invoking the
registered target_fn.
if local->fd == NULL, do not call dht_rebalance_in_progress_check for
truncate fop, but proceed with truncate2.
Change-Id: Ia5a5d40bcea7bfb320ef7096af1e035b8847d4ff
BUG: 960055
Signed-off-by: shishir gowda <sgowda@redhat.com>
Reviewed-on: http://review.gluster.org/4958
Reviewed-by: Amar Tumballi <amarts@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Vijay Bellur <vbellur@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In path based op's like truncate, we use getxattr instead of
fgetxattr call. These can fail with permission denied issues
as linkto file creation, and setattr of ownership is not atomic,
and in cases where setattr failed (subvols down..)
The fix is to perform getxattr as root:root as it is a internal
fop. fgetxattr, bypass the access check, as it already has a valid
open fd.
Change-Id: Ie221c9172e3c1c7ed4e50c8782d362826910756f
BUG: 957074
Signed-off-by: shishir gowda <sgowda@redhat.com>
Reviewed-on: http://review.gluster.org/4890
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Vijay Bellur <vbellur@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Additionally, update op_errno to the lasted failure. If failures
found in complete_check, error returned would be EUCLEAN instead
of the right failure (in this case ENOENT)
Change-Id: Ib813867f4b817af651627b9ea07b0b09fa2b26ce
BUG: 966852
Signed-off-by: shishir gowda <sgowda@redhat.com>
Reviewed-on: http://review.gluster.org/4989
Reviewed-by: Amar Tumballi <amarts@redhat.com>
Reviewed-by: Vijay Bellur <vbellur@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
mem_get function: Log message related to mem_pool
calloc is removed as its been calculated in mempool
'stats'. This messgae is consuming nearly half of
the total log messages in DEBUG mode.
dht_hash_compute function: Changed log level from DEBUG
to TRACE.
client_fdctx_destroy function: Changed log level from DEBUG
to TRACE.
Change-Id: Ic948db0419e76df4e95ebd0cabaf66eadbaada6b
BUG: 966851
Signed-off-by: Venkatesh Somyajulu <vsomyaju@redhat.com>
Reviewed-on: http://review.gluster.org/5086
Reviewed-by: Amar Tumballi <amarts@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Vijay Bellur <vbellur@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is support for discovering a filename in a given directory
which has a case insensitive match of a given name. It is implemented
as a virtual extended attribute on the directory where the required
filename is specified in the key.
E.g:
sh# getfattr -e "text" -n user.glusterfs.get_real_filename:FiLe-B /mnt/samba/patchy
getfattr: Removing leading '/' from absolute path names
# file: mnt/samba/patchy
user.glusterfs.get_real_filename:FiLe-B="file-b"
In reality, there can be multiple "answers" as the backend filesystem is
case sensitive and there can be multiple files which can strcasecamp()
successfully. In this case we pick the first matched file from the first
responding server.
If a matching file does not exist, we return ENOENT (and NOT ENODATA).
This way the caller can differentiate between "unsupported" glusterfs
API and file not existing.
This API is used by Samba VFS to perform efficient discovery of the real
filename without doing a full scan at the Samba level.
Change-Id: I53054c4067cba69e585fd0bbce004495bc6e39e8
BUG: 953694
Signed-off-by: Anand Avati <avati@redhat.com>
Reviewed-on: http://review.gluster.org/4941
Reviewed-by: Vijay Bellur <vbellur@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
|
|
|
|
|
|
|
|
|
|
|
|
| |
Do not let inode linking to happen only in lookup(). While
that works, it is inefficient.
Change-Id: I51bbfb6255ec4324ab17ff00566375f49d120c06
BUG: 953694
Signed-off-by: Anand Avati <avati@redhat.com>
Reviewed-on: http://review.gluster.org/4931
Reviewed-by: Vijay Bellur <vbellur@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
|
|
|
|
|
|
|
|
|
|
| |
Change-Id: Ib727948c6e21b19fd509f258ff0aea1c5d1a84d1
BUG: 966845
Signed-off-by: shishir gowda <sgowda@redhat.com>
Reviewed-on: http://review.gluster.org/5056
Reviewed-by: Amar Tumballi <amarts@redhat.com>
Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem:
syncop_open used to perform a ref in syncop_open_cbk so the extra
unref was needed but now syncop_open_cbk does not take a ref so no
need to do extra unref.
Fix:
remove the extra fd_unref and let dht_local_wipe do the final unref.
Change-Id: Ibe8f9a678d456a0c7bff175306068b5cd297ecc4
BUG: 961615
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
Reviewed-on: http://review.gluster.org/4974
Reviewed-by: Raghavendra Bhat <raghavendra@redhat.com>
Reviewed-by: Krishnan Parthasarathi <kparthas@redhat.com>
Reviewed-by: Amar Tumballi <amarts@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
See https://git.gluster.org/gluster-swift.git for the new location of
the Gluster-Swift code.
With this patch, no OpenStack Swift related RPMs are constructed.
This patch also removes the unused code that references the
user.ufo-test xattr key in the DHT translator.
Change-Id: I2da32642cbd777737a41c5f9f6d33f059c85a2c1
BUG: 961902 (https://bugzilla.redhat.com/show_bug.cgi?id=961902)
Signed-off-by: Peter Portante <peter.portante@redhat.com>
Reviewed-on: http://review.gluster.org/4970
Reviewed-by: Kaleb KEITHLEY <kkeithle@redhat.com>
Reviewed-by: Luis Pabon <lpabon@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem:
gfid-self-heal always assigns the gfid(GFID-1) it gets from lookup.
Between the time of lookup to triggering the gfid-self-heal the
entry could have changed. Now lets say there is a case where
one of the files of the replica subolumes already has a gfid
(GFID-2) and the other does not. In that case healing should
happen with GFID-2 instead of GFID-1.
Fix:
Missing-entry-self-heal already handles all these cases. So removed
separate handling of gfid-self-heal.
Change-Id: Ie96261e9036c8f3cb4cad89347f9bf7b681cdc1a
BUG: 767585
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
Reviewed-on: http://review.gluster.org/2670
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Vijay Bellur <vbellur@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Since removexattr() fails to remove "security.selinux" in a system
where SELinux is enforcing, xattr self-healing fails.
As a consequence of this, user extended attributes are not being healed.
Added a check in afr to prune SELinux xattr from the dictionary
used for removing xattrs from the sink.
Minor changes in tests and md-cache as well.
Signed-off-by: Vijay Bellur <vbellur@redhat.com>
Change-Id: I854bfc0098dde812ce2afe64b125ee40c04bdeb1
BUG: 957877
Reviewed-on: http://review.gluster.org/4905
Reviewed-by: Venky Shankar <vshankar@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
|
|
|
|
|
|
|
|
|
| |
Change-Id: Ifa42762adde8b55ef1e2b51a59c93cebd983343f
BUG: 912581
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
Reviewed-on: http://review.gluster.org/4792
Reviewed-by: Vijay Bellur <vbellur@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When subvols-per-directory is < available subvols, then there are layouts
which are not populated. This leads to incorrect identification of holes or
overlaps. We need to ignore layouts, which have err == 0, and start == stop.
In the current scenario (start == stop == 0).
Additionally, in layout-merge, treat missing xattrs as err = 0. In case of
missing layouts, anomalies will reset them.
For any other valid subvoles, err != 0 in case of layouts being zeroed out.
Also reverted back dht_selfheal_dir_xattr, which does layout calculation only
on subvols which have errors.
Change-Id: I9f57062722c9e8a26285e10675c31a78921115a1
BUG: 921408
Signed-off-by: shishir gowda <sgowda@redhat.com>
Reviewed-on: http://review.gluster.org/4668
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Amar Tumballi <amarts@redhat.com>
Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Today there is a non-obvious dependence of eager-locking on
write-behind. The reason is that eager-locking works as long
as the inheriting transaction has no overlaps with any of the
transactions already in progress. While write-behind provides
non-overlapping writes as a side-effect most of times (and only
guarantees it when strict-write-ordering option is enabled,
which is not on by default) eager-lock needs the behavior
as a guarantee. This is leading to complex and unwanted checks
for the presence of write-behind in the graph, for the simple
task of checking for overlaps.
This patch removes the interdependence between eager-locking
and write-behind by making eager-locking do its own overlap checks
with in-progress writes.
Change-Id: Iccba1185aeb5f1e7f060089c895a62840787133f
BUG: 912581
Signed-off-by: Anand Avati <avati@redhat.com>
Reviewed-on: http://review.gluster.org/4782
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
For example:
If a new entry creation fop fails with EEXIST or a delete entry fop
fails with ENOENT, on all the subvols the fop is wound, then no
change took place to the directory. So we can treat that case as no
change happened to the directory.
Change-Id: I3b3a7931954da2166a9cba19ff9f76f37739d751
BUG: 860210
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
Reviewed-on: http://review.gluster.org/4626
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
FIX:
In missing entry self heal, once the source directories are determined
after the lookup and if file is not present on any of the brick which
contains the souce directory, the entry is removed from the directory.
So log message should give information of "Purging of entry".
Change-Id: I4d3deb602e0812dc1c9c8ba0a466716d81dede7e
BUG: 947312
Signed-off-by: Venkatesh Somyajulu <vsomyaju@redhat.com>
Reviewed-on: http://review.gluster.org/4753
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem:
In Pump entry self-heal happens for each directory during the
first opendir using conservative merge. But in entry-self-heal
readdir is issued with '0' size. So entry self-heal is not
creating any files. After pump thinks entry self-heal is complete
it proceeds to heal each of the file in the directory it just
healed. Fortunately most of the times it chooses source-brick
in pump as read-child for readdir. This happens because readchild is the
subvolume on which lookup succeeds first. In pump lookup succeeds
faster in local process than on the destination brick process most
of the times. For all the entries pump finds in readdir it does a
lookup. During this lookup the entry on the destination brick is
created and healed. This is the reason why replace-brick
succeeds whenever read-child for the directory is chosen as the
source-brick. Which is most of the times. When read-child is chosen
as the destination brick, readdir returns no entries so replace-brick
completes without syncing the whole data.
Fix:
Set readdir-size in pump so that entry self-heal happens with
64k size. This ensures that entry self-heal triggered from
opendir actually creates the files on the destination brick.
Change-Id: I65ea45d3c2735a9578f3aa34eff771b6563241ca
BUG: 909800
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
Reviewed-on: http://review.gluster.org/4712
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
These functions keep changing as new functionality is added, so copying
and pasting the code is not a good solution. This way ensures that all
fields get initialized properly no matter how much new stuff we throw in.
Change-Id: I9e9b043d2d305d31e80cf5689465555b70312756
BUG: 924488
Signed-off-by: Jeff Darcy <jdarcy@redhat.com>
Reviewed-on: http://review.gluster.org/4710
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
if any subvol returned ENOENT while parent entrylk lock was held,
yield and return ENOENT for the entire lookup.
This is how the issue happens:
Multiple clients A, B and C are attempting 'mkdir -p /mnt/a/b/c'
1 Client A is in the middle of mkdir(/a). It has acquired lock.
It has performed mkdir(/a) on one subvol, and second one is still
in progress
2 Client B performs a lookup, sees directory /a on one,
ENOENT on the other, succeeds lookup.
3 Client B performs lookup on /a/b on both subvols, both return ENOENT
(one subvol because /a/b does not exist, another because /a
itself does not exist)
4 Client B proceeds to mkdir /a/b. It obtains entrylk on inode=/a with
basename=b on one subvol, but fails on other subvol as /a is yet to
be created by Client A.
5 Client A finishes mkdir of /a on other subvol
6 Client C also attempts to create /a/b, lookup returns ENOENT on
both subvols.
7 Client C tries to obtain entrylk on on inode=/a with basename=b,
obtains on one subvol (where B had failed), and waits for B to unlock
on other subvol.
8 Client B finishes mkdir() on one subvol with GFID-1 and completes
transaction and unlocks
9 Client C gets the lock on the second subvol, At this stage second
subvol already has /a/b created from Client B, but Client C does not
check that in the middle of mkdir transaction
10 Client C attempts mkdir /a/b on both subvols. It succeeds on
ONLY ONE (where Client B could not get lock because of
missing parent /a dir) with GFID-2, and gets EEXIST from ONE subvol.
This way we have /a/b in GFID mismatch. One subvol got GFID-1 because
Client B performed transaction on only one subvol (because entrylk()
could not be obtained on second subvol because of missing parent dir --
caused by premature/speculative succeeding of lookup() on /a when locks
are detected). Other subvol gets GFID-2 from Client C because while
it was waiting for entrylk() on both subvols, Client B was in the
middle of creating mkdir() on only one subvol, and Client C does not
"expect" this when it is between lock() and pre-op()/op() phase of the
transaction.
Original-author: Anand Avati <avati@redhat.com>
Change-Id: Idca475dbbc2a51e09da6fa0f9e1e37148caef208
BUG: 860210
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
Reviewed-on: http://review.gluster.org/4625
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
xattrs are first removed from sink followed by setting
source xattrs.
Change-Id: I181cb5b785b667bbfc6e40787a2183a8f45de06b
BUG: 906646
Signed-off-by: Venky Shankar <vshankar@redhat.com>
Reviewed-on: http://review.gluster.org/4656
Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Here are the logs of a file on which we saw EIO because of size mismatch:
[root@lizzie ~]# grep 38f18204 /var/log/glusterfs/mnt-x-.log
Reporting Unstable write for 38f18204-2840-408e-ae65-c01f4106b8c4
for offset: 0, len: 7680
Cleared unstable write flag for 38f18204-2840-408e-ae65-c01f4106b8c4:
offset 0 length 7680
Reporting Unstable write for 38f18204-2840-408e-ae65-c01f4106b8c4 for
offset: 7680, len: 71680
Reporting Unstable write for 38f18204-2840-408e-ae65-c01f4106b8c4 for
offset: 79360, len: 15716
fsync completed on 38f18204-2840-408e-ae65-c01f4106b8c4 for
offset 0 length 7680 with changelog status: -1 -1
According to these logs fsync did not happen after writev with
offset: 79360, len: 15716. Which is the reason for this problem.
In total 3 writes came. lets call them w1, w2, w3
w1 does pre_op so pre_op_done[0], pre_op_done[1] counts become 1 and 1
then is_piggyback_post_op() is called for w1 and it returns *false*
w1's fsync is fired
Now w2 and w3 come and see that pre_op_done[0], pre_op_done[1] are both 1,
so pre_op_piggyback[0] and pre_op_piggyback[1] are both incremented twice,
once by w2, one more time by w3 and become 2, 2 ------- Step-A
Now fsync of w1 is complete and it goes ahead with post op and decrements
pre_op_done[0], pre_op_done[1] to 0, 0
Now w2, w3 writevs complete and is_piggyback_post_op will return *true* for
both w2, w3.
So fsync is not fired for both w2, w3
this patch prevents Step-A from happening.
Change-Id: I8b6af1f1875b2cf5f718caa3c16ee7ff3dc96b5c
BUG: 927146
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
Reviewed-on: http://review.gluster.org/4752
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The scheme to encode brick d_off and brick id into global d_off has
two approaches. Since both brick d_off and global d_off are both 64-bit
wide, we need to be careful about how the brick id is encoded.
Filesystems like XFS always give a d_off which fits within 32bits. So
we have another 32bits (actually 31, in this scheme, as seen ahead) to
encode the brick id - which is typically plenty.
Filesystems like the recent EXT4 utilize the upto 63 low bits in d_off,
as the d_off is calculated based on a hash function value. This leaves
us no "unused" bits to encode the brick id.
However both these filesystmes (EXT4 more importantly) are "tolerant" in
terms of the accuracy of the value presented back in seekdir(). i.e, a
seekdir(val) actually seeks to the entry which has the "closest" true
offset.
This "two-prong" scheme exploits this behavior - which seems to be the
best middle ground amongst various approaches and has all the advantages
of the old approach:
- Works against XFS and EXT4, the two most common filesystems out there.
(which wasn't an "advantage" of the old approach as it is borken against
EXT4)
- Probably works against most of the others as well. The ones which would
NOT work are those which return HUGE d_offs _and_ NOT tolerant to
seekdir() to "closest" true offset.
- Nothing to "remember in memory" or evict "old entries".
- Works fine across NFS server reboots and also NFS head failover.
- Tolerant to seekdir() to arbitrary locations.
Algorithm:
Each d_off can be encoded in either of the two schemes. There is no
requirement to encode all d_offs of a directory or a reply-set in
the same scheme.
The topmost bit of the 64 bits is used to specify the "type" of encoding
of this particular d_off. If the topmost bit (bit-63) is 1, it indicates
that the encoding scheme holds a HUGE d_off. If the topmost bit is is 0,
it indicates that the "small" d_off encoding scheme is used.
The goal of the "small" d_off encoding is to stay as dense as possible
towards the lower bits even in the global d_off.
The goal of the HUGE d_off encoding is to stay as accurate (close) as
possible to the "true" d_off after a round of encoding and decoding.
If DHT has N subvolumes, we need ROOF(Log2(N)) "bits" to encode the brick
ID (call it "n").
SMALL d_off
===========
Encoding
--------
If the top n + 1 bits are free in a brick offset, then we leave the
top bit as 0 and set the remaining bits based on the old formula:
hi_mask = 0xffffffffffffffff
hi_mask = ~(hi_mask >> (n + 1))
if ((hi_mask & d_off_brick) != 0)
do_large_d_off_encoding ()
d_off_global = (d_off_brick * N) + brick_id
Decoding
--------
If the top bit in the global offset is 0, it indicates that this
is the encoding formula used. So decoding such a global offset will
be like the old formula:
if ((d_off_global & 0x8000000000000000) != 0)
do_large_d_off_decoding()
d_off_brick = (d_off_global % N)
brick_id = d_off_global / N
HUGE d_off
==========
Encoding
--------
If the top n + 1 bits are NOT free in a given brick offset, then we
set the top bit as 1 in the global offset. The low n bits are replaced
by brick_id.
low_mask = 0xffffffffffffffff << n // where n is ROOF(Log2(N))
d_off_global = (0x8000000000000000 | d_off_brick & low_mask) + brick_id
if (d_off_global == 0xffffffffffffffff)
discard_entry();
Decoding
--------
If the top bit in the global offset is set 1, it indicates that
the encoding formula used is above. So decoding would look like:
hi_mask = (0xffffffffffffffff << n)
low_mask = ~(hi_mask)
d_off_brick = (global_d_off & hi_mask & 0x7fffffffffffffff)
brick_id = global_d_off & low_mask
If "losing" the low n bits in this decoding of d_off_brick looks
"scary", we need to realize that till recently EXT4 used to only
return what can now be expressed as (d_off_global >> 32). The extra
31 bits of hash added by EXT recently, only decreases the probability
of a collision, and not eliminate it completely, anyways. In a way,
the "lost" n bits are made up by decreasing the probability of
collision by sharding the files into N bricks / EXT directories
-- call it "hash hedging", if you will :-)
Thanks-to: Zach Brown <zab@redhat.com>
Change-Id: Ieba9a7071829d51860b7c131982f12e0136b9855
BUG: 838784
Signed-off-by: Anand Avati <avati@redhat.com>
Reviewed-on: http://review.gluster.org/4711
Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Introduce AFR_CALL_RESUME macro which cleans up frame->local, like
how AFR_STACK_UNWIND etc. do.
Therefore fix leak in afr_fsync() path.
Change-Id: I3855d8e7e84dbc44e05f507563b7f722bf9621b8
BUG: 927146
Signed-off-by: Anand Avati <avati@redhat.com>
Reviewed-on: http://review.gluster.org/4745
Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
|
|
|
|
|
|
|
|
|
|
|
|
| |
Added extra fsync to data self-heal code to make sure the
data reached disk before erasing the changelogs
Change-Id: I9e7e6e55cdc49de2b991705d1638946464a9d4f9
BUG: 927146
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
Reviewed-on: http://review.gluster.org/4744
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
1) pre_op_piggyback should always be decremented.
2) Move fsync resume to just after post_op.
3) fsync stub should be created from afr's local
not from the final response.
Change-Id: I220bb532eb03bea584292f4dd2e816ad0c3e0cf7
BUG: 927146
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
Reviewed-on: http://review.gluster.org/4741
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
AFR now provides a stronger guarantee that fsync() returns only
after completely finishing all the deferred/delayed POST-OP on that
open file.
To acheive this we make a stub out of the returning fsync and
register it with the "delayed" frame in afr_changelog_wake_resume().
The delayed frame, after getting woken up and finishing the POST-OP
will call_resume() the registered stub (which UNWINDs the fsync) at
the time of frame destruction.
This provides a guarantee that an application's (or FUSE) fsync()
returns only after finishing up all the previous transactions,
including delayed POST-OPs and UNLOCK.
Change-Id: Iaa955457e2f25088a144fde37ad0444277b5cf49
BUG: 927146
Signed-off-by: Anand Avati <avati@redhat.com>
Reviewed-on: http://review.gluster.org/4737
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The changelogging scheme of AFR stores information about the state
of all replicas in all replicas (in the extended attribute of the
respective files on each server) in the form of 'pending counts'
of operations (effectively "dirty flags"). These xattrs are blindly
trusted while performing self-heal, and therefore utmost care has
to be taken while updating and maintaing them.
The most critical updation is the clearing of the pending counts
corresponding to the *other* server in the changelog of a given
server. Before clearing the pending count, we need durability
guarantee of the write which was performed on the other server.
To obtain such a guarantee, it may be necessary to explicitly
introduce an fsync() phase (if the file itself wasn't already
opened with O_SYNC).
This patch introduces the detection of unstable stable writes on
a file and issues explicit fsync() on the servers before performing
the POST-OP clearing of pending flags.
Change-Id: I2171b86a74ec91e40e5877eef0a4e7379578ecf7
BUG: 927146
Signed-off-by: Anand Avati <avati@redhat.com>
Reviewed-on: http://review.gluster.org/4721
Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
Reviewed-by: Krishnan Parthasarathi <kparthas@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is necessary to support "DHT over DHT" configurations, so that the
upper and lower instances of DHT don't step all over each other. Why
would we even consider such a thing? Because it gives us the ability to
do data tiering and rack-aware placement, either by themselves or as
complements to other functionality such as erasure codes or
deduplication which save space but cost performance. By setting up the
top-level DHT to place data into one of several lower-level DHT pools
based on policy instead of pure elastic hashing, we get better
performance for 90% of accesses and better storage efficiency for 90% of
data, all for relatively low effort.
Change-Id: I72e65c29edfc80babf39f7a2a00090f4588c4070
BUG: 924265
Signed-off-by: Jeff Darcy <jdarcy@redhat.com>
Reviewed-on: http://review.gluster.org/4694
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
| |
Change-Id: Icee9772f1f1bf5336eb82a4dc13e198424cd4a65
BUG: 921996
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
Reviewed-on: http://review.gluster.org/4699
Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
Reviewed-by: Amar Tumballi <amarts@redhat.com>
Reviewed-by: Vijay Bellur <vbellur@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
|
|
|
|
|
|
|
|
|
| |
Change-Id: Id6f156957e58aad06bf2602f880c7e4102b80fd1
BUG: 764890
Signed-off-by: JulesWang <w.jq0722@gmail.com>
Reviewed-on: http://review.gluster.org/4679
Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem:
Data self-heal may choose sink iatt to set mtimes.
This happens because after syncing of data is done
self-heal does one more xattrops/fstat to determine
sources sinks to set the inode-ctx. Since this is done
after data syncing and erase of xattrs, old source and
old sink are now sources, but the mtimes of them differ.
Old code just takes the first source from the list and
update mtimes, which could be sink before the self-heal
started.
Fix:
Set mtime from 'sources before syncing'.
Change-Id: Id769e1b99aa4f041eaee775f64cbf2c57b799723
BUG: 918437
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
Reviewed-on: http://review.gluster.org/4658
Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We needed to zero out the layout range, before we re-calculate the range.
When spread-count is issued, we would end up with stale ranges in the layout.
Replaced dht_selfheal_dir_xattr with dht_fix_dir_xattr, which correctly resets
the un-used (after re-cal) layouts.
Change-Id: I1a900d15df07335f59356bd23182ccec34381ab2
BUG: 884455
Signed-off-by: shishir gowda <sgowda@redhat.com>
Reviewed-on: http://review.gluster.org/4647
Reviewed-by: Amar Tumballi <amarts@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
|
|
|
|
|
|
|
|
|
| |
Change-Id: I03be3cb23684de4ab36cf2953002708466edd580
BUG: 765433
Signed-off-by: Csaba Henk <csaba@redhat.com>
Reviewed-on: http://review.gluster.org/4601
Reviewed-by: Venky Shankar <vshankar@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem:
With the present implementation, eager-lock is issued for
any fd fop. eager-lock is being transferred to metadata
transactions. But the lk-owner is set to local->fd address
only for DATA transactions, but for METADATA transactions
it is frame->root. Because of this unlock on the eager-lock fails
and rebalance hangs.
Fix:
Enable eager-lock for fd DATA transactions
Change-Id: If30df7486a0b2f5e4150d3259d1261f81473ce8a
BUG: 916226
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
Reviewed-on: http://review.gluster.org/4588
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
|
|
|
|
|
|
|
|
|
| |
Change-Id: Ie2259023b9001311a2032792639c3093054f6750
BUG: 896431
Signed-off-by: Avra Sengupta <asengupt@redhat.com>
Reviewed-on: http://review.gluster.org/4552
Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In dht_notify, we used to create a thread to start defrag
crawls after we had heard from all child subvols.
This was in-correct, as a later event, could also trigger the
crawl again(due to the fact that all subvols had responded).
The fix is to make sure, the thread is started only once after
all subvols have responded the first time
Change-Id: Ifc2978b9dc866af2395b79911eca50ab38ff9457
BUG: 916449
Signed-off-by: shishir gowda <sgowda@redhat.com>
Reviewed-on: http://review.gluster.org/4587
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Amar Tumballi <amarts@redhat.com>
Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
|
|
|
|
|
|
|
|
|
| |
Change-Id: Ia2e6bce80710d103da9d78afdb389ea162b00686
BUG: 912564
Signed-off-by: Anand Avati <avati@redhat.com>
Reviewed-on: http://review.gluster.org/4590
Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
'gluster volume rebalance' command will be enhanced to support passing of these
options/pattern.
<pattern> is comma separated list as show below. The Precedence is from right
to left.
e.g- "*avi,*pdf:10MB,*:1KB"
The precedence is as follows:
migrate all files with size equal or greater than 1KB "*:1KB"
migrate all pdf files with size equal or greater than 10MB "*pdf:10MB"
migrate all avi files "*avi"
With this option, it is possible to choose which files to migrate.
Change-Id: I6d6d6a015bcbacf1debae2f278a2d92306fb055d
BUG: 896456
Signed-off-by: shishir gowda <sgowda@redhat.com>
Reviewed-on: http://review.gluster.org/4366
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If holes are encountered, then we do not write these to the dst,
which sometimes causes file size to be lesser than src. Data is not
corrupted, as when non-zero reads are received, we do write that data.
Calling a truncrate to give file size to prevent it from being
truncated to less than src in case the file end has holes.
Thanks to Brian Foster for providing the test case
Change-Id: I3cdd143b63ec8d797273d76189dff8b05eb9e551
BUG: 915554
Signed-off-by: shishir gowda <sgowda@redhat.com>
Reviewed-on: http://review.gluster.org/4574
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
|