| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Since removexattr() fails to remove "security.selinux" in a system
where SELinux is enforcing, xattr self-healing fails.
As a consequence of this, user extended attributes are not being healed.
Added a check in afr to prune SELinux xattr from the dictionary
used for removing xattrs from the sink.
Minor changes in tests and md-cache as well.
Signed-off-by: Vijay Bellur <vbellur@redhat.com>
Change-Id: I854bfc0098dde812ce2afe64b125ee40c04bdeb1
BUG: 957877
Reviewed-on: http://review.gluster.org/4905
Reviewed-by: Venky Shankar <vshankar@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
|
|
|
|
|
|
|
|
|
| |
Change-Id: Ifa42762adde8b55ef1e2b51a59c93cebd983343f
BUG: 912581
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
Reviewed-on: http://review.gluster.org/4792
Reviewed-by: Vijay Bellur <vbellur@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When subvols-per-directory is < available subvols, then there are layouts
which are not populated. This leads to incorrect identification of holes or
overlaps. We need to ignore layouts, which have err == 0, and start == stop.
In the current scenario (start == stop == 0).
Additionally, in layout-merge, treat missing xattrs as err = 0. In case of
missing layouts, anomalies will reset them.
For any other valid subvoles, err != 0 in case of layouts being zeroed out.
Also reverted back dht_selfheal_dir_xattr, which does layout calculation only
on subvols which have errors.
Change-Id: I9f57062722c9e8a26285e10675c31a78921115a1
BUG: 921408
Signed-off-by: shishir gowda <sgowda@redhat.com>
Reviewed-on: http://review.gluster.org/4668
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Amar Tumballi <amarts@redhat.com>
Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Today there is a non-obvious dependence of eager-locking on
write-behind. The reason is that eager-locking works as long
as the inheriting transaction has no overlaps with any of the
transactions already in progress. While write-behind provides
non-overlapping writes as a side-effect most of times (and only
guarantees it when strict-write-ordering option is enabled,
which is not on by default) eager-lock needs the behavior
as a guarantee. This is leading to complex and unwanted checks
for the presence of write-behind in the graph, for the simple
task of checking for overlaps.
This patch removes the interdependence between eager-locking
and write-behind by making eager-locking do its own overlap checks
with in-progress writes.
Change-Id: Iccba1185aeb5f1e7f060089c895a62840787133f
BUG: 912581
Signed-off-by: Anand Avati <avati@redhat.com>
Reviewed-on: http://review.gluster.org/4782
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
For example:
If a new entry creation fop fails with EEXIST or a delete entry fop
fails with ENOENT, on all the subvols the fop is wound, then no
change took place to the directory. So we can treat that case as no
change happened to the directory.
Change-Id: I3b3a7931954da2166a9cba19ff9f76f37739d751
BUG: 860210
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
Reviewed-on: http://review.gluster.org/4626
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
FIX:
In missing entry self heal, once the source directories are determined
after the lookup and if file is not present on any of the brick which
contains the souce directory, the entry is removed from the directory.
So log message should give information of "Purging of entry".
Change-Id: I4d3deb602e0812dc1c9c8ba0a466716d81dede7e
BUG: 947312
Signed-off-by: Venkatesh Somyajulu <vsomyaju@redhat.com>
Reviewed-on: http://review.gluster.org/4753
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem:
In Pump entry self-heal happens for each directory during the
first opendir using conservative merge. But in entry-self-heal
readdir is issued with '0' size. So entry self-heal is not
creating any files. After pump thinks entry self-heal is complete
it proceeds to heal each of the file in the directory it just
healed. Fortunately most of the times it chooses source-brick
in pump as read-child for readdir. This happens because readchild is the
subvolume on which lookup succeeds first. In pump lookup succeeds
faster in local process than on the destination brick process most
of the times. For all the entries pump finds in readdir it does a
lookup. During this lookup the entry on the destination brick is
created and healed. This is the reason why replace-brick
succeeds whenever read-child for the directory is chosen as the
source-brick. Which is most of the times. When read-child is chosen
as the destination brick, readdir returns no entries so replace-brick
completes without syncing the whole data.
Fix:
Set readdir-size in pump so that entry self-heal happens with
64k size. This ensures that entry self-heal triggered from
opendir actually creates the files on the destination brick.
Change-Id: I65ea45d3c2735a9578f3aa34eff771b6563241ca
BUG: 909800
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
Reviewed-on: http://review.gluster.org/4712
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
These functions keep changing as new functionality is added, so copying
and pasting the code is not a good solution. This way ensures that all
fields get initialized properly no matter how much new stuff we throw in.
Change-Id: I9e9b043d2d305d31e80cf5689465555b70312756
BUG: 924488
Signed-off-by: Jeff Darcy <jdarcy@redhat.com>
Reviewed-on: http://review.gluster.org/4710
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
if any subvol returned ENOENT while parent entrylk lock was held,
yield and return ENOENT for the entire lookup.
This is how the issue happens:
Multiple clients A, B and C are attempting 'mkdir -p /mnt/a/b/c'
1 Client A is in the middle of mkdir(/a). It has acquired lock.
It has performed mkdir(/a) on one subvol, and second one is still
in progress
2 Client B performs a lookup, sees directory /a on one,
ENOENT on the other, succeeds lookup.
3 Client B performs lookup on /a/b on both subvols, both return ENOENT
(one subvol because /a/b does not exist, another because /a
itself does not exist)
4 Client B proceeds to mkdir /a/b. It obtains entrylk on inode=/a with
basename=b on one subvol, but fails on other subvol as /a is yet to
be created by Client A.
5 Client A finishes mkdir of /a on other subvol
6 Client C also attempts to create /a/b, lookup returns ENOENT on
both subvols.
7 Client C tries to obtain entrylk on on inode=/a with basename=b,
obtains on one subvol (where B had failed), and waits for B to unlock
on other subvol.
8 Client B finishes mkdir() on one subvol with GFID-1 and completes
transaction and unlocks
9 Client C gets the lock on the second subvol, At this stage second
subvol already has /a/b created from Client B, but Client C does not
check that in the middle of mkdir transaction
10 Client C attempts mkdir /a/b on both subvols. It succeeds on
ONLY ONE (where Client B could not get lock because of
missing parent /a dir) with GFID-2, and gets EEXIST from ONE subvol.
This way we have /a/b in GFID mismatch. One subvol got GFID-1 because
Client B performed transaction on only one subvol (because entrylk()
could not be obtained on second subvol because of missing parent dir --
caused by premature/speculative succeeding of lookup() on /a when locks
are detected). Other subvol gets GFID-2 from Client C because while
it was waiting for entrylk() on both subvols, Client B was in the
middle of creating mkdir() on only one subvol, and Client C does not
"expect" this when it is between lock() and pre-op()/op() phase of the
transaction.
Original-author: Anand Avati <avati@redhat.com>
Change-Id: Idca475dbbc2a51e09da6fa0f9e1e37148caef208
BUG: 860210
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
Reviewed-on: http://review.gluster.org/4625
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
xattrs are first removed from sink followed by setting
source xattrs.
Change-Id: I181cb5b785b667bbfc6e40787a2183a8f45de06b
BUG: 906646
Signed-off-by: Venky Shankar <vshankar@redhat.com>
Reviewed-on: http://review.gluster.org/4656
Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Here are the logs of a file on which we saw EIO because of size mismatch:
[root@lizzie ~]# grep 38f18204 /var/log/glusterfs/mnt-x-.log
Reporting Unstable write for 38f18204-2840-408e-ae65-c01f4106b8c4
for offset: 0, len: 7680
Cleared unstable write flag for 38f18204-2840-408e-ae65-c01f4106b8c4:
offset 0 length 7680
Reporting Unstable write for 38f18204-2840-408e-ae65-c01f4106b8c4 for
offset: 7680, len: 71680
Reporting Unstable write for 38f18204-2840-408e-ae65-c01f4106b8c4 for
offset: 79360, len: 15716
fsync completed on 38f18204-2840-408e-ae65-c01f4106b8c4 for
offset 0 length 7680 with changelog status: -1 -1
According to these logs fsync did not happen after writev with
offset: 79360, len: 15716. Which is the reason for this problem.
In total 3 writes came. lets call them w1, w2, w3
w1 does pre_op so pre_op_done[0], pre_op_done[1] counts become 1 and 1
then is_piggyback_post_op() is called for w1 and it returns *false*
w1's fsync is fired
Now w2 and w3 come and see that pre_op_done[0], pre_op_done[1] are both 1,
so pre_op_piggyback[0] and pre_op_piggyback[1] are both incremented twice,
once by w2, one more time by w3 and become 2, 2 ------- Step-A
Now fsync of w1 is complete and it goes ahead with post op and decrements
pre_op_done[0], pre_op_done[1] to 0, 0
Now w2, w3 writevs complete and is_piggyback_post_op will return *true* for
both w2, w3.
So fsync is not fired for both w2, w3
this patch prevents Step-A from happening.
Change-Id: I8b6af1f1875b2cf5f718caa3c16ee7ff3dc96b5c
BUG: 927146
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
Reviewed-on: http://review.gluster.org/4752
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The scheme to encode brick d_off and brick id into global d_off has
two approaches. Since both brick d_off and global d_off are both 64-bit
wide, we need to be careful about how the brick id is encoded.
Filesystems like XFS always give a d_off which fits within 32bits. So
we have another 32bits (actually 31, in this scheme, as seen ahead) to
encode the brick id - which is typically plenty.
Filesystems like the recent EXT4 utilize the upto 63 low bits in d_off,
as the d_off is calculated based on a hash function value. This leaves
us no "unused" bits to encode the brick id.
However both these filesystmes (EXT4 more importantly) are "tolerant" in
terms of the accuracy of the value presented back in seekdir(). i.e, a
seekdir(val) actually seeks to the entry which has the "closest" true
offset.
This "two-prong" scheme exploits this behavior - which seems to be the
best middle ground amongst various approaches and has all the advantages
of the old approach:
- Works against XFS and EXT4, the two most common filesystems out there.
(which wasn't an "advantage" of the old approach as it is borken against
EXT4)
- Probably works against most of the others as well. The ones which would
NOT work are those which return HUGE d_offs _and_ NOT tolerant to
seekdir() to "closest" true offset.
- Nothing to "remember in memory" or evict "old entries".
- Works fine across NFS server reboots and also NFS head failover.
- Tolerant to seekdir() to arbitrary locations.
Algorithm:
Each d_off can be encoded in either of the two schemes. There is no
requirement to encode all d_offs of a directory or a reply-set in
the same scheme.
The topmost bit of the 64 bits is used to specify the "type" of encoding
of this particular d_off. If the topmost bit (bit-63) is 1, it indicates
that the encoding scheme holds a HUGE d_off. If the topmost bit is is 0,
it indicates that the "small" d_off encoding scheme is used.
The goal of the "small" d_off encoding is to stay as dense as possible
towards the lower bits even in the global d_off.
The goal of the HUGE d_off encoding is to stay as accurate (close) as
possible to the "true" d_off after a round of encoding and decoding.
If DHT has N subvolumes, we need ROOF(Log2(N)) "bits" to encode the brick
ID (call it "n").
SMALL d_off
===========
Encoding
--------
If the top n + 1 bits are free in a brick offset, then we leave the
top bit as 0 and set the remaining bits based on the old formula:
hi_mask = 0xffffffffffffffff
hi_mask = ~(hi_mask >> (n + 1))
if ((hi_mask & d_off_brick) != 0)
do_large_d_off_encoding ()
d_off_global = (d_off_brick * N) + brick_id
Decoding
--------
If the top bit in the global offset is 0, it indicates that this
is the encoding formula used. So decoding such a global offset will
be like the old formula:
if ((d_off_global & 0x8000000000000000) != 0)
do_large_d_off_decoding()
d_off_brick = (d_off_global % N)
brick_id = d_off_global / N
HUGE d_off
==========
Encoding
--------
If the top n + 1 bits are NOT free in a given brick offset, then we
set the top bit as 1 in the global offset. The low n bits are replaced
by brick_id.
low_mask = 0xffffffffffffffff << n // where n is ROOF(Log2(N))
d_off_global = (0x8000000000000000 | d_off_brick & low_mask) + brick_id
if (d_off_global == 0xffffffffffffffff)
discard_entry();
Decoding
--------
If the top bit in the global offset is set 1, it indicates that
the encoding formula used is above. So decoding would look like:
hi_mask = (0xffffffffffffffff << n)
low_mask = ~(hi_mask)
d_off_brick = (global_d_off & hi_mask & 0x7fffffffffffffff)
brick_id = global_d_off & low_mask
If "losing" the low n bits in this decoding of d_off_brick looks
"scary", we need to realize that till recently EXT4 used to only
return what can now be expressed as (d_off_global >> 32). The extra
31 bits of hash added by EXT recently, only decreases the probability
of a collision, and not eliminate it completely, anyways. In a way,
the "lost" n bits are made up by decreasing the probability of
collision by sharding the files into N bricks / EXT directories
-- call it "hash hedging", if you will :-)
Thanks-to: Zach Brown <zab@redhat.com>
Change-Id: Ieba9a7071829d51860b7c131982f12e0136b9855
BUG: 838784
Signed-off-by: Anand Avati <avati@redhat.com>
Reviewed-on: http://review.gluster.org/4711
Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Introduce AFR_CALL_RESUME macro which cleans up frame->local, like
how AFR_STACK_UNWIND etc. do.
Therefore fix leak in afr_fsync() path.
Change-Id: I3855d8e7e84dbc44e05f507563b7f722bf9621b8
BUG: 927146
Signed-off-by: Anand Avati <avati@redhat.com>
Reviewed-on: http://review.gluster.org/4745
Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
|
|
|
|
|
|
|
|
|
|
|
|
| |
Added extra fsync to data self-heal code to make sure the
data reached disk before erasing the changelogs
Change-Id: I9e7e6e55cdc49de2b991705d1638946464a9d4f9
BUG: 927146
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
Reviewed-on: http://review.gluster.org/4744
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
1) pre_op_piggyback should always be decremented.
2) Move fsync resume to just after post_op.
3) fsync stub should be created from afr's local
not from the final response.
Change-Id: I220bb532eb03bea584292f4dd2e816ad0c3e0cf7
BUG: 927146
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
Reviewed-on: http://review.gluster.org/4741
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
AFR now provides a stronger guarantee that fsync() returns only
after completely finishing all the deferred/delayed POST-OP on that
open file.
To acheive this we make a stub out of the returning fsync and
register it with the "delayed" frame in afr_changelog_wake_resume().
The delayed frame, after getting woken up and finishing the POST-OP
will call_resume() the registered stub (which UNWINDs the fsync) at
the time of frame destruction.
This provides a guarantee that an application's (or FUSE) fsync()
returns only after finishing up all the previous transactions,
including delayed POST-OPs and UNLOCK.
Change-Id: Iaa955457e2f25088a144fde37ad0444277b5cf49
BUG: 927146
Signed-off-by: Anand Avati <avati@redhat.com>
Reviewed-on: http://review.gluster.org/4737
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The changelogging scheme of AFR stores information about the state
of all replicas in all replicas (in the extended attribute of the
respective files on each server) in the form of 'pending counts'
of operations (effectively "dirty flags"). These xattrs are blindly
trusted while performing self-heal, and therefore utmost care has
to be taken while updating and maintaing them.
The most critical updation is the clearing of the pending counts
corresponding to the *other* server in the changelog of a given
server. Before clearing the pending count, we need durability
guarantee of the write which was performed on the other server.
To obtain such a guarantee, it may be necessary to explicitly
introduce an fsync() phase (if the file itself wasn't already
opened with O_SYNC).
This patch introduces the detection of unstable stable writes on
a file and issues explicit fsync() on the servers before performing
the POST-OP clearing of pending flags.
Change-Id: I2171b86a74ec91e40e5877eef0a4e7379578ecf7
BUG: 927146
Signed-off-by: Anand Avati <avati@redhat.com>
Reviewed-on: http://review.gluster.org/4721
Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
Reviewed-by: Krishnan Parthasarathi <kparthas@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is necessary to support "DHT over DHT" configurations, so that the
upper and lower instances of DHT don't step all over each other. Why
would we even consider such a thing? Because it gives us the ability to
do data tiering and rack-aware placement, either by themselves or as
complements to other functionality such as erasure codes or
deduplication which save space but cost performance. By setting up the
top-level DHT to place data into one of several lower-level DHT pools
based on policy instead of pure elastic hashing, we get better
performance for 90% of accesses and better storage efficiency for 90% of
data, all for relatively low effort.
Change-Id: I72e65c29edfc80babf39f7a2a00090f4588c4070
BUG: 924265
Signed-off-by: Jeff Darcy <jdarcy@redhat.com>
Reviewed-on: http://review.gluster.org/4694
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
| |
Change-Id: Icee9772f1f1bf5336eb82a4dc13e198424cd4a65
BUG: 921996
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
Reviewed-on: http://review.gluster.org/4699
Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
Reviewed-by: Amar Tumballi <amarts@redhat.com>
Reviewed-by: Vijay Bellur <vbellur@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
|
|
|
|
|
|
|
|
|
| |
Change-Id: Id6f156957e58aad06bf2602f880c7e4102b80fd1
BUG: 764890
Signed-off-by: JulesWang <w.jq0722@gmail.com>
Reviewed-on: http://review.gluster.org/4679
Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem:
Data self-heal may choose sink iatt to set mtimes.
This happens because after syncing of data is done
self-heal does one more xattrops/fstat to determine
sources sinks to set the inode-ctx. Since this is done
after data syncing and erase of xattrs, old source and
old sink are now sources, but the mtimes of them differ.
Old code just takes the first source from the list and
update mtimes, which could be sink before the self-heal
started.
Fix:
Set mtime from 'sources before syncing'.
Change-Id: Id769e1b99aa4f041eaee775f64cbf2c57b799723
BUG: 918437
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
Reviewed-on: http://review.gluster.org/4658
Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We needed to zero out the layout range, before we re-calculate the range.
When spread-count is issued, we would end up with stale ranges in the layout.
Replaced dht_selfheal_dir_xattr with dht_fix_dir_xattr, which correctly resets
the un-used (after re-cal) layouts.
Change-Id: I1a900d15df07335f59356bd23182ccec34381ab2
BUG: 884455
Signed-off-by: shishir gowda <sgowda@redhat.com>
Reviewed-on: http://review.gluster.org/4647
Reviewed-by: Amar Tumballi <amarts@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
|
|
|
|
|
|
|
|
|
| |
Change-Id: I03be3cb23684de4ab36cf2953002708466edd580
BUG: 765433
Signed-off-by: Csaba Henk <csaba@redhat.com>
Reviewed-on: http://review.gluster.org/4601
Reviewed-by: Venky Shankar <vshankar@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem:
With the present implementation, eager-lock is issued for
any fd fop. eager-lock is being transferred to metadata
transactions. But the lk-owner is set to local->fd address
only for DATA transactions, but for METADATA transactions
it is frame->root. Because of this unlock on the eager-lock fails
and rebalance hangs.
Fix:
Enable eager-lock for fd DATA transactions
Change-Id: If30df7486a0b2f5e4150d3259d1261f81473ce8a
BUG: 916226
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
Reviewed-on: http://review.gluster.org/4588
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
|
|
|
|
|
|
|
|
|
| |
Change-Id: Ie2259023b9001311a2032792639c3093054f6750
BUG: 896431
Signed-off-by: Avra Sengupta <asengupt@redhat.com>
Reviewed-on: http://review.gluster.org/4552
Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In dht_notify, we used to create a thread to start defrag
crawls after we had heard from all child subvols.
This was in-correct, as a later event, could also trigger the
crawl again(due to the fact that all subvols had responded).
The fix is to make sure, the thread is started only once after
all subvols have responded the first time
Change-Id: Ifc2978b9dc866af2395b79911eca50ab38ff9457
BUG: 916449
Signed-off-by: shishir gowda <sgowda@redhat.com>
Reviewed-on: http://review.gluster.org/4587
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Amar Tumballi <amarts@redhat.com>
Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
|
|
|
|
|
|
|
|
|
| |
Change-Id: Ia2e6bce80710d103da9d78afdb389ea162b00686
BUG: 912564
Signed-off-by: Anand Avati <avati@redhat.com>
Reviewed-on: http://review.gluster.org/4590
Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
'gluster volume rebalance' command will be enhanced to support passing of these
options/pattern.
<pattern> is comma separated list as show below. The Precedence is from right
to left.
e.g- "*avi,*pdf:10MB,*:1KB"
The precedence is as follows:
migrate all files with size equal or greater than 1KB "*:1KB"
migrate all pdf files with size equal or greater than 10MB "*pdf:10MB"
migrate all avi files "*avi"
With this option, it is possible to choose which files to migrate.
Change-Id: I6d6d6a015bcbacf1debae2f278a2d92306fb055d
BUG: 896456
Signed-off-by: shishir gowda <sgowda@redhat.com>
Reviewed-on: http://review.gluster.org/4366
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If holes are encountered, then we do not write these to the dst,
which sometimes causes file size to be lesser than src. Data is not
corrupted, as when non-zero reads are received, we do write that data.
Calling a truncrate to give file size to prevent it from being
truncated to less than src in case the file end has holes.
Thanks to Brian Foster for providing the test case
Change-Id: I3cdd143b63ec8d797273d76189dff8b05eb9e551
BUG: 915554
Signed-off-by: shishir gowda <sgowda@redhat.com>
Reviewed-on: http://review.gluster.org/4574
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Before Anonymous fds are available, afr had to queue up
transactions if the file is not opened on one of its
subvolumes. This happens until the attempt to open the
file either succeeds or fails. These attempts happen
until the file is successfully opened on the subvolume.
Now client xlator uses anonymous fds to perform the fops
if the fd used for the fop is not 'opened'.
Fops will be successful even when the file is not opened
so there is no need to queue up the transactions anymore in afr.
Open is attempted on the subvolume where it is not
opened independent of the fop.
Change-Id: Id1a4b4ebe6f89f9efe8f6a8247918b91247d0819
BUG: 913051
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
Reviewed-on: http://review.gluster.org/4568
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
|
|
|
|
|
|
|
|
|
| |
Change-Id: I0cf8a59c81e30603aadc45393bbd49d80ae1621f
BUG: 912657
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
Reviewed-on: http://review.gluster.org/4542
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
fd based operations such as readv checked only for data split brain
instead of complete split-brain (i.e both data + metadata) assuming that
open would have done the complete split-brain check. However open-behind
would have unwound open, without winding to afr thus preventing the complete
split-brain check and some appliations will be able to read the contents
of the file even though the file has metadata split-brain. So let all
the fd based fops do a defensive check of complete split-brain.
Change-Id: Ia90b35f2b08426dfcad804b7f8105278c86fbd2d
BUG: 846240
Signed-off-by: Raghavendra Bhat <raghavendra@redhat.com>
Reviewed-on: http://review.gluster.org/4548
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is to support the common "write to temp file then rename" idiom. In our
case this causes us to create a linkfile during the rename in (N-1)/N cases,
with a significant impact on performance e.g. for UFO which uses this idiom.
This patch allows the user to specify up to two regular expressions that
separate the permanent and transient parts of a temp-file name:
rsync-hash-regex reimplements the existing RSYNC_FRIENDLY_NAME
pattern where the temporary name for EXAMPLE is .EXAMPLE.suffix
extra-hash-regex can be used for a second pattern, active concurrently,
and can include alternatives via the POSIX extended-regex syntax
Change-Id: Ic1a6ed19324bc31fefe91ee34b8478877a9c5d2c
BUG: 912564
Signed-off-by: Jeff Darcy <jdarcy@redhat.com>
Reviewed-on: http://review.gluster.org/4116
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
| |
provide space between path and next string, so that we can grep
for the correct path in error logs.
Change-Id: Ide53e341864dce432406a58e8cbb2ff1480cfbde
Signed-off-by: Amar Tumballi <amarts@redhat.com>
BUG: 815194
Reviewed-on: http://review.gluster.org/4531
Reviewed-by: Anand Avati <avati@redhat.com>
Tested-by: Anand Avati <avati@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Though linkfile_create and rebalance dst file create sent a setattr
with correct ownership, there is still a race window where the linkfile
open (client open due to migration) will fail, as its ownership will be
root:root.
Change-Id: I056092da6102319efa3bb9f1795f8c97db2a3fed
BUG: 884597
Signed-off-by: shishir gowda <sgowda@redhat.com>
Reviewed-on: http://review.gluster.org/4513
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Amar Tumballi <amarts@redhat.com>
Reviewed-by: Anand Avati <avati@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
After fix http://review.gluster.org/4282 (libglusterfsterfs/syncop: do not
hold ref on the fd in cbk) was pushed, syncop_open does not take a ref anymore.
Change-Id: I5543986a4f2d965eef42ca979b39ac75439cec49
BUG: 910661
Signed-off-by: shishir gowda <sgowda@redhat.com>
Reviewed-on: http://review.gluster.org/4509
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Amar Tumballi <amarts@redhat.com>
Reviewed-by: Anand Avati <avati@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently, linkfile creation happens as root.
use uid/gid returned from _cbk (link/rename) to set the correct ownership of
the link files.
Also added test/dht.rc to implement common dht functions
Change-Id: Iecdcf52d89fed8286670ce430b9d19deebd27b8c
BUG: 884597
Signed-off-by: shishir gowda <sgowda@redhat.com>
Reviewed-on: http://review.gluster.org/4304
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Since directories have presence on all subvolumes there is
no definite meaning of ->hashed_subvol or ->cached_subvol.
getxattr() code path chooses ->cached_subvol for pathinfo
extended attribute. While this makes sense of files, it makes
less sense for directories. Further if a hashed or a cached
subvolume is down, and there's a getxattr request for a
directory, we return with an errno.
This patch changes pathinfo extended attribute contents by
aggregating information from all subvolumes that are up.
Change-Id: I58adb741d63ccfd1d0239af75eb65f26f0fb384d
Signed-off-by: Venky Shankar <vshankar@redhat.com>
BUG: 856455
Reviewed-on: http://review.gluster.org/4047
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
|
|
|
|
|
|
|
|
|
|
| |
Change-Id: I1c9541058c7d07786539a3266ca125a6a15287d8
BUG: 859835
Signed-off-by: Anand Avati <avati@redhat.com>
Original-author: Kacper Kowalik (Xarthisius) <xarthisius.kk@gmail.com>
Signed-off-by: Kacper Kowalik (Xarthisius) <xarthisius.kk@gmail.com>
Reviewed-on: http://review.gluster.org/3967
Tested-by: Gluster Build System <jenkins@build.gluster.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Typically this lock was not needed in practice, but with
http://review.gluster.org/3842, this code gets executed in multiple
threads for different servers and we lose a count. This results in
leaked lock and a hang for a future transaction.
Change-Id: I377ed20e44f2a45cff522289dfef181f0653eca2
BUG: 765564
Signed-off-by: Anand Avati <avati@redhat.com>
Reviewed-on: http://review.gluster.org/4480
Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This method deals with the case where swapping might gain a bigger overlap
for the xlator currently under consideration, but sacrifices even more from
the xlator we're swapping with. For example:
A = 0x00000000 - 0x44444443 (new 0x00000000 - 0x55555554)
B = 0x44444444 - 0x77777776 (new 0x55555555 - 0xaaaaaaa9)
C = 0x77777777 - 0xffffffff (new 0xaaaaaaaa - 0xffffffff)
Here, the new range for B has a bigger overlap with the old C than with the
old B (0x33333333 vs. 0x22222222 to be precise) so looking only at that
might lead us to swap. However, such a swap turns the new C's overlap from
0x55555556 (vs. old C) to *zero* (vs. old B). In other words, we've gained
0x11111111 for B but lost 0x55555556 for C, so it's a bad idea.
The new algorithm accounts for all effects of the swap, so it not only avoids
bad swaps but can make some good ones that would have been missed previously.
For example, if swapping a range X with a later range Y would not increase the
overlap for X we would previously have skipped it even if the swap would
increase Y's overlap without affecting X's. This is the normal case when we're
adding a new brick (which initially has zero overlap with any old range) so
finding more good swaps is probably even more important than avoiding bad ones.
Also, the logic in dht_overlap_calc was completely broken before, causing
integer overflows instead of providing correct values, so no matter what
higher-level algorithm was in place the GIGO effect would have resulted in
bad decisions.
Change-Id: If61ed513cfcb931916c6b51da293e3efbaaf385f
BUG: 853258
Signed-off-by: Jeff Darcy <jdarcy@redhat.com>
Reviewed-on: http://review.gluster.org/3908
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
|
|
|
|
|
|
|
|
|
|
| |
Change-Id: I7049c0c64e36a9dfa4cc0e0b34de7ec111d2f6c1
BUG: 908302
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
Reviewed-on: http://review.gluster.org/4076
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
Reviewed-by: Anand Avati <avati@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There is no necessity for the delayed-post-op to wait until
the next fop phase on the fd completes. Change-log,
locks are inherited by the time next fop phase is attempted
so the wakeup can happen just before the fop phase is started.
Change-Id: I0b8e591f591b0f7565eb55265ab51f476ed2b165
BUG: 908302
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
Reviewed-on: http://review.gluster.org/4073
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
Reviewed-by: Anand Avati <avati@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem:
Files were being created in subvol which had less than
min_free_disk available even in the cases where other
subvols with more space were available.
Solution:
Changed the logic to look for subvol which has more
space available.
In cases where all the subvols have lesser than
Min_free_disk available , the one with max space and
atleast one inode is available.
Known Issue: Cannot ensure that first file that is
created right after min-free-value is crossed on a
brick will get created in other brick because disk
usage stat takes some time to update in glusterprocess.
Will fix that as part of another bug.
Change-Id: If3ae0bf5a44f8739ce35b3ee3f191009ddd44455
BUG: 858488
Signed-off-by: Raghavendra Talur <rtalur@redhat.com>
Reviewed-on: http://review.gluster.org/4420
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem:
* 'CONNECTING' is taken as CHILD_UP.
* Sending notifications (default_notify()) for all the events individually
while mounting.
Solution:
* Consider Child up only after the event CHILD_UP is received.
* Send a single notification for all the children's events only
while mounting.
Change-Id: I1b7de127e12f5bfb8f80702dbdce02019e138bc8
BUG: 885072
Signed-off-by: Varun Shastry <vshastry@redhat.com>
Reviewed-on: http://review.gluster.org/4356
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In dht_mkdir_cbk, EEXIST error is treated like a true error. Because
of this the following sequence of events can happen, eventually
resulting in GFID mismatch and (and possibly leaked locks and hang,
in the presence of replicate.)
The issue exists when many clients concurrently attempt creation of
directory and subdirectory (e.g mkdir -p /mnt/gluster/dir1/subdir)
0. First mkdir happens by one client on the hashed subvolume. Only
one client wins the race. Others racing mkdirs get EEXIST. Yet
other "laggers" in the race encounter the just-created directory
in lookup() on the hash dir.
1. At least one "lagger" lookup() notices that there are missing
directories on other subvolumes (which the "winner" mkdir is yet
to create), and starts off self-heal of the directory.
2. At least on some subvolumes, self-heal's mkdir wins the race
against the "winner" mkdir and creates the directory first. This
causes the "winner" mkdir to experience EEXIST error on those
subvolumes.
3. On other subvolumes where "winner" mkdir won the race, self-heal
experiences EEXIST error, but self-heal is properly translating
that into a success (but mkdir code path is not -- which is the
bug.)
4. Both mkdir and self-heal assign hash layouts to the just created
directory. But self-heal distributes hash range across N (total)
subvolumes, whereas mkdir distributes hash range across N - M
(where M is the number of subvolumes where mkdir lost the race).
Both the clients "cache" their respective layouts in the near
future for all future creates inside them (evidence in logs)
5. During the creation of the subdirectory, two clients race again.
Ideally winner performs mkdir() on the hashed subvolume and proceeds
to create other dirs, loser experiences EEXIST error on the hashed
subvolume and backs off. But in this case, because the two clients
have different layout views of the parent directory (because of
different hash splits and assignements), the hashed subvolumes for
the new directory can end up being different. Therefore, both clients
now win the race (they were never fighting against each other on a
common server), assigning different GFIDs to the directory on their
respective (different) subvolumes. Some of the remaining subvolumes
get GFID1, others GFID2.
Conclusion/Fix:
Making mkdir translate EEXIST error as success (just the way self-heal
is already rightly doing) will bring back truth to the design claim
that concurrent mkdir/self-heals perform deterministic + idempotent
operations. This will prevent the differing "hash views" by different
clients and thereby also avoid GFID mismatch by forcing all clients
to have a "fair race", because the hashed subvolume for all will be
the same (and thereby avoiding leaked locks and hangs.)
Change-Id: I84592fb9b8a3f739a07e2afb23b33758a0a9a157
BUG: 907072
Signed-off-by: Anand Avati <avati@redhat.com>
Reviewed-on: http://review.gluster.org/4459
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Amar Tumballi <amarts@redhat.com>
|
|
|
|
|
|
|
|
|
|
| |
Change-Id: Iaf119f839cb2113b8f8efb7bf7636d471b6541bf
BUG: 866440
Signed-off-by: Venkatesh Somyajula <vsomyaju@redhat.com>
Reviewed-on: http://review.gluster.org/4385
Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Default_fops uses stack_wind_tail. It winds without creating the frame leading
into wrong subvol return in the cookie. To avoid the problem caused by the
same, we're getting the subvol by passing the cookie.
Change-Id: I51ee79b22c89e4fb0b89e9a0bc3ac96c5b469f8f
BUG: 893338
Signed-off-by: Varun Shastry <vshastry@redhat.com>
Reviewed-on: http://review.gluster.org/4388
Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
Tested-by: Anand Avati <avati@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Do not do fd_ref in cbks of the fops which return a fd (such as
open, opendir, create).
Change-Id: Ic2f5b234c5c09c258494f4fb5d600a64813823ad
BUG: 885008
Signed-off-by: Raghavendra Bhat <raghavendra@redhat.com>
Reviewed-on: http://review.gluster.org/4282
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Amar Tumballi <amarts@redhat.com>
Reviewed-by: Anand Avati <avati@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When one of the subvolume is down, then lock request is not attempted
on that subvolume and move on to the next subvolume.
/* skip over children that are down */
while ((child_index < priv->child_count)
&& !local->child_up[child_index])
child_index++;
In the above case if there are 2 subvolumes and 2nd subvolume is down (subvolume
1 from afr's view), then after attempting lock on 1st child (i.e subvolume 0)
child index is calculated to be 1. But since the 2nd child is down child_index
is incremented to 2 as per the above logic and lock request is STACK_WINDed to
the child with child_index 2. Since there are only 2 children for afr the child
(i.e the xlator_t pointer) for child_index will be NULL. The process crashes
when it dereference the NULL xlator object.
Change-Id: Icd9b5ad28bac1b805e6e80d53c12d296526bedf5
BUG: 765564
Signed-off-by: Raghavendra Bhat <raghavendra@redhat.com>
Reviewed-on: http://review.gluster.org/4438
Reviewed-by: Krishnan Parthasarathi <kparthas@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
|