| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Posix translator returns pre and postbufs in the dict in {F}REMOVEXATTR fops.
These iatts are further cached at layers like md-cache.
Shard translator, in its current state, simply returns these values without
updating the aggregated file size and block-count.
This patch fixes this problem.
Change-Id: I4b2dd41ede472c5829af80a67401ec5a6376d872
Fixes: #1243
Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com>
(cherry picked from commit 32519525108a2ac6bcc64ad931dc8048d33d64de)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem:
In a replicate/arbiter volume if file creations or writes fails on
quorum number of bricks and on one brick it is due to ENOSPC and
on other brick it fails for a different reason, it may fail with
errors other than ENOSPC in some cases.
Fix:
Prioritize ENOSPC over other lesser priority errors and do not set
op_errno in posix_gfid_set if op_ret is 0 to avoid receiving any
error_no which can be misinterpreted by __afr_dir_write_finalize().
Also removing the function afr_has_arbiter_fop_cbk_quorum() which
might consider a successful reply form a single brick as quorum
success in some cases, whereas we always need fop to be successful
on quorum number of bricks in arbiter configuration.
Change-Id: I106e267f8b9451f681022f1cccb410d9bc824c08
Fixes: #1254
Signed-off-by: karthik-us <ksubrahm@redhat.com>
(cherry picked from commit fa63b45ca5edf172b1b89b28b5db3c5129cc57b6)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem: See github issue for details.
Fix:
-In lookup if the entry exists in 2 out of 3 bricks, don't fail the
lookup with ENOENT just because there is an entrylk on the parent.
Consider quorum before deciding.
-If entry FOP does not succeed on quorum no. of bricks, do not perform
new entry mark.
Fixes: #1303
Change-Id: I56df8c89ad53b29fa450c7930a7b7ccec9f4a6c5
Signed-off-by: Ravishankar N <ravishankar@redhat.com>
(cherry picked from commit c4a6748f25d2c1ab3ebcf89952278ebf94c8d371)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Posix translator returns pre and postbufs in the dict in {F}SETXATTR fops.
These iatts are further cached at layers like md-cache.
Shard translator, in its current state, simply returns these values without
updating the aggregated file size and block-count.
This patch fixes this problem.
Change-Id: I4da0eceb4235b91546df79270bcc0af8cd64e9ea
Fixes: #1243
Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com>
(cherry picked from commit 29ec66c6ab77e2d6893c6e213a3d1fb148702c99)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There was a critical flaw in the previous implementation of open-behind.
When an open is done in the background, it's necessary to take a
reference on the fd_t object because once we "fake" the open answer,
the fd could be destroyed. However as long as there's a reference,
the release function won't be called. So, if the application closes
the file descriptor without having actually opened it, there will
always remain at least 1 reference, causing a leak.
To avoid this problem, the previous implementation didn't take a
reference on the fd_t, so there were races where the fd could be
destroyed while it was still in use.
To fix this, I've implemented a new xlator cbk that gets called from
fuse when the application closes a file descriptor.
The whole logic of handling background opens have been simplified and
it's more efficient now. Only if the fop needs to be delayed until an
open completes, a stub is created. Otherwise no memory allocations are
needed.
Correctly handling the close request while the open is still pending
has added a bit of complexity, but overall normal operation is simpler.
Change-Id: I6376a5491368e0e1c283cc452849032636261592
Fixes: #1225
Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem:
frame is accessed after stack-wind. This can lead to crash
if the cbk frees the frame.
Fix:
Use new frame for the wind instead.
Fixes: #832
Change-Id: I64754609f1114b0bbd4d1336fa81a56f2cca6e03
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There was a bug in write-behind that allowed a previous completed write
to overwrite the overlapping region of data from a future write.
Suppose we want to send three writes (W1, W2 and W3). W1 and W2 are
sequential, and W3 writes at the same offset of W2:
W2.offset = W3.offset = W1.offset + W1.size
Both W1 and W2 are sent in parallel. W3 is only sent after W2 completes.
So W3 should *always* overwrite the overlapping part of W2.
Suppose write-behind processes the requests from 2 concurrent threads:
Thread 1 Thread 2
<received W1>
<received W2>
wb_enqueue_tempted(W1)
/* W1 is assigned gen X */
wb_enqueue_tempted(W2)
/* W2 is assigned gen X */
wb_process_queue()
__wb_preprocess_winds()
/* W1 and W2 are sequential and all
* other requisites are met to merge
* both requests. */
__wb_collapse_small_writes(W1, W2)
__wb_fulfill_request(W2)
__wb_pick_unwinds() -> W2
/* In this case, since the request is
* already fulfilled, wb_inode->gen
* is not updated. */
wb_do_unwinds()
STACK_UNWIND(W2)
/* The application has received the
* result of W2, so it can send W3. */
<received W3>
wb_enqueue_tempted(W3)
/* W3 is assigned gen X */
wb_process_queue()
/* Here we have W1 (which contains
* the conflicting W2) and W3 with
* same gen, so they are interpreted
* as concurrent writes that do not
* conflict. */
__wb_pick_winds() -> W3
wb_do_winds()
STACK_WIND(W3)
wb_process_queue()
/* Eventually W1 will be
* ready to be sent */
__wb_pick_winds() -> W1
__wb_pick_unwinds() -> W1
/* Here wb_inode->gen is
* incremented. */
wb_do_unwinds()
STACK_UNWIND(W1)
wb_do_winds()
STACK_WIND(W1)
So, as we can see, W3 is sent before W1, which shouldn't happen.
The problem is that wb_inode->gen is only incremented for requests that
have not been fulfilled but, after a merge, the request is marked as
fulfilled even though it has not been sent to the brick. This allows
that future requests are assigned to the same generation, which could
be internally reordered.
Solution:
Increment wb_inode->gen before any unwind, even if it's for a fulfilled
request.
Special thanks to Stefan Ring for writing a reproducer that has been
crucial to identify the issue.
Change-Id: Id4ab0f294a09aca9a863ecaeef8856474662ab45
Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
Fixes: #884
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
...if pending xattrs are zero for all children.
Problem:
If there are no pending xattrs and a metadata heal needs to be
performed, it can be possible that we end up with xattrs inadvertendly
deleted from all bricks, as explained in the BZ.
Fix:
After picking one among the sources as the good copy, mark pending xattrs on
all sources to blame the sinks. Now even if this metadata heal fails midway,
a subsequent heal will still choose one of the valid sources that it
picked previously.
Updates: #1067
Change-Id: If1b050b70b0ad911e162c04db4d89b263e2b8d7b
Signed-off-by: Ravishankar N <ravishankar@redhat.com>
(cherry picked from commit 2d5ba449e9200b16184b1e7fc84cabd015f1f779)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Issue:
1- In a cluster of 3 Nodes N1, N2, N3. Create 3 volumes vol1,
vol2, vol3 with 3 bricks (one from each node)
2- Set cluster.brick-multiplex on
3- Start all 3 volumes
4- Check if all bricks on a node are running on same port
5- Kill N1
6- Set performance.readdir-ahead for volumes vol1, vol2, vol3
7- Bring N1 up and check volume status
8- All bricks processes not running on N1.
Root Cause -
Since, There is a diff in volfile versions in N1 as compared
to N2 and N3 therefore glusterd_import_friend_volume() is called.
glusterd_import_friend_volume() copies the new_volinfo and deletes
old_volinfo and then calls glusterd_start_bricks().
glusterd_start_bricks() looks for the volfiles and sends an rpc
request to glusterfs_handle_attach(). Now, since the volinfo
has been deleted by glusterd_delete_stale_volume()
from priv->volumes list before glusterd_start_bricks() and
glusterd_create_volfiles_and_notify_services() and
glusterd_list_add_order is called after glusterd_start_bricks(),
therefore the attach RPC req gets an empty volfile path
and that causes the brick to crash.
Fix- Call glusterd_list_add_order() and
glusterd_create_volfiles_and_notify_services before
glusterd_start_bricks() cal is made in glusterd_import_friend_volume
> Change-Id: Idfe0e8710f7eb77ca3ddfa1cabeb45b2987f41aa
> Bug: bz#1773856
> Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com>
(cherry picked from commit 45e81aae791da9d013aba2286af44826227c05ec)
Change-Id: Idfe0e8710f7eb77ca3ddfa1cabeb45b2987f41aa
fixes: bz#1808964
Signed-off-by: Sanju Rakonde <srakonde@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem:
In a hyperconverged setup with granular-entry-heal enabled, if a file is
recreated while one of the bricks is down, and an index heal is triggered
(with the brick still down), entry-self heal was doing a spurious heal
with just the 2 good bricks. It was doing a post-op leading to removal
of the filename from .glusterfs/indices/entry-changes as well as
erroneous setting of afr xattrs on the parent. When the brick came up,
the xattrs were cleared, resulting in the renamed file not getting
healed and leading to gfid split-brain and EIO on the mount.
Fix:
Proceed with entry heal only when shd can connect to all bricks of the replica,
just like in data and metadata heal.
fixes: bz#1804591
Change-Id: I916ae26ad1fabf259bc6362da52d433b7223b17e
Signed-off-by: Ravishankar N <ravishankar@redhat.com>
(cherry picked from commit 06453d77d056fbaa393a137ca277a20e38d2f67e)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem: At the time of coming up one server node(1x3) after reboot
client is unmounted.The client is unmounted because a client
is getting AUTH_FAILED event and client call fini for the graph.The
client is getting AUTH_FAILED because brick is not attached with a
graph at that moment
Solution: To avoid the unmounting the client graph throw ENOENT error
from server in case if brick is not attached with server at
the time of authenticate clients.
> Credits: Xavi Hernandez <xhernandez@redhat.com>
> Change-Id: Ie6fbd73cbcf23a35d8db8841b3b6036e87682f5e
> Fixes: bz#1793852
> Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
> (cherry picked from commit > f6421dff22a6ddaf14134f6894deae219948c89d)
Change-Id: Ie6fbd73cbcf23a35d8db8841b3b6036e87682f5e
Fixes: bz#1794019
Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Null character string is a valid xattr value in file system. But for
those xattrs processed by md-cache, it does not update its entries if
value is null('\0'). This results in ENODATA when those xattrs are
queried afterwards via getxattr() causing failures in basic operations
like create, copy etc in a specially configured Samba setup for Mac OS
clients.
On the other side snapview-server is internally setting empty string("")
as value for xattrs received as part of listxattr() and are not intended
to be cached. Therefore we try to maintain that behaviour using an
additional dictionary key to prevent updation of entries in getxattr()
and fgetxattr() callbacks in md-cache.
Credits: Poornima G <pgurusid@redhat.com>
Change-Id: I7859cbad0a06ca6d788420c2a495e658699c6ff7
Fixes: bz#1785228
Signed-off-by: Anoop C S <anoopcs@redhat.com>
(cherry picked from commit b4b683736367d93daad08a5ee6ca95778c07c5a4)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem:
In a situation where B1 blames B2, B2 blames B1 and B3 doesn't blame
anything for entry heal, heal will not complete even though we have
clear source and sinks. This will happen because while doing
afr_selfheal_find_direction() only the bricks which are blamed by
non-accused bricks are considered as sinks. Later in
__afr_selfheal_entry_finalize_source() when it tries to mark all the
non-sources as sinks it fails to do so because there won't be any
healed_sinks marked, no witness present and there will be a source.
Fix:
If there is a source and no healed_sinks, then reset all the locked
sources to 0 and healed sinks to 1 to do conservative merge.
Change-Id: If40d8bc95d52a52b2730f55bdcf135109b421548
Fixes: bz#1760699
Signed-off-by: karthik-us <ksubrahm@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Ever since we added quorum checks for lookups in afr via commit
bd44d59741bb8c0f5d7a62c5b1094179dd0ce8a4, the split-brain resolution
commands would not work for replica 3 because there would be no
readables for the lookup fop.
The argument was that split-brains do not occur in replica 3 but we do
see (data/metadata) split-brain cases once in a while which indicate that there are
a few bugs/corner cases yet to be discovered and fixed.
Fortunately, commit 8016d51a3bbd410b0b927ed66be50a09574b7982 added
GF_CLIENT_PID_GLFS_HEALD as the pid for all fops made by glfsheal. If we
leverage this and allow lookups in afr when pid is GF_CLIENT_PID_GLFS_HEALD,
split-brain resolution commands will work for replica 3 volumes too.
Likewise, the check is added in shard_lookup as well to permit resolving
split-brains by specifying "/.shard/shard-file.xx" as the file name
(which previously used to fail with EPERM).
Change-Id: I3c543dea79caf7cfbc1633e9089cb1cdd2538ba9
Fixes: bz#1760791
Signed-off-by: Ravishankar N <ravishankar@redhat.com>
(cherry picked from commit 47dbd753187f69b3835d2e42fdbe7485874c4b3e)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If heal from next brick starts after the first brick completes heal, then
opendir on the brick can change atime leading to failure of the test. When
ctime is disabled it is better to just check mtime to be same after heal.
Backport of:
> BUG: 1751134
> Change-Id: Ia03e30fd547e6bbe85c1e299845ffa122f3a2692
> Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
(cherry picked from commit 0e37cdf271a48d3e58c212e95664a2aa34da3940)
fixes: bz#1769320
Change-Id: Ia03e30fd547e6bbe85c1e299845ffa122f3a2692
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
After add-brick and rebalance, the ctime xattr is not present
on rebalanced directories on new brick. This patch fixes the
same.
Note that ctime still doesn't support consistent time across
distribute sub-volume.
This patch also fixes the in-memory inconsistency of time attributes
when metadata is self healed.
Backport of:
> Patch: https://review.gluster.org/23127/
> Change-Id: Ia20506f1839021bf61d4753191e7dc34b31bb2df
> BUG: 1734026
> Signed-off-by: Kotresh HR <khiremat@redhat.com>
(cherry picked from commit 304640e55c0f3c6d15f4e230dc6376e4f5020fea)
Change-Id: Ia20506f1839021bf61d4753191e7dc34b31bb2df
Signed-off-by: Kotresh HR <khiremat@redhat.com>
fixes: bz#1752429
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We were not passing xattr_req when doing a name self heal
as well as a meta data heal. Because of this, some xdata
was missing which causes i/o errors
Backport of > https://review.gluster.org/#/c/glusterfs/+/23024/
>Change-Id: Ibfb1205a7eb0195632dc3820116ffbbb8043545f
>Fixes: bz#1728770
>Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com>
Fixes: bz#1749305
Change-Id: Ibfb1205a7eb0195632dc3820116ffbbb8043545f
Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com>
(cherry picked from commit d026f0bcfd301712e4f0671ccf238f43f2e6dd30)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem:
Since commit 600ba94183333c4af9b4a09616690994fd528478, shd starts
healing as soon as it is toggled from disabled to enabled. This was
causing the following line in the .t to fail on a 'fast' machine (always
on my laptop and sometimes on the jenkins slaves).
EXPECT_NOT "^0$" get_pending_heal_count $V0
because by the time shd was disabled, the heal was already completed.
Fix:
Increase the no. of files to be healed and make it a variable called
FILE_COUNT, should we need to bump it up further because the machines
become even faster. Also created pending metadata heals to increase the
time taken to heal a file.
fixes: bz#1749155
Change-Id: I5a26b08e45b8c19bce3c01ce67bdcc28ed48198d
Signed-off-by: Ravishankar N <ravishankar@redhat.com>
(cherry picked from commit 724c657995a2e148243eeb78c68b620c6d7714a5)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
...whenever shd is re-enabled after disabling or there is a change in
`cluster.heal-timeout`, without needing to restart shd or waiting for the
current `cluster.heal-timeout` seconds to expire.
See BZ 1743988 for more details.
Change-Id: Ia5ebd7c8e9f5b54cba3199c141fdd1af2f9b9bfe
fixes: bz#1747301
Reported-by: Glen Kiessling <glenk1973@hotmail.com>
Signed-off-by: Ravishankar N <ravishankar@redhat.com>
(cherry picked from commit 600ba94183333c4af9b4a09616690994fd528478)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem: sometime ./tests/bugs/glusterd/bug-1595320.t is failing is
failing at the time of checking brick_process after sending
a kill signal to brick process
Solution: Wait sometime after just sending a kill signal to brick
process to make sure brick process is stopped
> Change-Id: Iee9e91284618abfc62a550d47e4f9117785def58
> Fixes: bz#1743200
> Signed-off-by: Mohit Agrawal <moagrawal@redhat.com>
> (cherry picked from commit 8f1620ad7f5d3d040fee55c5f873349800e2268d)
Change-Id: Iee9e91284618abfc62a550d47e4f9117785def58
Fixes: bz#1745422
Signed-off-by: Mohit Agrawal <moagrawal@redhat.com>
|
|
|
|
|
|
|
| |
Fixes: bz#1741041
Change-Id: I29e338bac62104233a6f80212df8d0fb016affda
Signed-off-by: Ravishankar N <ravishankar@redhat.com>
(cherry picked from commit 8e9c53ebf16705b9a1db2fc486dc24a5cb244ddd)
|
|
|
|
|
|
|
| |
Change-Id: I0cebaaf55c09eb1fb77a274268ff564e871b743b
fixes bz#1740316
Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com>
(cherry picked from commit 51237eda7c4b3846d08c5d24d1e3fe9b7ffba1d4)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problems:
1. When checking for type and gfid mismatch, if the type or gfid
is unknown because of missing gfid handle and the gfid xattr
it will be reported as type or gfid mismatch and the heal will
not complete.
2. If the source selected during entry heal has null gfid the same
will be sent to afr_lookup_and_heal_gfid(). In this function when
we try to assign the gfid on the bricks where it does not exist,
we are considering the same gfid and try to assign that on those
bricks. This will fail in posix_gfid_set() since the gfid sent
is null.
Fix:
If the gfid sent to afr_lookup_and_heal_gfid() is null choose a
valid gfid before proceeding to assign the gfid on the bricks
where it is missing.
In afr_selfheal_detect_gfid_and_type_mismatch(), do not report
type/gfid mismatch if the type/gfid is unknown or not set.
Change-Id: Ia06552e4dc4a9f89cb7f5302833604bd21bbf7da
fixes: bz#1729481
Signed-off-by: karthik-us <ksubrahm@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem:
tests/bugs/replicate/bug-1717819-metadata-split-brain-detection.t fails
intermittently in test cases #49 & #50, which compare the values of the
user set xattr values after enabling the heal. We are not waiting for
the heal to complete before comparing those values, which might lead
those tests to fail.
Fix:
Wait till the HEAL-TIMEOUT before comparing the xattr values.
Also cheking for the shd to come up and the bricks to connect to the shd
process in another case.
Change-Id: I0e245b328da9df23ce70c5300278fad1c1d9f7ff
fixes: bz#1729895
Signed-off-by: karthik-us <ksubrahm@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem: test case is failing after just starting the volume at the
time of running stat command on mount point and client is
getting error transport endpoint is not conencted
Solution: To avoid the error make sure all brick instance should be
up and mount point should be active
> Change-Id: I49553a04d5b13e155ee02f4a1888a07fe3ee2ff5
> fixes: bz#1721590
> Signed-off-by: Mohit Agrawal <moagrawal@redhat.com>
> (cherry picked from commit 283b77805cca3027e333a11c9b00ac611662c9ee)
Change-Id: I49553a04d5b13e155ee02f4a1888a07fe3ee2ff5
fixes: bz#1728182
Signed-off-by: Mohit Agrawal <moagrawal@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The feature is not supported and is moved out of the codebase from
glusterfs-5.x release. Doesn't make sense to keep the code to
support it.
For those who want to upgrade from an version supporting it to higher
version, please do a 'gluster volume reset $VOL encryption reset' and
then continue with the upgrade process.
updates: bz#1648169
Change-Id: I8cf822c0d7195940bd37f6af2432a3cac68d44d1
Signed-off-by: Amar Tumballi <amarts@redhat.com>
|
|
|
|
|
|
|
|
|
| |
... with out which volume creation fails with "volume create: <xyz>: failed:
Failed to create volume files"
Fixes: bz#1716812
Change-Id: I2f4c2c6d5290f066b54e1c1db19e25db9937bedb
Signed-off-by: Atin Mukherjee <amukherj@redhat.com>
|
|
|
|
|
|
|
|
| |
$SRC/glusterfs/bugs/nfs/showmount-many-clients.t
Change-Id: I48758cc66fcb55f48c4a8a0a738b06867f6814a1
Signed-off-by: Aravinda VK <avishwan@redhat.com>
Updates: bz#1193929
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem:
We currently don't have a roll-back/undoing of post-ops if quorum is not met.
Though the FOP is still unwound with failure, the xattrs remain on the disk.
Due to these partial post-ops and partial heals (healing only when 2 bricks
are up), we can end up in metadata split-brain purely from the afr xattrs
point of view i.e each brick is blamed by atleast one of the others for
metadata. These scenarios are hit when there is frequent connect/disconnect
of the client/shd to the bricks.
Fix:
Pick a source based on the xattr values. If 2 bricks blame one, the blamed
one must be treated as sink. If there is no majority, all are sources. Once
we pick a source, self-heal will then do the heal instead of erroring out
due to split-brain.
This patch also adds restriction of all the bricks to be up to perform
metadata heal to avoid any metadata loss.
Removed the test case tests/bugs/replicate/bug-1468279-source-not-blaming-sinks.t
as it was doing metadata heal even when only 2 of 3 bricks were up.
Change-Id: I07a9d62f84ceda329dcab1f02a33aeed258dcb09
fixes: bz#1717819
Signed-off-by: karthik-us <ksubrahm@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Long tale of double unref! But do read...
In cases where a shard base inode is evicted from lru list while still
being part of fsync list but added back soon before its unlink, there
could be an extra inode_unref() leading to premature inode destruction
leading to crash.
One such specific case is the following -
Consider features.shard-deletion-rate = features.shard-lru-limit = 2.
This is an oversimplified example but explains the problem clearly.
First, a file is FALLOCATE'd to a size so that number of shards under
/.shard = 3 > lru-limit.
Shards 1, 2 and 3 need to be resolved. 1 and 2 are resolved first.
Resultant lru list:
1 -----> 2
refs on base inode - (1) + (1) = 2
3 needs to be resolved. So 1 is lru'd out. Resultant lru list -
2 -----> 3
refs on base inode - (1) + (1) = 2
Note that 1 is inode_unlink()d but not destroyed because there are
non-zero refs on it since it is still participating in this ongoing
FALLOCATE operation.
FALLOCATE is sent on all participant shards. In the cbk, all of them are
added to fync_list.
Resulting fsync list -
1 -----> 2 -----> 3 (order doesn't matter)
refs on base inode - (1) + (1) + (1) = 3
Total refs = 3 + 2 = 5
Now an attempt is made to unlink this file. Background deletion is triggered.
The first $shard-deletion-rate shards need to be unlinked in the first batch.
So shards 1 and 2 need to be resolved. inode_resolve fails on 1 but succeeds
on 2 and so it's moved to tail of list.
lru list now -
3 -----> 2
No change in refs.
shard 1 is looked up. In lookup_cbk, it's linked and added back to lru list
at the cost of evicting shard 3.
lru list now -
2 -----> 1
refs on base inode: (1) + (1) = 2
fsync list now -
1 -----> 2 (again order doesn't matter)
refs on base inode - (1) + (1) = 2
Total refs = 2 + 2 = 4
After eviction, it is found 3 needs fsync. So fsync is wound, yet to be ack'd.
So it is still inode_link()d.
Now deletion of shards 1 and 2 completes. lru list is empty. Base inode unref'd and
destroyed.
In the next batched deletion, 3 needs to be deleted. It is inode_resolve()able.
It is added back to lru list but base inode passed to __shard_update_shards_inode_list()
is NULL since the inode is destroyed. But its ctx->inode still contains base inode ptr
from first addition to lru list for no additional ref on it.
lru list now -
3
refs on base inode - (0)
Total refs on base inode = 0
Unlink is sent on 3. It completes. Now since the ctx contains ptr to base_inode and the
shard is part of lru list, base shard is unref'd leading to a crash.
FIX:
When shard is readded back to lru list, copy the base inode pointer as is into its inode ctx,
even if it is NULL. This is needed to prevent double unrefs at the time of deleting it.
Change-Id: I99a44039da2e10a1aad183e84f644d63ca552462
Updates: bz#1696136
Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Following files are fixed.
tests/bugs/distribute/overlap.py
tests/utils/changelogparser.py
tests/utils/create-files.py
tests/utils/gfid-access.py
tests/utils/libcxattr.py
Change-Id: I3db857cc19e19163d368d913eaec1269fbc37140
updates: bz#1193929
Signed-off-by: Kotresh HR <khiremat@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The way delta_blocks is computed in shard is incorrect, when a file
is truncated to a lower size. The accounting only considers change
in size of the last of the truncated shards.
FIX:
Get the block-count of each shard just before an unlink at posix in
xdata. Their summation plus the change in size of last shard
(from an actual truncate) is used to compute delta_blocks which is
used in the xattrop for size update.
Change-Id: I9128a192e9bf8c3c3a959e96b7400879d03d7c53
fixes: bz#1705884
Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
| |
storage.reserve-size option will take size as input
instead of percentage. If set, priority will be given to
storage.reserve-size over storage.reserve. Default value
of this option is 0.
fixes: bz#1651445
Change-Id: I7a7342c68e436e8bf65bd39c567512ee04abbcea
Signed-off-by: Sheetal Pamecha <sheetal.pamecha08@gmail.com>
|
|
|
|
|
|
|
|
|
|
|
| |
While processing a cleanup_and_exit function, we are
accessing a graph object. But this has not been protected
under a lock. Because a parallel cleanup of a graph is quite
possible which might lead to an invalid memory access
Change-Id: Id05ca70d5b57e172b0401d07b6a1f5386c044e79
fixes: bz#1708926
Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem: In commit ac70f66c5805e10b3a1072bd467918730c0aeeb4 I
missed one condition to populate volume dictionary in
multiple threads while brick_multiplex is enabled.Due
to that glusterd is not sending volume dictionary for
all volumes to peer.
Solution: Update the condition in code as well as update test case
also to avoid the issue
Change-Id: I06522dbdfee4f7e995d9cc7b7098fdf35340dc52
fixes: bz#1711250
Signed-off-by: Mohit Agrawal <moagrawal@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The handler functions are pointed to dummy functions.
The switch case handling for tier also have been moved to
point default case to avoid issues, if reintroduced.
The tier changes in DHT still remain as such.
updates: bz#1693692
Change-Id: I80d80c9a3eb862b4440a36b31ae82b2e9d92e4dc
Signed-off-by: Hari Gowtham <hgowtham@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
EC was ignoring lock contention notifications received while a lock was
being acquired. When a lock is partially acquired (some bricks have
granted the lock but some others not yet) we can receive notifications
from acquired bricks, which should be honored, since we may not receive
more notifications after that.
Since EC was ignoring them, once the lock was acquired, it was not
released until the eager-lock timeout, causing unnecessary delays on
other clients.
This fix takes into consideration the notifications received before
having completed the full lock acquisition. After that, the lock will
be releaed as soon as possible.
Fixes: bz#1708156
Change-Id: I2a306dbdb29fb557dcab7788a258bd75d826cc12
Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Consider the following case -
1. A file gets FALLOCATE'd such that > "shard-lru-limit" number of
shards are created.
2. And then it is deleted after that.
The unique thing about FALLOCATE is that unlike WRITE, all of the
participant shards are resolved and created and fallocated in a single
batch. This means, in this case, after the first "shard-lru-limit"
number of shards are resolved and added to lru list, as part of
resolution of the remaining shards, some of the existing shards in lru
list will need to be evicted. So these evicted shards will be
inode_unlink()d as part of eviction. Now once the fop gets to the actual
FALLOCATE stage, the lru'd-out shards get added to fsync list.
2 things to note at this point:
i. the lru'd out shards are only part of fsync list, so each holds 1 ref
on base shard
ii. and the more recently used shards are part of both fsync and lru list.
So each of these shards holds 2 refs on base inode - one for being
part of fsync list, and the other for being part of lru list.
FALLOCATE completes successfully and then this very file is deleted, and
background shard deletion launched. Here's where the ref counts get mismatched.
First as part of inode_resolve()s during the deletion, the lru'd-out inodes
return NULL, because they are inode_unlink()'d by now. So these inodes need to
be freshly looked up. But as part of linking them in lookup_cbk (precisely in
shard_link_block_inode()), inode_link() returns the lru'd-out inode object.
And its inode ctx is still valid and ctx->base_inode valid from the last
time it was added to list.
But shard_common_lookup_shards_cbk() passes NULL in the place of base_pointer
to __shard_update_shards_inode_list(). This means, as part of adding the lru'd out
inode back to lru list, base inode is not ref'd since its NULL.
Whereas post unlinking this shard, during shard_unlink_block_inode(),
ctx->base_inode is accessible and is unref'd because the shard was found to be part
of LRU list, although the matching ref didn't occur. This at some point leads to
base_inode refcount becoming 0 and it getting destroyed and released back while some
of its associated shards are continuing to be unlinked in parallel and the client crashes
whenever it is accessed next.
Fix is to pass base shard correctly, if available, in shard_link_block_inode().
Also, the patch fixes the ret value check in tests/bugs/shard/shard-fallocate.c
Change-Id: Ibd0bc4c6952367608e10701473cbad3947d7559f
Updates: bz#1696136
Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
ISSUE: gluster volume stop succeeds even if quorum is not met.
Fix: Add GD_OP_STOP_VOLUME to gluster_validate_quorum in
glusterd_mgmt_v3_pre_validate ().
Since the volume stop command has been ported from synctask to mgmt_v3,
the quorum check was missed out.
Change-Id: I7a634ad89ec2e286ea262d7952061efad5360042
fixes: bz#1690753
Signed-off-by: Vishal Pandey <vpandey@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
I was looking at a downstream failure of bug-1319374-THIS-crash.t when I
saw the compiler was throwing a warning while running the test:
tests/bugs/gfapi/bug-1319374.c:17:61: warning: implicit declaration of function ‘strerror’; did you mean ‘perror’? [-Wimplicit-function-declaration]
fprintf(stderr, "\nglfs_new: returned NULL (%s)\n", strerror(errno));
^~~~~~~~
perror
So I compiled the .c with -Wall and saw a lot many more warnings, all
due of a missing header. This patch fixes it.
fixes: bz#1708163
Change-Id: I8b6dd8e1404178a3d99b2d92d01f4575f5203e58
Signed-off-by: Ravishankar N <ravishankar@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
| |
At the time of a glusterd restart, while doing a handshake
there is a possibility that multiple shd manager might get
executed. Because of this, there is a chance that multiple
shd get spawned during a glusterd restart
Change-Id: Ie20798441e07d7d7a93b7d38dfb924cea178a920
fixes: bz#1707081
Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com>
|
|
|
|
|
|
| |
Change-Id: Iceefe22af754096c599dc570d4894d14fce4deae
Updates: bz#1193929
Signed-off-by: Xavier Hernandez <xhernandez@redhat.com>
|
|
|
|
|
|
|
|
| |
now the enhanced test covers most of the code in auth.login and auth.addr module.
updates: bz#1693692
Change-Id: I1f43c7dc414e2e4d443a93e9a37051359fd46ea4
Signed-off-by: Amar Tumballi <amarts@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem: If any custom xattrs are set on the directory before
add a brick, xattrs are not healed on the directory
after adding a brick.
Solution: xattr are not healed because dht_selfheal_dir_mkdir_lookup_cbk
checks the value of MDS and if MDS value is not negative
selfheal code path does not take reference of MDS xattrs.Change the
condition to take reference of MDS xattr so that custom xattrs are
populated on newly added brick
Updates: bz#1702299
Change-Id: Id14beedb98cce6928055f294e1594b22132e811c
Signed-off-by: Mohit Agrawal <moagrawal@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem: statedump is not capturing information related to glusterd
Solution: statdump is not capturing glusterd info because
trav->dumpops is null in gf_proc_dump_single_xlator_info ()
where trav is glusterd xlator object. trav->dumpops is null
because we missed to define dumpops in xlator_api of glusterd.
defining dumpops in xlator_api of glusterd fixes the issue.
fixes: bz#1703629
Change-Id: If85429ecb1ef580aced8d5b88d09fc15258bfc4c
Signed-off-by: Sanju Rakonde <srakonde@redhat.com>
|
|
|
|
|
|
|
| |
updates: bz#1693692
Change-Id: I848e622d7b8562e864f0e208aafdc21d9cb757d3
Signed-off-by: Sanju Rakonde <srakonde@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently EC tries to reopen fd's that have been opened while a brick
was down. This is done as part of regular write operations, just after
having acquired the locks, and it's sent as a sub-fop of the main write
fop.
There were two problems:
1. The reopen was attempted on all UP bricks, even if a previous lock
didn't succeed. This is incorrect because most probably the open will
fail.
2. If reopen is sent and fails, the error is propagated to the main
operation, causing it to fail when it shouldn't.
To fix this, we only attempt reopens on bricks where the current fop
owns a lock, and we prevent any error to be propagated to the main
fop.
To implement this behaviour an argument used to indicate the minimum
number of required answers has overloaded to also include some flags. To
make the change consistent, it has been necessary to rename the
argument, which means that a lot of files have been changed. However
there are no functional changes.
This change has also uncovered a problem in discard code, which didn't
correctely process requests of small sizes because no real discard fop
was being processed, only a write of 0's on some region. In this case
some fields of the fop remained uninitialized or with incorrect values.
To fix this, a new function has been created to simulate success on a
fop and it's used in the discard case.
Thanks to Pranith for providing a test script that has also detected an
issue in this patch. This patch includes a small modification of this
script to force data to be written into bricks before stopping them.
Change-Id: If272343873369186c2fb8f43c1d9c52c3ea304ec
Fixes: bz#1699866
Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
|
|
|
|
|
|
| |
Fixes: bz#1542072
Change-Id: Ia5fa1df81bbaec3a84653d136a331c76b457f42c
Signed-off-by: Milan Zink <zeten30@gmail.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem: At the time of handshaking glusterd populate volume
data in a dictionary.While no. of volumes are configured
more than 1500 glusterd takes more than 10 min to generated
the data.Due to taking more time rpc request times out and
rpc start bailing of call frames.
Solution: To optimize the code done below changes
1) Spawn multiple threads to populate volumes data in bulk
in separate dictionary and introduce an option
glusterd.brick-dict-thread-count to configure no. of threads
to populate volume data.
2) Populate tier data only while volume type is tier
3) Compare snap data only while snap_count is non zero
Fixes: bz#1699339
Change-Id: I38dc71970c049217f9d1a06fc0aaf4c26eab18f5
Signed-off-by: Mohit Agrawal <moagrawal@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When eager-lock lock acquisition fails because of say network failures, the
local is not being removed from owners_list, this leads to accumulation of
waiting frames and the application will hang because the waiting frames are
under the assumption that another transaction is in the process of acquiring
lock because owner-list is not empty. Handled this case as well in this patch.
Added asserts to make it easier to find these problems in future.
fixes bz#1696599
Change-Id: I3101393265e9827755725b1f2d94a93d8709e923
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
|