| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Two issues were found:
1. in wb_readdirp_cbk, inode should unrefed after wb_inode is
unlocked. Otherwise, inode and hence the context wb_inode can be freed
by the type we try to unlock wb_inode
2. wb_readdirp_mark_end iterates over a list of wb_inodes of children
of a directory. But inodes could've been freed and hence the list
might be corrupted. To fix take a reference on inode before adding it
to invalidate_list of parent.
Change-Id: I911b0e0b2060f7f41ded0b05db11af6f9b7c09c5
Signed-off-by: Raghavendra Gowdappa <rgowdapp@redhat.com>
Updates: bz#1671556
|
|
|
|
|
|
|
| |
Change-Id: I7be9a5f48dcad1b136c479c58b1dca1e0488166d
Signed-off-by: Raghavendra Gowdappa <rgowdapp@redhat.com>
Fixes: bz#1671556
(cherry picked from commit 6175cb10cd5f59f3c7ae4100bc78f359b68ca3e9)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The "struct iatt" in iatt.h is using int64_t types for storing
the atime, mtime and ctime. Therefore the struct 'struct md_cache' in
md-cache.c should also use this types to avoid an integer overflow.
This can happen e.g. if someone uses a very high default-retention-period
in the WORM-Xlator.
Change-Id: I605268d300ab622b9c8ab30e459dc00d9340aad1
fixes: bz#1678726
Signed-off-by: David Spisla <david.spisla@iternity.com>
(cherry picked from commit 15423e14f16dd1a15ee5e5cbbdbdd370e57ed59f)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If a fop to create an entry fails on one of the data brick,
we mark the pending changelog on the entry on brick for which
it was successful. This is done as part of post op phase to
make sure that entry gets healed even if it gets renamed to
some other path where its parent was not marked as bad.
As it happens as part of post op, we should consider thin-arbiter
to check if the brick, which was successful, is the good brick or not.
This will avoide split brain and other issues.
>Change-Id: I12686675be98f02f70a5186b3ed748c541514d53
>Signed-off-by: Ashish Pandey <aspandey@redhat.com>
Change-Id: I12686675be98f02f70a5186b3ed748c541514d53
updates: bz#1672314
Signed-off-by: Ashish Pandey <aspandey@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
PROBLEM:
Lot of the earlier changes in the management of shards in lru, fsync
lists assumed that if a given shard exists in fsync list, it must be
part of lru list as well. This was found to be not true.
Consider this - a file is FALLOCATE'd to a size which would make the
number of participant shards to be greater than the lru list size.
In this case, some of the resolved shards that are to participate in
this fop will be evicted from lru list to give way to the rest of the
shards. And once FALLOCATE completes, these shards are added to fsync
list but without a ref. After the fop completes, these shard inodes
are unref'd and destroyed while their inode ctxs are still part of
fsync list. Now when an FSYNC is called on the base file and the
fsync-list traversed, the client crashes due to illegal memory access.
FIX:
Hold a ref on the shard inode when adding to fsync list as well.
And unref under following conditions:
1. when the shard is evicted from lru list
2. when the base file is fsync'd
3. when the shards are deleted.
Change-Id: Iab460667d091b8388322f59b6cb27ce69299b1b2
fixes: bz#1669382
Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com>
(cherry picked from commit 72922c1fd69191b220f79905a23395c3a87f86ce)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
...when ctime is zero. ia_type and ia_gfid always need to be non-zero
for things to work correctly.
Problem:
Commit c9bde3021202f1d5c5a2d19ac05a510fc1f788ac zeroed out the iatt
buffer in the cbks of modification fops before unwinding if the ctime in
the buffer was zero. This was causing the fops to fail: noticeable when
AFR's 'consistent-metadata' option was enabled. (AFR zeros out the ctime
when the option is set. See commit
4c4624c9bad2edf27128cb122c64f15d7d63bbc8).
Fixes:
-Do not zero out the ia_type and ia_gfid of the iatt buff under any
circumstance.
-Also, fixed _rda_inode_ctx_update_iatts() to always update these values from
the incoming buf when ctime is zero. Otherwise we end up with zero
ia_type and ia_gfid the first time the function is called *and* the
incoming buf has ctime set to zero.
fixes: bz#1665145
Reported-By:Michael Hanselmann <public@hansmi.ch>
Change-Id: Ib72228892d42c3513c19fc6dfb543f2aa3489eca
Signed-off-by: Ravishankar N <ravishankar@redhat.com>
(cherry picked from commit 09db11b0c020bc79d493c6d7e7ea4f3beb000c68)
|
|
|
|
|
|
|
|
|
|
|
|
| |
rm -rf <dir> fails on dirs which contain linkto files
that point to themselves because dht incorrectly thought
that they were cached files after looking them up.
The fix now treats them as invalid linkto files
and deletes them.
Change-Id: I376c72a5309714ee339c74485e02cfb4e29be643
fixes: bz#1671611
Signed-off-by: N Balachandran <nbalacha@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The inode LRU mechanism is moot in fuse xlator (ie. there is no
limit for the LRU list), as fuse inodes are referenced from
kernel context, and thus they can only be dropped on request of
the kernel. This might results in a high number of passive
inodes which are useless for the glusterfs client, causing a
significant memory overhead.
This change tries to remedy this by extending the LRU semantics
and allowing to set a finite limit on the fuse inode LRU.
A brief history of problem:
When gluster's inode table was designed, fuse didn't have any
'invalidate' method, which means, userspace application could
never ask kernel to send a 'forget()' fop, instead had to wait
for kernel to send it based on kernel's parameters. Inode table
remembers the number of times kernel has cached the inode based
on the 'nlookup' parameter. And 'nlookup' field is not used by
no other entry points (like server-protocol, gfapi etc).
Hence the inode_table of fuse module always has to have lru-limit
as '0', which means no limit. GlusterFS always had to keep all
inodes in memory as kernel would have had a reference to it.
Again, the reason for this is, kernel's glusterfs inode reference
was pointer of 'inode_t' structure in glusterfs. As it is a
pointer, we could never free it (to prevent segfault, or memory
corruption).
Solution:
In the inode table, handle the prune case of inodes with 'nlookup'
differently, and call a 'invalidator' method, which in this case is
fuse_invalidate(), and it sends the request to kernel for getting
the forget request.
When the kernel sends the forget, it means, it has dropped all
the reference to the inode, and it will send the forget with the
'nlookup' parameter too. We just need to make sure to reduce the
'nlookup' value we have when we get forget. That automatically
cause the relevant prune to happen.
Credits: Csaba Henk, Xavier Hernandez, Raghavendra Gowdappa, Nithya B
fixes: bz#1623107
Change-Id: Ifee0737b23b12b1426c224ec5b8f591f487d83a2
Signed-off-by: Amar Tumballi <amarts@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
PROBLEM:
When multiple sharded files are deleted in quick succession, multiple
issues were observed:
1. misleading logs corresponding to a sharded file where while one log
message said the shards corresponding to the file were deleted
successfully, this was followed by multiple logs suggesting the very
same operation failed. This was because of multiple synctasks
attempting to clean up shards of the same file and only one of them
succeeding (the one that gets ENTRYLK successfully), and the rest of
them logging failure.
2. multiple synctasks to do background deletion would be launched, one
for each deleted file but all of them could readdir entries from
.remove_me at the same time could potentially contend for ENTRYLK on
.shard for each of the entry names. This is undesirable and wasteful.
FIX:
Background deletion will now follow a state machine. In the event that
there are multiple attempts to launch synctask for background deletion,
one for each file deleted, only the first task is launched. And if while
this task is doing the cleanup, more attempts are made to delete other
files, the state of the synctask is adjusted so that it restarts the
crawl even after reaching end-of-directory to pick up any files it may
have missed in the previous iteration.
This patch also fixes uninitialized lk-owner during syncop_entrylk()
which was leading to multiple background deletion synctasks entering
the critical section at the same time and leading to illegal memory access
of base inode in the second syntcask after it was destroyed post shard deletion
by the first synctask.
Change-Id: Ib33773d27fb4be463c7a8a5a6a4b63689705324e
updates: bz#1665803
Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com>
(cherry picked from commit c0c2022e7d7097e96270a74f37813eda0c4e6339)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
excessive logging
... of the kind
"[2018-12-26 05:22:44.195019] E [MSGID: 133010]
[shard.c:2253:shard_common_lookup_shards_cbk] 0-volume1-shard: Lookup
on shard 785 failed. Base file gfid = cd938e64-bf06-476f-a5d4-d580a0d37416
[No such file or directory]"
shard_common_lookup_shards_cbk() has a specific check to ignore ENOENT error without
logging them during specific fops. But because background deletion is done in a new
frame (with local->fop being GF_FOP_NULL), the ENOENT check is skipped and the
absence of shards gets logged everytime.
To fix this, local->fop is initialized to GF_FOP_UNLINK during background deletion.
Change-Id: I0ca8d3b3bfbcd354b4a555eee520eb0479bcda35
updates: bz#1665803
Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com>
(cherry picked from commit aa28fe32364e39981981d18c784e7f396d56153f)
|
|
|
|
|
|
|
|
|
|
| |
To avoid use_after_free, reset lease_ctx->timer back to NULL
after the structure has been freed.
Change-Id: Icd213ec809b8af934afdb519c335a4680a1d6cdc
updates: bz#1651323
Signed-off-by: Soumya Koduri <skoduri@redhat.com>
(cherry picked from commit a9b0003c717087ff168bc143c70559162e53e0d5)
|
|
|
|
|
|
|
|
|
| |
Replaced "recieve" with "receive".
Change-Id: I58a3d3d4a0093df4743de9fae4d8ff152d4b216c
fixes: bz#1662200
Signed-off-by: N Balachandran <nbalacha@redhat.com>
(cherry picked from commit a11c5c66321dd8411373a68cc163c981c7d083df)
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
io-cache xlator has been skipping xdata references when the
date needs to be read into page cache. This patch fixes the same.
Note: similar changes may be needed for other fops as well
which are handled by io-cache.
Change-Id: I28d73d4ba471d13eb55d0fd0b5197d222df77a2a
updates: bz#1651323
Signed-off-by: Soumya Koduri <skoduri@redhat.com>
(cherry picked from commit b3d88a0904131f6851f4185e43f815ecc3353ab5)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If a single brick is added to the volume and the
newly added brick is the first to respond to a
dht_revalidate call, its stbuf will not be merged
into local->stbuf as the brick does not yet have
a layout. The is_permission_different check therefore
fails to detect that an attr heal is required as it
only considers the stbuf values from existing bricks.
To fix this, merge all stbuf values into local->stbuf
and use local->prebuf to store the correct directory
attributes.
Change-Id: Ic9e8b04a1ab9ed1248b6b056e3450bbafe32e1bc
fixes: bz#1660736
Signed-off-by: N Balachandran <nbalacha@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
| |
Removed all references to dict_t xdata_from_req which is
allocated but not used anywhere. It is also not cleaned up
and hence causes a memory leak.
fixes: bz#1659676
Change-Id: I2edb857696191e872ad12a12efc36999626bacc7
Signed-off-by: N Balachandran <nbalacha@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
gluster-blockd sometimes segfaults with the following backtrace:
Core was generated by `/usr/sbin/gluster-blockd --glfs-lru-count 5 --log-level INFO'.
Program terminated with signal 11, Segmentation fault.
#0 0x00007fbb9cd639b9 in shard_unlink_block_inode (local=local@entry=0x7fbb80000a78, shard_block_num=<optimized out>) at shard.c:2929
2929 base_ictx->fsync_count--;
(gdb) bt
#0 0x00007fbb9cd639b9 in shard_unlink_block_inode (local=local@entry=0x7fbb80000a78, shard_block_num=<optimized out>) at shard.c:2929
#1 0x00007fbb9cd64311 in shard_unlink_shards_do_cbk (frame=frame@entry=0x7fbb9010a768, cookie=<optimized out>, this=<optimized out>, op_ret=<optimized out>, op_errno=<optimized out>, preparent=preparent@entry=0x7fbb7470dcf8,
postparent=postparent@entry=0x7fbb7470dd90, xdata=xdata@entry=0x0) at shard.c:2987
A fix for this has already been provided through a Converity report.
Backport of:
> Change-Id: Ic5d302a5e32d375acf8adc412763ab94e6dabc3d
> Signed-off-by: Sunny Kumar <sunkumar@redhat.com>
> (cherry picked from commit 145e180517054626d07892219fdee689b703c218)
Change-Id: I699a039e9c5115eb3376190dd8014427d12a293b
Updates: bz#1659563
Signed-off-by: Niels de Vos <ndevos@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
2 domain locking + xattrop for write-txn failures:
--------------------------------------------------
- A post-op wound on TA takes AFR_TA_DOM_NOTIFY range lock and
AFR_TA_DOM_MODIFY full lock, does xattrop on TA and releases
AFR_TA_DOM_MODIFY lock and stores in-memory which brick is bad.
- All further write txn failures are handled based on this in-memory
value without querying the TA.
- When shd heals the files, it does so by requesting full lock on
AFR_TA_DOM_NOTIFY domain. Client uses this as a cue (via upcall),
releases AFR_TA_DOM_NOTIFY range lock and invalidates its in-memory
notion of which brick is bad. The next write txn failure is wound on TA
to again update the in-memory state.
- Any incomplete write txns before the AFR_TA_DOM_NOTIFY upcall release
request is got is completed before the lock is released.
- Any write txns got after the release request are maintained in a ta_waitq.
- After the release is complete, the ta_waitq elements are spliced to a
separate queue which is then processed one by one.
- For fops that come in parallel when the in-memory bad brick is still
unknown, only one is wound to TA on wire. The other ones are maintained
in a ta_onwireq which is then processed after we get the response from
TA.
Change-Id: I32c7b61a61776663601ab0040e2f0767eca1fd64
updates: bz#1648205
Signed-off-by: Ravishankar N <ravishankar@redhat.com>
Signed-off-by: Ashish Pandey <aspandey@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* The scrubber was comparing the checksum of the file that it
calculated (by reading the file) with the on disk signature
(stored via xattr) wrongly. It was using strlen to calculate
the signature, while the actual length of the signature is
given by the brick. Just use the actual length that the brick
provides instead of trying to calculate the signature length via
strlen API.
* In posix, gfid2path was using the same string that contains the
list of all the xattrs of file to save the value of the gfid2path
xattr as well. This causes confusion when gfid2path xattr is queried
by scrubber for getting the actual path of a corrupted file. Use
separate string to fetch the value of the xattr instead of the string
that contains the list of xattrs.
Backport of:
> Patch: https://review.gluster.org/21752/
> BUG: 1654805
> Change-Id: I2d664ab524d2b312233476cb35863dde3122e9a9
(cherry picked from commit f77fb6d568616592ab25501c402c140d15235ca9)
Change-Id: I2d664ab524d2b312233476cb35863dde3122e9a9
fixes: bz#1654370
Signed-off-by: Raghavendra Bhat <raghavendra@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem:
If parent dir is in split-brain or has dirty xattrs set, and the file
has gfid missing on one of the bricks, then name heal won't assign the
gfid.
Fix:
Use the brick we select the gfid from as the 'source'.
Note: Problem was found while trying to debug a split-brain issue on
Cynthia Zhou's setup.
fixes: bz#1655545
Change-Id: Id088d4f0fb017aa35122de426654194e581ed742
Reported-by: Cynthia Zhou <cynthia.zhou@nokia-sbell.com>
Signed-off-by: Ravishankar N <ravishankar@redhat.com>
(cherry picked from commit 4d58730c0cd6ab5db39aec8a15276f7bd3371b04)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Internal fops (with frame->root->pid < 0) are used to heal
or move data and maintains data integrity. That is they do not
modify client data which holds the lease. Hence no need to recall
Lease for such fops.
Note: Like for locks, we would need rebalance and self-heal
daemon process to heal lease state as well.
Change-Id: I8988693fef8d00e17c19dcc842e2238f9eb5ab48
updates: bz#1651323
Signed-off-by: Soumya Koduri <skoduri@redhat.com>
(cherry picked from commit 080aa5b9e9d998552e23f7c33aed3afb0ca93c34)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When the glusterfs server recalls the lease, it expects
client to flush data and unlock the lease. If not it sets
a timer (starting from the time it sends RECALL request) and post
timeout, it revokes it.
Here we could have a race where in client did send UNLK
lease request but because of network delay it may have reached
after server revokes it. To handle such situations, treat
such requests as noop and return sucesss.
Change-Id: I166402d10273f4f115ff04030ecbc14676a01663
updates: bz#1651323
Signed-off-by: Soumya Koduri <skoduri@redhat.com>
(cherry picked from commit c2e758b54d8a3f778e3e63db0000bb8b63de9b25)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
From testing & code-reading, found couple of places where
we incorrectly unref the inode resulting in use_after_free
crash or ref leaks. This patch addresses couple of them.
a) When we try to grant the very first lease for a inode,
inode_ref is taken in __add_lease. This ref should be active
till all the leases granted to that inode are released (i.e,
till lease_cnt > 0). In addition even after lease_cnt becomes '0',
the inode should be active till all the blocked fops are resumed.
Hence release this ref, after resuming all those fops. To avoid
granting new leases while resuming those fops, defined a new boolean
(blocked_fops_resuming) to flag it in the lease_ctx.
b) 'new_lease_inode' which creates new lease_inode_entry and
takes ref on inode, is used while adding that entry to
client_list and recall_list.
Use its counter function '__destroy_lease_inode' which does unref
while removing those entries from those lists.
c) inode ref is also taken when added to timer->data. Unref the same
after processing timer->data.
Change-Id: Ie77c78ff4a971e0d9a66178597fb34faf39205fb
updates: bz#1651323
Signed-off-by: Soumya Koduri <skoduri@redhat.com>
(cherry picked from commit b7aec05aa965202ab73120acf0da4c32fe0cf16c)
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
afr_open stores the fd as part of its local->cont.open struct
but when it calls ftruncate (if open flags contain O_TRUNC), the
corresponding cbk function (afr_ open_ftruncate_cbk) is
incorrectly referencing uninitialized local->fd. This patch fixes
the same.
Change-Id: Icbdedbd1b8cfea11d8f41b6e5c4cb4b44d989aba
updates: bz#1651322
Signed-off-by: Soumya Koduri <skoduri@redhat.com>
(cherry picked from commit fda594875c4cdb2a22e27aa13f5c66bee032ccb5)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
With this change when SHD starts the index crawl it requests
all the clients to release the AFR_TA_DOM_NOTIFY lock so that
clients will know the in memory state is no more valid and
any new operations needs to query the thin-arbiter if required.
When SHD completes healing all the files without any failure, it
will again take the AFR_TA_DOM_NOTIFY lock and gets the xattrs on
TA to see whether there are any new failures happened by that time.
If there are new failures marked on TA, SHD will start the crawl
immediately to heal those failures as well. If there are no new
failures, then SHD will take the AFR_TA_DOM_MODIFY lock and unsets
the xattrs on TA, so that both the data bricks will be considered
as good there after.
>Change-Id: I037b89a0823648f314580ba0716d877bd5ddb1f1
>fixes: bz#1579788
>Signed-off-by: karthik-us <ksubrahm@redhat.com>
(cherry picked from commit 5784a00f997212d34bd52b2303e20c097240d91c)
Change-Id: I037b89a0823648f314580ba0716d877bd5ddb1f1
fixes: bz#1648205
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There was a problem in commit 7f81067 that caused infinite loop when
full heal was triggered.
The previous commit was made to prevent self-heal to go idle after a
replace brick operation. One of the changes consisted on setting a
flag to force an immediate scan of the dirty directory if a heal on
a directory succeeded (assuming it could have generated newer entries).
However that change was causing an issue with a full self-heal, since
every time an already healed directory was checked and it returned
suceessfully, it was also setting the flag, forcing self-heal to start
over again.
This patch fixes this issue by only setting the flag if the heal is not
full. It's assumed that a full self-heal will already traverse all
entries automatically, so there's no need to force a new scan later.
>Change-Id: Id12dbfc04e622b18183e796cc6cc87ccc30a6d55
>fixes: bz#1636631
>Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
(cherry picked from commit 7150c51ad75ccba22045a35fc31e5037612d1ad4)
Change-Id: Id12dbfc04e622b18183e796cc6cc87ccc30a6d55
fixes: bz#1651525
Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
As lookup is not a locked fop, we can not trust the
data received in this to be same.
Changing the log level to DEBUG in case lookup finds any
difference.
(cherry picked from commit 9be6bf3d90e3783b3ba559c93d41b933f8d53f03)
Change-Id: I39499c44688a2455c7c6c69a798762d045d21b39
updates: bz#1644622
Signed-off-by: Ashish Pandey <aspandey@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
With the commit febf5ed4848, during the volume create op,
we are setting volinfo->caps to 0, only if any of the bricks
belong to the same node and brickinfo->vg[0] is null.
Previously, we used to set volinfo->caps to 0, when
either brick doesn't belong to the same node or brickinfo->vg[0]
is null.
With this patch, we set volinfo->caps to 0, when either brick
doesn't belong to the same node or brickinfo->vg[0] is null.
(as we do earlier without commit febf5ed4848).
> BUG: bz#1635820
> Change-Id: I00a97415786b775fb088ac45566ad52b402f1a49
> Signed-off-by: Sanju Rakonde <srakonde@redhat.com>
(cherry picked from commit aae1c402b74fd02ed2f6473b896f108d82aef8e3)
fixes: bz#1647968
Change-Id: I00a97415786b775fb088ac45566ad52b402f1a49
Signed-off-by: Sanju Rakonde <srakonde@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently, there are possibilities in few places, where a user-controlled
(like filename, program parameter etc) string can be passed as 'fmt' for
printf(), which can lead to segfault, if the user's string contains '%s',
'%d' in it.
While fixing it, makes sense to make the explicit check for such issues
across the codebase, by making the format call properly.
Fixes: CVE-2018-14661
Fixes: bz#1647666
Change-Id: Ib547293f2d9eb618594cbff0df3b9c800e88bde4
Signed-off-by: Amar Tumballi <amarts@redhat.com>
|
|
|
|
|
|
|
|
|
| |
Fixes CID 1396581
Change-Id: Ic04091b5783a75d8e1e605a9c1c28b77fea048d3
updates: bz#1647962
Signed-off-by: Vijay Bellur <vbellur@redhat.com>
Signed-off-by: Susant Palai <spalai@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In the current scheme of glusterfs where lock migration is
experimental, (ideally) the rebalance process which is migrating
the file should request for a metalock. Hence, the metalock count
should not be more than one for an inode. In future, if there is a
need for meta-lock from other clients, this patch can be reverted.
Since pl_metalk is called as part of setxattr operation, any client
process(non-rebalance) residing outside trusted network can exhaust
memory of the server node by issuing setxattr repetitively on the
metalock key. The current patch makes sure that more than
one metalock cannot be granted on an inode.
Fixes CVE-2018-14660
updates: bz#1647962
Change-Id: Ie1e697766388718804a9551bc58351808fe71069
Signed-off-by: Susant Palai <spalai@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Server stack needs to have all the sort of validation, assuming
clients can be compromized. It is possible for a compromized
client to send basenames with paths with '/', and with that
create files without permission on server. By sanitizing the basename,
and not allowing anything other than actual directory as the parent
for any entry creation, we can mitigate the effects of clients
not able to exploit the server.
Fixes: CVE-2018-14651
Fixes: bz#1647663
Change-Id: I5dc0da0da2713452ff2b65ac2ddbccf1a267dc20
Signed-off-by: Amar Tumballi <amarts@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem:
Currently for replica volume, even if only one brick is UP
SHD will keep crawling index entries even if it can not
heal anything.
In thin-arbiter volume which is also a replica 2 volume,
this causes inode lock contention which in turn sends
upcall to all the clients to release notify locks, even
if it can not do anything for healing.
This will slow down the client performance and kills the
purpose of keeping in memory information about bad brick.
Solution: Before starting heal or even crawling, check if
sufficient number of children are UP and available to check
and heal entries.
(cherry picked from commit f73b4476b15f9d6d3dc3c8e20c9742aacd857f9f)
Change-Id: I011c9da3b37cae275f791affd56b8f1c1ac9255d
updates: bz#1644645
Signed-off-by: Ashish Pandey <aspandey@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
With commit 44e4db, we are not allowing user to create a volume
using glusterd's working directory as a brick or any sub directory
under glusterd's working directory as a brick.This has broken
shared-storage since the volume "gluster-shared-storage" is
created using the bricks under glusterd's working directory.
With this patch, we let the "gluster-shared-storage" volume
to use bricks under glusterd's working directory.
> BUG: bz#1647029
> Change-Id: Ifcbcf4576eea12cf46f199dea287b29bd3ec3bfd
> Signed-off-by: Sanju Rakonde <srakonde@redhat.com>
(cherry picked from commit bdb4ca184913c82ccf9552298f5d5b597794f2aa)
fixes: bz#1647801
Change-Id: Ifcbcf4576eea12cf46f199dea287b29bd3ec3bfd
Signed-off-by: Sanju Rakonde <srakonde@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
| |
By allowing clients taking dump in a file on brick process, we are
allowing compromised clients to create io-stats dumps on server,
which can exhaust all the available inodes.
Fixes: CVE-2018-14659
Fixes: bz#1647665
Change-Id: I32bfde9d4fe646d819a45e627805b928cae2e1ca
Signed-off-by: Amar Tumballi <amarts@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
| |
as key size in xdr can be anything, it can be bigger than the
'NAME_MAX' allowed in the structure, which can allow for service denial
attacks.
Fixes: CVE-2018-14653
Fixes: bz#1647664
Change-Id: I2dc5e99af27ddf44c12c94b07e51adb8674cce80
Signed-off-by: Amar Tumballi <amarts@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
'getspec' operation is not used between 'client' and 'server' ever since
we have off-loaded volfile management to glusterd, ie, at least 7 years.
No reason to keep the dead code! The removed option had no meaning,
as glusterd didn't provide a way to set (or unset) this option. So,
no regression should be observed from any of the existing glusterfs
deployment, supported or unsupported.
Updates: CVE-2018-14653
Updates: bz#1647664
Change-Id: I4a2e0f673c5bcd4644976a61dbd2d37003a428eb
Signed-off-by: Amar Tumballi <amarts@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
posix_update_utime_in_mdata() unconditionally logs an error if
consistent time attributes features is not enabled. This log
does not add any value, prints an incorrect errno & floods
the log file. Hence nuking this log message in this patch.
Backport of:
> Patch: https://review.gluster.org/21520/
> BUG: 1644129
> Change-Id: I9a1f9e7ada3366d2830f18d81f16a1461040092e
> Signed-off-by: Kotresh HR <khiremat@redhat.com>
fixes: bz#1644526
Change-Id: I9a1f9e7ada3366d2830f18d81f16a1461040092e
Signed-off-by: Kotresh HR <khiremat@redhat.com>
|
|
|
|
|
|
|
|
|
|
| |
For lease operation, we allocate and store child nodes
data in lease structure. Use the same in afr_lease_cbk()
while checking for the quorum.
Change-Id: If1fdd5a0798888afd39ad3df57d96487baf9d1e6
updates: #350
Signed-off-by: Soumya Koduri <skoduri@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Patch in master: https://review.gluster.org/#/c/glusterfs/+/21534/
Problem:
A compromised client can set arbitrary values for the GF_XATTROP_ENTRY_IN_KEY
and GF_XATTROP_ENTRY_OUT_KEY during xattrop fop. These values are
consumed by index as a filename to be created/deleted according to the key.
Thus it is possible to create/delete random files even outside the gluster
volume boundary.
Fix:
Index expects the filename to be a basename, i.e. it must not contain any
pathname components like "/" or "../". Enforce this.
Fixes: CVE-2018-14654
Fixes: bz#1646204
Change-Id: I35f2a39257b5917d17283d0a4f575b92f783f143
Signed-off-by: Ravishankar N <ravishankar@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Glusterfs leases expects lease_id to be set and sent
for each fop to determine conflict resolution with the
existing lease.
Incase if not set (most likely if there is an older
client in a mixed cluster), it makes sense to consider
it as conflicitng fop and recall the lease.
Also fixed the return status check for __remove_lease(),
wherein non-negative value is considered as success case.
Change-Id: I5bcfba4f7c71a5af7cdedeb03436d0b818e85783
updates: #350
Signed-off-by: Soumya Koduri <skoduri@redhat.com>
(cherry picked from commit cf5b13896d65b6916634976a3a5f61ddeefbc19c)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Backport of:
> Change-Id: Ic15ca41444dd04684a9458bd4a526b1d3e160499
> Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com>
> (cherry picked from commit e627977)
> BUG: 1605056
In __shard_update_shards_inode_list(), previously shard translator
was not holding a ref on the base inode whenever a shard was added to
the lru list. But if the base shard is forgotten and destroyed either
by fuse due to memory pressure or due to the file being deleted at some
point by a different client with this client still containing stale
shards in its lru list, the client would crash at the time of locking
lru_base_inode->lock owing to illegal memory access.
So now the base shard is ref'd into the inode ctx of every shard that
is added to lru list until it gets lru'd out.
The patch also handles the case where none of the shards associated
with a file that is about to be deleted are part of the LRU list and
where an unlink at the beginning of the operation destroys the base
inode (because there are no refkeepers) and hence all of the shards
that are about to be deleted will be resolved without the existence
of a base shard in-memory. This, if not handled properly, could lead
to a crash.
Change-Id: Ic15ca41444dd04684a9458bd4a526b1d3e160499
updates: bz#1641440
Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Backport of:
> Change-Id: I84a5e54d214b6c47ed85671a880bb1c767a29f4d
> Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com>
> (cherry picked from commit 15c9976)
> BUG: 1638453
PROBLEM:
tests/bugs/shard/bug-1251824.t fails occasionally with EIO due to gfid
mismatch across replicas on the same shard when dd is executed.
CAUSE:
Turns out this is due to a race between posix_mknod() and posix_lookup().
posix mknod does 3 operations, among other things:
1. creation of the entry itself under its parent directory
2. setting the gfid xattr on the file, and
3. creating the gfid link under .glusterfs.
Consider a case where the thread doing posix_mknod() (initiated by shard)
has executed steps 1 and 2 and is on its way to executing 3. And a
parallel LOOKUP from another thread on noting that loc->inode->gfid is NULL,
tries to perform gfid_heal where it attempts to create the gfid link
under .glusterfs and succeeds. As a result, posix_gfid_set() through
MKNOD (step 3) fails with EEXIST.
In the older code, MKNOD under such conditions was NOT being treated
as a failure. But commit e37ee6d changes this behavior by failing MKNOD,
causing the entry creation to be undone in posix_mknod() (it's another
matter that the stale gfid handle gets left behind if lookup has gone
ahead and gfid-healed it).
All of this happens on only one replica while on the other MKNOD succeeds.
Now if a parallel write causes shard translator to send another MKNOD
of the same shard (shortly after AFR releases entrylk from the first
MKNOD), the file is created on the other replica too, although with a
new gfid (since "gfid-req" that is passed now is a new UUID. This leads
to a gfid-mismatch across the replicas.
FIX:
The solution is to not fail MKNOD (or any other entry fop for that matter
that does posix_gfid_set()) if the .glusterfs link creation fails with EEXIST.
Change-Id: I84a5e54d214b6c47ed85671a880bb1c767a29f4d
fixes: bz#1641429
Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
As dict_unserialize does not null terminate the value,
using snprintf adds garbage characters to the buffer
used to create the filename.
The code also used this->name in the filename which
will be the same for all bricks for a volume. The
files were thus overwritten if a node contained
multiple bricks for a volume. The code now uses
the conf->unique instead if available.
Change-Id: I2c72534b32634b87961d3b3f7d53c5f2ca2c068c
fixes: bz#1640392
Signed-off-by: N Balachandran <nbalacha@redhat.com>
(cherry picked from commit 219cd649fdbd7bfd6c2268a0a4f66bcc15918e31)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Backport of https://review.gluster.org/#/c/glusterfs/+/21135/
Problem:
When a directory has dirty xattrs due to failed post-ops or when
replace/reset brick is performed, AFR does a conservative merge as
expected, but heal-info reports it as split-brain because there are no
clear sources.
Fix:
Modify pending flag to contain information about pending heals and
split-brains. For directories, if spit-brain flag is not set,just show
them as needing heal and not being in split-brain.
Change-Id: I09ef821f6887c87d315ae99e6b1de05103cd9383
fixes: bz#1638163
Signed-off-by: Ravishankar N <ravishankar@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Backport of https://review.gluster.org/#/c/glusterfs/+/21380/
Problem:
In an arbiter volume, if there is a pending data heal of a file only on
arbiter brick, self-heal takes inodelks twice due to a code-bug but unlocks
it only once, leaving behind a stale lock on the brick. This causes
the next write to the file to hang.
Fix:
Fix the code-bug to take lock only once. This bug was introduced master
with commit eb472d82a083883335bc494b87ea175ac43471ff
Thanks to Pranith Kumar K <pkarampu@redhat.com> for finding the RCA.
fixes: bz#1638159
Change-Id: I15ad969e10a6a3c4bd255e2948b6be6dcddc61e1
Signed-off-by: Ravishankar N <ravishankar@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
No default value was specified for `export-statfs-size` in posix
option table. Glusterd2 sets default value as `off` since the
option type is `bool`. Posix treats `export-statfs-size=on` if
not specified in volfile(That means default value is `on`)
This patch sets default value as `on`
> Change-Id: I5c6341183be9b62a78fdbc94621220f9284e1382
> updates: #302
> Signed-off-by: Aravinda VK <avishwan@redhat.com>
(cherry picked from commit 07088d95e450f847722e5decbfa5da18a0dbd9de)
Change-Id: Ib6b3accdb9921376c16040bd2312b99b0226a26f
Fixes: bz#1636842
Signed-off-by: Aravinda VK <avishwan@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
For both Virt and block workloads the file is opened multiple times
leading to dynamically setting eager-lock to off for the workload.
Instead of depending on the number-of-open-fds, if we change the
logic to depend on number of inodelks, then it will give better
performance than the earlier logic. When there is an eager-lock
and number of inodelks is more than 1 we know that there is a
conflicting lock, so depend on that information to decide whether
to keep the current transaction go through delayed-post-op or not.
Locks xlator doesn't have implementation to query number of locks in
fxattrop in releases older than 3.10 so to keep things backward
compatible in 3.12, data transactions will use new logic where as
fxattrop transactions will use old logic. I am planning to send one
more patch which makes metadata domain locks also depend on
inodelk-count
Profile info for a dd of 500MB to a file with another fd opened
on the file using exec 250>filename
Without this patch:
0.14 67.41 us 16.72 us 3870.82 us 892 FINODELK
0.59 279.87 us 95.71 us 2085.89 us 898 FXATTROP
3.46 366.43 us 81.75 us 6952.79 us 4000 WRITE
95.79 148733.99 us 50568.12 us 919127.86 us 273 FSYNC
With this patch:
0.00 51.01 us 38.07 us 80.16 us 4 FINODELK
0.00 235.43 us 235.43 us 235.43 us 1 TRUNCATE
0.00 125.07 us 56.80 us 193.33 us 2 GETXATTR
0.00 135.86 us 62.13 us 209.59 us 2 INODELK
0.00 197.88 us 155.39 us 253.90 us 4 FXATTROP
0.00 450.59 us 394.28 us 506.89 us 2 XATTROP
0.00 56.96 us 19.06 us 406.59 us 23 FLUSH
37.81 273648.93 us 48.43 us 6017657.05 us 44 LOOKUP
62.18 4951.86 us 93.80 us 1143154.75 us 3999 WRITE
postgresql benchmark performance changed from ~1130 TPS to ~2300TPS
randio fio job inside Ovirt based VM went from ~600IOPs to ~2000IOPS
fixes bz#1635972
Change-Id: If7f7388d2f08cf7f17ca517a4ea222560661dc36
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem:
When eager-lock is disabled because of multiple-fds opened and app
writes come on conflicting regions, the number of locks grows very
fast leading to all the CPU being spent just in locking and unlocking
by traversing huge queues in locks xlator for granting locks.
Fix:
Reduce the number of locks in transit by bundling the writes in the
same lock and disable delayed piggy-pack when we learn that multiple
fds are open on the file. This will reduce the size of queues in the
locks xlator. This also reduces the number of network calls like
inodelk/fxattrop.
Please note that this problem can still happen if eager-lock is
disabled as the writes will not be bundled in the same lock.
fixes bz#1635975
Change-Id: I8fd1cf229aed54ce5abd4e6226351a039924dd91
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Till now, glusterd was generating the volfile path for the snapshot
volume's bricks like this.
/snaps/<snap name>/<brick volfile>
But in reality, the path to the brick volfile for a snapshot volume is
/snaps/<snap name>/<snap volume name>/<brick volfile>
The above workaround was used to distinguish between a mount command used
to mount the snapshot volume, and a brick of the snapshot volume, so that
based on what is actually happening, glusterd can return the proper volfile
(client volfile for the former and the brick volfile for the latter). But,
this was causing problems for snapshot restore when brick multiplexing is
enabled. Because, with brick multiplexing, it tries to find the volfile
and sends GETSPEC rpc call to glusterd using the 2nd style of path i.e.
/snaps/<snap name>/<snap volume name>/<brick volfile>
So, when the snapshot brick (which is multiplexed) sends a GETSPEC rpc
request to glusterd for obtaining the brick volume file, glusterd was
returning the client volume file of the snapshot volume instead of the
brick volume file.
Change-Id: I28b2dfa5d9b379fe943db92c2fdfea879a6a594e
fixes: bz#1636162
Signed-off-by: Raghavendra Bhat <raghavendra@redhat.com>
(cherry picked from commit 83a89296a3d12a3fc2a643c0630be5ce659204ea)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The char pointer mdc_xattr_str in function mdc_xattr_list_populate
is malloc'd and doing a strcat into a malloc'd region can
overflow content allocated based on prior contents of the
memory region.
Added a NULL terimation to the malloc'd region to prevent
the overflow, and treat it as an empty string.
Change-Id: If0decab669551581230a8ede4c44c319ff04bac9
Updates: bz#1635373
Signed-off-by: ShyamsundarR <srangana@redhat.com>
(cherry picked from commit d00a2a1b398346bbdc5ac9b3ba4b09fb1ce1e699)
|