| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The decision as to which node would migrate a file
was based on the gfid of the file. Files were divided
among the nodes for the replica/disperse set. However,
if a brick was down when rebalance started, the nodeuuids
would be saved as NULL and a set of files would not be migrated.
Now, if the nodeuuid is NULL, the first non-null entry in
the set is the node responsible for migrating the file.
Change-Id: I72554c107792c7d534e0f25640654b6f8417d373
fixes: bz#1564198
Signed-off-by: N Balachandran <nbalacha@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Note 1) we're not supposed to be using #!/usr/bin/env python, see
https://fedoraproject.org/wiki/Packaging:Guidelines?rd=Packaging/Guidelines#Shebang_lines
Note 2) we're also not supposed to be using "!/usr/bin/python,
see https://fedoraproject.org/wiki/Changes/Avoid_usr_bin_python_in_RPM_Build#Quick_Opt-Out
The previous patch (https://review.gluster.org/19767) tried to do too
much in one patch, so it was abandoned.
This patch does two things:
1) minor cleanup of configure(.ac) to explicitly use python2
2) change all the shebang lines to #!/usr/bin/python2 and add them
where they were missing based on warnings emitted during rpmbuild.
In a follow-up patch python2 will eventually be changed to python3.
Before that python2-isms (e.g. print, string.join(), etc.) need to be
converted to python3. Some of those can be rewritten in version agnostic
python. E.g. print statements become print() with "from __future_ import
print_function". The python 2to3 utility will be used for some of those.
Also Aravinda has given guidance in the comments to the first patch for
changes.
updates: #411
Change-Id: I471730962b2526022115a1fc33629fb078b74338
Signed-off-by: Kaleb S. KEITHLEY <kkeithle@redhat.com>
|
|
|
|
|
|
|
|
|
|
| |
dht_opendir should wind the open to all subvols
whether or not local->subvols is set. This is
because dht_readdirp winds the calls to all subvols.
Change-Id: I67a96b06dad14a08967c3721301e88555aa01017
updates: bz#1564198
Signed-off-by: N Balachandran <nbalacha@redhat.com>
|
|
|
|
|
|
|
|
|
|
| |
Add pass-through option in performance traslators. Set the option in
GF_OPTION_INIT() and GF_OPTION_RECONF()
Updates: #304
Change-Id: If1537450147d154905831e36f7162a32866d7ad6
Signed-off-by: Varsha Rao <varao@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem: storage.reserve option is not working correctly while
disk space is allocate throguh fallocate
Solution: In posix_disk_space_check_thread_proc after every 5 sec interval
it calls posix_disk_space_check to monitor disk space and set the
flag in posix priv.In 5 sec timestamp user can create big file with
fallocate that can reach posix reserve limit and no error is shown on
terminal even limit has reached.
To resolve the same call posix_disk_space for every fallocate fop
instead to call by a thread after 5 second
BUG: 1560411
Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
Change-Id: I39ba9390e2e6d084eedbf3bcf45cd6d708591577
|
|
|
|
|
|
| |
Change-Id: I6745428fd9d4e402bf2cad52cee8ab46b7fd822f
fixes: bz#1560319
Signed-off-by: Kinglong Mee <mijinlong@open-fs.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In context of Cloudsync:
In scenarios where a data modification fop e.g. a write landed in
POSIX thinking that the file is local, while the file was actually
remote, can be dangerous. Ofcourse we don’t want to take inodelk
for every read/write operation to check the archival status or
coordinate with an upload or a download of a file. To avoid inodelk,
we will check the status of the file in POSIX it self, before we
resume the fop. This helps us avoiding any races mentioned above.
Now e.g. if a write reached POSIX for a file which was actually remote,
it can check the status of the file and will get to know that the file
is remote. It can error out with this status “remote” and cloudsync
xlator will retry the same operation, once it finished downloading the
file.
This patch includes the setxattr changes to do the post processing of
upload i.e. truncate and setting the remote xattr
"trusted.glusterfs.cs.remote" to indicate the file is REMOTE
Each file will have no xattr if the file is LOCAL, one remote xattr if
the file is REMOTE and a combination of REMOTE and DOWNLOADING xattr if
the file is getting downloaded. There is healing logic of these xattrs
to recover from crash inconsitencies.
Fixes: #387
Change-Id: Ie93c2d41aa8d6a798a39bdbef9d1669f057e5fdb
Signed-off-by: Susant Palai <spalai@redhat.com>
|
|
|
|
|
|
|
|
|
|
| |
Various synchronization present in dht_rename while handling
directories and files is necessary only if we have more than only one
child.
Change-Id: Ie21ad419125504ca2f391b1ae2e5c1d166fee247
fixes: bz#1563511
Signed-off-by: Raghavendra G <rgowdapp@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
spec-files:
https://review.gluster.org/#/c/18854/
Overview:
* Cloudsync maintains three file states in it's inode-ctx i.e
1 - LOCAL,
2 - REMOTE,
3 - DOWNLOADING.
* A data modifying fop is allowed only if the state is LOCAL.
If the state is REMOTE or DOWNLOADING, client will download
or wait for the download to finish initiated by other client.
* Multiple download and upload from different clients are synchronized
by inodelk.
* In POSIX a state check is done (part of different commit)before
allowing the fop to continue. If the state is remote/downloading the
fop is unwound with EREMOTE. The client will then download the file
and continue with the fop again.
* Basic Algo for fop (let's say write fop):
- If LOCAL -> resume fop
- If REMOTE ->
- INODELK
- STAT (this gets state and heal the state if needed)
- DOWNLOAD
- resume fop
Note:
* Developers will need to write plugins for download, based on the
remote store they choose. In phase-1, support will be added for
one remote store per volume. In future, more options for multiple
remote stores will be explored.
TODOs:
- Implement stat/lookup/readdirp to return size info from xattr
- Make plugins configurable
- Implement unlink fop
- Add metrics collection
- Add sharding support
Design Contributions:
Aravinda V K <avishwan@redhat.com>
Amar Tumballi <amarts@redhat.com>
Ram Ankireddypalle <areddy@commvault.com>
Susant Palai <spalai@redhat.com>
updates: #387
Change-Id: Iddf711ee7ab4e946ae3e472ff62791a7b85e6d4b
Signed-off-by: Susant Palai <spalai@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
NFS client gets "Invalid argument" when writing file through nfs-ganesha.
1. With quota disabled;
nfs client mount nfs-ganesha share, and do 'll' in the testing directory.
2. Enable quota;
getfattr: Removing leading '/' from absolute path names
trusted.gfid=0xe2edaac0eca8420ebbbcba7e56bbd240
trusted.gfid2path.b3250af8fa558e66=0x39663134343566662d653530332d343831352d396635312d3236633565366332633137642f7465737466696c653932
trusted.glusterfs.quota.9f1445ff-e503-4815-9f51-26c5e6c2c17d.contri.3=0x00000000000002000000000000000001
Notice: testfile92 without trusted.pgfid xattr.
3. restart glusterfs volume by "gluster volume stop/start gvtest"
4. echo somedata > testfile92
5. ll testfile92
-rw-r--r-- 1 root root 0 Mar 6 21:43 testfile92
BUG: 1560319
Change-Id: Iaa4dd1e891c99069fb85b7b11bb0482cbf2303b1
fixes: bz#1560319
Signed-off-by: Kinglong Mee <mijinlong@open-fs.com>
|
|
|
|
|
|
|
| |
Change-Id: I4648816af908539efdc2528608aa2ebf7f0d0e2f
fixes: bz#1559004
BUG: 1559004
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
|
|
|
|
|
|
| |
Change-Id: I0a290396c30c635b13ee73004d20259efb76a954
fixes: bz#1563945
Signed-off-by: Ashish Pandey <aspandey@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
gluster-block project needs a dependency check to see if all the bricks
are online before bringing up the relevant gluster-block services. While
the patch https://review.gluster.org/#/c/19785/ attempts to write the
script but brick should be only marked as online only when the
pmap_signin is completed.
While this is perfectly fine for non brick multiplexing, but with brick
multiplexing this patch still doesn't eliminate the race completely as
the attach_req call is asynchrnous and glusterd immediately marks the
port as registerd.
Change-Id: I81db54b88f7315e1b24e0234beebe00de6429f9d
Fixes: bz#1563273
Signed-off-by: Atin Mukherjee <amukherj@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem:
We seem to be winding the FOP if pre-op did not succeed on quorum bricks
and then failing the FOP with EROFS since the fop did not meet quorum.
This essentially masks the actual error due to which pre-op failed. (See
BZ).
Fix:
Skip FOP phase if pre-op quorum is not met and go to post-op.
Fixes: 1561129
Change-Id: Ie58a41e8fa1ad79aa06093706e96db8eef61b6d9
fixes: bz#1561129
Signed-off-by: Ravishankar N <ravishankar@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
glusterd maintains a boolean flag 'port_registered' which is used to determine
if a brick has completed its portmap sign in process. This flag is (re)set in
pmap_sigin and pmap_signout events. In case of brick multiplexing this flag is
the identifier to determine if the very first brick with which the process is
spawned up has completed its sign in process. However in case of glusterd
restart when a brick is already identified as running, glusterd does a
pmap_registry_bind to ensure its portmap table is updated but this flag isn't
which is fine in case of non brick multiplex case but causes an issue if
the very first brick which came as part of process is replaced and then
the subsequent brick attach will fail. One of the way to validate this
is to create and start a volume, remove the first brick and then
add-brick a new one. Add-brick operation will take a very long time and
post that the volume status will show all other brick status apart from
the new brick as down.
Solution is to set brickinfo->port_registered to true for all the
running bricks when brick multiplexing is enabled.
Change-Id: Ib0662d99d0fa66b1538947fd96b43f1cbc04e4ff
Fixes: bz#1560957
Signed-off-by: Atin Mukherjee <amukherj@redhat.com>
|
|
|
|
|
|
|
|
| |
Options levels for Changelog Xlator
Change-Id: Idd246717e38096c44258a990a0939f82e5fc9654
Updates: #430
Signed-off-by: Aravinda VK <avishwan@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Lookup-optimize has been shown to improve create
performance. The code has been in the project for several
years and is considered stable.
Enabling this by default in order to test this in the
upstream regression runs.
Change-Id: Iab792979ee34f0af4713931e0b5b399c23f65313
updates: bz#1557435
BUG: 1557435
Signed-off-by: N Balachandran <nbalacha@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
For transactions where there's no volname involved (eg : gluster v
status), the originator node initiates with staging phase and what that
means in op-sm there's no unlock event triggered which resulted into a
txn_opinfo dictionary leak.
Credits : cynthia.zhou@nokia-sbell.com
Change-Id: I92fffbc2e8e1b010f489060f461be78aa2b86615
Fixes: bz#1550339
Signed-off-by: Atin Mukherjee <amukherj@redhat.com>
|
|
|
|
|
|
| |
Change-Id: I97a70d29365b0a454241ac5f5cae56d93eefd73a
Fixes: bz#1563334
Signed-off-by: Atin Mukherjee <amukherj@redhat.com>
|
|
|
|
|
|
|
|
|
| |
On shd, we shouldn't treat any brick down based
on latency, otherwise self-heal will never happen
fixes: bz#1562717
Change-Id: Ica07fcc4fae91a6bfd9c9a670e2be464704d94b7
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
| |
We are setting mgmt_v3_timer->timer to NULL after mgmt_v3_timer is deleted
which is unnecessary. So removing the statement.
This issue is caught while running glusterd with ASAN.
Change-Id: Ied1f91590a2c64ec1af36d4de9c3febd6cf94bb9
Fixes: bz#1562907
Signed-off-by: Sanju Rakonde <srakonde@redhat.com>
|
|
|
|
|
|
|
| |
Updates #412
Change-Id: Ida53d8b630feabb856a3551fa888f92382ade768
Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com>
|
|
|
|
|
|
|
|
|
| |
Set the levels for DHT options based on
https://review.gluster.org/#/c/19466/
Change-Id: I51b31a706a0b9517404e83224c89de145fd5d7e1
updates: #430
Signed-off-by: N Balachandran <nbalacha@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Usage: Use 'reader-thread-count=<NUM>' as command line option to
set the thread count at the time of mounting the volume.
Next task is to make these threads auto-scale based on the load,
instead of having the user remount the volume everytime to change
the thread count.
Updates #412
Change-Id: I94aa1505e5ae6a133683d473e0e4e0edd139b76b
Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
With lookup-optimize enabled, gf_defrag_settle_hash in rebalance
sometimes flips the on-disk layout on volume root post the
migration of all files in the directory.
This is sometimes seen when attempting to fix the layout of a
directory multiple times before calling gf_defrag_settle_hash.
dht_fix_layout_of_directory generates a new layout in memory but
updates it in the inode ctx before it is set on disk. The layout
may be different the second time around due to
dht_selfheal_layout_maximize_overlap. If the layout is then not
written to the disk, the inode now contains the wrong layout.
gf_defrag_settle_hash does not check the correctness of the layout
in the inode before updating the commit-hash and writing it to the
disk thus changing the layout of the directory.
Change-Id: Ie1407d92982518f2a0c40ec70ad370b34a87b4d4
updates: bz#1557435
Signed-off-by: N Balachandran <nbalacha@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This reverts commit a60fc2ddc03134fb23c5ed5c0bcb195e1649416b.
This commit was causing multiple tests to time out when brick
multiplexing is enabled. With further debugging, it's found that even
though the volume stop transaction is converted into mgmt_v3 to allow
the remote nodes to follow the synctask framework to process the command,
there are other callers of glusterd_brick_stop () which are not synctask
based.
Change-Id: I7aee687abc6bfeaa70c7447031f55ed4ccd64693
updates: bz#1545048
|
|
|
|
|
|
|
|
|
|
| |
Updates: #363
This new value (3) will try to wind read requests to the child of AFR
having the least amount of pending requests in its queue.
Change-Id: If6bda2aac9bf7aec3fc39622f78659313c4b6508
Signed-off-by: Ravishankar N <ravishankar@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
| |
The xattr trusted.glusterfs.list-node-uuids was only sent to a single
subvolume. This was returning null uuids from the other subvolumes as
if they were down.
This fix forces that xattr to be requested from all subvolumes.
Change-Id: If62eb39a6857258923ba625e153d4ad79018ea2f
fixes: bz#1561406
Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
|
|
|
|
|
|
|
|
| |
log message describe the actual test
Change-Id: I1ea7300a6b186032a65236492d6d2a6eef0ab983
fixes: bz#1560441
Signed-off-by: Kaleb S. KEITHLEY <kkeithle@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem: There's a race between the last glusterfs_handle_terminate()
response sent to glusterd and the kill that happens immediately if the
terminated brick is the last brick.
Solution: When it is a last brick for the brick process, instead of glusterfsd
killing itself, glusterd will kill the process in case of brick multiplexing.
And also changing gf_attach utility accordingly.
Change-Id: I386c19ca592536daa71294a13d9fc89a26d7e8c0
fixes: bz#1545048
BUG: 1545048
Signed-off-by: Sanju Rakonde <srakonde@redhat.com>
|
|
|
|
|
|
|
|
|
| |
ENOSPC returned by a file migration is no longer
considered a rebalance failure.
Change-Id: I21cf3a8acdc827bc478e138d6cb5db649d53a28c
fixes: bz#1553598
Signed-off-by: N Balachandran <nbalacha@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem: if a lookup is done on a newly added brick for a path on which limit
has been reached, the lookup fails to heal the directory tree due to quota.
Solution: Tag the lookup as an internal fop and ignore it in quota.
Since marking internal fop does not usually give enough contextual information.
Introducing new flags to pass the contextual info.
Adding dict_check_flag and dict_set_flag to aid flag operations.
A flag is a single bit in a bit array (currently limited to 256 bits).
Change-Id: Ifb6a68bcaffedd425dd0f01f7db24edd5394c095
fixes: bz#1505355
BUG: 1505355
Signed-off-by: Sanoj Unnikrishnan <sunnikri@redhat.com>
|
|
|
|
|
|
|
| |
Updates: #425
Change-Id: Iea5198821f4eabc46bc63529afa4a92d4b4c2be0
Signed-off-by: Poornima G <pgurusid@redhat.com>
|
|
|
|
|
|
| |
Change-Id: Iefc5a00d36436b23181871fa365f27b8d90cff0a
fixes: bz#1560441
Signed-off-by: Sanju Rakonde <srakonde@redhat.com>
|
|
|
|
|
|
| |
Change-Id: I8f9c594cf56331d54eb4884335699744685ef20d
fixes: bz#1560441
Signed-off-by: Sanju Rakonde <srakonde@redhat.com>
|
|
|
|
|
|
|
| |
Updates: #429
Change-Id: Ic2e64422055f1838d5d453643c739ef1e9319cfe
Signed-off-by: Poornima G <pgurusid@redhat.com>
|
|
|
|
|
|
|
| |
Updates: #427
Change-Id: Ib1f45016ac75d7bc2755db0dd4b68ce1d95d26c3
Signed-off-by: Poornima G <pgurusid@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
| |
alert-time, soft timeout, hard timeout, default soft limit
and deem-statfs will be settable through volume set command.
hence marked as settable.
Other options are used only via quota commands.
Updates #302
Change-Id: I02d258cc3aa7fe58ccbadd59441cce64cfd9ba6e
Signed-off-by: Sanoj Unnikrishnan <sunnikri@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
| |
Provide correct error message for changelog end time check
Updated error message to print "wrong result for end".
Original patch by Keith Schincke <kschinck@redhat.com>
from https://review.gluster.org/#/c/8121/
Change-Id: Ia3458cbac7784bfc71c05da67391a3f8259f18f0
BUG: 1559126
Signed-off-by: Niklas Hambüchen <mail@nh2.me>
|
|
|
|
|
|
|
|
| |
`find_library()` doesn't consider LD_LIBRARY_PATH on Python < 3.6.
Change-Id: Iee26085cb5d14061001f19f032c2664d69a378a8
BUG: 1450593
Signed-off-by: Niklas Hambüchen <mail@nh2.me>
|
|
|
|
|
|
|
|
| |
Ownthread feature needs enabling for glusterfs4_0_fop_prog
Change-Id: Idce63eb094ae0fdfcddbd52d0dee25aa0e074926
BUG: 1559075
Signed-off-by: Milind Changire <mchangir@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When the self-heal daemon is doing a full sweep it uses readdirp to
get extra stat information from each file. This information is
obtained in two steps by the posix xlator: first the directory is
read to get the entries and then each entry is stated to get additional
info. Between these two steps, it's possible that the file is removed
by the user, so we'll get an error, leaving stat info empty.
EC's heal daemon was using the gfid blindly, causing an assert failure
when protocol/client was trying to encode the gfid.
To fix the problem a check has been added. If we detect a null gfid, we
simply ignore it and continue healing.
Change-Id: I2e4acdcecd0b6951055e50d1c37d686a2186a228
BUG: 1558016
Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
|
|
|
|
|
|
| |
BUG: 1557932
Change-Id: I3783e41b3812267bc10c0d05d062a31396ce135b
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem:
when dd happens on sharded replicate volume all the writes on shards happen
through anon-fd. When the writes don't come quick enough, old anon-fd closes
and new fd gets created to serve the new writes. open-fd-count is decremented
only after the fd is closed as part of fd_destroy(). So even when one fd is on
the way to be closed a new fd will be created and during this short period it
appears as though there are multiple fds opened on the file. AFR thinks another
application opened the same file and switches off eager-lock leading to
extra latency.
Fix:
Have a different option called active-fd whose life cycle starts at
fd_bind() and ends just before fd_destroy()
BUG: 1557932
Change-Id: I2e221f6030feeedf29fbb3bd6554673b8a5b9c94
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem:
shard_post_lookup_fsync_handler() goes over the list of inode-ctx that need to
be fsynced and in cbk it removes each of the inode-ctx from the list. When the
first member of list is removed it tries to modifies list head's memory with
the latest next/prev and when this happens, there is no guarantee that the
list-head which is from stack memory of shard_post_lookup_fsync_handler() is
valid.
Fix:
Do list_del_init() in the loop before winding fsync.
BUG: 1557876
Change-Id: If429d3634219e1a435bd0da0ed985c646c59c2ca
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
| |
While performing pause/resume on geo-replication with wrong user
(other user then you setup), always returns success. Which further
leads to snapshot creation failure as it is detecting active
geo-replication session.
Change-Id: I6e96e8dd3e861348b057475387f0093cb903ae88
BUG: 1550936
Signed-off-by: Sunny Kumar <sunkumar@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem: TLS verification fails while using intermediate CA
if mgmt SSL is enabled.
Solution: There are two main issue of TLS verification failing
1) not calling ssl_api to set cert_depth
2) The current code does not allow to set certificate depth
while MGMT SSL is enabled.
After apply this patch to set certificate depth user
need to set parameter option transport.socket.ssl-cert-depth <depth>
in /var/lib/glusterd/secure_acccess instead to set in
/etc/glusterfs/glusterd.vol. At the time of set secure_mgmt in ctx
we will check the value of cert-depth and save the value of cert-depth
in ctx.If user does not provide any value in cert-depth in that case
it will consider default value is 1
BUG: 1555154
Change-Id: I89e9a9e1026e37efb5c20f9ec62b1989ef644f35
Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
|
|
|
|
|
|
|
|
| |
Memory cleanup of same pointer twice inside gd_mgmt_v3_unlock_timer_cbk
causing glusterd to crash.
Change-Id: I9147241d995780619474047b1010317a89b9965a
BUG: 1550339
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem:
1) Afr's eager-lock only works for data transactions.
2) When there are conflicting writes, write with conflicting region initiates
unlock of eager-lock leading to extra pre-ops and post-ops on the file. When
eager-lock goes off, it leads to extra fsyncs for random-write workload in afr.
Solution (that is modeled after EC):
In EC, when there is a conflicting write, it waits for the current write to
complete before it winds the conflicted write. This leads to better utilization
of network and disk, because we will not be doing extra xattrops and FSYNCs and
inodelk/unlock. Moved fd based counters to inode based counters.
I tried to model the solution based on EC's locking, but it is not similar to
AFR because we had to keep backward compatibility.
Lifecycle of lock:
==================
First transaction is added to inode->owners list and an inodelk will be sent on
the wire. All the next transactions will be put in inode->waiters list until
the first transaction completes inodelk and [f]xattrop completely. Once
[f]xattrop also completes, all the requests in the inode->waiters list are
checked if it conflict with any of the existing locks which are in
inode->owners list and if not are added to inode->owners list and resumed with
doing transaction. When these transactions complete fop phase they will be
moved to inode->post_op list and resume the transactions that were paused
because of conflicts. Post-op and unlock will not be issued on the wire until
that is the last transaction on that inode. Last transaction when it has to
perform post-op can choose to sleep for deyed-post-op-secs value. During that
time if any other transaction comes, it will wake up the sleeping transaction
and takes over the ownership of the lock and the cycle continues. If the
dealyed-post-op-secs expire, then the timer thread will wakeup the sleeping
transaction and it will set lock->release to true and starts doing post-op and
then unlock. During this time if any other transactions come, they will be put
in inode->frozen list. Once the previous unlock comes it will move the frozen
list to waiters list and moves the first element from this waiters-list to
owners-list and attempts the lock and the cycle continues. This is the general
idea. There is logic at the time of dealying and at the time of new
transaction or in flush fop to wakeup existing sleeping transactions or
choosing whether to delay a transaction etc, which is subjected to change based
on future enhancements etc.
Fixes: #418
BUG: 1549606
Change-Id: I88b570bbcf332a27c82d2767dfa82472f60055dc
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem:
Whenever we read data from file over NFS, NFS reads
more data then requested and caches it. Based on the
stat information it makes sure that the cached/pre-read
data is valid or not.
Consider 4 + 2 EC volume and all the bricks are on
differnt nodes.
In EC, with round-robin read policy, reads are sent on
different set of data bricks. This way, it balances the
read fops to go on all the bricks and avoid heating UP
(overloading) same set of bricks.
Due to small difference in clock speed, it is possible
that we get minor difference for atime, mtime or ctime
for different bricks. That might cause a different stat
returned to NFS based on which NFS will discard
cached/pre-read data which is actually not changed and
could be used.
Solution:
Change read policy for EC as gfid-hash. That will force
all the read to go to same set of bricks.
Change-Id: I825441cc519e94bf3dc3aa0bd4cb7c6ae6392c84
BUG: 1554743
Signed-off-by: Ashish Pandey <aspandey@redhat.com>
|