summaryrefslogtreecommitdiffstats
path: root/xlators/cluster/dht/src/dht-common.h
Commit message (Collapse)AuthorAgeFilesLines
* dht - Remove "tier" code (part 2)Barak Sason Rofman2020-10-011-103/+0
| | | | | | | | | | | | | | Part 1 of this patch https://review.gluster.org/#/c/glusterfs/+/24328/ Following part 1, this patch complety removes all traces of "tier" feature in dht. This is based in the work done in https://review.gluster.org/#/c/glusterfs/+/23935/ Change-Id: I7fba1ab7249719301ca578b4a6f4acac748da145 updates: #1097 Signed-off-by: Barak Sason Rofman <bsasonro@redhat.com>
* dht: Ongoing IO is failing on non-distribute volumes after just add-brickMohit Agrawal2020-09-281-0/+2
| | | | | | | | | | | | | Problem: On a non-distributed volumes linux kernel untar is failed after running add-brick operation Solution: 1) Save hashed subvol as a MDS in case while MDS has not been populated Fixes: #1328 Change-Id: I9967e136da008c6367973a7346637617dfa8f934 Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
* Revert "dht: optimize rebalance crawl path"Pranith Kumar K2020-09-031-11/+0
| | | | | | | | | | | Based on the discussion on the issue, it is decided that it is better to not have this implementation of the feature. This reverts commit 3af9443c770837abe4f54db399623380ab9767a7. Change-Id: I4e3bf18fc376cdb0cf29f1d98a915deca17c3496 Updates: #1422 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* cluster/dht: simplify and cleanup internal time managementDmitry Antipov2020-08-211-2/+2
| | | | | | | | | | Prefer time_t and gf_time() over 'struct timeval' and gettimeofday() where microseconds are not really used, drop unneeded 'struct timeval' to 'struct timespec' conversion in dht_file_counter_thread(). Change-Id: Ibd802f79b8848df3f6175ca1fd82e93532bba38d Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru> Updates: #1002
* dht: optimize rebalance crawl pathSusant Palai2020-07-311-0/+11
| | | | | | | | | | | | | | | | | | | | | | For distribute only volumes we can use the information for local subvolumes to avoid syncop calls which goes through the whole stack to fetch stat and entries. A separate function gf_defrag_fix_layout_puredist is introduced. TODO: A glusterd flag needs to be introduced in case we want to fall back to run the old way. Perf numbers: DirSize - 1Million Old New %diff Depth - 100 (Run 1) 353 74 +377% Depth - 100 (Run 2) 348 72 +377~% Depth - 50 246 122 +100% Depth - 3 174 114 +52% Change-Id: I67cc136cebd34092fd775e69f74c2d5b33d3156d Fixes: #1242 Signed-off-by: Susant Palai <spalai@redhat.com>
* dht - fixing xattr inconsistencyBarak Sason Rofman2020-07-221-4/+0
| | | | | | | | | | | | | | | | | The scenario of setting an xattr to a dir, killing one of the bricks, removing the xattr, bringing back the brick results in xattr inconsistency - The downed brick will still have the xattr, but the rest won't. This patch add a mechanism that will remove the extra xattrs during lookup. This patch is a modification to a previous patch based on comments that were made after merge: https://review.gluster.org/#/c/glusterfs/+/24613/ fixes: #1324 Change-Id: Ifec0b7aea6cd40daa8b0319b881191cf83e031d1 Signed-off-by: Barak Sason Rofman <bsasonro@redhat.com>
* dht: Do opendir selectively in gf_defrag_process_dirSusant Palai2020-05-121-0/+2
| | | | | | | | | | | | | | | | Currently opendir is done from the cluster view. Hence, even if one opendir is successful, the opendir operation as a whole is considered successful. But since in gf_defrag_get_entry we fetch entries selectively from local_subvols, we need to opendir individually on those local subvols and keep track of fds separately. Otherwise it is possible that opendir failed on one of the subvol and we wind readdirp call on the fd to the corresponding subvol, which will ultimately result in EINVAL error. fixes: #1218 Change-Id: I50dd88b9597852a15579f4ee325918979417f570 Signed-off-by: Susant Palai <spalai@redhat.com>
* dht xlator: integer handling issuenik-redhat2020-04-291-1/+1
| | | | | | | | | | | | | | Issue: The ret value is passed to the function instead of the proper errno value Fix: Passing the errno generated to the log function CID: 1415824 : Improper use of negative value CID: 1420205 : Improper use of negative value Change-Id: Iaa7407ebd03eda46a2c027695e6bf0f598b371b2 Updates: #1060 Signed-off-by: nik-redhat <nladha@redhat.com>
* dht: Handle setxattr and rm race for directory in rebalanceSusant Palai2020-04-281-0/+3
| | | | | | | | | | | | | | Problem: Selfheal as part of directory does not return an error if the layout setxattr fails. This is because the actual lookup fop must have been successful to proceed for layout heal. Hence, we could not tell if fix-layout failed in rebalance. Solution: We can check this information in the layout structure that whether all the xlators have returned error. fixes: #1200 Change-Id: I3e5f2a36c0d934c21476a73a9a5473d8e490cde7 Signed-off-by: Susant Palai <spalai@redhat.com>
* dht - Remove "tier" code (part 1)v9devBarak Sason Rofman2020-04-171-18/+0
| | | | | | | | | | | | | This patch is removing some of the "tier" code in dht xlator, as it is no longer being used. Not all of the not-needed code is removed at once, so reviewing is easier. Follow up patches removing additional unused code will follow. This is based in the work done in https://review.gluster.org/#/c/glusterfs/+/23935/ Change-Id: I3cb6a0c5d8f14afcd87cf021ef8f74b91c0f908a updates: #1097 Signed-off-by: Barak Sason Rofman <bsaonro@redhat.com>
* dht - Reducing methods scopeBarak Sason Rofman2020-02-131-17/+8
| | | | | | | | | | 1. Reduced methods scope in the following: inode read&write, layout, linkfile, shard 2. Removed dead code @ dht-linkkile.c:174-228 & dht-shard.c:44 Change-Id: I2d08a10c7b074fccdb0c020845cad60c6ea32db5 updates: bz#1193929 Signed-off-by: Barak Sason Rofman <bsasonro@redhat.com>
* dht: Fix stale-layout and create issueSusant Palai2020-02-091-0/+6
| | | | | | | | | | | | | | | | | Problem: With lookup-optimize set to on by default, a client with stale-layout can create a new file on a wrong subvol. This will lead to possible duplicate files if two different clients attempt to create the same file with two different layouts. Solution: Send in-memory layout to be cross checked at posix before commiting a "create". In case of a mismatch, sync the client layout with that of the server and attempt the create fop one more time. test: Manual, testcase(attached) fixes: bz#1786679 Change-Id: Ife0941f105113f1c572f4363cbcee65e0dd9bd6a Signed-off-by: Susant Palai <spalai@redhat.com>
* geo-rep: Fix for "Transport End Point not connected" issueHarpreet Kaur2020-01-311-0/+4
| | | | | | | | | | | | | | | | | | | | | | problem: Geo-rep gsyncd process mounts the master and slave volume on master nodes and slave nodes respectively and starts the sync. But it doesn't wait for the mount to be in ready state to accept I/O. The gluster mount is considered to be ready when all the distribute sub-volumes is up. If the all the distribute subvolumes are not up, it can cause ENOTCONN error, when lookup on file comes and file is on the subvol that is down. solution: Added a Virtual Xattr "dht.subvol.status" which returns "1" if all subvols are up and "0" if all subvols are not up. Geo-rep then uses this virtual xattr after a fresh mount, to check whether all subvols are up or not and then starts the I/O. fixes: bz#1664335 Change-Id: If3ad01d728b1372da7c08ccbe75a45bdc1ab2a91 Signed-off-by: Harpreet Kaur <hlalwani@redhat.com> Signed-off-by: Kotresh HR <khiremat@redhat.com>
* DHT - Reduce methods scope (dht-common.c)Barak Sason Rofman2019-12-171-53/+5
| | | | | | | | | | | Methods that should have been static were defined as global, and the other way around. This patch fixes the issue in order to enforce encapsulation. updates: bz#1776757 Change-Id: I3eb5781849c5e597c1dd347e03f356c00db62a39 Signed-off-by: Barak Sason Rofman <bsasonro@redhat.com>
* dht-common.h/dht-helper.c: exctract the LOCK() from DHT_UPDATE_TIMEYaniv Kaul2019-12-171-14/+10
| | | | | | | | | | | | | Currently, the code (and only place) that is using this macro is in dht_inode_ctx_time_update() where it is called 3 times in a row, which is essentially 3 cycles of LOCK/UNLOCK on the same lock. Instead, extract the LOCK()/UNLOCK() part of the macro and wrap those calls with it. Change-Id: I6312b985e3d97517857b55f342440accc4063db6 updates: bz#1193929 Signed-off-by: Yaniv Kaul <ykaul@redhat.com>
* dht: Rebalance causing IO Error - File descriptor in bad stateMohit Agrawal2019-10-151-0/+19
| | | | | | | | | | | | | | | | Problem : When a file is migrated, dht attempts to re-open all open fds on the new cached subvol. Earlier, if dht had not opened the fd, the client xlator would be unable to find the remote fd and would fall back to using an anon fd for the fop. That behavior changed with https://review.gluster.org/#/c/glusterfs/+/15804, causing fops to fail with EBADFD if the fd was not available on the cached subvol. The client xlator returns EBADFD if the remote fd is not found but dht only checks for EBADF before re-opening fds on the new cached subvol. Solution: Handle EBADFD at dht code path to avoid the issue Change-Id: I43c51995cdd48d05b12e4b2889c8dbe2bb2a72d8 Fixes: bz#1758579
* dht-common.h: reorder variables to reduce padding.Yaniv Kaul2019-07-151-73/+81
| | | | | | | | | Manually added '-Wpadded' to get warnings on padding, and reordered structs to reduce most of them. Change-Id: I0c505fcb3dfef76399ac9d5d33bfb235354532de updates: bz#1193929 Signed-off-by: Yaniv Kaul <ykaul@redhat.com>
* multiple files: another attempt to remove includesYaniv Kaul2019-06-141-2/+0
| | | | | | | | | | | | | | | | | | There are many include statements that are not needed. A previous more ambitious attempt failed because of *BSD plafrom (see https://review.gluster.org/#/c/glusterfs/+/21929/ ) Now trying a more conservative reduction. It does not solve all circular deps that we have, but it does reduce some of them. There is just too much to handle reasonably (dht-common.h includes dht-lock.h which includes dht-common.h ...), but it does reduce the overall number of lines of include we need to look at in the future to understand and fix the mess later one. Change-Id: I550cd001bdefb8be0fe67632f783c0ef6bee3f9f updates: bz#1193929 Signed-off-by: Yaniv Kaul <ykaul@redhat.com>
* cluster/dht: Do not use gfid-req in fresh lookupN Balachandran2019-02-021-1/+3
| | | | | | | | | | | | Fuse sets a random gfid-req value for a fresh lookup. Posix lookup will set this gfid on entries with missing gfids causing a GFID mismatch for directories. DHT will now ignore the Fuse provided gfid-req and use the GFID returned from other subvols to heal the missing gfid. Change-Id: I5f541978808f246ba4542564251e341ec490db14 fixes: bz#1670259 Signed-off-by: N Balachandran <nbalacha@redhat.com>
* libglusterfs: Move devel headers under glusterfs directoryShyamsundarR2018-12-051-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | libglusterfs devel package headers are referenced in code using include semantics for a program, this while it works can be better especially when dealing with out of tree xlator builds or in general out of tree devel package usage. Towards this, the following changes are done, - moved all devel headers under a glusterfs directory - Included these headers using system header notation <> in all code outside of libglusterfs - Included these headers using own program notation "" within libglusterfs This change although big, is just moving around the headers and making it correct when including these headers from other sources. This helps us correctly include libglusterfs includes without namespace conflicts. Change-Id: Id2a98854e671a7ee5d73be44da5ba1a74252423b Updates: bz#1193929 Signed-off-by: ShyamsundarR <srangana@redhat.com>
* dht: coverity fixesSusant Palai2018-10-051-1/+1
| | | | | | | | | | CID: 1356541 Issue: Dereference null return value CID: 1382411 Issue: Dereference after null check CID: 1391409 Issue: Unchecked return value Change-Id: Id3d4feb4e88df424003cc8e8a1540e77bbe030e3 Updates: bz#789278 Signed-off-by: Susant Palai <spalai@redhat.com>
* dht: Operate internal fops with negative pidSusant Palai2018-09-201-0/+1
| | | | | | | | | | | | | | | | | | | With root-squash on, all root credentials are converted to a random uid, gid(65535). And ideally this does not carry the necessary permission bits to carry out the operation. But posix-acl will allow operations from this inode as long as its ctx has the ngroup information and ngroup has the owner group information. The problem we ran into recently was somehow posix-acl xlator did not cache the ngroup info and some of the dht internal fops(layout setxattr) failed with root-squash enabled. DHT internal fops now use a negative pid to pretend that the operation is from an internal client so posix-acl allows them to pass Change-Id: I5bb8d068389bf4c94629d668a16015a95ccb53ab fixes: bz#1624796 Signed-off-by: Susant Palai <spalai@redhat.com>
* dht: utilize the framework to pass-through xlator tasksAmar Tumballi2018-09-191-0/+20
| | | | | | | | | | | | | | | | | | | | | Also fixes the issue caused due to not converting back the fn function to after getting its address. We wanted the value of the field, not the address of the pt_fop field. With this patch, DHT will always be started in pass-through mode if the number of subvols is just 1. Fixes some tests to make sure DHT is in full config (ie, subvols > 1). - increased timeout of brick-mux test as it was bordering on 300 seconds. - Also change the volume type to supported 'replica 3' from 'replica 2'. - also no DHT tests should assume presence of DHT when there is just 1 brick in volume Credits: Nithya B <nbalacha@redhat.com> fixes: #405 Change-Id: I8e55239ce58d6ac6ae1901e2e384be1ecbd33d6e Signed-off-by: Amar Tumballi <amarts@redhat.com>
* Land clang-format changesGluster Ant2018-09-121-1102/+1113
| | | | Change-Id: I6f5d8140a06f3c1b2d196849299f8d483028d33b
* cluster/dht: Rework the debug xattr to get hashed subvolN Balachandran2018-09-071-1/+2
| | | | | | | | | | | | | | | | | | | The earlier implementation required the file to already exist when trying to get the hashed subvol. The reworked implementation allows a user to get the hashed subvol for any filename, whether it exists or not. Usage: getfattr -n "dht.file.hashed-subvol.<filename>" <parent dir> Eg:To get the hashed subvol for file-1 inside dir-1 getfattr -n "dht.file.hashed-subvol.file-1" /mnt/gluster/dir1 credit: rgowdapp@redhat.com Change-Id: Iae20bd5f56d387ef48c1c0a4ffa9f692866bf739 fixes: bz#1624244 Signed-off-by: N Balachandran <nbalacha@redhat.com>
* dht: remove useless argument from dht_iatt_mergeKinglong Mee2018-07-181-2/+1
| | | | | | | | | | The last using of the subvol argument has been removed at 4e1ec35ef4f7 ("core: fill 'ia_ino' from 'ia_gfid' in 'storage/posix' ......") 7 years ago (2011-06-16). Change-Id: I9788d79e2e40cc153cf2960e28c7c1c1033dc8f7 fixes: bz#1601683 Signed-off-by: Kinglong Mee <mijinlong@open-fs.com>
* dht: Inconsistent permission for directories after brick stop/startMohit Agrawal2018-07-111-0/+3
| | | | | | | | | | | | | | | | Problem: Inconsistent access permissions on directories after bringing back the down sub-volumes, in case of directories dht_setattr first wind a call on MDS once call is finished on MDS then wind a call on NON-MDS.At the time of revalidating dht just compare the uid/gid with stbuf uid/gid and if anyone differs set a flag to heal the same. Solution: Add a condition to compare permission also in dht_revalidate_cbk to set a flag to call dht_dir_attr_heal. BUG: 1584517 Change-Id: I3e039607148005015b5d93364536158380d4c5aa fixes: bz#1584517 Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
* cluster/dht: Do not try to use up the readdirp bufferN Balachandran2018-06-291-4/+0
| | | | | | | | | | | | | | | | | | | | | | | | | DHT attempts to use up the entire buffer in readdirp before unwinding in an attempt to reduce the number of calls. However, this has 2 disadvantages: 1. This can cause a stack overflow when parallel readdir is enabled. If the buffer only has a little space,rda can send back only one or two entries. If those entries are stripped out by dht_readdirp_cbk (linkto files for example) it will once again wind down to rda in an attempt to fill the buffer before unwinding to FUSE. This process can continue for several iterations, causing the stack to grow and eventually overflow, causing the process to crash. 2. If parallel readdir is disabled, dht could send readdirp calls with small buffers to the bricks, thus increasing the number of network calls. We are therefore reverting to the existing behaviour. Please note, this only mitigates the stack overflow, it does not prevent it from happening. This is still possible if a subvol has thousands of linkto files for instance. Change-Id: I291bc181c5249762d0c4fe27fa4fc2631166adf5 fixes: bz#1593548 Signed-off-by: N Balachandran <nbalacha@redhat.com>
* cluster/dht: Remove EIO from dht_inode_missingN Balachandran2018-05-171-3/+1
| | | | | | | | | Removed EIO from the list of errnos that triggered a migrate check task. Change-Id: I7f89c7a16056421588f1af2377cebe6affddcb47 fixes: bz#1578823 Signed-off-by: N Balachandran <nbalacha@redhat.com>
* cluster/dht: Debug virtual xattrs for dhtN Balachandran2018-05-091-0/+6
| | | | | | | | | Provide a virtual xattr with which to query the hashed subvol for a file. Change-Id: Ic7abd031f875da4b9084841ea7c25d6c8a851992 fixes: bz#1574421 Signed-off-by: N Balachandran <nbalacha@redhat.com>
* cluster/dht: fixes to parallel renames to same destination codepathRaghavendra G2018-05-071-1/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Test case: # while true; do uuid="`uuidgen`"; echo "some data" > "test$uuid"; mv "test$uuid" "test" -f || break; echo "done:$uuid"; done This script was run in parallel from multiple mountpoints Along the course of getting the above usecase working, many issues were found: Issue 1: ======= consider a case of rename (src, dst). We can encounter a situation where, * dst is a file present at the time of lookup * dst is removed by the time rename fop reaches glusterfs In this scenario, acquring inodelk on dst fails with ESTALE resulting in failure of rename. However, as per POSIX irrespective of whether dst is present or not, rename should be successful. Acquiring entrylk provides synchronization even in races like this. Algorithm: 1. Take inodelks on src and dst (if dst is present) on respective cached subvols. These inodelks are done to preserve backward compatibility with older clients, so that synchronization is preserved when a volume is mounted by clients of different versions. Once relevant older versions (3.10, 3.12, 3.13) reach EOL, this code can be removed. 2. Ignore ENOENT/ESTALE errors of inodelk on dst. 3. protect namespace of src and dst. To protect namespace of a file, take inodelk on parent on hashed subvol, then take entrylk on the same subvol on parent with basename of file. inodelk on parent is done to guard against changes to parent layout so that hashed subvol won't change during rename. 4. <rest of rename continues> 5. unlock all locks Issue 2: ======== linkfile creation in lookup codepath can race with a rename. Imagine the following scenario: * lookup finds a data-file with gfid - gfid-dst - without a corresponding linkto file on hashed-subvol. It decides to create linkto file with gfid - gfid-dst. - Note that some codepaths of dht-rename deletes linkto file of dst as first step. So, a lookup racing with an in-progress rename can easily run into this situation. * a rename (src-path:gfid-src, dst-path:gfid-dst) renames data-file and hence gfid of data-file changes to gfid-src with path dst-path. * lookup proceeds and creates linkto file - dst-path - with gfid - dst-gfid - on hashed-subvol. * rename tries to create a linkto file dst-path with src-gfid on hashed-subvol, but it fails with EEXIST. But EEXIST is ignored during linkto file creation. Now we've ended with dst-path having different gfids - dst-gfid on linkto file and src-gfid on data file. Future lookups on dst-path will always fail with ESTALE, due to differing gfids. The fix is to synchronize linkfile creation in lookup path with rename using the same mechanism of protecting namespace explained in solution of Issue 1. Once locks are acquired, before proceeding with linkfile creation, we check whether conditions for linkto file creation are still valid. If not, we skip linkto file creation. Issue 3: ======== gfid of dst-path can change by the time locks are acquired. This means, either another rename overwrote dst-path or dst-path was deleted and recreated by a different client. When this happens, cached-subvol for dst can change. If rename proceeds with old-gfid and old-cached subvol, we'll end up in inconsistent state(s) like dst-path with different gfids on different subvols, more than one data-file being present etc. Fix is to do the lookup with a new inode after protecting namespace of dst. Post lookup, we've to compare gfids and correct local state appropriately to be in sync with backend. Issue 4: ======== During revalidate lookup, if following a linkto file doesn't lead to a valid data-file, local->cached-subvol was not reset to NULL. This means we would be operating on a stale state which can lead to inconsistency. As a fix, reset it to NULL before proceeding with lookup everywhere. Issue 5: ======== Stale dentries left out in inode table on brick resulted in failures of link fop even though the file/dentry didn't exist on backend fs. A patch is submitted to fix this issue. Please check the dependency tree of current patch on gerrit for details In short, we fix the problem by not blindly trusting the inode-table. Instead we validate whether dentry is present by doing lookup on backend fs. Change-Id: I832e5c47d232f90c4edb1fafc512bf19bebde165 updates: bz#1543279 BUG: 1543279 Signed-off-by: Raghavendra G <rgowdapp@redhat.com>
* cluster/dht: log error only if layout healing is requiredRaghavendra G2018-05-071-0/+1
| | | | | | | | | | | | | | | | selfhealing of directory is invoked on two conditions: 1. no layout on disk or layout has some anomalies (holes/overlaps) 2. mds xattr is not set on the directory When dht_selfheal_directory is called with a correct layout just to set mds xattr, we see error msgs complaining about "not able to form layout on directory", which is misleading as the layout is correct. So, log this msg only if layout has anomalies. Change-Id: I4af25246fc3a2450c2426e9902d1a5b372eab125 updates: bz#1543279 BUG: 1543279 Signed-off-by: Raghavendra G <rgowdapp@redhat.com>
* cluster/dht: store the 'reaction' on failures per lockRaghavendra G2018-02-231-7/+9
| | | | | | | | | | | Currently its passed in dht_blocking_inode(entry)lk, which would be a global value for all the locks passed in the argument. This would be a limitation for cases where we want to ignore failures on only few locks and fail for others. Change-Id: I02cfbcaafb593ad8140c0e5af725c866b630fb6b BUG: 1543279 Signed-off-by: Raghavendra G <rgowdapp@redhat.com>
* cluster/dht: avoid overwriting client writes during migrationSusant Palai2018-02-021-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | For more details on this issue see https://github.com/gluster/glusterfs/issues/308 Solution: This is a restrictive solution where a file will not be migrated if a client writes to it during the migration. This does not check if the writes from the rebalance and the client actually do overlap. If dht_writev_cbk finds that the file is being migrated (PHASE1) it will set an xattr on the destination file indicating the file was updated by a non-rebalance client. Rebalance checks if any other client has written to the dst file and aborts the file migration if it finds the xattr. updates gluster/glusterfs#308 Change-Id: I73aec28bc9dbb8da57c7425ec88c6b6af0fbc9dd Signed-off-by: Susant Palai <spalai@redhat.com> Signed-off-by: Raghavendra G <rgowdapp@redhat.com> Signed-off-by: N Balachandran <nbalacha@redhat.com>
* cluster/dht: Fixed leak in dht_populate_inode_for_dentryN Balachandran2018-02-021-1/+1
| | | | | | | | | | Fixed an issue in dht_populate_inode_for_dentry where a layout is set in the inode without checking if it is already set. This overwrites the value each time without freeing the already existing layout. Change-Id: I651bf539a0b82b4ddc4c355890c16a8e91f5f1fd BUG: 1541264 Signed-off-by: N Balachandran <nbalacha@redhat.com>
* cluster/dht: Change datatype of search_unhashed variableVarsha Rao2018-01-101-1/+1
| | | | | | | | | | Variable search_unhashed is of type boolean, change it to integer type. This fixes the warning increment of a boolean expression. BUG: 1531987 Change-Id: Ibf153f6a9ad704da38bff346b6a21a71323ed9bb Signed-off-by: Varsha Rao <varao@redhat.com>
* cluster/dht: Add migration checks to dht_(f)xattropN Balachandran2017-12-261-0/+9
| | | | | | | | | | | | The dht_(f)xattrop implementation did not implement migration phase1/phase2 checks which could cause issues with rebalance on sharded volumes. This does not solve the issue where fops may reach the target out of order. Change-Id: I2416fc35115e60659e35b4b717fd51f20746586c BUG: 1471031 Signed-off-by: N Balachandran <nbalacha@redhat.com>
* glusterfs: Use gcc builtin ATOMIC operator to increase/decreate refcount.Mohit Agrawal2017-12-121-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Problem: In glusterfs code base we call mutex_lock/unlock to take reference/dereference for a object.Sometime it could be reason for lock contention also. Solution: There is no need to use mutex to increase/decrease ref counter, instead of using mutex use gcc builtin ATOMIC operation. Test: I have not observed yet how much performance gain after apply this patch specific to glusterfs but i have tested same with below small program(mutex and atomic both) and get good difference. static int numOuterLoops; static void * threadFunc(void *arg) { int j; for (j = 0; j < numOuterLoops; j++) { __atomic_add_fetch (&glob, 1,__ATOMIC_ACQ_REL); } return NULL; } int main(int argc, char *argv[]) { int opt, s, j; int numThreads; pthread_t *thread; int verbose; int64_t n = 0; if (argc < 2 ) { printf(" Please provide 2 args Num of threads && Outer Loop\n"); exit (-1); } numThreads = atoi(argv[1]); numOuterLoops = atoi (argv[2]); if (1) { printf("\tthreads: %d; outer loops: %d;\n", numThreads, numOuterLoops); } thread = calloc(numThreads, sizeof(pthread_t)); if (thread == NULL) { printf ("calloc error so exit\n"); exit (-1); } __atomic_store (&glob, &n, __ATOMIC_RELEASE); for (j = 0; j < numThreads; j++) { s = pthread_create(&thread[j], NULL, threadFunc, NULL); if (s != 0) { printf ("pthread_create failed so exit\n"); exit (-1); } } for (j = 0; j < numThreads; j++) { s = pthread_join(thread[j], NULL); if (s != 0) { printf ("pthread_join failed so exit\n"); exit (-1); } } printf("glob value is %ld\n",__atomic_load_n (&glob,__ATOMIC_RELAXED)); exit(0); } time ./thr_count 800 800000 threads: 800; outer loops: 800000; glob value is 640000000 real 1m10.288s user 0m57.269s sys 3m31.565s time ./thr_count_atomic 800 800000 threads: 800; outer loops: 800000; glob value is 640000000 real 0m20.313s user 1m20.558s sys 0m0.028 Change-Id: Ie5030a52ea264875e002e108dd4b207b15ab7cc7 Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
* dht/crypt/tier: Fix use of booleans as integersShyamsundarR2017-12-061-1/+1
| | | | | | BUG: 1520974 Change-Id: I19ea40c888e88a7a4ac271168ed1820c2075be93 Signed-off-by: ShyamsundarR <srangana@redhat.com>
* Tier: Stop tierd for detach starthari gowtham2017-12-011-3/+10
| | | | | | | | | | | | | | | | | | | Problem: tierd was stopped only after detach commit This makes the detach take a longer time. The detach demotes the files to the cold brick and if the promotion frequency is hit, then the tierd starts to promote files to hot tier again. Fix: stop tierd after detach start so the files get demoted faster. Note: the is_tier_enabled was not maintained properly. That has been fixed too. some code clean up has been done. Signed-off-by: hari gowtham <hgowtham@redhat.com> Change-Id: I532f7410cea04fbb960105483810ea3560ca149b BUG: 1446381
* cluster/dht: Serialize mds update code path with lookup unwind in selfhealMohit Agrawal2017-11-221-4/+10
| | | | | | | | | | | | | | | | Problem: Sometime test case ./tests/bugs/bug-1371806_1.t is failing on centos due to race condition between fresh lookup and setxattr fop. Solution: In selfheal code path we do save mds on inode_ctx, it was not serialize with lookup unwind. Due to this behavior after lookup unwind if mds is not saved on inode_ctx and if any subsequent setxattr fop call it has failed with ENOENT because no mds has found on inode ctx.To resolve it save mds on inode ctx has been serialize with lookup unwind. BUG: 1498966 Change-Id: I8d4bb40a6cbf0cec35d181ec0095cc7142b02e29 Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
* cluster/dht: make rebalance use truncate incaseSusant Palai2017-11-221-2/+1
| | | | | | | | | .. the brick file system does not support fallocate. Change-Id: Id76cda2d8bb3b223b779e5e7a34f17c8bfa6283c BUG: 1488103 Signed-off-by: Susant Palai <spalai@redhat.com>
* core: make gf_boolean_t a C99 bool instead of an enumJeff Darcy2017-11-031-0/+1
| | | | | | | | | | | | This reduces the space used from four bytes to one, and allows new code to use familiar C99 types/values interoperably with our old cruft. It does *not* change current declarations or code; that will be left for a separate - much larger - patch. Updates: #80 Change-Id: I5baedd17d3fb05b38f0d8b8bb9dd62824475842e Signed-off-by: Jeff Darcy <jdarcy@fb.com>
* cluster/dht: Don't store the entire uuid for subvolsN Balachandran2017-10-101-6/+12
| | | | | | | | | | | | Comparing the uuid string of the local node against that stored in the local_subvol information is inefficient, especially as it is done for every file to be migrated. The code has now been changed to set the value of info to 1 if the nodeuuid is that of the node making the comparison so this becomes an integer comparison. Change-Id: I7491d59caad3b71dbf5facc94dcde0cd53962775 BUG: 1451434 Signed-off-by: N Balachandran <nbalacha@redhat.com>
* cluster/dht : User xattrs are not healed after brick stop/startMohit Agrawal2017-10-041-11/+62
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Problem: In a distributed volume custom extended attribute value for a directory does not display correct value after stop/start or added newly brick. If any extended(acl) attribute value is set for a directory after stop/added the brick the attribute(user|acl|quota) value is not updated on brick after start the brick. Solution: First store hashed subvol or subvol(has internal xattr) on inode ctx and consider it as a MDS subvol.At the time of update custom xattr (user,quota,acl, selinux) on directory first check the mds from inode ctx, if mds is not present on inode ctx then throw EINVAL error to application otherwise set xattr on MDS subvol with internal xattr value of -1 and then try to update the attribute on other non MDS volumes also.If mds subvol is down in that case throw an error "Transport endpoint is not connected". In dht_dir_lookup_cbk| dht_revalidate_cbk|dht_discover_complete call dht_call_dir_xattr_heal to heal custom extended attribute. In case of gnfs server if hashed subvol has not found based on loc then wind a call on all subvol to update xattr. Fix: 1) Save MDS subvol on inode ctx 2) Check if mds subvol is present on inode ctx 3) If mds subvol is down then call unwind with error ENOTCONN and if it is up then set new xattr "GF_DHT_XATTR_MDS" to -1 and wind a call on other subvol. 4) If setxattr fop is successful on non-mds subvol then increment the value of internal xattr to +1 5) At the time of directory_lookup check the value of new xattr GF_DHT_XATTR_MDS 6) If value is not 0 in dht_lookup_dir_cbk(other cbk) functions then call heal function to heal user xattr 7) syncop_setxattr on hashed_subvol to reset the value of xattr to 0 if heal is successful on all subvol. Test : To reproduce the issue followed below steps 1) Create a distributed volume and create mount point 2) Create some directory from mount point mkdir tmp{1..5} 3) Kill any one brick from the volume 4) Set extended attribute from mount point on directory setfattr -n user.foo -v "abc" ./tmp{1..5} It will throw error " Transport End point is not connected " for those hashed subvol is down 5) Start volume with force option to start brick process 6) Execute getfattr command on mount point for directory 7) Check extended attribute on brick getfattr -n user.foo <volume-location>/tmp{1..5} It shows correct value for directories for those xattr fop were executed successfully. Note: The patch will resolve xattr healing problem only for fuse mount not for nfs mount. BUG: 1371806 Signed-off-by: Mohit Agrawal <moagrawa@redhat.com> Change-Id: I4eb137eace24a8cb796712b742f1d177a65343d5
* cluster/dht: EBADF handling for fremovexattr and fsetxattrN Balachandran2017-08-091-0/+6
| | | | | | | | | | | | | Add EBADF handling for dht_fremovexattr and dht_fsetxattr. Change-Id: Ide0d5812dae79655d2565157e5baabcd753b4309 BUG: 1476665 Signed-off-by: N Balachandran <nbalacha@redhat.com> Reviewed-on: https://review.gluster.org/17999 Smoke: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Shyamsundar Ranganathan <srangana@redhat.com> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Raghavendra G <rgowdapp@redhat.com>
* cluster/dht: Check for open fd only on EBADFN Balachandran2017-08-081-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | DHT fd based fops used to check if the fd was open on the cached subvol before winding the call. However, this introduced a performance regression of about 30% for reads. This check was introduced to handle cases where files were migrated while IOs were happening. As this is not the common case, dht will now check if the fd is open on the cached subvol only if the call fails with EBADF. This will prevent a performance hit where a rebalance is not running. Change-Id: I2035a858d63c3fcd22bb634055bbb0ad01686808 BUG: 1476665 Signed-off-by: N Balachandran <nbalacha@redhat.com> Reviewed-on: https://review.gluster.org/17976 Smoke: Gluster Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Amar Tumballi <amarts@redhat.com> Reviewed-by: Susant Palai <spalai@redhat.com> Reviewed-by: Raghavendra G <rgowdapp@redhat.com>
* cluster/rebalance: Fix hardlink migration failuresSusant Palai2017-07-131-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A brief about how hardlink migration works: - Different hardlinks (to the same file) may hash to different bricks, but their cached subvol will be same. Rebalance picks up the first hardlink, calculates it's hash(call it TARGET) and set the hashed subvolume as an xattr on the data file. - Now all the hardlinks those come after this will fetch that xattr and will create linkto files on TARGET (all linkto files for the hardlinks will be hardlink to each other on TARGET). - When number of hardlinks on source is equal to the number of hardlinks on TARGET, the data migration will happen. RACE:1 Since rebalance is multi-threaded, the first lookup (which decides where the TARGET subvol should be), can be called by two hardlink migration parallely and they may end up creating linkto files on two different TARGET subvols. Hence, hardlinks won't be migrated. Fix: Rely on the xattr response of lookup inside gf_defrag_handle_hardlink since it is executed under synclock. RACE:2 The linkto files on TARGET can be created by other clients also if they are doing lookup on the hardlinks. Consider a scenario where you have 100 hardlinks. When rebalance is migrating 99th hardlink, as a result of continuous lookups from other client, linkcount on TARGET is equal to source linkcount. Rebalance will migrate data on the 99th hardlink itself. On 100th hardlink migration, hardlink will have TARGET as cached subvolume. If it's hash is also the same, then a migration will be triggered from TARGET to TARGET leading to data loss. Fix: Make sure before the final data migration, source is not same as destination. RACE:3 Since a hardlink can be migrating to a non-hashed subvolume, a lookup from other client or even the rebalance it self, might delete the linkto file on TARGET leading to hardlinks never getting migrated. This will be addressed in a different patch in future. Change-Id: If0f6852f0e662384ee3875a2ac9d19ac4a6cea98 BUG: 1469964 Signed-off-by: Susant Palai <spalai@redhat.com> Reviewed-on: https://review.gluster.org/17755 Smoke: Gluster Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: N Balachandran <nbalacha@redhat.com> Reviewed-by: Raghavendra G <rgowdapp@redhat.com>
* cluster/dht: Use size to calculate estimatesN Balachandran2017-07-101-0/+5
| | | | | | | | | | | | | | | | | | | The earlier approach of using the number of files to determine when the rebalance would complete did not work well when file sizes differed widely. The new approach now gets the total data size and uses that information to determine how long the rebalance is expected to take. Change-Id: I84e80a0893efab72ff06130e4596fa71c9c8c868 BUG: 1467209 Signed-off-by: N Balachandran <nbalacha@redhat.com> Reviewed-on: https://review.gluster.org/17668 Smoke: Gluster Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: MOHIT AGRAWAL <moagrawa@redhat.com> Reviewed-by: Raghavendra G <rgowdapp@redhat.com>
* cluster/dht: Check if fd is opened on dst subvolN Balachandran2017-06-281-0/+67
| | | | | | | | | | | | | | | | | | | | If an fd is opened on a file, the file is migrated and the cached subvol is updated in the inode_ctx before an fd based fop is sent, the fop is sent to the dst subvol on which the fd is not opened. This causes the FOP to fail with EBADF. Now, every fd based fop will check to see that the fd has been opened on the dst subvol before winding it down. Change-Id: Id92ef5eb7a5b5226688e2d2868b15e383f5f240e BUG: 1465075 Signed-off-by: N Balachandran <nbalacha@redhat.com> Reviewed-on: https://review.gluster.org/17630 Smoke: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Raghavendra G <rgowdapp@redhat.com> Reviewed-by: Susant Palai <spalai@redhat.com> CentOS-regression: Gluster Build System <jenkins@build.gluster.org>