summaryrefslogtreecommitdiffstats
path: root/tests/basic/afr
Commit message (Collapse)AuthorAgeFilesLines
* cluster/thin-arbiter: Consider thin-arbiter before marking new entry changelogAshish Pandey2019-02-181-6/+16
| | | | | | | | | | | | | | | | | | | If a fop to create an entry fails on one of the data brick, we mark the pending changelog on the entry on brick for which it was successful. This is done as part of post op phase to make sure that entry gets healed even if it gets renamed to some other path where its parent was not marked as bad. As it happens as part of post op, we should consider thin-arbiter to check if the brick, which was successful, is the good brick or not. This will avoide split brain and other issues. >Change-Id: I12686675be98f02f70a5186b3ed748c541514d53 >Signed-off-by: Ashish Pandey <aspandey@redhat.com> Change-Id: I12686675be98f02f70a5186b3ed748c541514d53 updates: bz#1672314 Signed-off-by: Ashish Pandey <aspandey@redhat.com>
* afr: thin-arbiter 2 domain locking and in-memory stateRavishankar N2018-12-121-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 2 domain locking + xattrop for write-txn failures: -------------------------------------------------- - A post-op wound on TA takes AFR_TA_DOM_NOTIFY range lock and AFR_TA_DOM_MODIFY full lock, does xattrop on TA and releases AFR_TA_DOM_MODIFY lock and stores in-memory which brick is bad. - All further write txn failures are handled based on this in-memory value without querying the TA. - When shd heals the files, it does so by requesting full lock on AFR_TA_DOM_NOTIFY domain. Client uses this as a cue (via upcall), releases AFR_TA_DOM_NOTIFY range lock and invalidates its in-memory notion of which brick is bad. The next write txn failure is wound on TA to again update the in-memory state. - Any incomplete write txns before the AFR_TA_DOM_NOTIFY upcall release request is got is completed before the lock is released. - Any write txns got after the release request are maintained in a ta_waitq. - After the release is complete, the ta_waitq elements are spliced to a separate queue which is then processed one by one. - For fops that come in parallel when the in-memory bad brick is still unknown, only one is wound to TA on wire. The other ones are maintained in a ta_onwireq which is then processed after we get the response from TA. Change-Id: I32c7b61a61776663601ab0040e2f0767eca1fd64 updates: bz#1648205 Signed-off-by: Ravishankar N <ravishankar@redhat.com> Signed-off-by: Ashish Pandey <aspandey@redhat.com>
* cluster/afr: Use 2 domain locking in SHD for thin-arbiterkarthik-us2018-11-291-0/+49
| | | | | | | | | | | | | | | | | | | | | | | | With this change when SHD starts the index crawl it requests all the clients to release the AFR_TA_DOM_NOTIFY lock so that clients will know the in memory state is no more valid and any new operations needs to query the thin-arbiter if required. When SHD completes healing all the files without any failure, it will again take the AFR_TA_DOM_NOTIFY lock and gets the xattrs on TA to see whether there are any new failures happened by that time. If there are new failures marked on TA, SHD will start the crawl immediately to heal those failures as well. If there are no new failures, then SHD will take the AFR_TA_DOM_MODIFY lock and unsets the xattrs on TA, so that both the data bricks will be considered as good there after. >Change-Id: I037b89a0823648f314580ba0716d877bd5ddb1f1 >fixes: bz#1579788 >Signed-off-by: karthik-us <ksubrahm@redhat.com> (cherry picked from commit 5784a00f997212d34bd52b2303e20c097240d91c) Change-Id: I037b89a0823648f314580ba0716d877bd5ddb1f1 fixes: bz#1648205
* afr: thin-arbiter read txn changesRavishankar N2018-09-051-0/+60
| | | | | | | | | | | | | | | | If both data bricks are up, read subvol will be based on read_subvols. If only one data brick is up: - First qeury the data-brick that is up. If it blames the other brick, allow the reads. - If if doesn't, query the TA to obtain the source of truth. TODO: See if in-memory state can be maintained for read txns (BZ 1624358). updates: bz#1579788 Change-Id: I61eec35592af3a1aaf9f90846d9a358b2e4b2fcc Signed-off-by: Ravishankar N <ravishankar@redhat.com>
* cluster/afr: Delegate name-heal when possiblePranith Kumar K2018-09-041-0/+112
| | | | | | | | | | | | | | | | | | | | | | | | | Problem: When name-self-heal is triggered on the mount, it blocks lookup until name-self-heal completes. But that can lead to hangs when lot of clients are accessing a directory which needs name heal and all of them trigger heals waiting for other clients to complete heal. Fix: When a name-heal is needed but quorum number of names have the file and pending xattrs exist on the parent, then better to delegate the heal to SHD which will be completed as part of entry-heal of the parent directory. We could also do the same for quorum-number of names not present but we don't have any known use-case where this is a frequent occurrence so not changing that part at the moment. When there is a gfid mismatch or missing gfid it is important to complete the heal so that next rename doesn't assume everything is fine and perform a rename etc fixes bz#1622821 Change-Id: I8b002c85dffc6eb6f2833e742684a233daefeb2c Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* cluster/afr: Delegate metadata heal with pending xattrs to SHDPranith Kumar K2018-08-281-10/+18
| | | | | | | | | | | | | | | | | | | Problem: When metadata-self-heal is triggered on the mount, it blocks lookup until metadata-self-heal completes. But that can lead to hangs when lot of clients are accessing a directory which needs metadata heal and all of them trigger heals waiting for other clients to complete heal. Fix: Only when the heal is needed but the pending xattrs are not set, trigger metadata heal that could block lookup. This is the only case where different clients may give different metadata to the clients without heals, which should be avoided. Updates bz#1622821 Change-Id: I6089e9fda0770a83fb287941b229c882711f4e66 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* performance/readdir-ahead: keep stats of cached dentries in sync with ↵Krutika Dhananjay2018-08-181-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | modifications PROBLEM: Stats of dentries that are readdirp'd ahead can become stale due to fops like writes, truncate etc that modify the file pointed by dentries. When a readdir is finally wound at offset corresponding to these entries, the iatts that are returned to the application come from readdir-ahead's cache, which are stale by now. This problem gets further aggravated when caching translators/modules cache and continue to serve this stale information. FIX: * Store the iatt in context of the inode pointed by dentry. * Whenever the inode pointed by dentry undergoes modification, in cbk of modification fop, update the iatt stored in inode-ctx to reflect the modification. * When serving a readdirp response from application, update iatts of dentries with the iatts stored in the context of inodes pointed by these dentries. * Some fops don't have valid iatts in their responses. For eg., write response whose data is still cached in write-behind will have zeroed out stat. In this case keep only ia_type and ia_gfid and reset rest of the iatt members to zero. - fuse-bridge in this case just sends "entry" information back to kernel and attr is not sent. - gfapi sets entry->inode to NULL and zeroes out the entire stat * There is one tiny race between the entry creation and a readdirp on its parent dir, which could cause the inode-ctx setting and inode ctx reading to happen on two different inode objects. To prevent this, when entry->inode doesn't eqaul to linked_inode, - fuse-bridge is made to send only "entry" information without attributes - gfapi sets entry->inode to NULL and zeroes out the entire stat. Change-Id: Ia27ff49a61922e88c73a1547ad8aacc9968a69df BUG: 1390050 Updates: bz#1390050 Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com> Signed-off-by: Raghavendra G <rgowdapp@redhat.com>
* tests: Add thin-arbiter.rc for writing tests for thin-arbiterPranith Kumar K2018-08-181-0/+44
| | | | | | fixes bz#1615789 Change-Id: I1f42e78fec5ddaf2a425dc4b82c9a20472aa146d Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* tests: Fix for gfid-mismatch-resolution-with-fav-child-policy.t failurekarthik-us2018-08-141-0/+1
| | | | | | | | | | | | | | | | | | | This test was retried once on build https://build.gluster.org/job/regression-on-demand-multiplex/174/ (logs for the first try is not available with this build) Test case was failing in line #47 where it was was checking for the heal count to be 0. Line #51 had passed that means file got the gfid split brain resolved, and both the bricks had same gfids. At line #54 it again failed which checks for the md5sum on both the bricks. At this point the md5sum of the brick where the file got impunged had the md5sum same as the newly created empty file. This means the data heal has not happened for the file. At line #64 enabling granular-entry-heal faild, but without the logs it is not possible to debug this issue. Change-Id: I56d854dbb9e188cafedfd24a9d463603ae79bd06 fixes: bz#1615331 Signed-off-by: karthik-us <ksubrahm@redhat.com>
* tests: fix replace-brick-self-heal.t failureRavishankar N2018-08-131-1/+1
| | | | | | | | Please see BZ for details. Change-Id: Id9273432874bc6a452ac96b2b8c7a61ea6c5b98d Fixes: bz#1615239 Signed-off-by: Ravishankar N <ravishankar@redhat.com>
* tests: potential fixes for tests/basic/afr/add-brick-self-heal.tRavishankar N2018-08-131-0/+7
| | | | | | | | Please see bug description for details. Change-Id: Ieb6bce6d1d5c4c31f1878dd1a1c3d007d8ff81d5 fixes: bz#1614654 Signed-off-by: Ravishankar N <ravishankar@redhat.com>
* tests: Set heal-timeout to 5 secondsPranith Kumar K2018-08-091-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | Shd keeps doing heals in a loop until it heals at least one entry in the previous run. A heal is termed successful only if it heals both metadata and entry/data heal i.e. the entry needs to be completely healed by just that healer. In tests/basic/afr/granular-esh/replace-brick.t test, brick-0 is old and brick-1 is new. After replace-brick only root-gfid will be present in brick-0's index 1) shd-thread corresponding to brick-0 does metadata heal, this creates root-gfid in brick-0's 'dirty' index. 2) Both healer threads corresponding to brick-0 and brick-1 now try to heal root-gfid and brick-1 gets the heal-domain lock. brick-0's shd-thread will experience a failure and it goes back to waiting for 10 minutes (cluster.heal-timeout). 3) When brick-1's healer-thread completes healing root-gfid it creates 5 files which create indices in brick-0, so until brick-0 doesn't trigger one more heal, heal won't happen. $HEAL_TIMEOUT is set at 120 seconds, which is lesser than cluster.heal-timeout, so decreasing this to 5 seconds so that the next heal is triggered which will do the heals. fixes bz#1613807 Change-Id: I881133fc28880d8615fbc4558a0dfa0dc63d7798 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* Revert "performance/readdir-ahead: Invalidate cached dentries if they're ↵Raghavendra G2018-08-031-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | modified while in cache" This reverts commit 7131de81f72dda0ef685ed60d0887c6e14289b8c. With the latest master, I created a single brick volume and some files inside it. [root@rhgs313-6 ~]# umount -f /mnt/fuse1; mount -t glusterfs -s 192.168.122.6:/thunder /mnt/fuse1; ls -l /mnt/fuse1/; echo "Trying again"; ls -l /mnt/fuse1 umount: /mnt/fuse1: not mounted total 0 ----------. 0 root root 0 Jan 1 1970 file-1 ----------. 0 root root 0 Jan 1 1970 file-2 ----------. 0 root root 0 Jan 1 1970 file-3 ----------. 0 root root 0 Jan 1 1970 file-4 ----------. 0 root root 0 Jan 1 1970 file-5 d---------. 0 root root 0 Jan 1 1970 subdir Trying again total 3 -rw-r--r--. 1 root root 33 Aug 3 14:06 file-1 -rw-r--r--. 1 root root 33 Aug 3 14:06 file-2 -rw-r--r--. 1 root root 33 Aug 3 14:06 file-3 -rw-r--r--. 1 root root 33 Aug 3 14:06 file-4 -rw-r--r--. 1 root root 33 Aug 3 14:06 file-5 d---------. 0 root root 0 Jan 1 1970 subdir [root@rhgs313-6 ~]# Conversation can be followed on gluster-devel on thread with subj: tests/bugs/distribute/bug-1122443.t - spurious failure. git-bisected pointed this patch as culprit. Change-Id: I1eb46f6c196f44fde8ce991840a0e724e6f50862 Signed-off-by: Raghavendra G <rgowdapp@redhat.com> Updates: bz#1390050
* performance/md-cache: update cache only from fops issued after previous ↵Raghavendra G2018-08-021-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | invalidation Invalidations are triggered mainly by two codepaths - upcall and write-behind unwinding a cached write with zeroed out stat. For the case of upcall, following race can happen: * stat s1 is fetched from brick * invalidation is detected on brick * invalidation is propagated to md-cache and cache is invalidated * s1 updates md-cache with a stale state For the case of write-behind, imagine following sequence of operations, * A stat s1 was issued from application thread t1 when size of file was s1 * stat s1 completes on brick stack, but yet to reach md-cache * A write w1 from application thread t2 extends file to size s2 is cached in write-behind and response is unwound with zeroed out stat * md-cache while handling write-cbk, invalidates cache * md-cache receives response for s1, updates cache with stale stat with size of s1 overwriting invalidation state Fix is to remember when s1 was incident on md-cache and update cache with results of s1 only if the it was incident after invalidation of cache. This patch identified some bugs in regression tests which is tracked in https://bugzilla.redhat.com/show_bug.cgi?id=1608158. As a stop gap measure I am marking following tests as bad basic/afr/split-brain-resolution.t bugs/bug-1368312.t bugs/replicate/bug-1238398-split-brain-resolution.t bugs/replicate/bug-1417522-block-split-brain-resolution.t bugs/replicate/bug-1438255-do-not-mark-self-accusing-xattrs.t Change-Id: Ia4bb9dd36494944e2d91e9e71a79b5a3974a8c77 Signed-off-by: Raghavendra G <rgowdapp@redhat.com> Updates: bz#1512691
* build: remove bundled arg-standaloneNiels de Vos2018-07-282-4/+0
| | | | | | | | | | | | | | libargp or argp-standalone is available on all commonly used distributions. There is no need to bundle an unmaintained version of argp-standalone in this repository anymore. FreeBSD places the argp.h file in /usr/local/include when argp-standalone is installed. This path is not added to CPPFLAGS by default, so thats done in configure.ac as well. Change-Id: I384a53ab0a008ec9d48fd83afeaf8fbc197e91ee Fixes: bz#1609337 Signed-off-by: Niels de Vos <ndevos@redhat.com>
* performance/readdir-ahead: Invalidate cached dentries if they're modified ↵Krutika Dhananjay2018-07-281-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | while in cache PROBLEM: Entries that are readdirp'd ahead can undergo modification in terms of writes, truncates which could modify their iatts. When a readdir is finally wound at offset corresponding to these entries, the iatts that are returned to the application come from readdir-ahead's cache, which are stale by now. This problem gets further aggravated when caching translators/modules cache and continue to serve this stale information. FIX: Whenever a dentry undergoes modification, in the cbk of the modification fop, a "dirty" flag (default 0) is set in its inode ctx. When it's time for readdir-ahead to serve these entries, it will read the inode ctx and check if the entry is "dirty", and if it is, set the entry's attrs to all zeroes, as an indicator to fuse, md-cache etc not to cache these attributes. Also there is one tiny race between the entry creation and a readdirp on its parent dir, which could cause the inode-ctx setting and inode ctx reading to happen on two different inode objects. To prevent this, fuse-bridge is made to drop entries for which dentry->inode is not the same as linked inode, in readdirp cbk. Change-Id: If7396507632b5268442ca580473d5155fee9cbef BUG: 1390050 Updates: bz#1390050 Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com> Signed-off-by: Raghavendra G <rgowdapp@redhat.com>
* afr,ec: Print if the subvolume is up in statedumpPranith Kumar K2018-07-031-0/+28
| | | | | | fixes bz#1597156 Change-Id: I323eb9190e40b12df216698dcdba74a6d336beeb Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* cli: change volume create syntax of arbiter volumeAmar Tumballi2018-07-031-4/+9
| | | | | | | fixes: bz#1596524 updates: gluster/glusterd2#515 Change-Id: I8a46fa2fd1fd2b0e9fbcecd3bb18d348aed9c6a9 Signed-off-by: Amar Tumballi <amarts@redhat.com>
* tests: remove tarissue.t from BAD_TESTRavishankar N2018-06-281-3/+0
| | | | | | | | | | BZ 1337791 marked this .t as bad based on the tar version being a likely suspect. Undoing this to check as it passes on the latest jenkins slaves. Change-Id: Ia581064a9c620351d3fe7aeef95d2644337952e1 fixes: bz#1595492 Signed-off-by: Ravishankar N <ravishankar@redhat.com> Reported-by: Yaniv Kaul <ykaul@redhat.com>
* afr: add new value for read-hash-mode volume optionRavishankar N2018-03-291-0/+56
| | | | | | | | | | Updates: #363 This new value (3) will try to wind read requests to the child of AFR having the least amount of pending requests in its queue. Change-Id: If6bda2aac9bf7aec3fc39622f78659313c4b6508 Signed-off-by: Ravishankar N <ravishankar@redhat.com>
* cluster/afr: Switch to active-fd-count for open-fd checksPranith Kumar K2018-03-211-0/+20
| | | | | | BUG: 1557932 Change-Id: I3783e41b3812267bc10c0d05d062a31396ce135b Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* cluster/afr: Remove compound-fops usage in afrPranith Kumar K2018-03-061-37/+0
| | | | | | | | | We are not seeing much improvement with this change. So removing the feature so that it doesn't need to be maintained anymore. Fixes: #414 Change-Id: Ic7969b151544daf2547bd262a9fa03f575626411 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* tests: bring option of per test timeoutAmar Tumballi2018-02-151-0/+2
| | | | | | | | | | | | | | This uses 'timeout' command with 300 seconds default. Right now, there is just 1 test which takes more than that in a properly setup machine. Ideally best case is set the default to something like 30 seconds, and if a test is supposed to take more than that, owner should add a timeout line to test knowingly. That way, it makes test writers think about a time limit too. Change-Id: I747005ce1f208aeb2ecbf899e8feea487ecd21a0 Signed-off-by: Amar Tumballi <amarts@redhat.com>
* afr: don't treat all cases all bricks being blamed as split-brainRavishankar N2018-02-011-0/+16
| | | | | | | | | | | | | | | | | | | | | | | Problem: We currently don't have a roll-back/undoing of post-ops if quorum is not met. Though the FOP is still unwound with failure, the xattrs remain on the disk. Due to these partial post-ops and partial heals (healing only when 2 bricks are up), we can end up in split-brain purely from the afr xattrs point of view i.e each brick is blamed by atleast one of the others. These scenarios are hit when there is frequent connect/disconnect of the client/shd to the bricks while I/O or heal are in progress. Fix: Instead of undoing the post-op, pick a source based on the xattr values. If 2 bricks blame one, the blamed one must be treated as sink. If there is no majority, all are sources. Once we pick a source, self-heal will then do the heal instead of erroring out due to split-brain. Change-Id: I3d0224b883eb0945785ade0e9697a1c828aec0ae BUG: 1539358 Signed-off-by: Ravishankar N <ravishankar@redhat.com>
* tests: Renable basic/afr/split-brain-favorite-child-policy.tNigel Babu2017-11-291-6/+0
| | | | | | | | | This test was failing due to an infra issue. The infra issue is now fixed. BUG: 1517961 Change-Id: I09dfab9c0a3ebe73c738222e6269d9e35c85eddb Signed-off-by: Nigel Babu <nigelb@redhat.com>
* tests: mark currently failing regression tests as known issuesAmar Tumballi2017-11-281-0/+6
| | | | | | Change-Id: If6c36dc6c395730dfb17b5b4df6f24629d904926 BUG: 1517961 Signed-off-by: Amar Tumballi <amarts@redhat.com>
* afr: add checks for allowing lookupsRavishankar N2017-11-181-23/+0
| | | | | | | | | | | | | | | | | | | | | | Problem: In an arbiter volume, lookup was being served from one of the sink bricks (source brick was down). shard uses the iatt values from lookup cbk to calculate the size and block count, which in this case were incorrect values. shard_local_t->last_block was thus initialised to -1, resulting in an infinite while loop in shard_common_resolve_shards(). Fix: Use client quorum logic to allow or fail the lookups from afr if there are no readable subvolumes. So in replica-3 or arbiter vols, if there is no good copy or if quorum is not met, fail lookup with ENOTCONN. With this fix, we are also removing support for quorum-reads xlator option. So if quorum is not met, neither read nor write txns are allowed and we fail the fop with ENOTCONN. Change-Id: Ic65c00c24f77ece007328b421494eee62a505fa0 BUG: 1467250 Signed-off-by: Ravishankar N <ravishankar@redhat.com>
* tests: Update tier CLI in .t filesN Balachandran2017-10-301-1/+1
| | | | | | | | Update .t tier tests to use the new tier CLI. Change-Id: I0e7f1769071108d8266fc86378c4466bcaf96e7d BUG: 1505253 Signed-off-by: N Balachandran <nbalacha@redhat.com>
* cluster/afr: Fail open on split-brainPranith Kumar K2017-10-261-0/+38
| | | | | | | | | | | | | | | | | Problem: Append on a file with split-brain succeeds. Open is intercepted by open-behind, when write comes on the file, open-behind does open+write. Open succeeds because afr doesn't fail it. Then write succeeds because write-behind intercepts it. Flush is also intercepted by write-behind, so the application never gets to know that the write failed. Fix: Fail open on split-brain, so that when open-behind does open+write open fails which leads to write failure. Application will know about this failure. Change-Id: I4bff1c747c97bb2925d6987f4ced5f1ce75dbc15 BUG: 1294051 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* Disable failing NetBSD testsNigel Babu2017-10-131-0/+1
| | | | | BUG: 1501390 Change-Id: I9a04c094783ec33e617baeae3d0e0cbedb1d6c3b
* feature/posix: Enabled gfid2path by defaultKotresh HR2017-08-291-1/+1
| | | | | | | | | | | | | | | | Enable gfid2path feature by default. The basic performance tests are carried out and it doesn't show significant depreciation. The results are updated in issue. Updates: #139 Change-Id: I5f1949a608d0827018ef9d548d5d69f3bb7744fd Signed-off-by: Kotresh HR <khiremat@redhat.com> Reviewed-on: https://review.gluster.org/17950 Smoke: Gluster Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Aravinda VK <avishwan@redhat.com> Reviewed-by: Amar Tumballi <amarts@redhat.com>
* glusterd: do not create .glusterfs/indicesPrashanth Pai2017-08-231-2/+1
| | | | | | | | | | | | | | | | | | | | | glusterd shouldn't concern itself with creating directories specific to certain xlators. The index xlator will now proceed creating './glusterfs/indices' dir only if the parent '.glusterfs' directory exists, which still fixes the original problem reported i.e 'volume start force' command shouldn't create brick path if it doesn't exist (BUG 1457202) This reverts most of the changes done by the commit b58a15948fb3fc37b6c0b70171482f50ed957f42 Change-Id: I7fc52ad64dce220e336c218fb4d85933ca2e61c0 Signed-off-by: Prashanth Pai <ppai@redhat.com> Reviewed-on: https://review.gluster.org/18003 Smoke: Gluster Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Jeff Darcy <jeff@pl.atyp.us> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
* cluster/afr: GFID split-brain resolution with existing CLIkarthik-us2017-07-182-3/+168
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Problem: Currently there is no way for the admin from CLI to resolve gfid split-brain based on some policy like choice of the brick, mtime or size. Fix: With the existing CLI options based on size, mtime, and choice of brick, we do lookup on the parent for the specified file. As part of the lookup, if we find gfid mismatch, we resolve them based on the policy and return. If the file is not in gfid split- brain, then we check for the data and metadata split-brain in the getxattr code path, and resolve if any. This will work provided absolute path to the file with the CLI and not with gfid of the file. Hence the source-brick policy without any file path will also not resolve the gfid split-brain since it uses the gfid of the files. But it can resolve any other type of split-brains and skip the gfid mismatch resolution with the usual error message. Reverting the change https://review.gluster.org/17290. This patch resolves the issue. Fixes gluster/glusterfs#135 Change-Id: Iaeba6fc32f184a34255d03be87cda02773130a09 BUG: 1459530 Signed-off-by: karthik-us <ksubrahm@redhat.com> Reviewed-on: https://review.gluster.org/17485 Reviewed-by: Ravishankar N <ravishankar@redhat.com> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Smoke: Gluster Build System <jenkins@build.gluster.org>
* cluster/afr: Implement quorum for lk fopPranith Kumar K2017-06-191-0/+255
| | | | | | | | | | | | | | | | | | | Problem: At the moment when we have replica 3 or arbiter setup, even when lk succeeds on just one brick we give success to application which is wrong Fix: Consider quorum-number of successes as success when quorum is enabled. BUG: 1461792 Change-Id: I5789e6eb5defb68f8a0eb9cd594d316f5cdebaea Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: https://review.gluster.org/17524 Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Ravishankar N <ravishankar@redhat.com>
* index: Do not proceed with init if brick is not mountedRavishankar N2017-06-191-0/+5
| | | | | | | | | | | | | | | | | | | | | ..or else when a volume start force is given, we end up creating /brick-path/.glusterfs/indices folder and various subdirs under it and eventually starting the brick process. As a part of this patch, glusterd_get_index_basepath() is added in glusterd, who will then use it to create the basepath during volume-create, add-brick, replace-brick and reset-brick. It also uses this function to set the 'index-base' xlator option for the index translator. Change-Id: Id018cf3cb6f1e2e35b5c4cf438d1e939025cb0fc BUG: 1457202 Signed-off-by: Ravishankar N <ravishankar@redhat.com> Reviewed-on: https://review.gluster.org/17426 Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Atin Mukherjee <amukherj@redhat.com> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
* afr: gfid-mismatch-resolution-with-fav-child-policy.t to bad testsRavishankar N2017-05-151-0/+3
| | | | | | | | | | | | | | | | | | | | | | | gfid-mismatch-resolution-with-fav-child-policy.t does a `TEST ls $M0/f3` (line #170) to trigger healing of a file in gfid split-brain in a rep-3 volume. But the code to trigger name heal of gfid split-brain file is not yet there. The test is passing due a lookup/ stat on $M0 which triggers a background entry self heal (which has the code to heal gfid split-brain files) which may or may not complete the heal before line 170. If it doesn't, lookup on f3 is failing with EIO. Add the .t to bad tests until Karthik's patch for CLI based gfid split-brain resolution fixes name heal also. Change-Id: Iba6e9d81db386bc406aff1ecb6a18851f09bf7c0 BUG: 1450730 Signed-off-by: Ravishankar N <ravishankar@redhat.com> Reviewed-on: https://review.gluster.org/17290 Reviewed-by: Atin Mukherjee <amukherj@redhat.com> Reviewed-by: Raghavendra G <rgowdapp@redhat.com> Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org>
* cluster/afr: GFID split brain resolution with favorite-child-policykarthik-us2017-04-201-0/+228
| | | | | | | | | | | | | | | | | | | | | | | | | Problem: Currently the automatic split brain resolution with favorite child policy is not resolving the GFID split brains. Fix: When there is a GFID split brain and the favorite child policy is set to size/mtime/ctime/majority, based on the policy decide on the source and sinks. Delete the entry from the sinks and recreate it from the source. Mark the appropriate pending attributes and resolve the GFID split brain. When the heal takes place it will complete the pending heals and reset the attributes. Change-Id: Ie30e5373f94ca6f276745d9c3ad662b8acca6946 BUG: 1430719 Signed-off-by: karthik-us <ksubrahm@redhat.com> Reviewed-on: https://review.gluster.org/16878 Smoke: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Tested-by: Pranith Kumar Karampuri <pkarampu@redhat.com> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> Reviewed-by: Ravishankar N <ravishankar@redhat.com> CentOS-regression: Gluster Build System <jenkins@build.gluster.org>
* glusterd: disallow increasing replica count for arbiter volumesRavishankar N2017-03-061-0/+6
| | | | | | | | | | | | | | | | | | | | | Problem: add-brick command to increase replica count in an arbiter volume succeeds, causing undesirable effects like the 4th brick being loaded with the arbiter xlator, the 3rd one losing the arbiter xlator (when the brick process is restarted), arbitration logic in afr going for a toss etc. Fix: Arbiter configuration should always be a replica 3 volume (of which 3rd brick is arbiter). Hence disallow increasing replica count for arbiter volume configurations. Change-Id: I9fe4edac880d0f711e6d44324ad5562974e53e51 BUG: 1429200 Signed-off-by: Ravishankar N <ravishankar@redhat.com> Reviewed-on: https://review.gluster.org/16845 Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Atin Mukherjee <amukherj@redhat.com>
* tests: Fix and enable split-brain-healing mtime checkNiklas Hambüchen2017-03-011-6/+36
| | | | | | | | | | | | | | | | | | | | | This test was commented out with the belief that it depended on utimensat() support, but in fact it was not necessary because `stat -c %Y` only outputs second resolution. Simply commenting in the test made it fail because it checked the values *before* the heal, while intended was to check them *after* the heal. This commit fixes that. Change-Id: I4194ac645b365a1f906a3ac9bcbbdb1f05000e27 BUG: 1422074 Signed-off-by: Niklas Hambüchen <mail@nh2.me> Reviewed-on: https://review.gluster.org/16789 Reviewed-by: Ravishankar N <ravishankar@redhat.com> Tested-by: Ravishankar N <ravishankar@redhat.com> Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Niklas Hambüchen
* core: run many bricks within one glusterfsd processJeff Darcy2017-01-3021-27/+31
| | | | | | | | | | | | | | | | | | | | | | | This patch adds support for multiple brick translator stacks running in a single brick server process. This reduces our per-brick memory usage by approximately 3x, and our appetite for TCP ports even more. It also creates potential to avoid process/thread thrashing, and to improve QoS by scheduling more carefully across the bricks, but realizing that potential will require further work. Multiplexing is controlled by the "cluster.brick-multiplex" global option. By default it's off, and bricks are started in separate processes as before. If multiplexing is enabled, then *compatible* bricks (mostly those with the same transport options) will be started in the same process. Change-Id: I45059454e51d6f4cbb29a4953359c09a408695cb BUG: 1385758 Signed-off-by: Jeff Darcy <jdarcy@redhat.com> Reviewed-on: https://review.gluster.org/14763 Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* readdir-ahead : Perform STACK_UNWIND outside of mutex locksPoornima G2017-01-091-0/+0
| | | | | | | | | | | | | | | | | | Currently STACK_UNWIND is performnd within ctx->lock. If readdir-ahead is loaded as a child of dht, then there can be scenarios where the function calling STACK_UNWIND becomes re-entrant. Its a good practice to not call STACK_WIND/UNWIND within local mutex's Change-Id: If4e869849d99ce233014a8aad7c4d5eef8dc2e98 BUG: 1401812 Signed-off-by: Poornima G <pgurusid@redhat.com> Reviewed-on: http://review.gluster.org/16068 Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Jeff Darcy <jdarcy@redhat.com> Reviewed-by: Raghavendra G <rgowdapp@redhat.com>
* glusterd: Fail add-brick on replica count change, if brick is downkarthik-us2017-01-061-3/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Problem: 1. Have a replica 2 volume with bricks b1 and b2 2. Before setting the layout, b1 goes down 3. Set the layout write some data, which gets populated on b2 4. b2 goes down then b1 comes up 5. Add another brick b3, and heal will take place from b1 to b3, which basically have no data 6. Write some data. Both b1 and b3 will mark b2 for pending writes 7. b1 goes down, and b2 comes up 8. b2 gets heald from b1. During heal it removes the data which is already in b2, considering that as stale data. This leads to data loss. Solution: 1. In glusterd stage-op, while adding bricks, check whether the replica count is being increased 2. If yes, then check whether any of the bricks are down at that time 3. If yes, then fail the add-brick to avoid such data loss 4. Else continue the normal operation. This check will work enen when we convert plain distribute volume to replicate Test: 1. Create a replica 2 volume 2. Kill one brick from the volume 3. Try adding a brick to the volume 4. It should fail with all bricks are not up error 5. Cretae a distribute volume and kill one of the brick 6. Try to convert it to replicate volume, by adding bricks. 7. This should also fail. Change-Id: I9c8d2ab104263e4206814c94c19212ab914ed07c BUG: 1406411 Signed-off-by: karthik-us <ksubrahm@redhat.com> Reviewed-on: http://review.gluster.org/16330 Tested-by: Ravishankar N <ravishankar@redhat.com> Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: N Balachandran <nbalacha@redhat.com> Reviewed-by: Atin Mukherjee <amukherj@redhat.com>
* tests: Fix split-brain-favorite-child-policy.t failuresRavishankar N2017-01-021-4/+14
| | | | | | | | | | | | | | | | | | | | | | | | | Problem: In CentOS-7, the file was receving an extra removexattr(security.ima) FOP which changed its ctime, breaking the assumption that a particular brick had the latest ctime based on the writevs done in the .t Fix: 1. Compare the ctime of both files in the backend and pick the one with the latest ctime for the fav-child policy. Also unmount the volume before comparing, to avoid any further FOPS on the file that can possibly modify the timestamps. 2. Added floating point handling in stat function. Thanks to Pranith for the helping debugging the regex. Change-Id: I06041a0f39a29d2593b867af8685d65c7cd99150 BUG: 1408757 Signed-off-by: Ravishankar N <ravishankar@redhat.com> Reviewed-on: http://review.gluster.org/16288 Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
* glfsheal: Explicitly enable self-heal xlator optionsRavishankar N2016-12-151-0/+3
| | | | | | | | | | | | | | | | Enable data, metadata and entry self-heal as xlator-options so that glfs-heal.c can heal split-brain files even if they are disabled on the volume via volume set commands. Change-Id: Ic191a1017131db1ded94d97c932079d7bfd79457 BUG: 1234054 Signed-off-by: Ravishankar N <ravishankar@redhat.com> Reviewed-on: http://review.gluster.org/11333 Smoke: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Tested-by: Pranith Kumar Karampuri <pkarampu@redhat.com> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org>
* cluster/afr: CLI for granular entry heal enablement/disablementKrutika Dhananjay2016-11-286-7/+149
| | | | | | | | | | | | | | | | | | | | | | | | | | | | When there are already existing non-granular indices created that are yet to be healed, if granular-entry-heal option is toggled from 'off' to 'on', AFR self-heal whenever it kicks in, will try to look for granular indices in 'entry-changes'. Because of the absence of name indices, granular entry healing logic will fail to heal these directories, and worse yet unset pending extended attributes with the assumption that are no entries that need heal. To get around this, a new CLI is introduced which will invoke glfsheal program to figure whether at the time an attempt is made to enable granular entry heal, there are pending heals on the volume OR there are one or more bricks that are down. If either of them is true, the command will be failed with the appropriate error. New CLI: gluster volume heal <VOL> granular-entry-heal {enable,disable} Change-Id: I1f4fe8162813b9068e198965d94169fee4adc099 BUG: 1370410 Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com> Reviewed-on: http://review.gluster.org/15747 Smoke: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Atin Mukherjee <amukherj@redhat.com>
* cluster/afr: Fix bugs in [f]inodelk/[f]entrylkPranith Kumar K2016-11-261-0/+87
| | | | | | | | | | | | | | | | | | | | | | Problems: 1) Inodelk is not taking quorum into account 2) finodelk, [f]entrylk are not implemented correctly 3) By default afr doesn't go for non-blocking parallel locks. Fix: Implemented a common framework which can be used by [f]inodelk/[f]entrylk. Used quorum for the same. Change-Id: I239f13875a065298630d266941df10cfa3addc85 BUG: 1369077 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/15802 Tested-by: Krutika Dhananjay <kdhananj@redhat.com> Reviewed-by: Krutika Dhananjay <kdhananj@redhat.com> Smoke: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Ravishankar N <ravishankar@redhat.com> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org>
* features/index: Delete granular entry indices of already healed directories ↵Krutika Dhananjay2016-11-241-0/+76
| | | | | | | | | | | | | | | | | | | | | | during crawl If granular name indices are already in existence for a volume, and before they are healed, granular entry heal be disabled, a crawl on indices/xattrop will clear the changelogs on these directories. When their corresponding entry-changes indices are crawled subsequently, if it is found that the directories don't need heal anymore, the granular indices are not cleaned up. This patch fixes that problem by ensuring that the zero-xattrop also deletes the stale indices at the level of index translator. Change-Id: Ifbaa6bec2a14e3041addfee4054131babbf4d35e BUG: 1370410 Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com> Reviewed-on: http://review.gluster.org/15880 Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> CentOS-regression: Gluster Build System <jenkins@build.gluster.org>
* afr,ec: Heal device files with correct major, minor numbersPranith Kumar K2016-10-261-10/+18
| | | | | | | | | | | | | | Thanks a lot to xiaoping.wu@nokia.com from Nokia for the bug and the fix. BUG: 1384297 Change-Id: Ie443237e85d34633b5dd30f85eaa2ac34e45754c Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/15728 Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> Reviewed-by: Xavier Hernandez <xhernandez@datalab.es> CentOS-regression: Gluster Build System <jenkins@build.gluster.org>
* compound fops: Fix file corruption issueKrutika Dhananjay2016-10-241-0/+37
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 1. Address of a local variable @args is copied into state->req in server3_3_compound (). But even after the function has gone out of scope, in server_compound_resume () this pointer is accessed and dereferenced. This patch fixes that. 2. Compound fops, by virtue of NOT having a vector sizer (like the one writev has), ends up having both the header and the data (in case one of its member fops is WRITEV) in the same hdr_iobuf. This buffer was not being preserved through the lifetime of the compound fop, causing it to be overwritten by a parallel write fop, even when the writev associated with the currently executing compound fop is yet to hit the desk, thereby corrupting the file's data. This is fixed by associating the hdr_iobuf with the iobref so its memory remains valid through the lifetime of the fop. 3. Also fixed a use-after-free bug in protocol/client in compound fops cbk, missed by Linux but caught by NetBSD. Finally, big thanks to Pranith Kumar K and Raghavendra Gowdappa for their help in debugging this file corruption issue. Change-Id: I6d5c04f400ecb687c9403a17a12683a96c2bf122 BUG: 1378778 Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com> Reviewed-on: http://review.gluster.org/15654 NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> Reviewed-by: Raghavendra G <rgowdapp@redhat.com> Smoke: Gluster Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org>
* tests: change EXPECT_WITHIN timeoutsAnuradha Talur2016-08-281-8/+8
| | | | | | | | | | | | | | | | | Use defined HEAL and PROCESS_UP timeouts rather than hard code them in self-heald.t. Change-Id: I21586811904c8417b7208bb643f14dff20dc4832 BUG: 1370074 Signed-off-by: Anuradha Talur <atalur@redhat.com> Reviewed-on: http://review.gluster.org/15316 Reviewed-by: Ravishankar N <ravishankar@redhat.com> Tested-by: Ravishankar N <ravishankar@redhat.com> Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> Reviewed-by: Krutika Dhananjay <kdhananj@redhat.com> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>