summaryrefslogtreecommitdiffstats
path: root/xlators/cluster
Commit message (Collapse)AuthorAgeFilesLines
* cluster/afr: Prioritize ENOSPC over other errorskarthik-us2020-06-222-47/+5
| | | | | | | | | | | | | | | | | | | | | | | Problem: In a replicate/arbiter volume if file creations or writes fails on quorum number of bricks and on one brick it is due to ENOSPC and on other brick it fails for a different reason, it may fail with errors other than ENOSPC in some cases. Fix: Prioritize ENOSPC over other lesser priority errors and do not set op_errno in posix_gfid_set if op_ret is 0 to avoid receiving any error_no which can be misinterpreted by __afr_dir_write_finalize(). Also removing the function afr_has_arbiter_fop_cbk_quorum() which might consider a successful reply form a single brick as quorum success in some cases, whereas we always need fop to be successful on quorum number of bricks in arbiter configuration. Change-Id: I106e267f8b9451f681022f1cccb410d9bc824c08 Fixes: #1254 Signed-off-by: karthik-us <ksubrahm@redhat.com> (cherry picked from commit fa63b45ca5edf172b1b89b28b5db3c5129cc57b6)
* afr: more quorum checks in lookup and new entry markingRavishankar N2020-06-183-11/+25
| | | | | | | | | | | | | | | | | Problem: See github issue for details. Fix: -In lookup if the entry exists in 2 out of 3 bricks, don't fail the lookup with ENOENT just because there is an entrylk on the parent. Consider quorum before deciding. -If entry FOP does not succeed on quorum no. of bricks, do not perform new entry mark. Fixes: #1303 Change-Id: I56df8c89ad53b29fa450c7930a7b7ccec9f4a6c5 Signed-off-by: Ravishankar N <ravishankar@redhat.com> (cherry picked from commit c4a6748f25d2c1ab3ebcf89952278ebf94c8d371)
* cluster/ec: Return correct error code and log messageAshish Pandey2020-06-151-2/+9
| | | | | | | | | | | | | In case of readdir was send with an FD on which opendir was failed, this FD will be useless and we return it with error. For now, we are returning it with EINVAL without logging any message in log file. Return a correct error code and also log the message to improve thing to debug. fixes: #1220 Change-Id: Iaf035254b9c5aa52fa43ace72d328be622b06169 (cherry picked from commit af70cb5eedd80207cd184e69f2a4fb252b72d070)
* afr: event gen changesRavishankar N2020-05-093-78/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The general idea of the changes is to prevent resetting event generation to zero in the inode ctx, since event gen is something that should follow 'causal order'. Change #1: For a read txn, in inode refresh cbk, if event_generation is found zero, we are failing the read fop. This is not needed because change in event gen is only a marker for the next inode refresh to happen and should not be taken into account by the current read txn. Change #2: The event gen being zero above can happen if there is a racing lookup, which resets even get (in afr_lookup_done) if there are non zero afr xattrs. The resetting is done only to trigger an inode refresh and a possible client side heal on the next lookup. That can be acheived by setting the need_refresh flag in the inode ctx. So replaced all occurences of resetting even gen to zero with a call to afr_inode_need_refresh_set(). Change #3: In both lookup and discover path, we are doing an inode refresh which is not required since all 3 essentially do the same thing- update the inode ctx with the good/bad copies from the brick replies. Inode refresh also triggers background heals, but I think it is okay to do it when we call refresh during the read and write txns and not in the lookup path. The .ts which relied on inode refresh in lookup path to trigger heals are now changed to do read txn so that inode refresh and the heal happens. Change-Id: Iebf39a9be6ffd7ffd6e4046c96b0fa78ade6c5ec Fixes: #1179 Signed-off-by: Ravishankar N <ravishankar@redhat.com> Reported-by: Erik Jacobson <erik.jacobson at hpe.com> (cherry picked from commit f0fcd909ad4535b60c9208d4804ebe6afe421a09)
* afr: mark pending xattrs as a part of metadata healRavishankar N2020-04-071-1/+61
| | | | | | | | | | | | | | | | | | | | ...if pending xattrs are zero for all children. Problem: If there are no pending xattrs and a metadata heal needs to be performed, it can be possible that we end up with xattrs inadvertendly deleted from all bricks, as explained in the BZ. Fix: After picking one among the sources as the good copy, mark pending xattrs on all sources to blame the sinks. Now even if this metadata heal fails midway, a subsequent heal will still choose one of the valid sources that it picked previously. Updates: #1067 Change-Id: If1b050b70b0ad911e162c04db4d89b263e2b8d7b Signed-off-by: Ravishankar N <ravishankar@redhat.com> (cherry picked from commit 2d5ba449e9200b16184b1e7fc84cabd015f1f779)
* cluster/ec: Change handling of heal failure to avoid crashAshish Pandey2020-03-172-13/+13
| | | | | | | | | | | | | | | | | Problem: ec_getxattr_heal_cbk was called with NULL as second argument in case heal was failing. This function was dereferencing "cookie" argument which caused crash. Solution: Cookie is changed to carry the value that was supposed to be stored in fop->data, so even in the case when fop is NULL in error case, there won't be any NULL dereference. Thanks to Xavi for the suggestion about the fix. Change-Id: I0798000d5cadb17c3c2fbfa1baf77033ffc2bb8c updates: #1061
* cluster/afr: fix race when bricks come upXavi Hernandez2020-03-163-6/+9
| | | | | | | | | | | | | | | | | | | | | | | | | The was a problem when self-heal was sending lookups at the same time that one of the bricks was coming up. In this case there was a chance that the number of 'up' bricks changes in the middle of sending the requests to subvolumes which caused a discrepancy in the expected number of replies and the actual number of sent requests. This discrepancy caused that AFR continued executing requests before all requests were complete. Eventually, the frame of the pending request was destroyed when the operation terminated, causing a use- after-free issue when the answer was finally received. In theory the same thing could happen in the reverse way, i.e. AFR tries to wait for more replies than sent requests, causing a hang. Backport of: > Change-Id: I7ed6108554ca379d532efb1a29b2de8085410b70 > Signed-off-by: Xavi Hernandez <xhernandez@redhat.com> > Fixes: bz#1808875 Change-Id: I7ed6108554ca379d532efb1a29b2de8085410b70 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com> Fixes: bz#1809438
* cluster/ec: skip updating ctx->loc again when ec_fix_open/opendirKinglong Mee2020-03-162-10/+14
| | | | | | | | | | | | | The ec_manager_open/opendir memsets ctx->loc which causes memory/inode leak, and ec_fheal uses ctx->loc out of fd->lock that loc_copy may copy bad data when memset it. This patch skips updating ctx->loc when it is initilizaed. With it, ctx->loc is filled once, and never updated. Change-Id: I3bf5ffce4caf4c1c667f7acaa14b451d37a3550a fixes: bz#1806843 Signed-off-by: Kinglong Mee <mijinlong@horiscale.com>
* afr: prevent spurious entry heals leading to gfid split-brainRavishankar N2020-02-255-15/+7
| | | | | | | | | | | | | | | | | | | | | Problem: In a hyperconverged setup with granular-entry-heal enabled, if a file is recreated while one of the bricks is down, and an index heal is triggered (with the brick still down), entry-self heal was doing a spurious heal with just the 2 good bricks. It was doing a post-op leading to removal of the filename from .glusterfs/indices/entry-changes as well as erroneous setting of afr xattrs on the parent. When the brick came up, the xattrs were cleared, resulting in the renamed file not getting healed and leading to gfid split-brain and EIO on the mount. Fix: Proceed with entry heal only when shd can connect to all bricks of the replica, just like in data and metadata heal. fixes: bz#1804591 Change-Id: I916ae26ad1fabf259bc6362da52d433b7223b17e Signed-off-by: Ravishankar N <ravishankar@redhat.com> (cherry picked from commit 06453d77d056fbaa393a137ca277a20e38d2f67e)
* cluster/thin-arbiter: Wait for TA connection before ta-file lookupAshish Pandey2020-02-181-19/+21
| | | | | | | | | | | | | | | | | | | Problem: When we mount a ta volume, as soon as 2 data bricks are connected we consider that the mount is done and then send a lookup/create on ta file on ta node. However, this connection with ta node might not have been completed. Due to this delay, ta replica id file will not be created and we will see ENOTCONN error in log file if we do lookup. Solution: As we know that this ta node could have a higher latency, we should wait for reasonable time for connection to happen before sending lookup/create on replica id file. fixes: bz#1804058 Change-Id: I36f90865afe617e4e84cee57fec832a16f5dd6cc (cherry picked from commit a7fa54ddea3fe429f143b37e4de06a93b49d776a)
* tests: Fix spurious self-heald.t failurePranith Kumar K2020-02-131-23/+15
| | | | | | | | | | | | | | | | | | | Problem: heal-info code assumes that all indices in xattrop directory definitely need heal. There is one corner case. The very first xattrop on the file will lead to adding the gfid to 'xattrop' index in fop path and in _cbk path it is removed because the fop is zero-xattr xattrop in success case. These gfids could be read by heal-info and shown as needing heal. Fix: Check the pending flag to see if the file definitely needs or not instead of which index is being crawled at the moment. fixes: bz#1802449 Change-Id: I79f00dc7366fedbbb25ec4bec838dba3b34c7ad5 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> (cherry picked from commit d27df94016b5526c18ee964d4a47508326329dda)
* afr: expose cluster.optimistic-change-log to CLI.Ravishankar N2020-01-091-0/+2
| | | | | | | | | | | | | Backport of https://review.gluster.org/#/c/glusterfs/+/23960/ This volume option was not made avaialble to `gluster volume set` CLI. Reported-by: epolakis(https://github.com/kinsu) in https://github.com/gluster/glusterfs/issues/781 fixes: bz#1788785 Change-Id: I7141bdd4e53ee99e22b354edde8d023bfc0b2cd7 Signed-off-by: Ravishankar N <ravishankar@redhat.com>
* afr: make heal info locklessRavishankar N2019-12-163-203/+211
| | | | | | | | | | | | | | | | | | | | Changes in locks xlator: Added support for per-domain inodelk count requests. Caller needs to set GLUSTERFS_MULTIPLE_DOM_LK_CNT_REQUESTS key in the dict and then set each key with name 'GLUSTERFS_INODELK_DOM_PREFIX:<domain name>'. In the response dict, the xlator will send the per domain count as values for each of these keys. Changes in AFR: Replaced afr_selfheal_locked_inspect() with afr_lockless_inspect(). Logic has been added to make the latter behave same as the former, thus not breaking the current heal info output behaviour. fixes: bz#1783858 Change-Id: Ie9e83c162aa77f44a39c2ba7115de558120ada4d Signed-off-by: Ravishankar N <ravishankar@redhat.com> (cherry picked from commit d7e049160a9dea988ded5816491c2234d40ab6b3)
* cluster/dht: Correct fd processing loopN Balachandran2019-11-261-22/+62
| | | | | | | | | | | | The fd processing loops in the dht_migration_complete_check_task and the dht_rebalance_inprogress_task functions were unsafe and could cause an open to be sent on an already freed fd. This has been fixed. Change-Id: I0a3c7d2fba314089e03dfd704f9dceb134749540 fixes: bz#1769315 Signed-off-by: N Balachandran <nbalacha@redhat.com>
* cluster/afr: Heal entries when there is a source & no healed_sinkskarthik-us2019-11-141-0/+15
| | | | | | | | | | | | | | | | | | | | Problem: In a situation where B1 blames B2, B2 blames B1 and B3 doesn't blame anything for entry heal, heal will not complete even though we have clear source and sinks. This will happen because while doing afr_selfheal_find_direction() only the bricks which are blamed by non-accused bricks are considered as sinks. Later in __afr_selfheal_entry_finalize_source() when it tries to mark all the non-sources as sinks it fails to do so because there won't be any healed_sinks marked, no witness present and there will be a source. Fix: If there is a source and no healed_sinks, then reset all the locked sources to 0 and healed sinks to 1 to do conservative merge. Change-Id: If40d8bc95d52a52b2730f55bdcf135109b421548 Fixes: bz#1760699 Signed-off-by: karthik-us <ksubrahm@redhat.com>
* afr: support split-brain CLI for replica 3Ravishankar N2019-11-131-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | Ever since we added quorum checks for lookups in afr via commit bd44d59741bb8c0f5d7a62c5b1094179dd0ce8a4, the split-brain resolution commands would not work for replica 3 because there would be no readables for the lookup fop. The argument was that split-brains do not occur in replica 3 but we do see (data/metadata) split-brain cases once in a while which indicate that there are a few bugs/corner cases yet to be discovered and fixed. Fortunately, commit 8016d51a3bbd410b0b927ed66be50a09574b7982 added GF_CLIENT_PID_GLFS_HEALD as the pid for all fops made by glfsheal. If we leverage this and allow lookups in afr when pid is GF_CLIENT_PID_GLFS_HEALD, split-brain resolution commands will work for replica 3 volumes too. Likewise, the check is added in shard_lookup as well to permit resolving split-brains by specifying "/.shard/shard-file.xx" as the file name (which previously used to fail with EPERM). Change-Id: I3c543dea79caf7cfbc1633e9089cb1cdd2538ba9 Fixes: bz#1760791 Signed-off-by: Ravishankar N <ravishankar@redhat.com> (cherry picked from commit 47dbd753187f69b3835d2e42fdbe7485874c4b3e)
* dht: Rebalance causing IO Error - File descriptor in bad stateMohit Agrawal2019-11-135-17/+116
| | | | | | | | | | | | | | | | | | | | | Problem : When a file is migrated, dht attempts to re-open all open fds on the new cached subvol. Earlier, if dht had not opened the fd, the client xlator would be unable to find the remote fd and would fall back to using an anon fd for the fop. That behavior changed with https://review.gluster.org/#/c/glusterfs/+/15804, causing fops to fail with EBADFD if the fd was not available on the cached subvol. The client xlator returns EBADFD if the remote fd is not found but dht only checks for EBADF before re-opening fds on the new cached subvol. Solution: Handle EBADFD at dht code path to avoid the issue > Change-Id: I43c51995cdd48d05b12e4b2889c8dbe2bb2a72d8 > Fixes: bz#1758579 > (cherry picked from commit 9314a9fbf487614c736cf6c4c1b93078d37bb9df) > (Reviewed on upstream link https://review.gluster.org/23518) Change-Id: I43c51995cdd48d05b12e4b2889c8dbe2bb2a72d8 Fixes: bz#1761910
* ctime/rebalance: Heal ctime xattr on directory during rebalanceKotresh HR2019-09-163-4/+7
| | | | | | | | | | | | | | | | | | | | | | | | After add-brick and rebalance, the ctime xattr is not present on rebalanced directories on new brick. This patch fixes the same. Note that ctime still doesn't support consistent time across distribute sub-volume. This patch also fixes the in-memory inconsistency of time attributes when metadata is self healed. Backport of: > Patch: https://review.gluster.org/23127/ > Change-Id: Ia20506f1839021bf61d4753191e7dc34b31bb2df > BUG: 1734026 > Signed-off-by: Kotresh HR <khiremat@redhat.com> (cherry picked from commit 304640e55c0f3c6d15f4e230dc6376e4f5020fea) Change-Id: Ia20506f1839021bf61d4753191e7dc34b31bb2df Signed-off-by: Kotresh HR <khiremat@redhat.com> fixes: bz#1752429
* afr/lookup: Pass xattr_req in while doing a selfheal in lookupMohammed Rafi KC2019-09-113-5/+16
| | | | | | | | | | | | | | | | | | We were not passing xattr_req when doing a name self heal as well as a meta data heal. Because of this, some xdata was missing which causes i/o errors Backport of > https://review.gluster.org/#/c/glusterfs/+/23024/ >Change-Id: Ibfb1205a7eb0195632dc3820116ffbbb8043545f >Fixes: bz#1728770 >Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com> Fixes: bz#1749305 Change-Id: Ibfb1205a7eb0195632dc3820116ffbbb8043545f Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com> (cherry picked from commit d026f0bcfd301712e4f0671ccf238f43f2e6dd30)
* [RFC] change get_real_filename implementation to use ENOATTR instead of ENOENTMichael Adam2019-09-031-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | get_real_filename is implemented as a virtual extended attribute to help Samba implement the case-insensitive but case preserving SMB protocol more efficiently. It is implemented as a getxattr call on the parent directory with the virtual key of "get_real_filename:<entryname>" by looking for a spelling with different case for the provided file/dir name (<entryname>) and returning this correct spelling as a result if the entry is found. Originally (05aaec645a6262d431486eb5ac7cd702646cfcfb), the implementation used the ENOENT errno to return the authoritative answer that <entryname> does not exist in any case folding. Now this implementation is actually a violation or misuse of the defined API for the getxattr call which returns ENOENT for the case that the dir that the call is made against does not exist and ENOATTR (or the synonym ENODATA) for the case that the xattr does not exist. This was not a problem until the gluster fuse-bridge was changed to do map ENOENT to ESTALE in 59629f1da9dca670d5dcc6425f7f89b3e96b46bf, after which we the getxattr call for get_real_filename returned an ESTALE instead of ENOENT breaking the expectation in Samba. It is an independent problem that ESTALE should not leak out to user space but is intended to trigger retries between fuse and gluster. But nevertheless, the semantics seem to be incorrect here and should be changed. This patch changes the implementation of the get_real_filename virtual xattr to correctly return ENOATTR instead of ENOENT if the file/directory being looked up is not found. The Samba glusterfs_fuse vfs module which takes advantage of the get_real_filename over a fuse mount will receive a corresponding change to map ENOATTR to ENOENT. Without this change, it will still work correctly, but the performance optimization for nonexisting files is lost. On the other hand side, this change removes the distinction between the old not-implemented case and the implemented case. So Samba changed to treat ENOATTR like ENOENT will not work correctly any more against old servers that don't implement get_real_filename. I.e. existing files will be reported as non-existing Change-Id: I971b427ab8410636d5d201157d9af70e0d075b67 fixes: bz#1745914 Signed-off-by: Michael Adam <obnox@samba.org> (cherry picked from commit dc1b87fcfef08c9497b0c02b2410c9d18bbc2dba)
* afr: wake up index healer threadsRavishankar N2019-08-305-11/+25
| | | | | | | | | | | | | | ...whenever shd is re-enabled after disabling or there is a change in `cluster.heal-timeout`, without needing to restart shd or waiting for the current `cluster.heal-timeout` seconds to expire. See BZ 1743988 for more details. Change-Id: Ia5ebd7c8e9f5b54cba3199c141fdd1af2f9b9bfe fixes: bz#1747301 Reported-by: Glen Kiessling <glenk1973@hotmail.com> Signed-off-by: Ravishankar N <ravishankar@redhat.com> (cherry picked from commit 600ba94183333c4af9b4a09616690994fd528478)
* cluster/ec: Update lock->good_mask on parent fop failurePranith Kumar K2019-08-222-0/+4
| | | | | | | | | | When discard/truncate performs write fop, it should do so after updating lock->good_mask to make sure readv happens on the correct mask fixes: bz#1739424 Change-Id: Idfef0bbcca8860d53707094722e6ba3f81c583b7 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* cluster/ec: Fix reopen flags to avoid misbehaviorPranith Kumar K2019-08-222-3/+8
| | | | | | | | | | | | | | | | | | | | | | | Problem: when a file needs to be re-opened O_APPEND and O_EXCL flags are not filtered in EC. - O_APPEND should be filtered because EC doesn't send O_APPEND below EC for open to make sure writes happen on the individual fragments instead of at the end of the file. - O_EXCL should be filtered because shd could have created the file so even when file exists open should succeed - O_CREAT should be filtered because open happens with gfid as parameter. So open fop will create just the gfid which will lead to problems. Fix: Filter out these two flags in reopen. Change-Id: Ia280470fcb5188a09caa07bf665a2a94bce23bc4 Fixes: bz#1739426 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* cluster/ec: Always read from good-maskPranith Kumar K2019-08-222-5/+25
| | | | | | | | | | There are cases where fop->mask may have fop->healing added and readv shouldn't be wound on fop->healing. To avoid this always wind readv to lock->good_mask updates: bz#1739424 Change-Id: I2226ef0229daf5ff315d51e868b980ee48060b87 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* cluster/ec: fix EIO error for concurrent writes on sparse filesXavi Hernandez2019-08-221-9/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | EC doesn't allow concurrent writes on overlapping areas, they are serialized. However non-overlapping writes are serviced in parallel. When a write is not aligned, EC first needs to read the entire chunk from disk, apply the modified fragment and write it again. The problem appears on sparse files because a write to an offset implicitly creates data on offsets below it (so, in some way, they are overlapping). For example, if a file is empty and we read 10 bytes from offset 10, read() will return 0 bytes. Now, if we write one byte at offset 1M and retry the same read, the system call will return 10 bytes (all containing 0's). So if we have two writes, the first one at offset 10 and the second one at offset 1M, EC will send both in parallel because they do not overlap. However, the first one will try to read missing data from the first chunk (i.e. offsets 0 to 9) to recombine the entire chunk and do the final write. This read will happen in parallel with the write to 1M. What could happen is that half of the bricks process the write before the read, and the half do the read before the write. Some bricks will return 10 bytes of data while the otherw will return 0 bytes (because the file on the brick has not been expanded yet). When EC tries to recombine the answers from the bricks, it can't, because it needs more than half consistent answers to recover the data. So this read fails with EIO error. This error is propagated to the parent write, which is aborted and EIO is returned to the application. The issue happened because EC assumed that a write to a given offset implies that offsets below it exist. This fix prevents the read of the chunk from bricks if the current size of the file is smaller than the read chunk offset. This size is correctly tracked, so this fixes the issue. Also modifying ec-stripe.t file for Test #13 within it. In this patch, if a file size is less than the offset we are writing, we fill zeros in head and tail and do not consider it strip cache miss. That actually make sense as we know what data that part holds and there is no need of reading it from bricks. Change-Id: Ic342e8c35c555b8534109e9314c9a0710b6225d6 Fixes: bz#1739427 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
* cluster/ec: inherit healing from lock when it has infoKinglong Mee2019-08-221-2/+3
| | | | | | | | | If lock has info, fop should inherit healing mask from it. Otherwise, fop cannot inherit right healing when changed_flags is zero. Change-Id: Ife80c9169d2c555024347a20300b0583f7e8a87f updates: bz#1739424 Signed-off-by: Kinglong Mee <mijinlong@horiscale.com>
* afr: restore timestamp of parent dir during entry-healRavishankar N2019-08-211-0/+2
| | | | | | | Fixes: bz#1741041 Change-Id: I29e338bac62104233a6f80212df8d0fb016affda Signed-off-by: Ravishankar N <ravishankar@redhat.com> (cherry picked from commit 8e9c53ebf16705b9a1db2fc486dc24a5cb244ddd)
* cluster/dht: Fixed a memleak in dht_rename_cbkN Balachandran2019-08-091-11/+33
| | | | | | | | | Fixed a memleak in dht_rename_cbk when creating a linkto file. Change-Id: I705adef3cb79e33806520fc2b15558e90e2c211c fixes: bz#1739337 Signed-off-by: N Balachandran <nbalacha@redhat.com>
* cluster/ta: Notify the clients only if there are pending healskarthik-us2019-07-244-22/+69
| | | | | | | | | | | | | | | | | | | | | | | | | | | Problem: In case of thin arbiter, before index healer starts crawling the indices at every heal-timeout interval, even if there is nothing to be healed it will send an upcall notification to all the clients to release any AFR_TA_DOM_NOTIFY locks that they hold. SHD will wait for the upcall to return before proceeding with the heal even though there is nothing to be healed. This will also invalidates the cached information about the bricks states on the clients which leads to extra calls on TA from clients for the next reads & writes if needed. This will impact the IO performance. Fix: - Before sending the upcall to the clients, check for any pending heals on TA without taking any locks. - If there is nothing marked bad on TA, then continue with the index crawl to heal any dirty markings present on the files due to any post-op failure. - If there is a brick marked as bad on TA, then take the AFR_TA_DOM_NOTIFY lock on TA from SHD, get the state on TA and continue with the current healing process. Change-Id: Ieb477bc6cb18bbdfd4e7a0453c5ed79b574ec9d6 fixes: bz#1729483 Signed-off-by: karthik-us <ksubrahm@redhat.com>
* cluster/afr: Fix incorrect reporting of gfid & type mismatchkarthik-us2019-07-202-2/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | Problems: 1. When checking for type and gfid mismatch, if the type or gfid is unknown because of missing gfid handle and the gfid xattr it will be reported as type or gfid mismatch and the heal will not complete. 2. If the source selected during entry heal has null gfid the same will be sent to afr_lookup_and_heal_gfid(). In this function when we try to assign the gfid on the bricks where it does not exist, we are considering the same gfid and try to assign that on those bricks. This will fail in posix_gfid_set() since the gfid sent is null. Fix: If the gfid sent to afr_lookup_and_heal_gfid() is null choose a valid gfid before proceeding to assign the gfid on the bricks where it is missing. In afr_selfheal_detect_gfid_and_type_mismatch(), do not report type/gfid mismatch if the type/gfid is unknown or not set. Change-Id: Ia06552e4dc4a9f89cb7f5302833604bd21bbf7da fixes: bz#1729481 Signed-off-by: karthik-us <ksubrahm@redhat.com>
* cluster/ec: Prevent double pre-op xattropsPranith Kumar K2019-06-221-6/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Problem: Race: Thread-1 Thread-2 1) Does ec_get_size_version() to perform pre-op fxattrop as part of write-1 2) Calls ec_set_dirty_flag() in ec_get_size_version() for write-2. This sets dirty[] to 1 3) Completes executing ec_prepare_update_cbk leading to ctx->dirty[] = '1' 4) Takes LOCK(inode->lock) to check if there are any flags and sets dirty-flag because lock->waiting_flag is 0 now. This leads to fxattrop to increment on-disk dirty[] to '2' At the end of the writes the file will be marked for heal even when it doesn't need heal. Fix: Perform ec_set_dirty_flag() and other checks inside LOCK() to prevent dirty[] to be marked as '1' in step 2) above Updates bz#1593224 Change-Id: Icac2ab39c0b1e7e154387800fbededc561612865 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* ec-heal: check file's gfid when deleting stale nameKinglong Mee2019-06-201-1/+11
| | | | | | | | | | | | A name-less lookup does not contain parent's stat, It is hard to check the lookuped file is at the right path. This patch changes to a name lookup, and check file's gfid with expected gfid. If the gfid is different, mark it estale. fixes: bz#1702131 Change-Id: I2de20b10d680eed1e2fb1d3830b3b3dec4520dbf Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
* afr/read: Implement latency based read child selectionMohammed Rafi KC2019-06-203-27/+98
| | | | | | | | | | | | | | | | | Network latency is an important factor selecting a read subvolume. So this patch is adding two new policy. 1) We measure the latency of a child during a GF_DUMP rpc call. Then use this latency to pick a read subvol having the least latency. 2) Second one is an hybrid mode where it calculates the effective latency by multiplying outstanding pending read request and latency, and choose the least one. Change-Id: Ia49c8a08ab61f7dcdad8b8950aa4d338e7accf97 fixes: #520 Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com>
* cluster/dht: Strip out dht xattrsN Balachandran2019-06-191-0/+2
| | | | | | | | | | Some internal DHT xattrs were not being removed when calling getxattr in pass-through mode. This has been fixed. Change-Id: If7e3dbc7b495db88a566bd560888e3e9c167defa fixes: bz#1721435 Signed-off-by: N Balachandran <nbalacha@redhat.com>
* afr/fini: Free local_pool data during an afr finiMohammed Rafi KC2019-06-171-0/+6
| | | | | | | | | We should free the mem_pool local_pool during an afr_fini. Otherwise this will lead to mem leak for shd Change-Id: I805a34a88077bf7b886c28b403798bf9eeeb1c0b Updates: bz#1716695 Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com>
* clang-scan: resolve warningAmar Tumballi2019-06-151-4/+0
| | | | | | | | | | | | | | | dht-common.c: because there was a 'goto err' before assigning the 'local' variable, there is possibility of NULL dereference. As the check which was done wouldn't ever be true, removed the check. glusterd-geo-rep.c: a possible path where 'slave_host' could be NULL when it gets passed to strcmp() is found. strcmp() expects a valid string. Add a NULL check. Updates: bz#1622665 Change-Id: I64c280bc1beac9a2b109e8fa88f2a5ce8b823c3a Signed-off-by: Amar Tumballi <amarts@redhat.com>
* multiple files: another attempt to remove includesYaniv Kaul2019-06-1431-95/+7
| | | | | | | | | | | | | | | | | | There are many include statements that are not needed. A previous more ambitious attempt failed because of *BSD plafrom (see https://review.gluster.org/#/c/glusterfs/+/21929/ ) Now trying a more conservative reduction. It does not solve all circular deps that we have, but it does reduce some of them. There is just too much to handle reasonably (dht-common.h includes dht-lock.h which includes dht-common.h ...), but it does reduce the overall number of lines of include we need to look at in the future to understand and fix the mess later one. Change-Id: I550cd001bdefb8be0fe67632f783c0ef6bee3f9f updates: bz#1193929 Signed-off-by: Yaniv Kaul <ykaul@redhat.com>
* Cluster/afr: Don't treat all bricks having metadata pending as split-brainkarthik-us2019-06-102-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | Problem: We currently don't have a roll-back/undoing of post-ops if quorum is not met. Though the FOP is still unwound with failure, the xattrs remain on the disk. Due to these partial post-ops and partial heals (healing only when 2 bricks are up), we can end up in metadata split-brain purely from the afr xattrs point of view i.e each brick is blamed by atleast one of the others for metadata. These scenarios are hit when there is frequent connect/disconnect of the client/shd to the bricks. Fix: Pick a source based on the xattr values. If 2 bricks blame one, the blamed one must be treated as sink. If there is no majority, all are sources. Once we pick a source, self-heal will then do the heal instead of erroring out due to split-brain. This patch also adds restriction of all the bricks to be up to perform metadata heal to avoid any metadata loss. Removed the test case tests/bugs/replicate/bug-1468279-source-not-blaming-sinks.t as it was doing metadata heal even when only 2 of 3 bricks were up. Change-Id: I07a9d62f84ceda329dcab1f02a33aeed258dcb09 fixes: bz#1717819 Signed-off-by: karthik-us <ksubrahm@redhat.com>
* ec/fini: Fix race between xlator cleanup and on going async fopMohammed Rafi KC2019-06-086-15/+56
| | | | | | | | | | | | | | | | Problem: While we process a cleanup, there is a chance for a race between async operations, for example ec_launch_replace_heal. So this can lead to invalid mem access. Solution: Just like we track on going heal fops, we can also track fops like ec_launch_replace_heal, so that we can decide when to send a PARENT_DOWN request. Change-Id: I055391c5c6c34d58aef7336847f3b570cb831298 fixes: bz#1703948 Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com>
* cluster/dht: Fix directory perms during selfhealN Balachandran2019-06-051-3/+5
| | | | | | | | | Fixed a bug in the revalidate code path that wiped out directory permissions if no mds subvol was found. Change-Id: I8b4239ffee7001493c59d4032a2d3062586ea115 fixes: bz#1716830 Signed-off-by: N Balachandran <nbalacha@redhat.com>
* Fix some "Null pointer dereference" coverity issuesXavi Hernandez2019-05-262-0/+13
| | | | | | | | | | | | | | | | | | | | | | This patch fixes the following CID's: * 1124829 * 1274075 * 1274083 * 1274128 * 1274135 * 1274141 * 1274143 * 1274197 * 1274205 * 1274210 * 1274211 * 1288801 * 1398629 Change-Id: Ia7c86cfab3245b20777ffa296e1a59748040f558 Updates: bz#789278 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
* cluster/ec: honor contention notifications for partially acquired locksXavi Hernandez2019-05-251-1/+1
| | | | | | | | | | | | | | | | | | | | EC was ignoring lock contention notifications received while a lock was being acquired. When a lock is partially acquired (some bricks have granted the lock but some others not yet) we can receive notifications from acquired bricks, which should be honored, since we may not receive more notifications after that. Since EC was ignoring them, once the lock was acquired, it was not released until the eager-lock timeout, causing unnecessary delays on other clients. This fix takes into consideration the notifications received before having completed the full lock acquisition. After that, the lock will be releaed as soon as possible. Fixes: bz#1708156 Change-Id: I2a306dbdb29fb557dcab7788a258bd75d826cc12 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
* cluster/dht: Lookup all files when processing directoryN Balachandran2019-05-231-6/+6
| | | | | | | | | | | | | | | | | | | | | A rebalance process currently only looks up files that it is supposed to migrate. This could cause issues when lookup-optimize is enabled as the dir layout can be updated with the commit hash before all files are looked up. This is expecially problematic of one of the rebalance processes fails to complete as clients will try to access files whose linkto files might not have been created. Each process will now lookup every file in the directory it is processing. Pros: Less likely that files will be inaccessible. Cons: More lookup requests sent to the bricks and a potential performance hit. Note: this does not handle races such as when a layout is updated on disk just as the create fop is sent by the client. Change-Id: I22b55846effc08d3b827c3af9335229335f67fb8 fixes: bz#1711764 Signed-off-by: N Balachandran <nbalacha@redhat.com>
* ec/fini: Fix race with ec_fini and ec_notifyMohammed Rafi KC2019-05-213-0/+13
| | | | | | | | | | | | | | | During a graph cleanup, we first sent a PARENT_DOWN and wait for a child down to ultimately free the xlator and the graph. In the ec xlator, we cleanup the threads when we get a PARENT_DOWN event. But a racing event like CHILD_UP or event xl_op may trigger healing threads after threads cleanup. So there is a chance that the threads might access a freed private variabe Change-Id: I252d10181bb67b95900c903d479de707a8489532 fixes: bz#1703948 Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com>
* afr/frame: Destroy frame after afr_selfheal_entry_granularMohammed Rafi KC2019-05-211-3/+8
| | | | | | | | | | | | In function "afr_selfheal_entry_granular", after completing the heal we are not destroying the frame. This will lead to crash. when we execute statedump operation, where it tried to access xlator object. If this xlator object is freed as part of the graph destroy this will lead to an invalid memory access Change-Id: I0a5e78e704ef257c3ac0087eab2c310e78fbe36d fixes: bz#1708926 Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com>
* ec/shd: Cleanup self heal daemon resources during ec finiMohammed Rafi KC2019-05-135-13/+122
| | | | | | | | | | We were not properly cleaning self-heal daemon resources during ec fini. With shd multiplexing, it is absolutely necessary to cleanup all the resources during ec fini. Change-Id: Iae4f1bce7d8c2e1da51ac568700a51088f3cc7f2 fixes: bz#1703948 Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com>
* afr: thin-arbiter lock release fixesRavishankar N2019-05-103-47/+93
| | | | | | | | | | | | | | | | | | | | - pass fop state instead of afr local to afr_ta_dom_lock_check_and_release() - avoid afr_lock_release_synctask() being called simultaneosuly from notify code path and transaction (post-op) code path due to races. - Check if the post-op on TA is valid based on event_gen checks. - Invalidate in-memory information when we get TA child down. Note: Thi patch addresses some pending review comments of commit 053b1309dc8fbc05fcde5223e734da9f694cf5cc (https://review.gluster.org/#/c/glusterfs/+/20095/) fixes: bz#1698449 Change-Id: I2ccd7e1b53362f9f3fed8680aecb23b5011eb18c Signed-off-by: Ravishankar N <ravishankar@redhat.com>
* afr: log before attempting data self-heal.Ravishankar N2019-05-081-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I was working on a blog about troubleshooting AFR issues and I wanted to copy the messages logged by self-heal for my blog. I then realized that AFR-v2 is not logging *before* attempting data heal while it logs it for metadata and entry heals. I [MSGID: 108026] [afr-self-heal-entry.c:883:afr_selfheal_entry_do] 0-testvol-replicate-0: performing entry selfheal on d120c0cf-6e87-454b-965b-0d83a4c752bb I [MSGID: 108026] [afr-self-heal-common.c:1741:afr_log_selfheal] 0-testvol-replicate-0: Completed entry selfheal on d120c0cf-6e87-454b-965b-0d83a4c752bb. sources=[0] 2 sinks=1 I [MSGID: 108026] [afr-self-heal-common.c:1741:afr_log_selfheal] 0-testvol-replicate-0: Completed data selfheal on a9b5f183-21eb-4fb3-a342-287d3a7dddc5. sources=[0] 2 sinks=1 I [MSGID: 108026] [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] 0-testvol-replicate-0: performing metadata selfheal on a9b5f183-21eb-4fb3-a342-287d3a7dddc5 I [MSGID: 108026] [afr-self-heal-common.c:1741:afr_log_selfheal] 0-testvol-replicate-0: Completed metadata selfheal on a9b5f183-21eb-4fb3-a342-287d3a7dddc5. sources=[0] 2 sinks=1 Adding it in this patch. Now there is a 'performing' and a corresponding 'Completed' message for every type of heal. fixes: bz#1707746 Change-Id: I0b954cf1e17b48280aefa76640b5119b92133d61 Signed-off-by: Ravishankar N <ravishankar@redhat.com>
* dht: Custom xattrs are not healed in case of add-brickroot2019-05-081-8/+1
| | | | | | | | | | | | | | | | Problem: If any custom xattrs are set on the directory before add a brick, xattrs are not healed on the directory after adding a brick. Solution: xattr are not healed because dht_selfheal_dir_mkdir_lookup_cbk checks the value of MDS and if MDS value is not negative selfheal code path does not take reference of MDS xattrs.Change the condition to take reference of MDS xattr so that custom xattrs are populated on newly added brick Updates: bz#1702299 Change-Id: Id14beedb98cce6928055f294e1594b22132e811c Signed-off-by: Mohit Agrawal <moagrawal@redhat.com>
* afr : fix Coverity CID 1398627Rinku Kothiya2019-05-071-2/+9
| | | | | | | | | | | Fixed coverity error, "Unchecked return value (CHECKED_RETURN)". Checking return value & logging error message if afr_set_pending_dict fails. updates: bz#789278 Change-Id: Iab7da6b4f3cd0622b95b8e1c412b007a330467e5 Signed-off-by: Rinku Kothiya <rkothiya@redhat.com>