summaryrefslogtreecommitdiffstats
path: root/xlators/cluster/afr/src/afr-common.c
Commit message (Collapse)AuthorAgeFilesLines
* cluster/afr: Fix dict-leak in pre-opPranith Kumar K2018-03-031-8/+8
| | | | | | | | | | | | | | | | At the time of pre-op, pre_op_xdata is populted with the xattrs we get from the disk and at the time of post-op it gets over-written without unreffing the previous value stored leading to a leak. This is a regression we missed in https://review.gluster.org/#/q/ba149bac92d169ae2256dbc75202dc9e5d06538e >BUG: 1550078 >Change-Id: I0456f9ad6f77ce6248b747964a037193af3a3da7 >Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> >(cherry picked from commit e7b79c59590c203c65f7ac8548b30d068c232d33) BUG: 1550808 Change-Id: I0456f9ad6f77ce6248b747964a037193af3a3da7
* core: fix some of the dict_{get,set} with proper APIsAmar Tumballi2018-01-171-3/+2
| | | | | | | updates #220 Change-Id: I6e25dbb69b2c7021e00073e8f025d212db7de0be Signed-off-by: Amar Tumballi <amarts@redhat.com>
* cluster/afr: Fixing the flaws in arbiter becoming source patchkarthik-us2018-01-131-114/+152
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Problem: Setting the write_subvol value to read_subvol in case of metadata transaction during pre-op (commit 19f9bcff4aada589d4321356c2670ed283f02c03) might lead to the original problem of arbiter becoming source. Scenario: 1) All bricks are up and good 2) 2 writes w1 and w2 are in progress in parallel 3) ctx->read_subvol is good for all the subvolumes 4) w1 succeeds on brick0 and fails on brick1, yet to do post-op on the disk 5) read/lookup comes on the same file and refreshes read_subvols back to all good 6) metadata transaction happens which makes ctx->write_subvol to be assigned with ctx->read_subvol which is all good 7) w2 succeeds on brick1 and fails on brick0 and this will update the brick in reverse order leading to arbiter becoming source Fix: Instead of setting the ctx->write_subvol to ctx->read_subvol in the pre-op statge, if there is a metadata transaction, check in the function __afr_set_in_flight_sb_status() if it is a data/metadata transaction. Use the value of ctx->write_subvol if it is a data transactions and ctx->read_subvol value for other transactions. With this patch we assign the value of ctx->write_subvol in the afr_transaction_perform_fop() with the on disk value, instead of assigning it in the afr_changelog_pre_op() with the in memory value. Change-Id: Id2025a7e965f0578af35b1abaac793b019c43cc4 BUG: 1482064 Signed-off-by: karthik-us <ksubrahm@redhat.com>
* afr: volume option fixes for GD2Ravishankar N2017-11-271-1/+0
| | | | | | | | | This patch takes care of volume options exposed via the CLI. Updates #302 Change-Id: I6fd1645604928f6b9700e2425af4147cc6446a3a Signed-off-by: Ravishankar N <ravishankar@redhat.com>
* afr: coverity fixesRavishankar N2017-11-241-8/+10
| | | | | | | | | | | | | | | | | | | | | | 1.afr_discover_do: COPY_PASTE_ERROR 2.afr_fav_child_reset_sink_xattrs_cbk: REVERSE_INULL 3.afr_fop_lock_proceed: UNUSED_VALUE 4.afr_local_init: CHECKED_RETURN 5.afr_set_split_brain_choice: REVERSE_INULL 6.__afr_inode_write_finalize: FORWARD_NULL 7.afr_refresh_heal_done: REVERSE_INULL 8.afr_xl_op:UNUSED_VALUE 9.afr_changelog_populate_xdata: DEADCODE 10.set_afr_pending_xattrs_option: RESOURCE_LEAK Note: RESOURCE_LEAK complaints about afr_fgetxattr_pathinfo_cbk, afr_getxattr_list_node_uuids_cbk and afr_getxattr_pathinfo_cbk seem to be false alarms. Change-Id: Ia4ca1478b5e2922084732d14c1e7b1b03ad5ac45 BUG: 789278 Signed-off-by: Ravishankar N <ravishankar@redhat.com>
* afr: add checks for allowing lookupsRavishankar N2017-11-181-85/+159
| | | | | | | | | | | | | | | | | | | | | | Problem: In an arbiter volume, lookup was being served from one of the sink bricks (source brick was down). shard uses the iatt values from lookup cbk to calculate the size and block count, which in this case were incorrect values. shard_local_t->last_block was thus initialised to -1, resulting in an infinite while loop in shard_common_resolve_shards(). Fix: Use client quorum logic to allow or fail the lookups from afr if there are no readable subvolumes. So in replica-3 or arbiter vols, if there is no good copy or if quorum is not met, fail lookup with ENOTCONN. With this fix, we are also removing support for quorum-reads xlator option. So if quorum is not met, neither read nor write txns are allowed and we fail the fop with ENOTCONN. Change-Id: Ic65c00c24f77ece007328b421494eee62a505fa0 BUG: 1467250 Signed-off-by: Ravishankar N <ravishankar@redhat.com>
* cluster/afr: Fix for arbiter becoming sourcekarthik-us2017-11-181-1/+75
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Problem: When eager-lock is on, and two writes happen in parallel on a FD we were observing the following behaviour: - First write fails on one data brick - Since the post-op is not yet happened, the inode refresh will get both the data bricks as readable and set it in the inode context - In flight split brain check see both the data bricks as readable and allows the second write - Second write fails on the other data brick - Now the post-op happens and marks both the data bricks as bad and arbiter will become source for healing Fix: Adding one more variable called write_suvol in inode context and it will have the in memory representation of the writable subvols. Inode refresh will not update this value and its lifetime is pre-op through unlock in the afr transaction. Initially the pre-op will set this value same as read_subvol in inode context and then in the in flight split brain check we will use this value instead of read_subvol. After all the checks we will update the value of this and set the read_subvol same as this to avoid having incorrect value in that. Change-Id: I2ef6904524ab91af861d59690974bbc529ab1af3 BUG: 1482064 Signed-off-by: karthik-us <ksubrahm@redhat.com>
* Coverity Issue: PW.INCLUDE_RECURSION in several filesGirjesh Rajoria2017-11-091-3/+0
| | | | | | | | | | | | | | Coverity ID: 407, 408, 409, 410, 411, 412, 413, 414, 415, 416, 417, 418, 419, 423, 424, 425, 426, 427, 428, 429, 436, 437, 438, 439, 440, 441, 442, 443 Issue: Event include_recursion Removed redundant, recursive includes from the files. Change-Id: I920776b1fa089a2d4917ca722d0075a9239911a7 BUG: 789278 Signed-off-by: Girjesh Rajoria <grajoria@redhat.com>
* cluster/afr: Honor default timeout of 5min for analyzing split-brain fileskarthik-us2017-10-301-1/+5
| | | | | | | | | | | | | | | | | | | Problem: After setting split-brain-choice option to analyze the file to resolve the split brain using the command "setfattr -n replica.split-brain-choice -v "choiceX" <path-to-file>" should allow to access the file from mount for default timeout of 5mins. But the timeout was not honored and was able to access the file even after the timeout. Fix: Call the inode_invalidate() in afr_set_split_brain_choice_cbk() so that it will triger the cache invalidate after resetting the timer and the split brain choice. So the next calls to access the file will fail with EIO. Change-Id: I698cb833676b22ff3e4c6daf8b883a0958f51a64 BUG: 1503519 Signed-off-by: karthik-us <ksubrahm@redhat.com>
* cluster/afr: Fail open on split-brainPranith Kumar K2017-10-261-8/+69
| | | | | | | | | | | | | | | | | Problem: Append on a file with split-brain succeeds. Open is intercepted by open-behind, when write comes on the file, open-behind does open+write. Open succeeds because afr doesn't fail it. Then write succeeds because write-behind intercepts it. Flush is also intercepted by write-behind, so the application never gets to know that the write failed. Fix: Fail open on split-brain, so that when open-behind does open+write open fails which leads to write failure. Application will know about this failure. Change-Id: I4bff1c747c97bb2925d6987f4ced5f1ce75dbc15 BUG: 1294051 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* xlator/cluster/afr:coverity Issue "UNUSED_VALUE" in afr_get_split_brain_statusSubha sree Mohankumar2017-10-051-1/+7
| | | | | | | | | | | Issue: Event value_overwrite:Overwriting previous write to "ret" with value "-1". Fix : An "If" condition is added to check the value of "ret". Change-Id: I7b6bd4f20f73fa85eb8a5169644e275c7b56af51 BUG: 789278 Signed-off-by: Subha sree Mohankumar <smohanku@redhat.com>
* cluster/afr: Sending subvol up/down events when subvol comes up or goes downkarthik-us2017-09-201-0/+2
| | | | | | Change-Id: I6580351b245d5f868e9ddc6a4eb4dd6afa3bb6ec BUG: 1493539 Signed-off-by: karthik-us <ksubrahm@redhat.com>
* afr: discover/lookup heal fixesRavishankar N2017-09-041-50/+51
| | | | | | | | | | | | Addresses review comments in commit 468ca877807625817b72921d1e9585036687b640 Change-Id: I04b1bd3b00abfd6758798d6272954e36a24249a9 BUG: 1473636 Signed-off-by: Ravishankar N <ravishankar@redhat.com> Reviewed-on: https://review.gluster.org/18187 Smoke: Gluster Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
* afr: check validity of afr_replyRavishankar N2017-08-311-9/+14
| | | | | | | | | | | | | | | | | ...in various self-heal code paths. Originally found by Pranith in __afr_selfheal_name_impunge () Also change __afr_selfheal_assign_gfid() to send lookup only on those bricks that don't have a gfid matching that of the source. Change-Id: I70a2ccd750a2af92c5fc36e0eefb2b6125404b4a BUG: 1482923 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Signed-off-by: Ravishankar N <ravishankar@redhat.com> Reviewed-on: https://review.gluster.org/18065 Smoke: Gluster Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org>
* afr: heal metadata in discover code pathRavishankar N2017-08-161-31/+55
| | | | | | | | | | | | | | | | | | | | During graph switch, if fuse sends nameless (gfid) lookups, afr takes the discover code path to serve it. If there are pending metadata heals, they do not happen unless an inode refresh happens as a part of discover (which is not guaranteed to happen always). This patch fixes it by attempting metadata heal as a part of discover, just like how it is done in lookup code path. Also removed creating superfluous heal frames when launching heal. Change-Id: I49868649361ebe5d70b6ea150f4686169b6c3070 BUG: 1473636 Signed-off-by: Ravishankar N <ravishankar@redhat.com> Reviewed-on: https://review.gluster.org/17850 Smoke: Gluster Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Karthik U S <ksubrahm@redhat.com>
* cluster/afr: GFID split-brain resolution with existing CLIkarthik-us2017-07-181-1/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Problem: Currently there is no way for the admin from CLI to resolve gfid split-brain based on some policy like choice of the brick, mtime or size. Fix: With the existing CLI options based on size, mtime, and choice of brick, we do lookup on the parent for the specified file. As part of the lookup, if we find gfid mismatch, we resolve them based on the policy and return. If the file is not in gfid split- brain, then we check for the data and metadata split-brain in the getxattr code path, and resolve if any. This will work provided absolute path to the file with the CLI and not with gfid of the file. Hence the source-brick policy without any file path will also not resolve the gfid split-brain since it uses the gfid of the files. But it can resolve any other type of split-brains and skip the gfid mismatch resolution with the usual error message. Reverting the change https://review.gluster.org/17290. This patch resolves the issue. Fixes gluster/glusterfs#135 Change-Id: Iaeba6fc32f184a34255d03be87cda02773130a09 BUG: 1459530 Signed-off-by: karthik-us <ksubrahm@redhat.com> Reviewed-on: https://review.gluster.org/17485 Reviewed-by: Ravishankar N <ravishankar@redhat.com> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Smoke: Gluster Build System <jenkins@build.gluster.org>
* cluster/afr: Implement quorum for lk fopPranith Kumar K2017-06-191-18/+38
| | | | | | | | | | | | | | | | | | | Problem: At the moment when we have replica 3 or arbiter setup, even when lk succeeds on just one brick we give success to application which is wrong Fix: Consider quorum-number of successes as success when quorum is enabled. BUG: 1461792 Change-Id: I5789e6eb5defb68f8a0eb9cd594d316f5cdebaea Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: https://review.gluster.org/17524 Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Ravishankar N <ravishankar@redhat.com>
* afr: update errno check in afr_inode_refresh_doRavishankar N2017-06-051-1/+1
| | | | | | | | | | | | | Addresses review comment in https://review.gluster.org/#/c/17413 Change-Id: Ic247729e5e92a5bb0148543764e0b30790444004 BUG: 1456582 Signed-off-by: Ravishankar N <ravishankar@redhat.com> Reviewed-on: https://review.gluster.org/17436 Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
* afr: add errno to afr_inode_refresh_done()Ravishankar N2017-05-311-7/+16
| | | | | | | | | | | | | | | | | | | | | | | | | Problem: When parellel `rm -rf`s were being done from cifs clients, opendir might fail on some replicas with ENOENT. DHT ignores partial opendir failures in dht_fd_cbk() and winds readdirs on those replicas. Afr inode refresh (as a part of readdirp read_txn) sees in its fd context that the state of the fds is *not* AFR_FD_OPENED and bails out to afr_inode_refresh_done() without doing a refresh. When this happens, the errno is set as EIO due to lack of readable subvols, logging split-brain messages in the logs. Fix: Introduce an errno argument to afr_inode_refresh_do() to bail out with the right error value when inode refresh is not performed. Change-Id: I075707fbb73fd93a923b77b923a96aac79e847f9 BUG: 1456582 Signed-off-by: Ravishankar N <ravishankar@redhat.com> Reviewed-on: https://review.gluster.org/17413 Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
* cluster/afr: Return the list of node_uuids for the subvolumekarthik-us2017-05-171-0/+49
| | | | | | | | | | | | | | | | | | | | | | | | Problem: AFR was returning the node uuid of the first node for every file if the replica set was healthy, which was resulting in only one node migrating all the files. Fix: With this patch AFR returns the list of node_uuids to the upper layer, so that they can decide on which node to migrate which files, resulting in improved performance. Ordering of node uuids will be maintained based on the ordering of the bricks. If a brick is down, then the node uuid for that will be set to all zeros. Change-Id: I73ee0f9898ae473584fdf487a2980d7a6db22f31 BUG: 1366817 Signed-off-by: karthik-us <ksubrahm@redhat.com> Reviewed-on: https://review.gluster.org/17084 Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Tested-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org>
* afr: send the correct iatt values in fsync cbkRavishankar N2017-05-111-25/+43
| | | | | | | | | | | | | | | | | | | | | | | | Problem: afr unwinds the fsync fop with an iatt buffer from one of its children on whom fsync was successful. But that child might not be a valid read subvolume for that inode because of pending heals or because it happens to be the arbiter brick etc. Thus we end up sending the wrong iatt to mdcache which will in turn serve it to the application on a subsequent stat call as reported in the BZ. Fix: Pick a child on whom the fsync was successful *and* that is readable as indicated in the inode context. Change-Id: Ie8647289219cebe02dde4727e19a729b3353ebcf BUG: 1449329 RCA'ed-by: Miklós Fokin <miklos.fokin@appeartv.com> Signed-off-by: Ravishankar N <ravishankar@redhat.com> Reviewed-on: https://review.gluster.org/17227 CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> Smoke: Gluster Build System <jenkins@build.gluster.org>
* afr: fixes to quorum-type in afr_priv_dump()Ravishankar N2017-05-101-0/+2
| | | | | | | | | | | | | | | | | | Include the 'none' option as well in the output. This fixes the bug in commit 335555d256d444f4952ce239168f72b393370f01. Also added a test-case. This is a Signed-off-by: Ravishankar N <ravishankar@redhat.com> Change-Id: I479a14ae69ecae5a03e85e73ed50c19b483df603 BUG: 1448804 Reviewed-on: https://review.gluster.org/17215 Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Tested-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org>
* afr: include quorum type and count when dumping afr privRavishankar N2017-05-081-0/+6
| | | | | | | | | | | | | | | | Dump the client quorum type ('auto' or 'fixed'). If it is 'fixed', also dump the quorum-count. This information will be available in the client statedump and in /<fuse_mount>/.meta/graphs/active/testvol-replicate-X/private. Change-Id: Idbd6e2acbd622d4e6cfabf511e649a6da0e42384 BUG: 1448804 Signed-off-by: Ravishankar N <ravishankar@redhat.com> Reviewed-on: https://review.gluster.org/17196 Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
* Halo Replication feature for AFR translatorKevin Vigor2017-05-021-54/+310
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Summary: Halo Geo-replication is a feature which allows Gluster or NFS clients to write locally to their region (as defined by a latency "halo" or threshold if you like), and have their writes asynchronously propagate from their origin to the rest of the cluster. Clients can also write synchronously to the cluster simply by specifying a halo-latency which is very large (e.g. 10seconds) which will include all bricks. In other words, it allows clients to decide at mount time if they desire synchronous or asynchronous IO into a cluster and the cluster can support both of these modes to any number of clients simultaneously. There are a few new volume options due to this feature: halo-shd-latency: The threshold below which self-heal daemons will consider children (bricks) connected. halo-nfsd-latency: The threshold below which NFS daemons will consider children (bricks) connected. halo-latency: The threshold below which all other clients will consider children (bricks) connected. halo-min-replicas: The minimum number of replicas which are to be enforced regardless of latency specified in the above 3 options. If the number of children falls below this threshold the next best (chosen by latency) shall be swapped in. New FUSE mount options: halo-latency & halo-min-replicas: As descripted above. This feature combined with multi-threaded SHD support (D1271745) results in some pretty cool geo-replication possibilities. Operational Notes: - Global consistency is gaurenteed for synchronous clients, this is provided by the existing entry-locking mechanism. - Asynchronous clients on the other hand and merely consistent to their region. Writes & deletes will be protected via entry-locks as usual preventing concurrent writes into files which are undergoing replication. Read operations on the other hand should never block. - Writes are allowed from _any_ region and propagated from the origin to all other regions. The take away from this is care should be taken to ensure multiple writers do not write the same files resulting in a gfid split-brain which will require resolution via split-brain policies (majority, mtime & size). Recommended method for preventing this is using the nfs-auth feature to define which region for each share has RW permissions, tiers not in the origin region should have RO perms. TODO: - Synchronous clients (including the SHD) should choose clients from their own region as preferred sources for reads. Most of the plumbing is in place for this via the child_latency array. - Better GFID split brain handling & better dent type split brain handling (i.e. create a trash can and move the offending files into it). - Tagging in addition to latency as a means of defining which children you wish to synchronously write to Test Plan: - The usual suspects, clang, gcc w/ address sanitizer & valgrind - Prove tests Reviewers: jackl, dph, cjh, meyering Reviewed By: meyering Subscribers: ethanr Differential Revision: https://phabricator.fb.com/D1272053 Tasks: 4117827 Change-Id: I694a9ab429722da538da171ec528406e77b5e6d1 BUG: 1428061 Signed-off-by: Kevin Vigor <kvigor@fb.com> Reviewed-on: http://review.gluster.org/16099 Reviewed-on: https://review.gluster.org/16177 Tested-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
* afr: all children of AFR must be up to resolve s-brainRavishankar N2017-02-091-10/+22
| | | | | | | | | | | | | | | | | | | | | | | | Problem: The various split-brain resolution policies (favorite-child-policy based, CLI based and mount (get/setfattr) based) attempt to resolve split-brain even when not all bricks of replica are up. This can be a problem when say in a replica 3, the only good copy is down and the other 2 bricks are up and blame each other (i.e. split-brain). We end up healing the file in such a case and allow I/O on it. Fix: A decision on whether the file is in split-brain or not must be taken only if we are able to examine the afr xattrs of *all* bricks of a given replica. Change-Id: Icddb1268b380005799990f5379ef957d84639ef9 BUG: 1417522 Signed-off-by: Ravishankar N <ravishankar@redhat.com> Reviewed-on: https://review.gluster.org/16476 Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
* afr: Avoid resetting event_gen when brick is always downRavishankar N2017-01-121-11/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | Problem: __afr_set_in_flight_sb_status(), which resets event_gen to zero, is called if failed_subvols[i] is non-zero for any brick. But failed_subvols[i] is true even if the brick was down *before* the transaction started. Hence say if 1 brick is down in a replica-3, every writev that comes will trigger an inode refresh because of this resetting, as seen from the no. of FSTATs in the profile info in the BZ. Fix: Reset event gen only if the brick was previously a valid read child and the FOP failed on it the first time. Also `s/afr_inode_read_subvol_reset/afr_inode_event_gen_reset` because the function only resets event gen and not the data/metadata readable. Change-Id: I603ae646cbde96995c35db77916e2ed80b602a91 BUG: 1409206 Signed-off-by: Ravishankar N <ravishankar@redhat.com> Reviewed-on: http://review.gluster.org/16309 Smoke: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Tested-by: Pranith Kumar Karampuri <pkarampu@redhat.com> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org>
* ec: Invalidations in disperse volume should not update the statPoornima G2017-01-051-2/+1
| | | | | | | | | | | | | | | | | | | | | | Issue: In disperse volume, the file is present across bricks, hence the stat from one brick doesn't carry the valid size of the file. Therefore the upcall from one brick updating the md-cache results in wrong size being updated. Fix: If the notification is cache invalidation then, indicate md-cache that the attributes is invalid. BUG: 1410375 Change-Id: Id89d2283478e70b62b435a8891fffc86d2be8cb2 Signed-off-by: Poornima G <pgurusid@redhat.com> Reviewed-on: http://review.gluster.org/16329 Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> Reviewed-by: Xavier Hernandez <xhernandez@datalab.es> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
* afr: use accused matrix instead of readable matrix for deciding healsRavishankar N2016-12-261-1/+1
| | | | | | | | | | | | | | | | | | | | | | | Problem: afr_replies_interpret() used the 'readable' matrix to trigger client side heals after inode refresh. But for arbiter, readable is always zero. So when `dd` is run with a data brick down, spurious data heals are are triggered. These heals open an fd, causing eager lock to be disabled (open fd count >1) in afr transactions, leading to extra FXATTROPS Fix: Use the accused matrix (derived from interpreting the afr pending xattrs) to decide whether we can start heal or not. Change-Id: Ibbd56c9aed6026de6ec42422e60293702aaf55f9 BUG: 1408395 Signed-off-by: Ravishankar N <ravishankar@redhat.com> Reviewed-on: http://review.gluster.org/16277 NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Smoke: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Tested-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
* afr: Ignore event_generation checks post inode refresh for write txnsRavishankar N2016-12-221-1/+1
| | | | | | | | | | | | | | | | | | | Before http://review.gluster.org/#/c/15673/, after inode refresh, we failed read txns in case of EIO or event_generation being zero. For write transactions, the check was only for EIO. 15673 re-factored the code to fail both read and write when event_generation=0. This seems to have caused a regression as explained in the BZ. This patch restores that behaviour in afr_txn_refresh_done(). Change-Id: Ib8e116506badce6f58b55827dbe403d95069d744 BUG: 1406224 Signed-off-by: Ravishankar N <ravishankar@redhat.com> Reviewed-on: http://review.gluster.org/16205 Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Smoke: Gluster Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org>
* cluster/afr: Fix per-txn optimistic changelog initialisationKrutika Dhananjay2016-12-121-5/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Incorrect initialisation of local->optimistic_change_log was leading to skipped pre-op and post-op even when a brick didn't participate in the txn because it was down. The result - missing granular name index resulting in some entries never getting healed. FIX: Initialise local->optimistic_change_log just before pre-op. Also fixed granular entry heal to create the granular name index in pre-op as opposed to post-op. This is to prevent loss of granular information when during an entry txn, the good (src) brick goes offline before the post-op is done. This would cause self-heal to do conservative merge (since dirty xattr is the only information available), which when granular-entry-heal is enabled, expects granular indices, the lack of which can lead to loss of data in the worst case. Change-Id: Ia3ad716d6fb1821555f02180e86e8711a79f958d BUG: 1402730 Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com> Reviewed-on: http://review.gluster.org/16075 Smoke: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org>
* cluster/afr: Remove backward compatibility for locks with v1Pranith Kumar K2016-12-071-36/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we have cascading locks with same lk-owner there is a possibility for a deadlock to happen. One example is as follows: self-heal takes a lock in data-domain for big name with 256 chars of "aaaa...a" and starts heal in a 3-way replication when brick-0 is offline and healing from brick-1 to brick-2 is in progress. So this lock is active on brick-1 and brick-2. Now brick-0 comes online and an operation wants to take full lock and the lock is granted at brick-0 and it is waiting for lock on brick-1. As part of entry healing it takes full locks on all the available bricks and then proceeds with healing the entry. Now this lock will start waiting on brick-0 because some other operation already has a granted lock on it. This leads to a deadlock. Operation is waiting for unlock on "aaaa..." by heal where as heal is waiting for the operation to unlock on brick-0. Initially I thought this is happening because healing is trying to take a lock on all the available bricks instead of just the bricks that are participating in heal. But later realized that same kind of deadlock can happen if a brick goes down after the heal starts but comes back before it completes. So the essential problem is the cascading locks with same lk-owner which were added for backward compatibility with afr-v1 which can be safely removed now that versions with afr-v1 are already EOL. This patch removes the compatibility with v1 which requires cascading locks with same lk-owner. In the next version we can make locking-scheme option a dummy and switch completely to v2. BUG: 1401404 Change-Id: Ic9afab8260f5ff4dff5329eb0429811bcb879079 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/16024 Smoke: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Ravishankar N <ravishankar@redhat.com> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org>
* cluster/afr: Serialize conflicting locks on all subvolsPranith Kumar K2016-12-061-32/+50
| | | | | | | | | | | | | | | | | | | | | | | Problem: 1) When a blocking lock is issued and the parallel lock phase fails on all subvolumes with EAGAIN, it is not switching to serialized locking phase. 2) When quorum is enabled and locks fail partially it is better to give errno returned by brick rather than the default quorum errno. Fix: Handled this error case and changed op_errno to reflect the actual errno in case of quorum error. BUG: 1369077 Change-Id: Ifac2e4a13686e9fde601873012700966d56a7f31 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/15984 Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Ravishankar N <ravishankar@redhat.com>
* afr, client: More mem-leak fixes in COMPOUND fop cbkKrutika Dhananjay2016-12-041-12/+0
| | | | | | | | | | | | | | | | | | | | | | Bugs found and fixed: 1. Use correct subvolume index in pre-op-writev compound cbk 2. Prevent use-after-free of local->compound_args members in compound fops cbk in protocol/client 3. Fix xdata and xattr leaks in client_process_response 4. Fix possible leak of xdata in client_pre_writev() in test mode. 5. Free req->compound_req_array.compound_req_array_val as well after freeing its members 6. Free tmp_rsp->flock.lk_owner.lk_owner_val in LK fop. Change-Id: I15b646d7d4e0e5cd4ea3d2d6452c815cf2eaf68f BUG: 1401218 Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com> Reviewed-on: http://review.gluster.org/16020 Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> CentOS-regression: Gluster Build System <jenkins@build.gluster.org>
* selfheal: fix memory leak on client side healing queueMateusz Slupny2016-12-021-1/+4
| | | | | | | | | | | | | Change-Id: I2beaba829710565a3246f7449a5cd21755cf5f7d BUG: 1399592 Signed-off-by: Mateusz Slupny <mateusz.slupny@appeartv.com> Reviewed-on: http://review.gluster.org/15968 Tested-by: Pranith Kumar Karampuri <pkarampu@redhat.com> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Reviewed-by: Ravishankar N <ravishankar@redhat.com> Smoke: Gluster Build System <jenkins@build.gluster.org>
* afr: fix auto-quorumJeff Darcy2016-11-281-37/+0
| | | | | | | | | | | | | | | | | | | | | | | | | (1) afr_have_quorum is dead code. It was copied to afr_has_quorum, and everything else uses that, but the original was never deleted (until now). (2) Auto-quorum should be default for any N>2. Leaving quorum disabled is BAD, but apparently deemed acceptable for N=2 because there's no real quorum in that case. For any larger number (including arbiter configurations) there is such a thing as real quorum and we should use it by default. Note that for N=3 the answers we get from "N % 2" (the old check) and "N > 2" (the new one) are the same. (3) The special case for even N in afr_has_quorum has been simplified and explained more thoroughly in a comment. Change-Id: I48b33c15093512fecf516b26dcf09afecb7ae33b Signed-off-by: Jeff Darcy <jdarcy@redhat.com> Reviewed-on: http://review.gluster.org/15873 Smoke: Gluster Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Vijay Bellur <vbellur@redhat.com> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
* afr: Fix the EIO that can occur in afr_inode_refresh as a resultPoornima G2016-11-281-16/+74
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | of cache invalidation(upcall). Issue: ------ When a cache invalidation is recieved as a result of changing pending xattr, the read_subvol is reset. Consider the below chain of execution: CHILD_DOWN ... afr_readv ... afr_inode_refresh ... afr_inode_read_subvol_reset <- as a result of pending xattr set by some other client GF_EVENT_UPCALL will be sent afr_refresh_done -> this results in an EIO, as the read subvol was reset by the end of the afr_inode_refresh Solution: --------- When GF_EVENT_UPCALL is recieved, instead of resetting read_subvol, set a variable need_refresh in inode_ctx, the next time some one starts a txn, along with event gen, need_rrefresh also needs to be checked. Change-Id: Ifda21a7a8039b8874215e1afa4bdf20f7d991b58 BUG: 1396952 Signed-off-by: Poornima G <pgurusid@redhat.com> Reviewed-on: http://review.gluster.org/15892 Reviewed-by: Ravishankar N <ravishankar@redhat.com> Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
* afr: allow I/O when favorite-child-policy is enabledRavishankar N2016-11-271-3/+202
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Problem: Currently, I/O on a split-brained file fails even when the favorite-child-policy is set until the self-heal is complete. Fix: If a valid 'source' is found using the set favorite-child-policy, inspect and reset the afr pending xattrs on the 'sinks' (inside appropriate locks), refresh the inode and then proceed with the read or write transaction. The resetting itself happens in the self-heal code and hence can also happen in the client side background-heal or by the shd's index-heal in addition to the txn code path explained above. When it happens in via heal, we also add checks in undo-pending to not reset the sink xattrs again. Change-Id: Ic8c1317720cb26bd114b6fe6af4e58c73b864626 BUG: 1386188 Signed-off-by: Ravishankar N <ravishankar@redhat.com> Reported-by: Simon Turcotte-Langevin <simon.turcotte-langevin@ubisoft.com> Reviewed-on: http://review.gluster.org/15673 Tested-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Smoke: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org>
* cluster/afr: Fix bugs in [f]inodelk/[f]entrylkPranith Kumar K2016-11-261-325/+361
| | | | | | | | | | | | | | | | | | | | | | Problems: 1) Inodelk is not taking quorum into account 2) finodelk, [f]entrylk are not implemented correctly 3) By default afr doesn't go for non-blocking parallel locks. Fix: Implemented a common framework which can be used by [f]inodelk/[f]entrylk. Used quorum for the same. Change-Id: I239f13875a065298630d266941df10cfa3addc85 BUG: 1369077 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/15802 Tested-by: Krutika Dhananjay <kdhananj@redhat.com> Reviewed-by: Krutika Dhananjay <kdhananj@redhat.com> Smoke: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Ravishankar N <ravishankar@redhat.com> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org>
* afr,dht,ec: Replace GF_EVENT_CHILD_MODIFIED with event SOME_DESCENDENT_DOWN/UPPoornima G2016-11-211-22/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently these are few events related to child_up/down: GF_EVENT_CHILD_UP : Issued when any of the protocol client connects. GF_EVENT_CHILD_MODIFIED : Issued by afr/dht/ec GF_EVENT_CHILD_DOWN : Issued when any of the protocol client disconnects. These events get modified at the dht/afr/ec layers. Here is a brief on the same. DHT: - All the subvolumes reported once, and atleast one child came up, then GF_EVENT_CHILD_UP is issued - connect GF_EVENT_CHILD_UP is issued - disconnect GF_EVENT_CHILD_MODIFIED is issued - All the subvolumes disconnected, GF_EVENT_CHILD_DOWN is issued AFR: - First subvolume came up, then GF_EVENT_CHILD_UP is issued - Subsequent subvolumes coming up, results in GF_EVENT_CHILD_MODIFIED - Any of the subvolumes go down, then GF_EVENT_SOME_CHILD_DOWN is issued - Last up subvolume goes down, then GF_EVENT_CHILD_DOWN is issued Until the patch [1] introduced GF_EVENT_SOME_CHILD_UP, GF_EVENT_CHILD_MODIFIED was issued by afr/dht when any of the subvolumes go up or down. Now with md-cache changes, there is a necessity to differentiate between child up and down. Hence, introducing GF_EVENT_SOME_DESCENDENT_DOWN/UP and getting rid of GF_EVENT_CHILD_MODIFIED. [1] http://review.gluster.org/12573 Change-Id: I704140b6598f7ec705493251d2dbc4191c965a58 BUG: 1396038 Signed-off-by: Poornima G <pgurusid@redhat.com> Reviewed-on: http://review.gluster.org/15764 CentOS-regression: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> Smoke: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: N Balachandran <nbalacha@redhat.com> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Reviewed-by: Rajesh Joseph <rjoseph@redhat.com>
* features/shard: Fill loc.pargfid too for named lookups on individual shardsKrutika Dhananjay2016-11-081-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On a sharded volume when a brick is replaced while IO is going on, named lookup on individual shards as part of read/write was failing with ENOENT on the replaced brick, and as a result AFR initiated name heal in lookup callback. But since pargfid was empty (which is what this patch attempts to fix), the resolution of the shards by protocol/server used to fail and the following pattern of logs was seen: Brick-logs: [2016-11-08 07:41:49.387127] W [MSGID: 115009] [server-resolve.c:566:server_resolve] 0-rep-server: no resolution type for (null) (LOOKUP) [2016-11-08 07:41:49.387157] E [MSGID: 115050] [server-rpc-fops.c:156:server_lookup_cbk] 0-rep-server: 91833: LOOKUP(null) (00000000-0000-0000-0000-000000000000/16d47463-ece5-4b33-9c93-470be918c0f6.82) ==> (Invalid argument) [Invalid argument] Client-logs: [2016-11-08 07:41:27.497687] W [MSGID: 114031] [client-rpc-fops.c:2930:client3_3_lookup_cbk] 2-rep-client-0: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [Invalid argument] [2016-11-08 07:41:27.497755] W [MSGID: 114031] [client-rpc-fops.c:2930:client3_3_lookup_cbk] 2-rep-client-1: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [Invalid argument] [2016-11-08 07:41:27.498500] W [MSGID: 114031] [client-rpc-fops.c:2930:client3_3_lookup_cbk] 2-rep-client-2: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [Invalid argument] [2016-11-08 07:41:27.499680] E [MSGID: 133010] Also, this patch makes AFR by itself choose a non-NULL pargfid even if its ancestors fail to initialize all pargfid placeholders. Change-Id: I5f85b303ede135baaf92e87ec8e09941f5ded6c1 BUG: 1392445 Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com> Reviewed-on: http://review.gluster.org/15788 CentOS-regression: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> Reviewed-by: Ravishankar N <ravishankar@redhat.com> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Smoke: Gluster Build System <jenkins@build.gluster.org>
* md-cache, afr: Reduce the window of stale readPoornima G2016-10-201-1/+40
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Problem: Consider a replica setup, where one mount writes data to a file and the other mount reads the file. In afr, read operations are not transaction based, a brick(read subvolume) is chosen as a part of lookup or other operations, read is always wound only to the read subvolume, even if there was write from a different client that failed on this brick. This stale read continues until there is a lookup or any write operation from the mount point. Currently, this is not a major issue, as a lookup is issued before every read and it will switch the read subvolume to a correct one. But with the plan of increasing md-cache timeout to 600s, the stale read problem will be more pronounced, i.e. stale read can continue for 600s(or more if cascaded with readdirp), as there will be no lookups. Solution: Afr doesn't have any built-in solution for stale read(without affecting the performance). The solution that came up, was to use upcall. When a file on any brick is marked bad for the first time, upcall sends a notification to all the clients that had recently accessed the file. The solution has 2 parts: - Identifying when a file is marked bad, on any of the bricks, for the first time - Client side actions on recieving the notifications Identifying when a file is marked bad on any of the bricks for the first time: ----------------------------------------------------------------------------- The idea is to track xattrop in upcall. xattrop currently comes with 2 afr xattrs - afr dirty bit and afr pending xattrs. Dirty xattr is set to 1 before every write, and is unset if write succeeds. In certain scenarios, dirty xattr can be 0 and still the file could be bad copy. Hence do not track dirty xattr. Pending xattr is set on the good copy, indicating the other bricks that have bad copy. It is still not as simple as, notifying when any of the pending xattrs change. It could lead to flood of notifcations, in case the other brick is completely down or consistantly failing. Hence it is important to notify only once, the first time a good copy is marked bad. Client side actions on recieving pending xattr change, notification: -------------------------------------------------------------------- md-cache will invalidate the cache of that file, so that further lookup is passed down to afr and hence update the read subvolume. Invalidating only in md-cache is not enough, consider the folling oder of opertaions: - pending xattr invalidation - invalidate md-cache - readdirp on the bad read subvolume - fill md-cache - lookup (served from md-cache) - read - wound to the old read subvol. Hence, along with invalidating md-cache, it is very important to reset the read subvolume for that file, in afr. Design Credit: Anuradha Talur, Ravishankar N 1. xattrop doesn't carry info saying post op/pre op. 2. Pre xattrop will have 0 value for all pending xattrs, the cbk of pre xattrop carries the on-disk xattr value. Non zero indicated healing is required. 3. Post xattrop will have non zero value for any of the pending xattrs, if the fop failed on any of the bricks. Change-Id: I469cbc111714c433984fe1c922be2ef113c25804 BUG: 1211863 Signed-off-by: Poornima G <pgurusid@redhat.com> Reviewed-on: http://review.gluster.org/15398 Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Tested-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org>
* cluster/afr: Prevent dict_set() on NULL dictPranith Kumar K2016-10-151-1/+2
| | | | | | | | | | | | | | | | | In afr lookup when NULL dict is received in lookup, afr is supposed to set all the xattrs it requires in a new dict it creates, but for 'link-count' it is trying to set to the dict that is passed in lookup which can be NULL sometimes. This is leading to error logs. Fixed the same in this patch. BUG: 1385104 Change-Id: I679af89cfc410cbc35557ae0691763a05eb5ed0e Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/15646 Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Ravishankar N <ravishankar@redhat.com>
* afr: Implement IPC fopPoornima G2016-09-291-0/+120
| | | | | | | | | | | | | | | | | | | | | Currently ipc() is not implemented in afr. md-cache and upcall uses ipc to register the list of xattrs, [1] for more details. For the ipc op GF_IPC_TARGET_UPCALL, it has to be wound to all the replica subvolumes. ipc() is failed when any of the subvolumes fails with other than ENOTCONN or all of the subvolumes are down. [1] http://review.gluster.org/#/c/15002/ Change-Id: I0f651330eafda64e4d922043fe53bd0014536247 BUG: 1211863 Signed-off-by: Poornima G <pgurusid@redhat.com> Reviewed-on: http://review.gluster.org/15378 Tested-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
* afr: Ignore gluster internal (virtual) xattrs in metadata heal checkRavishankar N2016-09-261-6/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | Problem: In arbiter configuration, posix-xlator in the arbiter brick always sets the GF_CONTENT_KEY in the response dict with a value 0. If the file size on the data bricks is more than quick-read's max-file-size (64kb default), those bricks don't set the key. Because of this difference in the no. of dict elements, afr triggers metadata heal in lookup code path, in turn leading to extra lookups+inodelks. Fix: Changed afr dict comparison logic to ignore all virtual xattrs and the on-disk ones that we should not be healing. Also removed is_virtual_xattr() function. The original callers to this function (upcall) don't seem to need it anymore. Change-Id: I05730bdd39d8fb0b9a49a5fc9c0bb01f0d3bb308 BUG: 1378684 Signed-off-by: Ravishankar N <ravishankar@redhat.com> Reviewed-on: http://review.gluster.org/15548 NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Smoke: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
* afr: add replication eventsRavishankar N2016-09-061-2/+14
| | | | | | | | | | | | | | | | | | | | | Added the following events for the eventing framework: "EVENT_AFR_QUORUM_MET", --> Sent when quorum is met. "EVENT_AFR_QUORUM_FAIL" -->Sent when quorum is lost. "EVENT_AFR_SUBVOL_UP" -->Sent when afr witnesses the first up subvolume. "EVENT_AFR_SUBVOLS_DOWN"-->Sent when all children of an afr subvol are down. "EVENT_AFR_SPLIT_BRAIN" -->Sent when self-heal detects split-brain in heal path (not read/write path). Change-Id: I937c61ca1ce78b5922ade73c7bfa3051df59c513 BUG: 1371485 Signed-off-by: Ravishankar N <ravishankar@redhat.com> Reviewed-on: http://review.gluster.org/15349 Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Tested-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org>
* afr: Consume compound fops in afr transactionAnuradha Talur2016-09-011-0/+55
| | | | | | | | | | | | | Change-Id: Ib06ece3cce1b10d28d6d2953da28444f5c2457ad BUG: 1290304 Signed-off-by: Anuradha Talur <atalur@redhat.com> Reviewed-on: http://review.gluster.org/15014 Tested-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Smoke: Gluster Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> Reviewed-by: Krutika Dhananjay <kdhananj@redhat.com> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
* cluster/afr: Give option to do consistent-ioPranith Kumar K2016-08-221-32/+161
| | | | | | | | | | | | | | | | | | | | | | | Problem: When tiering/rebalance does migrations and afr with 2-way replica is in picture, migration can read stale data if the source brick goes down and writes to the destination. After this deletion of the file leads to permanent loss of data after migration. Fix: Rebalance/tiering should migrate only when the data is definitely not stale. So introduce an option in afr called consistent-io which will be enabled in migration daemons. BUG: 1306398 Change-Id: I750f65091cc70a3ed4bf3c12f83d0949af43920a Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/13425 Reviewed-by: Anuradha Talur <atalur@redhat.com> Reviewed-by: Krutika Dhananjay <kdhananj@redhat.com> Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org>
* cluster/afr: Prevent split-brain when bricks are brought off and on in ↵Krutika Dhananjay2016-08-221-6/+119
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | cyclic order When the bricks are brought offline and then online in cyclic order while writes are in progress on a file, thanks to inode refresh in write txns, AFR will mostly fail the write attempt when the only good copy is offline. However, there is still a remote possibility that the file will run into split-brain if the brick that has the lone good copy goes offline *after* the inode refresh but *before* the write txn completes (I call it in-flight split-brain in the patch for ease of reference), requiring intervention from admin to resolve the split-brain before the IO can resume normally on the file. To get around this, the patch does the following things: i) retains the dirty xattrs on the file ii) avoids marking the last of the good copies as bad (or accused) in case it is the one to go down during the course of a write. iii) fails that particular write with the appropriate errno. This way, we still have one good copy left despite the split-brain situation which when it is back online, will be chosen as source to do the heal. Change-Id: I9ca634b026ac830b172bac076437cc3bf1ae7d8a BUG: 1363721 Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com> Reviewed-on: http://review.gluster.org/15080 Tested-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Smoke: Gluster Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Ravishankar N <ravishankar@redhat.com> Reviewed-by: Oleksandr Natalenko <oleksandr@natalenko.name> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
* afr: some coverity fixesRavishankar N2016-07-261-92/+136
| | | | | | | | | | | | | | | Thanks to Krutika for a cleaner way to track inode refs in afr_set_split_brain_choice(). Change-Id: I2d968d05b815ad764b7e3f8aa9ad95a792b3c1df BUG: 1355604 Signed-off-by: Ravishankar N <ravishankar@redhat.com> Reviewed-on: http://review.gluster.org/14895 Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Tested-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org>
* cluster/afr: Unwind with xdata in inode-write fopsPranith Kumar K2016-06-011-1/+1
| | | | | | | | | | | | | | | When there is a failure afr was not unwinding xdata to xlators above. xdata need not be NULL on failures. So it is important to send it to parent xlators. Change-Id: Ic36aac10a79fa91121961932dd1920cb1c2c3a4c BUG: 1340623 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/14567 Smoke: Gluster Build System <jenkins@build.gluster.com> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Jeff Darcy <jdarcy@redhat.com>