summaryrefslogtreecommitdiffstats
path: root/xlators/cluster/afr/src/afr-common.c
Commit message (Collapse)AuthorAgeFilesLines
* afr: launch index heal on local subvols up on a child-up eventRavishankar N2015-09-091-17/+11
| | | | | | | | | | | | | | | | | | | | | | | Backport of http://review.gluster.org/#/c/11912/ Problem: When a replica's child goes down and comes up, the index heal is triggered only on the child that just came up. This does not serve the intended purpose as the list of files that need to be healed to this child is actually captured on the other child of the replica. Fix: Launch index-heal on all local children of the replica xlator which just received a child up. Note that afr_selfheal_childup() eventually calls afr_shd_index_healer() which will not run the heal on non-local children. Change-Id: I524fda17c28864758b35679cfb232f81f8374571 BUG: 1256245 Signed-off-by: Ravishankar N <ravishankar@redhat.com> Reviewed-on: http://review.gluster.org/11994 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Reviewed-by: Raghavendra Bhat <raghavendra@redhat.com> Tested-by: Raghavendra Bhat <raghavendra@redhat.com>
* cluster/afr: Pick gfid from poststat during fresh lookup for read child ↵Krutika Dhananjay2015-07-171-27/+41
| | | | | | | | | | | | | | calculation Backport of : http://review.gluster.org/11373 Change-Id: I03f11af082a0decf4ea084480b67e9e156964c76 BUG: 1235601 Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com> Reviewed-on: http://review.gluster.org/11408 Reviewed-by: Raghavendra Bhat <raghavendra@redhat.com> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Tested-by: Raghavendra Bhat <raghavendra@redhat.com>
* afr: honour selfheal enable/disable volume set optionsRavishankar N2015-06-151-1/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | Backport of http://review.gluster.org/11012 Note: http://review.gluster.org/9459 is not backported to 3.6 but the change it makes to afr_get_heal_info() (i.e. handling ret values) is needed for heal info to work correctly and tests/basic/afr/client-side-heal.t to pass. -------------------------- afr-v1 had the following volume set options that are used to enable/ disable self-heals from happening in AFR xlator when loaded in the client graph: cluster.metadata-self-heal cluster.data-self-heal cluster.entry-self-heal In afr-v2, these 3 heals can happen from the client if there is an inode refresh. This patch allows such heals to proceed only if the corresponding volume set options are set to true. -------------------------- Change-Id: Iebf863758d902fd2f95be320c6791d4e15f634e7 BUG: 1230259 Signed-off-by: Ravishankar N <ravishankar@redhat.com> Reviewed-on: http://review.gluster.org/11170 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anuradha Talur <atalur@redhat.com> Reviewed-by: Raghavendra Bhat <raghavendra@redhat.com>
* cluster/afr: Treat op_ret >= 0 as success in afr_final_errno()Krutika Dhananjay2015-06-031-1/+1
| | | | | | | | | | | | Backport of: http://review.gluster.org/10946 Change-Id: I6ca36fee5dc46f2a2afe04f3359c4102f45c3ae0 BUG: 1225745 Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com> Reviewed-on: http://review.gluster.org/10961 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Ravishankar N <ravishankar@redhat.com> Reviewed-by: Raghavendra Bhat <raghavendra@redhat.com>
* Tests: fix spurious failure in read-subvol-entry.tEmmanuel Dreyfus2015-05-191-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | read-subvol-entry.t tests that if a brick has pending operations, it is not used for readdir operations. On NetBSD this test exhibits spurious failures, with the wrong brick being used to perform readdir. It happens because when afr_replies_interpret() looks at xattr for pending attributes, it uses alternative bahvior whether it is working on a directory or another object. The decision is based on inode->ia_type, which may be IA_INVAL at that time if we come there from: afr_replies_interpret.() afr_xattrs_are_equal() afr_lookup_metadata_heal_chec() afr_lookup_entry_heal() afr_lookup_cbk() Using replies[i].poststat.ia_type, which is correctly set, works around the problem. Resubmitted as is after rebase. This is a backport of Id9ccdd8604f79a69db5f1902697f8913acac50ad BUG: 1138897 Change-Id: I73f5e04dec86e648a28363f417559b0cbf80324d Signed-off-by: Emmanuel Dreyfus <manu@netbsd.org> Reviewed-on: http://review.gluster.org/10178 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Kaleb KEITHLEY <kkeithle@redhat.com> Reviewed-by: Raghavendra Bhat <raghavendra@redhat.com>
* cluster/afr: serialize inode locksPranith Kumar K2015-03-251-59/+185
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Backport of http://review.gluster.org/9372 Problem: Afr winds inodelk calls without any order, so blocking inodelks from two different mounts can lead to dead lock when mount1 gets the lock on brick-1 and blocked on brick-2 where as mount2 gets lock on brick-2 and blocked on brick-1 Fix: Serialize the inodelks whether they are blocking inodelks or non-blocking inodelks. Non-blocking locks also need to be serialized. Otherwise there is a chance that both the mounts which issued same non-blocking inodelk may endup not acquiring the lock on any-brick. Ex: Mount1 and Mount2 request for full length lock on file f1. Mount1 afr may acquire the partial lock on brick-1 and may not acquire the lock on brick-2 because Mount2 already got the lock on brick-2, vice versa. Since both the mounts only got partial locks, afr treats them as failure in gaining the locks and unwinds with EAGAIN errno. BUG: 1189023 Change-Id: If5dd502d9d25d12425749a8efcf08a1423b29255 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/9576 Reviewed-by: Krutika Dhananjay <kdhananj@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Raghavendra Bhat <raghavendra@redhat.com>
* cluster/afr: Handle getxattr of quota-size keyPranith Kumar K2015-03-141-59/+0
| | | | | | | | | | | | | | | Backport of http://review.gluster.org/9820 Afr needs to query QUOTA_SIZE_KEY from all the subvolumes and return the value which is maximum of the readable bricks. BUG: 1201624 Change-Id: I41725a7323020c1480c38560dc5ae2c2e82d6d47 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/9873 Reviewed-by: Krutika Dhananjay <kdhananj@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Raghavendra Bhat <raghavendra@redhat.com>
* afr: stop encoding subvolume id in readdir d_offAnand Avati2015-03-031-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Backport of http://review.gluster.org/9332 The purpose of encoding d_off in AFR is to indicate the selected subvolume for the first readdir, and continue all further readdirs of the session on the same subvolume. This is required because, unlike files, dir d_offs are specific to the backend and cannot be re-used on another subvolume. The d_off transformation encodes the subvolume id and prevents such invalid use of d_offs on other servers. However, this approach could be quite wasteful of precious d_off bit-space. Unlike DHT, where server id can change from entry to entry and thus encoding the server id in the transformed d_off is necessary, we could take a slightly relaxed approach in AFR. The approach is to save the subvolume where the last readdir request was sent in the fd_ctx. This consumes constant space (i.e no per-entry cache), and serves the purpose of avoiding d_off "misuse" (i.e using d_off from one server on another). The compromise here is NFS resuming readdir from a non-0 cookie after an extended delay (either anonymous FD has been reclaimed, or server has restarted). In such cases a subvolume is picked freshly. To make this fresh picking more deterministic (i.e, to pick the same subvolume whenever possible, even after reboots), the function afr_hash_child (used by afr_read_subvol_select_by_policy) is modified to skip all dynamic inputs (i.e PID) for the case of directories. BUG: 1191537 Change-Id: I7e3bd8dfe346a9a8e428d7ddeada6cfb66e64e54 Signed-off-by: Anand Avati <avati@redhat.com> Reviewed-on: http://review.gluster.org/9638 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Raghavendra Bhat <raghavendra@redhat.com>
* afr : Fixes to 59ba78ae1461651e290ce72013786d828545d4c1Anuradha2015-02-111-0/+4
| | | | | | | | | | | | | | | | | | | A bug was found while reviewing http://review.gluster.org/9377/ (inode was not being unreffed after use). It is fixed on master as a part of the following change - http://review.gluster.org/#/c/9377/7/xlators/cluster/afr/src/afr-common.c . Fixing the same issue in 3.6 with this patch. Change-Id: Ibdc4be49d88613ccb1f80349eb4d368710c0c24b BUG: 1173528 Signed-off-by: Anuradha <atalur@redhat.com> Reviewed-on: http://review.gluster.org/9450 Reviewed-by: Niels de Vos <ndevos@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Ravishankar N <ravishankar@redhat.com> Reviewed-by: Raghavendra Bhat <raghavendra@redhat.com>
* cluster/afr: When parent and entry read subvols are different, set ↵Krutika Dhananjay2015-02-111-4/+52
| | | | | | | | | | | | | | | | | entry->inode to NULL Backport of: http://review.gluster.org/#/c/9477 That way a lookup would be forced on the entry, and its attributes will always be selected from its read subvol. Change-Id: Iaba25e2cd5f83e983fc8b1a1f48da3850808e6b8 BUG: 1186119 Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com> Reviewed-on: http://review.gluster.org/9562 Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Raghavendra Bhat <raghavendra@redhat.com>
* afr : glfs-heal implementationAnuradha2015-01-061-1/+312
| | | | | | | | | | | | | Backport of http://review.gluster.org/6529 and http://review.gluster.org/9119 Change-Id: Ie420efcb399b5119c61f448b421979c228b27b15 BUG: 1173528 Signed-off-by: Anuradha <atalur@redhat.com> Reviewed-on: http://review.gluster.org/9335 Reviewed-by: Ravishankar N <ravishankar@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Raghavendra Bhat <raghavendra@redhat.com>
* features/marker: Filter internal xattrs in lookupPranith Kumar K2014-11-161-3/+3
| | | | | | | | | | | | | | Backport of http://review.gluster.org/9061 Afr should ignore quota-size-key as part of self-heal but should heal quota-limit key. BUG: 1163569 Change-Id: I93d203002eac4fe20b70730c27c852d783c16d7f Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/9110 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* cluster/afr: Launch self-heal only when all the brick status is knownPranith Kumar K2014-09-291-2/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Problem: File goes into split-brain because of wrong erasing of xattrs. RCA: The issue happens because index self-heal is triggered even before all the bricks are up. So what ends up happening while erasing the xattrs is, xattrs are erased only on the sink brick for the brick that it thinks is up leading to split-brain Example: lets say the xattrs before heal started are: brick 2: trusted.afr.vol1-client-2=0x000000020000000000000000 trusted.afr.vol1-client-3=0x000000020000000000000000 brick 3: trusted.afr.vol1-client-2=0x000010040000000000000000 trusted.afr.vol1-client-3=0x000000000000000000000000 if only brick-2 came up at the time of triggering the self-heal only 'trusted.afr.vol1-client-2' is erased leading to the following xattrs: brick 2: trusted.afr.vol1-client-2=0x000000000000000000000000 trusted.afr.vol1-client-3=0x000000020000000000000000 brick 3: trusted.afr.vol1-client-2=0x000010040000000000000000 trusted.afr.vol1-client-3=0x000000000000000000000000 So the file goes into split-brain. BUG: 1142612 Change-Id: I0c8b66e154f03b636db052c97745399a7cca265b Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/8756 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Krutika Dhananjay <kdhananj@redhat.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* cluster/afr: Fix spurious metadata self-healsv3.6.0beta2Pranith Kumar K2014-09-241-3/+4
| | | | | | | | | | | | | | | Backport of http://review.gluster.org/8709 - Added logging for metadata and data self-heals which helped in debugging this issue. - Added checks to skip self-heals when no sinks are available to heal BUG: 1145987 Change-Id: Ide03af4f531a1280ec8ad95b627285df4d7bc42d Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/8832 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* cluster/afr: perform list-xattr during lookupRavishankar N2014-09-191-5/+193
| | | | | | | | | | | | | Detect and heal mismatching user extended attributes during lookup. Backport of: http://review.gluster.org/8558 Change-Id: Id03c9746f083ffd3014711d0b3a2e5a71a45eed4 BUG: 1144274 Signed-off-by: Ravishankar N <ravishankar@redhat.com> Reviewed-on: http://review.gluster.org/8773 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* cluster/afr : Mark pending changelog xattrs for new creationsAnuradha2014-09-181-0/+35
| | | | | | | | | | | | | | | | | Backport of: http://review.gluster.org/8555 Based on type of file, set appropriate pending changelogs for new entries. Change-Id: Icf9af866fe9a9e511210e8ad097e968e2307d8ee BUG: 1141787 Reviewed-on: http://review.gluster.org/8555 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Tested-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Signed-off-by: Anuradha <atalur@redhat.com> Reviewed-on: http://review.gluster.org/8748 Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* cluster/afr: Propagate EIO on inode's type mismatchKrutika Dhananjay2014-09-161-7/+13
| | | | | | | | | | | | | | | | Backport of: http://review.gluster.org/8574, and http://review.gluster.org/8586 Original author of the test script: Pranith Kumar K <pkarampu@redhat.com> Change-Id: I0c32bdd8e666f8175c0a8fbf940934e6ce469931 BUG: 1136830 Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com> Reviewed-on: http://review.gluster.org/8706 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* cluster/afr: Fix dict_t leaksKrutika Dhananjay2014-09-161-13/+20
| | | | | | | | | | | | | | | Backport of: http://review.gluster.org/#/c/8557/ dict_t objects that are ref'd in alloca'd "replies" in afr_replies_copy() are not unref'd before "replies" go out of scope. Change-Id: I9bb45bc673ec13292ac96dda060aceb48739ebe8 BUG: 1136831 Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com> Reviewed-on: http://review.gluster.org/8704 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* cluster/afr: Handle EAGAIN properly in inodelkPranith Kumar K2014-09-151-14/+125
| | | | | | | | | | | | | | | | | Problem: When one of the brick is taken down and brough back up in a replica pair, locks on that brick will be allowed. Afr returns inodelk success even when one of the bricks already has the lock taken. Fix: If any brick returns EAGAIN return failure to parent xlator. BUG: 1142020 Change-Id: Iee3f5990be75e10f8accec9bc3856e3f76d1593c Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/8744 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* cluster/afr: Perform gfid heal inside locks.Pranith Kumar K2014-09-121-3/+18
| | | | | | | | | | | | | | | | | | | | | Backport of http://review.gluster.org/8512 Problem: Allowing lookup with 'gfid-req' will lead to assigning gfid at posix layer. When two mounts perform lookup in parallel that can lead to both bricks getting different gfids leading to gfid-mismatch/EIO for the lookup. Fix: Perform gfid heal inside lock. BUG: 1136823 Change-Id: I1059fcd38338348b7e96a0bd4c35b0f49bf77aec Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/8589 Reviewed-by: Krutika Dhananjay <kdhananj@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Ravishankar N <ravishankar@redhat.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* features/changelog: Capture "correct" internal FOPsVenky Shankar2014-09-081-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | This patch fixes changelog capturing internal FOPs in a cascaded setup, where the intermediate master would record internal FOPs (generated by DHT on link()/rename()). This is due to I/O happening on the intermediate slave on geo-replication's auxillary mount with client-pid -1. Currently, the internal FOP capturing logic depends on client pid being non-negative and the presence of a special key in dictionary. Due to this, internal FOPs on an inter-mediate master would be recorded in the changelog. Checking client-pid being non-negative was introduced to capture AFR self-heal traffic in changelog, thereby breaking cascading setups. By coincidence, AFR self-heal daemon uses -1 as frame->root->pid thereby making is hard to differentiate b/w geo-rep's auxillary mount and self-heal daemon. BUG: 1138952 Change-Id: Ia08a2cfa3b02bb785f343794f5b2695d44398c4c Original-Author: Venky Shankar <vshankar@redhat.com> Signed-off-by: Kotresh H R <khiremat@redhat.com> Reviewed-on: http://review.gluster.org/8347 Reviewed-by: Venky Shankar <vshankar@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com> Reviewed-on: http://review.gluster.org/8638
* cluster/afr: Fix mem-leakPranith Kumar K2014-08-191-2/+5
| | | | | | | | | | | | | | | | | | | | Backport of http://review.gluster.org/8457 Problem: local->xattr_req is already reffed with xattr_req that comes in lookup fop. But when afr_lookup_xattr_req_prepare is called local->xattr_req is over-written with dict_new() which leads to ref leak on the dict which came in lookup fop Fix: Create local->xattr_req only when it is NULL BUG: 1128801 Change-Id: I4a1065add7700317e0cd3d0dda0a91e12d77e340 Reviewed-on: http://review.gluster.org/8460 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Krutika Dhananjay <kdhananj@redhat.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* Fix resolution issues across fuse/server/afrPranith Kumar K2014-06-141-10/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Problems with fuse/server: Fuse loc touch up sets loc->name even when pargfid is not known. Server lookup does (pargfid, name) based lookup when name is set ignoring the gfid. Because of this server resolver finds that the lookup came on (null-pargfid, name) and fails the lookup with EINVAL. Fix: Don't set loc->name in loc_touchup if the pargfid is not known. Did the same even for server-resolver Problem with afr: Lets say there is a directory hierarchy a/b/c/d on the mount and the user is cd'ed into the directory. Bring down one of the bricks of replica and remove all directories/files to simulate disk replacement on that brick. Now this brick is brought back up. Creates on the cd'ed directory fail with ESTALE. Basically before sending a create of 'f' inside 'd', fuse sends a lookup to make sure the file is not present. On one of the bricks 'd' is present and 'f' is not so it sends ENOENT as response. On the new brick 'd' itself is not present. So it sends ESTALE. In afr ESTALE is considered to be special errno on witnessing which lookup has to fail. And ESTALE is given more priority than ENOENT. Due to these reasons lookup fails with ESTALE rather than ENOENT. Since lookup didn't fail with ENOENT, 'create' can't be issued so the command is failed with ESTALE. Solution: Afr needs to consider ESTALE errno normally and ENOENT needs to be given more priority so that operations like create can proceed even when only one of the brick is up and running. Whenever client xlator identifies that gfid-changed, it sets that information in lookup xdata. Afr uses this information to fail the lookup with ESTALE so that top xlator can send fresh lookup. Change-Id: Ica6ce01baef08620154050a635e6f97d51029ef6 BUG: 1106408 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/8015 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* cluster/afr: move messages to new logging frameworkRavishankar N2014-05-171-8/+11
| | | | | | | | | | | | | | Change important (from a diagnostics point of view) log messages to use the gf_msg() framework. Change-Id: I0a58184bbb78989db149e67f07c140a21c781bc2 BUG: 1075611 Signed-off-by: Ravishankar N <ravishankar@redhat.com> Reviewed-on: http://review.gluster.org/7784 Reviewed-by: Krutika Dhananjay <kdhananj@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Tested-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
* cluster/afr: Fix bugs in quorum implementationPranith Kumar K2014-05-141-0/+14
| | | | | | | | | | | | | - Have common place to perform quorum fop wind check - Check if fop succeeded in a way that matches quorum to avoid marking changelog in split-brain. BUG: 1066996 Change-Id: Ibc5b80e01dc206b2abbea2d29e26f3c60ff4f204 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/7600 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Ravishankar N <ravishankar@redhat.com>
* cluster/afr: Mem leak fixes found in valgrind for iozonePranith Kumar K2014-04-091-0/+4
| | | | | | | | | | Change-Id: I869d191dc3470b2208c17343bbf772f01ef744cb BUG: 1085511 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/7424 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Ravishankar N <ravishankar@redhat.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/afr: Remove eager-lock stub on finodelk failurePranith Kumar K2014-04-021-0/+19
| | | | | | | | | | | | | | | | | | | | Problem: For write fops afr's transaction eager-lock init adds transactions that can share eager-lock to fdctx list. But if eager-lock finodelk fop fails the stub remains in the list. This could later lead to corruption of the list and lead to infinite loop on the list leading to a mount hang. Fix: Remove the stub when finodelk fails. Change-Id: I0ed4bc6b62f26c5e891c1181a6871ee6e4f4f5fd BUG: 1063190 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/6944 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Ravishankar N <ravishankar@redhat.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/afr: refactorAnand Avati2014-03-221-2610/+1613
| | | | | | | | | | | | | | | | | | | | | | | | | | | | - Remove client side self-healing completely (opendir, openfd, lookup) - Re-work readdir-failover to work reliably in case of NFS - Remove unused/dead lock recovery code - Consistently use xdata in both calls and callbacks in all FOPs - Per-inode event generation, used to force inode ctx refresh - Implement dirty flag support (in place of pending counts) - Eliminate inode ctx structure, use read subvol bits + event_generation - Implement inode ctx refreshing based on event generation - Provide backward compatibility in transactions - remove unused variables and functions - make code more consistent in style and pattern - regularize and clean up inode-write transaction code - regularize and clean up dir-write transaction code - regularize and clean up common FOPs - reorganize transaction framework code - skip setting xattrs in pending dict if nothing is pending - re-write self-healing code using syncops - re-write simpler self-heal-daemon Change-Id: I1e4080c9796c8a2815c2dab4be3073f389d614a8 BUG: 1021686 Signed-off-by: Anand Avati <avati@redhat.com> Reviewed-on: http://review.gluster.org/6010 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* cluster/afr: Don't accept heal commands until graph is upPranith Kumar K2014-01-081-0/+4
| | | | | | | | | | Change-Id: Icca6c23b6a5965f462db8b65af3eb2e141c7cd39 BUG: 1049355 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/6658 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Ravishankar N <ravishankar@redhat.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* cluster/afr: avoid race due to afr_is_transaction_running()Ravishankar N2013-12-241-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Problem: ------------------------------------------ afr_lookup_perform_self_heal() { if(afr_is_transaction_running()) goto out else afr_launch_self_heal(); } ------------------------------------------ When 2 clients simultaneously access a file in split-brain, one of them acquires the inode lock and proceeds with afr_launch_self_heal (which eventually fails and sets "sh-failed" in the callback.) The second client meanwhile bails out of afr_lookup_perform_self_heal() because afr_is_transaction_running() returns true due to the lock obtained by client-1. Consequetly in client-2, "sh-failed" does not get set in the dict, causing quick-read translator to *not* invalidate the inode, thereby serving data randomly from one of the bricks. Fix: If a possible split-brain is detected on lookup, forcefully traverse the afr_launch_self_heal() code path in afr_lookup_perform_self_heal(). Change-Id: I316f9f282543533fd3c958e4b63ecada42c2a14f BUG: 870565 Signed-off-by: Ravishankar N <ravishankar@redhat.com> Reviewed-on: http://review.gluster.org/6578 Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Varun Shastry <vshastry@redhat.com>
* cluster/afr: Add foreground self-heal launch capability through lookupPranith Kumar K2013-12-161-16/+19
| | | | | | | | | | | | Also renamed allow-sh-for-running-transaction -> attempt-self-heal Change-Id: I134cc79e663b532e625ffc342c59e49e71644ab3 BUG: 1039544 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/6463 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: venkatesh somyajulu <vsomyaju@redhat.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* cluster/afr: Remove 'max' from the logPranith Kumar K2013-10-171-1/+1
| | | | | | | | | | | | This patch avoids giving more info to the user about the internal heuristic employed in afr, for quota sizes. Change-Id: Ice3a164399f09b6967500ec0c17dc340e7ae9aba BUG: 1016683 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/6098 Reviewed-by: Vijay Bellur <vbellur@redhat.com> Tested-by: Vijay Bellur <vbellur@redhat.com>
* cluster/afr : Implementation of command "gluster volume heal vn statistics"Venkatesh Somyajulu2013-10-141-3/+64
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | "gluster volume heal volumename statistics" command gives the summary of the afr crawl done based on the entries present in the xattrop directory. Whenever afr crawls are attempted, the beginning time of crawl, end time of crawl, no of files healed, heal-failed count and number of files in split brain are shown along with the type of the crawl. If crawl is already in progress then it will give the number of files healed, heal failed count and number of files in split-brain from the beginning of the crawl and instead of telling the end time of the crawl, "CRAWL IN PROGRESS" message will be shown. Output format: command: "gluster volume heal volume-name statistics" Output: Gathering afr crawl statistics crawl statistics on volume volume-name has been successful ------------------------------------------------ Crawl statistics for brick no 0 Hostname of brick 192.168.122.248 Starting time of crawl: Wed Jul 10 15:52:38 2013 Ending time of crawl: Wed Jul 10 15:52:38 2013 Type of crawl: INDEX No. of entries healed: 0 No. of entries in split-brain: 0 No. of heal failed entries: 0 Starting time of crawl: Wed Jul 10 15:52:38 2013 Ending time of crawl: Wed Jul 10 15:52:38 2013 Type of crawl: INDEX No. of entries healed: 0 No. of entries in split-brain: 0 No. of heal failed entries: 0 ------------------------------------------------ Crawl statistics for brick no 1 Hostname of brick 192.168.122.1 Starting time of crawl: Wed Jul 10 15:52:42 2013 Ending time of crawl: Wed Jul 10 15:52:42 2013 Type of crawl: INDEX No. of entries healed: 0 No. of entries in split-brain: 0 No. of heal failed entries: 0 Starting time of crawl: Wed Jul 10 15:52:42 2013 Ending time of crawl: Wed Jul 10 15:52:42 2013 Type of crawl: INDEX No. of entries healed: 0 No. of entries in split-brain: 0 No. of heal failed entries: 0 -------------------------------------------------- Change-Id: I10bf9d10b005741db9973fb1352e0dd59ed99aa9 BUG: 949400 Signed-off-by: Venkatesh Somyajulu <vsomyaju@redhat.com> Reviewed-on: http://review.gluster.org/4790 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/afr: Handle quota size xattr separately in lookupPranith Kumar K2013-10-101-0/+87
| | | | | | | | | | | | | | | | | | Quota size xattrs are not maintained by afr. There is a possibility that they differ even when both the directory changelog xattrs suggest everything is fine. So if there is at least one 'source' check among the sources which has the maximum quota size. Otherwise check among all the available ones for maximum quota size. This way if there is a source and stale copies it always votes for the 'source'. Change-Id: Ia222379cbafa7043dd03f533c105860f2c7b8b0d BUG: 1016683 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/6052 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Varun Shastry <vshastry@redhat.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/afr: Have common inode-write-fop cbkPranith Kumar K2013-09-181-495/+7
| | | | | | | | | | Change-Id: Ia7b324b86d6a7051d187106d7a060155e77defc5 BUG: 910217 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/5238 Reviewed-by: Ravishankar N <ravishankar@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/afr: Improvement in logging of self heal completion statusVenkatesh Somyajulu2013-08-291-0/+3
| | | | | | | | | | | Additional information for source and sinks are added. Change-Id: I1704956ff86ac3ae36744efe7499c1d1c43faeaf BUG: 968301 Signed-off-by: Venkatesh Somyajulu <vsomyaju@redhat.com> Reviewed-on: http://review.gluster.org/5638 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/afr: Add special handling for failure postopsPranith Kumar K2013-08-281-2/+6
| | | | | | | | | | | | | | | | Idea is to not leave the file in FOOL-FOOL scenario in case on all the bricks data transaction failed with EDQUOT to avoid increasing un-necessary load of self-heals in the system. For directory transactions don't leave pending changelog in case the failures are seen on all the subvolumes. Change-Id: I38a5561d1d581a78347a76a4a509514e4a0c3fb7 BUG: 969461 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/5709 Reviewed-by: Anand Avati <avati@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* afr: treat appending writes as stable writes.Anand Avati2013-08-131-0/+2
| | | | | | | | | | | | | Durability of appending writes is implicit in the file size. Therefore performing an explicit fsync() is unnecessary in such cases as self-heal can check for the size of file when pending changelog is not unambiguous. Change-Id: I05446180a91d20e0dbee5de5a7085b87d57f178a BUG: 927146 Signed-off-by: Anand Avati <avati@redhat.com> Reviewed-on: http://review.gluster.org/5501 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
* afr: check for non-zero call_count before doing a stack windRavishankar N2013-08-071-0/+5
| | | | | | | | | | | | | | | | | | | | When one of the bricks of a 1x2 replicate volume is down, writes to the volume is causing a race between afr_flush_wrapper() and afr_flush_cbk(). The latter frees up the call_frame's local variables in the unwind, while the former accesses them in the for loop and sending a stack wind the second time. This causes the FUSE mount process (glusterfs) toa receive a SIGSEGV when the corresponding unwind is hit. This patch adds the call_count check which was removed when afr_flush_wrapper() was introduced in commit 29619b4e Change-Id: I87d12ef39ea61cc4c8244c7f895b7492b90a7042 BUG: 988182 Signed-off-by: Ravishankar N <ravishankar@redhat.com> Reviewed-on: http://review.gluster.org/5393 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/afr: Disable eager-lock if open-fd-count > 1Pranith Kumar K2013-08-021-0/+25
| | | | | | | | | | | | | | | | | Lets say mount1 has eager-lock(full-lock) and after the eager-lock is taken mount2 opened the same file, it won't be able to perform any data operations until mount1 releases eager-lock. To avoid such scenario do not enable eager-lock for transaction if open-fd-count is > 1. Delaying of changelog piggybacking is avoided in this situation. Change-Id: I51b45d6a7c216a78860aff0265a0b8dabc6423a5 BUG: 910217 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/5432 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: venkatesh somyajulu <vsomyaju@redhat.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* cluster/afr: Print self-heal log when self-heal succeedsPranith Kumar K2013-07-311-0/+3
| | | | | | | | | Change-Id: I95e47e589419dc6a032cbd8ba01964b6c176c2d5 BUG: 927146 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/5408 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* cluster/afr: Handle REPLICATE_TRASH_DIR from old bricksPranith Kumar K2013-07-261-0/+7
| | | | | | | | | Change-Id: Ib99f79d3fa607c818dbc62006516480f598d8add BUG: 886998 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/4640 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* afr : change the log level in lookup path to minimize incessant logging.Ravishankar N2013-07-081-3/+3
| | | | | | | | | | | | | Change the logging levels from WARNING to DEBUG in the lookup path to minimize incessant logging in case of gfid mismatch errors. Change-Id: I631b16df3249cf826606f547531f985dac696088 BUG: 959083 Signed-off-by: Ravishankar N <ravishankar@redhat.com> Reviewed-on: http://review.gluster.org/4939 Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/afr: Refactor inodelk to handle multiple domainsPranith Kumar K2013-07-031-7/+28
| | | | | | | | | | | | | | | | | | - afr_local_copy should not be memduping locked nodes, that would mean that lock is taken in self-heal on those nodes even before it actually takes the lock. So removed memdup code. Even entry lock related copying (lockee info) is also not necessary for self-heal functionality, so removing that as well. Since it is not local_copy anymore changed its name. - My editor changed tabs to spaces. Change-Id: I8dfb92cb8338e9a967c06907a8e29a8404782d61 BUG: 967717 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/5099 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* cluster/afr: post-op should complete before starting flushPranith Kumar K2013-07-031-18/+41
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Problem: At the moment afr-flush makes sure that a delayed post-op is woken up but it does not wait for it to complete the post-op before flush unwinds. These are the steps that are happening: 1) flush fop comes on an fd which wakes up a delayed post-op and continues with the flush fop. 2) post-op sends fsync on the wire. 3) flush completes and unwinds to fuse. 4) graph switch happens on the fuse mount disconnecting the old graph's client connections to bricks. 5) xattrop after fsync fails with ENOTCONN because the connections from old graph are taken down now. Fix: Wait for post-op to complete before starting to flush. We could make flush act similar to fsync (i.e.) wind flush as is but wait for post-op to complete before unwinding flush, but it is better to send flush as the final fop. So wind of flush will start after post-op is complete. Had to change fsync to accommodate this change. Change-Id: I93aa642647751969511718b0e137afbd067b388a BUG: 980548 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/5274 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* nfs: Remove afr split-brain handling in nfsPranith Kumar K2013-06-251-5/+0
| | | | | | | | | | | | | | | | We added this code as an interim fix until afr can handle split-brains even when opens are not issued. Afr code has matured to reject fd based fops when there are split-brains so we can remove it. Change-Id: Ib337f78eccee86469a5eaabed1a547a2cea2bdcf BUG: 974972 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/5227 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Ravishankar N <ravishankar@redhat.com> Reviewed-by: Jeff Darcy <jdarcy@redhat.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* glusterfs: discard (hole punch) supportBrian Foster2013-06-131-0/+241
| | | | | | | | | | | | | | | | Add support for the DISCARD file operation. Discard punches a hole in a file in the provided range. Block de-allocation is implemented via fallocate() (as requested via fuse and passed on to the brick fs) but a separate fop is created within gluster to emphasize the fact that discard changes file data (the discarded region is replaced with zeroes) and must invalidate caches where appropriate. BUG: 963678 Change-Id: I34633a0bfff2187afeab4292a15f3cc9adf261af Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-on: http://review.gluster.org/5090 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* gluster: add fallocate fop supportBrian Foster2013-06-131-0/+244
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Implement support for the fallocate file operation. fallocate allocates blocks for a particular inode such that future writes to the associated region of the file are guaranteed not to fail with ENOSPC. This patch adds fallocate support to the following areas: - libglusterfs - mount/fuse - io-stats - performance/md-cache,open-behind - quota - cluster/afr,dht,stripe - rpc/xdr - protocol/client,server - io-threads - marker - storage/posix - libgfapi BUG: 949242 Change-Id: Ice8e61351f9d6115c5df68768bc844abbf0ce8bd Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-on: http://review.gluster.org/4969 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* afr: let eager-locking do its own overlap checksAnand Avati2013-04-051-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | Today there is a non-obvious dependence of eager-locking on write-behind. The reason is that eager-locking works as long as the inheriting transaction has no overlaps with any of the transactions already in progress. While write-behind provides non-overlapping writes as a side-effect most of times (and only guarantees it when strict-write-ordering option is enabled, which is not on by default) eager-lock needs the behavior as a guarantee. This is leading to complex and unwanted checks for the presence of write-behind in the graph, for the simple task of checking for overlaps. This patch removes the interdependence between eager-locking and write-behind by making eager-locking do its own overlap checks with in-progress writes. Change-Id: Iccba1185aeb5f1e7f060089c895a62840787133f BUG: 912581 Signed-off-by: Anand Avati <avati@redhat.com> Reviewed-on: http://review.gluster.org/4782 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
* cluster/afr: detect in-progress creation in lookup and return ENOENTPranith Kumar K2013-04-021-0/+57
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | if any subvol returned ENOENT while parent entrylk lock was held, yield and return ENOENT for the entire lookup. This is how the issue happens: Multiple clients A, B and C are attempting 'mkdir -p /mnt/a/b/c' 1 Client A is in the middle of mkdir(/a). It has acquired lock. It has performed mkdir(/a) on one subvol, and second one is still in progress 2 Client B performs a lookup, sees directory /a on one, ENOENT on the other, succeeds lookup. 3 Client B performs lookup on /a/b on both subvols, both return ENOENT (one subvol because /a/b does not exist, another because /a itself does not exist) 4 Client B proceeds to mkdir /a/b. It obtains entrylk on inode=/a with basename=b on one subvol, but fails on other subvol as /a is yet to be created by Client A. 5 Client A finishes mkdir of /a on other subvol 6 Client C also attempts to create /a/b, lookup returns ENOENT on both subvols. 7 Client C tries to obtain entrylk on on inode=/a with basename=b, obtains on one subvol (where B had failed), and waits for B to unlock on other subvol. 8 Client B finishes mkdir() on one subvol with GFID-1 and completes transaction and unlocks 9 Client C gets the lock on the second subvol, At this stage second subvol already has /a/b created from Client B, but Client C does not check that in the middle of mkdir transaction 10 Client C attempts mkdir /a/b on both subvols. It succeeds on ONLY ONE (where Client B could not get lock because of missing parent /a dir) with GFID-2, and gets EEXIST from ONE subvol. This way we have /a/b in GFID mismatch. One subvol got GFID-1 because Client B performed transaction on only one subvol (because entrylk() could not be obtained on second subvol because of missing parent dir -- caused by premature/speculative succeeding of lookup() on /a when locks are detected). Other subvol gets GFID-2 from Client C because while it was waiting for entrylk() on both subvols, Client B was in the middle of creating mkdir() on only one subvol, and Client C does not "expect" this when it is between lock() and pre-op()/op() phase of the transaction. Original-author: Anand Avati <avati@redhat.com> Change-Id: Idca475dbbc2a51e09da6fa0f9e1e37148caef208 BUG: 860210 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/4625 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>