summaryrefslogtreecommitdiffstats
path: root/xlators/cluster/afr
Commit message (Collapse)AuthorAgeFilesLines
* cluster/afr: Avoid order mismatch in blocking entrylksPranith Kumar K2013-06-051-6/+9
| | | | | | | | | | | | | | | | | | | | | | | Problem: When taking blocking entrylks, afr orders the entrylks based on uuid_compare of gfids of parent dirs, if they are equal then it orders them based on the basenames. While this approach works fine, the implementation assumes loc->gfids to be populated at the time of the comparison, but loc may have gfid in loc->inode->gfid instead of loc->gfid which was leading to order mismatches and dead-locks. Fix: Implemented loc_gfid which gives gfid by checking both loc->gfid, loc->inode->gfid. Used this for ordering the blocking entrylks. Change-Id: Ib0db36bbaf0df09fa87c3c3bb6a834db74fc2154 BUG: 965987 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/5062 Reviewed-by: Krishnan Parthasarathi <kparthas@redhat.com> Reviewed-by: Jeff Darcy <jdarcy@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* cluster/afr: Removed misleading log from afr_start_crawlVenkatesh Somyajulu2013-05-311-2/+2
| | | | | | | | | | | | | | | Problem: If it is fresh volume with no files created the xattrop directory is not present. If crawl happens in that time, lookup for xattrop directory will fail and it results in printing a misleading log. Fix: Changed misleading log. Change-Id: Iae5da3e8423564d64096f88abdaf8c98e4935840 BUG: 928575 Signed-off-by: Venkatesh Somyajulu <vsomyaju@redhat.com> Reviewed-on: http://review.gluster.org/4993 Reviewed-by: Vijay Bellur <vbellur@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* cluster/afr: Club missing entry, missing gfid self-healsPranith Kumar K2013-05-071-45/+8
| | | | | | | | | | | | | | | | | | | | | Problem: gfid-self-heal always assigns the gfid(GFID-1) it gets from lookup. Between the time of lookup to triggering the gfid-self-heal the entry could have changed. Now lets say there is a case where one of the files of the replica subolumes already has a gfid (GFID-2) and the other does not. In that case healing should happen with GFID-2 instead of GFID-1. Fix: Missing-entry-self-heal already handles all these cases. So removed separate handling of gfid-self-heal. Change-Id: Ie96261e9036c8f3cb4cad89347f9bf7b681cdc1a BUG: 767585 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/2670 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* cluster/afr: Avoid self-healing extended attribute used by SELinux.Vijay Bellur2013-04-301-0/+8
| | | | | | | | | | | | | | | | | | Since removexattr() fails to remove "security.selinux" in a system where SELinux is enforcing, xattr self-healing fails. As a consequence of this, user extended attributes are not being healed. Added a check in afr to prune SELinux xattr from the dictionary used for removing xattrs from the sink. Minor changes in tests and md-cache as well. Signed-off-by: Vijay Bellur <vbellur@redhat.com> Change-Id: I854bfc0098dde812ce2afe64b125ee40c04bdeb1 BUG: 957877 Reviewed-on: http://review.gluster.org/4905 Reviewed-by: Venky Shankar <vshankar@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/afr: Added documentation for eager-lock checkPranith Kumar K2013-04-221-0/+17
| | | | | | | | | Change-Id: Ifa42762adde8b55ef1e2b51a59c93cebd983343f BUG: 912581 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/4792 Reviewed-by: Vijay Bellur <vbellur@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* afr: let eager-locking do its own overlap checksAnand Avati2013-04-053-2/+87
| | | | | | | | | | | | | | | | | | | | | | | | Today there is a non-obvious dependence of eager-locking on write-behind. The reason is that eager-locking works as long as the inheriting transaction has no overlaps with any of the transactions already in progress. While write-behind provides non-overlapping writes as a side-effect most of times (and only guarantees it when strict-write-ordering option is enabled, which is not on by default) eager-lock needs the behavior as a guarantee. This is leading to complex and unwanted checks for the presence of write-behind in the graph, for the simple task of checking for overlaps. This patch removes the interdependence between eager-locking and write-behind by making eager-locking do its own overlap checks with in-progress writes. Change-Id: Iccba1185aeb5f1e7f060089c895a62840787133f BUG: 912581 Signed-off-by: Anand Avati <avati@redhat.com> Reviewed-on: http://review.gluster.org/4782 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
* cluster/afr: Treat all dir fop failure as success in changelogPranith Kumar K2013-04-033-2/+35
| | | | | | | | | | | | | | | For example: If a new entry creation fop fails with EEXIST or a delete entry fop fails with ENOENT, on all the subvols the fop is wound, then no change took place to the directory. So we can treat that case as no change happened to the directory. Change-Id: I3b3a7931954da2166a9cba19ff9f76f37739d751 BUG: 860210 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/4626 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/afr: Made afr_sh_purge_entry_common message log more clear.Venkatesh Somyajulu2013-04-031-3/+1
| | | | | | | | | | | | | | | FIX: In missing entry self heal, once the source directories are determined after the lookup and if file is not present on any of the brick which contains the souce directory, the entry is removed from the directory. So log message should give information of "Purging of entry". Change-Id: I4d3deb602e0812dc1c9c8ba0a466716d81dede7e BUG: 947312 Signed-off-by: Venkatesh Somyajulu <vsomyaju@redhat.com> Reviewed-on: http://review.gluster.org/4753 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* pump: Set self-heal readdir size in pumpPranith Kumar K2013-04-031-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Problem: In Pump entry self-heal happens for each directory during the first opendir using conservative merge. But in entry-self-heal readdir is issued with '0' size. So entry self-heal is not creating any files. After pump thinks entry self-heal is complete it proceeds to heal each of the file in the directory it just healed. Fortunately most of the times it chooses source-brick in pump as read-child for readdir. This happens because readchild is the subvolume on which lookup succeeds first. In pump lookup succeeds faster in local process than on the destination brick process most of the times. For all the entries pump finds in readdir it does a lookup. During this lookup the entry on the destination brick is created and healed. This is the reason why replace-brick succeeds whenever read-child for the directory is chosen as the source-brick. Which is most of the times. When read-child is chosen as the destination brick, readdir returns no entries so replace-brick completes without syncing the whole data. Fix: Set readdir-size in pump so that entry self-heal happens with 64k size. This ensures that entry self-heal triggered from opendir actually creates the files on the destination brick. Change-Id: I65ea45d3c2735a9578f3aa34eff771b6563241ca BUG: 909800 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/4712 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/afr: detect in-progress creation in lookup and return ENOENTPranith Kumar K2013-04-021-0/+57
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | if any subvol returned ENOENT while parent entrylk lock was held, yield and return ENOENT for the entire lookup. This is how the issue happens: Multiple clients A, B and C are attempting 'mkdir -p /mnt/a/b/c' 1 Client A is in the middle of mkdir(/a). It has acquired lock. It has performed mkdir(/a) on one subvol, and second one is still in progress 2 Client B performs a lookup, sees directory /a on one, ENOENT on the other, succeeds lookup. 3 Client B performs lookup on /a/b on both subvols, both return ENOENT (one subvol because /a/b does not exist, another because /a itself does not exist) 4 Client B proceeds to mkdir /a/b. It obtains entrylk on inode=/a with basename=b on one subvol, but fails on other subvol as /a is yet to be created by Client A. 5 Client A finishes mkdir of /a on other subvol 6 Client C also attempts to create /a/b, lookup returns ENOENT on both subvols. 7 Client C tries to obtain entrylk on on inode=/a with basename=b, obtains on one subvol (where B had failed), and waits for B to unlock on other subvol. 8 Client B finishes mkdir() on one subvol with GFID-1 and completes transaction and unlocks 9 Client C gets the lock on the second subvol, At this stage second subvol already has /a/b created from Client B, but Client C does not check that in the middle of mkdir transaction 10 Client C attempts mkdir /a/b on both subvols. It succeeds on ONLY ONE (where Client B could not get lock because of missing parent /a dir) with GFID-2, and gets EEXIST from ONE subvol. This way we have /a/b in GFID mismatch. One subvol got GFID-1 because Client B performed transaction on only one subvol (because entrylk() could not be obtained on second subvol because of missing parent dir -- caused by premature/speculative succeeding of lookup() on /a when locks are detected). Other subvol gets GFID-2 from Client C because while it was waiting for entrylk() on both subvols, Client B was in the middle of creating mkdir() on only one subvol, and Client C does not "expect" this when it is between lock() and pre-op()/op() phase of the transaction. Original-author: Anand Avati <avati@redhat.com> Change-Id: Idca475dbbc2a51e09da6fa0f9e1e37148caef208 BUG: 860210 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/4625 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/afr: sync xattrs removed on source to sink(s)Venky Shankar2013-04-021-12/+85
| | | | | | | | | | | | | xattrs are first removed from sink followed by setting source xattrs. Change-Id: I181cb5b785b667bbfc6e40787a2183a8f45de06b BUG: 906646 Signed-off-by: Venky Shankar <vshankar@redhat.com> Reviewed-on: http://review.gluster.org/4656 Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
* cluster/afr: prevent piggyback on stale pre_opPranith Kumar K2013-04-021-33/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Here are the logs of a file on which we saw EIO because of size mismatch: [root@lizzie ~]# grep 38f18204 /var/log/glusterfs/mnt-x-.log Reporting Unstable write for 38f18204-2840-408e-ae65-c01f4106b8c4 for offset: 0, len: 7680 Cleared unstable write flag for 38f18204-2840-408e-ae65-c01f4106b8c4: offset 0 length 7680 Reporting Unstable write for 38f18204-2840-408e-ae65-c01f4106b8c4 for offset: 7680, len: 71680 Reporting Unstable write for 38f18204-2840-408e-ae65-c01f4106b8c4 for offset: 79360, len: 15716 fsync completed on 38f18204-2840-408e-ae65-c01f4106b8c4 for offset 0 length 7680 with changelog status: -1 -1 According to these logs fsync did not happen after writev with offset: 79360, len: 15716. Which is the reason for this problem. In total 3 writes came. lets call them w1, w2, w3 w1 does pre_op so pre_op_done[0], pre_op_done[1] counts become 1 and 1 then is_piggyback_post_op() is called for w1 and it returns *false* w1's fsync is fired Now w2 and w3 come and see that pre_op_done[0], pre_op_done[1] are both 1, so pre_op_piggyback[0] and pre_op_piggyback[1] are both incremented twice, once by w2, one more time by w3 and become 2, 2 ------- Step-A Now fsync of w1 is complete and it goes ahead with post op and decrements pre_op_done[0], pre_op_done[1] to 0, 0 Now w2, w3 writevs complete and is_piggyback_post_op will return *true* for both w2, w3. So fsync is not fired for both w2, w3 this patch prevents Step-A from happening. Change-Id: I8b6af1f1875b2cf5f718caa3c16ee7ff3dc96b5c BUG: 927146 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/4752 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
* cluster/afr: fix fd leak with unsafe call_resume()Anand Avati2013-03-282-1/+17
| | | | | | | | | | | | | | Introduce AFR_CALL_RESUME macro which cleans up frame->local, like how AFR_STACK_UNWIND etc. do. Therefore fix leak in afr_fsync() path. Change-Id: I3855d8e7e84dbc44e05f507563b7f722bf9621b8 BUG: 927146 Signed-off-by: Anand Avati <avati@redhat.com> Reviewed-on: http://review.gluster.org/4745 Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* cluster/afr: fsync before erase xattrs in data self-healPranith Kumar K2013-03-281-1/+75
| | | | | | | | | | | | Added extra fsync to data self-heal code to make sure the data reached disk before erasing the changelogs Change-Id: I9e7e6e55cdc49de2b991705d1638946464a9d4f9 BUG: 927146 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/4744 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/afr: piggyback and fsync resume changesPranith Kumar K2013-03-282-15/+18
| | | | | | | | | | | | | | 1) pre_op_piggyback should always be decremented. 2) Move fsync resume to just after post_op. 3) fsync stub should be created from afr's local not from the final response. Change-Id: I220bb532eb03bea584292f4dd2e816ad0c3e0cf7 BUG: 927146 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/4741 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/afr: fsync() guarantees POST-OP completionAnand Avati2013-03-273-10/+56
| | | | | | | | | | | | | | | | | | | | | | | | AFR now provides a stronger guarantee that fsync() returns only after completely finishing all the deferred/delayed POST-OP on that open file. To acheive this we make a stub out of the returning fsync and register it with the "delayed" frame in afr_changelog_wake_resume(). The delayed frame, after getting woken up and finishing the POST-OP will call_resume() the registered stub (which UNWINDs the fsync) at the time of frame destruction. This provides a guarantee that an application's (or FUSE) fsync() returns only after finishing up all the previous transactions, including delayed POST-OPs and UNLOCK. Change-Id: Iaa955457e2f25088a144fde37ad0444277b5cf49 BUG: 927146 Signed-off-by: Anand Avati <avati@redhat.com> Reviewed-on: http://review.gluster.org/4737 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
* cluster/afr: ensure DATA operations are made durable before POST-OPAnand Avati2013-03-274-22/+314
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The changelogging scheme of AFR stores information about the state of all replicas in all replicas (in the extended attribute of the respective files on each server) in the form of 'pending counts' of operations (effectively "dirty flags"). These xattrs are blindly trusted while performing self-heal, and therefore utmost care has to be taken while updating and maintaing them. The most critical updation is the clearing of the pending counts corresponding to the *other* server in the changelog of a given server. Before clearing the pending count, we need durability guarantee of the write which was performed on the other server. To obtain such a guarantee, it may be necessary to explicitly introduce an fsync() phase (if the file itself wasn't already opened with O_SYNC). This patch introduces the detection of unstable stable writes on a file and issues explicit fsync() on the servers before performing the POST-OP clearing of pending flags. Change-Id: I2171b86a74ec91e40e5877eef0a4e7379578ecf7 BUG: 927146 Signed-off-by: Anand Avati <avati@redhat.com> Reviewed-on: http://review.gluster.org/4721 Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Reviewed-by: Krishnan Parthasarathi <kparthas@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* nfs, afr: Fail lookup only on split-brainPranith Kumar K2013-03-201-0/+4
| | | | | | | | | | | Change-Id: Icee9772f1f1bf5336eb82a4dc13e198424cd4a65 BUG: 921996 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/4699 Reviewed-by: Jeff Darcy <jdarcy@redhat.com> Reviewed-by: Amar Tumballi <amarts@redhat.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* cluster/afr: Preserve mtime in self-healPranith Kumar K2013-03-121-14/+25
| | | | | | | | | | | | | | | | | | | | | | | Problem: Data self-heal may choose sink iatt to set mtimes. This happens because after syncing of data is done self-heal does one more xattrops/fstat to determine sources sinks to set the inode-ctx. Since this is done after data syncing and erase of xattrs, old source and old sink are now sources, but the mtimes of them differ. Old code just takes the first source from the list and update mtimes, which could be sink before the self-heal started. Fix: Set mtime from 'sources before syncing'. Change-Id: Id769e1b99aa4f041eaee775f64cbf2c57b799723 BUG: 918437 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/4658 Reviewed-by: Jeff Darcy <jdarcy@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* cluster xlators: s/-1/GF_CLIENT_PID_GSYNCD/Csaba Henk2013-03-031-2/+2
| | | | | | | | | Change-Id: I03be3cb23684de4ab36cf2953002708466edd580 BUG: 765433 Signed-off-by: Csaba Henk <csaba@redhat.com> Reviewed-on: http://review.gluster.org/4601 Reviewed-by: Venky Shankar <vshankar@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* cluster/afr: Turn on eager-lock for fd DATA transactionsPranith Kumar K2013-03-012-21/+7
| | | | | | | | | | | | | | | | | | | | Problem: With the present implementation, eager-lock is issued for any fd fop. eager-lock is being transferred to metadata transactions. But the lk-owner is set to local->fd address only for DATA transactions, but for METADATA transactions it is frame->root. Because of this unlock on the eager-lock fails and rebalance hangs. Fix: Enable eager-lock for fd DATA transactions Change-Id: If30df7486a0b2f5e4150d3259d1261f81473ce8a BUG: 916226 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/4588 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/afr: Don't queue transactions during open-fd fixPranith Kumar K2013-02-226-385/+167
| | | | | | | | | | | | | | | | | | | | | Before Anonymous fds are available, afr had to queue up transactions if the file is not opened on one of its subvolumes. This happens until the attempt to open the file either succeeds or fails. These attempts happen until the file is successfully opened on the subvolume. Now client xlator uses anonymous fds to perform the fops if the fd used for the fop is not 'opened'. Fops will be successful even when the file is not opened so there is no need to queue up the transactions anymore in afr. Open is attempted on the subvolume where it is not opened independent of the fop. Change-Id: Id1a4b4ebe6f89f9efe8f6a8247918b91247d0819 BUG: 913051 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/4568 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/afr: do complete split-brain check in all the fd based fopsRaghavendra Bhat2013-02-194-19/+33
| | | | | | | | | | | | | | | | | fd based operations such as readv checked only for data split brain instead of complete split-brain (i.e both data + metadata) assuming that open would have done the complete split-brain check. However open-behind would have unwound open, without winding to afr thus preventing the complete split-brain check and some appliations will be able to read the contents of the file even though the file has metadata split-brain. So let all the fd based fops do a defensive check of complete split-brain. Change-Id: Ia90b35f2b08426dfcad804b7f8105278c86fbd2d BUG: 846240 Signed-off-by: Raghavendra Bhat <raghavendra@redhat.com> Reviewed-on: http://review.gluster.org/4548 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* Use proper libtool option -avoid-version instead of bogus -avoidversionAnand Avati2013-02-071-2/+2
| | | | | | | | | | Change-Id: I1c9541058c7d07786539a3266ca125a6a15287d8 BUG: 859835 Signed-off-by: Anand Avati <avati@redhat.com> Original-author: Kacper Kowalik (Xarthisius) <xarthisius.kk@gmail.com> Signed-off-by: Kacper Kowalik (Xarthisius) <xarthisius.kk@gmail.com> Reviewed-on: http://review.gluster.org/3967 Tested-by: Gluster Build System <jenkins@build.gluster.com>
* afr: serialize modification of {entrylk,inodelk}_lock_countAnand Avati2013-02-071-53/+54
| | | | | | | | | | | | | | Typically this lock was not needed in practice, but with http://review.gluster.org/3842, this code gets executed in multiple threads for different servers and we lose a count. This results in leaked lock and a hang for a future transaction. Change-Id: I377ed20e44f2a45cff522289dfef181f0653eca2 BUG: 765564 Signed-off-by: Anand Avati <avati@redhat.com> Reviewed-on: http://review.gluster.org/4480 Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* cluster/afr: Avoid priv->eager_lock value update racePranith Kumar K2013-02-064-4/+6
| | | | | | | | | | Change-Id: I7049c0c64e36a9dfa4cc0e0b34de7ec111d2f6c1 BUG: 908302 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/4076 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Jeff Darcy <jdarcy@redhat.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/afr: Perform wakeup just before fopPranith Kumar K2013-02-062-13/+21
| | | | | | | | | | | | | | | There is no necessity for the delayed-post-op to wait until the next fop phase on the fd completes. Change-log, locks are inherited by the time next fop phase is attempted so the wakeup can happen just before the fop phase is started. Change-Id: I0b8e591f591b0f7565eb55265ab51f476ed2b165 BUG: 908302 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/4073 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Jeff Darcy <jdarcy@redhat.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/afr: added logging of changelog for split-brain in glustershd.log fileVenkatesh Somyajula2013-02-034-10/+68
| | | | | | | | | | Change-Id: Iaf119f839cb2113b8f8efb7bf7636d471b6541bf BUG: 866440 Signed-off-by: Venkatesh Somyajula <vsomyaju@redhat.com> Reviewed-on: http://review.gluster.org/4385 Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Reviewed-by: Jeff Darcy <jdarcy@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* cluster/afr: if a subvolume is down wind the lock request to nextRaghavendra Bhat2013-01-291-15/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | When one of the subvolume is down, then lock request is not attempted on that subvolume and move on to the next subvolume. /* skip over children that are down */ while ((child_index < priv->child_count) && !local->child_up[child_index]) child_index++; In the above case if there are 2 subvolumes and 2nd subvolume is down (subvolume 1 from afr's view), then after attempting lock on 1st child (i.e subvolume 0) child index is calculated to be 1. But since the 2nd child is down child_index is incremented to 2 as per the above logic and lock request is STACK_WINDed to the child with child_index 2. Since there are only 2 children for afr the child (i.e the xlator_t pointer) for child_index will be NULL. The process crashes when it dereference the NULL xlator object. Change-Id: Icd9b5ad28bac1b805e6e80d53c12d296526bedf5 BUG: 765564 Signed-off-by: Raghavendra Bhat <raghavendra@redhat.com> Reviewed-on: http://review.gluster.org/4438 Reviewed-by: Krishnan Parthasarathi <kparthas@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/afr: wakeup delayed post op on fsyncPranith Kumar K2013-01-291-5/+3
| | | | | | | | | Change-Id: I5d84ef72615f9d71b4af210976e2449de6e02326 BUG: 888174 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/4446 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/afr: Change order of unwind, resume for writevPranith Kumar K2013-01-291-31/+87
| | | | | | | | | | | | | | | | | | Generally inode-write fops do transaction.unwind then transaction.resume, but writev needs to make sure that delayed post-op frame is placed in fdctx before unwind happens. This prevents the race of flush doing the changelog wakeup first in fuse thread and then this writev placing its delayed post-op frame in fdctx. This helps flush make sure all the delayed post-ops are completed. Change-Id: Ia78ca556f69cab3073c21172bb15f34ff8c3f4be BUG: 888174 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/4428 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/afr: before checking lock_count of internal lock make sure its notRaghavendra Bhat2013-01-281-12/+13
| | | | | | | | | | | | | | | entrylk when the expected lock count is equal to the attempted lock count, then before deciding that lock is failed on all the nodes, make sure the lock type is checked properly. Change-Id: I1f362d54320cb6ec5654c5c69915c0f61c91d8c7 BUG: 765564 Signed-off-by: Raghavendra Bhat <raghavendra@redhat.com> Reviewed-on: http://review.gluster.org/4436 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* replicate: fix lock counting in blocking lock pathAnand Avati2013-01-262-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As of http://review.gluster.org/2828, the blocking lock code path's condition for checking completion of locking atempt is broken. The condition - if ((child_index == priv->child_count) || ...) and if ((child_index == priv->child_count) && ...) which is retained to check completion of blocking lock attempts for DATA/METADATA transaction will _always_ fail because a few lines above we have - child_index = cookie % priv->child_count; So child_index will never equal priv->child_count. This leaves the correctness at the mercy of the next part of the conditional - .. (int_lock->lock_count == int_lock->lk_expected_count) .. This "works" as long as no server went down during the transaction. If a server goes down in the middle of the transaction, then this condition also fails, and the code wraps around and starts a blocking lock attempt loop all the way again from from the first server. This results in double locks getting acquired on those servers, and eventually the second condition gets hit (first condition is _never_ hit) and we come out of locking phase. During unlock phase we perform only one unlock per server leaving the other lock "leaked" forever. Change-Id: I7189cdf3f70901b04647516fe1d1e189f36cc8dd BUG: 765564 Signed-off-by: Anand Avati <avati@redhat.com> Reviewed-on: http://review.gluster.org/4433 Reviewed-by: Krishnan Parthasarathi <kparthas@redhat.com> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* afr: Modified book-keeping structures for entrylksKrishnan Parthasarathi2013-01-236-460/+512
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * There are upto 3 entry lockees that may be needed to perform entrylk'ing in posix dir-write operations. * For eg, rmdir ("/a/b") needs to acquire locks on two entities, - entrylk ("/a", "b") - entrylk ("/a/b", null) * Changed existing entrylk/rename/selfheal (entrylk) transactions to use the new book-keeping structures * Fixed few issues in afr_trace_entry_lk{in,out} functions. Tracing is now aware of the new entry lockee structure. Implementation notes: * Changed 'cookie' sent in stack_wind to encode lockee_entity_no and subvol_no. cookie is a non-negative integer such that 0 <= cookie < replica_count, When more than one lock is being acquired across the subvolumes, cookie % replica_count gives the subvol_no cookie / replica_count gives the lockee_entity_no. Change-Id: Idbf41803387a7d59a0f7fcb1453d91cea74da153 BUG: 765564 Signed-off-by: Krishnan Parthasarathi <kp@gluster.com> Reviewed-on: http://review.gluster.org/2828 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/afr: Remove strict-readdir implementationPranith Kumar K2013-01-231-201/+0
| | | | | | | | | | | Leaving option frame-work un-changed for backward compatibility. Change-Id: I40bce1ec360801307e67f09e53b0721f64efab37 BUG: 886998 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/4309 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* self-heald: Remove stale index even in heal infoPranith Kumar K2013-01-221-35/+45
| | | | | | | | | Change-Id: Ic1c9559aec59c1fb9dfede4aba8895f3b86f32f1 BUG: 861015 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/4098 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
* cluster/afr: Link inode only on lookupPranith Kumar K2013-01-211-22/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | Problem: When "gluster volume heal <volname> info is executed, crawl's process_entry is not going to populate iatt structure so the iatt's gfid will be empty. So inode_links are failing. Fix: inode_link should be done only after lookup i.e. when heal is performed. So moved the inode_link related code to just after the lookup which is triggered when self-heal is done. Tests: The testcase that gives this issue does not give the inode-link failures anymore. glustershd heal, info commands are working as expected. Wrote basic automation tests for proactive-self-heal-daemon https://github.com/pranithk/gluster-tests/blob/master/afr/proactive-self-heal.sh Change-Id: Ic112bf104a4d553a64d3d8559f681a25ae1a5362 BUG: 861015 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/4090 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/afr: Disable delayed post op when eager-lock is offPranith Kumar K2013-01-181-0/+3
| | | | | | | | | | | | | | | | | | | | | Problem: When eager-lock is disabled, inodelks for write-fops on same fd conflict with each other. If eager-lock is disabled but delayed post-op is enabled then each write fop's inodelk unlock waits for post-op-delay-secs. So the conflicting write fop acquires inodelk after post-op-delay-secs. This results in post-op-delay-secs delay for every write fop on the fd for sequential writes (Ex: dd). Fix: Disable delayed-post-op when eager-lock is off. Change-Id: I87ea4c8d1c7bb269b9b174388ae50f37e82629b7 BUG: 895235 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/4391 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/afr: Fail readv on data-split-brainPranith Kumar K2013-01-183-0/+23
| | | | | | | | | | | | | | | | | | Problem: Afr prevents opens on a file in split-brian but the fd that is already open still has the capability to perform both reads and writes to the file. Fix: Fail readvs on a file with EIO. Change-Id: I8e07f24c36fab800499b36ab374f984b743332cd BUG: 873962 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/4199 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Jeff Darcy <jdarcy@redhat.com> Reviewed-by: Anand Avati <avati@redhat.com>
* afr: conditionally prioritize EIO errors over ENOENTBrian Foster2013-01-183-7/+12
| | | | | | | | | | | | | | | | | | | | | | | | The most important errno logic historically only prioritized ESTALE over ENOENT. Commit c8c0942d added EIO prioritization over ENOENT to ensure that split-brain was reported when it occurs in conjunction with bricks missing the file entry. The unintended side effect of this change is that (non split-brain) EIO errors reported from the bricks themselves are now reported to the client when the expectation is that afr should squash said errors in favor of marking the file inconsistent. The high-level problem is that EIO is overloaded with different meanings from different contexts. This commit adds an eio parameter to the errno priority logic to conditionally flag when EIO is of higher priority and should be propagated to the client. BUG: 892730 Change-Id: Ib692a8a1f1737ef190d57894f392ec53ffb33aab Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-on: http://review.gluster.org/4376 Reviewed-by: Jeff Darcy <jdarcy@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* afr: replace afr_more_important_error with afr_most_important_errorBrian Foster2013-01-173-26/+18
| | | | | | | | | | | | | | | | | | | | | afr_more_important_error() is written to return whether a new errno should override an existing errno for high-level operations that could span multiple sub-operations. It specifically prioritizes ESTALE over EIO over ENOENT, and otherwise defaults to the latest error passed having priority. This change preserves current behavior, but rewrites the logic to return the higher priority error of the existing and new errno. The purpose of the change is to make the logic a bit more clear and set the stage for future changes to make the logic flexible based on context. BUG: 892730 Change-Id: Id1aa48855dfb0507abc9d1ef22f2259b30472576 Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-on: http://review.gluster.org/4375 Reviewed-by: Jeff Darcy <jdarcy@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* cluster/afr: Pre-op should be undone for non-piggyback post-opPranith Kumar K2013-01-161-2/+6
| | | | | | | | | | | | | | | | | | | | | | Problem: When fop fails post-op is always performed over the network irrespective of whether pre-op is piggybacked or not. Decrementing Pre-op-done count even for the piggybacked ones is wrong. I have added an assert for pre_op_done to be non-zero and when dd of=a if=/dev/urandom bs=5M count=1000 is executed and a brick is taken down, the mount is crashing. Fix: Decrement pre-op-done count only when the post-op is not piggybacked. Change-Id: Ie837251a43bfb437f0fada191302eeee60be1601 BUG: 863939 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/4310 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* replicate: don't clear changelog for un-healed replicasJeff Darcy2013-01-161-6/+44
| | | | | | | | | | Change-Id: Iebfa6770a688e89c051666b46977862188061738 BUG: 802417 Signed-off-by: Jeff Darcy <jdarcy@redhat.com> Reviewed-on: http://review.gluster.org/4034 Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/afr: check for the -ve values returned from dict_serialized_lengthRaghavendra Bhat2012-12-171-1/+1
| | | | | | | | | Change-Id: I9fa7744b02791180ccb93adef10c363a1b38aa31 BUG: 838204 Signed-off-by: Raghavendra Bhat <raghavendra@redhat.com> Reviewed-on: http://review.gluster.org/4319 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/afr: Remember type of split-brain in inode-ctxPranith Kumar K2012-12-115-117/+114
| | | | | | | | | | | | | Along with this change, fixed the race of setting the split-brain status in inode-ctx after unwinding the fop from self-heal in case of back-ground self-heal. Change-Id: Ifc829300df485f50f139443802e8b6dc7038b4ad BUG: 873962 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/4198 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/afr: Empty string should not be default option valPranith Kumar K2012-12-053-5/+6
| | | | | | | | | | | | Glusterd does not allow empty string as default value. Changed afr option values to disallow empty string as value. Change-Id: I92a2d658907dbc6101e1139dd91f548acb5506f5 BUG: 859927 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/4271 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/afr: mark new entry changelog for create/mknod failuresPranith Kumar K2012-12-045-67/+228
| | | | | | | | | | | | | | | | | | | | | | Problem: When create/mknod fails on some of the nodes, appropriate pending data/metadata changelogs are not assigned. This was not considered to be an issue because entry self-heal would do the assigning of appropriate changelog after creating new entries. But using the combination of rebalance and remove brick we can construct a case where a file with same name and gfid can be created in a dir with different data and link-to xattr without any changelog. Fix: When a create/mknod failure is observed mark the appropriate changelog on the new file created. Change-Id: I4c32cbf5594a13fb14deaf97ff30b2fff11cbfd6 BUG: 858212 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/4207 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* afr: use data trylock mode in read/write self-heal trigger pathsBrian Foster2012-12-041-1/+8
| | | | | | | | | | | | | | | | | | | | | | Self-heal data lock contention between clients and glustershd instances can lead to long wait and user response times if the client ends up pending its lock on glustershd self-heal of a large file. We have reports of guest vm instances going completely unresponsive during self-heal of virtual disk images. Optimize the read/write self-heal trigger codepath (i.e., afr_open_fd_fix()) to trylock for self-heal and skip the self-heal otherwise to minimize the likelihood of a running/active guest of competing with glustershd on arrival of a brick. Note that lock contention is still possible from the client (e.g., via lookup). BUG: 874045 Change-Id: I406443c061ff6acd2a851179626b78352caa5c03 Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-on: http://review.gluster.org/4258 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* afr: support self-heal data trylock mechanismBrian Foster2012-12-044-8/+15
| | | | | | | | | | | | | | | Introduce a block flag to support an optional blocking or non-blocking mode in the self-heal data locking mechanism. All callers are modified to use blocking mode, which is the current default behavior (no change in behavior is introduced by this commit). BUG: 874045 Change-Id: Ib7ff9984578fa11de4e3b6981508100cdddd37cd Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-on: http://review.gluster.org/4257 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* afr: make flush non-transactionalBrian Foster2012-12-043-139/+38
| | | | | | | | | | | | | | | | | | Flush is historically a transaction to ensure all previous writes were complete. This is no longer required as write-behind has learned to make flush a barrier operation (re: conversation w/ Avati). Flush taking a full file lock causes VMs running on afr volumes to stall when a migration occurs and self-heal is in progress. Make afr_flush() a non-transactional operation. BUG: 874045 Change-Id: If2db83823e280c86b1b29b41361eed7081601632 Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-on: http://review.gluster.org/4261 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>