summaryrefslogtreecommitdiffstats
path: root/xlators/cluster/afr/src/afr-transaction.c
Commit message (Collapse)AuthorAgeFilesLines
* cluster/afr: Remove eager-lock stub on finodelk failurePranith Kumar K2014-04-021-6/+2
| | | | | | | | | | | | | | | | | | | | Problem: For write fops afr's transaction eager-lock init adds transactions that can share eager-lock to fdctx list. But if eager-lock finodelk fop fails the stub remains in the list. This could later lead to corruption of the list and lead to infinite loop on the list leading to a mount hang. Fix: Remove the stub when finodelk fails. Change-Id: I0ed4bc6b62f26c5e891c1181a6871ee6e4f4f5fd BUG: 1063190 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/6944 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Ravishankar N <ravishankar@redhat.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/afr: refactorAnand Avati2014-03-221-847/+610
| | | | | | | | | | | | | | | | | | | | | | | | | | | | - Remove client side self-healing completely (opendir, openfd, lookup) - Re-work readdir-failover to work reliably in case of NFS - Remove unused/dead lock recovery code - Consistently use xdata in both calls and callbacks in all FOPs - Per-inode event generation, used to force inode ctx refresh - Implement dirty flag support (in place of pending counts) - Eliminate inode ctx structure, use read subvol bits + event_generation - Implement inode ctx refreshing based on event generation - Provide backward compatibility in transactions - remove unused variables and functions - make code more consistent in style and pattern - regularize and clean up inode-write transaction code - regularize and clean up dir-write transaction code - regularize and clean up common FOPs - reorganize transaction framework code - skip setting xattrs in pending dict if nothing is pending - re-write self-healing code using syncops - re-write simpler self-heal-daemon Change-Id: I1e4080c9796c8a2815c2dab4be3073f389d614a8 BUG: 1021686 Signed-off-by: Anand Avati <avati@redhat.com> Reviewed-on: http://review.gluster.org/6010 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* libglusterfs: Add monotonic clocking counter for timer threadHarshavardhana2013-10-151-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | gettimeofday() returns the current wall clock time and timezone. Using these functions in order to measure the passage of time (how long an operation took) therefore seems like a no-brainer. This time suffer's from some limitations: a. They have a low resolution: “High-performance” timing by definition, requires clock resolutions into the microseconds or better. b. They can jump forwards and backwards in time: Computer clocks all tick at slightly different rates, which causes the time to drift. Most systems have NTP enabled which periodically adjusts the system clock to keep them in sync with “actual” time. The adjustment can cause the clock to suddenly jump forward (artificially inflating your timing numbers) or jump backwards (causing your timing calculations to go negative or hugely positive). In such cases timer thread could go into an infinite loop. From 'man gettimeofday': ---------- .. .. The time returned by gettimeofday() is affected by discontinuous jumps in the system time (e.g., if the system administrator manually changes the system time). If you need a monotonically increasing clock, see clock_gettime(2). .. .. ---------- Rationale: For calculating interval timing for Timer thread, all that’s needed should be clock as a simple counter that increments at a stable rate. This is necessary to avoid the jumps which are caused by using "wall time", this counter must be monotonic that can never “tick” backwards, ever. Change-Id: I701d31e71a85a73d21a6c5cd15583e7a5a645eeb BUG: 1017993 Signed-off-by: Harshavardhana <harsha@harshavardhana.net> Reviewed-on: http://review.gluster.org/6070 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/afr: Add special handling for failure postopsPranith Kumar K2013-08-281-0/+56
| | | | | | | | | | | | | | | | Idea is to not leave the file in FOOL-FOOL scenario in case on all the bricks data transaction failed with EDQUOT to avoid increasing un-necessary load of self-heals in the system. For directory transactions don't leave pending changelog in case the failures are seen on all the subvolumes. Change-Id: I38a5561d1d581a78347a76a4a509514e4a0c3fb7 BUG: 969461 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/5709 Reviewed-by: Anand Avati <avati@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* cluster/afr: Don't delay post op in cases of failuresPranith Kumar K2013-08-281-7/+23
| | | | | | | | | | Change-Id: Ib0c3af6babc61dc3ed45252582876e2f243d6446 BUG: 958118 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/5635 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Jeff Darcy <jdarcy@redhat.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/afr: Disable eager-lock if open-fd-count > 1Pranith Kumar K2013-08-021-2/+34
| | | | | | | | | | | | | | | | | Lets say mount1 has eager-lock(full-lock) and after the eager-lock is taken mount2 opened the same file, it won't be able to perform any data operations until mount1 releases eager-lock. To avoid such scenario do not enable eager-lock for transaction if open-fd-count is > 1. Delaying of changelog piggybacking is avoided in this situation. Change-Id: I51b45d6a7c216a78860aff0265a0b8dabc6423a5 BUG: 910217 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/5432 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: venkatesh somyajulu <vsomyaju@redhat.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* storage/posix: implement batched fsync in a single threadAnand Avati2013-07-231-1/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Because of the extra fsync()s issued by AFR transaction, they could potentially "clog" all the io-threads denying unrelated operations from making progress. This patch assigns a dedicated thread to issues fsyncs, as an experimental feature to understand performance characteristics with the approach. As a basis, incoming individual fsync requests are grouped into batches, falling in the same @batch-fsync-delay-usec window of time. These windows can extend in practice, as processing of the previous batch can take longer than @batch-fsync-delay-usec while new requests are getting batched. The feature support three modes (similar to the -S modes of fs_mark) - syncfs: In this mode one syncfs() is issued per batch, instead of N fsync()s (one per file.) - syncfs-single-fsync: In this mode one syncfs() is issued per batch (which, on Linux, guarantees the completion of write-out of dirty pages in the filesystem up to that point) and one single fsync() to synchronize or flush the controller/drive cache. This corresponds to -S 2 of fsmark. - syncfs-reverse-fsync: In this mode, one syncfs() is issued per batch, and all the open files in that batch are fsync()'ed in the reverse order of the queue. This corresponds to -S 4 of fsmark. - reverse-fsync: In this mode, no syncfs() is issued and all the files in the batch are fsync()'ed in the reverse order. This corresponds to -S 3 of fsmark. Change-Id: Ia1e170a810c780c8d80e02cf910accc4170c4cd4 BUG: 927146 Signed-off-by: Anand Avati <avati@redhat.com> Reviewed-on: http://review.gluster.org/4746 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* cluster/afr: Refactor inodelk to handle multiple domainsPranith Kumar K2013-07-031-274/+291
| | | | | | | | | | | | | | | | | | - afr_local_copy should not be memduping locked nodes, that would mean that lock is taken in self-heal on those nodes even before it actually takes the lock. So removed memdup code. Even entry lock related copying (lockee info) is also not necessary for self-heal functionality, so removing that as well. Since it is not local_copy anymore changed its name. - My editor changed tabs to spaces. Change-Id: I8dfb92cb8338e9a967c06907a8e29a8404782d61 BUG: 967717 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/5099 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* cluster/afr: Provide an option to disable afr durabilityPranith Kumar K2013-07-031-3/+11
| | | | | | | | | Change-Id: I40eec20ca6b3f857245a2438883822e251077ee9 BUG: 979365 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/5269 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* cluster/afr: post-op should complete before starting flushPranith Kumar K2013-07-031-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Problem: At the moment afr-flush makes sure that a delayed post-op is woken up but it does not wait for it to complete the post-op before flush unwinds. These are the steps that are happening: 1) flush fop comes on an fd which wakes up a delayed post-op and continues with the flush fop. 2) post-op sends fsync on the wire. 3) flush completes and unwinds to fuse. 4) graph switch happens on the fuse mount disconnecting the old graph's client connections to bricks. 5) xattrop after fsync fails with ENOTCONN because the connections from old graph are taken down now. Fix: Wait for post-op to complete before starting to flush. We could make flush act similar to fsync (i.e.) wind flush as is but wait for post-op to complete before unwinding flush, but it is better to send flush as the final fop. So wind of flush will start after post-op is complete. Had to change fsync to accommodate this change. Change-Id: I93aa642647751969511718b0e137afbd067b388a BUG: 980548 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/5274 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* cluster/afr: Handle NULL fdctx in fsyncPranith Kumar K2013-06-271-1/+4
| | | | | | | | | | | | | | | | | | | Problem: If fdctx is NULL in afr_fsync, process crashes because of NULL dereference. Fix: if fdctx is NULL, always say witnessed unstable write so that fsyncs are done properly. Handled fdctx being null in afr_delayed_changelog_post_op otherwise fsync stub is never resumed and the mount was hanging. Change-Id: Icacc900e9be63c29db3325cb0e19cc250adebaac BUG: 978794 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/5258 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* cluster/afr: Fix fd/memory leak on fsyncPranith Kumar K2013-06-241-1/+1
| | | | | | | | | Change-Id: I764883811e30ca9d9c249ad00b6762101083a2fe BUG: 976800 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/5248 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
* cluster/afr: Perform delayed changelog wakeups for anon fdPranith Kumar K2013-06-201-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | Problem: Nfs xlator never does open on a file for performing writes, afr does not perform changelog wakeup for this fd so operations which do metadata operations as soon as the data operations are completed perceive a delay of 'post-op-delay-secs'. Fix: Perform changelog wakeup on anon-fd if the fd with same pid is not present in inode-list. Note: This approach is a short-term fix. A proper fix needs a new domain for taking metadata locks so that data/metadata locks don't compete with each other. Change-Id: I253afb289eadf30c7951e56fb2c4840d7132f5e4 BUG: 966018 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/5066 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* cluster/afr: Added documentation for eager-lock checkPranith Kumar K2013-04-221-0/+17
| | | | | | | | | Change-Id: Ifa42762adde8b55ef1e2b51a59c93cebd983343f BUG: 912581 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/4792 Reviewed-by: Vijay Bellur <vbellur@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* afr: let eager-locking do its own overlap checksAnand Avati2013-04-051-0/+78
| | | | | | | | | | | | | | | | | | | | | | | | Today there is a non-obvious dependence of eager-locking on write-behind. The reason is that eager-locking works as long as the inheriting transaction has no overlaps with any of the transactions already in progress. While write-behind provides non-overlapping writes as a side-effect most of times (and only guarantees it when strict-write-ordering option is enabled, which is not on by default) eager-lock needs the behavior as a guarantee. This is leading to complex and unwanted checks for the presence of write-behind in the graph, for the simple task of checking for overlaps. This patch removes the interdependence between eager-locking and write-behind by making eager-locking do its own overlap checks with in-progress writes. Change-Id: Iccba1185aeb5f1e7f060089c895a62840787133f BUG: 912581 Signed-off-by: Anand Avati <avati@redhat.com> Reviewed-on: http://review.gluster.org/4782 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
* cluster/afr: Treat all dir fop failure as success in changelogPranith Kumar K2013-04-031-1/+1
| | | | | | | | | | | | | | | For example: If a new entry creation fop fails with EEXIST or a delete entry fop fails with ENOENT, on all the subvols the fop is wound, then no change took place to the directory. So we can treat that case as no change happened to the directory. Change-Id: I3b3a7931954da2166a9cba19ff9f76f37739d751 BUG: 860210 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/4626 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/afr: prevent piggyback on stale pre_opPranith Kumar K2013-04-021-33/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Here are the logs of a file on which we saw EIO because of size mismatch: [root@lizzie ~]# grep 38f18204 /var/log/glusterfs/mnt-x-.log Reporting Unstable write for 38f18204-2840-408e-ae65-c01f4106b8c4 for offset: 0, len: 7680 Cleared unstable write flag for 38f18204-2840-408e-ae65-c01f4106b8c4: offset 0 length 7680 Reporting Unstable write for 38f18204-2840-408e-ae65-c01f4106b8c4 for offset: 7680, len: 71680 Reporting Unstable write for 38f18204-2840-408e-ae65-c01f4106b8c4 for offset: 79360, len: 15716 fsync completed on 38f18204-2840-408e-ae65-c01f4106b8c4 for offset 0 length 7680 with changelog status: -1 -1 According to these logs fsync did not happen after writev with offset: 79360, len: 15716. Which is the reason for this problem. In total 3 writes came. lets call them w1, w2, w3 w1 does pre_op so pre_op_done[0], pre_op_done[1] counts become 1 and 1 then is_piggyback_post_op() is called for w1 and it returns *false* w1's fsync is fired Now w2 and w3 come and see that pre_op_done[0], pre_op_done[1] are both 1, so pre_op_piggyback[0] and pre_op_piggyback[1] are both incremented twice, once by w2, one more time by w3 and become 2, 2 ------- Step-A Now fsync of w1 is complete and it goes ahead with post op and decrements pre_op_done[0], pre_op_done[1] to 0, 0 Now w2, w3 writevs complete and is_piggyback_post_op will return *true* for both w2, w3. So fsync is not fired for both w2, w3 this patch prevents Step-A from happening. Change-Id: I8b6af1f1875b2cf5f718caa3c16ee7ff3dc96b5c BUG: 927146 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/4752 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
* cluster/afr: fix fd leak with unsafe call_resume()Anand Avati2013-03-281-1/+1
| | | | | | | | | | | | | | Introduce AFR_CALL_RESUME macro which cleans up frame->local, like how AFR_STACK_UNWIND etc. do. Therefore fix leak in afr_fsync() path. Change-Id: I3855d8e7e84dbc44e05f507563b7f722bf9621b8 BUG: 927146 Signed-off-by: Anand Avati <avati@redhat.com> Reviewed-on: http://review.gluster.org/4745 Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* cluster/afr: piggyback and fsync resume changesPranith Kumar K2013-03-281-8/+14
| | | | | | | | | | | | | | 1) pre_op_piggyback should always be decremented. 2) Move fsync resume to just after post_op. 3) fsync stub should be created from afr's local not from the final response. Change-Id: I220bb532eb03bea584292f4dd2e816ad0c3e0cf7 BUG: 927146 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/4741 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/afr: fsync() guarantees POST-OP completionAnand Avati2013-03-271-4/+26
| | | | | | | | | | | | | | | | | | | | | | | | AFR now provides a stronger guarantee that fsync() returns only after completely finishing all the deferred/delayed POST-OP on that open file. To acheive this we make a stub out of the returning fsync and register it with the "delayed" frame in afr_changelog_wake_resume(). The delayed frame, after getting woken up and finishing the POST-OP will call_resume() the registered stub (which UNWINDs the fsync) at the time of frame destruction. This provides a guarantee that an application's (or FUSE) fsync() returns only after finishing up all the previous transactions, including delayed POST-OPs and UNLOCK. Change-Id: Iaa955457e2f25088a144fde37ad0444277b5cf49 BUG: 927146 Signed-off-by: Anand Avati <avati@redhat.com> Reviewed-on: http://review.gluster.org/4737 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
* cluster/afr: ensure DATA operations are made durable before POST-OPAnand Avati2013-03-271-21/+260
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The changelogging scheme of AFR stores information about the state of all replicas in all replicas (in the extended attribute of the respective files on each server) in the form of 'pending counts' of operations (effectively "dirty flags"). These xattrs are blindly trusted while performing self-heal, and therefore utmost care has to be taken while updating and maintaing them. The most critical updation is the clearing of the pending counts corresponding to the *other* server in the changelog of a given server. Before clearing the pending count, we need durability guarantee of the write which was performed on the other server. To obtain such a guarantee, it may be necessary to explicitly introduce an fsync() phase (if the file itself wasn't already opened with O_SYNC). This patch introduces the detection of unstable stable writes on a file and issues explicit fsync() on the servers before performing the POST-OP clearing of pending flags. Change-Id: I2171b86a74ec91e40e5877eef0a4e7379578ecf7 BUG: 927146 Signed-off-by: Anand Avati <avati@redhat.com> Reviewed-on: http://review.gluster.org/4721 Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Reviewed-by: Krishnan Parthasarathi <kparthas@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* cluster/afr: Turn on eager-lock for fd DATA transactionsPranith Kumar K2013-03-011-20/+5
| | | | | | | | | | | | | | | | | | | | Problem: With the present implementation, eager-lock is issued for any fd fop. eager-lock is being transferred to metadata transactions. But the lk-owner is set to local->fd address only for DATA transactions, but for METADATA transactions it is frame->root. Because of this unlock on the eager-lock fails and rebalance hangs. Fix: Enable eager-lock for fd DATA transactions Change-Id: If30df7486a0b2f5e4150d3259d1261f81473ce8a BUG: 916226 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/4588 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/afr: Avoid priv->eager_lock value update racePranith Kumar K2013-02-061-3/+3
| | | | | | | | | | Change-Id: I7049c0c64e36a9dfa4cc0e0b34de7ec111d2f6c1 BUG: 908302 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/4076 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Jeff Darcy <jdarcy@redhat.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/afr: Perform wakeup just before fopPranith Kumar K2013-02-061-13/+14
| | | | | | | | | | | | | | | There is no necessity for the delayed-post-op to wait until the next fop phase on the fd completes. Change-log, locks are inherited by the time next fop phase is attempted so the wakeup can happen just before the fop phase is started. Change-Id: I0b8e591f591b0f7565eb55265ab51f476ed2b165 BUG: 908302 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/4073 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Jeff Darcy <jdarcy@redhat.com> Reviewed-by: Anand Avati <avati@redhat.com>
* afr: Modified book-keeping structures for entrylksKrishnan Parthasarathi2013-01-231-3/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * There are upto 3 entry lockees that may be needed to perform entrylk'ing in posix dir-write operations. * For eg, rmdir ("/a/b") needs to acquire locks on two entities, - entrylk ("/a", "b") - entrylk ("/a/b", null) * Changed existing entrylk/rename/selfheal (entrylk) transactions to use the new book-keeping structures * Fixed few issues in afr_trace_entry_lk{in,out} functions. Tracing is now aware of the new entry lockee structure. Implementation notes: * Changed 'cookie' sent in stack_wind to encode lockee_entity_no and subvol_no. cookie is a non-negative integer such that 0 <= cookie < replica_count, When more than one lock is being acquired across the subvolumes, cookie % replica_count gives the subvol_no cookie / replica_count gives the lockee_entity_no. Change-Id: Idbf41803387a7d59a0f7fcb1453d91cea74da153 BUG: 765564 Signed-off-by: Krishnan Parthasarathi <kp@gluster.com> Reviewed-on: http://review.gluster.org/2828 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/afr: Disable delayed post op when eager-lock is offPranith Kumar K2013-01-181-0/+3
| | | | | | | | | | | | | | | | | | | | | Problem: When eager-lock is disabled, inodelks for write-fops on same fd conflict with each other. If eager-lock is disabled but delayed post-op is enabled then each write fop's inodelk unlock waits for post-op-delay-secs. So the conflicting write fop acquires inodelk after post-op-delay-secs. This results in post-op-delay-secs delay for every write fop on the fd for sequential writes (Ex: dd). Fix: Disable delayed-post-op when eager-lock is off. Change-Id: I87ea4c8d1c7bb269b9b174388ae50f37e82629b7 BUG: 895235 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/4391 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/afr: Pre-op should be undone for non-piggyback post-opPranith Kumar K2013-01-161-2/+6
| | | | | | | | | | | | | | | | | | | | | | Problem: When fop fails post-op is always performed over the network irrespective of whether pre-op is piggybacked or not. Decrementing Pre-op-done count even for the piggybacked ones is wrong. I have added an assert for pre_op_done to be non-zero and when dd of=a if=/dev/urandom bs=5M count=1000 is executed and a brick is taken down, the mount is crashing. Fix: Decrement pre-op-done count only when the post-op is not piggybacked. Change-Id: Ie837251a43bfb437f0fada191302eeee60be1601 BUG: 863939 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/4310 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/afr: mark new entry changelog for create/mknod failuresPranith Kumar K2012-12-041-31/+36
| | | | | | | | | | | | | | | | | | | | | | Problem: When create/mknod fails on some of the nodes, appropriate pending data/metadata changelogs are not assigned. This was not considered to be an issue because entry self-heal would do the assigning of appropriate changelog after creating new entries. But using the combination of rebalance and remove brick we can construct a case where a file with same name and gfid can be created in a dir with different data and link-to xattr without any changelog. Fix: When a create/mknod failure is observed mark the appropriate changelog on the new file created. Change-Id: I4c32cbf5594a13fb14deaf97ff30b2fff11cbfd6 BUG: 858212 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/4207 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* afr: make flush non-transactionalBrian Foster2012-12-041-3/+0
| | | | | | | | | | | | | | | | | | Flush is historically a transaction to ensure all previous writes were complete. This is no longer required as write-behind has learned to make flush a barrier operation (re: conversation w/ Avati). Flush taking a full file lock causes VMs running on afr volumes to stall when a migration occurs and self-heal is in progress. Make afr_flush() a non-transactional operation. BUG: 874045 Change-Id: If2db83823e280c86b1b29b41361eed7081601632 Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-on: http://review.gluster.org/4261 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/afr: check transaction type for eager-lock after it is setPranith Kumar K2012-11-211-6/+6
| | | | | | | | | | | | | | | | | | | | | | | Problem: Eager locking lk-owner decision is taken before transaction type is set. Default transaction type is DATA so all transactions are treated as DATA transactions at the time of eager-locking decision. Fix: Move the code that takes lk-owner decision after the transaction type is set. Test: Checked that the transaction type is set properly in gdb at the time of the lk-owner decision. Change-Id: I7607c7ff4f88c7ced5416a1cddb6586cf45d88f9 BUG: 861335 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/4220 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* Preventing client crashing as the callings of GF_CALLOC has been failed.linbaiye2012-10-111-9/+14
| | | | | | | | | | | | | As the callings of GF_CALLOC can seldom come to a failure, glusterfs client will crash due to segment fault. We should have returned once the variables of transaction's local can't be alloced. Change-Id: Ia3798b8349d832b23c7825e64dbad93ebe29cd1b BUG: 861335 Signed-off-by: linbaiye <linbaiye@gmail.com> Reviewed-on: http://review.gluster.org/4005 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/afr: check transaction type for eager-lock after it is setPranith Kumar K2012-10-101-6/+6
| | | | | | | | | | | | | | | | | | | | | | | Problem: Eager locking lk-owner decision is taken before transaction type is set. Default transaction type is DATA so all transactions are treated as DATA transactions at the time of eager-locking decision. Fix: Move the code that takes lk-owner decision after the transaction type is set. Test: Checked that the transaction type is set properly in gdb at the time of the lk-owner decision. Change-Id: Ib1c886866f28788aed67622982e86d667b2cdb80 BUG: 864786 Signed-off-by: Pranith Kumar K <pranithk@gluster.com> Reviewed-on: http://review.gluster.org/4053 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/afr: Wake up post-op on non-co-operative transactionPranith Kumar K2012-09-251-0/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Problem: The problem is observed when kernel untar is done. One file untar happens every second. The reason for this is, setattr lock is blocked on the prev fd data-transaction full-lock (because of eager-lock). Because of post-op-delay the post-op (xattrop + unlock) of the prev data-transaction happens after 1 sec. Until this the setattr is blocked resulting in performance problems in untar. Fix: Whenever an loc data, meta-data transaction comes, it should wakeup the prev-post-op on the same process' fd. Tests: The performance problem in untar went away. I put a breakpoint in client_finodelk for a 2G file dd and the inodelk is hit only 4 times. This confirms that the change does not affect post-op-delay in a -ve way. Change-Id: Ice3c2a1211f4dca6520a19bc4ba6cb9efb2902ad BUG: 845754 Signed-off-by: Pranith Kumar K <pranithk@gluster.com> Reviewed-on: http://review.gluster.org/3975 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Jeff Darcy <jdarcy@redhat.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/afr: Handle child_up & fd not opened case in xactionPranith Kumar K2012-08-011-7/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | RCA: When an fd is opened while a brick is down, after the brick comes back up afr issues open on the other brick. It can fail for a number of reasons (enoent etc). While the system is in that state, inode/entrylks pre-op happen only on the brick that is up and fd is opened for fd-fops. post-op should consider only the bricks where both pre-op and fop succeeded as success, rest of them as failures. Code now marks only the children that are down as failures as opposed to child_down & fd-not-opened. This makes change-log appear as success on the subvolume where we did not do any fop leading to no change-log but differences in data/metadata for reg-files. Fix: Mark non-participants of fop as failure. This is tracked in transaction.pre_op[]. Tests: Simulated the scenario using err-gen on top of one of the client xlator which fails all fops always. Performed fops and the changelog represented pending fops on the brick with err-gen loaded. Tested the case of brick down and perform entry/metadata/data operations to confirm they still work as expected. Change-Id: I41905936126b19abba56ca581c0301a894507e1a BUG: 844987 Signed-off-by: Pranith Kumar K <pranithk@gluster.com> Reviewed-on: http://review.gluster.com/3765 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* remove useless if-before-free (and free-like) functionsJim Meyering2012-07-131-2/+1
| | | | | | | | | | | | See comments in http://bugzilla.redhat.com/839925 for the code to perform this change. Signed-off-by: Jim Meyering <meyering@redhat.com> BUG: 839925 Change-Id: I10e4ecff16c3749fe17c2831c516737e08a3205a Reviewed-on: http://review.gluster.com/3661 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/afr: post-op-delay supportAnand Avati2012-07-041-1/+146
| | | | | | | | | | | | | | | | post-op-delay introduces an artificial delay between the OP and POST-OP-CHANGELOG phases of a write transaction to increase the probability of changelog-piggyback and eager-locking to work more efficiently. Also enable eager-locking by default. Change-Id: I865ca4b68512c44818719c7e388952f15d53e6c2 BUG: 836033 Signed-off-by: Anand Avati <avati@redhat.com> Reviewed-on: http://review.gluster.com/3621 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Pranith Kumar Karampuri <pranithk@gluster.com>
* cluster/afr: cleanup lk_owner and PID messAnand Avati2012-07-041-18/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Historically PID (frame->root->pid) was used by the locks translator to identify a locker (and make decisions about which locks contend or cooperate/merge). Since the introduction of lock_owner parameter the usage of PID (for locks) was deprecated and is now unused. This patch nukes the usage of PID in AFR The usage of lk_owner has also ended up being a mess, because of the differentiation required between ->lk() and ->inodelk(), (->lk() needs to be identified by the process (roughly) and ->inodelk() needs to be identified by the transaction) and also because of optimizations like eager locking (locks are no more identified by the transaction as they now get inherited by the next transaction). The scheme (and technique) now is: - All FOPs (the third phase of the transaction) happen with the lk_owner which is set by the topmost layer (FUSE, NFS etc.) - All entrylks are issued with lk_owner set to the frame->root address. - Inodelks which will not be subject to eager locking are issued with lk_owner set to frame->root. - Inodelks which are subject to eager locking are issued with lk_owner set to the address of fd_t (which are the only type of frames which get subject to the eager locking optimization) - At the start of the transaction, the transaction frame's lk_owner is set to the either frame->root or fd_t (and never unmodified) depending on the type of transaction. - Just before the third phase (FOP phase) the set lk_owner is "saved" away and overwritten by the lk_owner submitted by the top layer (FUSE or NFS) - Right after the third phase, the saved lk_owner is "restored" to resume the transaction into the POST-OP and eventually UNLOCK using the same lk_owner which was used during the LOCK phase. Change-Id: I6ab8e4d6b65ae4185fa85ad3fded8e9188b2f929 BUG: 836033 Signed-off-by: Anand Avati <avati@redhat.com> Reviewed-on: http://review.gluster.com/3620 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Pranith Kumar Karampuri <pranithk@gluster.com>
* cluster/afr: decrement changelog when fdctx is absentPranith Kumar K2012-05-241-0/+2
| | | | | | | | | Change-Id: I9d76ddbd2cf8e4e8e4ad70529ba3a70178489a68 BUG: 765194 Signed-off-by: Pranith Kumar K <pranithk@gluster.com> Reviewed-on: http://review.gluster.com/3435 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/afr: Enforce order in pre/post opPranith Kumar K2012-05-181-72/+86
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The xattrop order in pre/post op on all the subvols is client-0, client-1... client-n where n is (replica-count - 1). This order can lead to invalid split-brains if the brick dies in the middle of xattrops. Example: transaction completed pre-op, so on all the subvolumes xattrs have '1' changelog. Now post-op is sent to both the subvols. On subvol-0 change-log of client-0 is decremented to 0, before decrementing change-log of client-1 to 0 the brick dies. This change-log status on subvol-0 gives the meaning that a change is done on subvol-0 successfully but on subvol-1 it failed. Which is not what happened. Changes done when the subvol-0 was down will lead to pending change-log on subvol-1 for subvol-0. Which is correct. When the subvol-0 is brought back up, the change-log will be in split-brain state even when it is not a legitimate split-brain. If the brick dies in the middle of xattrops it should remain fool. Pre-op should perform xattrop of the local change-log first and post-op should perform xattrop of the local change-log last. In case of optimistic changelogs txn_changelog should be done last on local if it succeeds, first if it fails. Change-Id: Ib6eeb20cdc49b0b1fd2f454f25a9c8e08388c6e7 BUG: 765194 Signed-off-by: Pranith Kumar K <pranithk@gluster.com> Reviewed-on: http://review.gluster.com/3226 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* license: dual license under GPLV2 and LGPLV3+Kaleb KEITHLEY2012-05-101-14/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Note that the license was not changed in any of the following: .../argp-standalone/... .../booster/... .../cli/... .../contrib/... .../extras/... .../glusterfsd/... .../glusterfs-hadoop/... .../mod_clusterfs/... .../scheduler/... .../swift/... The license was not changed in any of the non-building xlators. The license was not changed in any of the xlators that seemed — to me — to be clearly server-side only, e.g. protocol/server Note too that copyright was changed along with the license; I did not change the copyright in files where the license did not change. If you find any errors or ommissions please don't hesitate to let me know. The complete list of files with the license change is: libglusterfs/src/byte-order.h libglusterfs/src/call-stub.c libglusterfs/src/call-stub.h libglusterfs/src/checksum.c libglusterfs/src/checksum.h libglusterfs/src/circ-buff.c libglusterfs/src/circ-buff.h libglusterfs/src/common-utils.c libglusterfs/src/common-utils.h libglusterfs/src/compat-errno.c libglusterfs/src/compat-errno.h libglusterfs/src/compat.c libglusterfs/src/compat.h libglusterfs/src/daemon.c libglusterfs/src/daemon.h libglusterfs/src/defaults.c libglusterfs/src/defaults.h libglusterfs/src/dict.c libglusterfs/src/dict.h libglusterfs/src/event-history.c libglusterfs/src/event-history.h libglusterfs/src/event.c libglusterfs/src/event.h libglusterfs/src/fd-lk.c libglusterfs/src/fd-lk.h libglusterfs/src/fd.c libglusterfs/src/fd.h libglusterfs/src/gf-dirent.c libglusterfs/src/gf-dirent.h libglusterfs/src/globals.c libglusterfs/src/globals.h libglusterfs/src/glusterfs.h libglusterfs/src/graph-print.c libglusterfs/src/graph-utils.h libglusterfs/src/graph.c libglusterfs/src/hashfn.c libglusterfs/src/hashfn.h libglusterfs/src/iatt.h libglusterfs/src/inode.c libglusterfs/src/inode.h libglusterfs/src/iobuf.c libglusterfs/src/iobuf.h libglusterfs/src/latency.c libglusterfs/src/latency.h libglusterfs/src/list.h libglusterfs/src/lkowner.h libglusterfs/src/locking.h libglusterfs/src/logging.c libglusterfs/src/logging.h libglusterfs/src/mem-pool.c libglusterfs/src/mem-pool.h libglusterfs/src/mem-types.h libglusterfs/src/options.c libglusterfs/src/options.h libglusterfs/src/rbthash.c libglusterfs/src/rbthash.h libglusterfs/src/run.c libglusterfs/src/run.h libglusterfs/src/scheduler.c libglusterfs/src/scheduler.h libglusterfs/src/stack.c libglusterfs/src/stack.h libglusterfs/src/statedump.c libglusterfs/src/statedump.h libglusterfs/src/syncop.c libglusterfs/src/syncop.h libglusterfs/src/syscall.c libglusterfs/src/syscall.h libglusterfs/src/timer.c libglusterfs/src/timer.h libglusterfs/src/trie.c libglusterfs/src/trie.h libglusterfs/src/xlator.c libglusterfs/src/xlator.h libglusterfsclient/src/libglusterfsclient-dentry.c libglusterfsclient/src/libglusterfsclient-internals.h libglusterfsclient/src/libglusterfsclient.c libglusterfsclient/src/libglusterfsclient.h rpc/rpc-lib/src/auth-glusterfs.c rpc/rpc-lib/src/auth-null.c rpc/rpc-lib/src/auth-unix.c rpc/rpc-lib/src/protocol-common.h rpc/rpc-lib/src/rpc-clnt.c rpc/rpc-lib/src/rpc-clnt.h rpc/rpc-lib/src/rpc-transport.c rpc/rpc-lib/src/rpc-transport.h rpc/rpc-lib/src/rpcsvc-auth.c rpc/rpc-lib/src/rpcsvc-common.h rpc/rpc-lib/src/rpcsvc.c rpc/rpc-lib/src/rpcsvc.h rpc/rpc-lib/src/xdr-common.h rpc/rpc-lib/src/xdr-rpc.c rpc/rpc-lib/src/xdr-rpc.h rpc/rpc-lib/src/xdr-rpcclnt.c rpc/rpc-lib/src/xdr-rpcclnt.h rpc/rpc-transport/rdma/src/name.c rpc/rpc-transport/rdma/src/name.h rpc/rpc-transport/rdma/src/rdma.c rpc/rpc-transport/rdma/src/rdma.h rpc/rpc-transport/socket/src/name.c rpc/rpc-transport/socket/src/name.h rpc/rpc-transport/socket/src/socket.c rpc/rpc-transport/socket/src/socket.h xlators/cluster/afr/src/afr-common.c xlators/cluster/afr/src/afr-dir-read.c xlators/cluster/afr/src/afr-dir-read.h xlators/cluster/afr/src/afr-dir-write.c xlators/cluster/afr/src/afr-dir-write.h xlators/cluster/afr/src/afr-inode-read.c xlators/cluster/afr/src/afr-inode-read.h xlators/cluster/afr/src/afr-inode-write.c xlators/cluster/afr/src/afr-inode-write.h xlators/cluster/afr/src/afr-lk-common.c xlators/cluster/afr/src/afr-mem-types.h xlators/cluster/afr/src/afr-open.c xlators/cluster/afr/src/afr-self-heal-algorithm.c xlators/cluster/afr/src/afr-self-heal-algorithm.h xlators/cluster/afr/src/afr-self-heal-common.c xlators/cluster/afr/src/afr-self-heal-common.h xlators/cluster/afr/src/afr-self-heal-data.c xlators/cluster/afr/src/afr-self-heal-entry.c xlators/cluster/afr/src/afr-self-heal-metadata.c xlators/cluster/afr/src/afr-self-heal.h xlators/cluster/afr/src/afr-self-heald.c xlators/cluster/afr/src/afr-self-heald.h xlators/cluster/afr/src/afr-transaction.c xlators/cluster/afr/src/afr-transaction.h xlators/cluster/afr/src/afr.c xlators/cluster/afr/src/afr.h xlators/cluster/afr/src/pump.c xlators/cluster/afr/src/pump.h xlators/cluster/dht/src/dht-common.c xlators/cluster/dht/src/dht-common.h xlators/cluster/dht/src/dht-diskusage.c xlators/cluster/dht/src/dht-hashfn.c xlators/cluster/dht/src/dht-helper.c xlators/cluster/dht/src/dht-inode-read.c xlators/cluster/dht/src/dht-inode-write.c xlators/cluster/dht/src/dht-layout.c xlators/cluster/dht/src/dht-linkfile.c xlators/cluster/dht/src/dht-mem-types.h xlators/cluster/dht/src/dht-rebalance.c xlators/cluster/dht/src/dht-rename.c xlators/cluster/dht/src/dht-selfheal.c xlators/cluster/dht/src/dht.c xlators/cluster/dht/src/nufa.c xlators/cluster/dht/src/switch.c xlators/cluster/stripe/src/stripe-helpers.c xlators/cluster/stripe/src/stripe-mem-types.h xlators/cluster/stripe/src/stripe.c xlators/cluster/stripe/src/stripe.h xlators/features/index/src/index-mem-types.h ¹ xlators/features/index/src/index.c ¹ xlators/features/index/src/index.h ¹ xlators/performance/io-cache/src/io-cache.c xlators/performance/io-cache/src/io-cache.h xlators/performance/io-cache/src/ioc-inode.c xlators/performance/io-cache/src/ioc-mem-types.h xlators/performance/io-cache/src/page.c xlators/performance/io-threads/src/io-threads.c xlators/performance/io-threads/src/io-threads.h xlators/performance/io-threads/src/iot-mem-types.h xlators/performance/md-cache/src/md-cache-mem-types.h xlators/performance/md-cache/src/md-cache.c xlators/performance/quick-read/src/quick-read-mem-types.h xlators/performance/quick-read/src/quick-read.c xlators/performance/quick-read/src/quick-read.h xlators/performance/read-ahead/src/page.c xlators/performance/read-ahead/src/read-ahead-mem-types.h xlators/performance/read-ahead/src/read-ahead.c xlators/performance/read-ahead/src/read-ahead.h xlators/performance/symlink-cache/src/symlink-cache.c xlators/performance/write-behind/src/write-behind-mem-types.h xlators/performance/write-behind/src/write-behind.c xlators/protocol/auth/addr/src/addr.c ¹ xlators/protocol/auth/login/src/login.c ¹ xlators/protocol/client/src/client-callback.c xlators/protocol/client/src/client-handshake.c xlators/protocol/client/src/client-helpers.c xlators/protocol/client/src/client-lk.c xlators/protocol/client/src/client-mem-types.h xlators/protocol/client/src/client.c xlators/protocol/client/src/client.h xlators/protocol/client/src/client3_1-fops.c ¹ Copyright only, license reverted to original Change-Id: If560e826c61b6b26f8b9af7bed6e4bcbaeba31a8 BUG: 820551 Signed-off-by: Kaleb KEITHLEY <kkeithle@redhat.com> Reviewed-on: http://review.gluster.com/3304 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vijay@gluster.com>
* cluster/afr: Perform Flush with lk-owner given by parent xlator.Pranith Kumar K2012-05-071-0/+5
| | | | | | | | | | | | Lk-owner of posix-lk and flush should be same, flush can't clear posix-lks without that lk-owner. Change-Id: If775abb5741a0beb00c419b54d023fbd429e3cb7 BUG: 810502 Signed-off-by: Pranith Kumar K <pranithk@gluster.com> Reviewed-on: http://review.gluster.com/3221 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vijay@gluster.com>
* cluster/afr: increment change log with correct byte orderPranith Kumar K2012-04-161-2/+5
| | | | | | | | | | Change-Id: Id2af3e61ad659ff6d168161673e5e1e19f36bdb5 BUG: 765194 Signed-off-by: Pranith Kumar K <pranithk@gluster.com> Reviewed-on: http://review.gluster.com/3149 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Jeff Darcy <jdarcy@redhat.com> Reviewed-by: Vijay Bellur <vijay@gluster.com>
* core: adding extra data for fopsAmar Tumballi2012-03-221-24/+47
| | | | | | | | | | | | | with this change, the xlator APIs will have a dictionary as extra argument, which is passed between all the layers. This can be utilized for overloading in some of the operations. Change-Id: I58a8186b3ef647650280e63f3e5e9b9de7827b40 Signed-off-by: Amar Tumballi <amarts@redhat.com> BUG: 782265 Reviewed-on: http://review.gluster.com/2960 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/afr: Enable eager-lockPranith Kumar K2012-03-171-1/+2
| | | | | | | | | | | | | | | | | | Eager-lock is disabled by default. Use cluster.eager-lock on/off to change the config. write-behind on and eager-lock off is not supported configuration. In afr, when eager-lock is enabled the inode lock on fd is taken using the fd address as the lk-owner. So the lock is interchangableale between the inode-locks on the same fd. Change-Id: I7eef1ecd510f8028f5395dee882782da53c0de3f BUG: 802515 Signed-off-by: Pranith Kumar K <pranithk@gluster.com> Reviewed-on: http://review.gluster.com/2925 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Jeff Darcy <jdarcy@redhat.com> Reviewed-by: Anand Avati <avati@redhat.com>
* Revert "afr: [Un]Set the 'right' lkowner for [f]{inode|entry}_lk and the ↵Vijay Bellur2012-03-031-0/+2
| | | | | | | | | | | 'enclosed' fop." This reverts commit 2e80fdbeb6abbb23ff6789c2b98c82704883af0a. Change-Id: I417fd43e4195d63e5b8b83dd3beb712887130e1e Reviewed-on: http://review.gluster.com/2860 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vijay@gluster.com>
* afr: [Un]Set the 'right' lkowner for [f]{inode|entry}_lk and the 'enclosed' fop.Krishnan Parthasarathi2012-03-011-2/+0
| | | | | | | | | | | | | | afr 'mangles' the lkowner inorder to ensure [f]inodelk/[f]entrylk fops from the same application contend. But other fops that are 'visible' to the application should operate with the lkowner provided by fuse for correct functioning of posix-locks xlator. Change-Id: I7e71f35ae7df2a070f1f46d4fc77eed26a717673 BUG: 790743 Signed-off-by: Krishnan Parthasarathi <kp@gluster.com> Reviewed-on: http://review.gluster.com/2752 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vijay@gluster.com>
* cluster/afr: Perform xattrop with all afr-keysPranith Kumar K2012-01-271-11/+7
| | | | | | | | | | | | | | | | | | | | | | | | Self-heal does not happen if the file has change log xattr only for one of the subvol keys. This patch makes sure that xattrop is done for all the afr subvol keys after a new entry is created in entry-self-heal. 1) Added matrix create/cleanup functions 2) Impunging a new file does multiple xattrops on the source subvol, one per sink. The code can do a single xattrop after the entry is created on all the sinks. 3) Missing entry self-heal uses one frame per sink to heal the file. This leads to multiple xattrops on the source subvol. That code is changed now to use one frame which will create the file on all subvols. Change-Id: I65a42f9779b03f7efae283479f8653fb2cb8046b BUG: 762680 Signed-off-by: Pranith Kumar K <pranithk@gluster.com> Reviewed-on: http://review.gluster.com/2503 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Jeff Darcy <jdarcy@redhat.com> Reviewed-by: Krishnan Parthasarathi <kp@gluster.com>
* core: GFID filehandle based backend and anonymous FDsAnand Avati2012-01-201-6/+35
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 1. What -------- This change introduces an infrastructure change in the filesystem which lets filesystem operation address objects (inodes) just by its GFID. Thus far GFID has been a unique identifier of a user-visible inode. But in terms of addressability the only mechanism thus far has been the backend filesystem path, which could be derived from the GFID only if it was cached in the inode table along with the entire set of dentry ancestry leading up to the root. This change essentially decouples addressability from the namespace. It is no more necessary to be aware of the parent directory to address a file or directory. 2. Why ------- The biggest use case for such a feature is NFS for generating persistent filehandles. So far the technique for generating filehandles in NFS has been to encode path components so that the appropriate inode_t can be repopulated into the inode table by means of a recursive lookup of each component top-down. Another use case is the ability to perform more intelligent self-healing and rebalancing of inodes with hardlinks and also to detect renames. A derived feature from GFID filehandles is anonymous FDs. An anonymous FD is an internal USABLE "fd_t" which does not map to a user opened file descriptor or to an internal ->open()'d fd. The ability to address a file by the GFID eliminates the need to have a persistent ->open()'d fd for the purpose of avoiding the namespace. This improves NFS read/write performance significantly eliminating open/close calls and also fixes some of today's limitations (like keeping an FD open longer than necessary resulting in disk space leakage) 3. How ------- At each storage/posix translator level, every file is hardlinked inside a hidden .glusterfs directory (under the top level export) with the name as the ascii-encoded standard UUID format string. For reasons of performance and scalability there is a two-tier classification of those hardlinks under directories with the initial parts of the UUID string as the directory names. For directories (which cannot be hardlinked), the approach is to use a symlink which dereferences the parent GFID path along with basename of the directory. The parent GFID dereference will in turn be a dereference of the grandparent with the parent's basename, and so on recursively up to the root export. 4. Development --------------- 4a. To leverage the ability to address an inode by its GFID, the technique is to perform a "nameless lookup". This means, to populate a loc_t structure as: loc_t { pargfid: NULL parent: NULL name: NULL path: NULL gfid: GFID to be looked up [out parameter] inode: inode_new () result [in parameter] } and performing such lookup will return in its callback an inode_t populated with the right contexts and a struct iatt which can be used to perform an inode_link () on the inode (without a parent and basename). The inode will now be hashed and linked in the inode table and findable via inode_find(). A fundamental change moving forward is that the primary fields in a loc_t structure are now going to be (pargfid, name) and (gfid) depending on the kind of FOP. So far path had been the primary field for operations. The remaining fields only serve as hints/helpers. 4b. If read/write is to be performed on an inode_t, the approach so far has been to: fd_create(), STACK_WIND(open, fd), fd_bind (in callback) and then perform STACK_WIND(read, fd) etc. With anonymous fds now you can do fd_anonymous (inode), STACK_WIND (read, fd). This results in great boost in performance in the inbuilt NFS server. 5. Misc ------- The inode_ctx_put[2] has been renamed to inode_ctx_set[2] to be consistent with the rest of the codebase. Change-Id: Ie4629edf6bd32a595f4d7f01e90c0a01f16fb12f BUG: 781318 Reviewed-on: http://review.gluster.com/669 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@gluster.com>
* cluster/afr: Double the call count if transaction is for renamePranith Kumar K2011-12-131-4/+18
| | | | | | | | | | | | | In rename the changelog modification needs to happen both on old parent-dir and new parent-dir, so 2 stack winds are done per brick. Change-Id: I43f34661e397c4288162213944529e18b7724b1d BUG: 766603 Signed-off-by: Pranith Kumar K <pranithk@gluster.com> Reviewed-on: http://review.gluster.com/783 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vijay@gluster.com>
* cluster/afr: Update read-child if it becomes stalePranith Kumar K2011-11-281-25/+19
| | | | | | | | Change-Id: I00c714a89575023f6dbdd3430dcbf191e5d08019 BUG: 3650 Reviewed-on: http://review.gluster.com/740 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vijay@gluster.com>