summaryrefslogtreecommitdiffstats
path: root/xlators/cluster
Commit message (Collapse)AuthorAgeFilesLines
* glusterfs: zerofill supportM. Mohan Kumar2013-11-108-0/+588
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add support for a new ZEROFILL fop. Zerofill writes zeroes to a file in the specified range. This fop will be useful when a whole file needs to be initialized with zero (could be useful for zero filled VM disk image provisioning or during scrubbing of VM disk images). Client/application can issue this FOP for zeroing out. Gluster server will zero out required range of bytes ie server offloaded zeroing. In the absence of this fop, client/application has to repetitively issue write (zero) fop to the server, which is very inefficient method because of the overheads involved in RPC calls and acknowledgements. WRITESAME is a SCSI T10 command that takes a block of data as input and writes the same data to other blocks and this write is handled completely within the storage and hence is known as offload . Linux ,now has support for SCSI WRITESAME command which is exposed to the user in the form of BLKZEROOUT ioctl. BD Xlator can exploit BLKZEROOUT ioctl to implement this fop. Thus zeroing out operations can be completely offloaded to the storage device , making it highly efficient. The fop takes two arguments offset and size. It zeroes out 'size' number of bytes in an opened file starting from 'offset' position. This patch adds zerofill support to the following areas: - libglusterfs - io-stats - performance/md-cache,open-behind - quota - cluster/afr,dht,stripe - rpc/xdr - protocol/client,server - io-threads - marker - storage/posix - libgfapi Client applications can exloit this fop by using glfs_zerofill introduced in libgfapi.FUSE support to this fop has not been added as there is no system call for this fop. Changes from previous version 3: * Removed redundant memory failure log messages Changes from previous version 2: * Rebased and fixed build error Changes from previous version 1: * Rebased for latest master TODO : * Add zerofill support to trace xlator * Expose zerofill capability as part of gluster volume info Here is a performance comparison of server offloaded zeofill vs zeroing out using repeated writes. [root@llmvm02 remote]# time ./offloaded aakash-test log 20 real 3m34.155s user 0m0.018s sys 0m0.040s [root@llmvm02 remote]# time ./manually aakash-test log 20 real 4m23.043s user 0m2.197s sys 0m14.457s [root@llmvm02 remote]# time ./offloaded aakash-test log 25; real 4m28.363s user 0m0.021s sys 0m0.025s [root@llmvm02 remote]# time ./manually aakash-test log 25 real 5m34.278s user 0m2.957s sys 0m18.808s The argument log is a file which we want to set for logging purpose and the third argument is size in GB . As we can see there is a performance improvement of around 20% with this fop. Change-Id: I081159f5f7edde0ddb78169fb4c21c776ec91a18 BUG: 1028673 Signed-off-by: Aakash Lal Das <aakash@linux.vnet.ibm.com> Signed-off-by: M. Mohan Kumar <mohan@in.ibm.com> Reviewed-on: http://review.gluster.org/5327 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* cluster/afr: Remove 'max' from the logPranith Kumar K2013-10-171-1/+1
| | | | | | | | | | | | This patch avoids giving more info to the user about the internal heuristic employed in afr, for quota sizes. Change-Id: Ice3a164399f09b6967500ec0c17dc340e7ae9aba BUG: 1016683 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/6098 Reviewed-by: Vijay Bellur <vbellur@redhat.com> Tested-by: Vijay Bellur <vbellur@redhat.com>
* afr: check for split-brain before proceeding with fopsRavishankar N2013-10-162-12/+33
| | | | | | | | | | | Bail out of fops if split brain has been detected during lookup Change-Id: Id387dbb1a25eec4a121dedceadc6069bdea24b5d BUG: 1010834 Signed-off-by: Ravishankar N <ravishankar@redhat.com> Reviewed-on: http://review.gluster.org/5988 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* dht: dht_lookup_dir_cbk should set op_errno as local->op_errnoKaushal M2013-10-151-1/+1
| | | | | | | | | | | | | | | | | | | | | | | Two glusterfs clients return inconsistent errnos when the bricks of the volume were down. Consider two gluster mounts. Mount 1 was done when the bricks were online. Mount 2 was done after the bricks were killed, (using the 'glusterfs' command instead of the mount script). For any request, mount 1 will return ENOTCONN, where as mount 2 will return ENOENT. This happens because for the 2nd mount, a fuse would send a lookup on '/' for any request, as it hadn't been done yet. The client xlator returns ENOTCONN, but the dht_lookup_dir_cbk changed this to ENOENT unconditionally when aggregating. So, fuse returned ENOENT, even though the errno should have been ENOTCONN. Change-Id: I4b7a6d84ce5153045a807fccc01485afe0377117 BUG: 1019095 Signed-off-by: Kaushal M <kaushal@redhat.com> Reviewed-on: http://review.gluster.org/6072 Reviewed-by: Anand Avati <avati@redhat.com> Tested-by: Anand Avati <avati@redhat.com>
* libglusterfs: Add monotonic clocking counter for timer threadHarshavardhana2013-10-152-5/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | gettimeofday() returns the current wall clock time and timezone. Using these functions in order to measure the passage of time (how long an operation took) therefore seems like a no-brainer. This time suffer's from some limitations: a. They have a low resolution: “High-performance” timing by definition, requires clock resolutions into the microseconds or better. b. They can jump forwards and backwards in time: Computer clocks all tick at slightly different rates, which causes the time to drift. Most systems have NTP enabled which periodically adjusts the system clock to keep them in sync with “actual” time. The adjustment can cause the clock to suddenly jump forward (artificially inflating your timing numbers) or jump backwards (causing your timing calculations to go negative or hugely positive). In such cases timer thread could go into an infinite loop. From 'man gettimeofday': ---------- .. .. The time returned by gettimeofday() is affected by discontinuous jumps in the system time (e.g., if the system administrator manually changes the system time). If you need a monotonically increasing clock, see clock_gettime(2). .. .. ---------- Rationale: For calculating interval timing for Timer thread, all that’s needed should be clock as a simple counter that increments at a stable rate. This is necessary to avoid the jumps which are caused by using "wall time", this counter must be monotonic that can never “tick” backwards, ever. Change-Id: I701d31e71a85a73d21a6c5cd15583e7a5a645eeb BUG: 1017993 Signed-off-by: Harshavardhana <harsha@harshavardhana.net> Reviewed-on: http://review.gluster.org/6070 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/afr: [Feature] Command implementation to get heal-countVenkatesh Somyajulu2013-10-142-7/+132
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently to know the number of files to be healed, either user has to go to backend and check the number of entries present in indices/xattrop directory. But if a volume consists of large number of bricks, going to each backend and counting the number of entries is a time-taking task. Otherwise user can give gluster volume heal vol-name info command but with this approach if no. of entries are very hugh in the indices/ xattrop directory, it will comsume time. So as a feature, new command is implemented. Command 1: gluster volume heal vn statistics heal-count This command will get the number of entries present in every brick of a volume. The output displays only entries count. Command 2: gluster volume heal vn statistics heal-count replica 192.168.122.1:/home/user/brickname Here if we are concerned with just one replica. So providing any one of the brick of a replica will get the number of entries to be healed for that replica only. Example: Replicate volume with replica count 2. Backend status: -------------- [root@dhcp-0-17 xattrop]# ls -lia | wc -l 1918 NOTE: Out of 1918, 2 entries are <xattrop-gfid> dummy entries so actual no. of entries to be healed are 1916. [root@dhcp-0-17 xattrop]# pwd /home/user/2ty/.glusterfs/indices/xattrop Command output: -------------- Gathering count of entries to be healed on volume volume3 has been successful Brick 192.168.122.1:/home/user/22iu Status: Brick is Not connected Entries count is not available Brick 192.168.122.1:/home/user/2ty Number of entries: 1916 Change-Id: I72452f3de50502dc898076ec74d434d9e77fd290 BUG: 1015990 Signed-off-by: Venkatesh Somyajulu <vsomyaju@redhat.com> Reviewed-on: http://review.gluster.org/6044 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/afr : Implementation of command "gluster volume heal vn statistics"Venkatesh Somyajulu2013-10-146-45/+523
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | "gluster volume heal volumename statistics" command gives the summary of the afr crawl done based on the entries present in the xattrop directory. Whenever afr crawls are attempted, the beginning time of crawl, end time of crawl, no of files healed, heal-failed count and number of files in split brain are shown along with the type of the crawl. If crawl is already in progress then it will give the number of files healed, heal failed count and number of files in split-brain from the beginning of the crawl and instead of telling the end time of the crawl, "CRAWL IN PROGRESS" message will be shown. Output format: command: "gluster volume heal volume-name statistics" Output: Gathering afr crawl statistics crawl statistics on volume volume-name has been successful ------------------------------------------------ Crawl statistics for brick no 0 Hostname of brick 192.168.122.248 Starting time of crawl: Wed Jul 10 15:52:38 2013 Ending time of crawl: Wed Jul 10 15:52:38 2013 Type of crawl: INDEX No. of entries healed: 0 No. of entries in split-brain: 0 No. of heal failed entries: 0 Starting time of crawl: Wed Jul 10 15:52:38 2013 Ending time of crawl: Wed Jul 10 15:52:38 2013 Type of crawl: INDEX No. of entries healed: 0 No. of entries in split-brain: 0 No. of heal failed entries: 0 ------------------------------------------------ Crawl statistics for brick no 1 Hostname of brick 192.168.122.1 Starting time of crawl: Wed Jul 10 15:52:42 2013 Ending time of crawl: Wed Jul 10 15:52:42 2013 Type of crawl: INDEX No. of entries healed: 0 No. of entries in split-brain: 0 No. of heal failed entries: 0 Starting time of crawl: Wed Jul 10 15:52:42 2013 Ending time of crawl: Wed Jul 10 15:52:42 2013 Type of crawl: INDEX No. of entries healed: 0 No. of entries in split-brain: 0 No. of heal failed entries: 0 -------------------------------------------------- Change-Id: I10bf9d10b005741db9973fb1352e0dd59ed99aa9 BUG: 949400 Signed-off-by: Venkatesh Somyajulu <vsomyaju@redhat.com> Reviewed-on: http://review.gluster.org/4790 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/afr: Handle quota size xattr separately in lookupPranith Kumar K2013-10-101-0/+87
| | | | | | | | | | | | | | | | | | Quota size xattrs are not maintained by afr. There is a possibility that they differ even when both the directory changelog xattrs suggest everything is fine. So if there is at least one 'source' check among the sources which has the maximum quota size. Otherwise check among all the available ones for maximum quota size. This way if there is a source and stale copies it always votes for the 'source'. Change-Id: Ia222379cbafa7043dd03f533c105860f2c7b8b0d BUG: 1016683 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/6052 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Varun Shastry <vshastry@redhat.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/afr: Change Self-heal domain separator to ':'Pranith Kumar K2013-10-032-1/+3
| | | | | | | | | | | | | | | | | | | | '-' can be present in a volume. This may lead to domain collisions in future. Tests: Checked in gdb that domain comes with ':' separator: Breakpoint 1, pl_common_inodelk (frame=0x7fdabcce51a4, this=0x8bde20, volume=0x8b50d0 "r2-replicate-0:self-heal", inode=0x7fdab822f0e8, cmd=6, flock=0x7fdabc76eee4, loc=0x7fdabc76ede4, fd=0x0, xdata=0x7fdabc6e0ab0) at inodelk.c:597 Change-Id: I4456ae35ac8bf21e6361c34e9ad437f744a2e84b BUG: 993981 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/6025 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* Logging : Improved the log message on unlock failure in afr_unlock_inodelk_cbk.Anuradha2013-09-301-2/+6
| | | | | | | | | | | "unlock failed on 1 unlock" seems meaningless in the log message. Improved it. Change-Id: If67d3f9d4aa5310d0b6728a6c89fa58a5cc93d12 BUG: 1012947 Signed-off-by: Anuradha <atalur@redhat.com> Reviewed-on: http://review.gluster.org/6012 Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* gNFS: Incorrect NFS ACL encoding for XFSSantosh Kumar Pradhan2013-09-292-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | Problem: Incorrect NFS ACL encoding causes "system.posix_acl_default" setxattr failure on bricks on XFS file system. XFS (potentially others?) doesn't understand when the 0x10 prefix is added to the ACL type field for default ACLs (which the Linux NFS client adds) which causes setfacl()->setxattr() to fail silently. NFS client adds NFS_ACL_DEFAULT(0x1000) for default ACL. FIX: Mask the prefix (added by NFS client) OFF, so the setfacl is not rejected when it hits the FS. Original patch by: "Richard Wareing" Change-Id: I17ad27d84f030cdea8396eb667ee031f0d41b396 BUG: 1009210 Signed-off-by: Santosh Kumar Pradhan <spradhan@redhat.com> Reviewed-on: http://review.gluster.org/5980 Reviewed-by: Amar Tumballi <amarts@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* core: block unused signals in created threadsAnand Avati2013-09-251-2/+2
| | | | | | | | | | | | | | | Block all signal except those which are set for explicit handling in glusterfs_signals_setup(). Since thread spawning code in libglusterfs and xlators can get called from application threads when used through libgfapi, it is necessary to do this blocking. Change-Id: Ia320f80521a83d2edcda50b9ad414583a0175281 BUG: 1011662 Signed-off-by: Anand Avati <avati@redhat.com> Reviewed-on: http://review.gluster.org/5995 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Amar Tumballi <amarts@redhat.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* Revert "cluster/dht: Return success in dht_discover if layout issues"shishir gowda2013-09-243-70/+35
| | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit a3e593f9f17cb1e68db97bb5a0d8074793a33964 which was bought into fix dht_layout_anomalies error handling. We still see applications error'ing out due to graph switches. To fix the above issue - We cannot heal in dht_discover, as it is a gfid based lookup, and not path based. So, returning error here would lead to app's to see failure. Also, update the layout in inode_ctx even if it has anomalies. Let subsequent heals fix the issue. Conflicts: xlators/cluster/dht/src/dht-common.c Signed-off-by: shishir gowda <sgowda@redhat.com> Change-Id: I68c1056c3587e04a02344548546ddd06034489c5 BUG: 960348 Reviewed-on: http://review.gluster.org/5443 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* cluster/dht: Fix anomaly checkshishir gowda2013-09-231-3/+11
| | | | | | | | | | | | | We were wrongly detecting holes/overlaps for already accounted errors. Additionally, sort should also handle zero'ed out layout Change-Id: Ic3d13e1d735b914f9acc01fe919bc90656baea48 BUG: 1003851 Signed-off-by: shishir gowda <sgowda@redhat.com> Reviewed-on: http://review.gluster.org/5762 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Amar Tumballi <amarts@redhat.com> Reviewed-by: Anand Avati <avati@redhat.com>
* distribute: Rebalance should provide even disk space distributionHarshavardhana2013-09-191-15/+30
| | | | | | | | | | | | | | | | | | | | | | | | | Earlier disk space check had an issue which didn't provide the needed functionality to avoid migration when the destination had lesser available space, scenario we need to avoid is stated below : During rebalance `migrate-data` - Destination subvol experiences a `reduction` in 'blocks' of free space, at the same time source subvol gains certain 'blocks' of free space. A valid check is necessary here to avoid errorneous move to destination where the space could be scantily available. This patch provides a proper fix in place by subtracting necessary file blocks from destination and adding those blocks to source. Change-Id: I9c7840716a4256ef614ffc0fbfd9f2b456ac28c8 BUG: 982919 Signed-off-by: Harshavardhana <harsha@harshavardhana.net> Reviewed-on: http://review.gluster.org/5961 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Shishir Gowda <sgowda@redhat.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cli/glusterd: improve rebalance fix-layout status reportingRavishankar N2013-09-192-0/+6
| | | | | | | | | | | | | | | | | | | Problem: Currenly the CLI rebalance status command output does not indicate the 'type' of rebalance, i.e. whether a full rebalance or only a fix-layout was carried out. Fix: After the rebalance status of all peers is received by the originator glusterd, alter it to reflect the type of rebalance before passing it on to the CLI process. Change-Id: I1940ffda0d36e25e5b33c84a0ea210394cc9e1d3 BUG: 1004744 Signed-off-by: Ravishankar N <ravishankar@redhat.com> Reviewed-on: http://review.gluster.org/5826 Reviewed-by: Krishnan Parthasarathi <kparthas@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/afr: Have common inode-write-fop cbkPranith Kumar K2013-09-184-672/+581
| | | | | | | | | | Change-Id: Ia7b324b86d6a7051d187106d7a060155e77defc5 BUG: 910217 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/5238 Reviewed-by: Ravishankar N <ravishankar@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* Revert "cluster/distribute: Rebalance should also verify free inodes"Anand Avati2013-09-171-22/+5
| | | | | | | | | | | | | | | This reverts commit 215fea41a96479312a5ab8783c13b30ab9fe00fa Realized soon after merging, that this patch is actually useless. Checking for inode utilization on dst relative to src is of no use because dst is already storing linkfile and inode is already consumed no matter what. We might as well perform the rebalance and save on an inode on the src node. Change-Id: I7b8b8d35d9063a710e1bee1c7c6c7739fcc45ad4 Reviewed-on: http://review.gluster.org/5960 Reviewed-by: Harshavardhana <harsha@harshavardhana.net> Tested-by: Anand Avati <avati@redhat.com>
* cluster/distribute: Rebalance should also verify free inodesHarshavardhana2013-09-171-5/+22
| | | | | | | | | | | | | | | | | | | | | Currently during `MIGRATE_DATA` we never verified about the total inode usage among new and old bricks. Such checks are available for disk space usage but it is also needed for inodes during data migration. Such a check leads to uniform outcome of file distribution upon rebalance. Patch provides: - Check dst_inodes < src_inodes, a friendly `warning` message to indicate we have to ignore such an attempt - Rename __dht_check_free_space() --> __dht_check_free_space_and_inodes() Change-Id: I7bc4dd8b507883f0fb836300e99f0bb083493f5f BUG: 982919 Signed-off-by: Harshavardhana <harsha@harshavardhana.net> Reviewed-on: http://review.gluster.org/5948 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/dht: assign layout onto missing directories tooAnand Avati2013-09-091-4/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | The current self-healing algorithm is ignoring missing directories for assigning new layout. When lookup() is racing against mkdir() or when self-healing a half-done mkdir(), the layout assignment split must happen based on the final number of directories, and not the currently existing number of directories (because we finish mkdir() of missing directories before hash layout assignment). Without this fix, concurrent mkdir() and lookup() will step on each others feet, create a messed up layout on disk, and end up with different in-memory layouts. Once two clients have different in-memory layouts, creation of subdirectory will not arbitrate on the same hashed subvolume and will result in GFID mismatch of the sub-directory. Change-Id: Ia47acad67c265060405984c822b4d37512b9dbb3 BUG: 907072 Signed-off-by: Anand Avati <avati@redhat.com> Reviewed-on: http://review.gluster.org/5849 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Amar Tumballi <amarts@redhat.com> Reviewed-by: Peter Portante <pportant@redhat.com> Tested-by: Peter Portante <pportant@redhat.com>
* cluster/afr: Set size based source only when sizes are unequalPranith Kumar K2013-09-031-0/+13
| | | | | | | | | Change-Id: I18583f14edf1011401be15744371e2a6b79d75cc BUG: 1003842 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/5763 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* afr: make NOP truncate/ftruncate efficientAnand Avati2013-09-032-0/+17
| | | | | | | | | | | | | If truncate/ftruncate is called with the offset as the current size of file, then skip the durability fsync and unwind quickly. Change-Id: I0baec68d96c6d4d8217d33bd9738f7ed0d1b40c5 BUG: 958118 Signed-off-by: Anand Avati <avati@redhat.com> Reviewed-on: http://review.gluster.org/5737 Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* cluster/afr: Improvement in logging of self heal completion statusVenkatesh Somyajulu2013-08-296-30/+338
| | | | | | | | | | | Additional information for source and sinks are added. Change-Id: I1704956ff86ac3ae36744efe7499c1d1c43faeaf BUG: 968301 Signed-off-by: Venkatesh Somyajulu <vsomyaju@redhat.com> Reviewed-on: http://review.gluster.org/5638 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/afr: Reset attempted count before attempting blocking lockPranith Kumar K2013-08-291-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Problem: internal_lock->lk_attempted_count keeps track of the number of blocking locks attempted. lk_expected_count keeps track of the number locks expected. Here are the sequence of steps that happen which lead to the illution that a full file lock is achieved, even without attempting any lock. 2 mounts are doing dd on same file. Both of them witness a brick going down and coming back up again. Both of the mounts issue self-heal 1) Both mount-1, mount-2 attempt full file locks in self-heal domain. lets say mount-1 got the lock, mount-2 attempts blocking lock. 2) mount-1 attempts full file lock in data domain. It goes into blocking mode because some other writes are in progress. Eventually it gets the lock. But this results in lk_attempted_count to be still as 2 and will not be reset. It completes syncing the data. 3) mount-1 before unlocking final small range lock attempts full file lock in data domain to figure out the source/sink. This will be put into blocked mode again because some other writes are in progress. But this time seeing the stale value of lk_attempted_count being equal to lk_expected_count, blocking_lock phase thinks it completed locking without acquiring a single lock :-O. 4) mount-1 reads xattrs without any lock but since it does not modify the xattrs, no harm is done by this phase. It tries to do unlocks and the unlocks will fail because the locks are never taken in data domain. mount-1 also unlocks self-heal domain locks. Our beloved mount-2 now gets the chance to cause horror :-(. 5) mount-2 gets the full range blocking lock in self-heal domain. Please note that this sets lk_attempted_count to 2. 6) mount-2 attempts full range lock in data domain, since there are still writes on going, it switches to blocking mode. But since lk_attempted_count is 2 which is same as lk_expected_count, blocking phase locks thinks it actually got the full range locks even though not a single lock request went out the wire. 7) mount-2 reads the change-log xattrs, which would give the number of operations in progress (lets call this 'X'). It does the syncing and at the end of the sync decrements the changelog by 'X'. But since that 'X' was introduced by 'X' number of transactions that are in progress, they also decrement the changelog by 'X'. Effectively for 'X' operations 'X' number of pre-ops are done but 2 times 'X' number of post-ops are done resulting in -ve changelog numbers. Fix: Reset the lk_attempted_count and inode locks array that is used to remember locks that are granted. Change-Id: Ic0a79cd16f32392ea7c790511343c73592bbe6bd BUG: 1002698 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/5736 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/afr: unlock before aborting transactionAnand Avati2013-08-291-0/+2
| | | | | | | | | | | Else this results in a missing frame causing a hang Change-Id: Ib5f3dc6a3999449faa2853cee2944af2fb065a20 BUG: 1002399 Signed-off-by: Anand Avati <avati@redhat.com> Reviewed-on: http://review.gluster.org/5731 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* cluster/stripe: enable coalesce mode by defaultBrian Foster2013-08-281-4/+4
| | | | | | | | | | | | It has been available for a while now and is probably the sane default due to the more efficient layout and performance benefit. BUG: 1001207 Change-Id: I6275f9741866c0afd6e685f8dc5867a86485fd20 Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-on: http://review.gluster.org/5624 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/afr: Add special handling for failure postopsPranith Kumar K2013-08-284-26/+71
| | | | | | | | | | | | | | | | Idea is to not leave the file in FOOL-FOOL scenario in case on all the bricks data transaction failed with EDQUOT to avoid increasing un-necessary load of self-heals in the system. For directory transactions don't leave pending changelog in case the failures are seen on all the subvolumes. Change-Id: I38a5561d1d581a78347a76a4a509514e4a0c3fb7 BUG: 969461 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/5709 Reviewed-by: Anand Avati <avati@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* stripe: remove unused param, handle mem alloc failureKaleb S. KEITHLEY2013-08-281-2/+2
| | | | | | | | | Change-Id: I9c27b1edab111031ca8eea9cc49480ea01e39089 BUG: 1002207 Signed-off-by: Kaleb S. KEITHLEY <kkeithle@redhat.com> Reviewed-on: http://review.gluster.org/5716 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* synctask: minor enhancementsAnand Avati2013-08-281-4/+1
| | | | | | | | | | | | | | | | | | | | | | | - Enhance syncenv_new() to accept scaling parameters of syncproc. Previously the scaling parameters were hardcoded and decided at compile time. - New API synctask_create() which returns the created synctask. This is similar to synctask_new which only returned the status of whether a synctask could be created or not. The meaning of NULL cbk in synctask_create() means the task is "joinable". Until synctask_join() is called on such a synctask, the task is not reaped and resources are not destroyed. The task would be in a zombie state after synctask_fn returns and before synctask_join() is called. Change-Id: I368ec9037de9510d2ba951f0aad86aaf18d9a6b6 BUG: 986775 Signed-off-by: Anand Avati <avati@redhat.com> Reviewed-on: http://review.gluster.org/5365 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
* cluster/afr: Don't delay post op in cases of failuresPranith Kumar K2013-08-283-11/+36
| | | | | | | | | | Change-Id: Ib0c3af6babc61dc3ed45252582876e2f243d6446 BUG: 958118 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/5635 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Jeff Darcy <jdarcy@redhat.com> Reviewed-by: Anand Avati <avati@redhat.com>
* core: remove GLUSTERFS_CREATE_MODE_KEY usageAmar Tumballi2013-08-231-9/+0
| | | | | | | | | Change-Id: I23b8cb7223b91a55af1cd4214f61bbe0e87351f6 BUG: 952029 Signed-off-by: Amar Tumballi <amarts@redhat.com> Reviewed-on: http://review.gluster.org/5683 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* core: changes to support gfid-accessAmar Tumballi2013-08-211-6/+0
| | | | | | | | | Change-Id: I38d2fdc47e4b805deafca6805e54807976ffdb7e Signed-off-by: Amar Tumballi <amarts@redhat.com> BUG: 952029 Reviewed-on: http://review.gluster.org/5496 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* Revert "fuse: auxiliary gfid mount support"Amar Tumballi2013-08-211-0/+6
| | | | | | | | | | | | | | | | | This reverts commit 4c0f4c8a89039b1fa1c9c015fb6f273268164c20. Conflicts: xlators/mount/fuse/src/fuse-bridge.c For build issues added CREATE_MODE_KEY definition in: libglusterfs/src/glusterfs.h Change-Id: I8093c2a0b5349b01e1ee6206025edbdbee43055e BUG: 952029 Signed-off-by: Amar Tumballi <amarts@redhat.com> Reviewed-on: http://review.gluster.org/5495 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/dht: Del GF_READDIR_SKIP_DIRS key from dict for first_upshishir gowda2013-08-182-4/+12
| | | | | | | | | | | | | | | | Currently, we sent GF_READDIR_SKIP_DIRS for all subvolumes if first_subvol != first_up_subvolume. Also first_up_subvolume can change with-in the life of a call and cbk. Saving the first_up_subvol in dht_local for checks. Change-Id: I6e369e63f29c9761993f2a66ed768c424bb44d27 BUG: 996474 Signed-off-by: shishir gowda <sgowda@redhat.com> Reviewed-on: http://review.gluster.org/5577 Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* cluster/afr: Add largest file is source policyAnand Avati2013-08-142-29/+93
| | | | | | | | | | | | | | For Write Once Read Many times type of work-load choosing largest file to be the source will always resolve fool-fool scenarios correctly. In other cases we fsync() the files and will have a reliable 'wise man'. Change-Id: Ic4dbea8d06db6d578fbcb866fb65ee2d066ac7ba BUG: 958118 Signed-off-by: Anand Avati <avati@redhat.com> Reviewed-on: http://review.gluster.org/5519 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
* afr: treat appending writes as stable writes.Anand Avati2013-08-133-1/+29
| | | | | | | | | | | | | Durability of appending writes is implicit in the file size. Therefore performing an explicit fsync() is unnecessary in such cases as self-heal can check for the size of file when pending changelog is not unambiguous. Change-Id: I05446180a91d20e0dbee5de5a7085b87d57f178a BUG: 927146 Signed-off-by: Anand Avati <avati@redhat.com> Reviewed-on: http://review.gluster.org/5501 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
* cluster/afr: skip directory inspection when entry self-heal is offAnand Avati2013-08-111-1/+1
| | | | | | | | | | | | When user has explicitly configured to disable entry self-heal in the client, it is wrong to do the healing in opendir. So skip it. This is especially useful to reduce opendir() times after graph switches. Change-Id: Ic6eb9ff2334a5b8417f2f35410a366a536bad5df Signed-off-by: Anand Avati <avati@redhat.com> Reviewed-on: http://review.gluster.org/5528 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* cluster/afr: Unwind frame on error in readdir[p]Pranith Kumar K2013-08-081-2/+2
| | | | | | | | | Change-Id: I5701bf115e0aa1adb4fb52f5418534910a2268d4 BUG: 994959 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/5531 Reviewed-by: Vijay Bellur <vbellur@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* afr: check for non-zero call_count before doing a stack windRavishankar N2013-08-071-0/+5
| | | | | | | | | | | | | | | | | | | | When one of the bricks of a 1x2 replicate volume is down, writes to the volume is causing a race between afr_flush_wrapper() and afr_flush_cbk(). The latter frees up the call_frame's local variables in the unwind, while the former accesses them in the for loop and sending a stack wind the second time. This causes the FUSE mount process (glusterfs) toa receive a SIGSEGV when the corresponding unwind is hit. This patch adds the call_count check which was removed when afr_flush_wrapper() was introduced in commit 29619b4e Change-Id: I87d12ef39ea61cc4c8244c7f895b7492b90a7042 BUG: 988182 Signed-off-by: Ravishankar N <ravishankar@redhat.com> Reviewed-on: http://review.gluster.org/5393 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/afr: Disable eager-lock if open-fd-count > 1Pranith Kumar K2013-08-024-5/+98
| | | | | | | | | | | | | | | | | Lets say mount1 has eager-lock(full-lock) and after the eager-lock is taken mount2 opened the same file, it won't be able to perform any data operations until mount1 releases eager-lock. To avoid such scenario do not enable eager-lock for transaction if open-fd-count is > 1. Delaying of changelog piggybacking is avoided in this situation. Change-Id: I51b45d6a7c216a78860aff0265a0b8dabc6423a5 BUG: 910217 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/5432 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: venkatesh somyajulu <vsomyaju@redhat.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* dht: make linkfile creation mode explicitly get setAnand Avati2013-07-311-0/+9
| | | | | | | | | | | | | | | | Because of posix default_acl on parent directory, the mode of linkfile can get masked with the mode in the default acl. This breaks DHT integrity. So let the mode get explicitly reset after mknod(). Change-Id: Ia7328e1ee7b4430bda308f9da293dba78405e081 BUG: 990410 Signed-off-by: Anand Avati <avati@redhat.com> Reviewed-on: http://review.gluster.org/5440 Reviewed-by: Amar Tumballi <amarts@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Raghavendra Talur <rtalur@redhat.com>
* cluster/dht: Fix non-regular file ownership during rebalanceshishir gowda2013-07-312-3/+10
| | | | | | | | | | | | | | | | | | | Currently non-regular files were created as root:root, and their ownership never healed during rebalance process. Also, in dht_linkfile_attr_heal, we have to heal the linkfiles as default, as currently linkfiles are created as root:root. That check existed, as earlier linkfiles were created as frame->root->uid/gid Change-Id: I6cd88361b81bdd500e15bc47b623f5db8eec88e9 BUG: 990154 Signed-off-by: shishir gowda <sgowda@redhat.com> Reviewed-on: http://review.gluster.org/5434 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Amar Tumballi <amarts@redhat.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/afr: Print self-heal log when self-heal succeedsPranith Kumar K2013-07-315-32/+120
| | | | | | | | | Change-Id: I95e47e589419dc6a032cbd8ba01964b6c176c2d5 BUG: 927146 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/5408 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* cluster/dht: Treat migration failures due to space constraints as skippedshishir gowda2013-07-302-5/+30
| | | | | | | | | | | | | | | | Currently rebalance/remove-brick op's display migration failed count even for files which failed due to space issues (not enough space for file, or migration leading to cluster imbalance) These will now be counted as skipped, and rebalance/remove-brick status will display the additional counter Change-Id: I674904d380b5f8300e9ca9e6af557c3d30d6cff4 BUG: 989846 Signed-off-by: shishir gowda <sgowda@redhat.com> Reviewed-on: http://review.gluster.org/5399 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* cluster/dht: Allow non-local clients to function with nufa volumes.Vijay Bellur2013-07-291-6/+17
| | | | | | | | | | | | | nufa fails to init if a local brick is not found as of today. With this patch, if a local brick is not found, nufa switches over to dht mode of operations. Change-Id: I50ac1af37621b1e776c8c00a772b8e3dfb3691df BUG: 980838 Signed-off-by: Vijay Bellur <vbellur@redhat.com> Reviewed-on: http://review.gluster.org/5414 Reviewed-by: Jeff Darcy <jdarcy@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* cluster/afr: Handle REPLICATE_TRASH_DIR from old bricksPranith Kumar K2013-07-263-13/+52
| | | | | | | | | Change-Id: Ib99f79d3fa607c818dbc62006516480f598d8add BUG: 886998 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/4640 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* cluster/dht: mark linkfile creation/deletion as internal fopshishir gowda2013-07-241-9/+39
| | | | | | | | | | | | | | | | | | | Currently dht creates/deletes linkfiles for various ops like rename/linking and when layout changes. dht_linkfile_create already sends a key GLUSTERFS_INTERNAL_FOP_KEY in dict to identify this as an internal fop and not user based op. Enhancing rename related links/unlinks to send this key too. Marker/changelog or other xlators can now identify these as internal fops and handle them accordingly Change-Id: Ib1ca789e6dbce48703c55ad3f4f88f7cd2df3d06 BUG: 987428 Signed-off-by: shishir gowda <sgowda@redhat.com> Reviewed-on: http://review.gluster.org/5335 Reviewed-by: Amar Tumballi <amarts@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* storage/posix: implement batched fsync in a single threadAnand Avati2013-07-231-1/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Because of the extra fsync()s issued by AFR transaction, they could potentially "clog" all the io-threads denying unrelated operations from making progress. This patch assigns a dedicated thread to issues fsyncs, as an experimental feature to understand performance characteristics with the approach. As a basis, incoming individual fsync requests are grouped into batches, falling in the same @batch-fsync-delay-usec window of time. These windows can extend in practice, as processing of the previous batch can take longer than @batch-fsync-delay-usec while new requests are getting batched. The feature support three modes (similar to the -S modes of fs_mark) - syncfs: In this mode one syncfs() is issued per batch, instead of N fsync()s (one per file.) - syncfs-single-fsync: In this mode one syncfs() is issued per batch (which, on Linux, guarantees the completion of write-out of dirty pages in the filesystem up to that point) and one single fsync() to synchronize or flush the controller/drive cache. This corresponds to -S 2 of fsmark. - syncfs-reverse-fsync: In this mode, one syncfs() is issued per batch, and all the open files in that batch are fsync()'ed in the reverse order of the queue. This corresponds to -S 4 of fsmark. - reverse-fsync: In this mode, no syncfs() is issued and all the files in the batch are fsync()'ed in the reverse order. This corresponds to -S 3 of fsmark. Change-Id: Ia1e170a810c780c8d80e02cf910accc4170c4cd4 BUG: 927146 Signed-off-by: Anand Avati <avati@redhat.com> Reviewed-on: http://review.gluster.org/4746 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* cluster/afr: Handle parallel hardlinks self-healPranith Kumar K2013-07-231-0/+29
| | | | | | | | | Change-Id: Ieda11870c65edae500140b6c061f15a7b3f264f3 BUG: 986905 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/5370 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* fuse: auxiliary gfid mount supportRaghavendra G2013-07-191-6/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * files can be accessed directly through their gfid and not just through their paths. For eg., if the gfid of a file is f3142503-c75e-45b1-b92a-463cf4c01f99, that file can be accessed using <gluster-mount>/.gfid/f3142503-c75e-45b1-b92a-463cf4c01f99 .gfid is a virtual directory used to seperate out the namespace for accessing files through gfid. This way, we do not conflict with filenames which can be qualified as uuids. * A new file/directory/symlink can be created with a pre-specified gfid. A setxattr done on parent directory with fuse_auxgfid_newfile_args_t initialized with appropriate fields as value to key "glusterfs.gfid.newfile" results in the entry <parent>/bname whose gfid is set to args.gfid. The contents of the structure should be in network byte order. struct auxfuse_symlink_in { char linkpath[]; /* linkpath is a null terminated string */ } __attribute__ ((__packed__)); struct auxfuse_mknod_in { unsigned int mode; unsigned int rdev; unsigned int umask; } __attribute__ ((__packed__)); struct auxfuse_mkdir_in { unsigned int mode; unsigned int umask; } __attribute__ ((__packed__)); typedef struct { unsigned int uid; unsigned int gid; char gfid[UUID_CANONICAL_FORM_LEN + 1]; /* a null terminated gfid string * in canonical form. */ unsigned int st_mode; char bname[]; /* bname is a null terminated string */ union { struct auxfuse_mkdir_in mkdir; struct auxfuse_mknod_in mknod; struct auxfuse_symlink_in symlink; } __attribute__ ((__packed__)) args; } __attribute__ ((__packed__)) fuse_auxgfid_newfile_args_t; An initial consumer of this feature would be geo-replication to create files on slave mount with same gfids as that on master. It will also help gsyncd to access files directly through their gfids. gsyncd in its newer version will be consuming a changelog (of master) containing operations on gfids and sync corresponding files to slave. * Also, bring in support to heal gfids with a specific value. fuse-bridge sends across a gfid during a lookup, which storage translators assign to an inode (file/directory etc) if there is no gfid associated it. This patch brings in support to specify that gfid value from an application, instead of relying on random gfid generated by fuse-bridge. gfids can be healed through setxattr interface. setxattr should be done on parent directory. The key used is "glusterfs.gfid.heal" and the value should be the following structure whose contents should be in network byte order. typedef struct { char gfid[UUID_CANONICAL_FORM_LEN + 1]; /* a null terminated gfid * string in canonical form */ char bname[]; /* a null terminated basename */ } __attribute__((__packed__)) fuse_auxgfid_heal_args_t; This feature can be used for upgrading older geo-rep setups where gfids of files are different on master and slave to newer setups where they should be same. One can delete gfids on slave using setxattr -x and .glusterfs and issue stat on all the files with gfids from master. Thanks to "Amar Tumballi" <amarts@redhat.com> and "Csaba Henk" <csaba@redhat.com> for their inputs. Signed-off-by: Raghavendra G <rgowdapp@redhat.com> Change-Id: Ie8ddc0fb3732732315c7ec49eab850c16d905e4e BUG: 952029 Reviewed-on: http://review.gluster.com/#/c/4702 Reviewed-by: Amar Tumballi <amarts@redhat.com> Tested-by: Amar Tumballi <amarts@redhat.com> Reviewed-on: http://review.gluster.org/4702 Reviewed-by: Xavier Hernandez <xhernandez@datalab.es> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>