summaryrefslogtreecommitdiffstats
path: root/xlators/cluster
Commit message (Collapse)AuthorAgeFilesLines
* cluster/ec: Prevent Null dereference in dht-renamePranith Kumar K2015-06-121-1/+1
| | | | | | | | | | Change-Id: I3059f3b577f550c92fb77c6b6b44defd0584cd2e BUG: 1230647 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/11178 Tested-by: Gluster Build System <jenkins@build.gluster.com> Tested-by: NetBSD Build System <jenkins@build.gluster.org> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* tier/dht: Fixing non atomic promotion/demotion w.r.t to frequency periodJoseph Fernandes2015-06-111-40/+59
| | | | | | | | | | | | | | | | | | | | | This fixes the ping-pong issue i.e files getting demoted immediately after promition, caused by off-sync promotion/demotion processes. The solution is do promotion/demotion refering to the system time. To have the fix working all the file serving nodes should have thier system time synchronized with each other either manually or using a NTP Server. NOTE: The ping-pong issue can re-appear even with this fix, if the admin have different promotion freq period and demotion freq period, but this would be under the control of the admin. Change-Id: I1b33a5881d0cac143662ddb48e5b7b653aeb1271 BUG: 1218717 Signed-off-by: Joseph Fernandes <josferna@redhat.com> Reviewed-on: http://review.gluster.org/11110 Reviewed-by: Dan Lambright <dlambrig@redhat.com> Tested-by: Dan Lambright <dlambrig@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* cluster/tier: account for reordered layoutsDan Lambright2015-06-112-14/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For a tiered volume the cold subvolume is always at a fixed position in the graph. DHT's layout array, on the other hand, may have the cold subvolume in either the first or second index, therefore code cannot make any assumptions. The fix searches the layout for the correct position dynamically rather than statically. The bug manifested itself in NFS, in which a newly attached subvolume had not received an existing directory. This case is a "stale entry" and marked as such in the layout for that directory. The code did not see this, because it looked at the wrong index in the layout array. The fix also adds the check for decomissioned bricks, and fixes a problem in detach tier related to starting the rebalance process: we never received the right defrag command and it did not get directed to the tier translator. Change-Id: I77cdf9fbb0a777640c98003188565a79be9d0b56 BUG: 1214289 Signed-off-by: Dan Lambright <dlambrig@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Tested-by: NetBSD Build System <jenkins@build.gluster.org> Reviewed-by: Shyamsundar Ranganathan <srangana@redhat.com> Reviewed-by: Joseph Fernandes <josferna@redhat.com> Reviewed-by: Mohammed Rafi KC <rkavunga@redhat.com> Reviewed-on: http://review.gluster.org/11092
* features/changelog: Avoid setattr fop logging during renameSaravanakumar Arumugam2015-06-113-23/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Problem: When a file is renamed and the (renamed)file's Hashing falls into a different brick, DHT creates a special file(linkto file) in the brick(Hashed subvolume) and carries out setattr operation on that file. Currently, Changelog records this(setattr) operation in Hashed subvolume. glusterfind in turn records this operation as MODIFY operation. So, there is a NEW entry in Cached subvolume and MODIFY entry in Hashed subvolume for the same file. Solution: Avoid logging setattr operation carried out, by marking the operation as internal fop using xdata. In changelog translator, check whether setattr is set as internal fop and skip accordingly. Change-Id: I21b09afb5a638b88a4ccb822442216680b7b74fd BUG: 1230007 Signed-off-by: Saravanakumar Arumugam <sarumuga@redhat.com> Reviewed-on: http://review.gluster.org/11137 Reviewed-by: Raghavendra G <rgowdapp@redhat.com> Tested-by: Raghavendra G <rgowdapp@redhat.com>
* cluster/ec: Wind unlock fops at all costPranith Kumar K2015-06-102-8/+45
| | | | | | | | | | | | | | | | | Problem: While files are being created if more than redundancy number of bricks go down, then unlock for these fops do not go to the bricks. This will lead to stale locks leading to hangs. Fix: Wind unlock fops at all costs. Change-Id: I50a87e8b4d6d2dde5bf7405b82e3aeecd95ad00e BUG: 1220348 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/11152 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Xavier Hernandez <xhernandez@datalab.es>
* tier/volume set: Validate volume set option for tierMohammed Rafi KC2015-06-101-0/+6
| | | | | | | | | | | | | | | | Volume set option related to tier volume can only be set for tier volume, also currently all volume set i for tier option accepts a non-negative integer. This patch validate both condition. Change-Id: I3611af048ff4ab193544058cace8db205ea92336 BUG: 1216960 Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com> Signed-off-by: Dan Lambright <dlambrig@redhat.com> Reviewed-on: http://review.gluster.org/10751 Tested-by: Gluster Build System <jenkins@build.gluster.com> Tested-by: NetBSD Build System <jenkins@build.gluster.org> Reviewed-by: Joseph Fernandes
* cluster/ec: Prevent double unwindPranith Kumar K2015-06-084-13/+12
| | | | | | | | | | | | | | | | | | | | | | Problem: 1) ec_access/ec_readlink_/ec_readdir[p] _cbks are trying to recover only from ENOTCONN. 2) When the fop succeeds it unwinds right away. But when its ec_fop_manager resumes, if the number of bricks that are up is less than ec->fragments, the the state machine will resume with -EC_STATE_REPORT which unwinds again. This will lead to crashes. Fix: - If fop fails retry on other subvols, as ESTALE/ENOENT/EBADFD etc are also recoverable. - unwind success/failure in _cbks Change-Id: I2cac3c2f9669a4e6160f1ff4abc39f0299303222 BUG: 1228952 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/11111 Reviewed-by: Xavier Hernandez <xhernandez@datalab.es> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* cluster/afr: Do not attempt entry self-heal if the last lookup on entry ↵Krutika Dhananjay2015-06-081-2/+9
| | | | | | | | | | | | | | | | | | | failed on src Test bug-948686.t was causing shd to dump core due to gfid being NULL. This was due to the volume being stopped while index heal's in progress, causing afr_selfheal_unlocked_lookup_on() to fail sometimes on the src brick with ENOTCONN. And when afr_selfheal_newentry_mark() copies the gfid off the src iatt, it essentially copies null gfid. This was causing the assertion as part of xattrop in protocol/client to fail. Change-Id: I237a0d6b1849e4c48d7645a2cc16d9bc1441ef95 BUG: 1229172 Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com> Reviewed-on: http://review.gluster.org/11119 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* Changing log level to DEBUG in case of ENOENTAshish Pandey2015-06-081-2/+3
| | | | | | | | | Change-Id: I264e47ca679d8b57cd8c80306c07514e826f92d8 BUG: 1193388 Signed-off-by: Ashish Pandey <aspandey@redhat.com> Reviewed-on: http://review.gluster.org/10784 Tested-by: NetBSD Build System <jenkins@build.gluster.org> Reviewed-by: Xavier Hernandez <xhernandez@datalab.es>
* cluster/ec: Don't handle EC_XATTR_DIRTY in responsePranith Kumar K2015-06-062-27/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | Problem: ec_update_size_version expects all the keys it did xattrop with to come in response so that it can set the values again in ec_update_size_version_done. But EC_XATTR_DIRTY is not combined so the value won't be present in the response. So ctx->post/pre_dirty are not updated in ec_update_size_version_done. So these values are still non-zero. When ec_unlock_now is called as part of flush's unlock phase it again tries to perform same xattrop for EC_XATTR_DIRTY. But ec_update_size_version is not expected to be called in unlock phase of flush because ec_flush_size_version should have reset everything to zero and unlock is never invoked from ec_update_size_version_done for flush/fsync/fsyncdir. This leads to stale lock which leads to hang. Fix: EC_XATTR_DIRTY is removed in ex_xattrop_cbk and is never combined with other answers. So remove handling of this in the response. Change-Id: If0ea3efec3235a6e312465d8838585fbe752c7ea BUG: 1227654 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/11078 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* afr: honour selfheal enable/disable volume set optionsRavishankar N2015-06-032-3/+11
| | | | | | | | | | | | | | | | | | | | afr-v1 had the following volume set options that are used to enable/ disable self-heals from happening in AFR xlator when loaded in the client graph: cluster.metadata-self-heal cluster.data-self-heal cluster.entry-self-heal In afr-v2, these 3 heals can happen from the client if there is an inode refresh. This patch allows such heals to proceed only if the corresponding volume set options are set to true. Change-Id: I8d97d6020611152e73a269f3fdb607652c66cc86 BUG: 1226507 Signed-off-by: Ravishankar N <ravishankar@redhat.com> Reviewed-on: http://review.gluster.org/11012 Tested-by: NetBSD Build System <jenkins@build.gluster.org> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
* cluster/dht: fix incorrect dst subvol info in inode_ctxNithya Balachandran2015-06-026-88/+182
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Stashing additional information in the inode_ctx to help decide whether the migration information is stale, which could happen if a file was migrated several times but FOPs only detected the P1 migration phase. If no FOP detects the P2 phase, the inode ctx1 is never reset. We now save the src subvol as well as the dst subvol in the inode ctx. The src subvol is the subvol on which the FOP was sent when the mig info was set in the inode ctx. This information is considered stale if: 1. The subvol on which the current FOP is sent is the same as the dst subvol in the ctx 2. The subvol on which the current FOP is sent is not the same as the src subvol in the ctx This does not handle the case where the same file might have been renamed such that the src subvol is the same but the dst subvol is different. However, that is unlikely to happen very often. Change-Id: I05a2e9b107ee64750c7ca629aee03b03a02ef75f BUG: 1142423 Signed-off-by: Nithya Balachandran <nbalacha@redhat.com> Reviewed-on: http://review.gluster.org/10834 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Raghavendra G <rgowdapp@redhat.com> Tested-by: Raghavendra G <rgowdapp@redhat.com>
* cluster/dht: pass a destination subvol to fop2 variants to avoid races.Raghavendra G2015-06-025-179/+206
| | | | | | | | | | | | | | | | | | | The destination subvol used in the fop2 variants is either stored in inode-ctx1 or local->cached_subvol. However, it is not guaranteed that a value stored in these locations before invocation of fop2 is still present after the invocation as these locations are shared among different concurrent operations. So, to preserve the atomicity of "check dst-subvol and invoke fop2 variant if dst-subvol found", we pass down the dst-subvol to fop2 variant. This patch also fixes error handling in some fop2 variants. Change-Id: Icc226228a246d3f223e3463519736c4495b364d2 BUG: 1142423 Signed-off-by: Raghavendra G <rgowdapp@redhat.com> Reviewed-on: http://review.gluster.org/10943 Tested-by: NetBSD Build System <jenkins@build.gluster.org> Reviewed-by: N Balachandran <nbalacha@redhat.com>
* cluster/ec: Fix incorrect check for iatt differencesXavier Hernandez2015-06-021-5/+19
| | | | | | | | | | | | | | | | | | | A previous patch (http://review.gluster.org/10974) introduced a bug that caused that some metadata differences could not be detected in some circumstances. This could cause that self-heal is not triggered and the file not repaired. We also need to consider all differences for lookup requests, even if there isn't any lock. Special handling of differences in lookup is already done in lookup specific code. Change-Id: I3766b0f412b3201ae8a04664349578713572edc6 BUG: 1225793 Signed-off-by: Xavier Hernandez <xhernandez@datalab.es> Reviewed-on: http://review.gluster.org/11018 Tested-by: Gluster Build System <jenkins@build.gluster.com> Tested-by: NetBSD Build System <jenkins@build.gluster.org> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
* tiering:static function called from a non static inline functionMohammed Rafi KC2015-06-021-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | gcc v5.1.1 throws warning for calling a static function from a non-static inline function. <snippet from compiler warning> CC tier.lo tier.c:610:15: warning: 'tier_migrate_using_query_file' is static but used in inline function 'tier_migrate_files_using_qfile' which is not static ret = tier_migrate_using_query_file ((void *)query_cbk_args); ^ tier.c:585:47: warning: 'tier_process_brick_cbk' is static but used in inline function 'tier_build_migration_qfile' which is not static ret = dict_foreach (args->brick_list, tier_process_brick_cbk, ^ tier.c:565:176: warning: 'demotion_qfile' is static but used in inline function 'tier_build_migration_qfile' which is not static tier.c:565:158: warning: 'promotion_qfile' is static but used in inline function 'tier_build_migration_qfile' which is not static tier.c:563:58: warning: 'demotion_qfile' is static but used in inline function 'tier_build_migration_qfile' which is not static tier.c:563:40: warning: 'promotion_qfile' is static but used in inline function 'tier_build_migration_qfile' which is not static ret = remove (GET_QFILE_PATH (is_promotion)); ^ CCLD tier.la </snip> Change-Id: I46046feeb79ab4e2724b0ba6b02c9ec8b121ff4e BUG: 1226881 Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com> Reviewed-on: http://review.gluster.org/11032 Tested-by: NetBSD Build System <jenkins@build.gluster.org> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Niels de Vos <ndevos@redhat.com> Reviewed-by: Anoop C S <achiraya@redhat.com> Reviewed-by: Kaleb KEITHLEY <kkeithle@redhat.com>
* stripe: fix use-after-freeJeff Darcy2015-06-021-4/+10
| | | | | | | | | | | | | | | | Pretty much a classic case. STRIPE_STACK_UNWIND frees the "local" structure. In the "virtual xattr" path, used for lock recovery among other things, we were calling STRIPE_STACK_UNWIND and then continuing to clean up "our" parts of the just-freed structure. Oops. Change-Id: Ifa961b89cd21a2893de39a9eea243d184f9eac46 BUG: 1222317 Signed-off-by: Jeff Darcy <jdarcy@redhat.com> Reviewed-on: http://review.gluster.org/11037 Reviewed-by: Krishnan Parthasarathi <kparthas@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Tested-by: NetBSD Build System <jenkins@build.gluster.org> Reviewed-by: Niels de Vos <ndevos@redhat.com>
* cluster/tier: make attach/detach work with new rebalance logicDan Lambright2015-06-022-23/+28
| | | | | | | | | | | | | | | The new rebalance performance improvements added new datastructures which were not initialized in the tier case. Function dht_find_local_subvol_cbk() needs to accept a list built by lower level DHT translators in order to build the local subvolumes list. Change-Id: Iab03fc8e7fadc22debc08cd5bc781b9e3e270497 BUG: 1222088 Signed-off-by: Dan Lambright <dlambrig@redhat.com> Reviewed-on: http://review.gluster.org/10795 Tested-by: NetBSD Build System <jenkins@build.gluster.org> Reviewed-by: Shyamsundar Ranganathan <srangana@redhat.com>
* dht: Add lookup-optimize configuration option for DHTShyam2015-06-024-16/+66
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently with commit 4eaaf5 a mixed version cluster would have issues if lookup-uhashed is set to auto, as older clients would fail to validate the layouts if newer clients (i.e 3.7 or upwards) create directories. Also, in a mixed version cluster rebalance daemon would set commit hash for some subvolumes and not for the others. This commit fixes this problem by moving the enabling of the functionality introduced in the above mentioned commit to a new dht option. This option also has a op_version of 3_7_1 thereby preventing it from being set in a mixed version cluster. It brings in the following changes, - Option can be set only if min version of the cluster is 3.7.1 or more - Rebalance and mkdir update the layout with the commit hashes only if this option is set, hence ensuring rebalance works in a mixed version cluster, and also directories created by newer clients do not cause layout errors when read by older clients - This option also supersedes lookup-unhased, to enable the optimization for lookups more deterministic and not conflict with lookup-unhashed settings. Option added is cluster.lookup-optimize, which is a boolean. Usage: # gluster volume set VOLNAME cluster.lookup-optimize on Change-Id: Ifd1d4ce3f6438fcbcd60ffbfdbfb647355ea1ae0 BUG: 1222126 Signed-off-by: Shyam <srangana@redhat.com> Reviewed-on: http://review.gluster.org/10797 Tested-by: NetBSD Build System <jenkins@build.gluster.org> Reviewed-by: Kaushal M <kaushal@redhat.com> Reviewed-by: Raghavendra G <rgowdapp@redhat.com> Tested-by: Raghavendra G <rgowdapp@redhat.com>
* DHT/permissoin: Let setattr consume stat built from lookup in heal pathSusant Palai2015-06-011-2/+0
| | | | | | | | | | | | | | | | setattr call post mkdir(selfheal) ends up using the mode bits returned by mkdir,which miss the required suid, sgid and sticky bit. Hence, the fix is to use the mode bits from local->stbuf which was used to create the missing directories. Change-Id: I478708c80e28edc6509b784b0ad83952fc074a5b BUG: 1110262 Signed-off-by: Susant Palai <spalai@redhat.com> Reviewed-on: http://review.gluster.org/8208 Tested-by: NetBSD Build System <jenkins@build.gluster.org> Reviewed-by: Shyamsundar Ranganathan <srangana@redhat.com> Reviewed-by: Raghavendra G <rgowdapp@redhat.com> Tested-by: Raghavendra G <rgowdapp@redhat.com>
* cluster/dht: maintain start state of rebalance daemon across graph switch.Dan Lambright2015-06-011-2/+9
| | | | | | | | | | | | | | When we did a graph switch on a rebalance daemon, a second call to gf_degrag_start() was done. This lead to multiple threads doing migration. When multiple threads try to move the same file there can be deadlocks. Change-Id: I931ca7fe600022f245e3dccaabb1ad004f732c56 BUG: 1226005 Signed-off-by: Dan Lambright <dlambrig@redhat.com> Reviewed-on: http://review.gluster.org/10977 Tested-by: NetBSD Build System <jenkins@build.gluster.org> Reviewed-by: Shyamsundar Ranganathan <srangana@redhat.com>
* cluster/ec: Ignore differences in non locked inodesXavier Hernandez2015-05-305-28/+100
| | | | | | | | | | | | | | | | | | | | | | | | When ec combines iatt structures from multiple bricks, it checks for equality in important fields. This is ok for iatt related to inodes involved in the operation that have been locked before starting execution. However some fops return iatt information from other inodes. For example a rename locks source and destination parent directories, but it also returns an iatt from the entry itself. In these cases we ignore differences in some fields to avoid false detection of inconsistencies and trigger unnecessary self-heals. Another issue is solved in this patch that caused that the real size of the file stored into the inode context was lost during self-heal. Change-Id: I8b8eca30b2a6c39c7b9bbd3b3b6ba95228fcc041 BUG: 1225793 Signed-off-by: Xavier Hernandez <xhernandez@datalab.es> Reviewed-on: http://review.gluster.org/10974 Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Tested-by: NetBSD Build System
* build: do not #include "config.h" in each fileNiels de Vos2015-05-2940-198/+0
| | | | | | | | | | | | | | | | | | Instead of including config.h in each file, and have the additional config.h included from the compiler commandline (-include option). When a .c file tests for a certain #define, and config.h was not included, incorrect assumtions were made. With this change, it can not happen again. BUG: 1222319 Change-Id: I4f9097b8740b81ecfe8b218d52ca50361f74cb64 Signed-off-by: Niels de Vos <ndevos@redhat.com> Reviewed-on: http://review.gluster.org/10808 Tested-by: Gluster Build System <jenkins@build.gluster.com> Tested-by: NetBSD Build System Reviewed-by: Kaleb KEITHLEY <kkeithle@redhat.com> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
* tiering/rebalance: Use separate pid/socket file for tieringMohammed Rafi KC2015-05-281-2/+3
| | | | | | | | | | | | | | When promotion/demotion daemon starts, it uses the same pidfile as rebalance. This patch will introduce a different pid file for the same. Change-Id: Ic484c53f51e00ae6b2d697748a9600b14829e23b BUG: 1221970 Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com> Reviewed-on: http://review.gluster.org/10792 Reviewed-by: Atin Mukherjee <amukherj@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Tested-by: NetBSD Build System
* tier: Do not allow detach-tier commands on a non-tiered volumeMohammed Rafi KC2015-05-281-1/+2
| | | | | | | | | | | | Change-Id: Ic92d25db68e40ef4a4388ef42affd1b3ee5a7ec6 BUG: 1221270 Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com> Reviewed-on: http://review.gluster.org/10773 Reviewed-by: Atin Mukherjee <amukherj@redhat.com> Reviewed-by: Raghavendra G <rgowdapp@redhat.com> Reviewed-by: Kaushal M <kaushal@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Tested-by: NetBSD Build System
* xlators/cluster/dht: Fix Explicit null dereferenced (CID 1291727).Günther Deschner2015-05-281-1/+1
| | | | | | | | | | | | | Coverity CID 1291727. Guenther Change-Id: I95f01b638f74370f0ef04383f0f9d5799abe31f5 BUG: 789278 Signed-off-by: Guenther Deschner <gd@samba.org> Reviewed-on: http://review.gluster.org/10300 Reviewed-by: Dan Lambright <dlambrig@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* cluster/dht: Fix dht_setxattr to follow files under migrationNithya Balachandran2015-05-282-17/+375
| | | | | | | | | | | | | | | | If a file is under migration, any xattrs created on it are lost post migration of the file. This is because the xattrs are set only on the cached subvol of the source and as the source is under migration, it becomes a linkto file post migration. Change-Id: Ib8e233b519cf954e7723c6e26b38fa8f9b8c85c0 BUG: 1193636 Signed-off-by: Nithya Balachandran <nbalacha@redhat.com> Reviewed-on: http://review.gluster.org/10212 Tested-by: NetBSD Build System Reviewed-by: Raghavendra G <rgowdapp@redhat.com> Tested-by: Raghavendra G <rgowdapp@redhat.com>
* cluster/dht: Don't rely on linkto xattr to find destination subvol during ↵Raghavendra G2015-05-281-101/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | phase 2 of migration. linkto xattr on source file cannot be relied to find where the data file currently resides. This can happen if there are multiple migrations before phase 2 detection by a client. For eg., * migration (M1, node1, node2) starts. * application writes some data. DHT correctly stores the state in inode context that phase-1 of migration is in progress * migration M1 completes * migration (M2, node2, node3) is triggered and completed * application resumes writes to the file. DHT identifies it as phase-2 of migration. However, linkto xattr on node1 points to node2, but the file is on node3. A lookup correctly identifies node3 as cached subvol TBD: When we identify phase-2 of a previous migration (say M1), there might be a migration in progress - say (M3, node3, node4). In this case we need to send writes to both (node3, node4) not just node3. Also, the inode state needs to correctly indicate that its in phase-1 of migration. I'll send this as a different patch. Change-Id: I1a861f766258170af2f6c0935468edb6be687b95 BUG: 1142423 Signed-off-by: Raghavendra G <rgowdapp@redhat.com> Reviewed-on: http://review.gluster.org/10805 Tested-by: NetBSD Build System
* afr: allow readdir to proceed for directories in split-brainRavishankar N2015-05-281-18/+22
| | | | | | | | | | | | | | | | | | | Problem: afr_read_txn() bails out if read_subvol==-1. This meant that for directories that were in entry split-brain, FOPS like readdir, access, stat etc were not allowed. Fix: Except for getxattr, all other FOPS are wound on the first up child of afr. Change-Id: Iacec8fbb1e75c4d2094baa304f62331c81a6f670 BUG: 1221481 Signed-off-by: Ravishankar N <ravishankar@redhat.com> Reviewed-on: http://review.gluster.org/10776 Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Reviewed-by: Anuradha Talur <atalur@redhat.com> Tested-by: NetBSD Build System
* cluster/afr: Treat op_ret >= 0 as success in afr_final_errno()Krutika Dhananjay2015-05-271-1/+1
| | | | | | | | | | Change-Id: I7ec29428b7f7ef249014f948a5d616bfb8aaf80d BUG: 1225491 Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com> Reviewed-on: http://review.gluster.org/10946 Tested-by: NetBSD Build System Reviewed-by: Ravishankar N <ravishankar@redhat.com> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
* cluster/ec: Forced unlock when lock contention is detectedXavier Hernandez2015-05-2713-999/+1139
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | EC uses an eager lock mechanism to optimize multiple read/write requests on the same entry or inode. This increases performance but can have adverse results when other clients try to access the same entry/inode. To solve this, this patch adds a functionality to detect when this happens and force an earlier release to not block other clients. The method consists on requesting GF_GLUSTERFS_INODELK_COUNT and GF_GLUSTERFS_ENTRYLK_COUNT for all fops that take a lock. When this count is greater than one, the lock is marked to be released. All fops already waiting for this lock will be executed normally before releasing the lock, but new requests that also require it will be blocked and restarted after the lock has been released and reacquired again. Another problem was that some operations did correctly lock the parent of an entry when needed, but got the size and version xattrs from the entry instead of the parent. This patch solves this problem by binding all queries of size and version to each lock and replacing all entrylk calls by inodelk ones to remove concurrent updates on directory metadata. This also allows rename to correctly update source and destination directories. Change-Id: I2df0b22bc6f407d49f3cbf0733b0720015bacfbd BUG: 1165041 Signed-off-by: Xavier Hernandez <xhernandez@datalab.es> Reviewed-on: http://review.gluster.org/10852 Tested-by: NetBSD Build System Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
* cluster/ec: Fix use after free crashPranith Kumar K2015-05-214-23/+47
| | | | | | | | | | | | | | | | | ec_heal creates ec_fop_data but doesn't run ec_manager. ec_fop_data_allocate adds this fop to ec->pending_fops, because ec_manager is not run on this heal fop it is never removed from ec->pending_fops. When it is accessed after free it leads to crash. It is better to not to add HEAL fops to ec->pending_fops because we don't want graph switch to hang the mount because of a BIG file/directory heal. BUG: 1188145 Change-Id: I8abdc92f06e0563192300ca4abca3909efcca9c3 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/10868 Reviewed-by: Xavier Hernandez <xhernandez@datalab.es> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Raghavendra Bhat <raghavendra@redhat.com>
* cluster/ec: Correctly cleanup delayed locksXavier Hernandez2015-05-207-65/+163
| | | | | | | | | | | | | | | | | When a delayed lock is pending, a graph switch doesn't correctly terminate it. This means that the update of version and size xattrs is lost, causing EIO errors. This patch handles GF_EVENT_PARENT_DOWN event to correctly finish pending udpdates before completing the graph switch. Change-Id: I394f3b8d41df8d83cdd36636aeb62330f30a66d5 BUG: 1188145 Signed-off-by: Xavier Hernandez <xhernandez@datalab.es> Reviewed-on: http://review.gluster.org/10787 Tested-by: NetBSD Build System Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
* cluster/ec: Handle lookup failures while op in progressPranith Kumar K2015-05-192-12/+23
| | | | | | | | | | Change-Id: Ia1834ec23d5de615526d4d4e4d2e32aff155b7f7 BUG: 1211962 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/10806 Tested-by: NetBSD Build System Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Xavier Hernandez <xhernandez@datalab.es>
* cluster/tier: load libgfdb.so properly in all casesDan Lambright2015-05-161-1/+1
| | | | | | | | | | | | | We should load libgfdb.so.0, not libgfdb.so Change-Id: I7a0d64018ccd9893b1685de391e99b5392bd1879 BUG: 1222092 Signed-off-by: Dan Lambright <dlambrig@redhat.com> Reviewed-on: http://review.gluster.org/10796 Reviewed-by: Kaleb KEITHLEY <kkeithle@redhat.com> Reviewed-by: Joseph Fernandes Reviewed-by: Niels de Vos <ndevos@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* cluster/ec: Prevent unnecessary self-healsPranith Kumar K2015-05-151-2/+11
| | | | | | | | | | | | | | | | When a blocking lock is requested, lock request is succeeded even when ec->fragment number of locks are acquired successfully in non-blocking locking phase. This will lead to fop succeeding only on the bricks where the locks are acquired, leading to the necessity of self-heals. To prevent these un-necessary self-heals, if the remaining locks fail with EAGAIN in non-blocking lock phase try blocking locking phase instead. Change-Id: I940969e39acc620ccde2a876546cea77f7e130b6 BUG: 1221145 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/10770 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Xavier Hernandez <xhernandez@datalab.es>
* dht/rebalance : Fixed rebalance failureNithya Balachandran2015-05-142-3/+17
| | | | | | | | | | | | | | | | | | | | | | | | The rebalance process determines the local subvols for the node it is running on and only acts on files in those subvols. If a dist-rep or dist-disperse volume is created on 2 nodes by dividing the bricks equally across the nodes, one process might determine it has no local_subvols. When trying to update the commit hash, the function attempts to lock all local subvols. On the node with no local_subvols the dht inode lock operation fails, in turn causing the rebalance to fail. In a dist-rep volume with 2 nodes, if brick 0 of each replica set is on node1 and brick 1 is on node2, node2 will find that it has no local subvols. Change-Id: I7d73b5b4bf1c822eae6df2e6f79bd6a1606f4d1c BUG: 1221696 Signed-off-by: Nithya Balachandran <nbalacha@redhat.com> Reviewed-on: http://review.gluster.org/10786 Reviewed-by: Shyamsundar Ranganathan <srangana@redhat.com> Reviewed-by: Susant Palai <spalai@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* dht : null dereference coverity fix.Manikandan Selvaganesh2015-05-141-1/+2
| | | | | | | | | | | | | CID : 1124521 Change-Id: Ie524935d636195cb6894074095b9b98fe28dbc2c BUG: 789278 Signed-off-by: Manikandan Selvaganesh <mselvaga@redhat.com> Reviewed-on: http://review.gluster.org/10348 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Sakshi Bansal Reviewed-by: N Balachandran <nbalacha@redhat.com> Reviewed-by: Shyamsundar Ranganathan <srangana@redhat.com>
* cluster/afr : Do not copy dict when it is NULLAnuradha2015-05-131-1/+1
| | | | | | | | | | | | | | | | In afr_lookup_xattr_req_prepare(), dict_copy was done even though source dict was NULL. Change-Id: I85a5d2823ba021e7f78c1ce13402a0f16b08cb51 BUG: 1220332 Signed-off-by: Anuradha <atalur@redhat.com> Reviewed-on: http://review.gluster.org/10755 Reviewed-by: Ravishankar N <ravishankar@redhat.com> Reviewed-by: Prashanth Pai <ppai@redhat.com> Reviewed-by: Krutika Dhananjay <kdhananj@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Susant Palai <spalai@redhat.com> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
* dht/rebalance: Change log_level to DEBUGSusant Palai2015-05-121-2/+3
| | | | | | | | | | Change-Id: I646367581d8ee8a9e5966ee302b19273a0c780ff BUG: 1220329 Signed-off-by: Susant Palai <spalai@redhat.com> Reviewed-on: http://review.gluster.org/10756 Reviewed-by: N Balachandran <nbalacha@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Shyamsundar Ranganathan <srangana@redhat.com>
* dht: make lookup-unhashed=auto do something actually usefulJeff Darcy2015-05-106-57/+577
| | | | | | | | | | | | | | | | | | | | | | | | The key concept here is to determine whether a directory is "clean" by comparing its last-known-good topology to the current one for the volume. These are stored as "commit hashes" on the directory and the volume root respectively. The volume's commit hash changes whenever a brick is added or removed, and a fix-layout is done. A directory's commit hash changes only when a full rebalance (not just fix-layout) is done on it. If all bricks are present and have a directory commit hash that matches the volume commit hash, then we can assume that every file is in its "proper" place. Therefore, if we look for a file in that proper place and don't find it, we can assume it's not on any other subvolume and *safely* skip the global (broadcast to all) lookup. Change-Id: Id6ce4593ba1f7daffa74cfab591cb45960629ae3 BUG: 1219637 Signed-off-by: Jeff Darcy <jdarcy@redhat.com> Signed-off-by: Shyam <srangana@redhat.com> Reviewed-on: http://review.gluster.org/7702 Tested-by: Gluster Build System <jenkins@build.gluster.com> Tested-by: NetBSD Build System Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* ec: Fix failures with missing filesXavier Hernandez2015-05-095-283/+164
| | | | | | | | | | | | | | | | | | | | | | | | | | When a file does not exist on a brick but it does on others, there could be problems trying to access it because there was some loc_t structures with null 'pargfid' but 'name' was set. This forced inode resolution based on <pargfid>/name instead of <gfid> which would be the correct one. To solve this problem, 'name' is always set to NULL when 'pargfid' is not present. Another problem was caused by an incorrect management of errors while doing incremental locking. The only allowed error during an incremental locking was ENOTCONN, but missing files on a brick can be returned as ESTALE. This caused an EIO on the operation. This patch doesn't care of errors during an incremental locking. At the end of the operation it will check if there are enough successfully locked bricks to continue or not. Change-Id: I9360ebf8d819d219cea2d173c09bd37679a6f15a BUG: 1176062 Signed-off-by: Xavier Hernandez <xhernandez@datalab.es> Reviewed-on: http://review.gluster.org/9407 Tested-by: NetBSD Build System Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
* core: use reference counting for mem_acct structuresJeff Darcy2015-05-092-12/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When freeing memory, our memory-accounting code expects to be able to dereference from the (previously) allocated block to its owning translator. However, as we have already found once in option validation and twice in logging, that translator might itself have been freed and the dereference attempt causes on of our daemons to crash with SIGSEGV. This patch attempts to fix that as follows: * We no longer embed a struct mem_acct directly in a struct xlator, but instead allocate it separately. * Allocated memory blocks now contain a pointer to the mem_acct instead of the xlator. * The mem_acct structure contains a reference count, manipulated in both the normal and translator allocate/free code using atomic increments and decrements. * Because it's now a separate structure, we can defer freeing the mem_acct until its reference count reaches zero (either way). * Some unit tests were disabled, because they embedded their own copies of the implementation for what they were supposedly testing. Life's too short to spend time fixing tests that seem designed to impede progress by requiring a certain implementation as well as behavior. Change-Id: Id929b11387927136f78626901729296b6c0d0fd7 BUG: 1211749 Signed-off-by: Jeff Darcy <jdarcy@redhat.com> Reviewed-on: http://review.gluster.org/10417 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Krishnan Parthasarathi <kparthas@redhat.com> Reviewed-by: Niels de Vos <ndevos@redhat.com> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
* cluster/dht: change log level of developer logs to DEBUGVijay Bellur2015-05-091-2/+2
| | | | | | | | | | | | | A few log messages in dht directory self heal at log level INFO are useful only for developers and these logs tend to casue excessive logs in our log files. Hence moving the log level of such logs to DEBUG. Change-Id: I8a543f4ddeb5c20b2978a0f7b18d8baccc935a54 BUG: 1217949 Signed-off-by: Vijay Bellur <vbellur@redhat.com> Reviewed-on: http://review.gluster.org/10281 Reviewed-by: N Balachandran <nbalacha@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* glusterd: add counter support for tiered volumesDan Lambright2015-05-083-1/+22
| | | | | | | | | | | | | | | | This fix adds support to view the number of promoted or demoted files from the cli. The mechanism is isolmorphic to checking the status of volumes being rebalanced. gluster volume rebalance <vol> tier status Change-Id: I1b11ca27355ceec36c488967c23531202030e205 BUG: 1213063 Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com> Signed-off-by: Dan Lambright <dlambrig@redhat.com> Reviewed-on: http://review.gluster.org/10292 Reviewed-by: Atin Mukherjee <amukherj@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* cluster/ec: Change meaning of trusted.ec.dirtyPranith Kumar K2015-05-087-208/+555
| | | | | | | | | | | | | | | | - With this change, the xattr will represent if the file needs to be healed or not. It will have different values for data/entry and metadata changes. - inode ref leaks and dict_set_dynstr related leaks fixed - Added support for trylock/lock based on heal-cmd execution or not in data heal. - Made fixes to pass regression runs Change-Id: I9d8def4c2badde18a76b7898816fecfac113737a BUG: 1215265 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/10385 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* cluster/ec: data heal implementation for ecPranith Kumar K2015-05-083-8/+788
| | | | | | | | | | | | | | | | | | | | | | | | | Data self-heal: 1) Take inode lock in domain 'this->name:self-heal' on 0-0 range (full file), So that no other processes try to do self-heal at the same time. 2) Take inode lock in domain 'this->name' on 0-0 range (full file), 3) perform fxattrop+fstat and get the xattrs on all the bricks 3) Choose the brick with ec->fragment number of same version as source 4) Truncate sinks 5) Unlock lock taken in 2) 5) For each block take full file lock, Read from sources write to the sinks, Unlock 6) Take full file lock and see if the file is still sane copy i.e. File didn't become unusable while the bricks are offline. Update mtime to before healing 7) xattrop with -ve values of 'dirty' and difference of highest and its own version values for version xattr 8) unlock lock acquired in 6) 9) unlock lock acquired in 1) Change-Id: I6f4d42cd5423c767262c9d7bb5ca7767adb3e5fd BUG: 1215265 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/10384 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* cluster/ec: metadata/name/entry heal implementation for ecPranith Kumar K2015-05-082-0/+1058
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Metadata self-heal: 1) Take inode lock in domain 'this->name' on 0-0 range (full file) 2) perform lookup and get the xattrs on all the bricks 3) Choose the brick with highest version as source 4) Setattr uid/gid/permissions 5) removexattr stale xattrs 6) Setxattr existing/new xattrs 7) xattrop with -ve values of 'dirty' and difference of highest and its own version values for version xattr 8) unlock lock acquired in 1) Entry self-heal: 1) take directory lock in domain 'this->name:self-heal' on 'NULL' to prevent more than one self-heal 2) we take directory lock in domain 'this->name' on 'NULL' 3) Perform lookup on version, dirty and remember the values 4) unlock lock acquired in 2) 5) readdir on all the bricks and trigger name heals 6) xattrop with -ve values of 'dirty' and difference of highest and its own version values for version xattr 7) unlock lock acquired in 1) Name heal: 1) Take 'name' lock in 'this->name' on 'NULL' 2) Perform lookup on 'name' and get stat and xattr structures 3) Build gfid_db where for each gfid we know what subvolumes/bricks have a file with 'name' 4) Delete all the stale files i.e. the file does not exist on more than ec->redundancy number of bricks 5) On all the subvolumes/bricks with missing entry create 'name' with same type,gfid,permissions etc. 6) Unlock lock acquired in 1) Known limitation: At the moment with present design, it conservatively preserves the 'name' in case it can not decide whether to delete it. this can happen in the following scenario: 1) we have 3=2+1 (bricks: A, B, C) ec volume and 1 brick is down (Lets say A) 2) rename d1/f1 -> d2/f2 is performed but the rename is successful only on one of the bricks (Lets say B) 3) Now name self-heal on d1 and d2 would re-create the file on both d1 and d2 resulting in d1/f1 and d2/f2. Because we wanted to prevent data loss in the case above, the following scenario is not healable, i.e. it needs manual intervention: 1) we have 3=2+1 (bricks: A, B, C) ec volume and 1 brick is down (Lets say A) 2) We have two hard links: d1/a, d2/b and another file d3/c even before the brick went down 3) rename d3/c -> d2/b is performed 4) Now name self-heal on d2/b doesn't heal because d2/b with older gfid will not be deleted. One could think why not delete the link if there is more than 1 hardlink, but that leads to similar data loss issue I described earlier: Scenario: 1) we have 3=2+1 (bricks: A, B, C) ec volume and 1 brick is down (Lets say A) 2) We have two hard links: d1/a, d2/b 3) rename d1/a -> d3/c, d2/b -> d4/d is performed and both the operations are successful only on one of the bricks (Lets say B) 4) Now name self-heal on the 'names' above which can happen in parallel can decide to delete the file thinking it has 2 links but after all the self-heals do unlinks we are left with data loss. Change-Id: I3a68218a47bb726bd684604efea63cf11cfd11be BUG: 1213358 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/10298 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* dht: Fixing dereference after null checkarao2015-05-081-13/+38
| | | | | | | | | | | | | | CID: 1223229 The 'loc' ptr is not checked before dereferencing, which is handled. Change-Id: Icf668150bde190e6f1b9f58a038099338516efe8 BUG: 789278 Signed-off-by: arao <arao@redhat.com> Reviewed-on: http://review.gluster.org/9666 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Raghavendra G <rgowdapp@redhat.com> Tested-by: Raghavendra G <rgowdapp@redhat.com>
* features/shard: Implement [f]truncate fopsKrutika Dhananjay2015-05-071-2/+2
| | | | | | | | | | | | | | To-Do: * Make ftruncate work even in the absence of path * Aggregate and update ia_blocks appropriately when a file is truncated to a lower size. Change-Id: Ifd24c2f5e80d2c3bc921261f5481251df8948126 BUG: 1207615 Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com> Reviewed-on: http://review.gluster.org/10631 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
* Restore build on non Linux systemsEmmanuel Dreyfus2015-05-072-4/+7
| | | | | | | | | | | | | | | | | | | This change broke the build on NetBSD, FreeBSD, and MacOS X: http://review.gluster.org/10526/ We restore the build with two fixes: - Use POSIX-compliant sysconf(_SC_NPROCESSORS_ONLN) to get the number of processors, instead of Linux specific get_nprocs(). That let us remove Linux-specific #include <sys/sysinfo.h> - Only define MAX() if it is not already defined. NetBSD defines it in <sys/param.h> which is already included BUG: 1129939 Change-Id: I62341c670598670e47ea2f69ab94864f96588b18 Signed-off-by: Emmanuel Dreyfus <manu@netbsd.org> Reviewed-on: http://review.gluster.org/10652 Reviewed-by: Kaleb KEITHLEY <kkeithle@redhat.com> Tested-by: Vijay Bellur <vbellur@redhat.com>