glusterfs.git -

	Commit message (Collapse)	Author	Age	Files	Lines
*	afr: prevent winding inodelks twice for arbiter volumes	Ravishankar N	2018-10-12	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Backport of https://review.gluster.org/#/c/glusterfs/+/21380/ Problem: In an arbiter volume, if there is a pending data heal of a file only on arbiter brick, self-heal takes inodelks twice due to a code-bug but unlocks it only once, leaving behind a stale lock on the brick. This causes the next write to the file to hang. Fix: Fix the code-bug to take lock only once. This bug was introduced master with commit eb472d82a083883335bc494b87ea175ac43471ff Thanks to Pranith Kumar K <pkarampu@redhat.com> for finding the RCA. fixes: bz#1637989 Change-Id: I15ad969e10a6a3c4bd255e2948b6be6dcddc61e1 BUG: 1637989 Signed-off-by: Ravishankar N <ravishankar@redhat.com>
*	cluster/afr: Delegate metadata heal with pending xattrs to SHD	Pranith Kumar K	2018-10-12	3	-38/+47
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: When metadata-self-heal is triggered on the mount, it blocks lookup until metadata-self-heal completes. But that can lead to hangs when lot of clients are accessing a directory which needs metadata heal and all of them trigger heals waiting for other clients to complete heal. Fix: Only when the heal is needed but the pending xattrs are not set, trigger metadata heal that could block lookup. This is the only case where different clients may give different metadata to the clients without heals, which should be avoided. Updates bz#1625588 Change-Id: I6089e9fda0770a83fb287941b229c882711f4e66 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
*	cluster/afr: Delegate name-heal when possible	Pranith Kumar K	2018-10-12	2	-27/+85
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: When name-self-heal is triggered on the mount, it blocks lookup until name-self-heal completes. But that can lead to hangs when lot of clients are accessing a directory which needs name heal and all of them trigger heals waiting for other clients to complete heal. Fix: When a name-heal is needed but quorum number of names have the file and pending xattrs exist on the parent, then better to delegate the heal to SHD which will be completed as part of entry-heal of the parent directory. We could also do the same for quorum-number of names not present but we don't have any known use-case where this is a frequent occurrence so not changing that part at the moment. When there is a gfid mismatch or missing gfid it is important to complete the heal so that next rename doesn't assume everything is fine and perform a rename etc fixes bz#1625588 Change-Id: I8b002c85dffc6eb6f2833e742684a233daefeb2c Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
*	dht: Fill first_up_subvol before use in dht_opendir	Poornima G	2018-10-10	1	-0/+5
\| \| \| \| \| \| \| \|	Reported by: Sam McLeod Change-Id: Ic8f9b46b173796afd70aff1042834b03ac3e80b2 BUG: 1512371 Signed-off-by: Poornima G <pgurusid@redhat.com>
*	afr: fix incorrect reporting of directory split-brain	Ravishankar N	2018-10-10	7	-16/+22
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Backport of https://review.gluster.org/#/c/glusterfs/+/21135/ Problem: When a directory has dirty xattrs due to failed post-ops or when replace/reset brick is performed, AFR does a conservative merge as expected, but heal-info reports it as split-brain because there are no clear sources. Fix: Modify pending flag to contain information about pending heals and split-brains. For directories, if spit-brain flag is not set,just show them as needing heal and not being in split-brain. Fixes: bz#1633625 Change-Id: I09ef821f6887c87d315ae99e6b1de05103cd9383 BUG: 1633625 Signed-off-by: Ravishankar N <ravishankar@redhat.com>
*	glusterd: volume inode/fd status broken with brick mux	hari gowtham	2018-09-18	4	-35/+50
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	backport of:https://review.gluster.org/#/c/19846/6 Problem: The values for inode/fd was populated from the ctx received from the server xlator. Without brickmux, every brick from a volume belonged to a single brick from the volume. So searching the server and populating it worked. With brickmux, a number of bricks can be confined to a single process. These bricks can be from different volumes too (if we use the max-bricks-per-process option). If they are from different volumes, using the server xlator to populate causes problem. Fix: Use the brick to validate and populate the inode/fd status. >Signed-off-by: hari gowtham <hgowtham@redhat.com> >Change-Id: I2543fa5397ea095f8338b518460037bba3dfdbfd >fixes: bz#1566067 Change-Id: I2543fa5397ea095f8338b518460037bba3dfdbfd BUG: 1569336 fixes: bz#1569336 Signed-off-by: hari gowtham <hgowtham@redhat.com>
*	protocol: don't use alloca	Amar Tumballi	2018-09-06	1	-47/+30
\| \| \| \| \| \| \| \| \| \| \|	current implementation of alloca can cause issues when strings larger than the allocated buffer is passed to the xdr. Hence it makes sense to allow XDR decode functions to deal with memory allocations, which we can free later. BUG: 1625654 Change-Id: I3a05553f5702de9575c244649ca0e5ac9abaac94 Signed-off-by: Amar Tumballi <amarts@redhat.com>
*	posix: disable open/read/write on special files	Amar Tumballi	2018-09-06	1	-0/+33
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In the file system, the responsibility w.r.to the block and char device files is related to only support for 'creating' them (using mknod(2)). Once the device files are created, the read/write syscalls for the specific devices are handled by the device driver registered for the specific major number, and depending on the minor number, it knows where to read from. Hence, we are at risk of reading contents from devices which are handled by the host kernel on server nodes. By disabling open/read/write on the device file, we would be safe with the bypass one can achieve from client side (using gfapi) BUG: 1625648 Change-Id: I48c776b0af1cbd2a5240862826d3d8918601e47f Signed-off-by: Amar Tumballi <amarts@redhat.com>
*	io-stats: dump io-stats info in /var/run/gluster	Amar Tumballi	2018-09-06	1	-9/+25
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	It wouldn't make sense to allow iostats file to be written in any directory. While the formating makes sure we try to append io-stats-name for the file, so overwriting existing file is slim, but in any case it makes sense to restrict dumping to one directory. Below are the sample commands, and files created for the corresponding values: $ setfattr -n trusted.io-stats-dump -v file-for-dump $M0 In this case, the file would be in /var/run/gluster/file-for-dump $ setfattr -n trusted.io-stats-dump -v /dir1/dir2/file-for-dump $M0 In this case, then the dump file is in /var/run/gluster/dir1-dir2-file-for-dump Note that the value passed for this virtual xattr would be treated as a file, and even if the value has '/' in it, it would be changed to '-' for sanity. BUG: 1625660 Change-Id: Id9ae6a40a190b8937c51662e6e1c2a0f6c86a0e0 Signed-off-by: Amar Tumballi <amarts@redhat.com>
*	server-protocol: don't allow '../' path in 'name'	Amar Tumballi	2018-09-06	2	-0/+18
\| \| \| \| \| \| \| \| \| \| \| \|	This will prevent any arbitrary file creation through glusterfs by modifying the client bits. Also check for the similar flaw inside posix too, so we prevent any changes in layers in-between. BUG: 1625664 Signed-off-by: Amar Tumballi <amarts@redhat.com> Change-Id: Id9fe0ef6e86459e8ed85ab947d977f058c5ae06e
*	posix: remove not supported get/set content	Amar Tumballi	2018-09-05	3	-183/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	getting and setting a file's content using extended attribute worked great as a GET/PUT alternative when an object storage is supported on top of Gluster. But it needs application changes, and also, it skips some caching layers. It is not used over years, and not supported any more. Remove the dead code. Fixes: bz#1625286 Change-Id: Ide3b3f1f644f6ca58558bbe45561f346f96b95b7 BUG: 1625286 Signed-off-by: Amar Tumballi <amarts@redhat.com>
*	cluster/afr: Fix dict-leak in pre-op	Pranith Kumar K	2018-08-17	3	-20/+20
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	At the time of pre-op, pre_op_xdata is populted with the xattrs we get from the disk and at the time of post-op it gets over-written without unreffing the previous value stored leading to a leak. This is a regression we missed in https://review.gluster.org/#/q/ba149bac92d169ae2256dbc75202dc9e5d06538e Originally: > Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> > (cherry picked from commit e7b79c59590c203c65f7ac8548b30d068c232d33) Change-Id: I0456f9ad6f77ce6248b747964a037193af3a3da7 Fixes: bz#1613512 Signed-off-by: Amar Tumballi <amarts@redhat.com>
*	glusterd: _is_prefix should handle 0-length paths	Kaushal M	2018-08-17	1	-0/+9
\| \| \| \| \| \| \| \| \|	If one of the paths given to _is_prefix is 0-length, then it is not a prefix of the other. Hence, _is_prefix should return false. Change-Id: I54aa577a64a58940ec91872d0d74dc19cff9106d BUG: 1599788 Signed-off-by: Kaushal M <kaushal@redhat.com>
*	posix: check before removing stale symlink	Ravishankar N	2018-07-19	1	-4/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Backport of https://review.gluster.org/#/c/20509/ BZ 1564071 complains of directories with missing gfid symlinks and corresponding "Found stale gfid handle" messages in the logs. Hence add a check to see if the symlink points to an actual directory before removing it. Note: Removing stale symlinks was added via commit 3e9a9c029fac359477fb26d9cc7803749ba038b2 Change-Id: I5d91fab8e5f3a621a9ecad4a1f9c898a3c2d346a BUG: 1603093 Signed-off-by: Ravishankar N <ravishankar@redhat.com>
*	afr: don't update readables if inode refresh failed on all children	Ravishankar N	2018-07-11	4	-21/+52
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Backport of: https://review.gluster.org/#/c/20029/ 3.12 still supports quorum-reads, hence modified afr_inode_refresh_done() to support that. If inode refresh failed on all children of afr due to ENOENT (say file migrated by dht), it resets the readables to zero. Any inflight txn which then later comes on the inode fails with EIO because no readable children present for the inode. Fix: Don't update readables when inode refresh fails on all children of afr. In that way any inflight txns will either proceed with its own inode refresh if needed and fail it with the right errno or use the old value of readables and continue with the txn. Also, add quorum checks to the beginning of afr_transaction(). Otherwise, we seem to be winding the lock and checking for quorum only in pre-op pahse. Note: This should ideally fix BZ 1329505 since the stop gap fix for it is has been reverted at https://review.gluster.org/#/c/20028. Change-Id: I82990769f01be918a073fec83fc67ba4b3be24b1 BUG: 1599247 Signed-off-by: Ravishankar N <ravishankar@redhat.com>
*	afr: heal gfids when file is not present on all bricks	Ravishankar N	2018-07-11	5	-12/+51
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Backport of https://review.gluster.org/#/c/20271/ (only change is in .t) commit 20fa80057eb430fd72b4fa31b9b65598b8ec1265 introduced a regression wherein if a file is present in only 1 brick of replica and doesn't have a gfid associated with it, it doesn't get healed upon the next lookup from the client. Fix it. Change-Id: I7d1111dcb45b1b8b8340a7d02558f05df70aa599 BUG: 1598121 fixes: bz#1598121 Signed-off-by: Ravishankar N <ravishankar@redhat.com> (cherry picked from commit eb472d82a083883335bc494b87ea175ac43471ff)
*	afr: fix bug-1363721.t failure	Ravishankar N	2018-07-09	3	-1/+70
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Backport of https://review.gluster.org/#/c/20036/ Note: We need to update inode context's write_subvol even in case of compound fops. This is not there in master and 4.1 since compound FOPS was removed in it. Problem: In the .t, when the only good brick was brought down, writes on the fd were still succeeding on the bad bricks. The inflight split-brain check was marking the write as failure but since the write succeeded on all the bad bricks, afr_txn_nothing_failed() was set to true and we were unwinding writev with success to DHT and then catching the failure in post-op in the background. Fix: Don't wind the FOP phase if the write_subvol (which is populated with readable subvols obtained in pre-op cbk) does not have at least 1 good brick which was up when the transaction started. Change-Id: I4a1fef4569609c31cffeaef591a64c10870e8d0b BUG: 1598720 Signed-off-by: Ravishankar N <ravishankar@redhat.com>
*	afr: add quorum checks in pre-op	Ravishankar N	2018-07-06	1	-29/+31
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Backport of https://review.gluster.org/#/c/19781/ Problem: We seem to be winding the FOP if pre-op did not succeed on quorum bricks and then failing the FOP with EROFS since the fop did not meet quorum. This essentially masks the actual error due to which pre-op failed. (See BZ). Fix: Skip FOP phase if pre-op quorum is not met and go to post-op. Change-Id: Ie58a41e8fa1ad79aa06093706e96db8eef61b6d9 BUG: 1597154 Signed-off-by: Ravishankar N <ravishankar@redhat.com>
*	afr: capture the correct errno in post-op quorum check	Ravishankar N	2018-07-05	1	-8/+8
\| \| \| \| \| \| \| \| \| \|	If the post-op phase of txn did not meet quorm checks, use that errno to unwind the FOP rather than blindly setting ENOTCONN. Change-Id: I0cb0c8771ec75a45f9a25ad4cd8601103deddf0c BUG: 1597120 Signed-off-by: Ravishankar N <ravishankar@redhat.com> (cherry picked from commit 440a048f24b006c80af3d7bcd0a1f13fe3459d87)
*	cluster/dht: act as passthrough for renames on single child DHT	Raghavendra G	2018-07-05	1	-7/+15
\| \| \| \| \| \| \| \| \| \| \|	Various synchronization present in dht_rename while handling directories and files is necessary only if we have more than only one child. Change-Id: Ie21ad419125504ca2f391b1ae2e5c1d166fee247 fixes: bz#1563513 Signed-off-by: Raghavendra G <rgowdapp@redhat.com>
*	afr: add quorum checks in post-op	Ravishankar N	2018-07-04	1	-0/+29
\| \| \| \| \| \| \| \| \| \| \|	afr relies on pending changelog xattrs to identify source and sinks and the setting of these xattrs happen in post-op. So if post-op fails, we need to unwind the write txn with a failure. Change-Id: I0f019ac03890108324ee7672883d774918b20be1 BUG: 1597120 Signed-off-by: Ravishankar N <ravishankar@redhat.com> (cherry picked from commit a40a87ec3b226ae86a6ed8f4af25b45965a20cad)
*	glusterd: gluster v status is showing wrong status for glustershd	Sanju Rakonde	2018-07-04	1	-3/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When we restart the bricks, connect and disconnect events happen for glustershd. glusterd use two threads to handle disconnect and connects events from glustershd. When we restart the bricks we'll get both disconnect and connect events. So both the threads will compete for the big lock. We want disconnect event to finish before connect event. But If connect thread gets the big lock first, it sets svc->online to true, and then disconnect thread will et svc->online to false. So, glustershd will be disconnected from glusterd and wrong status is shown. After killing shd, glusterd sleeps for 1 second. To avoid the problem, If glusterd releses the lock before sleep and acquires it after sleep, disconnect thread will get a chance to handle the glusterd_svc_common_rpc_notify before other thread completes connect event. >Change-Id: Ie82e823fdfc936feb7c0ae10599297b050ee9986 >Signed-off-by: Sanju Rakonde <srakonde@redhat.com> Change-Id: Ie82e823fdfc936feb7c0ae10599297b050ee9986 fixes: bz#1582443 Signed-off-by: Sanju Rakonde <srakonde@redhat.com>
*	afr: don't treat all cases all bricks being blamed as split-brain	Ravishankar N	2018-07-04	2	-9/+48
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: We currently don't have a roll-back/undoing of post-ops if quorum is not met. Though the FOP is still unwound with failure, the xattrs remain on the disk. Due to these partial post-ops and partial heals (healing only when 2 bricks are up), we can end up in split-brain purely from the afr xattrs point of view i.e each brick is blamed by atleast one of the others. These scenarios are hit when there is frequent connect/disconnect of the client/shd to the bricks while I/O or heal are in progress. Fix: Instead of undoing the post-op, pick a source based on the xattr values. If 2 bricks blame one, the blamed one must be treated as sink. If there is no majority, all are sources. Once we pick a source, self-heal will then do the heal instead of erroring out due to split-brain. Change-Id: I3d0224b883eb0945785ade0e9697a1c828aec0ae BUG: 1597123 Signed-off-by: Ravishankar N <ravishankar@redhat.com> (cherry picked from commit 0e6e8216823c2d9dafb81aae0f6ee3497c23d140)
*	storage/posix: Fix posix_symlinks_match()	Pranith Kumar K	2018-07-04	1	-3/+13
\| \| \| \| \| \| \| \| \| \| \|	1) snprintf into linkname_expected should happen with PATH_MAX 2) comparison should happen with linkname_actual with complete string linkname_expected fixes bz#1595528 Change-Id: Ic3b3c362dc6c69c046b9a13e031989be47ecff14 BUG: 1595528 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
*	cluster/dht: Remove EIO from dht_inode_missing	N Balachandran	2018-07-04	2	-4/+2
\| \| \| \| \| \| \| \| \| \| \|	Removed EIO from the list of errnos that triggered a migrate check task. (cherry picked from commit c925962b91c67c8cd2391df7dd0251e0cbf66648) Change-Id: I7f89c7a16056421588f1af2377cebe6affddcb47 BUG: 1579673 Signed-off-by: N Balachandran <nbalacha@redhat.com>
*	glusterfs: access trusted peer group via remote-host command	Mohit Agrawal	2018-06-25	1	-5/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: In SSL environment the user is able to access volume via remote-host command without adding node in a trusted pool Solution: Change the list of rpc program in glusterd.c at the time of initialization while SSL is enabled > Change-Id: I987e433b639e68ad17b77b6452df1e22dbe0f199 > cherry picked from commit 234d611160840899bcfd5ab1c17a6253673d38ed BUG: 1593526 fixes: bz#1593526 Change-Id: I705253e032239e92ecad1c6a9b7e423a022132b5 Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
*	storage/posix: Handle ENOSPC correctly in zero_fill	Pranith Kumar K	2018-06-25	1	-1/+22
\| \| \| \| \| \| \|	Change-Id: Icc521d86cc510f88b67d334b346095713899087a BUG: 1591187 fixes: bz#1591187 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
*	protocol/server: Fix xdata leak in seek fop	Pranith Kumar K	2018-06-12	1	-2/+1
\| \| \| \| \| \| \|	Change-Id: I6125283ed22c04564f0b77bb7a50579a83e02eb0 fixes: bz#1590133 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> (cherry picked from commit fd5b48ea0afd907deb08604415bee14ab65f378b)
*	glusterd/geo-rep: Fix glusterd crash	Kotresh HR	2018-06-11	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Using strdump instead of gf_strdup crashes during free if mempool is being used. gf_free checks the magic number in the header which will not be taken care if strdup is used. Backport of: > Patch: https://review.gluster.org/19993/ > Change-Id: Iab36496554b838a036af9d863e3f5fd07fd9780e > Signed-off-by: Kotresh HR <khiremat@redhat.com> (cherry picked from commit 57632e3c1a33187d1d23f101f83cd8759142acac) fixes: bz#1577868 Change-Id: Iab36496554b838a036af9d863e3f5fd07fd9780e Signed-off-by: Kotresh HR <khiremat@redhat.com>
*	cluster/dht: Fix dht_rename lock order	N Balachandran	2018-05-09	1	-18/+47
\| \| \| \| \| \| \| \| \| \|	Fixed dht_order_rename_lock to use the same inodelk ordering as that of the dht selfheal locks (dictionary order of lock subvolumes). Change-Id: Ia3f8353b33ea2fd3bc1ba7e8e777dda6c1d33e0d BUG: 1570475 Signed-off-by: N Balachandran <nbalacha@redhat.com>
*	server/auth: add option for strict authentication	Mohammed Rafi KC	2018-04-22	6	-12/+81
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When this option is enabled, we will check for a matching username and password, if not found then the connection will be rejected. This also does a checksum validation of volfile The option is invalid when SSL/TLS is in use, at which point the SSL/TLS certificate user name is used to validate and hence authorize the right user. This expects TLS allow rules to be setup correctly rather than the default *. This option is not settable, as a result this cannot be enabled for volumes using the CLI. This is used with the shared storage volume, to restrict access to the same in non-SSL/TLS environments to the gluster peers only. Tested: ./tests/bugs/protocol/bug-1321578.t ./tests/features/ssl-authz.t - Ran tests on volumes with and without strict auth checking (as brick vol file needed to be edited to test, or rather to enable the option) - Ran tests on volumes to ensure existing mounts are disconnected when we enable strict checking Change-Id: I2ac4f0cfa5b59cc789cc5a265358389b04556b59 fixes: bz#1570430 Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com> Signed-off-by: ShyamsundarR <srangana@redhat.com>
*	shared storage: Prevent mounting shared storage from non-trusted client	Mohammed Rafi KC	2018-04-22	1	-0/+21
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	gluster shared storage is a volume used for internal storage for various features including ganesha, geo-rep, snapshot. So this volume should not be exposed to the client, as it is a special volume for internal use. This fix wont't generate non trusted volfile for shared storage volume. Change-Id: I8ffe30ae99ec05196d75466210b84db311611a4c updates: bz#1570430 Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com>
*	cluster/dht: Handle file migrations when brick down	N Balachandran	2018-04-18	1	-5/+51
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The decision as to which node would migrate a file was based on the gfid of the file. Files were divided among the nodes for the replica/disperse set. However, if a brick was down when rebalance started, the nodeuuids would be saved as NULL and a set of files would not be migrated. Now, if the nodeuuid is NULL, the first non-null entry in the set is the node responsible for migrating the file. Change-Id: I72554c107792c7d534e0f25640654b6f8417d373 fixes: bz#1566820 Signed-off-by: N Balachandran <nbalacha@redhat.com> (cherry picked from commit 1f0765242a689980265c472646c64473a92d94c0) Change-Id: Id1a6e847b0191b6a40707bea789a2a35ea3d9f68
*	cluster/dht: Wind open to all subvols	N Balachandran	2018-04-18	1	-10/+5
\| \| \| \| \| \| \| \| \| \| \|	dht_opendir should wind the open to all subvols whether or not local->subvols is set. This is because dht_readdirp winds the calls to all subvols. Change-Id: I67a96b06dad14a08967c3721301e88555aa01017 updates: bz#1566820 Signed-off-by: N Balachandran <nbalacha@redhat.com> (cherry picked from commit c4251edec654b4e0127577e004923d9729bc323d)
*	cluster/afr: Fixing the flaws in arbiter becoming source patch	Ravishankar N	2018-04-18	7	-179/+276
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Backport of https://review.gluster.org/19045 Problem: Setting the write_subvol value to read_subvol in case of metadata transaction during pre-op (commit 19f9bcff4aada589d4321356c2670ed283f02c03) might lead to the original problem of arbiter becoming source. Scenario: 1) All bricks are up and good 2) 2 writes w1 and w2 are in progress in parallel 3) ctx->read_subvol is good for all the subvolumes 4) w1 succeeds on brick0 and fails on brick1, yet to do post-op on the disk 5) read/lookup comes on the same file and refreshes read_subvols back to all good 6) metadata transaction happens which makes ctx->write_subvol to be assigned with ctx->read_subvol which is all good 7) w2 succeeds on brick1 and fails on brick0 and this will update the brick in reverse order leading to arbiter becoming source Fix: Instead of setting the ctx->write_subvol to ctx->read_subvol in the pre-op statge, if there is a metadata transaction, check in the function __afr_set_in_flight_sb_status() if it is a data/metadata transaction. Use the value of ctx->write_subvol if it is a data transactions and ctx->read_subvol value for other transactions. With this patch we assign the value of ctx->write_subvol in the afr_transaction_perform_fop() with the on disk value, instead of assigning it in the afr_changelog_pre_op() with the in memory value. Change-Id: Id2025a7e965f0578af35b1abaac793b019c43cc4 BUG: 1566131 Signed-off-by: karthik-us <ksubrahm@redhat.com> Signed-off-by: Ravishankar N <ravishankar@redhat.com>
*	cluster/afr: Fix for arbiter becoming source	karthik-us	2018-04-18	4	-6/+102
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Backport of https://review.gluster.org/#/c/18049/ Problem: When eager-lock is on, and two writes happen in parallel on a FD we were observing the following behaviour: - First write fails on one data brick - Since the post-op is not yet happened, the inode refresh will get both the data bricks as readable and set it in the inode context - In flight split brain check see both the data bricks as readable and allows the second write - Second write fails on the other data brick - Now the post-op happens and marks both the data bricks as bad and arbiter will become source for healing Fix: Adding one more variable called write_suvol in inode context and it will have the in memory representation of the writable subvols. Inode refresh will not update this value and its lifetime is pre-op through unlock in the afr transaction. Initially the pre-op will set this value same as read_subvol in inode context and then in the in flight split brain check we will use this value instead of read_subvol. After all the checks we will update the value of this and set the read_subvol same as this to avoid having incorrect value in that. Change-Id: I2ef6904524ab91af861d59690974bbc529ab1af3 BUG: 1566131 Signed-off-by: karthik-us <ksubrahm@redhat.com>
*	features/index: Choose different base file on EMLINK error	Pranith Kumar K	2018-04-12	1	-18/+9
\| \| \| \| \| \| \|	Change-Id: I4648816af908539efdc2528608aa2ebf7f0d0e2f fixes: bz#1565655 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> (cherry picked from commit bb12f2109a01856e8184e13cf984210d20155b13)
*	cluster/dht: Skipped files are not treated as errors	N Balachandran	2018-04-06	1	-5/+9
\| \| \| \| \| \| \| \| \| \| \| \| \|	For skipped files, use a return value of 1 to prevent error messages being logged. > Change-Id: I18de31ac1a64d4460e88dea7826c3ba03c895861 > BUG: 1553598 > Signed-off-by: N Balachandran <nbalacha@redhat.com> Change-Id: I18de31ac1a64d4460e88dea7826c3ba03c895861 BUG: 1555161 Signed-off-by: N Balachandran <nbalacha@redhat.com>
*	cluster/afr: Prevent ping-event handling on shd	Pranith Kumar K	2018-04-06	1	-0/+2
\| \| \| \| \| \| \| \| \| \|	On shd, we shouldn't treat any brick down based on latency, otherwise self-heal will never happen fixes: 1562723 Change-Id: Ica07fcc4fae91a6bfd9c9a670e2be464704d94b7 BUG: 1562723 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
*	cluster/ec: send list-node-uuids request to all subvolumes	Xavi Hernandez	2018-04-06	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The xattr trusted.glusterfs.list-node-uuids was only sent to a single subvolume. This was returning null uuids from the other subvolumes as if they were down. This fix forces that xattr to be requested from all subvolumes. Backport of: > BUG: 1561406 Change-Id: If62eb39a6857258923ba625e153d4ad79018ea2f BUG: 1561731 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
*	cluster/ec: Change default read policy to gfid-hash	Ashish Pandey	2018-04-06	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: Whenever we read data from file over NFS, NFS reads more data then requested and caches it. Based on the stat information it makes sure that the cached/pre-read data is valid or not. Consider 4 + 2 EC volume and all the bricks are on differnt nodes. In EC, with round-robin read policy, reads are sent on different set of data bricks. This way, it balances the read fops to go on all the bricks and avoid heating UP (overloading) same set of bricks. Due to small difference in clock speed, it is possible that we get minor difference for atime, mtime or ctime for different bricks. That might cause a different stat returned to NFS based on which NFS will discard cached/pre-read data which is actually not changed and could be used. Solution: Change read policy for EC as gfid-hash. That will force all the read to go to same set of bricks. >Change-Id: I825441cc519e94bf3dc3aa0bd4cb7c6ae6392c84 >BUG: 1554743 >Signed-off-by: Ashish Pandey <aspandey@redhat.com> Change-Id: I825441cc519e94bf3dc3aa0bd4cb7c6ae6392c84 BUG: 1558352 Signed-off-by: Ashish Pandey <aspandey@redhat.com>
*	cluster/ec: avoid delays in self-heal	Xavi Hernandez	2018-04-06	4	-48/+93
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Self-heal creates a thread per brick to sweep the index looking for files that need to be healed. These threads are started before the volume comes online, so nothing is done but waiting for the next sweep. This happens once per minute. When a replace brick command is executed, the new graph is loaded and all index sweeper threads started. When all bricks have reported, a getxattr request is sent to the root directory of the volume. This causes a heal on it (because the new brick doesn't have good data), and marks its contents as pending to be healed. This is done by the index sweeper thread on the next round, one minute later. This patch solves this problem by waking all index sweeper threads after a successful check on the root directory. Additionally, the index sweep thread scans the index directory sequentially, but it might happen that after healing a directory entry more index entries are created but skipped by the current directory scan. This causes the remaining entries to be processed on the next round, one minute later. The same can happen in the next round, so the heal is running in bursts and taking a lot to finish, specially on volumes with many directory levels. This patch solves this problem by immediately restarting the index sweep if a directory has been healed. Backport of: > BUG: 1547662 Change-Id: I58d9ab6ef17b30f704dc322e1d3d53b904e5f30e BUG: 1555201 Signed-off-by: Xavi Hernandez <jahernan@redhat.com>
*	glusterfsd: Memleak in glusterfsd process while brick mux is on	Mohit Agrawal	2018-04-06	21	-120/+204
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: At the time of stopping the volume while brick multiplex is enabled memory is not cleanup from all server side xlators. Solution: To cleanup memory for all server side xlators call fini in glusterfs_handle_terminate after send GF_EVENT_CLEANUP notification to top xlator. > BUG: 1544090 > Signed-off-by: Mohit Agrawal <moagrawa@redhat.com> > (cherry picked from commit 7c3cc485054e4ede1efb358552135b432fb7047a) >Note: Run all test-cases in separate build (https://review.gluster.org/19574) > with same patch after enable brick mux forcefully, all test cases are > passed. BUG: 1549473 Signed-off-by: Mohit Agrawal <moagrawa@redhat.com> Change-Id: Ia10dc7f2605aa50f2b90b3fe4eb380ba9299e2fc
*	glusterd: import volumes in separate synctask	Atin Mukherjee	2018-04-06	6	-69/+343
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	With brick multiplexing, to attach a brick to an existing brick process the prerequisite is to have the compatible brick to finish it's initialization and portmap sign in and hence the thread might have to go to a sleep and context switch the synctask to allow the brick process to communicate with glusterd. In normal code path, this works fine as glusterd_restart_bricks () is launched through a separate synctask. In case there's a mismatch of the volume when glusterd restarts, glusterd_import_friend_volume is invoked and then it tries to call glusterd_start_bricks () from the main thread which eventually may land into the similar situation. Now since this is not done through a separate synctask, the 1st brick will never be able to get its turn to finish all of its handshaking and as a consequence to it, all the bricks will fail to get attached to it. Solution : Execute import volume and glusterd restart bricks in separate synctask. Importing snaps had to be also done through synctask as there's a dependency of the parent volume need to be available for the importing snap functionality to work. >mainline patch : https://review.gluster.org/#/c/19357/ https://review.gluster.org/#/c/19536/ Change-Id: I290b244d456afcc9b913ab30be4af040d340428c BUG: 1543708 Signed-off-by: Atin Mukherjee <amukherj@redhat.com>
*	cluster/dht: ENOSPC will not fail rebalance	N Balachandran	2018-04-02	1	-9/+3
\| \| \| \| \| \| \| \| \|	ENOSPC returned by a file migration is no longer considered a rebalance failure. Change-Id: I21cf3a8acdc827bc478e138d6cb5db649d53a28c BUG: 1555161 Signed-off-by: N Balachandran <nbalacha@redhat.com>
*	glusterd: optimize glusterd import volumes code path	Atin Mukherjee	2018-03-08	1	-5/+7
\| \| \| \| \| \| \| \| \| \| \| \| \|	In case there's a version mismatch detected for one of the volumes glusterd was ending up with updating all the volumes which is a overkill. >mainline patch : https://review.gluster.org/#/c/19358/ Change-Id: I6df792db391ce3a1697cfa9260f7dbc3f59aa62d BUG: 1543709 Signed-off-by: Atin Mukherjee <amukherj@redhat.com> (cherry picked from commit bb34b07fd2ec5e6c3eed4fe0cdf33479dbf5127b)
*	cluster/afr: Fail open on split-brain	Pranith Kumar K	2018-03-08	11	-95/+208
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: Append on a file with split-brain succeeds. Open is intercepted by open-behind, when write comes on the file, open-behind does open+write. Open succeeds because afr doesn't fail it. Then write succeeds because write-behind intercepts it. Flush is also intercepted by write-behind, so the application never gets to know that the write failed. Fix: Fail open on split-brain, so that when open-behind does open+write open fails which leads to write failure. Application will know about this failure. Change-Id: I4bff1c747c97bb2925d6987f4ced5f1ce75dbc15 BUG: 1544635 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> (cherry picked from commit 786343abca3474ff01aa1017210112d97cbc4843)
*	glusterd/store: handle the case of fsid being set to 0	Amar Tumballi	2018-03-06	1	-0/+19
\| \| \| \| \| \| \| \| \| \| \| \| \|	Generally this would happen when a system gets upgraded from an version which doesn't have fsid details, to a version with fsid values. Without this change, after upgrade, people would see reduced 'df ' output, causing lot of confusions. Debugging Credits: Nithya B <nbalacha@redhat.com> Change-Id: Id718127ddfb69553b32770b25021290bd0e7c49a BUG: 1517260 Signed-off-by: Amar Tumballi <amarts@redhat.com>
*	cluster/dht: Handle single dht child in dht_lookup	N Balachandran	2018-03-05	1	-0/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch limits itself to only handling the case where no file (data or linkto) exists on the subvol. Additional cases to be handled: 1. A linkto file was found on the only child subvol. This currently calls dht_lookup_everywhere which eventually deletes it. It can be deleted directly as it will not be pointing to a valid subvol. 2. Directory lookups - locking might be unnecessary in some cases. > Change-Id: I940ba34531f2aaee1d36fd9ca45ecfd46be662a4 > BUG: 1546620 > Signed-off-by: N Balachandran <nbalacha@redhat.com> Change-Id: I940ba34531f2aaee1d36fd9ca45ecfd46be662a4 BUG: 1548270 Signed-off-by: N Balachandran <nbalacha@redhat.com>
*	cluster/dht: Ignore ENODATA from getxattr for posix acls	N Balachandran	2018-03-05	1	-7/+8
\| \| \| \| \| \| \| \| \| \| \| \| \|	dht_migrate_file no longer prints an error if getxattr for posix acls fails with ENODATA/ENOATTR. > Change-Id: Id9ecf6852cb5294c1c154b28d609889ea3420e1c > BUG: 1546954 > Signed-off-by: N Balachandran <nbalacha@redhat.com> Change-Id: Id9ecf6852cb5294c1c154b28d609889ea3420e1c BUG: 1548078 Signed-off-by: N Balachandran <nbalacha@redhat.com>