glusterfs.git -

	Commit message (Collapse)	Author	Age	Files	Lines
*	cluster/ec: Change handling of heal failure to avoid crash	Ashish Pandey	2020-02-25	2	-13/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: ec_getxattr_heal_cbk was called with NULL as second argument in case heal was failing. This function was dereferencing "cookie" argument which caused crash. Solution: Cookie is changed to carry the value that was supposed to be stored in fop->data, so even in the case when fop is NULL in error case, there won't be any NULL dereference. Thanks to Xavi for the suggestion about the fix. Change-Id: I0798000d5cadb17c3c2fbfa1baf77033ffc2bb8c fixes: bz#1805057
*	cluster/ec: Update lock->good_mask on parent fop failure	Pranith Kumar K	2020-02-25	2	-0/+4
\| \| \| \| \| \| \| \| \| \|	When discard/truncate performs write fop, it should do so after updating lock->good_mask to make sure readv happens on the correct mask fixes: bz#1805056 Change-Id: Idfef0bbcca8860d53707094722e6ba3f81c583b7 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
*	cluster/ec: Fix reopen flags to avoid misbehavior	Pranith Kumar K	2020-02-25	2	-3/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: when a file needs to be re-opened O_APPEND and O_EXCL flags are not filtered in EC. - O_APPEND should be filtered because EC doesn't send O_APPEND below EC for open to make sure writes happen on the individual fragments instead of at the end of the file. - O_EXCL should be filtered because shd could have created the file so even when file exists open should succeed - O_CREAT should be filtered because open happens with gfid as parameter. So open fop will create just the gfid which will lead to problems. Fix: Filter out these two flags in reopen. Change-Id: Ia280470fcb5188a09caa07bf665a2a94bce23bc4 Fixes: bz#1805055 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
*	cluster/ec: Always read from good-mask	Pranith Kumar K	2020-02-25	2	-5/+25
\| \| \| \| \| \| \| \| \| \|	There are cases where fop->mask may have fop->healing added and readv shouldn't be wound on fop->healing. To avoid this always wind readv to lock->good_mask fixes: bz#1805054 Change-Id: I2226ef0229daf5ff315d51e868b980ee48060b87 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
*	cluster/ec: fix EIO error for concurrent writes on sparse files	Xavi Hernandez	2020-02-25	1	-9/+17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	EC doesn't allow concurrent writes on overlapping areas, they are serialized. However non-overlapping writes are serviced in parallel. When a write is not aligned, EC first needs to read the entire chunk from disk, apply the modified fragment and write it again. The problem appears on sparse files because a write to an offset implicitly creates data on offsets below it (so, in some way, they are overlapping). For example, if a file is empty and we read 10 bytes from offset 10, read() will return 0 bytes. Now, if we write one byte at offset 1M and retry the same read, the system call will return 10 bytes (all containing 0's). So if we have two writes, the first one at offset 10 and the second one at offset 1M, EC will send both in parallel because they do not overlap. However, the first one will try to read missing data from the first chunk (i.e. offsets 0 to 9) to recombine the entire chunk and do the final write. This read will happen in parallel with the write to 1M. What could happen is that half of the bricks process the write before the read, and the half do the read before the write. Some bricks will return 10 bytes of data while the otherw will return 0 bytes (because the file on the brick has not been expanded yet). When EC tries to recombine the answers from the bricks, it can't, because it needs more than half consistent answers to recover the data. So this read fails with EIO error. This error is propagated to the parent write, which is aborted and EIO is returned to the application. The issue happened because EC assumed that a write to a given offset implies that offsets below it exist. This fix prevents the read of the chunk from bricks if the current size of the file is smaller than the read chunk offset. This size is correctly tracked, so this fixes the issue. Also modifying ec-stripe.t file for Test #13 within it. In this patch, if a file size is less than the offset we are writing, we fill zeros in head and tail and do not consider it strip cache miss. That actually make sense as we know what data that part holds and there is no need of reading it from bricks. Change-Id: Ic342e8c35c555b8534109e9314c9a0710b6225d6 Fixes: bz#1805053 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
*	cluster/ec: skip updating ctx->loc again when ec_fix_open/opendir	Kinglong Mee	2020-02-25	2	-10/+14
\| \| \| \| \| \| \| \| \| \| \| \| \|	The ec_manager_open/opendir memsets ctx->loc which causes memory/inode leak, and ec_fheal uses ctx->loc out of fd->lock that loc_copy may copy bad data when memset it. This patch skips updating ctx->loc when it is initilizaed. With it, ctx->loc is filled once, and never updated. Change-Id: I3bf5ffce4caf4c1c667f7acaa14b451d37a3550a fixes: bz#1805052 Signed-off-by: Kinglong Mee <mijinlong@horiscale.com>
*	cluster/ec: inherit healing from lock when it has info	Kinglong Mee	2020-02-25	1	-2/+3
\| \| \| \| \| \| \| \| \|	If lock has info, fop should inherit healing mask from it. Otherwise, fop cannot inherit right healing when changed_flags is zero. Change-Id: Ife80c9169d2c555024347a20300b0583f7e8a87f fixes: bz#1805051 Signed-off-by: Kinglong Mee <mijinlong@horiscale.com>
*	cluster/dht: Correct fd processing loop	N Balachandran	2020-02-25	1	-22/+62
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	backport of https://review.gluster.org/#/c/glusterfs/+/23506/ The fd processing loops in the dht_migration_complete_check_task and the dht_rebalance_inprogress_task functions were unsafe and could cause an open to be sent on an already freed fd. This has been fixed. Change-Id: I0a3c7d2fba314089e03dfd704f9dceb134749540 Fixes: bz#1804522 Signed-off-by: N Balachandran <nbalacha@redhat.com>
*	cluster/ec: Prevent double pre-op xattrops	Pranith Kumar K	2020-02-24	1	-6/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: Race: Thread-1 Thread-2 1) Does ec_get_size_version() to perform pre-op fxattrop as part of write-1 2) Calls ec_set_dirty_flag() in ec_get_size_version() for write-2. This sets dirty[] to 1 3) Completes executing ec_prepare_update_cbk leading to ctx->dirty[] = '1' 4) Takes LOCK(inode->lock) to check if there are any flags and sets dirty-flag because lock->waiting_flag is 0 now. This leads to fxattrop to increment on-disk dirty[] to '2' At the end of the writes the file will be marked for heal even when it doesn't need heal. Fix: Perform ec_set_dirty_flag() and other checks inside LOCK() to prevent dirty[] to be marked as '1' in step 2) above Fixes: bz#1805050 Change-Id: Icac2ab39c0b1e7e154387800fbededc561612865 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
*	cluster/ec: Reopen shouldn't happen with O_TRUNC	Pranith Kumar K	2020-02-24	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: Doing re-open with O_TRUNC will truncate the fragment even when it is not needed needing extra heals Fix: At the time of re-open don't use O_TRUNC. fixes: bz#1805049 Change-Id: Idc6408968efaad897b95a5a52481c66e843d3fb8 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
*	server: Mount fails after reboot 1/3 gluster nodes	Mohit Agrawal	2020-02-24	2	-17/+28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: At the time of coming up one server node(1x3) after reboot client is unmounted.The client is unmounted because a client is getting AUTH_FAILED event and client call fini for the graph.The client is getting AUTH_FAILED because brick is not attached with a graph at that moment Solution: To avoid the unmounting the client graph throw ENOENT error from server in case if brick is not attached with server at the time of authenticate clients. > Credits: Xavi Hernandez <xhernandez@redhat.com> > Change-Id: Ie6fbd73cbcf23a35d8db8841b3b6036e87682f5e > Fixes: bz#1793852 > Signed-off-by: Mohit Agrawal <moagrawa@redhat.com> > (cherry picked from commit f6421dff22a6ddaf14134f6894deae219948c89d) Fixes: bz#1804512 Change-Id: Ie6fbd73cbcf23a35d8db8841b3b6036e87682f5e Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
*	cluster/ec: fix fd reopen	Xavi Hernandez	2020-02-20	14	-274/+328
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently EC tries to reopen fd's that have been opened while a brick was down. This is done as part of regular write operations, just after having acquired the locks, and it's sent as a sub-fop of the main write fop. There were two problems: 1. The reopen was attempted on all UP bricks, even if a previous lock didn't succeed. This is incorrect because most probably the open will fail. 2. If reopen is sent and fails, the error is propagated to the main operation, causing it to fail when it shouldn't. To fix this, we only attempt reopens on bricks where the current fop owns a lock, and we prevent any error to be propagated to the main fop. To implement this behaviour an argument used to indicate the minimum number of required answers has overloaded to also include some flags. To make the change consistent, it has been necessary to rename the argument, which means that a lot of files have been changed. However there are no functional changes. This change has also uncovered a problem in discard code, which didn't correctely process requests of small sizes because no real discard fop was being processed, only a write of 0's on some region. In this case some fields of the fop remained uninitialized or with incorrect values. To fix this, a new function has been created to simulate success on a fop and it's used in the discard case. Thanks to Pranith for providing a test script that has also detected an issue in this patch. This patch includes a small modification of this script to force data to be written into bricks before stopping them. Change-Id: If272343873369186c2fb8f43c1d9c52c3ea304ec Fixes: bz#1805047 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
*	system/posix-acl: update ctx only if iatt is non-NULL	Homma	2019-12-06	1	-0/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We need to safe-guard against possible zero'ing out of iatt structure in acl ctx, which can cause many issues. Backport of: > fixes: bz#1668286 > Change-Id: Ie81a57d7453a6624078de3be8c0845bf4d432773 > Signed-off-by: Amar Tumballi <amarts@redhat.com> (cherry-picked from commit 6bf9637a93011298d032332ca93009ba4e377e46) fixes: bz#1767305 Change-Id: Ie81a57d7453a6624078de3be8c0845bf4d432773 Signed-off-by: Alan Orth <alan.orth@gmail.com>
*	Detach iot_worker to release its resources	Liguang Li	2019-11-05	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \|	When iot_worker terminates, its resources have not been reaped, which will consumes lots of memory. Detach iot_worker to automically release its resources back to the system. Change-Id: I71fabb2940e76ad54dc56b4c41aeeead2644b8bb fixes: bz#1718734 Signed-off-by: Liguang Li <liguang.lee6@gmail.com> Signed-off-by: N Balachandran <nbalacha@redhat.com>
*	cluster/afr: Heal entries when there is a source & no healed_sinks	karthik-us	2019-10-11	1	-0/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: In a situation where B1 blames B2, B2 blames B1 and B3 doesn't blame anything for entry heal, heal will not complete even though we have clear source and sinks. This will happen because while doing afr_selfheal_find_direction() only the bricks which are blamed by non-accused bricks are considered as sinks. Later in __afr_selfheal_entry_finalize_source() when it tries to mark all the non-sources as sinks it fails to do so because there won't be any healed_sinks marked, no witness present and there will be a source. Fix: If there is a source and no healed_sinks, then reset all the locked sources to 0 and healed sinks to 1 to do conservative merge. Change-Id: If40d8bc95d52a52b2730f55bdcf135109b421548 Fixes: bz#1760710 Signed-off-by: karthik-us <ksubrahm@redhat.com>
*	afr/lookup: Pass xattr_req in while doing a selfheal in lookup	Mohammed Rafi KC	2019-09-05	3	-5/+17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We were not passing xattr_req when doing a name self heal as well as a meta data heal. Because of this, some xdata was missing which causes i/o errors Backport of>https://review.gluster.org/#/c/glusterfs/+/23024/ >Change-Id: Ibfb1205a7eb0195632dc3820116ffbbb8043545f >Fixes: bz#1728770 >Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com> Change-Id: I740ccdbfcf2667fe4a850c12ecdc4b9eeed08293 Fixes: bz#1749352 Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com>
*	protocol/client: propagte GF_EVENT_CHILD_PING only for connections to brick	Raghavendra G	2019-08-09	1	-4/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Two reasons: * ping responses from glusterd may not be relevant for Halo replication. Instead, it might be interested in only knowing whether the brick itself is responsive. * When a brick is killed, propagating GF_EVENT_CHILD_PING of ping response from glusterd results in GF_EVENT_DISCONNECT spuriously propagated to parent xlators. These DISCONNECT events are from the connections client establishes with glusterd as part of its reconnect logic. Without GF_EVENT_CHILD_PING, the last event propagated to parent xlators would be the first DISCONNECT event from brick and hence subsequent DISCONNECTS to glusterd are not propagated as protocol/client prevents same event being propagated to parent xlators consecutively. propagating GF_EVENT_CHILD_PING for ping responses from glusterd would change the last_sent_event to GF_EVENT_CHILD_PING and hence protocol/client cannot prevent subsequent DISCONNECT events Signed-off-by: Raghavendra G <rgowdapp@redhat.com> Fixes: bz#1739336 Change-Id: I50276680c52f05ca9e12149a3094923622d6eaef (cherry picked from commit 5d66eafec581fb3209af74595784be8854ca40a4)
*	glusterd: add GF_TRANSPORT_BOTH_TCP_RDMA in glusterd_get_gfproxy_client_volfile	Atin Mukherjee	2019-07-18	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \|	... with out which volume creation fails with "volume create: <xyz>: failed: Failed to create volume files" >Fixes: bz#1716812 >Change-Id: I2f4c2c6d5290f066b54e1c1db19e25db9937bedb >Signed-off-by: Atin Mukherjee <amukherj@redhat.com> Fixes: bz#1721106 Change-Id: I2f4c2c6d5290f066b54e1c1db19e25db9937bedb Signed-off-by: Atin Mukherjee <amukherj@redhat.com>
*	upcall: Avoid sending notifications for invalid inodes	Soumya Koduri	2019-07-04	1	-1/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	For nameless LOOKUPs, server creates a new inode which shall remain invalid until the fop is successfully processed post which it is linked to the inode table. But incase if there is an already linked inode for that entry, it discards that newly created inode which results in upcall notification. This may result in client being bombarded with unnecessary upcalls affecting performance if the data set is huge. This issue can be avoided by looking up and storing the upcall context in the original linked inode (if exists), thus saving up on those extra callbacks. This is backport of below mainline fix - > fixes: bz#1718338 > patch url: https://review.gluster.org/#/c/glusterfs/+/22840/ > Signed-off-by: Soumya Koduri <skoduri@redhat.com> Change-Id: I044a1737819bb40d1a049d2f53c0566e746d2a17 fixes: bz#1720634 Signed-off-by: Soumya Koduri <skoduri@redhat.com>
*	cluster/ec: honor contention notifications for partially acquired locks	Xavi Hernandez	2019-06-28	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	EC was ignoring lock contention notifications received while a lock was being acquired. When a lock is partially acquired (some bricks have granted the lock but some others not yet) we can receive notifications from acquired bricks, which should be honored, since we may not receive more notifications after that. Since EC was ignoring them, once the lock was acquired, it was not released until the eager-lock timeout, causing unnecessary delays on other clients. This fix takes into consideration the notifications received before having completed the full lock acquisition. After that, the lock will be releaed as soon as possible. Backport of: > BUG: bz#1708156 > Change-Id: I2a306dbdb29fb557dcab7788a258bd75d826cc12 > Signed-off-by: Xavi Hernandez <xhernandez@redhat.com> Fixes: bz#1717282 Change-Id: I2a306dbdb29fb557dcab7788a258bd75d826cc12 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com> Signed-off-by: Hari Gowtham <hgowtham@redhat.com>
*	ec: fix truncate lock to cover the write in tuncate clean	Kinglong Mee	2019-05-08	1	-2/+6
\| \| \| \| \| \| \| \| \| \| \|	ec_truncate_clean does writing under the lock granted for truncate, but the lock is calculated by ec_adjust_offset_up, so that, the write in ec_truncate_clean is out of lock. Updates: bz#1699500 Change-Id: I15ed1b0807d75c5eb817323f1c227e97d03e0e7c Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> (cherry picked from commit 0e1223491e964096384edfae5032ed0d50d028ad)
*	performance/write-behind: remove request from wip list in wb_writev_cbk	Raghavendra G	2019-05-08	1	-0/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	There is a race in the way O_DIRECT writes are handled. Assume two overlapping write requests w1 and w2. * w1 is issued and is in wb_inode->wip queue as the response is still pending from bricks. Also wb_request_unref in wb_do_winds is not yet invoked. list_for_each_entry_safe (req, tmp, tasks, winds) { list_del_init (&req->winds); if (req->op_ret == -1) { call_unwind_error_keep_stub (req->stub, req->op_ret, req->op_errno); } else { call_resume_keep_stub (req->stub); } wb_request_unref (req); } * w2 is issued and wb_process_queue is invoked. w2 is not picked up for winding as w1 is still in wb_inode->wip. w1 is added to todo list and wb_writev for w2 returns. * response to w1 is received and invokes wb_request_unref. Assume wb_request_unref in wb_do_winds (see point 1) is not invoked yet. Since there is one more refcount, wb_request_unref in wb_writev_cbk of w1 doesn't remove w1 from wip. * wb_process_queue is invoked as part of wb_writev_cbk of w1. But, it fails to wind w2 as w1 is still in wip. * wb_requet_unref is invoked on w1 as part of wb_do_winds. w1 is removed from all queues including w1. * After this point there is no invocation of wb_process_queue unless new request is issued from application causing w2 to be hung till the next request. This bug is similar to bz 1626780 and bz 1379655. Change-Id: Iaa47437613591699d4c8ad18bc0b32de6affcc31 Signed-off-by: Raghavendra G <rgowdapp@redhat.com> Fixes: bz#1707198 (cherry picked from commit 6454132342c0b549365d92bcf3572ecd914f7fa8)
*	cluster/afr: Remove local from owners_list on failure of lock-acquisition	Pranith Kumar K	2019-05-08	4	-18/+14
\| \| \| \| \| \| \| \| \| \| \| \| \|	When eager-lock lock acquisition fails because of say network failures, the local is not being removed from owners_list, this leads to accumulation of waiting frames and the application will hang because the waiting frames are under the assumption that another transaction is in the process of acquiring lock because owner-list is not empty. Handled this case as well in this patch. Added asserts to make it easier to find these problems in future. fixes bz#1699736 Change-Id: I3101393265e9827755725b1f2d94a93d8709e923 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
*	cluster/dht: Request linkto xattrs in dht_rmdir opendir	N Balachandran	2019-04-10	1	-1/+26
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If parallel-readdir is enabled, the rda xlator is loaded below dht in the graph and proactively lists and caches entries when an opendir is performed. dht_rmdir checks if the directory being deleted contains stale linkto files by performing a readdirp on its child subvols. However, as the entries are actually read in during the opendir operation which does not request the linkto xattr,no linkto xattrs are present for the entries causing dht to incorrectly identify them as data files and fail the rmdir operation with ENOTEMPTY. DHT now always adds the linkto xattr in the list of xattrs requested in the opendir. Change-Id: I0711198e66c59146282eb8b88084170bedfb4018 fixes: bz#1695399 Signed-off-by: N Balachandran <nbalacha@redhat.com> (cherry picked from commit 110006bbcd5bb3e814b4cfe7d74cb41891ac3b0c)
*	glusterd: fix txn-id mem leak	Atin Mukherjee	2019-04-09	2	-6/+36
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This commit ensures the following: 1. Don't send commit op request to the remote nodes when gluster v status all is executed as for the status all transaction the local commit gets the name of the volumes and remote commit ops are technically a no-op. So no need for additional rpc requests. 2. In op state machine flow, if the transaction is in staged state and op_info.skip_locking is true, then no need to set the txn id in the priv->glusterd_txn_opinfo dictionary which never gets freed. Fixes: bz#1694612 Change-Id: Ib6a9300ea29633f501abac2ba53fb72ff648c822 Signed-off-by: Atin Mukherjee <amukherj@redhat.com> (cherry picked from commit 34e010d64905b7387de57840d3fb16a326853c9b)
*	client-rpc: Fix the payload being sent on the wire	Poornima G	2019-04-08	6	-240/+304
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The fops allocate 3 kind of payload(buffer) in the client xlator: - fop payload, this is the buffer allocated by the write and put fop - rsphdr paylod, this is the buffer required by the reply cbk of some fops like lookup, readdir. - rsp_paylod, this is the buffer required by the reply cbk of fops like readv etc. Currently, in the lookup and readdir fop the rsphdr is sent as payload, hence the allocated rsphdr buffer is also sent on the wire, increasing the bandwidth consumption on the wire. With this patch, the issue is fixed. Fixes: bz#1673058 Change-Id: Ie8158921f4db319e60ad5f52d851fa5c9d4a269b Signed-off-by: Poornima G <pgurusid@redhat.com>
*	cluster/dht: Fix lookup selfheal and rmdir race	N Balachandran	2019-04-08	1	-9/+25
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	A race between the lookup selfheal and rmdir can cause directories to be healed only on non-hashed subvols. This can prevent the directory from being listed from the mount point and in turn causes rm -rf to fail with ENOTEMPTY. Fix: Update the layout information correctly and reduce the call count only after processing the response. Change-Id: I812779aaf3d7bcf24aab1cb158cb6ed50d212451 fixes: bz#1695403 Signed-off-by: N Balachandran <nbalacha@redhat.com> (cherry picked from commit b0f1d782fc45313fce4e1c0e74127401d5342d05)
*	cluster/afr: Send truncate on arbiter brick from SHD	karthik-us	2019-03-12	1	-15/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: In an arbiter volume configuration SHD will not send any writes onto the arbiter brick even if there is data pending marker for the arbiter brick. If we have a arbiter setup on the geo-rep master and there are data pending markers for the files on arbiter brick, SHD will not mark any data changelog during healing. While syncing the data from master to slave, if the arbiter-brick is considered as ACTIVE, then there is a chance that slave will miss out some data. If the arbiter brick is being newly added or replaced there is a chance of slave missing all the data during sync. Fix: If there is data pending marker for the arbiter brick, send truncate on the arbiter brick during heal, so that it will record truncate as the data transaction in changelog. Change-Id: I3242ba6cea6da495c418ef860d9c3359c5459dec fixes: bz#1687687 Signed-off-by: karthik-us <ksubrahm@redhat.com>
*	core: make compute_cksum function op_version compatible	Sanju Rakonde	2019-03-11	1	-4/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: commit 5a152a changed the mechansim of computing the checksum. In heterogeneous cluster, peers are running into rejected state because we have different cksum computation mechansims in upgraded and non-upgraded nodes. Solution: add a check for op-version so that all the nodes in the cluster follow the same mechanism for computing the cksum. fixes: bz#1684569 > Change-Id: I1508f000e8c9895588b6011b8b6cc0eda7102193 > BUG: bz#1685120 > Signed-off-by: Sanju Rakonde <srakonde@redhat.com> Change-Id: I1508f000e8c9895588b6011b8b6cc0eda7102193 Signed-off-by: Sanju Rakonde <srakonde@redhat.com>
*	performance/write-behind: fix use-after-free in readdirp	Raghavendra Gowdappa	2019-02-22	1	-18/+22
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Two issues were found: 1. in wb_readdirp_cbk, inode should unrefed after wb_inode is unlocked. Otherwise, inode and hence the context wb_inode can be freed by the type we try to unlock wb_inode 2. wb_readdirp_mark_end iterates over a list of wb_inodes of children of a directory. But inodes could've been freed and hence the list might be corrupted. To fix take a reference on inode before adding it to invalidate_list of parent. Change-Id: I911b0e0b2060f7f41ded0b05db11af6f9b7c09c5 Signed-off-by: Raghavendra Gowdappa <rgowdapp@redhat.com> Updates: bz#1671556
*	performance/write-behind: handle call-stub leaks	Raghavendra Gowdappa	2019-02-22	1	-0/+8
\| \| \| \| \| \| \|	Change-Id: I7be9a5f48dcad1b136c479c58b1dca1e0488166d Signed-off-by: Raghavendra Gowdappa <rgowdapp@redhat.com> Fixes: bz#1671556 (cherry picked from commit 6175cb10cd5f59f3c7ae4100bc78f359b68ca3e9)
*	md-cache: Adapt integer data types to avoid integer overflow	David Spisla	2019-02-20	1	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	The "struct iatt" in iatt.h is using int64_t types for storing the atime, mtime and ctime. Therefore the struct 'struct md_cache' in md-cache.c should also use this types to avoid an integer overflow. This can happen e.g. if someone uses a very high default-retention-period in the WORM-Xlator. Change-Id: I605268d300ab622b9c8ab30e459dc00d9340aad1 fixes: bz#1678726 Signed-off-by: David Spisla <david.spisla@iternity.com> (cherry picked from commit 15423e14f16dd1a15ee5e5cbbdbdd370e57ed59f)
*	cluster/thin-arbiter: Consider thin-arbiter before marking new entry changelog	Ashish Pandey	2019-02-18	4	-19/+87
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If a fop to create an entry fails on one of the data brick, we mark the pending changelog on the entry on brick for which it was successful. This is done as part of post op phase to make sure that entry gets healed even if it gets renamed to some other path where its parent was not marked as bad. As it happens as part of post op, we should consider thin-arbiter to check if the brick, which was successful, is the good brick or not. This will avoide split brain and other issues. >Change-Id: I12686675be98f02f70a5186b3ed748c541514d53 >Signed-off-by: Ashish Pandey <aspandey@redhat.com> Change-Id: I12686675be98f02f70a5186b3ed748c541514d53 updates: bz#1672314 Signed-off-by: Ashish Pandey <aspandey@redhat.com>
*	features/shard: Ref shard inode while adding to fsync list	Krutika Dhananjay	2019-02-04	1	-8/+22
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	PROBLEM: Lot of the earlier changes in the management of shards in lru, fsync lists assumed that if a given shard exists in fsync list, it must be part of lru list as well. This was found to be not true. Consider this - a file is FALLOCATE'd to a size which would make the number of participant shards to be greater than the lru list size. In this case, some of the resolved shards that are to participate in this fop will be evicted from lru list to give way to the rest of the shards. And once FALLOCATE completes, these shards are added to fsync list but without a ref. After the fop completes, these shard inodes are unref'd and destroyed while their inode ctxs are still part of fsync list. Now when an FSYNC is called on the base file and the fsync-list traversed, the client crashes due to illegal memory access. FIX: Hold a ref on the shard inode when adding to fsync list as well. And unref under following conditions: 1. when the shard is evicted from lru list 2. when the base file is fsync'd 3. when the shards are deleted. Change-Id: Iab460667d091b8388322f59b6cb27ce69299b1b2 fixes: bz#1669382 Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com> (cherry picked from commit 72922c1fd69191b220f79905a23395c3a87f86ce)
*	readdir-ahead: do not zero-out iatt in fop cbk	Ravishankar N	2019-02-04	1	-20/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	...when ctime is zero. ia_type and ia_gfid always need to be non-zero for things to work correctly. Problem: Commit c9bde3021202f1d5c5a2d19ac05a510fc1f788ac zeroed out the iatt buffer in the cbks of modification fops before unwinding if the ctime in the buffer was zero. This was causing the fops to fail: noticeable when AFR's 'consistent-metadata' option was enabled. (AFR zeros out the ctime when the option is set. See commit 4c4624c9bad2edf27128cb122c64f15d7d63bbc8). Fixes: -Do not zero out the ia_type and ia_gfid of the iatt buff under any circumstance. -Also, fixed _rda_inode_ctx_update_iatts() to always update these values from the incoming buf when ctime is zero. Otherwise we end up with zero ia_type and ia_gfid the first time the function is called and the incoming buf has ctime set to zero. fixes: bz#1665145 Reported-By:Michael Hanselmann <public@hansmi.ch> Change-Id: Ib72228892d42c3513c19fc6dfb543f2aa3489eca Signed-off-by: Ravishankar N <ravishankar@redhat.com> (cherry picked from commit 09db11b0c020bc79d493c6d7e7ea4f3beb000c68)
*	cluster/dht: Delete invalid linkto files in rmdir	N Balachandran	2019-02-04	1	-1/+4
\| \| \| \| \| \| \| \| \| \| \| \|	rm -rf <dir> fails on dirs which contain linkto files that point to themselves because dht incorrectly thought that they were cached files after looking them up. The fix now treats them as invalid linkto files and deletes them. Change-Id: I376c72a5309714ee339c74485e02cfb4e29be643 fixes: bz#1671611 Signed-off-by: N Balachandran <nbalacha@redhat.com>
*	fuse: add --lru-limit option	Amar Tumballi	2019-01-16	3	-51/+86
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The inode LRU mechanism is moot in fuse xlator (ie. there is no limit for the LRU list), as fuse inodes are referenced from kernel context, and thus they can only be dropped on request of the kernel. This might results in a high number of passive inodes which are useless for the glusterfs client, causing a significant memory overhead. This change tries to remedy this by extending the LRU semantics and allowing to set a finite limit on the fuse inode LRU. A brief history of problem: When gluster's inode table was designed, fuse didn't have any 'invalidate' method, which means, userspace application could never ask kernel to send a 'forget()' fop, instead had to wait for kernel to send it based on kernel's parameters. Inode table remembers the number of times kernel has cached the inode based on the 'nlookup' parameter. And 'nlookup' field is not used by no other entry points (like server-protocol, gfapi etc). Hence the inode_table of fuse module always has to have lru-limit as '0', which means no limit. GlusterFS always had to keep all inodes in memory as kernel would have had a reference to it. Again, the reason for this is, kernel's glusterfs inode reference was pointer of 'inode_t' structure in glusterfs. As it is a pointer, we could never free it (to prevent segfault, or memory corruption). Solution: In the inode table, handle the prune case of inodes with 'nlookup' differently, and call a 'invalidator' method, which in this case is fuse_invalidate(), and it sends the request to kernel for getting the forget request. When the kernel sends the forget, it means, it has dropped all the reference to the inode, and it will send the forget with the 'nlookup' parameter too. We just need to make sure to reduce the 'nlookup' value we have when we get forget. That automatically cause the relevant prune to happen. Credits: Csaba Henk, Xavier Hernandez, Raghavendra Gowdappa, Nithya B fixes: bz#1623107 Change-Id: Ifee0737b23b12b1426c224ec5b8f591f487d83a2 Signed-off-by: Amar Tumballi <amarts@redhat.com>
*	features/shard: Fix launch of multiple synctasks for background deletion	Krutika Dhananjay	2019-01-15	2	-71/+128
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	PROBLEM: When multiple sharded files are deleted in quick succession, multiple issues were observed: 1. misleading logs corresponding to a sharded file where while one log message said the shards corresponding to the file were deleted successfully, this was followed by multiple logs suggesting the very same operation failed. This was because of multiple synctasks attempting to clean up shards of the same file and only one of them succeeding (the one that gets ENTRYLK successfully), and the rest of them logging failure. 2. multiple synctasks to do background deletion would be launched, one for each deleted file but all of them could readdir entries from .remove_me at the same time could potentially contend for ENTRYLK on .shard for each of the entry names. This is undesirable and wasteful. FIX: Background deletion will now follow a state machine. In the event that there are multiple attempts to launch synctask for background deletion, one for each file deleted, only the first task is launched. And if while this task is doing the cleanup, more attempts are made to delete other files, the state of the synctask is adjusted so that it restarts the crawl even after reaching end-of-directory to pick up any files it may have missed in the previous iteration. This patch also fixes uninitialized lk-owner during syncop_entrylk() which was leading to multiple background deletion synctasks entering the critical section at the same time and leading to illegal memory access of base inode in the second syntcask after it was destroyed post shard deletion by the first synctask. Change-Id: Ib33773d27fb4be463c7a8a5a6a4b63689705324e updates: bz#1665803 Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com> (cherry picked from commit c0c2022e7d7097e96270a74f37813eda0c4e6339)
*	features/shard: Assign fop id during background deletion to prevent ↵	Krutika Dhananjay	2019-01-14	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	excessive logging ... of the kind "[2018-12-26 05:22:44.195019] E [MSGID: 133010] [shard.c:2253:shard_common_lookup_shards_cbk] 0-volume1-shard: Lookup on shard 785 failed. Base file gfid = cd938e64-bf06-476f-a5d4-d580a0d37416 [No such file or directory]" shard_common_lookup_shards_cbk() has a specific check to ignore ENOENT error without logging them during specific fops. But because background deletion is done in a new frame (with local->fop being GF_FOP_NULL), the ENOENT check is skipped and the absence of shards gets logged everytime. To fix this, local->fop is initialized to GF_FOP_UNLINK during background deletion. Change-Id: I0ca8d3b3bfbcd354b4a555eee520eb0479bcda35 updates: bz#1665803 Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com> (cherry picked from commit aa28fe32364e39981981d18c784e7f396d56153f)
*	leases: Reset lease_ctx->timer post deletion	Soumya Koduri	2019-01-09	1	-0/+1
\| \| \| \| \| \| \| \| \| \|	To avoid use_after_free, reset lease_ctx->timer back to NULL after the structure has been freed. Change-Id: Icd213ec809b8af934afdb519c335a4680a1d6cdc updates: bz#1651323 Signed-off-by: Soumya Koduri <skoduri@redhat.com> (cherry picked from commit a9b0003c717087ff168bc143c70559162e53e0d5)
*	core: Fixed typos in nl-cache and logging-guidelines.md	N Balachandran	2019-01-09	1	-2/+2
\| \| \| \| \| \| \| \| \|	Replaced "recieve" with "receive". Change-Id: I58a3d3d4a0093df4743de9fae4d8ff152d4b216c fixes: bz#1662200 Signed-off-by: N Balachandran <nbalacha@redhat.com> (cherry picked from commit a11c5c66321dd8411373a68cc163c981c7d083df)
*	io-cache: xdata needs to be passed for readv operations	Soumya Koduri	2018-12-30	2	-2/+16
\| \| \| \| \| \| \| \| \| \| \| \| \|	io-cache xlator has been skipping xdata references when the date needs to be read into page cache. This patch fixes the same. Note: similar changes may be needed for other fops as well which are handled by io-cache. Change-Id: I28d73d4ba471d13eb55d0fd0b5197d222df77a2a updates: bz#1651323 Signed-off-by: Soumya Koduri <skoduri@redhat.com> (cherry picked from commit b3d88a0904131f6851f4185e43f815ecc3353ab5)
*	cluster/dht: sync brick root perms on add brick	N Balachandran	2018-12-26	1	-16/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If a single brick is added to the volume and the newly added brick is the first to respond to a dht_revalidate call, its stbuf will not be merged into local->stbuf as the brick does not yet have a layout. The is_permission_different check therefore fails to detect that an attr heal is required as it only considers the stbuf values from existing bricks. To fix this, merge all stbuf values into local->stbuf and use local->prebuf to store the correct directory attributes. Change-Id: Ic9e8b04a1ab9ed1248b6b056e3450bbafe32e1bc fixes: bz#1660736 Signed-off-by: N Balachandran <nbalacha@redhat.com>
*	performance/rda: Fixed dict_t memory leak	N Balachandran	2018-12-26	1	-8/+0
\| \| \| \| \| \| \| \| \| \| \|	Removed all references to dict_t xdata_from_req which is allocated but not used anywhere. It is also not cleaned up and hence causes a memory leak. fixes: bz#1659676 Change-Id: I2edb857696191e872ad12a12efc36999626bacc7 Signed-off-by: N Balachandran <nbalacha@redhat.com>
*	shard: prevent segfault in shard_unlink_block_inode()	Sunny Kumar	2018-12-20	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	gluster-blockd sometimes segfaults with the following backtrace: Core was generated by `/usr/sbin/gluster-blockd --glfs-lru-count 5 --log-level INFO'. Program terminated with signal 11, Segmentation fault. #0 0x00007fbb9cd639b9 in shard_unlink_block_inode (local=local@entry=0x7fbb80000a78, shard_block_num=<optimized out>) at shard.c:2929 2929 base_ictx->fsync_count--; (gdb) bt #0 0x00007fbb9cd639b9 in shard_unlink_block_inode (local=local@entry=0x7fbb80000a78, shard_block_num=<optimized out>) at shard.c:2929 #1 0x00007fbb9cd64311 in shard_unlink_shards_do_cbk (frame=frame@entry=0x7fbb9010a768, cookie=<optimized out>, this=<optimized out>, op_ret=<optimized out>, op_errno=<optimized out>, preparent=preparent@entry=0x7fbb7470dcf8, postparent=postparent@entry=0x7fbb7470dd90, xdata=xdata@entry=0x0) at shard.c:2987 A fix for this has already been provided through a Converity report. Backport of: > Change-Id: Ic5d302a5e32d375acf8adc412763ab94e6dabc3d > Signed-off-by: Sunny Kumar <sunkumar@redhat.com> > (cherry picked from commit 145e180517054626d07892219fdee689b703c218) Change-Id: I699a039e9c5115eb3376190dd8014427d12a293b Updates: bz#1659563 Signed-off-by: Niels de Vos <ndevos@redhat.com>
*	afr: thin-arbiter 2 domain locking and in-memory state	Ravishankar N	2018-12-12	6	-76/+679
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	2 domain locking + xattrop for write-txn failures: -------------------------------------------------- - A post-op wound on TA takes AFR_TA_DOM_NOTIFY range lock and AFR_TA_DOM_MODIFY full lock, does xattrop on TA and releases AFR_TA_DOM_MODIFY lock and stores in-memory which brick is bad. - All further write txn failures are handled based on this in-memory value without querying the TA. - When shd heals the files, it does so by requesting full lock on AFR_TA_DOM_NOTIFY domain. Client uses this as a cue (via upcall), releases AFR_TA_DOM_NOTIFY range lock and invalidates its in-memory notion of which brick is bad. The next write txn failure is wound on TA to again update the in-memory state. - Any incomplete write txns before the AFR_TA_DOM_NOTIFY upcall release request is got is completed before the lock is released. - Any write txns got after the release request are maintained in a ta_waitq. - After the release is complete, the ta_waitq elements are spliced to a separate queue which is then processed one by one. - For fops that come in parallel when the in-memory bad brick is still unknown, only one is wound to TA on wire. The other ones are maintained in a ta_onwireq which is then processed after we get the response from TA. Change-Id: I32c7b61a61776663601ab0040e2f0767eca1fd64 updates: bz#1648205 Signed-off-by: Ravishankar N <ravishankar@redhat.com> Signed-off-by: Ashish Pandey <aspandey@redhat.com>
*	features/bitrot: compare the signature with proper length	Raghavendra Bhat	2018-12-12	2	-8/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* The scrubber was comparing the checksum of the file that it calculated (by reading the file) with the on disk signature (stored via xattr) wrongly. It was using strlen to calculate the signature, while the actual length of the signature is given by the brick. Just use the actual length that the brick provides instead of trying to calculate the signature length via strlen API. * In posix, gfid2path was using the same string that contains the list of all the xattrs of file to save the value of the gfid2path xattr as well. This causes confusion when gfid2path xattr is queried by scrubber for getting the actual path of a corrupted file. Use separate string to fetch the value of the xattr instead of the string that contains the list of xattrs. Backport of: > Patch: https://review.gluster.org/21752/ > BUG: 1654805 > Change-Id: I2d664ab524d2b312233476cb35863dde3122e9a9 (cherry picked from commit f77fb6d568616592ab25501c402c140d15235ca9) Change-Id: I2d664ab524d2b312233476cb35863dde3122e9a9 fixes: bz#1654370 Signed-off-by: Raghavendra Bhat <raghavendra@redhat.com>
*	afr: assign gfid during name heal when no 'source' is present.	Ravishankar N	2018-12-12	4	-52/+52
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: If parent dir is in split-brain or has dirty xattrs set, and the file has gfid missing on one of the bricks, then name heal won't assign the gfid. Fix: Use the brick we select the gfid from as the 'source'. Note: Problem was found while trying to debug a split-brain issue on Cynthia Zhou's setup. fixes: bz#1655545 Change-Id: Id088d4f0fb017aa35122de426654194e581ed742 Reported-by: Cynthia Zhou <cynthia.zhou@nokia-sbell.com> Signed-off-by: Ravishankar N <ravishankar@redhat.com> (cherry picked from commit 4d58730c0cd6ab5db39aec8a15276f7bd3371b04)
*	leases: Do not conflict with internal fops	Soumya Koduri	2018-12-12	1	-0/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Internal fops (with frame->root->pid < 0) are used to heal or move data and maintains data integrity. That is they do not modify client data which holds the lease. Hence no need to recall Lease for such fops. Note: Like for locks, we would need rebalance and self-heal daemon process to heal lease state as well. Change-Id: I8988693fef8d00e17c19dcc842e2238f9eb5ab48 updates: bz#1651323 Signed-off-by: Soumya Koduri <skoduri@redhat.com> (cherry picked from commit 080aa5b9e9d998552e23f7c33aed3afb0ca93c34)
*	lease: Treat unlk request as noop if lease not found	Soumya Koduri	2018-12-12	1	-2/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When the glusterfs server recalls the lease, it expects client to flush data and unlock the lease. If not it sets a timer (starting from the time it sends RECALL request) and post timeout, it revokes it. Here we could have a race where in client did send UNLK lease request but because of network delay it may have reached after server revokes it. To handle such situations, treat such requests as noop and return sucesss. Change-Id: I166402d10273f4f115ff04030ecbc14676a01663 updates: bz#1651323 Signed-off-by: Soumya Koduri <skoduri@redhat.com> (cherry picked from commit c2e758b54d8a3f778e3e63db0000bb8b63de9b25)