summaryrefslogtreecommitdiffstats
path: root/xlators/performance
Commit message (Collapse)AuthorAgeFilesLines
* xlators: prefer libglusterfs time APIDmitry Antipov2020-09-077-34/+16
| | | | | | | | | | Prefer timespec_now_realtime() and gf_time() over clock_gettime() and time(), use gf_tvdiff() and gf_tsdiff() where appropriate, drop unused time_elapsed() and leftovers in 'struct posix_private'. Change-Id: Ie1f0229df5b03d0862193ce2b7fb91d27b0981b6 Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru> Updates: #1002
* open-behind: implement create fopXavi Hernandez2020-09-071-0/+52
| | | | | | | | | | | | | | Open behind didn't implement create fop. This caused that files created were not accounted for the number of open fd's. This could cause future opens to be delayed when they shouldn't. This patch implements the create fop. It also fixes a problem when destroying the stack: when frame->local was not NULL, STACK_DESTROY() tried to mem_put() it, which is not correct. Fixes: #1440 Change-Id: Ic982bad07d4af30b915d7eb1fbcef7a847a45869 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
* performance/md-cache: simplify and cleanup internal time managementDmitry Antipov2020-08-251-39/+31
| | | | | | | | | | | Since this xlator measures time intervals in seconds, timespec_now() may be replaced with simpler gf_time(). Consistently use time_t and uint32_t for timeouts, better error checking in mdc_reconfigure(), adjust comments and messages as well. Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru> Change-Id: I757c988e52db9d92348a900a43c617022a3d62af Updates: #1002
* performance/quick-read: simplify and cleanup internal time managementDmitry Antipov2020-08-222-21/+7
| | | | | | | | | Since this xlator measures time intervals in seconds, gettimeofday() may be replaced with simpler gf_time(). Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru> Change-Id: I5962771acbe8553dca51970183a55786a5289828 Updates: #1002
* performance/io-cache: simplify and cleanup internal time managementDmitry Antipov2020-08-213-38/+18
| | | | | | | | | | Since this xlator measures time intervals in seconds, gettimeofday() may be replaced with simpler gf_time(). Simplify and convert to static ioc_inode_need_revalidate() as well. Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru> Change-Id: Iaf13ecbf527589286ab3331c37429dd04bf6fa2c Updates: #1002
* open-behind: fix call_frame leakXavi Hernandez2020-08-201-4/+10
| | | | | | | | | | | | | When an open was delayed, a copy of the frame was created because the current frame was used to unwind the "fake" open. When the open was actually sent, the frame was correctly destroyed. However if the file was closed before needing to send the open, the frame was not destroyed. This patch correctly destroys the frame in all cases. Change-Id: I8c00fc7f15545c240e8151305d9e4cf06d653926 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com> Fixes: #1440
* NetBSD build fixesEmmanuel Dreyfus2020-06-301-2/+2
| | | | | | | | | | | | | - Make sure -largp is used at link time - PTHREAD_MUTEX_ADAPTIVE_NP is not available, use PTHREAD_MUTEX_DEFAULT instead - Avoid non POSIX [[ ]] in scripts - Do not check of lock.spinlock is NULL since it is not a pointer (it is not a pointer on Linux either) Change-Id: I5e04a7c552d24f8a473c2b837828d1bddfa7e128 Fixes: #1347 Type: Bug Signed-off-by: Emmanuel Dreyfus <manu@netbsd.org>
* Indicate timezone offsets in timestampsCsaba Henk2020-06-152-10/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Logs and other output carrying timestamps will have now timezone offsets indicated, eg.: [2020-03-12 07:01:05.584482 +0000] I [MSGID: 106143] [glusterd-pmap.c:388:pmap_registry_remove] 0-pmap: removing brick (null) on port 49153 To this end, - gf_time_fmt() now inserts timezone offset via %z strftime(3) template. - A new utility function has been added, gf_time_fmt_tv(), that takes a struct timeval pointer (*tv) instead of a time_t value to specify the time. If tv->tv_usec is negative, gf_time_fmt_tv(... tv ...) is equivalent to gf_time_fmt(... tv->tv_sec ...) Otherwise it also inserts tv->tv_usec to the formatted string. - Building timestamps of usec precision has been converted to gf_time_fmt_tv, which is necessary because the method of appending a period and the usec value to the end of the timestamp does not work if the timestamp has zone offset, but it's also beneficial in terms of eliminating repetition. - The buffer passed to gf_time_fmt/gf_time_fmt_tv has been unified to be of GF_TIMESTR_SIZE size (256). We need slightly larger buffer space to accommodate the zone offset and it's preferable to use a buffer which is undisputedly large enough. This change does *not* do the following: - Retaining a method of timestamp creation without timezone offset. As to my understanding we don't need such backward compatibility as the code just emits timestamps to logs and other diagnostic texts, and doesn't do any later processing on them that would rely on their format. An exception to this, ie. a case where timestamp is built for internal use, is graph.c:fill_uuid(). As far as I can see, what matters in that case is the uniqueness of the produced string, not the format. - Implementing a single-token (space free) timestamp format. While some timestamp formats used to be single-token, now all of them will include a space preceding the offset indicator. Again, I did not see a use case where this could be significant in terms of representation. - Moving the codebase to a single unified timestamp format and dropping the fmt argument of gf_time_fmt/gf_time_fmt_tv. While the gf_timefmt_FT format is almost ubiquitous, there are a few cases where different formats are used. I'm not convinced there is any reason to not use gf_timefmt_FT in those cases too, but I did not want to make a decision in this regard. Change-Id: I0af73ab5d490cca7ed8d07a2ce7ac22a6df2920a Updates: #837 Signed-off-by: Csaba Henk <csaba@redhat.com>
* open-behind: rewrite of internal logicXavi Hernandez2020-06-042-823/+485
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There was a critical flaw in the previous implementation of open-behind. When an open is done in the background, it's necessary to take a reference on the fd_t object because once we "fake" the open answer, the fd could be destroyed. However as long as there's a reference, the release function won't be called. So, if the application closes the file descriptor without having actually opened it, there will always remain at least 1 reference, causing a leak. To avoid this problem, the previous implementation didn't take a reference on the fd_t, so there were races where the fd could be destroyed while it was still in use. To fix this, I've implemented a new xlator cbk that gets called from fuse when the application closes a file descriptor. The whole logic of handling background opens have been simplified and it's more efficient now. Only if the fop needs to be delayed until an open completes, a stub is created. Otherwise no memory allocations are needed. Correctly handling the close request while the open is still pending has added a bit of complexity, but overall normal operation is simpler. Change-Id: I6376a5491368e0e1c283cc452849032636261592 Fixes: #1225 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
* md-cache: fix several NULL dereferencesXavi Hernandez2020-04-231-66/+129
| | | | | | | | | | | | | | This patch includes the following CID from Coverity Scan: * 1425196 * 1425197 * 1425198 * 1425199 * 1525200 Change-Id: Iddcfea449d3dd56d4dfcc39f4c3c608518e611e4 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com> Updates: #1060
* md-cache: avoid clearing cache when not necessaryXavi Hernandez2020-04-161-72/+93
| | | | | | | | | | | | mdc_inode_xatt_set() blindly cleared current cache when dict was not NULL, even if there was no xattr requested. This patch fixes this by only calling mdc_inode_xatt_set() when we have explicitly requested something to cache. Change-Id: Idc91a4693f1ff39f7059acde26682ccc361b947d Fixes: #1140 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
* write-behind: fix data corruptionXavi Hernandez2020-04-031-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There was a bug in write-behind that allowed a previous completed write to overwrite the overlapping region of data from a future write. Suppose we want to send three writes (W1, W2 and W3). W1 and W2 are sequential, and W3 writes at the same offset of W2: W2.offset = W3.offset = W1.offset + W1.size Both W1 and W2 are sent in parallel. W3 is only sent after W2 completes. So W3 should *always* overwrite the overlapping part of W2. Suppose write-behind processes the requests from 2 concurrent threads: Thread 1 Thread 2 <received W1> <received W2> wb_enqueue_tempted(W1) /* W1 is assigned gen X */ wb_enqueue_tempted(W2) /* W2 is assigned gen X */ wb_process_queue() __wb_preprocess_winds() /* W1 and W2 are sequential and all * other requisites are met to merge * both requests. */ __wb_collapse_small_writes(W1, W2) __wb_fulfill_request(W2) __wb_pick_unwinds() -> W2 /* In this case, since the request is * already fulfilled, wb_inode->gen * is not updated. */ wb_do_unwinds() STACK_UNWIND(W2) /* The application has received the * result of W2, so it can send W3. */ <received W3> wb_enqueue_tempted(W3) /* W3 is assigned gen X */ wb_process_queue() /* Here we have W1 (which contains * the conflicting W2) and W3 with * same gen, so they are interpreted * as concurrent writes that do not * conflict. */ __wb_pick_winds() -> W3 wb_do_winds() STACK_WIND(W3) wb_process_queue() /* Eventually W1 will be * ready to be sent */ __wb_pick_winds() -> W1 __wb_pick_unwinds() -> W1 /* Here wb_inode->gen is * incremented. */ wb_do_unwinds() STACK_UNWIND(W1) wb_do_winds() STACK_WIND(W1) So, as we can see, W3 is sent before W1, which shouldn't happen. The problem is that wb_inode->gen is only incremented for requests that have not been fulfilled but, after a merge, the request is marked as fulfilled even though it has not been sent to the brick. This allows that future requests are assigned to the same generation, which could be internally reordered. Solution: Increment wb_inode->gen before any unwind, even if it's for a fulfilled request. Special thanks to Stefan Ring for writing a reproducer that has been crucial to identify the issue. Change-Id: Id4ab0f294a09aca9a863ecaeef8856474662ab45 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com> Fixes: #884
* open-behind: fix missing fd referenceXavi Hernandez2020-03-171-11/+16
| | | | | | | | | | | Open behind was not keeping any reference on fd's pending to be opened. This makes it possible that a concurrent close and en entry fop (unlink, rename, ...) caused destruction of the fd while it was still being used. Change-Id: Ie9e992902cf2cd7be4af1f8b4e57af9bd6afd8e9 Fixes: bz#1810934 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
* multiple: fix bad type castXavi Hernandez2020-01-103-9/+13
| | | | | | | | | | | | When using inode_ctx_get() or inode_ctx_set(), a 'uint64_t *' is expected. In many cases, the value to retrieve or store is a pointer, which will be of smaller size in some architectures (for example 32-bits). In this case, directly passing the address of the pointer casted to an 'uint64_t *' is wrong and can cause memory corruption. Change-Id: Iae616da9dda528df6743fa2f65ae5cff5ad23258 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com> Fixes: bz#1785611
* xlator/performance/io-threads: structure loggingyatipadia2020-01-062-23/+33
| | | | | | | convert gf_msg() to gf_smsg() Change-Id: I35c6f62c346a75ecb22cd3a4346ad4dc48f09a91 Updates: #657
* xlator/io-cache: structure loggingyatipadia2020-01-064-100/+112
| | | | | | | | | Convert all gf_msg() to gf_smsg() Updates: #657 Change-Id: I72215b2518df78174dda8a7bc8de6f21fe1ba10f Signed-off-by: yatipadia <ypadia@redhat.com>
* md-cache.c: move cache-swift-metadata to off by defaultYaniv Kaul2019-12-201-1/+1
| | | | | | | | | | | | This causes mdc_xattr_list_populate() NOT to add "user.swift.metadata" as an xattr in the list of attrs we look at in some paths of the code. This is documented @ https://github.com/gluster/glusterfs/issues/775 Change-Id: Ie3d676c74a2f333beeacc302e253efe9f9942d1a updates: bz#1193929 Signed-off-by: Yaniv Kaul <ykaul@redhat.com>
* To fix readdir-ahead memory leakHuangShujun2019-12-101-0/+1
| | | | | | | | | | | Glusterfs client process has memory leak if create serveral files under one folder, and delete the folder. According to statedump, the ref counts of readdir-ahead is bigger than zero in the inode table. Readdir-ahead get parent inode by inode_parent in rda_mark_inode_dirty when each rda_writev_cbk,the inode ref count of parent folder will be increased in inode_parent, but readdir-ahead do not unref it later. The correction is unref the parent inode at the end of rda_mark_inode_dirty Fixes: bz#1779055 Signed-off-by: HuangShujun <549702281@qq.com> Change-Id: Iee68ab1089cbc2fbc4185b93720fb1f66ee89524
* performance/open-behind: seek fop should open_and_resumePranith Kumar K2019-10-111-0/+27
| | | | | | fixes: bz#1760187 Change-Id: I4c6ad13194d4fc5c7705e35bf9a27fce504b51f9 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* performance/read-ahead: update expected offset before unwinding read responseRaghavendra G2019-10-111-2/+2
| | | | | | | | | | | | | | With the current code there is a window of time between unwinding response to a read request and internal offset is updated to account the read just done. If new sequential read request comes in this time window, it is incorrectly identified as non-sequential read. Fix is to update the file offset to account for a read request before sending back the response to it. Change-Id: Iff0c59c769e1eb15f262257763026657e2d4785d Signed-off-by: Raghavendra G <rgowdapp@redhat.com> Fixes: bz#1753843
* xlators: fixes logically dead code.yatipadia2019-10-031-5/+0
| | | | | | | | | | | | | | | This patch addresses CID-1124388. Problem: When we reach the "out" section in ra_priv_dump(), if the condition (ret && conf) holds true, then the value of "add_section" will always be true. So the condition (add_section == _gf_false) will be a dead code. Fix:"add_section" has no use in the whole block and was making part of the block as logically dead code and hence, removed it. Change-Id: Id7e0105fc9a5ca5b2c2d098c665e6e32ecc6b62b updates: bz#789278 Signed-off-by: yatipadia <ypadia@redhat.com>
* perf/write-behind: Clear frame->local on conflict errorN Balachandran2019-09-251-0/+4
| | | | | | | | | | | | | | | WB saves the wb_inode in frame->local for the truncate and ftruncate fops. This value is not cleared in case of error on a conflicting write request. FRAME_DESTROY finds a non-null frame->local and tries to free it using mem_put. However, wb_inode is allocated using GF_CALLOC, causing the process to crash. credit: vpolakis@gmail.com Change-Id: I217f61470445775e05145aebe44c814731c1b8c5 Fixes: bz#1753592 Signed-off-by: N Balachandran <nbalacha@redhat.com>
* performance/md-cache: Do not skip caching of null character xattr valuesAnoop C S2019-08-201-19/+12
| | | | | | | | | | | | | | | | | | | | | Null character string is a valid xattr value in file system. But for those xattrs processed by md-cache, it does not update its entries if value is null('\0'). This results in ENODATA when those xattrs are queried afterwards via getxattr() causing failures in basic operations like create, copy etc in a specially configured Samba setup for Mac OS clients. On the other side snapview-server is internally setting empty string("") as value for xattrs received as part of listxattr() and are not intended to be cached. Therefore we try to maintain that behaviour using an additional dictionary key to prevent updation of entries in getxattr() and fgetxattr() callbacks in md-cache. Credits: Poornima G <pgurusid@redhat.com> Change-Id: I7859cbad0a06ca6d788420c2a495e658699c6ff7 Fixes: bz#1726205 Signed-off-by: Anoop C S <anoopcs@redhat.com>
* [core] fix return of local in __nlc_inode_ctx_getRinku Kothiya2019-07-251-22/+14
| | | | | | | | | | | | __nlc_inode_ctx_get assigns a value to nlc_pe_p which is never used by its parent function or any of the predecessor hence remove the assignment and also that function argument as it is not being used anywhere. fixes: bz#1732496 Change-Id: I5b950e1e251bd50a646616da872a4efe9d2ff8c9 Signed-off-by: Rinku Kothiya <rkothiya@redhat.com>
* Detach iot_worker to release its resourcesLiguang Li2019-07-151-0/+1
| | | | | | | | | | | | When iot_worker terminates, its resources have not been reaped, which will consumes lots of memory. Detach iot_worker to automically release its resources back to the system. fixes: bz#1729107 Change-Id: I71fabb2940e76ad54dc56b4c41aeeead2644b8bb Signed-off-by: Liguang Li <liguang.lee6@gmail.com>
* quick-read: rename cache-invalidation key to avoid redundant keysAtin Mukherjee2019-07-081-4/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | With group-metadata-cache group profile settings performance.cache-invalidation option when turned on enables both md-cache and quick-read xlator's cache-invalidation feature. While the intent of the group-metadata-cache is to set md-cache xlator's cache-invalidation feature, quick-read xlator also gets affected due to the same. While md-cache feature and it's profile existed since release-3.9, quick-read cache-invalidation was introduced in release-4 and due to this op-version mismatch on any cluster which is >= glusterfs-4 when this group profile is applied it breaks backward compatibility with the old clients. The proposed fix here is to rename the key in quick-read to 'quick-read-cache-invalidation' so that both these features have distinct identification. While this brings in by itself a backward compatibility challenge where this feature is enabled in an existing cluster and when the same is upgraded to a version where this change exists, it will lead to an unidentified old key. But as a workaround we can always ask users upgrading to release-7 version to turn off this option, upgrade the cluster and turn it back on with the new key. This needs to be documented once the patch is accepted. Fixes: bz#1698042 Change-Id: I30422ba6496208e21191a8d78ad29b2e21078664 Signed-off-by: Atin Mukherjee <amukherj@redhat.com> Signed-off-by: Raghavendra G <rgowdapp@redhat.com>
* md-cache: only update generation for inode at upcall and NULL statKinglong Mee2019-06-191-20/+36
| | | | | | | | | | | | | | | 1. For parallel writes from nfs-ganesha, two fops with two generations, but the fops reply maybe returned disordered. 2. The inode md-cache timeout should not increase conf->generation. With this patch, 1, Fop only gets generation from inode md-cache or conf, does not increase it. 2. The generation is increased at upcall invalidate, estal/enoent error invalidate, reply with zeroed out stat from write-behind. Change-Id: I897ecaa143fd18bc024c1948c7d1a6f831fd53da Updates: bz#1683594 Signed-off-by: Kinglong Mee <mijinlong@open-fs.com>
* multiple files: another attempt to remove includesYaniv Kaul2019-06-141-4/+0
| | | | | | | | | | | | | | | | | | There are many include statements that are not needed. A previous more ambitious attempt failed because of *BSD plafrom (see https://review.gluster.org/#/c/glusterfs/+/21929/ ) Now trying a more conservative reduction. It does not solve all circular deps that we have, but it does reduce some of them. There is just too much to handle reasonably (dht-common.h includes dht-lock.h which includes dht-common.h ...), but it does reduce the overall number of lines of include we need to look at in the future to understand and fix the mess later one. Change-Id: I550cd001bdefb8be0fe67632f783c0ef6bee3f9f updates: bz#1193929 Signed-off-by: Yaniv Kaul <ykaul@redhat.com>
* libglusterfs: cleanup iovec functionsXavi Hernandez2019-06-114-35/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch cleans some iovec code and creates two additional helper functions to simplify management of iovec structures. iov_range_copy(struct iovec *dst, uint32_t dst_count, uint32_t dst_offset, struct iovec *src, uint32_t src_count, uint32_t src_offset, uint32_t size); This function copies up to 'size' bytes from 'src' at offset 'src_offset' to 'dst' at 'dst_offset'. It returns the number of bytes copied. iov_skip(struct iovec *iovec, uint32_t count, uint32_t size); This function removes the initial 'size' bytes from 'iovec' and returns the updated number of iovec vectors remaining. The signature of iov_subset() has also been modified to make it safer and easier to use. The new signature is: iov_subset(struct iovec *src, int src_count, uint32_t start, uint32_t size, struct iovec **dst, int32_t dst_count); This function creates a new iovec array containing the subset of the 'src' vector starting at 'start' with size 'size'. The resulting array is allocated if '*dst' is NULL, or copied to '*dst' if it fits (based on 'dst_count'). It returns the number of iovec vectors used. A new set of functions to iterate through an iovec array have been created. They can be used to simplify the implementation of other iovec-based helper functions. Change-Id: Ia5fe57e388e23392a8d6cdab17670e337cadd587 Updates: bz#1193929 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
* lcov: improve line coverageAmar Tumballi2019-06-031-39/+19
| | | | | | | | | | | | upcall: remove extra variable assignment and use just one initialization. open-behind: reduce the overall number of lines, in functions not frequently called selinux: reduce some lines in init failure cases updates: bz#1693692 Change-Id: I7c1de94f2ec76a5bfe1f48a9632879b18e5fbb95 Signed-off-by: Amar Tumballi <amarts@redhat.com>
* performance/write-behind: remove request from wip list in wb_writev_cbkRaghavendra G2019-05-061-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is a race in the way O_DIRECT writes are handled. Assume two overlapping write requests w1 and w2. * w1 is issued and is in wb_inode->wip queue as the response is still pending from bricks. Also wb_request_unref in wb_do_winds is not yet invoked. list_for_each_entry_safe (req, tmp, tasks, winds) { list_del_init (&req->winds); if (req->op_ret == -1) { call_unwind_error_keep_stub (req->stub, req->op_ret, req->op_errno); } else { call_resume_keep_stub (req->stub); } wb_request_unref (req); } * w2 is issued and wb_process_queue is invoked. w2 is not picked up for winding as w1 is still in wb_inode->wip. w1 is added to todo list and wb_writev for w2 returns. * response to w1 is received and invokes wb_request_unref. Assume wb_request_unref in wb_do_winds (see point 1) is not invoked yet. Since there is one more refcount, wb_request_unref in wb_writev_cbk of w1 doesn't remove w1 from wip. * wb_process_queue is invoked as part of wb_writev_cbk of w1. But, it fails to wind w2 as w1 is still in wip. * wb_requet_unref is invoked on w1 as part of wb_do_winds. w1 is removed from all queues including w1. * After this point there is no invocation of wb_process_queue unless new request is issued from application causing w2 to be hung till the next request. This bug is similar to bz 1626780 and bz 1379655. Change-Id: Iaa47437613591699d4c8ad18bc0b32de6affcc31 Signed-off-by: Raghavendra G <rgowdapp@redhat.com> Fixes: bz#1705865
* performance/decompounder: remove the translator as the feature is not used ↵Amar Tumballi2019-04-297-989/+1
| | | | | | | | anymore updates: bz#1693692 Change-Id: Id5932b11e115ca6da1c2bfff7ae1460787109e06 Signed-off-by: Amar Tumballi <amarts@redhat.com>
* io-threads.c: Potentially skip a lock.Yaniv Kaul2019-03-121-12/+13
| | | | | | | | | | | | | | | Before going into the lock, verify stub_cnt != 0. Otherwise, let's skip this code. Unrelated, switch a CALLOC to MALLOC, as we initialize all members right away. This allocation is done also under lock, so also should help a bit. Compile-tested only! updates: bz#1193929 Signed-off-by: Yaniv Kaul <ykaul@redhat.com> Change-Id: Ie2fe6adff41ae4969abff95eff945b54e1a01d32
* performance/readdir-ahead: fix deadlockRaghavendra Gowdappa2019-03-071-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | This deadlock happens while processing dentry corresponding to current directory (.) in rda_fill_readdirp. In this case following order is followed: LOCK(directory_fd_ctx->lock); rda_inode_ctx_get_iatt -> LOCK(directory_inode->lock); However, in rda_mark_inode_dirty following lock order is followed: LOCK(directory_inode->lock); LOCK(directory_fd_ctx->lock); these two codepaths when executed concurrently resulted in a deadlock. Current patch fixes this by removing locking directory inode and fd-ctx in rda_fill_readdirp. This is fine as directory inode's stat won't change due to writes to files within directory. Change-Id: Ic93a67a0dac8229bb0d79582e526a512e6f2569c fixes: bz#1674412 Signed-off-by: Raghavendra Gowdappa <rgowdapp@redhat.com> Fixes:bz#1674412
* io-threads: Prioritize fops with NO_ROOT_SQUASH pidSusant Palai2019-03-051-1/+3
| | | | | | | | | | | | | | | | | | | | | | There was 30% regression observed in mkdir path with commit b139bc58eb504adf5ef81658896c9283ae21f390. On analysis it is found that io-threads xlator deprioritzes fops with all -ve pid. Some context in to the no-root-squash pid requirement: DHT xlator does some of the internal fops with root privileges. This is needed so that operations like layout healing should not be abandoned because a non root user is operating. If root-squash option is enabled the layout set operation looses its root privilege as server xlator converts the uid and pid to random numbers. Hence, the above mentioned commit converted pid to GF_CLIENT_PID_NO_ROOT_SQUASH to continue fops as root. Combining the above I am proposing not to deprioritize fops with no-root-squash pid. Change-Id: I54d056c01b25729304a77f9242fbaff39c5672ba fixes: bz#1676430 Signed-off-by: Susant Palai <spalai@redhat.com>
* md-cache: Adapt integer data types to avoid integer overflowDavid Spisla2019-02-201-3/+3
| | | | | | | | | | | | | The "struct iatt" in iatt.h is using int64_t types for storing the atime, mtime and ctime. Therefore the struct 'struct md_cache' in md-cache.c should also use this types to avoid an integer overflow. This can happen e.g. if someone uses a very high default-retention-period in the WORM-Xlator. Change-Id: I605268d300ab622b9c8ab30e459dc00d9340aad1 fixes: bz#1678726 Signed-off-by: David Spisla <david.spisla@iternity.com>
* performance/write-behind: handle call-stub leaksRaghavendra Gowdappa2019-02-191-0/+8
| | | | | | Change-Id: I7be9a5f48dcad1b136c479c58b1dca1e0488166d Signed-off-by: Raghavendra Gowdappa <rgowdapp@redhat.com> Fixes: bz#1674406
* performance/write-behind: fix use-after-free in readdirpRaghavendra Gowdappa2019-02-191-18/+22
| | | | | | | | | | | | | | | Two issues were found: 1. in wb_readdirp_cbk, inode should unrefed after wb_inode is unlocked. Otherwise, inode and hence the context wb_inode can be freed by the type we try to unlock wb_inode 2. wb_readdirp_mark_end iterates over a list of wb_inodes of children of a directory. But inodes could've been freed and hence the list might be corrupted. To fix take a reference on inode before adding it to invalidate_list of parent. Change-Id: I911b0e0b2060f7f41ded0b05db11af6f9b7c09c5 Signed-off-by: Raghavendra Gowdappa <rgowdapp@redhat.com> Updates: bz#1674406
* core: implement a global thread poolXavi Hernandez2019-02-181-5/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch implements a thread pool that is wait-free for adding jobs to the queue and uses a very small locked region to get jobs. This makes it possible to decrease contention drastically. It's based on wfcqueue structure provided by urcu library. It automatically enables more threads when load demands it, and stops them when not needed. There's a maximum number of threads that can be used. This value can be configured. Depending on the workload, the maximum number of threads plays an important role. So it needs to be configured for optimal performance. Currently the thread pool doesn't self adjust the maximum for the workload, so this configuration needs to be changed manually. For this reason, the global thread pool has been made optional, so that volumes can still use the thread pool provided by io-threads. To enable it for bricks, the following option needs to be set: config.global-threading = on This option has no effect if bricks are already running. A restart is required to activate it. It's recommended to also enable the following option when running bricks with the global thread pool: performance.iot-pass-through = on To enable it for a FUSE mount point, the option '--global-threading' must be added to the mount command. To change it, an umount and remount is needed. It's recommended to disable the following option when using global threading on a mount point: performance.client-io-threads = off To enable it for services managed by glusterd, glusterd needs to be started with option '--global-threading'. In this case all daemons, like self-heal, will be using the global thread pool. Currently it can only be enabled for bricks, FUSE mounts and glusterd services. The maximum number of threads for clients and bricks can be configured using the following options: config.client-threads config.brick-threads These options can be applied online and its effect is immediate most of the times. If one of them is set to 0, the maximum number of threads will be calcutated as #cores * 2. Some distributions use a very old userspace-rcu library (version 0.7) for this reason, some header files from version 0.10 have been copied into contrib/userspace-rcu and are used if the detected version is 0.7 or older. An additional change has been made to io-threads to prevent that threads are started when iot-pass-through is set. Change-Id: I09d19e246b9e6d53c6247b29dfca6af6ee00a24b updates: #532 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
* md-cache.c: minor reduction of work under lock.Yaniv Kaul2019-02-181-4/+3
| | | | | | | | | | Take the time before taking the lock, not under lock. Compile-tested only! updates: bz#1193929 Signed-off-by: Yaniv Kaul <ykaul@redhat.com> Change-Id: I6cd05d8556a9bcc015e1be53f6ba46854e52a380
* performance/md-cache: change the op-version of "global-cache-invalidation"Raghavendra Gowdappa2019-02-121-1/+1
| | | | | | | | | Since release-6 is not done yet, this option can be introduced with GD_OP_VERSION_6_0. Change-Id: I8a0867e5b8b23d0d485704a2fc7a3efc4a90f637 Signed-off-by: Raghavendra Gowdappa <rgowdapp@redhat.com> updates: bz#1664934
* performance/md-cache: introduce an option to control invalidation of inodesRaghavendra Gowdappa2019-02-111-10/+44
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Explicit invalidation by calling inode_invalidate is necessary when same (meta)data is shared/access across multiple mounts. Without an explicit inode_invalidate call, caches in the mount which didn't witness writes wouldn't be aware of changes as writes wouldn't have passed through them. However, if (meta)data is not shared, all relevant I/O goes through the cache of single mount and hence is coherent with (meta)data on bricks always. So, explicit inode invalidation can be disabled for this case which gives a huge performance boost for workloads that write data and then immediately read the data they just wrote. Note that otherwise, local writes (which pass through the cache) will change ctime and cause unnecessary invalidations. The name of the option that controls this behavior is "performance.global-cache-invalidation". This option is global and it purges caches both in glusterfs and kernel stack for native FUSE mounts. For non-native FUSE mounts, it purges cache only from glusterfs stack. This option is effective only when performance.stat-prefetch is on. Note that there is a similar option "performance.cache-invalidation", but the scope of that option is limited to quick-read and md-cache. Change-Id: I462bb4b65ff9aae1f6ba76f50b1f2f94fb10323b Signed-off-by: Raghavendra Gowdappa <rgowdapp@redhat.com> updates: bz#1664934
* core: make gf_thread_create() easier to useXavi Hernandez2019-02-011-6/+1
| | | | | | | | | | | | | | This patch creates a specific function to set the thread name using a string format and a variable argument list, like printf(). This function is used to set the thread name from gf_thread_create(), which now accepts a variable argument list to create the full name. It's not necessary anymore to use a local array to build the name of the thread. This is done automatically. Change-Id: Idd8d01fd462c227359b96e98699f8c6d962dc17c Updates: bz#1193929 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
* readdir-ahead: do not zero-out iatt in fop cbkRavishankar N2019-01-311-20/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | ...when ctime is zero. ia_type and ia_gfid always need to be non-zero for things to work correctly. Problem: Commit c9bde3021202f1d5c5a2d19ac05a510fc1f788ac zeroed out the iatt buffer in the cbks of modification fops before unwinding if the ctime in the buffer was zero. This was causing the fops to fail: noticeable when AFR's 'consistent-metadata' option was enabled. (AFR zeros out the ctime when the option is set. See commit 4c4624c9bad2edf27128cb122c64f15d7d63bbc8). Fixes: -Do not zero out the ia_type and ia_gfid of the iatt buff under any circumstance. -Also, fixed _rda_inode_ctx_update_iatts() to always update these values from the incoming buf when ctime is zero. Otherwise we end up with zero ia_type and ia_gfid the first time the function is called *and* the incoming buf has ctime set to zero. fixes: bz#1670253 Reported-By:Michael Hanselmann <public@hansmi.ch> Change-Id: Ib72228892d42c3513c19fc6dfb543f2aa3489eca Signed-off-by: Ravishankar N <ravishankar@redhat.com>
* performance/readdir-ahead: Fix deadlock in readdir ahead.Mohammed Rafi KC2019-01-231-4/+14
| | | | | | | | | | | | This patch fixes a lock contention in reaadir-ahead xlator. There are two issues, one is the processing of "." ".." entry while holding an fd_ctx lock. The other one is destroying the stack inside a fd_ctx lock. Change-Id: Id0bf83a3d9fea6b40015b8d167525c59c6cfa25e updates: bz#1659708 Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com>
* fix 32-bit-build-smoke warningsIraj Jamali2019-01-113-5/+5
| | | | | | | fixes: bz#1622665 Change-Id: I777d67b1b62c284c62a02277238ad7538eef001e Signed-off-by: Iraj Jamali <ijamali@redhat.com>
* performance/md-cache: Fix a crash when statfs caching is enabledVijay Bellur2019-01-111-2/+2
| | | | | | | | | | | | | mem_put() in STACK_UNWIND_STRICT causes a crash if frame->local is not null as md-cache obtains local from CALLOC. Changed two occurrences of STACK_UNWIND_STRICT to MDC_STACK_UNWIND as the latter macro does not rely on STACK_UNWIND_STRICT for cleaning up frame->local. fixes: bz#1632503 Change-Id: I1b3edcb9372a164ef73119e99a49e747765d7166 Signed-off-by: Vijay Bellur <vbellur@redhat.com>
* performance/io-threads: Improve debuggability in statedumpVijay Bellur2019-01-071-6/+23
| | | | | | | | | | | | | | | | | | | | | | statedump from io-threads lacked information to understand the number of running threads & number of requests in each priority queue. This patch addresses that. Sample statedump output w/ this patch: current_high_priority_threads=7 current_normal_priority_threads=9 current_low_priority_threads=0 current_least_priority_threads=0 fast_priority_queue_length=32 normal_priority_queue_length=45 Also, changed the wording for least priority queue in iot_get_pri_meaning(). Change-Id: Ic5f6391a15cc28884383f5185fce1cb52e0d10a5 fixes: bz#1664124 Signed-off-by: Vijay Bellur <vbellur@redhat.com>
* multiple-files: clang-scan fixesAmar Tumballi2018-12-311-1/+1
| | | | | | updates: bz#1622665 Change-Id: I9f3a75ed9be3d90f37843a140563c356830ef945 Signed-off-by: Amar Tumballi <amarts@redhat.com>
* core: Fixed typos in nl-cache and logging-guidelines.mdN Balachandran2018-12-261-2/+2
| | | | | | | | Replaced "recieve" with "receive". Change-Id: I58a3d3d4a0093df4743de9fae4d8ff152d4b216c fixes: bz#1662089 Signed-off-by: N Balachandran <nbalacha@redhat.com>