summaryrefslogtreecommitdiffstats
path: root/xlators
Commit message (Collapse)AuthorAgeFilesLines
...
* cluster/ec: fix EIO error for concurrent writes on sparse filesXavi Hernandez2019-08-221-9/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | EC doesn't allow concurrent writes on overlapping areas, they are serialized. However non-overlapping writes are serviced in parallel. When a write is not aligned, EC first needs to read the entire chunk from disk, apply the modified fragment and write it again. The problem appears on sparse files because a write to an offset implicitly creates data on offsets below it (so, in some way, they are overlapping). For example, if a file is empty and we read 10 bytes from offset 10, read() will return 0 bytes. Now, if we write one byte at offset 1M and retry the same read, the system call will return 10 bytes (all containing 0's). So if we have two writes, the first one at offset 10 and the second one at offset 1M, EC will send both in parallel because they do not overlap. However, the first one will try to read missing data from the first chunk (i.e. offsets 0 to 9) to recombine the entire chunk and do the final write. This read will happen in parallel with the write to 1M. What could happen is that half of the bricks process the write before the read, and the half do the read before the write. Some bricks will return 10 bytes of data while the otherw will return 0 bytes (because the file on the brick has not been expanded yet). When EC tries to recombine the answers from the bricks, it can't, because it needs more than half consistent answers to recover the data. So this read fails with EIO error. This error is propagated to the parent write, which is aborted and EIO is returned to the application. The issue happened because EC assumed that a write to a given offset implies that offsets below it exist. This fix prevents the read of the chunk from bricks if the current size of the file is smaller than the read chunk offset. This size is correctly tracked, so this fixes the issue. Also modifying ec-stripe.t file for Test #13 within it. In this patch, if a file size is less than the offset we are writing, we fill zeros in head and tail and do not consider it strip cache miss. That actually make sense as we know what data that part holds and there is no need of reading it from bricks. Change-Id: Ic342e8c35c555b8534109e9314c9a0710b6225d6 Fixes: bz#1739427 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
* cluster/ec: inherit healing from lock when it has infoKinglong Mee2019-08-221-2/+3
| | | | | | | | | If lock has info, fop should inherit healing mask from it. Otherwise, fop cannot inherit right healing when changed_flags is zero. Change-Id: Ife80c9169d2c555024347a20300b0583f7e8a87f updates: bz#1739424 Signed-off-by: Kinglong Mee <mijinlong@horiscale.com>
* afr: restore timestamp of parent dir during entry-healRavishankar N2019-08-211-0/+2
| | | | | | | Fixes: bz#1741041 Change-Id: I29e338bac62104233a6f80212df8d0fb016affda Signed-off-by: Ravishankar N <ravishankar@redhat.com> (cherry picked from commit 8e9c53ebf16705b9a1db2fc486dc24a5cb244ddd)
* features/shard: Send correct size when reads are sent beyond file sizeKrutika Dhananjay2019-08-211-0/+2
| | | | | | | Change-Id: I0cebaaf55c09eb1fb77a274268ff564e871b743b fixes bz#1740316 Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com> (cherry picked from commit 51237eda7c4b3846d08c5d24d1e3fe9b7ffba1d4)
* event: rename event_XXX with gf_ prefixedXiubo Li2019-08-216-10/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I hit one crash issue when using the libgfapi. In the libgfapi it will call glfs_poller() --> event_dispatch() in file api/src/glfs.c:721, and the event_dispatch() is defined by libgluster locally, the problem is the name of event_dispatch() is the extremly the same with the one from libevent package form the OS. For example, if a executable program Foo, which will also use and link the libevent and the libgfapi at the same time, I can hit the crash, like: kernel: glfs_glfspoll[68486]: segfault at 1c0 ip 00007fef006fd2b8 sp 00007feeeaffce30 error 4 in libevent-2.0.so.5.1.9[7fef006ed000+46000] The link for Foo is: lib_foo_LADD = -levent $(GFAPI_LIBS) It will crash. This is because the glfs_poller() is calling the event_dispatch() from the libevent, not the libglsuter. The gfapi link info : GFAPI_LIBS = -lacl -lgfapi -lglusterfs -lgfrpc -lgfxdr -luuid If I link Foo like: lib_foo_LADD = $(GFAPI_LIBS) -levent It will works well without any problem. And if Foo call one private lib, such as handler_glfs.so, and the handler_glfs.so will link the GFAPI_LIBS directly, while the Foo won't and it will dlopen(handler_glfs.so), then the crash will be hit everytime. The link info will be: foo_LADD = -levent libhandler_glfs_LIBADD = $(GFAPI_LIBS) I can avoid the crash temporarily by linking the GFAPI_LIBS in Foo too like: foo_LADD = $(GFAPI_LIBS) -levent libhandler_glfs_LIBADD = $(GFAPI_LIBS) But this is ugly since the Foo won't use any APIs from the GFAPI_LIBS. And in some cases when the --as-needed link option is added(on many dists it is added as default), then the crash is back again, the above workaround won't work. Backport of: > https://review.gluster.org/#/c/glusterfs/+/23110/ > Change-Id: I38f0200b941bd1cff4bf3066fca2fc1f9a5263aa > Fixes: #699 > Signed-off-by: Xiubo Li <xiubli@redhat.com> Change-Id: I38f0200b941bd1cff4bf3066fca2fc1f9a5263aa updates: bz#1740519 Signed-off-by: Xiubo Li <xiubli@redhat.com> (cherry picked from commit 799edc73c3d4f694c365c6a7c27c9ab8eed5f260)
* locks/fencing: Address hang while lock preemptionSusant Palai2019-08-213-20/+29
| | | | | | | | | | | | | | | | | The fop_wind_count can go negative when fencing is enabled on unwind path of the IO leading to hang. Also changed code so that fop_wind_count needs to be maintained only till fencing is enabled on the file. > updates: bz#1717824 > Change-Id: Icd04b42bc16cd3d50eaa581ee57233910194f480 > signed-off-by: Susant Palai <spalai@redhat.com> (backport of https://review.gluster.org/#/c/glusterfs/+/23088/) fixes: bz#1740077 Change-Id: Icd04b42bc16cd3d50eaa581ee57233910194f480 Signed-off-by: Susant Palai <spalai@redhat.com>
* features/utime: always update ctime at setattrKinglong Mee2019-08-192-13/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | For the nfs EXCLUSIVE mode create may sets a later time to mtime (at verifier), it should not set to ctime for storage.ctime does not allowed set ctime to a earlier time. /* Earlier, mdata was updated only if the existing time is less * than the time to be updated. This would fail the scenarios * where mtime can be set to any time using the syscall. Hence * just updating without comparison. But the ctime is not * allowed to changed to older date. */ According to kernel's setattr, always set ctime at setattr, and doesnot set ctime from mtime at storage.ctime. Backport of: > Patch: https://review.gluster.org/23154 > Change-Id: I5cfde6cb7f8939da9617506e3dc80bd840e0d749 > BUG: 1737288 > Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Change-Id: I5cfde6cb7f8939da9617506e3dc80bd840e0d749 fixes: bz#1739437 Signed-off-by: Kotresh HR <khiremat@redhat.com>
* posix/ctime: Fix race during lookup ctime xattr healKotresh HR2019-08-191-18/+58
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Problem: Ctime heals the ctime xattr ("trusted.glusterfs.mdata") in lookup if it's not present. In a multi client scenario, there is a race which results in updating the ctime xattr to older value. e.g. Let c1 and c2 be two clients and file1 be the file which doesn't have the ctime xattr. Let the ctime of file1 be t1. (from backend, ctime heals time attributes from backend when not present). Now following operations are done on mount c1 -> ls -l /mnt/file1 | c2 -> ls -l /mnt/file1;echo "append" >> /mnt/file1; The race is that the both c1 and c2 didn't fetch the ctime xattr in lookup, so both of them tries to heal ctime to time 't1'. If c2 wins the race and appends the file before c1 heals it, it sets the time to 't1' and updates it to 't2' (because of append). Now c1 proceeds to heal and sets it to 't1' which is incorrect. Solution: Compare the times during heal and only update the larger time. This is the general approach used in ctime feature but got missed with healing legacy files. Backport of: > Patch: https://review.gluster.org/23131 > BUG: 1734299 > Change-Id: I930bda192c64c3d49d0aed431ce23d3bc57e51b7 > Signed-off-by: Kotresh HR <khiremat@redhat.com> fixes: bz#1739436 Change-Id: I930bda192c64c3d49d0aed431ce23d3bc57e51b7 Signed-off-by: Kotresh HR <khiremat@redhat.com>
* features/utime: Fix mem_put crashPranith Kumar K2019-08-191-1/+3
| | | | | | | | | | | | | | | | | | | | Problem: When frame->local is not null FRAME_DESTROY calls mem_put on it. Since the stub is already destroyed in call_resume(), it leads to crash Fix: Set frame->local to NULL before calling call_resume() Backport of: > Patch: https://review.gluster.org/23091 > BUG: 1593542 > Change-Id: I0f8adf406f4cefdb89d7624ba7a9d9c2eedfb1de > Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> fixes: bz#1739430 Change-Id: I0f8adf406f4cefdb89d7624ba7a9d9c2eedfb1de Signed-off-by: Kotresh HR <khiremat@redhat.com>
* ctime: Set mdata xattr on legacy filesKotresh HR2019-08-196-56/+228
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Problem: The files which were created before ctime enabled would not have "trusted.glusterfs.mdata"(stores time attributes) xattr. Upon fops which modifies either ctime or mtime, the xattr gets created with latest ctime, mtime and atime, which is incorrect. It should update only the corresponding time attribute and rest from backend Solution: Creating xattr with values from brick is not possible as each brick of replica set would have different times. So create the xattr upon successful lookup if the xattr is not created Note To Reviewers: The time attributes used to set xattr is got from successful lookup. Instead of sending the whole iatt over the wire via setxattr, a structure called mdata_iatt is sent. The mdata_iatt contains only time attributes. Backport of: > Patch: https://review.gluster.org/22936 > Change-Id: I5e535631ddef04195361ae0364336410a2895dd4 > BUG: 1593542 > Signed-off-by: Kotresh HR <khiremat@redhat.com> Change-Id: I5e535631ddef04195361ae0364336410a2895dd4 updates: bz#1739430 Signed-off-by: Kotresh HR <khiremat@redhat.com>
* posix : add posix_set_ctime() in posix_ftruncate()Jiffin Tony Thottan2019-08-091-0/+2
| | | | | | | | | | | >Backport of https://review.gluster.org/#/c/glusterfs/+/22948/ >Change-Id: I0cb5320fea71306e0283509ae47024f23874b53b >fixes: bz#1723761 Change-Id: I0cb5320fea71306e0283509ae47024f23874b53b fixes: bz#1739399 Signed-off-by: Jiffin Tony Thottan <jthottan@redhat.com> (cherry picked from commit 7d8be567f2f904fc74d0990ebce2e8afbedab918)
* cluster/dht: Fixed a memleak in dht_rename_cbkN Balachandran2019-08-091-11/+33
| | | | | | | | | Fixed a memleak in dht_rename_cbk when creating a linkto file. Change-Id: I705adef3cb79e33806520fc2b15558e90e2c211c fixes: bz#1739337 Signed-off-by: N Balachandran <nbalacha@redhat.com>
* protocol/client: propagte GF_EVENT_CHILD_PING only for connections to brickRaghavendra G2019-08-091-4/+12
| | | | | | | | | | | | | | | | | | | | | | | | Two reasons: * ping responses from glusterd may not be relevant for Halo replication. Instead, it might be interested in only knowing whether the brick itself is responsive. * When a brick is killed, propagating GF_EVENT_CHILD_PING of ping response from glusterd results in GF_EVENT_DISCONNECT spuriously propagated to parent xlators. These DISCONNECT events are from the connections client establishes with glusterd as part of its reconnect logic. Without GF_EVENT_CHILD_PING, the last event propagated to parent xlators would be the first DISCONNECT event from brick and hence subsequent DISCONNECTS to glusterd are not propagated as protocol/client prevents same event being propagated to parent xlators consecutively. propagating GF_EVENT_CHILD_PING for ping responses from glusterd would change the last_sent_event to GF_EVENT_CHILD_PING and hence protocol/client cannot prevent subsequent DISCONNECT events Signed-off-by: Raghavendra G <rgowdapp@redhat.com> Fixes: bz#1739334 Change-Id: I50276680c52f05ca9e12149a3094923622d6eaef (cherry picked from commit 5d66eafec581fb3209af74595784be8854ca40a4)
* glusterd: don't log a warning message for tier-enabled keySanju Rakonde2019-07-291-2/+5
| | | | | | | | | | | | | | We are logging a warning message saying unknown-key for tier-enabled kay. although the tier xlator is deprecated, this key is left behind for handling the peer rejection issues in a heterogeneous cluster. We need not to log if this key is not found/recognised. updates: bz#1727008 Change-Id: Ia68661898a618f99a240ca8d8a124ff6a65ebe9d Signed-off-by: Sanju Rakonde <srakonde@redhat.com> (cherry picked from commit 7d270d05f835d41e32572501246b50181dc9be56)
* features/snapview-server: obtain the list of snapshots inside the lockRaghavendra Bhat2019-07-241-1/+1
| | | | | | | | | The current list of snapshots from priv->dirents is obtained outside the lock. Change-Id: I8876ec0a38308da5db058397382fbc82cc7ac177 Fixes: bz#1731512 (cherry picked from commit 8e795617fd6f5193d0d52a336059ce1a28108c0e)
* cluster/ta: Notify the clients only if there are pending healskarthik-us2019-07-244-22/+69
| | | | | | | | | | | | | | | | | | | | | | | | | | | Problem: In case of thin arbiter, before index healer starts crawling the indices at every heal-timeout interval, even if there is nothing to be healed it will send an upcall notification to all the clients to release any AFR_TA_DOM_NOTIFY locks that they hold. SHD will wait for the upcall to return before proceeding with the heal even though there is nothing to be healed. This will also invalidates the cached information about the bricks states on the clients which leads to extra calls on TA from clients for the next reads & writes if needed. This will impact the IO performance. Fix: - Before sending the upcall to the clients, check for any pending heals on TA without taking any locks. - If there is nothing marked bad on TA, then continue with the index crawl to heal any dirty markings present on the files due to any post-op failure. - If there is a brick marked as bad on TA, then take the AFR_TA_DOM_NOTIFY lock on TA from SHD, get the state on TA and continue with the current healing process. Change-Id: Ieb477bc6cb18bbdfd4e7a0453c5ed79b574ec9d6 fixes: bz#1729483 Signed-off-by: karthik-us <ksubrahm@redhat.com>
* glusterd/svc: update pid of mux volumes from the shd processMohammed Rafi KC2019-07-249-35/+114
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | For a normal volume, we are updating the pid from a the process while we do a daemonization or at the end of the init if it is no-daemon mode. Along with updating the pid we also lock the file, to make sure that the process is running fine. With brick mux, we were updating the pidfile from gluterd after an attach/detach request. There are two problems with this approach. 1) We are not holding a pidlock for any file other than parent process. 2) There is a chance for possible race conditions with attach/detach. For example, shd start and a volume stop could race. Let's say we are starting an shd and it is attached to a volume. While we trying to link the pid file to the running process, this would have deleted by the thread that doing a volume stop. Backport of : https://review.gluster.org/#/c/glusterfs/+/22935/ >Change-Id: I29a00352102877ce09ea3f376ca52affceb5cf1a >Updates: bz#1722541 >Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com> Change-Id: I29a00352102877ce09ea3f376ca52affceb5cf1a Updates: bz#1732668 Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com>
* graph/shd: Use top down approach while cleaning xlatorMohammed Rafi KC2019-07-249-1/+12
| | | | | | | | | | | | | | | | | | | We were cleaning xlator from botton to top, which might lead to problems when upper xlators trying to access the xlator object loaded below. One such scenario is when fd_unref happens as part of the fini call which might lead to calling the releasedir to lower xlator. This will lead to invalid mem access Backport of:https://review.gluster.org/#/c/glusterfs/+/22968/ >Change-Id: I8a6cb619256fab0b0c01a2d564fc88287c4415a0 >Updates: bz#1716695 >Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com> Change-Id: I8a6cb619256fab0b0c01a2d564fc88287c4415a0 Updates: bz#1730229 Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com>
* graph/shd: Use glusterfs_graph_deactivate to free the xl recMohammed Rafi KC2019-07-241-0/+3
| | | | | | | | | | | | | | | | | We were using glusterfs_graph_fini to free the xl rec from glusterfs_process_volfp as well as glusterfs_graph_cleanup. Instead we can use glusterfs_graph_deactivate, which is does fini as well as other common rec free. Backportof : https://review.gluster.org/22904 >Change-Id: Ie4a5f2771e5254aa5ed9f00c3672a6d2cc8e4bc1 >Updates: bz#1716695 >Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com> Change-Id: Ie4a5f2771e5254aa5ed9f00c3672a6d2cc8e4bc1 Updates: bz#1730229 Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com>
* glusterd: conditionally clear txn_opinfo in stage opAtin Mukherjee2019-07-201-2/+10
| | | | | | | | | | | | | | ...otherwise this leads to a crash when volume status is run on a heterogeneous mode. > Fixes: bz#1723658 > Change-Id: I0d39f412b2e5e9d3ef0a3462b90b38bb5364b09d > Signed-off-by: Atin Mukherjee <amukherj@redhat.com> > (cherry picked from commit a72452fcf90679b28baec12d2769cbaa982bb4e4) Fixes: bz#1728127 Change-Id: I0d39f412b2e5e9d3ef0a3462b90b38bb5364b09d Signed-off-by: Atin Mukherjee <amukherj@redhat.com>
* cluster/afr: Fix incorrect reporting of gfid & type mismatchkarthik-us2019-07-202-2/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | Problems: 1. When checking for type and gfid mismatch, if the type or gfid is unknown because of missing gfid handle and the gfid xattr it will be reported as type or gfid mismatch and the heal will not complete. 2. If the source selected during entry heal has null gfid the same will be sent to afr_lookup_and_heal_gfid(). In this function when we try to assign the gfid on the bricks where it does not exist, we are considering the same gfid and try to assign that on those bricks. This will fail in posix_gfid_set() since the gfid sent is null. Fix: If the gfid sent to afr_lookup_and_heal_gfid() is null choose a valid gfid before proceeding to assign the gfid on the bricks where it is missing. In afr_selfheal_detect_gfid_and_type_mismatch(), do not report type/gfid mismatch if the type/gfid is unknown or not set. Change-Id: Ia06552e4dc4a9f89cb7f5302833604bd21bbf7da fixes: bz#1729481 Signed-off-by: karthik-us <ksubrahm@redhat.com>
* glusterd: do not mark skip_locking as true for geo-rep operationsSanju Rakonde2019-07-191-2/+7
| | | | | | | | | | | | | | | | | | We need to send the commit req to peers in case of geo-rep operations even though it is a no volname operation. In commit phase peers try to set the txn_opinfo which will fail because it is a no volname operation where we don't require a commit phase. We mark skip_locking as true for no volname operations, but we have to give an exception to geo-rep operations, so that they can set txn_opinfo in commit phase. Please refer to detailed RCA at the bug: 1730543 fixes: bz#1730543 Change-Id: I9f2478b12a281f6e052035c0563c40543493a3fc Signed-off-by: Sanju Rakonde <srakonde@redhat.com> (cherry picked from commit b917974ee922d7a2e079692ad7d6f61f900b37b2)
* features/snapview-server: use the same volfile server for gfapi optionsRaghavendra Bhat2019-07-182-4/+42
| | | | | | | | | | | | snapview server xlator makes use of "localhost" as the volfile server while initing the new glfs instance to talk to a snapshot. While localhost is fine, better use the same volfile server that was used to start the snapshot daemon containing the snapview-server xlator. Change-Id: I4485d39b0e3d066f481adc6958ace53ea33237f7 fixes: bz#1728391 Signed-off-by: Raghavendra Bhat <raghavendra@redhat.com> (cherry picked from commit 36f6e6df0ff15d0464b869803710adca2b65e8ba)
* glusterd: Show the correct brick status in get-stateMohit Agrawal2019-07-153-2/+37
| | | | | | | | | | | | | | Problem: get-state does not show correct brick status if brick status is not Started, it always shows started if any value is set brickinfo->status Solution: Check the value of brickinfo->status to show correct status in get-state Change-Id: I12a79619024c2cf59f338220d144f2f034059b3b fixes: bz#1726905 Signed-off-by: Mohit Agrawal <moagrawal@redhat.com> (cherry picked from commit af989db23d1db00e087f2b9d3dfc43b13ef17153)
* glusterd/thin-arbiter: Thin-arbiter integration with GD1Vishal Pandey2019-07-049-23/+730
| | | | | | | | | | | | | | | | | | | | | | | | | gluster volume create <VOLNAME> replica 2 thin-arbiter 1 <host1>:<brick1> <host2>:<brick2> <thin-arbiter-host>:<path-to-store-replica-id-file> [force] The changes have been made in a way that the last brick in the bricks list will be treated as the thin-arbiter. GD1 will be manipulated to consider replica count to be as 2 and continue creating the volume like any other replica 2 volume but since thin-arbiter volumes need ta-brick client xlator entries for each subvolume in fuse volfile, volfile generation is modified in a way to inject these entries seperately in the volfile for every subvolume. Few more additions - 1- Save the volinfo with new fields ta_bricks list and thin_arbiter_count. 2- Introduce a new option client.ta-brick-port to add remote-port to ta-brick xlator entry in fuse volfiles. The option can be set using the following CLI syntax - gluster volume set <VOLNAME> client.ta-brick-port <PORTNO.> 3- Volume Info will contain a Thin-Arbiter-path entry to distinguish from other replicate volumes. Change-Id: Ib434e2313b29716f32476c6c211d282c4ef39406 Updates #687 Signed-off-by: Vishal Pandey <vpandey@redhat.com> (cherry picked from commit 9b223b15ab69fce4076de036ee162f36a058bcd2)
* glusterd/shd: Change shd logfile to a unique nameMohammed Rafi KC2019-06-246-33/+39
| | | | | | | | | | | | | | | | With the shd mux changes, shd was havinga a logfile with volname of the first started volume. This was creating a lot confusion, as other volumes data is also logging to a logfile which has a different vol name. With this changes the logfile will be changed to a unique name ie "/var/log/glusterfs/glustershd.log". This was the same logfile name before the shd mux Change-Id: I2b94c1f0b2cf3c9493505dddf873687755a46dda fixes: bz#1721601 Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com>
* shd/mux: Fix race between mux_proc unlink and stopMohammed Rafi KC2019-06-241-0/+3
| | | | | | | | | | | | | | There is a small race window, where we have a shd proc without having a connection. That is when we stopped the last shd running on a process. The list was removed outside of a lock just after stopping the process. So there is a window where we stopped the process, but the shd proc list contains the entry. Change-Id: Id82a82509e5cd72acac24e8b7b87197626525441 fixes: bz#1722541 Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com>
* cluster/ec: Prevent double pre-op xattropsPranith Kumar K2019-06-221-6/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Problem: Race: Thread-1 Thread-2 1) Does ec_get_size_version() to perform pre-op fxattrop as part of write-1 2) Calls ec_set_dirty_flag() in ec_get_size_version() for write-2. This sets dirty[] to 1 3) Completes executing ec_prepare_update_cbk leading to ctx->dirty[] = '1' 4) Takes LOCK(inode->lock) to check if there are any flags and sets dirty-flag because lock->waiting_flag is 0 now. This leads to fxattrop to increment on-disk dirty[] to '2' At the end of the writes the file will be marked for heal even when it doesn't need heal. Fix: Perform ec_set_dirty_flag() and other checks inside LOCK() to prevent dirty[] to be marked as '1' in step 2) above Updates bz#1593224 Change-Id: Icac2ab39c0b1e7e154387800fbededc561612865 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* posix/ctime: Fix ctime upgrade issueKotresh HR2019-06-211-0/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Problem: On a EC volume, during upgrade from the older version where ctime feature is not enabled(or not present) to the newer version where the ctime feature is available (enabled default), the self heal hangs and doesn't complete. Cause: The ctime feature has both client side code (utime) and server side code (posix). The feature is driven from client. Only if the client side sets the time in the frame, should the server side sets the time attributes in xattr. But posix setattr/fseattr was not doing that. When one of the server nodes is updated, since ctime is enabled by default, it starts setting xattr on setattr/fseattr on the updated node/brick. On a EC volume the first two updated nodes(bricks) are not a problem because there are 4 other bricks with consistent data. However once the third brick is updated, the new attribute(mdata xattr) will cause an inconsistency on metadata on 3 bricks, which prevents the file to be repaired. Fix: Don't create mdata xattr with utimes/utimensat system call. Only update if already present. Change-Id: Ieacedecb8a738bb437283ef3e0f042fd49dc4c8c fixes: bz#1720201 Signed-off-by: Kotresh HR <khiremat@redhat.com>
* WORM-Xlator: Avoid performing fsetxattr if fd is NULLDavid Spisla2019-06-211-0/+7
| | | | | | | | | | | If worm_create_cbk receives an error (op_ret == -1) fd will be NULL and therefore performing fsetxattr would lead to a segfault and the brick process crashes. To avoid this we allow setting fsetxattr only if op_ret >= 0 . If an error happens we explicitly unwind Change-Id: Ie7f8a198add93e5cd908eb7029cffc834c3b58a6 fixes: bz#1717757 Signed-off-by: David Spisla <david.spisla@iternity.com>
* ec-heal: check file's gfid when deleting stale nameKinglong Mee2019-06-201-1/+11
| | | | | | | | | | | | A name-less lookup does not contain parent's stat, It is hard to check the lookuped file is at the right path. This patch changes to a name lookup, and check file's gfid with expected gfid. If the gfid is different, mark it estale. fixes: bz#1702131 Change-Id: I2de20b10d680eed1e2fb1d3830b3b3dec4520dbf Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
* afr/read: Implement latency based read child selectionMohammed Rafi KC2019-06-203-27/+98
| | | | | | | | | | | | | | | | | Network latency is an important factor selecting a read subvolume. So this patch is adding two new policy. 1) We measure the latency of a child during a GF_DUMP rpc call. Then use this latency to pick a read subvol having the least latency. 2) Second one is an hybrid mode where it calculates the effective latency by multiplying outstanding pending read request and latency, and choose the least one. Change-Id: Ia49c8a08ab61f7dcdad8b8950aa4d338e7accf97 fixes: #520 Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com>
* posix: fix crash in posix_cs_set_stateSusant Palai2019-06-202-3/+9
| | | | | | Fixes: bz#1721474 Change-Id: Ic2a53fa3d1e9e23424c6898e0986f80d52c5e3f6 Signed-off-by: Susant Palai <spalai@redhat.com>
* encryption/crypt: remove from volume fileAmar Tumballi2019-06-202-34/+0
| | | | | | | | | | | | | | The feature is not supported and is moved out of the codebase from glusterfs-5.x release. Doesn't make sense to keep the code to support it. For those who want to upgrade from an version supporting it to higher version, please do a 'gluster volume reset $VOL encryption reset' and then continue with the upgrade process. updates: bz#1648169 Change-Id: I8cf822c0d7195940bd37f6af2432a3cac68d44d1 Signed-off-by: Amar Tumballi <amarts@redhat.com>
* md-cache: only update generation for inode at upcall and NULL statKinglong Mee2019-06-191-20/+36
| | | | | | | | | | | | | | | 1. For parallel writes from nfs-ganesha, two fops with two generations, but the fops reply maybe returned disordered. 2. The inode md-cache timeout should not increase conf->generation. With this patch, 1, Fop only gets generation from inode md-cache or conf, does not increase it. 2. The generation is increased at upcall invalidate, estal/enoent error invalidate, reply with zeroed out stat from write-behind. Change-Id: I897ecaa143fd18bc024c1948c7d1a6f831fd53da Updates: bz#1683594 Signed-off-by: Kinglong Mee <mijinlong@open-fs.com>
* cluster/dht: Strip out dht xattrsN Balachandran2019-06-191-0/+2
| | | | | | | | | | Some internal DHT xattrs were not being removed when calling getxattr in pass-through mode. This has been fixed. Change-Id: If7e3dbc7b495db88a566bd560888e3e9c167defa fixes: bz#1721435 Signed-off-by: N Balachandran <nbalacha@redhat.com>
* glusterd-volgen.c: remove BD xlator from the graphYaniv Kaul2019-06-188-49/+1
| | | | | | | | | | | | | The BD xlator was removed some time ago. Remove it from the graph. We can also remove the caps settings - only the BD xlator was using it. Lastly, remove the caps (which only BD was using) and the document describing the translator. Change-Id: Id0adcb2952f4832a5dc6301e726874522e07935d updates: bz#1193929 Signed-off-by: Yaniv Kaul <ykaul@redhat.com>
* core: fedora 30 compiler warningsSheetalPamecha2019-06-184-9/+7
| | | | | | | | warning: ā€˜%s’ directive argument is null [-Wformat-overflow=] Change-Id: I69b8d47f0002c58b00d1cc947fac6f1c64e0b295 updates: bz#1193929 Signed-off-by: SheetalPamecha <spamecha@redhat.com>
* glusterd: log error message only when rsp.op_ret is negativeSanju Rakonde2019-06-171-1/+1
| | | | | | | | | | | | | | | | Problem: commit d42221bec9 added a log message based on rsp.op_ret check. but while running subdir-mount.t, this message is seen even on successful mounts. Solution: in __server_getspec(), return value of sys_read() is assigned to ret, which will be a non-negative number in when sys_read() is success. This non-zero value is assigned to rsp.op_ret. We should log an error only when rsp.op_ret is negative. fixes: bz#1718848 Change-Id: Ieef8ba33c2c7b4a97d4aef17543f58e66fd3b341 Signed-off-by: Sanju Rakonde <srakonde@redhat.com>
* glusterd: add GF_TRANSPORT_BOTH_TCP_RDMA in glusterd_get_gfproxy_client_volfileAtin Mukherjee2019-06-171-0/+1
| | | | | | | | | ... with out which volume creation fails with "volume create: <xyz>: failed: Failed to create volume files" Fixes: bz#1716812 Change-Id: I2f4c2c6d5290f066b54e1c1db19e25db9937bedb Signed-off-by: Atin Mukherjee <amukherj@redhat.com>
* uss: Fix tar issue with ctime and uss enabledKotresh HR2019-06-171-9/+13
| | | | | | | | | | | | | | | | | | | Problem: If ctime and uss enabled, tar still complains with 'file changed as we read it' Cause: To clear nfs cache (gluster-nfs), the ctime was incremented in snap-view client on stat cbk. Fix: The ctime should not be incremented manually. Since gluster-nfs is planning to be deprecated, this code is being removed to fix the issue. Change-Id: Iae7f100c20fce880a50b008ba716077350281404 fixes: bz#1720290 Signed-off-by: Kotresh HR <khiremat@redhat.com>
* afr/fini: Free local_pool data during an afr finiMohammed Rafi KC2019-06-171-0/+6
| | | | | | | | | We should free the mem_pool local_pool during an afr_fini. Otherwise this will lead to mem leak for shd Change-Id: I805a34a88077bf7b886c28b403798bf9eeeb1c0b Updates: bz#1716695 Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com>
* clang-scan: resolve warningAmar Tumballi2019-06-152-4/+6
| | | | | | | | | | | | | | | dht-common.c: because there was a 'goto err' before assigning the 'local' variable, there is possibility of NULL dereference. As the check which was done wouldn't ever be true, removed the check. glusterd-geo-rep.c: a possible path where 'slave_host' could be NULL when it gets passed to strcmp() is found. strcmp() expects a valid string. Add a NULL check. Updates: bz#1622665 Change-Id: I64c280bc1beac9a2b109e8fa88f2a5ce8b823c3a Signed-off-by: Amar Tumballi <amarts@redhat.com>
* multiple files: another attempt to remove includesYaniv Kaul2019-06-1461-182/+26
| | | | | | | | | | | | | | | | | | There are many include statements that are not needed. A previous more ambitious attempt failed because of *BSD plafrom (see https://review.gluster.org/#/c/glusterfs/+/21929/ ) Now trying a more conservative reduction. It does not solve all circular deps that we have, but it does reduce some of them. There is just too much to handle reasonably (dht-common.h includes dht-lock.h which includes dht-common.h ...), but it does reduce the overall number of lines of include we need to look at in the future to understand and fix the mess later one. Change-Id: I550cd001bdefb8be0fe67632f783c0ef6bee3f9f updates: bz#1193929 Signed-off-by: Yaniv Kaul <ykaul@redhat.com>
* gnfs: support inode dumpXie Changlong2019-06-141-0/+16
| | | | | | | | So, we will get more debug info. fixes: #679 Change-Id: I3588e204ad25c20b69271c1a4ee17d0d158bd794 Signed-off-by: Xie Changlong <xiechanglong@cmss.chinamobile.com>
* upcall: Avoid sending notifications for invalid inodesSoumya Koduri2019-06-141-1/+18
| | | | | | | | | | | | | | | | | | | For nameless LOOKUPs, server creates a new inode which shall remain invalid until the fop is successfully processed post which it is linked to the inode table. But incase if there is an already linked inode for that entry, it discards that newly created inode which results in upcall notification. This may result in client being bombarded with unnecessary upcalls affecting performance if the data set is huge. This issue can be avoided by looking up and storing the upcall context in the original linked inode (if exists), thus saving up on those extra callbacks. Change-Id: I044a1737819bb40d1a049d2f53c0566e746d2a17 fixes: bz#1718338 Signed-off-by: Soumya Koduri <skoduri@redhat.com>
* libglusterfs: cleanup iovec functionsXavi Hernandez2019-06-116-50/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch cleans some iovec code and creates two additional helper functions to simplify management of iovec structures. iov_range_copy(struct iovec *dst, uint32_t dst_count, uint32_t dst_offset, struct iovec *src, uint32_t src_count, uint32_t src_offset, uint32_t size); This function copies up to 'size' bytes from 'src' at offset 'src_offset' to 'dst' at 'dst_offset'. It returns the number of bytes copied. iov_skip(struct iovec *iovec, uint32_t count, uint32_t size); This function removes the initial 'size' bytes from 'iovec' and returns the updated number of iovec vectors remaining. The signature of iov_subset() has also been modified to make it safer and easier to use. The new signature is: iov_subset(struct iovec *src, int src_count, uint32_t start, uint32_t size, struct iovec **dst, int32_t dst_count); This function creates a new iovec array containing the subset of the 'src' vector starting at 'start' with size 'size'. The resulting array is allocated if '*dst' is NULL, or copied to '*dst' if it fits (based on 'dst_count'). It returns the number of iovec vectors used. A new set of functions to iterate through an iovec array have been created. They can be used to simplify the implementation of other iovec-based helper functions. Change-Id: Ia5fe57e388e23392a8d6cdab17670e337cadd587 Updates: bz#1193929 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
* Cluster/afr: Don't treat all bricks having metadata pending as split-brainkarthik-us2019-06-102-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | Problem: We currently don't have a roll-back/undoing of post-ops if quorum is not met. Though the FOP is still unwound with failure, the xattrs remain on the disk. Due to these partial post-ops and partial heals (healing only when 2 bricks are up), we can end up in metadata split-brain purely from the afr xattrs point of view i.e each brick is blamed by atleast one of the others for metadata. These scenarios are hit when there is frequent connect/disconnect of the client/shd to the bricks. Fix: Pick a source based on the xattr values. If 2 bricks blame one, the blamed one must be treated as sink. If there is no majority, all are sources. Once we pick a source, self-heal will then do the heal instead of erroring out due to split-brain. This patch also adds restriction of all the bricks to be up to perform metadata heal to avoid any metadata loss. Removed the test case tests/bugs/replicate/bug-1468279-source-not-blaming-sinks.t as it was doing metadata heal even when only 2 of 3 bricks were up. Change-Id: I07a9d62f84ceda329dcab1f02a33aeed258dcb09 fixes: bz#1717819 Signed-off-by: karthik-us <ksubrahm@redhat.com>
* features/shard: Fix extra unref when inode object is lru'd out and added backKrutika Dhananjay2019-06-091-4/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Long tale of double unref! But do read... In cases where a shard base inode is evicted from lru list while still being part of fsync list but added back soon before its unlink, there could be an extra inode_unref() leading to premature inode destruction leading to crash. One such specific case is the following - Consider features.shard-deletion-rate = features.shard-lru-limit = 2. This is an oversimplified example but explains the problem clearly. First, a file is FALLOCATE'd to a size so that number of shards under /.shard = 3 > lru-limit. Shards 1, 2 and 3 need to be resolved. 1 and 2 are resolved first. Resultant lru list: 1 -----> 2 refs on base inode - (1) + (1) = 2 3 needs to be resolved. So 1 is lru'd out. Resultant lru list - 2 -----> 3 refs on base inode - (1) + (1) = 2 Note that 1 is inode_unlink()d but not destroyed because there are non-zero refs on it since it is still participating in this ongoing FALLOCATE operation. FALLOCATE is sent on all participant shards. In the cbk, all of them are added to fync_list. Resulting fsync list - 1 -----> 2 -----> 3 (order doesn't matter) refs on base inode - (1) + (1) + (1) = 3 Total refs = 3 + 2 = 5 Now an attempt is made to unlink this file. Background deletion is triggered. The first $shard-deletion-rate shards need to be unlinked in the first batch. So shards 1 and 2 need to be resolved. inode_resolve fails on 1 but succeeds on 2 and so it's moved to tail of list. lru list now - 3 -----> 2 No change in refs. shard 1 is looked up. In lookup_cbk, it's linked and added back to lru list at the cost of evicting shard 3. lru list now - 2 -----> 1 refs on base inode: (1) + (1) = 2 fsync list now - 1 -----> 2 (again order doesn't matter) refs on base inode - (1) + (1) = 2 Total refs = 2 + 2 = 4 After eviction, it is found 3 needs fsync. So fsync is wound, yet to be ack'd. So it is still inode_link()d. Now deletion of shards 1 and 2 completes. lru list is empty. Base inode unref'd and destroyed. In the next batched deletion, 3 needs to be deleted. It is inode_resolve()able. It is added back to lru list but base inode passed to __shard_update_shards_inode_list() is NULL since the inode is destroyed. But its ctx->inode still contains base inode ptr from first addition to lru list for no additional ref on it. lru list now - 3 refs on base inode - (0) Total refs on base inode = 0 Unlink is sent on 3. It completes. Now since the ctx contains ptr to base_inode and the shard is part of lru list, base shard is unref'd leading to a crash. FIX: When shard is readded back to lru list, copy the base inode pointer as is into its inode ctx, even if it is NULL. This is needed to prevent double unrefs at the time of deleting it. Change-Id: I99a44039da2e10a1aad183e84f644d63ca552462 Updates: bz#1696136 Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com>
* ec/fini: Fix race between xlator cleanup and on going async fopMohammed Rafi KC2019-06-086-15/+56
| | | | | | | | | | | | | | | | Problem: While we process a cleanup, there is a chance for a race between async operations, for example ec_launch_replace_heal. So this can lead to invalid mem access. Solution: Just like we track on going heal fops, we can also track fops like ec_launch_replace_heal, so that we can decide when to send a PARENT_DOWN request. Change-Id: I055391c5c6c34d58aef7336847f3b570cb831298 fixes: bz#1703948 Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com>