summaryrefslogtreecommitdiffstats
path: root/xlators
Commit message (Collapse)AuthorAgeFilesLines
* posix: log aio_error return codes in posix_fs_health_checkMohit Agrawal2019-08-221-3/+2
| | | | | | | | | | | | | Problem: Sometime brick is going down to health check thread is failed without logging error codes return by aio system calls. As per aio_error man page it returns a positive error number if the asynchronous I/O operation failed. Solution: log aio_error return codes in error message Change-Id: I2496b1bc16e602b0fd3ad53e211de11ec8c641ef Fixes: bz#1744519 Signed-off-by: Mohit Agrawal <moagrawal@redhat.com>
* ctime: Fix incorrect realtime passed to frame->root->ctimeKotresh HR2019-08-221-1/+1
| | | | | | | | | | | | On systems that don't support "timespec_get"(e.g., centos6), it was using "clock_gettime" with "CLOCK_MONOTONIC" to get unix epoch time which is incorrect. This patch introduces "timespec_now_realtime" which uses "clock_gettime" with "CLOCK_REALTIME" which fixes the issue. Change-Id: I57be35ce442d7e05319e82112b687eb4f28d7612 Signed-off-by: Kotresh HR <khiremat@redhat.com> fixes: bz#1743652
* nlm: check if nlm4 is initialized in nlm_privXie Changlong2019-08-221-3/+5
| | | | | | | | | | | | | | | | Otherwise, gnfs will crash in following steps. 1) gluster v set <VOL> nfs.disable off 2) gluster v set <VOL> nfs.nlm off 3) kill -SIGUSR1 <GNFS_PID> 4) gnfs crash with SIGSEGV as follows: nlm_priv (this=this@entry=0x7f1ad00173b0) at nlm4.c:2742 0x00007f1acf89d29d in nfs_priv (this=0x7f1ad00173b0) at nfs.c:1662 0x00007f1ae2941085 in gf_proc_dump_single_xlator_info (trav=trav@entry=0x7f1ad00173b0) at statedump.c:502 0x00007f1ae29410b8 in gf_proc_dump_per_xlator_info (top=top@entry=0x7f1ad00173b0) at statedump.c:519 fixes: bz#1739360 Change-Id: Ib9b207a4ccb3226dbc2c449b77de348cbc9a3d3c Signed-off-by: Xie Changlong <xiechanglong@cmss.chinamobile.com>
* storage/posix - Fixing a coverity issueBarak Sason2019-08-211-0/+1
| | | | | | | | | | Fixed a resource leak of variable 'pfd' CID: 1400673 Updates: bz#789278 Change-Id: I78e1e8a89e0604b56e35a75c25d436b35db096c3 Signed-off-by: Barak Sason <bsasonro@redhat.com>
* features/utime - fixing a coverity issueBarak Sason2019-08-211-2/+2
| | | | | | | | | | | | | -Modified op_errno init value to a non-negative value in order to avoid using a negative value where it's not allowed. -In the metod "STACK_UNWIND_STRICT" modified 3rd argument in order to represnt the correct value to use (changed from -1 to ret). CID: 1403650 Updates: bz#789278 Change-Id: I608031d5af13832e05e180e746b1b5280b54f559 Signed-off-by: Barak Sason <bsasonro@redhat.com>
* features/cloudsync - fix a coverity issueBarak Sason2019-08-211-4/+1
| | | | | | | | | | | | | All assigns to op_errno in this mehod were to the same vlaue - ENOMEM. Removed repeted assignments and assigned as init value. This also prevents the problem of sending a negatve value of op_errno to CS_STACK_UNWIND method CID: 1394645 - https://scan6.coverity.com/reports.htm#v44018/p10714/fileInstanceId=92065749&defectInstanceId=28018364&mergedDefectId=1394645 Updates: bz#789278 Change-Id: If765a9216500a38f9392617aaf06583ce36e3262 Signed-off-by: Barak Sason <bsasonro@redhat.com>
* storage/posix - fixing a coverity issueBarak Sason2019-08-212-4/+21
| | | | | | | | | | CID: 1394644 & 1394639 Updates: bz#789278 Added logging in case method calls fails Change-Id: Ib833a5f68d37b98287b84c325637bc688937f647 Signed-off-by: Barak Sason <bsasonro@redhat.com>
* ctime: Fix ctime issue with utime family of syscallsKotresh HR2019-08-204-52/+68
| | | | | | | | | When atime|mtime is updated via utime family of syscalls, ctime is not updated. This patch fixes the same. Change-Id: I7f86d8f8a1e06a332c3449b5bbdbf128c9690f25 fixes: bz#1738786 Signed-off-by: Kotresh HR <khiremat@redhat.com>
* performance/md-cache: Do not skip caching of null character xattr valuesAnoop C S2019-08-202-20/+23
| | | | | | | | | | | | | | | | | | | | | Null character string is a valid xattr value in file system. But for those xattrs processed by md-cache, it does not update its entries if value is null('\0'). This results in ENODATA when those xattrs are queried afterwards via getxattr() causing failures in basic operations like create, copy etc in a specially configured Samba setup for Mac OS clients. On the other side snapview-server is internally setting empty string("") as value for xattrs received as part of listxattr() and are not intended to be cached. Therefore we try to maintain that behaviour using an additional dictionary key to prevent updation of entries in getxattr() and fgetxattr() callbacks in md-cache. Credits: Poornima G <pgurusid@redhat.com> Change-Id: I7859cbad0a06ca6d788420c2a495e658699c6ff7 Fixes: bz#1726205 Signed-off-by: Anoop C S <anoopcs@redhat.com>
* features/locks: avoid use after freed of frame for blocked lockKinglong Mee2019-08-205-8/+14
| | | | | | | | | | | | | The fop contains blocked lock may use freed frame info when other unlock fop has unwind the blocked lock. Because the blocked lock is added to block list in inode lock(or other lock), after that, when out of the inode lock, the fop contains the blocked lock should not use it. Change-Id: Icb309a1cc78380dc982b26d50c18d67e4f2c8915 fixes: bz#1737291 Signed-off-by: Kinglong Mee <mijinlong@horiscale.com>
* logging: Structured logging reference PRAravinda VK2019-08-2013-333/+326
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To convert the existing `gf_msg` to `gf_smsg`: - Define `_STR` of respective Message ID as below(In `*-messages.h`) #define PC_MSG_REMOTE_OP_FAILED_STR "remote operation failed." - Change `gf_msg` to use `gf_smsg`. Convert values into fields and add any missing fields. Note: `errno` and `error` fields will be added automatically to log message in case errnum is specified. Example: gf_smsg( this->name, // Name or log domain GF_LOG_WARNING, // Log Level rsp.op_errno, // Error number PC_MSG_REMOTE_OP_FAILED, // Message ID "path=%s", local->loc.path, // Key Value 1 "gfid=%s", loc_gfid_utoa(&local->loc), // Key Value 2 NULL // Log End ); Key value pairs formatting Help: gf_slog( this->name, // Name or log domain GF_LOG_WARNING, // Log Level rsp.op_errno, // Error number PC_MSG_REMOTE_OP_FAILED, // Message ID "op=CREATE", // Static Key and Value "path=%s", local->loc.path, // Format for Value "brick-%d-status=%s", brkidx, brkstatus, // Use format for key and val NULL // Log End ); Before: [2019-07-03 08:16:18.226819] W [MSGID: 114031] [client-rpc-fops_v2.c \ :2633:client4_0_lookup_cbk] 0-gv3-client-0: remote operation failed. \ Path: / (00000000-0000-0000-0000-000000000001) [Transport endpoint \ is not connected] After: [2019-07-29 07:50:15.773765] W [MSGID: 114031] \ [client-rpc-fops_v2.c:2633:client4_0_lookup_cbk] 0-gv1-client-0: \ remote operation failed. [{path=/f1}, \ {gfid=00000000-0000-0000-0000-000000000000}, \ {errno=107}, {error=Transport endpoint is not connected}] To add new `gf_smsg`, Add a Message ID in respective `*-messages.h` file and the follow the steps mentioned above. Change-Id: I4e7d37f27f106ab398e991d931ba2ac7841a44b1 Updates: #657 Signed-off-by: Aravinda VK <avishwan@redhat.com>
* mount/fuse - Fixing a coverity issueBarak Sason2019-08-201-0/+1
| | | | | | | | | | Fixed resource leak of dict_value and newkey variables CID: 1398630 Updates: bz#789278 Change-Id: I589fdc0aecaeb4f446cd29f95bad36ccd7a35beb Signed-off-by: Barak Sason <bsasonro@redhat.com>
* posix: In brick_mux brick is crashed while start/stop volume in loopMohit Agrawal2019-08-205-4/+66
| | | | | | | | | | | | | | | Problem: In brick_mux environment sometime brick is crashed while volume stop/start in a loop.Brick is crashed in janitor task at the time of accessing priv.If posix priv is cleaned up before call janitor task then janitor task is crashed. Solution: To avoid the crash in brick_mux environment introduce a new flag janitor_task_stop in posix_private and before send CHILD_DOWN event wait for update the flag by janitor_task_done Change-Id: Id9fa5d183a463b2b682774ab5cb9868357d139a4 fixes: bz#1730409 Signed-off-by: Mohit Agrawal <moagrawal@redhat.com>
* protocol/client - fixing a coverity issueBarak Sason2019-08-201-3/+3
| | | | | | | | | | Moved null pointer check up in order to avoid seg-fault CID: 1404258 Updates: bz#789278 Change-Id: Ib97e05302bfeb8fe38d6ce9870b9740cb576e492 Signed-off-by: Barak Sason <bsasonro@redhat.com>
* storage/posix - Moved pointed validity check in order to avoid possible ↵Barak Sason2019-08-201-3/+3
| | | | | | | | | | seg-fault CID: 1124831 Updates: bz#789278 Change-Id: Ia6550be3742849809cf3e0a4a39d9d6e77003b35 Signed-off-by: Barak Sason <bsasonro@redhat.com>
* libglusterfs: remove dependency of rpcAmar Tumballi2019-08-165-15/+32
| | | | | | | | | | | | | | | | | | Goal: 'libglusterfs' files shouldn't have any dependency outside of the tree, specially the header files, shouldn't have '#include' from outside the tree. Fixes: * Had to introduce libglusterd so, methods and structures required for only mgmt/glusterd, and cli/ are separated from 'libglusterfs/' * Remove rpc/xdr/gen from build, which was used mainly so dependency for libglusterfs could be properly satisfied. * Move rpcsvc_auth_data to client_t.h, so all dependencies could be handled. Updates: bz#1636297 Change-Id: I0e80243a5a3f4615e6fac6e1b947ad08a9363fce Signed-off-by: Amar Tumballi <amarts@redhat.com>
* mount.glusterfs: make fcache-keep-open option take a valuePhilip Spencer2019-08-161-8/+14
| | | | | | Fixes: bz#1158130 Change-Id: Ifdeaed7c9fbe85f7ce421f7c89cbe7265e45f77c Signed-off-by: Amar Tumballi <amarts@redhat.com>
* afr: restore timestamp of parent dir during entry-healRavishankar N2019-08-141-0/+2
| | | | | | Fixes: bz#1734370 Change-Id: I29e338bac62104233a6f80212df8d0fb016affda Signed-off-by: Ravishankar N <ravishankar@redhat.com>
* client-handshake.c: minor changes and removal of dead code.Yaniv Kaul2019-08-142-287/+54
| | | | | | | | | | - Removal of quite a bit of dead code. - Use dict_set_str_sizen and friends where applicable. - Moved some functions to be static and initialize values right away. Change-Id: Ic25b5da4028198694a0e24796dea375661eb66b9 updates: bz#1193929 Signed-off-by: Yaniv Kaul <ykaul@redhat.com>
* posix: don't expect timer wheel to be initedRaghavendra Talur2019-08-141-1/+1
| | | | | | | | | | Adding a timer to timer wheel should be done only after getting the timer wheel from the ctx using the function glusterfs_ctx_tw_get(). The function inits the wheel if not already done. Change-Id: I9692f84b822a02a9dc14725b7c11d26a2a634e94 Updates: #703 Signed-off-by: Raghavendra Talur <rtalur@redhat.com>
* glusterd: create separate logdirs for cluster.rc instancesN Balachandran2019-08-1415-130/+180
| | | | | | | | | | Create a separate logdir for each host instance created by cluster.rc. This makes it easier to determine the files belonging to a particular instance. Change-Id: Ic8321f83f98995412b7d5f095b3d3f0391767a8b Fixes: bz#1733042 Signed-off-by: N Balachandran <nbalacha@redhat.com>
* fuse: Set limit on invalidate queue sizeN Balachandran2019-08-143-16/+55
| | | | | | | | | | | | | If the glusterfs fuse client process is unable to process the invalidate requests quickly enough, the number of such requests quickly grows large enough to use a significant amount of memory. We are now introducing another option to set an upper limit on these to prevent runaway memory usage. Change-Id: Iddfff1ee2de1466223e6717f7abd4b28ed947788 Fixes: bz#1732717 Signed-off-by: N Balachandran <nbalacha@redhat.com>
* cluster/ec: Fix coverity issue.Ashish Pandey2019-08-131-1/+1
| | | | | | Change-Id: I727287784a15d89441865de7f438002e4a370250 fixes: bz#1738763 Signed-off-by: Ashish Pandey <aspandey@redhat.com>
* features/shard: Send correct size when reads are sent beyond file sizeKrutika Dhananjay2019-08-121-0/+2
| | | | | | Change-Id: I0cebaaf55c09eb1fb77a274268ff564e871b743b fixes bz#1738419 Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com>
* fuse: rate limit reading from fuse device upon receiving EPERMCsaba Henk2019-08-083-0/+31
| | | | | | Fixes: bz#1644322 Change-Id: I53e8fa362cd8c7d04fb1c4abb606a9abb642c592 Signed-off-by: Csaba Henk <csaba@redhat.com>
* cluster/ec: Update lock->good_mask on parent fop failurePranith Kumar K2019-08-072-0/+4
| | | | | | | | | | When discard/truncate performs write fop, it should do so after updating lock->good_mask to make sure readv happens on the correct mask fixes bz#1727081 Change-Id: Idfef0bbcca8860d53707094722e6ba3f81c583b7 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* cluster/dht: Log hashes in hexN Balachandran2019-08-064-15/+13
| | | | | | | | | Log layout hash ranges in hex to make it easier to compare them to the on disk xattrs. Change-Id: Ib75c2508bf8e0ab7f5ae26d0443ef02b792b7307 Fixes: bz#1697293 Signed-off-by: N Balachandran <nbalacha@redhat.com>
* features/utime: always update ctime at setattrKinglong Mee2019-08-062-13/+2
| | | | | | | | | | | | | | | | | | | | For the nfs EXCLUSIVE mode create may sets a later time to mtime (at verifier), it should not set to ctime for storage.ctime does not allowed set ctime to a earlier time. /* Earlier, mdata was updated only if the existing time is less * than the time to be updated. This would fail the scenarios * where mtime can be set to any time using the syscall. Hence * just updating without comparison. But the ctime is not * allowed to changed to older date. */ According to kernel's setattr, always set ctime at setattr, and doesnot set ctime from mtime at storage.ctime. Change-Id: I5cfde6cb7f8939da9617506e3dc80bd840e0d749 fixes: bz#1737288 Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
* glusterd/shd: Return null proc if process is not running.Mohammed Rafi KC2019-08-054-18/+65
| | | | | | | | | | | | | We were ruturning first proc entry even if it is not running. This was in an assumption that the process could have just started and not updated the pidfile. Now we that we have introduced the states for process state, we can take decision based on that. Change-Id: Ibfc11c966b0db599a8d6a08d8b975233b2bbfb8c Fixes: bz#1728766 Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com>
* multiple files: reduce minor work under RCU_READ_LOCKYaniv Kaul2019-08-0512-240/+261
| | | | | | | | | 1. Try to unlock faster - in error paths. 2. Remove memory allocations - do them before the lock. Change-Id: I1e9ddd80b99de45ad0f557d62a5f28951dfd54c8 updates: bz#1193929 Signed-off-by: Yaniv Kaul <ykaul@redhat.com>
* storage/posix: set the op_errno to proper errno during gfid setRaghavendra Bhat2019-08-041-0/+1
| | | | | | | | | In posix_gfid_set, the proper error is not captured in one of the failure cases. Change-Id: I1c13f0691a15d6893f1037b3a5fe385a99657e00 Fixes: bz#1736482 Signed-off-by: Raghavendra Bhat <raghavendra@redhat.com>
* locks/fencing: Address hang while lock preemptionSusant Palai2019-08-023-20/+29
| | | | | | | | | | | | The fop_wind_count can go negative when fencing is enabled on unwind path of the IO leading to hang. Also changed code so that fop_wind_count needs to be maintained only till fencing is enabled on the file. updates: bz#1717824 Change-Id: Icd04b42bc16cd3d50eaa581ee57233910194f480 Signed-off-by: Susant Palai <spalai@redhat.com>
* Multiple files: get trivial stuff done before lockYaniv Kaul2019-08-016-22/+26
| | | | | | | | | Initialize a dictionary for example seems to be prefectly fine to be done before taking a lock. Change-Id: Ib29516c4efa8f0e2b526d512beab488fcd16d2e7 updates: bz#1193929 Signed-off-by: Yaniv Kaul <ykaul@redhat.com>
* posix/ctime: Fix race during lookup ctime xattr healKotresh HR2019-08-011-18/+58
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Problem: Ctime heals the ctime xattr ("trusted.glusterfs.mdata") in lookup if it's not present. In a multi client scenario, there is a race which results in updating the ctime xattr to older value. e.g. Let c1 and c2 be two clients and file1 be the file which doesn't have the ctime xattr. Let the ctime of file1 be t1. (from backend, ctime heals time attributes from backend when not present). Now following operations are done on mount c1 -> ls -l /mnt/file1 | c2 -> ls -l /mnt/file1;echo "append" >> /mnt/file1; The race is that the both c1 and c2 didn't fetch the ctime xattr in lookup, so both of them tries to heal ctime to time 't1'. If c2 wins the race and appends the file before c1 heals it, it sets the time to 't1' and updates it to 't2' (because of append). Now c1 proceeds to heal and sets it to 't1' which is incorrect. Solution: Compare the times during heal and only update the larger time. This is the general approach used in ctime feature but got missed with healing legacy files. fixes: bz#1734299 Change-Id: I930bda192c64c3d49d0aed431ce23d3bc57e51b7 Signed-off-by: Kotresh HR <khiremat@redhat.com>
* cluster/ec: Create heal task with heal process idAshish Pandey2019-07-301-1/+19
| | | | | | | | | | | | | | | | | | | Problem: ec_data_undo_pending calls syncop_fxattrop->SYNCOP without a frame. In this case SYNCOP gets the frame of the task. However, when we create a synctask for heal we provide frame as NULL. Now, if the read-only feature is ON, it will receive the process ID of the shd as 0 and will consider that it as not an internal process. This will prevent healing of a file with "Read-only file system" error message log. Solution: While launching heal, create a synctask using frame and set process id of the SHD which is -6. Change-Id: I37195399c85de322cbcac75633888922c4e3db4a Fixes: bz#1734252
* cluster/ec: Fix reopen flags to avoid misbehaviorPranith Kumar K2019-07-302-3/+8
| | | | | | | | | | | | | | | | | | | | | | | Problem: when a file needs to be re-opened O_APPEND and O_EXCL flags are not filtered in EC. - O_APPEND should be filtered because EC doesn't send O_APPEND below EC for open to make sure writes happen on the individual fragments instead of at the end of the file. - O_EXCL should be filtered because shd could have created the file so even when file exists open should succeed - O_CREAT should be filtered because open happens with gfid as parameter. So open fop will create just the gfid which will lead to problems. Fix: Filter out these two flags in reopen. Change-Id: Ia280470fcb5188a09caa07bf665a2a94bce23bc4 Fixes: bz#1733935 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* event: rename event_XXX with gf_ prefixedXiubo Li2019-07-296-10/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I hit one crash issue when using the libgfapi. In the libgfapi it will call glfs_poller() --> event_dispatch() in file api/src/glfs.c:721, and the event_dispatch() is defined by libgluster locally, the problem is the name of event_dispatch() is the extremly the same with the one from libevent package form the OS. For example, if a executable program Foo, which will also use and link the libevent and the libgfapi at the same time, I can hit the crash, like: kernel: glfs_glfspoll[68486]: segfault at 1c0 ip 00007fef006fd2b8 sp 00007feeeaffce30 error 4 in libevent-2.0.so.5.1.9[7fef006ed000+46000] The link for Foo is: lib_foo_LADD = -levent $(GFAPI_LIBS) It will crash. This is because the glfs_poller() is calling the event_dispatch() from the libevent, not the libglsuter. The gfapi link info : GFAPI_LIBS = -lacl -lgfapi -lglusterfs -lgfrpc -lgfxdr -luuid If I link Foo like: lib_foo_LADD = $(GFAPI_LIBS) -levent It will works well without any problem. And if Foo call one private lib, such as handler_glfs.so, and the handler_glfs.so will link the GFAPI_LIBS directly, while the Foo won't and it will dlopen(handler_glfs.so), then the crash will be hit everytime. The link info will be: foo_LADD = -levent libhandler_glfs_LIBADD = $(GFAPI_LIBS) I can avoid the crash temporarily by linking the GFAPI_LIBS in Foo too like: foo_LADD = $(GFAPI_LIBS) -levent libhandler_glfs_LIBADD = $(GFAPI_LIBS) But this is ugly since the Foo won't use any APIs from the GFAPI_LIBS. And in some cases when the --as-needed link option is added(on many dists it is added as default), then the crash is back again, the above workaround won't work. Fixes: #699 Change-Id: I38f0200b941bd1cff4bf3066fca2fc1f9a5263aa Signed-off-by: Xiubo Li <xiubli@redhat.com>
* glusterd: write voldir once in glusterd-store and don't attempt again.Yaniv Kaul2019-07-291-29/+16
| | | | | | | | | | | | | | | | | | glusterd_store_brickinfos() is calling per each brick the function glusterd_store_brickinfo(). In it, we call: ret = glusterd_store_create_brick_dir(volinfo); However, volinfo is the same for all those bricks - no need to again and again call it (which tries to mkdir that dir). We can do it once above the loops in glusterd_store_brickinfos() While at, combine two similar functions that write additional dirs. Change-Id: I5858cf7783f088ea13a8fa20115118efa816f4cb updates: bz#1193929 Signed-off-by: Yaniv Kaul <ykaul@redhat.com>
* cluster/ec: Always read from good-maskPranith Kumar K2019-07-262-5/+25
| | | | | | | | | | There are cases where fop->mask may have fop->healing added and readv shouldn't be wound on fop->healing. To avoid this always wind readv to lock->good_mask fixes bz#1727081 Change-Id: I2226ef0229daf5ff315d51e868b980ee48060b87 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* fuse: add missing GF_FREE to fuse_interruptCsaba Henk2019-07-251-1/+4
| | | | | | Change-Id: Id7e003e4a53d0a0057c1c84e1cd704c80a6cb015 Fixes: bz#1728047 Signed-off-by: Csaba Henk <csaba@redhat.com>
* quiesce: add missing fopsAmar Tumballi2019-07-251-0/+30
| | | | | | Updates: bz#1693692 Change-Id: I4f005e7168c201709a85db443d643b81e6d3d282 Signed-off-by: Amar Tumballi <amarts@redhat.com>
* [core] fix return of local in __nlc_inode_ctx_getRinku Kothiya2019-07-251-22/+14
| | | | | | | | | | | | __nlc_inode_ctx_get assigns a value to nlc_pe_p which is never used by its parent function or any of the predecessor hence remove the assignment and also that function argument as it is not being used anywhere. fixes: bz#1732496 Change-Id: I5b950e1e251bd50a646616da872a4efe9d2ff8c9 Signed-off-by: Rinku Kothiya <rkothiya@redhat.com>
* cluster/ec: fix EIO error for concurrent writes on sparse filesXavi Hernandez2019-07-241-9/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | EC doesn't allow concurrent writes on overlapping areas, they are serialized. However non-overlapping writes are serviced in parallel. When a write is not aligned, EC first needs to read the entire chunk from disk, apply the modified fragment and write it again. The problem appears on sparse files because a write to an offset implicitly creates data on offsets below it (so, in some way, they are overlapping). For example, if a file is empty and we read 10 bytes from offset 10, read() will return 0 bytes. Now, if we write one byte at offset 1M and retry the same read, the system call will return 10 bytes (all containing 0's). So if we have two writes, the first one at offset 10 and the second one at offset 1M, EC will send both in parallel because they do not overlap. However, the first one will try to read missing data from the first chunk (i.e. offsets 0 to 9) to recombine the entire chunk and do the final write. This read will happen in parallel with the write to 1M. What could happen is that half of the bricks process the write before the read, and the half do the read before the write. Some bricks will return 10 bytes of data while the otherw will return 0 bytes (because the file on the brick has not been expanded yet). When EC tries to recombine the answers from the bricks, it can't, because it needs more than half consistent answers to recover the data. So this read fails with EIO error. This error is propagated to the parent write, which is aborted and EIO is returned to the application. The issue happened because EC assumed that a write to a given offset implies that offsets below it exist. This fix prevents the read of the chunk from bricks if the current size of the file is smaller than the read chunk offset. This size is correctly tracked, so this fixes the issue. Also modifying ec-stripe.t file for Test #13 within it. In this patch, if a file size is less than the offset we are writing, we fill zeros in head and tail and do not consider it strip cache miss. That actually make sense as we know what data that part holds and there is no need of reading it from bricks. Change-Id: Ic342e8c35c555b8534109e9314c9a0710b6225d6 Fixes: bz#1730715 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
* core: use more restrictive mode while creating the directoriesSanju Rakonde2019-07-2310-41/+41
| | | | | | | fixes: bz#1724024 Change-Id: I539fb7248b2cfc037ec29f1413ea648f9ec21ef2 Signed-off-by: Sanju Rakonde <srakonde@redhat.com>
* features/utime: Fix mem_put crashPranith Kumar K2019-07-221-1/+3
| | | | | | | | | | | | | | Problem: When frame->local is not null FRAME_DESTROY calls mem_put on it. Since the stub is already destroyed in call_resume(), it leads to crash Fix: Set frame->local to NULL before calling call_resume() fixes: bz#1593542 Change-Id: I0f8adf406f4cefdb89d7624ba7a9d9c2eedfb1de Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* (multiple files) use dict_allocate_and_serialize() where applicable.Yaniv Kaul2019-07-226-110/+28
| | | | | | | | This function does length, allocation and serialization for you. Change-Id: I142a259952a2fe83dd719442afaefe4a43a8e55e updates: bz#1193929 Signed-off-by: Yaniv Kaul <ykaul@redhat.com>
* ctime: Set mdata xattr on legacy filesKotresh HR2019-07-226-56/+228
| | | | | | | | | | | | | | | | | | | | | | | | | | Problem: The files which were created before ctime enabled would not have "trusted.glusterfs.mdata"(stores time attributes) xattr. Upon fops which modifies either ctime or mtime, the xattr gets created with latest ctime, mtime and atime, which is incorrect. It should update only the corresponding time attribute and rest from backend Solution: Creating xattr with values from brick is not possible as each brick of replica set would have different times. So create the xattr upon successful lookup if the xattr is not created Note To Reviewers: The time attributes used to set xattr is got from successful lookup. Instead of sending the whole iatt over the wire via setxattr, a structure called mdata_iatt is sent. The mdata_iatt contains only time attributes. Change-Id: I5e535631ddef04195361ae0364336410a2895dd4 fixes: bz#1593542 Signed-off-by: Kotresh HR <khiremat@redhat.com>
* dht: log getxattr failure for node-uuid at "DEBUG"Susant Palai2019-07-181-2/+5
| | | | | | | | | | | | | | | | | | | | There are two ways to fetch node-uuid information from dht. 1 - #define GF_XATTR_LIST_NODE_UUIDS_KEY "trusted.glusterfs.list-node-uuids" This key is used by AFR. 2 - #define GF_REBAL_FIND_LOCAL_SUBVOL "glusterfs.find-local-subvol" This key is used for non-afr volume type. We do two getxattr operations. First on the #1 key followed by on #2 if getxattr on #1 key fails. Since the parent function "dht_init_local_subvols_and_nodeuuids" logs failure, moving the log-level to DEBUG in dht_find_local_subvol_cbk. fixes: bz#1730175 Change-Id: I4d88244dc26587b111ca5b00d4c00118efdaac14 Signed-off-by: Susant Palai <spalai@redhat.com>
* cluster/ec: skip updating ctx->loc again when ec_fix_open/opendirKinglong Mee2019-07-172-10/+14
| | | | | | | | | | | | | The ec_manager_open/opendir memsets ctx->loc which causes memory/inode leak, and ec_fheal uses ctx->loc out of fd->lock that loc_copy may copy bad data when memset it. This patch skips updating ctx->loc when it is initilizaed. With it, ctx->loc is filled once, and never updated. Change-Id: I3bf5ffce4caf4c1c667f7acaa14b451d37a3550a fixes: bz#1729772 Signed-off-by: Kinglong Mee <mijinlong@horiscale.com>
* cluster/ec: inherit healing from lock when it has infoKinglong Mee2019-07-161-2/+3
| | | | | | | | | If lock has info, fop should inherit healing mask from it. Otherwise, fop cannot inherit right healing when changed_flags is zero. Change-Id: Ife80c9169d2c555024347a20300b0583f7e8a87f fixes: bz#1727081 Signed-off-by: Kinglong Mee <mijinlong@horiscale.com>