summaryrefslogtreecommitdiffstats
path: root/xlators/cluster/dht/src/dht-common.c
Commit message (Collapse)AuthorAgeFilesLines
* porting: Port for FreeBSD rebased from Mike Ma's effortsHarshavardhana2014-07-021-0/+1
| | | | | | | | | | | | | | | | | | | - Provides a working Gluster Management Daemon, CLI - Provides a working GlusterFS server, GlusterNFS server - Provides a working GlusterFS client - execinfo port from FreeBSD is moved into ./contrib/libexecinfo for ease of portability on NetBSD. (FreeBSD 10 and OSX provide execinfo natively) - More portability cleanups for Darwin, FreeBSD and NetBSD - Provides a new rc script for FreeBSD Change-Id: I8dff336f97479ca5a7f9b8c6b730051c0f8ac46f BUG: 1111774 Original-Author: Mike Ma <mikemandarine@gmail.com> Signed-off-by: Harshavardhana <harsha@harshavardhana.net> Reviewed-on: http://review.gluster.org/8141 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Kaleb KEITHLEY <kkeithle@redhat.com>
* dht: pass xdata to xlators above.Krishnan Parthasarathi2014-06-291-1/+1
| | | | | | | | | | Change-Id: I96e9feb88443fcd7da40c33c0e8c4e2645b1fcf3 BUG: 1096047 Signed-off-by: Krishnan Parthasarathi <kparthas@redhat.com> Reviewed-on: http://review.gluster.org/7872 Reviewed-by: Shyamsundar Ranganathan <srangana@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* cluster/dht: handle ESTALE appropriately in rmdir codepath.Raghavendra G2014-06-231-5/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | Till we separated the scenario of a file/directory not existing from parent not existing [1], we used to include a subvolume in the layout of a directory even if it is not present on that subvolume. This was done to allow a lookup racing with mkdir to create correct layout. However, there are other scenarios as well where a directory is not present. One such situation is trying to create a directory after an add-brick. Since there is no guarantee that all the ancestors are created after an add-brick (and hence directory cannot be created), the newly added brick should not be part of the layout. However, we used to consider newly added brick as part of layout (even before we do fix-layout of all the ancestors) and this was the root cause of [2]. With [1], this issue got fixed and hence [2] got fixed too. However, [1] is not complete in the sense we didn't modify rmdir codepath appropriately. This patch fixes that gap. [1] http://review.gluster.org/6322 [2] https://bugzilla.redhat.com/show_bug.cgi?id=1006809 Change-Id: I79ab96bb8abb6f3d90bb6e235a1c465e1be0fd19 BUG: 1032894 Signed-off-by: Raghavendra G <rgowdapp@redhat.com> Reviewed-on: http://review.gluster.org/8142 Reviewed-by: Vijay Bellur <vbellur@redhat.com> Tested-by: Vijay Bellur <vbellur@redhat.com>
* Cluster/DHT : Logging changesNithya Balachandran2014-06-181-27/+21
| | | | | | | | | | | Removed trailing spaces from the code Change-Id: I427c9a01b514824f903e301863c2c29071db6483 BUG: 1075611 Signed-off-by: Nithya Balachandran <nbalacha@redhat.com> Reviewed-on: http://review.gluster.org/8096 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* features/quota: Make dht_statfs_cbk more fool proof from quota_deem_statfsVarun Shastry2014-06-181-16/+60
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Problem: The function depends on the fact that if quota-deem-statfs option is enabled, all of the subvolumes send their xdata with quota-deem-statfs flag ON. But, this may not be true in case of errors in some of the subvolumes. There is a decision/policy made which assumes quota-deem-statfs to be ON if at least ONE of the subvolumes sends the flag ON. By this, df reports quota modified statfs values if *at least ONE* of the bricks sends the quota-deem-statfs flag ON. This can be visualized with the below "Transition Diagram/State Machine". Event: Each Quota deem statfs status from the individual bricks Action: Decision taken on the calculation of the statvfs received State: Whether quota deem statfs is ON or OFF (0: OFF, 1: ON) Input: Event from individual bricks ___ ___ / \ OFF* / \ (OFF|ON)* | | | | \ / ON \ / -----> 0 ----------------> 1 The below Transition Function depicts the relation between the statfs calculation based on the events received. State Event action ------------------------------------- OFF OFF OFF OFF ON REPLACE ON OFF NEGLECT ON ON COMPARE Change-Id: I0e8fb7d3945a3ca3dde0bb99de6cd397e27a3162 BUG: 1048786 Signed-off-by: Varun Shastry <vshastry@redhat.com> Reviewed-on: http://review.gluster.org/6652 Reviewed-by: Krishnan Parthasarathi <kparthas@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Raghavendra G <rgowdapp@redhat.com> Tested-by: Raghavendra G <rgowdapp@redhat.com>
* cluster/dht: Do layout self healing of directory for nameless lookupVenkatesh Somyajulu2014-06-171-0/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Problem: Currently in the nameless lookup code path, if at the end of the lookup, even if it detects that layout anamolies are there, layout healing will not be done as there is no code to heal it. So there can be race between mkdir and lookup. Assume mkdir is going on from some other mount point, Say, M1. Directories are created on some nodes but layout is not set yet. Now from M2, nameless lookup goes, lookup will be success full as the directory is present on some of the nodes, but it won't heal layout. Now if create goes after lookup fop, because layout is absent, file creation will fail. Fix: Included the code of layout self-heal in the nameless lookup path. At the end of lookup, layout will be computed as it would have been in the named lookup, but it will be set to those node only, where directory is present. So after that if create fop goes, the probabiliy to get the subvolume with proper hash-range is high now, so reduces the race window. Other: Whenever a directory is created, we have to choose a brick from which we start allocating layout in a circular fashion. To calculate this starting brick, I have changed the candidate from name of the directory to gfid of the directory But to compute where a given file belongs, we will still use the name of the file. Hash computed from the name of the file should belong to any one of the directory-hash-range Calculation of hash for a file is acting as a consumer and the setting of directory layout based on gfid is acting as a producer, which are independent from each other. Change-Id: I3808c55082cd1b5c72d2c77cbbc063f55aa38bee BUG: 1095888 Signed-off-by: Venkatesh Somyajulu <vsomyaju@redhat.com> Reviewed-on: http://review.gluster.org/7493 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Raghavendra Bhat <raghavendra@redhat.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* Cluster/DHT: New logging frameworkNithya Balachandran2014-06-161-285/+444
| | | | | | | | | | | | Moved all relevant DHT gf_log calls to the new logging framework. Change-Id: I3af3cfe0416e332774a6c4ff6a091d006c400af2 BUG: 1075611 Signed-off-by: Nithya Balachandran <nbalacha@redhat.com> Reviewed-on: http://review.gluster.org/7929 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* DHT/readdirp: Directory not shown/healed on mount point if existsSusant Palai2014-06-161-5/+36
| | | | | | | | | | | | | | | | | | | | | | | | | | | | on single brick(non first up subvolume). Problem: If snapshot is taken, when mkdir has succeeded only on hashed_subvolume, then after restoring snapshot the directory is not shown on mount point. Why: dht_readdirp takes only those directory entries in to account, which are present on first_up_subvolume. Hence, if the "hashed subvolume" is not same as first_up_subvolume, it wont be listed on mount point and also not healed. Solution: Case 1: (Rebalance not running)If hashed subvolume is NULL or down then filter in first_up_subvolume. Other wise the corresponding hashed subvolume will take care of the directory entry. Case 2: If readdirp_optimize option is turned on then read from first_up_subvol Change-Id: Idaad28f1c9f688dbfb1a8a3ab8b244510c02365e BUG: 1092433 Signed-off-by: Susant Palai <spalai@redhat.com> Reviewed-on: http://review.gluster.org/7599 Reviewed-by: Raghavendra G <rgowdapp@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* DHT/Mkdir: Fail mkdir with ENOENT, if parent is not availableSusant Palai2014-05-281-3/+3
| | | | | | | | | | Change-Id: I54227bcafb6d0d8cf716a679d2a34be7fc916898 BUG: 1078847 Signed-off-by: Susant Palai <spalai@redhat.com> Reviewed-on: http://review.gluster.org/7306 Reviewed-by: Raghavendra G <rgowdapp@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* cluster/afr: refactorAnand Avati2014-03-221-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | - Remove client side self-healing completely (opendir, openfd, lookup) - Re-work readdir-failover to work reliably in case of NFS - Remove unused/dead lock recovery code - Consistently use xdata in both calls and callbacks in all FOPs - Per-inode event generation, used to force inode ctx refresh - Implement dirty flag support (in place of pending counts) - Eliminate inode ctx structure, use read subvol bits + event_generation - Implement inode ctx refreshing based on event generation - Provide backward compatibility in transactions - remove unused variables and functions - make code more consistent in style and pattern - regularize and clean up inode-write transaction code - regularize and clean up dir-write transaction code - regularize and clean up common FOPs - reorganize transaction framework code - skip setting xattrs in pending dict if nothing is pending - re-write self-healing code using syncops - re-write simpler self-heal-daemon Change-Id: I1e4080c9796c8a2815c2dab4be3073f389d614a8 BUG: 1021686 Signed-off-by: Anand Avati <avati@redhat.com> Reviewed-on: http://review.gluster.org/6010 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* cluster/dht: If hashed_subvol is NULL, do not failSusant Palai2014-02-081-3/+3
| | | | | | | | | | | | | | | | | | | | | Problem: With the current implementation we are allowing unlink of a file if hashed subvol is down and cached subvol is up. For the above op to work we should have the info of hashed_subvol. But incase we do remount of the volume we will have a zeroed layout for the disconnected subvol(start=0, stop=0, err=ENOTCONN) which will result into hashed_subvol being NULL and failing unlink op. Solution: Dont fail if hashed_subvol is NULL. Check cached subvol and unlink in cached subvol. The linkto file in the hashed subvol can be remove later. Change-Id: Ic1982c15c8942a1adcb47ed0017d2d5ace5c9241 BUG: 983416 Signed-off-by: Susant Palai <spalai@redhat.com> Reviewed-on: http://review.gluster.org/6851 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Raghavendra G <rgowdapp@redhat.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/dht: Abandoned memory if a call failsChristopher R. Hertel2014-02-031-0/+3
| | | | | | | | | | | | | | | | | | | | | If the call to dict_set_dynstr() fails, the memory indicated by xattr_buf will not have been stored in the dictionary, so it must be freed. Patch set 2: Added a missed call to GF_FREE(). Fixed a formatting consistency issue. Patch set 3: Cleaned a minor style nit. BUG: 789278 CID: 1124786 Change-Id: Id1f85bd2cbfac0b8727a3f6901f0a50ba921817d Signed-off-by: Christopher R. Hertel <crh@redhat.com> Reviewed-on: http://review.gluster.org/6826 Reviewed-by: Shyamsundar Ranganathan <srangana@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* dht: do not remove linkfile if file exist in cached sub volumeVijaykumar M2014-02-021-19/+109
| | | | | | | | | | | | | | | | Currently with rmdir, if a directory contains only the linkfiles we remove all the linkfiles and this is causing the problem when the cached sub volume is down and end-up with duplicate files showing on the mount point. Solution: Before removing a linkfile check if the files exists in cached subvolume. Change-Id: Iedffd0d9298ec8bb95d5ce27c341c9ade81f0d3c BUG: 1042725 Signed-off-by: Vijaykumar M <vmallika@redhat.com> Reviewed-on: http://review.gluster.org/6500 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* quota: filter glusterfs quota xattrsSusant Palai2014-01-221-0/+6
| | | | | | | | | Change-Id: I86ebe02735ee88598640240aa888e02b48ecc06c BUG: 1040423 Signed-off-by: Susant Palai <spalai@redhat.com> Reviewed-on: http://review.gluster.org/6490 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Raghavendra G <rgowdapp@redhat.com>
* pathinfo: Provide user namespace access.Vijaykumar M2013-12-161-2/+2
| | | | | | | | | | | | | | | | | Locality can be now queried by unprivileged users with key "glusterfs.pathinfo". Setting both "glusterfs.pathinfo" and "trusted.glusterfs.pathinfo" on disk is prevented with this patch. Original Author: Vijay Bellur <vbellur@redhat.com> Change-Id: I4f7a0db8ad59165c4aeda04b23173255157a8b79 Signed-off-by: Vijaykumar M <vmallika@redhat.com> Reviewed-on: http://review.gluster.org/5101 Reviewed-by: Krishnan Parthasarathi <kparthas@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* glusterd/geo-rep: more glusterd and cli fixes for geo-rep.Ajeet Jha2013-12-121-7/+1
| | | | | | | | | | | | | | | | | | | | | | | | -> handle option validation cases in reset case. -> Creating valid conf path when glusterd restarts. -> Reading the gsyncd worker thread status and displaying it. -> Displaying status-detail per worker. -> Fetch checkpoint info in geo-rep status. -> use-tarssh value validation added. misc: misc geo-rep fixes based on cluster, logrotate etc.. -> cluster/dht: fix 'stime' getxattr getting overwritten. -> cluster/afr: return max of 'stime' values in subvol. -> geo-rep-logrotate: Sending SIGHUP to geo-rep auxiliary. -> cluster/dht: fix convoluted logic while aggregating. -> cluster/*: fix 'stime' min/max fetch logic. Change-Id: I811acea0bbd6194797a3e55d89295d1ea021ac85 BUG: 1036552 Signed-off-by: Ajeet Jha <ajha@redhat.com> Reviewed-on: http://review.gluster.org/6405 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Amar Tumballi <amarts@gmail.com> Reviewed-by: Anand Avati <avati@redhat.com>
* dht: Set status to FAILED when rebalance stops due to brick going downKaushal M2013-12-101-2/+4
| | | | | | | | | | Change-Id: I98da41342127b1690d887a5bc025e4c9dd504894 BUG: 1038452 Signed-off-by: Kaushal M <kaushal@redhat.com> Reviewed-on: http://review.gluster.org/6435 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Shishir Gowda <gowda.shishir@gmail.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* cluster/dht: handle NULL check before strlen/strcmp in fgetxattrAnand Avati2013-12-021-0/+1
| | | | | | | | | | | | @key can legally be NULL. Handle that case without crashing. Change-Id: Iaae293caa7eeb24afc9cd2580799173e2ce00911 BUG: 1036879 Signed-off-by: Anand Avati <avati@redhat.com> Reviewed-on: http://review.gluster.org/6395 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Krishnan Parthasarathi <kparthas@redhat.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* cluster/dht: set layout in inode ctx even if linkfile failsAnand Avati2013-11-261-1/+1
| | | | | | | | | | | | | | Creating linkfile could have failed, but we dont care about linkfile for setting layout in the inode ctx (could be EEXIST etc.) So ignore @inode in cbk and pick it up from local->loc.inode Change-Id: I2952799d7ae0d3441b84b2ca2981afd75d7576e2 BUG: 1032859 Signed-off-by: Anand Avati <avati@redhat.com> Reviewed-on: http://review.gluster.org/6319 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* features/quota: Improvements to quotaRaghavendra G2013-11-261-4/+36
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * Two stages of quota enforcement is done: Soft and hard quota Upon reaching soft quota limit on the directory it logs/alerts in the quota daemon log (ie DEFAULT_LOG_DIR/quotad.log) and no more writes allowed after hard quota limit. After reaching the soft-limit the daemon alerts the user/admin repeatively for every 'alert-time', which is configurable. * Quota enforcer is moved to server-side. It takes care of enforcing quota. Since enforcer doesn't have the cluster view, it relies on another service called quota-aggregator. Aggregator, on query can return the size of a directory based on the cluster view. Enforcer is always loaded in the server graph and is by passed if the feature is not enabled. Options specific to enforcer: server-quota - Specifies whether the feature is on/off. It is used to by pass the quota if turned off. deem-statfs - If set to on, it takes quota limits into consideration while estimating fs size. (df command). The algorithm followed is, i. Adjust statvfs based on limit configured on root. ii. If limit is set on the inode passed, use size/limits on that inode to populate statvfs. Otherwise, use size/limits configured on root. iii. Upon statvfs, update the ctx->size on the inode. iv. Don't let DHT aggregate, instead take the maximum of the usages from the subvols of the DHT, since each of it contains the complete information. Enforcer also makes use of gfid-to-path conversion functionality to work correctly when a client like nfs predominently relies on nameless lookups. * Quota Aggregator acts as a thin client to provide cluster view Its a lightweight *gluster client* process with no mount point, started upon enabling quota or restarting the volume. This is a single process run on each brick, which can answer queries on all volumes in the cluster. Its volfile stored in GLUSTERD_DEFAULT_WORKING_DIR/quotad/quotad.vol. Credits: Raghavendra Bhat <rabhat@redhat.com> Varun Shastry <vshastry@redhat.com> Shishir Gowda <sgowda@redhat.com> Kruthika Dhananjay <kdhananj@redhat.com> Brian Foster <bfoster@redhat.com> Krishnan Parthasarathi <kparthas@redhat.com> Change-Id: Id1cb25b414951da34c665a55f77385d482e0f9de BUG: 969461 Signed-off-by: Raghavendra G <rgowdapp@redhat.com> Reviewed-on: http://review.gluster.org/5952 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* dht: dht_lookup_dir_cbk should set op_errno as local->op_errnoKaushal M2013-10-151-1/+1
| | | | | | | | | | | | | | | | | | | | | | | Two glusterfs clients return inconsistent errnos when the bricks of the volume were down. Consider two gluster mounts. Mount 1 was done when the bricks were online. Mount 2 was done after the bricks were killed, (using the 'glusterfs' command instead of the mount script). For any request, mount 1 will return ENOTCONN, where as mount 2 will return ENOENT. This happens because for the 2nd mount, a fuse would send a lookup on '/' for any request, as it hadn't been done yet. The client xlator returns ENOTCONN, but the dht_lookup_dir_cbk changed this to ENOENT unconditionally when aggregating. So, fuse returned ENOENT, even though the errno should have been ENOTCONN. Change-Id: I4b7a6d84ce5153045a807fccc01485afe0377117 BUG: 1019095 Signed-off-by: Kaushal M <kaushal@redhat.com> Reviewed-on: http://review.gluster.org/6072 Reviewed-by: Anand Avati <avati@redhat.com> Tested-by: Anand Avati <avati@redhat.com>
* gNFS: Incorrect NFS ACL encoding for XFSSantosh Kumar Pradhan2013-09-291-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | Problem: Incorrect NFS ACL encoding causes "system.posix_acl_default" setxattr failure on bricks on XFS file system. XFS (potentially others?) doesn't understand when the 0x10 prefix is added to the ACL type field for default ACLs (which the Linux NFS client adds) which causes setfacl()->setxattr() to fail silently. NFS client adds NFS_ACL_DEFAULT(0x1000) for default ACL. FIX: Mask the prefix (added by NFS client) OFF, so the setfacl is not rejected when it hits the FS. Original patch by: "Richard Wareing" Change-Id: I17ad27d84f030cdea8396eb667ee031f0d41b396 BUG: 1009210 Signed-off-by: Santosh Kumar Pradhan <spradhan@redhat.com> Reviewed-on: http://review.gluster.org/5980 Reviewed-by: Amar Tumballi <amarts@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* core: block unused signals in created threadsAnand Avati2013-09-251-2/+2
| | | | | | | | | | | | | | | Block all signal except those which are set for explicit handling in glusterfs_signals_setup(). Since thread spawning code in libglusterfs and xlators can get called from application threads when used through libgfapi, it is necessary to do this blocking. Change-Id: Ia320f80521a83d2edcda50b9ad414583a0175281 BUG: 1011662 Signed-off-by: Anand Avati <avati@redhat.com> Reviewed-on: http://review.gluster.org/5995 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Amar Tumballi <amarts@redhat.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* Revert "cluster/dht: Return success in dht_discover if layout issues"shishir gowda2013-09-241-37/+15
| | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit a3e593f9f17cb1e68db97bb5a0d8074793a33964 which was bought into fix dht_layout_anomalies error handling. We still see applications error'ing out due to graph switches. To fix the above issue - We cannot heal in dht_discover, as it is a gfid based lookup, and not path based. So, returning error here would lead to app's to see failure. Also, update the layout in inode_ctx even if it has anomalies. Let subsequent heals fix the issue. Conflicts: xlators/cluster/dht/src/dht-common.c Signed-off-by: shishir gowda <sgowda@redhat.com> Change-Id: I68c1056c3587e04a02344548546ddd06034489c5 BUG: 960348 Reviewed-on: http://review.gluster.org/5443 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* core: changes to support gfid-accessAmar Tumballi2013-08-211-6/+0
| | | | | | | | | Change-Id: I38d2fdc47e4b805deafca6805e54807976ffdb7e Signed-off-by: Amar Tumballi <amarts@redhat.com> BUG: 952029 Reviewed-on: http://review.gluster.org/5496 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* Revert "fuse: auxiliary gfid mount support"Amar Tumballi2013-08-211-0/+6
| | | | | | | | | | | | | | | | | This reverts commit 4c0f4c8a89039b1fa1c9c015fb6f273268164c20. Conflicts: xlators/mount/fuse/src/fuse-bridge.c For build issues added CREATE_MODE_KEY definition in: libglusterfs/src/glusterfs.h Change-Id: I8093c2a0b5349b01e1ee6206025edbdbee43055e BUG: 952029 Signed-off-by: Amar Tumballi <amarts@redhat.com> Reviewed-on: http://review.gluster.org/5495 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/dht: Del GF_READDIR_SKIP_DIRS key from dict for first_upshishir gowda2013-08-181-4/+11
| | | | | | | | | | | | | | | | Currently, we sent GF_READDIR_SKIP_DIRS for all subvolumes if first_subvol != first_up_subvolume. Also first_up_subvolume can change with-in the life of a call and cbk. Saving the first_up_subvol in dht_local for checks. Change-Id: I6e369e63f29c9761993f2a66ed768c424bb44d27 BUG: 996474 Signed-off-by: shishir gowda <sgowda@redhat.com> Reviewed-on: http://review.gluster.org/5577 Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* fuse: auxiliary gfid mount supportRaghavendra G2013-07-191-6/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * files can be accessed directly through their gfid and not just through their paths. For eg., if the gfid of a file is f3142503-c75e-45b1-b92a-463cf4c01f99, that file can be accessed using <gluster-mount>/.gfid/f3142503-c75e-45b1-b92a-463cf4c01f99 .gfid is a virtual directory used to seperate out the namespace for accessing files through gfid. This way, we do not conflict with filenames which can be qualified as uuids. * A new file/directory/symlink can be created with a pre-specified gfid. A setxattr done on parent directory with fuse_auxgfid_newfile_args_t initialized with appropriate fields as value to key "glusterfs.gfid.newfile" results in the entry <parent>/bname whose gfid is set to args.gfid. The contents of the structure should be in network byte order. struct auxfuse_symlink_in { char linkpath[]; /* linkpath is a null terminated string */ } __attribute__ ((__packed__)); struct auxfuse_mknod_in { unsigned int mode; unsigned int rdev; unsigned int umask; } __attribute__ ((__packed__)); struct auxfuse_mkdir_in { unsigned int mode; unsigned int umask; } __attribute__ ((__packed__)); typedef struct { unsigned int uid; unsigned int gid; char gfid[UUID_CANONICAL_FORM_LEN + 1]; /* a null terminated gfid string * in canonical form. */ unsigned int st_mode; char bname[]; /* bname is a null terminated string */ union { struct auxfuse_mkdir_in mkdir; struct auxfuse_mknod_in mknod; struct auxfuse_symlink_in symlink; } __attribute__ ((__packed__)) args; } __attribute__ ((__packed__)) fuse_auxgfid_newfile_args_t; An initial consumer of this feature would be geo-replication to create files on slave mount with same gfids as that on master. It will also help gsyncd to access files directly through their gfids. gsyncd in its newer version will be consuming a changelog (of master) containing operations on gfids and sync corresponding files to slave. * Also, bring in support to heal gfids with a specific value. fuse-bridge sends across a gfid during a lookup, which storage translators assign to an inode (file/directory etc) if there is no gfid associated it. This patch brings in support to specify that gfid value from an application, instead of relying on random gfid generated by fuse-bridge. gfids can be healed through setxattr interface. setxattr should be done on parent directory. The key used is "glusterfs.gfid.heal" and the value should be the following structure whose contents should be in network byte order. typedef struct { char gfid[UUID_CANONICAL_FORM_LEN + 1]; /* a null terminated gfid * string in canonical form */ char bname[]; /* a null terminated basename */ } __attribute__((__packed__)) fuse_auxgfid_heal_args_t; This feature can be used for upgrading older geo-rep setups where gfids of files are different on master and slave to newer setups where they should be same. One can delete gfids on slave using setxattr -x and .glusterfs and issue stat on all the files with gfids from master. Thanks to "Amar Tumballi" <amarts@redhat.com> and "Csaba Henk" <csaba@redhat.com> for their inputs. Signed-off-by: Raghavendra G <rgowdapp@redhat.com> Change-Id: Ie8ddc0fb3732732315c7ec49eab850c16d905e4e BUG: 952029 Reviewed-on: http://review.gluster.com/#/c/4702 Reviewed-by: Amar Tumballi <amarts@redhat.com> Tested-by: Amar Tumballi <amarts@redhat.com> Reviewed-on: http://review.gluster.org/4702 Reviewed-by: Xavier Hernandez <xhernandez@datalab.es> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* dht: fix dht_discover_cbk doing a wrong layout set.shishir gowda2013-07-171-1/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | with the sequence of operations are like below, we have issues with current code (MP == mountpoint): T0,MP1# mkdir /abcd (Succeeds on hash_subvol) T1,MP2# mkdir /abcd (Gets EEXIST as dir exists in hash_subvol) T2,MP2# mkdir /.gfid/<abcd's gfid>/xyz (lookup happens on abcd's gfid, calls dht_discover) T3,MP1# (Completes mkdir(), goes to dir_selfheal to set the layouts). T4,MP2# (dht_discover_cbk gets success for lookup as the entry existed, as layout is not yet written, it says normalize done, found holes). T5,MP2# (as layout anomaly is not considered an issue in this patch, dht_layout_set happens on inode, with all xlators pointing to 0s) T6,MP1# (completes mkdir call, inode has proper layouts) T7,MP2# mkdir /.gfid/<abcd's gfid>/xyz fails with ENOENT (with log saying no subvol found for hash value of xyz. Porting Amar's fix from down-stream beta branch. Change-Id: Ibdc37ee614c96158a1330af19cad81a39bee2651 BUG: 982913 Original-author: Amar Tumballi <amarts@redhat.com> Signed-off-by: shishir gowda <sgowda@redhat.com> Reviewed-on: http://review.gluster.org/5302 Reviewed-by: Amar Tumballi <amarts@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* libxlator: implement pluggable aggregation policiesAvra Sengupta2013-07-151-1/+3
| | | | | | | | | | | | | | | | | The API is described in libxlator.h. Behavior remains the same for this commit; this is a preparatory step for per-translator customization of aggregation. Change-Id: I5d42923af59b2fd78e1ff59c12763875b57c5190 BUG: 847839 Original Author: Csaba Henk <csaba@redhat.com> Signed-off-by: Avra Sengupta <asengupt@redhat.com> Reviewed-on: http://review.gluster.org/4903 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Amar Tumballi <amarts@redhat.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* cluster/dht: node-uuid for directories winds to all subvolumesAvra Sengupta2013-07-151-2/+6
| | | | | | | | | | | | | | | this works similar to pathinfo now except that the request is sent to all subvolumes of dht. Underlying replica selects it's subvolume in a round-robin fashion till one of them returns successfully. Change-Id: Ie46c5f7090d04d8c2e487b209916ae6791e94624 BUG: 847839 Original Author: Venky Shankar <vshankar@redhat.com> Signed-off-by: Avra Sengupta <asengupt@redhat.com> Reviewed-on: http://review.gluster.org/5225 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Amar Tumballi <amarts@redhat.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* cluster/*: get logic to calculate min() of the 'stime' xattrAvra Sengupta2013-07-141-0/+5
| | | | | | | | | | | | | | | | | | | | | | * in both distribute and replicate (ignoring stripe for now), add logic to calculate the min() of stime values. * What is a 'stime' ? Why is this required: - stime means 'slave xtime', mainly used to keep track of slave node's sync status when distributed geo-replication is used. Logic of calculating 'min()' for this stime is very important as in case of crashes/reboots/shutdown, we will have to 'restart' with crawling from stime time value from the mount point, which gives the 'min()' of all the bricks, which means, we don't miss syncing any files in the above cases. Change-Id: I2be8d434326572be9d4986db665570a6181db1ee BUG: 847839 Original Author: Amar Tumballi <amarts@redhat.com> Signed-off-by: Avra Sengupta <asengupt@redhat.com> Reviewed-on: http://review.gluster.org/4893 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* cluster/dht: If linkfile unlink fails with ENOTCONN, do not failshishir gowda2013-07-111-1/+2
| | | | | | | | | | | | | | Currently if linkfile fails with ENOENT, we do not fail. We also need to treat failures with ENOTCONN as success, as if cached subvol is up, rm of a file should succeed. A stale linkfile will get removed later Change-Id: I71d136847933351ed9e2c939bda4a69bc96a3cfc BUG: 983416 Signed-off-by: shishir gowda <sgowda@redhat.com> Reviewed-on: http://review.gluster.org/5317 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/dht: Ignore subvols with error in min-free-disk/inodesshishir gowda2013-07-101-2/+4
| | | | | | | | | | | | | | | Currently when selecting a alternative subvolume when hashed subvol has exceeded min-free-disk/inodes, we do not check if layouts have errors (including decommissioning). This leads to data being written to those subvolumes, and in case of decommissioning, will lead to data loss. Change-Id: Ie0c6cf4a29d7c53d8a6d8a8c1bd595cf58a0012a BUG: 982919 Signed-off-by: shishir gowda <sgowda@redhat.com> Reviewed-on: http://review.gluster.org/5299 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* cluster/dht: Fix unused-but-set-variable warningsPranith Kumar K2013-06-061-4/+0
| | | | | | | | | | Change-Id: Ie2b2dc54aa0d500c35752c72d3b562bcc05b1fc2 BUG: 969336 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/5123 Reviewed-by: Jeff Darcy <jdarcy@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* cluster/dht: Ignore ENOENT errors for unlink of linkfilesshishir gowda2013-06-061-2/+2
| | | | | | | | | | | | | If unlink of linkfile returns ENOENT, do not fail unlink. Proceed with unlinking of cached file. Change-Id: If7cec92b40c39d68dd9c3606c6c2c3a6bd67d27b BUG: 966848 Signed-off-by: shishir gowda <sgowda@redhat.com> Reviewed-on: http://review.gluster.org/4971 Reviewed-by: Amar Tumballi <amarts@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* cluster/dht: Set layout when inode is presentPranith Kumar K2013-06-041-1/+2
| | | | | | | | | | | | | | | | | Problem: Lookups in discovery fail with ENOENT so local->inode is never set. dht_layout_set logs the callstack when the function is called in that state. Fix: Don't set layout when lookups fail in discovery. Change-Id: I5d588314c89e3575fcf7796d57847e35fd20f89a BUG: 965434 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/5055 Reviewed-by: Shishir Gowda <sgowda@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* cluster/dht: Return success in dht_discover if layout issuesJeff Darcy2013-06-031-13/+32
| | | | | | | | | | | | | | | | | We cannot heal in dht_discover, as it is a gfid based lookup, and not path based. So, returning error here would lead to app's to see failure. Also, update the layout in inode_ctx even if it has anomalies. Let subsequent heals fix the issue. Change-Id: I2358aadacf9a24e20a22ab0a6055c38c5eb6485c BUG: 960348 Original-author: shishir gowda <sgowda@redhat.com> Signed-off-by: shishir gowda <sgowda@redhat.com> Signed-off-by: Jeff Darcy <jdarcy@redhat.com> Reviewed-on: http://review.gluster.org/4959 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* dht,posix: support for case discoveryAnand Avati2013-05-251-0/+73
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is support for discovering a filename in a given directory which has a case insensitive match of a given name. It is implemented as a virtual extended attribute on the directory where the required filename is specified in the key. E.g: sh# getfattr -e "text" -n user.glusterfs.get_real_filename:FiLe-B /mnt/samba/patchy getfattr: Removing leading '/' from absolute path names # file: mnt/samba/patchy user.glusterfs.get_real_filename:FiLe-B="file-b" In reality, there can be multiple "answers" as the backend filesystem is case sensitive and there can be multiple files which can strcasecamp() successfully. In this case we pick the first matched file from the first responding server. If a matching file does not exist, we return ENOENT (and NOT ENODATA). This way the caller can differentiate between "unsupported" glusterfs API and file not existing. This API is used by Samba VFS to perform efficient discovery of the real filename without doing a full scan at the Samba level. Change-Id: I53054c4067cba69e585fd0bbce004495bc6e39e8 BUG: 953694 Signed-off-by: Anand Avati <avati@redhat.com> Reviewed-on: http://review.gluster.org/4941 Reviewed-by: Vijay Bellur <vbellur@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* object-storage: final removal of ufo codePeter Portante2013-05-101-54/+0
| | | | | | | | | | | | | | | | | | | See https://git.gluster.org/gluster-swift.git for the new location of the Gluster-Swift code. With this patch, no OpenStack Swift related RPMs are constructed. This patch also removes the unused code that references the user.ufo-test xattr key in the DHT translator. Change-Id: I2da32642cbd777737a41c5f9f6d33f059c85a2c1 BUG: 961902 (https://bugzilla.redhat.com/show_bug.cgi?id=961902) Signed-off-by: Peter Portante <peter.portante@redhat.com> Reviewed-on: http://review.gluster.org/4970 Reviewed-by: Kaleb KEITHLEY <kkeithle@redhat.com> Reviewed-by: Luis Pabon <lpabon@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* dht: make DHT xattr names configurableJeff Darcy2013-03-211-50/+74
| | | | | | | | | | | | | | | | | | | | This is necessary to support "DHT over DHT" configurations, so that the upper and lower instances of DHT don't step all over each other. Why would we even consider such a thing? Because it gives us the ability to do data tiering and rack-aware placement, either by themselves or as complements to other functionality such as erasure codes or deduplication which save space but cost performance. By setting up the top-level DHT to place data into one of several lower-level DHT pools based on policy instead of pure elastic hashing, we get better performance for 90% of accesses and better storage efficiency for 90% of data, all for relatively low effort. Change-Id: I72e65c29edfc80babf39f7a2a00090f4588c4070 BUG: 924265 Signed-off-by: Jeff Darcy <jdarcy@redhat.com> Reviewed-on: http://review.gluster.org/4694 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster xlators: s/-1/GF_CLIENT_PID_GSYNCD/Csaba Henk2013-03-031-2/+2
| | | | | | | | | Change-Id: I03be3cb23684de4ab36cf2953002708466edd580 BUG: 765433 Signed-off-by: Csaba Henk <csaba@redhat.com> Reviewed-on: http://review.gluster.org/4601 Reviewed-by: Venky Shankar <vshankar@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* cluster/distribute: Prevent spurious multiple defrag crawlsshishir gowda2013-02-281-9/+14
| | | | | | | | | | | | | | | | | | In dht_notify, we used to create a thread to start defrag crawls after we had heard from all child subvols. This was in-correct, as a later event, could also trigger the crawl again(due to the fact that all subvols had responded). The fix is to make sure, the thread is started only once after all subvols have responded the first time Change-Id: Ifc2978b9dc866af2395b79911eca50ab38ff9457 BUG: 916449 Signed-off-by: shishir gowda <sgowda@redhat.com> Reviewed-on: http://review.gluster.org/4587 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Amar Tumballi <amarts@redhat.com> Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
* cluster/dht: Create linkfile with file uid/gidshishir gowda2013-02-131-3/+13
| | | | | | | | | | | | | | | | Currently, linkfile creation happens as root. use uid/gid returned from _cbk (link/rename) to set the correct ownership of the link files. Also added test/dht.rc to implement common dht functions Change-Id: Iecdcf52d89fed8286670ce430b9d19deebd27b8c BUG: 884597 Signed-off-by: shishir gowda <sgowda@redhat.com> Reviewed-on: http://review.gluster.org/4304 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/dht: pathinfo xattr changes for directoriesVenky Shankar2013-02-081-92/+223
| | | | | | | | | | | | | | | | | | | | Since directories have presence on all subvolumes there is no definite meaning of ->hashed_subvol or ->cached_subvol. getxattr() code path chooses ->cached_subvol for pathinfo extended attribute. While this makes sense of files, it makes less sense for directories. Further if a hashed or a cached subvolume is down, and there's a getxattr request for a directory, we return with an errno. This patch changes pathinfo extended attribute contents by aggregating information from all subvolumes that are up. Change-Id: I58adb741d63ccfd1d0239af75eb65f26f0fb384d Signed-off-by: Venky Shankar <vshankar@redhat.com> BUG: 856455 Reviewed-on: http://review.gluster.org/4047 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/dht: ignore EEXIST error in mkdir to avoid GFID mismatchAnand Avati2013-02-031-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In dht_mkdir_cbk, EEXIST error is treated like a true error. Because of this the following sequence of events can happen, eventually resulting in GFID mismatch and (and possibly leaked locks and hang, in the presence of replicate.) The issue exists when many clients concurrently attempt creation of directory and subdirectory (e.g mkdir -p /mnt/gluster/dir1/subdir) 0. First mkdir happens by one client on the hashed subvolume. Only one client wins the race. Others racing mkdirs get EEXIST. Yet other "laggers" in the race encounter the just-created directory in lookup() on the hash dir. 1. At least one "lagger" lookup() notices that there are missing directories on other subvolumes (which the "winner" mkdir is yet to create), and starts off self-heal of the directory. 2. At least on some subvolumes, self-heal's mkdir wins the race against the "winner" mkdir and creates the directory first. This causes the "winner" mkdir to experience EEXIST error on those subvolumes. 3. On other subvolumes where "winner" mkdir won the race, self-heal experiences EEXIST error, but self-heal is properly translating that into a success (but mkdir code path is not -- which is the bug.) 4. Both mkdir and self-heal assign hash layouts to the just created directory. But self-heal distributes hash range across N (total) subvolumes, whereas mkdir distributes hash range across N - M (where M is the number of subvolumes where mkdir lost the race). Both the clients "cache" their respective layouts in the near future for all future creates inside them (evidence in logs) 5. During the creation of the subdirectory, two clients race again. Ideally winner performs mkdir() on the hashed subvolume and proceeds to create other dirs, loser experiences EEXIST error on the hashed subvolume and backs off. But in this case, because the two clients have different layout views of the parent directory (because of different hash splits and assignements), the hashed subvolumes for the new directory can end up being different. Therefore, both clients now win the race (they were never fighting against each other on a common server), assigning different GFIDs to the directory on their respective (different) subvolumes. Some of the remaining subvolumes get GFID1, others GFID2. Conclusion/Fix: Making mkdir translate EEXIST error as success (just the way self-heal is already rightly doing) will bring back truth to the design claim that concurrent mkdir/self-heals perform deterministic + idempotent operations. This will prevent the differing "hash views" by different clients and thereby also avoid GFID mismatch by forcing all clients to have a "fair race", because the hashed subvolume for all will be the same (and thereby avoiding leaked locks and hangs.) Change-Id: I84592fb9b8a3f739a07e2afb23b33758a0a9a157 BUG: 907072 Signed-off-by: Anand Avati <avati@redhat.com> Reviewed-on: http://review.gluster.org/4459 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Amar Tumballi <amarts@redhat.com>
* cluster/dht: stack wind with cookiev3.4.0qa8Varun Shastry2013-01-311-16/+22
| | | | | | | | | | | | | | | Default_fops uses stack_wind_tail. It winds without creating the frame leading into wrong subvol return in the cookie. To avoid the problem caused by the same, we're getting the subvol by passing the cookie. Change-Id: I51ee79b22c89e4fb0b89e9a0bc3ac96c5b469f8f BUG: 893338 Signed-off-by: Varun Shastry <vshastry@redhat.com> Reviewed-on: http://review.gluster.org/4388 Reviewed-by: Jeff Darcy <jdarcy@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com> Tested-by: Anand Avati <avati@redhat.com>
* cluster/distribute: If cached_subvol is down, return ENOTCONN in lookupshishir gowda2013-01-211-1/+10
| | | | | | | | | | | | | When we follow a linkfile, and the lookup returns a ENOTCONN error, return the error, as the cached subvol is down, and lookup_everywhere wont succeed, but actually ends up clearing the linkfile, and clearing the namespace. Change-Id: I772bf71531bc646e8fb62d3e8549a5fe0f3896da BUG: 893378 Signed-off-by: shishir gowda <sgowda@redhat.com> Reviewed-on: http://review.gluster.org/4383 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/dht: update ctx-time only if we receive the new iattVarun Shastry2013-01-171-4/+4
| | | | | | | | | | | | | | | 1. Used local->postparent(contains merged iatt of all succesful calls) instead of postparent for dht ctx time update. 2. dht_inode_ctx_time_update avoided in case of opret -1. Change-Id: Ie04a7842a41c241f911b6a3f76267b996d27fb43 BUG: 881013 Signed-off-by: Varun Shastry <vshastry@redhat.com> Reviewed-on: http://review.gluster.org/4338 Reviewed-by: Shishir Gowda <sgowda@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/dht: fail fix-layout if any of the subvol is downshishir gowda2012-11-291-7/+15
| | | | | | | | | | | | | | | | If any subvolume is down, and a layout is re-written and hash values change, entry names in the downed subvol can be reused in the other subvol which got the same hash range. when the downed subvol is brought back up, duplicate entried might appear Also separated handling of ENOSPC and ENOTCONN error. Change-Id: I5ed93990425a4cee70df2dab7c7c119fdc87ad56 BUG: 860663 Signed-off-by: shishir gowda <sgowda@redhat.com> Reviewed-on: http://review.gluster.org/4000 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>