summaryrefslogtreecommitdiffstats
path: root/xlators/cluster/dht
Commit message (Collapse)AuthorAgeFilesLines
...
* dht: improve transform/detransform of d_off (and be ext4 safe)Anand Avati2013-04-011-5/+82
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The scheme to encode brick d_off and brick id into global d_off has two approaches. Since both brick d_off and global d_off are both 64-bit wide, we need to be careful about how the brick id is encoded. Filesystems like XFS always give a d_off which fits within 32bits. So we have another 32bits (actually 31, in this scheme, as seen ahead) to encode the brick id - which is typically plenty. Filesystems like the recent EXT4 utilize the upto 63 low bits in d_off, as the d_off is calculated based on a hash function value. This leaves us no "unused" bits to encode the brick id. However both these filesystmes (EXT4 more importantly) are "tolerant" in terms of the accuracy of the value presented back in seekdir(). i.e, a seekdir(val) actually seeks to the entry which has the "closest" true offset. This "two-prong" scheme exploits this behavior - which seems to be the best middle ground amongst various approaches and has all the advantages of the old approach: - Works against XFS and EXT4, the two most common filesystems out there. (which wasn't an "advantage" of the old approach as it is borken against EXT4) - Probably works against most of the others as well. The ones which would NOT work are those which return HUGE d_offs _and_ NOT tolerant to seekdir() to "closest" true offset. - Nothing to "remember in memory" or evict "old entries". - Works fine across NFS server reboots and also NFS head failover. - Tolerant to seekdir() to arbitrary locations. Algorithm: Each d_off can be encoded in either of the two schemes. There is no requirement to encode all d_offs of a directory or a reply-set in the same scheme. The topmost bit of the 64 bits is used to specify the "type" of encoding of this particular d_off. If the topmost bit (bit-63) is 1, it indicates that the encoding scheme holds a HUGE d_off. If the topmost bit is is 0, it indicates that the "small" d_off encoding scheme is used. The goal of the "small" d_off encoding is to stay as dense as possible towards the lower bits even in the global d_off. The goal of the HUGE d_off encoding is to stay as accurate (close) as possible to the "true" d_off after a round of encoding and decoding. If DHT has N subvolumes, we need ROOF(Log2(N)) "bits" to encode the brick ID (call it "n"). SMALL d_off =========== Encoding -------- If the top n + 1 bits are free in a brick offset, then we leave the top bit as 0 and set the remaining bits based on the old formula: hi_mask = 0xffffffffffffffff hi_mask = ~(hi_mask >> (n + 1)) if ((hi_mask & d_off_brick) != 0) do_large_d_off_encoding () d_off_global = (d_off_brick * N) + brick_id Decoding -------- If the top bit in the global offset is 0, it indicates that this is the encoding formula used. So decoding such a global offset will be like the old formula: if ((d_off_global & 0x8000000000000000) != 0) do_large_d_off_decoding() d_off_brick = (d_off_global % N) brick_id = d_off_global / N HUGE d_off ========== Encoding -------- If the top n + 1 bits are NOT free in a given brick offset, then we set the top bit as 1 in the global offset. The low n bits are replaced by brick_id. low_mask = 0xffffffffffffffff << n // where n is ROOF(Log2(N)) d_off_global = (0x8000000000000000 | d_off_brick & low_mask) + brick_id if (d_off_global == 0xffffffffffffffff) discard_entry(); Decoding -------- If the top bit in the global offset is set 1, it indicates that the encoding formula used is above. So decoding would look like: hi_mask = (0xffffffffffffffff << n) low_mask = ~(hi_mask) d_off_brick = (global_d_off & hi_mask & 0x7fffffffffffffff) brick_id = global_d_off & low_mask If "losing" the low n bits in this decoding of d_off_brick looks "scary", we need to realize that till recently EXT4 used to only return what can now be expressed as (d_off_global >> 32). The extra 31 bits of hash added by EXT recently, only decreases the probability of a collision, and not eliminate it completely, anyways. In a way, the "lost" n bits are made up by decreasing the probability of collision by sharding the files into N bricks / EXT directories -- call it "hash hedging", if you will :-) Thanks-to: Zach Brown <zab@redhat.com> Change-Id: Ieba9a7071829d51860b7c131982f12e0136b9855 BUG: 838784 Signed-off-by: Anand Avati <avati@redhat.com> Reviewed-on: http://review.gluster.org/4711 Reviewed-by: Jeff Darcy <jdarcy@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* dht: make DHT xattr names configurableJeff Darcy2013-03-2111-125/+192
| | | | | | | | | | | | | | | | | | | | This is necessary to support "DHT over DHT" configurations, so that the upper and lower instances of DHT don't step all over each other. Why would we even consider such a thing? Because it gives us the ability to do data tiering and rack-aware placement, either by themselves or as complements to other functionality such as erasure codes or deduplication which save space but cost performance. By setting up the top-level DHT to place data into one of several lower-level DHT pools based on policy instead of pure elastic hashing, we get better performance for 90% of accesses and better storage efficiency for 90% of data, all for relatively low effort. Change-Id: I72e65c29edfc80babf39f7a2a00090f4588c4070 BUG: 924265 Signed-off-by: Jeff Darcy <jdarcy@redhat.com> Reviewed-on: http://review.gluster.org/4694 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* dht: fix a typoJulesWang2013-03-181-1/+1
| | | | | | | | | Change-Id: Id6f156957e58aad06bf2602f880c7e4102b80fd1 BUG: 764890 Signed-off-by: JulesWang <w.jq0722@gmail.com> Reviewed-on: http://review.gluster.org/4679 Reviewed-by: Jeff Darcy <jdarcy@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* cluster/distribute: Fix layout overlaps due to spread-count in selfheal pathshishir gowda2013-03-071-50/+12
| | | | | | | | | | | | | | | | We needed to zero out the layout range, before we re-calculate the range. When spread-count is issued, we would end up with stale ranges in the layout. Replaced dht_selfheal_dir_xattr with dht_fix_dir_xattr, which correctly resets the un-used (after re-cal) layouts. Change-Id: I1a900d15df07335f59356bd23182ccec34381ab2 BUG: 884455 Signed-off-by: shishir gowda <sgowda@redhat.com> Reviewed-on: http://review.gluster.org/4647 Reviewed-by: Amar Tumballi <amarts@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
* cluster xlators: s/-1/GF_CLIENT_PID_GSYNCD/Csaba Henk2013-03-031-2/+2
| | | | | | | | | Change-Id: I03be3cb23684de4ab36cf2953002708466edd580 BUG: 765433 Signed-off-by: Csaba Henk <csaba@redhat.com> Reviewed-on: http://review.gluster.org/4601 Reviewed-by: Venky Shankar <vshankar@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* glusterd: Added the validation function for subvols-per-directoryAvra Sengupta2013-02-281-0/+2
| | | | | | | | | Change-Id: Ie2259023b9001311a2032792639c3093054f6750 BUG: 896431 Signed-off-by: Avra Sengupta <asengupt@redhat.com> Reviewed-on: http://review.gluster.org/4552 Reviewed-by: Jeff Darcy <jdarcy@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* cluster/distribute: Prevent spurious multiple defrag crawlsshishir gowda2013-02-281-9/+14
| | | | | | | | | | | | | | | | | | In dht_notify, we used to create a thread to start defrag crawls after we had heard from all child subvols. This was in-correct, as a later event, could also trigger the crawl again(due to the fact that all subvols had responded). The fix is to make sure, the thread is started only once after all subvols have responded the first time Change-Id: Ifc2978b9dc866af2395b79911eca50ab38ff9457 BUG: 916449 Signed-off-by: shishir gowda <sgowda@redhat.com> Reviewed-on: http://review.gluster.org/4587 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Amar Tumballi <amarts@redhat.com> Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
* cluster/dht: print hash function munging logs in DEBUG modeAnand Avati2013-02-271-2/+2
| | | | | | | | | Change-Id: Ia2e6bce80710d103da9d78afdb389ea162b00686 BUG: 912564 Signed-off-by: Anand Avati <avati@redhat.com> Reviewed-on: http://review.gluster.org/4590 Reviewed-by: Jeff Darcy <jdarcy@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* cluster/distribute: Add filter to support file patterns to be migratedshishir gowda2013-02-273-1/+120
| | | | | | | | | | | | | | | | | | | | | | | 'gluster volume rebalance' command will be enhanced to support passing of these options/pattern. <pattern> is comma separated list as show below. The Precedence is from right to left. e.g- "*avi,*pdf:10MB,*:1KB" The precedence is as follows: migrate all files with size equal or greater than 1KB "*:1KB" migrate all pdf files with size equal or greater than 10MB "*pdf:10MB" migrate all avi files "*avi" With this option, it is possible to choose which files to migrate. Change-Id: I6d6d6a015bcbacf1debae2f278a2d92306fb055d BUG: 896456 Signed-off-by: shishir gowda <sgowda@redhat.com> Reviewed-on: http://review.gluster.org/4366 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/distribute: Preserve file size during rebalance migrationshishir gowda2013-02-251-0/+6
| | | | | | | | | | | | | | | | | | If holes are encountered, then we do not write these to the dst, which sometimes causes file size to be lesser than src. Data is not corrupted, as when non-zero reads are received, we do write that data. Calling a truncrate to give file size to prevent it from being truncated to less than src in case the file end has holes. Thanks to Brian Foster for providing the test case Change-Id: I3cdd143b63ec8d797273d76189dff8b05eb9e551 BUG: 915554 Signed-off-by: shishir gowda <sgowda@redhat.com> Reviewed-on: http://review.gluster.org/4574 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* dht: Enable mem-accounting for nufaPranith Kumar K2013-02-191-0/+9
| | | | | | | | | Change-Id: I0cf8a59c81e30603aadc45393bbd49d80ae1621f BUG: 912657 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/4542 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* distribute: add hash-name-regex optionJeff Darcy2013-02-185-23/+125
| | | | | | | | | | | | | | | | | | | | | This is to support the common "write to temp file then rename" idiom. In our case this causes us to create a linkfile during the rename in (N-1)/N cases, with a significant impact on performance e.g. for UFO which uses this idiom. This patch allows the user to specify up to two regular expressions that separate the permanent and transient parts of a temp-file name: rsync-hash-regex reimplements the existing RSYNC_FRIENDLY_NAME pattern where the temporary name for EXAMPLE is .EXAMPLE.suffix extra-hash-regex can be used for a second pattern, active concurrently, and can include alternatives via the POSIX extended-regex syntax Change-Id: Ic1a6ed19324bc31fefe91ee34b8478877a9c5d2c BUG: 912564 Signed-off-by: Jeff Darcy <jdarcy@redhat.com> Reviewed-on: http://review.gluster.org/4116 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/dht: improvement in rebalance logsAmar Tumballi2013-02-171-2/+2
| | | | | | | | | | | | provide space between path and next string, so that we can grep for the correct path in error logs. Change-Id: Ide53e341864dce432406a58e8cbb2ff1480cfbde Signed-off-by: Amar Tumballi <amarts@redhat.com> BUG: 815194 Reviewed-on: http://review.gluster.org/4531 Reviewed-by: Anand Avati <avati@redhat.com> Tested-by: Anand Avati <avati@redhat.com>
* cluster/distribute: Reopen fds in migration internally as root:rootshishir gowda2013-02-141-1/+10
| | | | | | | | | | | | | | | Though linkfile_create and rebalance dst file create sent a setattr with correct ownership, there is still a race window where the linkfile open (client open due to migration) will fail, as its ownership will be root:root. Change-Id: I056092da6102319efa3bb9f1795f8c97db2a3fed BUG: 884597 Signed-off-by: shishir gowda <sgowda@redhat.com> Reviewed-on: http://review.gluster.org/4513 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Amar Tumballi <amarts@redhat.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/distribute: Remove suprious fd_unref callshishir gowda2013-02-141-2/+0
| | | | | | | | | | | | | After fix http://review.gluster.org/4282 (libglusterfsterfs/syncop: do not hold ref on the fd in cbk) was pushed, syncop_open does not take a ref anymore. Change-Id: I5543986a4f2d965eef42ca979b39ac75439cec49 BUG: 910661 Signed-off-by: shishir gowda <sgowda@redhat.com> Reviewed-on: http://review.gluster.org/4509 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Amar Tumballi <amarts@redhat.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/dht: Create linkfile with file uid/gidshishir gowda2013-02-134-4/+101
| | | | | | | | | | | | | | | | Currently, linkfile creation happens as root. use uid/gid returned from _cbk (link/rename) to set the correct ownership of the link files. Also added test/dht.rc to implement common dht functions Change-Id: Iecdcf52d89fed8286670ce430b9d19deebd27b8c BUG: 884597 Signed-off-by: shishir gowda <sgowda@redhat.com> Reviewed-on: http://review.gluster.org/4304 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/dht: pathinfo xattr changes for directoriesVenky Shankar2013-02-082-92/+224
| | | | | | | | | | | | | | | | | | | | Since directories have presence on all subvolumes there is no definite meaning of ->hashed_subvol or ->cached_subvol. getxattr() code path chooses ->cached_subvol for pathinfo extended attribute. While this makes sense of files, it makes less sense for directories. Further if a hashed or a cached subvolume is down, and there's a getxattr request for a directory, we return with an errno. This patch changes pathinfo extended attribute contents by aggregating information from all subvolumes that are up. Change-Id: I58adb741d63ccfd1d0239af75eb65f26f0fb384d Signed-off-by: Venky Shankar <vshankar@redhat.com> BUG: 856455 Reviewed-on: http://review.gluster.org/4047 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* Use proper libtool option -avoid-version instead of bogus -avoidversionAnand Avati2013-02-071-3/+3
| | | | | | | | | | Change-Id: I1c9541058c7d07786539a3266ca125a6a15287d8 BUG: 859835 Signed-off-by: Anand Avati <avati@redhat.com> Original-author: Kacper Kowalik (Xarthisius) <xarthisius.kk@gmail.com> Signed-off-by: Kacper Kowalik (Xarthisius) <xarthisius.kk@gmail.com> Reviewed-on: http://review.gluster.org/3967 Tested-by: Gluster Build System <jenkins@build.gluster.com>
* dht: better layout-optimization algorithmJeff Darcy2013-02-072-22/+76
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This method deals with the case where swapping might gain a bigger overlap for the xlator currently under consideration, but sacrifices even more from the xlator we're swapping with. For example: A = 0x00000000 - 0x44444443 (new 0x00000000 - 0x55555554) B = 0x44444444 - 0x77777776 (new 0x55555555 - 0xaaaaaaa9) C = 0x77777777 - 0xffffffff (new 0xaaaaaaaa - 0xffffffff) Here, the new range for B has a bigger overlap with the old C than with the old B (0x33333333 vs. 0x22222222 to be precise) so looking only at that might lead us to swap. However, such a swap turns the new C's overlap from 0x55555556 (vs. old C) to *zero* (vs. old B). In other words, we've gained 0x11111111 for B but lost 0x55555556 for C, so it's a bad idea. The new algorithm accounts for all effects of the swap, so it not only avoids bad swaps but can make some good ones that would have been missed previously. For example, if swapping a range X with a later range Y would not increase the overlap for X we would previously have skipped it even if the swap would increase Y's overlap without affecting X's. This is the normal case when we're adding a new brick (which initially has zero overlap with any old range) so finding more good swaps is probably even more important than avoiding bad ones. Also, the logic in dht_overlap_calc was completely broken before, causing integer overflows instead of providing correct values, so no matter what higher-level algorithm was in place the GIGO effect would have resulted in bad decisions. Change-Id: If61ed513cfcb931916c6b51da293e3efbaaf385f BUG: 853258 Signed-off-by: Jeff Darcy <jdarcy@redhat.com> Reviewed-on: http://review.gluster.org/3908 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/dht: Correct min_free_disk behaviourRaghavendra Talur2013-02-042-27/+89
| | | | | | | | | | | | | | | | | | | | | | | | | | | Problem: Files were being created in subvol which had less than min_free_disk available even in the cases where other subvols with more space were available. Solution: Changed the logic to look for subvol which has more space available. In cases where all the subvols have lesser than Min_free_disk available , the one with max space and atleast one inode is available. Known Issue: Cannot ensure that first file that is created right after min-free-value is crossed on a brick will get created in other brick because disk usage stat takes some time to update in glusterprocess. Will fix that as part of another bug. Change-Id: If3ae0bf5a44f8739ce35b3ee3f191009ddd44455 BUG: 858488 Signed-off-by: Raghavendra Talur <rtalur@redhat.com> Reviewed-on: http://review.gluster.org/4420 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/dht: ignore EEXIST error in mkdir to avoid GFID mismatchAnand Avati2013-02-031-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In dht_mkdir_cbk, EEXIST error is treated like a true error. Because of this the following sequence of events can happen, eventually resulting in GFID mismatch and (and possibly leaked locks and hang, in the presence of replicate.) The issue exists when many clients concurrently attempt creation of directory and subdirectory (e.g mkdir -p /mnt/gluster/dir1/subdir) 0. First mkdir happens by one client on the hashed subvolume. Only one client wins the race. Others racing mkdirs get EEXIST. Yet other "laggers" in the race encounter the just-created directory in lookup() on the hash dir. 1. At least one "lagger" lookup() notices that there are missing directories on other subvolumes (which the "winner" mkdir is yet to create), and starts off self-heal of the directory. 2. At least on some subvolumes, self-heal's mkdir wins the race against the "winner" mkdir and creates the directory first. This causes the "winner" mkdir to experience EEXIST error on those subvolumes. 3. On other subvolumes where "winner" mkdir won the race, self-heal experiences EEXIST error, but self-heal is properly translating that into a success (but mkdir code path is not -- which is the bug.) 4. Both mkdir and self-heal assign hash layouts to the just created directory. But self-heal distributes hash range across N (total) subvolumes, whereas mkdir distributes hash range across N - M (where M is the number of subvolumes where mkdir lost the race). Both the clients "cache" their respective layouts in the near future for all future creates inside them (evidence in logs) 5. During the creation of the subdirectory, two clients race again. Ideally winner performs mkdir() on the hashed subvolume and proceeds to create other dirs, loser experiences EEXIST error on the hashed subvolume and backs off. But in this case, because the two clients have different layout views of the parent directory (because of different hash splits and assignements), the hashed subvolumes for the new directory can end up being different. Therefore, both clients now win the race (they were never fighting against each other on a common server), assigning different GFIDs to the directory on their respective (different) subvolumes. Some of the remaining subvolumes get GFID1, others GFID2. Conclusion/Fix: Making mkdir translate EEXIST error as success (just the way self-heal is already rightly doing) will bring back truth to the design claim that concurrent mkdir/self-heals perform deterministic + idempotent operations. This will prevent the differing "hash views" by different clients and thereby also avoid GFID mismatch by forcing all clients to have a "fair race", because the hashed subvolume for all will be the same (and thereby avoiding leaked locks and hangs.) Change-Id: I84592fb9b8a3f739a07e2afb23b33758a0a9a157 BUG: 907072 Signed-off-by: Anand Avati <avati@redhat.com> Reviewed-on: http://review.gluster.org/4459 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Amar Tumballi <amarts@redhat.com>
* cluster/dht: stack wind with cookieVarun Shastry2013-01-313-29/+45
| | | | | | | | | | | | | | | Default_fops uses stack_wind_tail. It winds without creating the frame leading into wrong subvol return in the cookie. To avoid the problem caused by the same, we're getting the subvol by passing the cookie. Change-Id: I51ee79b22c89e4fb0b89e9a0bc3ac96c5b469f8f BUG: 893338 Signed-off-by: Varun Shastry <vshastry@redhat.com> Reviewed-on: http://review.gluster.org/4388 Reviewed-by: Jeff Darcy <jdarcy@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com> Tested-by: Anand Avati <avati@redhat.com>
* libglusterfs/syncop: do not hold ref on the fd in cbkRaghavendra Bhat2013-01-301-6/+6
| | | | | | | | | | | | | * Do not do fd_ref in cbks of the fops which return a fd (such as open, opendir, create). Change-Id: Ic2f5b234c5c09c258494f4fb5d600a64813823ad BUG: 885008 Signed-off-by: Raghavendra Bhat <raghavendra@redhat.com> Reviewed-on: http://review.gluster.org/4282 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Amar Tumballi <amarts@redhat.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/distribute: get_layout should account only available subvolsshishir gowda2013-01-231-4/+3
| | | | | | | | | | | | | | | | The earlier logic used to check if (layout-spread-count <= subvol_cnt - decommissioned bricks). With this if a subvol was down, and layout-spread was > upsubvols, a mkdir ended up creating holes in the layout. The fix is to consider only the combination of subvols which are usable (not down or not decommissioned). Change-Id: I61ad3bcaf4589f5a75f7887cfa595c98311ae3bb BUG: 902610 Signed-off-by: shishir gowda <sgowda@redhat.com> Reviewed-on: http://review.gluster.org/4412 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* glusterd/cli: Updated the options descriptions for "volume set help"Avra Sengupta2013-01-211-4/+24
| | | | | | | | | Change-Id: I0db00b7334bb9707ab48bd661ac03a3ad818d6e4 BUG: 893458 Signed-off-by: Avra Sengupta <asengupt@redhat.com> Reviewed-on: http://review.gluster.org/4393 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* core: fixes for gcc's '-pedantic' flag buildAvra Sengupta2013-01-211-1/+1
| | | | | | | | | | | | | * warnings on 'void *' arguments * warnings on empty initializations * warnings on empty array (array[0]) Change-Id: Iae440f54cbd59580eb69f3ecaed5a9926c0edf95 BUG: 875913 Signed-off-by: Avra Sengupta <asengupt@redhat.com> Reviewed-on: http://review.gluster.org/4219 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/distribute: If cached_subvol is down, return ENOTCONN in lookupshishir gowda2013-01-211-1/+10
| | | | | | | | | | | | | When we follow a linkfile, and the lookup returns a ENOTCONN error, return the error, as the cached subvol is down, and lookup_everywhere wont succeed, but actually ends up clearing the linkfile, and clearing the namespace. Change-Id: I772bf71531bc646e8fb62d3e8549a5fe0f3896da BUG: 893378 Signed-off-by: shishir gowda <sgowda@redhat.com> Reviewed-on: http://review.gluster.org/4383 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/dht: update ctx-time only if we receive the new iattVarun Shastry2013-01-172-5/+8
| | | | | | | | | | | | | | | 1. Used local->postparent(contains merged iatt of all succesful calls) instead of postparent for dht ctx time update. 2. dht_inode_ctx_time_update avoided in case of opret -1. Change-Id: Ie04a7842a41c241f911b6a3f76267b996d27fb43 BUG: 881013 Signed-off-by: Varun Shastry <vshastry@redhat.com> Reviewed-on: http://review.gluster.org/4338 Reviewed-by: Shishir Gowda <sgowda@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/dht: Add "afr.readdir-failover=off" option the rebalance processshishir gowda2012-12-161-7/+27
| | | | | | | | | | | | | | | | | By failing over readdir (default behaviour), rebalance could get duplicate files, as readdir would re-read from offset 0. Rebalance should not attempt to migrate these files again. Additionally, we need to handle these cases as failure in rebalance crawl. No test case provided, as we cannot determine the read child in afr. Change-Id: If07508b4f92dacc17e0f695b48a866c7c66004be BUG: 859387 Signed-off-by: shishir gowda <sgowda@redhat.com> Reviewed-on: http://review.gluster.org/4300 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/distribute: re-set layouts to prevent overlapsshishir gowda2012-12-114-8/+55
| | | | | | | | | | | | | | | | | When subvols-per-directory option is used, with bricks addition/removal the layouts might get distributed to other subvols, which were not part of the layout before. We need to clean up layouts on old subvolumes, to prevent overlaps. Also, we need to make sure if layout-cnt is never less than subvolume-cnt. Change-Id: I00994a092ca0c99aedcc41bd9412d43460f88a04 BUG: 884455 Signed-off-by: shishir gowda <sgowda@redhat.com> Reviewed-on: http://review.gluster.org/4281 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* dht: support auto-NUFA optionJeff Darcy2012-12-041-9/+67
| | | | | | | | | | | | | | | Many people have asked for behavior like the old NUFA, which builds and seems to run but was previously impossible to enable/configure in a standard way. This change allows NUFA to be enabled instead of DHT from the command line, with automatic selection of the local subvolume on each host. Change-Id: I0065938db3922361fd450a6c1919a4cbbf6f202e BUG: 882278 Signed-off-by: Jeff Darcy <jdarcy@redhat.com> Reviewed-on: http://review.gluster.org/4234 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* fix memory leaksRaghavendra Bhat2012-12-042-1/+3
| | | | | | | | | | | | | * write-behind: free the inode context in wb_forget * distribute: in readdirp callback put the allocated context to the inode * distribute: check if the layout is NULL before accessing it in layout_unref Change-Id: I7698f81b85b99d06bf6b01fc1a6e51e1593b5e27 BUG: 790709 Signed-off-by: Raghavendra Bhat <raghavendra@redhat.com> Reviewed-on: http://review.gluster.org/4250 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/dht: fail fix-layout if any of the subvol is downshishir gowda2012-11-295-35/+47
| | | | | | | | | | | | | | | | If any subvolume is down, and a layout is re-written and hash values change, entry names in the downed subvol can be reused in the other subvol which got the same hash range. when the downed subvol is brought back up, duplicate entried might appear Also separated handling of ENOSPC and ENOTCONN error. Change-Id: I5ed93990425a4cee70df2dab7c7c119fdc87ad56 BUG: 860663 Signed-off-by: shishir gowda <sgowda@redhat.com> Reviewed-on: http://review.gluster.org/4000 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/dht: Heal dir uid/gidshishir gowda2012-11-294-1/+105
| | | | | | | | | | | | Identify mismatching uid/gid in lookup, and trigger a syncop heal. uid/gid of subvol with latest ctime is trusted (local->prebuf). Change-Id: Ib5c4bc438e7f4b1f33080e73593f40f400e997f0 BUG: 862967 Signed-off-by: shishir gowda <sgowda@redhat.com> Reviewed-on: http://review.gluster.org/3964 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/dht: send ACCESS call on dir to first_up_subvol if cached is downshishir gowda2012-11-291-0/+11
| | | | | | | | | Change-Id: I4f518a969bbe3a11075e7c9ae10bd21bf059d5f3 BUG: 867253 Signed-off-by: shishir gowda <sgowda@redhat.com> Reviewed-on: http://review.gluster.org/4240 Reviewed-by: Jeff Darcy <jdarcy@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* cluster/distribute: send getxattr on LOCKINFO to only cached subvolumes.Raghavendra2012-11-271-1/+3
| | | | | | | | | | | | lk is sent to only cached subvolume. Hence there is no point in sending LOCKINFO to other children (even in case of directories). Change-Id: Ia20fc358dfa84cee9a52d1f613564ff6f25aa0c9 BUG: 808400 Signed-off-by: Raghavendra <raghavendra@gluster.com> Reviewed-on: http://review.gluster.org/4123 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* libglusterfs: Implement float percentagePranith Kumar K2012-11-234-20/+21
| | | | | | | | | Change-Id: Ia7ea63471f0bbd74686873f5f6f183475880f1a0 BUG: 839595 Signed-off-by: Pranith Kumar K <pranithk@gluster.com> Reviewed-on: http://review.gluster.org/4162 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* cluster/dht: dump the layout information of directories onlyRaghavendra Bhat2012-11-191-9/+18
| | | | | | | | | | | | | | | testcase: The changes are for removing gf_log from statedump related sections in dht and using pthread_mutex_trylock in statedump sections. Changes are internal. So tests were done by attaching gdb to the process and executing by manually changing the values of some of the pointers. Change-Id: I41fa76c1812b462cb76f5bbf2fd14de080e73895 BUG: 843822 Signed-off-by: Raghavendra Bhat <raghavendra@redhat.com> Reviewed-on: http://review.gluster.org/4117 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* cluster/dht: ignore empty ->hashed_subvol during lookupVenky Shankar2012-10-171-9/+1
| | | | | | | | | | | | | | | ->hashed_subvol is not valid (== NULL) when the subvolume the entity hashes to is down. For directories, we need not rely on ->hashed_subvol as we aggregate information from all subvolumes. So, during lookup, NULL ->hashed_subvol is ingored but logged. Change-Id: I306e4e274fe29d60ff028add4a6c3bcd67b2f314 Signed-off-by: Venky Shankar <vshankar@redhat.com> BUG: 856459 Reviewed-on: http://review.gluster.org/4046 Reviewed-by: Anand Avati <avati@redhat.com> Tested-by: Anand Avati <avati@redhat.com>
* cluster/distribute: Always return the latest time in struct iatt.shishir gowda2012-10-166-50/+264
| | | | | | | | | | | | | | | | | | | save the a/c/mtime in inode_ctx, and dht_inode_ctx_update checks the passed iatte, and updates the stat's time, and inode_ctx's time accordingly. For preparent times, only the iatt stat to be returned is updated, not the ctx. With this, update, WIPE is removed, as we would always be passing back the latest mtime, and hence cache times will be relevant. TODO-handle rename WIPE calls Change-Id: I8e4c738cd830f3fafeef789c9181f9c242ac96a2 BUG: 857791 Signed-off-by: shishir gowda <sgowda@redhat.com> Reviewed-on: http://review.gluster.org/3737 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* build: split CPPFLAGS from CFLAGSJeff Darcy2012-10-031-2/+3
| | | | | | | | | | | | | | | | | Automake provides a separate variable for preprocessor flags (*_CPPFLAGS). They are already uses in a few places, so make it consistent and use it everywhere. Note that cflags obtained from pkg-config often are cppflags, which is why LIBXML2_CFLAGS moves with into AM_CPPFLAGS, for example. Change-Id: I15feed1d18b2ca497371271c4b5876d5ec6289dd BUG: 862082 Original-author: Jan Engelhardt <jengelh@inai.de> Signed-off-by: Jan Engelhardt <jengelh@inai.de> Signed-off-by: Jeff Darcy <jdarcy@redhat.com> Reviewed-on: http://review.gluster.org/4029 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* build: remove useless explicit -fPIC -shared fromJeff Darcy2012-10-031-2/+2
| | | | | | | | | | | | | | | | | | | | CFLAGS libtool will automatically add "-fPIC" to the compiler command line as needed, so there is no need to specify it separately. "-shared" is normally a linker flag and has an odd effect when used with libtool --mode=compile, namely that it inhibits production of static objects. For that however, using AC_DISABLE_STATIC is a lot simpler. Change-Id: Ic4cba0fad18ffd985cf07f8d6951a976ae59a48f BUG: 862082 Original-author: Jan Engelhardt <jengelh@inai.de> Signed-off-by: Jan Engelhardt <jengelh@inai.de> Signed-off-by: Jeff Darcy <jdarcy@redhat.com> Reviewed-on: http://review.gluster.org/4027 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* build: remove -nostartfiles flagJeff Darcy2012-10-021-1/+1
| | | | | | | | | | | | | | | The "-nostartfiles" is a discouraged option and is documented to potentially result in undesired behavior. Since I see no reason why it should be in glusterfs, remove it. Change-Id: I56f2b08874516ebad91447b2583ca2fb776bb7ab BUG: 862082 Original-author: Jan Engelhardt <jengelh@inai.de> Signed-off-by: Jan Engelhardt <jengelh@inai.de> Signed-off-by: Jeff Darcy <jdarcy@redhat.com> Reviewed-on: http://review.gluster.org/4018 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* build: consolidate common compilation flags into one variableJeff Darcy2012-10-011-1/+1
| | | | | | | | | | | | | | | Some -D flags are present in all files, so collect them. This adds -D${GF_HOST_OS} to some compiler command lines, but this should not be a problem. Change-Id: I1aeb346143d4984c9cc4f2750c465ce09af1e6ca BUG: 862082 Original-author: Jan Engelhardt <jengelh@inai.de> Signed-off-by: Jan Engelhardt <jengelh@inai.de> Signed-off-by: Jeff Darcy <jdarcy@redhat.com> Reviewed-on: http://review.gluster.org/4013 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* Fixed some general typing errors.Varun Shastry2012-09-271-1/+1
| | | | | | | | | | | | Eg: changed recieved to received Change-Id: I360fcb99c97c8a0222e373fee20ea2fccfb938db BUG: 860543 Signed-off-by: Varun Shastry <vshastry@redhat.com> Reviewed-on: http://review.gluster.org/3998 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Jeff Darcy <jdarcy@redhat.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* dht: improve dht_fix_layout_of_directory for better re-assignmentAnand Avati2012-09-121-145/+78
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Jeff Darcy wrote: > AFAICT, the fix-layout code doesn't do the same rotation that the > new-directory code does. Therefore, the new bricks always claim > completely predictable hash ranges for every directory, leading to > either a 0-1-2-3 pattern or a 1-0-2-3 pattern. In other words, a > file whose hash falls into the second quarter of the range will always > be assigned to brick 2, and a file whose hash falls into the fourth > quarter will always be assigned to brick 3. The rest will be split > according to the original pattern. Put still another way, instead of > same-named files in different directories being spread across N bricks, > they might be spread across only two bricks (bad) or totally > concentrated on one brick (worse) regardless of N. The current dht_fix_layout_of_directory() code, in an attempt to maximize overlap of new layout with existing layout (to minimize movement of data) fails to do a good job of randomizing new assignment even when it could do a better job. In an example where we expand from 2 nodes to 4 nodes, the current possibilities are limited in the following way - (theoretical hash range: 00 - 99) OLD 1 ----- server1: 00 - 49 server2: 50 - 99 NEW 1 ----- server1: 00 - 24 server2: 50 - 74 server3: 25 - 49 server4: 75 - 99 OLD 2 ----- server1: 50 - 99 server2: 00 - 49 NEW 2 ------ server1: 50 - 74 server2: 00 - 24 server3: 25 - 49 server4: 75 - 99 The above shows that when add-brick from 2 bricks to 4 bricks, server3 and server4 always get the _same_ hash range no matter what the original hash range assignment was. The fix in this patch is first do the standard new directory assignment to a directory (with rotation etc.) and then do the reassignment to maximize overlap. This way newly added servers still get random ranges and existing servers have a probability of getting either of the quarters which were part of its half previously. The same principles hold for all add-brick from M to M+N. Change-Id: I0cbbf3bfa334645728072d66aaaa80120d0b295f BUG: 853258 Signed-off-by: Anand Avati <avati@redhat.com> Reviewed-on: http://review.gluster.org/3883 Tested-by: Gluster Build System <jenkins@build.gluster.com>
* cluster/dht: handle percent option for 'min-free-disk'Amar Tumballi2012-09-071-0/+11
| | | | | | | | | | | | | | | * with the init option cleanups, setting of 'conf->disk_unit' was reset, which made it not set the '%' in the option. * bring a global check, which makes the option assume its percent, as long as value is < 100. Change-Id: I00bd1395a309cdc596a2b2b80304c6d98696a24a Signed-off-by: Amar Tumballi <amarts@redhat.com> BUG: 852889 Reviewed-on: http://review.gluster.org/3918 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/distribute: remove gf_log() from statedump functionsAmar Tumballi2012-09-061-3/+0
| | | | | | | | | Change-Id: I83cccab6819d6a74e96c2717ca539fa1568cac89 Signed-off-by: Amar Tumballi <amarts@redhat.com> BUG: 843822 Reviewed-on: http://review.gluster.org/3912 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* libglusterfs/dict: make 'dict_t' a opaque objectAmar Tumballi2012-09-061-14/+11
| | | | | | | | | | | | | | | * ie, don't dereference dict_t pointer, instead use APIs everywhere * other than dict_t only 'data_t' should be the valid export from dict.h * added 'dict_foreach_fnmatch()' API * changed dict_lookup() to use data_t, instead of data_pair_t Change-Id: I400bb0dd55519a7c5d2a107e67c8e7a7207228dc Signed-off-by: Amar Tumballi <amarts@redhat.com> BUG: 850917 Reviewed-on: http://review.gluster.org/3829 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* dht/rebalance: set the correct ownership on the dst file.shishir gowda2012-08-281-0/+8
| | | | | | | | | | | | | | | Currently, the dst file created has root:root ownership, till migration is completed. During this phase, open fails on the dst file if uid/gid is non-root. Setting the dst_file to the correct ownership fixes the issue Change-Id: Icfec89eb10dc866cdee38dab17695fe21174ef99 BUG: 852361 Signed-off-by: shishir gowda <sgowda@redhat.com> Reviewed-on: http://review.gluster.org/3861 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Amar Tumballi <amarts@redhat.com> Reviewed-by: Anand Avati <avati@redhat.com>