summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* object-storage: use tox for unit tests; fix em tooPeter Portante2013-04-0519-76/+91
| | | | | | | | | | | | | | | | | | | | | | | | Add the ability to use tox for unit tests, since it helps us solve the problem of supporting multiple branches that require different versions of dependencies, and allows us to possibly support multiple versions of python in the future. Also fix the code to work with pre-grizzly environments, by not requiring the constraints backport. Also fixed the xattr support to work with both pyxattr and xattr modules. And fixed the ring tests to also work without a live /etc/swift directory. BUG: 948657 (https://bugzilla.redhat.com/show_bug.cgi?id=948657) Change-Id: I2be79c8ef8916bb6552ef957094f9186a963a068 Signed-off-by: Peter Portante <peter.portante@redhat.com> Reviewed-on: http://review.gluster.org/4781 Reviewed-by: Alex Wheeler <wheelear@gmail.com> Tested-by: Alex Wheeler <wheelear@gmail.com> Reviewed-by: Anand Avati <avati@redhat.com>
* object-storage: Import missing sys and errno modules.Mohammed Junaid2013-04-042-2/+98
| | | | | | | | | | | | | Import the missing modules and implemented unit test case for Glusterfs module. Thanks to Paul Smith for pointing it out. Change-Id: Ib04202aa0ae05c4da2ebbf11f87d6accc778f827 BUG: 905946 Signed-off-by: Mohammed Junaid <junaid@redhat.com> Reviewed-on: http://review.gluster.org/4758 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Kaleb KEITHLEY <kkeithle@redhat.com> Reviewed-by: Anand Avati <avati@redhat.com>
* gsync: Display additional information in status commandsarvotham s pai2013-04-042-4/+52
| | | | | | | | | | | | | | | | | | | | | | | | | | Added code to display extra information when status command is executed. Information shown now are 1 Number of files synced 2 crawl time 3 total sync time 4 bytes synced bytes synced is taken from rsync output . --stats option of rsync gives extra infor mation about the sync.In stats output there is a field called Total transferred file size which states the ammount of bytes synced . This information is parsed from stdout output using regular expressions.Bytes synced information can be used to calculate throughput. Change-Id: Id9bba9fff45ee7049bb8257c6fd918e5237e05b1 BUG: 947774 Signed-off-by: sarvotham s pai <spai@redhat.com> Reviewed-on: http://review.gluster.org/4749 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* tests: Remove grep process entries from pidgrepRaghavendra Talur2013-04-041-1/+1
| | | | | | | | | | | | | | | | | | Problem: We were picking process with lowest pid from ps|grep result. However, lowest pid need not be oldest process as recycling of PIDs can take place. Solution: Removed grep process entries from ps entries using grep -v grep. Change-Id: I2b9687a05a34cf6358f773183770d69a3fb9eb10 BUG: 858488 Signed-off-by: Raghavendra Talur <rtalur@redhat.com> Reviewed-on: http://review.gluster.org/4765 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/afr: Treat all dir fop failure as success in changelogPranith Kumar K2013-04-033-2/+35
| | | | | | | | | | | | | | | For example: If a new entry creation fop fails with EEXIST or a delete entry fop fails with ENOENT, on all the subvols the fop is wound, then no change took place to the directory. So we can treat that case as no change happened to the directory. Change-Id: I3b3a7931954da2166a9cba19ff9f76f37739d751 BUG: 860210 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/4626 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* posix: fix dangerous "sharing" of fd in readdir between two requestsAnand Avati2013-04-031-2/+17
| | | | | | | | | | | | | | | | | | | | | | | posix_fill_readdir() is a multi-step function which performs many readdir() calls, and expects the directory cursor to have not "seeked away" elsewhere between two successive iterations. Usually this is not a problem as each opendir() from an application has its own backend fd, and there is nobody else to "seek away" the directory cursor. However in case of NFS's use of anonymous fd, the same fd_t is shared between all NFS readdir requests, and two readdir loops can be executing in parallel on the same dir dragging away the cursor in a chaotic manner. The fix in this patch is to lock on the fd around the loop. Another approach could be to reimplement posix_fill_readdir() with a single getdents() call, but that's for another day. Change-Id: Ia42e9c7fbcde43af4c0d08c20cc0f7419b98bd3f BUG: 948086 Signed-off-by: Anand Avati <avati@redhat.com> Reviewed-on: http://review.gluster.org/4774 Reviewed-by: Jeff Darcy <jdarcy@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* cluster/afr: Made afr_sh_purge_entry_common message log more clear.Venkatesh Somyajulu2013-04-031-3/+1
| | | | | | | | | | | | | | | FIX: In missing entry self heal, once the source directories are determined after the lookup and if file is not present on any of the brick which contains the souce directory, the entry is removed from the directory. So log message should give information of "Purging of entry". Change-Id: I4d3deb602e0812dc1c9c8ba0a466716d81dede7e BUG: 947312 Signed-off-by: Venkatesh Somyajulu <vsomyaju@redhat.com> Reviewed-on: http://review.gluster.org/4753 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* dict: Put "goto out" in dict_unserialize to avoid process crashVenkatesh Somyajulu2013-04-031-0/+1
| | | | | | | | | | | | | | | | Problem: In the dictionary serialization function, if the [(buf + vallen) > (orig_buf + size)], then memdup is getting failed. Fix: Put "goto out" whenever this condition is met. Change-Id: I662628a936596dbb47825aad47d7dbab2879eb07 BUG: 947824 Signed-off-by: Venkatesh Somyajulu <vsomyaju@redhat.com> Reviewed-on: http://review.gluster.org/4767 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* pump: Set self-heal readdir size in pumpPranith Kumar K2013-04-031-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Problem: In Pump entry self-heal happens for each directory during the first opendir using conservative merge. But in entry-self-heal readdir is issued with '0' size. So entry self-heal is not creating any files. After pump thinks entry self-heal is complete it proceeds to heal each of the file in the directory it just healed. Fortunately most of the times it chooses source-brick in pump as read-child for readdir. This happens because readchild is the subvolume on which lookup succeeds first. In pump lookup succeeds faster in local process than on the destination brick process most of the times. For all the entries pump finds in readdir it does a lookup. During this lookup the entry on the destination brick is created and healed. This is the reason why replace-brick succeeds whenever read-child for the directory is chosen as the source-brick. Which is most of the times. When read-child is chosen as the destination brick, readdir returns no entries so replace-brick completes without syncing the whole data. Fix: Set readdir-size in pump so that entry self-heal happens with 64k size. This ensures that entry self-heal triggered from opendir actually creates the files on the destination brick. Change-Id: I65ea45d3c2735a9578f3aa34eff771b6563241ca BUG: 909800 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/4712 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* build: enable fusermount by defaultAnand Avati2013-04-032-11/+12
| | | | | | | | | | | | | | | | | | The fusermount available in gluster is customized to ensure mounting with SELinux happens properly, i.e - to have a separate thread for fuse_thread_proc which can process getxattr requests and in parallel perform sys_mount() in a different thread, thereby avoiding a deadlock. However our build and packaging defaults to not including our fusermount. This patch reverses the defaults. Change-Id: I793af4c2f56aeac46efae3db30e7c64ee7c18850 BUG: 811217 Signed-off-by: Anand Avati <avati@redhat.com> Reviewed-on: http://review.gluster.org/4773 Reviewed-by: Jeff Darcy <jdarcy@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* protocol/client: Print valid loc identifiersPranith Kumar K2013-04-033-36/+43
| | | | | | | | | | Change-Id: I45f91105862a2484b8906a7a63b98ab4aaf80d05 BUG: 924643 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/4683 Reviewed-by: Jeff Darcy <jdarcy@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* dht: make nufa/switch call dht's init/finiJeff Darcy2013-04-038-1046/+806
| | | | | | | | | | | | | These functions keep changing as new functionality is added, so copying and pasting the code is not a good solution. This way ensures that all fields get initialized properly no matter how much new stuff we throw in. Change-Id: I9e9b043d2d305d31e80cf5689465555b70312756 BUG: 924488 Signed-off-by: Jeff Darcy <jdarcy@redhat.com> Reviewed-on: http://review.gluster.org/4710 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* core: add dispatch table for init/finiJeff Darcy2013-04-032-10/+27
| | | | | | | | | | | | | This adds a layer of indirection so that derivative translators such as NUFA and switch can refer to the parent's init/fini (in both cases DHT's) without having to create stub functions. Change-Id: I1af1fea70a9ddd2aa20485af7ae65f9660f19dd6 BUG: 924490 Signed-off-by: Jeff Darcy <jdarcy@redhat.com> Reviewed-on: http://review.gluster.org/4709 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/distribute: Start rebalance with option readdir-optimize onshishir gowda2013-04-031-0/+1
| | | | | | | | | | | | | | With readdir-optimize set to on, we instruct the posix layer to ignore directory entries from not first subvolume. DHT discards directories returned from non first subvolume. By making posix itself ignore it, we are making directory crawls faster Change-Id: Ia1faf2dedec0c615c0632c3c063e846f5742ede6 Signed-off-by: shishir gowda <sgowda@redhat.com> Reviewed-on: http://review.gluster.org/4613 Reviewed-by: Amar Tumballi <amarts@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* build: require /usr/bin/fusermount when not carrying our own versionNiels de Vos2013-04-031-0/+3
| | | | | | | | | | | | | | | | | | The fuse.so from glusterfs-fuse will try to call /usr/bin/fusermount. This obviously fails when the fuse package is not installed and fusermount is not available. In order to prevent this problem, the glusterfs-fuse package should require /usr/bin/fusermount so that it gets automatically pulled in when glusterfs-fuse is installed with yum. BUG: 947830 Change-Id: I20fe836a1aaf751dbc04d9ec4ba5ea50573c71c5 Signed-off-by: Niels de Vos <ndevos@redhat.com> Reviewed-on: http://review.gluster.org/4768 Reviewed-by: Kaleb KEITHLEY <kkeithle@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* rpc-lib: Move defaulting to socket message to debug levelRaghavendra Talur2013-04-021-1/+1
| | | | | | | | | | | | | | | | | | Problem: For every gluster cli operation from command line rpc init process is required. During init process we print "no transport type set, defaulting to socket" message at WARNING level, which is not necessary. Solution: Moved the log level to DEBUG. Change-Id: I73f4644264368b0f6c11a77ef66595018784ce79 BUG: 928204 Signed-off-by: Raghavendra Talur <rtalur@redhat.com> Reviewed-on: http://review.gluster.org/4727 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cli: add more meaningful error messagesBala.FA2013-04-021-84/+151
| | | | | | | | | Change-Id: I6e88e6763fa537f4705427b4673d86e6443c2c98 BUG: 928648 Signed-off-by: Bala.FA <barumuga@redhat.com> Reviewed-on: http://review.gluster.org/4747 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* glusterd: add more specific log messagesBala.FA2013-04-021-8/+8
| | | | | | | | | Change-Id: I57fbdd83f3098e64886c3dd690a1ae04fc37442d BUG: 928648 Signed-off-by: Bala.FA <barumuga@redhat.com> Reviewed-on: http://review.gluster.org/4739 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* Tests: Change rebalance status verificationPranith Kumar K2013-04-023-4/+4
| | | | | | | | | | | | | | According to the comment at the following URL https://bugzilla.redhat.com/show_bug.cgi?id=916226#c2 "success:" can come even before rebalance is completed. Changed it to check for "completed" instead. Change-Id: Ibe9d3b75493240f30261ac2a1280f32ef32886da BUG: 916226 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/4614 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/afr: detect in-progress creation in lookup and return ENOENTPranith Kumar K2013-04-021-0/+57
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | if any subvol returned ENOENT while parent entrylk lock was held, yield and return ENOENT for the entire lookup. This is how the issue happens: Multiple clients A, B and C are attempting 'mkdir -p /mnt/a/b/c' 1 Client A is in the middle of mkdir(/a). It has acquired lock. It has performed mkdir(/a) on one subvol, and second one is still in progress 2 Client B performs a lookup, sees directory /a on one, ENOENT on the other, succeeds lookup. 3 Client B performs lookup on /a/b on both subvols, both return ENOENT (one subvol because /a/b does not exist, another because /a itself does not exist) 4 Client B proceeds to mkdir /a/b. It obtains entrylk on inode=/a with basename=b on one subvol, but fails on other subvol as /a is yet to be created by Client A. 5 Client A finishes mkdir of /a on other subvol 6 Client C also attempts to create /a/b, lookup returns ENOENT on both subvols. 7 Client C tries to obtain entrylk on on inode=/a with basename=b, obtains on one subvol (where B had failed), and waits for B to unlock on other subvol. 8 Client B finishes mkdir() on one subvol with GFID-1 and completes transaction and unlocks 9 Client C gets the lock on the second subvol, At this stage second subvol already has /a/b created from Client B, but Client C does not check that in the middle of mkdir transaction 10 Client C attempts mkdir /a/b on both subvols. It succeeds on ONLY ONE (where Client B could not get lock because of missing parent /a dir) with GFID-2, and gets EEXIST from ONE subvol. This way we have /a/b in GFID mismatch. One subvol got GFID-1 because Client B performed transaction on only one subvol (because entrylk() could not be obtained on second subvol because of missing parent dir -- caused by premature/speculative succeeding of lookup() on /a when locks are detected). Other subvol gets GFID-2 from Client C because while it was waiting for entrylk() on both subvols, Client B was in the middle of creating mkdir() on only one subvol, and Client C does not "expect" this when it is between lock() and pre-op()/op() phase of the transaction. Original-author: Anand Avati <avati@redhat.com> Change-Id: Idca475dbbc2a51e09da6fa0f9e1e37148caef208 BUG: 860210 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/4625 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* Adds missing functions to ring.py, and more thorough tests.Alex Wheeler2013-04-022-14/+81
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Situation: The function get_part_nodes is being called by Openstack-Swift's proxy/controllers/base.py: https://github.com/openstack/swift/blob/1.7.4/swift/proxy/controllers/base.py#L410 https://github.com/openstack/swift/blob/1.7.6/swift/proxy/controllers/base.py#L447 As this was not implemented in the current GlusterFS version of ring.py, it was calling swift's original get_part_nodes, which would often return the incorrect node, resulting in the incorrect drive being associated with a request. There is another function that the original ring.py implements -- get_other_nodes, which has to do with replication. Since GlusterFS is handling replication, this function should never be called. However, in the interest of completeness, that function is also being replaced. Code changes: The two functions, get_part_nodes, and get_other_nodes have been implemented to override the default functions, and get_nodes has been updated to store information in self vars, about the account being operated on, as the two new functions are not called with that info, and get_nodes appears to always be called first. The code has be refactored to all call _get_part_nodes, much like swift has refactored their code. Reason for implementation this way: I didn't see a better way to do it, but am open to suggestions. Test cases: A mock ring is created with two different devices, test and iops test_first_device: Ensure that the first device, test, is returned for both get_nodes, and get_part_node, and get_more_nodes returns volume_not_in_ring. test_invalid_device: Ensure that a request for a non-existant device returns volume_not_in_ring. test_second_device: Same as test_first_device, but for the second device, iops instead of test. test_second_device_with_reseller_prefix: Test that calling with the reseller prefix, AUTH_ will still return the correct data. Change-Id: I2f3d526934a47b01e9c065d0edf0fbf06f300369 BUG: 924792 Signed-off-by: Alex Wheeler <wheelear@gmail.com> Reviewed-on: http://review.gluster.org/4748 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Kaleb KEITHLEY <kkeithle@redhat.com> Reviewed-by: Jeff Darcy <jdarcy@redhat.com> Reviewed-by: Anand Avati <avati@redhat.com>
* rpc/nfs: cleanup legacy code of general optionsRajesh Amaravathi2013-04-025-454/+268
| | | | | | | | | | | | | | | | | | Removing the code which handles "general" options. Since it is no longer possible to set general options which apply for all volumes by default, this was redundant. This cleanup of general options code also solves a bug wherein with nfs.addr-namelookup on, nfs.rpc-auth-reject wouldn't work on ip addresses Change-Id: Iba066e32f9a0255287c322ef85ad1d04b325d739 BUG: 921072 Signed-off-by: Rajesh Amaravathi <rajesh@redhat.com> Reviewed-on: http://review.gluster.org/4691 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Kaleb KEITHLEY <kkeithle@redhat.com> Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
* synctask: introduce synclocks for co-operative lockingAnand Avati2013-04-022-1/+167
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch introduces a synclocks - co-operative locks for synctasks. Synctasks yield themselves when a lock cannot be acquired at the time of the lock call, and the unlocker will wake the yielded locker at the time of unlock. The implementation is safe in a multi-threaded syncenv framework. It is also safe for sharing the lock between non-synctasks. i.e, the same lock can be used for synchronization between a synctask and a regular thread. In such a situation, waiting synctasks will yield themselves while non-synctasks will sleep on a cond variable. The unlocker (which could be either a synctask or a regular thread) will wake up any type of lock waiter (synctask or regular). Usage: Declaration and Initialization ------------------------------ synclock_t lock; ret = synclock_init (&lock); if (ret) { /* lock could not be allocated */ } Locking and non-blocking lock attempt ------------------------------------- ret = synclock_trylock (&lock); if (ret && (errno == EBUSY)) { /* lock is held by someone else */ return; } synclock_lock (&lock); { /* critical section */ } synclock_unlock (&lock); Change-Id: I081873edb536ddde69a20f4a7dc6558ebf19f5b2 BUG: 763820 Signed-off-by: Anand Avati <avati@redhat.com> Reviewed-on: http://review.gluster.org/4717 Reviewed-by: Krishnan Parthasarathi <kparthas@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Raghavendra G <raghavendra@gluster.com> Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
* cluster/afr: sync xattrs removed on source to sink(s)Venky Shankar2013-04-025-12/+227
| | | | | | | | | | | | | xattrs are first removed from sink followed by setting source xattrs. Change-Id: I181cb5b785b667bbfc6e40787a2183a8f45de06b BUG: 906646 Signed-off-by: Venky Shankar <vshankar@redhat.com> Reviewed-on: http://review.gluster.org/4656 Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
* cluster/afr: prevent piggyback on stale pre_opPranith Kumar K2013-04-021-33/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Here are the logs of a file on which we saw EIO because of size mismatch: [root@lizzie ~]# grep 38f18204 /var/log/glusterfs/mnt-x-.log Reporting Unstable write for 38f18204-2840-408e-ae65-c01f4106b8c4 for offset: 0, len: 7680 Cleared unstable write flag for 38f18204-2840-408e-ae65-c01f4106b8c4: offset 0 length 7680 Reporting Unstable write for 38f18204-2840-408e-ae65-c01f4106b8c4 for offset: 7680, len: 71680 Reporting Unstable write for 38f18204-2840-408e-ae65-c01f4106b8c4 for offset: 79360, len: 15716 fsync completed on 38f18204-2840-408e-ae65-c01f4106b8c4 for offset 0 length 7680 with changelog status: -1 -1 According to these logs fsync did not happen after writev with offset: 79360, len: 15716. Which is the reason for this problem. In total 3 writes came. lets call them w1, w2, w3 w1 does pre_op so pre_op_done[0], pre_op_done[1] counts become 1 and 1 then is_piggyback_post_op() is called for w1 and it returns *false* w1's fsync is fired Now w2 and w3 come and see that pre_op_done[0], pre_op_done[1] are both 1, so pre_op_piggyback[0] and pre_op_piggyback[1] are both incremented twice, once by w2, one more time by w3 and become 2, 2 ------- Step-A Now fsync of w1 is complete and it goes ahead with post op and decrements pre_op_done[0], pre_op_done[1] to 0, 0 Now w2, w3 writevs complete and is_piggyback_post_op will return *true* for both w2, w3. So fsync is not fired for both w2, w3 this patch prevents Step-A from happening. Change-Id: I8b6af1f1875b2cf5f718caa3c16ee7ff3dc96b5c BUG: 927146 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/4752 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
* dht: improve transform/detransform of d_off (and be ext4 safe)Anand Avati2013-04-011-5/+82
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The scheme to encode brick d_off and brick id into global d_off has two approaches. Since both brick d_off and global d_off are both 64-bit wide, we need to be careful about how the brick id is encoded. Filesystems like XFS always give a d_off which fits within 32bits. So we have another 32bits (actually 31, in this scheme, as seen ahead) to encode the brick id - which is typically plenty. Filesystems like the recent EXT4 utilize the upto 63 low bits in d_off, as the d_off is calculated based on a hash function value. This leaves us no "unused" bits to encode the brick id. However both these filesystmes (EXT4 more importantly) are "tolerant" in terms of the accuracy of the value presented back in seekdir(). i.e, a seekdir(val) actually seeks to the entry which has the "closest" true offset. This "two-prong" scheme exploits this behavior - which seems to be the best middle ground amongst various approaches and has all the advantages of the old approach: - Works against XFS and EXT4, the two most common filesystems out there. (which wasn't an "advantage" of the old approach as it is borken against EXT4) - Probably works against most of the others as well. The ones which would NOT work are those which return HUGE d_offs _and_ NOT tolerant to seekdir() to "closest" true offset. - Nothing to "remember in memory" or evict "old entries". - Works fine across NFS server reboots and also NFS head failover. - Tolerant to seekdir() to arbitrary locations. Algorithm: Each d_off can be encoded in either of the two schemes. There is no requirement to encode all d_offs of a directory or a reply-set in the same scheme. The topmost bit of the 64 bits is used to specify the "type" of encoding of this particular d_off. If the topmost bit (bit-63) is 1, it indicates that the encoding scheme holds a HUGE d_off. If the topmost bit is is 0, it indicates that the "small" d_off encoding scheme is used. The goal of the "small" d_off encoding is to stay as dense as possible towards the lower bits even in the global d_off. The goal of the HUGE d_off encoding is to stay as accurate (close) as possible to the "true" d_off after a round of encoding and decoding. If DHT has N subvolumes, we need ROOF(Log2(N)) "bits" to encode the brick ID (call it "n"). SMALL d_off =========== Encoding -------- If the top n + 1 bits are free in a brick offset, then we leave the top bit as 0 and set the remaining bits based on the old formula: hi_mask = 0xffffffffffffffff hi_mask = ~(hi_mask >> (n + 1)) if ((hi_mask & d_off_brick) != 0) do_large_d_off_encoding () d_off_global = (d_off_brick * N) + brick_id Decoding -------- If the top bit in the global offset is 0, it indicates that this is the encoding formula used. So decoding such a global offset will be like the old formula: if ((d_off_global & 0x8000000000000000) != 0) do_large_d_off_decoding() d_off_brick = (d_off_global % N) brick_id = d_off_global / N HUGE d_off ========== Encoding -------- If the top n + 1 bits are NOT free in a given brick offset, then we set the top bit as 1 in the global offset. The low n bits are replaced by brick_id. low_mask = 0xffffffffffffffff << n // where n is ROOF(Log2(N)) d_off_global = (0x8000000000000000 | d_off_brick & low_mask) + brick_id if (d_off_global == 0xffffffffffffffff) discard_entry(); Decoding -------- If the top bit in the global offset is set 1, it indicates that the encoding formula used is above. So decoding would look like: hi_mask = (0xffffffffffffffff << n) low_mask = ~(hi_mask) d_off_brick = (global_d_off & hi_mask & 0x7fffffffffffffff) brick_id = global_d_off & low_mask If "losing" the low n bits in this decoding of d_off_brick looks "scary", we need to realize that till recently EXT4 used to only return what can now be expressed as (d_off_global >> 32). The extra 31 bits of hash added by EXT recently, only decreases the probability of a collision, and not eliminate it completely, anyways. In a way, the "lost" n bits are made up by decreasing the probability of collision by sharding the files into N bricks / EXT directories -- call it "hash hedging", if you will :-) Thanks-to: Zach Brown <zab@redhat.com> Change-Id: Ieba9a7071829d51860b7c131982f12e0136b9855 BUG: 838784 Signed-off-by: Anand Avati <avati@redhat.com> Reviewed-on: http://review.gluster.org/4711 Reviewed-by: Jeff Darcy <jdarcy@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* cli: Made volume top help string clearM S Vishwanath Bhat2013-04-011-4/+3
| | | | | | | | | | | | | | | | nfs option is not applicable for read-perf and write-perf nfs option and brick option can not be used in same command Change-Id: I920ba0de011df0cc5e0adca6597aaea9372fe592 BUG: 924335 Signed-off-by: M S Vishwanath Bhat <vbhat@redhat.com> Reviewed-on: http://review.gluster.org/4706 Reviewed-by: Krishnan Parthasarathi <kparthas@redhat.com> Reviewed-by: Raghavendra Talur <rtalur@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* mgmt/glusterd: Enable write-behind in nfsPranith Kumar K2013-04-011-1/+1
| | | | | | | | | | | | | We observed that the number of write requests thus inodelks are increasing very rapidly to thousands without write-behind in the graph. Change-Id: Id71c9c2b0a4c9601a4644a58a933221c62dab0c0 BUG: 928341 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/4734 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* rpc: disable root-squash dynamically upon volume set commandRaghavendra Bhat2013-04-012-2/+68
| | | | | | | | | Change-Id: I2ba9ca339ffbe07cb74833165a46a941225b623d BUG: 927616 Signed-off-by: Raghavendra Bhat <raghavendra@redhat.com> Reviewed-on: http://review.gluster.org/4722 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* storage/posix: honor O_SYNC and O_DSYNC sent in @flags of writev()Anand Avati2013-03-292-5/+10
| | | | | | | | | | | | | | | | Historic bug - posix_writev() has been inspecting pfd->flushwrites for performing fsync() after write, instead of @flags for O_SYNC|O_DSYNC. pfd->flushwrites was never set anywhere and is unused completely. This is behavior from the time before anonymous FD where open() had @wbflags param. This is a leftover from that cleanup. Change-Id: Id9bfe562a60db4eb3bd0a7705bdba91f2df2f3ec BUG: 916372 Signed-off-by: Anand Avati <avati@redhat.com> Reviewed-on: http://review.gluster.org/4738 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
* cluster/afr: fix fd leak with unsafe call_resume()Anand Avati2013-03-282-1/+17
| | | | | | | | | | | | | | Introduce AFR_CALL_RESUME macro which cleans up frame->local, like how AFR_STACK_UNWIND etc. do. Therefore fix leak in afr_fsync() path. Change-Id: I3855d8e7e84dbc44e05f507563b7f722bf9621b8 BUG: 927146 Signed-off-by: Anand Avati <avati@redhat.com> Reviewed-on: http://review.gluster.org/4745 Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* cluster/afr: fsync before erase xattrs in data self-healPranith Kumar K2013-03-281-1/+75
| | | | | | | | | | | | Added extra fsync to data self-heal code to make sure the data reached disk before erasing the changelogs Change-Id: I9e7e6e55cdc49de2b991705d1638946464a9d4f9 BUG: 927146 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/4744 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* glusterfs.spec.in: sync with fedora glusterfs.specKaleb S. KEITHLEY2013-03-282-0/+13
| | | | | | | | | | | | | add --without ufo Change-Id: If1b77003ded537f9664fa6ad677d48d118516c64 Signed-off-by: Kaleb S. KEITHLEY <kkeithle@redhat.com> BUG: 819130 Reviewed-on: http://review.gluster.org/4742 Tested-by: Gluster Build System <jenkins@build.gluster.com> Tested-by: Luis Pabon <lpabon@redhat.com> Reviewed-by: Luis Pabon <lpabon@redhat.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/afr: piggyback and fsync resume changesPranith Kumar K2013-03-283-16/+21
| | | | | | | | | | | | | | 1) pre_op_piggyback should always be decremented. 2) Move fsync resume to just after post_op. 3) fsync stub should be created from afr's local not from the final response. Change-Id: I220bb532eb03bea584292f4dd2e816ad0c3e0cf7 BUG: 927146 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/4741 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* cluster/afr: fsync() guarantees POST-OP completionAnand Avati2013-03-273-10/+56
| | | | | | | | | | | | | | | | | | | | | | | | AFR now provides a stronger guarantee that fsync() returns only after completely finishing all the deferred/delayed POST-OP on that open file. To acheive this we make a stub out of the returning fsync and register it with the "delayed" frame in afr_changelog_wake_resume(). The delayed frame, after getting woken up and finishing the POST-OP will call_resume() the registered stub (which UNWINDs the fsync) at the time of frame destruction. This provides a guarantee that an application's (or FUSE) fsync() returns only after finishing up all the previous transactions, including delayed POST-OPs and UNLOCK. Change-Id: Iaa955457e2f25088a144fde37ad0444277b5cf49 BUG: 927146 Signed-off-by: Anand Avati <avati@redhat.com> Reviewed-on: http://review.gluster.org/4737 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
* cluster/afr: ensure DATA operations are made durable before POST-OPAnand Avati2013-03-275-24/+316
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The changelogging scheme of AFR stores information about the state of all replicas in all replicas (in the extended attribute of the respective files on each server) in the form of 'pending counts' of operations (effectively "dirty flags"). These xattrs are blindly trusted while performing self-heal, and therefore utmost care has to be taken while updating and maintaing them. The most critical updation is the clearing of the pending counts corresponding to the *other* server in the changelog of a given server. Before clearing the pending count, we need durability guarantee of the write which was performed on the other server. To obtain such a guarantee, it may be necessary to explicitly introduce an fsync() phase (if the file itself wasn't already opened with O_SYNC). This patch introduces the detection of unstable stable writes on a file and issues explicit fsync() on the servers before performing the POST-OP clearing of pending flags. Change-Id: I2171b86a74ec91e40e5877eef0a4e7379578ecf7 BUG: 927146 Signed-off-by: Anand Avati <avati@redhat.com> Reviewed-on: http://review.gluster.org/4721 Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Reviewed-by: Krishnan Parthasarathi <kparthas@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* libglusterfs/dict: fix infinite loop in dict_keys_join()Vijaykumar koppad2013-03-271-2/+4
| | | | | | | | | | | | - missing "pairs = next" caused infinite loop Change-Id: I9171be5bec051de6095e135d616534ab49cd4797 BUG: 905871 Signed-off-by: Vijaykumar Koppad <vijaykumar.koppad@gmail.com> Reviewed-on: http://review.gluster.org/4723 Reviewed-by: Venky Shankar <vshankar@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* glusterd: Removed fd leaks in glusterfs_start utility functionKrishnan Parthasarathi2013-03-256-119/+104
| | | | | | | | | | | | | | | | | | | PROBLEM: The FILE* associated with the pidfile was leaked if pmap_registry_search on the brickinfo' path failed. FIX: Eliminates the use of the FILE* that was leaked. Uses glusterd_is_service_running utility function in place of the earlier attempt to check for the same. Change-Id: I94082bd5a94b8a6340f8cc11726d3264e364efe6 BUG: 916549 Signed-off-by: Krishnan Parthasarathi <kparthas@redhat.com> Reviewed-on: http://review.gluster.org/4596 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Jeff Darcy <jdarcy@redhat.com> Reviewed-by: Anand Avati <avati@redhat.com>
* config: better (i.e. more portable) test for libxml2Kaleb S. KEITHLEY2013-03-254-15/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Over the weekend I tried to build on MacOS X¹ and ran into the following issues: 1) The recent change to autogen.sh to test for pkg-config falls down. 2) After removing the pkg-config test in autogen.sh, w/o pkg-config the PKG_CHECK_MODULES macro invocation in configure[.ac] falls down. N.B. Solaris users run into this too, even through there's a (broken) pkg-config package that can be installed. 3) There are other problems in the code related to fuse that are beyond the scope of this. It seems that pkg-config is only a requirement for the definition of the PKG_CHECK_MODULES macro used to detect libxml2. Since this seems to be inherently unportable — at least to MacOS X and Solaris — I'd like to: A) Change the use of the PKG_CHECK_MODULES macro to the more portable AM_PATH_XML2 macro provided by the libxml2 package in /usr/.../share/aclocal/libxml.m4 2) Revisit the decision to add the check for pkg-config in autogen.sh in BZ 921817. For now this is just an rfc. If people are agreeable I'll reenter this change against BZ 921817. ¹Mountain Lion 10.8.3, XCode 4.6.1 Change-Id: I237b1ed8919088345b8fd943423b2a6ad289981b BUG: 921817 Signed-off-by: Kaleb S. KEITHLEY <kkeithle@redhat.com> Reviewed-on: http://review.gluster.org/4720 Reviewed-by: Justin Clift <jclift@redhat.com> Tested-by: Justin Clift <jclift@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* glusterd: Simplify glusterd_service_stop()Krishnan Parthasarathi2013-03-251-70/+11
| | | | | | | | | Change-Id: I396d250a3299ad1f7fce4bd14389b0c2756b6cb0 BUG: 764890 Signed-off-by: Krishnan Parthasarathi <kparthas@redhat.com> Reviewed-on: http://review.gluster.org/4718 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* glusterfsd: dump the in-memory graph rather than volfileAnand Avati2013-03-245-14/+108
| | | | | | | | | | | | | | | | | | | | | Currently we have been printing in the logfile, the volfile verbatim as received from the server. However we perform pre-processing on the graph we receive from the server, like adding ACL translator, applying --xlator-option cli params, etc. So print the serialized in-memory graph as the "volfile" in the log. This can be very handy to double check if certain --xlator-option param actually got applied or not, and in general is showing a "truer" representation of the real graph actually used. Change-Id: I0221dc56e21111b48a1ee3e5fe17a5ef820dc0c6 BUG: 924504 Signed-off-by: Anand Avati <avati@redhat.com> Reviewed-on: http://review.gluster.org/4708 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Amar Tumballi <amarts@redhat.com>
* mount: Added the xlator-option to mount.glusterfs script.Avra Sengupta2013-03-222-0/+56
| | | | | | | | | | | | | | | Now all xlator-options can be set from the mount command as well. Example : mount -t glusterfs Hostname:/Volume_Name Mount_Point -o "xlator-option=xyz=123, xlator-option=abc=999" Change-Id: If52d994986839d1c969e3e2e01b2e1a29a3140b7 BUG: 920583 Signed-off-by: Avra Sengupta <asengupt@redhat.com> Reviewed-on: http://review.gluster.org/4660 Reviewed-by: Amar Tumballi <amarts@redhat.com> Reviewed-by: Shishir Gowda <sgowda@redhat.com> Reviewed-by: Jeff Darcy <jdarcy@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* glusterfsd: Fixed fd leak due to use of tmpfile()Krishnan Parthasarathi2013-03-223-32/+69
| | | | | | | | | Change-Id: I3c2dc070ebe967100170e39f3545acacc6016d61 BUG: 924075 Signed-off-by: Krishnan Parthasarathi <kparthas@redhat.com> Reviewed-on: http://review.gluster.org/4703 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
* glusterd: Improve error logging when a brick from an old volume gets re-usedNiels de Vos2013-03-221-2/+7
| | | | | | | | | | | | | | | | | | | The error message when creating a volume that contains a brick with certain xatts set on a parent directory is unclear. Users do not understand '... or a prefix of it is already part of a volume'. Most users check the final directory that is used for a brick, but not its parents. It would be helpful to present the user with the actual directory that is preventing the volume to use the brick. BUG: 923917 Change-Id: I815ad32a992eb0e41ee8fca6ee9327400d042c45 Signed-off-by: Niels de Vos <ndevos@redhat.com> Reviewed-on: http://review.gluster.org/4701 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
* nfs: ACCESS - reply only what was asked forAnand Avati2013-03-223-8/+12
| | | | | | | | | | | | | Set only those bits which were requested by the client. Some clients, like AIX, do not like the fact that we are returning the EXEC bit set in the ACCESS reply even though it only asked for LOOKUP bit. Change-Id: I3c2fd5dce030ea5ddae0511497cafa078c4d76d6 BUG: 924481 Signed-off-by: Anand Avati <avati@redhat.com> Reviewed-on: http://review.gluster.org/4707 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
* Added autogen.sh check for presence of tarJustin Clift2013-03-221-0/+6
| | | | | | | | | | Change-Id: I95313699edcf7bc2696505fcb475a4a67c1800cf BUG: 924891 Signed-off-by: Justin Clift <jclift@redhat.com> Reviewed-on: http://review.gluster.org/4716 Reviewed-by: Kaleb KEITHLEY <kkeithle@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
* glusterfs.spec.in: sync with fedora glusterfs.specKaleb S. KEITHLEY2013-03-221-18/+32
| | | | | | | | | | | | | | | | | remove hard-coded reference to swift version and dist tarballs minor improvements for rpm builds for regression tests, including adding cache on build.gluster.org to avoid random failures due to transient network or dns failures causing curl or git to fail. BUG: 819130 Change-Id: I4f1213056ae2987dd2202f9cfbb3ed4f16ffc7cf Signed-off-by: Kaleb S. KEITHLEY <kkeithle@redhat.com> Reviewed-on: http://review.gluster.org/4674 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Justin Clift <jclift@redhat.com> Tested-by: Justin Clift <jclift@redhat.com> Reviewed-by: Anand Avati <avati@redhat.com>
* socket: Make non-ssl sockets perform non-blocking connect()Krishnan Parthasarathi2013-03-221-0/+12
| | | | | | | | | Change-Id: Icb60cf7ad3ea7ca0eeb12fd19b95a6b340857bb2 BUG: 920916 Signed-off-by: Krishnan Parthasarathi <kparthas@redhat.com> Reviewed-on: http://review.gluster.org/4670 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* dht: make DHT xattr names configurableJeff Darcy2013-03-2113-125/+233
| | | | | | | | | | | | | | | | | | | | This is necessary to support "DHT over DHT" configurations, so that the upper and lower instances of DHT don't step all over each other. Why would we even consider such a thing? Because it gives us the ability to do data tiering and rack-aware placement, either by themselves or as complements to other functionality such as erasure codes or deduplication which save space but cost performance. By setting up the top-level DHT to place data into one of several lower-level DHT pools based on policy instead of pure elastic hashing, we get better performance for 90% of accesses and better storage efficiency for 90% of data, all for relatively low effort. Change-Id: I72e65c29edfc80babf39f7a2a00090f4588c4070 BUG: 924265 Signed-off-by: Jeff Darcy <jdarcy@redhat.com> Reviewed-on: http://review.gluster.org/4694 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>
* object-storage: Removed the redundant REMOTE_CLUSTER option.Mohammed Junaid2013-03-213-19/+3
| | | | | | | | | | | | | | Gluster cli uses the remote-host option to connect to the glusterd and by default it uses localhost to connect to glusterd. So, UFO code will use the remote-host option everytime to connect to the glusterd. Change-Id: I5a684d3c43fe9bdc9cc0b7c472a9d8145f9e1fd4 BUG: 878663 Signed-off-by: Mohammed Junaid <junaid@redhat.com> Reviewed-on: http://review.gluster.org/4690 Reviewed-by: Peter Portante <pportant@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Anand Avati <avati@redhat.com>