| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
PROBLEM:
performance.nfs.* option values (which are of type boolean) are
not validated during the stage phase of 'volume set'.
The result - nfs graph generation fails during commit phase,
AFTER the option and its (invalid) value have been placed in
volinfo->dict.
CAUSE:
nfsperfxl_option_handler() - the function that validates the values of
performance.nfs.* options - never receives the (key,value) pair that
needs to be set, for validation during 'volume set' stage.
FIX:
In build_nfs_graph(), copy the (mod_)dict containing the (option,value)
parameters into set_dict before attempting to build the client graph
for the volume on which the operation is being performed.
Of course, an easier way out would be to simply do a 'volume reset' and
pretend nothing wrong happened!
Change-Id: I56b17d0239d58a9e0b7798933a3c8451e2675b69
BUG: 949930
Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com>
Reviewed-on: http://review.gluster.org/4814
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Vijay Bellur <vbellur@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In a single node cluster, it is possible to deadlock on the "big
lock", while restarting bricks. In glusterd_restart_bricks, we perform a
glusterd_brick_connect, where we release the big lock in anticipation
that glusterd_brick_rpc_notify could run in the same C stack (and
deadlocking). So, in the restart code path, we could unlock before we
have performed a lock on the big lock.
To fix this, we need to take the big lock in the
glusterd_launch_synctask 'thread' as well.
Change-Id: I1abea1ca82b55c784b8a810a8194f254b32b1dcc
BUG: 948686
Signed-off-by: Krishnan Parthasarathi <kparthas@redhat.com>
Reviewed-on: http://review.gluster.org/4837
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There are primarily three lists that are part of glusterd process,
that are concurrently accessed. Namely, priv->volumes, priv->peers
and volinfo->bricks_list.
Big-lock approach
-----------------
WHAT IS IT?
Big lock is a coarse-grained lock which protects all three
lists, mentioned above, from racy access.
HOW DOES IT WORK?
At any given point in time, glusterd's thread(s) are in execution
_iff_ there is a preceding, inbound network event. Of course, the
sigwaiter thread and timer thread are exceptions.
A network event is an external trigger to glusterd, via the epoll
thread, in the form of POLLIN and POLLERR.
As long as we take the big-lock at all such entry points and yield
it when we are done, we are guaranteed that all the network events,
accessing the global lists, are serialised.
This amounts to holding the big lock at
- all the handlers of all the actors in glusterd. (POLLIN)
- all the cbks in glusterd. (POLLIN)
- rpc_notify (DISCONNECT event), if we access/modify
one of the three lists. (POLLERR)
In the case of synctask'ized volume operations, we must remember that,
if we held the big lock for the entire duration of the handler,
we may block other non-synctask rpc actors from executing.
For eg, volume-start would block in PMAP SIGNIN, if done incorrectly.
To prevent this, we need to yield the big lock, when we yield the
synctask, and reacquire on waking up of the synctask.
Change-Id: Ib929f9905b55fb6c3fc27fefb497a26dba058e4f
BUG: 948686
Signed-off-by: Krishnan Parthasarathi <kparthas@redhat.com>
Reviewed-on: http://review.gluster.org/4784
Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
|
|
|
|
|
|
|
|
|
| |
BUG: 951549
Change-Id: I3de5bd86d4238a60a0a85ba2e15d9c131969b210
Signed-off-by: Kaleb S. KEITHLEY <kkeithle@redhat.com>
Reviewed-on: http://review.gluster.org/4816
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
glusterd syncops perform a barrier_wake whenever rpc_clnt_submit returned -1.
This is based on the wrong assumption that the cbkfn wasn't called.
This would result in one more wakeup than there ought to be.
Change-Id: I591e67c267f0e26d1145bf8fb5feeb2c13a751a1
BUG: 948686
Signed-off-by: Krishnan Parthasarathi <kparthas@redhat.com>
Reviewed-on: http://review.gluster.org/4802
Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch incorporates all the changes suggested on the behaviour of
'volume create' command in http://review.gluster.org/#change,4214
(comment #14, to be precise).
Change-Id: Iaac524a59738b177415595b18aa8a136090d3d25
BUG: 948729
Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com>
Reviewed-on: http://review.gluster.org/4740
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Till now running glusterfs processes were allowed to run in valgrind
mode only when built with debug mode enabled.
Change-Id: I11e07ea2a4da4f82f70cdded6258a22d65d6db64
BUG: 922877
Signed-off-by: Raghavendra Bhat <raghavendra@redhat.com>
Reviewed-on: http://review.gluster.org/4688
Reviewed-by: Anand Avati <avati@redhat.com>
Tested-by: Anand Avati <avati@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When subvols-per-directory is < available subvols, then there are layouts
which are not populated. This leads to incorrect identification of holes or
overlaps. We need to ignore layouts, which have err == 0, and start == stop.
In the current scenario (start == stop == 0).
Additionally, in layout-merge, treat missing xattrs as err = 0. In case of
missing layouts, anomalies will reset them.
For any other valid subvoles, err != 0 in case of layouts being zeroed out.
Also reverted back dht_selfheal_dir_xattr, which does layout calculation only
on subvols which have errors.
Change-Id: I9f57062722c9e8a26285e10675c31a78921115a1
BUG: 921408
Signed-off-by: shishir gowda <sgowda@redhat.com>
Reviewed-on: http://review.gluster.org/4668
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Amar Tumballi <amarts@redhat.com>
Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Today there is a non-obvious dependence of eager-locking on
write-behind. The reason is that eager-locking works as long
as the inheriting transaction has no overlaps with any of the
transactions already in progress. While write-behind provides
non-overlapping writes as a side-effect most of times (and only
guarantees it when strict-write-ordering option is enabled,
which is not on by default) eager-lock needs the behavior
as a guarantee. This is leading to complex and unwanted checks
for the presence of write-behind in the graph, for the simple
task of checking for overlaps.
This patch removes the interdependence between eager-locking
and write-behind by making eager-locking do its own overlap checks
with in-progress writes.
Change-Id: Iccba1185aeb5f1e7f060089c895a62840787133f
BUG: 912581
Signed-off-by: Anand Avati <avati@redhat.com>
Reviewed-on: http://review.gluster.org/4782
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
enabling this option has an effect on pathinfo xattr
request returning <node-uuid>:<path> instead of the
default - which is <hostname>:<path>.
Change-Id: Ice1b38abf8e5df1568bab6d79ec0d53dfa520332
BUG: 765380
Signed-off-by: Venky Shankar <vshankar@redhat.com>
Reviewed-on: http://review.gluster.org/4567
Reviewed-by: Amar Tumballi <amarts@redhat.com>
Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Added code to display extra information when status command
is executed.
Information shown now are
1 Number of files synced
2 crawl time
3 total sync time
4 bytes synced
bytes synced is taken from rsync output .
--stats option of rsync gives extra infor
mation about the sync.In stats output there
is a field called Total transferred file
size which states the ammount of bytes synced .
This information is parsed from stdout output
using regular expressions.Bytes synced information
can be used to calculate throughput.
Change-Id: Id9bba9fff45ee7049bb8257c6fd918e5237e05b1
BUG: 947774
Signed-off-by: sarvotham s pai <spai@redhat.com>
Reviewed-on: http://review.gluster.org/4749
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
For example:
If a new entry creation fop fails with EEXIST or a delete entry fop
fails with ENOENT, on all the subvols the fop is wound, then no
change took place to the directory. So we can treat that case as no
change happened to the directory.
Change-Id: I3b3a7931954da2166a9cba19ff9f76f37739d751
BUG: 860210
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
Reviewed-on: http://review.gluster.org/4626
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
posix_fill_readdir() is a multi-step function which performs many
readdir() calls, and expects the directory cursor to have not
"seeked away" elsewhere between two successive iterations. Usually
this is not a problem as each opendir() from an application has its
own backend fd, and there is nobody else to "seek away" the directory
cursor. However in case of NFS's use of anonymous fd, the same fd_t
is shared between all NFS readdir requests, and two readdir loops can
be executing in parallel on the same dir dragging away the cursor in
a chaotic manner.
The fix in this patch is to lock on the fd around the loop. Another
approach could be to reimplement posix_fill_readdir() with a single
getdents() call, but that's for another day.
Change-Id: Ia42e9c7fbcde43af4c0d08c20cc0f7419b98bd3f
BUG: 948086
Signed-off-by: Anand Avati <avati@redhat.com>
Reviewed-on: http://review.gluster.org/4774
Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
FIX:
In missing entry self heal, once the source directories are determined
after the lookup and if file is not present on any of the brick which
contains the souce directory, the entry is removed from the directory.
So log message should give information of "Purging of entry".
Change-Id: I4d3deb602e0812dc1c9c8ba0a466716d81dede7e
BUG: 947312
Signed-off-by: Venkatesh Somyajulu <vsomyaju@redhat.com>
Reviewed-on: http://review.gluster.org/4753
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem:
In Pump entry self-heal happens for each directory during the
first opendir using conservative merge. But in entry-self-heal
readdir is issued with '0' size. So entry self-heal is not
creating any files. After pump thinks entry self-heal is complete
it proceeds to heal each of the file in the directory it just
healed. Fortunately most of the times it chooses source-brick
in pump as read-child for readdir. This happens because readchild is the
subvolume on which lookup succeeds first. In pump lookup succeeds
faster in local process than on the destination brick process most
of the times. For all the entries pump finds in readdir it does a
lookup. During this lookup the entry on the destination brick is
created and healed. This is the reason why replace-brick
succeeds whenever read-child for the directory is chosen as the
source-brick. Which is most of the times. When read-child is chosen
as the destination brick, readdir returns no entries so replace-brick
completes without syncing the whole data.
Fix:
Set readdir-size in pump so that entry self-heal happens with
64k size. This ensures that entry self-heal triggered from
opendir actually creates the files on the destination brick.
Change-Id: I65ea45d3c2735a9578f3aa34eff771b6563241ca
BUG: 909800
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
Reviewed-on: http://review.gluster.org/4712
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
|
|
|
|
|
|
|
|
|
|
| |
Change-Id: I45f91105862a2484b8906a7a63b98ab4aaf80d05
BUG: 924643
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
Reviewed-on: http://review.gluster.org/4683
Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
These functions keep changing as new functionality is added, so copying
and pasting the code is not a good solution. This way ensures that all
fields get initialized properly no matter how much new stuff we throw in.
Change-Id: I9e9b043d2d305d31e80cf5689465555b70312756
BUG: 924488
Signed-off-by: Jeff Darcy <jdarcy@redhat.com>
Reviewed-on: http://review.gluster.org/4710
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
With readdir-optimize set to on, we instruct the posix layer to ignore
directory entries from not first subvolume. DHT discards directories
returned from non first subvolume. By making posix itself ignore it,
we are making directory crawls faster
Change-Id: Ia1faf2dedec0c615c0632c3c063e846f5742ede6
Signed-off-by: shishir gowda <sgowda@redhat.com>
Reviewed-on: http://review.gluster.org/4613
Reviewed-by: Amar Tumballi <amarts@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
|
|
|
|
|
|
|
|
|
| |
Change-Id: I57fbdd83f3098e64886c3dd690a1ae04fc37442d
BUG: 928648
Signed-off-by: Bala.FA <barumuga@redhat.com>
Reviewed-on: http://review.gluster.org/4739
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
if any subvol returned ENOENT while parent entrylk lock was held,
yield and return ENOENT for the entire lookup.
This is how the issue happens:
Multiple clients A, B and C are attempting 'mkdir -p /mnt/a/b/c'
1 Client A is in the middle of mkdir(/a). It has acquired lock.
It has performed mkdir(/a) on one subvol, and second one is still
in progress
2 Client B performs a lookup, sees directory /a on one,
ENOENT on the other, succeeds lookup.
3 Client B performs lookup on /a/b on both subvols, both return ENOENT
(one subvol because /a/b does not exist, another because /a
itself does not exist)
4 Client B proceeds to mkdir /a/b. It obtains entrylk on inode=/a with
basename=b on one subvol, but fails on other subvol as /a is yet to
be created by Client A.
5 Client A finishes mkdir of /a on other subvol
6 Client C also attempts to create /a/b, lookup returns ENOENT on
both subvols.
7 Client C tries to obtain entrylk on on inode=/a with basename=b,
obtains on one subvol (where B had failed), and waits for B to unlock
on other subvol.
8 Client B finishes mkdir() on one subvol with GFID-1 and completes
transaction and unlocks
9 Client C gets the lock on the second subvol, At this stage second
subvol already has /a/b created from Client B, but Client C does not
check that in the middle of mkdir transaction
10 Client C attempts mkdir /a/b on both subvols. It succeeds on
ONLY ONE (where Client B could not get lock because of
missing parent /a dir) with GFID-2, and gets EEXIST from ONE subvol.
This way we have /a/b in GFID mismatch. One subvol got GFID-1 because
Client B performed transaction on only one subvol (because entrylk()
could not be obtained on second subvol because of missing parent dir --
caused by premature/speculative succeeding of lookup() on /a when locks
are detected). Other subvol gets GFID-2 from Client C because while
it was waiting for entrylk() on both subvols, Client B was in the
middle of creating mkdir() on only one subvol, and Client C does not
"expect" this when it is between lock() and pre-op()/op() phase of the
transaction.
Original-author: Anand Avati <avati@redhat.com>
Change-Id: Idca475dbbc2a51e09da6fa0f9e1e37148caef208
BUG: 860210
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
Reviewed-on: http://review.gluster.org/4625
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Removing the code which handles "general" options.
Since it is no longer possible to set general options which
apply for all volumes by default, this was redundant.
This cleanup of general options code also solves a bug wherein
with nfs.addr-namelookup on, nfs.rpc-auth-reject wouldn't work
on ip addresses
Change-Id: Iba066e32f9a0255287c322ef85ad1d04b325d739
BUG: 921072
Signed-off-by: Rajesh Amaravathi <rajesh@redhat.com>
Reviewed-on: http://review.gluster.org/4691
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Kaleb KEITHLEY <kkeithle@redhat.com>
Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
xattrs are first removed from sink followed by setting
source xattrs.
Change-Id: I181cb5b785b667bbfc6e40787a2183a8f45de06b
BUG: 906646
Signed-off-by: Venky Shankar <vshankar@redhat.com>
Reviewed-on: http://review.gluster.org/4656
Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Here are the logs of a file on which we saw EIO because of size mismatch:
[root@lizzie ~]# grep 38f18204 /var/log/glusterfs/mnt-x-.log
Reporting Unstable write for 38f18204-2840-408e-ae65-c01f4106b8c4
for offset: 0, len: 7680
Cleared unstable write flag for 38f18204-2840-408e-ae65-c01f4106b8c4:
offset 0 length 7680
Reporting Unstable write for 38f18204-2840-408e-ae65-c01f4106b8c4 for
offset: 7680, len: 71680
Reporting Unstable write for 38f18204-2840-408e-ae65-c01f4106b8c4 for
offset: 79360, len: 15716
fsync completed on 38f18204-2840-408e-ae65-c01f4106b8c4 for
offset 0 length 7680 with changelog status: -1 -1
According to these logs fsync did not happen after writev with
offset: 79360, len: 15716. Which is the reason for this problem.
In total 3 writes came. lets call them w1, w2, w3
w1 does pre_op so pre_op_done[0], pre_op_done[1] counts become 1 and 1
then is_piggyback_post_op() is called for w1 and it returns *false*
w1's fsync is fired
Now w2 and w3 come and see that pre_op_done[0], pre_op_done[1] are both 1,
so pre_op_piggyback[0] and pre_op_piggyback[1] are both incremented twice,
once by w2, one more time by w3 and become 2, 2 ------- Step-A
Now fsync of w1 is complete and it goes ahead with post op and decrements
pre_op_done[0], pre_op_done[1] to 0, 0
Now w2, w3 writevs complete and is_piggyback_post_op will return *true* for
both w2, w3.
So fsync is not fired for both w2, w3
this patch prevents Step-A from happening.
Change-Id: I8b6af1f1875b2cf5f718caa3c16ee7ff3dc96b5c
BUG: 927146
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
Reviewed-on: http://review.gluster.org/4752
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The scheme to encode brick d_off and brick id into global d_off has
two approaches. Since both brick d_off and global d_off are both 64-bit
wide, we need to be careful about how the brick id is encoded.
Filesystems like XFS always give a d_off which fits within 32bits. So
we have another 32bits (actually 31, in this scheme, as seen ahead) to
encode the brick id - which is typically plenty.
Filesystems like the recent EXT4 utilize the upto 63 low bits in d_off,
as the d_off is calculated based on a hash function value. This leaves
us no "unused" bits to encode the brick id.
However both these filesystmes (EXT4 more importantly) are "tolerant" in
terms of the accuracy of the value presented back in seekdir(). i.e, a
seekdir(val) actually seeks to the entry which has the "closest" true
offset.
This "two-prong" scheme exploits this behavior - which seems to be the
best middle ground amongst various approaches and has all the advantages
of the old approach:
- Works against XFS and EXT4, the two most common filesystems out there.
(which wasn't an "advantage" of the old approach as it is borken against
EXT4)
- Probably works against most of the others as well. The ones which would
NOT work are those which return HUGE d_offs _and_ NOT tolerant to
seekdir() to "closest" true offset.
- Nothing to "remember in memory" or evict "old entries".
- Works fine across NFS server reboots and also NFS head failover.
- Tolerant to seekdir() to arbitrary locations.
Algorithm:
Each d_off can be encoded in either of the two schemes. There is no
requirement to encode all d_offs of a directory or a reply-set in
the same scheme.
The topmost bit of the 64 bits is used to specify the "type" of encoding
of this particular d_off. If the topmost bit (bit-63) is 1, it indicates
that the encoding scheme holds a HUGE d_off. If the topmost bit is is 0,
it indicates that the "small" d_off encoding scheme is used.
The goal of the "small" d_off encoding is to stay as dense as possible
towards the lower bits even in the global d_off.
The goal of the HUGE d_off encoding is to stay as accurate (close) as
possible to the "true" d_off after a round of encoding and decoding.
If DHT has N subvolumes, we need ROOF(Log2(N)) "bits" to encode the brick
ID (call it "n").
SMALL d_off
===========
Encoding
--------
If the top n + 1 bits are free in a brick offset, then we leave the
top bit as 0 and set the remaining bits based on the old formula:
hi_mask = 0xffffffffffffffff
hi_mask = ~(hi_mask >> (n + 1))
if ((hi_mask & d_off_brick) != 0)
do_large_d_off_encoding ()
d_off_global = (d_off_brick * N) + brick_id
Decoding
--------
If the top bit in the global offset is 0, it indicates that this
is the encoding formula used. So decoding such a global offset will
be like the old formula:
if ((d_off_global & 0x8000000000000000) != 0)
do_large_d_off_decoding()
d_off_brick = (d_off_global % N)
brick_id = d_off_global / N
HUGE d_off
==========
Encoding
--------
If the top n + 1 bits are NOT free in a given brick offset, then we
set the top bit as 1 in the global offset. The low n bits are replaced
by brick_id.
low_mask = 0xffffffffffffffff << n // where n is ROOF(Log2(N))
d_off_global = (0x8000000000000000 | d_off_brick & low_mask) + brick_id
if (d_off_global == 0xffffffffffffffff)
discard_entry();
Decoding
--------
If the top bit in the global offset is set 1, it indicates that
the encoding formula used is above. So decoding would look like:
hi_mask = (0xffffffffffffffff << n)
low_mask = ~(hi_mask)
d_off_brick = (global_d_off & hi_mask & 0x7fffffffffffffff)
brick_id = global_d_off & low_mask
If "losing" the low n bits in this decoding of d_off_brick looks
"scary", we need to realize that till recently EXT4 used to only
return what can now be expressed as (d_off_global >> 32). The extra
31 bits of hash added by EXT recently, only decreases the probability
of a collision, and not eliminate it completely, anyways. In a way,
the "lost" n bits are made up by decreasing the probability of
collision by sharding the files into N bricks / EXT directories
-- call it "hash hedging", if you will :-)
Thanks-to: Zach Brown <zab@redhat.com>
Change-Id: Ieba9a7071829d51860b7c131982f12e0136b9855
BUG: 838784
Signed-off-by: Anand Avati <avati@redhat.com>
Reviewed-on: http://review.gluster.org/4711
Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We observed that the number of write requests thus inodelks
are increasing very rapidly to thousands without write-behind
in the graph.
Change-Id: Id71c9c2b0a4c9601a4644a58a933221c62dab0c0
BUG: 928341
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
Reviewed-on: http://review.gluster.org/4734
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Vijay Bellur <vbellur@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Historic bug - posix_writev() has been inspecting pfd->flushwrites for
performing fsync() after write, instead of @flags for O_SYNC|O_DSYNC.
pfd->flushwrites was never set anywhere and is unused completely. This
is behavior from the time before anonymous FD where open() had @wbflags
param. This is a leftover from that cleanup.
Change-Id: Id9bfe562a60db4eb3bd0a7705bdba91f2df2f3ec
BUG: 916372
Signed-off-by: Anand Avati <avati@redhat.com>
Reviewed-on: http://review.gluster.org/4738
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Introduce AFR_CALL_RESUME macro which cleans up frame->local, like
how AFR_STACK_UNWIND etc. do.
Therefore fix leak in afr_fsync() path.
Change-Id: I3855d8e7e84dbc44e05f507563b7f722bf9621b8
BUG: 927146
Signed-off-by: Anand Avati <avati@redhat.com>
Reviewed-on: http://review.gluster.org/4745
Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
|
|
|
|
|
|
|
|
|
|
|
|
| |
Added extra fsync to data self-heal code to make sure the
data reached disk before erasing the changelogs
Change-Id: I9e7e6e55cdc49de2b991705d1638946464a9d4f9
BUG: 927146
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
Reviewed-on: http://review.gluster.org/4744
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
1) pre_op_piggyback should always be decremented.
2) Move fsync resume to just after post_op.
3) fsync stub should be created from afr's local
not from the final response.
Change-Id: I220bb532eb03bea584292f4dd2e816ad0c3e0cf7
BUG: 927146
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
Reviewed-on: http://review.gluster.org/4741
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
AFR now provides a stronger guarantee that fsync() returns only
after completely finishing all the deferred/delayed POST-OP on that
open file.
To acheive this we make a stub out of the returning fsync and
register it with the "delayed" frame in afr_changelog_wake_resume().
The delayed frame, after getting woken up and finishing the POST-OP
will call_resume() the registered stub (which UNWINDs the fsync) at
the time of frame destruction.
This provides a guarantee that an application's (or FUSE) fsync()
returns only after finishing up all the previous transactions,
including delayed POST-OPs and UNLOCK.
Change-Id: Iaa955457e2f25088a144fde37ad0444277b5cf49
BUG: 927146
Signed-off-by: Anand Avati <avati@redhat.com>
Reviewed-on: http://review.gluster.org/4737
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The changelogging scheme of AFR stores information about the state
of all replicas in all replicas (in the extended attribute of the
respective files on each server) in the form of 'pending counts'
of operations (effectively "dirty flags"). These xattrs are blindly
trusted while performing self-heal, and therefore utmost care has
to be taken while updating and maintaing them.
The most critical updation is the clearing of the pending counts
corresponding to the *other* server in the changelog of a given
server. Before clearing the pending count, we need durability
guarantee of the write which was performed on the other server.
To obtain such a guarantee, it may be necessary to explicitly
introduce an fsync() phase (if the file itself wasn't already
opened with O_SYNC).
This patch introduces the detection of unstable stable writes on
a file and issues explicit fsync() on the servers before performing
the POST-OP clearing of pending flags.
Change-Id: I2171b86a74ec91e40e5877eef0a4e7379578ecf7
BUG: 927146
Signed-off-by: Anand Avati <avati@redhat.com>
Reviewed-on: http://review.gluster.org/4721
Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
Reviewed-by: Krishnan Parthasarathi <kparthas@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
PROBLEM:
The FILE* associated with the pidfile was leaked if
pmap_registry_search on the brickinfo' path failed.
FIX:
Eliminates the use of the FILE* that was leaked. Uses
glusterd_is_service_running utility function in place
of the earlier attempt to check for the same.
Change-Id: I94082bd5a94b8a6340f8cc11726d3264e364efe6
BUG: 916549
Signed-off-by: Krishnan Parthasarathi <kparthas@redhat.com>
Reviewed-on: http://review.gluster.org/4596
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
Reviewed-by: Anand Avati <avati@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Over the weekend I tried to build on MacOS X¹ and ran into the following
issues:
1) The recent change to autogen.sh to test for pkg-config falls down.
2) After removing the pkg-config test in autogen.sh, w/o pkg-config the
PKG_CHECK_MODULES macro invocation in configure[.ac] falls down. N.B.
Solaris users run into this too, even through there's a (broken)
pkg-config package that can be installed.
3) There are other problems in the code related to fuse that are beyond the
scope of this.
It seems that pkg-config is only a requirement for the definition of the
PKG_CHECK_MODULES macro used to detect libxml2. Since this seems to be
inherently unportable — at least to MacOS X and Solaris — I'd like to:
A) Change the use of the PKG_CHECK_MODULES macro to the more portable
AM_PATH_XML2 macro provided by the libxml2 package in
/usr/.../share/aclocal/libxml.m4
2) Revisit the decision to add the check for pkg-config in autogen.sh in
BZ 921817.
For now this is just an rfc. If people are agreeable I'll reenter this
change against BZ 921817.
¹Mountain Lion 10.8.3, XCode 4.6.1
Change-Id: I237b1ed8919088345b8fd943423b2a6ad289981b
BUG: 921817
Signed-off-by: Kaleb S. KEITHLEY <kkeithle@redhat.com>
Reviewed-on: http://review.gluster.org/4720
Reviewed-by: Justin Clift <jclift@redhat.com>
Tested-by: Justin Clift <jclift@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
|
|
|
|
|
|
|
|
|
| |
Change-Id: I396d250a3299ad1f7fce4bd14389b0c2756b6cb0
BUG: 764890
Signed-off-by: Krishnan Parthasarathi <kparthas@redhat.com>
Reviewed-on: http://review.gluster.org/4718
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Now all xlator-options can be set from the mount command as well.
Example :
mount -t glusterfs Hostname:/Volume_Name Mount_Point -o "xlator-option=xyz=123, xlator-option=abc=999"
Change-Id: If52d994986839d1c969e3e2e01b2e1a29a3140b7
BUG: 920583
Signed-off-by: Avra Sengupta <asengupt@redhat.com>
Reviewed-on: http://review.gluster.org/4660
Reviewed-by: Amar Tumballi <amarts@redhat.com>
Reviewed-by: Shishir Gowda <sgowda@redhat.com>
Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The error message when creating a volume that contains a brick with
certain xatts set on a parent directory is unclear.
Users do not understand '... or a prefix of it is already part of
a volume'. Most users check the final directory that is used for
a brick, but not its parents.
It would be helpful to present the user with the actual directory that
is preventing the volume to use the brick.
BUG: 923917
Change-Id: I815ad32a992eb0e41ee8fca6ee9327400d042c45
Signed-off-by: Niels de Vos <ndevos@redhat.com>
Reviewed-on: http://review.gluster.org/4701
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Set only those bits which were requested by the client. Some clients,
like AIX, do not like the fact that we are returning the EXEC bit
set in the ACCESS reply even though it only asked for LOOKUP bit.
Change-Id: I3c2fd5dce030ea5ddae0511497cafa078c4d76d6
BUG: 924481
Signed-off-by: Anand Avati <avati@redhat.com>
Reviewed-on: http://review.gluster.org/4707
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is necessary to support "DHT over DHT" configurations, so that the
upper and lower instances of DHT don't step all over each other. Why
would we even consider such a thing? Because it gives us the ability to
do data tiering and rack-aware placement, either by themselves or as
complements to other functionality such as erasure codes or
deduplication which save space but cost performance. By setting up the
top-level DHT to place data into one of several lower-level DHT pools
based on policy instead of pure elastic hashing, we get better
performance for 90% of accesses and better storage efficiency for 90% of
data, all for relatively low effort.
Change-Id: I72e65c29edfc80babf39f7a2a00090f4588c4070
BUG: 924265
Signed-off-by: Jeff Darcy <jdarcy@redhat.com>
Reviewed-on: http://review.gluster.org/4694
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
... so it's easy to figure out errno caused it. As of now
it's only due to ENOSPC. Logging is done in the error handling
routine, so any further changes that require unlinking of the
timestamp file due to some error condition(s) are logged.
Change-Id: Ia59338e2e32b2adbbd1d56aa260018270f1abae9
BUG: 853911
Signed-off-by: Venky Shankar <vshankar@redhat.com>
Reviewed-on: http://review.gluster.org/4649
Reviewed-by: Csaba Henk <csaba@redhat.com>
Reviewed-by: Amar Tumballi <amarts@redhat.com>
Reviewed-by: Vijay Bellur <vbellur@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The issue could be fixed with .validate=GF_OPT_VALIDATE_MIN. But
adding max value is more robust.
Change-Id: Ia69c6f86855dbd34a26e20391e77bfa0f796a200
BUG: 923573
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
Reviewed-on: http://review.gluster.org/4698
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
stable writes can be "made stable" by simply setting O_SYNC (or
O_DSYNC, accordingly) in the write flags or fd->flags. Performing
fsync() at the end of the write is extremely inefficient and completely
messes up eager-locking logic in AFR.
Change-Id: I4d954c133641e246b2ab4df874bad0282667561f
BUG: 916372
Signed-off-by: Anand Avati <avati@redhat.com>
Reviewed-on: http://review.gluster.org/4591
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Amar Tumballi <amarts@redhat.com>
Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
| |
Change-Id: Icee9772f1f1bf5336eb82a4dc13e198424cd4a65
BUG: 921996
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
Reviewed-on: http://review.gluster.org/4699
Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
Reviewed-by: Amar Tumballi <amarts@redhat.com>
Reviewed-by: Vijay Bellur <vbellur@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
|
|
|
|
|
|
|
|
|
| |
Change-Id: Id6f156957e58aad06bf2602f880c7e4102b80fd1
BUG: 764890
Signed-off-by: JulesWang <w.jq0722@gmail.com>
Reviewed-on: http://review.gluster.org/4679
Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
|
|
|
|
|
|
|
|
|
| |
Signed-off-by: shishir gowda <sgowda@redhat.com>
Change-Id: I9d49537c2c7b51d5598b80627d61f060aaec8549
BUG: 921437
Reviewed-on: http://review.gluster.org/4671
Reviewed-by: Vijay Bellur <vbellur@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Users are still using geo-rep with the old, deprecated, insecure, unsupported
ssh setup. Not their fault -- the implementation of the new method had the
following charasteristics:
- old method is possible, but with default settings it's not working
- it can be made operational by fiddling with "remote-gsyncd" tunable
- with default setting, an unhelpful, actually misleading error message is
produced
- the UI gave no hint to the changes in the ssh setup
http://review.gluster.org/4392 tried to fix these; what it accomplished was
unrestricted support to the bad practice (by making the default old setup
operational).
From this on:
- we disable the old method by reserving the "remote-gsyncd" tunable
- if the old method is attempted, give a hint what to do
Change-Id: Icade94725d8d8d2d4c89cab992d4226351637b86
BUG: 895656
Signed-off-by: Csaba Henk <csaba@redhat.com>
Reviewed-on: http://review.gluster.org/4602
Reviewed-by: Venky Shankar <vshankar@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem:
ENOATTR returned by
getxattr -n <NotAnExistingAttribute> <file>
was being logged at ERROR level.
Solution:
Moved logging to DEBUG level.
Change-Id: I982a577a4c231faa958ea71abdb272f8d5ffd70c
BUG: 918052
Signed-off-by: Raghavendra Talur <rtalur@redhat.com>
Reviewed-on: http://review.gluster.org/4628
Reviewed-by: Amar Tumballi <amarts@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Vijay Bellur <vbellur@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem:
Data self-heal may choose sink iatt to set mtimes.
This happens because after syncing of data is done
self-heal does one more xattrops/fstat to determine
sources sinks to set the inode-ctx. Since this is done
after data syncing and erase of xattrs, old source and
old sink are now sources, but the mtimes of them differ.
Old code just takes the first source from the list and
update mtimes, which could be sink before the self-heal
started.
Fix:
Set mtime from 'sources before syncing'.
Change-Id: Id769e1b99aa4f041eaee775f64cbf2c57b799723
BUG: 918437
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
Reviewed-on: http://review.gluster.org/4658
Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
PROBLEM:
During 'volume delete', when glusterd fails to erase all information
about a volume from the backend store (for instance because rmdir()
failed on non-empty directories), not only does volume delete fail on
that node, but also subsequent attempts to restart glusterd fail
because the volume store is left in an inconsistent state.
FIX:
Rename the volume directory path to a new location
<working-dir>/trash/<volume-id>.deleted, and then go on to clean up its
contents. The volume is considered deleted once rename() succeeds,
irrespective of whether the cleanup succeeds or not.
Change-Id: Iaf18e1684f0b101808bd5e1cd53a5d55790541a8
BUG: 889630
Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com>
Reviewed-on: http://review.gluster.org/4639
Reviewed-by: Amar Tumballi <amarts@redhat.com>
Reviewed-by: Kaushal M <kaushal@redhat.com>
Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
|
|
|
|
|
|
|
|
|
|
| |
Change-Id: I2911d3ac80825310f84c5ba6bd7890e65e1ee219
BUG: 865700
Signed-off-by: Krishnan Parthasarathi <kparthas@redhat.com>
Reviewed-on: http://review.gluster.org/4624
Reviewed-by: Amar Tumballi <amarts@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Vijay Bellur <vbellur@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If for some reason glusterd_get_brick_root() fails,
it frees the gf_strdup'ed *mount_point in its own error path,
and returns -1.
Unfortunately it already had assigned that pointer value
to the output argument, the caller function
glusterd_add_brick_detail() sees a non-NULL pointer,
and free() again: segfault.
Could be fixed with a one-liner (*mount_point = NULL)
in the error path, but I think glusterd_get_brick_root()
should only assign to the output argument once all checks passed,
so I use a local temporary pointer, which increases the patch a bit.
Change-Id: I3f3035f01e80a5e9bdf2da895e4cf7baa3dfbd2f
BUG: 919352
Signed-off-by: Lars Ellenberg <lars@linbit.com>
Reviewed-on: http://review.gluster.org/4646
Reviewed-by: Krishnan Parthasarathi <kparthas@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
|