glusterfs.git/xlators/cluster, branch v7.0rc0

cluster/ec: Update lock->good_mask on parent fop failure

2019-08-22T05:56:58+00:00

When discard/truncate performs write fop, it should do so
after updating lock->good_mask to make sure readv happens
on the correct mask

fixes: bz#1739424
Change-Id: Idfef0bbcca8860d53707094722e6ba3f81c583b7
Signed-off-by: Pranith Kumar K

cluster/ec: Fix reopen flags to avoid misbehavior

2019-08-22T05:56:58+00:00

Problem:
when a file needs to be re-opened O_APPEND and O_EXCL
flags are not filtered in EC.

- O_APPEND should be filtered because EC doesn't send O_APPEND below EC for
open to make sure writes happen on the individual fragments instead of at the
end of the file.

- O_EXCL should be filtered because shd could have created the file so even
when file exists open should succeed

- O_CREAT should be filtered because open happens with gfid as parameter. So
open fop will create just the gfid which will lead to problems.

Fix:
Filter out these two flags in reopen.

Change-Id: Ia280470fcb5188a09caa07bf665a2a94bce23bc4
Fixes: bz#1739426
Signed-off-by: Pranith Kumar K

cluster/ec: Always read from good-mask

2019-08-22T05:56:58+00:00

There are cases where fop->mask may have fop->healing added
and readv shouldn't be wound on fop->healing. To avoid this
always wind readv to lock->good_mask

updates: bz#1739424
Change-Id: I2226ef0229daf5ff315d51e868b980ee48060b87
Signed-off-by: Pranith Kumar K

cluster/ec: fix EIO error for concurrent writes on sparse files

2019-08-22T05:56:58+00:00

EC doesn't allow concurrent writes on overlapping areas, they are
serialized. However non-overlapping writes are serviced in parallel.
When a write is not aligned, EC first needs to read the entire chunk
from disk, apply the modified fragment and write it again.

The problem appears on sparse files because a write to an offset
implicitly creates data on offsets below it (so, in some way, they
are overlapping). For example, if a file is empty and we read 10 bytes
from offset 10, read() will return 0 bytes. Now, if we write one byte
at offset 1M and retry the same read, the system call will return 10
bytes (all containing 0's).

So if we have two writes, the first one at offset 10 and the second one
at offset 1M, EC will send both in parallel because they do not overlap.
However, the first one will try to read missing data from the first chunk
(i.e. offsets 0 to 9) to recombine the entire chunk and do the final write.
This read will happen in parallel with the write to 1M. What could happen
is that half of the bricks process the write before the read, and the
half do the read before the write. Some bricks will return 10 bytes of
data while the otherw will return 0 bytes (because the file on the brick
has not been expanded yet).

When EC tries to recombine the answers from the bricks, it can't, because
it needs more than half consistent answers to recover the data. So this
read fails with EIO error. This error is propagated to the parent write,
which is aborted and EIO is returned to the application.

The issue happened because EC assumed that a write to a given offset
implies that offsets below it exist.

This fix prevents the read of the chunk from bricks if the current size
of the file is smaller than the read chunk offset. This size is
correctly tracked, so this fixes the issue.

Also modifying ec-stripe.t file for Test #13 within it.
In this patch, if a file size is less than the offset we are writing, we
fill zeros in head and tail and do not consider it strip cache miss.
That actually make sense as we know what data that part holds and there is
no need of reading it from bricks.

Change-Id: Ic342e8c35c555b8534109e9314c9a0710b6225d6
Fixes: bz#1739427
Signed-off-by: Xavi Hernandez

cluster/ec: inherit healing from lock when it has info

2019-08-22T05:56:23+00:00

If lock has info, fop should inherit healing mask from it.
Otherwise, fop cannot inherit right healing when changed_flags is zero.

Change-Id: Ife80c9169d2c555024347a20300b0583f7e8a87f
updates: bz#1739424
Signed-off-by: Kinglong Mee

afr: restore timestamp of parent dir during entry-heal

2019-08-21T11:45:24+00:00

Fixes: bz#1741041
Change-Id: I29e338bac62104233a6f80212df8d0fb016affda
Signed-off-by: Ravishankar N 
(cherry picked from commit 8e9c53ebf16705b9a1db2fc486dc24a5cb244ddd)

cluster/dht: Fixed a memleak in dht_rename_cbk

2019-08-09T11:40:56+00:00

Fixed a memleak in dht_rename_cbk when creating
a linkto file.

Change-Id: I705adef3cb79e33806520fc2b15558e90e2c211c
fixes: bz#1739337
Signed-off-by: N Balachandran

cluster/ta: Notify the clients only if there are pending heals

2019-07-24T11:01:54+00:00

Problem:
In case of thin arbiter, before index healer starts crawling the
indices at every heal-timeout interval, even if there is nothing to
be healed it will send an upcall notification to all the clients to
release any AFR_TA_DOM_NOTIFY locks that they hold. SHD will wait
for the upcall to return before proceeding with the heal even though
there is nothing to be healed. This will also invalidates the cached
information about the bricks states on the clients which leads to
extra calls on TA from clients for the next reads & writes if needed.
This will impact the IO performance.

Fix:
- Before sending the upcall to the clients, check for any pending heals
on TA without taking  any locks.
- If there is nothing marked bad on TA, then continue with the index
crawl to heal any dirty markings present on the files due to any post-op
failure.
- If there is a brick marked as bad on TA, then take the
AFR_TA_DOM_NOTIFY lock on TA from SHD, get the state on TA and
continue with the current healing process.

Change-Id: Ieb477bc6cb18bbdfd4e7a0453c5ed79b574ec9d6
fixes: bz#1729483
Signed-off-by: karthik-us

cluster/afr: Fix incorrect reporting of gfid & type mismatch

2019-07-20T07:35:44+00:00

Problems:
1. When checking for type and gfid mismatch, if the type or gfid
is unknown because of missing gfid handle and the gfid xattr
it will be reported as type or gfid mismatch and the heal will
not complete.

2. If the source selected during entry heal has null gfid the same
will be sent to afr_lookup_and_heal_gfid(). In this function when
we try to assign the gfid on the bricks where it does not exist,
we are considering the same gfid and try to assign that on those
bricks. This will fail in posix_gfid_set() since the gfid sent
is null.

Fix:
If the gfid sent to afr_lookup_and_heal_gfid() is null choose a
valid gfid before proceeding to assign the gfid on the bricks
where it is missing.

In afr_selfheal_detect_gfid_and_type_mismatch(), do not report
type/gfid mismatch if the type/gfid is unknown or not set.

Change-Id: Ia06552e4dc4a9f89cb7f5302833604bd21bbf7da
fixes: bz#1729481
Signed-off-by: karthik-us

cluster/ec: Prevent double pre-op xattrops

2019-06-22T05:06:15+00:00

Problem:
Race:
Thread-1                                    Thread-2
1) Does ec_get_size_version() to perform
pre-op fxattrop as part of write-1
                                           2) Calls ec_set_dirty_flag() in
                                              ec_get_size_version() for write-2.
					      This sets dirty[] to 1
3) Completes executing
ec_prepare_update_cbk leading to
ctx->dirty[] = '1'
					   4) Takes LOCK(inode->lock) to check if there are
					      any flags and sets dirty-flag because
				              lock->waiting_flag is 0 now. This leads to
					      fxattrop to increment on-disk dirty[] to '2'

At the end of the writes the file will be marked for heal even when it doesn't need heal.

Fix:
Perform ec_set_dirty_flag() and other checks inside LOCK() to prevent dirty[] to be marked
as '1' in step 2) above

Updates bz#1593224
Change-Id: Icac2ab39c0b1e7e154387800fbededc561612865
Signed-off-by: Pranith Kumar K