glusterfs.git/xlators/cluster/ec/src, branch v4.0dev

libglusterfs: Name threads on creation

2017-07-19T14:16:19+00:00

Set names to threads on creation for easier
debugging.

Output of top -H -p 
Before:
19773 root      20   0 1301.3m  12.6m   8.4m S  0.0  0.1   0:00.00 glusterfsd
19774 root      20   0 1301.3m  12.6m   8.4m S  0.0  0.1   0:00.00 glusterfsd
19775 root      20   0 1301.3m  12.6m   8.4m S  0.0  0.1   0:00.00 glusterfsd
19776 root      20   0 1301.3m  12.6m   8.4m S  0.0  0.1   0:00.00 glusterfsd
19777 root      20   0 1301.3m  12.6m   8.4m S  0.0  0.1   0:00.00 glusterfsd
19778 root      20   0 1301.3m  12.6m   8.4m S  0.0  0.1   0:00.00 glusterfsd
19779 root      20   0 1301.3m  12.6m   8.4m S  0.0  0.1   0:00.00 glusterfsd
19780 root      20   0 1301.3m  12.6m   8.4m S  0.0  0.1   0:00.00 glusterfsd
19781 root      20   0 1301.3m  12.6m   8.4m S  0.0  0.1   0:00.00 glusterfsd
19782 root      20   0 1301.3m  12.6m   8.4m S  0.0  0.1   0:00.00 glusterfsd
19783 root      20   0 1301.3m  12.6m   8.4m S  0.0  0.1   0:00.00 glusterfsd
19784 root      20   0 1301.3m  12.6m   8.4m S  0.0  0.1   0:00.00 glusterfsd
19785 root      20   0 1301.3m  12.6m   8.4m S  0.0  0.1   0:00.01 glusterfsd
19786 root      20   0 1301.3m  12.6m   8.4m S  0.0  0.1   0:00.01 glusterfsd
19787 root      20   0 1301.3m  12.6m   8.4m S  0.0  0.1   0:00.01 glusterfsd
19789 root      20   0 1301.3m  12.6m   8.4m S  0.0  0.1   0:00.00 glusterfsd
19790 root      20   0 1301.3m  12.6m   8.4m S  0.0  0.1   0:00.00 glusterfsd
25178 root      20   0 1301.3m  12.6m   8.4m S  0.0  0.1   0:00.00 glusterfsd
 5398 root      20   0 1301.3m  12.6m   8.4m S  0.0  0.1   0:00.00 glusterfsd
 7881 root      20   0 1301.3m  12.6m   8.4m S  0.0  0.1   0:00.00 glusterfsd

After:
19773 root      20   0 1301.3m  12.6m   8.4m S  0.0  0.1   0:00.00 glusterfsd
19774 root      20   0 1301.3m  12.6m   8.4m S  0.0  0.1   0:00.00 glustertimer
19775 root      20   0 1301.3m  12.6m   8.4m S  0.0  0.1   0:00.00 glusterfsd
19776 root      20   0 1301.3m  12.6m   8.4m S  0.0  0.1   0:00.00 glustermemsweep
19777 root      20   0 1301.3m  12.6m   8.4m S  0.0  0.1   0:00.00 glustersproc0
19778 root      20   0 1301.3m  12.6m   8.4m S  0.0  0.1   0:00.00 glustersproc1
19779 root      20   0 1301.3m  12.6m   8.4m S  0.0  0.1   0:00.00 glusterepoll0
19780 root      20   0 1301.3m  12.6m   8.4m S  0.0  0.1   0:00.00 glusteridxwrker
19781 root      20   0 1301.3m  12.6m   8.4m S  0.0  0.1   0:00.00 glusteriotwr0
19782 root      20   0 1301.3m  12.6m   8.4m S  0.0  0.1   0:00.00 glusterbrssign
19783 root      20   0 1301.3m  12.6m   8.4m S  0.0  0.1   0:00.00 glusterbrswrker
19784 root      20   0 1301.3m  12.6m   8.4m S  0.0  0.1   0:00.00 glusterclogecon
19785 root      20   0 1301.3m  12.6m   8.4m S  0.0  0.1   0:00.01 glusterclogd0
19786 root      20   0 1301.3m  12.6m   8.4m S  0.0  0.1   0:00.01 glusterclogd1
19787 root      20   0 1301.3m  12.6m   8.4m S  0.0  0.1   0:00.01 glusterclogd2
19789 root      20   0 1301.3m  12.6m   8.4m S  0.0  0.1   0:00.00 glusterposixjan
19790 root      20   0 1301.3m  12.6m   8.4m S  0.0  0.1   0:00.00 glusterposixfsy
25178 root      20   0 1301.3m  12.6m   8.4m S  0.0  0.1   0:00.00 glusterepoll1
 5398 root      20   0 1301.3m  12.6m   8.4m S  0.0  0.1   0:00.00 glusterepoll2
 7881 root      20   0 1301.3m  12.6m   8.4m S  0.0  0.1   0:00.00 glusterposixhc

Change-Id: Id5f333755c1ba168a2ffaa4fce6e71c375e10703
BUG: 1254002
Updates: #271
Signed-off-by: Raghavendra Talur 
Reviewed-on: https://review.gluster.org/11926
Reviewed-by: Prashanth Pai 
Smoke: Gluster Build System 
Reviewed-by: Niels de Vos 
CentOS-regression: Gluster Build System

cluster/ec: Non-disruptive upgrade on EC volume fails

2017-07-14T00:26:04+00:00

Problem:
Enabling optimistic changelog on EC volume was not
handling node down scenarios appropriately resulting
in volume data inaccessibility.

Solution:
Update dirty xattr appropriately on good bricks whenever
nodes are down. This would fix the metadata information
as part of heal and thus ensures data accessibility.

BUG: 1468261
Change-Id: I08b0d28df386d9b2b49c3de84b4aac1c729ac057
Signed-off-by: Sunil Kumar Acharya 
Reviewed-on: https://review.gluster.org/17703
Smoke: Gluster Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Pranith Kumar Karampuri

cluster/ec: Get size of file in EC [f]xattrop

2017-07-13T08:10:11+00:00

Problem:
For allowing parallel writes we shouldn't depend on ia_size to be same for
all the bricks in each write_cbk(). But we need to make sure backend size
is correct on all the bricks and no crashes/manual modifications happened.

Fix:
At the time of get_size_version() we do 1 check to make sure size of the file
is same across the bricks. From then on the FOPs will give the status of the
fop, so we rely on this information to keep which bricks are good/bad.

Updates #251
Change-Id: I1df645347e2e9f2e09cfa4411b6cc305d7f4e4e5
Signed-off-by: Pranith Kumar K 
Reviewed-on: https://review.gluster.org/17741
Smoke: Gluster Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Xavier Hernandez

cluster/ec : Don't try to heal when no sink is UP

2017-07-07T06:13:13+00:00

Problem:
4 + 2 EC volume configuration.
If untar of linux is going on and we kill a brick,
indices will be created for the files/dir which need
to be healed. ec_shd_index_sweep spawns threads to
scan these entries and start heal. If in the middle
of this we kill one more brick, we end up in a
situation where we can not heal an entry as there
are only "ec->fragment" number of bricks are UP.
However, the scan will be continued and it will
trigger the heal for those entries.

Solution:
When a heal is triggered for an entry, check if it
*CAN* be healed or not. If not come out with ENOTCONN.

Change-Id: I305be7701c289f36bd7bde22491b71074771424f
BUG: 1464359
Signed-off-by: Ashish Pandey 
Reviewed-on: https://review.gluster.org/17692
Smoke: Gluster Build System 
CentOS-regression: Gluster Build System 
NetBSD-regression: NetBSD Build System 
Reviewed-by: Pranith Kumar Karampuri 
Reviewed-by: Sunil Kumar Acharya 
Reviewed-by: Xavier Hernandez

cluster/ec: correctly handle end of file for seek

2017-07-06T06:17:44+00:00

When a SEEK_HOLE was issued near to the end of file, sometimes an
offset beyond the end of file was returned. Another problem was that
using some offsets greater than the end of file returned successfully
instead of failing with ENXIO.

Change-Id: I238d2884ba02fd19a78116b0f8f8e8d6338fb3f5
BUG: 1449348
Signed-off-by: Xavier Hernandez 
Reviewed-on: https://review.gluster.org/17228
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Amar Tumballi 
Reviewed-by: Pranith Kumar Karampuri

ec: Increase notification in all the cases

2017-06-28T09:52:23+00:00

Problem:
"gluster v heal  info" is taking
long time to respond when a brick is down.

RCA:
Heal info command does virtual mount.
EC wait for 10 seconds, before sending UP call to upper xlator,
to get notification (DOWN or UP) from all the bricks.

Currently, we are increasing ec->xl_notify_count based on
the current status of the brick. So, if a DOWN event notification
has come and brick is already down, we are not increasing
ec->xl_notify_count in ec_handle_down.

Solution:
Handle DOWN even as notification irrespective of what
is the current status of brick.

Change-Id: I0acac0db7ec7622d4c0584692e88ad52f45a910f
BUG: 1464091
Signed-off-by: Ashish Pandey 
Reviewed-on: https://review.gluster.org/17606
Tested-by: Pranith Kumar Karampuri 
Reviewed-by: Pranith Kumar Karampuri 
Smoke: Gluster Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Xavier Hernandez 
NetBSD-regression: NetBSD Build System

cluster/ec: Node uuid xattr support update for EC

2017-06-23T02:56:35+00:00

Problem:
The change in EC to return list of node uuids for
GF_XATTR_NODE_UUID_KEY was causing problems with
geo-rep.

Fix:
This patch will allow to get the single node uuid
as it was doing before with the key
"GF_XATTR_NODE_UUID_KEY", and will also allow to get
the list of node uuids by using a new key
"GF_XATTR_LIST_NODE_UUIDS_KEY". This will solve
the problem with geo-rep and any other features which
were depending on this.

BUG: 1462790
Change-Id: I2d9214a9658d4a41a3d6de08600884d2bda5f3eb
Signed-off-by: Sunil Kumar Acharya 
Reviewed-on: https://review.gluster.org/17594
Smoke: Gluster Build System 
Reviewed-by: Xavier Hernandez 
Reviewed-by: Pranith Kumar Karampuri 
CentOS-regression: Gluster Build System

cluster/ec: lk shouldn't be a transaction

2017-06-16T08:18:51+00:00

Problem:
When application sends a blocking lock, the lk fop actually waits under
inodelk.  This can lead to a dead-lock.
1) Let's say app-1 takes exculsive-fcntl-lock on the file
2) app-2 attempts an exclusive-fcntl-lock on the file which goes to blocking
   stage note: app-2 is blocked inside transaction which holds an inode-lock
3) app-1 tries to perform write which needs inode-lock so it gets blocked on
   app-2 to unlock inodelk and app-2 is blocked on app-1 to unlock fcntl-lock

Fix:
Correct way to fix this issue and make fcntl locks perform well would be to
introduce
2-phase locking for fcntl lock:
1) Implement a try-lock phase where locks xlator will not merge lk call with
   existing calls until a commit-lock phase.
2) If in try-lock phase we get quorum number of success without any EAGAIN
   error, then send a commit-lock which will merge locks.
3) In case there are any errors, unlock should just delete the lock-object
   which was tried earlier and shouldn't touch the committed locks.

Unfortunately this is a sizeable feature and need to be thought through for any
corner cases.  Until then remove transaction from lk call.

BUG: 1455049
Change-Id: I18a782903ba0eb43f1e6526fb0cf8c626c460159
Signed-off-by: Pranith Kumar K 
Reviewed-on: https://review.gluster.org/17542
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Ashish Pandey 
Reviewed-by: Xavier Hernandez

cluster/ec: Update xattr and heal size properly

2017-06-06T14:41:52+00:00

Problem-1 : Recursive healing of same file is happening
when IO is going on even after data heal completes.

Solution:
RCA: At the end of the write, when ec_update_size_version
gets called, we send it only on good bricks and not
on healing brick. Due to this, xattr on healing brick
will always remain out of sync and when the background
heal check source and sink, it finds this brick to be
healed and start healing from scratch. That involve
ftruncate and writing all of the data again.

To solve this, send xattrop on all the good bricks as
well as healing bricks.

Problem-2: The above fix exposes the data corruption
during heal. If the write on a file is going on and
heal finishes, we find that the file gets corrupted.

RCA:
The real problem happens in ec_rebuild_data(). Here we receive the
'size' argument which contains the real file size at the time of
starting self-heal and it's assigned to heal->total_size.

After that, a sequence of calls to ec_sync_heal_block() are done. Each
call ends up calling ec_manager_heal_block(), which does the actual work
of healing a block.

First a lock on the inode is taken in state EC_STATE_INIT using
ec_heal_inodelk(). When the lock is acquired, ec_heal_lock_cbk() is
called. This function calls ec_set_inode_size() to store the real size
of the inode (it uses heal->total_size).

The next step is to read the block to be healed. This is done using a
regular ec_readv(). One of the things this call does is to trim the
returned size if the file is smaller than the requested size.

In our case, when we read the last block of a file whose size was = 512
mod 1024 at the time of starting self-heal, ec_readv() will return only
the first 512 bytes, not the whole 1024 bytes.

This isn't a problem since the following ec_writev() sent from the heal
code only attempts to write the amount of data read, so it shouldn't
modify the remaining 512 bytes.

However ec_writev() also checks the file size. If we are writing the
last block of the file (determined by the size stored on the inode that
we have set to heal->total_size), any data beyond the (imposed) end of
file will be cleared with 0's. This causes the 512 bytes after the
heal->total_size to be cleared. Since the file was written after heal
started, the these bytes contained data, so the block written to the
damaged brick will be incorrect.

Solution:
Align heal->total_size to a multiple of the stripe size.

Thanks "Xavier Hernandez" 
to find out the root cause and to fix the issue.

Change-Id: I6c9f37b3ff9dd7f5dc1858ad6f9845c05b4e204e
BUG: 1428673
Signed-off-by: Ashish Pandey 
Reviewed-on: https://review.gluster.org/16985
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Pranith Kumar Karampuri 
Reviewed-by: Xavier Hernandez

core: fix spelling errors

2017-06-02T11:50:43+00:00

fixes for various minor spelling errors and typos

Reported-by: Patrick Matthäi 
Change-Id: Ic1be36f82e3d822bbdc9559878bd79520fc0fcd5
BUG: 1457808
Signed-off-by: Kaleb S. KEITHLEY 
Reviewed-on: https://review.gluster.org/17442
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Niels de Vos 
Smoke: Gluster Build System