glusterfs.git/xlators/cluster/ec/src/ec-dir-write.c, branch v3.7.2

cluster/ec: Fix all EIO errors in EC

2015-05-28T11:12:06+00:00

        Backport of http://review.gluster.org/10770
        Backport of http://review.gluster.org/10806
        Backport of http://review.gluster.org/10787
        Backport of http://review.gluster.org/10868
        Backport of http://review.gluster.com/10852

 - When a blocking lock is requested, lock request is succeeded even when
ec->fragment number of locks are acquired successfully in non-blocking locking
phase. This will lead to fop succeeding only on the bricks where the locks are
acquired, leading to the necessity of self-heals. To prevent these un-necessary
self-heals, if the remaining locks fail with EAGAIN in non-blocking lock phase
try blocking locking phase instead.

 -  Handle lookup failures while op in progress

 - cluster/ec: Correctly cleanup delayed locks
When a delayed lock is pending, a graph switch doesn't correctly
terminate it. This means that the update of version and size xattrs
is lost, causing EIO errors. This patch handles GF_EVENT_PARENT_DOWN
event to correctly finish pending udpdates before completing the
graph switch.

 - Fix use after free crash
ec_heal creates ec_fop_data but doesn't run ec_manager. ec_fop_data_allocate
adds this fop to ec->pending_fops, because ec_manager is not run on this heal
fop it is never removed from ec->pending_fops. When it is accessed after free
it leads to crash. It is better to not to add HEAL fops to ec->pending_fops
because we don't want graph switch to hang the mount because of a BIG
file/directory heal.

- Forced unlock when lock contention is detected
EC uses an eager lock mechanism to optimize multiple read/write
requests on the same entry or inode. This increases performance
but can have adverse results when other clients try to access the
same entry/inode. To solve this, this patch adds a functionality
to detect when this happens and force an earlier release to not
block other clients.

The method consists on requesting GF_GLUSTERFS_INODELK_COUNT and
GF_GLUSTERFS_ENTRYLK_COUNT for all fops that take a lock. When this
count is greater than one, the lock is marked to be released. All
fops already waiting for this lock will be executed normally before
releasing the lock, but new requests that also require it will be
blocked and restarted after the lock has been released and reacquired
again.

Another problem was that some operations did correctly lock the
parent of an entry when needed, but got the size and version xattrs
from the entry instead of the parent.

This patch solves this problem by binding all queries of size and
version to each lock and replacing all entrylk calls by inodelk ones
to remove concurrent updates on directory metadata.  This also allows
rename to correctly update source and destination directories.

BUG: 1225279
Change-Id: I02a6084b138dd38e018a462347cd9ce38610c7ef
Reviewed-on: http://review.gluster.org/10926
Tested-by: NetBSD Build System
Tested-by: Gluster Build System 
Reviewed-by: Pranith Kumar Karampuri

cluster/ec: add separate versions for data/entry, metadata

2015-05-08T17:29:38+00:00

Adding 64 bits in "version" key of extended attributes. First 64 bits (Left)
represents Data version. Last 64 bits (right) represents Meta Data version.

Note: 3.7 and 3.6 version ec can't co-exist with this change because xattrop in
3.6 will fail with ERANGE as the buffer passed to it will be '8' bytes where as
the value will be 16 bytes in 3.7. Where as 3.7 version clients can work with
old version files. For upgrades we need to tell users to complete heals and
then upgrade

BUG: 1215265
Change-Id: Ib85114680cb7e75b8371c984d9f7b6401c1ffb93
Signed-off-by: Ashish Pandey 
Reviewed-on: http://review.gluster.org/10312
Reviewed-on: http://review.gluster.org/10626
Tested-by: Gluster Build System 
Reviewed-by: Pranith Kumar Karampuri

cluster/ec: Refactor inode-writev

2015-04-07T05:35:07+00:00

All _cbk() functions in inode-write.c do same things, i.e. store
op_ret/op_errno, stat structures if they are available and combine them.  Moved
this common operation into one function ec_inode_write_cbk() and made all the
other _cbk() functions to use this instead.

Change-Id: I2387b9f2d9598ced6299a26ea1900e9deb9fadc4
BUG: 1199767
Signed-off-by: Pranith Kumar K 
Reviewed-on: http://review.gluster.org/9981
Tested-by: Gluster Build System 
Reviewed-by: Dan Lambright

cluster/ec: Refactor ec-dir-write

2015-03-24T07:16:23+00:00

- Also fixed iatt_combine to go over all the valid iatts

Change-Id: I1d52d705ed0437f602357acde3e479cedb748681
BUG: 1199767
Signed-off-by: Pranith Kumar K 
Reviewed-on: http://review.gluster.org/9827
Tested-by: Gluster Build System 
Reviewed-by: Xavier Hernandez 
Reviewed-by: Vijay Bellur

ec: Fix posix compliance failures

2015-01-29T03:49:29+00:00

This patch solves some problems that caused dispersed volumes to not
pass posix smoke tests:

* Problems in open/create with O_WRONLY
    Opening files with -w- permissions using O_WRONLY returned an EACCES
    error because internally O_WRONLY was replaced with O_RDWR.

* Problems with entrylk on renames.
    When source and destination were the same, ec tried to acquire
    the same entrylk twice, causing a deadlock.

* Overwrite of a variable when reordering locks.
    On a rename, if the second lock needed to be placed at the beggining
    of the list, the 'lock' variable was overwritten and later its timer
    was cancelled, cancelling the incorrect one.

* Handle O_TRUNC in open.
    When O_TRUNC was received in an open call, it was blindly propagated
    to child subvolumes. This caused a discrepancy between real file
    size and the size stored into trusted.ec.size xattr. This has been
    solved by removing O_TRUNC from open and later calling ftruncate.

Change-Id: I20c3d6e1c11be314be86879be54b728e01013798
BUG: 1161886
Signed-off-by: Xavier Hernandez 
Reviewed-on: http://review.gluster.org/9420
Reviewed-by: Dan Lambright 
Tested-by: Gluster Build System 
Reviewed-by: Pranith Kumar Karampuri 
Tested-by: Pranith Kumar Karampuri

ec: Remove O_APPEND from flags on create and open.

2015-01-09T17:46:08+00:00

Allowing O_APPEND flag to pass through to the brick files
corrupts fragment contents because writes are not stored on
the desired place.

Write fop has been modified so that it uses current file
size as its write offset. This guarantees that all writes,
even those comming from different file descriptors and
clients, will write to the end of the file.

Change-Id: I9f721f12217a98231fe52e344166d1c94172c272
BUG: 1161621
Signed-off-by: Xavier Hernandez 
Reviewed-on: http://review.gluster.org/9079
Tested-by: Gluster Build System 
Reviewed-by: Dan Lambright 
Reviewed-by: Vijay Bellur

ec: Fix return errors when not enough bricks

2014-12-05T11:39:07+00:00

Changes introduced by this patch:

* Fix an incorrect error propagation when the state of the life
  cycle of a fop returns an error.

* Fix incorrect unlocking of failed locks.

* Return ENOTCONN if there aren't enough bricks online.

* In readdir(p) check that the fd has been successfully open by
  a previous opendir.

Change-Id: Ib44f25a1297849ebcbab839332f3b6359f275ebe
BUG: 1162805
Signed-off-by: Xavier Hernandez 
Reviewed-on: http://review.gluster.org/9098
Tested-by: Gluster Build System 
Reviewed-by: Vijay Bellur

ec: Fix self-healing issues.

2014-12-04T19:34:38+00:00

Three problems have been detected:

1. Self healing is executed in background, allowing the fop that
   detected the problem to continue without blocks nor delays.

   While this is quite interesting to avoid unnecessary delays,
   it can cause spurious failures of self-heal because it may
   try to recover a file inside a directory that a previous
   self-heal has not recovered yet, causing the file self-heal
   to fail.

2. When a partial self-heal is being executed on a directory,
   if a full self-heal is attempted, it won't be executed
   because another self-heal is already in process, so the
   directory won't be fully repaired.

3. Information contained in loc's of some fop's is not enough
   to do a complete self-heal.

To solve these problems, I've made some changes:

* Improved ec_loc_from_loc() to add all available information
  to a loc.

* Before healing an entry, it's parent is checked and partially
  healed if necessary to avoid failures.

* All heal requests received for the same inode while another
  self-heal is being processed are queued. When the first heal
  completes, all pending requests are answered using the results
  of the first heal (without full execution), unless the first
  heal was a partial heal. In this case all partial heals are
  answered, and the first full heal is processed normally.

* An special virtual xattr (not physically stored on bricks)
  named 'trusted.ec.heal' has been created to allow synchronous
  self-heal of files.

  Now, the recommended way to heal an entire volume is this:

    find  -d -exec getfattr -h -n trusted.ec.heal {} \;

Some minor changes:

* ec_loc_prepare() has been renamed to ec_loc_update().

* All loc management functions return 0 on success and -1 on
  error.

* Do not delay fop unlocks if heal is needed.

* Added basic ec xattrs initially on create, mkdir and mknod
  fops.

* Some coding style changes

Change-Id: I2a5fd9c57349a153710880d6ac4b1fa0c1475985
BUG: 1161588
Signed-off-by: Xavier Hernandez 
Reviewed-on: http://review.gluster.org/9072
Tested-by: Gluster Build System 
Reviewed-by: Dan Lambright

ec: Change license

2014-12-03T18:45:26+00:00

Change-Id: Iae90ade2421898417b53dec0417a610cf306c44b
BUG: 1168167
Signed-off-by: Xavier Hernandez 
Reviewed-on: http://review.gluster.org/9201
Reviewed-by: Kaleb KEITHLEY 
Tested-by: Gluster Build System 
Reviewed-by: Vijay Bellur

ec: Fix rebalance issues

2014-10-27T14:18:40+00:00

Some issues in ec xlator made that rebalance didn't complete
successfully and generated some warnings and errors in the
log. The most critical error was a race condition that caused
false corruption detection when two specific operations were
executed sequentially and they shared the same lock.

This explains the problem:

1. A setxattr is issued.
2. setxattr: ec locks the inode before updating the xattr.
3. setxattr: The xattr is updated.
4. setxattr: Upper xlator is notified that the operation completed.
5. setxattr: A background task is initiated to update the version
             of the file.
6. A stat is issued on the same file.
7. stat: Since the lock is already acquired, it's reused.
8. stat: A lookup is issued to determine version and size
         information of the file.

At this point, operations 5 and 8 can interfere. This can make that
lookup sees different information on each brick, determining that
some bricks are corrupted and incorrectly excluding them from the
operation and initiating a self-heal. In some cases this false
detection combined with self-heal could lead to invalid updates of
the trusted.ec.size xattr, leaving the file smaller than it should
be.

This only happens if the first operation does not perform a lookup,
because chained operations reuse the information returned by the
previous one, avoiding this kind of problems.

To solve this, now the background update is executed atomically with
the posterior unlock. This avoids some reuses of the lock while
updating. However this reduces performance because the window in
which new requests can reuse the lock is much smaller now. This has
been alleviated by using the same technique implemented in AFR (i.e.
waiting some time before releasing the lock).

Some minor changes also introduced in this patch:

* Bug in management of 'trusted.glusterfs.pathinfo' that was writing
  beyond the allocated space.
* Uninitialized variable.
* trusted.ec.config was not created for regular files created with
  mknod.
* An invalid state was used in access fop.

Change-Id: Idfaf69578ed04dbac97a62710326729715b9b395
BUG: 1152902
Signed-off-by: Xavier Hernandez 
Reviewed-on: http://review.gluster.org/8947
Tested-by: Gluster Build System 
Reviewed-by: Vijay Bellur