glusterfs.git/libglusterfs/src, branch v3.7.0beta2

features/bit-rot-stub: versioning of objects in write/truncate fop instead of open

2015-05-10T15:14:33+00:00

* This patch brings in the changes where object versioning is done in write and
  truncate fops instead of tracking them in open and create fops. This model
  works for both regular and anonymous fds. It also removes the race associated
  with open calls, create and lookups.

  This patch follows the below method for object versioning and notifications:

  Before sending writev on the fd, increase the ongoing
  version first. This makes anonymous fd write similar to the regular
  fd write by having the ongoing version increased before doing the
  write.

  Do following steps to do versioning:
  1) For anonymous fds set the fd context (so that release is invoked) and add
     the fd context to the list maintained in the inode context.
     For regular fds the above think would have been done in open itself.
  2) Increase the on-disk ongoing version
  3) Increase the in memory ongoing version and mark inode as non-dirty
  3) Once versioning is successfully done send write operation. If
     versioning fails, then fail the write fop.
  5) In writev_cbk mark inode as modified.

> Change-Id: I7104391bbe076d8fc49b68745d2ec29a6e92476c
> BUG: 1207979
> Signed-off-by: Raghavendra Bhat 
> Reviewed-on: http://review.gluster.org/10233
> Tested-by: Gluster Build System 
> Reviewed-by: Vijay Bellur 

Change-Id: I4bb86989b5fab02b9ed2950798b1a80e566f1024
BUG: 1220041
Signed-off-by: Raghavendra Bhat 
Reviewed-on: http://review.gluster.org/10722
Reviewed-by: Gaurav Kumar Garg 
Tested-by: NetBSD Build System
Tested-by: Gluster Build System

features/bitrot: Throttle filesystem scrubber

2015-05-10T12:29:31+00:00

This patch introduces multithreaded filesystem scrubber based
on throttling option configured for a particular volume. The
implementation "logically" breaks scanning and scrubbing with
the number of scrubber threads auto-configured depending upon
the throttle configuration. Scanning (crawling) is left single
threaded (per brick) with entries scrubbed in bulk. On reaching
this "bulk" watermark, scanner waits until entries are scrubbed.
Bricks for a particular volume have a set of thread(s) assigned
for scrubbing, with entries for each brick scrubbed in a round
robin fashion to avoid scrub "stalls" when a brick (out of N
bricks) is under active scrubbing.

This mechanism helps us implement "pause/resume" with ease: all
one need to do is to cleanup scrubber threads and let the main
scanner thread "wait" untill scrubbing is resumed (where the
scrubber thread(s) are spawned again), therefore continuing
where we left off (unless we restart the deamons, where crawl
initiates from root directory again, but I guess that's OK).

[
    NOTE:

    Throttling is optional for the signer daemon, without which
    it runs full throttle. However, passing "-DBR_RATE_LIMIT_SIGNER"
    predefined in CFLAGS enables CPU throttling (during checksum
    calculation) thereby avoiding high CPU usage.
]

Subsequent patches would introduce CPU throttling during hash
calculation for scrubber.

> Change-Id: I5701dd6cd4dff27ca3144ac5e3798a2216b39d4f
> BUG: 1207020
> Signed-off-by: Venky Shankar 
> Reviewed-on: http://review.gluster.org/10511
> Tested-by: Gluster Build System 
> Reviewed-by: Vijay Bellur 

Change-Id: I5a125b2d0ac7dafd3e278b7fe4c6c9dd07af76dd
Signed-off-by: Venky Shankar 
BUG: 1220041
Reviewed-on: http://review.gluster.org/10720
Tested-by: Gluster Build System 
Reviewed-by: Gaurav Kumar Garg

features/bitrot: Follow xattr naming conventions

2015-05-10T12:28:43+00:00

Instead of "trusted.glusterfs.bit-rot.*" use "trusted.bit-rot.*"

NOTE:
With this patch, data on existing volumes would be resigned
(which should be OK as of now since we do not expect many
users as of now :-))

> Change-Id: I926c7bca266a9c8f2cb35d57c4d0359aa5cecfa0
> BUG: 1170075
> Signed-off-by: Venky Shankar 
> Reviewed-on: http://review.gluster.org/10181
> Tested-by: NetBSD Build System
> Tested-by: Gluster Build System 
> Reviewed-by: Vijay Bellur 

Change-Id: I3c18d7dc2db4beaca6e8d8d231b4171a7b18795f
Signed-off-by: Venky Shankar 
BUG: 1220041
Reviewed-on: http://review.gluster.org/10718
Tested-by: Gluster Build System 
Reviewed-by: Gaurav Kumar Garg

core: Global timer-wheel

2015-05-10T12:27:40+00:00

Instantiate a process wide global instance of the timer wheel
data structure. Spawning glusterfs* process with option arg
"--global-timer-wheel" instantiates a global instance of
timer-wheel under global context (->ctx).

Translators can make use of this process wide instance [via a
call to glusterfs_global_timer_wheel()] instead of maintaining
an instance of their own and possibly consuming more memory.
Linux kernel too has a single instance of timer wheel where
subsystems such as IO, networking, etc.. make use of.

Bitrot daemon would be early consumers of this: bitrot translator
instances for multiple volumes would track objects belonging to
their respective bricks in this global expiry tracking data
structure. This is also a first step to move GlusterFS timer
mechanism to use timer-wheel.

> Change-Id: Ie882df607e07acaced846ea269ebf1ece306d6ae
> BUG: 1170075
> Signed-off-by: Venky Shankar 
> Reviewed-on: http://review.gluster.org/10380
> Tested-by: NetBSD Build System
> Reviewed-by: Vijay Bellur 
> Tested-by: Gluster Build System 

Change-Id: I35c840daa9996a059699f8ea5af54c76ede7e09c
Signed-off-by: Venky Shankar 
Signed-off-by: Gaurav Kumar Garg 
BUG: 1220041
Reviewed-on: http://review.gluster.org/10716
Tested-by: Gluster Build System

CTR/Libgfdb: Log typo fix

2015-05-10T09:30:18+00:00

Log typo fix for CTR Xlator and Libgfdb

Change-Id: Ia39069a5ce9c48bbee937f1b5c5d749a30c9ac56
BUG: 1220100
Signed-off-by: Joseph Fernandes 
Reviewed-on: http://review.gluster.org/10742
Reviewed-by: N Balachandran 
Tested-by: Gluster Build System

dht: make lookup-unhashed=auto do something actually useful

2015-05-10T04:55:09+00:00

The key concept here is to determine whether a directory is "clean" by
comparing its last-known-good topology to the current one for the
volume.  These are stored as "commit hashes" on the directory and the
volume root respectively.  The volume's commit hash changes whenever a
brick is added or removed, and a fix-layout is done.  A directory's
commit hash changes only when a full rebalance (not just fix-layout)
is done on it.  If all bricks are present and have a directory
commit hash that matches the volume commit hash, then we can assume
that every file is in its "proper" place. Therefore, if we look for
a file in that proper place and don't find it, we can assume it's not
on any other subvolume and *safely* skip the global (broadcast to all)
lookup.

Change-Id: Id6ce4593ba1f7daffa74cfab591cb45960629ae3
BUG: 1220064
Reviewed-on-master: http://review.gluster.org/#/c/7702/
Signed-off-by: Jeff Darcy 
Signed-off-by: Shyam 
Reviewed-on: http://review.gluster.org/10729
Tested-by: Gluster Build System 
Reviewed-by: Krishnan Parthasarathi 
Reviewed-by: Vijay Bellur

bitrot/scrub: fix induced throttling in syncop_ftw_throttle()

2015-05-10T03:13:16+00:00

Failing to reset scanning counter causes "incorrect" delay of around
50 seconds per directory entry. This causes scrubber to run extremely
slowly.

[
    NOTE: This is a temporary fix. With the introduction of token
          bucket based throttling, inducing throttle via sleep()
          call would be unneeded.
]

Also, fix logging messages in scrubber to log brick and full path
of the object which is identified/marked as corrupted.

> Change-Id: Id501bd15dcdbd8a09613f80f9d84050304740027
> BUG: 1170075
> Signed-off-by: Venky Shankar 
> Reviewed-on: http://review.gluster.org/10375
> Tested-by: NetBSD Build System
> Tested-by: Gluster Build System 
> Reviewed-by: Raghavendra Bhat 
> Reviewed-by: Gaurav Kumar Garg 

Change-Id: I78f227f52f12549d62ecb35cbb70121424f7c2a7
BUG: 1220041
Reviewed-on: http://review.gluster.org/10714
Tested-by: Gluster Build System 
Reviewed-by: Vijay Bellur

core: use reference counting for mem_acct structures

2015-05-09T21:27:36+00:00

When freeing memory, our memory-accounting code expects to be able to
dereference from the (previously) allocated block to its owning
translator.  However, as we have already found once in option
validation and twice in logging, that translator might itself have
been freed and the dereference attempt causes on of our daemons to
crash with SIGSEGV.  This patch attempts to fix that as follows:

 * We no longer embed a struct mem_acct directly in a struct xlator,
   but instead allocate it separately.

 * Allocated memory blocks now contain a pointer to the mem_acct
   instead of the xlator.

 * The mem_acct structure contains a reference count, manipulated in
   both the normal and translator allocate/free code using atomic
   increments and decrements.

 * Because it's now a separate structure, we can defer freeing the
   mem_acct until its reference count reaches zero (either way).

 * Some unit tests were disabled, because they embedded their own
   copies of the implementation for what they were supposedly testing.
   Life's too short to spend time fixing tests that seem designed to
   impede progress by requiring a certain implementation as well as
   behavior.

Change-Id: Id929b11387927136f78626901729296b6c0d0fd7
BUG: 1219026
Signed-off-by: Jeff Darcy 
Reviewed-on: http://review.gluster.org/10417
Tested-by: Gluster Build System 
Reviewed-by: Krishnan Parthasarathi 
Reviewed-by: Niels de Vos 
Reviewed-by: Pranith Kumar Karampuri 
Reviewed-on: http://review.gluster.org/10723
Tested-by: NetBSD Build System

cluster/afr : Prevent inode-evict during split-brain resolution

2015-05-09T08:54:56+00:00

        Backport of: http://review.gluster.org/#/c/10134/

1) Provided setfattr command to set timeout for split-brain
choice.

2) If split-brain inspection/resolution is being done
from the mount for a file, ref the inode when
split-brain-choice is set.
This inode will be unconditionally unref-ed after timeout
seconds set by the user/default otherwise.

3) Updated the doc and testcase to reflect the changes.

Change-Id: I15c9037dee28855f21e680e7e3632e1f48dba4e1
BUG: 1219388
Reviewed-on: http://review.gluster.org/10134
Reviewed-by: Krutika Dhananjay 
Reviewed-by: Ravishankar N 
Tested-by: Gluster Build System 
Reviewed-by: Pranith Kumar Karampuri 
Signed-off-by: Anuradha 
Reviewed-on: http://review.gluster.org/10679

cluster/ec: data heal implementation for ec

2015-05-08T22:05:30+00:00

Data self-heal:
1) Take inode lock in domain 'this->name:self-heal' on 0-0 range (full file),
   So that no other processes try to do self-heal at the same time.
2) Take inode lock in domain 'this->name' on 0-0 range (full file),
3) perform fxattrop+fstat and get the xattrs on all the bricks
3) Choose the brick with ec->fragment number of same version as source
4) Truncate sinks
5) Unlock lock taken in 2)
5) For each block take full file lock, Read from sources write to the sinks, Unlock
6) Take full file lock and see if the file is still sane copy i.e. File didn't become unusable while the bricks are offline.
   Update mtime to before healing
7) xattrop with -ve values of 'dirty' and difference of highest and its own
   version values for version xattr
8) unlock lock acquired in 6)
9) unlock lock acquired in 1)

Change-Id: I6f4d42cd5423c767262c9d7bb5ca7767adb3e5fd
BUG: 1216303
Signed-off-by: Pranith Kumar K 
Reviewed-on: http://review.gluster.org/10384
Reviewed-on: http://review.gluster.org/10692
Tested-by: Gluster Build System