glusterfs.git/xlators/features/bit-rot/src/bitd/bit-rot.c, branch v3.7.1

features/bitrot: refactor brick connection logic

2015-05-31T04:15:04+00:00

    Backport of http://review.gluster.org/10763

Brick connection was bloated (and not implemented efficiently) with
calls which were not required to be called under lock. This resulted
in starvation of lock by critical code paths. This eventally did not
scale when the number of bricks per volume increases (add-brick and
the likes).

Also, this patch cleans up some of the weird reconnection logic that
added more to the starvation of resources and cleans up uncontrolled
growing of log files.

Change-Id: I05e737f2a9742944a4a543327d167de2489236a4
BUG: 1226146
Original-Author: Raghavendra Bhat 
Signed-off-by: Venky Shankar 
Reviewed-on: http://review.gluster.org/10986
Tested-by: Gluster Build System 
Tested-by: NetBSD Build System

features/bitrot: reimplement scrubbing frequency

2015-05-31T04:14:10+00:00

This patch reimplments existing scrub-frequency mechanism used
to schedule scrubber runs. Existing mechanism uses periodic
sleeps (waking up periodically on minimum granularity) and
performing a number of tracking checks based on counters and
sleep times. This patch does away with all the nifty counters
and uses timer-wheel to schedule scrub runs.

Scheduling changes are peformed by merely calculating the new
expiry time and calling mod_timer() [mod_timer_pending() in
some cases] making the code more debuggable and easier to
follow. This also introduces "hourly" scrubbing tunable as an
aid for testing scrubbing during development/testing cycle.

One could also implement on-demand scrubbing with ease: by
invoking mod_timer() with an expiry of one (1) second, thereby
scheduling a scrub run the very next second.

Change-Id: I6c7c5f0c6c9f886bf574d88c04cde14b76e60a8b
BUG: 1224647
Signed-off-by: Venky Shankar 
Reviewed-on: http://review.gluster.org/10902
Tested-by: Gluster Build System 
Tested-by: NetBSD Build System

features/bitrot: stub improvements and fixes

2015-05-31T04:12:48+00:00

This patch refactors the signing trigger mechanism used by bitrot
daemon as a "catch up" meachanism to sign files which _missed_
signing on the last run either due to bitrot being disabled and
enabled again or if bitrot is enabled for a volume with existing
data.

Existing implementation relies on overloading writev() to trigger
signing which just by the looks sounded dangerous and I hated it
to the core. This change moves all that business to the setxattr
interface thereby keeping the writev path strictly for client
IO.

Why not use IPC fop to trigger signing?
There's a need to access the object's inode to perform various
maintainance operations. inode is not _directly_ accessible in
the IPC fop (although, it can be found via inode_grep() for the
object's GFID - the inode just needs to be pinned in memory,
which is the case if there's an active fd on the inode). This
patch relies on good old technique of overloading fsetxattr()
to do the job instead of using IPC fop.

There are some pretty nice cleanups along the lines of memory
deallocations, unncessary allocations and redundant ref()ing
of structures (such as fd's) provided by this patch. All in
all - much improved code navigation.

Change-Id: Id93fe90b1618802d1a95a5072517dac342b96cb8
BUG: 1225709
Signed-off-by: Venky Shankar 
Reviewed-on: http://review.gluster.org/10953
Tested-by: Gluster Build System 
Tested-by: NetBSD Build System

features/bit-rot-stub: versioning of objects in write/truncate fop instead of open

2015-05-10T15:14:33+00:00

* This patch brings in the changes where object versioning is done in write and
  truncate fops instead of tracking them in open and create fops. This model
  works for both regular and anonymous fds. It also removes the race associated
  with open calls, create and lookups.

  This patch follows the below method for object versioning and notifications:

  Before sending writev on the fd, increase the ongoing
  version first. This makes anonymous fd write similar to the regular
  fd write by having the ongoing version increased before doing the
  write.

  Do following steps to do versioning:
  1) For anonymous fds set the fd context (so that release is invoked) and add
     the fd context to the list maintained in the inode context.
     For regular fds the above think would have been done in open itself.
  2) Increase the on-disk ongoing version
  3) Increase the in memory ongoing version and mark inode as non-dirty
  3) Once versioning is successfully done send write operation. If
     versioning fails, then fail the write fop.
  5) In writev_cbk mark inode as modified.

> Change-Id: I7104391bbe076d8fc49b68745d2ec29a6e92476c
> BUG: 1207979
> Signed-off-by: Raghavendra Bhat 
> Reviewed-on: http://review.gluster.org/10233
> Tested-by: Gluster Build System 
> Reviewed-by: Vijay Bellur 

Change-Id: I4bb86989b5fab02b9ed2950798b1a80e566f1024
BUG: 1220041
Signed-off-by: Raghavendra Bhat 
Reviewed-on: http://review.gluster.org/10722
Reviewed-by: Gaurav Kumar Garg 
Tested-by: NetBSD Build System
Tested-by: Gluster Build System

features/bitrot: scrubber should crawl based on the scrubber frequency value

2015-05-10T13:03:46+00:00

Currently scrubber is crawling all the files continuously. It should
crawl files based on the scrubber frequency which user have set.

By default scrubber crawling frequency value will be biweekly.

Change-Id: I5762a92c1e700134cfe4283d1f631904adbfe31d
BUG: 1220068
Signed-off-by: Gaurav Kumar Garg 
Reviewed-on: http://review.gluster.org/10739
Tested-by: NetBSD Build System
Tested-by: Gluster Build System 
Reviewed-by: Niels de Vos

features/bitrot: Scrubber pause/resume

2015-05-10T12:29:45+00:00

With logical scan/scrub split, pausing filesystem scrubber is an
override to the thread throttling mechanism, which effectively
throttles "down" number of scrubber threads to zero. This causes
scanner to wait until threads are spawned again (when resumed)
thereby continuing where it left off (since the file tree walk
stack is effectively preserved when the main scanner thread
is waiting for scrubbers to consume scanned entries).

The only catch is when scrubber daemon restarts: file tree walk
stack is lost and scrubbing initiates from root. This is probably
OK for now (can be changed later to persist parent directory
information before entering pause state).

> Change-Id: I5109a749b7fccd0f5367765078f46e6522dd32a1
> BUG: 1208131
> Signed-off-by: Venky Shankar 
> Reviewed-on: http://review.gluster.org/10521
> Reviewed-by: Vijay Bellur 
> Tested-by: Vijay Bellur 

Change-Id: I9b60f2ce24ca3787423a45ec7d502f89215fe45f
Signed-off-by: Venky Shankar 
BUG: 1220041
Reviewed-on: http://review.gluster.org/10721
Tested-by: Gluster Build System 
Reviewed-by: Gaurav Kumar Garg

features/bitrot: Throttle filesystem scrubber

2015-05-10T12:29:31+00:00

This patch introduces multithreaded filesystem scrubber based
on throttling option configured for a particular volume. The
implementation "logically" breaks scanning and scrubbing with
the number of scrubber threads auto-configured depending upon
the throttle configuration. Scanning (crawling) is left single
threaded (per brick) with entries scrubbed in bulk. On reaching
this "bulk" watermark, scanner waits until entries are scrubbed.
Bricks for a particular volume have a set of thread(s) assigned
for scrubbing, with entries for each brick scrubbed in a round
robin fashion to avoid scrub "stalls" when a brick (out of N
bricks) is under active scrubbing.

This mechanism helps us implement "pause/resume" with ease: all
one need to do is to cleanup scrubber threads and let the main
scanner thread "wait" untill scrubbing is resumed (where the
scrubber thread(s) are spawned again), therefore continuing
where we left off (unless we restart the deamons, where crawl
initiates from root directory again, but I guess that's OK).

[
    NOTE:

    Throttling is optional for the signer daemon, without which
    it runs full throttle. However, passing "-DBR_RATE_LIMIT_SIGNER"
    predefined in CFLAGS enables CPU throttling (during checksum
    calculation) thereby avoiding high CPU usage.
]

Subsequent patches would introduce CPU throttling during hash
calculation for scrubber.

> Change-Id: I5701dd6cd4dff27ca3144ac5e3798a2216b39d4f
> BUG: 1207020
> Signed-off-by: Venky Shankar 
> Reviewed-on: http://review.gluster.org/10511
> Tested-by: Gluster Build System 
> Reviewed-by: Vijay Bellur 

Change-Id: I5a125b2d0ac7dafd3e278b7fe4c6c9dd07af76dd
Signed-off-by: Venky Shankar 
BUG: 1220041
Reviewed-on: http://review.gluster.org/10720
Tested-by: Gluster Build System 
Reviewed-by: Gaurav Kumar Garg

features/bit-rot: Token Bucket based throttling

2015-05-10T12:29:05+00:00

BitRot daemons (signer & scrubber) are disk/cpu hoggers when left
running full throttle. Checksum calculations (especially SHA family
of hash routines) can be quite CPU intensive. Moreover periodic
disk scans performed by scrubber followed by reading data blocks
for hash calculation (which is also done by signer) generate lot
of heavy IO request(s). This causes interference with actual client
operations (be it a regular client or filesystems daemons such as
self-heal, etc..) and results in degraded system performance.

This patch introduces throttling based on Token Bucket Filtering[1].
It's a well known algorithm for checking (and ensuring) that data
transmission conform to defined limits and generally used in packet
switched networks. Linux control groups (Cgroups) uses a variant[2]
of this algorithm to provide block device IO throttling (cgroup
subsys "blkio": blk-iothrottle).

So, why not just live with Cgroups?
Cgroups is linux specific. We need to have a throttling mechanism
for other supported UNIXes. Moreover, having our own implementation
gives much more finer control in terms of tuning it for our needs
(plus the simplicity of the alogorithm itself).

Ideally, throttling should be a part of server stack (either as a
separate translator or integrated with io-threads) since that's
the point of entry for IO request(s) from *all* client(s). That
way one could selectively throttle IO request(s) based on client
PIDs (frame->root->pid), e.g., self-heal daemon, bitrot, etc..
(*actual* clients can run full throttle). This implementation
avoids that deliberately (there needs to be a much more smarter
queueing mechanism) and throttles CPU usage for hash calculations.

This patch is just the infrastructure part with no interfaces
exposed to set various throttling values. The tunable selected
here (basically hardcoded) avoids 100% CPU usage during hash
calculation (with some bursts cycles). We'd need much more
intensive test(s) to assign values for various throttling
options (lazy/normal/aggressive).

[1] https://en.wikipedia.org/wiki/Token_bucket
[2] http://en.wikipedia.org/wiki/Token_bucket#Hierarchical_token_bucket

> Change-Id: Icc49af80eeab6adb60166d0810e69ef37cfe2fd8
> BUG: 1207020
> Signed-off-by: Venky Shankar 
> Reviewed-on: http://review.gluster.org/10307
> Reviewed-by: Vijay Bellur 
> Tested-by: Vijay Bellur 

Change-Id: I034ba1095aa3bfc3212a67a63ffb931431b372f6
Signed-off-by: Venky Shankar 
BUG: 1220041
Reviewed-on: http://review.gluster.org/10719
Tested-by: NetBSD Build System
Tested-by: Gluster Build System 
Reviewed-by: Gaurav Kumar Garg

features/bitrot: Follow xattr naming conventions

2015-05-10T12:28:43+00:00

Instead of "trusted.glusterfs.bit-rot.*" use "trusted.bit-rot.*"

NOTE:
With this patch, data on existing volumes would be resigned
(which should be OK as of now since we do not expect many
users as of now :-))

> Change-Id: I926c7bca266a9c8f2cb35d57c4d0359aa5cecfa0
> BUG: 1170075
> Signed-off-by: Venky Shankar 
> Reviewed-on: http://review.gluster.org/10181
> Tested-by: NetBSD Build System
> Tested-by: Gluster Build System 
> Reviewed-by: Vijay Bellur 

Change-Id: I3c18d7dc2db4beaca6e8d8d231b4171a7b18795f
Signed-off-by: Venky Shankar 
BUG: 1220041
Reviewed-on: http://review.gluster.org/10718
Tested-by: Gluster Build System 
Reviewed-by: Gaurav Kumar Garg

features/bitrot: Use global timer wheel

2015-05-10T12:28:23+00:00

> Change-Id: I761927ea263b4144b851881f25791fda5b794f59
> BUG: 1170075
> Signed-off-by: Venky Shankar 
> Reviewed-on: http://review.gluster.org/10381
> Tested-by: NetBSD Build System
> Tested-by: Gluster Build System 
> Reviewed-by: Raghavendra Bhat 
> Reviewed-by: Vijay Bellur 

Change-Id: I4aa7c0d8b42b4c8d14a1d810e54c2de4d52b4389
Signed-off-by: Venky Shankar 
BUG: 1220041
Reviewed-on: http://review.gluster.org/10717
Tested-by: Gluster Build System 
Reviewed-by: Gaurav Kumar Garg