summaryrefslogtreecommitdiffstats
path: root/done/GlusterFS 3.7/BitRot.md
blob: deca9eee5f5f6436a36cce18dcfb619c9fc146bf (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
Feature
=======

BitRot Detection

1 Summary
=========

BitRot detection is a technique used to identify an “insidious” type of
disk error where data is silently corrupted with no indication from the
disk to the storage software layer that an error has occurred. BitRot
detection is exceptionally useful when using JBOD (which had no way of
knowing that the data is corrupted on disk) rather than RAID (esp. RAID6
which has a performance penalty for certain kind of workloads).

2 Use cases
===========

-   Archival/Compliance
-   Openstack cinder
-   Gluster health

Refer
[here](http://supercolony.gluster.org/pipermail/gluster-devel/2014-December/043248.html)
for an elaborate discussion on use cases.

3 Owners
========

Venky Shankar <vshankar@redhat.com, yknev.shankar@gmail.com>  
Raghavendra Bhat <rabhat@redhat.com>  
Vijay Bellur <vbellur@redhat.com>  

4 Current Status
================

Initial approach is [here](http://goo.gl/TSjLJn). The document goes into
some details on why one could end up with "rotten" data and approaches
taken by block level filesystems to detect and recover from bitrot. Some
of the design goals are carry forwarded and made to fit with GlusterFS.

Status as of 11th Feb 2015:

Done

-   Object notification
-   Object expiry tracking using timer-wheel

In Progress

-   BitRot server stub
-   BitRot Daemon

5 Detailed Description
======================

**NOTE: Points marked with [NIS] are "Not in Scope" for 3.7 release.**

The basic idea is to maintain file data/metadata checksums as an
extended attribute. Checksum granularity is per file for now, however
this can be extended to be per "block-size" blocks (chunks). A BitRot
daemon per brick is responsible for checksum maintenance for files local
to the brick. "Distributifying" enables scale and effective resource
utilization of the cluster (memory, disk, etc..).

BitD (BitRot Deamon)

-   Daemon per brick takes care of maintaining checksums for data local
    to the brick.
-   Checksums are SHA256 (default) hash
    -   Of file data (regular files only)
    -   "Rolling" metadata checksum of extended attributes (GlusterFS
        xattrs) **[NIS]**
    -   Master checksum: checksum of checksums (data + metadata)
        **[NIS]**
    -   Hashtype is persisted along side the checksum and can be tuned
        per file type

-   Checksum maintenance is "lazy"
    -   "not" inline to the data path (expensive)
    -   List of changed files is notified by the filesystem although a
        single filesystem scan is needed to get to the current state.
        BitD is built over existing journaling infrastructure (a.k.a
        changelog)
    -   Laziness is governed by policies that determine when to
        (re)calculate checksum. IOW, checksum is calculated when a file
        is considered "stable"
        -   Release+Expiry: on a file descriptor release and an
            inactivity for "X" seconds.

-   Filesystem scan
    -   Required once after stop/start or for initial data set
    -   Xtime based scan (marker framework)
    -   Considerations
        -   Parallelize crawl
        -   Sort by inode \# to reduce disk seek
        -   Integrate with libgfchangelog

Detection

-   Upon file/data access (expensive)
    -   open() or read() (disabled by default)
-   Data scrubbing
    -   Filesystem checksum validation
        -   "Bad" file marking
    -   Deep: validate data checksum
    -   Timestamp of last validity - used for replica repair **[NIS]**
    -   Repair **[NIS]**
    -   Shallow: validate metadata checksum **[NIS]**

Repair/Recover stratergies **[NIS]**

-   Mirrored file data
    -   self-heal
-   Erasure Codes (ec xlator)

It would also be beneficial to use inbuilt bitrot capabilities of
backend filesystems such as btrfs. For such cases, it's better to
"handover" bulk of the work of the backend filesystem and have
minimalistic implementation on the daemon side. This area needs to be
explored more (i.e., ongoing and not for 3.7).

6 Benefit to GlusterFS
======================

By the ability of detect silent corruptions (and even backend tinkering
of a file), reading bad data could be avoided and possibly using it as a
truthful source to heal other copies and may be even remotely replicate
to a backup node damaging a good copy. Scrubbing allows pro-active
detection of corrupt files and repairing them before access.

7 Design and CLI specification
==============================

-   [Design document](http://goo.gl/Mjy4mD)
-   [CLI specification](http://goo.gl/2o12Fn)

8 Scope
=======

8.1. Nature of proposed change
------------------------------

The most basic changes being introduction of a server side daemon (per
brick) to maintain file data checksums. Changes to changelog and
consumer library would be needed to support requirements for bitrot
daemon.

8.2. Implications on manageability
----------------------------------

Introduction of new CLI commands to enable bitrot detection, trigger
scrub, query file status, etc.

8.3. Implications on presentation layer
---------------------------------------

N/A

8.4. Implications on persistence layer
--------------------------------------

Introduction of new extended attributes.

8.5. Implications on 'GlusterFS' backend
----------------------------------------

As in 8.4

8.6. Modification to GlusterFS metadata
---------------------------------------

BitRot related extended attributes

8.7. Implications on 'glusterd'
-------------------------------

Supporting changes to CLI.

9 How To Test
=============

10 User Experience
==================

Refer to Section \#7

11 Dependencies
===============

Enhancement to changelog translator (and libgfchangelog) is the most
prevalent change. Other dependencies include glusterd.

12 Documentation
================

TBD

13 Status
=========

-   Initial set of patches merged
-   Bug fixing/enhancement in progress

14 Comments and Discussion
==========================

More than welcome :-)

-   [BitRot tracker Bug](https://bugzilla.redhat.com/1170075)
-   [BitRot hash computation](https://bugzilla.redhat.com/914874)