summaryrefslogtreecommitdiffstats
path: root/done/GlusterFS 3.6/Gluster Volume Snapshot.md
blob: 468992a5649b72a7cdecbac9fa5715a14ff813b7 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
Feature
-------

Snapshot of Gluster Volume

Summary
-------

Gluster volume snapshot will provide point-in-time copy of a GlusterFS
volume. This snapshot is an online-snapshot therefore file-system and
its associated data continue to be available for the clients, while the
snapshot is being taken.

Snapshot of a GlusterFS volume will create another read-only volume
which will be a point-in-time copy of the original volume. Users can use
this read-only volume to recover any file(s) they want. Snapshot will
also provide restore feature which will help the user to recover an
entire volume. The restore operation will replace the original volume
with the snapshot volume.

Owner(s)
--------

Rajesh Joseph <rjoseph@redhat.com>

Copyright
---------

Copyright (c) 2013-2014 Red Hat, Inc. <http://www.redhat.com>

This feature is licensed under your choice of the GNU Lesser General
Public License, version 3 or any later version (LGPLv3 or later), or the
GNU General Public License, version 2 (GPLv2), in all cases as published
by the Free Software Foundation.

Current status
--------------

Gluster volume snapshot support is provided in GlusterFS 3.6

Detailed Description
--------------------

GlusterFS snapshot feature will provide a crash consistent point-in-time
copy of Gluster volume(s). This snapshot is an online-snapshot therefore
file-system and its associated data continue to be available for the
clients, while the snapshot is being taken. As of now we are not
planning to provide application level crash consistency. That means if a
snapshot is restored then applications need to rely on journals or other
technique to recover or cleanup some of the operations performed on
GlusterFS volume.

A GlusterFS volume is made up of multiple bricks spread across multiple
nodes. Each brick translates to a directory path on a given file-system.
The current snapshot design is based on thinly provisioned LVM2 snapshot
feature. Therefore as a prerequisite the Gluster bricks should be on
thinly provisioned LVM. For a single lvm, taking a snapshot would be
straight forward for the admin, but this is compounded in a GlusterFS
volume which has bricks spread across multiple LVM’s across multiple
nodes. Gluster volume snapshot feature aims to provide a set of
interfaces from which the admin can snap and manage the snapshots for
Gluster volumes.

Gluster volume snapshot is nothing but snapshots of all the bricks in
the volume. So ideally all the bricks should be snapped at the same
time. But with real-life latencies (processor and network) this may not
hold true all the time. Therefore we need to make sure that during
snapshot the file-system is in consistent state. Therefore we barrier
few operation so that the file-system remains in a healthy state during
snapshot.

For details about barrier [Server Side
Barrier](http://www.gluster.org/community/documentation/index.php/Features/Server-side_Barrier_feature)

Benefit to GlusterFS
--------------------

Snapshot of glusterfs volume allows users to

-   A point in time checkpoint from which to recover/failback
-   Allow read-only snaps to be the source of backups.

Scope
-----

### Nature of proposed change

Gluster cli will be modified to provide new commands for snapshot
management. The entire snapshot core implementation will be done in
glusterd.

Apart from this Snapshot will also make use of quiescing xlator for
doing quiescing. This will be a server side translator which will
quiesce will fops which can modify disk state. The quescing will be done
till the snapshot operation is complete.

### Implications on manageability

Snapshot will provide new set of cli commands to manage snapshots. REST
APIs are not planned for this release.

### Implications on persistence layer

Snapshot will create new volume per snapshot. These volumes are stored
in /var/lib/glusterd/snaps folder. Apart from this each volume will have
additional snapshot related information stored in snap\_list.info file
in its respective vol folder.

### Implications on 'glusterd'

Snapshot information and snapshot volume details are stored in
persistent stores.

How To Test
-----------

For testing this feature one needs to have mulitple thinly provisioned
volumes or else need to create LVM using loop back devices.

Details of how to create thin volume can be found at the following link
<https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Logical_Volume_Manager_Administration/thinly_provisioned_volume_creation.html>

Each brick needs to be in a independent LVM. And these LVMs should be
thinly provisioned. From these bricks create Gluster volume. This volume
can then be used for snapshot testing.

See the User Experience section for various commands of snapshot.

User Experience
---------------

##### Snapshot creation

        snapshot create <snapname> <volname(s)> [description <description>] [force]

This command will create a sapshot of the volume identified by volname.
snapname is a mandatory field and the name should be unique in the
entire cluster. Users can also provide an optional description to be
saved along with the snap (max 1024 characters). force keyword is used
if some bricks of orginal volume is down and still you want to take the
snapshot.

##### Listing of available snaps

        gluster snapshot list [snap-name] [vol <volname>]

This command is used to list all snapshots taken, or for a specified
volume. If snap-name is provided then it will list the details of that
snap.

##### Configuring the snapshot behavior

        gluster snapshot config [vol-name]

This command will display existing config values for a volume. If volume
name is not provided then config values of all the volume is displayed.

        gluster snapshot config [vol-name] [<snap-max-limit> <count>] [<snap-max-soft-limit> <percentage>] [force]

The above command can be used to change the existing config values. If
vol-name is provided then config value of that volume is changed, else
it will set/change the system limit.

The system limit is the default value of the config for all the volume.
Volume specific limit cannot cross the system limit. If a volume
specific limit is not provided then system limit will be considered.

If any of this limit is decreased and the current snap count of the
system/volume is more than the limit then the command will fail. If user
still want to decrease the limit then force option should be used.

**snap-max-limit**: Maximum snapshot limit for a volume. Snapshots
creation will fail if snap count reach this limit.

**snap-max-soft-limit**: Maximum snapshot limit for a volume. Snapshots
can still be created if snap count reaches this limit. An auto-deletion
will be triggered if this limit is reached. The oldest snaps will be
deleted if snap count reaches this limit. This is represented as
percentage value.

##### Status of snapshots

        gluster snapshot status ([snap-name] | [volume <vol-name>])

Shows the status of all the snapshots or the specified snapshot. The
status will include the brick details, LVM details, process details,
etc.

##### Activating a snap volume

By default the snapshot created will be in an inactive state. Use the
following commands to activate snapshot.

        gluster snapshot activate <snap-name>

##### Deactivating a snap volume

        gluster snapshot deactivate <snap-name>

The above command will deactivate an active snapshot

##### Deleting snaps

        gluster snapshot delete <snap-name>

This command will delete the specified snapshot.

##### Restoring snaps

        gluster snapshot restore <snap-name>

This command restores an already taken snapshot of single or multiple
volumes. Snapshot restore is an offline activity therefore if any volume
which is part of the given snap is online then the restore operation
will fail.

Once the snapshot is restored it will be deleted from the list of
snapshot.

Dependencies
------------

To provide support for a crash-consistent snapshot feature Gluster core
com- ponents itself should be crash-consistent. As of now Gluster as a
whole is not crash-consistent. In this section we will identify those
Gluster components which are not crash-consistent.

**Geo-Replication**: Geo-replication provides master-slave
synchronization option to Gluster. Geo-replication maintains state
information for completing the sync operation. Therefore ideally when a
snapshot is taken then both the master and slave snapshot should be
taken. And both master and slave snapshot should be in mutually
consistent state.

Geo-replication make use of change-log to do the sync. By default the
change-log is stored .glusterfs folder in every brick. But the
change-log path is configurable. If change-log is part of the brick then
snapshot will contain the change-log changes as well. But if it is not
then it needs to be saved separately during a snapshot.

Following things should be considered for making change-log
crash-consistent:

-   Change-log is part of the brick of the same volume.
-   Change-log is outside the brick. As of now there is no size limit on
    the

change-log files. We need to answer following questions here

-   -   Time taken to make a copy of the entire change-log. Will affect
        the

overall time of snapshot operation.

-   -   The location where it can be copied. Will impact the disk usage
        of

the target disk or file-system.

-   Some part of change-log is present in the brick and some are outside

the brick. This situation will arrive when change-log path is changed
in-between.

-   Change-log is saved in another volume and this volume forms a CG
    with

the volume about to be snapped.

**Note**: Considering the above points we have decided not to support
change-log stored outside the bricks.

For this release automatic snapshot of both master and slave session is
not supported. If required user need to explicitly take snapshot of both
master and slave. Following steps need to be followed while taking
snapshot of a master and slave setup

-   Stop geo-replication manually.
-   Snapshot all the slaves first.
-   When the slave snapshot is done then initiate master snapshot.
-   When both the snapshot is complete geo-syncronization can be started
    again.

**Gluster Quota**: Quota enables an admin to specify per directory
quota. Quota makes use of marker translator to enforce quota. As of now
the marker framework is not completely crash-consistent. As part of
snapshot feature we need to address following issues.

-   If a snapshot is taken while the contribution size of a file is
    being updated then you might end up with a snapshot where there is a
    mismatch between the actual size of the file and the contribution of
    the file. These in-consistencies can only be rectified when a
    look-up is issued on the snapshot volume for the same file. As a
    workaround admin needs to issue an explicit file-system crawl to
    rectify the problem.
-   For NFS, quota makes use of pgfid to build a path from gfid and
    enforce quota. As of now pgfid update is not crash-consistent.
-   Quota saves its configuration in file-system under /var/lib/glusterd
    folder. As part of snapshot feature we need to save this file.

**NFS**: NFS uses a single graph to represent all the volumes in the
system. And to make all the snapshot volume accessible these snapshot
volumes should be added to this graph. This brings in another
restriction, i.e. all the snapshot names should be unique and
additionally snap name should not clash with any other volume name as
well.

To handle this situation we have decided to use an internal uuid as snap
name. And keep a mapping of this uuid and user given snap name in an
internal structure.

Another restriction with NFS is that when a newly created volume
(snapshot volume) is started it will restart NFS server. Therefore we
decided when snapshot is taken it will be in stopped state. Later when
snapshot volume is needed it can be started explicitly.

**DHT**: DHT xlator decides which node to look for a file/directory.
Some of the DHT fop are not atomic in nature, e.g rename (both file and
directory). Also these operations are not transactional in nature. That
means if a crash happens the data in server might be in an inconsistent
state. Depending upon the time of snapshot and which DHT operation is in
what state there can be an inconsistent snapshot.

**AFR**: AFR is the high-availability module in Gluster. AFR keeps track
of fresh and correct copy of data using extended attributes. Therefore
it is important that before taking snapshot these extended attributes
are written into the disk. To make sure these attributes are written to
disk snapshot module will issue explicit sync after the
barrier/quiescing.

The other issue with the current AFR is that it writes the volume name
to the extended attribute of all the files. AFR uses this for
self-healing. When snapshot is taken of such a volume the snapshotted
volume will also have the same volume name. Therefore AFR needs to
create a mapping of the real volume name and the extended entry name in
the volfile. So that correct name can be referred during self-heal.

Another dependency on AFR is that currently there is no direct API or
call back function which will tell that AFR self-healing is completed on
a volume. This feature is required to heal a snapshot volume before
restore.

Documentation
-------------

Status
------

In development

Comments and Discussion
-----------------------

<Follow here>