summaryrefslogtreecommitdiffstats
path: root/done/GlusterFS 3.6/Server-side Barrier feature.md
blob: c13e25a771da98831ba2d0b342bda69a6afff502 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
Server-side barrier feature
===========================

-   Author(s): Varun Shastry, Krishnan Parthasarathi
-   Date: Jan 28 2014
-   Bugzilla: <https://bugzilla.redhat.com/1060002>
-   Document ID: BZ1060002
-   Document Version: 1
-   Obsoletes: NA

Abstract
--------

Snapshot feature needs a mechanism in GlusterFS, where acknowledgements
to file operations (FOPs) are held back until the snapshot of all the
bricks of the volume are taken.

The barrier feature would stop holding back FOPs after a configurable
'barrier-timeout' seconds. This is to prevent an accidental lockdown of
the volume.

This mechanism should have the following properties:

-   Should keep 'barriering' transparent to the applications.
-   Should not acknowledge FOPs that fall into the barrier class. A FOP
    that when acknowledged to the application, could lead to the
    snapshot of the volume become inconsistent, is a barrier class FOP.

With the below example of 'unlink' how a FOP is classified as barrier
class is explained.

For the following sequence of events, assuming unlink FOP was not
barriered. Assume a replicate volume with two bricks, namely b1 and b2.

                         b1               b2
    time           ----------------------------------
     |        t1      snapshot
     |        t2      unlink /a        unlink /a
     \/       t3      mkdir /a         mkdir /a
              t4                       snapshot

The result of the sequence of events will store /a as a file in snapshot
b1 while /a is stored as directory in snapshot b2. This leads to split
brain problem of the AFR and in other way inconsistency of the volume.

Copyright
---------

Copyright (c) 2014 Red Hat, Inc. <http://www.redhat.com>

This feature is licensed under your choice of the GNU Lesser General
Public License, version 3 or any later version (LGPLv3 or later), or the
GNU General Public License, version 2 (GPLv2), in all cases as published
by the Free Software Foundation.

Introduction
------------

The volume snapshot feature snapshots a volume by snapshotting
individual bricks, that are available, using the lvm-snapshot
technology. As part of using lvm-snapshot, the design requires bricks to
be free from few set of modifications (fops in Barrier Class) to avoid
the inconsistency. This is where the server-side barriering of FOPs
comes into picture.

Terminology
-----------

-   barrier(ing) - To make barrier fops temporarily inactive or
    disabled.
-   available - A brick is said to be available when the corresponding
    glusterfsd process is running and serving file operations.
-   FOP - File Operation

High Level Design
-----------------

### Architecture/Design Overview

-   Server-side barriering, for Snapshot, must be enabled/disabled on
    the bricks of a volume in a synchronous manner. ie, any command
    using this would be blocked until barriering is enabled/disabled.
    The brick process would provide this mechanism via an RPC.
-   Barrier translator would be placed immediately above io-threads
    translator in the server/brick stack.
-   Barrier translator would queue FOPs when enabled. On disable, the
    translator dequeues all the FOPs, while serving new FOPs from
    application. By default, barriering is disabled.
-   The barrier feature would stop blocking the acknowledgements of FOPs
    after a configurable 'barrier-timeout' seconds. This is to prevent
    an accidental lockdown of the volume.
-   Operations those fall into barrier class are listed below. Any other
    fop not listed below does not fall into this category and hence are
    not barriered.
    -   rmdir
    -   unlink
    -   rename
    -   [f]truncate
    -   fsync
    -   write with O\_SYNC flag
    -   [f]removexattr

### Design Feature

Following timeline diagram depicts message exchanges between glusterd
and brick during enable and disable of barriering. This diagram assumes
that enable operation is synchronous and disable is asynchronous. See
below for alternatives.

            glusterd (snapshot)                       barrier @ brick
            ------------------                        ---------------
    t1           |                                            |
    t2           |                                continue to pass through
                 |                                     all the fops
    t3     send 'enable'                                      |
    t4           |                                * starts barriering the fops
                 |                                * send back the ack
    t5    receive the ack                                     |
                 |                                            |
    t6           |    &lt;take snap&gt;                             |
                 |         .                                  |
                 |         .                                  |
                 |         .                                  |
                 |    &lt;/take snap&gt;                            |
                 |                                            |
    t7     send disable                                       |
         (does not wait for the ack)                          |
    t8           |                               release all the holded fops
                 |                                 and no more barriering
                 |                                            |
    t9           |                               continue in PASS_THROUGH mode

Glusterd would send an RPC (described in API section), to enable
barriering on a brick, by setting option feature.barrier to 'ON' in
barrier translator. This would be performed on all the bricks present in
that node, belonging to the set of volumes that are being snapshotted.

Disable of barriering can happen in synchronous or asynchronous mode.
The choice is left to the consumer of this feature.

On disable, all FOPs queued up will be dequeued. Simultaneously the
subsequent barrier request(s) will be served.

Barrier option enable/disable is persisted into the volfile. This is to
make the feature available for consumers in asynchronous mode, like any
other (configurable) feature.

Barrier feature also has timeout option based on which dequeuing would
get triggered if the consumer fails to send the disable request.

Low-level details of Barrier translator working
-----------------------------------------------

The translator operates in one of two states, namely QUEUEING and
PASS\_THROUGH.

When barriering is enabled, the translator moves to QUEUEING state. It
queues outgoing FOPs thereafter in the call back path.

When barriering is disabled, the translator moves to PASS\_THROUGH state
and does not queue when it is in PASS\_THROUGH state. Additionally, the
queued FOPs are 'released', when the translator moves from QUEUEING to
PASS\_THROUGH state.

It has a translator global queue (doubly linked lists, see
libglusterfs/src/list.h) where the FOPs are queued in the form of a call
stub (see libglusterfs/src/call-stub.[ch])

When the FOP has succeeded, but barrier translator failed to queue in
the call back, the barrier translator would disable barriering and
release any queued FOPs, barrier would inform the consumer about this
failure on succesive disable request.

Interfaces
----------

### Application Programming Interface

-   An RPC procedure is added at the brick side, which allows any client
    [sic] to set the feature.barrier option of the barrier translator
    with a given value.
-   Glusterd would be using this to set server-side-barriering on, on a
    brick.

Performance Considerations
--------------------------

-   The barriering of FOPs may be perceived as a performance degrade by
    the applications. Since this is a hard requirement for snapshot, the
    onus is on the snapshot feature to reduce the window for which
    barriering is enabled.

### Scalability

-   In glusterd, each brick operation is executed in a serial manner.
    So, the latency of enabling barriering is a function of the no. of
    bricks present on the node of the set of volumes being snapshotted.
    This is not a scalability limitation of the mechanism of enabling
    barriering but a limitation in the brick operations mechanism in
    glusterd.

Migration Considerations
------------------------

The barrier translator is introduced with op-version 4. It is a
server-side translator and does not impact older clients even when this
feature is enabled.

Installation and deployment
---------------------------

-   Barrier xlator is not packaged with glusterfs-server rpm. With this
    changes, this has to be added to the rpm.