summaryrefslogtreecommitdiffstats
path: root/doc/legacy/hacker-guide/replicate.txt
blob: ad5b352a829c2bc508c38a705f9f6f08dd0bc5f4 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
---------------
* cluster/replicate
---------------

Before understanding replicate, one must understand two internal FOPs:

GF_FILE_LK:
  This is exactly like fcntl(2) locking, except the locks are in a
  separate domain from locks held by applications.

GF_DIR_LK (loc_t *loc, char *basename):
  This allows one to lock a name under a directory. For example,
  to lock /mnt/glusterfs/foo, one would use the call:

  GF_DIR_LK ({loc_t for "/mnt/glusterfs"}, "foo")

  If one wishes to lock *all* the names under a particular directory,
  supply the basename argument as NULL.

  The locks can either be read locks or write locks; consult the
  function prototype for more details.

Both these operations are implemented by the features/locks (earlier
known as posix-locks) translator.

--------------
* Basic design
--------------

All FOPs can be classified into four major groups:

 - inode-read
   Operations that read an inode's data (file contents) or metadata (perms, etc.).

   access, getxattr, fstat, readlink, readv, stat.

 - inode-write
   Operations that modify an inode's data or metadata.

   chmod, chown, truncate, writev, utimens.

 - dir-read
   Operations that read a directory's contents or metadata.

   readdir, getdents, checksum.

 - dir-write
   Operations that modify a directory's contents or metadata.

   create, link, mkdir, mknod, rename, rmdir, symlink, unlink.

   Some of these make a subgroup in that they modify *two* different entries:
        link, rename, symlink.

 - Others
   Other operations.

   flush, lookup, open, opendir, statfs.

------------
* Algorithms
------------

Each of the four major groups has its own algorithm:

 ----------------------
 - inode-read, dir-read
 ----------------------

 = Send a request to the first child that is up:
   - if it fails:
       try the next available child
   - if we have exhausted all children:
       return failure

 -------------
 - inode-write
 -------------

 All operations are done in parallel unless specified otherwise.

 (1) Send a GF_FILE_LK request on all children for a write lock on
     the appropriate region
            (for metadata operations: entire file (0, 0)
             for writev: (offset, offset+size of buffer))

     - If a lock request fails on a child:
         unlock all children
         try to acquire a blocking lock (F_SETLKW) on each child, serially.

	 If this fails (due to ENOTCONN or EINVAL):
           Consider this child as dead for rest of transaction.

 (2) Mark all children as "pending" on all (alive) children
     (see below for meaning of "pending").

     - If it fails on any child:
         mark it as dead (in transaction local state).

 (3) Perform operation on all (alive) children.

     - If it fails on any child:
         mark it as dead (in transaction local state).

 (4) Unmark all successful children as not "pending" on all nodes.

 (5) Unlock region on all (alive) children.

 -----------
 - dir-write
 -----------

 The algorithm for dir-write is same as above except instead of holding
 GF_FILE_LK locks we hold a GF_DIR_LK lock on the name being operated upon.
 In case of link-type calls, we hold locks on both the operand names.

-----------
* "pending"
-----------

 The "pending" number is like a journal entry. A pending entry is an
 array of 32-bit integers stored in network byte-order as the extended
 attribute of an inode (which can be a directory as well).

 There are three keys corresponding to three types of pending operations:

 - AFR_METADATA_PENDING
     There are some metadata operations pending on this inode (perms, ctime/mtime,
     xattr, etc.).

 - AFR_DATA_PENDING
     There is some data pending on this inode (writev).

 - AFR_ENTRY_PENDING
     There are some directory operations pending on this directory
     (create, unlink, etc.).

-----------
* Self heal
-----------

 - On lookup, gather extended attribute data:
   - If entry is a regular file:
     - If an entry is present on one child and not on others:
       - create entry on others.
     - If entries exist but have different metadata (perms, etc.):
       - consider the entry with the highest AFR_METADATA_PENDING number as
         definitive and replicate its attributes on children.

   - If entry is a directory:
     - Consider the entry with the highest AFR_ENTRY_PENDING number as
       definitive and replicate its contents on all children.

   - If any two entries have non-matching types (i.e., one is file and
     other is directory):
     - Announce to the user via log that a split-brain situation has been
       detected, and do nothing.

 - On open, gather extended attribute data:
   - Consider the file with the highest AFR_DATA_PENDING number as
     the definitive one and replicate its contents on all other
     children.

 During all self heal operations, appropriate locks must be held on all
 regions/entries being affected.

---------------
* Inode scaling
---------------

Inode scaling is necessary because if a situation arises where:
  - An inode number is returned for a directory (by lookup) which was
    previously the inode number of a file (as per FUSE's table), then
    FUSE gets horribly confused (consult a FUSE expert for more details).

To avoid such a situation, we distribute the 64-bit inode space equally
among all children of replicate.

To illustrate:

If c1, c2, c3 are children of replicate, they each get 1/3 of the available
inode space:

Child:        c1   c2   c3   c1   c2   c3   c1   c2   c3   c1   c2 ...
Inode number: 1    2    3    4    5    6    7    8    9    10   11 ...

Thus, if lookup on c1 returns an inode number "2", it is scaled to "4"
(which is the second inode number in c1's space).

This way we ensure that there is never a collision of inode numbers from
two different children.

This reduction of inode space doesn't really reduce the usability of
replicate since even if we assume replicate has 1024 children (which would be a
highly unusual scenario), each child still has a 54-bit inode space.

2^54 ~ 1.8 * 10^16

which is much larger than any real world requirement.


==============================================
$ Last updated: Sun Oct 12 23:17:01 IST 2008 $
$ Author: Vikas Gorur <vikas@gluster.com>  $
==============================================