1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
|
Glossary
========
**Brick**
: A Brick is the basic unit of storage in GlusterFS, represented by an export
directory on a server in the trusted storage pool.
A brick is expressed by combining a server with an export directory in the following format:
`SERVER:EXPORT`
For example:
`myhostname:/exports/myexportdir/`
**Volume**
: A volume is a logical collection of bricks. Most of the gluster
management operations happen on the volume.
**Subvolume**
: A brick after being processed by at least one translator or in other words
set of one or more xlator stacked together is called a sub-volume.
**Volfile**
: Volume (vol) files are configuration files that determine the behavior of the
GlusterFs trusted storage pool. Volume file is a textual representation of a
collection of modules (also known as translators) that together implement the
various functions required. The collection of modules are arranged in a graph-like
fashion. E.g, A replicated volume's volfile, among other things, would have a
section describing the replication translator and its tunables.
This section describes how the volume would replicate data written to it.
Further, a client process that serves a mount point, would interpret its volfile
and load the translators described in it. While serving I/O, it would pass the
request to the collection of modules in the order specified in the volfile.
At a high level, GlusterFs has three entities,that is, Server, Client and Management daemon.
Each of these entities have their own volume files.
Volume files for servers and clients are generated by the management daemon
after the volume is created.
Server and Client Vol files are located in /var/lib/glusterd/vols/VOLNAME directory.
The management daemon vol file is named as glusterd.vol and is located in /etc/glusterfs/
directory.
**glusterd**
: The daemon/service that manages volumes and cluster membership. It is required to
run on all the servers in the trusted storage pool.
**Cluster**
: A trusted pool of linked computers working together, resembling a single computing resource.
In GlusterFs, a cluster is also referred to as a trusted storage pool.
**Client**
: Any machine that mounts a GlusterFS volume. Any applications that use libgfapi access
mechanism can also be treated as clients in GlusterFS context.
**Server**
: The machine (virtual or bare metal) that hosts the bricks in which data is stored.
**Block Storage**
: Block special files, or block devices, correspond to devices through which the system moves
data in the form of blocks. These device nodes often represent addressable devices such as
hard disks, CD-ROM drives, or memory regions. GlusterFS requires a filesystem (like XFS) that
supports extended attributes.
**Filesystem**
: A method of storing and organizing computer files and their data.
Essentially, it organizes these files into a database for the
storage, organization, manipulation, and retrieval by the computer's
operating system.
Source: [Wikipedia][]
**Distributed File System**
: A file system that allows multiple clients to concurrently access data which is spread across
servers/bricks in a trusted storage pool. Data sharing among multiple locations is fundamental
to all distributed file systems.
**Virtual File System (VFS)
VFS is a kernel software layer which handles all system calls related to the standard Linux file system.
It provides a common interface to several kinds of file systems.
**POSIX**
: Portable Operating System Interface (for Unix) is the name of a
family of related standards specified by the IEEE to define the
application programming interface (API), along with shell and
utilities interfaces for software compatible with variants of the
Unix operating system. Gluster exports a fully POSIX compliant file
system.
**Extended Attributes**
: Extended file attributes (abbreviated xattr) is a filesystem feature
that enables users/programs to associate files/dirs with metadata.
**FUSE**
: Filesystem in Userspace (FUSE) is a loadable kernel module for
Unix-like computer operating systems that lets non-privileged users
create their own filesystems without editing kernel code. This is
achieved by running filesystem code in user space while the FUSE
module provides only a "bridge" to the actual kernel interfaces.
Source: [Wikipedia][1]
**GFID**
: Each file/directory on a GlusterFS volume has a unique 128-bit number
associated with it called the GFID. This is analogous to inode in a
regular filesystem.
**Infiniband**
InfiniBand is a switched fabric computer network communications link
used in high-performance computing and enterprise data centers.
**Metadata**
: Metadata is data providing information about one or more other
pieces of data.
**Namespace**
: Namespace is an abstract container or environment created to hold a
logical grouping of unique identifiers or symbols. Each Gluster
volume exposes a single namespace as a POSIX mount point that
contains every file in the cluster.
**Node**
: A server or computer that hosts one or more bricks.
**Open Source**
: Open source describes practices in production and development that
promote access to the end product's source materials. Some consider
open source a philosophy, others consider it a pragmatic
methodology.
Before the term open source became widely adopted, developers and
producers used a variety of phrases to describe the concept; open
source gained hold with the rise of the Internet, and the attendant
need for massive retooling of the computing source code.
Opening the source code enabled a self-enhancing diversity of
production models, communication paths, and interactive communities.
Subsequently, a new, three-word phrase "open source software" was
born to describe the environment that the new copyright, licensing,
domain, and consumer issues created.
Source: [Wikipedia][2]
**Petabyte**
: A petabyte (derived from the SI prefix peta- ) is a unit of
information equal to one quadrillion (short scale) bytes, or 1000
terabytes. The unit symbol for the petabyte is PB. The prefix peta-
(P) indicates a power of 1000:
1 PB = 1,000,000,000,000,000 B = 10005 B = 1015 B.
The term "pebibyte" (PiB), using a binary prefix, is used for the
corresponding power of 1024.
Source: [Wikipedia][3]
**Quorum**
: The configuration of quorum in a trusted storage pool determines the
number of server failures that the trusted storage pool can sustain.
If an additional failure occurs, the trusted storage pool becomes
unavailable.
**Quota**
: Quota allows you to set limits on usage of disk space by directories or
by volumes.
**RAID**
: Redundant Array of Inexpensive Disks (RAID) is a technology that
provides increased storage reliability through redundancy, combining
multiple low-cost, less-reliable disk drives components into a
logical unit where all drives in the array are interdependent.
**RDMA**
: Remote direct memory access (RDMA) is a direct memory access from the
memory of one computer into that of another without involving either
one's operating system. This permits high-throughput, low-latency
networking, which is especially useful in massively parallel computer
clusters.
**Rebalance**
: A process of fixing layout and resdistributing data in a volume when a
brick is added or removed.
**RRDNS**
: Round Robin Domain Name Service (RRDNS) is a method to distribute
load across application servers. RRDNS is implemented by creating
multiple A records with the same name and different IP addresses in
the zone file of a DNS server.
**Samba**
: Samba allows file and print sharing between computers running Windows and
computers running Linux. It is an implementation of several services and
protocols including SMB and CIFS.
**Self-Heal**
: The self-heal daemon that runs in the background, identifies
inconsistencies in files/dirs in a replicated volume and then resolves
or heals them. This healing process is usually required when one or more
bricks of a volume goes down and then comes up later.
**Split-brain**
: This is a situation where data on two or more bricks in a replicated
volume start to diverge in terms of content or metadata. In this state,
one cannot determine programitically which set of data is "right" and
which is "wrong".
**Translator**
: Translators (also called xlators) are stackable modules where each
module has a very specific purpose. Translators are stacked in a
hierarchical structure called as graph. A translator receives data
from its parent translator, performs necessary operations and then
passes the data down to its child translator in hierarchy.
**Trusted Storage Pool**
: A storage pool is a trusted network of storage servers. When you
start the first server, the storage pool consists of that server
alone.
**Scale-Up Storage**
: Increases the capacity of the storage device in a single dimension.
For example, adding additional disk capacity to an existing trusted storage pool.
**Scale-Out Storage**
Scale out systems are designed to scale on both capacity and performance.
It increases the capability of a storage device in single dimension.
For example, adding more systems of the same size, or adding servers to a trusted storage pool
that increases CPU, disk capacity, and throughput for the trusted storage pool.
**Userspace**
: Applications running in user space don’t directly interact with
hardware, instead using the kernel to moderate access. Userspace
applications are generally more portable than applications in kernel
space. Gluster is a user space application.
**Geo-Replication**
: Geo-replication provides a continuous, asynchronous, and incremental
replication service from site to another over Local Area Networks
(LAN), Wide Area Network (WAN), and across the Internet.
**N-way Replication**
: Local synchronous data replication which is typically deployed across campus
or Amazon Web Services Availability Zones.
**Distributed Hash Table Terminology**
**Hashed subvolume**
: A Distributed Hash Table Translator subvolume to which the file or directory name is hashed to.
**Cached subvolume**
: A Distributed Hash Table Translator subvolume where the file content is actually present.
For directories, the concept of cached-subvolume is not relevant. It is loosely used to mean
subvolumes which are not hashed-subvolume.
**Linkto-file**
: For a newly created file, the hashed and cached subvolumes are the same.
When directory entry operations like rename (which can change the name and hence hashed
subvolume of the file) are performed on the file, instead of moving the entire data in the file
to a new hashed subvolume, a file is created with the same name on the newly hashed subvolume.
The purpose of this file is only to act as a pointer to the node where the data is present.
In the extended attributes of this file, the name of the cached subvolume is stored.
This file on the newly hashed-subvolume is called a linkto-file.
The linkto file is relevant only for non-directory entities.
**Directory Layout**
: The directory layout specifies the hash-ranges of the subdirectories of a directory to which
subvolumes they correspond to.
**Properties of directory layouts:**
: The layouts are created at the time of directory creation and are persisted as extended attributes
of the directory.
A subvolume is not included in the layout if it remained offline at the time of directory creation
and no directory entries ( such as files and directories) of that directory are created on
that subvolume. The subvolume is not part of the layout until the fix-layout is complete
as part of running the rebalance command. If a subvolume is down during access (after directory creation),
access to any files that hash to that subvolume fails.
**Fix Layout**
: A command that is executed during the rebalance process.
The rebalance process itself comprises of two stages:
Fixes the layouts of directories to accommodate any subvolumes that are added or removed.
It also heals the directories, checks whether the layout is non-contiguous, and persists the
layout in extended attributes, if needed. It also ensures that the directories have the same
attributes across all the subvolumes.
Migrates the data from the cached-subvolume to the hashed-subvolume.
[Wikipedia]: http://en.wikipedia.org/wiki/Filesystem
[1]: http://en.wikipedia.org/wiki/Filesystem_in_Userspace
[2]: http://en.wikipedia.org/wiki/Open_source
[3]: http://en.wikipedia.org/wiki/Petabyte
|