summaryrefslogtreecommitdiffstats
path: root/done/Features/shard.md
blob: 2bdf6ce8e1d1978a88c8f62a068fcf82ece9d59d (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
### Sharding xlator (Stripe 2.0)

GlusterFS's answer to very large files (those which can grow beyond a
single brick) has never been clear. There is a stripe xlator which allows you to
do that, but that comes at a cost of flexibility - you can add servers only in
multiple of stripe-count x replica-count, mixing striped and unstriped files is
not possible in an "elegant" way. This also happens to be a big limiting factor
for the big data/Hadoop use case where super large files are the norm (and where
you want to split a file even if it could fit within a single server.)

The proposed solution for this is to replace the current stripe xlator with a
new Shard xlator. Unlike the stripe xlator, Shard is not a cluster xlator. It is
placed on top of DHT. Initially all files will be created as normal files, even
up to a certain configurable size. The first block (default 4MB) will be stored
like a normal file. However further blocks will be stored in a file, named by
the GFID and block index in a separate namespace (like /.shard/GFID1.1,
/.shard/GFID1.2 ... /.shard/GFID1.N). File IO happening to a particular offset
will write to the appropriate "piece file", creating it if necessary. The
aggregated file size and block count will be stored in the xattr of the original
(first block) file.

The advantage of such a model:

- Data blocks are distributed by DHT in a "normal way".
- Adding servers can happen in any number (even one at a time) and DHT's
  rebalance will spread out the "piece files" evenly.
- Self-healing of a large file is now more distributed into smaller files across
  more servers.
- piece file naming scheme is immune to renames and hardlinks.

Source: https://gist.github.com/avati/af04f1030dcf52e16535#sharding-xlator-stripe-20

## Usage:

Shard translator is disabled by default. To enable it on a given volume, execute:

		gluster volume set <VOLNAME> features.shard on

The default shard block size is 4MB. To modify it, execute:

		gluster volume set <VOLNAME> features.shard-block-size <value>

When a file is created in a volume with sharding disabled, its block size is
persisted in its xattr on the first block. This property of the file will remain
even if the shard-block-size for the volume is reconfigured later.

If you want to disable sharding on a volume, it is advisable to create a new
volume without sharding and copy out contents of this volume into the new
volume.

## Note:

* Shard translator is still a beta feature in 3.7.0 and will be possibly fully
  supported in one of the 3.7.x releases.
* It is advisable to use shard translator in volumes with replication enabled
  for fault tolerance.

## TO-DO:

* Complete implementation of zerofill, discard and fallocate fops.
* Introduce caching and its invalidation within shard translator to store size
  and block count of shard'ed files.
* Make shard translator work for non-Hadoop and non-VM use cases where there are
  multiple clients operating on the same file.
* Serialize appending writes.
* Manage recovery of size and block count better in the face of faults during
  ongoing inode write fops.
* Anything else that could crop up later :)