diff options
author | Krutika Dhananjay <kdhananj@redhat.com> | 2015-05-05 07:57:10 +0530 |
---|---|---|
committer | Humble Devassy Chirammal <humble.devassy@gmail.com> | 2015-05-05 00:16:44 -0700 |
commit | 41c3c1c1e4ccc425ac2edb31d7fa970ff464ebdd (patch) | |
tree | 9775e8868cc9faeb1869d703c20a6384f2c8e45a | |
parent | 25e8e74eb7b81ccd114a9833371a3f72d284c48d (diff) |
doc: Feature doc for shard translator
Source: https://gist.github.com/avati/af04f1030dcf52e16535#sharding-xlator-stripe-20
Change-Id: I127685bacb6cf924bb41a3b782ade1460afb91d6
BUG: 1200082
Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com>
Reviewed-on: http://review.gluster.org/10537
Reviewed-by: Humble Devassy Chirammal <humble.devassy@gmail.com>
Tested-by: Humble Devassy Chirammal <humble.devassy@gmail.com>
-rw-r--r-- | doc/features/shard.md | 68 |
1 files changed, 68 insertions, 0 deletions
diff --git a/doc/features/shard.md b/doc/features/shard.md new file mode 100644 index 00000000000..3588531a2b4 --- /dev/null +++ b/doc/features/shard.md @@ -0,0 +1,68 @@ +### Sharding xlator (Stripe 2.0) + +GlusterFS's answer to very large files (those which can grow beyond a +single brick) has never been clear. There is a stripe xlator which allows you to +do that, but that comes at a cost of flexibility - you can add servers only in +multiple of stripe-count x replica-count, mixing striped and unstriped files is +not possible in an "elegant" way. This also happens to be a big limiting factor +for the big data/Hadoop use case where super large files are the norm (and where +you want to split a file even if it could fit within a single server.) + +The proposed solution for this is to replace the current stripe xlator with a +new Shard xlator. Unlike the stripe xlator, Shard is not a cluster xlator. It is +placed on top of DHT. Initially all files will be created as normal files, even +up to a certain configurable size. The first block (default 4MB) will be stored +like a normal file. However further blocks will be stored in a file, named by +the GFID and block index in a separate namespace (like /.shard/GFID1.1, +/.shard/GFID1.2 ... /.shard/GFID1.N). File IO happening to a particular offset +will write to the appropriate "piece file", creating it if necessary. The +aggregated file size and block count will be stored in the xattr of the original +(first block) file. + +The advantage of such a model: + +- Data blocks are distributed by DHT in a "normal way". +- Adding servers can happen in any number (even one at a time) and DHT's + rebalance will spread out the "piece files" evenly. +- Self-healing of a large file is now more distributed into smaller files across + more servers. +- piece file naming scheme is immune to renames and hardlinks. + +Source: https://gist.github.com/avati/af04f1030dcf52e16535#sharding-xlator-stripe-20 + +## Usage: + +Shard translator is disabled by default. To enable it on a given volume, execute +<code> +gluster volume set <VOLNAME> features.shard on +</code> + +The default shard block size is 4MB. To modify it, execute +<code> +gluster volume set <VOLNAME> features.shard-block-size <value> +</code> + +When a file is created in a volume with sharding disabled, its block size is +persisted in its xattr on the first block. This property of the file will remain +even if the shard-block-size for the volume is reconfigured later. + +If you want to disable sharding on a volume, it is advisable to create a new +volume without sharding and copy out contents of this volume into the new +volume. + +## Note: +* Shard translator is still a beta feature in 3.7.0 and will be possibly fully + supported in one of the 3.7.x releases. +* It is advisable to use shard translator in volumes with replication enabled + for fault tolerance. + +## TO-DO: +* Complete implementation of zerofill, discard and fallocate fops. +* Introduce caching and its invalidation within shard translator to store size + and block count of shard'ed files. +* Make shard translator work for non-Hadoop and non-VM use cases where there are + multiple clients operating on the same file. +* Serialize appending writes. +* Manage recovery of size and block count better in the face of faults during + ongoing inode write fops. +* Anything else that could crop up later :) |