diff options
author | Vijay Bellur <vbellur@redhat.com> | 2013-07-25 17:12:21 +0530 |
---|---|---|
committer | Anand Avati <avati@redhat.com> | 2013-07-25 05:07:11 -0700 |
commit | e9c583598b8ad58bbda15759067ff57eca619e95 (patch) | |
tree | 3653ddaca4c2975bed44b17fc59f8ef64afcf2a1 /doc/features | |
parent | 131d78dd36bac795d50fee3b04969b5ea9cb613c (diff) |
doc: Create a features folder.
Moved rdma and bd documents to doc/features. Added a new
document on rebalance.
Change-Id: I04269202adc9605754fc29876433c88480b822a3
BUG: 811311
Signed-off-by: Vijay Bellur <vbellur@redhat.com>
Reviewed-on: http://review.gluster.org/5395
Reviewed-by: Anand Avati <avati@redhat.com>
Tested-by: Anand Avati <avati@redhat.com>
Diffstat (limited to 'doc/features')
-rw-r--r-- | doc/features/bd.txt | 130 | ||||
-rw-r--r-- | doc/features/rdma-cm-in-3.4.0.txt | 9 | ||||
-rw-r--r-- | doc/features/rebalance.md | 74 |
3 files changed, 213 insertions, 0 deletions
diff --git a/doc/features/bd.txt b/doc/features/bd.txt new file mode 100644 index 00000000000..c1ba006ef8c --- /dev/null +++ b/doc/features/bd.txt @@ -0,0 +1,130 @@ +Sections +1. Introduction +2. Advantages +3. Creating BD backend volume +4. BD volume file +5. Using BD backend gluster volume +6. Limitations +7. TODO + +1. Introduction +=============== +Block Device translator(BD xlator) represented as storage/bd_map in +volume file adds a new backend 'block' to GlusterFS. It enables +GlusterFS to export block devices as regular files to the client. +Currently BD xlator supports exporting of 'Volume Group(VG)' as +directory and Logical Volumes(LV) within that VG as regular files to the +client. + +The eventual goal of this work is to support thin provisioning, +snapshot, copy etc of VM images seamlessly in GlusterFS storage +environment + +The immediate goal of this translator is to use LVs to store +VM images and expose them as files to QEMU/KVM. Given VG is represented +as directory and its logical volumes as files. + +BD xlator uses lvm2-devel APIs for getting the list of VGs and LVs in +the system and lvm binaries (such as lvcreate, lvresize etc) to perform +the required LV operations. + +2. Advantages +============= +By exporting LVs as regular files, it becomes possible to: +* Associate each VM to a LV so that there is no file system overhead. +* Use file system commands like cp to take copy of VM images +* Create linked clones of VM by doing LV snapshot at server +side +* Implement thin provisioning by developing a qcow2 translator + +3. Creating BD backend volume +============================= +New parameter "device vg" in volume create command is used to create BD +backend gluster volumes. + +For example + $ gluster volume create my-volume device vg hostname:/my-vg + +creates gluster volume 'my-volume' with BD backend which uses the VG +'my-vg' to store data. VG 'my-vg' should exist before creating this +gluster volume. + +4. BD volume file +================= +BD backend volume file specifies which VG to export to the client. The +section in the volume file that describes BD xlator looks like this. + +volume my-volume-bd_map +type storage/bd_map +option device vg +option export volume-group +end-volume + +option device=vg specifies that it should use VG as block backend. option +export=volume-group specifies that it should export VG "volume-group" +to the client. + +5. Using BD backend gluster volume +================================== +Mount +----- + $ mount -t glusterfs hostname:/my-volume /media/bd + $ cd /media/bd + +From the mount point: +-------------------- +* Creating a new file (ie LV) involves two steps + $ touch lv1 + $ truncate -s <size> lv1 + or + $ qemu-img create -f <format> gluster:/hostname/my-volume/path-to-image <size> + +* Cloning an LV + $ ln lv1 lv2 + +* Snapshotting an LV + $ ln -s lv1 lv1-ss + +* Passing it to QEMU as one of the drives + $ qemu -drive file=<mount>/<file>,if=<if-type> + +* GlusterFS is one of the supported QEMU block drivers, the URI format + is + gluster[+transport]://[server[:port]]/my-volume/image[?socket=...] + ie + $ qemu -drive file=gluster:/hostname/my-volume/path-to-image,if=<if-type> + +Using Gluster CLI: +----------------- +* To create a new image of required size + $ gluster bd create my-volume:/path-to-image <size> + +* To delete an existing image + $ gluster bd delete my-volume:/path-to-image + +* To clone (full clone) an image + $ gluster bd clone my-volume:/path-to-image new-image + +* To take a snapshot of an image + $ gluster bd snapshot my-volume:/path-to-image snapshot-image <size> + +All gluster BD commands need size to specified in terms of KB, MB, etc. + +6. Limitations +============== +* No support to create multiple bricks +* Image creation should be used with truncate to get proper size or use + qemu-img create +* cp command can't be used to copy images, instead use ln command or + gluster bd clone command +* ln -s command throws an error even if snapshot is successful +* When ln command used on BD volumes, target file's inode is different + from target's +* Creation/deletion of directories, xattr operations, mknod and readlink + operations are not supported. + +7. TODO +======= +Add support for exporting LUNs also as a regular files. +Add support for xattr and multi brick support +Include support for device mapper thin targets diff --git a/doc/features/rdma-cm-in-3.4.0.txt b/doc/features/rdma-cm-in-3.4.0.txt new file mode 100644 index 00000000000..fd953e56b3f --- /dev/null +++ b/doc/features/rdma-cm-in-3.4.0.txt @@ -0,0 +1,9 @@ +Following is the impact of http://review.gluster.org/#change,149. + +New userspace packages needed: +librdmacm +librdmacm-devel + +rdmacm needs an IPoIB address for connection establishment. This requirement results in following issues: +* Because of bug #890502, we've to probe the peer on an IPoIB address. This imposes a restriction that all volumes created in the future have to communicate over IPoIB address (irrespective of whether they use gluster's tcp or rdma transport). +* Currently client has an independence to choose b/w tcp and rdma transports while communicating with the server (by creating volumes with transport-type tcp,rdma). This independence was a byproduct of our ability use the normal channel used with transport-type tcp for rdma connectiion establishment handshake too. However, with new requirement of IPoIB address for connection establishment, we loose this independence (till we bring in multi-network support - where a brick can be identified by a set of ip-addresses and we can choose different pairs of ip-addresses for communication based on our requirements - in glusterd). diff --git a/doc/features/rebalance.md b/doc/features/rebalance.md new file mode 100644 index 00000000000..29b993008d2 --- /dev/null +++ b/doc/features/rebalance.md @@ -0,0 +1,74 @@ +## Background + + +For a more detailed description, view Jeff Darcy's blog post [here] +(http://hekafs.org/index.php/2012/03/glusterfs-algorithms-distribution/) + +GlusterFS uses the distribute translator (DHT) to aggregate space of multiple servers. DHT distributes files among its subvolumes using a consistent hashing method providing 32-bit hashes. Each DHT subvolume is given a range in the 32-bit hash space. A hash value is calculated for every file using a combination of its name. The file is then placed in the subvolume with the hash range that contains the hash value. + +## What is rebalance? + +The rebalance process migrates files between the DHT subvolumes when necessary. + +## When is rebalance required? + +Rebalancing is required for two main cases. + +1. Addition/Removal of bricks + +2. Renaming of a file + +## Addition/Removal of bricks + +Whenever the number or order of DHT subvolumes change, the hash range given to each subvolume is recalculated. When this happens, already existing files on the volume will need to be moved to the correct subvolume based on their hash. Rebalance does this activity. + +Addition of bricks which increase the size of a volume will increase the number of DHT subvolumes and lead to recalculation of hash ranges (This doesn't happen when bricks are added to a volume to increase redundancy, i.e. increase replica count of a volume). This will require an explicit rebalance command to be issued to migrate the files. + +Removal of bricks which decrease the size of a volumes also causes the hash ranges of DHT to be recalculated. But we don't need to issue an explicit rebalance command in this case, as rebalance is done automatically by the remove-brick process if needed. + +## Renaming of a file + +Renaming of file will cause its hash to change. The file now needs to be moved to the correct subvolume based on its new hash. Rebalance does this. + +## How does rebalance work? + +At a high level, the rebalance process consists of the following 3 steps: + +1. Crawl the volume to access all files +2. Calculate the hash for the file +3. If needed move the migrate the file to the correct subvolume. + + +The rebalance process has been optimized by making it distributed across the trusted storage pool. With distributed rebalance, a rebalance process is launched on each peer in the cluster. Each rebalance process will crawl files on only those bricks of the volume which are present on it, and migrate the files which need migration to the correct brick. This speeds up the rebalance process considerably. + +## What will happen if rebalance is not run? + +### Addition of bricks + +With the current implementation of add-brick, when the size of a volume is augmented by adding new bricks, the new bricks are not put into use immediately i.e., the hash ranges there not recalculated immediately. This means that the files will still be placed only onto the existing bricks, leaving the newly added storage space unused. Starting a rebalance process on the volume will cause the hash ranges to be recalculated with the new bricks included, which allows the newly added storage space to be used. + +### Renaming a file + +When a file rename causes the file to be hashed to a new subvolume, DHT writes a link file on the new subvolume leaving the actual file on the original subvolume. A link file is an empty file, which has an extended attribute set that points to the subvolume on which the actual file exists. So, when a client accesses the renamed file, DHT first looks for the file in the hashed subvolume and gets the link file. DHT understands the link file, and gets the actual file from the subvolume pointed to by the link file. This leads to a slight reduction in performance. A rebalance will move the actual file to the hashed subvolume, allowing clients to access the file directly once again. + +## Are clients affected during a rebalance process? + +The rebalance process is transparent to applications on the clients. Applications which have open files on the volume will not be affected by the rebalance process, even if the open file requires migration. The DHT translator on the client will hide the migration from the applications. + +##How are open files migrated? + +(A more technical description of the algorithm used can be seen in the commit message of commit a07bb18c8adeb8597f62095c5d1361c5bad01f09.) + +To achieve migration of open files, two things need to be assured of, +a) any writes or changes happening to the file during migration are correctly synced to destination subvolume after the migration is complete. +b) any further changes should be made to the destination subvolume + +Both of these requirements require sending notificatoins to clients. Clients are notified by overloading an attribute used in every callback functions. DHT understands these attributes in the callbacks and can be notified if a file is being migrated or not. + +During rebalance, a file will be in two phases + +1. Migration in process - In this phase the file is being migrated by the rebalance process from the source subvolume to the destination subvolume. The rebalance process will set a 'in-migration' attribute on the file, which will notify the clients' DHT translator. The clients' DHT translator will then take care to send any further changes to the destination subvolume as well. This way we satisfy the first requirement + +2. Migration completed - Once the file has been migrated, the rebalance process will set a 'migration-complete' attribute on the file. The clients will be notified of the completion and all further operations on the file will happen on the destination subvolume. + +The DHT translator handles the above and allows the applications on the clients to continue working on a file under migration. |