diff options
-rw-r--r-- | doc/features/bd-xlator.md | 406 |
1 files changed, 406 insertions, 0 deletions
diff --git a/doc/features/bd-xlator.md b/doc/features/bd-xlator.md new file mode 100644 index 00000000000..743e78b1eb0 --- /dev/null +++ b/doc/features/bd-xlator.md @@ -0,0 +1,406 @@ +#Block device translator + +Block device translator (BD xlator) is a translator added to GlusterFS which provides block backend for GlusterFS. This replaces the existing bd_map translator in GlusterFS that provided similar but very limited functionality. GlusterFS expects the underlying brick to be formatted with a POSIX compatible file system. BD xlator changes that and allows for having bricks that are raw block devices like LVM which needn’t have any file systems on them. Hence with BD xlator, it becomes possible to build a GlusterFS volume comprising of bricks that are logical volumes (LV). + +##bd + +BD xlator maps underlying LVs to files and hence the LVs appear as files to GlusterFS clients. Though BD volume externally appears very similar to the usual Posix volume, not all operations are supported or possible for the files on a BD volume. Only those operations that make sense for a block device are supported and the exact semantics are described in subsequent sections. + +While Posix volume takes a file system directory as brick, BD volume needs a volume group (VG) as brick. In the usual use case of BD volume, a file created on BD volume will result in an LV being created in the brick VG. In addition to a VG, BD volume also needs a file system directory that should be specified at the volume creation time. This directory is necessary for supporting the notion of directories and directory hierarchy for the BD volume. Metadata about LVs (size, mapping info) is stored in this directory. + +BD xlator was mainly developed to use block devices directly as VM images when GlusterFS is used as storage for KVM virtualization. Some of the salient points of BD xlator are + +* Since BD supports file level snapshots and clones by leveraging the snapshot and clone capabilities of LVM, it can be used to fully off-load snapshot and cloning operations from QEMU to the storage (GlusterFS) itself. + +* BD understands dm-thin LVs and hence can support files that are backed by thinly provisioned LVs. This capability of BD xlator translates to having thinly provisioned raw VM images. + +* BD enables thin LVs from a thin pool to be used from multiple nodes that have visibility to GlusterFS BD volume. Thus thin pool can be used as a VM image repository allowing access/visibility to it from multiple nodes. + +* BD supports true zerofill by using BLKZEROOUT ioctl on underlying block devices. Thus BD allows SCSI WRITESAME to be used on underlying block device if the device supports it. + +Though BD xlator is primarily intended to be used with block devices, it does provide full Posix xlator compatibility for files that are created on BD volume but are not backed by or mapped to a block device. Such files which don’t have a block device mapping exist on the Posix directory that is specified during BD volume creation. BD xlator is available from GlusterFS-3.5 release. + +###Compiling BD translator + +BD xlator needs lvm2 development library. –enable-bd-xlator option can be used with `./configure` script to explicitly enable BD translator. The following snippet from the output of configure script shows that BD xlator is enabled for compilation. + + +#####GlusterFS configure summary + + … + Block Device xlator : yes + + +###Creating a BD volume + +BD supports hosting of both linear LV and thin LV within the same volume. However seperate examples are provided below. As noted above, the prerequisite for a BD volume is VG which is created from a loop device here, but it can be any other device too. + + +* Creating BD volume with linear LV backend + +* Create a loop device + + + [root@node ~]# dd if=/dev/zero of=bd-loop count=1024 bs=1M + + [root@node ~]# losetup /dev/loop0 bd-loop + + +* Prepare a brick by creating a VG + + [root@node ~]# pvcreate /dev/loop0 + + [root@node ~]# vgcreate bd-vg /dev/loop0 + + +* Create the BD volume + +* Create a POSIX directory first + + + [root@node ~]# mkdir /bd-meta + +It is recommended that this directory is created on an LV in the brick VG itself so that both data and metadata live together on the same device. + + +* Create and mount the volume + + [root@node ~]# gluster volume create bd node:/bd-meta?bd-vg force + + +The general syntax for specifying the brick is `host:/posix-dir?volume-group-name` where “?” is the separator. + + + + [root@node ~]# gluster volume start bd + [root@node ~]# gluster volume info bd + Volume Name: bd + Type: Distribute + Volume ID: cb042d2a-f435-4669-b886-55f5927a4d7f + Status: Started + Xlator 1: BD + Capability 1: offload_copy + Capability 2: offload_snapshot + Number of Bricks: 1 + Transport-type: tcp + Bricks: + Brick1: node:/bd-meta + Brick1 VG: bd-vg + + + + [root@node ~]# mount -t glusterfs node:/bd /mnt + +* Create a file that is backed by an LV + + [root@node ~]# ls /mnt + + [root@node ~]# + +Since the volume is empty now, so is the underlying VG. + + [root@node ~]# lvdisplay bd-vg + [root@node ~]# + +Creating a file that is mapped to an LV is a 2 step operation. First the file should be created on the mount point and a specific extended attribute should be set to map the file to LV. + + [root@node ~]# touch /mnt/lv + [root@node ~]# setfattr -n “user.glusterfs.bd” -v “lv” /mnt/lv + +Now an LV got created in the VG brick and the file /mnt/lv maps to this LV. Any read/write to this file ends up as read/write to the underlying LV. + + [root@node ~]# lvdisplay bd-vg + — Logical volume — + LV Path /dev/bd-vg/6ff0f25f-2776-4d19-adfb-df1a3cab8287 + LV Name 6ff0f25f-2776-4d19-adfb-df1a3cab8287 + VG Name bd-vg + LV UUID PjMPcc-RkD5-RADz-6ixG-UYsk-oclz-vL0nv6 + LV Write Access read/write + LV Creation host, time node, 2013-11-26 16:15:45 +0530 + LV Status available + open 0 + LV Size 4.00 MiB + Current LE 1 + Segments 1 + Allocation inherit + Read ahead sectors 0 + Block device 253:6 + +The file gets created with default LV size which is 1 LE which is 4MB in this case. + + [root@node ~]# ls -lh /mnt/lv + -rw-r–r–. 1 root root 4.0M Nov 26 16:15 /mnt/lv + +truncate can be used to set the required file size. + + [root@node ~]# truncate /mnt/lv -s 256M + [root@node ~]# lvdisplay bd-vg + — Logical volume — + LV Path /dev/bd-vg/6ff0f25f-2776-4d19-adfb-df1a3cab8287 + LV Name 6ff0f25f-2776-4d19-adfb-df1a3cab8287 + VG Name bd-vg + LV UUID PjMPcc-RkD5-RADz-6ixG-UYsk-oclz-vL0nv6 + LV Write Access read/write + LV Creation host, time node, 2013-11-26 16:15:45 +0530 + LV Status available + # open 0 + LV Size 256.00 MiB + Current LE 64 + Segments 1 + Allocation inherit + Read ahead sectors 0 + Block device 253:6 + + + [root@node ~]# ls -lh /mnt/lv + -rw-r–r–. 1 root root 256M Nov 26 16:15 /mnt/lv + + currently LV size has been set to 256 + +The size of the file/LV can be specified during creation/mapping time itself like this: + + setfattr -n “user.glusterfs.bd” -v “lv:256MB” /mnt/lv + +2. Creating BD volume with thin LV backend + +* Create a loop device + + + [root@node ~]# dd if=/dev/zero of=bd-loop-thin count=1024 bs=1M + + [root@node ~]# losetup /dev/loop0 bd-loop-thin + + +* Prepare a brick by creating a VG and thin pool + + + [root@node ~]# pvcreate /dev/loop0 + + [root@node ~]# vgcreate bd-vg-thin /dev/loop0 + + +* Create a thin pool + + + [root@node ~]# lvcreate –thin bd-vg-thin -L 1000M + + Rounding up size to full physical extent 4.00 MiB + Logical volume “lvol0″ created + +lvdisplay shows the thin pool + + [root@node ~]# lvdisplay bd-vg-thin + — Logical volume — + LV Name lvol0 + VG Name bd-vg-thin + LV UUID HVa3EM-IVMS-QG2g-oqU6-1UxC-RgqS-g8zhVn + LV Write Access read/write + LV Creation host, time node, 2013-11-26 16:39:06 +0530 + LV Pool transaction ID 0 + LV Pool metadata lvol0_tmeta + LV Pool data lvol0_tdata + LV Pool chunk size 64.00 KiB + LV Zero new blocks yes + LV Status available + # open 0 + LV Size 1000.00 MiB + Allocated pool data 0.00% + Allocated metadata 0.88% + Current LE 250 + Segments 1 + Allocation inherit + Read ahead sectors auto + Block device 253:9 + +* Create the BD volume + +* Create a POSIX directory first + + + [root@node ~]# mkdir /bd-meta-thin + +* Create and mount the volume + + [root@node ~]# gluster volume create bd-thin node:/bd-meta-thin?bd-vg-thin force + + [root@node ~]# gluster volume start bd-thin + + + [root@node ~]# gluster volume info bd-thin + Volume Name: bd-thin + Type: Distribute + Volume ID: 27aa7eb0-4ffa-497e-b639-7cbda0128793 + Status: Started + Xlator 1: BD + Capability 1: thin + Capability 2: offload_copy + Capability 3: offload_snapshot + Number of Bricks: 1 + Transport-type: tcp + Bricks: + Brick1: node:/bd-meta-thin + Brick1 VG: bd-vg-thin + + + [root@node ~]# mount -t glusterfs node:/bd-thin /mnt + +* Create a file that is backed by a thin LV + + + [root@node ~]# ls /mnt + + [root@node ~]# + +Creating a file that is mapped to a thin LV is a 2 step operation. First the file should be created on the mount point and a specific extended attribute should be set to map the file to a thin LV. + + [root@node ~]# touch /mnt/thin-lv + + [root@node ~]# setfattr -n “user.glusterfs.bd” -v “thin:256MB” /mnt/thin-lv + +Now /mnt/thin-lv is a thin provisioned file that is backed by a thin LV and size has been set to 256. + + [root@node ~]# lvdisplay bd-vg-thin + — Logical volume — + LV Name lvol0 + VG Name bd-vg-thin + LV UUID HVa3EM-IVMS-QG2g-oqU6-1UxC-RgqS-g8zhVn + LV Write Access read/write + LV Creation host, time node, 2013-11-26 16:39:06 +0530 + LV Pool transaction ID 1 + LV Pool metadata lvol0_tmeta + LV Pool data lvol0_tdata + LV Pool chunk size 64.00 KiB + LV Zero new blocks yes + LV Status available + # open 0 + LV Size 000.00 MiB + Allocated pool data 0.00% + Allocated metadata 0.98% + Current LE 250 + Segments 1 + Allocation inherit + Read ahead sectors auto + Block device 253:9 + + + + + — Logical volume — + LV Path dev/bd-vg-thin/081b01d1-1436-4306-9baf-41c7bf5a2c73 + LV Name 081b01d1-1436-4306-9baf-41c7bf5a2c73 + VG Name bd-vg-thin + LV UUID coxpTY-2UZl-9293-8H2X-eAZn-wSp6-csZIeB + LV Write Access read/write + LV Creation host, time node, 2013-11-26 16:43:19 +0530 + LV Pool name lvol0 + LV Status available + # open 0 + LV Size 256.00 MiB + Mapped size 0.00% + Current LE 64 + Segments 1 + Allocation inherit + Read ahead sectors auto + Block device 253:10 + + + + + +As can be seen from above, creation of a file resulted in creation of a thin LV in the brick. + +For more information on bd translator wrt snapshots and clones, +refer # http://raobharata.wordpress.com/2013/11/27/glusterfs-block-device-translator/ + +###Improvisation on BD translator: + +First version of BD xlator ( block backend) had few limitations such as + +* Creation of directories not supported +* Supports only single brick +* Does not use extended attributes (and client gfid) like posix xlator +* Creation of special files (symbolic links, device nodes etc) not + supported + +Basic limitation of not allowing directory creation was blocking +oVirt/VDSM to consume BD xlator as part of Gluster domain since VDSM +creates multi-level directories when GlusterFS is used as storage +backend for storing VM images. + +To overcome these limitations a new BD xlator with following +improvements are implemented. + +* New hybrid BD xlator that handles both regular files and block device + files +* The volume will have both POSIX and BD bricks. Regular files are + created on POSIX bricks, block devices are created on the BD brick (VG) +* BD xlator leverages exiting POSIX xlator for most POSIX calls and + hence sits above the POSIX xlator +* Block device file is differentiated from regular file by an extended + attribute +* The xattr 'user.glusterfs.bd' (BD_XATTR) plays a role in mapping a + posix file to Logical Volume (LV). +* When a client sends a request to set BD_XATTR on a posix file, a new + LV is created and mapped to posix file. So every block device will + have a representative file in POSIX brick with 'user.glusterfs.bd' + (BD_XATTR) set. +* Here after all operations on this file results in LV related + operations. + +For example, opening a file that has BD_XATTR set results in opening +the LV block device, reading results in reading the corresponding LV +block device. + +When BD xlator gets request to set BD_XATTR via setxattr call, it +creates a LV and information about this LV is placed in the xattr of the +posix file. xattr "user.glusterfs.bd" used to identify that posix file +is mapped to BD. + +Usage: +Server side: + + [root@host1 ~]# gluster volume create bdvol host1:/storage/vg1_info?vg1 host2:/storage/vg2_info?vg2 + +It creates a distributed gluster volume 'bdvol' with Volume Group vg1 +using posix brick /storage/vg1_info in host1 and Volume Group vg2 using +/storage/vg2_info in host2. + + + [root@host1 ~]# gluster volume start bdvol + +Client side: + + [root@node ~]# mount -t glusterfs host1:/bdvol /media + [root@node ~]# touch /media/posix + +It creates regular posix file 'posix' in either host1:/vg1 or host2:/vg2 brick + + [root@node ~]# mkdir /media/image + + [root@node ~]# touch /media/image/lv1 + + +It also creates regular posix file 'lv1' in either host1:/vg1 or +host2:/vg2 brick + + [root@node ~]# setfattr -n "user.glusterfs.bd" -v "lv" /media/image/lv1 + + [root@node ~]# + + +Above setxattr results in creating a new LV in corresponding brick's VG +and it sets 'user.glusterfs.bd' with value 'lv:<default-extent-size'' + + + [root@node ~]# truncate -s5G /media/image/lv1 + + +It results in resizig LV 'lv1'to 5G + +New BD xlator code is placed in `xlators/storage/bd` directory. + +Also add volume-uuid to the VG so that same VG cannot be used for other +bricks/volumes. After deleting a gluster volume, one has to manually +remove the associated tag using vgchange <vg-name> --deltag +`<trusted.glusterfs.volume-id:<volume-id>>` + |