diff options
author | M. Mohan Kumar <mohan@in.ibm.com> | 2013-11-13 22:44:42 +0530 |
---|---|---|
committer | Anand Avati <avati@redhat.com> | 2013-11-13 11:38:42 -0800 |
commit | 48c40e1a42efe1b59126406084821947d139dd0e (patch) | |
tree | 74959ecda9b9bd56c85e0e32991c11c06b022296 /xlators/mgmt/glusterd/src/glusterd-volume-ops.c | |
parent | 15a8ecd9b3eedf80881bd3dba81f16b7d2cb7c97 (diff) |
bd: posix/multi-brick support to BD xlator
Current BD xlator (block backend) has a few limitations such as
* Creation of directories not supported
* Supports only single brick
* Does not use extended attributes (and client gfid) like posix xlator
* Creation of special files (symbolic links, device nodes etc) not
supported
Basic limitation of not allowing directory creation is blocking
oVirt/VDSM to consume BD xlator as part of Gluster domain since VDSM
creates multi-level directories when GlusterFS is used as storage
backend for storing VM images.
To overcome these limitations a new BD xlator with following
improvements is suggested.
* New hybrid BD xlator that handles both regular files and block device
files
* The volume will have both POSIX and BD bricks. Regular files are
created on POSIX bricks, block devices are created on the BD brick (VG)
* BD xlator leverages exiting POSIX xlator for most POSIX calls and
hence sits above the POSIX xlator
* Block device file is differentiated from regular file by an extended
attribute
* The xattr 'user.glusterfs.bd' (BD_XATTR) plays a role in mapping a
posix file to Logical Volume (LV).
* When a client sends a request to set BD_XATTR on a posix file, a new
LV is created and mapped to posix file. So every block device will
have a representative file in POSIX brick with 'user.glusterfs.bd'
(BD_XATTR) set.
* Here after all operations on this file results in LV related
operations.
For example opening a file that has BD_XATTR set results in opening
the LV block device, reading results in reading the corresponding LV
block device.
When BD xlator gets request to set BD_XATTR via setxattr call, it
creates a LV and information about this LV is placed in the xattr of the
posix file. xattr "user.glusterfs.bd" used to identify that posix file
is mapped to BD.
Usage:
Server side:
[root@host1 ~]# gluster volume create bdvol host1:/storage/vg1_info?vg1 host2:/storage/vg2_info?vg2
It creates a distributed gluster volume 'bdvol' with Volume Group vg1
using posix brick /storage/vg1_info in host1 and Volume Group vg2 using
/storage/vg2_info in host2.
[root@host1 ~]# gluster volume start bdvol
Client side:
[root@node ~]# mount -t glusterfs host1:/bdvol /media
[root@node ~]# touch /media/posix
It creates regular posix file 'posix' in either host1:/vg1 or host2:/vg2 brick
[root@node ~]# mkdir /media/image
[root@node ~]# touch /media/image/lv1
It also creates regular posix file 'lv1' in either host1:/vg1 or
host2:/vg2 brick
[root@node ~]# setfattr -n "user.glusterfs.bd" -v "lv" /media/image/lv1
[root@node ~]#
Above setxattr results in creating a new LV in corresponding brick's VG
and it sets 'user.glusterfs.bd' with value 'lv:<default-extent-size'
[root@node ~]# truncate -s5G /media/image/lv1
It results in resizig LV 'lv1'to 5G
New BD xlator code is placed in xlators/storage/bd directory.
Also add volume-uuid to the VG so that same VG can't be used for other
bricks/volumes. After deleting a gluster volume, one has to manually
remove the associated tag using vgchange <vg-name> --deltag
<trusted.glusterfs.volume-id:<volume-id>>
Changes from previous version V5:
* Removed support for delayed deleting of LVs
Changes from previous version V4:
* Consolidated the patches
* Removed usage of BD_XATTR_SIZE and consolidated it in BD_XATTR.
Changes from previous version V3:
* Added support in FUSE to support full/linked clone
* Added support to merge snapshots and provide information about origin
* bd_map xlator removed
* iatt structure used in inode_ctx. iatt is cached and updated during
fsync/flush
* aio support
* Type and capabilities of volume are exported through getxattr
Changes from version 2:
* Used inode_context for caching BD size and to check if loc/fd is BD or
not.
* Added GlusterFS server offloaded copy and snapshot through setfattr
FOP. As part of this libgfapi is modified.
* BD xlator supports stripe
* During unlinking if a LV file is already opened, its added to delete
list and bd_del_thread tries to delete from this list when a last
reference to that file is closed.
Changes from previous version:
* gfid is used as name of LV
* ? is used to specify VG name for creating BD volume in volume
create, add-brick. gluster volume create volname host:/path?vg
* open-behind issue is fixed
* A replicate brick can be added dynamically and LVs from source brick
are replicated to destination brick
* A distribute brick can be added dynamically and rebalance operation
distributes existing LVs/files to the new brick
* Thin provisioning support added.
* bd_map xlator support retained
* setfattr -n user.glusterfs.bd -v "lv" creates a regular LV and
setfattr -n user.glusterfs.bd -v "thin" creates thin LV
* Capability and backend information added to gluster volume info (and
--xml) so
that management tools can exploit BD xlator.
* tracing support for bd xlator added
TODO:
* Add support to display snapshots for a given LV
* Display posix filename for list-origin instead of gfid
Change-Id: I00d32dfbab3b7c806e0841515c86c3aa519332f2
BUG: 1028672
Signed-off-by: M. Mohan Kumar <mohan@in.ibm.com>
Reviewed-on: http://review.gluster.org/4809
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
Diffstat (limited to 'xlators/mgmt/glusterd/src/glusterd-volume-ops.c')
-rw-r--r-- | xlators/mgmt/glusterd/src/glusterd-volume-ops.c | 149 |
1 files changed, 149 insertions, 0 deletions
diff --git a/xlators/mgmt/glusterd/src/glusterd-volume-ops.c b/xlators/mgmt/glusterd/src/glusterd-volume-ops.c index 034004dbd9e..a2bd7334c93 100644 --- a/xlators/mgmt/glusterd/src/glusterd-volume-ops.c +++ b/xlators/mgmt/glusterd/src/glusterd-volume-ops.c @@ -12,6 +12,10 @@ #include "config.h" #endif +#ifdef HAVE_BD_XLATOR +#include <lvm2app.h> +#endif + #include "common-utils.h" #include "syscall.h" #include "cli1-xdr.h" @@ -26,6 +30,7 @@ #define glusterd_op_start_volume_args_get(dict, volname, flags) \ glusterd_op_stop_volume_args_get (dict, volname, flags) + int __glusterd_handle_create_volume (rpcsvc_request_t *req) { @@ -599,6 +604,101 @@ glusterd_handle_cli_statedump_volume (rpcsvc_request_t *req) __glusterd_handle_cli_statedump_volume); } +#ifdef HAVE_BD_XLATOR +/* + * Validates if given VG in the brick exists or not. Also checks if VG has + * GF_XATTR_VOL_ID_KEY tag set to avoid using same VG for multiple bricks. + * Tag is checked only during glusterd_op_stage_create_volume. Tag is set during + * glusterd_validate_and_create_brickpath(). + * @brick - brick info, @check_tag - check for VG tag or not + * @msg - Error message to return to caller + */ +int +glusterd_is_valid_vg (glusterd_brickinfo_t *brick, int check_tag, char *msg) +{ + lvm_t handle = NULL; + vg_t vg = NULL; + char *vg_name = NULL; + int retval = 0; + char *p = NULL; + char *ptr = NULL; + struct dm_list *dm_lvlist = NULL; + struct dm_list *dm_seglist = NULL; + struct lvm_lv_list *lv_list = NULL; + struct lvm_property_value prop = {0, }; + struct lvm_lvseg_list *seglist = NULL; + struct dm_list *taglist = NULL; + struct lvm_str_list *strl = NULL; + + handle = lvm_init (NULL); + if (!handle) { + sprintf (msg, "lvm_init failed, could not validate vg"); + return -1; + } + if (*brick->vg == '\0') { /* BD xlator has vg in brick->path */ + p = gf_strdup (brick->path); + vg_name = strtok_r (p, "/", &ptr); + } else + vg_name = brick->vg; + + vg = lvm_vg_open (handle, vg_name, "r", 0); + if (!vg) { + sprintf (msg, "no such vg: %s", vg_name); + retval = -1; + goto out; + } + if (!check_tag) + goto next; + + taglist = lvm_vg_get_tags (vg); + if (!taglist) + goto next; + + dm_list_iterate_items (strl, taglist) { + if (!strncmp(strl->str, GF_XATTR_VOL_ID_KEY, + strlen (GF_XATTR_VOL_ID_KEY))) { + sprintf (msg, "VG %s is already part of" + " a brick", vg_name); + retval = -1; + goto out; + } + } +next: + + brick->caps = CAPS_BD; + + dm_lvlist = lvm_vg_list_lvs (vg); + if (!dm_lvlist) + goto out; + + dm_list_iterate_items (lv_list, dm_lvlist) { + dm_seglist = lvm_lv_list_lvsegs (lv_list->lv); + dm_list_iterate_items (seglist, dm_seglist) { + prop = lvm_lvseg_get_property (seglist->lvseg, + "segtype"); + if (!prop.is_valid || !prop.value.string) + continue; + if (!strcmp (prop.value.string, "thin-pool")) { + brick->caps |= CAPS_THIN; + gf_log (THIS->name, GF_LOG_INFO, "Thin Pool " + "\"%s\" will be used for thin LVs", + lvm_lv_get_name (lv_list->lv)); + break; + } + } + } + + retval = 0; +out: + if (vg) + lvm_vg_close (vg); + lvm_quit (handle); + if (p) + GF_FREE (p); + return retval; +} +#endif + /* op-sm */ int glusterd_op_stage_create_volume (dict_t *dict, char **op_errstr) @@ -712,6 +812,11 @@ glusterd_op_stage_create_volume (dict_t *dict, char **op_errstr) } if (!uuid_compare (brick_info->uuid, MY_UUID)) { + if (brick_info->vg[0]) { + ret = glusterd_is_valid_vg (brick_info, 1, msg); + if (ret) + goto out; + } ret = glusterd_validate_and_create_brickpath (brick_info, volume_uuid, op_errstr, is_force); @@ -809,6 +914,7 @@ glusterd_op_stage_start_volume (dict_t *dict, char **op_errstr) uuid_t volume_id = {0,}; char volid[50] = {0,}; char xattr_volid[50] = {0,}; + int caps = 0; this = THIS; GF_ASSERT (this); @@ -847,6 +953,7 @@ glusterd_op_stage_start_volume (dict_t *dict, char **op_errstr) } } + list_for_each_entry (brickinfo, &volinfo->bricks, brick_list) { ret = glusterd_resolve_brick (brickinfo); if (ret) { @@ -902,8 +1009,24 @@ glusterd_op_stage_start_volume (dict_t *dict, char **op_errstr) ret = -1; goto out; } +#ifdef HAVE_BD_XLATOR + if (brickinfo->vg[0]) + caps = CAPS_BD | CAPS_THIN; + + /* Check for VG/thin pool if its BD volume */ + if (brickinfo->vg[0]) { + ret = glusterd_is_valid_vg (brickinfo, 0, msg); + if (ret) + goto out; + /* if anyone of the brick does not have thin support, + disable it for entire volume */ + caps &= brickinfo->caps; + } else + caps = 0; +#endif } + volinfo->caps = caps; ret = 0; out: if (ret && (msg[0] != '\0')) { @@ -1315,6 +1438,8 @@ glusterd_op_create_volume (dict_t *dict, char **op_errstr) char *str = NULL; char *username = NULL; char *password = NULL; + int caps = 0; + char msg[1024] __attribute__((unused)) = {0, }; this = THIS; GF_ASSERT (this); @@ -1477,6 +1602,7 @@ glusterd_op_create_volume (dict_t *dict, char **op_errstr) if (count) brick = strtok_r (brick_list+1, " \n", &saveptr); + caps = CAPS_BD | CAPS_THIN; while ( i <= count) { ret = glusterd_brickinfo_new_from_brick (brick, &brickinfo); @@ -1489,6 +1615,27 @@ glusterd_op_create_volume (dict_t *dict, char **op_errstr) brickinfo->hostname, brickinfo->path); goto out; } + +#ifdef HAVE_BD_XLATOR + if (!uuid_compare (brickinfo->uuid, MY_UUID)) { + if (brickinfo->vg[0]) { + ret = glusterd_is_valid_vg (brickinfo, 0, msg); + if (ret) { + gf_log (this->name, GF_LOG_ERROR, "%s", + msg); + goto out; + } + + /* if anyone of the brick does not have thin + support, disable it for entire volume */ + caps &= brickinfo->caps; + + + } else + caps = 0; + } +#endif + list_add_tail (&brickinfo->brick_list, &volinfo->bricks); brick = strtok_r (NULL, " \n", &saveptr); i++; @@ -1496,6 +1643,8 @@ glusterd_op_create_volume (dict_t *dict, char **op_errstr) gd_update_volume_op_versions (volinfo); + volinfo->caps = caps; + ret = glusterd_store_volinfo (volinfo, GLUSTERD_VOLINFO_VER_AC_INCREMENT); if (ret) { glusterd_store_delete_volume (volinfo); |