diff options
Diffstat (limited to 'doc')
| -rw-r--r-- | doc/Makefile.am | 5 | ||||
| -rw-r--r-- | doc/admin-guide/en-US/markdown/admin_managing_snapshots.md | 66 | ||||
| -rw-r--r-- | doc/features/bd.txt | 130 | ||||
| -rw-r--r-- | doc/gluster.8 | 23 | ||||
| -rw-r--r-- | doc/glusterd.vol | 8 | ||||
| -rw-r--r-- | doc/hacker-guide/en-US/markdown/adding-fops.md | 18 | ||||
| -rw-r--r-- | doc/hacker-guide/en-US/markdown/posix.md | 59 | ||||
| -rw-r--r-- | doc/legacy/hacker-guide/adding-fops.txt | 33 | ||||
| -rw-r--r-- | doc/legacy/hacker-guide/posix.txt | 59 | ||||
| -rw-r--r-- | doc/logging.txt | 11 | ||||
| -rw-r--r-- | doc/split-brain.md | 251 |
11 files changed, 403 insertions, 260 deletions
diff --git a/doc/Makefile.am b/doc/Makefile.am index 0f9feb250..1103b607d 100644 --- a/doc/Makefile.am +++ b/doc/Makefile.am @@ -1,9 +1,6 @@ -EXTRA_DIST = glusterfs.8 mount.glusterfs.8 glusterd.vol gluster.8 \ +EXTRA_DIST = glusterfs.8 mount.glusterfs.8 gluster.8 \ glusterd.8 glusterfsd.8 -voldir = $(sysconfdir)/glusterfs -vol_DATA = glusterd.vol - man8_MANS = glusterfs.8 mount.glusterfs.8 gluster.8 glusterd.8 glusterfsd.8 CLEANFILES = diff --git a/doc/admin-guide/en-US/markdown/admin_managing_snapshots.md b/doc/admin-guide/en-US/markdown/admin_managing_snapshots.md new file mode 100644 index 000000000..e76ee9151 --- /dev/null +++ b/doc/admin-guide/en-US/markdown/admin_managing_snapshots.md @@ -0,0 +1,66 @@ +Managing GlusterFS Volume Snapshots +========================== + +This section describes how to perform common GlusterFS volume snapshot +management operations + +Pre-requisites +===================== + +GlusterFS volume snapshot feature is based on thinly provisioned LVM snapshot. +To make use of snapshot feature GlusterFS volume should fulfill following +pre-requisites: + +* Each brick should be on an independent thinly provisioned LVM. +* Brick LVM should not contain any other data other than brick. +* None of the brick should be on a thick LVM. + + +Snapshot Management +===================== + + +**Snapshot creation** + +*gluster snapshot create \<vol-name\> \[-n \<snap-name\>\] \[-d \<description\>\]* + +This command will create a snapshot of a GlusterFS volume. User can provide a snap-name and a description to identify the snap. The description cannot be more than 1024 characters. + +Volume should be present and it should be in started state. + +**Restoring snaps** + +*gluster snapshot restore -v \<vol-name\> \<snap-name\>* + +This command restores an already taken snapshot of a GlusterFS volume. Snapshot restore is an offline activity therefore if the volume is online then the restore operation will fail. + +Once the snapshot is restored it will be deleted from the list of snapshot. + +**Deleting snaps** + +*gluster snapshot delete \<volname\>\ -s \<snap-name\> \[force\]* + +This command will delete the specified snapshot. + +**Listing of available snaps** + +*gluster snapshot list \[\<volname\> \[-s \<snap-name>\]\]* + +This command is used to list all snapshots taken, or for a specified volume. If snap-name is provided then it will list the details of that snap. + +**Configuring the snapshot behavior** + +*gluster snapshot config \[\<vol-name | all\>\]* + +This command will display existing config values for a volume. If volume name is not provided then config values of all the volume is displayed. System config is displayed irrespective of volume name. + +*gluster snapshot config \<vol-name | all\> \[\<snap-max-hard-limit\> \<count\>\] \[\<snap-max-soft-limit\> \<percentage\>\]* + +The above command can be used to change the existing config values. If vol-name is provided then config value of that volume is changed, else it will set/change the system limit. + +The system limit is the default value of the config for all the volume. Volume specific limit cannot cross the system limit. If a volume specific limit is not provided then system limit will be considered. + +If any of this limit is decreased and the current snap count of the system/volume is more than the limit then the command will fail. If user still want to decrease the limit then force option should be used. + + + diff --git a/doc/features/bd.txt b/doc/features/bd.txt deleted file mode 100644 index c1ba006ef..000000000 --- a/doc/features/bd.txt +++ /dev/null @@ -1,130 +0,0 @@ -Sections -1. Introduction -2. Advantages -3. Creating BD backend volume -4. BD volume file -5. Using BD backend gluster volume -6. Limitations -7. TODO - -1. Introduction -=============== -Block Device translator(BD xlator) represented as storage/bd_map in -volume file adds a new backend 'block' to GlusterFS. It enables -GlusterFS to export block devices as regular files to the client. -Currently BD xlator supports exporting of 'Volume Group(VG)' as -directory and Logical Volumes(LV) within that VG as regular files to the -client. - -The eventual goal of this work is to support thin provisioning, -snapshot, copy etc of VM images seamlessly in GlusterFS storage -environment - -The immediate goal of this translator is to use LVs to store -VM images and expose them as files to QEMU/KVM. Given VG is represented -as directory and its logical volumes as files. - -BD xlator uses lvm2-devel APIs for getting the list of VGs and LVs in -the system and lvm binaries (such as lvcreate, lvresize etc) to perform -the required LV operations. - -2. Advantages -============= -By exporting LVs as regular files, it becomes possible to: -* Associate each VM to a LV so that there is no file system overhead. -* Use file system commands like cp to take copy of VM images -* Create linked clones of VM by doing LV snapshot at server -side -* Implement thin provisioning by developing a qcow2 translator - -3. Creating BD backend volume -============================= -New parameter "device vg" in volume create command is used to create BD -backend gluster volumes. - -For example - $ gluster volume create my-volume device vg hostname:/my-vg - -creates gluster volume 'my-volume' with BD backend which uses the VG -'my-vg' to store data. VG 'my-vg' should exist before creating this -gluster volume. - -4. BD volume file -================= -BD backend volume file specifies which VG to export to the client. The -section in the volume file that describes BD xlator looks like this. - -volume my-volume-bd_map -type storage/bd_map -option device vg -option export volume-group -end-volume - -option device=vg specifies that it should use VG as block backend. option -export=volume-group specifies that it should export VG "volume-group" -to the client. - -5. Using BD backend gluster volume -================================== -Mount ------ - $ mount -t glusterfs hostname:/my-volume /media/bd - $ cd /media/bd - -From the mount point: --------------------- -* Creating a new file (ie LV) involves two steps - $ touch lv1 - $ truncate -s <size> lv1 - or - $ qemu-img create -f <format> gluster:/hostname/my-volume/path-to-image <size> - -* Cloning an LV - $ ln lv1 lv2 - -* Snapshotting an LV - $ ln -s lv1 lv1-ss - -* Passing it to QEMU as one of the drives - $ qemu -drive file=<mount>/<file>,if=<if-type> - -* GlusterFS is one of the supported QEMU block drivers, the URI format - is - gluster[+transport]://[server[:port]]/my-volume/image[?socket=...] - ie - $ qemu -drive file=gluster:/hostname/my-volume/path-to-image,if=<if-type> - -Using Gluster CLI: ------------------ -* To create a new image of required size - $ gluster bd create my-volume:/path-to-image <size> - -* To delete an existing image - $ gluster bd delete my-volume:/path-to-image - -* To clone (full clone) an image - $ gluster bd clone my-volume:/path-to-image new-image - -* To take a snapshot of an image - $ gluster bd snapshot my-volume:/path-to-image snapshot-image <size> - -All gluster BD commands need size to specified in terms of KB, MB, etc. - -6. Limitations -============== -* No support to create multiple bricks -* Image creation should be used with truncate to get proper size or use - qemu-img create -* cp command can't be used to copy images, instead use ln command or - gluster bd clone command -* ln -s command throws an error even if snapshot is successful -* When ln command used on BD volumes, target file's inode is different - from target's -* Creation/deletion of directories, xattr operations, mknod and readlink - operations are not supported. - -7. TODO -======= -Add support for exporting LUNs also as a regular files. -Add support for xattr and multi brick support -Include support for device mapper thin targets diff --git a/doc/gluster.8 b/doc/gluster.8 index b23e2c891..3c78fb8b1 100644 --- a/doc/gluster.8 +++ b/doc/gluster.8 @@ -8,7 +8,7 @@ .\" cases as published by the Free Software Foundation. .\" .\" -.TH Gluster 8 "Gluster command line utility" "22 November 2012" "Gluster Inc." +.TH Gluster 8 "Gluster command line utility" "07 March 2011" "Gluster Inc." .SH NAME gluster - Gluster Console Manager (command line utility) .SH SYNOPSIS @@ -36,11 +36,9 @@ The Gluster Console Manager is a command line utility for elastic volume managem \fB\ volume info [all|<VOLNAME>] \fR Display information about all volumes, or the specified volume. .TP -\fB\ volume create <NEW-VOLNAME> [device vg] [stripe <COUNT>] [replica <COUNT>] [transport <tcp|rdma|tcp,rdma>] <NEW-BRICK> ... \fR +\fB\ volume create <NEW-VOLNAME> [stripe <COUNT>] [replica <COUNT>] [transport <tcp|rdma|tcp,rdma>] <NEW-BRICK> ... \fR Create a new volume of the specified type using the specified bricks and transport type (the default transport type is tcp). To create a volume with both transports (tcp and rdma), give 'transport tcp,rdma' as an option. -device vg parameter specifies volume should use block backend instead of regular posix backend. In this case NEW-BRICK should specify an existing Volume Group and there can be only one brick for Block backend volumes. \fR -Refer Block backend section for more details .TP \fB\ volume delete <VOLNAME> \fR Delete the specified volume. @@ -59,9 +57,6 @@ Set the volume options. .TP \fB\ volume help \fR Display help for the volume command. -.SS "Block backend" -.TP -By specifying "device vg" in volume create, a volume capable of exporting block devices(ie Volume Groups (VG)) is created. As of now exporting only VG is supported. While creating block backend volume the "VG" (mentioned in NEW-BRICK) must exist (ie created with vgcreate). VG is exported as a directory and all LVs under that VG will be exported as files. Please refer BD commands section for Block backend related commands .SS "Brick Commands" .PP .TP @@ -110,20 +105,6 @@ Display the status of peers. .TP \fB\ peer help \fR Display help for the peer command. -.SS "BD commands" -.TP -\fB\ bd create <VOLNAME:/path-to-image> <size> \fR -Creates a new image of given size in the volume. Size can be suffixed with MB, GB etc, if nothing specified MB is taken as default. -.TP -\fB\ bd delete <VOLNAME:/path-to-image> \fR -Deletes a image in the volume -.TP -\fB\ bd clone <VOLNAME:/path-to-image> <new-image> \fR -Clones an existing image (full clone) -.TP -\fB\ bd snapshot <VOLNAME:/path-to-image> <new-image> <size> \fR -Creates a linked clone of an existing image with given size. Size can be suffixed with MB, GB etc, if nothing specified MB is taken as default. - .SS "Other Commands" .TP \fB\ help \fR diff --git a/doc/glusterd.vol b/doc/glusterd.vol deleted file mode 100644 index de17d8fd8..000000000 --- a/doc/glusterd.vol +++ /dev/null @@ -1,8 +0,0 @@ -volume management - type mgmt/glusterd - option working-directory /var/lib/glusterd - option transport-type socket,rdma - option transport.socket.keepalive-time 10 - option transport.socket.keepalive-interval 2 - option transport.socket.read-fail-log off -end-volume diff --git a/doc/hacker-guide/en-US/markdown/adding-fops.md b/doc/hacker-guide/en-US/markdown/adding-fops.md new file mode 100644 index 000000000..3f72ed3e2 --- /dev/null +++ b/doc/hacker-guide/en-US/markdown/adding-fops.md @@ -0,0 +1,18 @@ +Adding a new FOP +================ + +Steps to be followed when adding a new FOP to GlusterFS: + +1. Edit `glusterfs.h` and add a `GF_FOP_*` constant. +2. Edit `xlator.[ch]` and: + * add the new prototype for fop and callback. + * edit `xlator_fops` structure. +3. Edit `xlator.c` and add to fill_defaults. +4. Edit `protocol.h` and add struct necessary for the new FOP. +5. Edit `defaults.[ch]` and provide default implementation. +6. Edit `call-stub.[ch]` and provide stub implementation. +7. Edit `common-utils.c` and add to gf_global_variable_init(). +8. Edit client-protocol and add your FOP. +9. Edit server-protocol and add your FOP. +10. Implement your FOP in any translator for which the default implementation + is not sufficient. diff --git a/doc/hacker-guide/en-US/markdown/posix.md b/doc/hacker-guide/en-US/markdown/posix.md new file mode 100644 index 000000000..84c813e55 --- /dev/null +++ b/doc/hacker-guide/en-US/markdown/posix.md @@ -0,0 +1,59 @@ +storage/posix translator +======================== + +Notes +----- + +### `SET_FS_ID` + +This is so that all filesystem checks are done with the user's +uid/gid and not GlusterFS's uid/gid. + +### `MAKE_REAL_PATH` + +This macro concatenates the base directory of the posix volume +('option directory') with the given path. + +### `need_xattr` in lookup + +If this flag is passed, lookup returns a xattr dictionary that contains +the file's create time, the file's contents, and the version number +of the file. + +This is a hack to increase small file performance. If an application +wants to read a small file, it can finish its job with just a lookup +call instead of a lookup followed by read. + +### `getdents`/`setdents` + +These are used by unify to set and get directory entries. + +### `ALIGN_BUF` + +Macro to align an address to a page boundary (4K). + +### `priv->export_statfs` + +In some cases, two exported volumes may reside on the same +partition on the server. Sending statvfs info for both +the volumes will lead to erroneous df output at the client, +since free space on the partition will be counted twice. + +In such cases, user can disable exporting statvfs info +on one of the volumes by setting this option. + +### `xattrop` + +This fop is used by replicate to set version numbers on files. + +### `getxattr`/`setxattr` hack to read/write files + +A key, `GLUSTERFS_FILE_CONTENT_STRING`, is handled in a special way by +`getxattr`/`setxattr`. A getxattr with the key will return the entire +content of the file as the value. A `setxattr` with the key will write +the value as the entire content of the file. + +### `posix_checksum` + +This calculates a simple XOR checksum on all entry names in a +directory that is used by unify to compare directory contents. diff --git a/doc/legacy/hacker-guide/adding-fops.txt b/doc/legacy/hacker-guide/adding-fops.txt deleted file mode 100644 index e70dbbdc8..000000000 --- a/doc/legacy/hacker-guide/adding-fops.txt +++ /dev/null @@ -1,33 +0,0 @@ - HOW TO ADD A NEW FOP TO GlusterFS - ================================= - -Steps to be followed when adding a new FOP to GlusterFS: - -1. Edit glusterfs.h and add a GF_FOP_* constant. - -2. Edit xlator.[ch] and: - 2a. add the new prototype for fop and callback. - 2b. edit xlator_fops structure. - -3. Edit xlator.c and add to fill_defaults. - -4. Edit protocol.h and add struct necessary for the new FOP. - -5. Edit defaults.[ch] and provide default implementation. - -6. Edit call-stub.[ch] and provide stub implementation. - -7. Edit common-utils.c and add to gf_global_variable_init(). - -8. Edit client-protocol and add your FOP. - -9. Edit server-protocol and add your FOP. - -10. Implement your FOP in any translator for which the default implementation - is not sufficient. - -========================================== -Last updated: Mon Oct 27 21:35:49 IST 2008 - -Author: Vikas Gorur <vikas@gluster.com> -========================================== diff --git a/doc/legacy/hacker-guide/posix.txt b/doc/legacy/hacker-guide/posix.txt deleted file mode 100644 index 7958af2ea..000000000 --- a/doc/legacy/hacker-guide/posix.txt +++ /dev/null @@ -1,59 +0,0 @@ ---------------- -* storage/posix ---------------- - -- SET_FS_ID - - This is so that all filesystem checks are done with the user's - uid/gid and not GlusterFS's uid/gid. - -- MAKE_REAL_PATH - - This macro concatenates the base directory of the posix volume - ('option directory') with the given path. - -- need_xattr in lookup - - If this flag is passed, lookup returns a xattr dictionary that contains - the file's create time, the file's contents, and the version number - of the file. - - This is a hack to increase small file performance. If an application - wants to read a small file, it can finish its job with just a lookup - call instead of a lookup followed by read. - -- getdents/setdents - - These are used by unify to set and get directory entries. - -- ALIGN_BUF - - Macro to align an address to a page boundary (4K). - -- priv->export_statfs - - In some cases, two exported volumes may reside on the same - partition on the server. Sending statvfs info for both - the volumes will lead to erroneous df output at the client, - since free space on the partition will be counted twice. - - In such cases, user can disable exporting statvfs info - on one of the volumes by setting this option. - -- xattrop - - This fop is used by replicate to set version numbers on files. - -- getxattr/setxattr hack to read/write files - - A key, GLUSTERFS_FILE_CONTENT_STRING, is handled in a special way by - getxattr/setxattr. A getxattr with the key will return the entire - content of the file as the value. A setxattr with the key will write - the value as the entire content of the file. - -- posix_checksum - - This calculates a simple XOR checksum on all entry names in a - directory that is used by unify to compare directory contents. - - diff --git a/doc/logging.txt b/doc/logging.txt index d1e568a31..b4ee45996 100644 --- a/doc/logging.txt +++ b/doc/logging.txt @@ -55,11 +55,12 @@ gf_syslog (GF_ERR_DEV, LOG_ERR, "error reading configuration file"); The logs are sent in CEE format (http://cee.mitre.org/) to syslog. Its targeted to rsyslog syslog server. -This log framework can be disabled either at compile time or run time +This log framework is enabled at compile time by default. This can be +disabled by passing '--disable-syslog' to ./configure or '--without +syslog' to rpmbuild -- for compile time by passing '--disable-syslog' to ./configure or - '--without syslog' to rpmbuild (or) -- for run time by having a file /var/log/glusterd/logger.conf and - restarting gluster services +Even though its enabled at compile time, its required to have +/etc/glusterfs/logger.conf file to make it into effect before starting +gluster services Currently all gluster logs are sent with error code GF_ERR_DEV. diff --git a/doc/split-brain.md b/doc/split-brain.md new file mode 100644 index 000000000..b0d938e26 --- /dev/null +++ b/doc/split-brain.md @@ -0,0 +1,251 @@ +Steps to recover from File split-brain. +====================================== + +Quick Start: +============ +1. Get the path of the file that is in split-brain: +> It can be obtained either by +> a) The command `gluster volume heal info split-brain`. +> b) Identify the files for which file operations performed + from the client keep failing with Input/Output error. + +2. Close the applications that opened this file from the mount point. +In case of VMs, they need to be powered-off. + +3. Decide on the correct copy: +> This is done by observing the afr changelog extended attributes of the file on +the bricks using the getfattr command; then identifying the type of split-brain +(data split-brain, metadata split-brain, entry split-brain or split-brain due to +gfid-mismatch); and finally determining which of the bricks contains the 'good copy' +of the file. +> `getfattr -d -m . -e hex <file-path-on-brick>`. +It is also possible that one brick might contain the correct data while the +other might contain the correct metadata. + +4. Reset the relevant extended attribute on the brick(s) that contains the +'bad copy' of the file data/metadata using the setfattr command. +> `setfattr -n <attribute-name> -v <attribute-value> <file-path-on-brick>` + +5. Trigger self-heal on the file by performing lookup from the client: +> `ls -l <file-path-on-gluster-mount>` + +Detailed Instructions for steps 3 through 5: +=========================================== +To understand how to resolve split-brain we need to know how to interpret the +afr changelog extended attributes. + +Execute `getfattr -d -m . -e hex <file-path-on-brick>` + +* Example: +[root@store3 ~]# getfattr -d -e hex -m. brick-a/file.txt +\#file: brick-a/file.txt +security.selinux=0x726f6f743a6f626a6563745f723a66696c655f743a733000 +trusted.afr.vol-client-2=0x000000000000000000000000 +trusted.afr.vol-client-3=0x000000000200000000000000 +trusted.gfid=0x307a5c9efddd4e7c96e94fd4bcdcbd1b + +The extended attributes with `trusted.afr.<volname>-client-<subvolume-index>` +are used by afr to maintain changelog of the file.The values of the +`trusted.afr.<volname>-client-<subvolume-index>` are calculated by the glusterfs +client (fuse or nfs-server) processes. When the glusterfs client modifies a file +or directory, the client contacts each brick and updates the changelog extended +attribute according to the response of the brick. + +'subvolume-index' is nothing but (brick number - 1) in +`gluster volume info <volname>` output. + +* Example: +[root@pranithk-laptop ~]# gluster volume info vol + Volume Name: vol + Type: Distributed-Replicate + Volume ID: 4f2d7849-fbd6-40a2-b346-d13420978a01 + Status: Created + Number of Bricks: 4 x 2 = 8 + Transport-type: tcp + Bricks: + brick-a: pranithk-laptop:/gfs/brick-a + brick-b: pranithk-laptop:/gfs/brick-b + brick-c: pranithk-laptop:/gfs/brick-c + brick-d: pranithk-laptop:/gfs/brick-d + brick-e: pranithk-laptop:/gfs/brick-e + brick-f: pranithk-laptop:/gfs/brick-f + brick-g: pranithk-laptop:/gfs/brick-g + brick-h: pranithk-laptop:/gfs/brick-h + +In the example above: +``` +Brick | Replica set | Brick subvolume index +---------------------------------------------------------------------------- +-/gfs/brick-a | 0 | 0 +-/gfs/brick-b | 0 | 1 +-/gfs/brick-c | 1 | 2 +-/gfs/brick-d | 1 | 3 +-/gfs/brick-e | 2 | 4 +-/gfs/brick-f | 2 | 5 +-/gfs/brick-g | 3 | 6 +-/gfs/brick-h | 3 | 7 +``` + +Each file in a brick maintains the changelog of itself and that of the files +present in all the other bricks in it's replica set as seen by that brick. + +In the example volume given above, all files in brick-a will have 2 entries, +one for itself and the other for the file present in it's replica pair, i.e.brick-b: +trusted.afr.vol-client-0=0x000000000000000000000000 -->changelog for itself (brick-a) +trusted.afr.vol-client-1=0x000000000000000000000000 -->changelog for brick-b as seen by brick-a + +Likewise, all files in brick-b will have: +trusted.afr.vol-client-0=0x000000000000000000000000 -->changelog for brick-a as seen by brick-b +trusted.afr.vol-client-1=0x000000000000000000000000 -->changelog for itself (brick-b) + +The same can be extended for other replica pairs. + +Interpreting Changelog (roughly pending operation count) Value: +Each extended attribute has a value which is 24 hexa decimal digits. +First 8 digits represent changelog of data. Second 8 digits represent changelog +of metadata. Last 8 digits represent Changelog of directory entries. + +Pictorially representing the same, we have: +``` +0x 000003d7 00000001 00000000 + | | | + | | \_ changelog of directory entries + | \_ changelog of metadata + \ _ changelog of data +``` + + +For Directories metadata and entry changelogs are valid. +For regular files data and metadata changelogs are valid. +For special files like device files etc metadata changelog is valid. +When a file split-brain happens it could be either data split-brain or +meta-data split-brain or both. When a split-brain happens the changelog of the +file would be something like this: + +* Example:(Lets consider both data, metadata split-brain on same file). +[root@pranithk-laptop vol]# getfattr -d -m . -e hex /gfs/brick-?/a +getfattr: Removing leading '/' from absolute path names +\#file: gfs/brick-a/a +trusted.afr.vol-client-0=0x000000000000000000000000 +trusted.afr.vol-client-1=0x000003d70000000100000000 +trusted.gfid=0x80acdbd886524f6fbefa21fc356fed57 +\#file: gfs/brick-b/a +trusted.afr.vol-client-0=0x000003b00000000100000000 +trusted.afr.vol-client-1=0x000000000000000000000000 +trusted.gfid=0x80acdbd886524f6fbefa21fc356fed57 + +###Observations: + +####According to changelog extended attributes on file /gfs/brick-a/a: +The first 8 digits of trusted.afr.vol-client-0 are all +zeros (0x00000000................), and the first 8 digits of +trusted.afr.vol-client-1 are not all zeros (0x000003d7................). +So the changelog on /gfs/brick-a/a implies that some data operations succeeded +on itself but failed on /gfs/brick-b/a. + +The second 8 digits of trusted.afr.vol-client-0 are +all zeros (0x........00000000........), and the second 8 digits of +trusted.afr.vol-client-1 are not all zeros (0x........00000001........). +So the changelog on /gfs/brick-a/a implies that some metadata operations succeeded +on itself but failed on /gfs/brick-b/a. + +####According to Changelog extended attributes on file /gfs/brick-b/a: +The first 8 digits of trusted.afr.vol-client-0 are not all +zeros (0x000003b0................), and the first 8 digits of +trusted.afr.vol-client-1 are all zeros (0x00000000................). +So the changelog on /gfs/brick-b/a implies that some data operations succeeded +on itself but failed on /gfs/brick-a/a. + +The second 8 digits of trusted.afr.vol-client-0 are not +all zeros (0x........00000001........), and the second 8 digits of +trusted.afr.vol-client-1 are all zeros (0x........00000000........). +So the changelog on /gfs/brick-b/a implies that some metadata operations succeeded +on itself but failed on /gfs/brick-a/a. + +Since both the copies have data, metadata changes that are not on the other +file, it is in both data and metadata split-brain. + +Deciding on the correct copy: +----------------------------- +The user may have to inspect stat,getfattr output of the files to decide which +metadata to retain and contents of the file to decide which data to retain. +Continuing with the example above, lets say we want to retain the data +of /gfs/brick-a/a and metadata of /gfs/brick-b/a. + +Resetting the relevant changelogs to resolve the split-brain: +------------------------------------------------------------- +For resolving data-split-brain: +We need to change the changelog extended attributes on the files as if some data +operations succeeded on /gfs/brick-a/a but failed on /gfs/brick-b/a. But +/gfs/brick-b/a should NOT have any changelog which says some data operations +succeeded on /gfs/brick-b/a but failed on /gfs/brick-a/a. We need to reset the +data part of the changelog on trusted.afr.vol-client-0 of /gfs/brick-b/a. + +For resolving metadata-split-brain: +We need to change the changelog extended attributes on the files as if some +metadata operations succeeded on /gfs/brick-b/a but failed on /gfs/brick-a/a. +But /gfs/brick-a/a should NOT have any changelog which says some metadata +operations succeeded on /gfs/brick-a/a but failed on /gfs/brick-b/a. +We need to reset metadata part of the changelog on +trusted.afr.vol-client-1 of /gfs/brick-a/a + +So, the intended changes are: +On /gfs/brick-b/a: +For trusted.afr.vol-client-0 +0x000003b00000000100000000 to 0x000000000000000100000000 +(Note that the metadata part is still not all zeros) +Hence execute +`setfattr -n trusted.afr.vol-client-0 -v 0x000000000000000100000000 /gfs/brick-b/a` + +On /gfs/brick-a/a: +For trusted.afr.vol-client-1 +0x0000000000000000ffffffff to 0x000003d70000000000000000 +(Note that the data part is still not all zeros) +Hence execute +`setfattr -n trusted.afr.vol-client-1 -v 0x000003d70000000000000000 /gfs/brick-a/a` + +Thus after the above operations are done, the changelogs look like this: +[root@pranithk-laptop vol]# getfattr -d -m . -e hex /gfs/brick-?/a +getfattr: Removing leading '/' from absolute path names +\#file: gfs/brick-a/a +trusted.afr.vol-client-0=0x000000000000000000000000 +trusted.afr.vol-client-1=0x000003d70000000000000000 +trusted.gfid=0x80acdbd886524f6fbefa21fc356fed57 + +\#file: gfs/brick-b/a +trusted.afr.vol-client-0=0x000000000000000100000000 +trusted.afr.vol-client-1=0x000000000000000000000000 +trusted.gfid=0x80acdbd886524f6fbefa21fc356fed57 + + +Triggering Self-heal: +--------------------- +Perform `ls -l <file-path-on-gluster-mount>` to trigger healing. + +Fixing Directory entry split-brain: +---------------------------------- +Afr has the ability to conservatively merge different entries in the directories +when there is a split-brain on directory. +If on one brick directory 'd' has entries '1', '2' and has entries '3', '4' on +the other brick then afr will merge all of the entries in the directory to have +'1', '2', '3', '4' entries in the same directory. +(Note: this may result in deleted files to re-appear in case the split-brain +happens because of deletion of files on the directory) +Split-brain resolution needs human intervention when there is at least one entry +which has same file name but different gfid in that directory. +Example: +On brick-a the directory has entries '1' (with gfid g1), '2' and on brick-b +directory has entries '1' (with gfid g2) and '3'. +These kinds of directory split-brains need human intervention to resolve. +The user needs to remove either file '1' on brick-a or the file '1' on brick-b +to resolve the split-brain. In addition, the corresponding gfid-link file also +needs to be removed.The gfid-link files are present in the .glusterfs folder +in the top-level directory of the brick. If the gfid of the file is +0x307a5c9efddd4e7c96e94fd4bcdcbd1b (the trusted.gfid extended attribute got +from the getfattr command earlier),the gfid-link file can be found at +> /gfs/brick-a/.glusterfs/30/7a/307a5c9efddd4e7c96e94fd4bcdcbd1b + +####Word of caution: +Before deleting the gfid-link, we have to ensure that there are no hard links +to the file present on that brick. If hard-links exist,they must be deleted as +well. |
