summaryrefslogtreecommitdiffstats
path: root/doc/features/geo-replication
diff options
context:
space:
mode:
authorHumble Devassy Chirammal <hchiramm@redhat.com>2015-07-23 18:07:08 +0530
committerKaleb KEITHLEY <kkeithle@redhat.com>2015-08-05 04:49:00 -0700
commita0919d638a889f03a5bd804cf4c3a63084680fce (patch)
tree72d519341747cfc82afcc06c313f9b92018bece0 /doc/features/geo-replication
parent3403370ebeaf16567b79022c6ac48b2e0cd50db5 (diff)
Removing glusterfs features doc from glusterfs repo
According to the new workflow defined for glusterfs documentation, the features are maintained at https://github.com/gluster/glusterfs-specs. Removing the features dir from the doc repo as per, https://www.mail-archive.com/gluster-users@gluster.org/msg21168.html Change-Id: I2f1219c49ef933f6b89a55f2238787d8565d80b6 BUG: 1206539 Signed-off-by: Humble Devassy Chirammal <hchiramm@redhat.com> Reviewed-on: http://review.gluster.org/11749 Tested-by: NetBSD Build System <jenkins@build.gluster.org> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Prashanth Pai <ppai@redhat.com> Reviewed-by: Humble Devassy Chirammal <humble.devassy@gmail.com> Reviewed-by: Kaleb KEITHLEY <kkeithle@redhat.com>
Diffstat (limited to 'doc/features/geo-replication')
-rw-r--r--doc/features/geo-replication/distributed-geo-rep.md71
-rw-r--r--doc/features/geo-replication/libgfchangelog.md119
2 files changed, 0 insertions, 190 deletions
diff --git a/doc/features/geo-replication/distributed-geo-rep.md b/doc/features/geo-replication/distributed-geo-rep.md
deleted file mode 100644
index 0a3183d6269..00000000000
--- a/doc/features/geo-replication/distributed-geo-rep.md
+++ /dev/null
@@ -1,71 +0,0 @@
-Introduction
-============
-
-This document goes through the new design of distributed geo-replication, it's features and the nature of changes involved. First we list down some of the important features.
-
- - Distributed asynchronous replication
- - Fast and versatile change detection
- - Replica failover
- - Hardlink synchronization
- - Effective handling of deletes and renames
- - Configurable sync engine (rsync, tar+ssh)
- - Adaptive to a wide variety of workloads
- - GFID synchronization
-
-Geo-replication makes use of the all new *journaling* infrastructure (a.k.a. changelog) to achieve great performance and feature improvements as mentioned above. To understand more about changelogging and the helper library (*libgfchangelog*) refer to document: doc/features/geo-replication/libgfchangelog.md
-
-Data Replication
-----------------
-
-Geo-replication is responsible to incrementally replicate data from the master node to the slave. But isn't that similar to what AFR does? Yes, but here the slave is located geographically distant from the master. Geo-replication follows the eventually consistent replication model, which implies, at any point of time, the slave would be lagging w.r.t. master, but would eventually catch up. Replication performance is dependent on two crucial factors:
- - Network latency
- - Change detection
-
-Network latency is something that is not in direct control for many reasons, but still there is always a best effort. Therefore, geo-replication offloads the data replicaiton part to common UNIX file transfer utilities. We choose the grand daddy of file transfers [rsync(1)] [1] as the default synchronization engine, as it's best known for it's diff transfer algorithm for effcient usage of network and lightning fast transfers (leave alone the flexibiliy). But what about small files performance? Due to it's checksumming algorithm, rsync has more overhead for small files -- the overhead of checksumming outweighs the bytes to be transferred for small files. Therefore, geo-replication can also use combination of tar piped over ssh to transfer large number of small files. Tests have shown a great improvement over standard rsync. However, sync engine is not yet dynamic to the file type and needs to be chosen manually by a configuration option.
-
-OTOH, change detection is something that is in full control of the application. Earlier (< release 3.5), geo-replicaiton would perform a file system crawl to indentify changes in the file system. This was not an unintelligent *check-every-single-inode* in the file system, but crawl logic based on *xtime*. xtime is an extended attribute maintained by the *marker* translator for each inode on the master and follows an upward-recursive marking pattern. Geo-replication would traverse a directory based on this simple condition:
-
-> xtime(master) > xtime(slave)
-
-E.g.:
-
-> MASTER SLAVE
->
-> /\ /\
-> d0 dir0 d0 dir0
-> / \ / \
-> d1 dir1 d1 dir1
-> / /
-> d2 d2
-> / /
-> file0 file0
-
-Consider the directory tree above. Assume that master and slave were in sync and the following operation happens on master:
-```
-touch /d0/d1/d2/file0
-```
-This would trigger a xtime marking (xtime being the current timestamp) from the leaf (*file0*) upto the root (*/*), i.e. an *xattr* of *file0*, *d2*, *d1*, *d0* and finally */*. Geo-replication daemon would crawl the file system based the condition mentioned before and hence would only crawl the **left** part of the directory tree (as the **right** part would hve equal xtimes).
-
-Although the above crawling algorithm is fast, it still has to crawl a good part of the file system. Also, to decide whether to crawl a particular subdirectory, geo-rep need to compare xtime -- which is basically a **getxattr()** call on the master and slave (remember, *slave* is over a WAN).
-
-Therefore, in 3.5 the need arised to take crawling to the next level. Geo-replication now uses the changelogging infrastructure to idenitify changes in the filesystem. Actually, there is absolutely no crawl involved. Changelogging based detection is notification based. Geo-replication daemon registers itself with the changelog consumer library (*libgfchangelog*) and basically invokes a set of APIs to get the list of changes in the filesystem and replays them onto the slave. There is absolutely no crawl or any kind of extended attribute gets involved.
-
-Distributed Geo-Replication
----------------------------
-Geo-replication (also known as gsyncd or geo-rep) used to be non-distributed before release 3.5. The node on which geo-rep start command was executed was responsible for replication data to the slave. If this node goes offline due to some reason (reboot, crash, etc..), replication would thereby be ceased. So one of the main development efforts for release 3.5 was to *distributify* geo-replication. Geo-rep daemon running on each node (per brick) is responsible for replicating data **local** to each brick. This results in full parallelism and effective use of cluster/network resource.
-
-With release 3.5, geo-rep start command would spawn a geo-replication daemon on each node in the master cluster (one per brick). Geo-rep *status* command shown geo-rep session status from each master node. Similary, *stop* would gracefully tear down the session from all nodes.
-
-What else is synced?
---------------------
- - GFID: Synchronizing the inode number (GFID) between master and the slave helps in synchronizing hardlinks.
- - Purges are also handled effectively as there is no entry comparison between master and slave. With changelog replay, geo-rep perform unlink operation without having to resort to expensive **readdir()** over the WAN.
- - Renames: With earlier geo-replication, because of the path based nature of crawling, renames were actually a delete and a create on the slave, followed by data transfer (not to mention the inode number change). Now, with changelogging, it's actually a **rename()** call on the slave.
-
-Replica Failover
-----------------
-One of the basic volume configuration is a replicated volume (synchronous replication). Having geo-replication sync data from all replicas would mean wastage of network bandwidth and possibly data corruption on the slave (though that's unlikely). Therefore, geo-rep on such volume configurations works in an **ACTIVE** and **PASSIVE** mode. Geo-rep daemon on one of the replicas is responsible for replicating data (**ACTIVE**), while the other geo-rep daemon is basically doing nothing (**PASSIVE**).
-
-On the event of the *ACTIVE* node going offline, the *PASSIVE* node identifies this event (there's a lag of max 60 seconds for this identification) and switches to *ACTIVE*; thereby taking over the role of replicating data from where the earlier *ACTIVE* node left off. This guarantees uninterrupted data replication even on node reboot/failures.
-
-[1]:http://rsync.samba.org
diff --git a/doc/features/geo-replication/libgfchangelog.md b/doc/features/geo-replication/libgfchangelog.md
deleted file mode 100644
index 1dd0d24253a..00000000000
--- a/doc/features/geo-replication/libgfchangelog.md
+++ /dev/null
@@ -1,119 +0,0 @@
-libgfchangelog: "GlusterFS changelog" consumer library
-======================================================
-
-This document puts forward the intended need for GlusterFS changelog consumer library (a.k.a. libgfchangelog) for consuming changlogs produced by the Changelog translator. Further, it mentions the proposed design and the API exposed by it. A brief explanation of changelog translator can also be found as a commit message in the upstream source tree and the review link can be [accessed here] [1].
-
-Initial consumer of changelogs would be Geo-Replication (release 3.5). Possible consumers in the future could be backup utilities, GlusterFS self-heal, bit-rot detection, AV scanners. All these utilities have one thing in common - to get a list of changed entities (created/modified/deleted) in the file system. Therefore, the need arises to provide such functionality in the form of a shared library that applications can link against and query for changes (See API section). There is no plan as of now to provide language bindings as such, but for shell script friendliness: 'gfind' command line utility (which would be dynamically linked with libgfchangelog) would be helpful. As of now, development for this utility is still not commenced.
-
-The next section gives a brief introduction about how changelogs are organized and managed. Then we propose couple of designs for libgfchangelog. API set is not covered in this document (maybe later).
-
-Changelogs
-==========
-
-Changelogs can be thought as a running history for an entity in the file system from the time the entity came into existance. The goal is to capture all possible transitions the entity underwent till the time it got purged. The transition namespace is broken up into three categories with each category represented by a specific changelog format. Changes are recorded in a flat file in the filesystem and are rolled over after a specific time interval. All three types of categories are recorded in a single changelog file (sequentially) with a type for each entry. Having a single file reduces disk seeks and fragmentation and less number of files to deal with. Stratergy for pruning of old logs is still undecided.
-
-
-Changelog Transition Namespace
-------------------------------
-
-As mentioned before the transition namespace is categorized into three types:
- - TYPE-I : Data operation
- - TYPE-II : Metadata operation
- - TYPE-III : Entry operation
-
-One could visualize the transition of an file system entity as a state machine transitioning from one type to another. For TYPE-I and TYPE-II operations there is no state transition as such, but TYPE-III operation involves a state change from the file systems perspective. We can now classify file operations (fops) into one of the three types:
- - Data operation: write(), writev(), truncate(), ftruncate()
- - Metadata operation: setattr(), fsetattr(), setxattr(), fsetxattr(), removexattr(), fremovexattr()
- - Entry operation: create(), mkdir(), mknod(), symlink(), link(), rename(), unlink(), rmdir()
-
-Changelog Entry Format
-----------------------
-
-In order to record the type of operation and entity underwent, a type identifier is used. Normally, the entity on which the operation is performed would be identified by the pathname, which is the most common way of addressing in a file system, but we choose to use GlusterFS internal file identifier (GFID) instead (as GlusterFS supports GFID based backend and the pathname field may not always be valid and other reasons which are out of scope of this this document). Therefore, the format of the record for the three types of operation can be summarized as follows:
-
- - TYPE-I : GFID of the file
- - TYPE-II : GFID of the file
- - TYPE-III : GFID + FOP + MODE + UID + GID + PARGFID/BNAME [PARGFID/BNAME]
-
-GFID's are analogous to inodes. TYPE-I and TYPE-II fops record the GFID of the entity on which the operation was performed: thereby recording that there was an data/metadata change on the inode. TYPE-III fops record at the minimum a set of six or seven records (depending on the type of operation), that is sufficient to identify what type of operation the entity underwent. Normally this record inculdes the GFID of the entity, the type of file operation (which is an integer [an enumerated value which is used in GluterFS]) and the parent GFID and the basename (analogous to parent inode and basename).
-
-Changelogs can be either in ascii or binary format, the difference being the format of the records that is persisted. In a binary changelog the gfids are recorded in it's native format ie. 16 byte record and the fop number as a 4 byte integer. In an ascii changelog, the gfids are stored in their canonical form and the fop number is stringified and persisted. Null charater is used as the record serarator and changelogs. This makes it hard to read changelogs from the command line, but the packed format is needed to support file names with spaces and special characters. Below is a snippet of a changelog along side it's hexdump.
-
-```
-00000000 47 6c 75 73 74 65 72 46 53 20 43 68 61 6e 67 65 |GlusterFS Change|
-00000010 6c 6f 67 20 7c 20 76 65 72 73 69 6f 6e 3a 20 76 |log | version: v|
-00000020 31 2e 31 20 7c 20 65 6e 63 6f 64 69 6e 67 20 3a |1.1 | encoding :|
-00000030 20 32 0a 45 61 36 39 33 63 30 34 65 2d 61 66 39 | 2.Ea693c04e-af9|
-00000040 65 2d 34 62 61 35 2d 39 63 61 37 2d 31 63 34 61 |e-4ba5-9ca7-1c4a|
-00000050 34 37 30 31 30 64 36 32 00 32 33 00 33 33 32 36 |47010d62.23.3326|
-00000060 31 00 30 00 30 00 66 36 35 34 32 33 32 65 2d 61 |1.0.0.f654232e-a|
-00000070 34 32 62 2d 34 31 62 33 2d 62 35 61 61 2d 38 30 |42b-41b3-b5aa-80|
-00000080 33 62 33 64 61 34 35 39 33 37 2f 6c 69 62 76 69 |3b3da45937/libvi|
-00000090 72 74 5f 64 72 69 76 65 72 5f 6e 65 74 77 6f 72 |rt_driver_networ|
-000000a0 6b 2e 73 6f 00 44 61 36 39 33 63 30 34 65 2d 61 |k.so.Da693c04e-a|
-000000b0 66 39 65 2d 34 62 61 35 2d 39 63 61 37 2d 31 63 |f9e-4ba5-9ca7-1c|
-000000c0 34 61 34 37 30 31 30 64 36 32 00 45 36 65 39 37 |4a47010d62.E6e97|
-```
-
-As you can see, there is an *entry* operation (journal record starting with an "E"). Records for this operation are:
- - GFID : a693c04e-af9e-4ba5-9ca7-1c4a-47010d62
- - FOP : 23 (create)
- - Mode : 33261
- - UID : 0
- - GID : 0
- - PARGFID/BNAME: f654232e-a42b-41b3-b5aa-803b3da45937
-
-**NOTE**: In case of a rename operation, there would be an additional record (for the target PARGFID/BNAME).
-
-libgfchangelog
---------------
-
-NOTE: changelogs generated by the changelog translator are rolled over [with the timestamp as the suffix] after a specific interval, after which a new change is started. The current changelog [changelog file without the timestamp as the suffix] should never be processed unless it's rolled over. The rolled over logs should be treated read-only.
-
-Capturing changes performed on a file system is useful for applications that rely on file system scan (crawl) to figure out such information. Backup utilities, automatic file healing in a replicated environment, bit-rot detection and the likes are some of the end user applications that require a set of changed entities in a file system to act on. Goal of libgfchangelog is to provide the application (consumer) a fast and easy to use common query interface (API). The consumer need not worry about the changelog format, nomenclature of the changelog files etc.
-
-Now we list functionality and some of the features.
-
-Functionality
--------------
-
-Changelog Processing: Processing involes reading changelog file(s), converting the entries into human-readable (or application understandable) format (in case of binary log format).
-Book-keeping: Keeping track of how much the application has consumed the changelog (ie. changes during the time slice start-time -> end-time).
-Serve API request: Update the consumer by providing the set of changes.
-
-Processing could be done in two ways:
-
-* Pre-processing (pre-processing from the library POV):
-Once a changelog file is rolled over (by the changelog translator), a set of post processing operations are performed. These operations could include conversion of a binary log file to an understandable format, collate a bunch of logs into a larger sampling period or just keep a private copy of the changelog (in ascii format). Extra disk space is consumed to store this private copy. The library would then be free to consume these logs and serve application requests.
-
-* On-demand:
-The processing of the changelogs is trigerred when an application requests for changes. Downside of this being additional time spent on decoding the logs and data accumulation during application request time (but no additional disk space is used over the time period).
-
-After processing, the changelog is ready to be consumed by the application. The function of processing is to convert the logs into human/application readable format (an example is shown below):
-
-```
-E a7264fe2-dd6b-43e1-8786-a03b42cc2489 CREATE 33188 0 0 00000000-0000-0000-0000-000000000001%2Fservices1
-M a7264fe2-dd6b-43e1-8786-a03b42cc2489 NULL
-M 00000000-0000-0000-0000-000000000001 NULL
-D a7264fe2-dd6b-43e1-8786-a03b42cc2489
-```
-
-Features
---------
-
-The following points mention some of the features that the library could provide.
-
- - Consumer could choose the update type when it registers with the library. 'types' could be:
- - Streaming: The consumer is updated via stream of changes, ie. the library would just replay the logs
- - Consolidated: The consumer is provided with a consolidated view of the changelog, eg. if <gfid> had an DATA and a METADATA operation, it would be presented as a single update. Similarly for ENTRY operations.
- - Raw: This mode provides the consumer with the pathnames of the changelog files itself (after processing). The changelogs should be strictly treated as read-only. This gives the flexibility to the consumer to extract updates using thier own preferred way (eg. using command line tools like sed, awk, sort | uniq etc.).
- - Application may choose to adopt a synchronous (blocking) or an asynchronous (callback) notification mechanism.
- - Provide a unified view of changelogs from multiple peers (replication scenario) or a global changelog view of the entire cluster.
-
-
-** The first cut of the library supports:**
- - Raw access mode
- - Synchronous programming model
- - Per brick changelog consumption ie. no unified/globally aggregated changelog
-
-[1]:http://review.gluster.org/5127