diff options
author | Vikas Gorur <vikas@zresearch.com> | 2009-02-18 17:36:07 +0530 |
---|---|---|
committer | Vikas Gorur <vikas@zresearch.com> | 2009-02-18 17:36:07 +0530 |
commit | 77adf4cd648dce41f89469dd185deec6b6b53a0b (patch) | |
tree | 02e155a5753b398ee572b45793f889b538efab6b /doc/replicate.lyx | |
parent | f3b2e6580e5663292ee113c741343c8a43ee133f (diff) |
Added all files
Diffstat (limited to 'doc/replicate.lyx')
-rw-r--r-- | doc/replicate.lyx | 797 |
1 files changed, 797 insertions, 0 deletions
diff --git a/doc/replicate.lyx b/doc/replicate.lyx new file mode 100644 index 00000000000..2bbcb652aaa --- /dev/null +++ b/doc/replicate.lyx @@ -0,0 +1,797 @@ +#LyX 1.4.2 created this file. For more info see http://www.lyx.org/ +\lyxformat 245 +\begin_document +\begin_header +\textclass article +\language english +\inputencoding auto +\fontscheme default +\graphics default +\paperfontsize default +\spacing single +\papersize default +\use_geometry false +\use_amsmath 1 +\cite_engine basic +\use_bibtopic false +\paperorientation portrait +\secnumdepth 3 +\tocdepth 3 +\paragraph_separation skip +\defskip medskip +\quotes_language english +\papercolumns 1 +\papersides 1 +\paperpagestyle default +\tracking_changes false +\output_changes false +\end_header + +\begin_body + +\begin_layout Title + +\size larger +Automatic File Replication (replicate) in GlusterFS +\end_layout + +\begin_layout Author +Vikas Gorur +\family typewriter +\size larger +<vikas@zresearch.com> +\end_layout + +\begin_layout Standard +\begin_inset ERT +status open + +\begin_layout Standard + + +\backslash +hrule +\end_layout + +\end_inset + + +\end_layout + +\begin_layout Section* +Overview +\end_layout + +\begin_layout Standard +This document describes the design and usage of the replicate translator in GlusterFS. + This document is valid for the 1.4.x releases, and not earlier ones. +\end_layout + +\begin_layout Standard +The replicate translator of GlusterFS aims to keep identical copies of a file + on all its subvolumes, as far as possible. + It tries to do this by performing all filesystem mutation operations (writing + data, creating files, changing ownership, etc.) on all its subvolumes in + such a way that if an operation succeeds on atleast one subvolume, all + other subvolumes can later be brought up to date. +\end_layout + +\begin_layout Standard +In the rest of the document the terms +\begin_inset Quotes eld +\end_inset + +subvolume +\begin_inset Quotes erd +\end_inset + + and +\begin_inset Quotes eld +\end_inset + +server +\begin_inset Quotes erd +\end_inset + + are used interchangeably, trusting that it will cause no confusion to the + reader. +\end_layout + +\begin_layout Section* +Usage +\end_layout + +\begin_layout Standard +A sample volume declaration for replicate looks like this: +\end_layout + +\begin_layout Standard +\begin_inset ERT +status open + +\begin_layout Standard + + +\backslash +begin{verbatim} +\end_layout + +\begin_layout Standard + +volume replicate +\end_layout + +\begin_layout Standard + + type cluster/replicate +\end_layout + +\begin_layout Standard + + # options, see below for description +\end_layout + +\begin_layout Standard + + subvolumes brick1 brick2 +\end_layout + +\begin_layout Standard + +end-volume +\end_layout + +\begin_layout Standard + + +\backslash +end{verbatim} +\end_layout + +\begin_layout Standard + +\end_layout + +\begin_layout Standard + +\end_layout + +\begin_layout Standard + +\end_layout + +\end_inset + + +\end_layout + +\begin_layout Standard +This defines an replicate volume with two subvolumes, brick1, and brick2. + For replicate to work properly, it is essential that its subvolumes support +\series bold +extended attributes +\series default +. + This means that you should choose a backend filesystem that supports extended + attributes, like XFS, ReiserFS, or Ext3. +\end_layout + +\begin_layout Standard +The storage volumes used as backend for replicate +\emph on +must +\emph default + have a posix-locks volume loaded above them. +\end_layout + +\begin_layout Standard +\begin_inset ERT +status open + +\begin_layout Standard + + +\backslash +begin{verbatim} +\end_layout + +\begin_layout Standard + +volume brick1 +\end_layout + +\begin_layout Standard + + type features/posix-locks +\end_layout + +\begin_layout Standard + + subvolumes brick1-ds +\end_layout + +\begin_layout Standard + +end-volume +\end_layout + +\begin_layout Standard + + +\backslash +end{verbatim} +\end_layout + +\end_inset + + +\end_layout + +\begin_layout Section* +Design +\end_layout + +\begin_layout Subsection* +Read algorithm +\end_layout + +\begin_layout Standard +All operations that do not modify the file or directory are sent to all + the subvolumes and the first successful reply is returned to the application. +\end_layout + +\begin_layout Standard +The read() system call (reading data from a file) is an exception. + For read() calls, replicate tries to do load balancing by sending all reads from + a particular file to a particular server. +\end_layout + +\begin_layout Standard +The read algorithm is also affected by the option read-subvolume; see below + for details. +\end_layout + +\begin_layout Subsection* +Classes of file operations +\end_layout + +\begin_layout Standard +replicate divides all filesystem write operations into three classes: +\end_layout + +\begin_layout Itemize + +\series bold +data: +\series default +Operations that modify the contents of a file (write, truncate). +\end_layout + +\begin_layout Itemize + +\series bold +metadata: +\series default +Operations that modify attributes of a file or directory (permissions, ownership +, etc.). +\end_layout + +\begin_layout Itemize + +\series bold +entry: +\series default +Operations that create or delete directory entries (mkdir, create, rename, + rmdir, unlink, etc.). +\end_layout + +\begin_layout Subsection* +Locking and Change Log +\end_layout + +\begin_layout Standard +To ensure consistency across subvolumes, replicate holds a lock whenever a modificatio +n is being made to a file or directory. + By default, replicate considers the first subvolume as the sole lock server. + However, the number of lock servers can be increased upto the total number + of subvolumes. +\end_layout + +\begin_layout Standard +The change log is a set of extended attributes associated with files and + directories that replicate maintains. + The change log keeps track of the changes made to files and directories + (data, metadata, entry) so that the self-heal algorithm knows which copy + of a file or directory is the most recent one. +\end_layout + +\begin_layout Subsection* +Write algorithm +\end_layout + +\begin_layout Standard +The algorithm for all write operations (data, metadata, entry) is: +\end_layout + +\begin_layout Enumerate +Lock the file (or directory) on all of the lock servers (see options below). +\end_layout + +\begin_layout Enumerate +Write change log entries on all servers. +\end_layout + +\begin_layout Enumerate +Perform the operation. +\end_layout + +\begin_layout Enumerate +Erase change log entries. +\end_layout + +\begin_layout Enumerate +Unlock the file (or directory) on all of the lock servers. +\end_layout + +\begin_layout Standard +The above algorithm is a simplified version intended for general users. + Please refer to the source code for the full details. +\end_layout + +\begin_layout Subsection* +Self-Heal +\end_layout + +\begin_layout Standard +replicate automatically tries to fix any inconsistencies it detects among different + copies of a file. + It uses information in the change log to determine which copy is the +\begin_inset Quotes eld +\end_inset + +correct +\begin_inset Quotes erd +\end_inset + + version. +\end_layout + +\begin_layout Standard +Self-heal is triggered when a file or directory is first +\begin_inset Quotes eld +\end_inset + +accessed +\begin_inset Quotes erd +\end_inset + +, that is, the first time any operation is attempted on it. + The self-heal algorithm does the following things: +\end_layout + +\begin_layout Standard +If the entry being accessed is a directory: +\end_layout + +\begin_layout Itemize +The contents of the +\begin_inset Quotes eld +\end_inset + +correct +\begin_inset Quotes erd +\end_inset + + version is replicated on all subvolumes, by deleting entries and creating + entries as necessary. +\end_layout + +\begin_layout Standard +If the entry being accessed is a file: +\end_layout + +\begin_layout Itemize +If the file does not exist on some subvolumes, it is created. +\end_layout + +\begin_layout Itemize +If there is a mismatch in the size of the file, or ownership, or permission, + it is fixed. +\end_layout + +\begin_layout Itemize +If the change log indicates that some copies need updating, they are updated. +\end_layout + +\begin_layout Subsection* +Split-brain +\end_layout + +\begin_layout Standard +It may happen that one replicate client can access only some of the servers in + a cluster and another replicate client can access the remaining servers. + Or it may happen that in a cluster of two servers, one server goes down + and comes back up, but the other goes down immediately. + Both these scenarios result in a +\begin_inset Quotes eld +\end_inset + +split-brain +\begin_inset Quotes erd +\end_inset + +. +\end_layout + +\begin_layout Standard +In a split-brain situation, there will be two or more copies of a file, + all of which are +\begin_inset Quotes eld +\end_inset + +correct +\begin_inset Quotes erd +\end_inset + + in some sense. + replicate without manual intervention has no way of knowing what to do, since + it cannot consider any single copy as definitive, nor does it know of any + meaningful way to merge the copies. +\end_layout + +\begin_layout Standard +If replicate detects that a split-brain has happened on a file, it disallows opening + of that file. + You will have to manually resolve the conflict by deleting all but one + copy of the file. + Alternatively you can set an automatic split-brain resolution policy by + using the `favorite-child' option (see below). +\end_layout + +\begin_layout Section* +Translator Options +\end_layout + +\begin_layout Standard +replicate accepts the following options: +\end_layout + +\begin_layout Subsection* +read-subvolume (default: none) +\end_layout + +\begin_layout Standard +The value of this option must be the name of a subvolume. + If given, all read operations are sent to only the specified subvolume, + instead of being balanced across all subvolumes. +\end_layout + +\begin_layout Subsection* +favorite-child (default: none) +\end_layout + +\begin_layout Standard +The value of this option must be the name of a subvolume. + If given, the specified subvolume will be preferentially used in resolving + conflicts ( +\begin_inset Quotes eld +\end_inset + +split-brain +\begin_inset Quotes erd +\end_inset + +). + This means if a discrepancy is noticed in the attributes or content of + a file, the copy on the `favorite-child' will be considered the definitive + version and its contents will +\emph on +overwrite +\emph default +the contents of all other copies. + Use this option with caution! It is possible to +\emph on +lose data +\emph default + with this option. + If you are in doubt, do not specify this option. +\end_layout + +\begin_layout Subsection* +Self-heal options +\end_layout + +\begin_layout Standard +Setting any of these options to +\begin_inset Quotes eld +\end_inset + +off +\begin_inset Quotes erd +\end_inset + + prevents that kind of self-heal from being done on a file or directory. + For example, if metadata self-heal is turned off, permissions and ownership + are no longer fixed automatically. +\end_layout + +\begin_layout Subsubsection* +data-self-heal (default: on) +\end_layout + +\begin_layout Standard +Enable/disable self-healing of file contents. +\end_layout + +\begin_layout Subsubsection* +metadata-self-heal (default: off) +\end_layout + +\begin_layout Standard +Enable/disable self-healing of metadata (permissions, ownership, modification + times). +\end_layout + +\begin_layout Subsubsection* +entry-self-heal (default: on) +\end_layout + +\begin_layout Standard +Enable/disable self-healing of directory entries. +\end_layout + +\begin_layout Subsection* +Change Log options +\end_layout + +\begin_layout Standard +If any of these options is turned off, it disables writing of change log + entries for that class of file operations. + That is, steps 2 and 4 of the write algorithm (see above) are not done. + Note that if the change log is not written, the self-heal algorithm cannot + determine the +\begin_inset Quotes eld +\end_inset + +correct +\begin_inset Quotes erd +\end_inset + + version of a file and hence self-heal will only be able to fix +\begin_inset Quotes eld +\end_inset + +obviously +\begin_inset Quotes erd +\end_inset + + wrong things (such as a file existing on only one node). +\end_layout + +\begin_layout Subsubsection* +data-change-log (default: on) +\end_layout + +\begin_layout Standard +Enable/disable writing of change log for data operations. +\end_layout + +\begin_layout Subsubsection* +metadata-change-log (default: on) +\end_layout + +\begin_layout Standard +Enable/disable writing of change log for metadata operations. +\end_layout + +\begin_layout Subsubsection* +entry-change-log (default: on) +\end_layout + +\begin_layout Standard +Enable/disable writing of change log for entry operations. +\end_layout + +\begin_layout Subsection* +Locking options +\end_layout + +\begin_layout Standard +These options let you specify the number of lock servers to use for each + class of file operations. + The default values are satisfactory in most cases. + If you are extra paranoid, you may want to increase the values. + However, be very cautious if you set the data- or entry- lock server counts + to zero, since this can result in +\emph on +lost data. + +\emph default + For example, if you set the data-lock-server-count to zero, and two application +s write to the same region of a file, there is a possibility that none of + your servers will have all the data. + In other words, the copies will be +\emph on +inconsistent +\emph default +, and +\emph on +incomplete +\emph default +. + Do not set data- and entry- lock server counts to zero unless you absolutely + know what you are doing and agree to not hold GlusterFS responsible for + any lost data. +\end_layout + +\begin_layout Subsubsection* +data-lock-server-count (default: 1) +\end_layout + +\begin_layout Standard +Number of lock servers to use for data operations. +\end_layout + +\begin_layout Subsubsection* +metadata-lock-server-count (default: 0) +\end_layout + +\begin_layout Standard +Number of lock servers to use for metadata operations. +\end_layout + +\begin_layout Subsubsection* +entry-lock-server-count (default: 1) +\end_layout + +\begin_layout Standard +Number of lock servers to use for entry operations. +\end_layout + +\begin_layout Section* +Known Issues +\end_layout + +\begin_layout Subsection* +Self-heal of file with more than one link (hard links): +\end_layout + +\begin_layout Standard +Consider two servers, A and B. + Assume A is down, and the user creates a file `new' as a hard link to a + file `old'. + When A comes back up, replicate will see that the file `new' does not exist on + A, and self-heal will create the file and copy the contents from B. + However, now on server A the file `new' is not a link to the file `old' + but an entirely different file. +\end_layout + +\begin_layout Standard +We know of no easy way to fix this problem, but we will try to fix it in + forthcoming releases. +\end_layout + +\begin_layout Subsection* +File re-opening after a server comes back up: +\end_layout + +\begin_layout Standard +If a server A goes down and comes back up, any files which were opened while + A was down and are still open will not have their writes replicated on + A. + In other words, data replication only happens on those servers which were + alive when the file was opened. +\end_layout + +\begin_layout Standard +This is a rather tricky issue but we hope to fix it very soon. +\end_layout + +\begin_layout Section* +Frequently Asked Questions +\end_layout + +\begin_layout Subsection* +1. + How can I force self-heal to happen? +\end_layout + +\begin_layout Standard +You can force self-heal to happen on your cluster by running a script or + a command that accesses every file. + A simple way to do it would be: +\end_layout + +\begin_layout Standard +\begin_inset ERT +status open + +\begin_layout Standard + +\end_layout + +\begin_layout Standard + + +\backslash +begin{verbatim} +\end_layout + +\begin_layout Standard + +$ ls -lR +\end_layout + +\begin_layout Standard + + +\backslash +end{verbatim} +\end_layout + +\begin_layout Standard + +\end_layout + +\end_inset + + +\end_layout + +\begin_layout Standard +Run the command in all directories which you want to forcibly self-heal. +\end_layout + +\begin_layout Subsection* +2. + Which backend filesystem should I use for replicate? +\end_layout + +\begin_layout Standard +You can use any backend filesystem that supports extended attributes. + We know of users successfully using XFR, ReiserFS, and Ext3. +\end_layout + +\begin_layout Subsection* +3. + What can I do to improve replicate performance? +\end_layout + +\begin_layout Standard +Try loading performance translators such as io-threads, write-behind, io-cache, + and read-ahead depending on your workload. + If you are willing to sacrifice correctness in corner cases, you can experiment + with the lock-server-count and the change-log options (see above). + As warned earlier, be very careful! +\end_layout + +\begin_layout Subsection* +4. + How can I selectively replicate files? +\end_layout + +\begin_layout Standard +There is no support for selective replication in replicate itself. + You can achieve selective replication by loading the unify translator over + replicate, and using the switch scheduler. + Configure unify with two subvolumes, one of them being replicate. + Using the switch scheduler, schedule all files for which you need replication + to the replicate subvolume. + Consult unify and switch documentation for more details. +\end_layout + +\begin_layout Section* +Contact +\end_layout + +\begin_layout Standard +If you need more assistance on replicate, contact us on the mailing list <gluster-user +s@gluster.org> (visit gluster.org for details on how to subscribe). +\end_layout + +\begin_layout Standard +Send you comments and suggestions about this document to <vikas@zresearch.com>. +\end_layout + +\end_body +\end_document |