diff options
| author | M S Vishwanath Bhat <vishwanath@gluster.com> | 2012-02-24 13:18:56 +0530 | 
|---|---|---|
| committer | Vijay Bellur <vijay@gluster.com> | 2012-03-07 23:18:29 -0800 | 
| commit | 5fdd65f5f4f5df1d28b0fb4f7efed226d5db1b3c (patch) | |
| tree | 377a94774c5cd9f55b16ba6fcd1c7b5ec51bfa3b /glusterfs-hadoop/README | |
| parent | e1ab347720f25ed2e7db633a7202f7b873f4b90a (diff) | |
renaming hdfs -> glusterfs-hadoop
Change-Id: Ibb937af1231f6bbed9a2d4eaeabc6e9d4000887f
BUG: 797064
Signed-off-by: M S Vishwanath Bhat <vishwanath@gluster.com>
Reviewed-on: http://review.gluster.com/2811
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Vijay Bellur <vijay@gluster.com>
Diffstat (limited to 'glusterfs-hadoop/README')
| -rw-r--r-- | glusterfs-hadoop/README | 182 | 
1 files changed, 182 insertions, 0 deletions
diff --git a/glusterfs-hadoop/README b/glusterfs-hadoop/README new file mode 100644 index 000000000..3026f11c0 --- /dev/null +++ b/glusterfs-hadoop/README @@ -0,0 +1,182 @@ +GlusterFS Hadoop Plugin +======================= + +INTRODUCTION +------------ + +This document describes how to use GlusterFS (http://www.gluster.org/) as a backing store with Hadoop. + + +REQUIREMENTS +------------ + +* Supported OS is GNU/Linux +* GlusterFS and Hadoop installed on all machines in the cluster +* Java Runtime Environment (JRE) +* Maven (needed if you are building the plugin from source) +* JDK (needed if you are building the plugin from source) + +NOTE: Plugin relies on two *nix command line utilities to function properly. They are: + +* mount: Used to mount GlusterFS volumes. +* getfattr: Used to fetch Extended Attributes of a file + +Make sure they are installed on all hosts in the cluster and their locations are in $PATH +environment variable. + + +INSTALLATION +------------ + +** NOTE: Example below is for Hadoop version 0.20.2 ($GLUSTER_HOME/hdfs/0.20.2) ** + +* Building the plugin from source [Maven (http://maven.apache.org/) and JDK is required to build the plugin] + +  Change to glusterfs-hadoop directory in the GlusterFS source tree and build the plugin. + +  # cd $GLUSTER_HOME/hdfs/0.20.2 +  # mvn package + +  On a successful build the plugin will be present in the `target` directory. +  (NOTE: version number will be a part of the plugin) + +  # ls target/ +  classes  glusterfs-0.20.2-0.1.jar  maven-archiver  surefire-reports  test-classes +               ^^^^^^^^^^^^^^^^^^ + +  Copy the plugin to lib/ directory in your $HADOOP_HOME dir. + +  # cp target/glusterfs-0.20.2-0.1.jar $HADOOP_HOME/lib + +  Copy the sample configuration file that ships with this source (conf/core-site.xml) to conf +  directory in your $HADOOP_HOME dir. + +  # cp conf/core-site.xml $HADOOP_HOME/conf + +* Installing the plugin from RPM + +  See the plugin documentation for installing from RPM. + + +CLUSTER INSTALLATION +-------------------- + +  In case it is tedious to do the above steps(s) on all hosts in the cluster; use the build-and-deploy.py script to +  build the plugin in one place and deploy it (along with the configuration file on all other hosts). + +  This should be run on the host which is that hadoop master [Job Tracker]. + +* STEPS (You would have done Step 1 and 2 anyway while deploying Hadoop) + +  1. Edit conf/slaves file in your hadoop distribution; one line for each slave. +  2. Setup password-less ssh b/w hadoop master and slave(s). +  3. Edit conf/core-site.xml with all glusterfs related configurations (see CONFIGURATION) +  4. Run the following +     # cd $GLUSTER_HOME/hdfs/0.20.2/tools +     # python ./build-and-deploy.py -b -d /path/to/hadoop/home -c + +     This will build the plugin and copy it (and the config file) to all slaves (mentioned in $HADOOP_HOME/conf/slaves). + +   Script options: +     -b : build the plugin +     -d : location of hadoop directory +     -c : deploy core-site.xml +     -m : deploy mapred-site.xml +     -h : deploy hadoop-env.sh + + +CONFIGURATION +------------- + +  All plugin configuration is done in a single XML file (core-site.xml) with <name><value> tags in each <property> +  block. + +  Brief explanation of the tunables and the values they accept (change them where-ever needed) are mentioned below + +  name:  fs.glusterfs.impl +  value: org.apache.hadoop.fs.glusterfs.GlusterFileSystem + +         The default FileSystem API to use (there is little reason to modify this). + +  name:  fs.default.name +  value: glusterfs://server:port + +         The default name that hadoop uses to represent file as a URI (typically a server:port tuple). Use any host +         in the cluster as the server and any port number. This option has to be in server:port format for hadoop +         to create file URI; but is not used by plugin. + +  name:  fs.glusterfs.volname +  value: volume-dist-rep + +         The volume to mount. + + +  name:  fs.glusterfs.mount +  value: /mnt/glusterfs + +         This is the directory that the plugin will use to mount (FUSE mount) the volume. + +  name:  fs.glusterfs.server +  value: 192.168.1.36, hackme.zugzug.org + +         To mount a volume the plugin needs to know the hostname or the IP of a GlusterFS server in the cluster. +         Mention it here. + +  name:  quick.slave.io +  value: [On/Off], [Yes/No], [1/0] + +         NOTE: This option is not tested as of now. + +         This is a performance tunable option. Hadoop schedules jobs to hosts that contain the file data part. The job +         then does I/O on the file (via FUSE in case of GlusterFS). When this option is set, the plugin will try to +         do I/O directly from the backed filesystem (ext3, ext4 etc..) the file resides on. Hence read performance +         will improve and job would run faster. + + +USAGE +----- + +  Once configured, start Hadoop Map/Reduce daemons + +  # cd $HADOOP_HOME +  # ./bin/start-mapred.sh + +  If the map/reduce job/task trackers are up, all I/O will be done to GlusterFS. + + +FOR HACKERS +----------- + +* Source Layout + +** version specific: hdfs/<version> ** +./src +./src/main +./src/main/java +./src/main/java/org +./src/main/java/org/apache +./src/main/java/org/apache/hadoop +./src/main/java/org/apache/hadoop/fs +./src/main/java/org/apache/hadoop/fs/glusterfs +./src/main/java/org/apache/hadoop/fs/glusterfs/GlusterFSBrickClass.java +./src/main/java/org/apache/hadoop/fs/glusterfs/GlusterFSXattr.java            <--- Fetch/Parse Extended Attributes of a file +./src/main/java/org/apache/hadoop/fs/glusterfs/GlusterFUSEInputStream.java    <--- Input Stream (instantiated during open() calls; quick read from backed FS) +./src/main/java/org/apache/hadoop/fs/glusterfs/GlusterFSBrickRepl.java +./src/main/java/org/apache/hadoop/fs/glusterfs/GlusterFUSEOutputStream.java   <--- Output Stream (instantiated during creat() calls) +./src/main/java/org/apache/hadoop/fs/glusterfs/GlusterFileSystem.java         <--- Entry Point for the plugin (extends Hadoop FileSystem class) +./src/test +./src/test/java +./src/test/java/org +./src/test/java/org/apache +./src/test/java/org/apache/hadoop +./src/test/java/org/apache/hadoop/fs +./src/test/java/org/apache/hadoop/fs/glusterfs +./src/test/java/org/apache/hadoop/fs/glusterfs/AppTest.java                  <--- Your test cases go here (if any :-)) +./tools/build-deploy-jar.py                                                  <--- Build and Deployment Script +./conf +./conf/core-site.xml                                                         <--- Sample configuration file +./pom.xml                                                                    <--- build XML file (used by maven) + +** toplevel: hdfs/ ** +./COPYING                                                                    <--- License +./README                                                                     <--- This file  | 
