summaryrefslogtreecommitdiffstats
path: root/glusterfs-hadoop/README
diff options
context:
space:
mode:
Diffstat (limited to 'glusterfs-hadoop/README')
-rw-r--r--glusterfs-hadoop/README182
1 files changed, 182 insertions, 0 deletions
diff --git a/glusterfs-hadoop/README b/glusterfs-hadoop/README
new file mode 100644
index 000000000..3026f11c0
--- /dev/null
+++ b/glusterfs-hadoop/README
@@ -0,0 +1,182 @@
+GlusterFS Hadoop Plugin
+=======================
+
+INTRODUCTION
+------------
+
+This document describes how to use GlusterFS (http://www.gluster.org/) as a backing store with Hadoop.
+
+
+REQUIREMENTS
+------------
+
+* Supported OS is GNU/Linux
+* GlusterFS and Hadoop installed on all machines in the cluster
+* Java Runtime Environment (JRE)
+* Maven (needed if you are building the plugin from source)
+* JDK (needed if you are building the plugin from source)
+
+NOTE: Plugin relies on two *nix command line utilities to function properly. They are:
+
+* mount: Used to mount GlusterFS volumes.
+* getfattr: Used to fetch Extended Attributes of a file
+
+Make sure they are installed on all hosts in the cluster and their locations are in $PATH
+environment variable.
+
+
+INSTALLATION
+------------
+
+** NOTE: Example below is for Hadoop version 0.20.2 ($GLUSTER_HOME/hdfs/0.20.2) **
+
+* Building the plugin from source [Maven (http://maven.apache.org/) and JDK is required to build the plugin]
+
+ Change to glusterfs-hadoop directory in the GlusterFS source tree and build the plugin.
+
+ # cd $GLUSTER_HOME/hdfs/0.20.2
+ # mvn package
+
+ On a successful build the plugin will be present in the `target` directory.
+ (NOTE: version number will be a part of the plugin)
+
+ # ls target/
+ classes glusterfs-0.20.2-0.1.jar maven-archiver surefire-reports test-classes
+ ^^^^^^^^^^^^^^^^^^
+
+ Copy the plugin to lib/ directory in your $HADOOP_HOME dir.
+
+ # cp target/glusterfs-0.20.2-0.1.jar $HADOOP_HOME/lib
+
+ Copy the sample configuration file that ships with this source (conf/core-site.xml) to conf
+ directory in your $HADOOP_HOME dir.
+
+ # cp conf/core-site.xml $HADOOP_HOME/conf
+
+* Installing the plugin from RPM
+
+ See the plugin documentation for installing from RPM.
+
+
+CLUSTER INSTALLATION
+--------------------
+
+ In case it is tedious to do the above steps(s) on all hosts in the cluster; use the build-and-deploy.py script to
+ build the plugin in one place and deploy it (along with the configuration file on all other hosts).
+
+ This should be run on the host which is that hadoop master [Job Tracker].
+
+* STEPS (You would have done Step 1 and 2 anyway while deploying Hadoop)
+
+ 1. Edit conf/slaves file in your hadoop distribution; one line for each slave.
+ 2. Setup password-less ssh b/w hadoop master and slave(s).
+ 3. Edit conf/core-site.xml with all glusterfs related configurations (see CONFIGURATION)
+ 4. Run the following
+ # cd $GLUSTER_HOME/hdfs/0.20.2/tools
+ # python ./build-and-deploy.py -b -d /path/to/hadoop/home -c
+
+ This will build the plugin and copy it (and the config file) to all slaves (mentioned in $HADOOP_HOME/conf/slaves).
+
+ Script options:
+ -b : build the plugin
+ -d : location of hadoop directory
+ -c : deploy core-site.xml
+ -m : deploy mapred-site.xml
+ -h : deploy hadoop-env.sh
+
+
+CONFIGURATION
+-------------
+
+ All plugin configuration is done in a single XML file (core-site.xml) with <name><value> tags in each <property>
+ block.
+
+ Brief explanation of the tunables and the values they accept (change them where-ever needed) are mentioned below
+
+ name: fs.glusterfs.impl
+ value: org.apache.hadoop.fs.glusterfs.GlusterFileSystem
+
+ The default FileSystem API to use (there is little reason to modify this).
+
+ name: fs.default.name
+ value: glusterfs://server:port
+
+ The default name that hadoop uses to represent file as a URI (typically a server:port tuple). Use any host
+ in the cluster as the server and any port number. This option has to be in server:port format for hadoop
+ to create file URI; but is not used by plugin.
+
+ name: fs.glusterfs.volname
+ value: volume-dist-rep
+
+ The volume to mount.
+
+
+ name: fs.glusterfs.mount
+ value: /mnt/glusterfs
+
+ This is the directory that the plugin will use to mount (FUSE mount) the volume.
+
+ name: fs.glusterfs.server
+ value: 192.168.1.36, hackme.zugzug.org
+
+ To mount a volume the plugin needs to know the hostname or the IP of a GlusterFS server in the cluster.
+ Mention it here.
+
+ name: quick.slave.io
+ value: [On/Off], [Yes/No], [1/0]
+
+ NOTE: This option is not tested as of now.
+
+ This is a performance tunable option. Hadoop schedules jobs to hosts that contain the file data part. The job
+ then does I/O on the file (via FUSE in case of GlusterFS). When this option is set, the plugin will try to
+ do I/O directly from the backed filesystem (ext3, ext4 etc..) the file resides on. Hence read performance
+ will improve and job would run faster.
+
+
+USAGE
+-----
+
+ Once configured, start Hadoop Map/Reduce daemons
+
+ # cd $HADOOP_HOME
+ # ./bin/start-mapred.sh
+
+ If the map/reduce job/task trackers are up, all I/O will be done to GlusterFS.
+
+
+FOR HACKERS
+-----------
+
+* Source Layout
+
+** version specific: hdfs/<version> **
+./src
+./src/main
+./src/main/java
+./src/main/java/org
+./src/main/java/org/apache
+./src/main/java/org/apache/hadoop
+./src/main/java/org/apache/hadoop/fs
+./src/main/java/org/apache/hadoop/fs/glusterfs
+./src/main/java/org/apache/hadoop/fs/glusterfs/GlusterFSBrickClass.java
+./src/main/java/org/apache/hadoop/fs/glusterfs/GlusterFSXattr.java <--- Fetch/Parse Extended Attributes of a file
+./src/main/java/org/apache/hadoop/fs/glusterfs/GlusterFUSEInputStream.java <--- Input Stream (instantiated during open() calls; quick read from backed FS)
+./src/main/java/org/apache/hadoop/fs/glusterfs/GlusterFSBrickRepl.java
+./src/main/java/org/apache/hadoop/fs/glusterfs/GlusterFUSEOutputStream.java <--- Output Stream (instantiated during creat() calls)
+./src/main/java/org/apache/hadoop/fs/glusterfs/GlusterFileSystem.java <--- Entry Point for the plugin (extends Hadoop FileSystem class)
+./src/test
+./src/test/java
+./src/test/java/org
+./src/test/java/org/apache
+./src/test/java/org/apache/hadoop
+./src/test/java/org/apache/hadoop/fs
+./src/test/java/org/apache/hadoop/fs/glusterfs
+./src/test/java/org/apache/hadoop/fs/glusterfs/AppTest.java <--- Your test cases go here (if any :-))
+./tools/build-deploy-jar.py <--- Build and Deployment Script
+./conf
+./conf/core-site.xml <--- Sample configuration file
+./pom.xml <--- build XML file (used by maven)
+
+** toplevel: hdfs/ **
+./COPYING <--- License
+./README <--- This file