summaryrefslogtreecommitdiffstats
path: root/doc/admin-guide/en-US/admin_Hadoop.xml
blob: 08bac89615ba936d8b13e91e36f5ff6107c3e68b (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
<?xml version='1.0' encoding='UTF-8'?>
<!-- This document was created with Syntext Serna Free. --><!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd" [
<!ENTITY % BOOK_ENTITIES SYSTEM "Administration_Guide.ent">
%BOOK_ENTITIES;
]>
<chapter id="chap-Administration_Guide-Hadoop">
  <title>Managing Hadoop Compatible Storage </title>
  <para>GlusterFS  provides compatibility for Apache Hadoop and it uses the standard file system
APIs available in Hadoop to provide a new storage option for Hadoop deployments. Existing
MapReduce based applications can use GlusterFS   seamlessly. This new functionality opens up data
within Hadoop deployments to any file-based or object-based application.

 </para>
  <section id="sect-Administration_Guide-Hadoop-Introduction-Architecture_Overview">
    <title>Architecture Overview </title>
    <para>The following diagram illustrates Hadoop integration with GlusterFS:
<mediaobject>
        <imageobject>
          <imagedata fileref="images/Hadoop_Architecture.png"/>
        </imageobject>
      </mediaobject>
  </para>
  </section>
  <section id="sect-Administration_Guide-Hadoop-Introduction-Advantages">
    <title>Advantages </title>
    <para>
The following are the advantages of Hadoop Compatible Storage with GlusterFS:

   
  </para>
    <itemizedlist>
      <listitem>
        <para>Provides simultaneous file-based and object-based access within Hadoop.
</para>
      </listitem>
      <listitem>
        <para>Eliminates the centralized metadata server.
</para>
      </listitem>
      <listitem>
        <para>Provides compatibility with MapReduce applications and rewrite is not required.
</para>
      </listitem>
      <listitem>
        <para>Provides a fault tolerant file system.
</para>
      </listitem>
    </itemizedlist>
  </section>
  <section>
    <title>Preparing to Install Hadoop Compatible Storage</title>
    <para>This section provides information on pre-requisites and list of dependencies that will be installed
during installation of Hadoop compatible storage.

</para>
    <section id="sect-Administration_Guide-Hadoop-Preparation">
      <title>Pre-requisites </title>
      <para>The following are the pre-requisites to install  Hadoop Compatible
Storage :

  </para>
      <itemizedlist>
        <listitem>
          <para>Hadoop 0.20.2 is installed, configured, and is running on all the machines in the cluster.
</para>
        </listitem>
        <listitem>
          <para>Java Runtime Environment
</para>
        </listitem>
        <listitem>
          <para>Maven (mandatory only if you are building the plugin from the source)
</para>
        </listitem>
        <listitem>
          <para>JDK (mandatory only if you are building the plugin from the source)
</para>
        </listitem>
        <listitem>
          <para>getfattr
- command line utility</para>
        </listitem>
      </itemizedlist>
    </section>
  </section>
  <section>
    <title>Installing, and Configuring Hadoop Compatible Storage</title>
    <para>This section describes how to install and configure Hadoop Compatible Storage in your storage
environment and verify that it is functioning correctly.

</para>
    <orderedlist>
      <para>To install and configure Hadoop compatible storage:</para>
      <listitem>
        <para>Download  <filename>glusterfs-hadoop-0.20.2-0.1.x86_64.rpm</filename> file to each server on your cluster. You can download the file from <ulink url="http://download.gluster.com/pub/gluster/glusterfs/qa-releases/3.3-beta-2/glusterfs-hadoop-0.20.2-0.1.x86_64.rpm"/>.

</para>
      </listitem>
      <listitem>
        <para>To install Hadoop Compatible Storage on all servers in your cluster, run the following command:
</para>
        <para><command># rpm –ivh --nodeps glusterfs-hadoop-0.20.2-0.1.x86_64.rpm</command>
</para>
        <para>The following files will be extracted:
 </para>
        <itemizedlist>
          <listitem>
            <para>/usr/local/lib/glusterfs-<replaceable>Hadoop-version-gluster_plugin_version</replaceable>.jar </para>
          </listitem>
          <listitem>
            <para> /usr/local/lib/conf/core-site.xml</para>
          </listitem>
        </itemizedlist>
      </listitem>
      <listitem>
        <para>(Optional) To install Hadoop Compatible Storage in a different location, run the following
command:
</para>
        <para><command># rpm –ivh --nodeps –prefix /usr/local/glusterfs/hadoop glusterfs-hadoop- 0.20.2-0.1.x86_64.rpm</command>
</para>
      </listitem>
      <listitem>
        <para>Edit the <filename>conf/core-site.xml</filename> file. The following is the sample <filename>conf/core-site.xml</filename> file:
</para>
        <para><programlisting>&lt;configuration&gt;
  &lt;property&gt;
    &lt;name&gt;fs.glusterfs.impl&lt;/name&gt;
    &lt;value&gt;org.apache.hadoop.fs.glusterfs.Gluster FileSystem&lt;/value&gt;
&lt;/property&gt;

&lt;property&gt;
   &lt;name&gt;fs.default.name&lt;/name&gt;
   &lt;value&gt;glusterfs://fedora1:9000&lt;/value&gt;
&lt;/property&gt;

&lt;property&gt;
   &lt;name&gt;fs.glusterfs.volname&lt;/name&gt;
   &lt;value&gt;hadoopvol&lt;/value&gt;
&lt;/property&gt;  
 
&lt;property&gt;
   &lt;name&gt;fs.glusterfs.mount&lt;/name&gt;
   &lt;value&gt;/mnt/glusterfs&lt;/value&gt;
&lt;/property&gt;

&lt;property&gt;
   &lt;name&gt;fs.glusterfs.server&lt;/name&gt;
   &lt;value&gt;fedora2&lt;/value&gt;
&lt;/property&gt;

&lt;property&gt;
   &lt;name&gt;quick.slave.io&lt;/name&gt;
   &lt;value&gt;Off&lt;/value&gt;
&lt;/property&gt;
&lt;/configuration&gt;
</programlisting></para>
        <para>The following are the configurable fields:
</para>
        <para><informaltable frame="none">
            <tgroup cols="3">
              <colspec colnum="1" colname="c0" colsep="0"/>
              <colspec colnum="2" colname="c1" colsep="0"/>
              <colspec colnum="3" colname="c2" colsep="0"/>
              <thead>
                <row>
                  <entry>Property Name </entry>
                  <entry>Default Value </entry>
                  <entry>Description </entry>
                </row>
              </thead>
              <tbody>
                <row>
                  <entry>fs.default.name </entry>
                  <entry>glusterfs://fedora1:9000</entry>
                  <entry>Any hostname in the cluster as the server and any port number. </entry>
                </row>
                <row>
                  <entry>fs.glusterfs.volname </entry>
                  <entry>hadoopvol </entry>
                  <entry>GlusterFS volume to mount. </entry>
                </row>
                <row>
                  <entry>fs.glusterfs.mount </entry>
                  <entry>/mnt/glusterfs</entry>
                  <entry>The directory used to fuse mount the volume.</entry>
                </row>
                <row>
                  <entry>fs.glusterfs.server </entry>
                  <entry>fedora2</entry>
                  <entry>Any hostname or IP address on the cluster except the client/master. </entry>
                </row>
                <row>
                  <entry>quick.slave.io </entry>
                  <entry>Off </entry>
                  <entry>Performance tunable option. If this option is set to On, the plugin will try to perform I/O directly from the disk file system (like ext3 or ext4) the file resides on. Hence read performance will improve and job would run faster. <note>
                      <para>This option is not tested widely</para>
                    </note></entry>
                </row>
              </tbody>
            </tgroup>
          </informaltable></para>
      </listitem>
      <listitem>
        <para>Create a soft link in Hadoop’s library and configuration directory for the downloaded files (in
Step 3) using the following commands:
</para>
        <para><command># ln -s <replaceable>&lt;target location&gt; &lt;source location</replaceable>&gt;</command>
</para>
        <para>For example,
</para>
        <para><command># ln –s /usr/local/lib/glusterfs-0.20.2-0.1.jar <replaceable>$HADOOP_HOME</replaceable>/lib/glusterfs-0.20.2-0.1.jar</command>
</para>
        <para><command># ln –s /usr/local/lib/conf/core-site.xml <replaceable>$HADOOP_HOME</replaceable>/conf/core-site.xml </command></para>
      </listitem>
      <listitem>
        <para> (Optional) You can run the following command on Hadoop master to build the plugin and deploy
it along with core-site.xml file, instead of repeating the above steps:
</para>
        <para><command># build-deploy-jar.py -d <replaceable>$HADOOP_HOME</replaceable> -c </command></para>
      </listitem>
    </orderedlist>
  </section>
  <section>
    <title>Starting and Stopping the Hadoop MapReduce Daemon</title>
    <para>To start and stop MapReduce daemon</para>
    <itemizedlist>
      <listitem>
        <para>To start MapReduce daemon manually, enter the following command:
</para>
        <para><command># <replaceable>$HADOOP_HOME</replaceable>/bin/start-mapred.sh</command>
</para>
      </listitem>
      <listitem>
        <para>To stop MapReduce daemon manually, enter the following command:
</para>
        <para><command># <replaceable>$HADOOP_HOME</replaceable>/bin/stop-mapred.sh </command></para>
      </listitem>
    </itemizedlist>
    <para><note>
        <para>You must start Hadoop MapReduce daemon on all servers.
</para>
      </note></para>
  </section>
</chapter>