Apache Hadoop website gives instructions on how to install Hadoop 2.2.0. However, as most open source projects, it is not well documented. Some of the instructions simply does not work. I have spent quite a lot of time figuring out how to install Hadoop version 2.2 on my Linux Debian 7 wheezy machine. Here are the steps I took:
Step by step instructions to install Hadoop 2.2.0 on Debian wheezy Linux
Prerequisites
- Linux OS. I use Debian 7 wheezy
- SSH server. $ sudo apt-get install openssh-server
- Java JDK.If you don’t have it, follow this instruction: https://wiki.debian.org/JavaPackage. I installed the Sun Java JDK 1.7 that downloaded from Oracle.
- Add a “contrib” component to /etc/apt/sources.list, for example:
# Debian 7 "Wheezy"
deb http://http.debian.net/debian/ wheezy main contrib
- Update the list of available packages and install the java-package package:
# apt-get update && apt-get install java-package && exit
- Download the desired Java JDK/JRE binary distribution (Oracle). Choose tar.gz archives or self-extracting archives, do not choose the RPM!
- Use java-package to create a Debian package, for example:
$ make-jpkg jdk-7u45-linux-x64.tar.gz
- Install the binary package created:
$ su
# dpkg -i oracle-j2sdk1.7_1.7.0+update45_amd64.deb
By default the DebianAlternatives will automatically install the best version of Java as the default version. If the symlinks have been manually set they will be preserved by the tools. The update-alternatives tools try hard to respect explicit configuration from the local admin. Local manual symlinks appear to be an explicit configuration. In order to reset the alternative symlinks to their default value use the --auto option.
# update-alternatives --auto java
If you’d like to override the default to perhaps use a specific version then use --config and manually select the desired version.
# update-alternatives --display java
# update-alternatives --config java
Choose the appropriate number for the desired alternative.
The appropriate java binary will automatically be in PATH by virtue of the /usr/bin/java alternative symlink.
You may as well use the update-alternatives tool from java-common package which let you update all alternatives belonging to one runtime or development kit at a time.
# update-java-alternatives -l
# update-java-alternatives -s j2sdk1.7-oracle
Add Hadoop Group and User:
$ sudo addgroup hadoop
$ sudo adduser --ingroup hadoop hadoopuser
$ sudo adduser hadoopuser sudo
Setup SSH Certificate for password-less login
$ ssh-keygen -t rsa -P ''
...
Your identification has been saved in /home/hadoopuser/.ssh/id_rsa.
Your public key has been saved in /home/hadoopuser/.ssh/id_rsa.pub.
...
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
$ ssh localhost
Download Hadoop 2.2.0
$ cd ~
$ wget http://www.trieuvan.com/apache/hadoop/common/hadoop-2.2.0/hadoop-2.2.0.tar.gz
$ sudo tar vxzf hadoop-2.2.0.tar.gz -C /usr/local
$ cd /usr/local
$ sudo mv hadoop-2.2.0 hadoop
$ sudo chown -R hadoopuser:hadoop hadoop
Or you can compile and build Hadoop 2.2.0 from source. See the build instructions for details:
http://svn.apache.org/repos/asf/hadoop/common/trunk/BUILDING.txt and my other post on how to build native Hadoop libraries: http://drweiwang.com/build-hadoop-native-libraries/
Setup Hadoop Environment Variables
$cd ~
$nano .bashrc
Append the following code to the end of your shell config profile ~/.bashrc
#Hadoop variables
export JAVA_HOME=/usr/lib/jvm/j2sdk1.7-oracle/
export HADOOP_INSTALL=/usr/local/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_HOME=$HADOOP_INSTALL
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export HADOOP_YARN_HOME=$HADOOP_INSTALL
export HADOOP_CONF_DIR=$HADOOP_INSTALL/etc/hadoop
Now, you need to modify the $JAVA_HOME value in the hadoop-env.sh shell script, which is the required environment variable.
$ cd /usr/local/hadoop/etc/hadoop
$ nano hadoop-env.sh
Find the line that has export JAVA_HOME , which is first export statement in the hadoop-env.sh file; and change the value to:
export JAVA_HOME=/usr/lib/jvm/j2sdk1.7-oracle/
Now, Hadoop should be installed and let’s log back as user, hadoopuser, and check the Hadoop version
$ hadoop version
Hadoop 2.2.0
Subversion https://svn.apache.org/repos/asf/hadoop/common -r 1529768
Compiled by hortonmu on 2013-10-07T06:28Z
Compiled with protoc 2.5.0
From source with checksum 79e53ce7994d1628b240f09af91e1af4
This command was run using /usr/local/hadoop/share/hadoop/common/hadoop-common-2.2.0.jar
Configure Hadoop
Change the core-site.xml configuration, which defines the HDFS file server
$ cd /usr/local/hadoop/etc/hadoop
$ nano core-site.xml
Paste the following between the <configuration></configuration> tag:
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
Modify the yarn-site.xml by adding the following between the <configuration></configuration> tag:
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
Modify the mapred-site.xml.template. First rename it to mapred-site.xml:
$ mv mapred-site.xml.template mapred-site.xml
$ nano mapred-site.xml
and then, add the following code inside the <configuration> tag:
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
Now, we need to configure the HDFS. Make two directories for the namenode and datanode:
$ cd ~
$ mkdir -p mydata/hdfs/namenode
$ mkdir -p mydata/hdfs/datanode
$ cd /usr/local/hadoop/etc/hadoop
$ nano hdfs-site.xml
Paste following between <configuration> tag
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/home/hadoopuser/mydata/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/home/hadoopuser/mydata/hdfs/datanode</value>
</property>
The namenode needs to be formatted first:
hadoopuser@debian7-64:~$ hdfs namenode -format
Start Hadoop Service
Everything should be all set. We can now start the Hadoop services. The old shell commands start-all.sh and stop-all.sh of version 1.2 has been superseded by start-dfs.sh and start-yarn.sh in version 2.2.0.
hadoopuser@debian7-64:~$ start-dfs.sh
....
hadoopuser@debian7-64:~$ start-yarn.sh
....
hadoopuser@debian7-64:~$ jps
If everything is successful, you should see the following java processes running:
7313 NodeManager
7026 SecondaryNameNode
6692 NameNode
7213 ResourceManager
7470 Jps
6786 DataNode
Run Hadoop Example
hadoopuser@debian7-64:/usr/local/hadoop$ hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar pi 2 5
Number of Maps = 2
Samples per Map = 5
.......
Wrote input for Map #0
Wrote input for Map #1
Starting Job
.......
14/01/14 22:11:09 INFO mapreduce.Job: map 0% reduce 0%
14/01/14 22:11:16 INFO mapreduce.Job: map 100% reduce 0%
14/01/14 22:11:22 INFO mapreduce.Job: map 100% reduce 100%
14/01/14 22:11:23 INFO mapreduce.Job: Job job_1389755304371_0002 completed successfully
14/01/14 22:11:23 INFO mapreduce.Job: Counters: 43
File System Counters
FILE: Number of bytes read=50
FILE: Number of bytes written=239029
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=538
HDFS: Number of bytes written=215
HDFS: Number of read operations=11
HDFS: Number of large read operations=0
HDFS: Number of write operations=3
Job Counters
Launched map tasks=2
Launched reduce tasks=1
Data-local map tasks=2
Total time spent by all maps in occupied slots (ms)=10579
Total time spent by all reduces in occupied slots (ms)=3592
Map-Reduce Framework
Map input records=2
Map output records=4
Map output bytes=36
Map output materialized bytes=56
Input split bytes=302
Combine input records=0
Combine output records=0
Reduce input groups=2
Reduce shuffle bytes=56
Reduce input records=4
Reduce output records=0
Spilled Records=8
Shuffled Maps =2
Failed Shuffles=0
Merged Map outputs=2
GC time elapsed (ms)=70
CPU time spent (ms)=1430
Physical memory (bytes) snapshot=620412928
Virtual memory (bytes) snapshot=1804890112
Total committed heap usage (bytes)=467140608
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=236
File Output Format Counters
Bytes Written=97
Job Finished in 22.021 seconds
Estimated value of Pi is 3.60000000000000000000