Sources
Default configuration values
Single node cluster setup
For Ubuntu based distribution.
Download Hadoop
Download current version from Apache Hadoop Releases
Current version is 2.7.1
(2015-08-18).
Install Hadoop
Java
sudo apt-get install openjdk-7-jdk -y
Untar
mkdir ~/hadoop
cd ~/hadoop
tar -xf /vagrant/installators/hadoop-2.7.1.tar.gz
Add to .bashrc or similar
# Hadoop
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
export HADOOP_HOME=/home/vagrant/hadoop/hadoop-2.7.1
export PATH=$HADOOP_HOME/bin:$PATH
export PATH=$JAVA_HOME/bin:$PATH
export HADOOP_CLASSPATH=$/lib/tools.jar
Alter java home in hadoop-env.sh conf file
vi $HADOOP_HOME/etc/hadoop/hadoop-env.sh
# insert
export JAVA_HOME=${JAVA_HOME}
# or explicitly
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
Set up ssh
mkdir -p ~/.ssh
cd ~/.ssh
ssh-keygen # and Enter, Enter... no password and defaults
cp id_rsa.pub authorized_keys # authorize myself
ssh localhost # should work
Run example job
Calculate PI (no input required)
hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar pi 4 100
Pseudo-distributed mode setup
Prepare directory for HDFS data
cd ~/hadoop
mkdir data
sudo chown 777 data
Add configuration properties
vi $HADOOP_HOME/etc/hadoop/core-site.xml
# add:
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/vagrant/hadoop/data</value>
</property>
vi $HADOOP_HOME/etc/hadoop/hdfs-site.xml
# add:
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
cp $HADOOP_HOME/etc/hadoop/mapred-site.xml.template $HADOOP_HOME/etc/hadoop/mapred-site.xml
vi $HADOOP_HOME/etc/hadoop/mapred-site.xml
# add:
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
Format namenode
hdfs namenode -format
Start application
./sbin/start-dfs.sh
./sbin/start-yarn.sh
jps
# Output should be like:
# 24050 NameNode
# 24828 Jps
# 24410 SecondaryNameNode
# 24185 DataNode
# 24590 ResourceManager
# 24725 NodeManager
Prepare data
cd $HADOOP_HOME
# use bin/hadoop fs... or hdfs dfs...
hdfs dfs -ls / # list nothing (but also no error), hdfs was formatted
echo "This is a test." > test.txt
hdfs dfs -mkdir /data
hdfs dfs -mkdir /data/in
hdfs dfs -put test.txt /data/in
Run example wordcount job
hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar wordcount /data/in /data/out
hdfs dfs -cat /data/out/part-r-00000
# Output should be like:
# This 1
# a 1
# is 1
# test. 1
Run your own code
Create example .java file
Code from official Hadoop tutorial.
# copy paste or edit the code
vi WordCount.java
Compile properly
javac -classpath `yarn classpath` WordCount.java
jar cvf wc.jar WordCount*.class
Clean up
hdfs dfs -rm -r /data/out
Run
hadoop jar wc.jar WordCount /data/in/test.txt /data/out
Get output
hdfs dfs -get /data/out/part-r-00000 output/
Set up Hadoop using Vagrant and Docker
Prerequisites
- installed Virtual Box
- installed Vagrant
mkdir hadoop-dev
cd hadoop-dev
Vagrantfile
Create Vagrantfile
Vagrant.configure(2) do |config|
config.vm.box = "chef/centos-7.0"
config.vm.hostname = "test-hadoop.lttr.cz"
config.vm.network "private_network", ip: "2.2.2.2"
config.vm.provision "docker" do |d|
d.run "sequenceiq/hadoop-docker",
# --name = name of the container
# -v = specify volume (synced folder)
# -p = forward ports
# HDFS, MapReduce, YARN and other ports
args: "--name hadoop-docker
-p 50010:50010 -p 50020:50020 -p 50070:50070 -p 50075:50075 -p 50090:50090
-p 19888:19888
-p 8030:8030 -p 8031:8031 -p 8032:8032 -p 8033:8033 -p 8040:8040 -p 8042:8042 -p 8088:8088
-p 49707:49707 -p 2122:2122
-v /vagrant/data:/var/data"
end
config.vm.provider "virtualbox" do |vb|
vb.memory = "2048"
end
end
vagrant up
This will install Docker as Vagrant provisioner and download and install given Docker image. It will take a while.
Ssh into Vagrant VM and continue as root
vagrant ssh
sudo -i
Using container with Hadoop
Tty into container
docker exec -it CONTAINER /bin/bash
cd $HADOOP_PREFIX
bin/hadoop fs -put ...
bin/hadoop jar ...
bin/hdfs dfs -cat ...
bin/hadoop fs -get ...
Set up workspace for Hadoop with Maven
Set up project
Generate testing project with basic structure
mvn archetype:generate -DgroupId=cz.lttr.hadoop -DartifactId=test-hadoop -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false
Add encoding to pom.xml
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
Add Hadoop dependencies to pom.xml
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-core</artifactId>
<version>1.2.1</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.7.1</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>2.7.1</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.7.1</version>
</dependency>
Edit source code with Eclipse
Prepare project for Eclipse
mvn eclipse:eclipse
Than Import Existing Maven project
in Eclipse.
Write some code.
Package and deploy
Generate .jar
file
mvn package
Test the jar file
java -cp target/test-hadoop-1.0-SNAPSHOT.jar cz.lttr.hadoop.WordCount
Tips
Show dependency tree of the project jar files
mvn dependency:tree