Downloads and docs

Default configuration values

Single node cluster setup

For Ubuntu based distribution.

Download Hadoop

Download current version from Apache Hadoop Releases

Current version is 2.7.1 (2015-08-18).

Install Hadoop


sudo apt-get install openjdk-7-jdk -y


mkdir ~/hadoop
cd ~/hadoop
tar -xf /vagrant/installators/hadoop-2.7.1.tar.gz

Add to .bashrc or similar

# Hadoop
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
export HADOOP_HOME=/home/vagrant/hadoop/hadoop-2.7.1
export PATH=$JAVA_HOME/bin:$PATH
export HADOOP_CLASSPATH=$/lib/tools.jar

Alter java home in conf file

vi $HADOOP_HOME/etc/hadoop/
# insert
# or explicitly
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64

Set up ssh

mkdir -p ~/.ssh
cd ~/.ssh
ssh-keygen # and Enter, Enter... no password and defaults
cp authorized_keys # authorize myself
ssh localhost # should work

Run example job

Calculate PI (no input required)

hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar pi 4 100

Pseudo-distributed mode setup

Prepare directory for HDFS data

cd ~/hadoop
mkdir data
sudo chown 777 data

Add configuration properties

vi $HADOOP_HOME/etc/hadoop/core-site.xml
# add:
vi $HADOOP_HOME/etc/hadoop/hdfs-site.xml
# add:
cp $HADOOP_HOME/etc/hadoop/mapred-site.xml.template $HADOOP_HOME/etc/hadoop/mapred-site.xml
vi $HADOOP_HOME/etc/hadoop/mapred-site.xml
# add:

Format namenode

hdfs namenode -format

Start application

# Output should be like:
# 24050 NameNode
# 24828 Jps
# 24410 SecondaryNameNode
# 24185 DataNode
# 24590 ResourceManager
# 24725 NodeManager

Prepare data

# use bin/hadoop fs... or hdfs dfs...
hdfs dfs -ls / # list nothing (but also no error), hdfs was formatted
echo "This is a test." > test.txt
hdfs dfs -mkdir /data
hdfs dfs -mkdir /data/in
hdfs dfs -put test.txt /data/in

Run example wordcount job

hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar wordcount /data/in /data/out
hdfs dfs -cat /data/out/part-r-00000
# Output should be like:
# This   1
# a      1
# is     1
# test.  1

Run your own code

Create example .java file

Code from official Hadoop tutorial.

# copy paste or edit the code

Compile properly

javac -classpath `yarn classpath`
jar cvf wc.jar WordCount*.class

Clean up

hdfs dfs -rm -r /data/out


hadoop jar wc.jar WordCount /data/in/test.txt /data/out

Get output

hdfs dfs -get /data/out/part-r-00000 output/

Set up Hadoop using Vagrant and Docker


  • installed Virtual Box
  • installed Vagrant
mkdir hadoop-dev
cd hadoop-dev


Create Vagrantfile

Vagrant.configure(2) do |config| = "chef/centos-7.0"
config.vm.hostname = "" "private_network", ip: ""
config.vm.provision "docker" do |d| "sequenceiq/hadoop-docker",
# --name = name of the container
# -v = specify volume (synced folder)
# -p = forward ports
# HDFS, MapReduce, YARN and other ports
args: "--name hadoop-docker
-p 50010:50010 -p 50020:50020 -p 50070:50070 -p 50075:50075 -p 50090:50090
-p 19888:19888
-p 8030:8030 -p 8031:8031 -p 8032:8032 -p 8033:8033 -p 8040:8040 -p 8042:8042 -p 8088:8088
-p 49707:49707 -p 2122:2122
-v /vagrant/data:/var/data"
config.vm.provider "virtualbox" do |vb|
vb.memory = "2048"
vagrant up

This will install Docker as Vagrant provisioner and download and install given Docker image. It will take a while.

Ssh into Vagrant VM and continue as root

vagrant ssh
sudo -i

Using container with Hadoop

Tty into container

docker exec -it CONTAINER /bin/bash
bin/hadoop fs -put ...
bin/hadoop jar ...
bin/hdfs dfs -cat ...
bin/hadoop fs -get ...

Set up workspace for Hadoop with Maven

Set up project

Generate testing project with basic structure

mvn archetype:generate -DgroupId=cz.lttr.hadoop -DartifactId=test-hadoop -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false

Add encoding to pom.xml


Add Hadoop dependencies to pom.xml


Edit source code with Eclipse

Prepare project for Eclipse

mvn eclipse:eclipse

Than Import Existing Maven project in Eclipse.

Write some code.

Package and deploy

Generate .jar file

mvn package

Test the jar file

java -cp target/test-hadoop-1.0-SNAPSHOT.jar cz.lttr.hadoop.WordCount


Show dependency tree of the project jar files

mvn dependency:tree