1 Setup quick guide

1.1 setup a first master node

Installing Java and SSH

# For Ubuntu 11
sudo apt-add-repository ppa:flexiondotorg/java
sudo apt-get update
sudo apt-get install sun-java6-jdk

# if installation is inside of a VM and behined a proxy
# In addition to configuring proxies, tell sudo to consider the environment with the flag -E
export http_proxy=http://<proxy>:<port>
export https_proxy=http://<proxy>:<port>
sudo -E apt-add-repository ppa:flexiondotorg/java

Check if java is installed java version

Adding a ubuntu user if necessary. For example, a new user(ubuntu) in a group(hadoop) :

sudo addgroup hadoop
sudo adduser -ingroup hadoop ubuntu

We use the default ubuntu:ubuntu user from Amazon EC2 instance in the following.

Setup passphraseless ssh

su - ubuntu
ssh-keygen -t rsa -P ""

cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

Download hadoop-0.20.2.tar.gz,Unzip then give chown permissions to user(ubuntu) for all files in the work directory.

wget https://archive.apache.org/dist/hadoop/core/hadoop-0.20.205.0/hadoop-0.20.205.0.tar.gz
sudo tar xzf hadoop-0.20.205.0.tar.gz
sudo chown -R ubuntu hadoop-0.20.205.0

Creating a file directory (hadoop.tmp.dir) for HDFS.

sudo mkdir -p /home/ubuntu/myhdfs
sudo chown ubuntu:ubuntu /home/ubuntu/myhdfs

Basic Hadoop configuation

Adding the following in the hadoop-env.sh

export JAVA_HOME=/usr/lib/jvm/java-6-sun
export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true

Adding the following between the <configuration> ... </configuration> tags of inconf/core-site.xml.

<property>
  <name>hadoop.tmp.dir</name>
  <value>/home/ubuntu/myhdfs</value>
  <!-- For default : mkdir -p /tmp/hadoop-username/dfs -->
</property> 

<property>
  <name>fs.default.name</name>
  <value>hdfs://localhost:54310</value>
</property>

Adding in file conf/mapred-site.xml:

<property>
  <name>mapred.job.tracker</name>
  <value>localhost:54311</value>
</property>

Adding in file conf/hdfs-site.xml:

<property>
  <name>dfs.replication</name>
  <value>1</value>
</property>

Format the HDFS. This is only needed for first time on setup.

hadoop-0.20.205.0/bin/hadoop namenode –format

Start Hadoop

hadoop-0.20.205.0/bin/start-all.sh

Using bin/hadoop fsck / to ckeck if all data nodes are on.

1.2 Add a new slave node

Installing ssh Java6 and Hadoop on the new slave node.
Adding a hadoop user if necessary.

Copying the master’s key to the salve, on the master node types follows

ssh-copy-id -i $HOME/.ssh/id_rsa.pub ubuntu@159.xxx.xxx.xxx

Following configuation with files which are all from the hadoop_home/conf

Download and unzip the Hadoop.
Hadoop configuation
- Do the same in file hadoop-env.sh as we did for the master node.
- Copying the configuations of core-site.xml, hdfs-site.xml and mapred-site.xml from the master to the slave node. In this case, there is no custom settings on the new node. The localhost should be replaced with IP of the master node.
Editing the Slaves file of the master node. Appending the IP address or hostname of the slave node at the end of file.

localhost
159.xxx.xxx.xxx

Restarting the master node by bin/start-all.sh. After a moment, the node will be initialized and appeared on the Web admin of the master.

1.3 Add and remove a configured slave node

Add node
- Firstly, add the node’s IP address or hostname to the Hadoop_Home/conf/slaves file located on the master node. And making sure that the node is not listed in the exclude file.
- New node can be started without affecting the running of the existing cluster. Running /bin/hadoop-daemon.sh start datanode and bin/hadoop-daemon.sh start tasktracker will start the data storage and task tracker processes on the new node, therefore adding it to the cluster.
Remove node
- Firstly, the node must be added to the exclude file. Next, the command bin/Hadoop dfsadmin –refreshNodes must be run on the master server. This forces the master to repopulate the list of valid nodes from the slaves and exclude files
- The master server will begin a decommissioning process on that node. Decommissioning a node is meant to prevent data loss and lasts until all blocks on the node are replicated.

1.4 Troubleshooting

DFS Format

bin/hadoop namenode -format # DFS format command

Problem: Format aborted

Format aborted in /home/hduser/hadooptmp/dfs/name
11/10/25 04:29:40 INFO namenode.NameNode: SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at ubuntu/127.0.1.1

If the name node is already shutdown, then go to the dfs directory and manually delete all files. After this, input the format command again.

Problem: safe mode

Name node stucks in the safe mode org.apache.hadoop.dfs.SafeModeException Use the following command to turn off the safe mode.

bin/hadoop dfsadmin -safemode leave

Hadoop

1 Setup quick guide

1.1 setup a first master node

1.2 Add a new slave node

1.3 Add and remove a configured slave node

1.4 Troubleshooting

2 Development

2.1 Development with Eclipse plugin

3 Hadoop tutorial