Hadoop Virtual Cluster Appliance

From Grid-Appliance Wiki

Jump to: navigation, search

Welcome to the Hadoop virtual cluster appliance tutorial. The purpose of this tutorial is to introduce you to the usage of the an extension of the Grid appliance which integrates the Hadoop file system and map/reduce framework for data-intensive parallel computing. This highlights a complementary application of the Grid Appliance.

Contents

Introduction

Quoting the Hadoop web site, "Hadoop is a software platform that lets one easily write and run applications that process vast amounts of data". This open-source platform supports a scalable distributed file system (HDFS) and implements a map/reduce programming environment. Hadoop is very similar in nature and goals to the Google file system and map/reduce framework, and both Yahoo! and Google have projects around Hadoop: check out the Hadoop and distributed computing at Yahoo! blog and this joint IBM/Google announcement for examples of where Hadoop is being used. The Hadoop virtual cluster appliance extends or baseline Grid appliance to enable easy deployment of virtual Hadoop clusters. The set up and configuration of complex distributed systems like Hadoop is non-trivial and requires substantial investment of time; this appliance makes it simpler to deploy one's own Hadoop cluster to quickly assess the potential of this exciting technology in one's own local environment. The tutorial assumes you have gone through the introductory presentations describing the Grid appliance and read through the documentation describing the overall Hadoop architecture and goals. At the end of the tutorial you will have bootstrapped a small Hadoop virtual cluster and run a simple map/reduce Hadoop streaming job.

Installing the hadoop package

Here are the basic steps you need to go through to deploy your own Hadoop virtual cluster for this tutorial:

  1. Download a Grid Appliance (v 2.04 or higher). (Alternatively, you can create your own Grid Appliance by following the steps mentioned in Testing Grid Appliance).
  2. Follow instructions to create your own GroupVPN and appliance pool (Deploying independent appliance pools - PlanetLab). (Alternatively, you may also use our public Grid appliance pool for testing purposes, but be aware that if you do so your nodes will not be secure and remote users will be able to ssh into them).
  3. Install the hadoop package for grid-appliance by using the commands:
    sudo bash
    echo "deb http://archive.canonical.com/ lucid partner" >> /etc/apt/sources.list
    apt-get update
    apt-get install grid-appliance-hadoop
    • A window will appear with information about installation/configuration of Java jre. Click the tab key and press enter. Another window will then appear asking whether you agree with the DJL terms. Click the tab key to select Yes and press enter if you do.(Loose ends to be cleaned up in 10.4 package).
    • A default 'hadoopuser' will be automatically created after the installation.
  4. Repeat the above steps for all the nodes you want in the virtual cluster. You may also create a custom appliance image which already by following the instructions in the "Customizing the Grid appliance image for local deployments" tutorial, and then replicate the customized image to create your virtual cluster.

Configuring the Hadoop cluster

To configure the Hadoop cluster, you need to select a node which will serve as the cluster's namenode - we will refer to this appliance as the "server" VM appliance, and the other nodes as "workers".

Log in to the "server" appliance. To configure the Hadoop cluster, the "server" VM needs to discover the names of the "worker" VMs, generate the conf/core-site.xml and conf/slaves configuration files, and propagate the core-site.xml configuration to the workers. All this functionality can be achieved with the following command.

su - hadoopuser
. ./.bashrc
cd $HADOOP_HOME
./bin/gather-workers.sh


Hadoop file system (HDFS) example

In this example, you will store a file into HDFS in the "server" appliance and retrieve its content from a worker node.

  • First, format the Hadoop file system namenode with the following command (only run it in the "server" appliance):
    cd $HADOOP_HOME
    ./bin/hadoop namenode -format
    • Note: The namenode format command need to be run only on the first setup.
  • Now start all the Hadoop daemons on the server and worker node(s) with the following command (you will need to type "yes" when prompted by ssh):
    ./bin/start-all.sh
  • Copy the README.txt file to HDFS with the command:
    ./bin/hadoop dfs -copyFromLocal README.txt README.txt
  • Check that the README.txt file is in the HDFS file system by issuing:
    ./bin/hadoop dfs -ls
  • Inspect its contents with:
    ./bin/hadoop dfs -cat README.txt
  • Repeat the following steps from a worker VM and you should see the same contents of README.txt.
    cd $HADOOP_HOME
    ./bin/hadoop dfs -ls
    ./bin/hadoop dfs -cat README.txt


Map/reduce example

This example illustrates the use of Hadoop's "streaming" feature to run Python-based map and reduce scripts. In this simple example, map/reduce is used to count the number of even and odd numbers in a set of input files. It is based on the example described in this blog. Another example of Hadoop streaming can be found in this Wiki.

  • First, go inside the demos directory:
    cd $HADOOP_HOME/demos
  • Take a look at the mapper.py and reducer.py scripts. Mapper.py scans through an input file and determines whether each number it finds is even or odd. For example, if you run:
    ./mapper.py < demoinputs/input1.txt
  • The result of processing the input file demoinputs/input1.txt:
 1
 2
 3
 4
 5
 6
 7
 8
 9
 10
 11
 12
 13
 14
 15
 16

will be the following:

 1 1
 0 2
 1 3
 0 4
 1 5
 0 6
 1 7
 0 8
 1 9
 0 10
 1 11
 0 12
 1 13
 0 14
 1 15
 0 16  
  • The reduce.py script computes the sum of even and odd numbers. For example, if you run:
    cat demoinputs/input1.txt | ./mapper.py | ./reducer.py

You will obtain:

even    count:8 sum:72
odd     count:8 sum:64

Where 72=2+4+6+8+10+12+14+16 and 64=1+3+5+7+9+11+13+15 Now take a look at the script mapred_python_streaming_example.sh. This script accomplishes the following:

  1. It copies the demoinputs directory to HDFS; there are two input files within it
  2. It starts a map-reduce streaming job on the Hadoop cluster, using mapper.py and reducer.py as the mapper and reducer tasks, respectively
  3. It shows the results of the output directory (demooutputs) created by this job
  • To run this demonstration:
    ./mapred_python_streaming_example.sh

You should see as a result:

even    count:16 sum:944
odd     count:16 sum:928
  • If you want to delete the demo inputs and outputs directories from HDFS:
    $HADOOP_HOME/bin/hadoop dfs -rmr demoinputs
    $HADOOP_HOME/bin/hadoop dfs -rmr demooutputs

Additional resources

There are several tutorials and examples in the Hadoop web site and elsewhere. Some starting points:

Follow this link for detailed information on HDFS

Follow this link for a comprehensive Hadoop map/reduce tutorial

Michael Noll's Wiki has excellent examples showing how to set up and run Hadoop standalone and cluster systems on Ubuntu

Personal tools