Welcome to the Hadoop virtual cluster appliance tutorial. The purpose of this tutorial is to introduce you to the usage of the an extension of the Grid appliance which integrates the Hadoop file system and map/reduce framework for data-intensive parallel computing. This highlights a complementary application of the Grid Appliance.
The tutorial contains the following sections:
Quoting the Hadoop web site, "Hadoop is a software platform that lets one easily write and run applications that process vast amounts of data". This open-source platform supports a scalable distributed file system (HDFS) and implements a map/reduce programming environment. Hadoop is very similar in nature and goals to the Google file system and map/reduce framework, and both Yahoo! and Google have projects around Hadoop: check out the Hadoop and distributed computing at Yahoo! blog and this joint IBM/Google announcement for examples of where Hadoop is being used.
The Hadoop virtual cluster appliance extends or baseline Grid appliance to enable easy deployment of virtual Hadoop clusters. The set up and configuration of complex distributed systems like Hadoop is non-trivial and requires substantial investment of time; this appliance makes it simpler to deploy one's own Hadoop cluster to quickly assess the potential of this exciting technology in one's own local environment.
The tutorial assumes you have gone through the introductory presentations describing the Grid appliance and read through the documentation describing the overall Hadoop architecture and goals. At the end of the tutorial you will have bootstrapped a small Hadoop virtual cluster and run a simple map/reduce Hadoop streaming job.
Here are the basic steps you need to go through to deploy your own Hadoop virtual cluster for this tutorial:
- Install VM software in the resources you will use. At the minimum, one physical machine with 1GB RAM is sufficient. VMware Player or Server are recommended.
- Download and install the Hadoop appliance image.
- Configure virtual appliance floppies for your resource pool using our Web interface.
- Deploy Hadoop "server" appliance and one or more Hadoop "worker" appliances. The "server" appliance runs the HDFS namenode and map/reduce job tracker, the "worker" appliances run the HDFS data nodes and map/reduce task trackers.
- Configure and test the Hadoop deployment with demo applications.
Setting up the baseline Grid appliance pool
- Download the Hadoop virtual appliance image
- Follow the tutorial "Deploying independent appliance pools with PlanetLab as bootstrap" to set up a baseline pool with one server and one or more workers.
- The Hadoop virtual appliance image essentially consists of the baseline Grid appliance image described in this tutorial with an additional module (the opt.vmdk virtual disk) with Hadoop-specific software and configuration.
- Alternatively, if you do not want to use our public PlanetLab infrastructure and deploy your pool in local resources, you may follow the tutorial "Deploying independent appliance pools with your own bootstrap infrastructure"
- Check that the condor_status command works from the appliance and that the output of this command displays as many worker nodes as you have deployed in step 2.
Configuring the Hadoop cluster
Log in to the "server" appliance as user "griduser". To configure the Hadoop cluster, the "server" VM needs to discover the names of the "worker" VMs, generate the conf/hadoop-site.xml and conf/slaves configuration files, and propagate the hadoop-site.xml configuration to the workers. All this functionality can be achieved with the following command:
Hadoop file system (HDFS) example
In this example, you will store a file into HDFS in the server node and retrieve its content from a worker node.
First, format the Hadoop file system namenode with the following command (in the "server" node):
./bin/hadoop namenode -format
Formatting the namenode as described above a one-time operation.
Now start all the Hadoop daemons on the server and worker node(s) with the following command:
Copy the README.txt file to HDFS with the command:
./bin/hadoop dfs -copyFromLocal README.txt README.txt
Check that the README.txt file is in the HDFS file system by issuing:
./bin/hadoop dfs -ls
Inspect its contents with:
./bin/hadoop dfs -cat README.txt
Repeat the two steps abot (cd Hadoop/hadoop; /.bin/hadoop dfs -ls; ./bin/haddop dfs -cat README.txt) from a worker VM; you should see the same contents of README.txt from the worker node(s).
This example illustrates the use of Hadoop's "streaming" feature to run Python-based map and reduce scripts. In this simple example, map/reduce is used to count the number of even and odd numbers in a set of input files. It is based on the example described in this blog. Another example of Hadoop streaming can be found in this Wiki.
First, go inside the demos disrectory:
Take a look at the mapper.py and reducer.py scripts. Mapper.py scans through an input file and determines whether each number it finds is even or odd. For example, if you run:
./mapper.py < demoinputs/input1.txt
The result of processing the input file demoinputs/input1.txt:
will be the following:
The reduce.py script computes the sum of even and odd numbers. For example, if you run:
cat demoinputs/input1.txt | ./mapper.py | ./reduce.py
You will obtain:
even count:8 sum:72
odd count:8 sum:64
Where 72=2+4+6+8+10+12+14+16 and 64=1+3+5+7+9+11+13+15
Now take a look at the script mapred_python_streaming_example.sh. This script accomplishes the following:
1. It copies the demoinputs directory to HDFS; there are two input files within it
2. It starts a map-reduce streaming job on the Hadoop cluster, using mapper.py and reducer.py as the mapper and reducer tasks, respectively
3. It shows the results of the output directory (demooutputs) created by this job
To run this demonstration:
You should see as a result:
even count:16 sum:944
odd count:16 sum:928
If you want to delete the demo inputs and outputs directories from HDFS:
$HADOOP_HOME/bin/hadoop dfs -rmr demoinputs
$HADOOP_HOME/bin/hadoop dfs -rmr demooutputs
Additional resources and future plans
There are several tutorials and examples in the Hadoop web site and elsewhere. Some starting points:
There are many interesting things yet to be done with this appliance. Join our user's group and contact us if you are interested in contributing. Here are things yet to be done:
- Integrate with Hadoop-on-demand to fully leverage Condor to deploy Hadoop on demand
- Integrate resource discovery with IPOP's distributed hash table for simple configuration
- Expand documentation to include how to customize the appliance with larger virtual disks
- Incorporation of additional examples
- Performance analysis
In addition to the Grid appliance developers mentioned in "About us", this appliance had contributions from Ketan Deshpande, Kiran Kulkarni and Mukund Ingale at the U. Florida, who configured and tested Hadoop in the opt.vmdk module and implemented configuration scripts.