Hadoop Quickstart: Build a Cluster In The Cloud In 20 Minutes Or Less

Editor’s note: The tips Bryan Halfpap provides below really work. I stood up a working Hadoop Cluster in under 20 minutes, from cold iron to production ready, using just his guidance and a Rackspace account. bg

I’ve been working with Apache Hadoop in my lab, spending much of that with CDH3 (the Cloudera Distribution including Apache Hadoop). As part of my examination of the best way to move from test/evaluation/prototyping to production systems I’ve been examining cloud-based cluster capabilities. Crucial Point CTO Bob Gourley and I recently set up clusters in Amazon EC2 and Rackspace and were both impressed with the ability to get powerful systems up and running fast for very low cost.

With this post I’ll share some of the tips and techniques used to get these clusters up and running. These tips flow from the guide provided by Cloudera (see link at end of this post) but include a few other important tips that will help get you operational fast. I also provide more explanatory context to help you understand more what is going on as you execute these instructions. My only assumption here is that you have an understanding of the basic Linux command-line environment and have access to systems (perhaps in your environment or, like we did, at Amazon or Rackspace). Please note that this particular guide is written for 64-bit Redhat-flavored systems (CentOS/Fedora/ect…), and will need to be tailored for other OSs. Also note that we did things this way because we want to learn and to teach and thought many of you would like to see what is going on here as well. But there are actually better/faster ways than this. You can download configuration management systems that let you plan, start, manage and operate clusters very smartly using great GUIs and then manage them over the lifecycle of a production environment (see for example, the free Cloudera Manager). We will provide more on tools like that in the future. But now we want to show you how to stand up your environment using CDH3 by use of the command line. This is the fun way. We will walk you through the enterprise way in the near term.

Our first step flows from an understanding that Hadoop utilizes a lot of Java, and for that, you want to be sure that you are using the Sun/Oracle version. This ensures that any problems you come across aren’t in the Java implementation itself.

Bring up the Linux system you wish to install CDH/Hadoop on and run the commands below inside of the command prompt. This will return a list of installed packages called “OpenJDK“. OpenJDK is an open-source implementation of Java that we will need to remove.

rpm -qa | grep openjdk

If that returns a list of openjdk packages, you need to remove them before heading further. This command below will remove anything in yum which starts with “java”, and will remove openJDK from the system. We want to do this because the Sun/Oracle Java is much more stable and will ensure that all required features are present in Java.

yum remove java*

Once the packages have been removed, run the following commands in order to download and install the Sun/Oracle Java environs.

For 64-bit versions:

wget http://download.oracle.com/otn-pub/java/jdk/7u2-b13/jdk-7u2-linux-x64.rpm

Now we download and install the Java Runtime Environment.

For 64-bit versions:

wget http://download.oracle.com/otn-pub/java/jdk/7u2-b13/jre-7u2-linux-x64.rpm

Now that we have the packages downloaded, it is time to install them. Use the Redhat package manager (rpm) to install these files with the commands below.

rpm -i jre-7u2-linux-x64.rpm
rpm -i jdk-7u2-linux-x64.rpm

At this point we are ready to start downloading the Cloudera Distribution including Apache Hadoop (CDH). Redhat-flavor linux users of later distributions (Fedora 12+, latest versions of CentOS) can safely use the Redhat/CentOS 6 repositories from cloudera to install hadoop. If you are running later versions of those flavors, use the CentOS 5 repo.

For Latest Versions: http://archive.cloudera.com/redhat/6/x86_64/cdh/cloudera-cdh3.repo

For Legacy Versions: http://archive.cloudera.com/redhat/cdh/cloudera-cdh3.repo

You may notice that this is a repo file — it’ll add a repository to your package manager and help you keep your CDH software up to date. Simply add it to the yum repository like this (as root):

cd /etc/yum.repos.d/
wget http://archive.cloudera.com/redhat/6/x86_64/cdh/cloudera-cdh3.repo

Then update your package information from the repo with yum.

yum update

We use the repositories because it’s easier to keep CDH up-to-date with the package manager rather than building from source every time there is an update. Maintenance and setup are easier, and it’s easier to reinstall should we need to do that. Adding the repositories also allows you to install additional packages with yum with a few simple commands, rather than having to search for, then download and install any additional software add-ons for your Hadoop ecosystem.

Now you should be ready for the next step — downloading and installing Cloudera’s Hadoop. Do this with yum by using the following commands:

yum install hadoop-0.20

This will install the base of Hadoop and allow you to install the other packages that Hadoop requires to operate — namely the jobtracker, tasktracker, namenode, and datanode. All of those trackers and nodes are required in order to operate a Hadoop cluster. Choose which ones this machine will run, or install all of them at once if you are running in “pseudo-distributed mode”. Pseudo-distributed mode treats one computer as if it is an entire cluster, and is great for testing out prototype jobs or performing development on.

to install the packages, simply follow this syntax:

yum install hadoop-0.20-[NAME OF THE PACKAGE]

If you want to run hadoop just on one node, install all of the packages, then use the following configuration for Hadoop in the /etc/hadoop-0.20/conf folder: http://hadoop.apache.org/common/docs/r0.20.2/quickstart.html#PseudoDistributed

Below is a simple bash script for pseudo-distributed installs of CDH. This script will start the Hadoop services for you:

#!/bin/bash
service hadoop-0.20-datanode start
service hadoop-0.20-jobtracker start
service hadoop-0.20-namenode start
service hadoop-0.20-tasktracker start

Copy and paste that code into a file named “start-hadoop”, save it, and mark the file executable with the chmod command:

chmod +x start-hadoop

Then run it to see if everything starts up fine with:

./start-hadoop

If you did everything well, then you should be able to perform the following actions:

hadoop fs -copyFromLocal [filename] [filename in hdfs]

hadoop fs -cat [filename in hdfs]

Where the names in brackets are the names of a file you wish to import and then display in HDFS (Hadoop distributed file system). If you weren’t able to get that started fine, then you may need to repeat one of the steps or make sure you installed all the components you needed.

If you have any questions or if you’ve found an error in this Tutorial, please let me know on Twitter (@Crypt0s).

A full documentation including steps to install on Debian-based systems can be found at Cloudera’s website.

Note: Keep checking CTOvision.com for more Hadoop updates on software for Hadoop like Whirr. Whirr is another great capability, enabling the fast standup of cloud clusters by use of a simple properties file you update with your cloud provider info. We also have posts coming on the use of Cloudera’s Manager. Also, as previously mentioned, if you would like fast tips to standing up cloud-based servers to start these Hadoop Clusters on, see our Quickstart guides to Amazon and Rackspace servers.