Back

 Industry News Details

 
Quick Hadoop Startup in a Virtual Environment Posted on : Jan 28 - 2016

A fully-featured Hadoop environment has a number of pieces that need to be integrated. Vagrant and Ansible are just the tools to make things easier.

When getting started with Hadoop, it is useful to have a test environment to quickly try out programs on a small scale before submitting them to a real cluster (or before setting that cluster up). There are instructions on the Hadoop website that describe running Hadoop as a single Java process. However, I've found that running this way hides a lot of how Hadoop really works for large-scale applications, which can slow understanding of what kinds of problems need to be solved to make an implementation work and be performant in a real cluster.

The same page also describes a pseudo-distributed mode. Don't let the "pseudo" throw you off; when running in this mode, the exact same Hadoop processes are running, in the same way as a full cluster. The only thing that's missing, besides the number of parallel resources, is the setup for High Availability. That's important for a real cluster, but it doesn't affect the way Hadoop jobs are written. So in my view, pseudo-distributed mode makes a great test environment.

The instructions for pseudo-distributed mode still include a lot of files to edit and commands to run. To make it easier to get up and running, I've created a virtual environment using Vagrant and Ansible that will do the installation automatically.

Vagrant and Ansible have good support and good docs for running on multiple platforms, so I'll assume we're starting in an environment where they are both available (as well as some virtualization software; Vagrant defaults to VirtualBox). I'll focus first on the Hadoop components that are being installed and set up, then show how the automated installation and setup works.

Hadoop Components

Hadoop Distributed File System

All of the Hadoop components we'll be using come in the standard tarball. First, we need to get the Hadoop Distributed File System (HDFS) up and running. HDFS provides the storage (both input and output) for Hadoop jobs; most Hadoop jobs start by reading one or more HDFS files and finish by leaving one or more HDFS files behind.

HDFS is divided into two main components, Name Node and Data Node. (There is a third component, the Journal Manager, that is used in High Availability setups.) The Name Node manages the HDFS equivalent of a file allocation table: for every file written to HDFS, it keeps track of where the pieces are located. Like a regular file system, HDFS divides the file up into "blocks"; however, the blocks are generally distributed across the network, and are generally replicated both for improved performance and to protect against drive failure. The amount of replication is configurable per-file; the standard default is 3.

The HDFS Data Nodes handle storing and retrieving blocks. They are provided with storage, typically on top of some existing local file system (e.g. EXT3 or XFS). The Data Nodes register themselves with the Name Node, which keeps track of how much total space is available and the health of each Data Node. This allows the Name Node to detect failure of a Data Node and to make additional copies of the blocks it holds to keep up the configured replication factor.

The Data Nodes also provide direct access to blocks to HDFS clients. While a client must go first to the Name Node to determine which Data Nodes have blocks of interest, the client can then read from or write to those blocks by going directly to the Data Nodes. This prevents the Name Node from becoming a bottleneck.

In a real cluster there is one Name Node (two when running High Availability) and as many Data Nodes as there are servers. For this pseudo-distributed installation, we still need both a Name Node and a Data Node, but they will be run in the same virtual machine.

Yet Another Resource Negotiator

Now that we have a distributed file system, we can set up something to schedule and run the actual jobs. For this example, I am setting up the "next generation" job scheduler YARN. It's still called "NextGen" in the docs, but it has been around for quite a while now in Hadoop terms.

Like HDFS, YARN needs two components, in this case a Resource Manager and a Node Manager. (It has other components: a History Server that stores job history; and a Proxy Server, that provides a network proxy for viewing application status and logs from outside the cluster.)

The Resource Manager accepts applications, schedules them, and tracks their status. The Node Manager registers with the Resource Manager and provides its local CPU and memory for scheduling. For a real cluster, there is one Resource Manager (two for High Availability) and as many Node Managers as there are servers. For this example, we will run a Resource Manager and a single Node Manager in our single virtual machine. View More