null  null
Table of Contents
Page No
Meet Hadoop - - - - - - - - - - - - - - - - - - - - - - - - - - -
MapReduce - - - - - - - - - - - - - - - - - - - - - - - - - - - -
The Hadoop Distributed Filesystem - - - - - - - - - - -
Hadoop I/O - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Developing a Map Reduce Application - - - - - - - - -
How MapReduce Works - - - - - - - - - - - - - - - - - - -
MapReduce Types and Formats - - - - - - - - - - - - - -
MapReduce Features - - - - - - - - - - - -- - - - - - - - - -
Setting Up a Hadoop Cluster - - - - - - -- - - - - - - - - -
Administering Hadoop - - - - - - - - - - - - - - - - - - - - -
Pig - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Hive - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Hbase - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
ZooKeeper - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Sqoop - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Case Studies - - - - - - - - - - - - - - - - - - - - - - - - - - -
Hadoop got its start in Nutch. A few of us were attempting to build an open
source web search engine and having trouble managing computations running on
even a handful of computers. Once Google published its GFS and MapReduce
papers, the route became clear. They’d devised systems to solve precisely the
problems we were having with Nutch. So we started, two of us, half-time, to try to recreate these systems as a part of Nutch.
We managed to get Nutch limping along on 20 machines, but it soon became clear
that to handle the Web’s massive scale, we’d need to run it on thousands of
machines and, moreover, that the job was bigger than two half-time developers
could handle.
Around that time, Yahoo! got interested, and quickly put together a team that I
joined. We split off the distributed computing part of Nutch, naming it Hadoop. With
the help of Yahoo!, Hadoop soon grew into a technology that could truly scale to the
In 2006, Tom White started contributing to Hadoop. I already knew Tom through an
excellent article he’d written about Nutch, so I knew he could present complex ideas
in clear prose. I soon learned that he could also develop software that was as
pleasant to read as his prose.
From the beginning, Tom’s contributions to Hadoop showed his concern for users
and for the project. Unlike most open source contributors, Tom is not primarily
interested in tweaking the system to better meet his own needs, but rather in making
it easier for anyone to use.Given this, I was very pleased when I learned that Tom
intended to write a book about Hadoop. Who could be better qualified? Now you
have the opportunity to learn about Hadoop from a master—not only of the
technology, but also of common sense and plain talk.
—Doug Cutting Shed in the Yard, California
Meet Hadoop
We live in the data age. It’s not easy to measure the total volume of data stored
elec-tronically, but an IDC estimate put the size of the “digital universe” at 0.18
zettabytes in 2006, and is forecasting a tenfold growth by 2011 to 1.8 zettabytes.1 A
zettabyte is 1021 bytes, or equivalently one thousand exabytes, one million
petabytes, or one billion terabytes. That’s roughly the same order of magnitude as
one disk drive for every person in the world.
This flood of data is coming from many sources. Consider the following:
 The New York Stock Exchange generates about one terabyte of new trade data
per day.
 Facebook hosts approximately 10 billion photos, taking up one petabyte of
, the genealogy site, stores around 2.5 petabytes of data.
 The Internet Archive stores around 2 petabytes of data, and is growing at a rate
of 20 terabytes per month.
 The Large Hadron Collider near Geneva, Switzerland, will produce about 15
petabytes of data per year.
Data Storage and Analysis
The problem is simple: while the storage capacities of hard drives have increased
mas-sively over the years, access speeds—the rate at which data can be read from
drives— have not kept up. One typical drive from 1990 could store 1,370 MB of data
and had a transfer speed of 4.4 MB/s,4 so you could read all the data from a full
drive in around five minutes. Over 20 years later, one terabyte drives are the norm,
but the transfer speed is around 100 MB/s, so it takes more than two and a half
hours to read all the data off the disk.
This is a long time to read all data on a single drive—and writing is even slower. The
obvious way to reduce the time is to read from multiple disks at once. Imagine if we
had 100 drives, each holding one hundredth of the data. Working in parallel, we
could read the data in under two minutes.
Only using one hundredth of a disk may seem wasteful. But we can store one
hundred datasets, each of which is one terabyte, and provide shared access to
them. We can imagine that the users of such a system would be happy to share
access in return for shorter analysis times, and, statistically, that their analysis jobs
would be likely to be spread over time, so they wouldn’t interfere with each other too
There’s more to being able to read and write data in parallel to or from multiple
disks, though.
The first problem to solve is hardware failure: as soon as you start using many
pieces of hardware, the chance that one will fail is fairly high. A common way of
avoiding data loss is through replication: redundant copies of the data are kept by
the system so that in the event of failure, there is another copy available. This is how
RAID works, for instance, although Hadoop’s filesystem, the Hadoop Distributed
Filesystem (HDFS), takes a slightly different approach, as you shall see later.
The second problem is that most analysis tasks need to be able to combine the data
in some way; data read from one disk may need to be combined with the data from
any of the other 99 disks. Various distributed systems allow data to be combined
from multiple sources, but doing this correctly is notoriously challenging. MapReduce
pro-vides a programming model that abstracts the problem from disk reads and
writes.,This, in a nutshell, is what Hadoop provides: a reliable shared storage and
analysis system. The storage is provided by HDFS and analysis by MapReduce.
There are other parts to Hadoop, but these capabilities are its kernel.
Comparison with Other Systems
The approach taken by MapReduce may seem like a brute-force approach. The
premise is that the entire dataset—or at least a good portion of it—is processed for
each query. But this is its power. MapReduce is a batch query processor, and the
ability to run an ad hoc query against your whole dataset and get the results in a
reasonable time is transformative. It changes the way you think about data, and
unlocks data that was previously archived on tape or disk. It gives people the
opportunity to innovate with data. Questions that took too long to get answered
before can now be answered, which in turn leads to new questions and new insights.
For example, Mailtrust, Rackspace’s mail division, used Hadoop for processing
email logs. One ad hoc query they wrote was to find the geographic distribution of
their users. In their words:
This data was so useful that we’ve scheduled the MapReduce job to run monthly
and we will be using this data to help us decide which Rackspace data centers to
place new mail servers in as we grow.
By bringing several hundred gigabytes of data together and having the tools to
analyze it, the Rackspace engineers were able to gain an understanding of the data
that they otherwise would never have had, and, furthermore, they were able to use
what they had learned to improve the service for their customers. You can read
more about how Rackspace uses Hadoop in Chapter 16.
Why can’t we use databases with lots of disks to do large-scale batch analysis? Why
is MapReduce needed?
The answer to these questions comes from another trend in disk drives: seek time is
improving more slowly than transfer rate. Seeking is the process of moving the disk’s
head to a particular place on the disk to read or write data. It characterizes the
latency of a disk operation, whereas the transfer rate corresponds to a disk’s
If the data access pattern is dominated by seeks, it will take longer to read or write
large portions of the dataset than streaming through it, which operates at the transfer
rate. On the other hand, for updating a small proportion of records in a database, a
tradi-tional B-Tree (the data structure used in relational databases, which is limited
by the rate it can perform seeks) works well. For updating the majority of a database,
a B-Tree is less efficient than MapReduce, which uses Sort/Merge to rebuild the
In many ways, MapReduce can be seen as a complement to an RDBMS. (The
differences between the two systems are shown in Table 1-1.) MapReduce is a good
fit for problems that need to analyze the whole dataset, in a batch fashion,
particularly for ad hoc anal-ysis. An RDBMS is good for point queries or updates,
where the dataset has been in-dexed to deliver low-latency retrieval and update
times of a relatively small amount of data. MapReduce suits applications where the
data is written once, and read many times, whereas a relational database is good for
datasets that are continually updated.
Table 1-1. RDBMS compared to MapReduce
Traditional RDBMS
Data size Gigabytes
Access Interactive and batch
Updates Read and write many times
Structure Static schema
Integrity High
Scaling Nonlinear
Write once, read many times
Dynamic schema
Another difference between MapReduce and an RDBMS is the amount of structure
in the datasets that they operate on. Structured data is data that is organized into
entities that have a defined format, such as XML documents or database tables that
conform to a particular predefined schema. This is the realm of the RDBMS. Semistructured data, on the other hand, is looser, and though there may be a schema, it
is often ignored, so it may be used only as a guide to the structure of the data: for
example, a spreadsheet, in which the structure is the grid of cells, although the cells
themselves may hold any form of data. Unstructured data does not have any
particular internal structure: for example, plain text or image data. MapReduce works
well on unstructured or semi-structured data, since it is designed to interpret the data
at processing time. In other words, the input keys and values for MapReduce are not
an intrinsic property of the data, but they are chosen by the person analyzing the
Relational data is often normalized to retain its integrity and remove redundancy.
Normalization poses problems for MapReduce, since it makes reading a record a
non-local operation, and one of the central assumptions that MapReduce makes is
that it is possible to perform (high-speed) streaming reads and writes.
A web server log is a good example of a set of records that is not normalized (for example, the client hostnames are specified in full each time, even though the same
client may appear many times), and this is one reason that logfiles of all kinds are
particularly well-suited to analysis with MapReduce.
MapReduce is a linearly scalable programming model. The programmer writes two
functions—a map function and a reduce function—each of which defines a mapping
from one set of key-value pairs to another. These functions are oblivious to the size
of the data or the cluster that they are operating on, so they can be used unchanged
for a small dataset and for a massive one. More important, if you double the size of
the input data, a job will run twice as slow. But if you also double the size of the
cluster, a job will run as fast as the original one. This is not generally true of SQL
Over time, however, the differences between relational databases and MapReduce
sys-tems are likely to blur—both as relational databases start incorporating some of
the ideas from MapReduce (such as Aster Data’s and Greenplum’s databases) and,
from the other direction, as higher-level query languages built on MapReduce (such
as Pig and Hive) make MapReduce systems more approachable to traditional
database programmers.
A Brief History of Hadoop
Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely
used text search library. Hadoop has its origins in Apache Nutch, an open source
web search engine, itself a part of the Lucene project.
The Origin of the Name “Hadoop”
The name Hadoop is not an acronym; it’s a made-up name. The project’s creator,
Doug Cutting, explains how the name came about:
"The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell
and pronounce, meaningless, and not used elsewhere: those are my naming
criteria. Kids are good at generating such. Googol is a kid’s term."
Subprojects and “contrib” modules in Hadoop also tend to have names that are unrelated to their function, often with an elephant or other animal theme (“Pig,” for
example). Smaller components are given more descriptive (and therefore more mun6
dane) names. This is a good principle, as it means you can generally work out what
something does from its name. For example, the jobtracker9 keeps track of
MapReduce jobs.
Hadoop at Yahoo!
Building Internet-scale search engines requires huge amounts of data and therefore
large numbers of machines to process it. Yahoo! Search consists of four primary
com-ponents: the Crawler, which downloads pages from web servers; the WebMap,
which builds a graph of the known Web; the Indexer, which builds a reverse index to
the best pages; and the Runtime, which answers users’ queries. The WebMap is a
graph that consists of roughly 1 trillion (1012) edges each representing a web link
and 100 billion (1011) nodes each representing distinct URLs. Creating and analyzing
such a large graph requires a large number of computers running for many days. In
early 2005, the infra-structure for the WebMap, named Dreadnaught, needed to be
redesigned to scale up to more nodes. Dreadnaught had successfully scaled from
20 to 600 nodes, but required a complete redesign to scale out further. Dreadnaught
is similar to MapReduce in many ways, but provides more flexibility and less
structure. In particular, each fragment in a Dreadnaught job can send output to each
of the fragments in the next stage of the job, but the sort was all done in library code.
In practice, most of the WebMap phases were pairs that corresponded to
MapReduce. Therefore, the WebMap applications would not require extensive
refactoring to fit into MapReduce.
Eric Baldeschwieler (Eric14) created a small team and we started designing and
prototyping a new framework written in C++ modeled after GFS and MapReduce to
replace Dreadnaught. Although the immediate need was for a new framework for
WebMap, it was clear that standardization of the batch platform across Yahoo!
Search was critical and by making the framework general enough to support other
users, we could better leverage investment in the new platform.
At the same time, we were watching Hadoop, which was part of Nutch, and its
progress. In January 2006, Yahoo! hired Doug Cutting, and a month later we
decided to abandon our prototype and adopt Hadoop. The advantage of Hadoop
over our prototype and design was that it was already working with a real application
(Nutch) on 20 nodes. That allowed us to bring up a research cluster two months later
and start helping real customers use the new framework much sooner than we could
have otherwise. Another advantage, of course, was that since Hadoop was already
open source, it was easier (although far from easy!) to get permission from Yahoo!’s
legal department to work in open source. So we set up a 200-node cluster for the
researchers in early 2006 and put the WebMap conversion plans on hold while we
supported and improved Hadoop for the research users.
Here’s a quick timeline of how things have progressed:
2004—Initial versions of what is now Hadoop Distributed Filesystem and
Map-Reduce implemented by Doug Cutting and Mike Cafarella.
December 2005—Nutch ported to the new framework. Hadoop runs reliably
on 20 nodes.
January 2006—Doug Cutting joins Yahoo!.
February 2006—Apache Hadoop project officially started to support the
stand-alone development of MapReduce and HDFS.
 February 2006—Adoption of Hadoop by Yahoo! Grid team.
 April 2006—Sort benchmark (10 GB/node) run on 188 nodes in 47.9 hours.
 May 2006—Yahoo! set up a Hadoop research cluster—300 nodes.
 May 2006—Sort benchmark run on 500 nodes in 42 hours (better hardware
than April benchmark).
 October 2006—Research cluster reaches 600 nodes.
 December 2006—Sort benchmark run on 20 nodes in 1.8 hours, 100 nodes
in 3.3 hours, 500 nodes in 5.2 hours, 900 nodes in 7.8 hours.
 January 2007—Research cluster reaches 900 nodes.
 April 2007—Research clusters—2 clusters of 1000 nodes.
 April 2008—Won the 1 terabyte sort benchmark in 209 seconds on 900 nodes.
 October 2008—Loading 10 terabytes of data per day on to research clusters.
 March 2009—17 clusters with a total of 24,000 nodes.
 April 2009—Won the minute sort by sorting 500 GB in 59 seconds (on 1,400
nodes) and the 100 terabyte sort in 173 minutes (on 3,400 nodes).
Apache Hadoop and the Hadoop Ecosystem
Although Hadoop is best known for MapReduce and its distributed filesystem
(HDFS, renamed from NDFS), the term is also used for a family of related projects
that fall under the umbrella of infrastructure for distributed computing and large-scale
data processing.
The Hadoop projects that are covered in this book are described briefly here:
A set of components and interfaces for distributed filesystems and general I/O
(serialization, Java RPC, persistent data structures).
A serialization system for efficient, cross-language RPC, and persistent data
A distributed data processing model and execution environment that runs on
large clusters of commodity machines.
A distributed filesystem that runs on large clusters of commodity machines.
A data flow language and execution environment for exploring very large
datasets. Pig runs on HDFS and MapReduce clusters.
A distributed data warehouse. Hive manages data stored in HDFS and provides
a query language based on SQL (and which is translated by the runtime engine
to MapReduce jobs) for querying the data.
A distributed, column-oriented database. HBase uses HDFS for its underlying
storage, and supports both batch-style computations using MapReduce and point
queries (random reads).
A distributed, highly available coordination service. ZooKeeper provides
primitives such as distributed locks that can be used for building distributed
A tool for efficiently moving data between relational databases and HDFS.
MapReduce is a programming model for data processing. The model is simple, yet
not too simple to express useful programs in. Hadoop can run MapReduce programs
writ-ten in various languages; in this chapter, we shall look at the same program
expressed in Java, Ruby, Python, and C++. Most important, MapReduce programs
are inherently parallel, thus putting very large-scale data analysis into the hands of
anyone with enough machines at their disposal. MapReduce comes into its own for
large datasets, so let’s start by looking at one.
A Weather Dataset
For our example, we will write a program that mines weather data. Weather sensors
collecting data every hour at many locations across the globe gather a large volume
of log data, which is a good candidate for analysis with MapReduce, since it is semistructured and record-oriented.
Data Format
The data we will use is from the National Climatic Data Center (NCDC, http://www The data is stored using a line-oriented ASCII format, in which
each line is a record. The format supports a rich set of meteorological elements,
many of which are optional or with variable data lengths. For simplicity, we shall
focus on the basic elements, such as temperature, which are always present and
are of fixed width.
Example 2-1 shows a sample line with some of the salient fields highlighted. The line
has been split into multiple lines to show each field: in the real file, fields are packed
into one line with no delimiters.
Example 2-1. Format of a National Climate Data Center record
# USAF weather station identifier
# WBAN weather station identifier
# observation date
# observation time
# latitude (degrees x 1000)
+028783 # longitude (degrees x 1000)
# elevation (meters)
# wind direction (degrees)
# quality code
# sky ceiling height (meters)
# quality code
# visibility distance (meters)
# quality code
# air temperature (degrees Celsius x 10)
# quality code
# dew point temperature (degrees Celsius x 10)
# quality code
# atmospheric pressure (hectopascals x 10)
# quality code
Data files are organized by date and weather station. There is a directory for each
year from 1901 to 2001, each containing a gzipped file for each weather station with
its readings for that year. For example, here are the first entries for 1990:
% ls raw/1990 | head
Since there are tens of thousands of weather stations, the whole dataset is made up
of a large number of relatively small files. It’s generally easier and more efficient to
process a smaller number of relatively large files, so the data was preprocessed so
that each year’s readings were concatenated into a single file.
Analyzing the Data with Unix Tools
What’s the highest recorded global temperature for each year in the dataset? We will
answer this first without using Hadoop, as this information will provide a performance
baseline, as well as a useful means to check our results.
The classic tool for processing line-oriented data is awk. Example 2-2 is a small
script to calculate the maximum temperature for each year.
Example 2-2. A program for finding the maximum recorded temperature by year
from NCDC weather records
#!/usr/bin/env bash for year in
all/* do
echo -ne `basename $year .gz`"\t" gunzip
-c $year | \
awk '{ temp = substr($0, 88, 5) + 0; q =
substr($0, 93, 1);
if (temp !=9999 && q ~ /[01459]/ && temp > max) max = temp
} END { print max }'
The script loops through the compressed year files, first printing the year, and then
processing each file using awk. The awk script extracts two fields from the data: the
air temperature and the quality code. The air temperature value is turned into an
integer by adding 0. Next, a test is applied to see if the temperature is valid (the
value 9999 signifies a missing value in the NCDC dataset) and if the quality code
indicates that the reading is not suspect or erroneous. If the reading is OK, the value
is compared with the maximum value seen so far, which is updated if a new
maximum is found. The END block is executed after all the lines in the file have
been processed, and it prints the maximum value.
Here is the beginning of a run:
% ./
The temperature values in the source file are scaled by a factor of 10, so this works
out as a maximum temperature of 31.7°C for 1901 (there were very few readings at
the beginning of the century, so this is plausible). The complete run for the century
took 42 minutes in one run on a single EC2 High-CPU Extra Large Instance.
To speed up the processing, we need to run parts of the program in parallel. In
theory, this is straightforward: we could process different years in different
processes, using all the available hardware threads on a machine. There are a few
problems with this, however.
First, dividing the work into equal-size pieces isn’t always easy or obvious. In this
case, the file size for different years varies widely, so some processes will finish
much earlier than others. Even if they pick up further work, the whole run is
dominated by the longest file. A better approach, although one that requires more
work, is to split the input into fixed-size chunks and assign each chunk to a process.
Second, combining the results from independent processes may need further
process-ing. In this case, the result for each year is independent of other years and
may be combined by concatenating all the results, and sorting by year. If using the
fixed-size chunk approach, the combination is more delicate. For this example, data
for a par-ticular year will typically be split into several chunks, each processed
independently. We’ll end up with the maximum temperature for each chunk, so the
final step is to look for the highest of these maximums, for each year.
Third, you are still limited by the processing capacity of a single machine. If the best
time you can achieve is 20 minutes with the number of processors you have, then
that’s it. You can’t make it go faster. Also, some datasets grow beyond the capacity
of a single machine. When we start using multiple machines, a whole host of other
factors come into play, mainly falling in the category of coordination and reliability.
Who runs the overall job? How do we deal with failed processes?
So, though it’s feasible to parallelize the processing, in practice it’s messy. Using a
framework like Hadoop to take care of these issues is a great help.
Analyzing the Data with Hadoop
To take advantage of the parallel processing that Hadoop provides, we need to
express our query as a MapReduce job. After some local, small-scale testing, we will
be able to run it on a cluster of machines.
Map and Reduce
MapReduce works by breaking the processing into two phases: the map phase and
the reduce phase. Each phase has key-value pairs as input and output, the types of
which may be chosen by the programmer. The programmer also specifies two
functions: the map function and the reduce function.
Figure 2-1. MapReduce logical data flow
Java MapReduce
Having run through how the MapReduce program works, the next step is to express
it in code. We need three things: a map function, a reduce function, and some code
to run the job. The map function is represented by the Mapper class, which declares
an abstract map() method. Example 2-3 shows the implementation of our map
Example 2-3. Mapper for maximum temperature example
import org.apache.hadoop.mapreduce.Mapper;
public class MaxTemperatureMapper
extends Mapper<LongWritable, Text, Text, IntWritable> {
private static final int MISSING = 9999;
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString(); String year =
line.substring(15, 19); int airTemperature;
if (line.charAt(87) == '+') { // parseInt doesn't like leading plus signs
airTemperature = Integer.parseInt(line.substring(88, 92));
} else {
airTemperature = Integer.parseInt(line.substring(87, 92));
String quality = line.substring(92, 93);
if (airTemperature != MISSING && quality.matches("[01459]")) {
context.write(new Text(year), new IntWritable(airTemperature));
The Mapper class is a generic type, with four formal type parameters that specify the
input key, input value, output key, and output value types of the map function. For
the present example, the input key is a long integer offset, the input value is a line of
The output key is a year, and the output value is an air temperature (an integer).
Rather than use built-in Java types, Hadoop provides its own set of basic types that
are opti-mized for network serialization. These are found in the package. Here we use LongWritable, which corresponds to a
Java Long, Text (like Java String), and IntWritable (like Java Integer).
The map() method is passed a key and a value. We convert the Text value
containing the line of input into a Java String, then use its substring() method to
extract the columns we are interested in.
The map() method also provides an instance of Context to write the output to. In this
case, we write the year as a Text object (since we are just using it as a key), and the
temperature is wrapped in an IntWritable. We write an output record only if the temperature is present and the quality code indicates the temperature reading is OK.
The reduce function is similarly defined using a Reducer, as illustrated in Example 24.
Example 2-4. Reducer for maximum temperature example
import org.apache.hadoop.mapreduce.Reducer;
public class MaxTemperatureReducer
extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values,
Context context)
throws IOException, InterruptedException {
int maxValue = Integer.MIN_VALUE; for
(IntWritable value : values) {
maxValue = Math.max(maxValue, value.get());
context.write(key, new IntWritable(maxValue));
Again, four formal type parameters are used to specify the input and output types,
this time for the reduce function. The input types of the reduce function must match
the output types of the map function: Text and IntWritable. And in this case, the
output types of the reduce function are Text and IntWritable, for a year and its
maximum temperature, which we find by iterating through the temperatures and
comparing each with a record of the highest found so far.
The third piece of code runs the MapReduce job (see Example 2-5).
Example 2-5. Application to find the maximum temperature in the weather dataset
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
public class MaxTemperature {
public static void main(String[] args) throws Exception { if
(args.length != 2) {
System.err.println("Usage: MaxTemperature <input path> <output
path>"); System.exit(-1);
Job job = new Job();
job.setJobName("Max temperature");
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
A Job object forms the specification of the job. It gives you control over how the job
is run. When we run this job on a Hadoop cluster, we will package the code into a
JAR file (which Hadoop will distribute around the cluster). Rather than explicitly
specify the name of the JAR file, we can pass a class in the Job’s setJarByClass()
method, which Hadoop will use to locate the relevant JAR file by looking for the JAR
file containing this class.
Having constructed a Job object, we specify the input and output paths. An input
path is specified by calling the static addInputPath() method on FileInputFormat, and
it can be a single file, a directory (in which case, the input forms all the files in that
directory), or a file pattern. As the name suggests, addInputPath() can be called
more than once to use input from multiple paths.
The output path (of which there is only one) is specified by the static setOutput
Path() method on FileOutputFormat. It specifies a directory where the output files
from the reducer functions are written. The directory shouldn’t exist before running
the job, as Hadoop will complain and not run the job. This precaution is to prevent
data loss.
Next, we specify the map and reduce types to use via the setMapperClass() and
setReducerClass() methods.
The setOutputKeyClass() and setOutputValueClass() methods control the output
types for the map and the reduce functions, which are often the same, as they are in
our case. If they are different, then the map output types can be set using the
methods setMapOutputKeyClass() and setMapOutputValueClass().
The input types are controlled via the input format, which we have not explicitly set
since we are using the default TextInputFormat.
After setting the classes that define the map and reduce functions, we are ready to
run the job. The waitForCompletion() method on Job submits the job and waits for it
to finish. The method’s boolean argument is a verbose flag, so in this case the job
writes information about its progress to the console.
The return value of the waitForCompletion() method is a boolean indicating success
(true) or failure (false), which we translate into the program’s exit code of 0 or 1.
A test run
After writing a MapReduce job, it’s normal to try it out on a small dataset to flush out
any immediate problems with the code. First install Hadoop in standalone mode—
there are instructions for how to do this in Appendix A. This is the mode in which
Hadoop runs using the local filesystem with a local job runner. Then install and
compile the examples using the instructions on the book’s website.
Let’s test it on the five-line sample discussed earlier (the output has been slightly reformatted to fit the page):
export HADOOP_CLASSPATH=hadoop-examples.jar
hadoop MaxTemperature input/ncdc/sample.txt output
11/09/15 21:35:14 INFO jvm.JvmMetrics: Initializing JVM Metrics with
processName=JobT racker, sessionId=
11/09/15 21:35:14 WARN util.NativeCodeLoader: Unable to load native-hadoop
library fo r your platform... using builtin-java classes where applicable
11/09/15 21:35:14 WARN mapreduce.JobSubmitter: Use GenericOptionsParser
for parsing t he arguments. Applications should implement Tool for the same.
11/09/15 21:35:14 INFO input.FileInputFormat: Total input paths to process
: 1 11/09/15 21:35:14 WARN snappy.LoadSnappy: Snappy native library
not loaded 11/09/15 21:35:14 INFO mapreduce.JobSubmitter: number of
11/09/15 21:35:15 INFO mapreduce.Job: Running job:
job_local_0001 11/09/15 21:35:15 INFO mapred.LocalJobRunner:
Waiting for map tasks
11/09/15 21:35:15 INFO mapred.LocalJobRunner: Starting task:
attempt_local_0001_m_000 000_0
11/09/15 21:35:15 INFO mapred.Task: Using ResourceCalculatorPlugin
: null 11/09/15 21:35:15 INFO mapred.MapTask: (EQUATOR) 0 kvi
26214396(104857584) 11/09/15 21:35:15 INFO mapred.MapTask: 100 11/09/15 21:35:15 INFO
mapred.MapTask: soft limit at 83886080
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=10
Total committed heap usage (bytes)=379723776
File Input Format Counters
Bytes Read=529
File Output Format Counters
Bytes Written=29
When the hadoop command is invoked with a classname as the first argument, it
launches a JVM to run the class. It is more convenient to use hadoop than straight
java since the former adds the Hadoop libraries (and their dependencies) to the
class-path and picks up the Hadoop configuration, too. To add the application
classes to the classpath, we’ve defined an environment variable called
HADOOP_CLASSPATH, which the hadoop script picks up.
When running in local (standalone) mode, the programs in this book all assume that
you have set the HADOOP_CLASSPATH in this way. The com-mands should be
run from the directory that the example code is installed in.
The output from running the job provides some useful information. For example, we
can see that the job was given an ID of job_local_0001, and it ran one map task and
one reduce task (with the IDs attempt_local_0001_m_000000_0 and
attempt_local_0001_r_000000_0). Knowing the job and task IDs can be very useful
when debugging MapReduce jobs.
The last section of the output, titled “Counters,” shows the statistics that Hadoop
generates for each job it runs. These are very useful for checking whether the
amount of data processed is what you expected. For example, we can follow the
number of records that went through the system: five map inputs produced five map
outputs, then five reduce inputs in two groups produced two reduce outputs.
The output was written to the output directory, which contains one output file per
reducer. The job had a single reducer, so we find a single file, named part-r-00000:
% cat output/part-r-00000
1949 111
1950 22
This result is the same as when we went through it by hand earlier. We interpret this
as saying that the maximum temperature recorded in 1949 was 11.1°C, and in 1950
it was 2.2°C.
Figure 2-3. MapReduce data flow with a single reduce task
Figure 2-4. MapReduce data flow with multiple reduce tasks
Finally, it’s also possible to have zero reduce tasks. This can be appropriate when
you don’t need the shuffle since the processing can be carried out entirely in parallel
.In this case, the only off-node data transfer is when the map tasks write to HDFS
(see Figure 2-5).
Combiner Functions
Many MapReduce jobs are limited by the bandwidth available on the cluster, so it
pays to minimize the data transferred between map and reduce tasks. Hadoop
allows the user to specify a combiner function to be run on the map output—the
combiner func-tion’s output forms the input to the reduce function. Since the
combiner function is an optimization, Hadoop does not provide a guarantee of how
many times it will call it for a particular map output record, if at all. In other words,
calling the combiner func-tion zero, one, or many times should produce the same
output from the reducer.
Figure 2-5. MapReduce data flow with no reduce tasks
The combiner function doesn’t replace the reduce function. (How could it? The
reduce function is still needed to process records with the same key from different
maps.) But it can help cut down the amount of data shuffled between the maps and
the reduces, and for this reason alone it is always worth considering whether you
can use a combiner function in your MapReduce job.
Specifying a combiner function
Going back to the Java MapReduce program, the combiner function is defined using
the Reducer class, and for this application, it is the same implementation as the
reducer function in MaxTemperatureReducer. The only change we need to make is
to set the combiner class on the Job (see Example 2-7).
Example 2-7. Application to find the maximum temperature, using a combiner
function for efficiency
public class MaxTemperatureWithCombiner {
public static void main(String[] args) throws Exception { if
(args.length != 2) {
System.err.println("Usage: MaxTemperatureWithCombiner <input
path> " + "<output path>");
Job job = new Job();
s); job.setJobName("Max temperature");
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
Hadoop Streaming
Hadoop provides an API to MapReduce that allows you to write your map and
reduce functions in languages other than Java. Hadoop Streaming uses Unix
standard streams as the interface between Hadoop and your program, so you can
use any language that can read standard input and write to standard output to write
your MapReduce program.
The Hadoop Distributed Filesystem
When a dataset outgrows the storage capacity of a single physical machine, it
becomes necessary to partition it across a number of separate machines.
Filesystems that manage the storage across a network of machines are called
distributed filesystems. Since they are network-based, all the complications of
network programming kick in, thus making distributed filesystems more complex
than regular disk filesystems. For example, one of the biggest challenges is making
the filesystem tolerate node failure without suffering data loss.
Hadoop comes with a distributed filesystem called HDFS, which stands for Hadoop
Distributed Filesystem. (You may sometimes see references to “DFS”—informally or
in older documentation or configurations—which is the same thing.) HDFS is
Hadoop’s flagship filesystem and is the focus of this chapter, but Hadoop actually
has a general-purpose filesystem abstraction, so we’ll see along the way how
Hadoop integrates with other storage systems (such as the local filesystem and
Amazon S3).
The Design of HDFS
HDFS is a filesystem designed for storing very large files with streaming data access
patterns, running on clusters of commodity hardware.1 Let’s examine this statement
in more detail:
Very large files
“Very large” in this context means files that are hundreds of megabytes, gigabytes,
or terabytes in size. There are Hadoop clusters running today that store petabytes of
Streaming data access
HDFS is built around the idea that the most efficient data processing pattern is a
write-once, read-many-times pattern. A dataset is typically generated or copied from
source, then various analyses are performed on that dataset over time. Each
analysis will involve a large proportion, if not all, of the dataset, so the time to read
the whole dataset is more important than the latency in reading the first record.
Commodity hardware
Hadoop doesn’t require expensive, highly reliable hardware to run on. It’s designed
to run on clusters of commodity hardware (commonly available hardware available
from multiple vendors) for which the chance of node failure across the cluster is
high, at least for large clusters. HDFS is designed to carry on working without a
noticeable interruption to the user in the face of such failure.
It is also worth examining the applications for which using HDFS does not work so
well. While this may change in the future, these are areas where HDFS is not a good
fit today:
Low-latency data access
Applications that require low-latency access to data, in the tens of milliseconds
range, will not work well with HDFS. Remember, HDFS is optimized for delivering
a high throughput of data, and this may be at the expense of latency. HBase
(Chapter 13) is currently a better choice for low-latency access.
Lots of small files
Since the namenode holds filesystem metadata in memory, the limit to the
number of files in a filesystem is governed by the amount of memory on the
namenode. As a rule of thumb, each file, directory, and block takes about 150
bytes. So, for example, if you had one million files, each taking one block, you
would need at least 300 MB of memory. While storing millions of files is feasible,
billions is be-yond the capability of current hardware.
Multiple writers, arbitrary file modifications
Files in HDFS may be written to by a single writer. Writes are always made at the
end of the file. There is no support for multiple writers, or for modifications at
arbitrary offsets in the file. (These might be supported in the future, but they are
likely to be relatively inefficient.)
HDFS Concepts
A disk has a block size, which is the minimum amount of data that it can read or
write. Filesystems for a single disk build on this by dealing with data in blocks, which
are an integral multiple of the disk block size. Filesystem blocks are typically a few
kilobytes in size, while disk blocks are normally 512 bytes. This is generally
transparent to the filesystem user who is simply reading or writing a file—of
whatever length. However, there are tools to perform filesystem maintenance, such
as df and fsck, that operate on the filesystem block level.
HDFS, too, has the concept of a block, but it is a much larger unit—64 MB by
default. Like in a filesystem for a single disk, files in HDFS are broken into blocksized chunks, which are stored as independent units. Unlike a filesystem for a single
disk, a file in HDFS that is smaller than a single block does not occupy a full block’s
worth of un-derlying storage. When unqualified, the term “block” in this book refers to
a block in HDFS.
Why Is a Block in HDFS So Large?
HDFS blocks are large compared to disk blocks, and the reason is to minimize
the cost of seeks. By making a block large enough, the time to transfer the data
from the disk can be made to be significantly larger than the time to seek to the
start of the block. Thus the time to transfer a large file made of multiple blocks
operates at the disk transfer rate.
A quick calculation shows that if the seek time is around 10 ms, and the transfer
rate is 100 MB/s, then to make the seek time 1% of the transfer time, we need
to make the block size around 100 MB. The default is actually 64 MB, although
many HDFS in-stallations use 128 MB blocks. This figure will continue to be
revised upward as transfer speeds grow with new generations of disk drives.
This argument shouldn’t be taken too far, however. Map tasks in MapReduce
normally operate on one block at a time, so if you have too few tasks (fewer
than nodes in the cluster), your jobs will run slower than they could otherwise.
Having a block abstraction for a distributed filesystem brings several benefits. The
first benefit is the most obvious: a file can be larger than any single disk in the
network. There’s nothing that requires the blocks from a file to be stored on the
same disk, so they can take advantage of any of the disks in the cluster. In fact, it
would be possible, if unusual, to store a single file on an HDFS cluster whose blocks
filled all the disks in the cluster.
Second, making the unit of abstraction a block rather than a file simplifies the
storage subsystem. Simplicity is something to strive for all in all systems, but is
especially important for a distributed system in which the failure modes are so
varied. The storage subsystem deals with blocks, simplifying storage management
(since blocks are a fixed size, it is easy to calculate how many can be stored on a
given disk) and eliminating metadata concerns (blocks are just a chunk of data to be
stored—file metadata such as permissions information does not need to be stored
with the blocks, so another system can handle metadata separately).
Furthermore, blocks fit well with replication for providing fault tolerance and availability. To insure against corrupted blocks and disk and machine failure, each block is
replicated to a small number of physically separate machines (typically three). If a
block becomes unavailable, a copy can be read from another location in a way that
is trans-parent to the client. A block that is no longer available due to corruption or
machine failure can be replicated from its alternative locations to other live machines
to bring the replication factor back to the normal level. Similarly, some applications
may choose to set a high replication factor for the blocks in a popular file to spread
the read load on the cluster.
Like its disk filesystem cousin, HDFS’s fsck command understands blocks. For
exam-ple, running:
% hadoop fsck / -files -blocks
will list the blocks that make up each file in the filesystem( “Filesystem check
(fsck)” )
Namenodes and Datanodes
An HDFS cluster has two types of node operating in a master-worker pattern: a
name-node (the master) and a number of datanodes (workers). The namenode
manages the filesystem namespace. It maintains the filesystem tree and the
metadata for all the files and directories in the tree. This information is stored
persistently on the local disk in the form of two files: the namespace image and the
edit log. The namenode also knows the datanodes on which all the blocks for a
given file are located, however, it does not store block locations persistently, since
this information is reconstructed from datanodes when the system starts.
Datanodes are the workhorses of the filesystem. They store and retrieve blocks
when they are told to (by clients or the namenode), and they report back to the
namenode periodically with lists of blocks that they are storing. Without the
namenode, the filesystem cannot be used. In fact, if the machine running the
namenode was obliterated, all the files on the filesystem would be lost since there
would be no way of knowing how to reconstruct the files from the blocks on the
datanodes. For this reason, it is important to make the namenode resilient to failure,
and Hadoop provides two mechanisms for this.
It is also possible to run a secondary namenode, which despite its name does not
act as a namenode. Its main role is to periodically merge the namespace image with
the edit log to prevent the edit log from becoming too large. The secondary
namenode usually runs on a separate physical machine, since it requires plenty of
CPU and as much memory as the namenode to perform the merge. It keeps a copy
of the merged name-space image, which can be used in the event of the namenode
failing. However, the state of the secondary namenode lags that of the primary, so in
the event of total failure of the primary, data loss is almost certain. The usual course
of action in this case is to copy the namenode’s metadata files that are on NFS to
the secondary and run it as the new primary.
HDFS Federation
The namenode keeps a reference to every file and block in the filesystem in
memory, which means that on very large clusters with many files, memory becomes
the limiting factor for scaling. HDFS Federation, introduced in the 0.23 release
series, allows a cluster to scale by adding namenodes, each of which manages a
portion of the filesystem namespace. For example, one namenode might manage all
the files rooted under /user, say, and a second namenode might handle files under
The Command-Line Interface
We’re going to have a look at HDFS by interacting with it from the command line.
There are many other interfaces to HDFS, but the command line is one of the
simplest and, to many developers, the most familiar.
We are going to run HDFS on one machine, so first follow the instructions for setting
up Hadoop in pseudo-distributed mode in Appendix A. Later you’ll see how to run on
a cluster of machines to give us scalability and fault tolerance.
There are two properties that we set in the pseudo-distributed configuration that deserve further explanation. The first is, set to hdfs://localhost/, which
is used to set a default filesystem for Hadoop. Filesystems are specified by a URI,
and here we have used an hdfs URI to configure Hadoop to use HDFS by default.
The HDFS daemons will use this property to determine the host and port for the
HDFS namenode. We’ll be running it on localhost, on the default HDFS port, 8020.
And HDFS clients will use this property to work out where the namenode is running
so they can connect to it.
We set the second property, dfs.replication, to 1 so that HDFS doesn’t replicate
filesystem blocks by the default factor of three. When running with a single
datanode, HDFS can’t replicate blocks to three datanodes, so it would perpetually
warn about blocks being under-replicated. This setting solves that problem.
Basic Filesystem Operations
The filesystem is ready to be used, and we can do all of the usual filesystem
operations such as reading files, creating directories, moving files, deleting data, and
listing direc-tories. You can type hadoop fs -help to get detailed help on every
Start by copying a file from the local filesystem to HDFS:
% hadoop fs -copyFromLocal input/docs/quangle.txt
hdfs://localhost/user/tom/ quangle.txt
This command invokes Hadoop’s filesystem shell command fs, which supports a
number of subcommands—in this case, we are running -copyFromLocal. The local
file quangle.txt is copied to the file /user/tom/quangle.txt on the HDFS instance
running on localhost. In fact, we could have omitted the scheme and host of the URI
and picked up the default, hdfs://localhost, as specified in core-site.xml:
% hadoop fs -copyFromLocal input/docs/quangle.txt /user/tom/quangle.txt
We could also have used a relative path and copied the file to our home directory in
HDFS, which in this case is /user/tom:
% hadoop fs -copyFromLocal input/docs/quangle.txt quangle.txt
Let’s copy the file back to the local filesystem and check whether it’s the same:
hadoop fs -copyToLocal quangle.txt quangle.copy.txt
md5 input/docs/quangle.txt quangle.copy.txt
MD5 (input/docs/quangle.txt) = a16f231da6b05e2ba7a339320e7dacd9
MD5 (quangle.copy.txt) = a16f231da6b05e2ba7a339320e7dacd9
The MD5 digests are the same, showing that the file survived its trip to HDFS and is
back intact.
Finally, let’s look at an HDFS file listing. We create a directory first just to see how it
is displayed in the listing:
hadoop fs -mkdir books
hadoop fs -ls .
Found 2
- m supergroup
1 m supergroup
0 2009-04-02 22:41 /user/tom/books
118 2009-04-02 22:29 /user/tom/quangle.txt
File Permissions in HDFS
HDFS has a permissions model for files and directories that is much like POSIX.
There are three types of permission: the read permission (r), the write permission
(w), and the execute permission (x). The read permission is required to read files or
list the contents of a directory. The write permission is required to write a file, or for a
directory, to create or delete files or directories in it. The execute permission is
ignored for a file since you can’t execute a file on HDFS (unlike POSIX), and for a
directory it is required to access its children.
Each file and directory has an owner, a group, and a mode. The mode is made up of
the permissions for the user who is the owner, the permissions for the users who are
members of the group, and the permissions for users who are neither the owners
nor members of the group.
Hadoop Filesystems
Hadoop has an abstract notion of filesystem, of which HDFS is just one implementation. The Java abstract class org.apache.hadoop.fs.FileSystem represents a
filesystem in Hadoop, and there are several concrete implementations, which are
described in Table 3-1.
Table 3-1. Hadoop filesystems
Filesyst URI
(all under
A filesystem for a locally connected disk
with clientfs.LocalFileSystem
side checksums. Use
RawLocalFileSystem for a
local filesystem with no checksums. See
“LocalFileSystem” .
Hadoop’s distributed filesystem. HDFS is
designed to
DistributedFileSyste work efficiently in conjunction with
A filesystem providing read-only access to
hdfs.HftpFileSystem HDFS over
HTTP. (Despite its name, HFTP has no
connection with
FTP.) Often used with distcp (see “Parallel
Copying with
distcp” ) to copy data between HDFS
clusters running different versions.
hdfs.HsftpFileSyste A filesystem providing read-only access to
HDFS over
HTTPS. (Again, this has no connection
with FTP.)
hdfs.web.WebHdfsFi A filesystem providing secure read-write
access to HDFS
webhdfs le
over HTTP. WebHDFS is intended as a
replacement for
A filesystem layered on another filesystem
for archiving
files. Hadoop Archives are typically used for
archiving files
in HDFS to reduce the namenode’s memory
usage. See
“Hadoop Archives” on page 78.
CloudStore (formerly Kosmos
filesystem) is a diskfs
tributed filesystem like HDFS or Google’s
KosmosFileSystem GFS, written in
C++. Find more information about
it at
Filesyst URI
A filesystem backed by an FTP
fs.ftp.FTPFileSystem server.
A filesystem backed by Amazon S3. See
A filesystem backed by Amazon S3, which
fs.s3.S3FileSystem stores files in
blocks (much like HDFS) to overcome S3’s 5
GB file size
(all under
hdfs.DistributedRaid A “RAID” version of HDFS designed for
archival storage.
For each file in HDFS, a (smaller) parity file
is created,
which allows the HDFS replication to be
reduced from
three to two, which reduces disk usage by
25% to 30%,
while keeping the probability of data loss the
same. Distributed RAID requires that you run a
RaidNode daemon
on the cluster.
viewfs.ViewFileSyste A client-side mount table for other Hadoop
Commonly used to create mount points
for federated
namenodes (see “HDFS Federation”
on page 49).
Hadoop provides many interfaces to its filesystems, and it generally uses the URI
scheme to pick the correct filesystem instance to communicate with. For example,
the filesystem shell that we met in the previous section operates with all Hadoop
filesys-tems. To list the files in the root directory of the local filesystem, type:
% hadoop fs -ls file:///
Hadoop is written in Java, and all Hadoop filesystem interactions are mediated
through the Java API. The filesystem shell, for example, is a Java application that
uses the Java FileSystem class to provide filesystem operations. The other
filesystem interfaces are discussed briefly in this section. These interfaces are most
commonly used with HDFS, since the other filesystems in Hadoop typically have
existing tools to access the under-lying filesystem (FTP clients for FTP, S3 tools for
S3, etc.), but many of them will work with any Hadoop filesystem.
There are two ways of accessing HDFS over HTTP: directly, where the HDFS
daemons serve HTTP requests to clients; and via a proxy (or proxies), which
accesses HDFS on the client’s behalf using the usual DistributedFileSystem API.
The two ways are illus-trated in Figure 3-1.
Figure 3-1. Accessing HDFS over HTTP directly, and via a bank of HDFS proxies
Data Flow
Anatomy of a File Read
To get an idea of how data flows between the client interacting with HDFS, the
name-node and the datanodes, consider Figure 3-2, which shows the main
sequence of events when reading a file.
Figure 3-2. A client reading data from HDFS
The client opens the file it wishes to read by calling open() on the FileSystem object,
which for HDFS is an instance of DistributedFileSystem (step 1 in Figure 3-2).
DistributedFileSystem calls the namenode, using RPC, to determine the locations of
the blocks for the first few blocks in the file (step 2). For each block, the namenode
returns the addresses of the datanodes that have a copy of that block. Furthermore,
the datanodes are sorted according to their proximity to the client (according to the
top-ology of the cluster’s network; see “Network Topology and Hadoop” ). If the
client is itself a datanode (in the case of a MapReduce task, for instance), then it will
read from the local datanode, if it hosts a copy of the block (see also Figure 2-2).
The DistributedFileSystem returns an FSDataInputStream (an input stream that supports file seeks) to the client for it to read data from. FSDataInputStream in turn
wraps a DFSInputStream, which manages the datanode and namenode I/O.
The client then calls read() on the stream (step 3). DFSInputStream, which has
stored the datanode addresses for the first few blocks in the file, then connects to
the first (closest) datanode for the first block in the file. Data is streamed from the
datanode back to the client, which calls read() repeatedly on the stream (step 4).
When the end of the block is reached, DFSInputStream will close the connection to
the datanode, then find the best datanode for the next block (step 5). This happens
transparently to the client, which from its point of view is just reading a continuous
Blocks are read in order with the DFSInputStream opening new connections to
datanodes as the client reads through the stream. It will also call the namenode to
retrieve the datanode locations for the next batch of blocks as needed. When the
client has finished reading, it calls close() on the FSDataInputStream (step 6).
During reading, if the DFSInputStream encounters an error while communicating
with a datanode, then it will try the next closest one for that block. It will also
remember datanodes that have failed so that it doesn’t needlessly retry them for
later blocks. The DFSInputStream also verifies checksums for the data transferred to
it from the datanode. If a corrupted block is found, it is reported to the namenode
before the DFSInput Stream attempts to read a replica of the block from another
One important aspect of this design is that the client contacts datanodes directly to
retrieve data and is guided by the namenode to the best datanode for each block.
This design allows HDFS to scale to a large number of concurrent clients, since the
data traffic is spread across all the datanodes in the cluster. The namenode
meanwhile merely has to service block location requests (which it stores in memory,
making them very efficient) and does not, for example, serve data, which would
quickly become a bot-tleneck as the number of clients grew.
Network Topology and Hadoop
What does it mean for two nodes in a local network to be “close” to each other? In
the context of high-volume data processing, the limiting factor is the rate at which we
can transfer data between nodes—bandwidth is a scarce commodity. The idea is to
use the bandwidth between two nodes as a measure of distance.
Processes on the same node
Different nodes on the same rack
Nodes on different racks in the same data center
Nodes in different data centers
For example, imagine a node n1 on rack r1 in data center d1. This can be
represented as /d1/r1/n1. Using this notation, here are the distances for the four
distance(/d1/r1/n1, /d1/r1/n1) = 0 (processes on the same node)
distance(/d1/r1/n1, /d1/r1/n2) = 2 (different nodes on the same rack)
distance(/d1/r1/n1, /d1/r2/n3) = 4 (nodes on different racks in the same data
distance(/d1/r1/n1, /d2/r3/n4) = 6 (nodes in different data centers)
At the time of this writing, Hadoop is not suited for running across data centers.
This is illustrated schematically in Figure 3-3. (Mathematically inclined readers
will notice that this is an example of a distance metric.)
Finally, it is important to realize that Hadoop cannot divine your network
topology for you. It needs some help; we’ll cover how to configure topology in
“Network Topol-ogy” . By default though, it assumes that the network is flat—a
single-level hierarchy—or in other words, that all nodes are on a single rack in a
single data center. For small clusters, this may actually be the case, and no
further configuration is required.
Figure 3-3. Network distance in Hadoop
Anatomy of a File Write
Next we’ll look at how files are written to HDFS. Although quite detailed, it is instructive to understand the data flow since it clarifies HDFS’s coherency model.
The case we’re going to consider is the case of creating a new file, writing data to it,
then closing the file. See Figure 3-4.
The client creates the file by calling create() on DistributedFileSystem (step 1 in
Figure 3-4). DistributedFileSystem makes an RPC call to the namenode to create a
new file in the filesystem’s namespace, with no blocks associated with it (step 2).
The name-node performs various checks to make sure the file doesn’t already exist,
and that the client has the right permissions to create the file. If these checks pass,
the namenode makes a record of the new file; otherwise, file creation fails and the
client is thrown an IOException. The DistributedFileSystem returns an
FSDataOutputStream for the client to start writing data to. Just as in the read case,
FSDataOutputStream wraps a DFSOutput Stream, which handles communication
with the datanodes and namenode.
As the client writes data (step 3), DFSOutputStream splits it into packets, which it
writes to an internal queue, called the data queue. The data queue is consumed by
the Data Streamer, whose responsibility it is to ask the namenode to allocate new
blocks by picking a list of suitable datanodes to store the replicas. The list of
datanodes forms a pipeline—we’ll assume the replication level is three, so there are
three nodes in the pipeline. The DataStreamer streams the packets to the first
datanode in the pipeline, which stores the packet and forwards it to the second
datanode in the pipeline. Simi-larly, the second datanode stores the packet and
forwards it to the third (and last) datanode in the pipeline (step 4).
Figure 3-4. A client writing data to HDFS
DFSOutputStream also maintains an internal queue of packets that are waiting to be
acknowledged by datanodes, called the ack queue. A packet is removed from the
ack queue only when it has been acknowledged by all the datanodes in the pipeline
(step 5).
If a datanode fails while data is being written to it, then the following actions are
taken, which are transparent to the client writing the data. First the pipeline is closed,
and any packets in the ack queue are added to the front of the data queue so that
datanodes that are downstream from the failed node will not miss any packets. The
current block on the good datanodes is given a new identity, which is communicated
to the name-node, so that the partial block on the failed datanode will be deleted if
the failed datanode recovers later on. The failed datanode is removed from the
pipeline and the remainder of the block’s data is written to the two good datanodes
in the pipeline. The namenode notices that the block is under-replicated, and it
arranges for a further replica to be created on another node. Subsequent blocks are
then treated as normal.
It’s possible, but unlikely, that multiple datanodes fail while a block is being written.
As long as dfs.replication.min replicas (default one) are written, the write will
succeed, and the block will be asynchronously replicated across the cluster until its
target rep-lication factor is reached (dfs.replication, which defaults to three).
When the client has finished writing data, it calls close() on the stream (step 6). This
action flushes all the remaining packets to the datanode pipeline and waits for acknowledgments before contacting the namenode to signal that the file is complete
(step 7). The namenode already knows which blocks the file is made up of (via Data
Streamer asking for block allocations), so it only has to wait for blocks to be
minimally replicated before returning successfully.
Replica Placement
Once the replica locations have been chosen, a pipeline is built, taking network
topol-ogy into account. For a replication factor of 3, the pipeline might look like
Figure 3-5.
Figure 3-5. A typical replica pipeline
Keeping an HDFS Cluster Balanced
When copying data into HDFS, it’s important to consider cluster balance. HDFS
works best when the file blocks are evenly spread across the cluster, so you want to
ensure that distcp doesn’t disrupt this. Going back to the 1,000 GB example, by
specifying -m 1 a single map would do the copy, which—apart from being slow and
not using the cluster resources efficiently—would mean that the first replica of each
block would reside on the node running the map (until the disk filled up). The second
and third replicas would be spread across the cluster, but this one node would be
unbalanced. By having more maps than nodes in the cluster, this problem is
avoided—for this rea-son, it’s best to start by running distcp with the default of 20
maps per node.
However, it’s not always possible to prevent a cluster from becoming unbalanced.
Per-haps you want to limit the number of maps so that some of the nodes can be
used by other jobs. In this case, you can use the balancer to subsequently even out
the block distribution across the cluster.
Hadoop Archives
HDFS stores small files inefficiently, since each file is stored in a block, and block
metadata is held in memory by the namenode. Thus, a large number of small files
can eat up a lot of memory on the namenode. (Note, however, that small files do
not take up any more disk space than is required to store the raw contents of
the file. For example, a 1 MB file stored with a block size of 128 MB uses 1 MB
of disk space, not 128 MB.)
Hadoop Archives, or HAR files, are a file archiving facility that packs files into HDFS
blocks more efficiently, thereby reducing namenode memory usage while still
allowing transparent access to files. In particular, Hadoop Archives can be used as
input to MapReduce.
Using Hadoop Archives
A Hadoop Archive is created from a collection of files using the archive tool. The tool
runs a MapReduce job to process the input files in parallel, so to run it, you need a
MapReduce cluster running to use it. Here are some files in HDFS that we would like
to archive:
% hadoop fs -lsr /my/files
-rw-r--r-1 tom supergroup
drwxr-xr-x - tom supergroup
-rw-r--r-1 tom supergroup
1 2009-04-09 19:13 /my/files/a
0 2009-04-09 19:13 /my/files/dir
1 2009-04-09 19:13 /my/files/dir/b
Now we can run the archive command:
% hadoop archive -archiveName files.har /my/files /my
The first option is the name of the archive, here files.har. HAR files always have a
.har extension, which is mandatory for reasons we shall see later. Next comes the
files to put in the archive. Here we are archiving only one source tree, the files in
/my/files in HDFS, but the tool accepts multiple source trees. The final argument is
the output directory for the HAR file. Let’s see what the archive has created:
% hadoop
-ls /my
Found 2
drwxr-xr-x - tom supergroup
drwxr-xr-x - tom supergroup
% hadoop
-ls /my/files.har
Found 3
-rw-r--r-- 10
-rw-r--r-- 10
0 2009-04-09 19:13 /my/files
0 2009-04-09 19:13 /my/files.har
165 2009-04-09 19:13 /my/files.har/_index
23 2009-04-09 /my/files.har/_masterindex
2 2009-04-09 19:13 /my/files.har/part-0
The directory listing shows what a HAR file is made of: two index files and a
collection of part files—just one in this example. The part files contain the contents of
a number of the original files concatenated together, and the indexes make it
possible to look up the part file that an archived file is contained in, and its offset and
length. All these details are hidden from the application, however, which uses the
har URI scheme to interact with HAR files, using a HAR filesystem that is layered on
top of the underlying filesystem (HDFS in this case). The following command
recursively lists the files in the archive:
% hadoop fs -lsr har:///my/files.har
drw-r-- r--
- tom supergroup
drw-r-- r--
- tom supergroup
10 tom supergroup
drw-r-- r--
- tom supergroup
-rw-r-- r--
10 tom supergroup
19:13 /my/files.har/my
2009-0409 19:13 /my/files.har/my/files
2009-0409 19:13 /my/files.har/my/files/a
2009-0409 19:13 /my/files.har/my/files/dir
2009-0409 19:13 /my/files.har/my/files/dir/b
This is quite straightforward if the filesystem that the HAR file is on is the default
filesystem. On the other hand, if you want to refer to a HAR file on a different
filesystem, then you need to use a different form of the path URI to normal. These
two commands have the same effect, for example:
hadoop fs -lsr har:///my/files.har/my/files/dir
hadoop fs -lsr har://hdfs-localhost:8020/my/files.har/my/files/dir
Notice in the second form that the scheme is still har to signify a HAR filesystem, but
the authority is hdfs to specify the underlying filesystem’s scheme, followed by a
dash and the HDFS host (localhost) and port (8020). We can now see why HAR files
have to have a .har extension. The HAR filesystem translates the har URI into a URI
for the underlying filesystem, by looking at the authority and path up to and including
hdfs://localhost:8020/my/files.har. The remaining part of the path is the path of the
file in the archive: /my/files/dir.
To delete a HAR file, you need to use the recursive form of delete, since from the
underlying filesystem’s point of view the HAR file is a directory:
%hadoop fs -rmr /my/files.har
There are a few limitations to be aware of with HAR files. Creating an archive
creates a copy of the original files, so you need as much disk space as the files you
are archiving to create the archive (although you can delete the originals once you
have created the archive). There is currently no support for archive compression,
although the files that go into the archive can be compressed (HAR files are like tar
files in this respect).
Archives are immutable once they have been created. To add or remove files, you
must re-create the archive. In practice, this is not a problem for files that don’t
change after being written, since they can be archived in batches on a regular basis,
such as daily or weekly.
As noted earlier, HAR files can be used as input to MapReduce. However, there is
no archive-aware InputFormat that can pack multiple files into a single MapReduce
split, so processing lots of small files, even in a HAR file, can still be inefficient.
Finally, if you are hitting namenode memory limits even after taking steps to
minimize the number of small files in the system, then consider using HDFS
Federation to scale the namespace.
Hadoop I/O
Hadoop comes with a set of primitives for data I/O. Some of these are techniques
that are more general than Hadoop, such as data integrity and compression, but
deserve special consideration when dealing with multiterabyte datasets. Others are
Hadoop tools or APIs that form the building blocks for developing distributed
systems, such as serialization frameworks and on-disk data structures.
Data Integrity
Users of Hadoop rightly expect that no data will be lost or corrupted during storage
or processing. However, since every I/O operation on the disk or network carries
with it a small chance of introducing errors into the data that it is reading or writing,
when the volumes of data flowing through the system are as large as the ones
Hadoop is capable of handling, the chance of data corruption occurring is high.
The usual way of detecting corrupted data is by computing a checksum for the data
when it first enters the system, and again whenever it is transmitted across a
channel that is unreliable and hence capable of corrupting the data. The data is
deemed to be corrupt if the newly generated checksum doesn’t exactly match the
original. This tech-nique doesn’t offer any way to fix the data—merely error
detection. (And this is a reason for not using low-end hardware; in particular, be sure
to use ECC memory.) Note that it is possible that it’s the checksum that is corrupt,
not the data, but this is very unlikely, since the checksum is much smaller than the
A commonly used error-detecting code is CRC-32 (cyclic redundancy check), which
computes a 32-bit integer checksum for input of any size.
Data Integrity in HDFS
HDFS transparently checksums all data written to it and by default verifies
checksums when reading data. A separate checksum is created for every
io.bytes.per.checksum bytes of data. The default is 512 bytes, and since a CRC-32
checksum is 4 bytes long, the storage overhead is less than 1%.
Datanodes are responsible for verifying the data they receive before storing the
data and its checksum. This applies to data that they receive from clients and from
other datanodes during replication. A client writing data sends it to a pipeline of
datanodes (as explained in Chapter 3), and the last datanode in the pipeline verifies
the checksum. If it detects an error, the client receives a ChecksumException, a
subclass of IOExcep tion, which it should handle in an application-specific manner,
by retrying the opera-tion, for example. When clients read data from datanodes,
they verify checksums as well, comparing them with the ones stored at the
datanode. Each datanode keeps a persistent log of checksum verifications, so it
knows the last time each of its blocks was verified. When a client successfully
verifies a block, it tells the datanode, which updates its log. Keeping sta-tistics such
as these is valuable in detecting bad disks.
Aside from block verification on client reads, each datanode runs a
DataBlockScanner in a background thread that periodically verifies all the blocks
stored on the datanode. This is to guard against corruption due to “bit rot” in the
physical storage media. See “Datanode block scanner” for details on how to access
the scanner reports.
Since HDFS stores replicas of blocks, it can “heal” corrupted blocks by copying one
of the good replicas to produce a new, uncorrupt replica. The way this works is that if
a client detects an error when reading a block, it reports the bad block and the
datanode it was trying to read from to the namenode before throwing a
ChecksumException. The namenode marks the block replica as corrupt, so it
doesn’t direct clients to it, or try to copy this replica to another datanode. It then
schedules a copy of the block to be re-plicated on another datanode, so its
replication factor is back at the expected level. Once this has happened, the corrupt
replica is deleted.
It is possible to disable verification of checksums by passing false to the setVerify
Checksum() method on FileSystem, before using the open() method to read a file.
The same effect is possible from the shell by using the -ignoreCrc option with the get or the equivalent -copyToLocal command. This feature is useful if you have a
corrupt file that you want to inspect so you can decide what to do with it. For
example, you might want to see whether it can be salvaged before you delete it.
The Hadoop LocalFileSystem performs client-side checksumming. This means that
when you write a file called filename, the filesystem client transparently creates a
hidden file, .filename.crc, in the same directory containing the checksums for each
chunk of the file. Like HDFS, the chunk size is controlled by the
io.bytes.per.checksum property, which defaults to 512 bytes. The chunk size is
stored as metadata in the .crc file, so the file can be read back correctly even if the
setting for the chunk size has changed. Checksums are verified when the file is
read, and if an error is detected, LocalFileSystem throws a ChecksumException.
Checksums are fairly cheap to compute (in Java, they are implemented in native
code), typically adding a few percent overhead to the time to read or write a file. For
most applications, this is an acceptable price to pay for data integrity. It is, however,
possible to disable checksums: typically when the underlying filesystem supports
checksums natively. This is accomplished by using RawLocalFileSystem in place of
Local FileSystem. To do this globally in an application, it suffices to remap the
implementa-tion for file URIs by setting the property fs.file.impl to the value
org.apache.hadoop.fs.RawLocalFileSystem. Alternatively, you can directly create a
Raw LocalFileSystem instance, which may be useful if you want to disable
checksum veri-fication for only some reads; for example:
Configuration conf = ...
FileSystem fs = new RawLocalFileSystem();
fs.initialize(null, conf);
Serialization is the process of turning structured objects into a byte stream for transmission over a network or for writing to persistent storage. Deserialization is the
reverse process of turning a byte stream back into a series of structured objects.
Serialization appears in two quite distinct areas of distributed data processing: for
interprocess communication and for persistent storage.
In Hadoop, interprocess communication between nodes in the system is
implemented using remote procedure calls (RPCs). The RPC protocol uses
serialization to render the message into a binary stream to be sent to the remote
node, which then deserializes the binary stream into the original message. In
general, it is desirable that an RPC seriali-zation format is:
A compact format makes the best use of network bandwidth, which is the most
scarce resource in a data center.
Interprocess communication forms the backbone for a distributed system, so it is
essential that there is as little performance overhead as possible for the
serialization and deserialization process.
Protocols change over time to meet new requirements, so it should be
straightforward to evolve the protocol in a controlled manner for clients and
servers. For example, it should be possible to add a new argument to a method
call, and have the new servers accept messages in the old format (without the
new argument) from old clients.
For some systems, it is desirable to be able to support clients that are written in
different languages to the server, so the format needs to be designed to make
this possible.
On the face of it, the data format chosen for persistent storage would have different
requirements from a serialization framework. After all, the lifespan of an RPC is less
than a second, whereas persistent data may be read years after it was written. As it
turns out, the four desirable properties of an RPC’s serialization format are also
crucial for a persistent storage format. We want the storage format to be compact (to
make efficient use of storage space), fast (so the overhead in reading or writing
terabytes of data is minimal), extensible (so we can transparently read data written
in an older format), and interoperable (so we can read or write persistent data using
different languages).
Hadoop uses its own serialization format, Writables, which is certainly compact and
fast, but not so easy to extend or use from languages other than Java. Since
Writables are central to Hadoop (most MapReduce programs use them for their key
and value types), we look at them in some depth in the next three sections, before
looking at serialization frameworks in general, and then Avro (a serialization system
that was designed to overcome some of the limitations of Writables) in more detail.
The Writable Interface
The Writable interface defines two methods: one for writing its state to a DataOutput
binary stream, and one for reading its state from a DataInput binary stream:
import; import; import;
public interface Writable {
void write(DataOutput out) throws IOException;
void readFields(DataInput
in) throws IOException;
Let’s look at a particular Writable to see what we can do with it. We will use
IntWritable, a wrapper for a Java int. We can create one and set its value using the
set() method:
IntWritable writable = new IntWritable();
Equivalently, we can use the constructor that takes the integer value:
IntWritable writable = new IntWritable(163);
To examine the serialized form of the IntWritable, we write a small helper method
that wraps a in a (an
implemen-tation of to capture the bytes in the serialized stream:
public static byte[] serialize(Writable writable) throws IOException {
ByteArrayOutputStream out = new ByteArrayOutputStream();
DataOutputStream dataOut = new DataOutputStream(out);
return out.toByteArray();
Writable wrappers for Java primitives
There are Writable wrappers for all the Java primitive types (see Table 4-7) except
char (which can be stored in an IntWritable). All have a get() and a set() method for
retrieving and storing the wrapped value.
Figure 4-1. Writable class hierarchy
Table 4-7. Writable wrapper classes for Java primitives
Serialized size
Serialized size
Serialization Frameworks
Although most MapReduce programs use Writable key and value types, this isn’t
man-dated by the MapReduce API. In fact, any types can be used; the only
requirement is that there be a mechanism that translates to and from a binary
representation of each type.
To support this, Hadoop has an API for pluggable serialization frameworks. A serialization framework is represented by an implementation of Serialization (in the package). WritableSerialization, for example, is the
implementation of Serialization for Writable types.
A Serialization defines a mapping from types to Serializer instances (for turning an
object into a byte stream) and Deserializer instances (for turning a byte stream into
an object).
Set the io.serializations property to a comma-separated list of classnames to register
Serialization implementations. Its default value includes
alizer.WritableSerialization and the Avro specific and reflect serializations, which
means that only Writable or Avro objects can be serialized or deserialized out of the
box. Hadoop includes a class called JavaSerialization that uses Java Object
Serialization. Although it makes it convenient to be able to use standard Java types
in MapReduce programs, like Integer or String, Java Object Serialization is not as
efficient as Writa-bles, so it’s not worth making this trade-off (see the sidebar on the
next page).
Why Not Use Java Object Serialization?
Java comes with its own serialization mechanism, called Java Object Serialization
(often referred to simply as “Java Serialization”), that is tightly integrated with the
language, so it’s natural to ask why this wasn’t used in Hadoop. Here’s what Doug
Cutting said in response to that question:
Why didn’t I use Serialization when we first started Hadoop? Because it
looked big and hairy and I thought we needed something lean and mean,
where we had precise control over exactly how objects are written and read,
since that is central to Hadoop. With Serialization you can get some control,
but you have to fight for it.
The logic for not using RMI was similar. Effective, high-performance interprocess communications are critical to Hadoop. I felt like we’d need to
precisely control how things like connections, timeouts and buffers are
handled, and RMI gives you little control over those.
The problem is that Java Serialization doesn’t meet the criteria for a serialization
format listed earlier: compact, fast, extensible, and interoperable.
Java Serialization is not compact: it writes the classname of each object being
written to the stream—this is true of classes that implement or Subsequent instances of the same class write a reference
han-dle to the first occurrence, which occupies only 5 bytes. However, reference
handles don’t work well with random access, since the referent class may occur at
any point in the preceding stream—that is, there is state stored in the stream. Even
worse, reference handles play havoc with sorting records in a serialized stream,
since the first record of a particular class is distinguished and must be treated as a
special case.
All these problems are avoided by not writing the classname to the stream at all,
which is the approach that Writable takes. This makes the assumption that the client
knows the expected type. The result is that the format is considerably more compact
than Java Serialization, and random access and sorting work as expected since
each record is independent of the others (so there is no stream state).
Java Serialization is a general-purpose mechanism for serializing graphs of objects,
so it necessarily has some overhead for serialization and deserialization operations.
What’s more, the deserialization procedure creates a new instance for each object
deserialized from the stream. Writable objects, on the other hand, can be (and often
are) reused. For example, for a MapReduce job, which at its core serializes and
deserializes billions of records of just a handful of different types, the savings gained
by not having to allocate new objects are significant.
In terms of extensibility, Java Serialization has some support for evolving a type, but
it is brittle and hard to use effectively (Writables have no support: the programmer
has to manage them himself).
In principle, other languages could interpret the Java Serialization stream protocol
(de-fined by the Java Object Serialization Specification), but in practice there are no
widely used implementations in other languages, so it is a Java-only solution. The
situation is the same for Writables.
Serialization IDL
There are a number of other serialization frameworks that approach the problem in a
different way: rather than defining types through code, you define them in a
language-neutral, declarative fashion, using an interface description language
(IDL). The system can then generate types for different languages, which is good for
interoperability. They also typically define versioning schemes that make type
evolution straightforward.
Hadoop’s own Record I/O (found in the org.apache.hadoop.record package) has an
IDL that is compiled into Writable objects, which makes it convenient for generating
types that are compatible with MapReduce. For whatever reason, however, Record
I/O was not widely used, and has been deprecated in favor of Avro.
Apache Thrift and Google Protocol Buffers are both popular serialization
frameworks, and they are commonly used as a format for persistent binary data.
There is limited support for these as MapReduce formats; however, they are used
internally in parts of Hadoop for RPC and data exchange.
File-Based Data Structures
For some applications, you need a specialized data structure to hold your data. For
doing MapReduce-based processing, putting each blob of binary data into its own
file doesn’t scale, so Hadoop developed a number of higher-level containers for
these situations.
Imagine a logfile, where each log record is a new line of text. If you want to log
binary types, plain text isn’t a suitable format. Hadoop’s SequenceFile class fits the
bill in this situation, providing a persistent data structure for binary key-value pairs.
To use it as a logfile format, you would choose a key, such as timestamp
represented by a LongWrit able, and the value is a Writable that represents the
quantity being logged. SequenceFiles also work well as containers for smaller files.
HDFS and MapReduce are optimized for large files, so packing files into a
SequenceFile makes storing and processing the smaller files more efficient.
Writing a SequenceFile
To create a SequenceFile, use one of its createWriter() static methods, which
returns a SequenceFile.Writer instance. There are several overloaded versions, but
they all require you to specify a stream to write to (either a FSDataOutputStream or
a FileSys tem and Path pairing), a Configuration object, and the key and value
types. Optional arguments include the compression type and codec, a Progressable
callback to be in-formed of write progress, and a Metadata instance to be stored in
the SequenceFile header. The keys and values stored in a SequenceFile do not
necessarily need to be Writable. Any types that can be serialized and deserialized
by a Serialization may be used. Once you have a SequenceFile.Writer, you then
write key-value pairs, using the append() method. Then when you’ve finished, you
call the close() method (Sequence File.Writer implements
Example 4-14 shows a short program to write some key-value pairs to a Sequence
File, using the API just described.
Example 4-14. Writing a SequenceFile
public class SequenceFileWriteDemo {
private static final String[] DATA = { "One,
two, buckle my shoe",
"Three, four, shut the door", "Five, six,
pick up sticks", "Seven, eight, lay them
straight", "Nine, ten, a big fat hen"
public static void main(String[] args) throws IOException {
String uri = args[0];
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(uri), conf);
Path path = new Path(uri);
IntWritable key = new IntWritable();
Text value = new Text(); SequenceFile.Writer writer = null; try {
writer = SequenceFile.createWriter(fs, conf, path,
key.getClass(), value.getClass());
for (int i = 0; i < 100; i++) { key.set(100 - i);
value.set(DATA[i % DATA.length]);
System.out.printf("[%s]\t%s\t%s\n", writer.getLength(), key, value);
writer.append(key, value);
} finally { IOUtils.closeStream(writer);
The keys in the sequence file are integers counting down from 100 to 1, represented
as IntWritable objects. The values are Text objects. Before each record is appended
to the SequenceFile.Writer, we call the getLength() method to discover the current
position in the file. (We will use this information about record boundaries in the next
section when we read the file nonsequentially.) We write the position out to the
console, along with the key and value pairs. The result of running it is shown here:
% hadoop SequenceFileWriteDemo numbers.seq
100One, two, buckle my shoe
99Three, four, shut the door
98Five, six, pick up sticks
97Seven, eight, lay them straight
96Nine, ten, a big fat hen
95One, two, buckle my shoe
94Three, four, shut the door
93Five, six, pick up sticks
92Seven, eight, lay them straight
Nine, ten, a big fat hen
One, two, buckle my shoe
Three, four, shut the door
Five, six, pick up sticks
Seven, eight, lay them straight
Nine, ten, a big fat hen
One, two, buckle my shoe
Three, four, shut the door
Five, six, pick up sticks
Seven, eight, lay them straight
Nine, ten, a big fat hen
Reading a SequenceFile
Reading sequence files from beginning to end is a matter of creating an instance of
SequenceFile.Reader and iterating over records by repeatedly invoking one of the
next() methods. Which one you use depends on the serialization framework you are
using. If you are using Writable types, you can use the next() method that takes a
key and a value argument, and reads the next key and value in the stream into these
public boolean next(Writable key, Writable val)
The return value is true if a key-value pair was read and false if the end of the file
has been reached. For other, nonWritable serialization frameworks (such as Apache
Thrift), you should use these two methods:
public Object next(Object key) throws IOException
public Object getCurrentValue(Object val) throws IOException
In this case, you need to make sure that the serialization you want to use has been
set in the io.serializations property; see “Serialization Frameworks”.
If the next() method returns a non-null object, a key-value pair was read from the
stream, and the value can be retrieved using the getCurrentValue() method.
Otherwise, if next() returns null, the end of the file has been reached.
The program in Example 4-15 demonstrates how to read a sequence file that has
Writable keys and values. Note how the types are discovered from the Sequence
File.Reader via calls to getKeyClass() and getValueClass(), then ReflectionUtils is
used to create an instance for the key and an instance for the value. By using this
tech-nique, the program can be used with any sequence file that has Writable keys
and values.
Example 4-15. Reading a SequenceFile
public class SequenceFileReadDemo {
public static void main(String[] args) throws IOException {
String uri = args[0];
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(uri), conf);
Path path = new Path(uri);
SequenceFile.Reader reader = null; try {
reader = new SequenceFile.Reader(fs, path, conf);
Writable key = (Writable)
ReflectionUtils.newInstance(reader.getKeyClass(), conf);
Writable value = (Writable)
ReflectionUtils.newInstance(reader.getValueClass(), conf);
long position = reader.getPosition();
while (, value)) {
String syncSeen = reader.syncSeen() ? "*" : "";
System.out.printf("[%s%s]\t%s\t%s\n", position, syncSeen, key,
value); position = reader.getPosition(); // beginning of next record
} finally { IOUtils.closeStream(reader);
Another feature of the program is that it displays the position of the sync points in the
sequence file. A sync point is a point in the stream that can be used to
resynchronize with a record boundary if the reader is “lost”—for example, after
seeking to an arbitrary position in the stream. Sync points are recorded by
SequenceFile.Writer, which inserts a special entry to mark the sync point every few
records as a sequence file is being written. Such entries are small enough to incur
only a modest storage overhead—less than 1%. Sync points always align with
record boundaries.
Running the program in Example 4-15 shows the sync points in the sequence file as
asterisks. The first one occurs at position 2021 (the second one occurs at position
4075, but is not shown in the output):
% hadoop SequenceFileReadDemo numbers.seq
100One, two, buckle my shoe
99Three, four, shut the door
98Five, six, pick up sticks
97Seven, eight, lay them straight
96Nine, ten, a big fat hen
95One, two, buckle my shoe
94Three, four, shut the door
93Five, six, pick up sticks
92Seven, eight, lay them straight
91Nine, ten, a big fat hen
One, two, buckle my shoe
One, two, buckle my shoe
Three, four, shut the door
Five, six, pick up sticks
Seven, eight, lay them straight
Nine, ten, a big fat hen
One, two, buckle my shoe
Three, four, shut the door
Five, six, pick up sticks
Seven, eight, lay them straight
Nine, ten, a big fat hen
There are two ways to seek to a given position in a sequence file. The first is the
seek() method, which positions the reader at the given point in the file. For example,
seeking to a record boundary works as expected:;
assertThat(, value), is(true));
assertThat(((IntWritable) key).get(), is(95));
But if the position in the file is not at a record boundary, the reader fails when the
next() method is called:;, value); // fails with IOException
The second way to find a record boundary makes use of sync points. The sync(long
position) method on SequenceFile.Reader positions the reader at the next sync
point after position. (If there are no sync points in the file after this position, then the
reader will be positioned at the end of the file.) Thus, we can call sync() with any
position in the stream—a nonrecord boundary, for example—and the reader will
reestablish itself at the next sync point so reading can continue:
assertThat(reader.getPosition(), is(2021L));
assertThat(, value), is(true));
assertThat(((IntWritable) key).get(), is(59));
SequenceFile.Writer has a method called sync() for inserting a sync point at the
current position in the stream. This is not to be confused with the identically named
but otherwise unrelated sync() method defined by the Syncable interface for
synchronizing buffers to the underlying device.
Sync points come into their own when using sequence files as input to MapReduce,
since they permit the file to be split, so different portions of it can be processed independently by separate map tasks.
Displaying a SequenceFile with the command-line interface
The hadoop fs command has a -text option to display sequence files in textual form.
It looks at a file’s magic number so that it can attempt to detect the type of the file
and appropriately convert it to text. It can recognize gzipped files and sequence files;
other-wise, it assumes the input is plain text.
For sequence files, this command is really useful only if the keys and values have a
meaningful string representation (as defined by the toString() method). Also, if you
have your own key or value classes, then you will need to make sure they are on
Ha-doop’s classpath.
Running it on the sequence file we created in the previous section gives the
following output:
% hadoop fs -text numbers.seq | head
One, two, buckle my shoe
Three, four, shut the door
Five, six, pick up sticks
Seven, eight, lay them straight
Nine, ten, a big fat hen
One, two, buckle my shoe
Three, four, shut the door
Five, six, pick up sticks
Seven, eight, lay them straight
Nine, ten, a big fat hen
Sorting and merging SequenceFiles
The most powerful way of sorting (and merging) one or more sequence files is to
use MapReduce. MapReduce is inherently parallel and will let you specify the
number of reducers to use, which determines the number of output partitions. For
example, by specifying one reducer, you get a single output file. We can use the sort
example that comes with Hadoop by specifying that the input and output are
sequence files, and by setting the key and value types:
hadoop jar $HADOOP_INSTALL/hadoop-*-examples.jar sort
-r 1 \ -inFormat
org.apache.hadoop.mapred.SequenceFileInputFormat \ outFormat
org.apache.hadoop.mapred.SequenceFileOutputFormat \ outKey \
numbers.seq sorted
hadoop fs -text sorted/part-00000 | head
Nine, ten, a big fat hen
Seven, eight, lay them straight
Five, six, pick up sticks
Three, four, shut the door
One, two, buckle my shoe
Nine, ten, a big fat hen
Seven, eight, lay them straight
Five, six, pick up sticks
Three, four, shut the door
One, two, buckle my shoe
Sorting is covered in more detail in “Sorting”.
As an alternative to using MapReduce for sort/merge, there is a SequenceFile.Sorter
class that has a number of sort() and merge() methods. These functions predate
Map-Reduce and are lower-level functions than MapReduce (for example, to get
parallelism, you need to partition your data manually), so in general MapReduce is
the preferred approach to sort and merge sequence files.
The SequenceFile format
A sequence file consists of a header followed by one or more records (see Figure 42). The first three bytes of a sequence file are the bytes SEQ, which acts a magic
number, followed by a single byte representing the version number. The header
contains other fields including the names of the key and value classes, compression
details, user-defined metadata, and the sync marker.Recall that the sync marker is
used to allow a reader to synchronize to a record boundary from any position in the
file. Each file has a randomly generated sync marker, whose value is stored in the
header. Sync markers appear between records in the sequence file. They are
designed to incur less than a 1% storage overhead, so they don’t necessarily appear
between every pair of records (such is the case for short records).
Figure 4-2. The internal structure of a sequence file with no compression and
record compression
The internal format of the records depends on whether compression is enabled, and
if it is, whether it is record compression or block compression.
If no compression is enabled (the default), then each record is made up of the record
length (in bytes), the key length, the key, and then the value. The length fields are
written as four-byte integers adhering to the contract of the writeInt() method of Keys and values are serialized using the Serialization defined for
the class being written to the sequence file.
The format for record compression is almost identical to no compression, except the
value bytes are compressed using the codec defined in the header. Note that keys
are not compressed.
Block compression compresses multiple records at once; it is therefore more
compact than and should generally be preferred over record compression because it
has the opportunity to take advantage of similarities between records. (See Figure 43.) Records are added to a block until it reaches a minimum size in bytes, defined by
the io.seqfile.compress.blocksize property: the default is 1 million bytes. A sync
marker is written before the start of every block. The format of a block is a field
indicating the number of records in the block, followed by four compressed fields: the
key lengths, the keys, the value lengths, and the values.
A MapFile is a sorted SequenceFile with an index to permit lookups by key. MapFile
can be thought of as a persistent form of java.util.Map (although it doesn’t implement
this interface), which is able to grow beyond the size of a Map that is kept in
Figure 4-3. The internal structure of a sequence file with block compression
Writing a MapFile
Writing a MapFile is similar to writing a SequenceFile: you create an instance of
MapFile.Writer, then call the append() method to add entries in order. (Attempting to
add entries out of order will result in an IOException.) Keys must be instances of
WritableComparable, and values must be Writable—contrast this to SequenceFile,
which can use any serialization framework for its entries.
The program in Example 4-16 creates a MapFile, and writes some entries to it. It is
very similar to the program in Example 4-14 for creating a SequenceFile.
Example 4-16. Writing a MapFile
public class MapFileWriteDemo {
private static final String[] DATA = { "One,
two, buckle my shoe","Three, four, shut the
door", "Five, six, pick up sticks", "Seven,
eight, lay them straight", "Nine, ten, a big fat
public static void main(String[] args) throws IOException {
String uri = args[0];
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(uri), conf);
IntWritable key = new IntWritable(); Text
value = new Text(); MapFile.Writer writer =
try {
writer = new MapFile.Writer(conf, fs, uri,
key.getClass(), value.getClass());
for (int i = 0; i < 1024; i++) { key.set(i +
value.set(DATA[i % DATA.length]);
writer.append(key, value);
} finally { IOUtils.closeStream(writer);
Let’s use this program to build a MapFile:
% hadoop MapFileWriteDemo
If we look at the MapFile, we see it’s actually a directory containing two files called
data and index:
% ls -l
total 104
-rw-r--r-1 tom
-rw-r--r-1 tom
47898 Jul
251 Jul
22:06 data
22:06 index
Both files are SequenceFiles. The data file contains all of the entries, in order:
% hadoop fs -text | head
One, two, buckle my shoe
Three, four, shut the door
Five, six, pick up sticks
Seven, eight, lay them straight
Nine, ten, a big fat hen
One, two, buckle my shoe
Three, four, shut the door
Five, six, pick up sticks
Seven, eight, lay them straight
Nine, ten, a big fat hen
The index file contains a fraction of the keys, and contains a mapping from the key
to that key’s offset in the data file:
% hadoop fs -text
As we can see from the output, by default only every 128th key is included in the
index, although you can change this value either by setting the
property or by calling the setIndexInterval() method on the MapFile.Writer instance.
A reason to increase the index interval would be to decrease the amount of memory
that the MapFile needs to store the index. Conversely, you might decrease the
interval to improve the time for random selection (since fewer records need to be
skipped on average) at the expense of memory usage.
Since the index is only a partial index of keys, MapFile is not able to provide
methods to enumerate, or even count, all the keys it contains. The only way to
perform these operations is to read the whole file.
Reading a MapFile
Iterating through the entries in order in a MapFile is similar to the procedure for a
SequenceFile: you create a MapFile.Reader, then call the next() method until it
returns false, signifying that no entry was read because the end of the file was
public boolean next(WritableComparable key, Writable val) throws IOException
A random access lookup can be performed by calling the get() method:
public Writable get(WritableComparable key, Writable val) throws IOException
The return value is used to determine if an entry was found in the MapFile; if it’s null,
then no value exists for the given key. If key was found, then the value for that key is
read into val, as well as being returned from the method call.
It might be helpful to understand how this is implemented. Here is a snippet of code
that retrieves an entry for the MapFile we created in the previous section:
Text value = new Text();
reader.get(new IntWritable(496), value);
assertThat(value.toString(), is("One, two, buckle my shoe"));
For this operation, the MapFile.Reader reads the index file into memory (this is
cached so that subsequent random access calls will use the same in-memory
index). The reader then performs a binary search on the in-memory index to find the
key in the index that is less than or equal to the search key, 496. In this example, the
index key found is 385, with value 18030, which is the offset in the data file. Next the
reader seeks to this offset in the data file and reads entries until the key is greater
than or equal to the search key, 496. In this case, a match is found and the value is
read from the data file. Overall, a lookup takes a single disk seek and a scan through
up to 128 entries on disk. For a random-access read, this is actually very efficient.
MapFile variants
Hadoop comes with a few variants on the general key-value MapFile interface:
SetFile is a specialization of MapFile for storing a set of Writable keys. The
keys must be added in sorted order.
ArrayFile is a MapFile where the key is an integer representing the index of
the element in the array, and the value is a Writable value.
BloomMapFile is a MapFile which offers a fast version of the get() method,
especially for sparsely populated files. The implementation uses a dynamic
bloom filter for testing whether a given key is in the map. The test is very fast
since it is in-memory, but it has a non-zero probability of false positives, in
which case the regular get() method is called.
There are two tuning parameters: io.mapfile.bloom.size for the (approximate)
number of entries in the map (default 1,048,576), and
file.bloom.error.rate for the desired maximum error rate (default 0.005, which
is 0.5%).
Converting a SequenceFile to a MapFile
One way of looking at a MapFile is as an indexed and sorted SequenceFile. So it’s
quite natural to want to be able to convert a SequenceFile into a MapFile. We
covered how to sort a SequenceFile in “Sorting and merging SequenceFiles” , so
here we look at how to create an index for a SequenceFile. The program in Example
4-17 hinges around the static utility method fix() on MapFile, which re-creates the
index for a MapFile.
Example 4-17. Re-creating the index for a MapFile
public class MapFileFixer {
public static void main(String[] args) throws Exception {
String mapUri = args[0];
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(mapUri), conf);
Path map = new Path(mapUri);
Path mapData = new Path(map, MapFile.DATA_FILE_NAME);
// Get key and value types from data sequence file
SequenceFile.Reader reader = new SequenceFile.Reader(fs,
mapData, conf); Class keyClass = reader.getKeyClass();
Class valueClass = reader.getValueClass();
// Create the map file index file
long entries = MapFile.fix(fs, map, keyClass, valueClass, false, conf);
System.out.printf("Created MapFile %s with %d entries\n", map,
The fix() method is usually used for re-creating corrupted indexes, but since it
creates a new index from scratch, it’s exactly what we need here. The recipe is as
Sort the sequence file numbers.seq into a new directory called that will
become the MapFile (if the sequence file is already sorted, then you can skip this
step. Instead, copy it to a file, then go to step 3):
Rename the MapReduce output to be the data file:
hadoop fs -mv
Create the index file:
hadoop MapFileFixer
Created MapFile with 100 entries
The MapFile now exists and can be used.
Developing a MapReduce Application
In Chapter 2, we introduced the MapReduce model. In this chapter, we look at the
practical aspects of developing a MapReduce application in Hadoop.
Writing a program in MapReduce has a certain flow to it. You start by writing your
map and reduce functions, ideally with unit tests to make sure they do what you
expect. Then you write a driver program to run a job, which can run from your IDE
using a small subset of the data to check that it is working. If it fails, then you can
use your IDE’s debugger to find the source of the problem. With this information, you
can expand your unit tests to cover this case and improve your mapper or reducer
as ap-propriate to handle such input correctly.
When the program runs as expected against the small dataset, you are ready to
unleash it on a cluster. Running against the full dataset is likely to expose some
more issues, which you can fix as before, by expanding your tests and mapper or
reducer to handle the new cases. Debugging failing programs in the cluster is a
challenge, so we look at some common techniques to make it easier.
After the program is working, you may wish to do some tuning, first by running
through some standard checks for making MapReduce programs faster and then by
doing task profiling. Profiling distributed programs is not trivial, but Hadoop has
hooks to aid the process.
Before we start writing a MapReduce program, we need to set up and configure the
development environment. And to do that, we need to learn a bit about how Hadoop
does configuration.
The Configuration API
Components in Hadoop are configured using Hadoop’s own configuration API. An
instance of the Configuration class (found in the org.apache.hadoop.conf package)
represents a collection of configuration properties and their values. Each property is
named by a String, and the type of a value may be one of several types, including
Java primitives such as boolean, int, long, float, and other useful types such as
String, Class,, and collections of String.
Configurations read their properties from resources—XML files with a simple
structure for defining name-value pairs. See Example 5-1.
Example 5-1. A simple configuration file, configuration-1.xml
<?xml version="1.0"?>
<property> <name>size-weight</name>
<description>Size and weight</description>
Assuming this configuration file is in a file called configuration-1.xml, we can access
its properties using a piece of code like this:
Configuration conf = new Configuration();
assertThat(conf.get("color"), is("yellow"));
assertThat(conf.getInt("size", 0), is(10));
assertThat(conf.get("breadth", "wide"), is("wide"));
There are a couple of things to note: type information is not stored in the XML file;
instead, properties can be interpreted as a given type when they are read. Also, the
get() methods allow you to specify a default value, which is used if the property is
not defined in the XML file, as in the case of breadth here.
Combining Resources
Things get interesting when more than one resource is used to define a
configuration. This is used in Hadoop to separate out the default properties for the
system, defined internally in a file called core-default.xml, from the site-specific
overrides, in core-site.xml. The file in Example 5-2 defines the size and weight
Example 5-2. A second configuration file, configuration-2.xml
<?xml version="1.0"?>
Resources are added to a Configuration in order:
Configuration conf = new Configuration();
Properties defined in resources that are added later override the earlier definitions.
So the size property takes its value from the second configuration file, configuration2.xml:
assertThat(conf.getInt("size", 0), is(12));
However, properties that are marked as final cannot be overridden in later
definitions. The weight property is final in the first configuration file, so the attempt to
override it in the second fails, and it takes the value from the first:
assertThat(conf.get("weight"), is("heavy"));
Attempting to override final properties usually indicates a configuration error, so this
results in a warning message being logged to aid diagnosis. Administrators mark
prop-erties as final in the daemon’s site files that they don’t want users to change in
their client-side configuration files or job submission parameters.
Variable Expansion
Configuration properties can be defined in terms of other properties, or system properties. For example, the property size-weight in the first configuration file is defined
as ${size},${weight}, and these properties are expanded using the values found in
the configuration:
assertThat(conf.get("size-weight"), is("12,heavy"));
System properties take priority over properties defined in resource files:
System.setProperty("size", "14");
assertThat(conf.get("size-weight"), is("14,heavy"));
This feature is useful for overriding properties on the command line by using
-Dproperty=value JVM arguments.
Note that while configuration properties can be defined in terms of system
properties, unless system properties are redefined using configuration properties,
they are not ac-cessible through the configuration API. Hence:
System.setProperty("length", "2");
assertThat(conf.get("length"), is((String) null));
Configuring the Development Environment
The first step is to download the version of Hadoop that you plan to use and unpack
it on your development machine (this is described in Appendix A). Then, in your favorite IDE, create a new project and add all the JAR files from the top level of the
unpacked distribution and from the lib directory to the classpath. You will then be
able to compile Java Hadoop programs and run them in local (standalone) mode
within the IDE.
Managing Configuration
When developing Hadoop applications, it is common to switch between running the
application locally and running it on a cluster. In fact, you may have several clusters
you work with, or you may have a local “pseudo-distributed” cluster that you like to
test on One way to accommodate these variations is to have Hadoop configuration
files con-taining the connection settings for each cluster you run against, and specify
which one you are using when you run Hadoop applications or tools. As a matter of
best practice, it’s recommended to keep these files outside Hadoop’s installation
directory tree, as this makes it easy to switch between Hadoop versions without
duplicating or losing settings.
For the purposes of this book, we assume the existence of a directory called conf
that contains three configuration files: hadoop-local.xml, hadoop-localhost.xml, and
hadoop-cluster.xml (these are available in the example code for this book). Note that
there is nothing special about the names of these files—they are just convenient
ways to package up some configuration settings.
The hadoop-local.xml file contains the default Hadoop configuration for the default
filesystem and the jobtracker:
<?xml version="1.0"?>
The settings in hadoop-localhost.xml point to a namenode and a jobtracker both running on localhost:
<?xml version="1.0"?>
Finally, hadoop-cluster.xml contains details of the cluster’s namenode and
jobtracker addresses. In practice, you would name the file after the name of the
cluster, rather than “cluster” as we have here:
<?xml version="1.0"?>
You can add other configuration properties to these files as needed. For example, if
you wanted to set your Hadoop username for a particular cluster, you could do it in
the appropriate file.
Setting User Identity
The user identity that Hadoop uses for permissions in HDFS is determined by
running the whoami command on the client system. Similarly, the group names
are derived from the output of running groups.
If, however, your Hadoop user identity is different from the name of your user
account on your client machine, then you can explicitly set your Hadoop
username and group names by setting the hadoop.job.ugi property. The
username and group names are specified as a comma-separated list of strings
(e.g., preston,directors,inventors would set the username to preston and the
group names to directors and inventors).
You can set the user identity that the HDFS web interface runs as by setting
dfs.web.ugi using the same syntax. By default, it is webuser,webgroup, which is
not a super user, so system files are not accessible through the web interface.
Notice that, by default, there is no authentication with this system. See “Security” for how to use Kerberos authentication with Hadoop.
With this setup, it is easy to use any configuration with the -conf command-line
switch. For example, the following command shows a directory listing on the HDFS
server running in pseudo-distributed mode on localhost:
% hadoop fs -conf conf/hadoop-localhost.xml -ls .
Found 2 items
- tom supergroup
0 2009-04-08 10:32 /user/tom/input
- tom supergroup
0 2009-04-08 13:09 /user/tom/output
If you omit the -conf option, then you pick up the Hadoop configuration in the conf
subdirectory under $HADOOP_INSTALL. Depending on how you set this up, this
may be for a standalone setup or a pseudo-distributed cluster.
Tools that come with Hadoop support the -conf option, but it’s also straightforward to
make your programs (such as programs that run MapReduce jobs) support it, too,
using the Tool interface.
GenericOptionsParser, Tool, and ToolRunner
Hadoop comes with a few helper classes for making it easier to run jobs from the
command line. GenericOptionsParser is a class that interprets common Hadoop
command-line options and sets them on a Configuration object for your application
to use as desired. You don’t usually use GenericOptionsParser directly, as it’s more
convenient to implement the Tool interface and run your application with the
ToolRunner, which uses GenericOptionsParser internally:
public interface Tool extends Configurable { int
run(String [] args) throws Exception;
Example 5-3 shows a very simple implementation of Tool, for printing the keys and
values of all the properties in the Tool’s Configuration object.
Example 5-3. An example Tool implementation for printing the properties in a
public class ConfigurationPrinter extends Configured implements Tool {
static { Configuration.addDefaultResource("hdfsdefault.xml"); Configuration.addDefaultResource("hdfssite.xml"); Configuration.addDefaultResource("mapreddefault.xml");
public int run(String[] args) throws Exception {
Configuration conf = getConf();
for (Entry<String, String> entry: conf) {
System.out.printf("%s=%s\n", entry.getKey(), entry.getValue());
return 0;
public static void main(String[] args) throws Exception {
int exitCode = ConfigurationPrinter(), args);
We make ConfigurationPrinter a subclass of Configured, which is an implementation
of the Configurable interface. All implementations of Tool need to implement
Configurable (since Tool extends it), and subclassing Configured is often the easiest
way to achieve this. The run() method obtains the Configuration using Configurable’s
getConf() method and then iterates over it, printing each property to standard output.
The static block makes sure that the HDFS and MapReduce configurations are
picked up in addition to the core ones (which Configuration knows about already).
ConfigurationPrinter’s main() method does not invoke its own run() method directly.
Instead, we call ToolRunner’s static run() method, which takes care of creating a
Configuration object for the Tool, before calling its run() method. ToolRunner also
uses a GenericOptionsParser to pick up any standard options specified on the
command line and set them on the Configuration instance. We can see the effect of
picking up the properties specified in conf/hadoop-localhost.xml by running the
following command:
localhost.xml \ | grep mapred.job.tracker=
Which Properties Can I Set?
ConfigurationPrinter is a useful tool for telling you what a property is set to in
your environment.
You can also see the default settings for all the public properties in Hadoop by
looking in the docs directory of your Hadoop installation for HTML files called
core-default.html, hdfs-default.html and mapred-default.html. Each
property has a descrip-tion that explains what it is for and what values it can be
set to.
Writing a Unit Test
The map and reduce functions in MapReduce are easy to test in isolation, which is a
consequence of their functional style. For known inputs, they produce known
outputs. However, since outputs are written to a Context (or an OutputCollector in
the old API), rather than simply being returned from the method call, the Context
needs to be re-placed with a mock so that its outputs can be verified. There are
several Java mock object frameworks that can help build mocks; here we use
Mockito, which is noted for its clean syntax, although any mock framework should
work just as well.
The test for the mapper is shown in Example 5-4.
import org.junit.*;
public class MaxTemperatureMapperTest {
public void processesValidRecord() throws IOException,
InterruptedException { MaxTemperatureMapper mapper = new
Text value = new
23550FM-12+0382" + // Year ^^^^
"99999V0203201N00261220001CN9999999N900111+99999999999"); // Temperature
MaxTemperatureMapper.Context context =
mock(MaxTemperatureMapper.Context.class);, value, context);
verify(context).write(new Text("1950"), new IntWritable(-11));
The test is very simple: it passes a weather record as input to the mapper, then
checks the output is the year and temperature reading. The input key is ignored by
the mapper, so we can pass in anything, including null as we do here. To create a
mock Context, we call Mockito’s mock() method (a static import), passing the class
of the type we want to mock. Then we invoke the mapper’s map() method, which
executes the code being tested. Finally, we verify that the mock object was called
with the correct method and arguments, using Mockito’s verify() method (again,
statically imported). Here we verify that Context’s write() method was called with a
Text object representing the year (1950) and an IntWritable representing the
temperature (−1.1°C).
Proceeding in a test-driven fashion, we create a Mapper implementation that passes
the test (see Example 5-5). Since we will be evolving the classes in this chapter,
each is put in a different package indicating its version for ease of exposition. For
example, v1.Max TemperatureMapper is version 1 of MaxTemperatureMapper. In
reality, of course, you would evolve classes without repackaging them.
Example 5-5. First version of a Mapper that passes MaxTemperatureMapperTest
public class MaxTemperatureMapper
extends Mapper<LongWritable, Text, Text, IntWritable> {
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
String year = line.substring(15, 19);
int airTemperature = Integer.parseInt(line.substring(87, 92));
context.write(new Text(year), new IntWritable(airTemperature));
This is a very simple implementation, which pulls the year and temperature fields
from the line and writes them to the Context. Let’s add a test for missing values,
which in the raw data are represented by a temperature of +9999:
public void ignoresMissingTemperatureRecord() throws
IOException, InterruptedException {
MaxTemperatureMapper mapper = new MaxTemperatureMapper();
Text value = new
23550FM-12+0382" + // Year ^^^^
99999999"); // Temperature ^^^^^
MaxTemperatureMapper.Context context =
mock(MaxTemperatureMapper.Context.class);, value, context);
verify(context, never()).write(any(Text.class), any(IntWritable.class));
Since records with missing temperatures should be filtered out, this test uses
Mockito to verify that the write() method on the Context is never called for any Text
key or IntWritable value.
The existing test fails with a NumberFormatException, as parseInt() cannot parse
integers with a leading plus sign, so we fix up the implementation (version 2) to
handle missing values:
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
String year = line.substring(15, 19);
String temp = line.substring(87, 92); if
(!missing(temp)) {
int airTemperature = Integer.parseInt(temp);
context.write(new Text(year), new IntWritable(airTemperature));
private boolean missing(String temp) {
return temp.equals("+9999");
With the test for the mapper passing, we move on to writing the reducer.
The reducer has to find the maximum value for a given key. Here’s a simple test for
this feature:
public void returnsMaximumIntegerInValues() throws
IOException, InterruptedException {
MaxTemperatureReducer reducer = new MaxTemperatureReducer();
Text key = new Text("1950");
List<IntWritable> values = Arrays.asList(
new IntWritable(10), new IntWritable(5));
MaxTemperatureReducer.Context context =
reducer.reduce(key, values, context);
verify(context).write(key, new IntWritable(10));
We construct a list of some IntWritable values and then verify that
MaxTemperatureReducer picks the largest. The code in Example 5-6 is for an
implemen-tation of MaxTemperatureReducer that passes the test. Notice that we
haven’t tested the case of an empty values iterator, but arguably we don’t need to,
since MapReduce would never call the reducer in this case, as every key produced
by a mapper has a value.
Example 5-6. Reducer for maximum temperature example
public class MaxTemperatureReducer
extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values,
Context context)
throws IOException, InterruptedException {
int maxValue = Integer.MIN_VALUE; for
(IntWritable value : values) {
maxValue = Math.max(maxValue, value.get());
context.write(key, new IntWritable(maxValue));
Running Locally on Test Data
Now that we’ve got the mapper and reducer working on controlled inputs, the next
step is to write a job driver and run it on some test data on a development machine.
Running a Job in a Local Job Runner
Using the Tool interface introduced earlier in the chapter, it’s easy to write a driver to
run our MapReduce job for finding the maximum temperature by year (see
MaxTemperatureDriver in Example 5-7).
Example 5-7. Application to find the maximum temperature
public class MaxTemperatureDriver extends Configured implements Tool {
public int run(String[] args) throws Exception { if
(args.length != 2) {
System.err.printf("Usage: %s [generic options] <input> <output>\n",
r); return -1;
Job job = new Job(getConf(), "Max temperature");
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
return job.waitForCompletion(true) ? 0 : 1;
public static void main(String[] args) throws Exception {
int exitCode = MaxTemperatureDriver(),
args); System.exit(exitCode);
MaxTemperatureDriver implements the Tool interface, so we get the benefit of being
able to set the options that GenericOptionsParser supports. The run() method
constructs Job object based on the tool’s configuration, which it uses to launch a job.
Among the possible job configuration parameters, we set the input and output file
paths, the map-per, reducer and combiner classes, and the output types (the input
types are determined by the input format, which defaults to TextInputFormat and has
LongWritable keys and Text values). It’s also a good idea to set a name for the job
(Max temperature), so that you can pick it out in the job list during execution and
after it has completed. By default, the name is the name of the JAR file, which is
normally not particularly descriptive.
Now we can run this application against some local files. Hadoop comes with a local
job runner, a cut-down version of the MapReduce execution engine for running MapReduce jobs in a single JVM. It’s designed for testing and is very convenient for use
in an IDE, since you can run it in a debugger to step through the code in your
mapper and reducer.
The local job runner is enabled by a configuration setting. Normally,
mapred.job.tracker is a host:port pair to specify the address of the jobtracker, but
when it has the special value of local, the job is run in-process without accessing an
external jobtracker.
From the command line, we can run the driver by typing:
hadoop v2.MaxTemperatureDriver -conf conf/hadooplocal.xml \ input/ncdc/micro output
Equivalently, we could use the -fs and -jt options provided by GenericOptionsParser:
% hadoop v2.MaxTemperatureDriver -fs file:/// -jt local input/ncdc/micro
This command executes MaxTemperatureDriver using input from the local
input/ncdc/ micro directory, producing output in the local output directory. Note that
although we’ve set -fs so we use the local filesystem (file:///), the local job runner will
actually work fine against any filesystem, including HDFS (and it can be handy to do
this if you have a few files that are on HDFS).
When we run the program, it fails and prints the following exception:
java.lang.NumberFormatException: For input string: "+0000"
Fixing the mapper
This exception shows that the map method still can’t parse positive temperatures. (If
the stack trace hadn’t given us enough information to diagnose the fault, we could
run the test in a local debugger, since it runs in a single JVM.) Earlier, we made it
handle the special case of missing temperature, +9999, but not the general case of
any positive temperature. With more logic going into the mapper, it makes sense to
factor out a parser class to encapsulate the parsing logic; see Example 5-8 (now on
version 3).
Example 5-8. A class for parsing weather records in NCDC format
public class NcdcRecordParser {
private static final int MISSING_TEMPERATURE = 9999;
private String year;
private int airTemperature;
private String quality;
public void parse(String record) {
year = record.substring(15, 19);
String airTemperatureString;
// Remove leading plus sign as parseInt doesn't like them if
(record.charAt(87) == '+') {
airTemperatureString = record.substring(88, 92);
} else {
airTemperatureString = record.substring(87, 92);
airTemperature = Integer.parseInt(airTemperatureString);
quality = record.substring(92, 93);
public void parse(Text record) {
public boolean isValidTemperature() {
return airTemperature != MISSING_TEMPERATURE &&
public String getYear() {
return year;
public int getAirTemperature() {
return airTemperature;
The resulting mapper is much simpler (see Example 5-9). It just calls the parser’s
parse() method, which parses the fields of interest from a line of input, checks
whether a valid temperature was found using the isValidTemperature() query
method, and if it was, retrieves the year and the temperature using the getter
methods on the parser. Notice that we also check the quality status field as well as
missing temperatures in isValidTemperature() to filter out poor temperature
Another benefit of creating a parser class is that it makes it easy to write related
mappers for similar jobs without duplicating code. It also gives us the opportunity to
write unit tests directly against the parser, for more targeted testing.
Example 5-9. A Mapper that uses a utility class to parse records
public class MaxTemperatureMapper
extends Mapper<LongWritable, Text, Text, IntWritable> {
private NcdcRecordParser parser = new NcdcRecordParser();
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
if (parser.isValidTemperature()) {
context.write(new Text(parser.getYear()),
new IntWritable(parser.getAirTemperature()));
With these changes, the test passes.
Testing the Driver
Apart from the flexible configuration options offered by making your application implement Tool, you also make it more testable because it allows you to inject an
arbitrary Configuration. You can take advantage of this to write a test that uses a
local job runner to run a job against known input data, which checks that the output
is as expected. There are two approaches to doing this. The first is to use the local
job runner and run the job against a test file on the local filesystem. The code in
Example 5-10 gives an idea of how to do this.
Example 5-10. A test for MaxTemperatureDriver that uses a local, in-process
job runner
public void test() throws Exception {
Configuration conf = new Configuration();
conf.set("", "file:///");
conf.set("mapred.job.tracker", "local");
Path input = new Path("input/ncdc/micro");
Path output = new Path("output");
FileSystem fs = FileSystem.getLocal(conf);
fs.delete(output, true); // delete old output
MaxTemperatureDriver driver = new
MaxTemperatureDriver(); driver.setConf(conf);
int exitCode = String[] {
input.toString(), output.toString() });
assertThat(exitCode, is(0));
checkOutput(conf, output);
The test explicitly sets and mapred.job.tracker so it uses the local
filesystem and the local job runner. It then runs the MaxTemperatureDriver via its
Tool interface against a small amount of known data. At the end of the test, the
checkOut put() method is called to compare the actual output with the expected
output, line by line.
The second way of testing the driver is to run it using a “mini-” cluster. Hadoop has a
pair of testing classes, called MiniDFSCluster and MiniMRCluster, which provide a
pro-grammatic way of creating in-process clusters. Unlike the local job runner, these
allow testing against the full HDFS and MapReduce machinery. Bear in mind, too,
that task-trackers in a mini-cluster launch separate JVMs to run tasks in, which can
make de-bugging more difficult.
Mini-clusters are used extensively in Hadoop’s own automated test suite, but they
can be used for testing user code, too. Hadoop’s ClusterMapReduceTestCase
abstract class provides a useful base for writing such a test, handles the details of
starting and stopping the in-process HDFS and MapReduce clusters in its setUp()
and tearDown() methods, and generates a suitable configuration object that is set up
to work with them. Sub-classes need populate only data in HDFS (perhaps by
copying from a local file), run a MapReduce job, then confirm the output is as
expected. Refer to the MaxTemperature DriverMiniTest class in the example code
that comes with this book for the listing.
Tests like this serve as regression tests, and are a useful repository of input edge
cases and their expected results. As you encounter more test cases, you can
simply add them to the input file and update the file of expected output
Running on a Cluster
Now that we are happy with the program running on a small test dataset, we are
ready to try it on the full dataset on a Hadoop cluster. Chapter 9 covers how to set
up a fully distributed cluster, although you can also work through this section on a
pseudo-distributed cluster.
We don’t need to make any modifications to the program to run on a cluster rather
than on a single machine, but we do need to package the program as a JAR file to
send to the cluster. This is conveniently achieved using Ant, using a task such as
this (you can find the complete build file in the example code):
<jar destfile="hadoop-examples.jar" basedir="${classes.dir}"/>
If you have a single job per JAR, then you can specify the main class to run in the
JAR file’s manifest. If the main class is not in the manifest, then it must be
specified on the command line (as you will see shortly). Also, any dependent JAR
files should be pack-aged in a lib subdirectory in the JAR file. (This is analogous to
a Java Web application archive, or WAR file, except in that case the JAR files go
in a WEB-INF/lib subdirectory in the WAR file.)
Launching a Job
To launch the job, we need to run the driver, specifying the cluster that we want to
run the job on with the -conf option (we could equally have used the -fs and -jt
conf/hadoop-cluster.xml \ input/ncdc/all max-temp
The waitForCompletion() method on Job launches the job and polls for progress,
writ-ing a line summarizing the map and reduce’s progress whenever either
changes. Here’s the output (some lines have been removed for clarity):
09/04/11 08:15:52 INFO mapred.FileInputFormat: Total input paths to process : 101
09/04/11 08:15:53 INFO mapred.JobClient: Running job: job_200904110811_0002
09/04/11 08:15:54 INFO mapred.JobClient: map 0% reduce
0% 09/04/11 08:16:06 INFO mapred.JobClient: map 28%
reduce 0% 09/04/11 08:16:07 INFO mapred.JobClient: map
30% reduce 0%
09/04/11 08:21:36 INFO mapred.JobClient: map 100% reduce 100%
09/04/11 08:21:38 INFO mapred.JobClient: Job complete:
job_200904110811_0002 09/04/11 08:21:38 INFO mapred.JobClient:
Counters: 19
09/04/11 08:21:38 INFO mapred.JobClient: Job Counters
09/04/11 08:21:38 INFO mapred.JobClient: Launched reduce tasks=32
09/04/11 08:21:38 INFO mapred.JobClient: Rack-local map tasks=82
09/04/11 08:21:38 INFO mapred.JobClient: Launched map tasks=127
09/04/11 08:21:38 INFO mapred.JobClient: Data-local map tasks=45
09/04/11 08:21:38 INFO mapred.JobClient: FileSystemCounters
09/04/11 08:21:38 INFO mapred.JobClient: FILE_BYTES_READ=12667214
09/04/11 08:21:38 INFO mapred.JobClient:
09/04/11 08:21:38 INFO mapred.JobClient:
09/04/11 08:21:38 INFO mapred.JobClient:
09/04/11 08:21:38 INFO mapred.JobClient:
09/04/11 08:21:38 INFO mapred.JobClient:
09/04/11 08:21:38 INFO mapred.JobClient:
09/04/11 08:21:38 INFO mapred.JobClient:
09/04/11 08:21:38 INFO mapred.JobClient:
09/04/11 08:21:38 INFO mapred.JobClient:
09/04/11 08:21:38 INFO mapred.JobClient:
09/04/11 08:21:38 INFO mapred.JobClient:
09/04/11 08:21:38 INFO mapred.JobClient:
09/04/11 08:21:38 INFO mapred.JobClient:
09/04/11 08:21:38 INFO mapred.JobClient:
09/04/11 08:21:38 INFO mapred.JobClient:
Map-Reduce Framework
Reduce input groups=100
Combine output records=4489
Map input records=1209901509
Reduce shuffle bytes=19140
Reduce output records=100
Spilled Records=9481
Map output bytes=10282306995
Map input bytes=274600205558
Combine input records=1142482941
Map output records=1142478555
Reduce input records=103
The output includes more useful information. Before the job starts, its ID is printed:
this is needed whenever you want to refer to the job, in logfiles for example, or when
interrogating it via the hadoop job command. When the job is complete, its statistics
(known as counters) are printed out. These are very useful for confirming that the job
did what you expected. For example, for this job we can see that around 275 GB of
input data was analyzed (“Map input bytes”), read from around 34 GB of
compressed files on HDFS (“HDFS_BYTES_READ”). The input was broken into 101
gzipped files of reasonable size, so there was no problem with not being able to split
Job, Task, and Task Attempt IDs
The format of a job ID is composed of the time that the jobtracker (not the job)
started and an incrementing counter maintained by the jobtracker to uniquely
identify the job to that instance of the jobtracker. So the job with this ID:
is the second (0002, job IDs are 1-based) job run by the jobtracker which
started at 08:11 on April 11, 2009. The counter is formatted with leading zeros
to make job IDs sort nicely—in directory listings, for example. However, when
the counter reaches 10000 it is not reset, resulting in longer job IDs (which
don’t sort so well).
Tasks belong to a job, and their IDs are formed by replacing the job prefix of a
job ID with a task prefix, and adding a suffix to identify the task within the job.
For example:
is the fourth (000003, task IDs are 0-based) map (m) task of the job with ID
job_200904110811_0002. The task IDs are created for a job when it is
initialized, so they do not necessarily dictate the order that the tasks will be
For example:
is the first (0, attempt IDs are 0-based) attempt at running task
task_200904110811_0002_m_000003. Task attempts are allocated during the
job run as needed, so their ordering represents the order that they were created
for tasktrackers to run.
The final count in the task attempt ID is incremented by 1,000 if the job is
restarted after the jobtracker is restarted and recovers its running jobs (although
this behavior is disabled by default—see “Jobtracker Failure” ).
The MapReduce Web UI
Hadoop comes with a web UI for viewing information about your jobs. It is useful for
following a job’s progress while it is running, as well as finding job statistics and logs
after the job has completed. You can find the UI at http://jobtracker-host:50030/.
The jobtracker page
A screenshot of the home page is shown in Figure 5-1. The first section of the page
gives details of the Hadoop installation, such as the version number and when it was
com-piled, and the current state of the jobtracker (in this case, running), and when it
was started.
Next is a summary of the cluster, which has measures of cluster capacity and
utilization. This shows the number of maps and reduces currently running on the
cluster, the total number of job submissions, the number of tasktracker nodes
currently available, and the cluster’s capacity: in terms of the number of map and
reduce slots available across the cluster (“Map Task Capacity” and “Reduce Task
Capacity”), and the number of available slots per node, on average. The number of
tasktrackers that have been black-listed by the jobtracker is listed as well
(blacklisting is discussed in “Tasktracker Fail-ure” ).
Below the summary, there is a section about the job scheduler that is running (here
the default). You can click through to see job queues.
Further down, we see sections for running, (successfully) completed, and failed jobs.
Each of these sections has a table of jobs, with a row per job that shows the job’s ID,
owner, name (as set in the Job constructor or setJobName() method, both of which
internally set the property) and progress information.
Finally, at the foot of the page, there are links to the jobtracker’s logs, and the jobtracker’s history: information on all the jobs that the jobtracker has run. The main
mapred.jobtracker.completeuserjobs.max imum property), before consigning them to
the history page. Note also that the job his-tory is persistent, so you can find jobs
here from previous runs of the jobtracker.
Figure 5-1. Screenshot of the jobtracker page
Job History
Job history refers to the events and configuration for a completed job. It is retained
whether the job was successful or not, in an attempt to provide interesting
information for the user running a job.
Job history files are stored on the local filesystem of the jobtracker in a history subdirectory of the logs directory. It is possible to set the location to an arbitrary Hadoop
filesystem via the hadoop.job.history.location property. The jobtracker’s history files
are kept for 30 days before being deleted by the system.
A second copy is also stored for the user in the _logs/history subdirectory of the job’s
hadoop.job.history.user.location. By setting it to the special value none, no user job
history is saved, although job history is still saved centrally. A user’s job history files
are never deleted by the system.
The history log includes job, task, and attempt events, all of which are stored in a
plain-text file. The history for a particular job may be viewed through the web UI, or
via the command line, using hadoop job -history (which you point at the job’s output
The job page
Clicking on a job ID brings you to a page for the job, illustrated in Figure 5-2. At the
top of the page is a summary of the job, with basic information such as job owner
and name, and how long the job has been running for. The job file is the
consolidated configuration file for the job, containing all the properties and their
values that were in effect during the job run. If you are unsure of what a particular
property was set to, you can click through to inspect the file.
While the job is running, you can monitor its progress on this page, which
periodically updates itself. Below the summary is a table that shows the map
progress and the reduce progress. “Num Tasks” shows the total number of map and
reduce tasks for this job (a row for each). The other columns then show the state of
these tasks: “Pending” (waiting to run), “Running,” “Complete” (successfully run),
“Killed” (tasks that have failed—this column would be more accurately labeled
“Failed”). The final column shows the total number of failed and killed task attempts
for all the map or reduce tasks for the job (task attempts may be marked as killed if
they are a speculative execution duplicate, if the tasktracker they are running on
dies or if they are killed by a user). See “Task Failure” for background on task
Further down the page, you can find completion graphs for each task that show their
progress graphically. The reduce completion graph is divided into the three phases
of the reduce task: copy (when the map outputs are being transferred to the
reduce’s tasktracker), sort (when the reduce inputs are being merged), and reduce
(when the reduce function is being run to produce the final output).
In the middle of the page is a table of job counters. These are dynamically updated
during the job run, and provide another useful window into the job’s progress and
general health.
Retrieving the Results
Once the job is finished, there are various ways to retrieve the results. Each reducer
produces one output file, so there are 30 part files named part-r-00000 to part-r00029 in the max-temp directory.
As their names suggest, a good way to think of these “part” files is as parts of the
max-temp “file.”If the output is large (which it isn’t in this case), then it is important to
have multiple parts so that more than one reducer can work in parallel. Usually, if a
file is in this partitioned form, it can still be used easily enough: as the input to
another MapReduce job, for example. In some cases, you can exploit the structure
of multiple partitions to do a map-side join, for example, (“Map-Side Joins” on ) or a
MapFile lookup (“An application: Partitioned MapFile lookups” ).
This job produces a very small amount of output, so it is convenient to copy it from
HDFS to our development machine. The -getmerge option to the hadoop fs
command is useful here, as it gets all the files in the directory specified in the source
pattern and merges them into a single file on the local filesystem:
hadoop fs -getmerge max-temp max-temp-local
sort max-temp-local | tail
We sorted the output, as the reduce output partitions are unordered (owing to the
hash partition function). Doing a bit of postprocessing of data from MapReduce is
very common, as is feeding it into analysis tools, such as R, a spreadsheet, or even
a relational database.
Another way of retrieving the output if it is small is to use the -cat option to print the
output files to the console:
% hadoop fs -cat max-temp/*
On closer inspection, we see that some of the results don’t look plausible. For
instance, the maximum temperature for 1951 (not shown here) is 590°C! How do we
find out what’s causing this? Is it corrupt input data or a bug in the program?
Debugging a Job
The time-honored way of debugging programs is via print statements, and this is
cer-tainly possible in Hadoop. However, there are complications to consider: with
pro-grams running on tens, hundreds, or thousands of nodes, how do we find and
examine the output of the debug statements, which may be scattered across these
nodes? For this particular case, where we are looking for (what we think is) an
unusual case, we can use a debug statement to log to standard error, in conjunction
with a message to update the task’s status message to prompt us to look in the error
log. The web UI makes this easy, as we will see.
We also create a custom counter to count the total number of records with
implausible temperatures in the whole dataset. This gives us valuable information
about how to deal with the condition—if it turns out to be a common occurrence,
then we might need to learn more about the condition and how to extract the
temperature in these cases, rather than simply dropping the record. In fact, when
trying to debug a job, you should always ask yourself if you can use a counter to get
the information you need to find out what’s happening. Even if you need to use
logging or a status message, it may be useful to use a counter to gauge the extent of
the problem.
If the amount of log data you produce in the course of debugging is large, then
you’ve got a couple of options. The first is to write the information to the map’s
output, rather than to standard error, for analysis and aggregation by the reduce.
This approach usu-ally necessitates structural changes to your program, so start
with the other techniques first. Alternatively, you can write a program (in MapReduce
of course) to analyze the logs produced by your job.
We add our debugging to the mapper (version 4), as opposed to the reducer, as we
want to find out what the source data causing the anomalous output looks like:
public class MaxTemperatureMapper
extends Mapper<LongWritable, Text, Text, IntWritable> {
enum Temperature {
private NcdcRecordParser parser = new NcdcRecordParser();
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
if (parser.isValidTemperature()) {
int airTemperature = parser.getAirTemperature();
if (airTemperature > 1000) {
("Temperature over 100 degrees for input: " + value);
context.setStatus("Detected possibly corrupt record: see logs.");
context.write(new Text(parser.getYear()), new IntWritable(airTemperature));
If the temperature is over 100°C (represented by 1000, since temperatures are in
tenths of a degree), we print a line to standard error with the suspect line, as well as
updating the map’s status message using the setStatus() method on Context
directing us to look in the log. We also increment a counter, which in Java is
represented by a field of an enum type. In this program, we have defined a single
field OVER_100 as a way to count the number of records with a temperature of over
100°C. With this modification, we recompile the code, re-create the JAR file, then
rerun the job, and while it’s running go to the tasks page.
The tasks page
The job page has a number of links for look at the tasks in a job in more detail. For
example, by clicking on the “map” link, you are brought to a page that lists
information for all of the map tasks on one page. You can also see just the
completed tasks. The screenshot in Figure 5-3 shows a portion of this page for the
job run with our debugging statements. Each row in the table is a task, and it
provides such information as the start and end times for each task, any errors
reported back from the tasktracker, and a link to view the counters for an individual
The “Status” column can be helpful for debugging, since it shows a task’s latest
status message. Before a task starts, it shows its status as “initializing,” then once it
starts reading records it shows the split information for the split it is reading as a
filename with a byte offset and length. You can see the status we set for debugging
for task task_200904110811_0003_m_000044, so let’s click through to the logs
page to find the associated debug message. (Notice, too, that there is an extra
counter for this task, since our user counter has a nonzero count for this task.)
The task details page
From the tasks page, you can click on any task to get more information about it. The
task details page, shown in Figure 5-4, shows each task attempt. In this case, there
was one task attempt, which completed successfully. The table provides further
useful data, such as the node the task attempt ran on, and links to task logfiles and
counters. The “Actions” column contains links for killing a task attempt. By default,
this is dis-abled, making the web UI a read-only interface. Set
webinterface.private.actions to true to enable the actions links.
Figure 5-3. Screenshot of the tasks page
Figure 5-4. Screenshot of the task details page
By setting webinterface.private.actions to true, you also allow anyone with access to
the HDFS web interface to delete files. The dfs.web.ugi property determines the user
that the HDFS web UI runs as, thus con-trolling which files may be viewed and
For map tasks, there is also a section showing which nodes the input split was
located on.
By following one of the links to the logfiles for the successful task attempt (you can
see the last 4 KB or 8 KB of each logfile, or the entire file), we can find the suspect
input record that we logged (the line is wrapped and truncated to fit on the page):
Temperature over 100 degrees for input:
0335999999433181957042302005+37950+139117SAO +0004RJSN
V020113590031500703569999994 33201957010100005+35317+139650SAO
This record seems to be in a different format to the others. For one thing, there are
spaces in the line, which are not described in the specification.
When the job has finished, we can look at the value of the counter we defined to see
how many records over 100°C there are in the whole dataset. Counters are
accessible via the web UI or the command line:
'v4.MaxTemperatureMapper$Temperature' \ OVER_1003
The -counter option takes the job ID, counter group name (which is the fully qualified
classname here), and the counter name (the enum name). There are only three malformed records in the entire dataset of over a billion records. Throwing out bad
records is standard for many big data problems, although we need to be careful in
this case, since we are looking for an extreme value—the maximum temperature
rather than an aggregate measure. Still, throwing away three records is probably not
going to change the result.
Handling malformed data
Capturing input data that causes a problem is valuable, as we can use it in a test to
check that the mapper does the right thing:
public void parsesMalformedTemperature() throws
IOException, InterruptedException {
MaxTemperatureMapper mapper = new MaxTemperatureMapper();
Text value = new
39117SAO +0004" + // Year ^^^^"RJSN
0005+353"); // Temperature ^^^^^
MaxTemperatureMapper.Context context =
Counter counter = mock(Counter.class);
ALFORMED)).thenReturn(counter);, value, context);
verify(context, never()).write(any(Text.class), any(IntWritable.class));
The record that was causing the problem is of a different format to the other lines
we’ve seen. Example 5-11 shows a modified program (version 5) using a parser that
ignores each line with a temperature field that does not have a leading sign (plus or
minus). We’ve also introduced a counter to measure the number of records that we
are ignoring for this reason.
Example 5-11. Mapper for maximum temperature example
public class MaxTemperatureMapper
extends Mapper<LongWritable, Text, Text, IntWritable> {
enum Temperature {
private NcdcRecordParser parser = new NcdcRecordParser();
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
if (parser.isValidTemperature()) {
int airTemperature = parser.getAirTemperature();
context.write(new Text(parser.getYear()), new IntWritable(airTemperature));
} else if (parser.isMalformedTemperature()) {
System.err.println("Ignoring possibly corrupt input: " + value);
Hadoop Logs
Hadoop produces logs in various places, for various audiences. These are
summarized in Table 5-2.
Table 5-2. Types of Hadoop logs
System daemon
Each Hadoop daemon produces a
Administrators logfile (using log4j) and another file that
standard out and error. Written in the
directory defined by the
HADOOP_LOG_DIR environment variable.
A log of all HDFS requests, turned off
HDFS audit logs
Administrators by default. Written to the namenode’s
log, although this is configurable.
MapReduce job history
A log of the events (such as task
that occur in the course of running a
Saved centrally on the jobtracker, and in
job’s output directory in a _logs/history
MapReduce task
Each tasktracker child process
produces a
logfile using log4j (called syslog), a file
“System logfiles” on page 307
“Logging” on page
“Audit Logging” on page
“Job History” on page
This section.
data sent to standard out (stdout), and a
for standard error (stderr). Written in
userlogs subdirectory of the directory
As we have seen in the previous section, MapReduce task logs are accessible
through the web UI, which is the most convenient way to view them. You can also
find the logfiles on the local filesystem of the tasktracker that ran the task attempt, in
a directory named by the task attempt. If task JVM reuse is enabled (“Task JVM Reuse” ), then each logfile accumulates the logs for the entire JVM run, so multiple task
attempts will be found in each logfile. The web UI hides this by showing only the
portion that is relevant for the task attempt being viewed.
It is straightforward to write to these logfiles. Anything written to standard output, or
standard error, is directed to the relevant logfile. (Of course, in Streaming, standard
output is used for the map or reduce output, so it will not show up in the standard
output log.)
In Java, you can write to the task’s syslog file if you wish by using the Apache
Commons Logging API. This is shown in Example 5-12.
Example 5-12. An identity mapper that writes to standard output and also uses the
Apache Commons Logging API
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.hadoop.mapreduce.Mapper;
public class LoggingIdentityMapper<KEYIN, VALUEIN, KEYOUT,
private static final Log LOG = LogFactory.getLog(LoggingIdentityMapper.class);
public void map(KEYIN key, VALUEIN value, Context
context) throws IOException, InterruptedException {
Log to stdout file System.out.println("Map
key: " + key);
Log to syslog file"Map key: " + key);
if (LOG.isDebugEnabled()) {
LOG.debug("Map value: " + value);
context.write((KEYOUT) key, (VALUEOUT) value);
The default log level is INFO, so DEBUG level messages do not appear in the
syslog task log file. However, sometimes you want to see these messages—to do
this set or mapred.reduce.child.log.level, as appropriate
(from 0.22). For example, in this case we could set it for the mapper to see the map
values in the log as follows:
hadoop jar hadoop-examples.jar LoggingDriver -conf conf/hadoopcluster.xml
input/ncdc/sample.txt logging-out
There are some controls for managing retention and size of task logs. By default,
logs are deleted after a minimum of 24 hours (set using the
mapred.userlog.retain.hours property). You can also set a cap on the maximum size
of each logfile using the mapred.userlog.limit.kb property, which is 0 by default,
meaning there is no cap. Sometimes you may need to debug a problem that you
suspect is oc-curring in the JVM running a Hadoop command, rather than on the
cluster. You can send DEBUG level logs to the console by using an invo-cation like
% HADOOP_ROOT_LOGGER=DEBUG,console hadoop fs -text /foo/bar
Remote Debugging
When a task fails and there is not enough information logged to diagnose the error,
you may want to resort to running a debugger for that task. This is hard to arrange
when running the job on a cluster, as you don’t know which node is going to process
which part of the input, so you can’t set up your debugger ahead of the failure.
However, there are a few other options available:
Reproduce the failure locally
Often the failing task fails consistently on a particular input. You can try to repo-duce
the problem locally by downloading the file that the task is failing on and running the
job locally, possibly using a debugger such as Java’s VisualVM.
Use JVM debugging options
A common cause of failure is a Java out of memory error in the task JVM. You can
set to include -XX:-HeapDumpOnOutOfMemoryError XX:Heap DumpPath=/path/to/dumps to produce a heap dump which can be
examined after-wards with tools like jhat or the Eclipse Memory Analyzer. Note that
the JVM options should be added to the existing memory settings specified by
Use task profiling
Java profilers give a lot of insight into the JVM, and Hadoop provides a mechanism
to profile a subset of the tasks in a job.
Use IsolationRunner
Older versions of Hadoop provided a special task runner called IsolationRunner that
could rerun failed tasks in situ on the cluster. Unfortunately, it is no longer available
In some cases it’s useful to keep the intermediate files for a failed task attempt for
later inspection, particularly if supplementary dump or profile files are created in the
task’s working directory. You can set keep.failed.task.files to true to keep a failed
task’s files.
You can keep the intermediate files for successful tasks, too, which may be handy if
you want to examine a task that isn’t failing. In this case, set the property
keep.task.files.pattern to a regular expression that matches the IDs of the tasks you
want to keep.
To examine the intermediate files, log into the node that the task failed on and look
for the directory for that task attempt. It will be under one of the local MapReduce
direc-tories, as set by the mapred.local.dir property (covered in more detail in
“Important Hadoop Daemon Properties” ). If this property is a comma-separated list
of directories (to spread load across the physical disks on a machine), then you may
need to look in all of the directories before you find the directory for that particular
task attempt. The task attempt directory is in the following location:
Tuning a Job
After a job is working, the question many developers ask is, “Can I make it run
There are a few Hadoop-specific “usual suspects” that are worth checking to see if
they are responsible for a performance problem. You should run through the
checklist in Table 5-3 before you start trying to profile or optimize at the task level.
Profiling Tasks
Like debugging, profiling a job running on a distributed system like MapReduce
presents some challenges. Hadoop allows you to profile a fraction of the tasks in a
job, and, as each task completes, pulls down the profile information to your machine
for later analysis with standard profiling tools.
Of course, it’s possible, and somewhat easier, to profile a job running in the local job
runner. And provided you can run with enough input data to exercise the map and
reduce tasks, this can be a valuable way of improving the performance of your
mappers and reducers. There are a couple of caveats, however. The local job
runner is a very different environment from a cluster, and the data flow patterns are
very different. Optimizing the CPU performance of your code may be pointless if
your MapReduce job is I/O-bound (as many jobs are). To be sure that any tuning is
effective, you should compare the new execution time with the old running on a real
cluster. Even this is easier said than done, since job execution times can vary due to
resource contention with other jobs and the decisions the scheduler makes to do
with task placement. To get a good idea of job execution time under these
circumstances, perform a series of runs (with and without the change) and check
whether any improvement is statistically significant.
It’s unfortunately true that some problems (such as excessive memory use) can be
re-produced only on the cluster, and in these cases the ability to profile in situ is
The HPROF profiler
There are a number of configuration properties to control profiling, which are also
exposed via convenience methods on JobConf. The following modification to
MaxTemperatureDriver (version 6) will enable remote HPROF profiling. HPROF is a
profiling tool that comes with the JDK that, although basic, can give valuable information about a program’s CPU and heap usage:
Configuration conf = getConf();
conf.setBoolean("mapred.task.profile", true);
conf.set("mapred.task.profile.params", "-agentlib:hprof=cpu=samples,"
+ "heap=sites,depth=6,force=n,thread=y,verbose=n,file=%s");
conf.set("mapred.task.profile.maps", "0-2");
conf.set("mapred.task.profile.reduces", ""); // no reduces Job
job = new Job(conf, "Max temperature");
The first line enables profiling, which by default is turned off. (Instead of using
mapred.task.profile you can also use the JobContext.TASK_PROFILE constant in
the new API.)
Next we set the profile parameters, which are the extra command-line arguments to
pass to the task’s JVM. (When profiling is enabled, a new JVM is allocated for each
task, even if JVM reuse is turned on; see “Task JVM Reuse” .) The default
parameters specify the HPROF profiler; here we set an extra HPROF option,
depth=6, to give more stack trace depth than the HPROF default. (Using
mapred.task.profile.params property.)
Finally, we specify which tasks we want to profile. We normally only want profile
information from a few tasks, so we use the properties mapred.task.profile.maps and
mapred.task.profile.reduces to specify the range of (map or reduce) task IDs that we
want profile information for. We’ve set the maps property to 0-2 (which is actually the
default), which means map tasks with IDs 0, 1, and 2 are profiled. A set of ranges is
permitted, using a notation that allows open ranges. For example, 0-1,4,6- would
specify all tasks except those with IDs 2, 3, and 5. The tasks to profile can also be
controlled using the JobContext.NUM_MAP_PROFILES constant for map tasks, and
JobCon text.NUM_REDUCE_PROFILES for reduce tasks.
When we run a job with the modified driver, the profile output turns up at the end of
the job in the directory we launched the job from. Since we are only profiling a few
tasks, we can run the job on a subset of the dataset.
Here’s a snippet of one of the mapper’s profile files, which shows the CPU sampling
1 3.49%
2 3.39%
3 3.19%
4 3.19%
5 3.19%
BEGIN (total = 1002) Sat Apr 11 11:17:52 2009
count trace method
35 307969 java.lang.Object.<init>
34 307954 java.lang.Object.<init>
32 307945 java.util.regex.Matcher.<init>
32 307963 java.lang.Object.<init>
32 307973 java.lang.Object.<init>
Cross-referencing the trace number 307973 gives us the stacktrace from the same
TRACE 307973: (thread=200001)
So it looks like the mapper is spending 3% of its time constructing IntWritable
objects. This observation suggests that it might be worth reusing the Writable
instances being output (version 7, see Example 5-13).
Example 5-13. Reusing the Text and IntWritable output objects
public class MaxTemperatureMapper
extends Mapper<LongWritable, Text, Text, IntWritable> {
enum Temperature {
private NcdcRecordParser parser = new
private Text year = new Text();
private IntWritable temp = new IntWritable();
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
if (parser.isValidTemperature()) {
context.write(year, temp);
} else if (parser.isMalformedTemperature()) {
System.err.println("Ignoring possibly corrupt input: " + value);
However, we know if this is significant only if we can measure an improvement when
running the job over the whole dataset. Running each variant five times on an
otherwise quiet 11-node cluster showed no statistically significant difference in job
execution time. Of course, this result holds only for this particular combination of
code, data, and hardware, so you should perform similar benchmarks to see
whether such a change is significant for your setup.
Other profilers
The mechanism for retrieving profile output is HPROF-specific, so if you use another
profiler you will need to manually retrieve the profiler’s output from tasktrackers for
If the profiler is not installed on all the tasktracker machines, consider using the Distributed Cache (“Distributed Cache” ) for making the profiler binary available on the
required machines.
MapReduce Workflows
So far in this chapter, you have seen the mechanics of writing a program using MapReduce. We haven’t yet considered how to turn a data processing problem into the
MapReduce model.
The data processing you have seen so far in this book is to solve a fairly simple
problem (finding the maximum recorded temperature for given years). When the
processing gets more complex, this complexity is generally manifested by having
more MapReduce jobs, rather than having more complex map and reduce functions.
In other words, as a rule of thumb, think about adding more jobs, rather than adding
complexity to jobs.
For more complex problems, it is worth considering a higher-level language than
Map-Reduce, such as Pig, Hive, Cascading, Cascalog, or Crunch. One immediate
benefit is that it frees you up from having to do the translation into MapReduce jobs,
allowing you to concentrate on the analysis you are performing.
Finally, the book Data-Intensive Text Processing with MapReduce by Jimmy Lin
and Chris Dyer (Morgan & Claypool Publishers, 2010, is a
great re-source for learning more about MapReduce algorithm design, and is highly
Apache Oozie
If you need to run a complex workflow, or one on a tight production schedule, or you
have a large number of connected workflows with data dependencies between them,
( fits the bill in any or all of these cases. It has
been designed to manage the executions of thousands of dependent workflows,
each com-posed of possibly thousands of consistuent actions at the level of an
individual Map-Reduce job.
Oozie has two main parts: a workflow engine that stores and runs workflows
composed of Hadoop jobs, and a coordinator engine that runs workflow jobs based
on predefined schedules and data availability. The latter property is especially
powerful since it allows a workflow job to wait until its input data has been produced
by a dependent workflow; also, it make rerunning failed workflows more tractable,
since no time is wasted running successful parts of a workflow. Anyone who has
managed a complex batch system knows how difficult it can be to catch up from jobs
missed due to downtime or failure, and will appreciate this feature.
Unlike JobControl, which runs on the client machine submitting the jobs, Oozie runs
as a service in the cluster, and clients submit a workflow definitions for immediate or
later execution. In Oozie parlance, a workflow is a DAG of action nodes and controlflow nodes. An action node performs a workflow task, like moving files in HDFS, running a MapReduce, Streaming, Pig or Hive job, performing a Sqoop import, or
running an arbitrary shell script or Java program. A control-flow node governs the
workflow execution between actions by allowing such constructs as conditional logic
(so different execution branches may be followed depending on the result of an
earlier action node) or parallel execution. When the workflow completes, Oozie can
make an HTTP call-back to the client to inform it of the workflow status. It is also
possible to receive callbacks every time the workflow enters or exits an action node.
Defining an Oozie workflow
Workflow definitions are written in XML using the Hadoop Process Definition Language, the specification for which can be found on the Oozie website. Example 5-14
shows a simple Oozie workflow definition for running a single MapReduce job.
Example 5-14. Oozie workflow definition to run the maximum temperature
MapReduce job
<workflow-app xmlns="uri:oozie:workflow:0.1" name="max-tempworkflow"> <start to="max-temp-mr"/>
<action name="max-temp-mr">
<delete path="${nameNode}/user/${wf:user()}/output"/>
<ok to="end"/>
<error to="fail"/>
<kill name="fail">
<message>MapReduce failed, error
message[${wf:errorMessage(wf:lastErrorNode())}]</message> </kill>
<end name="end"/>
This workflow has three control-flow nodes and one action node: a start control
node, a map-reduce action node, a kill control node, and an end control node.
All workflows must have one start and one end node. When the workflow job
starts it transitions to the node specified by the start node (the max-temp-mr
action in this example). A workflow job succeeds when it transitions to the end
node. However, if the workflow job transitions to a kill node, then it is considered
to have failed and reports an error message as specified by the message element
in the workflow definition.
The bulk of this workflow definition file specifies the map-reduce action. The first
two elements, job-tracker and name-node, are used to specify the jobtracker to
submit the job to, and the namenode (actually a Hadoop filesystem URI) for input
and output data. Both are parameterized so that the workflow definition is not tied
to a particular cluster (which makes it easy to test).
The optional prepare element runs before the MapReduce job, and is used for
directory deletion (and creation too, if needed, although that is not shown here). By
ensuring that the output directory is in a consistent state before running a job, Oozie
can safely rerun the action if the job fails.
The MapReduce job to run is specified in the configuration element using nested
ele-ments for specifying the Hadoop configuration name-value pairs. You can view
the MapReduce configuration section as a declarative replacement for the driver
classes that we have used elsewhere in this book for running MapReduce programs
(such as Example 2-6).
There are two non-standard Hadoop properties, mapred.input.dir and mapred.out
put.dir, which are used to set the FileInputFormat input paths and FileOutputFormat
output path, respectively.
We have taken advantage of JSP Expression Language (EL) functions in several
places in the workflow definition. Oozie provides a set of functions for interacting
with the workflow; ${wf:user()}, for example, returns the name of the user who
started the current workflow job, and we use it to specify the correct filesystem path.
The Oozie specification lists all the EL functions that Oozie supports.
Packaging and deploying an Oozie workflow application
A workflow application is made up of the workflow definition plus all the associated
resources (such as MapReduce JAR files, Pig scripts, and so on), needed to run it.
Applications must adhere to a simple directory structure, and are deployed to HDFS
so that they can be accessed by Oozie. For this workflow application, we’ll put all of
the files in a base directory called max-temp-workflow, as shown diagramatically
max-temp-workflow/ ├── lib/
│ └── hadoop-examples.jar └──
The workflow definition file workflow.xml must appear in the top-level of this
directory. JAR files containing the application’s MapReduce classes are placed in
the lib directory.
Workflow applications that conform to this layout can be built with any suitable build
tool, like Ant or Maven; you can find an example in the code that accompanies this
book. Once an application has been built, it should be copied to HDFS using regular
Hadoop tools. Here is the appropriate command for this application:
% hadoop fs -put hadoop-examples/target/max-temp-workflow max-tempworkflow
Running an Oozie workflow job
Next let’s see how to run a workflow job for the application we just uploaded. For this
we use the oozie command line tool, a client program for communicating with an
Oozie server. For convenience we export the OOZIE_URL environment variable to
tell the oozie command which Oozie server to use (we’re using one running locally
% export OOZIE_URL="http://localhost:11000/oozie"
There are lots of sub-commands for the oozie tool (type oozie help to get a list), but
we’re going to call the job subcommand with the -run option to run the workflow job:
% oozie job -config ch05/src/main/resources/ -run job: 0000009-120119174508294-oozie-tom-W
The -config option specifies a local Java properties file containing definitions for the
parameters in the workflow XML file (in this case nameNode and jobTracker), as
well as which tells Oozie the location of the workflow
application in HDFS. Here is the contents of the properties file:
To get information about the status of the workflow job we use the -info option, using
the job ID that was printed by the run command earlier (type oozie job to get a list of
all jobs).
% oozie job -info 0000009-120119174508294-oozie-tom-W
The output shows the status: RUNNING, KILLED, or SUCCEEDED. You can find all
this in-formation via Oozie’s web UI too, available at http://localhost:11000/oozie.
When the job has succeeded we can inspect the results in the usual way:
% hadoop fs -cat output/part-*
1949 111
1950 22
This example only scratched the surface of writing Oozie workflows. The documentation on Oozie’s website has information about creating more complex workflows,
as well as writing and running coordinator jobs.
How MapReduce Works
In this chapter, we look at how MapReduce in Hadoop works in detail. This
knowledge provides a good foundation for writing more advanced MapReduce
programs, which we will cover in the following two chapters.
Anatomy of a MapReduce Job Run
You can run a MapReduce job with a single method call: submit() on a Job object
(note that you can also call waitForCompletion(), which will submit the job if it hasn’t
been submitted already, then wait for it to finish).1 This method call conceals a great
deal of processing behind the scenes. This section uncovers the steps Hadoop takes
to run a job. We saw in Chapter 5 that the way Hadoop executes a MapReduce
program depends on a couple of configuration settings. In releases of Hadoop up to
and including the 0.20 release series, mapred.job.tracker determines the means of
execution. If this configuration property is set to local, the default, then the local job
runner is used. This runner runs the whole job in a single JVM. It’s designed for
testing and for running MapReduce programs on small datasets.
Alternatively, if mapred.job.tracker is set to a colon-separated host and port pair,
then the property is interpreted as a jobtracker address, and the runner submits the
job to the jobtracker at that address. The whole process in described in detail in the
next section. In Hadoop 0.23.0 a new MapReduce implementation was introduced.
The new im-plementation (called MapReduce 2) is built on a system called YARN,
described in “YARN (MapReduce 2)” . For now, just note that the framework that is
used for execution is set by the property, which takes
the values local (for the local job runner), classic (for the “classic” MapReduce framework, also called MapReduce 1, which uses a jobtracker and tasktrackers), and yarn
(for the new framework).
1. In the old MapReduce API you can call JobClient.submitJob(conf) or
Classic MapReduce (MapReduce 1)
A job run in classic MapReduce is illustrated in Figure 6-1. At the highest level,
there are four independent entities:
 The client, which submits the MapReduce job.
 The jobtracker, which coordinates the job run. The jobtracker is a Java
application whose main class is JobTracker.
 The tasktrackers, which run the tasks that the job has been split into.
Tasktrackers are Java applications whose main class is TaskTracker.
 The distributed filesystem (normally HDFS), which is used for sharing job files
between the other entities.
Job Submission
The submit() method on Job creates an internal JobSummitter instance and calls sub
mitJobInternal() on it (step 1 in Figure 6-1). Having submitted the job, waitForCom
pletion() polls the job’s progress once a second and reports the progress to the
console if it has changed since the last report. When the job is complete, if it was
successful, the job counters are displayed. Otherwise, the error that caused the job
to fail is logged to the console.
Figure 6-1. How Hadoop runs a MapReduce job using the classic framework
The job submission process implemented by JobSummitter does the
Asks the jobtracker for a new job ID (by calling getNewJobId() on JobTracker) (step
Checks the output specification of the job. For example, if the output directory has
not been specified or it already exists, the job is not submitted and an error is thrown
to the MapReduce program.
Computes the input splits for the job. If the splits cannot be computed, because the
input paths don’t exist, for example, then the job is not submitted and an error is
thrown to the MapReduce program.
Job Initialization
When the JobTracker receives a call to its submitJob() method, it puts it into an
internal queue from where the job scheduler will pick it up and initialize it.
Initialization involves creating an object to represent the job being run, which
encapsulates its tasks, and bookkeeping information to keep track of the tasks’
status and progress (step 5).
To create the list of tasks to run, the job scheduler first retrieves the input splits computed by the client from the shared filesystem (step 6). It then creates one map task
for each split. The number of reduce tasks to create is determined by the
mapred.reduce.tasks property in the Job, which is set by the setNumReduceTasks()
method, and the scheduler simply creates this number of reduce tasks to be run.
Tasks are given IDs at this point.
In addition to the map and reduce tasks, two further tasks are created: a job setup
task and a job cleanup task. These are run by tasktrackers and are used to run code
to setup the job before any map tasks run, and to cleanup after all the reduce tasks
are complete. The OutputCommitter that is configured for the job determines the
code to be run, and by default this is a FileOutputCommitter. For the job setup task it
will create the final output directory for the job and the temporary working space for
the task output, and for the job cleanup task it will delete the temporary working
space for the task output. The commit protocol is described in more detail in “Output
Task Assignment
Tasktrackers run a simple loop that periodically sends heartbeat method calls to the
jobtracker. Heartbeats tell the jobtracker that a tasktracker is alive, but they also
double as a channel for messages. As a part of the heartbeat, a tasktracker will
indicate whether it is ready to run a new task, and if it is, the jobtracker will allocate it
a task, which it communicates to the tasktracker using the heartbeat return value
(step 7).
Before it can choose a task for the tasktracker, the jobtracker must choose a job to
select the task from. There are various scheduling algorithms as explained later in
this chapter (see “Job Scheduling” ), but the default one simply maintains a priority
list of jobs. Having chosen a job, the jobtracker now chooses a task for the job.
Tasktrackers have a fixed number of slots for map tasks and for reduce tasks: for example, a tasktracker may be able to run two map tasks and two reduce tasks
simulta-neously. (The precise number depends on the number of cores and the
amount of memory on the tasktracker; see “Memory” .) The default scheduler fills
empty map task slots before reduce task slots, so if the tasktracker has at least one
empty map task slot, the jobtracker will select a map task; otherwise, it will select a
reduce task.
To choose a reduce task, the jobtracker simply takes the next in its list of yet-to-berun reduce tasks, since there are no data locality considerations. For a map task,
however, it takes account of the tasktracker’s network location and picks a task
whose input split is as close as possible to the tasktracker. In the optimal case, the
task is data-local, that is, running on the same node that the split resides on.
Alternatively, the task may be rack-local: on the same rack, but not the same node,
as the split. Some tasks are neither data-local nor rack-local and retrieve their data
from a different rack from the one they are running on. You can tell the proportion of
each type of task by looking at a job’s counters (see).
Task Execution
Now that the tasktracker has been assigned a task, the next step is for it to run the
task. First, it localizes the job JAR by copying it from the shared filesystem to the
tasktracker’s filesystem. It also copies any files needed from the distributed cache by
the application to the local disk; see “Distributed Cache” (step 8). Second, it creates
a local working directory for the task, and un-jars the contents of the JAR into this
directory. Third, it creates an instance of TaskRunner to run the task.
TaskRunner launches a new Java Virtual Machine (step 9) to run each task in (step
10), so that any bugs in the user-defined map and reduce functions don’t affect the
task-tracker (by causing it to crash or hang, for example). It is, however, possible to
reuse the JVM between tasks; see “Task JVM Reuse” .
The child process communicates with its parent through the umbilical interface. This
way it informs the parent of the task’s progress every few seconds until the task is
Each task can perform setup and cleanup actions, which are run in the same JVM as
the task itself, and are determined by the OutputCommitter for the job (see “Output
Committers” ). The cleanup action is used to commit the task, which in the case of
file-based jobs means that its output is written to the final location for that task. The
commit protocol ensures that when speculative execution is enabled (“Spec-ulative
Execution” ), only one of the duplicate tasks is committed and the other is aborted.
Streaming and Pipes
Both Streaming and Pipes run special map and reduce tasks for the purpose of
launching the user-supplied executable and communicating with it (Fig-ure 6-2).
In the case of Streaming, the Streaming task communicates with the process (which
may be written in any language) using standard input and output streams. The Pipes
task, on the other hand, listens on a socket and passes the C++ process a port
number in its environment, so that on startup, the C++ process can establish a
persistent socket connection back to the parent Java Pipes task.
In both cases, during execution of the task, the Java process passes input key-value
pairs to the external process, which runs it through the user-defined map or reduce
function and passes the output key-value pairs back to the Java process. From the
tasktracker’s point of view, it is as if the tasktracker child process ran the map or
reduce code itself.
Progress and Status Updates
MapReduce jobs are long-running batch jobs, taking anything from minutes to hours
to run. Because this is a significant length of time, it’s important for the user to get
feedback on how the job is progressing. A job and each of its tasks have a status,
which includes such things as the state of the job or task (e.g., running, successfully
completed, failed), the progress of maps and reduces, the values of the job’s
counters, and a status message or description (which may be set by user code).
These statuses change over the course of the job, so how do they get
communicated back to the client?
When a task is running, it keeps track of its progress, that is, the proportion of the
task completed. For map tasks, this is the proportion of the input that has been
processed. For reduce tasks, it’s a little more complex, but the system can still
estimate the pro-portion of the reduce input processed. It does this by dividing the
total progress into three parts, corresponding to the three phases of the shuffle (see
“Shuffle and Sort” ). For example, if the task has run the reducer on half its input,
then the task’s progress is , since it has completed the copy and sort phases (⅓
each) and is halfway through the reduce phase ().
Figure 6-2. The relationship of the Streaming and Pipes executable to the
tasktracker and its child
What Constitutes Progress in MapReduce?
Progress is not always measurable, but nevertheless it tells Hadoop that a task is
doing something. For example, a task writing output records is making progress,
even though it cannot be expressed as a percentage of the total number that will be
written, since the latter figure may not be known, even by the task producing the
Progress reporting is important, as it means Hadoop will not fail a task that’s making
progress. All of the following operations constitute progress:
 Reading an input record (in a mapper or reducer)
 Writing an output record (in a mapper or reducer)
 Setting the status description on a reporter (using Reporter’s setStatus()
 Incrementing a counter (using Reporter’s incrCounter() method)
 Calling Reporter’s progress() method
Tasks also have a set of counters that count various events as the task runs (we saw
an example in “A test run” ), either those built into the framework, such as the
number of map output records written, or ones defined by users.
If a task reports progress, it sets a flag to indicate that the status change should be
sent to the tasktracker. The flag is checked in a separate thread every three
seconds, and if set it notifies the tasktracker of the current task status. Meanwhile,
the tasktracker is sending heartbeats to the jobtracker every five seconds (this is a
minimum, as the heartbeat interval is actually dependent on the size of the cluster:
for larger clusters, the interval is longer), and the status of all the tasks being run by
the tasktracker is sent in the call. Counters are sent less frequently than every five
seconds, because they can be relatively high-bandwidth.
The jobtracker combines these updates to produce a global view of the status of all
the jobs being run and their constituent tasks. Finally, as mentioned earlier, the Job
receives the latest status by polling the jobtracker every second. Clients can also use
Job’s getStatus() method to obtain a JobStatus instance, which contains all of the
status information for the job.
The method calls are illustrated in Figure 6-3.
Job Completion
When the jobtracker receives a notification that the last task for a job is complete
(this will be the special job cleanup task), it changes the status for the job to
“successful.” Then, when the Job polls for status, it learns that the job has completed
successfully, so it prints a message to tell the user and then returns from the
waitForCompletion() method.
The jobtracker also sends an HTTP job notification if it is configured to do so. This
can be configured by clients wishing to receive callbacks, via the job.end.notifica
tion.url property.
Last, the jobtracker cleans up its working state for the job and instructs tasktrackers
to do the same (so intermediate output is deleted, for example).
YARN (MapReduce 2)
For very large clusters in the region of 4000 nodes and higher, the MapReduce
system described in the previous section begins to hit scalability bottlenecks, so in
2010 a group at Yahoo! began to design the next generation of MapReduce. The
result was YARN, short for Yet Another Resource Negotiator (or if you prefer
recursive ancronyms, YARN Application Resource Negotiator).
You can read more about the motivation for and development of YARN in Arun C
Murthy’s post, The Next Generation of Apache Hadoop MapReduce.
Figure 6-3. How status updates are propagated through the MapReduce 1 system
YARN meets the scalability shortcomings of “classic” MapReduce by splitting the responsibilities of the jobtracker into separate entities. The jobtracker takes care of
both job scheduling (matching tasks with tasktrackers) and task progress monitoring
(keep-ing track of tasks and restarting failed or slow tasks, and doing task
bookkeeping such as maintaining counter totals).
YARN separates these two roles into two independent daemons: a resource
manager to manage the use of resources across the cluster, and an application
master to manage the lifecycle of applications running on the cluster. The idea is that
an application master negotiates with the resource manager for cluster resources—
described in terms of a number of containers each with a certain memory limit—then
runs application-specific processes in those containers. The containers are overseen
by node managers running on cluster nodes, which ensure that the application does
not use more resour-ces than it has been allocated. At the time of writing, memory is
the only resource that is managed, and node managers will kill any containers that
exceed their allocated memory limits.
In contrast to the jobtracker, each instance of an application—here a MapReduce job
—has a dedicated application master, which runs for the duration of the application.
This model is actually closer to the original Google MapReduce paper, which
describes how a master process is started to coordinate map and reduce tasks
running on a set of workers.
As described, YARN is more general than MapReduce, and in fact MapReduce is
just one type of YARN application. There are a few other YARN applications—such
as a distributed shell that can run a script on a set of nodes in the cluster—and
at ByYarn). The beauty of YARN’s design is
that different YARN applications can co-exist on the same cluster—so a MapReduce
application can run at the same time as an MPI application, for example—which
brings great benefits for managability and cluster utilization.
Furthermore, it is even possible for users to run different versions of MapReduce on
the same YARN cluster, which makes the process of upgrading MapReduce more
managable. (Note that some parts of MapReduce, like the job history server and the
shuffle handler, as well as YARN itself, still need to be upgraded across the cluster.)
MapReduce on YARN involves more entities than classic MapReduce. They are:
The client, which submits the MapReduce job.
 The YARN resource manager, which coordinates the allocation of compute resources on the cluster.
 The YARN node managers, which launch and monitor the compute containers
on machines in the cluster.
 The MapReduce application master, which coordinates the tasks running the
MapReduce job. The application master and the MapReduce tasks run in con102
tainers that are scheduled by the resource manager, and managed by the node
 The distributed filesystem (normally HDFS, covered in Chapter 3), which is
used for sharing job files between the other entities.
Job Submission
Jobs are submitted in MapReduce 2 using the same user API as MapReduce 1 (step
1). MapReduce 2 has an implementation of ClientProtocol that is activated when
mapre is set to yarn. The submission process is very similar
to the classic implementation. The new job ID is retrieved from the resource
manager (rather than the jobtracker), although in the nomenclature of YARN it is an
application ID (step 2). The job client checks the output specification of the job;
computes input splits (al-though there is an option to generate them on the cluster, pute-splits-in-cluster, which can be beneficial for jobs
with many splits); and copies job resources (including the job JAR, configuration, and
split information) to HDFS (step 3). Finally, the job is submitted by calling
submitApplication() on the resource manager (step 4).
Job Initialization
When the resource manager receives a call to its submitApplication(), it hands off the
request to the scheduler. The scheduler allocates a container, and the resource
manager then launches the application master’s process there, under the node
manager’s man-agement (steps 5a and 5b).
The application master for MapReduce jobs is a Java application whose main class
is MRAppMaster. It initializes the job by creating a number of bookkeeping objects to
keep track of the job’s progress, as it will receive progress and completion reports
from the tasks (step 6). Next, it retrieves the input splits computed in the client from
the shared filesystem (step 7). It then creates a map task object for each split, and a
number of reduce task objects determined by the mapreduce.job.reduces property.
The next thing the application master does is decide how to run the tasks that make
up the MapReduce job. If the job is small, the application master may choose to run
them in the same JVM as itself, since it judges the overhead of allocating new
containers and running tasks in them as outweighing the gain to be had in running
them in parallel, compared to running them sequentially on one node. (This is
different to MapReduce 1, where small jobs are never run on a single tasktracker.)
Such a job is said to be uberized, or run as an uber task.
What qualifies as a small job? By default one that has less than 10 mappers, only
one reducer, and the input size is less than the size of one HDFS block. (These
values may be changed for a job by setting mapreduce.job.ubertask.maxmaps,
mapreduce.job.uber task.maxreduces, and mapreduce.job.ubertask.maxbytes.) It’s
mapreduce.job.ubertask.enable to false).
Before any tasks can be run the job setup method is called (for the job’s
OutputCommit ter), to create the job’s output directory. In contrast to MapReduce 1,
where it is called in a special task that is run by the tasktracker, in the YARN
implementation the method is called directly by the application master.
Task Assignment
If the job does not qualify for running as an uber task, then the application master
requests containers for all the map and reduce tasks in the job from the resource
man-ager (step 8). Each request, which are piggybacked on heartbeat calls, includes
infor-mation about each map task’s data locality, in particular the hosts and
corresponding racks that the input split resides on. The scheduler uses this
information to make scheduling decisions (just like a jobtracker’s scheduler does): it
attempts to place tasks on data-local nodes in the ideal case, but if this is not
possible the scheduler prefers rack-local placement to non-local placement.
Requests also specify memory requirements for tasks. By default both map and
reduce tasks are allocated 1024 MB of memory, but this is configurable by setting
mapre and mapreduce.reduce.memory.mb.
The way memory is allocated is different to MapReduce 1, where tasktrackers have
a fixed number of “slots”, set at cluster configuration time, and each task runs in a
single slot. Slots have a maximum memory allowance, which again is fixed for a
cluster, and which leads both to problems of under utilization when tasks use less
memory (since other waiting tasks are not able to take advantage of the unused
memory) and problems of job failure when a task can’t complete since it can’t get
enough memory to run correctly.
In YARN, resources are more fine-grained, so both these problems can be avoided.
In particular, applications may request a memory capability that is anywhere
between the minimum allocation and a maximum allocation, and which must be a
multiple of the minimum allocation. Default memory allocations are schedulerspecific, and for the capacity scheduler the default minimum is 1024 MB (set by
yarn.schedu ler.capacity.minimum-allocation-mb), and the default maximum is 10240
MB (set by yarn.scheduler.capacity.maximum-allocation-mb). Thus, tasks can
request any mem-ory allocation between 1 and 10 GB (inclusive), in multiples of 1
GB (the scheduler will round to the nearest multiple if needed), by setting and map reduce.reduce.memory.mb appropriately.
Task Execution
Once a task has been assigned a container by the resource manager’s scheduler,
the application master starts the container by contacting the node manager (steps 9a
and 9b). The task is executed by a Java application whose main class is YarnChild.
Before it can run the task it localizes the resources that the task needs, including the
job con-figuration and JAR file, and any files from the distributed cache (step 10).
Finally, it runs the map or reduce task (step 11).
The YarnChild runs in a dedicated JVM, for the same reason that tasktrackers
spawn new JVMs for tasks in MapReduce 1: to isolate user code from long-running
system daemons. Unlike MapReduce 1, however, YARN does not support JVM
reuse so each task runs in a new JVM.
Streaming and Pipes programs work in the same way as MapReduce 1. The Yarn
Child launches the Streaming or Pipes process and communicates with it using
standard input/output or a socket (respectively), as shown in Figure 6-2 (except the
child and subprocesses run on node managers, not tasktrackers).
Progress and Status Updates
When running under YARN, the task reports its progress and status (including counters) back to its application master every three seconds (over the umbilical interface),
which has an aggregate view of the job. The process is illustrated in Figure 6-5.
Contrast this to MapReduce 1, where progress updates flow from the child through
the task-tracker to the jobtracker for aggregation.
The client polls the application master every second (set via
gressmonitor.pollinterval) to receive progress updates, which are usually displayed
to the user.
Job Completion
As well as polling the application master for progress, every five seconds the client
checks whether the job has completed when using the waitForCompletion() method
on Job. The polling interval can be set via the mapreduce.client.completion.polli
nterval configuration property. Notification of job completion via an HTTP callback is
also supported like in MapRe-duce 1. In MapReduce 2 the application master
initiates the callback. On job completion the application master and the task
containers clean up their work-ing state, and the OutputCommitter’s job cleanup
method is called. Job information is archived by the job history server to enable later
interrogation by users if desired.
In the real world, user code is buggy, processes crash, and machines fail. One of the
major benefits of using Hadoop is its ability to handle such failures and allow your job
to complete.
Failures in Classic MapReduce
In the MapReduce 1 runtime there are three failure modes to consider: failure of the
running task, failure of the tastracker, and failure of the jobtracker. Let’s look at each
in turn.
Task Failure
Consider first the case of the child task failing. The most common way that this
happens is when user code in the map or reduce task throws a runtime exception. If
this happens, the child JVM reports the error back to its parent tasktracker, before it
exits. The error ultimately makes it into the user logs. The tasktracker marks the task
attempt as failed, freeing up a slot to run another task.
For Streaming tasks, if the Streaming process exits with a nonzero exit code, it is
marked as failed. This behavior is governed by the
property (the default is true).
Another failure mode is the sudden exit of the child JVM—perhaps there is a JVM
bug that causes the JVM to exit for a particular set of circumstances exposed by the
Map-Reduce user code. In this case, the tasktracker notices that the process has
exited and marks the attempt as failed.
Hanging tasks are dealt with differently. The tasktracker notices that it hasn’t
received a progress update for a while and proceeds to mark the task as failed. The
child JVM process will be automatically killed after this period.5 The timeout period
after which tasks are considered failed is normally 10 minutes and can be configured
on a per-job basis (or a cluster basis) by setting the mapred.task.timeout property to
a value in milliseconds.
Setting the timeout to a value of zero disables the timeout, so long-running tasks are
never marked as failed. In this case, a hanging task will never free up its slot, and
over time there may be cluster slowdown as a result. This approach should therefore
be avoided, and making sure that a task is reporting progress periodically will suffice
(see “What Constitutes Progress in MapReduce?” ).
When the jobtracker is notified of a task attempt that has failed (by the tasktracker’s
heartbeat call), it will reschedule execution of the task. The jobtracker will try to avoid
rescheduling the task on a tasktracker where it has previously failed. Furthermore, if
a task fails four times (or more), it will not be retried further. This value is
configurable: the maximum number of attempts to run a task is controlled by the property for map tasks and mapred.reduce.max.attempts
for reduce tasks. By default, if any task fails four times (or whatever the maximum
number of attempts is configured to), the whole job fails.
For some applications, it is undesirable to abort the job if a few tasks fail, as it may
be possible to use the results of the job despite some failures. In this case, the
maximum percentage of tasks that are allowed to fail without triggering job failure
can be set for the job. Map tasks and reduce tasks are controlled independently,
using the and mapred.max.reduce.failures.percent
A task attempt may also be killed, which is different from it failing. A task attempt
may be killed because it is a speculative duplicate (for more, see “Speculative
Execu-tion” on page 213), or because the tasktracker it was running on failed, and
the job-tracker marked all the task attempts running on it as killed. Killed task
attempts do not count against the number of attempts to run the task (as set by and mapred.reduce.max.attempts), since it wasn’t the
task’s fault that an attempt was killed.
Users may also kill or fail task attempts using the web UI or the command line (type
hadoop job to see the options). Jobs may also be killed by the same mechanisms.
Tasktracker Failure
Failure of a tasktracker is another failure mode. If a tasktracker fails by crashing, or
running very slowly, it will stop sending heartbeats to the jobtracker (or send them
very infrequently). The jobtracker will notice a tasktracker that has stopped sending
heart-beats (if it hasn’t received one for 10 minutes, configured via the mapred.task
tracker.expiry.interval property, in milliseconds) and remove it from its pool of
tasktrackers to schedule tasks on. The jobtracker arranges for map tasks that were
run and completed successfully on that tasktracker to be rerun if they belong to
incomplete jobs, since their intermediate output residing on the failed tasktracker’s
local filesystem may not be accessible to the reduce task. Any tasks in progress are
also rescheduled.
A tasktracker can also be blacklisted by the jobtracker, even if the tasktracker has
not failed. If more than four tasks from the same job fail on a particular tasktracker
(set by (mapred.max.tracker.failures), then the jobtracker records this as a fault. A
tasktracker is blacklisted if the number of faults is over some minimum threshold
(four, set by mapred.max.tracker.blacklists) and is significantly higher than the
average number of faults for tasktrackers in the cluster cluster.
Blacklisted tasktrackers are not assigned tasks, but they continue to communicate
with the jobtracker. Faults expire over time (at the rate of one per day), so
tasktrackers get the chance to run jobs again simply by leaving them running.
Alternatively, if there is an underlying fault that can be fixed (by replacing hardware,
for example), the task-tracker will be removed from the jobtracker’s blacklist after it
restarts and rejoins the cluster.
Jobtracker Failure
Failure of the jobtracker is the most serious failure mode. Hadoop has no
mechanism for dealing with failure of the jobtracker—it is a single point of failure—so
in this case the job fails. However, this failure mode has a low chance of occurring,
since the chance of a particular machine failing is low. The good news is that the
situation is improved in YARN, since one of its design goals is to eliminate single
points of failure in Map-Reduce.
After restarting a jobtracker, any jobs that were running at the time it was stopped
will need to be re-submitted. There is a configuration option that attempts to recover
any running jobs (mapred.jobtracker.restart.recover, turned off by default), however it
is known not to work reliably, so should not be used.
Failures in YARN
For MapReduce programs running on YARN, we need to consider the failure of any
of the following entities: the task, the application master, the node manager, and the
resource manager.
Task Failure
Failure of the running task is similar to the classic case. Runtime exceptions and
sudden exits of the JVM are propagated back to the application master and the task
attempt is marked as failed. Likewise, hanging tasks are noticed by the application
master by the absence of a ping over the umbilical channel (the timeout is set by
mapreduce.task.time out), and again the task attempt is marked as failed.
The configuration properties for determining when a task is considered to be failed
are the same as the classic case: a task is marked as failed after four attempts (set
by for map tasks and mapreduce.reduce.maxattempts
than per-cent of the map tasks in the job fail, or
more than mapreduce.reduce.failures.maxper cent percent of the reduce tasks fail.
Application Master Failure
Just like MapReduce tasks are given several attempts to succeed (in the face of
hardware or network failures) applications in YARN are tried multiple times in the
event of fail-ure. By default, applications are marked as failed if they fail once, but
this can be in-creased by setting the property
An application master sends periodic heartbeats to the resource manager, and in the
event of application master failure, the resource manager will detect the failure and
start a new instance of the master running in a new container (managed by a node
manager). In the case of the MapReduce application master, it can recover the state
of the tasks that had already been run by the (failed) application so they don’t have
to be rerun. By default, recovery is not enabled, so failed application masters will not
setting ery.enable to true.
The client polls the application master for progress reports, so if its application
master fails the client needs to locate the new instance. During job initialization the
client asks the resource manager for the application master’s address, and then
caches it, so it doesn’t overload the the resource manager with a request every time
it needs to poll the application master. If the application master fails, however, the
client will experi-ence a timeout when it issues a status update, at which point the
client will go back to the resource manager to ask for the new application master’s
Node Manager Failure
If a node manager fails, then it will stop sending heartbeats to the resource manager,
and the node manager will be removed from the resource manager’s pool of
available nodes. The property yarn.resourcemanager.nm.liveness-monitor.expiryinterval-ms, which defaults to 600000 (10 minutes), determines the minimum time
the resource manager waits before considering a node manager that has sent no
heartbeat in that time as failed.
Any task or application master running on the failed node manager will be recovered
using the mechanisms described in the previous two sections.
Node managers may be blacklisted if the number of failures for the application is
high. Blacklisting is done by the application master, and for MapReduce the
application master will try to reschedule tasks on different nodes if more than three
tasks fail on a node manager. The threshold may be set with
mapreduce.job.maxtaskfai lures.per.tracker.
Resource Manager Failure
Failure of the resource manager is serious, since without it neither jobs nor task containers can be launched. The resource manager was designed from the outset to be
able to recover from crashes, by using a checkpointing mechanism to save its state
to per-sistent storage, although at the time of writing the latest release did not have a
complete implementation.
After a crash, a new resource manager instance is brought up (by an adminstrator)
and it recovers from the saved state. The state consists of the node managers in the
system as well as the running applications. (Note that tasks are not part of the
resource man-ager’s state, since they are managed by the application. Thus the
amount of state to be stored is much more managable than that of the jobtracker.)
The storage used by the reource manager is configurable via the yarn.resourceman property. The default is org.apache.hadoop.yarn.server.resource
manager.recovery.MemStore, which keeps the store in memory, and is therefore not
highly-available. However, there is a ZooKeeper-based store in the works that will
support reliable recovery from resource manager failures in the future.
Job Scheduling
Early versions of Hadoop had a very simple approach to scheduling users’ jobs: they
ran in order of submission, using a FIFO scheduler. Typically, each job would use
the whole cluster, so jobs had to wait their turn. Although a shared cluster offers
great potential for offering large resources to many users, the problem of sharing
resources fairly between users requires a better scheduler. Production jobs need to
complete in a timely manner, while allowing users who are making smaller ad hoc
queries to get results back in a reasonable time.
Later on, the ability to set a job’s priority was added, via the mapred.job.priority
property or the setJobPriority() method on JobClient (both of which take one of the
values VERY_HIGH, HIGH, NORMAL, LOW, VERY_LOW). When the job scheduler
is choosing the next job to run, it selects one with the highest priority. However, with
the FIFO scheduler, priorities do not support preemption, so a high-priority job can
still be blocked by a long-running low priority job that started before the high-priority
job was scheduled.
MapReduce in Hadoop comes with a choice of schedulers. The default in
MapReduce 1 is the original FIFO queue-based scheduler, and there are also
multiuser schedulers called the Fair Scheduler and the Capacity Scheduler.
MapReduce 2 comes with the Capacity Scheduler (the default), and the FIFO
The Fair Scheduler
The Fair Scheduler aims to give every user a fair share of the cluster capacity over
time. If a single job is running, it gets all of the cluster. As more jobs are submitted,
free task slots are given to the jobs in such a way as to give each user a fair share of
the cluster. A short job belonging to one user will complete in a reasonable time even
while another user’s long job is running, and the long job will still make progress.
Jobs are placed in pools, and by default, each user gets their own pool. A user who
submits more jobs than a second user will not get any more cluster resources than
the second, on average. It is also possible to define custom pools with guaranteed
minimum capacities defined in terms of the number of map and reduce slots, and to
set weightings for each pool.
The Fair Scheduler supports preemption, so if a pool has not received its fair share
for a certain period of time, then the scheduler will kill tasks in pools running over
capacity in order to give the slots to the pool running under capacity.
The Fair Scheduler is a “contrib” module. To enable it, place its JAR file on Hadoop’s
classpath, by copying it from Hadoop’s contrib/fairscheduler directory to the lib directory. Then set the mapred.jobtracker.taskScheduler property to:
The Fair Scheduler will work without further configuration, but to take full advantage
of its features and how to configure it (including its web interface), refer to README
in the src/contrib/fairscheduler directory of the distribution.
The Capacity Scheduler
The Capacity Scheduler takes a slightly different approach to multiuser scheduling. A
cluster is made up of a number of queues (like the Fair Scheduler’s pools), which
may be hierarchical (so a queue may be the child of another queue), and each
queue has an allocated capacity. This is like the Fair Scheduler, except that within
each queue, jobs are scheduled using FIFO scheduling (with priorities). In effect, the
Capacity Scheduler allows users or organizations (defined using queues) to simulate
a separate MapReduce cluster with FIFO scheduling for each user or organization.
The Fair Scheduler, by contrast, (which actually also supports FIFO job scheduling
within pools as an option, making it like the Capacity Scheduler) enforces fair sharing
within each pool, so running jobs share the pool’s resources.
Shuffle and Sort
MapReduce makes the guarantee that the input to every reducer is sorted by key.
The process by which the system performs the sort—and transfers the map outputs
to the reducers as inputs—is known as the shuffle.6 In this section, we look at how
the shuffle works, as a basic understanding would be helpful, should you need to
optimize a Map-Reduce program. The shuffle is an area of the codebase where
refinements and improvements are continually being made, so the following
description necessarily conceals many details (and may change over time, this is for
version 0.20). In many ways, the shuffle is the heart of MapReduce and is where the
“magic” happens.
The Map Side
When the map function starts producing output, it is not simply written to disk. The
process is more involved, and takes advantage of buffering writes in memory and
doing some presorting for efficiency reasons. Figure 6-6 shows what happens. Each
map task has a circular memory buffer that it writes the output to. The buffer is 100
MB by default, a size which can be tuned by changing the io.sort.mb property. When
the contents of the buffer reaches a certain threshold size (io.sort.spill.per cent,
default 0.80, or 80%), a background thread will start to spill the contents to disk. Map
outputs will continue to be written to the buffer while the spill takes place, but if the
buffer fills up during this time, the map will block until the spill is complete.
Spills are written in round-robin fashion to the directories specified by the
mapred.local.dir property, in a job-specific subdirectory.
Figure 6-6. Shuffle and sort in MapReduce
Before it writes to disk, the thread first divides the data into partitions corresponding
to the reducers that they will ultimately be sent to. Within each partition, the background thread performs an in-memory sort by key, and if there is a combiner
function, it is run on the output of the sort. Running the combiner function makes for
a more compact map output, so there is less data to write to local disk and to
transfer to the reducer.
Each time the memory buffer reaches the spill threshold, a new spill file is created,
so after the map task has written its last output record there could be several spill
files. Before the task is finished, the spill files are merged into a single partitioned
and sorted output file. The configuration property io.sort.factor controls the maximum
number of streams to merge at once; the default is 10.
If there are at least three spill files (set by the min.num.spills.for.combine property)
then the combiner is run again before the output file is written. Recall that combiners
may be run repeatedly over the input without affecting the final result. If there are
only one or two spills, then the potential reduction in map output size is not worth the
overhead in invoking the combiner, so it is not run again for this map output.
It is often a good idea to compress the map output as it is written to disk, since doing
so makes it faster to write to disk, saves disk space, and reduces the amount of data
to transfer to the reducer. By default, the output is not compressed, but it is easy to
enable by setting to true. The compression library to
use is speci-fied by; see “Compression” for
more on compression formats.
The output file’s partitions are made available to the reducers over HTTP. The maximum number of worker threads used to serve the file partitions is controlled by the
tasktracker.http.threads property—this setting is per tasktracker, not per map task
slot. The default of 40 may need increasing for large clusters running large jobs. In
MapReduce 2, this property is not applicable since the maximum number of threads
used is set automatically based on the number of processors on the machine. (MapReduce 2 uses Netty, which by default allows up to twice as many threads as there
are processors.)
The Reduce Side
Let’s turn now to the reduce part of the process. The map output file is sitting on the
local disk of the machine that ran the map task (note that although map outputs
always get written to local disk, reduce outputs may not be), but now it is needed by
the machine that is about to run the reduce task for the partition. Furthermore, the
reduce task needs the map output for its particular partition from several map tasks
across the cluster. The map tasks may finish at different times, so the reduce task
starts copying their outputs as soon as each completes. This is known as the copy
phase of the reduce task. The reduce task has a small number of copier threads so
that it can fetch map outputs in parallel. The default is five threads, but this number
can be changed by setting the mapred.reduce.parallel.copies property.
How do reducers know which machines to fetch map output from?
As map tasks complete successfully, they notify their parent tasktracker of the status
update, which in turn notifies the jobtracker. (In MapRe-duce 2, the tasks notify their
application master directly.) These notifi-cations are transmitted over the heartbeat
communication mechanism described earlier. Therefore, for a given job, the
jobtracker (or applica-tion master) knows the mapping between map outputs and
hosts. A thread in the reducer periodically asks the master for map output hosts until
it has retrieved them all.
Hosts do not delete map outputs from disk as soon as the first reducer has retrieved
them, as the reducer may subsequently fail. Instead, they wait until they are told to
delete them by the jobtracker (or application master), which is after the job has
The map outputs are copied to the reduce task JVM’s memory if they are small
enough (the buffer’s size is controlled by mapred.job.shuffle.input.buffer.percent,
which specifies the proportion of the heap to use for this purpose); otherwise, they
are copied to disk. When the in-memory buffer reaches a threshold size (controlled
by mapred.job.shuffle.merge.percent), or reaches a threshold number of map
outputs (mapred.inmem.merge.threshold), it is merged and spilled to disk. If a
combiner is speci-fied it will be run during the merge to reduce the amount of data
written to disk.
As the copies accumulate on disk, a background thread merges them into larger,
sorted files. This saves some time merging later on. Note that any map outputs that
were compressed (by the map task) have to be decompressed in memory in order to
perform a merge on them.
When all the map outputs have been copied, the reduce task moves into the sort
phase (which should properly be called the merge phase, as the sorting was carried
out on the map side), which merges the map outputs, maintaining their sort ordering.
This is done in rounds. For example, if there were 50 map outputs, and the merge
factor was 10 (the default, controlled by the io.sort.factor property, just like in the
map’s merge), then there would be 5 rounds. Each round would merge 10 files into
one, so at the end there would be five intermediate files.
Rather than have a final round that merges these five files into a single sorted file,
the merge saves a trip to disk by directly feeding the reduce function in what is the
last phase: the reduce phase. This final merge can come from a mixture of inmemory and on-disk segments.
The number of files merged in each round is actually more subtle than this example
suggests. The goal is to merge the minimum number of files to get to the merge
factor for the final round. So if there were 40 files, the merge would not merge 10 files
in each of the four rounds to get 4 files. Instead, the first round would merge only 4
files, and the subsequent three rounds would merge the full 10 files. The 4 merged
files, and the 6 (as yet unmerged) files make a total of 10 files for the final round. The
process is illustrated in Figure 6-7.
Note that this does not change the number of rounds, it’s just an opti-mization to
minimize the amount of data that is written to disk, since the final round always
merges directly into the reduce.
During the reduce phase, the reduce function is invoked for each key in the sorted
output. The output of this phase is written directly to the output filesystem, typically
HDFS. In the case of HDFS, since the tasktracker node (or node manager) is also
run-ning a datanode, the first block replica will be written to the local disk.
Task JVM Reuse
Hadoop runs tasks in their own Java Virtual Machine to isolate them from other running tasks. The overhead of starting a new JVM for each task can take around a
second, which for jobs that run for a minute or so is insignificant. However, jobs that
have a large number of very short-lived tasks (these are usually map tasks), or that
have lengthy initialization, can see performance gains when the JVM is reused for
subsequent tasks.
Note that, with task JVM reuse enabled, tasks are not run concurrently in a single
JVM; rather, the JVM runs tasks sequentially. Tasktrackers can, however, run more
than one task at a time, but this is always done in separate JVMs. The properties for
controlling the tasktrackers’ number of map task slots and reduce task slots are
discussed in “Memory” .
The property for controlling task JVM reuse is mapred.job.reuse.jvm.num.tasks: it
specifies the maximum number of tasks to run for a given job for each JVM
launched; the default is 1 (see Table 6-5). No distinction is made between map or
reduce tasks, however tasks from different jobs are always run in separate JVMs.
The method set NumTasksToExecutePerJvm() on JobConf can also be used to
configure this property.
Table 6-5. Task JVM Reuse properties
Property name
Type value
mapred.job.reuse.jvm.num.tasks int
The maximum number of tasks to run for
a given
job for each JVM on a tasktracker. A
value of –1
indicates no limit: the same JVM may be
used for
all tasks for a job.
Tasks that are CPU-bound may also benefit from task JVM reuse by taking
advantage of runtime optimizations applied by the HotSpot JVM. After running for a
while, the HotSpot JVM builds up enough information to detect performance-critical
sections in the code and dynamically translates the Java byte codes of these hot
spots into native machine code. This works well for long-running processes, but
JVMs that run for sec-onds or a few minutes may not gain the full benefit of HotSpot.
In these cases, it is worth enabling task JVM reuse.
Another place where a shared JVM is useful is for sharing state between the tasks of
a job. By storing reference data in a static field, tasks get rapid access to the shared
Skipping Bad Records
Large datasets are messy. They often have corrupt records. They often have records
that are in a different format. They often have missing fields. In an ideal world, your
code would cope gracefully with all of these conditions. In practice, it is often
expedient to ignore the offending records. Depending on the analysis being
performed, if only a small percentage of records are affected, then skipping them
may not significantly affect the result. However, if a task trips up when it encounters
a bad record—by throwing a runtime exception—then the task fails. Failing tasks are
retried (since the failure may be due to hardware failure or some other reason
outside the task’s control), but if a task fails four times, then the whole job is marked
as failed. If it is the data that is causing the task to throw an exception, rerunning the
task won’t help, since it will fail in exactly the same way each time.
The best way to handle corrupt records is in your mapper or reducer code. You can
detect the bad record and ignore it, or you can abort the job by throwing an
exception. You can also count the total number of bad records in the job using
counters to see how widespread the problem is.
In rare cases, though, you can’t handle the problem because there is a bug in a
third-party library that you can’t work around in your mapper or reducer. In these
cases, you can use Hadoop’s optional skipping mode for automatically skipping bad
When skipping mode is enabled, tasks report the records being processed back to
the tasktracker. When the task fails, the tasktracker retries the task, skipping the
records that caused the failure. Because of the extra network traffic and bookkeeping
to maintain the failed record ranges, skipping mode is turned on for a task only after
it has failed twice.Thus, for a task consistently failing on a bad record, the tasktracker
runs the following task attempts with these outcomes:
Task fails.
Skipping mode is enabled. Task fails, but failed record is stored by the
Skipping mode is still enabled. Task succeeds by skipping the bad record that
failed in the previous attempt.
MapReduce Types and Formats
MapReduce has a simple model of data processing: inputs and outputs for the map
and reduce functions are key-value pairs. This chapter looks at the MapReduce
model in detail and, in particular, how data in various formats, from simple text to
structured binary objects, can be used with this model.
MapReduce Types
The map and reduce functions in Hadoop MapReduce have the following general
map: (K1, V1) → list(K2, V2)
reduce: (K2, list(V2)) → list(K3, V3)
In general, the map input key and value types (K1 and V1) are different from the map
output types (K2 and V2). However, the reduce input must have the same types as
the map output, although the reduce output types may be different again (K3 and
V3). The Java API mirrors this general form:
public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {
public class Context extends MapContext<KEYIN, VALUEIN, KEYOUT,
VALUEOUT> { // ...
protected void map(KEYIN key, VALUEIN value,
Context context) throws IOException, InterruptedException {
// ...
public class Reducer<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {
public class Context extends ReducerContext<KEYIN, VALUEIN, KEYOUT,
VALUEOUT> { // ...
protected void reduce(KEYIN key, Iterable<VALUEIN> values, Context context
Context context) throws IOException, InterruptedException {
// ...
The context objects are used for emitting key-value pairs, so they are parameterized
by the output types, so that the signature of the write() method is:
public void write(KEYOUT key, VALUEOUT value)
throws IOException, InterruptedException
Since Mapper and Reducer are separate classes the type parameters have different
scopes, and the actual type argument of KEYIN (say) in the Mapper may be different
to the type of the type parameter of the same name (KEYIN) in the Reducer. For
instance, in the maximum temparature example from earlier chapters, KEYIN is
replaced by LongWrita ble for the Mapper, and by Text for the Reducer.
Similarly, even though the map output types and the reduce input types must match,
this is not enforced by the Java compiler.
The type parameters are named differently to the abstract types (KEYIN versus K1,
and so on), but the form is the same.
If a combine function is used, then it is the same form as the reduce function (and is
an implementation of Reducer), except its output types are the intermediate key and
value types (K2 and V2), so they can feed the reduce function:
map: (K1, V1) → list(K2, V2)
combine: (K2, list(V2)) → list(K2, V2) reduce:
(K2, list(V2)) → list(K3, V3)
Often the combine and reduce functions are the same, in which case, K3 is the same
as K2, and V3 is the same as V2.
The partition function operates on the intermediate key and value types (K2 and V2),
and returns the partition index. In practice, the partition is determined solely by the
key (the value is ignored):
partition: (K2, V2) → integer
Or in Java:
public abstract class Partitioner<KEY, VALUE> {
public abstract int getPartition(KEY key, VALUE value, int numPartitions);
Table 7-1. Configuration of MapReduce types in the
new API
Job setter method
Properties for configuring types:
Intermediate Output
K1 V1 K2
K3 V3
setInputFormatClass() •
setMapOutputKeyCla ss() setMapOutputValueCl
mapreduce.job.output.key.class setOutputKeyClass()
Properties that must be
consistent with the types:
The Default MapReduce Job
What happens when you run MapReduce without setting a mapper or a reducer?
Let’s try it by running this minimal MapReduce program:
public class MinimalMapReduce extends Configured implements Tool {
public int run(String[] args) throws Exception { if
(args.length != 2) {
System.err.printf("Usage: %s [generic options] <input> <output>\n",
r); return -1;
Job job = new Job(getConf());
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
return job.waitForCompletion(true) ? 0 : 1;
public static void main(String[] args) throws Exception {
int exitCode = MinimalMapReduce(),
args); System.exit(exitCode);
The only configuration that we set is an input path and an output path. We run it over
a subset of our weather data with the following:
% hadoop MinimalMapReduce "input/ncdc/all/190{1,2}.gz" output
We do get some output: one file named part-r-00000 in the output directory. Here’s
what the first few lines look like (truncated to fit the page):
Each line is an integer followed by a tab character, followed by the original weather
data record. Admittedly, it’s not a very useful program, but understanding how it produces its output does provide some insight into the defaults that Hadoop uses when
running MapReduce jobs. Example 7-1 shows a program that has exactly the same
effect as MinimalMapReduce, but explicitly sets the job settings to their defaults.
Example 7-1. A minimal MapReduce driver, with the defaults explicitly set
public class MinimalMapReduceWithDefaults extends Configured implements Tool {
public int run(String[] args) throws Exception {
Job job = JobBuilder.parseInputAndOutput(this, getConf(), args);
if (job == null) {
return -1;
return job.waitForCompletion(true) ? 0 : 1;
public static void main(String[] args) throws Exception {
int exitCode = MinimalMapReduceWithDefaults(),
args); System.exit(exitCode);
We’ve simplified the first few lines of the run() method, by extracting the logic for
printing usage and setting the input and output paths into a helper method. Almost all
MapReduce drivers take these two arguments (input and output), so reducing the
boilerplate code here is a good thing. Here are the relevant methods in the
JobBuilder class for reference:
public static Job parseInputAndOutput(Tool tool, Configuration conf,
String[] args) throws IOException {
if (args.length != 2) { printUsage(tool, "<input>
<output>"); return null;
Job job = new Job(conf);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
return job;
public static void printUsage(Tool tool, String extraArgsUsage) {
System.err.printf("Usage: %s [genericOptions] %s\n\n",
tool.getClass().getSimpleName(), extraArgsUsage);
Going back to MinimalMapReduceWithDefaults in Example 7-1, although there are
many other default job settings, the ones highlighted are those most central to
running a job. Let’s go through them in turn.
The default input format is TextInputFormat, which produces keys of type LongWrita
ble (the offset of the beginning of the line in the file) and values of type Text (the line
of text). This explains where the integers in the final output come from: they are the
line offsets.
The default mapper is just the Mapper class, which writes the input key and value
un-changed to the output:
public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {
protected void map(KEYIN key, VALUEIN value,
Context context) throws IOException, InterruptedException {
context.write((KEYOUT) key, (VALUEOUT) value);
Mapper is a generic type, which allows it to work with any key or value types. In this
case, the map input and output key is of type LongWritable and the map input and
output value is of type Text.
The default partitioner is HashPartitioner, which hashes a record’s key to determine
which partition the record belongs in. Each partition is processed by a reduce task,
so the number of partitions is equal to the number of reduce tasks for the job:
public class HashPartitioner<K, V> extends Partitioner<K, V> {
public int getPartition(K key, V value,int numReduceTasks) {
return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
The key’s hash code is turned into a nonnegative integer by bitwise ANDing it with
the largest integer value. It is then reduced modulo the number of partitions to find
the index of the partition that the record belongs in.
By default, there is a single reducer, and therefore a single partition, so the action of
the partitioner is irrelevant in this case since everything goes into one partition. However, it is important to understand the behavior of HashPartitioner when you have
more than one reduce task. Assuming the key’s hash function is a good one, the
records will be evenly allocated across reduce tasks, with all records sharing the
same key being processed by the same reduce task.
You may have noticed that we didn’t set the number of map tasks. The reason for
this is that the number is equal to the number of splits that the input is turned into,
which is driven by size of the input, and the file’s block size (if the file is in HDFS).
Choosing the Number of Reducers
The single reducer default is something of a gotcha for new users to Hadoop. Almost
all real-world jobs should set this to a larger number; otherwise, the job will be very
slow since all the intermediate data flows through a single reduce task. (Note that
when running under the local job runner, only zero or one reducers are supported.)
The optimal number of reducers is related to the total number of available reducer
slots in your cluster. The total number of slots is found by multiplying the number of
nodes in the cluster and the number of slots per node (which is determined by the
value of the mapred.tasktracker.reduce.tasks.maximum property, described in
“Environment Settings” ).
One common setting is to have slightly fewer reducers than total slots, which gives
one wave of reduce tasks (and tolerates a few failures, without extending job
execution time). If your reduce tasks are very big, then it makes sense to have a
larger number of reducers (resulting in two waves, for example) so that the tasks are
more fine-grained, and failure doesn’t affect job execution time significantly.
The default reducer is Reducer, again a generic type, which simply writes all its input
to its output:
public class Reducer<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {
protected void reduce(KEYIN key, Iterable<VALUEIN> values, Context
context Context context) throws IOException,
InterruptedException {
for (VALUEIN value: values) {
context.write((KEYOUT) key, (VALUEOUT) value);
For this job, the output key is LongWritable, and the output value is Text. In fact, all
the keys for this MapReduce program are LongWritable, and all the values are Text,
since these are the input keys and values, and the map and reduce functions are
both identity functions which by definition preserve type. Most MapReduce
programs, however, don’t use the same key or value types throughout, so you need
to configure the job to declare the types you are using, as described in the previous
Records are sorted by the MapReduce system before being presented to the
reducer. In this case, the keys are sorted numerically, which has the effect of
interleaving the lines from the input files into one combined output file.
Input Formats
Hadoop can process many different types of data formats, from flat text files to databases. In this section, we explore the different formats available.
Input Splits and Records
As we saw in Chapter 2, an input split is a chunk of the input that is processed by a
single map. Each map processes a single split. Each split is divided into records, and
the map processes each record—a key-value pair—in turn. Splits and records are
log-ical: there is nothing that requires them to be tied to files.
FileInputFormat is the base class for all implementations of InputFormat that use
files as their data source.
Figure 7-2. InputFormat class hierarchy
Applications may impose a minimum split size: by setting this to a value larger
than the block size, they can force splits to be larger than a block. There is no good
reason for doing this when using HDFS, since doing so will increase the number of
blocks that are not local to a map task.
The maximum split size defaults to the maximum value that can be represented by a
Java long type. It has an effect only when it is less than the block size, forcing splits
to be smaller than a block.
The split size is calculated by the formula (see the computeSplitSize() method in
max(minimumSize, min(maximumSize, blockSize))
by default:
minimumSize < blockSize < maximumSize
so the split size is blockSize. Various settings for these parameters and how they
affect the final split size are illustrated in Table 7-6.
Table 7-6. Examples of how to control the split
Minimum split
1 (default)
1 (default)
Maximum split
128 MB
1 (default)
32 MB
Block size size
64 MB
By default, split size is the same
64 MB as the
default block size.
The most natural way to
128 MB
increase the
split size is to have larger
blocks in
HDFS, by setting
size, or on a per-file basis
at file
construction time.
64 MB
Making the minimum split
greater than the block size
the split size, but at the cost of
64 MB
Making the maximum split
32 MB size less
than the block size decreases
the split
Small files and CombineFileInputFormat
Hadoop works better with a small number of large files than a large number of small
files. One reason for this is that FileInputFormat generates splits in such a way that
each split is all or part of a single file. If the file is very small (“small” means
significantly smaller than an HDFS block) and there are a lot of them, then each map
task will process very little input, and there will be a lot of them (one per file), each of
which imposes extra bookkeeping overhead. Compare a 1 GB file broken into
sixteen 64 MB blocks, and 10,000 or so 100 KB files. The 10,000 files use one map
each, and the job time can be tens or hundreds of times slower than the equivalent
one with a single input file and 16 map tasks.
The situation is alleviated somewhat by CombineFileInputFormat,
designed to work well with small files. Where FileInputFormat creates a
CombineFileInputFormat packs many files into each split so that each
more to process. Crucially, CombineFileInputFormat takes node and
which was
split per file,
mapper has
rack locality
into account when deciding which blocks to place in the same split, so it does not
compromise the speed at which it can process the input in a typical MapReduce job.
Of course, if possible, it is still a good idea to avoid the many small files case since
MapReduce works best when it can operate at the transfer rate of the disks in the
cluster, and processing many small files increases the number of seeks that are
needed to run a job. Also, storing large numbers of small files in HDFS is wasteful of
the namenode’s memory. One technique for avoiding the many small files case is to
merge small files into larger files by using a SequenceFile: the keys can act as
filenames (or a constant such as NullWritable, if not needed) and the values as file
contents. See Example 7-4. But if you already have a large number of small files in
HDFS, then CombineFileInput Format is worth trying.
CombineFileInputFormat isn’t just good for small files—it can bring ben-efits when
processing large files, too. Essentially, CombineFileInputFor mat decouples the
amount of data that a mapper consumes from the block size of the files in HDFS.
If your mappers can process each block in a matter of seconds, then you could use
CombineFileInputFormat with the maximum split size set to a small multiple of the
number of blocks (by setting the mapred.max.split.size property in bytes) so that
each mapper processes more than one block. In return, the overall processing time
falls, since proportionally fewer mappers run, which reduces the overhead in task
bookkeeping and startup time associated with a large number of short-lived
Since CombineFileInputFormat is an abstract class without any concrete classes
(unlike FileInputFormat), you need to do a bit more work to use it. (Hopefully,
common im-plementations will be added to the library over time.) For example, to
have the CombineFileInputFormat equivalent of TextInputFormat, you would create
a concrete subclass of CombineFileInputFormat and implement the
getRecordReader() method.
Preventing splitting
Some applications don’t want files to be split, so that a single mapper can process
each input file in its entirety. For example, a simple way to check if all the records in
a file are sorted is to go through the records in order, checking whether each record
is not less than the preceding one. Implemented as a map task, this algorithm will
work only if one map processes the whole file. There are a couple of ways to ensure
that an existing file is not split. The first (quick and dirty) way is to increase the
minimum split size to be larger than the largest file in your system. Setting it to its
maximum value, Long.MAX_VALUE, has this effect. The sec-ond is to subclass the
concrete subclass of FileInputFormat that you want to use, to override the
isSplitable() method4 to return false. For example, here’s a nonsplittable
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.JobContext;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
public class NonSplittableTextInputFormat extends
TextInputFormat { @Override
protected boolean isSplitable(JobContext context, Path file) {
return false;
File information in the mapper
A mapper processing a file input split can find information about the split by calling
the getInputSplit() method on the Mapper’s Context object. When the input format
derives from FileInputFormat, the InputSplit returned by this method can be cast to a
FileSplit to access the file information listed in Table 7-7. In the old MapReduce API,
Streaming, and Pipes, the same file split information is made available through
properties which can be read from the mapper’s configuration. (In the old
MapReduce API this is achieved by implementing configure() in your Mapper
implementation to get access to the JobConf object.)
In addition to the properties in Table 7-7 all mappers and reducers have access to
the properties listed in “The Task Execution Environment”.
Table 7-7. File split properties
Path/String The path of the input file being
This is how the mapper in SortValidator.RecordStatsChecker is implemented. In the
method name isSplitable(), “splitable” has a single “t.” It is usually spelled “splittable,”
which is the spelling I have used in this book.
Property name Type
The byte offset of the start of the split from the
of the file
The length of the split in bytes
In the next section, you shall see how to use a FileSplit when we need to access the
split’s filename.
Processing a whole file as a record
A related requirement that sometimes crops up is for mappers to have access to the
full contents of a file. Not splitting the file gets you part of the way there, but you also
need to have a RecordReader that delivers the file contents as the value of the
record. The listing for WholeFileInputFormat in Example 7-2 shows a way of doing
Example 7-2. An InputFormat for reading a whole file as a record
public class WholeFileInputFormat
extends FileInputFormat<NullWritable, BytesWritable> {
protected boolean isSplitable(JobContext context, Path file) {
return false;
public RecordReader<NullWritable, BytesWritable>
createRecordReader( InputSplit split, TaskAttemptContext
context) throws IOException, InterruptedException {
WholeFileRecordReader reader = new
WholeFileRecordReader(); reader.initialize(split, context);
return reader;
WholeFileInputFormat defines a format where the keys are not used, represented by
NullWritable, and the values are the file contents, represented by BytesWritable instances. It defines two methods. First, the format is careful to specify that input files
should never be split, by overriding isSplitable() to return false. Second, we
implement createRecordReader() to return a custom implementation of Record
Reader, which appears in Example 7-3.
Example 7-3. The RecordReader used by WholeFileInputFormat for reading a whole
file as a record class WholeFileRecordReader extends RecordReader<NullWritable,
BytesWritable> {
private FileSplit fileSplit; private
Configuration conf;
private BytesWritable value = new BytesWritable();
private boolean processed = false;
public void initialize(InputSplit split, TaskAttemptContext context)
throws IOException, InterruptedException {
this.fileSplit = (FileSplit) split; this.conf =
public boolean nextKeyValue() throws IOException,
InterruptedException { if (!processed) {
byte[] contents = new byte[(int) fileSplit.getLength()]; Path
file = fileSplit.getPath();
FileSystem fs = file.getFileSystem(conf);
FSDataInputStream in = null;
try {
in =;
IOUtils.readFully(in, contents, 0, contents.length);
value.set(contents, 0, contents.length);
} finally { IOUtils.closeStream(in);
processed = true; return true;
return false;
public NullWritable getCurrentKey() throws IOException,
InterruptedException { return NullWritable.get();
public BytesWritable getCurrentValue() throws IOException,
InterruptedException {
return value;
public float getProgress() throws IOException {
return processed ? 1.0f : 0.0f;
public void close() throws IOException { // do
WholeFileRecordReader is responsible for taking a FileSplit and converting it into a
single record, with a null key and a value containing the bytes of the file. Because
there is only a single record, WholeFileRecordReader has either processed it or not,
so it main-tains a boolean called processed. If, when the nextKeyValue() method is
called, the file has not been processed, then we open the file, create a byte array
whose length is the length of the file, and use the Hadoop IOUtils class to slurp the
file into the byte array. Then we set the array on the BytesWritable instance that was
passed into the next() method, and return true to signal that a record has been read.
To demonstrate how WholeFileInputFormat can be used, consider a MapReduce job
for packaging small files into sequence files, where the key is the original filename,
and the value is the content of the file. The listing is in Example 7-4.
Example 7-4. A MapReduce program for packaging a collection of small files as a
single SequenceFile
public class SmallFilesToSequenceFileConverter extends
Configured implements Tool {
static class SequenceFileMapper
extends Mapper<NullWritable, BytesWritable, Text, BytesWritable> {
private Text filenameKey;
protected void setup(Context context) throws IOException,
InterruptedException {
InputSplit split = context.getInputSplit(); Path path
= ((FileSplit) split).getPath(); filenameKey = new
protected void map(NullWritable key, BytesWritable value, Context
context) throws IOException, InterruptedException {
context.write(filenameKey, value);
public int run(String[] args) throws Exception {
Job job = JobBuilder.parseInputAndOutput(this, getConf(), args);
if (job == null) {
return -1;
return job.waitForCompletion(true) ? 0 : 1;
public static void main(String[] args) throws Exception {
int exitCode = SmallFilesToSequenceFileConverter(),
args); System.exit(exitCode);
Since the input format is a WholeFileInputFormat, the mapper has to only find the
filename for the input file split. It does this by casting the InputSplit from the context
to a FileSplit, which has a method to retrieve the file path. The path is stored in a
Text object for the key. The reducer is the identity (not explicitly set), and the output
format is a SequenceFileOutputFormat.
Here’s a run on a few small files. We’ve chosen to use two reducers, so we get two
output sequence files:
hadoop jar hadoop-examples.jar SmallFilesToSequenceFileConverter \
input/smallfiles output
Two part files are created, each of which is a sequence file, which we can inspect
with the -text option to the filesystem shell:
% hadoop fs -conf conf/hadoop-localhost.xml -text output/part-r-00000
61 61 61 61 61 61 61 61 6161
63 63 63 63 63 63 63 63 6363
% hadoop fs -conf conf/hadoop-localhost.xml -text output/part-r-00001
62 62 62 62 62 62 62 62 6262
64 64 64 64 64 64 64 64 6464
66 66 66 66 66 66 66 66 6666
The input files were named a, b, c, d, e, and f, and each contained 10 characters of
the corresponding letter (so, for example, a contained 10 “a” characters), except e,
which was empty. We can see this in the textual rendering of the sequence files,
which prints the filename followed by the hex representation of the file.
There’s at least one way we could improve this program. As mentioned earlier,
having one mapper per file is inefficient, so subclassing CombineFileInputFormat
instead of FileInputFormat would be a better approach. Also, for a related technique
of packing files into a Hadoop Archive, rather than a sequence file, see the section
“Hadoop Ar-chives” .
Text Input
Hadoop excels at processing unstructured text. In this section, we discuss the
different InputFormats that Hadoop provides to process text.
TextInputFormat is the default InputFormat. Each record is a line of input. The key, a
LongWritable, is the byte offset within the file of the beginning of the line. The value
is the contents of the line, excluding any line terminators (newline, carriage return),
and is packaged as a Text object. So a file containing the following text:
On the top of the Crumpetty Tree
The Quangle Wangle sat,
But his face you could not see,
On account of his Beaver Hat.
is divided into one split of four records. The records are interpreted as the following
key-value pairs:
(0, On the top of the Crumpetty Tree) (33,
The Quangle Wangle sat,)
(57, But his face you could not see,) (89, On
account of his Beaver Hat.)
Clearly, the keys are not line numbers. This would be impossible to implement in
gen-eral, in that a file is broken into splits, at byte, not line, boundaries. Splits are
processed independently. Line numbers are really a sequential notion: you have to
keep a count of lines as you consume them, so knowing the line number within a
split would be possible, but not within the file.
However, the offset within the file of each line is known by each split independently
of the other splits, since each split knows the size of the preceding splits and just
adds this on to the offsets within the split to produce a global file offset. The offset is
usually sufficient for applications that need a unique identifier for each line.
Combined with the file’s name, it is unique within the filesystem. Of course, if all the
lines are a fixed width, then calculating the line number is simply a matter of dividing
the offset by the width.
The Relationship Between Input Splits and HDFS Blocks
The logical records that FileInputFormats define do not usually fit neatly into HDFS
blocks. For example, a TextInputFormat’s logical records are lines, which will cross
HDFS boundaries more often than not. This has no bearing on the functioning of
your program—lines are not missed or broken, for example—but it’s worth knowing
about, as it does mean that data-local maps (that is, maps that are running on the
same host as their input data) will perform some remote reads. The slight overhead
this causes is not normally significant.
Figure 7-3 shows an example. A single file is broken into lines, and the line
boundaries do not correspond with the HDFS block boundaries. Splits honor logical
record bound-aries, in this case lines, so we see that the first split contains line 5,
even though it spans the first and second block. The second split starts at line 6.
Figure 7-3. Logical records and HDFS blocks for TextInputFormat
TextInputFormat’s keys, being simply the offset within the file, are not normally very
useful. It is common for each line in a file to be a key-value pair, separated by a
delimiter such as a tab character. For example, this is the output produced by
TextOutputFor mat, Hadoop’s default OutputFormat. To interpret such files correctly,
KeyValueTextIn putFormat is appropriate.
You can specify the separator via the mapreduce.input.keyvaluelinerecor
dreader.key.value.separator property (or in the old
API). It is a tab character by default. Consider the following input file, where →
represents a (horizontal) tab character:
line1→On the top of the Crumpetty Tree line2→The Quangle Wangle sat, line3→But
his face you could not see, line4→On account of his Beaver Hat.
Like in the TextInputFormat case, the input is in a single split comprising four
records, although this time the keys are the Text sequences before the tab in each
(line1, On the top of the Crumpetty Tree) (line2,
The Quangle Wangle sat,)
(line3, But his face you could not see,) (line4,
On account of his Beaver Hat.)
With TextInputFormat and KeyValueTextInputFormat, each mapper receives a
variable number of lines of input. The number depends on the size of the split and
the length of the lines. If you want your mappers to receive a fixed number of lines of
input, then NLineInputFormat is the InputFormat to use. Like TextInputFormat, the
keys are the byte offsets within the file and the values are the lines themselves. N
refers to the number of lines of input that each mapper receives. With N set to one
(the default), each mapper receives exactly one line of input. The mapre
spermap in the old API) controls the value of N. By way of example, consider these
four lines again:
On the top of the Crumpetty Tree
The Quangle Wangle sat,
But his face you could not see,
On account of his Beaver Hat.
If, for example, N is two, then each split contains two lines. One mapper will receive
the first two key-value pairs:
(0, On the top of the Crumpetty Tree) (33,
The Quangle Wangle sat,)
And another mapper will receive the second two key-value pairs:
(57, But his face you could not see,) (89, On
account of his Beaver Hat.)
The keys and values are the same as TextInputFormat produces. What is different is
the way the splits are constructed.
Usually, having a map task for a small number of lines of input is inefficient (due to
the overhead in task setup), but there are applications that take a small amount of
input data and run an extensive (that is, CPU-intensive) computation for it, then emit
their output. Simulations are a good example. By creating an input file that specifies
input parameters, one per line, you can perform a parameter sweep: run a set of
simulations in parallel to find how a model varies as the parameter changes.
If you have long-running simulations,you may fall afoul of task time-outs. When a
task doesn’t report progress for more than 10 minutes, then the tasktracker assumes
it has failed and aborts the process (see “Task Failure” ).The best way to guard
against this is to report progress periodically, by writing a status message, or
incrementing a counter, for example. See “What Constitutes Progress in
MapReduce?” .
Another example is using Hadoop to bootstrap data loading from multiple data
sources, such as databases. You create a “seed” input file that lists the data
sources, one per line. Then each mapper is allocated a single data source, and it
loads the data from that source into HDFS. The job doesn’t need the reduce phase,
so the number of reducers should be set to zero (by calling setNumReduceTasks()
on Job). Furthermore, MapReduce jobs can be run to process the data loaded into
HDFS. See Appendix C for an example.
Most XML parsers operate on whole XML documents, so if a large XML document is
made up of multiple input splits, then it is a challenge to parse these individually. Of
course, you can process the entire XML document in one mapper (if it is not too
large) using the technique in “Processing a whole file as a record”.
Large XML documents that are composed of a series of “records” (XML document
fragments) can be broken into these records using simple string or regularexpression matching to find start and end tags of records. This alleviates the
problem when the document is split by the framework, since the next start tag of a
record is easy to find by simply scanning from the start of the split, just like
TextInputFormat finds newline boundaries.
Hadoop comes with a class for this purpose called StreamXmlRecordReader (which
is in the org.apache.hadoop.streaming package, although it can be used outside of
Stream-ing). You can use it by setting your input format to StreamInputFormat and
org.apache.hadoop.streaming.StreamXmlRecor dReader. The reader is configured
by setting job configuration properties to tell it the patterns for the start and end tags
(see the class documentation for details).
To take an example, Wikipedia provides dumps of its content in XML form, which are
appropriate for processing in parallel using MapReduce using this approach. The
data is contained in one large XML wrapper document, which contains a series of
elements, such as page elements that contain a page’s content and associated
metadata. Using StreamXmlRecordReader, the page elements can be interpreted as
records for processing by a mapper.
Binary Input
Hadoop MapReduce is not just restricted to processing textual data—it has support
for binary formats, too.
Hadoop’s sequence file format stores sequences of binary key-value pairs.
Sequence files are well suited as a format for MapReduce data since they are
splittable (they have sync points so that readers can synchronize with record
boundaries from an arbitrary point in the file, such as the start of a split), they support
compression as a part of the format, and they can store arbitrary types using a
variety of serialization frameworks.
To use data from sequence files as the input to MapReduce, you use
SequenceFileIn putFormat. The keys and values are determined by the sequence
file, and you need to make sure that your map input types correspond. For example,
if your sequence file has IntWritable keys and Text values, like the one created in
Chapter 4, then the map signature would be Mapper<IntWritable, Text, K, V>, where
K and V are the types of the map’s output keys and values.
SequenceFileAsTextInputFormat is a variant of SequenceFileInputFormat that
converts the sequence file’s keys and values to Text objects. The conversion is
performed by calling toString() on the keys and values. This format makes sequence
files suitable input for Streaming.
SequenceFileAsBinaryInputFormat is a variant of SequenceFileInputFormat that
retrieves the sequence file’s keys and values as opaque binary objects. They are
encapsulated as BytesWritable objects, and the application is free to interpret the
underlying byte array as it pleases. Combined with a process that creates sequence
files with Sequence File.Writer’s appendRaw() method, this provides a way to use
any binary data types with MapReduce (packaged as a sequence file), although
plugging into Hadoop’s se-rialization mechanism is normally a cleaner.
Multiple Inputs
Although the input to a MapReduce job may consist of multiple input files
(constructed by a combination of file globs, filters, and plain paths), all of the input is
interpreted by a single InputFormat and a single Mapper. What often happens,
however, is that over time, the data format evolves, so you have to write your
mapper to cope with all of your legacy formats. Or, you have data sources that
provide the same type of data but in different formats. This arises in the case of
performing joins of different datasets; see “Reduce-Side Joins”. For instance, one
might be tab-separated plain text, the other a binary sequence file. Even if they are
in the same format, they may have different representations and, therefore, need to
be parsed differently.
MultipleInputs.addInputPath(job, ncdcInputPath,
TextInputFormat.class, MaxTemperatureMapper.class);
MultipleInputs.addInputPath(job, metOfficeInputPath,
TextInputFormat.class, MetOfficeMaxTemperatureMapper.class);
This code replaces the usual calls to FileInputFormat.addInputPath() and job.setMap
perClass(). Both Met Office and NCDC data is text-based, so we use TextInputFor
mat for each. But the line format of the two data sources is different, so we use two
different mappers. The MaxTemperatureMapper reads NCDC input data and
extracts the year and temperature fields. The MetOfficeMaxTemperatureMapper
reads Met Office in-put data and extracts the year and temperature fields. The
important thing is that the map outputs have the same types, since the reducers
(which are all of the same type) see the aggregated map outputs and are not aware
of the different mappers used to produce them.
The MultipleInputs class has an overloaded version of addInputPath() that doesn’t
take a mapper:
public static void addInputPath(Job job, Path path,
Class<? extends InputFormat> inputFormatClass)
This is useful when you only have one mapper (set using the Job’s
setMapperClass() method) but multiple input formats.
Database Input (and Output)
DBInputFormat is an input format for reading data from a relational database, using
JDBC. Because it doesn’t have any sharding capabilities, you need to be careful not
to overwhelm the database you are reading from by running too many mappers. For
this reason, it is best used for loading relatively small datasets, perhaps for joining
with larger datasets from HDFS, using MultipleInputs. The corresponding output
format is DBOutputFormat, which is useful for dumping job outputs (of modest size)
into a database. For an alternative way of moving data between relational databases
and HDFS, consider using Sqoop, which is described in Chapter 15. HBase’s
TableInputFormat is designed to allow a MapReduce program to operate on data
stored in an HBase table. TableOutputFormat is for writing MapReduce outputs into
an HBase table.
Output Formats
Hadoop has output data formats that correspond to the input formats covered in the
previous section. The OutputFormat class hierarchy appears in Figure 7-4.
Figure 7-4. OutputFormat class hierarchy
Text Output
The default output format, TextOutputFormat, writes records as lines of text. Its keys
and values may be of any type, since TextOutputFormat turns them to strings by
calling toString() on them. Each key-value pair is separated by a tab character,
(mapred.textoutputformat.separator in the old API). The counterpart to TextOutput
Format for reading in this case is KeyValueTextInputFormat, since it breaks lines into
key-value pairs based on a configurable separator (see “KeyValueTextInputFor-mat”
You can suppress the key or the value (or both, making this output format equivalent
to NullOutputFormat, which emits nothing) from the output using a NullWritable type.
This also causes no separator to be written, which makes the output suitable for
reading in using TextInputFormat.
Binary Output
As the name indicates, SequenceFileOutputFormat writes sequence files for its
output. This is a good choice of output if it forms the input to a further MapReduce
job, since it is compact and is readily compressed. Compression is controlled via the
static methods on SequenceFileOutputFormat, as described in “Using Compression
in Map-Reduce”. For an example of how to use SequenceFileOutputFormat, see
“Sorting” .
SequenceFileAsBinaryInput Format, and it writes keys and values in raw binary
format into a SequenceFile container.
MapFileOutputFormat writes MapFiles as output. The keys in a MapFile must be
added in order, so you need to ensure that your reducers emit keys in sorted
order.The reduce input keys are guaranteed to be sorted, but the output keys are
under the control of the reduce function, and there is nothing in the general
MapReduce contract that states that the reduce output keys have to be ordered in
any way. The extra constraint of sorted reduce output keys is just needed for
Multiple Outputs
FileOutputFormat and its subclasses generate a set of files in the output directory.
There is one file per reducer, and files are named by the partition number: part-r00000, part-r-00001, etc. There is sometimes a need to have more control over the
naming of the files or to produce multiple files per reducer. MapReduce comes with
the MultipleOut puts class to help you do this.
An example: Partitioning data
Consider the problem of partitioning the weather dataset by weather station. We
would like to run a job whose output is a file per station, with each file containing all
the records for that station.
One way of doing this is to have a reducer for each weather station. To arrange this,
we need to do two things. First, write a partitioner that puts records from the same
weather station into the same partition. Second, set the number of reducers on the
job to be the number of weather stations. The partitioner would look like this:
public class StationPartitioner extends Partitioner<LongWritable, Text> {
private NcdcRecordParser parser = new NcdcRecordParser();
public int getPartition(LongWritable key, Text value, int numPartitions) {
return getPartition(parser.getStationId());
private int getPartition(String stationId) {
The getPartition(String) method, whose implementation is not shown, turns the
station ID into a partition index. To do this, it needs a list of all the station IDs and
then just returns the index of the station ID in the list.
There are two drawbacks to this approach. The first is that since the number of partitions needs to be known before the job is run, so does the number of weather
stations. Although the NCDC provides metadata about its stations, there is no
guarantee that the IDs encountered in the data match those in the metadata. A
station that appears in the metadata but not in the data wastes a reducer slot.
Worse, a station that appears in the data but not in the metadata doesn’t get a
reducer slot—it has to be thrown away. One way of mitigating this problem would be
to write a job to extract the unique station IDs, but it’s a shame that we need an extra
job to do this.
The second drawback is more subtle. It is generally a bad idea to allow the number
of partitions to be rigidly fixed by the application, since it can lead to small or uneven139
sized partitions. Having many reducers doing a small amount of work isn’t an
efficient way of organizing a job: it’s much better to get reducers to do more work
and have fewer of them, as the overhead in running a task is then reduced. Unevensized parti-tions can be difficult to avoid, too. Different weather stations will have
gathered a widely varying amount of data: compare a station that opened one year
ago to one that has been gathering data for one century. If a few reduce tasks take
significantly longer than the others, they will dominate the job execution time and
cause it to be longer than it needs to be.
There are two special cases when it does make sense to allow the application to set
the number of partitions (or equivalently, the number of reducers):
Zero reducersThis is a vacuous case: there are no partitions, as the application
needs to run only map tasks.
One reducerIt can be convenient to run small jobs to combine the output of previous
jobs into a single file. This should only be attempted when the amount of data is
small enough to be processed comfortably by one reducer.
It is much better to let the cluster drive the number of partitions for a job—the idea
being that the more cluster reduce slots are available the faster the job can
complete. This is why the default HashPartitioner works so well, as it works with any
number of partitions and ensures each partition has a good mix of keys leading to
more even-sized partitions.
If we go back to using HashPartitioner, each partition will contain multiple stations,
so to create a file per station, we need to arrange for each reducer to write multiple
files, which is where MultipleOutputs comes in.
MultipleOutputs allows you to write data to files whose names are derived from the
output keys and values, or in fact from an arbitrary string. This allows each reducer
(or mapper in a map-only job) to create more than a single file. File names are of the
form name-m-nnnnn for map outputs and name-r-nnnnn for reduce outputs, where
name is an arbitrary name that is set by the program, and nnnnn is an integer
designating the part number, starting from zero. The part number ensures that
outputs written from dif-ferent partitions (mappers or reducers) do not collide in the
case of the same name.
The program in Example 7-5 shows how to use MultipleOutputs to partition the
dataset by station.
Example 7-5. Partitions whole dataset into files named by the station ID using
public class PartitionByStationUsingMultipleOutputs extends
Configured implements Tool {
static class StationMapper
extends Mapper<LongWritable, Text, Text, Text> {
private NcdcRecordParser parser = new NcdcRecordParser();
protected void map(LongWritable key, Text value, Context
context) throws IOException, InterruptedException {
context.write(new Text(parser.getStationId()), value);
static class MultipleOutputsReducer
extends Reducer<Text, Text, NullWritable, Text> {
private MultipleOutputs<NullWritable, Text> multipleOutputs;
protected void setup(Context context)
throws IOException, InterruptedException {
multipleOutputs = new MultipleOutputs<NullWritable, Text>(context);
protected void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
for (Text value : values) { multipleOutputs.write(NullWritable.get(), value,
protected void cleanup(Context context)
throws IOException, InterruptedException {
public int run(String[] args) throws Exception {
Job job = JobBuilder.parseInputAndOutput(this, getConf(), args); if (job
== null) {
return -1;
return job.waitForCompletion(true) ? 0 : 1;
public static void main(String[] args) throws Exception {
int exitCode = PartitionByStationUsingMultipleOutputs(),
In the reducer, where we generate the output, we construct an instance of
MultipleOut puts in the setup() method and assign it to an instance variable. We then
use the MultipleOutputs instance in the reduce() method to write to the output, in
place of the context. The write() method takes the key and value, as well as a name.
We use the station identifier for the name, so the overall effect is to produce output
files with the naming scheme station_identifier-r-nnnnn.
In one run, the first few output files were named as follows (other columns from the
directory listing have been dropped):
The base path specified in the write() method of MultipleOutputs is interpreted
relative to the output directory, and since it may contain file path separator
characters (/), it’s possible to create subdirectories of arbitrary depth. For example,
the following modi-fication partitions the data by station and year so that each year’s
data is contained in a directory named by the station ID (such as 02907099999/1901/part-r-00000):
protected void reduce(Text key, Iterable<Text> values, Context
context) throws IOException, InterruptedException {
for (Text value : values) {
String basePath = String.format("%s/%s/part",
parser.getStationId(), parser.getYear());
multipleOutputs.write(NullWritable.get(), value, basePath);
MultipleOutputs delegates to the mapper’s OutputFormat, which in this example is a
TextOutputFormat, but more complex set ups are possible. For example, you can
create named outputs, each with its own OutputFormat and key and value types
(which may differ from the output types of the mapper or reducer). Furthermore, the
mapper or reducer (or both) may write to multiple output files for each record
processed. Please consult the Java documentation for more information.
Lazy Output
FileOutputFormat subclasses will create output (part-r-nnnnn) files, even if they are
empty. Some applications prefer that empty files not be created, which is where Lazy
OutputFormat helps. It is a wrapper output format that ensures that the output file is
created only when the first record is emitted for a given partition. To use it, call its
setOutputFormatClass() method with the JobConf and the underlying output format.
Streaming and Pipes support a -lazyOutput option to enable LazyOutputFormat.
MapReduce Features
This chapter looks at some of the more advanced features of MapReduce, including
counters and sorting and joining datasets.
There are often things you would like to know about the data you are analyzing but
that are peripheral to the analysis you are performing. For example, if you were
counting invalid records and discovered that the proportion of invalid records in the
whole da-taset was very high, you might be prompted to check why so many records
were being marked as invalid—perhaps there is a bug in the part of the program that
detects invalid records? Or if the data were of poor quality and genuinely did have
very many invalid records, after discovering this, you might decide to increase the
size of the dataset so that the number of good records was large enough for
meaningful analysis.
Counters are a useful channel for gathering statistics about the job: for quality control
or for application level-statistics. They are also useful for problem diagnosis. If you
are tempted to put a log message into your map or reduce task, then it is often better
to see whether you can use a counter instead to record that a particular condition
occurred. In addition to counter values being much easier to retrieve than log output
for large distributed jobs, you get a record of the number of times that condition
occurred, which is more work to obtain from a set of logfiles.
Built-in Counters
Hadoop maintains some built-in counters for every job, which report various metrics
for your job. For example, there are counters for the number of bytes and records
processed, which allows you to confirm that the expected amount of input was consumed and the expected amount of output was produced.
Counters are divided into groups, and there are several groups for the built-in
counters, listed in Table 8-1.
Table 8-1. Built-in counter groups
org.apache.hadoop.mapred.Task$Counter (0.20)
Table 82
org.apache.hadoop.mapreduce.TaskCounter (post 0.20)
FileSystemCounters (0.20)
org.apache.hadoop.mapreduce.FileSystemCounter (post
Table 83
Table 8org.apache.hadoop.mapred.FileInputFormat$Counter (0.20)
ter (post
torg.apache.hadoop.mapred.FileOutputFormat$Counter (0.20)
Format ounter
(post 0.20)
Counorg.apache.hadoop.mapred.JobInProgress$Counter (0.20)
org.apache.hadoop.mapreduce.JobCounter (post 0.20)
Table 85
Table 86
Each group either contains task counters (which are updated as a task progresses)
or job counters (which are updated as a job progresses). We look at both types in
the following sections.
Task counters
Task counters gather information about tasks over the course of their execution, and
the results are aggregated over all the tasks in a job. For example, the
MAP_INPUT_RECORDS counter counts the input records read by each map task
and aggre-gates over all map tasks in a job, so that the final figure is the total
number of input records for the whole job.
Task counters are maintained by each task attempt, and periodically sent to the tasktracker and then to the jobtracker, so they can be globally aggregated. (This is
described in “Progress and Status Updates”. Note that the information flow is different in YARN, see “YARN (MapReduce 2)”.) Task counters are sent in full every
time, rather than sending the counts since the last transmission, since this guards
against errors due to lost messages. Furthermore, during a job run, counters may go
down if a task fails.
Counter values are definitive only once a job has successfully completed. However,
some counters provide useful diagnostic information as a task is progressing, and it
can be useful to monitor them with the web UI. For example,
COMMITTED_HEAP_BYTES provide an indication of how mem-ory usage varies
over the course of a particular task attempt.
The built-in task counters include those in the MapReduce task counters group (Table 8-2) and those in the file-related counters groups (Table 8-3, Table 8-4, Table 85).
Table 8-2. Built-in MapReduce task counters
The number of input records consumed by all the maps in the job.
Map input records
Incremented every
(MAP_INPUT_RECORDS time a record is read from a RecordReader and passed to
the map’s map()
method by the framework.
The number of input records skipped by all the maps in the job. See
“Skipping Bad
Map skipped records
Records” on page 217.
The number of bytes of uncompressed input consumed by all the
Map input bytes
maps in the job.
Incremented every time a record is read from a RecordReader
and passed to the
map’s map() method by the framework.
The number of bytes of input split objects read by maps. These
Split raw bytes
objects represent
the split metadata (that is, the offset and length within a file) rather
than the split
data itself, so the total size should be small.
The number of map output records produced by all the
Map output records
maps in the job.
(MAP_OUTPUT_RECOR Incremented every time the collect() method is called
on a map’s
The number of bytes of uncompressed output produced by all the
Map output bytes
maps in the job.
Incremented every time the collect() method is called on a
(MAP_OUTPUT_BYTES) map’s Output
Map output materialized
The number of bytes of map output actually written to disk. If map
output combytes
(MAP_OUTPUT_MATERIALIZED_B pression is enabled this is reflected in the counter
The number of input records consumed by all the combiners (if
Combine input records
any) in the job.
(COMBINE_INPUT_RECOR Incremented every time a value is read from the combiner’s iterator
over values.
Note that this count is the number of values consumed by the
combiner, not the
number of distinct key groups (which would not be a useful metric,
since there is
not necessarily one group per key for a combiner; see
“Combiner Functions” on page 34, and also “Shuffle and Sort” on
page 205).
The number of output records produced by all the combiners (if
Combine output records
any) in the job.
(COMBINE_OUTPUT_RECOR Incremented every time the collect() method is called on a
combiner’s Out
The number of distinct key groups consumed by all the reducers in
Reduce input groups
the job. Incre(REDUCE_INPUT_GROU mented every time the reducer’s reduce() method is called by
the framework.
The number of input records consumed by all the reducers in the
Reduce input records
job. Incremented
(REDUCE_INPUT_RECOR every time a value is read from the reducer’s iterator over values. If
reducers consume
all of their inputs, this count should be the same as the count for Map
output records.
The number of reduce output records produced by all the
Reduce output records
maps in the job.
(REDUCE_OUTPUT_RECOR Incremented every time the collect() method is called on
a reducer’s
The number of distinct key groups skipped by all the reducers in the job.
See “Skipping
Reduce skipped groups
Bad Records” on page 217.
The number of input records skipped by all the reducers
Reduce skipped records
in the job.
The number of bytes of map output copied by the shuffle
Reduce shuffle bytes
to reducers.
Spilled records
CPU milliseconds
Physical memory bytes
Virtual memory bytes
Committed heap bytes
GC time milliseconds
Shuffled maps
Failed shuffle
Merged map outputs
The number of records spilled to disk in all map and reduce tasks
in the job.
The cumulative CPU time for a task in milliseconds, as reported by
The physical memory being used by a task in bytes, as reported by
The virtual memory being used by a task in bytes, as reported by
The total amount of memory available in the JVM in bytes, as
reported by Run
The elapsed time for garbage collection in tasks in milliseconds, as
reported by
GarbageCollectorMXBean.getCollectionTime(). From
The number of map output files transferred to reducers by the shuffle
(see “Shuffle
and Sort” on page 205). From 0.21.
The number of map output copy failures during the shuffle.
From 0.21.
The number of map outputs that have been merged on the reduce side
of the shuffle.
From 0.21.
Table 8-3. Built-in filesystem task counters
Filesystem bytes
Filesystem bytes
The number of bytes read by each filesystem by map and reduce tasks. There is a
counter for each
filesystem: Filesystem may be Local, HDFS, S3, KFS, etc.
The number of bytes written by each filesystem by map and
reduce tasks.
Table 8-4. Built-in FileInputFormat task counters
Bytes read The number of bytes read by map tasks via the FileInputFormat.
Table 8-5. Built-in FileOutputFormat task counters
The number of bytes written by map tasks (for map-only jobs) or reduce tasks via the
Job counters
Job counters (Table 8-6) are maintained by the jobtracker (or application master in
YARN), so they don’t need to be sent across the network, unlike all other counters,
including user-defined ones. They measure job-level statistics, not values that
change while a task is running. For example, TOTAL_LAUNCHED_MAPS counts
the number of map tasks that were launched over the course of a job (including ones
that failed).
Table 8-6. Built-in job counters
Launched map tasks
Launched reduce tasks
The number of map tasks that were launched. Includes
tasks that were
started speculatively.
The number of reduce tasks that were launched.
Includes tasks that
were started speculatively.
The number of uber tasks (see “YARN (MapReduce 2)” on
page 194)
Launched uber tasks
that were launched. From 0.23.
The number of maps in uber tasks. From
Maps in uber tasks
The number of reduces in uber tasks. From
Reduces in uber tasks
Failed map tasks
Failed reduce tasks
Failed uber tasks
Data-local map tasks
The number of map tasks that failed. See “Task Failure”
on page 200
for potential causes.
The number of reduce tasks that failed.
The number of uber tasks that failed. From
The number of map tasks that ran on the same node as
their input data.
The number of map tasks that ran on a node in the same
Rack-local map tasks
rack as their
input data, but that are not data-local.
The number of map tasks that ran on a node in a different
rack to their
Other local map tasks
input data. Inter-rack bandwidth is scarce, and Hadoop
tries to place
map tasks close to their input data, so this count should
be low. See
Figure 2-2.
The total time taken running map tasks in milliseconds.
Total time in map tasks
Includes tasks
that were started speculatively.
The total time taken running reduce tasks in
Total time in reduce tasks
milliseconds. Includes
tasks that were started speculatively.
Total time in map tasks waiting after
The total time spent waiting after reserving slots for
reserving slots
map tasks in
(FALLOW_SLOTS_MILLIS_MAPS milliseconds. Slot reservation is Capacity Scheduler
feature for highmemory jobs, see “Task memory limits” on page 316.
Not used by
YARN-based MapReduce.
Total time in reduce tasks waiting after
The total time spent waiting after reserving slots for
reserving slots
reduce tasks in
(FALLOW_SLOTS_MILLIS_REDU milliseconds. Slot reservation is Capacity Scheduler
feature for highmemory jobs, see “Task memory limits” on page 316.
Not used by
YARN-based MapReduce.
User-Defined Java Counters
MapReduce allows user code to define a set of counters, which are then
incremented as desired in the mapper or reducer. Counters are defined by a Java
enum, which serves to group related counters. A job may define an arbitrary number
of enums, each with an arbitrary number of fields. The name of the enum is the
group name, and the enum’s fields are the counter names. Counters are global: the
MapReduce framework aggre-gates them across all maps and reduces to produce a
grand total at the end of the job.
We created some counters in Chapter 5 for counting malformed records in the
weather dataset. The program in Example 8-1 extends that example to count the
number of missing records and the distribution of temperature quality codes.
Example 8-1. Application to run the maximum temperature job, including counting
missing and malformed fields and quality codes
public class MaxTemperatureWithCounters extends Configured implements Tool {
enum Temperature {
static class MaxTemperatureMapperWithCounters extends
MapReduceBase implements Mapper<LongWritable, Text, Text,
IntWritable> {
private NcdcRecordParser parser = new NcdcRecordParser();
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {
if (parser.isValidTemperature()) {
int airTemperature = parser.getAirTemperature();
output.collect(new Text(parser.getYear()),
new IntWritable(airTemperature));
} else if (parser.isMalformedTemperature()) {
System.err.println("Ignoring possibly corrupt input: " + value);
reporter.incrCounter(Temperature.MALFORMED, 1);
} else if (parser.isMissingTemperature()) {
reporter.incrCounter(Temperature.MISSING, 1);
// dynamic counter
reporter.incrCounter("TemperatureQuality", parser.getQuality(), 1);
public int run(String[] args) throws IOException {
JobConf conf = JobBuilder.parseInputAndOutput(this, getConf(),
args); if (conf == null) {
return -1;
return 0;
public static void main(String[] args) throws Exception {
int exitCode =
MaxTemperatureWithCounters(), args); System.exit(exitCode);
The best way to see what this program does is run it over the complete dataset:
% hadoop jar hadoop-examples.jar MaxTemperatureWithCounters
input/ncdc/all output-counters
When the job has successfully completed, it prints out the counters at the end
(this is done by JobClient’s runJob() method). Here are the ones we are
interested in:
09/04/20 06:33:36 INFO mapred.JobClient:
TemperatureQuality 09/04/20 06:33:36 INFO
mapred.JobClient: 2=1246032 09/04/20 06:33:36 INFO
mapred.JobClient: 1=973422173 09/04/20 06:33:36
09/04/20 06:33:36 INFO mapred.JobClient: Air Temperature
Records 09/04/20 06:33:36 INFO mapred.JobClient:
09/04/20 06:33:36 INFO mapred.JobClient: Missing=66136856
Dynamic counters
The code makes use of a dynamic counter—one that isn’t defined by a Java
enum. Since a Java enum’s fields are defined at compile time, you can’t create
new counters on the fly using enums. Here we want to count the distribution of
temperature quality codes, and though the format specification defines the values
that it can take, it is more con-venient to use a dynamic counter to emit the
values that it actually takes. The method we use on the Reporter object takes a
group and counter name using String names:
public void incrCounter(String group, String counter, long amount)
The two ways of creating and accessing counters—using enums and using Strings—
are actually equivalent since Hadoop turns enums into Strings to send counters over
RPC. Enums are slightly easier to work with, provide type safety, and are suitable for
most jobs. For the odd occasion when you need to create counters dynamically, you
can use the String interface.
Readable counter names
By default, a counter’s name is the enum’s fully qualified Java classname. These
names are not very readable when they appear on the web UI, or in the console, so
Hadoop provides a way to change the display names using resource bundles. We’ve
done this here, so we see “Air Temperature Records” instead of
“Temperature$MISSING.” For dynamic counters, the group and counter names are
used for the display names, so this is not normally an issue.
The recipe to provide readable names is as follows. Create a properties file named
after the enum, using an underscore as a separator for nested classes. The
properties file should be in the same directory as the top-level class containing the
enum. The file is named for
the counters in Example 8-1.
The properties file should contain a single property named CounterGroupName,
whose value is the display name for the whole group. Then each field in the enum
should have a corresponding property defined for it, whose name is the name of the
field suffixed with .name, and whose value is the display name for the counter. Here
are the contents of
CounterGroupName=Air Temperature Records
Hadoop uses the standard Java localization mechanisms to load the correct
properties for the locale you are running in, so, for example, you can create a
named, and they will be
used when running in the zh_CN locale. Refer to the documentation for
java.util.PropertyResourceBundle for more information.
Retrieving counters
In addition to being available via the web UI and the command line (using hadoop
job -counter), you can retrieve counter values using the Java API. You can do this
while the job is running, although it is more usual to get counters at the end of a job
run, when they are stable. Example 8-2 shows a program that calculates the
proportion of records that have missing temperature fields.
Example 8-2. Application to calculate the proportion of records with missing
temperature fields
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
public class MissingTemperatureFields extends Configured implements Tool {
public int run(String[] args) throws Exception { if
(args.length != 1) {
JobBuilder.printUsage(this, "<job ID>");
return -1;
JobClient jobClient = new JobClient(new JobConf(getConf()));
String jobID = args[0];
RunningJob job = jobClient.getJob(JobID.forName(jobID));
if (job == null) {
System.err.printf("No job with ID %s found.\n", jobID);
return -1;
if (!job.isComplete()) {
System.err.printf("Job %s is not complete.\n", jobID);
return -1;
Counters counters = job.getCounters(); long
missing = counters.getCounter(
long total =
System.out.printf("Records with missing temperature fields:
%.2f%%\n", 100.0 * missing / total);
return 0;
public static void main(String[] args) throws Exception {
int exitCode = MissingTemperatureFields(),
args); System.exit(exitCode);
First we retrieve a RunningJob object from a JobClient, by calling the getJob()
method with the job ID. We check whether there is actually a job with the given ID.
There may not be, either because the ID was incorrectly specified or because the
jobtracker no longer has a reference to the job (only the last 100 jobs are kept in
memory, controlled by mapred.jobtracker.completeuserjobs.maximum, and all are
cleared out if the job-tracker is restarted).
After confirming that the job has completed, we call the RunningJob’s getCounters()
method, which returns a Counters object, encapsulating all the counters for a job.
The Counters class provides various methods for finding the names and values of
We use the getCounter() method, which takes an enum to find the number of records
that had a missing temperature field.
There are also findCounter() methods, all of which return a Counter object. We use
this form to retrieve the built-in counter for map input records. To do this, we refer to
the counter by its group name—the fully qualified Java classname for the enum—
and counter name (both strings).
Finally, we print the proportion of records that had a missing temperature field.
Here’s what we get for the whole weather dataset:
% hadoop jar hadoop-examples.jar MissingTemperatureFields
Records with missing temperature fields: 5.47%
User-Defined Streaming Counters
A Streaming MapReduce program can increment counters by sending a specially
for-matted line to the standard error stream, which is co-opted as a control channel
in this case. The line must have the following format:
This snippet in Python shows how to increment the “Missing” counter in the “Temperature” group by one:
In a similar way, a status message may be sent with a line formatted like this:
The ability to sort data is at the heart of MapReduce. Even if your application isn’t
concerned with sorting per se, it may be able to use the sorting stage that
MapReduce provides to organize its data. In this section, we will examine different
ways of sorting datasets and how you can control the sort order in MapReduce.
We are going to sort the weather dataset by temperature. Storing temperatures as
Text objects doesn’t work for sorting purposes, since signed integers don’t sort
lexicographically. Instead, we are going to store the data using sequence files whose
The built-in counters’ enums are not currently a part of the public API, so this is the
only way to retrieve them. From release 0.21.0, counters are available via the
JobCounter and TaskCounter enums in the org.apache.hadoop.mapreduce
IntWritable keys represent the temperature (and sort correctly), and whose Text
values are the lines of data.
The MapReduce job in Example 8-3 is a map-only job that also filters the input to
remove records that don’t have a valid temperature reading. Each map creates a
single block-compressed sequence file as output. It is invoked with the following
input/ncdc/all \ input/ncdc/all-seq
Example 8-3. A MapReduce program for transforming the weather data into
SequenceFile format
public class SortDataPreprocessor extends Configured implements Tool {
static class CleanerMapper
extends Mapper<LongWritable, Text, IntWritable, Text> {
private NcdcRecordParser parser = new NcdcRecordParser();
protected void map(LongWritable key, Text value, Context
context) throws IOException, InterruptedException {
if (parser.isValidTemperature()) {
context.write(new IntWritable(parser.getAirTemperature()), value);
public int run(String[] args) throws Exception {
Job job = JobBuilder.parseInputAndOutput(this, getConf(), args);
if (job == null) {
return -1;
SequenceFileOutputFormat.setCompressOutput(job, true);
return job.waitForCompletion(true) ? 0 : 1;
public static void main(String[] args) throws Exception {
int exitCode = SortDataPreprocessor(),
args); System.exit(exitCode);
Partial Sort
Example 8-4. A MapReduce program for sorting a SequenceFile with IntWritable
keys using the default HashPartitioner
public class SortByTemperatureUsingHashPartitioner extends
Configured implements Tool {
public int run(String[] args) throws Exception {
Job job = JobBuilder.parseInputAndOutput(this, getConf(), args);
if (job == null) {
return -1;
SequenceFileOutputFormat.setCompressOutput(job, true);
return job.waitForCompletion(true) ? 0 : 1;
public static void main(String[] args) throws Exception {
int exitCode =
SortByTemperatureUsingHashPartitioner(), args);
Controlling Sort Order
The sort order for keys is controlled by a RawComparator, which is found as follows:
If the property mapred.output.key.comparator.class is set, either explicitly or by
calling setSortComparatorClass() on Job, then an instance of that class is used. (In
the old API the equivalent method is setOutputKeyComparatorClass() on JobConf.)
Otherwise, keys must be a subclass of WritableComparable, and the registered
comparator for the key class is used. If there is no registered comparator, then a
RawComparator is used that deserializes the byte streams being compared into
objects and delegates to the WritableCompar able’s compareTo() method.
Suppose we run this program using 30 reducers:
mapred.reduce.tasks=30 input/ncdc/all-seq output-hashsort
This command produces 30 output files, each of which is sorted. However, there is
no easy way to combine the files (by concatenation, for example, in the case of
plain-text files) to produce a globally sorted file. For many applications, this doesn’t
matter. For example, having a partially sorted set of files is fine if you want to do
An application: Partitioned MapFile lookups
To perform lookups by key, for instance, having multiple files works well. If we
change the output format to be a MapFileOutputFormat, as shown in Example 8-5,
then the output is 30 map files, which we can perform lookups against.
Example 8-5. A MapReduce program for sorting a SequenceFile and producing
MapFiles as output,
public class SortByTemperatureToMapFile extends Configured implements Tool {
public int run(String[] args) throws Exception {
Job job = JobBuilder.parseInputAndOutput(this, getConf(), args);
if (job == null) {
return -1;
SequenceFileOutputFormat.setCompressOutput(job, true);
return job.waitForCompletion(true) ? 0 : 1;
public static void main(String[] args) throws Exception {
int exitCode = SortByTemperatureToMapFile(),
args); System.exit(exitCode);
MapFileOutputFormat provides a pair of convenience static methods for performing
lookups against MapReduce output; their use is shown in Example 8-6.
Example 8-6. Retrieve the first entry with a given key from a collection of MapFiles
public class LookupRecordByTemperature extends Configured implements Tool {
public int run(String[] args) throws Exception {
if (args.length != 2) {
JobBuilder.printUsage(this, "<path> <key>");
return -1;
Path path = new Path(args[0]);
IntWritable key = new IntWritable(Integer.parseInt(args[1]));
Reader[] readers = MapFileOutputFormat.getReaders(path,
getConf()); Partitioner<IntWritable, Text> partitioner =
new HashPartitioner<IntWritable, Text>();
Text val = new Text();
Writable entry =
MapFileOutputFormat.getEntry(readers, partitioner, key,
if (entry == null) {
System.err.println("Key not found: " + key);
return -1;
NcdcRecordParser parser = new
NcdcRecordParser(); parser.parse(val.toString());
System.out.printf("%s\t%s\n", parser.getStationId(), parser.getYear());
return 0;
public static void main(String[] args) throws Exception {
int exitCode = LookupRecordByTemperature(),
args); System.exit(exitCode);
The getReaders() method opens a MapFile.Reader for each of the output files
created by the MapReduce job. The getEntry() method then uses the partitioner to
choose the reader for the key and finds the value for that key by calling Reader’s
get() method. If getEntry() returns null, it means no matching key was found.
Otherwise, it returns the value, which we translate into a station ID and year.
To see this in action, let’s find the first entry for a temperature of –10°C (remember
that temperatures are stored as integers representing tenths of a degree, which is
why we ask for a temperature of –100):
% hadoop jar hadoop-examples.jar LookupRecordByTemperature outputhashmapsort -100
We can also use the readers directly, in order to get all the records for a given key.
The array of readers that is returned is ordered by partition, so that the reader for a
given key may be found using the same partitioner that was used in the MapReduce
Example 8-7. Retrieve all entries with a given key from a collection of MapFiles
public class LookupRecordsByTemperature extends Configured implements Tool {
public int run(String[] args) throws Exception {
if (args.length != 2) {
JobBuilder.printUsage(this, "<path> <key>");
return -1;
Path path = new Path(args[0]);
IntWritable key = new IntWritable(Integer.parseInt(args[1]));
Reader[] readers = MapFileOutputFormat.getReaders(path,
getConf()); Partitioner<IntWritable, Text> partitioner =
new HashPartitioner<IntWritable, Text>();
Text val = new Text();
Reader reader = readers[partitioner.getPartition(key, val, readers.length)];
Writable entry = reader.get(key, val);
if (entry == null) {
System.err.println("Key not found: " + key);
return -1;
NcdcRecordParser parser = new
NcdcRecordParser(); IntWritable nextKey = new
do { parser.parse(val.toString());
System.out.printf("%s\t%s\n", parser.getStationId(), parser.getYear());
} while(, val) && key.equals(nextKey));
return 0;
public static void main(String[] args) throws Exception {
int exitCode = LookupRecordsByTemperature(),
args); System.exit(exitCode);
And here is a sample run to retrieve all readings of –10°C and count them:
hadoop jar hadoop-examples.jar LookupRecordsByTemperature outputhashmapsort -100 \ 2> /dev/null | wc -l
Total Sort
How can you produce a globally sorted file using Hadoop? The naive answer is to
use a single partition.4 But this is incredibly inefficient for large files, since one
machine has to process all of the output, so you are throwing away the benefits of
the parallel ar-chitecture that MapReduce provides.
Instead, it is possible to produce a set of sorted files that, if concatenated, would
form a globally sorted file. The secret to doing this is to use a partitioner that
respects the total order of the output. For example, if we had four partitions, we could
put keys for temperatures less than –10°C in the first partition, those between –10°C
and 0°C in the second, those between 0°C and 10°C in the third, and those over
10°C in the fourth.
Although this approach works, you have to choose your partition sizes carefully to
ensure that they are fairly even so that job times aren’t dominated by a single
reducer. For the partitioning scheme just described, the relative sizes of the
partitions are as follows:
Temperature range
Proportion of records
< –10°C
[–10°C, 0°C)
[0°C, 10°C)
>= 10°C
These partitions are not very even. To construct more even partitions, we need to
have a better understanding of the temperature distribution for the whole dataset. It’s
fairly easy to write a MapReduce job to count the number of records that fall into a
collection of temperature buckets. For example, Figure 8-1 shows the distribution for
buckets of size 1°C, where each point on the plot corresponds to one bucket.
Figure 8-1. Temperature distribution for the weather dataset
While we could use this information to construct a very even set of partitions, the fact
that we needed to run a job that used the entire dataset to construct them is not
ideal. It’s possible to get a fairly even set of partitions, by sampling the key space.
The idea behind sampling is that you look at a small subset of the keys to
approximate the key distribution, which is then used to construct partitions. Luckily,
we don’t have to write the code to do this ourselves, as Hadoop comes with a
selection of samplers.
The InputSampler class defines a nested Sampler interface whose implementations
return a sample of keys given an InputFormat and Job:
public interface Sampler<K, V> {
K[]getSample(InputFormat<K, V> inf, Job job)
throws IOException, InterruptedException;
This interface is not usually called directly by clients. Instead, the writePartition File()
static method on InputSampler is used, which creates a sequence file to store the
keys that define the partitions:
public static <K, V> void writePartitionFile(Job job, Sampler<K, V>
sampler) throws IOException, ClassNotFoundException,
The sequence file is used by TotalOrderPartitioner to create partitions for the sort
job. Example 8-8 puts it all together.
Example 8-8. A MapReduce program for sorting a SequenceFile with IntWritable
keys using the TotalOrderPartitioner to globally sort the data
public class SortByTemperatureUsingTotalOrderPartitioner extends
Configured implements Tool {
public int run(String[] args) throws Exception {
Job job = JobBuilder.parseInputAndOutput(this, getConf(), args);
if (job == null) {
return -1;
SequenceFileOutputFormat.setCompressOutput(job, true);
InputSampler.Sampler<IntWritable, Text> sampler =
new InputSampler.RandomSampler<IntWritable, Text>(0.1, 10000, 10);
Path input = FileInputFormat.getInputPaths(job)[0];
input = input.makeQualified(input.getFileSystem(getConf()));
Path partitionFile = new Path(input, "_partitions");
InputSampler.writePartitionFile(job, sampler);
// Add to DistributedCache
URI partitionUri = new URI(partitionFile.toString() + "#_partitions");
return job.waitForCompletion(true) ? 0 : 1;
public static void main(String[] args) throws Exception { int
exitCode =
new SortByTemperatureUsingTotalOrderPartitioner(), args);
We use a RandomSampler, which chooses keys with a uniform probability—here,
0.1. There are also parameters for the maximum number of samples to take and the
maxi-mum number of splits to sample (here, 10,000 and 10, respectively; these
settings are the defaults when InputSampler is run as an application), and the
sampler stops when the first of these limits is met. Samplers run on the client,
making it important to limit the number of splits that are downloaded, so the sampler
runs quickly. In practice, the time taken to run the sampler is a small fraction of the
overall job time.
The partition file that InputSampler writes is called _partitions, which we have set to
be in the input directory (it will not be picked up as an input file since it starts with an
underscore). To share the partition file with the tasks running on the cluster, we add
it to the distributed cache (see “Distributed Cache” ).
On one run, the sampler chose –5.6°C, 13.9°C, and 22.0°C as partition boundaries
(for four partitions), which translates into more even partition sizes than the earlier
choice of partitions:
Temperature range
Proportion of records
< –5.6°C
[–5.6°C, 13.9°C)
[13.9°C, 22.0°C)
>= 22.0°C
Your input data determines the best sampler for you to use. For example, SplitSam
pler, which samples only the first n records in a split, is not so good for sorted data5
because it doesn’t select keys from throughout the split.
On the other hand, IntervalSampler chooses keys at regular intervals through the
split and makes a better choice for sorted data. RandomSampler is a good generalpurpose sampler. If none of these suits your application (and remember that the
point of sam-pling is to produce partitions that are approximately equal in size), you
can write your own implementation of the Sampler interface.
One of the nice properties of InputSampler and TotalOrderPartitioner is that you are
free to choose the number of partitions. This choice is normally driven by the number
of reducer slots in your cluster (choose a number slightly fewer than the total, to
allow for failures). However, TotalOrderPartitioner will work only if the partition
boundaries are distinct: one problem with choosing a high number is that you may
get collisions if you have a small key space.
Here’s how we run it:
mapred.reduce.tasks=30 input/ncdc/all-seq output-totalsort
The program produces 30 output partitions, each of which is internally sorted; in addition, for these partitions, all the keys in partition i are less than the keys in partition i
+ 1.
If the join is performed by the mapper, it is called a map-side join, whereas if it is
performed by the reducer it is called a reduce-side join. If both datasets are too
large for either to be copied to each node in the cluster, then we can still join them
using MapReduce with a map-side or reduce-side join, depending on how the data is
structured. One common example of this case is a user database and a log of some
user activity (such as access logs). For a popular service, it is not feasible to
distribute the user database (or the logs) to all the MapReduce nodes.
Map-Side Joins
A map-side join between large inputs works by performing the join before the data
reaches the map function. For this to work, though, the inputs to each map must be
partitioned and sorted in a particular way. Each input dataset must be divided into
the same number of partitions, and it must be sorted by the same key (the join key)
in each source. All the records for a particular key must reside in the same partition.
This may sound like a strict requirement (and it is), but it actually fits the description
of the output of a MapReduce job.
Figure 8-2. Inner join of two datasets
A map-side join can be used to join the outputs of several jobs that had the same
number of reducers, the same keys, and output files that are not splittable (by being
smaller than an HDFS block, or by virtue of being gzip compressed, for example). In
the context of the weather example, if we ran a partial sort on the stations file by
station ID, and another, identical sort on the records, again by station ID, and with
the same number of reducers, then the two outputs would satisfy the conditions for
running a map-side join.
Use a CompositeInputFormat from the org.apache.hadoop.mapreduce.join package
to run a map-side join. The input sources and join type (inner or outer) for
CompositeIn putFormat are configured through a join expression that is written
according to a simple grammar. The package documentation has details and
The org.apache.hadoop.examples.Join example is a general-purpose command-line
program for running a map-side join, since it allows you to run a MapReduce job for
any specified mapper and reducer over multiple inputs that are joined with a given
join operation.
Reduce-Side Joins
A reduce-side join is more general than a map-side join, in that the input datasets
don’t have to be structured in any particular way, but it is less efficient as both
datasets have to go through the MapReduce shuffle. The basic idea is that the
mapper tags each record with its source and uses the join key as the map output
key, so that the records with the same key are brought together in the reducer. We
use several ingredients to make this work in practice:
Multiple inputs
The input sources for the datasets have different formats, in general, so it is very
convenient to use the MultipleInputs class (see “Multiple Inputs” ) to separate the
logic for parsing and tagging each source.
Secondary sort
As described, the reducer will see the records from both sources that have the same
key, but they are not guaranteed to be in any particular order. However, to perform
the join, it is important to have the data from one source before another. For the
weather data join, the station record must be the first of the values seen for each
key, so the reducer can fill in the weather records with the station name and emit
them straightaway. Of course, it would be possible to receive the records in any
order if we buffered them in memory, but this should be avoided, since the number of
records in any group may be very large and exceed the amount of memory available to the reducer.
To tag each record, we use TextPair from Chapter 4 for the keys, to store the station
ID, and the tag. The only requirement for the tag values is that they sort in such a
way that the station records come before the weather records. This can be achieved
by tagging station records as 0 and weather records as 1. The mapper classes to do
this are shown in Examples 8-12 and 8-13.
Example 8-12. Mapper for tagging station records for a reduce-side join
public class JoinStationMapper
extends Mapper<LongWritable, Text, TextPair, Text> {
private NcdcStationMetadataParser parser = new NcdcStationMetadataParser();
protected void map(LongWritable key, Text value, Context
context) throws IOException, InterruptedException {
if (parser.parse(value)) {
context.write(new TextPair(parser.getStationId(), "0"), new
Example 8-13. Mapper for tagging weather records for a reduce-side join
public class JoinRecordMapper
extends Mapper<LongWritable, Text, TextPair, Text> {
private NcdcRecordParser parser = new
protected void map(LongWritable key, Text value, Context
context) throws IOException, InterruptedException {
context.write(new TextPair(parser.getStationId(), "1"), value);
The reducer knows that it will receive the station record first, so it extracts its name
from the value and writes it out as a part of every output record (Example 8-14).
Example 8-14. Reducer for joining tagged station records with tagged weather
public class JoinReducer extends Reducer<TextPair, Text, Text, Text> {
protected void reduce(TextPair key, Iterable<Text> values, Context
context) throws IOException, InterruptedException {
Iterator<Text> iter = values.iterator();
Text stationName = new Text(;
while (iter.hasNext()) {
Text record =;
Text outValue = new Text(stationName.toString() + "\t" +
record.toString()); context.write(key.getFirst(), outValue);
The code assumes that every station ID in the weather records has exactly one
matching record in the station dataset. If this were not the case, we would need to
generalize the code to put the tag into the value objects, by using another TextPair.
The reduce() method would then be able to tell which entries were station names
and detect (and handle) missing or duplicate entries, before processing the weather
Because objects in the reducer’s values iterator are re-used (for efficiency
purposes), it is vital that the code makes a copy of the first Text object from
the values iterator:
Text stationName = new Text(;
If the copy is not made, then the stationName reference will refer to the value
just read when it is turned into a string, which is a bug.
Tying the job together is the driver class, shown in Example 8-15. The essential point
is that we partition and group on the first part of the key, the station ID, which we do
with a custom Partitioner (KeyPartitioner) and a custom group comparator, First
Comparator (from TextPair).
Example 8-15. Application to join weather records with station names
public class JoinRecordWithStationName extends Configured implements Tool {
public static class KeyPartitioner extends Partitioner<TextPair, Text> {
public int getPartition(TextPair key, Text value, int numPartitions) {
return (key.getFirst().hashCode() & Integer.MAX_VALUE) %
public int run(String[] args) throws Exception {
if (args.length != 3) {
JobBuilder.printUsage(this, "<ncdc input> <station input> <output>");
return -1;
Job job = new Job(getConf(), "Join weather records with station
names"); job.setJarByClass(getClass());
Path ncdcInputPath = new Path(args[0]);
Path stationInputPath = new Path(args[1]);
Path outputPath = new Path(args[2]);
MultipleInputs.addInputPath(job, ncdcInputPath,
TextInputFormat.class, JoinRecordMapper.class);
MultipleInputs.addInputPath(job, stationInputPath,
TextInputFormat.class, JoinStationMapper.class);
FileOutputFormat.setOutputPath(job, outputPath);
return job.waitForCompletion(true) ? 0 : 1;
public static void main(String[] args) throws Exception {
int exitCode = JoinRecordWithStationName(),
args); System.exit(exitCode);
Running the program on the sample data yields the following output:
Side Data Distribution
Side data can be defined as extra read-only data needed by a job to process the
main dataset. The challenge is to make side data available to all the map or reduce
tasks (which are spread across the cluster) in a convenient and efficient fashion.
In addition to the distribution mechanisms described in this section, it is possible to
cache side-data in memory in a static field, so that tasks of the same job that run in
succession on the same tasktracker can share the data. “Task JVM Re-use” on page
216 describes how to enable this feature. If you take this approach, be aware of the
amount of memory that you are using, as it might affect the memory needed by the
shuffle (see “Shuffle and Sort” ).
Using the Job Configuration
You can set arbitrary key-value pairs in the job configuration using the various setter
methods on Configuration (or JobConf in the old MapReduce API). This is very
useful if you need to pass a small piece of metadata to your tasks.
In the task you can retrieve the data from the configuration returned by Context’s
getConfiguration() method. (In the old API, it’s a little more involved: override the
configure() method in the Mapper or Reducer and use a getter method on the
JobConf object passed in to retrieve the data. It’s very common to store the data in
an instance field so it can be used in the map() or reduce() method.)
Usually, a primitive type is sufficient to encode your metadata, but for arbitrary
objects you can either handle the serialization yourself (if you have an existing
mechanism for turning objects to strings and back), or you can use Hadoop’s
Stringifier class. DefaultStringifier uses Hadoop’s serialization framework to serialize.
You shouldn’t use this mechanism for transferring more than a few kilobytes of data
because it can put pressure on the memory usage in the Hadoop daemons,
particularly in a system running hundreds of jobs. The job configuration is read by
the jobtracker, the tasktracker, and the child JVM, and each time the configuration
is read, all of its entries are read into memory, even if they are not used. User
properties are not used by the jobtracker or the tasktracker, so they just waste time
and memory.
Distributed Cache
Rather than serializing side data in the job configuration, it is preferable to distribute
datasets using Hadoop’s distributed cache mechanism. This provides a service for
copying files and archives to the task nodes in time for the tasks to use them when
they run. To save network bandwidth, files are normally copied to any particular
node once per job.
For tools that use GenericOptionsParser (this includes many of the programs in this
book—see “GenericOptionsParser, Tool, and ToolRunner”), you can specify the
files to be distributed as a comma-separated list of URIs as the argument to the files option. Files can be on the local filesystem, on HDFS, or on another Hadoop
readable filesystem (such as S3). If no scheme is supplied, then the files are
assumed to be local. (This is true even if the default filesystem is not the local
You can also copy archive files (JAR files, ZIP files, tar files, and gzipped tar files)
to your tasks, using the -archives option; these are unarchived on the task node.
The -libjars option will add JAR files to the classpath of the mapper and reducer
tasks. This is useful if you haven’t bundled library JAR files in your job JAR file.
Streaming doesn’t use the distributed cache for copying the streaming scripts
across the cluster. You specify a file to be copied using the -file option (note the
singular), which should be repeated for each file to be copied. Furthermore, files
specified using the -file option must be file paths only, not URIs, so they must be
accessible from the local filesystem of the client launching the Streaming job.
Streaming also accepts the -files and -archives options for copying files into the
distributed cache for use by your Streaming scripts.Let’s see how to use the
distributed cache to share a metadata file for station names. The command we will
run is:
input/ncdc/metadata/stations-fixed-width.txt input/ncdc/all output
This command will copy the local file stations-fixed-width.txt (no scheme is supplied,
so the path is automatically interpreted as a local file) to the task nodes, so we can
MaxTemperatureByStationNameUsingDistri butedCacheFile appears in Example 816.
Example 8-16. Application to find the maximum temperature by station, showing
station names from a lookup table passed as a distributed cache file
public class
extends Configured implements Tool {
static class StationTemperatureMapper
extends Mapper<LongWritable, Text, Text, IntWritable> {
private NcdcRecordParser parser = new NcdcRecordParser();
protected void map(LongWritable key, Text value, Context
context) throws IOException, InterruptedException {
if (parser.isValidTemperature()) { context.write(new
new IntWritable(parser.getAirTemperature()));
static class MaxTemperatureReducerWithStationLookup
extends Reducer<Text, IntWritable, Text, IntWritable> {
private NcdcStationMetadata metadata;
protected void setup(Context context)
throws IOException, InterruptedException {
metadata = new NcdcStationMetadata();
metadata.initialize(new File("stations-fixed-width.txt"));
protected void reduce(Text key, Iterable<IntWritable> values,
Context context) throws IOException, InterruptedException {
String stationName = metadata.getStationName(key.toString());
int maxValue = Integer.MIN_VALUE; for
(IntWritable value : values) {
maxValue = Math.max(maxValue, value.get());
context.write(new Text(stationName), new IntWritable(maxValue));
public int run(String[] args) throws Exception {
Job job = JobBuilder.parseInputAndOutput(this, getConf(), args);
if (job == null) {
return -1;
return job.waitForCompletion(true) ? 0 : 1;
public static void main(String[] args) throws Exception { int
exitCode =
new MaxTemperatureByStationNameUsingDistributedCacheFile(),
args); System.exit(exitCode);
The program finds the maximum temperature by weather station, so the mapper
(StationTemperatureMapper) simply emits (station ID, temperature) pairs. For the
combiner, we reuse MaxTemperatureReducer (from Chapters 2 and 5) to pick the
maximum temperature for any given group of map outputs on the map side. The reducer (MaxTemperatureReducerWithStationLookup) is different from the combiner,
since in addition to finding the maximum temperature, it uses the cache file to look
up the station name. We use the reducer’s setup() method to retrieve the cache file
using its original name, relative to the working directory of the task. You can use the
distributed cache for copying files that do not fit in memory. Here’s a snippet of the
output, showing some maximum temperatures for a few weather stations:
How it works
When you launch a job, Hadoop copies the files specified by the -files, -archives and
-libjars options to the jobtracker’s filesystem (normally HDFS). Then, before a task is
run, the tasktracker copies the files from the jobtracker’s filesystem to a local disk—
the cache—so the task can access the files. The files are said to be localized at this
point. From the task’s point of view, the files are just there (and it doesn’t care that
they came from HDFS). In addition, files specified by -libjars are added to the task’s
classpath before it is launched.
The tasktracker also maintains a reference count for the number of tasks using each
file in the cache. Before the task has run, the file’s reference count is incremented by
one; then after the task has run, the count is decreased by one. Only when the count
reaches zero is it eligible for deletion, since no tasks are using it. Files are deleted to
make room for a new file when the cache exceeds a certain size—10 GB by default.
The cache size may be changed by setting the configuration property
local.cache.size, which is measured in bytes.
Although this design doesn’t guarantee that subsequent tasks from the same job
run-ning on the same tasktracker will find the file in the cache, it is very likely that
they will, since tasks from a job are usually scheduled to run at around the same
time, so there isn’t the opportunity for enough other jobs to run and cause the
original task’s file to be deleted from the cache.
Files are localized under the ${mapred.local.dir}/taskTracker/archive directory on the
tasktrackers. Applications don’t have to know this, however, since the files are
symbolically linked from the task’s working directory.
The distributed cache API
Most applications don’t need to use the distributed cache API because they can use
the cache via GenericOptionsParser, as we saw in Example 8-16. However, some
applica-tions may need to use more advanced features of the distributed cache, and
for this they can use its API directly. The API is in two parts: methods for putting data
into the cache (found in Job), and methods for retrieving data from the cache (found
in JobCon text). Here are the pertinent methods in Job for putting data into the
public void addCacheFile(URI uri) public void
addCacheArchive(URI uri) public void
setCacheFiles(URI[] files)
public void setCacheArchives(URI[] archives) public
void addFileToClassPath(Path file) public void
addArchiveToClassPath(Path archive) public void
If you are using the old MapReduce API the same methods can be found in
Recall that there are two types of object that can be placed in the cache: files and
archives. Files are left intact on the task node, while archives are unarchived on the
task node. For each type of object, there are three methods: an addCacheXXXX()
method to add the file or archive to the distributed cache, a setCacheXXXXs()
method to set the entire list of files or archives to be added to the cache in a single
call (replacing those set in any previous calls), and an addXXXXToClassPath() to
add the file or archive to the MapReduce task’s classpath. Table Table 8-7 compares
these API methods to the GenericOptionsParser options described in Table 5-1.
Table 8-7. Distributed cache API
Job API method
addCacheFile(URI uri)
setCacheFiles(URI[] files)
addCacheArchive(URI uri)
setCacheArchives(URI[] files)
addFileToClassPath(Path file)
Add files to the distributed
cache to
be copied to the task
Add archives to the
cache to be copied to the task
and unarchived there.
Add files to the distributed
cache to
be added to the MapReduce
classpath. The files are not
unarchived, so this is a useful way to
JAR files to the
Add archives to the
cache to be unarchived and
to the MapReduce task’s
This can be useful when you
to add a directory of files to
classpath, since you can
create an
archive containing the files,
although you can equally well
a JAR file and use
The URIs referenced in the add() or set() methods must be files in a shared
filesystem that exist when the job is run. On the other hand, the files specified as a
GenericOptionsParser option (e.g. -files) may refer to a local file, in which case they
get copied to the default shared file-system (normally HDFS) on your behalf. This is
the key difference between using the Java API directly and using
GenericOptionsParser: the Java API does not copy the file specified in the add() or
set() method to the shared filesystem, whereas the Gener icOptionsParser does.
The remaining distributed cache API method on Job is createSymlink(), which
creates symbolic links for all the files for the current job when they are localized on
the task node. The symbolic link name is set by the fragment identifier of the file’s
URI. For example, the file specified by the URI hdfs://namenode/foo/bar#myfile is
symlinked as myfile in the task’s working directory. (There’s an example of using this
API in Exam-ple 8-8.) If there is no fragment identifier, then no symbolic link is
created. Files added to the distributed cache using GenericOptionsParser are
automatically symlinked.
Symbolic links are not created for files in the distributed cache when using the local
job runner, so for this reason you may choose to use the getLocalCacheFiles() and
getLocalCacheArchives() methods (dis-cussed below) if you want your jobs to work
both locally and on a clus-ter.
The second part of the distributed cache API is found on JobContext, and it is used
from the map or reduce task code when you want to access files from the distributed
public Path[] getLocalCacheFiles() throws IOException;
public Path[] getLocalCacheArchives() throws IOException;
public Path[] getFileClassPaths();
public Path[] getArchiveClassPaths();
If the files from the distributed cache have symbolic links in the task’s working directory, then you can access the localized file directly by name, as we did in Exam-ple
8-16. It’s also possible to get a reference to files and archives in the cache using the
getLocalCacheFiles() and getLocalCacheArchives() methods.
In the case of archives, the paths returned are to the directory containing the
unarchived files. (For complete-ness, you can also retrieve the files and archives
added to the task classpath via getFi leClassPaths() and getArchiveClassPaths().)
Note that files are returned as local Path objects. To read the files you can use a
Hadoop local FileSystem instance, retrieved using its getLocal() method.
Alternatively, you can use the API, as shown in this updated setup()
method for MaxTemperatur eReducerWithStationLookup:
protected void setup(Context context)
throws IOException, InterruptedException {
metadata = new NcdcStationMetadata();
Path[] localPaths = context.getLocalCacheFiles();
if (localPaths.length == 0) {
throw new FileNotFoundException("Distributed cache file not found.");
File localFile = new File(localPaths[0].toUri());
MapReduce Library Classes
Hadoop comes with a library of mappers and reducers for commonly used functions.
Setting Up a Hadoop Cluster
This chapter explains how to set up Hadoop to run on a cluster of machines.
Running HDFS and MapReduce on a single machine is great for learning about
these systems, but to do useful work they need to run on multiple nodes.
There are a few options when it comes to getting a Hadoop cluster, from building
your own to running on rented hardware, or using an offering that provides Hadoop
as a service in the cloud. This chapter and the next give you enough information to
set up and operate your own cluster, but even if you are using a Hadoop service in
which a lot of the routine maintenance is done for you, these chapters still offer
valuable infor-mation about how Hadoop works from an operations point of view.
Cluster Specification
Hadoop is designed to run on commodity hardware. That means that you are not tied
to expensive, proprietary offerings from a single vendor; rather, you can choose
stand-ardized, commonly available hardware from any of a large range of vendors to
build your cluster.
“Commodity” does not mean “low-end.” Low-end machines often have cheap components, which have higher failure rates than more expensive (but still commodityclass) machines. When you are operating tens, hundreds, or thousands of machines,
cheap components turn out to be a false economy, as the higher failure rate incurs a
greater maintenance cost. On the other hand, large database class machines are not
recommended either, since they don’t score well on the price/performance curve.
And even though you would need fewer of them to build a cluster of comparable
perfor-mance to one built of mid-range commodity hardware, when one did fail it
would have a bigger impact on the cluster, since a larger proportion of the cluster
hardware would be unavailable.
Hardware specifications rapidly become obsolete, but for the sake of illustration, a
typical choice of machine for running a Hadoop datanode and tasktracker in mid2010 would have the following specifications:
2 quad-core 2-2.5GHz CPUs
16-24 GB ECC RAM
4 × 1TB SATA disks
Gigabit Ethernet
While the hardware specification for your cluster will assuredly be different, Hadoop
is designed to use multiple cores and disks, so it will be able to take full advantage of
more powerful hardware.
Why Not Use RAID?
HDFS clusters do not benefit from using RAID (Redundant Array of Independent
Disks) for datanode storage (although RAID is recommended for the namenode’s
disks, to protect against corruption of its metadata). The redundancy that RAID
provides is not needed, since HDFS handles it by replication between nodes.
Furthermore, RAID striping (RAID 0), which is commonly used to increase performance, turns out to be slower than the JBOD (Just a Bunch Of Disks)
configuration used by HDFS, which round-robins HDFS blocks between all disks.
The reason for this is that RAID 0 read and write operations are limited by the
speed of the slowest disk in the RAID array. In JBOD, disk operations are
independent, so the average speed of operations is greater than that of the
slowest disk. Disk performance often shows con-siderable variation in practice,
even for disks of the same model. In some benchmarking carried out on a Yahoo!
cluster (, JBOD performed 10%
faster than RAID 0 in one test (Gridmix), and 30% better in another (HDFS write
Finally, if a disk fails in a JBOD configuration, HDFS can continue to operate
without the failed disk, whereas with RAID, failure of a single disk causes the
whole array (and hence the node) to become unavailable.
The bulk of Hadoop is written in Java, and can therefore run on any platform with a
JVM, although there are enough parts that harbor Unix assumptions (the control
scripts, for example) to make it unwise to run on a non-Unix platform in production.
ECC memory is strongly recommended, as several Hadoop users have reported
seeing many checksum errors when using non-ECC memory on Hadoop clusters.
In fact, Windows operating systems are not supported production platforms. How
large should your cluster be? There isn’t an exact answer to this question, but the
beauty of Hadoop is that you can start with a small cluster (say, 10 nodes) and grow
it as your storage and computational needs grow. In many ways, a better question is
this: how fast does my cluster need to grow? You can get a good feel for this by
considering storage capacity.
For example, if your data grows by 1 TB a week, and you have three-way HDFS
repli-cation, then you need an additional 3 TB of raw storage per week. Allow some
room for intermediate files and logfiles (around 30%, say), and this works out at
about one machine (2010 vintage) per week, on average. In practice, you wouldn’t
buy a new machine each week and add it to the cluster. The value of doing a backof-the-envelope calculation like this is that it gives you a feel for how big your cluster
should be: in this example, a cluster that holds two years of data needs 100
For a small cluster (on the order of 10 nodes), it is usually acceptable to run the
name-node and the jobtracker on a single master machine (as long as at least one
copy of the namenode’s metadata is stored on a remote filesystem). As the cluster
and the number of files stored in HDFS grow, the namenode needs more memory,
so the namenode and jobtracker should be moved onto separate machines.The
secondary namenode can be run on the same machine as the namenode, but
again for reasons of memory usage (the secondary has the same memory
requirements as the primary), it is best to run it on a separate piece of hardware,
especially for larger clusters. Machines running the namenodes should typically run
on 64-bit hardware to avoid the 3 GB limit on Java heap size in 32-bit architectures.
Network Topology
A common Hadoop cluster architecture consists of a two-level network topology, as
illustrated in Figure 9-1. Typically there are 30 to 40 servers per rack, with a 1 GB
switch for the rack (only three are shown in the diagram), and an uplink to a core
switch or router (which is normally 1 GB or better). The salient point is that the
aggregate band-width between nodes on the same rack is much greater than that
between nodes on different racks.
Figure 9-1. Typical two-level network architecture for a Hadoop cluster
Rack awareness
To get maximum performance out of Hadoop, it is important to configure Hadoop so
that it knows the topology of your network. If your cluster runs on a single rack, then
there is nothing more to do, since this is the default. However, for multirack clusters,
you need to map nodes to racks. By doing this, Hadoop will prefer within-rack
transfers (where there is more bandwidth available) to off-rack transfers when
placing MapReduce tasks on nodes. HDFS will be able to place replicas more
intelligently to trade-off performance and resilience.
Network locations such as nodes and racks are represented in a tree, which reflects
the network “distance” between locations. The namenode uses the network location
when determining where to place block replicas (see “Network Topology and Hadoop” on page 71); the MapReduce scheduler uses network location to determine
where the closest replica is as input to a map task.
For the network in Figure 9-1, the rack topology is described by two network
locations, say, /switch1/rack1 and /switch1/rack2. Since there is only one top-level
switch in this cluster, the locations can be simplified to /rack1 and /rack2.
The Hadoop configuration must specify a map between node addresses and network
locations. The map is described by a Java interface, DNSToSwitchMapping, whose
signature is:
public interface DNSToSwitchMapping {
public List<String> resolve(List<String> names);
The names parameter is a list of IP addresses, and the return value is a list of corresponding network location strings. The topology.node.switch.mapping.impl configuration property defines an implementation of the DNSToSwitchMapping interface that
the namenode and the jobtracker use to resolve worker node network locations.
For the network in our example, we would map node1, node2, and node3 to /rack1,
and node4, node5, and node6 to /rack2.
Most installations don’t need to implement the interface themselves, however, since
the default implementation is ScriptBasedMapping, which runs a user-defined script
to determine the mapping. The script’s location is controlled by the property The script must accept a variable number of arguments
that are the hostnames or IP addresses to be mapped, and it must emit the
correspond-ing network locations to standard output, separated by whitespace. The
If no script location is specified, the default behavior is to map all nodes to a single
network location, called /default-rack.
Cluster Setup and Installation
Your hardware has arrived. The next steps are to get it racked up and install the
software needed to run Hadoop.
There are various ways to install and configure Hadoop. This chapter describes how
to do it from scratch using the Apache Hadoop distribution, and will give you the
background to cover the things you need to think about when setting up Hadoop.
To ease the burden of installing and maintaining the same software on each node, it
is normal to use an automated installation method like Red Hat Linux’s Kickstart or
Debian’s Fully Automatic Installation. These tools allow you to automate the
operating system installation by recording the answers to questions that are asked
during the installation process (such as the disk partition layout), as well as which
packages to install. Crucially, they also provide hooks to run scripts at the end of the
process, which are invaluable for doing final system tweaks and customization that is
not covered by the standard installer.
The following sections describe the customizations that are needed to run Hadoop.
These should all be added to the installation script.
Installing Java
Java 6 or later is required to run Hadoop. The latest stable Sun JDK is the preferred
option, although Java distributions from other vendors may work, too. The following
command confirms that Java was installed correctly:
% java -version
java version "1.6.0_12"
Java(TM) SE Runtime Environment (build 1.6.0_12-b04)
Java HotSpot(TM) 64-Bit Server VM (build 11.2-b01, mixed mode)
Creating a Hadoop User
It’s good practice to create a dedicated Hadoop user account to separate the
Hadoop installation from other services running on the same machine.
For small clusters, some administrators choose to make this user’s home directory
an NFS-mounted drive, to aid with SSH key distribution (see the following
discussion). The NFS server is typically outside the Hadoop cluster. If you use NFS,
it is worth considering autofs, which allows you to mount the NFS filesystem on
demand, when the system accesses it. Autofs provides some protection against the
NFS server failing and allows you to use replicated filesystems for failover. There are
other NFS gotchas to watch out for, such as synchronizing UIDs and GIDs. For help
setting up NFS on Linux, refer to the HOWTO at
Installing Hadoop
( core/releases.html), and unpack the contents of the
distribution in a sensible location, such as /usr/local (/opt is another standard choice).
Note that Hadoop is not installed in the hadoop user’s home directory, as that may
be an NFS-mounted directory:
cd /usr/local
sudo tar xzf hadoop-x.y.z.tar.gz
We also need to change the owner of the Hadoop files to be the hadoop user and
% sudo chown -R hadoop:hadoop hadoop-x.y.z
Some administrators like to install HDFS and MapReduce in separate locations on
the same system. At the time of this writing, only HDFS and MapReduce from the
same Hadoop release are compatible with one another; however, in future releases,
the compatibility requirements will be loosened.
Note that separate installations of HDFS and MapReduce can still share
configuration by using the --config option (when starting daemons) to refer to a
common configuration directory. They can also log to the same directory, as the
logfiles they produce are named in such a way as to avoid clashes.
Testing the Installation
Once you’ve created an installation script, you are ready to test it by installing it on
the machines in your cluster. This will probably take a few iterations as you discover
kinks in the install. When it’s working, you can proceed to configure Hadoop and give
it a test run. This process is documented in the following sections.
SSH Configuration
The Hadoop control scripts (but not the daemons) rely on SSH to perform clusterwide operations. For example, there is a script for stopping and starting all the
daemons in the cluster. Note that the control scripts are optional—cluster-wide
operations can be performed by other mechanisms, too (such as a distributed shell).
To work seamlessly, SSH needs to be set up to allow password-less login for the
hadoop user from machines in the cluster. The simplest way to achieve this is to
generate a public/private key pair, and place it in an NFS location that is shared
across the cluster.
First, generate an RSA key pair by typing the following in the hadoop user account:
% ssh-keygen -t rsa -f ~/.ssh/id_rsa
Even though we want password-less logins, keys without passphrases are not
consid-ered good practice, so we specify a passphrase when prompted for one. We
shall use ssh-agent to avoid the need to enter a password for each connection. The
private key is in the file specified by the -f option, ~/.ssh/id_rsa, and the public key is
stored in a file with the same name with .pub appended, ~/.ssh/
Next we need to make sure that the public key is in the ~/.ssh/authorized_keys file
on all the machines in the cluster that we want to connect to. If the hadoop user’s
home directory is an NFS filesystem, as described earlier, then the keys can be
shared across the cluster by typing:
% cat ~/.ssh/ >> ~/.ssh/authorized_keys
If the home directory is not shared using NFS, then the public keys will need to be
shared by some other means.Test that you can SSH from the master to a worker
machine by making sure ssh-agent is running, and then run ssh-add to store your
passphrase. You should be able to ssh to a worker without entering the passphrase
Hadoop Configuration
There are a handful of files for controlling the configuration of a Hadoop installation;
the most important ones are listed in Table 9-1. This section covers MapReduce 1,
which employs the jobtracker and tasktracker daemons.
Table 9-1. Hadoop configuration
Bash script
Plain text
Plain text
Environment variables that are used in the scripts to
run Hadoop.
Configuration settings for Hadoop Core, such as I/O
settings that are
common to HDFS and MapReduce.
Configuration settings for HDFS daemons: the
namenode, the secondary namenode, and the datanodes.
Configuration settings for MapReduce daemons: the
jobtracker, and
the tasktrackers.
A list of machines (one per line) that each run
a secondary
A list of machines (one per line) that each run a
datanode and a
Properties for controlling how metrics are published in
Hadoop (see
“Metrics” on page 350).
Properties for system logfiles, the namenode audit
log, and the task
log for the tasktracker child process (“Hadoop Logs”
on page 173).
These files are all found in the conf directory of the Hadoop distribution. The configuration directory can be relocated to another part of the filesystem (outside the
3. See its main page for instructions on how to start sh-agent.
Configuration Management
Hadoop does not have a single, global location for configuration information. Instead,
each Hadoop node in the cluster has its own set of configuration files, and it is up to
administrators to ensure that they are kept in sync across the system. Hadoop
provides a rudimentary facility for synchronizing configuration using rsync (see
upcoming dis-cussion); alternatively, there are parallel shell tools that can help do
this, like dsh or pdsh.
Hadoop is designed so that it is possible to have a single set of configuration files
that are used for all master and worker machines. The great advantage of this is
simplicity, both conceptually (since there is only one configuration to deal with) and
operationally (as the Hadoop scripts are sufficient to manage a single configuration
For some clusters, the one-size-fits-all configuration model breaks down. For
example, if you expand the cluster with new machines that have a different hardware
specifica-tion to the existing ones, then you need a different configuration for the new
machines to take advantage of their extra resources.
In these cases, you need to have the concept of a class of machine, and maintain a
separate configuration for each class. Hadoop doesn’t provide tools to do this, but
there are several excellent tools for doing precisely this type of configuration
management, such as Chef, Puppet, cfengine, and bcfg2.
For a cluster of any size, it can be a challenge to keep all of the machines in sync:
consider what happens if the machine is unavailable when you push out an update—
who en-sures it gets the update when it becomes available? This is a big problem
and can lead to divergent installations, so even if you use the Hadoop control scripts
for managing Hadoop, it may be a good idea to use configuration management tools
for maintaining the cluster. These tools are also excellent for doing regular
maintenance, such as patch-ing security holes and updating system packages.
Control scripts
Hadoop comes with scripts for running commands, and starting and stopping
daemons across the whole cluster. To use these scripts (which can be found in the
bin directory), you need to tell Hadoop which machines are in the cluster. There are
two files for this purpose, called masters and slaves, each of which contains a list of
the machine host-names or IP addresses, one per line. The masters file is actually a
misleading name, in that it determines which machine or machines should run a
secondary namenode. The slaves file lists the machines that the datanodes and
tasktrackers should run on. Both masters and slaves files reside in the configuration
directory, although the slaves file may be placed elsewhere (and given another
name) by changing the setting in Also, these files do not need to be
distributed to worker nodes, since they are used only by the control scripts running
on the namenode or jobtracker.
You don’t need to specify which machine (or machines) the namenode and
jobtracker runs on in the masters file, as this is determined by the machine the
scripts are run on. (In fact, specifying these in the masters file would cause a
secondary namenode to run there, which isn’t always what you want.) For example,
the script, which starts all the HDFS daemons in the cluster, runs the
namenode on the machine the script is run on. In slightly more detail, it:
Starts a namenode on the local machine (the machine that the script is run
Starts a datanode on each machine listed in the slaves file
Starts a secondary namenode on each machine listed in the masters file
There is a similar script called, which starts all the MapReduce daemons in the cluster. More specifically, it:
Starts a jobtracker on the local machine
Starts a tasktracker on each machine listed in the slaves file
Note that masters is not used by the MapReduce control scripts.
Also provided are and scripts to stop the daemons
started by the corresponding start script.
These scripts start and stop Hadoop daemons using the script. If
you use the aforementioned scripts, you shouldn’t call directly.
But if you need to control Hadoop daemons from another system or from your own
scripts, then the script is a good integration point. Likewise, (with an “s”) is handy for starting the same daemon on a set of
Master node scenarios
Depending on the size of the cluster, there are various configurations for running the
master daemons: the namenode, secondary namenode, and jobtracker. On a small
cluster (a few tens of nodes), it is convenient to put them on a single machine;
however, as the cluster gets larger, there are good reasons to separate them.
The namenode has high memory requirements, as it holds file and block metadata
for the entire namespace in memory. The secondary namenode, while idle most of
the time, has a comparable memory footprint to the primary when it creates a
checkpointFor filesys-tems with a large number of files, there may not be enough
physical memory on one machine to run both the primary and secondary namenode.
The secondary namenode keeps a copy of the latest checkpoint of the filesystem
met-adata that it creates. Keeping this (stale) backup on a different node to the
namenode allows recovery in the event of loss (or corruption) of all the namenode’s
metadata files.
On a busy cluster running lots of MapReduce jobs, the jobtracker uses considerable
memory and CPU resources, so it should run on a dedicated node.
Whether the master daemons run on one or more nodes, the following
instructions apply:
Run the HDFS control scripts from the namenode machine. The masters file should
contain the address of the secondary namenode.
Run the MapReduce control scripts from the jobtracker machine.
When the namenode and jobtracker are on separate nodes, their slaves files need to
be kept in sync, since each node in the cluster should run a datanode and a
Environment Settings
In this section, we consider how to set the variables in
By default, Hadoop allocates 1,000 MB (1 GB) of memory to each daemon it runs.
This is controlled by the HADOOP_HEAPSIZE setting in In addition,
the task tracker launches separate child JVMs to run map and reduce tasks in, so we
need to factor these into the total memory footprint of a worker machine.
The maximum number of map tasks that can run on a tasktracker at one time is controlled by the property, which defaults to
two tasks. There is a corresponding property for reduce tasks, mapred.task
tracker.reduce.tasks.maximum, which also defaults to two tasks. The tasktracker is
said to have two map slots and two reduce slots.
The memory given to each child JVM running a task can be changed by setting the property. The default setting is -Xmx200m, which gives each
task 200 MB of memory. (Incidentally, you can provide extra JVM options here, too.
For example, you might enable verbose GC logging to debug GC.) The default
configura-tion therefore uses 2,800 MB of memory for a worker machine (see Table
Table 9-2. Worker node memory
Default memory used
Tasktracker child map
2 × 200
Tasktracker child
reduce task
2 × 200
Memory used for 8 processors, 400 MB per
child (MB)
7 × 400
7 × 400
The number of tasks that can be run simultaneously on a tasktracker is related to the
number of processors available on the machine. Because MapReduce jobs are
normally I/O-bound, it makes sense to have more tasks than processors to get better
utilization. The amount of oversubscription depends on the CPU utilization of jobs
you run, but a good rule of thumb is to have a factor of between one and two more
tasks (counting both map and reduce tasks) than processors.
For example, if you had 8 processors and you wanted to run 2 processes on each
pro-cessor, then you could set each of and
mapred.tasktracker.reduce.tasks.maximum to 7 (not 8, since the datanode and the
tasktracker each take one slot). If you also increased the memory available to each
child task to 400 MB, then the total memory usage would be 7,600 MB (see Table 92).
Whether this Java memory allocation will fit into 8 GB of physical memory depends
on the other processes that are running on the machine. If you are running
Streaming or Pipes programs, this allocation will probably be inappropriate (and the
memory allocated to the child should be dialed down), since it doesn’t allow enough
memory for users’ (Streaming or Pipes) processes to run. The thing to avoid is
processes being swapped out, as this leads to severe performance degradation. The
precise memory settings are necessarily very cluster-dependent and can be
optimized over time with experience gained from monitoring the memory usage
across the cluster. Tools like Ganglia (“GangliaContext” on page 352) are good for
gathering this information. See “Task memory limits” on page 316 for more on how to
enforce task memory limits.
For the master nodes, each of the namenode, secondary namenode, and jobtracker
daemons uses 1,000 MB by default, a total of 3,000 MB.
How much memory does a namenode need?
A namenode can eat up memory, since a reference to every block of every file is
main-tained in memory. It’s difficult to give a precise formula, since memory usage
depends on the number of blocks per file, the filename length, and the number of
directories in the filesystem; plus it can change from one Hadoop release to
The default of 1,000 MB of namenode memory is normally enough for a few million
files, but as a rule of thumb for sizing purposes you can conservatively allow 1,000
MB per million blocks of storage.
For example, a 200 node cluster with 4 TB of disk space per node, a block size of
128 MB and a replication factor of 3 has room for about 2 million blocks (or more):
200 × 4,000,000 MB ⁄ (128 MB × 3). So in this case, setting the namenode
memory to 2,000 MB would be a good starting point.
You can increase the namenode’s memory without changing the memory
allocated to other Hadoop daemons by setting HADOOP_NAMENODE_OPTS in to include a JVM option for setting the memory size.
HADOOP_NAMENODE_OPTS allows you to pass extra options to the
namenode’s JVM. So, for example, if using a Sun JVM, -Xmx2000m would specify
that 2,000 MB of memory should be allocated to the namenode.
If you change the namenode’s memory allocation, don’t forget to do the same for
the secondary namenode (using the HADOOP_SECONDARYNAMENODE_OPTS
variable), since its memory requirements are comparable to the primary
namenode’s. You will probably also want to run the secondary namenode on a
different machine, in this case.
There are corresponding environment variables for the other Hadoop daemons, so
you can customize their memory allocations, if desired. See for
The location of the Java implementation to use is determined by the JAVA_HOME
setting in or from the JAVA_HOME shell environment variable, if not
set in It’s a good idea to set the value in, so that it is
clearly defined in one place and to ensure that the whole cluster is using the same
version of Java.
System logfiles
System logfiles produced by Hadoop are stored in $HADOOP_INSTALL/logs by
default. This can be changed using the HADOOP_LOG_DIR setting in It’s a good idea to change this so that logfiles are kept out of the directory
that Hadoop is installed in, since this keeps logfiles in one place even after the
installation directory changes after an upgrade. A common choice is /var/log/hadoop,
set by including the following line in
export HADOOP_LOG_DIR=/var/log/hadoop
The log directory will be created if it doesn’t already exist (if not, confirm that the
Hadoop user has permission to create it). Each Hadoop daemon running on a
machine produces two logfiles. The first is the log output written via log4j. This file,
which ends in .log, should be the first port of call when diagnosing problems, since
most application log messages are written here. The standard Hadoop log4j
configuration uses a Daily Rolling File Appender to rotate logfiles. Old logfiles are
never deleted, so you should arrange for them to be periodically deleted or archived,
so as to not run out of disk space on the local node.
The second logfile is the combined standard output and standard error log. This
logfile, which ends in .out, usually contains little or no output, since Hadoop uses
log4j for logging. It is only rotated when the daemon is restarted, and only the last
five logs are retained. Old logfiles are suffixed with a number between 1 and 5, with
5 being the oldest file.
Logfile names (of both types) are a combination of the name of the user running the
daemon, the daemon name, and the machine hostname. For example, hadoop-tomdatanode-sturges.local.log.2008-07-04 is the name of a logfile after it has been
This naming structure makes it possible to archive logs from all machines in the
cluster in a single directory, if needed, since the filenames are unique.
The username in the logfile name is actually the default for the
HADOOP_IDENT_STRING setting in If you wish to give the Hadoop
instance a different identity for the purposes of naming the logfiles, change
HADOOP_IDENT_STRING to be the identifier you want.
SSH settings
The control scripts allow you to run commands on (remote) worker nodes from the
master node using SSH. It can be useful to customize the SSH settings, for various
reasons. For example, you may want to reduce the connection timeout (using the
ConnectTimeout option) so the control scripts don’t hang around waiting to see
whether a dead node is going to respond. Obviously, this can be taken too far. If the
timeout is too low, then busy nodes will be skipped, which is bad.
Another useful SSH setting is StrictHostKeyChecking, which can be set to no to
auto-matically add new host keys to the known hosts files. The default, ask, is to
prompt the user to confirm they have verified the key fingerprint, which is not a
suitable setting in a large cluster environment.
To pass extra options to SSH, define the HADOOP_SSH_OPTS environment
variable in See the ssh and ssh_config manual pages for more SSH
settings.The Hadoop control scripts can distribute configuration files to all nodes of
the cluster using rsync. This is not enabled by default, but by defining the
HADOOP_MASTER setting in, worker daemons will rsync the tree
rooted at HADOOP_MASTER to the local node’s HADOOP_INSTALL whenever the
daemon starts up.
What if you have two masters—a namenode and a jobtracker on separate
machines? You can pick one as the source and the other can rsync from it, along
with all the workers. In fact, you could use any machine, even one outside the
Hadoop cluster, to rsync from.
Because HADOOP_MASTER is unset by default, there is a bootstrapping problem:
how do we make sure with HADOOP_MASTER set is present on
worker nodes? For small clusters, it is easy to write a small script to copy from the master to all of the worker nodes. For larger clusters, tools like dsh
can do the copies in parallel. Alternatively, a suitable can be created
as a part of the automated in-stallation script (such as Kickstart).
Important Hadoop Daemon Properties
Hadoop has a bewildering number of configuration properties. In this section, we
address the ones that you need to define (or at least understand why the default is
appropriate) for any real-world working cluster. These properties are set in the
Hadoop site files: core-site.xml, hdfs-site.xml, and mapred-site.xml. Typical
examples of these files are shown in Example 9-1, Example 9-2, and Example 9-3.
Notice that most prop-erties are marked as final, in order to prevent them from being
overridden by job con-figurations. You can learn more about how to write Hadoop’s
configuration files in “The Configuration API” .
Example 9-1. A typical core-site.xml configuration file
<?xml version="1.0"?> <!-- coresite.xml --> <configuration>
Example 9-2. A typical hdfs-site.xml configuration file
<?xml version="1.0"?> <!-- hdfssite.xml --> <configuration>
Example 9-3. A typical mapred-site.xml configuration file
<?xml version="1.0"?>
<!-- mapred-site.xml -->
<!-- Not marked as final so jobs can include JVM debugging options -> </property>
To run HDFS, you need to designate one machine as a namenode. In this case, the
property is an HDFS filesystem URI, whose host is the namenode’s
hostname or IP address, and port is the port that the namenode will listen on for
RPCs. If no port is specified, the default of 8020 is used.
The property also doubles as specifying the default filesystem. The
default filesystem is used to resolve relative paths, which are handy to use since
they save typing (and avoid hardcoding knowledge of a particular namenode’s
address). For example, with the default filesystem defined in Example 9-1, the
relative URI /a/b is resolved to hdfs://namenode/a/b.
There are a few other configuration properties you should set for HDFS: those that
set the storage directories for the namenode and for datanodes. The property specifies a list of directories where the namenode stores persistent filesystem metadata (the edit log and the filesystem image). A copy of each of the
metadata files is stored in each directory for redundancy. It’s common to configure so that the namenode metadata is written to one or two local disks, and
a remote disk, such as an NFS-mounted directory. Such a setup guards against
failure of a local disk and failure of the entire namenode, since in both cases the files
can be recovered and used to start a new namenode. (The secondary namenode
takes only periodic checkpoints of the namenode, so it does not provide an up-todate backup of the namenode.)
You should also set the property, which specifies a list of directories for a
datanode to store its blocks. Unlike the namenode, which uses multiple directories
for redundancy, a datanode round-robins writes between its storage directories, so
for performance you should specify a storage directory for each local disk. Read
perfor-mance also benefits from having multiple disks for storage, because blocks
will be spread across them, and concurrent reads for distinct blocks will be
correspondingly spread across disks.
Finally, you should configure where the secondary namenode stores its checkpoints
of the filesystem. The fs.checkpoint.dir property specifies a list of directories where
the checkpoints are kept. Like the storage directories for the namenode, which keep
re-dundant copies of the namenode metadata, the checkpointed filesystem image is
stored in each checkpoint directory for redundancy.
Table 9-3 summarizes the important configuration properties for HDFS.
Table 9-3. Important HDFS daemon properties
Property name Type
Default value
commafs.checkpoint.dir separated
The default filesystem. The URI
the hostname and port that
the namenode’s RPC server runs on. The
port is 8020. This property is set
in coresite.xml.
The list of directories where the
namenode stores its persistent
The namenode stores a
copy of the
metadata in each directory in
the list.
A list of directories where the
stores blocks. Each block is
stored in
only one of these
A list of directories
where the
secondary namenode stores
checkpoints. It stores a copy of the
in each directory in the
To run MapReduce, you need to designate one machine as a jobtracker, which on
small clusters may be the same machine as the namenode. To do this, set the
mapred.job.tracker property to the hostname or IP address and port that the
jobtracker will listen on. Note that this property is not a URI, but a host-port pair,
separated by a colon. The port number 8021 is a common choice.
During a MapReduce job, intermediate data and working files are written to
temporary local files. Since this data includes the potentially very large output of map
tasks, you need to ensure that the mapred.local.dir property, which controls the
location of local temporary storage, is configured to use disk partitions that are large
enough. The mapred.local.dir property takes a comma-separated list of directory
names, and you should use all available local disks to spread disk I/O. Typically, you
will use the same disks and partitions (but different directories) for MapReduce
temporary data as you use for datanode block storage, as governed by the property, discussed earlier.
MapReduce uses a distributed filesystem to share files (such as the job JAR file)
with the tasktrackers that run the MapReduce tasks. The mapred.system.dir property
is used to specify a directory where these files can be stored. This directory is
resolved relative to the default filesystem (configured in, which is
usually HDFS.
Finally, you should set the and mapred.task
tracker.reduce.tasks.maximum properties to reflect the number of available cores on
the tasktracker machines and to reflect the amount of
memory available for the tasktracker child JVMs. See the discussion in “Memory”.
Table 9-4 summarizes the important configuration properties for MapReduce.
Table 9-4. Important MapReduce daemon
Property name
hostname and
Default value
The hostname and port that
the jobtracker’s RPC server runs on.
If set to
the default value of local,
then the
jobtracker is run in-process
on demand when you run a
MapReduce job
(you don’t need to start the
in this case, and in fact you
will get
an error if you try to start it
in this
${hadoop.tmp.di A list of directories where
stores intermediate data for
jobs. The
data is cleared out when the
mapred.tasktracker. int
roperty name
job ends.
The directory relative to where
/mapred/system shared
files are stored, during a
job run.
The number of map tasks that
may be
run on a tasktracker at any
one time.
The number of reduce tasks
that may
be run on a tasktracker at
any one
The JVM options used to
launch the
tasktracker child process
that runs
Default value
The JVM options used for the
process that runs map tasks.
The JVM options used for the
process that runs reduce tasks.
Hadoop Daemon Addresses and Ports
Hadoop daemons generally run both an RPC server (Table 9-5) for communication
between daemons and an HTTP server to provide web pages for human
consumption (Table 9-6). Each server is configured by setting the network address
and port number to listen on. By specifying the network address as, Hadoop
will bind to all addresses on the machine. Alternatively, you can specify a single
address to bind to. A port number of 0 instructs the server to start on a free port: this
is generally discouraged, since it is incompatible with setting cluster-wide firewall
Table 9-5. RPC server properties
Property name
When set to an HDFS URI, this property
the namenode’s RPC server address and
port. The
default port is 8020 if not
The datanode’s RPC server address
dfs.datanode.ipc.address and port.
When set to a hostname and port, this
specifies the jobtracker’s RPC server
address and
port. A commonly used port is
The tasktracker’s RPC server address and
port. This
is used by the tasktracker’s child JVM to
communicate with the tasktracker. Using any free
port is
acceptable in this case, as the server only
binds to
the loopback address. You should
change this
setting only if the machine has no
In addition to an RPC server, datanodes run a TCP/IP server for block transfers. The
server address and port is set by the dfs.datanode.address property, and has a
default value of
Table 9-6. HTTP server
Property name
Property name
Default value
The jobtracker’s HTTP server address
and port.
The tasktracker’s HTTP server address
and port.
The namenode’s HTTP server address and port.
The datanode’s HTTP server address and port.
The secondary namenode’s HTTP server address and
There are also settings for controlling which network interfaces the datanodes and
tasktrackers report as their IP addresses (for HTTP and RPC servers). The relevant
properties are dfs.datanode.dns.interface and mapred.tasktracker.dns.interface, both
of which are set to default, which will use the default network interface. You can set
this explicitly to report the address of a particular interface (eth0, for example).
Other Hadoop Properties
This section discusses some other properties that you might consider setting.
Cluster membership
To aid the addition and removal of nodes in the future, you can specify a file
containing a list of authorized machines that may join the cluster as datanodes or
tasktrackers. The file is specified using the dfs.hosts (for datanodes) and
mapred.hosts (for tasktrackers) properties, as well as the corresponding
dfs.hosts.exclude and mapred.hosts.exclude files used for decommissioning.
Buffer size
Hadoop uses a buffer size of 4 KB (4,096 bytes) for its I/O operations. This is a conservative setting, and with modern hardware and operating systems, you will likely
see performance benefits by increasing it; 128 KB (131,072 bytes) is a common
choice. Set this using the io.file.buffer.size property in core-site.xml.
HDFS block size
The HDFS block size is 64 MB by default, but many clusters use 128 MB
(134,217,728 bytes) or even 256 MB (268,435,456 bytes) to ease memory pressure
on the namenode and to give mappers more data to work on. Set this using the
dfs.block.size property in hdfs-site.xml.
Reserved storage space
By default, datanodes will try to use all of the space available in their storage
directories. If you want to reserve some space on the storage volumes for non-HDFS
use, then you can set dfs.datanode.du.reserved to the amount, in bytes, of space to
Hadoop filesystems have a trash facility, in which deleted files are not actually
deleted, but rather are moved to a trash folder, where they remain for a minimum
period before being permanently deleted by the system. The minimum period in
minutes that a file will remain in the trash is set using the fs.trash.interval
configuration property in core-site.xml. By default, the trash interval is zero, which
disables trash.
Like in many operating systems, Hadoop’s trash facility is a user-level feature,
meaning that only files that are deleted using the filesystem shell are put in the trash.
Files deleted programmatically are deleted immediately. It is possible to use the
trash programmat-ically, however, by constructing a Trash instance, then calling its
moveToTrash() method with the Path of the file intended for deletion. The method
returns a value indicating success; a value of false means either that trash is not
enabled or that the file is already in the trash.
When trash is enabled, each user has her own trash directory called .Trash in her
home directory. File recovery is simple: you look for the file in a subdirectory of
.Trash and move it out of the trash subtree.
HDFS will automatically delete files in trash folders, but other filesystems will not, so
you have to arrange for this to be done periodically. You can expunge the trash,
which will delete files that have been in the trash longer than their minimum period,
using the filesystem shell:
% hadoop fs -expunge
The Trash class exposes an expunge() method that has the same effect.
Job scheduler
Particularly in a multiuser MapReduce setting, consider changing the default FIFO
job scheduler to one of the more fully featured alternatives.
Reduce slow start
By default, schedulers wait until 5% of the map tasks in a job have completed before
scheduling reduce tasks for the same job. For large jobs this can cause problems
with cluster utilization, since they take up reduce slots while waiting for the map
tasks to complete. Setting mapred.reduce.slowstart.completed.maps to a higher
value, such as 0.80 (80%), can help improve throughput.
Task memory limits
On a shared cluster, it shouldn’t be possible for one user’s errant MapReduce
program to bring down nodes in the cluster. This can happen if the map or reduce
task has a memory leak, for example, because the machine on which the tasktracker
is running will run out of memory and may affect the other running processes.
Or consider the case where a user sets to a large value and
causes memory pressure on other running tasks, causing them to swap. Marking this
property as final on the cluster would prevent it being changed by users in their jobs,
but there are legitimate reasons to allow some jobs to use more memory, so this is
not always an acceptable solution. Furthermore, even locking down does not solve the problem, since tasks can spawn new processes which are not constrained in their memory usage. Streaming and Pipes jobs
do exactly that, for example.
To prevent cases like these, some way of enforcing a limit on a task’s memory usage
is needed. Hadoop provides two mechanisms for this. The simplest is via the Linux
ulimit command, which can be done at the operating system level (in the limits.conf
file, typically found in /etc/security), or by setting mapred.child.ulimit in the Hadoop
configuration. The value is specified in kilobytes, and should be comfortably larger
than the memory of the JVM set by; otherwise, the child JVM
might not start.
The second mechanism is Hadoop’s task memory monitoring feature.The idea is that
an adminstrator sets a range of allowed virtual memory limits for tasks on the cluster,
and users specify the maximum memory requirements for their jobs in the job configuration. If a user doesn’t set memory requirements for their job, then the defaults are
used ( and mapred.job.reduce.memory.mb).
This approach has a couple of advantages over the ulimit approach. First, it enforces
the memory usage of the whole task process tree, including spawned processes.
Second, it enables memory-aware scheduling, where tasks are scheduled on
tasktrackers which have enough free memory to run them. The Capacity Scheduler,
for example, will ac-count for slot usage based on the memory settings, so that if a
job’s ory.mb setting exceeds
then the scheduler will allocate more than one slot on a tasktracker to run each map
task for that job.
To enable task memory monitoring you need to set all six of the properties in Table
9-7. The default values are all -1, which means the feature is disabled.
Table 9-7. MapReduce task memory monitoring properties
Property name
value int
The amount of virtual memory, in MB, that defines
a map
slot. Map tasks that require more than this
amount of
memory will use more than one map
The amount of virtual memory, in MB, that defines a
slot. Reduce tasks that require more than this
amount of
memory will use more than one reduce
YARN uses a different memory model to the one described here, and the
configuration options are different.
Property name
Type Default
The amount of virtual memory, in MB, that a
map task
requires to run. If a map task exceeds this limit it
may be
terminated and marked as failed.
The amount of virtual memory, in MB, that a
reduce task
requires to run. If a reduce task exceeds this limit it
may be
terminated and marked as failed.
The maximum limit that users can set to.
The maximum limit that users can set
mapred.job.reduce.memory.mb to.
User Account Creation
Once you have a Hadoop cluster up and running, you need to give users access to
it. This involves creating a home directory for each user and setting ownership
permissions on it:
hadoop fs -mkdir /user/username
hadoop fs -chown username:username /user/username
This is a good time to set space limits on the directory. The following sets a 1 TB
limit on the given user directory:
% hadoop dfsadmin -setSpaceQuota 1t /user/username
YARN Configuration
YARN is the next-generation architecture for running MapReduce (and is described
in “YARN (MapReduce 2)”). It has a different set of daemons and configu-ration
options to classic MapReduce (also called MapReduce 1), and in this section we
shall look at these differences and how to run MapReduce on YARN.
Under YARN you no longer run a jobtracker or tasktrackers. Instead, there is a single
resource manager running on the same machine as the HDFS namenode (for small
clusters) or on a dedicated machine, and node managers running on each worker
node in the cluster.
The YARN script (in the bin directory) starts the YARN daemons in the
cluster. This script will start a resource manager (on the machine the script is run
on), and a node manager on each machine listed in the slaves file.
YARN also has a job history server daemon that provides users with details of past
job runs, and a web app proxy server for providing a secure way for users to access
the UI provided by YARN applications. In the case of MapReduce, the web UI served
by the proxy provides information about the current job you are running, similar to
the one described in “The MapReduce Web UI”. By default the web app proxy server
runs in the same process as the resource manager, but it may be configured to run
as a standalone daemon.
YARN has its own set of configuration files listed in Table 9-8, these are used in
addition to those in Table 9-1.
Table 9-8. YARN configuration files
Bash script
Hadoop configuration
Environment variables that are used in the scripts to
run YARN.
Configuration settings for YARN daemons: the resource manager,
the job history
server, the webapp proxy server, and the node
Important YARN Daemon Properties
When running MapReduce on YARN the mapred-site.xml file is still used for general
MapReduce properties, although the jobtracker and tasktracker-related properties
are not used. None of the properties in Table 9-4 are applicable to YARN, except for (and the related properties and
map which apply only to map or reduce tasks, respectively).
The JVM options specified in this way are used to launch the YARN child process
that runs map or reduce tasks.
The configuration files in Example 9-4 show some of the important configuration
properties for running MapReduce on YARN.
Example 9-4. An example set of site configuration files for running MapReduce on
<?xml version="1.0"?>
<!-- mapred-site.xml -->
<!-- Not marked as final so jobs can include JVM debugging options -> </property>
<?xml version="1.0"?>
<!-- yarn-site.xml -->
<name>yarn.nodemanager.resource.memorymb</name> <value>8192</value>
The YARN resource manager address is controlled via yarn.resourceman
ager.address, which takes the form of a host-port pair. In a client configuration this
property is used to connect to the resource manager (using RPC), and in addition
the property must be set to yarn for the client to use
YARN rather than the local job runner.
Although YARN does not honor mapred.local.dir, it has an equivalent property called
yarn.nodemanager.local-dirs, which allows you to specify which local disks to store
intermediate data on. It is specified by a comma-separated list of local directory
paths, which are used in a round-robin fashion.
YARN doesn’t have tasktrackers to serve map outputs to reduce tasks, so for this
func-tion it relies on shuffle handlers, which are long-running auxiliary services
running in node managers. Since YARN is a general-purpose service the shuffle
handlers need to be explictly enabled in the yarn-site.xml by setting the
yarn.nodemanager.aux-serv ices property to mapreduce.shuffle.
Table 9-9 summarizes the important configuration properties for YARN.
Table 9-9. Important YARN daemon properties
Property name
hostname and
Default value Description
The hostname and port that the resource
manager’s RPC server
runs on.
comma/tmp/nmA list of directories where node
localers allow containers to store
directory names dir
intermediate data. The data is cleared
out when
the application ends.
A list of auxiliary services run by
commathe node
manager. A service is
service names
implemented by
the class defined by the
Property name
default no auxiliary services
are specified.
The amount of physical memory
(in MB)
which may be allocated to
being run by the node
The ratio of virtual to physical
for containers. Virtual memory
may exceed the allocation
by this
YARN treats memory in a more fine-grained manner than the slot-based model used
in the classic implementation of MapReduce. Rather than specifying a fixed
maximum number of map and reduce slots that may run on a tasktracker node at
once, YARN allows applications to request an arbitrary amount of memory (within
limits) for a task. In the YARN model, node managers allocate memory from a pool,
so the number of tasks that are running on a particular node depends on the sum of
their memory re-quirements, and not simply on a fixed number of slots.
The slot-based model can lead to cluster under-utilization, since the proportion of
map slots to reduce slots is fixed as a cluster-wide configuration. However, the
number of map versus reduce slots that are in demand changes over time: at the
beginning of a job only map slots are needed, while at the end of the job only reduce
slots are needed. On larger clusters with many concurrent jobs the variation in
demand for a particular type of slot may be less pronounced, but there is still
wastage. YARN avoids this problem by not distinguishing between the two types of
The considerations for how much memory to dedicate to a node manager for running
containers are similar to the those discussed in “Memory”. Each Hadoop daemon
uses 1,000 MB, so for a datanode and a node manager the total is 2,000 MB. Set
aside enough for other processes that are running on the machine, and the
remainder can be dedicated to the node manager’s containers, by setting the
configuration prop-erty yarn.nodemanager.resource.memory-mb to the total
allocation in MB. (The default is 8,192 MB.)
The next step is to determine how to set memory options for individual jobs. There
are two controls: which allows you to set the JVM heap size
of the map or reduce task; and (or
mapreduce.reduce.memory.mb) which is used to specify how much memory you
need for map (or reduce) task con-tainers. The latter setting is used by the scheduler
when negotiating for resources in the cluster, and by the node manager, which runs
and monitors the task containers.
For example, suppose that is set to -Xmx800m, and mapre is left at its default value of 1,024 MB. When a map task is
run, the node manager will allocate a 1,024 MB container (decreasing the size of its
pool by that amount for the duration of the task) and launch the task JVM configured
with a 800 MB maximum heap size. Note that the JVM process will have a larger
memory footprint than the heap size, and the overhead will depend on such things
as the native libraries that are in use, the size of the permanent generation space,
and so on. The important thing is that the physical memory used by the JVM
process, including any processes that it spawns, such as Streaming or Pipes
processes, does not exceed its allocation (1,024 MB). If a container uses more
memory than it has been allocated than it may be terminated by the node manager
and marked as failed.
Schedulers may impose a minimum or maximum on memory allocations. For
example, for the capacity scheduler the default minimum is 1024 MB (set by
yarn.schedu ler.capacity.minimum-allocation-mb), and the default maximum is 10240
MB (set by yarn.scheduler.capacity.maximum-allocation-mb).
There are also virtual memory constraints that a container must meet. If a container’s
virtual memory usage exeeds a given multiple of the allocated physical memory,
then the node manager may terminate the process. The multiple is expressed by the
yarn.nodemanager.vmem-pmem-ratio property, which defaults to 2.1. In the example
above, the virtual memory threshold above which the task may be terminated is
2,150 MB, which is 2.1 × 1,024 MB.
When configuring memory parameters it’s very useful to be able to monitor a task’s
actual memory usage during a job run, and this is possible via MapReduce task
Table 8-2) provide snapshot values of memory usage and are therefore suitable for
observation during the course of a task attempt.
YARN Daemon Addresses and Ports
YARN daemons run one or more RPC servers and HTTP servers, details of which
are covered in Table 9-10 and Table 9-11.
Table 9-10. YARN RPC server properties
Property name
Property name
mapreduce.jobhis 20
The resource manager’s RPC server address and
port. This is used
by the client (typically outside the cluster) to
communicate with
the resource manager.
The resource manager’s admin RPC server address
and port. This is
used by the admin client (invoked with yarn
rmadmin, typically
run outside the cluster) to communicate with the
resource manager.
The resource manager scheduler’s RPC server address
and port. This
is used by (in-cluster) application masters to
communicate with the
resource manager.
The resource manager resource tracker’s RPC
server address and
port. This is used by the (in-cluster) node managers to
with the resource manager.
The node manager’s RPC server address and port.
This is used by
(in-cluster) application masters to communicate
with node managers.
The node manager localizer’s RPC
server address and port.
The job history server’s RPC server address and
port. This is used by
the client (typically outside the cluster) to
query job history. This
property is set in mapred-site.xml.
Table 9-11. YARN HTTP server properties
yarn.resourceman 88
Property name
The resource manager’s HTTP server
address and port.
The node manager’s HTTP server
address and port.
The web app proxy server’s HTTP server
address and port. If not set
(the default) then the web app proxy server will
run in the resource
manager process.
The job history server’s HTTP server address
and port. This property
is set in mapred-site.xml.
The shuffle handler’s HTTP port number. This
is used for serving
map outputs, and is not a user-accessible
web UI. This property is
set in mapred-site.xml.
Kerberos and Hadoop
At a high level, there are three steps that a client must take to access a service when
using Kerberos, each of which involves a message exchange with a server:
 Authentication. The client authenticates itself to the Authentication Server and
receives a timestamped Ticket-Granting Ticket (TGT).
 Authorization. The client uses the TGT to request a service ticket from the
Ticket Granting Server.
 Service Request. The client uses the service ticket to authenticate itself to the
server that is providing the service the client is using. In the case of Hadoop, this
might be the namenode or the jobtracker.
Together, the Authentication Server and the Ticket Granting Server form the Key
Dis-tribution Center (KDC). The process is shown graphically in Figure 9-2.
Figure 9-2. The three-step Kerberos ticket exchange protocol
The authorization and service request steps are not user-level actions: the client performs these steps on the user’s behalf. The authentication step, however, is normally
carried out explicitly by the user using the kinit command, which will prompt for a
password. However, this doesn’t mean you need to enter your password every time
you run a job or access HDFS, since TGTs last for 10 hours by default (and can be
renewed for up to a week). It’s common to automate authentication at operating
system login time, thereby providing single sign-on to Hadoop.
In cases where you don’t want to be prompted for a password (for running an
unattended MapReduce job, for example), you can create a Kerberos keytab file
using the ktutil command. A keytab is a file that stores passwords and may be
supplied to kinit with the -t option.
Hadoop Benchmarks
Hadoop comes with several benchmarks that you can run very easily with minimal
setup cost. Benchmarks are packaged in the test JAR file, and you can get a list of
them, with descriptions, by invoking the JAR file with no arguments:
% hadoop jar $HADOOP_INSTALL/hadoop-*-test.jar
Most of the benchmarks show usage instructions when invoked with no arguments.
For example:
% hadoop jar $HADOOP_INSTALL/hadoop-*-test.jar TestDFSIO
Usage: TestFDSIO -read | -write | -clean [-nrFiles N] [-fileSize MB] [-resFile
resultFileName] [-bufferSize Bytes]
Benchmarking HDFS with TestDFSIO
TestDFSIO tests the I/O performance of HDFS. It does this by using a MapReduce
job as a convenient way to read or write files in parallel. Each file is read or written in
a separate map task, and the output of the map is used for collecting statistics
relating to the file just processed. The statistics are accumulated in the reduce to
produce a summary.
The following command writes 10 files of 1,000 MB each:
% hadoop jar $HADOOP_INSTALL/hadoop-*-test.jar TestDFSIO -write -nrFiles
10 -fileSize 1000
At the end of the run, the results are written to the console and also recorded in a
local file (which is appended to, so you can rerun the benchmark and not lose old
% cat TestDFSIO_results.log
----- TestDFSIO ----- : write
Date & time: Sun Apr 12 07:14:09 EDT 2009
Number of files: 10
Total MBytes processed: 10000
Throughput mb/sec: 7.796340865378244
Average IO rate mb/sec: 7.8862199783325195
IO rate std deviation: 0.9101254683525547
Test exec time sec: 163.387
The files are written under the /benchmarks/TestDFSIO directory by default (this can
be changed by setting the system property), in a directory called
io_data. To run a read benchmark, use the -read argument. Note that these files
must already exist (having been written by TestDFSIO -write):
% hadoop jar $HADOOP_INSTALL/hadoop-*-test.jar TestDFSIO read -nrFiles 10 -fileSize 1000
Here are the results for a real run:
----- TestDFSIO ----- : read
Date & time: Sun Apr 12 07:24:28 EDT 2009
Number of files: 10
Total MBytes processed: 10000
Throughput mb/sec: 80.25553361904304
Average IO rate mb/sec: 98.6801528930664
IO rate std deviation: 36.63507598174921
Test exec time sec: 47.624
When you’ve finished benchmarking, you can delete all the generated files from
HDFS using the -clean argument:
% hadoop jar $HADOOP_INSTALL/hadoop-*-test.jar TestDFSIO -clean
Hadoop in the Cloud
Although many organizations choose to run Hadoop in-house, it is also popular to
run Hadoop in the cloud on rented hardware or as a service. For instance, Cloudera
offers tools for running Hadoop in a public or private cloud, and Amazon has a
Hadoop cloud service called Elastic MapReduce.In this section, we look at running
Hadoop on Amazon EC2, which is a great way to try out your own Hadoop cluster on
a low-commitment, trial basis.
Hadoop on Amazon EC2
Amazon Elastic Compute Cloud (EC2) is a computing service that allows customers
to rent computers (instances) on which they can run their own applications. A
customer can launch and terminate instances on demand, paying by the hour for
active instances.
Administering Hadoop
The previous chapter was devoted to setting up a Hadoop cluster. In this chapter, we
look at the procedures to keep a cluster running smoothly.
Persistent Data Structures
As an administrator, it is invaluable to have a basic understanding of how the components of HDFS—the namenode, the secondary namenode, and the datanodes—
organize their persistent data on disk. Knowing which files are which can help you
diagnose problems or spot that something is awry.
Namenode directory structure
A newly formatted namenode creates the following directory structure:
/edits /fsimage /fstime
Recall from Chapter 9 that the property is a list of directories, with the
same contents mirrored in each directory. This mechanism provides resilience,
partic-ularly if one of the directories is an NFS mount, as is recommended.
The VERSION file is a Java properties file that contains information about the
version of HDFS that is running. Here are the contents of a typical file:
#Tue Mar 10 19:21:36 GMT 2009
namespaceID=134368441 cTime=0
The layoutVersion is a negative integer that defines the version of HDFS’s persistent
data structures. This version number has no relation to the release number of the
Ha-doop distribution. Whenever the layout changes, the version number is
decremented (for example, the version after −18 is −19). When this happens, HDFS
needs to be upgraded, since a newer namenode (or datanode) will not operate if its
storage layout is an older version.
The namespaceID is a unique identifier for the filesystem, which is created when the
filesystem is first formatted. The namenode uses it to identify new datanodes, since
they will not know the namespaceID until they have registered with the namenode.
The cTime property marks the creation time of the namenode’s storage. For newly
for-matted storage, the value is always zero, but it is updated to a timestamp
whenever the filesystem is upgraded.
The storageType indicates that this storage directory contains data structures for a
The other files in the namenode’s storage directory are edits, fsimage, and fstime.
These are all binary files, which use Hadoop Writable objects as their serialization
format (see “Serialization”). To understand what these files are for, we need to dig
into the workings of the namenode a little more.
The filesystem image and edit log
When a filesystem client performs a write operation (such as creating or moving a
file), it is first recorded in the edit log. The namenode also has an in-memory
representation of the filesystem metadata, which it updates after the edit log has
been modified. The in-memory metadata is used to serve read requests.
The edit log is flushed and synced after every write before a success code is
returned to the client. For namenodes that write to multiple directories, the write must
be flushed and synced to every copy before returning successfully. This ensures that
no operation is lost due to machine failure.
The fsimage file is a persistent checkpoint of the filesystem metadata. However, it is
not updated for every filesystem write operation, since writing out the fsimage file,
which can grow to be gigabytes in size, would be very slow. This does not
compromise resilience, however, because if the namenode fails, then the latest state
of its metadata can be reconstructed by loading the fsimage from disk into memory,
then applying each of the operations in the edit log.
The fsimage file contains a serialized form of all the directory and file inodes in the
filesystem. Each inode is an internal representation of a file or directory’s metadata
and contains such information as the file’s replication level, modification and access
times, access permissions, block size, and the blocks a file is made up of. For
directories, the mod-ification time, permissions, and quota metadata is stored.The
fsimage file does not record the datanodes on which the blocks are stored. Instead
the namenode keeps this mapping in memory, which it constructs by asking the
datanodes for their block lists when they join the cluster and periodically afterward to
ensure the namenode’s block mapping is up-to-date.
As described, the edits file would grow without bound. Though this state of affairs
would have no impact on the system while the namenode is running, if the
namenode were restarted, it would take a long time to apply each of the operations
in its (very long) edit log. During this time, the filesystem would be offline, which is
generally undesirable.
The solution is to run the secondary namenode, whose purpose is to produce checkpoints of the primary’s in-memory filesystem metadata.1 The checkpointing process
proceeds as follows (and is shown schematically in Figure 10-1):
 The secondary asks the primary to roll its edits file, so new edits go to a new file.
 The secondary retrieves fsimage and edits from the primary (using HTTP GET).
 The secondary loads fsimage into memory, applies each operation from edits,
then creates a new consolidated fsimage file.
 The secondary sends the new fsimage back to the primary (using HTTP POST).
 The primary replaces the old fsimage with the new one from the secondary, and
the old edits file with the new one it started in step 1. It also updates the fstime
file to record the time that the checkpoint was taken.
At the end of the process, the primary has an up-to-date fsimage file and a shorter
edits file (it is not necessarily empty, as it may have received some edits while the
checkpoint was being taken). It is possible for an administrator to run this process
while the
namenode is
safe mode, using the
-saveNamespace command.
Figure 10-1. The checkpointing process
This procedure makes it clear why the secondary has similar memory requirements
to the primary (since it loads the fsimage into memory), which is the reason that the
sec-ondary needs a dedicated machine on large clusters. The schedule for
checkpointing is controlled by two configuration parameters. The secondary
namenode checkpoints every hour (fs.checkpoint.period in seconds) or sooner if the
edit log has reached 64 MB (fs.checkpoint.size in bytes), which it checks every five
Secondary namenode directory structure
A useful side effect of the checkpointing process is that the secondary has a
checkpoint at the end of the process, which can be found in a subdirectory called
previous.check-point. This can be used as a source for making (stale) backups of the
namenode’s metadata:
${fs.checkpoint.dir}/current/VERSION /edits
/fsimage /fstime
The layout of this directory and of the secondary’s current directory is identical to the
namenode’s. This is by design, since in the event of total namenode failure (when
there are no recoverable backups, even from NFS), it allows recovery from a
secondary namenode. This can be achieved either by copying the relevant storage
directory to a new namenode, or, if the secondary is taking over as the new primary
namenode, by using the -importCheckpoint option when starting the namenode
daemon. The -importCheckpoint option will load the namenode metadata from the
latest checkpoint in the directory defined by the fs.checkpoint.dir property, but only if
there is no metadata in the directory, so there is no risk of overwriting
precious metadata.
Datanode directory structure
Unlike namenodes, datanodes do not need to be explicitly formatted, since they
create their storage directories automatically on startup. Here are the key files and
${}/current/VERSION /blk_<id_1>
/blk_<id_2>.meta /...
/blk_<id_64>.meta /subdir0/
A datanode’s VERSION file is very similar to the namenode’s:
#Tue Mar 10 21:32:31 GMT 2009
storageID=DS-547717739- cTime=0
The namespaceID, cTime, and layoutVersion are all the same as the values in the
name-node (in fact, the namespaceID is retrieved from the namenode when the
datanode first connects). The storageID is unique to the datanode (it is the same
across all storage directories) and is used by the namenode to uniquely identify the
datanode. The storageType identifies this directory as a datanode storage directory.
The other files in the datanode’s current storage directory are the files with the blk_
prefix. There are two types: the HDFS blocks themselves (which just consist of the
file’s raw bytes) and the metadata for a block (with a .meta suffix). A block file just
consists of the raw bytes of a portion of the file being stored; the metadata file is
made up of a header with version and type information, followed by a series of
checksums for sec-tions of the block.
When the number of blocks in a directory grows to a certain size, the datanode
creates a new subdirectory in which to place new blocks and their accompanying
metadata. It creates a new subdirectory every time the number of blocks in a
directory reaches 64 (set by the dfs.datanode.numblocks configuration property).
The effect is to have a tree with high fan-out, so even for systems with a very large
number of blocks, the directories will only be a few levels deep. By taking this
measure, the datanode ensures that there is a manageable number of files per
directory, which avoids the problems that most operating systems encounter when
there are a large number of files (tens or hundreds of thousands) in a single
If the configuration property specifies multiple directories (on different
drives), blocks are written to each in a round-robin fashion. Note that blocks are not
replicated on each drive on a single datanode: block replication is across distinct
Safe Mode
When the namenode starts, the first thing it does is load its image file (fsimage) into
memory and apply the edits from the edit log (edits). Once it has reconstructed a
con-sistent in-memory image of the filesystem metadata, it creates a new fsimage
file (effectively doing the checkpoint itself, without recourse to the secondary
namenode) and an empty edit log. Only at this point does the namenode start
listening for RPC and HTTP requests. However, the namenode is running in safe
mode, which means that it offers only a read-only view of the filesystem to clients.
Strictly speaking, in safe mode, only filesystem operations that access the filesystem
metadata (like producing a directory listing) are guaran-teed to work. Reading a file
will work only if the blocks are available on the current set of datanodes in the
cluster; and file modifications (writes, deletes, or renames) will always fail.
Recall that the locations of blocks in the system are not persisted by the
namenode— this information resides with the datanodes, in the form of a list of the
blocks it is storing. During normal operation of the system, the namenode has a map
of block locations stored in memory. Safe mode is needed to give the datanodes
time to check in to the namenode with their block lists, so the namenode can be
informed of enough block locations to run the filesystem effectively. If the namenode
didn’t wait for enough datanodes to check in, then it would start the process of
replicating blocks to new datanodes, which would be unnecessary in most cases
(since it only needed to wait for the extra datanodes to check in), and would put a
great strain on the cluster’s resources. Indeed, while in safe mode, the namenode
does not issue any block replication or deletion instructions to datanodes.
Safe mode is exited when the minimal replication condition is reached, plus an
extension time of 30 seconds. The minimal replication condition is when 99.9% of
the blocks in the whole filesystem meet their minimum replication level (which
defaults to one, and is set by dfs.replication.min, see Table 10-1).
When you are starting a newly formatted HDFS cluster, the namenode does not go
into safe mode since there are no blocks in the system.
Table 10-1. Safe mode properties
Property name
dfs.safemode.threshold.pct float
The minimum number of replicas that have to
be written for a write to be successful.
The proportion of blocks in the system that
meet the minimum replication level
defined by
dfs.replication.min before the
namenode will
exit safe mode. Setting this value to 0 or less
forces the
namenode not to start in safe mode. Setting
this value
to more than 1 means the namenode never
exits safe
The time, in milliseconds, to extend safe mode
by after
the minimum replication condition
defined by
dfs.safemode.threshold.pct has been
Entering and leaving safe mode
To see whether the namenode is in safe mode, you can use the dfsadmin command:
% hadoop dfsadmin -safemode get
Safe mode is ON
The front page of the HDFS web UI provides another indication of whether the
name-node is in safe mode. Sometimes you want to wait for the namenode to exit
safe mode before carrying out a command, particularly in scripts. The wait option
achieves this:
hadoop dfsadmin -safemode wait
# command to read or write a file
An administrator has the ability to make the namenode enter or leave safe mode at
any time. It is sometimes necessary to do this when carrying out maintenance on the
cluster or after upgrading a cluster to confirm that data is still readable. To enter safe
mode, use the following command:
% hadoop dfsadmin -safemode enter
Safe mode is ON
You can use this command when the namenode is still in safe mode while starting
up to ensure that it never leaves safe mode. Another way of making sure that the
namenode stays in safe mode indefinitely is to set the property
dfs.safemode.threshold.pct to a value over one.
You can make the namenode leave safe mode by using:
% hadoop dfsadmin -safemode leave
Safe mode is OFF
Audit Logging
HDFS has the ability to log all filesystem access requests, a feature that some
organi-zations require for auditing purposes. Audit logging is implemented using
log4j logging at the INFO level, and in the default configuration it is disabled, as the
log threshold is set to WARN in
You can enable audit logging by replacing WARN with INFO, and the result will be a
log line written to the namenode’s log for every HDFS event. Here’s an example for a
list status request on /user/tom:
2009-03-13 07:11:22,982 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem. audit:
ugi=tom,staff,admin ip=/ cmd=listStatus src=/user/tom dst=null
The dfsadmin tool is a multipurpose tool for finding information about the state of
HDFS, as well as performing administration operations on HDFS. It is invoked as
hadoop dfsadmin and requires superuser privileges. Some of the available
commands to dfsadmin are described in Table 10-2. Use the -help command to get
more information.
Table 10-2. dfsadmin commands
Shows help for a given command, or all commands if no command
is specified.
Shows filesystem statistics (similar to those shown in the web UI) and information
on connected
Dumps information to a file in Hadoop’s log directory about blocks that are being
replicated or
deleted, and a list of connected datanodes.
Changes or query the state of safe mode. See “Safe Mode” on
page 342.
Saves the current in-memory filesystem image to a new fsimage file and resets the
-saveNamespace edits file. This
operation may be performed only in safe mode.
Updates the set of datanodes that are permitted to connect to the namenode. See
and Decommissioning Nodes” on page 357.
Gets information on the progress of an HDFS upgrade or forces an upgrade to
-upgradeProgress proceed. See
“Upgrades” on page 360.
Removes the previous version of the datanodes’ and namenode’s storage
directories. Used after
an upgrade has been applied and the cluster is running successfully on the
new version. See
“Upgrades” on page 360.
Sets directory quotas. Directory quotas set a limit on the number of names (files or
directories) in
the directory tree. Directory quotas are useful for preventing users from creating
large numbers
of small files, a measure that helps preserve the namenode’s memory (recall
that accounting
information for every file, directory, and block in the filesystem is stored in
Clears specified directory quotas.
Sets space quotas on directories. Space quotas set a limit on the size of files that
may be stored in
a directory tree. They are useful for giving users a limited amount
of storage.
Clears specified space quotas.
Refreshes the namenode’s service-level authorization policy
-refreshServiceAcl file.
Filesystem check (fsck)
Hadoop provides an fsck utility for checking the health of files in HDFS. The tool
looks for blocks that are missing from all datanodes, as well as under- or overreplicated blocks. Here is an example of checking the whole filesystem for a small
% hadoop fsck /
......................Status: HEALTHY
Total size:
511799225 B
Total dirs:
Total files:
Total blocks (validated):
Minimally replicated blocks:
Over-replicated blocks:
Under-replicated blocks:
Mis-replicated blocks:
Default replication factor:
Average block replication:
Corrupt blocks:
Missing replicas:
0 (0.0 %)
of data-nodes:
of racks:
22 (avg. block size 23263601 B)
22 (100.0 %)
0 (0.0 %)
0 (0.0 %)
0 (0.0
The filesystem under path '/' is HEALTHY
fsck recursively walks the filesystem namespace, starting at the given path (here the
filesystem root), and checks the files it finds. It prints a dot for every file it checks. To
check a file, fsck retrieves the metadata for the file’s blocks and looks for problems
or inconsistencies. Note that fsck retrieves all of its information from the namenode;
it does not communicate with any datanodes to actually retrieve any block data. Most
of the output from fsck is self-explanatory, but here are some of the conditions it
looks for:
Over-replicated blocks
These are blocks that exceed their target replication for the file they belong to.
Over-replication is not normally a problem, and HDFS will automatically delete
excess replicas.
Under-replicated blocks
These are blocks that do not meet their target replication for the file they belong
to. HDFS will automatically create new replicas of under-replicated blocks until
they meet the target replication. You can get information about the blocks being
replicated (or waiting to be replicated) using hadoop dfsadmin -metasave.
Misreplicated blocks
These are blocks that do not satisfy the block replica placement policy (see
“Replica Placement” on page 74). For example, for a replication level of three in a
multirack cluster, if all three replicas of a block are on the same rack, then the
block is mis-replicated since the replicas should be spread across at least two
racks for resilience. HDFS will automatically re-replicate misreplicated blocks so
that they satisfy the rack placement policy.
Corrupt blocks
These are blocks whose replicas are all corrupt. Blocks with at least one
noncorrupt replica are not reported as corrupt; the namenode will replicate the
noncorrupt replica until the target replication is met.
Missing replicas
These are blocks with no replicas anywhere in the cluster.
Corrupt or missing blocks are the biggest cause for concern, as it means data has
been lost. By default, fsck leaves files with corrupt or missing blocks, but you can tell
it to perform one of the following actions on them:
Move the affected files to the /lost+found directory in HDFS, using the -move
option. Files are broken into chains of contiguous blocks to aid any salvaging
efforts you may attempt.
Delete the affected files, using the -delete option. Files cannot be recovered after
being deleted.
Finding the blocks for a file. The fsck tool provides an easy way to find out which
blocks are in any particular file. For example:
% hadoop fsck /user/tom/part-00007 -files -blocks -racks
/user/tom/part-00007 25582428 bytes, 1 block(s): OK
0. blk_-3724870485760122836_1035 len=25582428 repl=3 [/defaultrack/, /default-rack/, /defaultrack/]
This says that the file /user/tom/part-00007 is made up of one block and shows the
datanodes where the blocks are located. The fsck options used are as follows:
 The -files option shows the line with the filename, size, number of blocks, and its
health (whether there are any missing blocks).
 The -blocks option shows information about each block in the file, one line per
 The -racks option displays the rack location and the datanode addresses for
each block.
Running hadoop fsck without any arguments displays full usage instructions.
Datanode block scanner
Every datanode runs a block scanner, which periodically verifies all the blocks
stored on the datanode. This allows bad blocks to be detected and fixed before they
are read by clients. The DataBlockScanner maintains a list of blocks to verify and
scans them one by one for checksum errors. The scanner employs a throttling
mechanism to preserve disk bandwidth on the datanode.
Blocks are periodically verified every three weeks to guard against disk errors over
time (this is controlled by the dfs.datanode.scan.period.hours property, which
defaults to 504 hours). Corrupt blocks are reported to the namenode to be fixed.
You can get a block verification report for a datanode by visiting the datanode’s web
interface at http://datanode:50075/blockScannerReport. Here’s an example of a
report, which should be self-explanatory:
Total Blocks
Verified in last hour
Verified in last day
Verified in last week
Verified in last four weeks
Verified in SCAN_PERIOD
Not yet verified
Verified since restart
By specifying the listblocks parameter, http://datanode:50075/blockScannerReport
?listblocks, the report is preceded by a list of all the blocks on the datanode along
with their latest verification status.
The first column is the block ID, followed by some key-value pairs. The status can
be one of failed or ok according to whether the last scan of the block detected a
checksum error. The type of scan is local if it was performed by the background
thread, remote if it was performed by a client or a remote datanode, or none if a scan
of this block has yet to be made. The last piece of information is the scan time, which
is displayed as the number of milliseconds since midnight 1 January 1970, and also
as a more readable value.
Over time, the distribution of blocks across datanodes can become unbalanced. An
unbalanced cluster can affect locality for MapReduce, and it puts a greater strain on
the highly utilized datanodes, so it’s best avoided.
The balancer program is a Hadoop daemon that re-distributes blocks by moving
them from over-utilized datanodes to under-utilized datanodes, while adhering to the
block replica placement policy that makes data loss unlikely by placing block replicas
on different racks (see “Replica Placement”). It moves blocks until the cluster is
deemed to be balanced, which means that the utilization of every datanode (ratio of
used space on the node to total capacity of the node) differs from the utilization of
the cluster (ratio of used space on the cluster to total capacity of the cluster) by no
more than a given threshold percentage. You can start the balancer with:
The -threshold argument specifies the threshold percentage that defines what it
means for the cluster to be balanced. The flag is optional, in which case the
threshold is 10%. At any one time, only one balancer may be running on the cluster.
The balancer runs until the cluster is balanced; it cannot move any more blocks, or it
loses contact with the namenode. It produces a logfile in the standard log directory,
where it writes a line for every iteration of redistribution that it carries out. Here is the
output from a short run on a small cluster:
The cluster
5:23:4 P
2M 0
5:27:1 P
4M 1
Bytes Already
Bytes Left To
0 KB
219.21 MB
195.24 MB
22.45 MB
balanced. Exiting...
Bytes Being
150.29 B
150.29 B
The balancer is designed to run in the background without unduly taxing the cluster
or interfering with other clients using the cluster. It limits the bandwidth that it uses to
copy a block from one node to another. The default is a modest 1 MB/s, but this can
be changed by setting the dfs.balance.bandwidthPerSec property in hdfs-site.xml,
speci-fied in bytes.
Monitoring is an important part of system administration. In this section, we look at
the monitoring facilities in Hadoop and how they can hook into external monitoring
The purpose of monitoring is to detect when the cluster is not providing the expected
level of service. The master daemons are the most important to monitor: the
namenodes (primary and secondary) and the jobtracker. Failure of datanodes and
tasktrackers is to be expected, particularly on larger clusters, so you should provide
extra capacity so that the cluster can tolerate having a small percentage of dead
nodes at any time.
In addition to the facilities described next, some administrators run test jobs on a periodic basis as a test of the cluster’s health. There is lot of work going on to add
more monitoring capabilities to Hadoop, which is not covered here. For example,
Chukwa is a data collection and monitoring system built on HDFS and MapReduce,
and excels at mining log data for finding large-scale trends.
All Hadoop daemons produce logfiles that can be very useful for finding out what is
happening in the system. “System logfiles” on page 307 explains how to configure
these files.
Setting log levels
When debugging a problem, it is very convenient to be able to change the log level
temporarily for a particular component in the system.
Hadoop daemons have a web page for changing the log level for any log4j log name,
which can be found at /logLevel in the daemon’s web UI. By convention, log names
in Hadoop correspond to the classname doing the logging, although there are
exceptions to this rule, so you should consult the source code to find log names.
For example, to enable debug logging for the JobTracker class, we would visit the
job-tracker’s web UI at http://jobtracker-host:50030/logLevel and set the log name
org.apache.hadoop.mapred.JobTracker to level DEBUG.
The same thing can be achieved from the command line as follows:
hadoop daemonlog -setlevel jobtracker-host:50030 \
org.apache.hadoop.mapred.JobTracker DEBUG
Log levels changed in this way are reset when the daemon restarts, which is usually
what you want. However, to make a persistent change to a log level, simply change
the file in the configuration directory. In this case, the line to add is:
For HDFS, the rules are slightly different. If a datanode appears in both the include
and the exclude file, then it may connect, but only to be decommissioned. Table 10-4
sum-marizes the different combinations for datanodes. As for tasktrackers, an
unspecified or empty include file means all nodes are included.
Table 10-4. HDFS include and exclude file precedence
Node appears in include
Node appears in exclude
Node may not connect.
Node may not connect.
Node may connect.
Node may connect and will be
To remove nodes from the cluster:
Add the network addresses of the nodes to be decommissioned to the exclude
file. Do not update the include file at this point. Update the namenode with the
new set of permitted datanodes, with this command:
hadoop dfsadmin -refreshNodes
Update the jobtracker with the new set of permitted tasktrackers using:
hadoop mradmin -refreshNodes
Go to the web UI and check whether the admin state has changed to “Decommission In Progress” for the datanodes being decommissioned. They will start
copying their blocks to other datanodes in the cluster.
When all the datanodes report their state as “Decommissioned,” then all the blocks
have been replicated. Shut down the decommissioned nodes.
Remove the nodes from the include file, and run:
hadoop dfsadmin -refreshNodes
hadoop mradmin -refreshNodes
Remove the nodes from the slaves file.
Upgrading an HDFS and MapReduce cluster requires careful planning. The most important consideration is the HDFS upgrade. If the layout version of the filesystem has
changed, then the upgrade will automatically migrate the filesystem data and
metadata to a format that is compatible with the new version. As with any procedure
that involves data migration, there is a risk of data loss, so you should be sure that
both your data and metadata is backed.
Part of the planning process should include a trial run on a small test cluster with a
copy of data that you can afford to lose. A trial run will allow you to familiarize
yourself with the process, customize it to your particular cluster configuration and
toolset, and iron out any snags before running the upgrade procedure on a
production cluster. A test cluster also has the benefit of being available to test client
upgrades on.
Upgrading a cluster when the filesystem layout has not changed is fairly
straightforward: install the new versions of HDFS and MapReduce on the cluster
(and on clients at the same time), shut down the old daemons, update configuration
files, then start up the new daemons and switch clients to use the new libraries. This
process is reversible, so rolling back an upgrade is also straightforward.
After every successful upgrade, you should perform a couple of final cleanup steps:
 Remove the old installation and configuration files from the cluster.
 Fix any deprecation warnings in your code and configuration.
HDFS data and metadata upgrades
If you use the procedure just described to upgrade to a new version of HDFS and it
expects a different layout version, then the namenode will refuse to run. A message
like the following will appear in its log:
File system image contains an old layout version -16.
An upgrade to version -18 is required.
Please restart NameNode with -upgrade option.
The most reliable way of finding out whether you need to upgrade the filesystem is
by performing a trial on a test cluster.
An upgrade of HDFS makes a copy of the previous version’s metadata and data.
Doing an upgrade does not double the storage requirements of the cluster, as the
datanodes use hard links to keep two references (for the current and previous
version) to the same block of data. This design makes it straightforward to roll back
to the previous version of the filesystem, should you need to. You should understand
that any changes made to the data on the upgraded system will be lost after the
rollback completes.
You can keep only the previous version of the filesystem: you can’t roll back several
versions. Therefore, to carry out another upgrade to HDFS data and metadata, you
will need to delete the previous version, a process called finalizing the upgrade.
Once an upgrade is finalized, there is no procedure for rolling back to a previous
In general, you can skip releases when upgrading (for example, you can upgrade
from release 0.18.3 to 0.20.0 without having to upgrade to a 0.19.x release first), but
in some cases, you may have to go through intermediate releases. The release
notes make it clear when this is required.
You should only attempt to upgrade a healthy filesystem. Before running the
upgrade, do a full fsck (see “Filesystem check (fsck)” ). As an extra precaution, you
can keep a copy of the fsck output that lists all the files and blocks in the system, so
you can compare it with the output of running fsck after the upgrade.
It’s also worth clearing out temporary files before doing the upgrade, both from the
MapReduce system directory on HDFS and local temporary files.With these
preliminaries out of the way, here is the high-level procedure for upgrading a cluster
when the filesystem layout needs to be migrated:
 Make sure that any previous upgrade is finalized before proceeding with another
 Shut down MapReduce and kill any orphaned task processes on the
 Shut down HDFS and backup the namenode directories.
 Install new versions of Hadoop HDFS and MapReduce on the cluster and on
 Start HDFS with the -upgrade option.
 Wait until the upgrade is complete.
 Perform some sanity checks on HDFS.
 Start MapReduce.
 Roll back or finalize the upgrade (optional).
While running the upgrade procedure, it is a good idea to remove the Hadoop scripts
from your PATH environment variable. This forces you to be explicit about which
version of the scripts you are running. It can be convenient to define two
environment variables for the new installation directories; in the following
instructions, we have defined
Start the upgrade. To perform the upgrade, run the following command (this is step
5 in the high-level upgrade procedure):
% $NEW_HADOOP_INSTALL/bin/ -upgrade
This causes the namenode to upgrade its metadata, placing the previous version in
a new directory called previous:
${}/current/VERSION /edits
/fsimage /fstime
Similarly, datanodes upgrade their storage directories, preserving the old copy in a
directory called previous.
Wait until the upgrade is complete. The upgrade process is not instantaneous, but
you can check the progress of an upgrade using dfsadmin (upgrade events also
appear in the daemons’ logfiles, step 6):
% $NEW_HADOOP_INSTALL/bin/hadoop dfsadmin -upgradeProgress
Upgrade for version -18 has been completed.
Upgrade is not finalized.
Check the upgrade. This shows that the upgrade is complete. At this stage, you
should run some sanity checks (step 7) on the filesystem (check files and blocks
using fsck, basic file operations). You might choose to put HDFS into safe mode
while you are running some of these checks (the ones that are read-only) to prevent
others from making changes.
Roll back the upgrade (optional). If you find that the new version is not working
correctly, you may choose to roll back to the previous version (step 9). This is only
possible if you have not finalized the upgrade. A rollback reverts the filesystem state
to before the upgrade was per-formed, so any changes made in the meantime will be
lost. In other words, it rolls back to the previous state of the filesystem, rather than
downgrading the current state of the filesystem to a former version.
First, shut down the new daemons:
Then start up the old version of HDFS with the -rollback option:
% $OLD_HADOOP_INSTALL/bin/ -rollback
This command gets the namenode and datanodes to replace their current storage
directories with their previous copies. The filesystem will be returned to its previous
Finalize the upgrade (optional). When you are happy with the new version of
HDFS, you can finalize the upgrade (step 9) to remove the previous storage
directories. After an upgrade has been finalized, there is no way to roll back to the
previous version.
This step is required before performing another upgrade:
$NEW_HADOOP_INSTALL/bin/hadoop dfsadmin -finalizeUpgrade
$NEW_HADOOP_INSTALL/bin/hadoop dfsadmin -upgradeProgress status
There are no upgrades in progress.
HDFS is now fully upgraded to the new version.
Pig raises the level of abstraction for processing large datasets. MapReduce allows
you the programmer to specify a map function followed by a reduce function, but
working out how to fit your data processing into this pattern, which often requires
multiple MapReduce stages, can be a challenge. With Pig, the data structures are
much richer, typically being multivalued and nested; and the set of transformations
you can apply to the data are much more powerful—they include joins, for example,
which are not for the faint of heart in MapReduce.
Pig is made up of two pieces:
 The language used to express data flows, called Pig Latin.
 The execution environment to run Pig Latin programs. There are currently two
environments: local execution in a single JVM and distributed execution on a
Ha-doop cluster.
A Pig Latin program is made up of a series of operations, or transformations, that are
applied to the input data to produce output. Taken as a whole, the operations
describe a data flow, which the Pig execution environment translates into an
executable repre-sentation and then runs. Under the covers, Pig turns the
transformations into a series of MapReduce jobs, but as a programmer you are
mostly unaware of this, which allows you to focus on the data rather than the nature
of the execution.
Pig is a scripting language for exploring large datasets. One criticism of MapReduce
is that the development cycle is very long. Writing the mappers and reducers,
compiling and packaging the code, submitting the job(s), and retrieving the results is
a time-consuming business, and even with Streaming, which removes the compile
and package step, the experience is still involved. Pig’s sweet spot is its ability to
process terabytes of data simply by issuing a half-dozen lines of Pig Latin from the
console. Indeed, it was created at Yahoo! to make it easier for researchers and
engineers to mine the huge datasets there. Pig is very supportive of a programmer
writing a query, since it provides several commands for introspecting the data
structures in your program, as it is written. Even more useful, it can perform a
sample run on a representative subset of your input data, so you can see whether
there are errors in the processing before unleashing it on the full dataset.
Pig was designed to be extensible. Virtually all parts of the processing path are customizable: loading, storing, filtering, grouping, and joining can all be altered by userdefined functions (UDFs). These functions operate on Pig’s nested data model, so
they can integrate very deeply with Pig’s operators. As another benefit, UDFs tend to
be more reusable than the libraries developed for writing MapReduce programs.
Pig isn’t suitable for all data processing tasks, however. Like MapReduce, it is
designed for batch processing of data. If you want to perform a query that touches
only a small amount of data in a large dataset, then Pig will not perform well, since it
is set up to scan the whole dataset, or at least large portions of it.
In some cases, Pig doesn’t perform as well as programs written in MapReduce.
How-ever, the gap is narrowing with each release, as the Pig team implements
sophisticated algorithms for implementing Pig’s relational operators. It’s fair to say
that unless you are willing to invest a lot of effort optimizing Java MapReduce code,
writing queries in Pig Latin will save you time.
Installing and Running Pig
Pig runs as a client-side application. Even if you want to run Pig on a Hadoop
cluster, there is nothing extra to install on the cluster: Pig launches jobs and interacts
with HDFS (or other Hadoop filesystems) from your workstation.
Installation is straightforward. Java 6 is a prerequisite (and on Windows, you will
need Cygwin). Download a stable release from,
and un-pack the tarball in a suitable place on your workstation:
% tar xzf pig-x.y.z.tar.gz
It’s convenient to add Pig’s binary directory to your command-line path. For example:
export PIG_INSTALL=/home/tom/pig-x.y.z
You also need to set the JAVA_HOME environment variable to point to a suitable
Java installation.
Try typing pig -help to get usage instructions.
Execution Types
Pig has two execution types or modes: local mode and MapReduce mode.
Local mode
In local mode, Pig runs in a single JVM and accesses the local filesystem. This mode
is suitable only for small datasets and when trying out Pig.
The execution type is set using the -x or -exectype option. To run in local mode, set
the option to local:
% pig -x local
This starts Grunt, the Pig interactive shell, which is discussed in more detail shortly.
MapReduce mode
In MapReduce mode, Pig translates queries into MapReduce jobs and runs them on
a Hadoop cluster. The cluster may be a pseudo- or fully distributed cluster.
MapReduce mode (with a fully distributed cluster) is what you use when you want to
run Pig on large datasets.
To use MapReduce mode, you first need to check that the version of Pig you downloaded is compatible with the version of Hadoop you are using. Pig releases will only
work against particular versions of Hadoop; this is documented in the release notes.
Pig honors the HADOOP_HOME environment variable for finding which Hadoop
client to run. However if it is not set, Pig will use a bundled copy of the Hadoop
libraries. Note that these may not match the version of Hadoop running on your
cluster, so it is best to explicitly set HADOOP_HOME.
Next, you need to point Pig at the cluster’s namenode and jobtracker. If the
installation of Hadoop at HADOOP_HOME is already configured for this, then there
is nothing more to do. Otherwise, you can set HADOOP_CONF_DIR to a directory
containing the Hadoop site file (or files) that define and
Alternatively, you can set these two properties in the file in Pig’s conf
directory (or the directory specified by PIG_CONF_DIR). Here’s an example for a
pseudo-distributed setup:
Once you have configured Pig to connect to a Hadoop cluster, you can launch Pig,
setting the -x option to mapreduce, or omitting it entirely, as MapReduce mode is the
% pig
2012-01-18 20:23:05,764 [main] INFO
org.apache.pig.Main - Logging error
s to: /private/tmp/pig_1326946985762.log
2012-01-18 20:23:06,009 [main] INFO
ne.HExecutionEngine - Connecting to hadoop file system at: hdfs://localhost/
2012-01-18 20:23:06,274 [main] INFO
ne.HExecutionEngine - Connecting to map-reduce job tracker at: localhost:8021
As you can see from the output, Pig reports the filesystem and jobtracker that it has
connected to.
Running Pig Programs
There are three ways of executing Pig programs, all of which work in both local and
MapReduce mode:
Pig can run a script file that contains Pig commands. For example, pig script.pig
runs the commands in the local file script.pig. Alternatively, for very short scripts,
you can use the -e option to run a script specified as a string on the command
Grunt is an interactive shell for running Pig commands. Grunt is started when no
file is specified for Pig to run, and the -e option is not used. It is also possible to
run Pig scripts from within Grunt using run and exec.
You can run Pig programs from Java using the PigServer class, much like you
can use JDBC to run SQL programs from Java. For programmatic access to
Grunt, use
Grunt has line-editing facilities like those found in GNU Readline (used in the bash
shell and many other command-line applications). For instance, the Ctrl-E key combination will move the cursor to the end of the line. Grunt remembers command history, too,1 and you can recall lines in the history buffer using Ctrl-P or Ctrl-N (for
previous and next) or, equivalently, the up or down cursor keys.
Another handy feature is Grunt’s completion mechanism, which will try to complete
Pig Latin keywords and functions when you press the Tab key. For example,
consider the following incomplete line:
grunt> a = foreach b ge
If you press the Tab key at this point, ge will expand to generate, a Pig Latin
grunt> a = foreach b generate
You can customize the completion tokens by creating a file named autocomplete and
placing it on Pig’s classpath (such as in the conf directory in Pig’s install directory), or
in the directory you invoked Grunt from. The file should have one token per line, and
tokens must not contain any whitespace. Matching is case-sensitive. It can be very
handy to add commonly used file paths (especially because Pig does not perform
file-name completion) or the names of any user-defined functions you have created.
You can get a list of commands using the help command. When you’ve finished your
Grunt session, you can exit with the quit command.
Pig Latin Editors
PigPen is an Eclipse plug-in that provides an environment for developing Pig
programs. It includes a Pig script text editor, an example generator (equivalent to the
ILLUS-TRATE command), and a button for running the script on a Hadoop cluster.
There is also an operator graph window, which shows a script in graph form, for
visualizing the data flow. For full installation and usage instructions, please refer to
the Pig wiki at
There are also Pig Latin syntax highlighters for other editors, including Vim and TextMate. Details are available on the Pig wiki.
An Example
Let’s look at a simple example by writing the program to calculate the maximum
recorded temperature by year for the weather dataset in Pig Latin (just like we did
using MapReduce in Chapter 2). The complete program is only a few lines long:
-- max_temp.pig: Finds the maximum temperature by
year records = LOAD 'input/ncdc/micro-tab/sample.txt'
AS (year:chararray, temperature:int, quality:int);
filtered_records = FILTER records BY temperature != 9999
(quality == 0 OR quality == 1 OR quality == 4 OR quality == 5 OR quality ==
9); grouped_records = GROUP filtered_records BY year;
max_temp = FOREACH grouped_records
DUMP max_temp;
To explore what’s going on, we’ll use Pig’s Grunt interpreter, which allows us to enter
lines and interact with the program to understand what it’s doing. Start up Grunt in
local mode, then enter the first line of the Pig script:
grunt> records = LOAD 'input/ncdc/micro-tab/sample.txt'
AS (year:chararray, temperature:int, quality:int);
For simplicity, the program assumes that the input is tab-delimited text, with each
line having just year, temperature, and quality fields. (Pig actually has more flexibility
than this with regard to the input formats it accepts, as you’ll see later.) This line
describes the input data we want to process. The year:chararray notation describes
the field’s name and type; a chararray is like a Java string, and an int is like a Java
int. The LOAD operator takes a URI argument; here we are just using a local file, but
we could refer to an HDFS URI. The AS clause (which is optional) gives the fields
names to make it convenient to refer to them in subsequent statements.
The result of the LOAD operator, indeed any operator in Pig Latin, is a relation,
which is just a set of tuples. A tuple is just like a row of data in a database table, with
multiple fields in a particular order. In this example, the LOAD function produces a
set of (year, temperature, quality) tuples that are present in the input file. We write a
relation with one tuple per line, where tuples are represented as comma-separated
items in parentheses:
(1950,22,1) (1950,-11,1)
Relations are given names, or aliases, so they can be referred to. This relation is
given the records alias. We can examine the contents of an alias using the DUMP
grunt> DUMP records;
(1950,22,1) (1950,-11,1)
(1949,111,1) (1949,78,1)
We can also see the structure of a relation—the relation’s schema—using the
DESCRIBE operator on the relation’s alias:
grunt> DESCRIBE records;
records: {year: chararray,temperature: int,quality: int}
This tells us that records has three fields, with aliases year, temperature, and quality,
which are the names we gave them in the AS clause. The fields have the types given
to them in the AS clause, too. We shall examine types in Pig in more detail later.
The second statement removes records that have a missing temperature (indicated
by a value of 9999) or an unsatisfactory quality reading. For this small dataset, no
records are filtered out:
grunt> filtered_records = FILTER records BY temperature != 9999 AND
>> (quality == 0 OR quality == 1 OR quality == 4 OR quality == 5 OR quality
== 9); grunt> DUMP filtered_records;
(1950,22,1) (1950,-11,1)
(1949,111,1) (1949,78,1)
The third statement uses the GROUP function to group the records relation by the
year field. Let’s use DUMP to see what it produces:
grunt> grouped_records = GROUP filtered_records BY
year; grunt> DUMP grouped_records;
We now have two rows, or tuples, one for each year in the input data. The first field
in each tuple is the field being grouped by (the year), and the second field is a bag of
tuples for that year. A bag is just an unordered collection of tuples, which in Pig Latin
is represented using curly braces.
By grouping the data in this way, we have created a row per year, so now all that
remains is to find the maximum temperature for the tuples in each bag. Before we do
this, let’s understand the structure of the grouped_records relation:
grunt> DESCRIBE grouped_records;
grouped_records: {group: chararray,filtered_records: {year: chararray,
temperature: int,quality: int}}
This tells us that the grouping field is given the alias group by Pig, and the second
field is the same structure as the filtered_records relation that was being grouped.
With this information, we can try the fourth transformation:
grunt> max_temp = FOREACH grouped_records GENERATE group,
FOREACH processes every row to generate a derived set of rows, using a
GENERATE clause to define the fields in each derived row. In this example, the first
field is group, which is just the year. The second field is a little more complex. The
filtered_records.temperature reference is to the temperature field of the
filtered_records bag in the grouped_records relation. MAX is a built-in function for
calculating the maximum value of fields in a bag. In this case, it calculates the
maximum temperature for the fields in each filtered_records bag. Let’s check the
grunt> DUMP max_temp;
So we’ve successfully calculated the maximum temperature for each year.
Generating Examples
In this example, we’ve used a small sample dataset with just a handful of rows to
make it easier to follow the data flow and aid debugging. Creating a cut-down
dataset is an art, as ideally it should be rich enough to cover all the cases to exercise
your queries (the completeness property), yet be small enough to reason about by
the programmer (the conciseness property). Using a random sample doesn’t work
well in general, since join and filter operations tend to remove all random data,
leaving an empty result, which is not illustrative of the general data flow.
With the ILLUSTRATE operator, Pig provides a tool for generating a reasonably
com-plete and concise dataset. Here is the output from running ILLUSTRATE
(slightly re-formatted to fit the page):
------------------------------------------------------------------------------| records
| year:chararray
| temperature:int
| quality:int
| 1949
| 78
| 1949
| 111
| 1949
| 9999
---------------------------------------------------------------------------------------------------------------------------------------------------------------------| filtered_records | year:chararray | temperature:int | quality:int
| 1949
| 1949
| 78
| 111
| 1
| 1
| group:chararray filtered_records:bag{:tuple(year:chararray, |
temperature:int,quality:int)} |
| 1949
| {(1949, 78, 1), (1949, 111, 1)}
---------------------------------------------------------------------------------------------------------------------------------------------| max_temp
| group:chararray
| :int
| 1949
| 111
--------------------------------------------------Notice that Pig used some of the original data (this is important to keep the
generated dataset realistic), as well as creating some new data. It noticed the
special value 9999 in the query and created a tuple containing this value to
exercise the FILTER statement.
In summary, the output of the ILLUSTRATE is easy to follow and can help you
un-derstand what your query is doing.
Comparison with Databases
Having seen Pig in action, it might seem that Pig Latin is similar to SQL. The
presence of such operators as GROUP BY and DESCRIBE reinforces this
impression. However, there are several differences between the two languages,
and between Pig and RDBMSs in general.
The most significant difference is that Pig Latin is a data flow programming
language, whereas SQL is a declarative programming language. In other words,
a Pig Latin pro-gram is a step-by-step set of operations on an input relation, in
which each step is a single transformation. By contrast, SQL statements are a
set of constraints that, taken together, define the output. In many ways,
programming in Pig Latin is like working at the level of an RDBMS query
planner, which figures out how to turn a declarative statement into a system of
RDBMSs store data in tables, with tightly predefined schemas. Pig is more
relaxed about the data that it processes: you can define a schema at runtime,
but it’s optional. Es-sentially, it will operate on any source of tuples (although
the source should support being read in parallel, by being in multiple files, for
example), where a UDF is used to read the tuples from their raw
representation.2 The most common representation is a text file with tabseparated fields, and Pig provides a built-in load function for this format. Unlike
with a traditional database, there is no data import process to load the data into
the RDBMS. The data is loaded from the filesystem (usually HDFS) as the first
step in the processing.
Pig’s support for complex, nested data structures differentiates it from SQL,
which operates on flatter data structures. Also, Pig’s ability to use UDFs and
streaming oper-ators that are tightly integrated with the language and Pig’s
nested data structures makes Pig Latin more customizable than most SQL
There are several features to support online, low-latency queries that RDBMSs
have that are absent in Pig, such as transactions and indexes. As mentioned
earlier, Pig does not support random reads or queries in the order of tens of
milliseconds. Nor does it support random writes to update small portions of
data; all writes are bulk, streaming writes, just like MapReduce.
Hive (covered in Chapter 12) sits between Pig and conventional RDBMSs. Like Pig,
Hive is designed to use HDFS for storage, but otherwise there are some significant
differences. Its query language, HiveQL, is based on SQL, and anyone who is
familiar with SQL would have little trouble writing queries in HiveQL. Like RDBMSs,
Hive mandates that all data be stored in tables, with a schema under its
management; how-ever, it can associate a schema with preexisting data in HDFS,
so the load step is optional. Hive does not support low-latency queries, a
characteristic it shares with Pig.
Pig Latin
This section gives an informal description of the syntax and semantics of the Pig
Latin programming language. It is not meant to offer a complete reference to the
language, but there should be enough here for you to get a good understanding of
Pig Latin’s constructs.
A Pig Latin program consists of a collection of statements. A statement can be
thought of as an operation, or a command.5 For example, a GROUP operation is a
type of statement:
grouped_records = GROUP records BY year;
The command to list the files in a Hadoop filesystem is another example of a
ls /
Statements are usually terminated with a semicolon, as in the example of the
GROUP statement. In fact, this is an example of a statement that must be terminated
with a semicolon: it is a syntax error to omit it. The ls command, on the other hand,
does not have to be terminated with a semicolon. As a general guideline, statements
or com-mands for interactive use in Grunt do not need the terminating semicolon.
This group includes the interactive Hadoop commands, as well as the diagnostic
operators like DESCRIBE. It’s never an error to add a terminating semicolon, so if in
doubt, it’s sim-plest to add one.
Statements that have to be terminated with a semicolon can be split across multiple
lines for readability:
records = LOAD 'input/ncdc/micro-tab/sample.txt'
AS (year:chararray, temperature:int, quality:int);
Pig Latin has two forms of comments. Double hyphens are single-line comments.
Everything from the first hyphen to the end of the line is ignored by the Pig Latin
-- My program
DUMP A; -- What's in A?
C-style comments are more flexible since they delimit the beginning and end of the
comment block with /* and */ markers. They can span lines or be embedded in a
single line:
A = LOAD 'input/pig/join/A';
B = LOAD 'input/pig/join/B';
C = JOIN A BY $0, /* ignored */ B BY $1;
Pig Latin has a list of keywords that have a special meaning in the language and
cannot be used as identifiers. These include the operators (LOAD, ILLUSTRATE),
commands (cat, ls), expressions (matches, FLATTEN), and functions (DIFF, MAX)—
all of which are covered in the following sections.
Pig Latin has mixed rules on case sensitivity. Operators and commands are not
case-sensitive (to make interactive use more forgiving); however, aliases and
function names are case-sensitive.
You sometimes see these terms being used interchangeably in documentation on
Pig Latin. For example, “GROUP command, ” “GROUP operation,” “GROUP
As a Pig Latin program is executed, each statement is parsed in turn. If there are
syntax errors, or other (semantic) problems such as undefined aliases, the
interpreter will halt and display an error message. The interpreter builds a logical
plan for every relational operation, which forms the core of a Pig Latin program. The
logical plan for the state-ment is added to the logical plan for the program so far,
then the interpreter moves on to the next statement.
It’s important to note that no data processing takes place while the logical plan of the
program is being constructed. For example, consider again the Pig Latin program
from the first example:
-- max_temp.pig: Finds the maximum temperature by
year records = LOAD 'input/ncdc/micro-tab/sample.txt'
AS (year:chararray, temperature:int, quality:int);
filtered_records = FILTER records BY temperature != 9999
(quality == 0 OR quality == 1 OR quality == 4 OR quality == 5 OR quality ==
9); grouped_records = GROUP filtered_records BY year;
max_temp = FOREACH grouped_records
DUMP max_temp;
When the Pig Latin interpreter sees the first line containing the LOAD statement, it
confirms that it is syntactically and semantically correct, and adds it to the logical
plan, but it does not load the data from the file (or even check whether the file
exists). Indeed, where would it load it? Into memory? Even if it did fit into memory,
what would it do with the data? Perhaps not all the input data is needed (since later
statements filter it, for example), so it would be pointless to load it. The point is that it
makes no sense to start any processing until the whole flow is defined. Similarly, Pig
validates the GROUP and FOREACH...GENERATE statements, and adds them to
the logical plan without executing them. The trigger for Pig to start execution is the
DUMP statement. At that point, the logical plan is compiled into a physical plan and
Multiquery execution
Since DUMP is a diagnostic tool, it will always trigger execution. However, the
STORE command is different. In interactive mode, STORE acts like DUMP and will
always trigger execution (this includes the run command), but in batch mode it will
not (this includes the exec command). The reason for this is efficiency. In batch
mode, Pig will parse the whole script to see if there are any optimizations that could
be made to limit the amount of data to be written to or read from disk. Consider the
following simple example:
A = LOAD 'input/pig/multiquery/A';
B = FILTER A BY $1 == 'banana';
C = FILTER A BY $1 != 'banana';
STORE B INTO 'output/b';
STORE C INTO 'output/c';
Relations B and C are both derived from A, so to save reading A twice, Pig can
run this script as a single MapReduce job by reading A once and writing two
output files from the job, one for each of B and C. This feature is called multiquery
In previous versions of Pig that did not have multiquery execution, each STORE
state-ment in a script run in batch mode triggered execution, resulting in a job for
each STORE statement. It is possible to restore the old behavior by disabling
multiquery execution with the -M or -no_multiquery option to pig.
The physical plan that Pig prepares is a series of MapReduce jobs, which in local
mode Pig runs in the local JVM, and in MapReduce mode Pig runs on a Hadoop
You can see the logical and physical plans created by Pig using the EXPLAIN
command on a relation (EXPLAIN max_temp; for example).
EXPLAIN will also show the MapReduce plan, which shows how the physical
operators are grouped into MapReduce jobs. This is a good way to find out how
many MapReduce jobs Pig will run for your query.
The relational operators that can be a part of a logical plan in Pig are summarized in
Table 11-1.
Table 11-1. Pig Latin relational
Loading and
Grouping and
Combining and
Loads data from the filesystem or other storage into a
Saves a relation to the filesystem or other
Prints a relation to the console
Removes unwanted rows from a relation
Removes duplicate rows from a relation
Adds or removes fields from a relation
Runs a MapReduce job using a relation as
Transforms a relation using an external
Selects a random sample of a relation
Joins two or more relations
Groups the data in two or more relations
Groups the data in a single relation
Creates the cross-product of two or more
Sorts a relation by one or more fields
Limits the size of a relation to a maximum number of
Combines two or more relations into one
Splits a relation into two or more
There are other types of statements that are not added to the logical plan. For
example, the diagnostic operators, DESCRIBE, EXPLAIN, and ILLUSTRATE are
provided to allow the user to interact with the logical plan, for debugging purposes
(see Ta-ble 11-2). DUMP is a sort of diagnostic operator, too, since it is used only to
allow interactive debugging of small result sets or in combination with LIMIT to
retrieve a few rows from a larger relation. The STORE statement should be used
when the size of the output is more than a few lines, as it writes to a file, rather than
to the console.
Table 11-2. Pig Latin diagnostic operators
Prints a relation’s schema
Prints the logical and physical plans
Shows a sample execution of the logical plan, using a generated subset of the input
Pig Latin provides three statements, REGISTER, DEFINE and IMPORT, to make it
possible to incorporate macros and user-defined functions into Pig scripts (see Table 11-3).
Table 11-3. Pig Latin macro and UDF statements
Registers a JAR file with the Pig runtime
Creates an alias for a macro, a UDF, streaming script, or a command specification
Import macros defined in a separate file into a script
Since they do not process relations, commands are not added to the logical plan; instead, they are executed immediately. Pig provides commands to interact with
Hadoop filesystems (which are very handy for moving data around before or after
processing with Pig) and MapReduce, as well as a few utility commands (described
in Table 11-4).
Table 11-4. Pig Latin
Prints the contents of one or more files
Changes the current directory
Copies a local file or directory to a Hadoop
Copies a file or directory on a Hadoop filesystem to the local
Copies a file or directory to another directory
Accesses Hadoop’s filesystem shell
Lists files
Creates a new directory
Moves a file or directory to another directory
Prints the path of the current working directory
Deletes a file or directory
Forcibly deletes a file or directory (does not fail if the file or directory
does not exist)
Kills a MapReduce job
Runs a script in a new Grunt shell in batch mode
Shows the available commands and options
Exits the interpreter
Runs a script within the existing Grunt shell
Sets Pig options and MapReduce job properties
Run a shell command from within Grunt
The filesystem commands can operate on files or directories in any Hadoop
filesystem, and they are very similar to the hadoop fs commands (which is not
surprising, as both are simple wrappers around the Hadoop FileSystem interface).
You can access all of the Hadoop filesystem shell commands using Pig’s fs
command. For example, fs -ls will show a file listing, andfs -help will show help on all
the available commands.
Precisely which Hadoop filesystem is used is determined by the
prop-erty in the site file for Hadoop Core.
These commands are mostly self-explanatory, except set, which is used to set
options that control Pig’s behavior, including arbitrary MapReduce job properties.
The debug option is used to turn debug logging on or off from within a script (you
can also control the log level when launching Pig, using the -d or -debug option):
grunt> set debug on
Another useful option is the option, which gives a Pig job a meaningful
name, making it easier to pick out your Pig MapReduce jobs when running on a
shared Ha-doop cluster. If Pig is running a script (rather than being an interactive
query from Grunt), its job name defaults to a value based on the script name.
There are two commands in Table 11-4 for running a Pig script, exec and run. The
difference is that exec runs the script in batch mode in a new Grunt shell, so any
aliases defined in the script are not accessible to the shell after the script has
completed. On the other hand, when running a script with run, it is as if the
contents of the script had been entered manually, so the command history of the
invoking shell contains all the statements from the script. Multiquery execution,
where Pig executes a batch of state-ments in one go, is only used by exec, not
An expression is something that is evaluated to yield a value. Expressions can be
used in Pig as a part of a statement containing a relational operator. Pig has a rich
variety of expressions, many of which will be familiar from other programming
languages. They are listed in Table 11-5, with brief descriptions and examples. We
shall see examples of many of these expressions throughout the chapter.
Table 11-5. Pig Latin
Field (by
Field (by
Constant value (see also
literals in Table 11-6)
Field in position n (zero$n
c.$n, c.f
Map lookup m#k
(t) f
x + y, x - y
x * y, x / y
+x, -x
Conditional x ? y : z
Field named f
Field named f from relation r after
grouping or joining
Field in container c (relation, bag,
or tuple)
by position, by name
Value associated with key k in
map m
Cast of field f to type t
Addition, subtraction
Multiplication, division
Modulo, the remainder of x
divided by y
Unary positive, negation
Bincond/ternary, y if x evaluates
to true,
z otherwise
1.0, 'a'
(int) year
$1 + $2, $1 - $2
$1 * $2, $1 / $2
$1 % $2
+1, –1
quality == 0 ? 0 : 1
x == y, x
!= y
Equals, not equals
x > y, x < y
x >= y, x
<= y
x matches
x is null
x is not
x or y
x and y
not x
quality == 0,
ture != 9999
quality > 0, quality
Greater than, less than < 10
Greater than or equal to, less quality >= 1,
than or equal to
quality <=
Pattern matching with
regular expression
quality matches
Is null
temperature is null
temperature is not
Is not null
Logical or
q == 0 or q == 1
Logical and
q == 0 and r == 0
not q matches
Logical negation
Invocation of function fn on
fields f1, f2,
Removal of a level of nesting
from bags and
So far you have seen some of the simple types in Pig, such as int and chararray.
Here we will discuss Pig’s built-in types in more detail.
Pig has four numeric types: int, long, float, and double, which are identical to their
Java counterparts. There is also a bytearray type, like Java’s byte array type for
repre-senting a blob of binary data, and chararray, which, like java.lang.String,
represents textual data in UTF-16 format, although it can be loaded or stored in UTF8 format. Pig does not have types corresponding to Java’s boolean,6 byte, short, or
char primitive types. These are all easily represented using Pig’s int type, or
chararray for char.
The numeric, textual, and binary types are simple atomic types. Pig Latin also has
three complex types for representing nested structures: tuple, bag, and map. All of
Pig Latin’s types are listed in Table 11-6.
Table 11-6. Pig Latin types
Binary bytearray
Literal example
32-bit signed integer
64-bit signed integer
32-bit floating-point number
64-bit floating-point number
Character array in UTF-16 format
Byte array
Not supported
Sequence of fields of any type
An unordered collection of tuples, possibly with
A set of key-value pairs. Keys must be character
values may be any type
The complex types are usually loaded from files or constructed using relational operators. Be aware, however, that the literal form in Table 11-6 is used when a constant
value is created from within a Pig Latin program. The raw form in a file is usually
different when using the standard PigStorage loader. For example, the
representation in a file of the bag in Table 11-6 would be {(1,pomegranate),(2)} (note
the lack of quotes), and with a suitable schema, this would be loaded as a relation
with a single field and row, whose value was the bag.
Pig provides built-in functions TOTUPLE, TOBAG and TOMAP, which are used for
turning expressions into tuples, bags and maps.
Although relations and bags are conceptually the same (an unordered collection of
tuples), in practice Pig treats them slightly differently. A relation is a top-level
construct, whereas a bag has to be contained in a relation. Normally, you don’t have
to worry about this, but there are a few restrictions that can trip up the uninitiated.
For example, it’s not possible to create a relation from a bag literal. So the following
statement fails:
A = {(1,2),(3,4)}; -- Error
The simplest workaround in this case is to load the data from a file using the LOAD
As another example, you can’t treat a relation like a bag and project a field into a
new relation ($0 refers to the first field of A, using the positional notation):
B = A.$0;
Instead, you have to use a relational operator to turn the relation A into relation B:
It’s possible that a future version of Pig Latin will remove these inconsistencies and
treat relations and bags in the same way.
A relation in Pig may have an associated schema, which gives the fields in the
relation names and types. We’ve seen how an AS clause in a LOAD statement is
used to attach a schema to a relation:
grunt> records = LOAD 'input/ncdc/microtab/sample.txt' >> AS (year:int, temperature:int,
quality:int); grunt> DESCRIBE records;
records: {year: int,temperature: int,quality: int}
This time we’ve declared the year to be an integer, rather than a chararray, even
though the file it is being loaded from is the same. An integer may be more
appropriate if we needed to manipulate the year arithmetically (to turn it into a
timestamp, for example), whereas the chararray representation might be more
appropriate when it’s being used as a simple identifier. Pig’s flexibility in the degree
to which schemas are declared con-trasts with schemas in traditional SQL
databases, which are declared before the data is loaded into to the system. Pig is
designed for analyzing plain input files with no associated type information, so it is
quite natural to choose types for fields later than you would with an RDBMS.
It’s possible to omit type declarations completely, too:
grunt> records = LOAD 'input/ncdc/micro-tab/sample.txt'
>> AS (year, temperature, quality);
grunt> DESCRIBE records;
records: {year: bytearray,temperature: bytearray,quality: bytearray}
In this case, we have specified only the names of the fields in the schema, year,
temperature, and quality. The types default to bytearray, the most general type, representing a binary string.
You don’t need to specify types for every field; you can leave some to default to byte
array, as we have done for year in this declaration:
grunt> records
>> AS (year,
records: {year:
= LOAD 'input/ncdc/micro-tab/sample.txt'
temperature:int, quality:int);
bytearray,temperature: int,quality: int}
However, if you specify a schema in this way, you do need to specify every field.
Also, there’s no way to specify the type of a field without specifying the name. On the
other hand, the schema is entirely optional and can be omitted by not specifying an
AS clause:
grunt> records = LOAD 'input/ncdc/micro-tab/sample.txt'; grunt> DESCRIBE
Schema for records unknown.
Fields in a relation with no schema can be referenced only using positional notation:
$0 refers to the first field in a relation, $1 to the second, and so on. Their types
default to bytearray:
grunt> projected_records = FOREACH records GENERATE
$0, $1, $2; grunt> DUMP projected_records;
(1950,22,1) (1950,-11,1)
(1949,111,1) (1949,78,1)
grunt> DESCRIBE projected_records;
projected_records: {bytearray,bytearray,bytearray}
Although it can be convenient not to have to assign types to fields (particularly in the
first stages of writing a query), doing so can improve the clarity and efficiency of Pig
Latin programs, and is generally recommended. Declaring a schema as a part of the
query is flexible, but doesn’t lend itself to schema reuse. A set of Pig queries over
the same input data will often have the same schema repeated in each query. If the
query processes a large number of fields, this repetition can become hard to
maintain. The Apache HCatalog project (
solves this problem by providing a table metadata service, based on Hive’s
metastore, so that Pig queries can reference schemas by name, rather than
specifying them in full each time.
Validation and nulls
An SQL database will enforce the constraints in a table’s schema at load time: for
example, trying to load a string into a column that is declared to be a numeric type
will fail. In Pig, if the value cannot be cast to the type declared in the schema, then it
will substitute a null value. Let’s see how this works if we have the following input for
the weather data, which has an “e” character in place of an integer:
Pig handles the corrupt line by producing a null for the offending value, which is displayed as the absence of a value when dumped to screen (and also when saved
using STORE):
grunt> records = LOAD 'input/ncdc/microtab/sample_corrupt.txt' >> AS (year:chararray,
temperature:int, quality:int);
grunt> DUMP records;
Pig produces a warning for the invalid field (not shown here), but does not halt its
processing. For large datasets, it is very common to have corrupt, invalid, or merely
unexpected data, and it is generally infeasible to incrementally fix every unparsable
record. Instead, we can pull out all of the invalid records in one go, so we can take
action on them, perhaps by fixing our program (because they indicate we have made
a mistake) or by filtering them out (because the data is genuinely unusable):
grunt> corrupt_records = FILTER records BY temperature is
null; grunt> DUMP corrupt_records;
Note the use of the is null operator, which is analogous to SQL. In practice, we
would include more information from the original record, such as an identifier and
the value that could not be parsed, to help our analysis of the bad data.
We can find the number of corrupt records using the following idiom for counting the
number of rows in a relation:
grunt> grouped = GROUP corrupt_records ALL;
grunt> all_grouped = FOREACH grouped GENERATE group,
COUNT(corrupt_records); grunt> DUMP all_grouped;
Another useful technique is to use the SPLIT operator to partition the data into
“good” and “bad” relations, which can then be analyzed separately:
grunt> SPLIT records INTO good_records IF temperature is
not null, >> bad_records IF temperature is null;
grunt> DUMP good_records;
grunt> DUMP bad_records;
Going back to the case in which temperature’s type was left undeclared, the corrupt
data cannot be easily detected, since it doesn’t surface as a null:
grunt> records = LOAD 'input/ncdc/micro-tab/sample_corrupt.txt'
AS (year:chararray, temperature, quality:int); grunt>
DUMP records;
grunt> filtered_records = FILTER records BY temperature != 9999 AND
(quality == 0 OR quality == 1 OR quality == 4 OR quality == 5 OR quality ==
grunt> grouped_records = GROUP filtered_records BY
year; grunt> max_temp = FOREACH grouped_records
GENERATE group, >>
grunt> DUMP max_temp;
What happens in this case is that the temperature field is interpreted as a bytearray,
so the corrupt field is not detected when the input is loaded. When passed to the
MAX function, the temperature field is cast to a double, since MAX works only with
numeric types. The corrupt field can not be represented as a double, so it becomes a
null, which MAX silently ignores. The best approach is generally to declare types for
your data on loading, and look for missing or corrupt values in the relations
themselves before you do your main processing.
Sometimes corrupt data shows up as smaller tuples since fields are simply missing.
You can filter these out by using the SIZE function as follows:
grunt> A = LOAD 'input/pig/corrupt/missing_fields';
grunt> DUMP A;
grunt> B = FILTER A BY SIZE(TOTUPLE(*)) > 1;
grunt> DUMP B;
Schema merging
In Pig, you don’t declare the schema for every new relation in the data flow. In most
cases, Pig can figure out the resulting schema for the output of a relational operation
by considering the schema of the input relation.
How are schemas propagated to new relations? Some relational operators don’t
change the schema, so the relation produced by the LIMIT operator (which restricts
a relation to a maximum number of tuples), for example, has the same schema as
the relation it operates on. For other operators, the situation is more complicated.
UNION, for ex-ample, combines two or more relations into one, and tries to merge
the input relations schemas. If the schemas are incompatible, due to different types
or number of fields, then the schema of the result of the UNION is unknown.
You can find out the schema for any relation in the data flow using the DESCRIBE
operator. If you want to redefine the schema for a relation, you can use the
FOREACH...GENERATE operator with AS clauses to define the schema for some or
all of the fields of the input relation.
Functions in Pig come in four types:
Eval function
A function that takes one or more expressions and returns another expression. An
example of a built-in eval function is MAX, which returns the maximum value of the
entries in a bag. Some eval functions are aggregate functions, which means they
operate on a bag of data to produce a scalar value; MAX is an example of an
aggregate function. Furthermore, many aggregate functions are algebraic, which
means that the result of the function may be calculated incrementally. In MapReduce
terms, algebraic functions make use of the combiner and are much more efficient to
calculate. MAX is an algebraic function, whereas a function to calculate the median
of a collection of values is an example of a function that is not algebraic.
Filter function
A special type of eval function that returns a logical boolean result. As the name
suggests, filter functions are used in the FILTER operator to remove unwanted rows.
They can also be used in other relational operators that take boolean con-ditions
and, in general, expressions using boolean or conditional expressions. An example
of a built-in filter function is IsEmpty, which tests whether a bag or a map contains
any items.
Load function
A function that specifies how to load data into a relation from external storage.
Store function
A function that specifies how to save the contents of a relation to external storage.
Often, load and store functions are implemented by the same type. For example,
PigStorage, which loads data from delimited text files, can store data in the same
Pig comes with a collection of built-in functions, a selection of which are listed in
Table 11-7. The complete list of built-in functions, which includes a large number of
standard math and string functions, can be found in the documentation for each Pig
Table 11-7. A selection of Pig’s built-in functions
Calculates the average (mean) value of entries in a
Concatenates byte arrays or character arrays
Calculates the number of non-null entries in a bag.
Calculates the number of entries in a bag, including those that
are null.
Calculates the set difference of two bags. If the two
arguments are not bags,
then returns a bag containing both if they are equal;
otherwise, returns an
empty bag.
Calculates the maximum value of entries in a
Calculates the minimum value of entries in a
Calculates the size of a type. The size of numeric types is
always one; for
character arrays, it is the number of characters; for byte
arrays, the number
of bytes; and for containers (tuple, bag, map), it is the
number of entries.
Calculates the sum of the values of entries in
a bag.
Converts one or more expressions to individual tuples which
are then put in
a bag.
Tokenizes a character array into a bag of its
constituent words.
Converts an even number of expressions to a map of
key-value pairs.
Calculates the top n tuples in a bag.
Converts one or more expressions to a tuple.
Tests if a bag or map is empty.
Loads or stores relations using a field-delimited text
format. Each line is
broken into fields using a configurable field delimiter
(defaults to a tab
character) to be stored in the tuple’s fields. It is the default
storage when
none is specified.
Loads or stores relations from or to binary files. A Pig-specific
format is used
that uses Hadoop Writable objects.
Loads relations from a plain-text format. Each line
corresponds to a tuple
whose single field is the line of text.
Loads or stores relations from or to a (Pig-defined) JSON
format. Each tuple
is stored on one line.
Loads or stores relations from or to HBase
If the function you need is not available, you can write your own. Before you do that,
however, have a look in the Piggy Bank, a repository of Pig functions shared by the
Pig community. For example, there are load and store functions in the Piggy Bank
for Avro data files, CSV files, Hive RCFiles, SequenceFiles, and XML files. The Pig
website has instructions on how to browse and obtain the Piggy Bank functions. If
the Piggy Bank doesn’t have what you need, you can write your own function (and if
it is sufficiently general, you might consider contributing it to the Piggy Bank so that
others can benefit from it, too). These are known as user-defined functions, or UDFs.
Macros provide a way to package reusable pieces of Pig Latin code from within Pig
Latin itself. For example, we can extract the part of our Pig Latin program that
performs grouping on a relation then finds the maximum value in each group, by
defining a macro as follows:
DEFINE max_by_group(X, group_key, max_field)
RETURNS Y { A = GROUP $X by $group_key;
$Y = FOREACH A GENERATE group, MAX($X.$max_field);
The macro, called max_by_group, takes three parameters: a relation, X, and two
field names, group_key and max_field. It returns a single relation, Y. Within the
macro body, parameters and return aliases are referenced with a $ prefix, such as
The macro is used as follows:
records = LOAD 'input/ncdc/micro-tab/sample.txt'
AS (year:chararray, temperature:int, quality:int);
filtered_records = FILTER records BY temperature != 9999
(quality == 0 OR quality == 1 OR quality == 4 OR quality == 5 OR quality ==
9); max_temp = max_by_group(filtered_records, year, temperature);
DUMP max_temp
At runtime, Pig will expand the macro using the macro definition. After expansion,
the program looks like the following, with the expanded section in bold.
records = LOAD 'input/ncdc/micro-tab/sample.txt'
AS (year:chararray, temperature:int, quality:int);
filtered_records = FILTER records BY temperature != 9999
(quality == 0 OR quality == 1 OR quality == 4 OR quality == 5 OR quality ==
9); macro_max_by_group_A_0 = GROUP filtered_records by (year);
max_temp = FOREACH macro_max_by_group_A_0
DUMP max_temp
You don’t normally see the expanded form since Pig creates it internally, however in
some cases it is useful to see it when writing and debugging macros. You can get
Pig to perform macro expansion only (without executing the script) by passing the dryrun argument to pig.
Notice that the parameters that were passed to the macro (filtered_records, year,
and temperature) have been substituted for the names in the macro definition.
Aliases in the macro definition that don’t have a $ prefix, such as A in this example,
are local to the macro definition and are re-written at expansion time to avoid
conflicts with aliases in other parts of the program. In this case, A becomes
macro_max_by_group_A_0 in the expanded form.
To foster reuse, macros can be defined in separate files to Pig scripts, in which case
they need to be imported into any script that uses them. An import statement looks
like this:
IMPORT './ch11/src/main/pig/max_temp.macro';
User-Defined Functions
Pig’s designers realized that the ability to plug-in custom code is crucial for all but the
most trivial data processing jobs. For this reason, they made it easy to define and
use user-defined functions. We only cover Java UDFs in this section, but be aware
that you can write UDFs in Python or JavaScript too, both of which are run using the
Java Scripting API.
A Filter UDF
Let’s demonstrate by writing a filter function for filtering out weather records that do
not have a temperature quality reading of satisfactory (or better). The idea is to
change this line:
filtered_records = FILTER records BY temperature != 9999 AND
(quality == 0 OR quality == 1 OR quality == 4 OR quality == 5 OR quality == 9);
filtered_records = FILTER records BY temperature != 9999 AND isGood(quality);
This achieves two things: it makes the Pig script more concise, and it encapsulates
the logic in one place so that it can be easily reused in other scripts. If we were just
writing an ad hoc query, then we probably wouldn’t bother to write a UDF. It’s when
you start doing the same kind of processing over and over again that you see
opportunities for reusable UDFs.
Filter UDFs are all subclasses of FilterFunc, which itself is a subclass of EvalFunc.
We’ll look at EvalFunc in more detail later, but for the moment just note that, in
essence, EvalFunc looks like the following class:
public abstract class EvalFunc<T> {
public abstract T exec(Tuple input) throws IOException;
EvalFunc’s only abstract method, exec(), takes a tuple and returns a single value,
the (parameterized) type T. The fields in the input tuple consist of the expressions
passed to the function—in this case, a single integer. For FilterFunc, T is Boolean,
so the method should return true only for those tuples that should not be filtered out.
For the quality filter, we write a class, IsGoodQuality, that extends FilterFunc and implements the exec() method. See Example 11-1. The Tuple class is essentially a list
of objects with associated types. Here we are concerned only with the first field
(since the function only has a single argument), which we extract by index using the
get() method on Tuple. The field is an integer, so if it’s not null, we cast it and check
whether the value is one that signifies the temperature was a good reading, returning
the appropriate value, true or false.
Example 11-1. A FilterFunc UDF to remove records with unsatisfactory
temperature quality readings
package com.hadoopbook.pig;
import java.util.ArrayList;
import java.util.List;
import org.apache.pig.FilterFunc;
import org.apache.pig.impl.logicalLayer.FrontendException;
public class IsGoodQuality extends FilterFunc {
public Boolean exec(Tuple tuple) throws IOException { if
(tuple == null || tuple.size() == 0) {
return false;
try {
Object object = tuple.get(0); if (object
== null) {
return false;
int i = (Integer) object;
return i == 0 || i == 1 || i == 4 || i == 5 || i == 9;
} catch (ExecException e) { throw
new IOException(e);
To use the new function, we first compile it and package it in a JAR file (the example
code that accompanies this book comes with build instructions for how to do this).
Then we tell Pig about the JAR file with the REGISTER operator, which is given the
local path to the filename (and is not enclosed in quotes):
grunt> REGISTER pig-examples.jar;
Finally, we can invoke the function:
grunt> filtered_records = FILTER records BY temperature != 9999 AND
Pig resolves function calls by treating the function’s name as a Java classname and
attempting to load a class of that name. (This, incidentally, is why function names are
case-sensitive: because Java classnames are.) When searching for classes, Pig
uses a classloader that includes the JAR files that have been registered. When
running in dis-tributed mode, Pig will ensure that your JAR files get shipped to the
For the UDF in this example, Pig looks for a class with the name com.hadoop
book.pig.IsGoodQuality, which it finds in the JAR file we registered.
Resolution of built-in functions proceeds in the same way, except for one difference:
Pig has a set of built-in package names that it searches, so the function call does not
have to be a fully qualified name. For example, the function MAX is actually
implemented by a class MAX in the package org.apache.pig.builtin. This is one of
the packages that Pig looks in, so we can write MAX rather than
org.apache.pig.builtin.MAX in our Pig programs.
We can add our package name to the search path by invoking Grunt with this command-line argument: -Dudf.import.list=com.hadoopbook.pig. Or, we can shorten the
function name by defining an alias, using the DEFINE operator:
grunt> DEFINE isGood com.hadoopbook.pig.IsGoodQuality();
grunt> filtered_records = FILTER records BY temperature != 9999 AND
Defining an alias is a good idea if you want to use the function several times in the
same script. It’s also necessary if you want to pass arguments to the constructor of
the UDF’s implementation class.
Leveraging types
The filter works when the quality field is declared to be of type int, but if the type
information is absent, then the UDF fails! This happens because the field is the
default type, bytearray, represented by the DataByteArray class. Because
DataByteArray is not an Integer, the cast fails.
The obvious way to fix this is to convert the field to an integer in the exec() method.
However, there is a better way, which is to tell Pig the types of the fields that the
function expects. The getArgToFuncMapping() method on EvalFunc is provided for
precisely this reason. We can override it to tell Pig that the first field should be an
public List<FuncSpec> getArgToFuncMapping() throws
FrontendException { List<FuncSpec> funcSpecs = new
ArrayList<FuncSpec>(); funcSpecs.add(new
new Schema(new Schema.FieldSchema(null, DataType.INTEGER))));
return funcSpecs;
This method returns a FuncSpec object corresponding to each of the fields of
tuple that are passed to the exec() method. Here there is a single field, and
construct an anonymous FieldSchema (the name is passed as null, since
ignores the name when doing type conversion). The type is specified using
INTEGER constant on Pig’s
DataType class.
With the amended function, Pig will attempt to convert the argument passed to the
function to an integer. If the field cannot be converted, then a null is passed for the
field. The exec() method always returns false if the field is null. For this application,
this behavior is appropriate, as we want to filter out records whose quality field is
Here’s the final program using the new function:
-- max_temp_filter_udf.pig
REGISTER pig-examples.jar;
com.hadoopbook.pig.IsGoodQuality(); records =
LOAD 'input/ncdc/micro-tab/sample.txt'
AS (year:chararray, temperature:int, quality:int);
filtered_records = FILTER records BY temperature != 9999 AND
isGood(quality); grouped_records = GROUP filtered_records BY year;
max_temp = FOREACH grouped_records
DUMP max_temp;
An Eval UDF
Writing an eval function is a small step up from writing a filter function. Consider a
UDF (see Example 11-2) for trimming the leading and trailing whitespace from
chararray values, just like the trim() method on java.lang.String. We will use this UDF
later in the chapter.
Example 11-2. An EvalFunc UDF to trim leading and trailing whitespace from
chararray values
public class Trim extends EvalFunc<String> {
public String exec(Tuple input) throws IOException { if
(input == null || input.size() == 0) {
return null;
try {
Object object = input.get(0); if (object
== null) {
return null;
return ((String) object).trim();
} catch (ExecException e) {
throw new IOException(e);
public List<FuncSpec> getArgToFuncMapping() throws
FrontendException { List<FuncSpec> funcList = new
ArrayList<FuncSpec>(); funcList.add(new
FuncSpec(this.getClass().getName(), new Schema(
new Schema.FieldSchema(null, DataType.CHARARRAY))));
return funcList;
An eval function extends the EvalFunc class, parameterized by the type of the return
value (which is String for the Trim UDF).7 The exec() and getArgToFuncMapping()
methods are straightforward, like the ones in the IsGoodQuality UDF. When you
write an eval function, you need to consider what the output’s schema looks like. In
the following statement, the schema of B is determined by the function udf:
If udf creates tuples with scalar fields, then Pig can determine B’s schema through
reflection. For complex types such as bags, tuples, or maps, Pig needs more help,
and you should implement the outputSchema() method to give Pig the information
about the output schema.
The Trim UDF returns a string, which Pig translates as a chararray, as can be seen
from the following session:
grunt> DUMP A;
( pomegranate) (banana ) (apple) (lychee )
grunt> DESCRIBE A;
A: {fruit: chararray}
grunt> DUMP B;
grunt> DESCRIBE B;
B: {chararray}
A has chararray fields that have leading and trailing spaces. We create B from A by
applying the Trim function to the first field in A (named fruit). B’s fields are correctly
inferred to be of type chararray.
Dynamic Invokers
Sometimes you may want to use a function that is provided by a Java library, but
without going to the effort of writing a UDF. Dynamic invokers allow you to do this by
calling Java methods directly from a Pig script. The trade-off is that method calls are
made via reflection, which, when being called for every record in a large dataset, can
impose significant overhead. So for scripts that are run repeatedly a dedicated UDF
is normally preferred.
The following snippet shows how we could define and use a trim UDF that uses the
Apache Commons Lang StringUtils class.
Although not relevant for this example, eval functions that operate on a bag may
additionally implement Pig’s Algebraic or Accumulator interfaces for more efficient
processing of the bag in chunks.
grunt> DEFINE trim
InvokeForString('org.apache.commons.lang.StringUtils.trim', 'String');
grunt> B = FOREACH A GENERATE trim(fruit);
grunt> DUMP B;
(pomegranate) (banana)
(apple) (lychee)
The InvokeForString invoker is used since the return type of the method is a
String. (There are also InvokeForInt, InvokeForLong, InvokeForDouble, and
InvokeForFloat in-vokers.) The first argument to the invoker constructor is the
fully-qualified method to be invoked. The second is a space-separated list of the
method argument classes.
A Load UDF
We’ll demonstrate a custom load function that can read plain-text column ranges
as fields, very much like the Unix cut command. It is used as follows:
grunt> records = LOAD 'input/ncdc/micro/sample.txt'
USING com.hadoopbook.pig.CutLoadFunc('16-19,88-92,93-93')
AS (year:int, temperature:int, quality:int);
grunt> DUMP records;
The string passed to CutLoadFunc is the column specification; each commaseparated range defines a field, which is assigned a name and type in the AS
clause. Let’s examine the implementation of CutLoadFunc shown in Example 113.
Example 11-3. A LoadFunc UDF to load tuple fields as column ranges
public class CutLoadFunc extends LoadFunc {
private static final Log LOG = LogFactory.getLog(CutLoadFunc.class);
private final List<Range> ranges;
private final TupleFactory tupleFactory =
TupleFactory.getInstance(); private RecordReader reader;
public CutLoadFunc(String cutPattern) {
ranges = Range.parse(cutPattern);
public void setLocation(String location, Job job)
throws IOException {
FileInputFormat.setInputPaths(job, location);
public InputFormat getInputFormat() {
return new TextInputFormat();
public void prepareToRead(RecordReader reader, PigSplit split)
{ this.reader = reader;
public Tuple getNext() throws IOException { try {
if (!reader.nextKeyValue()) { return
Text value = (Text) reader.getCurrentValue();
String line = value.toString();
Tuple tuple = tupleFactory.newTuple(ranges.size()); for
(int i = 0; i < ranges.size(); i++) {
Range range = ranges.get(i);
if (range.getEnd() > line.length()) {
"Range end (%s) is longer than line length (%s)",
range.getEnd(), line.length()));
tuple.set(i, new DataByteArray(range.getSubstring(line)));
return tuple;
} catch (InterruptedException e) { throw
new ExecException(e);
In Pig, like in Hadoop, data loading takes place before the mapper runs, so it is important that the input can be split into portions that are independently handled by
each mapper.
From Pig 0.7.0 the load and store function interfaces have been overhauled to be
more closely aligned with Hadoop’s InputFormat and OutputFormat classes.
Functions written for previous versions of Pig will need rewriting (guidelines for doing
so are provided at A LoadFunc
will typically use an existing underlying InputFormat to create records, with the
LoadFunc providing the logic for turning the records into Pig tuples.
CutLoadFunc is constructed with a string that specifies the column ranges to use for
each field. The logic for parsing this string and creating a list of internal Range
objects that encapsulates these ranges is contained in the Range class.
Pig calls setLocation() on a LoadFunc to pass the input location to the loader. Since
CutLoadFunc uses a TextInputFormat to break the input into lines, we just pass the
lo-cation to set the input path using a static method on FileInputFormat. Pig uses the
new MapReduce API, so we use the input and output for-mats and associated
classes from the org.apache.hadoop.mapreduce package.
Next, Pig calls the getInputFormat() method to create a RecordReader for each split,
just like in MapReduce. Pig passes each RecordReader to the prepareToRead()
method of CutLoadFunc, which we store a reference to, so we can use it in the
getNext() method for iterating through the records.
The Pig runtime calls getNext() repeatedly, and the load function reads tuples from
the reader until the reader reaches the last record in its split. At this point, it returns
null to signal that there are no more tuples to be read.
It is the responsibility of the getNext() implementation to turn lines of the input file
into Tuple objects. It does this by means of a TupleFactory, a Pig class for creating
Tuple instances. The newTuple() method creates a new tuple with the required
number of fields, which is just the number of Range classes, and the fields are
populated using substrings of the line, which are determined by the Range objects.
We need to think about what to do if the line is shorter than the range asked for. One
option is to throw an exception and stop further processing. This is appropriate if
your application cannot tolerate incomplete or corrupt records. In many cases, it is
better to return a tuple with null fields and let the Pig script handle the incomplete
data as it sees fit. This is the approach we take here; by exiting the for loop if the
range end is past the end of the line, we leave the current field and any subsequent
fields in the tuple with their default value of null.
Using a schema
Let’s now consider the type of the fields being loaded. If the user has specified a
schema, then the fields need converting to the relevant types. However, this is
performed lazily by Pig, and so the loader should always construct tuples of type
bytearrary, using the DataByteArray type. The loader function still has the
opportunity to do the conversion, however, by overriding getLoadCaster() to return a
custom implementation of the interface, which provides a collection of conversion
methods for this
public interface LoadCaster {
public Integer bytesToInteger(byte[] b) throws IOException;
public Long bytesToLong(byte[] b) throws IOException; public
Float bytesToFloat(byte[] b) throws IOException; public
Double bytesToDouble(byte[] b) throws IOException; public
String bytesToCharArray(byte[] b) throws IOException;
public Map<String, Object> bytesToMap(byte[] b) throws
public Tuple bytesToTuple(byte[] b) throws IOException;
public DataBag bytesToBag(byte[] b) throws IOException;
CutLoadFunc doesn’t override getLoadCaster() since the default implementation
returns Utf8StorageConverter, which provides standard conversions between UTF-8
encoded data and Pig data types.
In some cases, the load function itself can determine the schema. For example, if we
were loading self-describing data like XML or JSON, we could create a schema for
Pig by looking at the data. Alternatively, the load function may determine the schema
in another way, such as an external file, or by being passed information in its
constructor. To support such cases, the load function should implement the
LoadMetadata interface (in addition to the LoadFunc interface), so it can supply a
schema to the Pig runtime. Note, however, that if a user supplies a schema in the AS
clause of LOAD, then it takes precedence over the schema one specified by the
LoadMetadata interface.
A load function may additionally implement the LoadPushDown interface as a means
for finding out which columns the query is asking for. This can be a useful
optimization for column-oriented storage, so that the loader only loads the columns
that are needed by the query. There is no obvious way for CutLoadFunc to load only
a subset of columns, since it reads the whole line for each tuple, so we don’t use this
Data Processing Operators
Loading and Storing Data
Throughout this chapter, we have seen how to load data from external storage for
processing in Pig. Storing the results is straightforward, too. Here’s an example of
using PigStorage to store tuples as plain-text values separated by a colon character:
grunt> STORE A INTO 'out' USING PigStorage(':');
grunt> cat out
Other built-in storage functions were described in Table 11-7.
Filtering Data
Once you have some data loaded into a relation, the next step is often to filter it to
remove the data that you are not interested in. By filtering early in the processing
pipe-line, you minimize the amount of data flowing through the system, which can
improve efficiency.
We have already seen how to remove rows from a relation using the FILTER
operator with simple expressions and a UDF. The FOREACH...GENERATE operator
is used to act on every row in a relation. It can be used to remove fields or to
generate new ones. In this example, we do both:
(Joe,cherry,2) (Ali,apple,3)
grunt> B = FOREACH A GENERATE $0, $2+1, 'Constant';
grunt> DUMP B;
Here we have created a new relation B with three fields. Its first field is a projection
of the first field ($0) of A. B’s second field is the third field of A ($2) with one added to
it. B’s third field is a constant field (every row in B has the same third field) with the
chararray value Constant.
The FOREACH...GENERATE operator has a nested form to support more complex
processing. In the following example, we compute various statistics for the weather
-- year_stats.pig
REGISTER pig-examples.jar;
com.hadoopbook.pig.IsGoodQuality(); records =
LOAD 'input/ncdc/all/19{1,2,3,4,5}0*'
USING com.hadoopbook.pig.CutLoadFunc('5-10,11-15,16-19,88-92,93-93')
AS (usaf:chararray, wban:chararray, year:int, temperature:int, quality:int);
grouped_records = GROUP records BY year PARALLEL 30;
year_stats = FOREACH grouped_records {
uniq_stations = DISTINCT records.usaf;
good_records = FILTER records BY isGood(quality);
GENERATE FLATTEN(group), COUNT(uniq_stations) AS station_count,
COUNT(good_records) AS good_record_count, COUNT(records) AS
DUMP year_stats;
Using the cut UDF we developed earlier, we load various fields from the input
dataset into the records relation. Next we group records by year. Notice the
PARALLEL key-word for setting the number of reducers to use; this is vital when
running on a cluster. Then we process each group using a nested
FOREACH...GENERATE operator. The first nested statement creates a relation for
the distinct USAF identifiers for stations using the DISTINCT operator. The second
nested statement creates a relation for the records with “good” readings using the
FILTER operator and a UDF. The final nested statement is a GENERATE statement
(a nested FOREACH...GENERATE must always have a GENERATE statement as
the last nested statement) that generates the summary fields of interest using the
grouped records, as well as the relations created in the nested block.
Running it on a few years of data, we get the following:
The fields are year, number of unique stations, total number of good readings, and
total number of readings. We can see how the number of weather stations and
readings grew over time.
The STREAM operator allows you to transform data in a relation using an external
program or script. It is named by analogy with Hadoop Streaming, which provides a
similar capability for MapReduce (see “Hadoop Streaming” ).
STREAM can use built-in commands with arguments. Here is an example that uses
the Unix cut command to extract the second field of each tuple in A. Note that the
com-mand and its arguments are enclosed in backticks:
grunt> C = STREAM A THROUGH `cut -f 2`;
grunt> DUMP C;
The STREAM operator uses PigStorage to serialize and deserialize relations to and
from the program’s standard input and output streams. Tuples in A are converted to
tab-delimited lines that are passed to the script. The output of the script is read one
line at a time and split on tabs to create new tuples for the output relation C. You can
provide a custom serializer and deserializer, which implement PigToStream and
StreamToPig respectively (both in the org.apache.pig package), using the DEFINE
Pig streaming is most powerful when you write custom processing scripts. The
follow-ing Python script filters out bad weather records:
#!/usr/bin/env python
import re import sys
for line in sys.stdin:
(year, temp, q) = line.strip().split()
if (temp != "9999" and re.match("[01459]", q)): print
"%s\t%s" % (year, temp)
To use the script, you need to ship it to the cluster. This is achieved via a DEFINE
clause, which also creates an alias for the STREAM command. The STREAM
statement can then refer to the alias, as the following Pig script shows:
-- max_temp_filter_stream.pig
DEFINE is_good_quality ``
SHIP ('ch11/src/main/python/');
records = LOAD 'input/ncdc/micro-tab/sample.txt'
AS (year:chararray, temperature:int, quality:int);
filtered_records = STREAM records THROUGH
AS (year:chararray, temperature:int);
grouped_records = GROUP filtered_records BY year;
max_temp = FOREACH grouped_records
MAX(filtered_records.temperature); DUMP
Grouping and Joining Data
Joining datasets in MapReduce takes some work on the part of the programmer ,
whereas Pig has very good built-in support for join operations, making it much more
approachable. Since the large datasets that are suitable for analysis by Pig (and
MapReduce in general) are not normalized, joins are used more infrequently in Pig
than they are in SQL.
Let’s look at an example of an inner join. Consider the relations A and B:
grunt> DUMP A; (2,Tie)
(4,Coat) (3,Hat) (1,Scarf)
grunt> DUMP B; (Joe,2)
(Hank,4) (Ali,0) (Eve,3)
We can join the two relations on the numerical (identity) field in each:
grunt> C = JOIN A BY $0, B BY $1;
grunt> DUMP C;
This is a classic inner join, where each match between the two relations corresponds
to a row in the result. (It’s actually an equijoin since the join predicate is equality.)
The result’s fields are made up of all the fields of all the input relations.
You should use the general join operator if all the relations being joined are too large
to fit in memory. If one of the relations is small enough to fit in memory, there is a
special type of join called a fragment replicate join, which is implemented by
distributing the small input to all the mappers and performing a map-side join using
an in-memory lookup table against the (fragmented) larger relation. There is a
special syntax for telling Pig to use a fragment replicate join:
grunt> C = JOIN A BY $0, B BY $1 USING "replicated";
The first relation must be the large one, followed by one or more small ones (all of
which must fit in memory).
Pig also supports outer joins using a syntax that is similar to SQL’s. For example:
grunt> C = JOIN A BY $0 LEFT OUTER, B BY
$1; grunt> DUMP C;
JOIN always gives a flat structure: a set of tuples. The COGROUP statement is
similar to JOIN, but creates a nested set of output tuples. This can be useful if you
want to exploit the structure in subsequent statements:
grunt> D = COGROUP A BY $0, B BY $1;
grunt> DUMP D;
COGROUP generates a tuple for each unique grouping key. The first field of each
tuple is the key, and the remaining fields are bags of tuples from the relations with a
matching key. The first bag contains the matching tuples from relation A with the
same key. Similarly, the second bag contains the matching tuples from relation B
with the same key. There are more keywords that may be used in the USING clause,
including "skewed" (for large datasets with a skewed keyspace) and "merge" (to
effect a merge join for inputs that are already sorted on the join key). See Pig’s
documentation for details on how to use these specialized joins.
If for a particular key a relation has no matching key, then the bag for that relation is
empty. For example, since no one has bought a scarf (with ID 1), the second bag in
the tuple for that row is empty. This is an example of an outer join, which is the
default type for COGROUP. It can be made explicit using the OUTER keyword,
making this COGROUP statement the same as the previous one:
You can suppress rows with empty bags by using the INNER keyword, which gives
the COGROUP inner join semantics. The INNER keyword is applied per relation, so
the following only suppresses rows when relation A has no match (dropping the
unknown product 0 here):
grunt> E = COGROUP A BY $0 INNER, B BY
$1; grunt> DUMP E;
We can flatten this structure to discover who bought each of the items in relation A:
B.$0; grunt> DUMP F;
Using a combination of COGROUP, INNER, and FLATTEN (which removes nesting)
it’s possible to simulate an (inner) JOIN:
grunt> G = COGROUP A BY $0 INNER, B BY $1 INNER;
grunt> DUMP H;
This gives the same result as JOIN A BY $0, B BY $1.
If the join key is composed of several fields, you can specify them all in the BY
clauses of the JOIN or COGROUP statement. Make sure that the number of fields in
each BY clause is the same.
Here’s another example of a join in Pig, in a script for calculating the maximum temperature for every station over a time period controlled by the input:
-- max_temp_station_name.pig
REGISTER pig-examples.jar;
DEFINE isGood com.hadoopbook.pig.IsGoodQuality();
stations = LOAD 'input/ncdc/metadata/stations-fixed-width.txt'
USING com.hadoopbook.pig.CutLoadFunc('1-6,8-12,14-42')
AS (usaf:chararray, wban:chararray, name:chararray);
trimmed_stations = FOREACH stations GENERATE usaf,
wban, com.hadoopbook.pig.Trim(name);
records = LOAD 'input/ncdc/all/191*'
USING com.hadoopbook.pig.CutLoadFunc('5-10,11-15,88-92,93-93')
AS (usaf:chararray, wban:chararray, temperature:int, quality:int);
filtered_records = FILTER records BY temperature != 9999 AND
isGood(quality); grouped_records = GROUP filtered_records BY (usaf,
wban) PARALLEL 30; max_temp = FOREACH grouped_records
max_temp_named = JOIN max_temp BY (usaf, wban), trimmed_stations BY
(usaf, wban) PARALLEL 30;
max_temp_result = FOREACH max_temp_named GENERATE $0, $1, $5, $2;
STORE max_temp_result INTO 'max_temp_by_station';
We use the cut UDF we developed earlier to load one relation holding the station IDs
(USAF and WBAN identifiers) and names, and one relation holding all the weather
records, keyed by station ID. We group the filtered weather records by station ID and
aggregate by maximum temperature, before joining with the stations. Finally, we
project out the fields we want in the final result: USAF, WBAN, station name, maximum temperature.
Here are a few results for the 1910s:
This query could be made more efficient by using a fragment replicate join, as the
station metadata is small.
Pig Latin includes the cross-product operator (also known as the cartesian product),
which joins every tuple in a relation with every tuple in a second relation (and with
every tuple in further relations if supplied). The size of the output is the product of the
size of the inputs, potentially making the output very large:
grunt> I = CROSS A, B;
grunt> DUMP I;
(2,Tie,Joe,2) (2,Tie,Hank,4)
(2,Tie,Ali,0) (2,Tie,Eve,3)
(2,Tie,Hank,2) (4,Coat,Joe,2)
(4,Coat,Hank,4) (4,Coat,Ali,0)
(4,Coat,Eve,3) (4,Coat,Hank,2)
(3,Hat,Joe,2) (3,Hat,Hank,4)
(3,Hat,Ali,0) (3,Hat,Eve,3)
(3,Hat,Hank,2) (1,Scarf,Joe,2)
(1,Scarf,Hank,4) (1,Scarf,Ali,0)
(1,Scarf,Eve,3) (1,Scarf,Hank,2)
When dealing with large datasets, you should try to avoid operations that generate
intermediate representations that are quadratic (or worse) in size. Computing the
cross-product of the whole input dataset is rarely needed, if ever.
For example, at first blush one might expect that calculating pairwise document similarity in a corpus of documents would require every document pair to be generated
before calculating their similarity. However, if one starts with the insight that most
document pairs have a similarity score of zero (that is, they are unrelated), then we
can find a way to a better algorithm.
In this case, the key idea is to focus on the entities that we are using to calculate
similarity (terms in a document, for example) and make them the center of the
algorithm. In practice, we also remove terms that don’t help discriminate between
documents (stop-words), and this reduces the problem space still further. Using this
technique to analyze a set of roughly one million (106) documents generates in the
order of one billion (109) intermediate pairs,9 rather than the one trillion (1012)
produced by the naive approach (generating the cross-product of the input) or the
approach with no stopword removal. “Pairwise Document Similarity in Large
Collections with MapReduce,” Elsayed, Lin, and Oard (2008, College Park, MD:
University of Maryland).
Although COGROUP groups the data in two or more relations, the GROUP
statement groups the data in a single relation. GROUP supports grouping by more
than equality of keys: you can use an expression or user-defined function as the
group key. For ex-ample, consider the following relation A:
grunt> DUMP A;
(Joe,cherry) (Ali,apple)
(Joe,banana) (Eve,apple)
Let’s group by the number of characters in the second field:
grunt> B = GROUP A BY SIZE($1);
grunt> DUMP B;
GROUP creates a relation whose first field is the grouping field, which is given the
alias group. The second field is a bag containing the grouped fields with the same
schema as the original relation (in this case, A).
There are also two special grouping operations: ALL and ANY. ALL groups all the
tuples in a relation in a single group, as if the GROUP function was a constant:
grunt> C = GROUP A ALL;
grunt> DUMP C;
Note that there is no BY in this form of the GROUP statement. The ALL grouping is
commonly used to count the number of tuples in a relation, as shown in “Validation
and nulls”. The ANY keyword is used to group the tuples in a relation randomly,
which can be useful for sampling.
Sorting Data
Relations are unordered in Pig. Consider a relation A:
grunt> DUMP A;
There is no guarantee which order the rows will be processed in. In particular, when
retrieving the contents of A using DUMP or STORE, the rows may be written in any
order. If you want to impose an order on the output, you can use the ORDER
operator to sort a relation by one or more fields. The default sort order compares
fields of the same type using the natural ordering, and different types are given an
arbitrary, but deterministic, ordering (a tuple is always “less than” a bag, for
The following example sorts A by the first field in ascending order and by the second
field in descending order:
grunt> B = ORDER A BY $0, $1 DESC;
grunt> DUMP B;
Any further processing on a sorted relation is not guaranteed to retain its order. For
Even though relation C has the same contents as relation B, its tuples may be
emitted in any order by a DUMP or a STORE. It is for this reason that it is usual to
perform the ORDER operation just before retrieving the output.
The LIMIT statement is useful for limiting the number of results, as a quick and dirty
way to get a sample of a relation; prototyping (the ILLUSTRATE command) should
be preferred for generating more representative samples of the data. It can be used
imme-diately after the ORDER statement to retrieve the first n tuples. Usually, LIMIT
will select any n tuples from a relation, but when used immediately after an ORDER
state-ment, the order is retained (in an exception to the rule that processing a
relation does not retain its order):
grunt> D = LIMIT B 2;
grunt> DUMP D;
If the limit is greater than the number of tuples in the relation, all tuples are returned
(so LIMIT has no effect).
Using LIMIT can improve the performance of a query because Pig tries to apply the
limit as early as possible in the processing pipeline, to minimize the amount of data
that needs to be processed. For this reason, you should always use LIMIT if you are
not interested in the entire output.
Combining and Splitting Data
Sometimes you have several relations that you would like to combine into one. For
this, the UNION statement is used. For example:
grunt> DUMP A; (2,3)
grunt> DUMP B; (z,x,8)
grunt> C = UNION A, B;
grunt> DUMP C;
C is the union of relations A and B, and since relations are unordered, the order of
the tuples in C is undefined. Also, it’s possible to form the union of two relations with
different schemas or with different numbers of fields, as we have done here. Pig
attempts to merge the schemas from the relations that UNION is operating on. In this
case, they are incompatible, so C has no schema:
grunt> DESCRIBE A;
{f0: int,f1: int}
grunt> DESCRIBE B;
{f0: chararray,f1: chararray,f2: int}
grunt> DESCRIBE C;
Schema for C unknown.
If the output relation has no schema, your script needs to be able to handle tuples
that vary in the number of fields and/or types. The SPLIT operator is the opposite of
UNION; it partitions a relation into two or more relations. See “Validation and nulls”
for an example of how to use it.
Pig in Practice
There are some practical techniques that are worth knowing about when you are
developing and running Pig programs. This section covers some of them.
When running in MapReduce mode it’s important that the degree of parallelism
matches the size of the dataset. By default, Pig will sets the number of reducers by
looking at the size of the input, and using one reducer per 1GB of input, up to a maximum of 999 reducers. You can override these parameters by setting pig.exec.reduc
ers.bytes.per.reducer (the default is 1000000000 bytes) and pig.exec.reducers.max
(default 999).
To explictly set the number of reducers you want for each job, you can use a
PARALLEL clause for operators that run in the reduce phase. These include all the
grouping and joining operators (GROUP, COGROUP, JOIN, CROSS), as well as
DISTINCT and ORDER. The following line sets the number of reducers to 30 for the
grouped_records = GROUP records BY year PARALLEL 30;
Alternatively, you can set the default_parallel option, and it will take effect for all
subsequent jobs:
grunt> set default_parallel 30
A good setting for the number of reduce tasks is slightly fewer than the number of
reduce slots in the cluster. The number of map tasks is set by the size of the input
(with one map per HDFS block) and is not affected by the PARALLEL clause.
Parameter Substitution
If you have a Pig script that you run on a regular basis, then it’s quite common to
want to be able to run the same script with different parameters. For example, a
script that runs daily may use the date to determine which input files it runs over. Pig
supports parameter substitution, where parameters in the script are substituted with
values supplied at runtime. Parameters are denoted by identifiers prefixed with a $
character; for example, $input and $output are used in the following script to specify
the input and output paths:
-- max_temp_param.pig
records = LOAD '$input' AS (year:chararray, temperature:int, quality:int);
filtered_records = FILTER records BY temperature != 9999 AND
(quality == 0 OR quality == 1 OR quality == 4 OR quality == 5 OR quality ==
9); grouped_records = GROUP filtered_records BY year;
max_temp = FOREACH grouped_records
STORE max_temp into '$output';
Parameters can be specified when launching Pig, using the -param option, one for
each parameter:
% pig -param input=/user/tom/input/ncdc/micro-tab/sample.txt \
-param output=/tmp/out \
You can also put parameters in a file and pass them to Pig using the -param_file
option. For example, we can achieve the same result as the previous command by
placing the parameter definitions in a file:
# Input file input=/user/tom/input/ncdc/microtab/sample.txt
# Output file
The pig invocation then becomes:
% pig -param_file ch11/src/main/pig/max_temp_param.param \
You can specify multiple parameter files using -param_file repeatedly. You can also
use a combination of -param and -param_file options, and if any parameter is
defined in both a parameter file and on the command line, the last value on the
command line takes precedence.
Dynamic parameters
For parameters that are supplied using the -param option, it is easy to make the
value dynamic by running a command or script. Many Unix shells support command
sub-stitution for a command enclosed in backticks, and we can use this to make the
output directory date-based:
% pig -param input=/user/tom/input/ncdc/micro-tab/sample.txt \
-param output=/tmp/`date "+%Y-%m-%d"`/out \
Parameter substitution processing
Parameter substitution occurs as a preprocessing step before the script is run. You
can see the substitutions that the preprocessor made by executing Pig with the dryrun option. In dry run mode, Pig performs parameter substitution (and macro
expansion) and generates a copy of the original script with substituted values, but
does not execute the script. You can inspect the generated script and check that the
substitutions look sane (because they are dynamically generated, for example)
before running it in normal mode. At the time of this writing, Grunt does not support
parameter substitution.
In “Information Platforms and the Rise of the Data Scientist,” Jeff Hammerbacher
describes Information Platforms as “the locus of their organization’s efforts to ingest,
process, and generate information,” and how they “serve to accelerate the process
of learning from empirical data.”
One of the biggest ingredients in the Information Platform built by Jeff’s team at
Face-book was Hive, a framework for data warehousing on top of Hadoop. Hive
grew from a need to manage and learn from the huge volumes of data that
Facebook was producing every day from its burgeoning social network. After trying a
few different systems, the team chose Hadoop for storage and processing, since it
was cost-effective and met their scalability needs.
Hive was created to make it possible for analysts with strong SQL skills (but meager
Java programming skills) to run queries on the huge volumes of data that Facebook
stored in HDFS. Today, Hive is a successful Apache project used by many
organizations as a general-purpose, scalable data processing platform.
Of course, SQL isn’t ideal for every big data problem—it’s not a good fit for building
complex machine learning algorithms, for example—but it’s great for many analyses,
and it has the huge advantage of being very well known in the industry. What’s more,
SQL is the lingua franca in business intelligence tools (ODBC is a common bridge,
for example), so Hive is well placed to integrate with these products.
This chapter is an introduction to using Hive. It assumes that you have working
knowl-edge of SQL and general database architecture; as we go through Hive’s
features, we’ll often compare them to the equivalent in a traditional RDBMS.
Installing Hive
In normal use, Hive runs on your workstation and converts your SQL query into a
series of MapReduce jobs for execution on a Hadoop cluster. Hive organizes data
into tables, which provide a means for attaching structure to data stored in HDFS.
Metadata— such as table schemas—is stored in a database called the metastore.
When starting out with Hive, it is convenient to run the metastore on your local machine. In this configuration, which is the default, the Hive table definitions that you
create will be local to your machine, so you can’t share them with other users.
Installation of Hive is straightforward. Java 6 is a prerequisite; and on Windows, you
will need Cygwin, too. You also need to have the same version of Hadoop installed
locally that your cluster is running. Of course, you may choose to run Hadoop locally,
either in standalone or pseudo-distributed mode, while getting started with Hive.
Which Versions of Hadoop Does Hive Work With?
Any given release of Hive is designed to work with multiple versions of Hadoop.
Gen-erally, Hive works with the latest release of Hadoop, as well as supporting a
number of older versions. For example, Hive 0.5.0 is compatible with versions of
Hadoop be-tween 0.17.x and 0.20.x (inclusive). You don’t need to do anything
special to tell Hive which version of Hadoop you are using, beyond making sure
that the hadoop executable is on the path or setting the HADOOP_HOME
environment variable.
Download a release at, and unpack the tarball
in a suitable place on your workstation:
% tar xzf hive-x.y.z-dev.tar.gz
It’s handy to put Hive on your path to make it easy to launch:
export HIVE_INSTALL=/home/tom/hive-x.y.z-dev
Now type hive to launch the Hive shell:
% hive hive>
The Hive Shell
The shell is the primary way that we will interact with Hive, by issuing commands in
HiveQL. HiveQL is Hive’s query language, a dialect of SQL. It is heavily influenced
by MySQL, so if you are familiar with MySQL you should feel at home using Hive.
When starting Hive for the first time, we can check that it is working by listing its
tables: there should be none. The command must be terminated with a semicolon to
tell Hive to execute it:
Time taken: 10.425 seconds
Like SQL, HiveQL is generally case insensitive (except for string comparisons), so
show tables; works equally well here. The tab key will autocomplete Hive keywords
and functions.
For a fresh install, the command takes a few seconds to run since it is lazily creating
the metastore database on your machine. (The database stores its files in a directory
called metastore_db, which is relative to where you ran the hive command from.)
You can also run the Hive shell in non-interactive mode. The -f option runs the commands in the specified file, script.q, in this example:
% hive -f script.q
For short scripts, you can use the -e option to specify the commands inline, in which
case the final semicolon is not required:
% hive -e 'SELECT * FROM dummy'
Hive history file=/tmp/tom/hive_job_log_tom_201005042112_1906486281.txt
Time taken: 4.734 seconds
It’s useful to have a small table of data to test queries against, such as trying
out functions in SELECT expressions using literal data. Here’s one way of
populating a single row table:
echo 'X' > /tmp/dummy.txt
hive -e "CREATE TABLE dummy (value
'/tmp/dummy.txt' \ OVERWRITE INTO TABLE
In both interactive and non-interactive mode, Hive will print information to standard
error—such as the time taken to run a query—during the course of operation. You
can suppress these messages using the -S option at launch time, which has the
effect of only showing the output result for queries:
% hive -S -e 'SELECT * FROM dummy'
Other useful Hive shell features include the ability to run commands on the host operating system by using a ! prefix to the command and the ability to access Hadoop
filesystems using the dfs command.
An Example
Let’s see how to use Hive to run a query on the weather dataset we explored in
earlier chapters. The first step is to load the data into Hive’s managed storage. Here
we’ll have Hive use the local filesystem for storage; later we’ll see how to store
tables in HDFS.
Just like an RDBMS, Hive organizes its data into tables. We create a table to hold
the weather data using the CREATE TABLE statement:
CREATE TABLE records (year STRING, temperature INT, quality INT)
The first line declares a records table with three columns: year, temperature, and
quality. The type of each column must be specified, too: here the year is a string,
while the other two columns are integers.
So far, the SQL is familiar. The ROW FORMAT clause, however, is particular to
HiveQL. What this declaration is saying is that each row in the data file is tabdelimited text. Hive expects there to be three fields in each row, corresponding to the
table columns, with fields separated by tabs, and rows by newlines.
Next we can populate Hive with the data. This is just a small sample, for exploratory
LOAD DATA LOCAL INPATH 'input/ncdc/micro-tab/sample.txt'
Running this command tells Hive to put the specified local file in its warehouse
direc-tory. This is a simple filesystem operation. There is no attempt, for example, to
parse the file and store it in an internal database format, since Hive does not
mandate any particular file format. Files are stored verbatim: they are not modified
by Hive.
In this example, we are storing Hive tables on the local filesystem (
is set to its default value of file:///). Tables are stored as directories under Hive’s
ware-house directory, which is controlled by the hive.metastore.warehouse.dir, and
defaults to /user/hive/warehouse.
Thus, the files for the records table are found in the /user/hive/warehouse/records
directory on the local filesystem:
% ls /user/hive/warehouse/records/ sample.txt
In this case, there is only one file, sample.txt, but in general there can be more, and
Hive will read all of them when querying the table.
The OVERWRITE keyword in the LOAD DATA statement tells Hive to delete any
existing files in the directory for the table. If it is omitted, then the new files are simply
added to the table’s directory (unless they have the same names, in which case they
replace the old files).
Now that the data is in Hive, we can run a query against it:
hive> SELECT year, MAX(temperature)
FROM records
WHERE temperature != 9999
AND (quality = 0 OR quality = 1 OR quality = 4 OR quality = 5 OR
quality = 9)
GROUP BY year;
1949 111
1950 22
This SQL query is unremarkable. It is a SELECT statement with a GROUP BY
clause for grouping rows into years, which uses theMAX() aggregate function to find
the maximum temperature for each year group. But the remarkable thing is that Hive
transforms this query into a MapReduce job, which it executes on our behalf, then
prints the results to the console. There are some nuances such as the SQL
constructs that Hive supports and the format of the data that we can query—and we
shall explore some of these in this chapter—but it is the ability to execute SQL
queries against our raw data that gives Hive its power.
Running Hive
In this section, we look at some more practical aspects of running Hive, including
how to set up Hive to run against a Hadoop cluster and a shared metastore. In doing
so, we’ll see Hive’s architecture in some detail.
Configuring Hive
Hive is configured using an XML configuration file like Hadoop’s. The file is called
hive-site.xml and is located in Hive’s conf directory. This file is where you can set
prop-erties that you want to set every time you run Hive. The same directory
contains hive-default.xml, which documents the properties that Hive exposes and
their default values.
You can override the configuration directory that Hive looks for in hive-site.xml by
passing the --config option to the hive command:
% hive --config /Users/tom/dev/hive-conf
Note that this option specifies the containing directory, not hive-site.xml itself. It can
be useful if you have multiple site files—for different clusters, say—that you switch
between on a regular basis. Alternatively, you can set the HIVE_CONF_DIR
environment variable to the configuration directory, for the same effect.
The hive-site.xml is a natural place to put the cluster connection details: you can
specify the filesystem and jobtracker using the usual Hadoop properties, and mapred.job.tracker (see Appendix A for more details on
configuring Hadoop). If not set, they default to the local filesystem and the local (inprocess) job runner—just like they do in Hadoop—which is very handy when trying
out Hive on small trial datasets. Metastore configuration settings are com-monly
found in hive-site.xml, too.
Hive also permits you to set properties on a per-session basis, by passing the hiveconf option to the hive command. For example, the following command sets the
cluster (to a pseudo-distributed cluster) for the duration of the session:
% hive -hiveconf -hiveconf
If you plan to have more than one Hive user sharing a Hadoop cluster, then
you need to make the directories that Hive uses writable by all users. The
following commands will create the directories and set their permissions
hadoop fs -mkdir /tmp
hadoop fs -chmod a+w /tmp
hadoop fs -mkdir /user/hive/warehouse
hadoop fs -chmod a+w /user/hive/warehouse
If all users are in the same group, then permissions g+w are sufficient on
the warehouse directory.
You can change settings from within a session, too, using the SET command. This
is useful for changing Hive or MapReduce job settings for a particular query. For
example, the following command ensures buckets are populated according to the
table definition:
hive> SET hive.enforce.bucketing=true;
To see the current value of any property, use SET with just the property name:
hive> SET hive.enforce.bucketing;
By itself, SET will list all the properties (and their values) set by Hive. Note that the
list will not include Hadoop defaults, unless they have been explicitly overridden in
one of the ways covered in this section. Use SET -v to list all the properties in the
system, including Hadoop defaults.
There is a precedence hierarchy to setting properties. In the following list, lower
num-bers take precedence over higher numbers:
The Hive SET command
The command line -hiveconf option
hadoop-site.xml (or, equivalently, core-site.xml, hdfs-site.xml, and mapredsite.xml)
 hadoop-default.xml (or, equivalently, core-default.xml, hdfs-default.xml, and
You can find Hive’s error log on the local file system at /tmp/$USER/hive.log. It can
be very useful when trying to diagnose configuration problems or other types of
error. Hadoop’s MapReduce task logs are also a useful source for troubleshooting;
see “Ha-doop Logs” for where to find them.
The logging configuration is in conf/, and you can edit this file to
change log levels and other logging-related settings. Often though, it’s more
convenient to set logging configuration for the session. For example, the following
handy invoca-tion will send debug messages to the console:
% hive -hiveconf hive.root.logger=DEBUG,console
Hive Services
The Hive shell is only one of several services that you can run using the hive
command. You can specify the service to run using the --service option. Type hive -service help to get a list of available service names; the most useful are described
The command line interface to Hive (the shell). This is the default service.
Runs Hive as a server exposing a Thrift service, enabling access from a range of
clients written in different languages. Applications using the Thrift, JDBC, and
ODBC connectors need to run a Hive server to communicate with Hive. Set the
HIVE_PORT environment variable to specify the port the server will listen on
(defaults to 10,000).
The Hive Web Interface. See “The Hive Web Interface (HWI)” on page 418.
The Hive equivalent to hadoop jar, a convenient way to run Java applications that
includes both Hadoop and Hive classes on the classpath.
By default, the metastore is run in the same process as the Hive service. Using
this service, it is possible to run the metastore as a standalone (remote) process.
Set the METASTORE_PORT environment variable to specify the port the server
will listen on.
The Hive Web Interface (HWI)
As an alternative to the shell, you might want to try Hive’s simple web interface.
Start it using the following commands:
export ANT_LIB=/path/to/ant/lib
hive --service hwi
(You only need to set the ANT_LIB environment variable if Ant’s library is not
found in /opt/ant/lib on your system.) Then navigate to http://localhost:9999/hwi
in your browser. From there, you can browse Hive database schemas and
create sessions for issuing commands and queries.
It’s possible to run the web interface as a shared service to give users within an
orga-nization access to Hive without having to install any client software. There
are more details on the Hive Web Interface on the Hive wiki at ence/display/Hive/HiveWebInterface.
Hive clients
If you run Hive as a server (hive --service hiveserver), then there are a number of
different mechanisms for connecting to it from applications. The relationship between
Hive clients and Hive services is illustrated in Figure 12-1.
Figure 12-1. Hive architecture
Thrift Client
The Hive Thrift Client makes it easy to run Hive commands from a wide range of
programming languages. Thrift bindings for Hive are available for C++, Java,
PHP, Python, and Ruby. They can be found in the src/service/src subdirectory in
the Hive distribution.
JDBC Driver
Hive provides a Type 4 (pure Java) JDBC driver, defined in the class
org.apache.hadoop.hive.jdbc.HiveDriver. When configured with a JDBC URI of
the form jdbc:hive://host:port/dbname, a Java application will connect to a Hive
server running in a separate process at the given host and port. (The driver
makes calls to an interface implemented by the Hive Thrift Client using the Java
Thrift bindings.) You may alternatively choose to connect to Hive via JDBC in
embedded mode using the URI jdbc:hive://. In this mode, Hive runs in the same
JVM as the application invoking it, so there is no need to launch it as a
standalone server since it does not use the Thrift service or the Hive Thrift Client.
ODBC Driver
The Hive ODBC Driver allows applications that support the ODBC protocol to
connect to Hive. (Like the JDBC driver, the ODBC driver uses Thrift to communicate with the Hive server.) The ODBC driver is still in development, so you
should refer to the latest instructions on the Hive wiki for how to build and run it.
The Metastore
The metastore is the central repository of Hive metadata. The metastore is divided
into two pieces: a service and the backing store for the data. By default, the
metastore service runs in the same JVM as the Hive service and contains an
embedded Derby database instance backed by the local disk. This is called the
embedded metastore configuration (see Figure 12-2).
Using an embedded metastore is a simple way to get started with Hive; however,
only one embedded Derby database can access the database files on disk at any
one time, which means you can only have one Hive session open at a time that
shares the same metastore. Trying to start a second session gives the error:
Failed to start database 'metastore_db'
when it attempts to open a connection to the metastore. The solution to supporting
multiple sessions (and therefore multiple users) is to use a standalone database.
This configuration is referred to as a local metastore, since the metastore service still
runs in the same process as the Hive service, but connects to a database running in
a separate process, either on the same machine or on a remote machine. Any
JDBC-compliant database may be used
configuration properties listed in Table 12-1.
MySQL is a popular choice for the standalone metastore. In this case,
javax.jdo.option.ConnectionDriverName is set to com.mysql.jdbc.Driver. (The user
name and password should be set, too, of course.) The JDBC driver JAR file for
MySQL (Connector/J) must be on Hive’s classpath, which is simply achieved by
placing it in Hive’s lib directory.
Figure 12-2. Metastore configurations
Going a step further, there’s another metastore configuration called a remote metastore, where one or more metastore servers run in separate processes to the Hive
service. This brings better manageability and security, since the database tier can be
completely firewalled off, and the clients no longer need the database credentials.
A Hive service is configured to use a remote metastore by setting hive.meta
store.local to false, and hive.metastore.uris to the metastore server URIs, separated
by commas if there is more than one. Metastore server URIs are of the form thrift://
host:port, where the port corresponds to the one set by METASTORE_PORT when
starting the metastore server (see “Hive Services”).
Table 12-1. Important metastore configuration
Property name
Default value
The directory relative to where
managed tables
are stored.
Whether to use an embedded
boolean true
server (true), or connect to a remote
(false). If false, then
hive.metastore.uris must be
The URIs specifying the remote
hive.metastore.uris comma- Not set
servers to connect to. Clients
connect in a
round-robin fashion if there are
remote servers.
jdbc:derby:;databa The JDBC URL of the metastore
org.apache.derby. The JDBC driver classname.
The JDBC user name.
The JDBC password.
Comparison with Traditional Databases
While Hive resembles a traditional database in many ways (such as supporting an
SQL interface), its HDFS and MapReduce underpinnings mean that there are a
number of architectural differences that directly influence the features that Hive
supports, which in turn affects the uses that Hive can be put to.
Schema on Read Versus Schema on Write
In a traditional database, a table’s schema is enforced at data load time. If the data
being loaded doesn’t conform to the schema, then it is rejected. This design is
sometimes called schema on write, since the data is checked against the schema
when it is written into the database.
Hive, on the other hand, doesn’t verify the data when it is loaded, but rather when a
query is issued. This is called schema on read. There are trade-offs between the two
approaches. Schema on read makes for a very fast initial load, since the data does
not have to be read, parsed, and serialized to disk in the database’s internal format.
The load operation is just a file copy or move. It is more flexible, too: consider having
two schemas for the same underlying data, depending on the analysis being
Schema on write makes query time performance faster, since the database can
index columns and perform compression on the data. The trade-off, however, is that
it takes longer to load data into the database. Furthermore, there are many scenarios
where the schema is not known at load time, so there are no indexes to apply, since
the queries have not been formulated yet. These scenarios are where Hive shines.
Updates, Transactions, and Indexes
Updates, transactions, and indexes are mainstays of traditional databases. Yet, until
recently, these features have not been considered a part of Hive’s feature set. This is
because Hive was built to operate over HDFS data using MapReduce, where fulltable scans are the norm and a table update is achieved by transforming the data
into a new table. For a data warehousing application that runs over large portions of
the dataset, this works well. However, there are workloads where updates (or insert
appends, at least) are needed, or where indexes would yield significant performance
gains. On the transactions front, Hive doesn’t define clear semantics for concurrent
access to tables, which means ap-plications need to build their own application-level
concurrency or locking mechanism. The Hive team is actively working on
improvements in all these areas. Change is also coming from another direction:
HBase integration. HBase (Chap-ter 13) has different storage characteristics to
HDFS, such as the ability to do row updates and column indexing, so we can expect
to see these features used by Hive in future releases. It is already possible to access
Data Types
Hive supports both primitive and complex data types. Primitives include numeric,
boolean, string, and timestamp types. The complex data types include arrays, maps,
and structs. Hive’s data types are listed in Table 12-3. Note that the literals shown
are those used from within HiveQL; they are not the serialized form used in the
table’s storage format (see “Storage Formats” ).
Table 12-3. Hive data types
Category Type
1-byte (8-bit) signed integer, from 128 to
Primitive TINYINT
2-byte (16-bit) signed integer,
-32,768 to 32,767
4-byte (32-bit) signed integer,
-2,147,483,648 to
8-byte (64-bit) signed integer,
4-byte (32-bit) single-precision
floatingpoint number
8-byte (64-bit) double-precision
point number
BOOLEAN true/false value
Character string
Byte array
Timestamp with nanosecond
TIMESTAMP precision
Category Type
Complex ARRAY
Literal examples
'a', "a"
Not supported
1325502245000, '2012-01-02
Literal examples
An ordered collection of fields. The
array(1, 2) a
must all be of the same
An unordered collection of key-value
map('a', 1, 'b', 2)
Keys must be primitives; values may
be any
type. For a particular map, the
keys must
be the same type, and the values
must be
the same type.
A collection of named fields. The
struct('a', 1, 1.0) b
fields may
be of different types.
The literal forms for arrays, maps, and structs are provided as functions. That is,
array(), map(), and struct() are built-in Hive functions.
The columns are named col1, col2, col3, etc.
Primitive types
Hive’s primitive types correspond roughly to Java’s, although some names are influenced by MySQL’s type names (some of which, in turn, overlap with SQL-92). There
are four signed integral types: TINYINT, SMALLINT, INT, and BIGINT, which are
equivalent to Java’s byte, short, int, and long primitive types, respectively; they are 1byte, 2-byte, 4-byte, and 8-byte signed integers.
Hive’s floating-point types, FLOAT and DOUBLE, correspond to Java’s float and
double, which are 32-bit and 64-bit floating point numbers. Unlike some databases,
there is no option to control the number of significant digits or decimal places stored
for floating point values. Hive supports a BOOLEAN type for storing true and false
There is a single Hive data type for storing text, STRING, which is a variable-length
character string. Hive’s STRING type is like VARCHAR in other databases, although
there is no declaration of the maximum number of characters to store with STRING.
(The theoretical maximum size STRING that may be stored is 2GB, although in
practice it may be inefficient to materialize such large values. Sqoop has large object
support, see “Importing Large Objects”). The BINARY data type is for storing
variable-length binary data. The TIMESTAMP data type stores timestamps with
nanosecond precision. Hive comes with UDFs for converting between Hive
timestamps, Unix timestamps (seconds since the Unix epoch), and strings, which
makes most common date operations tractable. TIMESTAMP does not encapsulate
a timezone, however the to_utc_timestamp and from_utc_timestamp functions make
it possible to do timezone conversions.
Primitive types form a hierarchy, which dictates the implicit type conversions that
Hive will perform. For example, a TINYINT will be converted to an INT, if an
expression ex-pects an INT; however, the reverse conversion will not occur and Hive
will return an error unless the CAST operator is used.
The implicit conversion rules can be summarized as follows. Any integral numeric
type can be implicitly converted to a wider type. All the integral numeric types,
FLOAT, and (perhaps surprisingly) STRING can be implicitly converted to DOUBLE.
TINYINT, SMALL INT, and INT can all be converted to FLOAT. BOOLEAN types
cannot be converted to any other type.
You can perform explicit type conversion using CAST. For example, CAST('1' AS
INT) will convert the string '1' to the integer value 1. If the cast fails—as it does in
CAST('X' AS INT), for example—then the expression returns NULL.
Complex types
Hive has three complex types: ARRAY, MAP, and STRUCT. ARRAY and MAP are
like their namesakes in Java, while a STRUCT is a record type which encapsulates a
set of named fields. Complex types permit an arbitrary level of nesting. Complex type
declarations must specify the type of the fields in the collection, using an angled
bracket notation, as illustrated in this table definition which has three columns, one
for each complex type:
CREATE TABLE complex (
col1 ARRAY<INT>,
If we load the table with one row of data for ARRAY, MAP, and STRUCT shown in
the “Literal examples” column in Table 12-3 , then the following query demonstrates
the field accessor operators for each type:
hive> SELECT col1[0], col2['b'], col3.c FROM complex;
Operators and Functions
The usual set of SQL operators is provided by Hive: relational operators (such as x =
'a' for testing equality, x IS NULL for testing nullity, x LIKE 'a%' for pattern matching),
arithmetic operators (such as x + 1 for addition), and logical operators (such as x OR
y for logical OR). The operators match those in MySQL, which deviates from SQL-92
since || is logical OR, not string concatenation. Use the concat function for the latter
in both MySQL and Hive.
Hive comes with a large number of built-in functions—too many to list here—divided
into categories including mathematical and statistical functions, string functions, date
functions (for operating on string representations of dates), conditional functions, aggregate functions, and functions for working with XML (using the xpath function) and
You can retrieve a list of functions from the Hive shell by typing SHOW
FUNCTIONS.6 To get brief usage instructions for a particular function, use the
DESCRIBE command:
length(str) - Returns the length of str
In the case when there is no built-in function that does what you want, you can write
your own;
A Hive table is logically made up of the data being stored and the associated
metadata describing the layout of the data in the table. The data typically resides in
HDFS, al-though it may reside in any Hadoop filesystem, including the local
filesystem or S3. Hive stores the metadata in a relational database—and not in
HDFS. In this section, we shall look in more detail at how to create tables, the
different physical storage formats that Hive offers, and how to import data into them.
Multiple Database/Schema Support
Many relational databases have a facility for multiple namespaces, which allow
users and applications to be segregated into different databases or schemas. Hive
supports the same facility, and provides commands such as CREATE DATABASE
dbname, USE dbname, and DROP DATABASE dbname. You can fully qualify a
table by writing dbname.tablename. If no database is specified, tables belong to
the default database.
Managed Tables and External Tables
When you create a table in Hive, by default Hive will manage the data, which means
that Hive moves the data into its warehouse directory. Alternatively, you may create
an external table, which tells Hive to refer to the data that is at an existing location
outside the warehouse directory.
The difference between the two types of table is seen in the LOAD and DROP
semantics. Let’s consider a managed table first.
When you load data into a managed table, it is moved into Hive’s warehouse
For example:
Or see the Hive function reference at +UDF.
CREATE TABLE managed_table (dummy STRING);
LOAD DATA INPATH '/user/tom/data.txt' INTO table managed_table;
will move the file hdfs://user/tom/data.txt into Hive’s warehouse directory for the
managed_table table, which is hdfs://user/hive/warehouse/managed_table.
If the table is later dropped, using:
DROP TABLE managed_table;
then the table, including its metadata and its data, is deleted. It bears repeating that
since the initial LOAD performed a move operation, and the DROP performed a
delete operation, the data no longer exists anywhere. This is what it means for Hive
to manage the data.
An external table behaves differently. You control the creation and deletion of the
data. The location of the external data is specified at table creation time:
CREATE EXTERNAL TABLE external_table (dummy STRING)
LOCATION '/user/tom/external_table';
LOAD DATA INPATH '/user/tom/data.txt' INTO TABLE external_table;
With the EXTERNAL keyword, Hive knows that it is not managing the data, so it
doesn’t move it to its warehouse directory. Indeed, it doesn’t even check if the
external location exists at the time it is defined. This is a useful feature, since it
means you can create the data lazily after creating the table.
When you drop an external table, Hive will leave the data untouched and only delete
the metadata.
So how do you choose which type of table to use? In most cases, there is not much
difference between the two (except of course for the difference in DROP semantics),
so it is a just a matter of preference. As a rule of thumb, if you are doing all your
processing with Hive, then use managed tables, but if you wish to use Hive and
other tools on the same dataset, then use external tables. A common pattern is to
use an external table to access an initial dataset stored in HDFS (created by another
process), then use a Hive transform to move the data into a managed Hive table.
This works the other way around, too—an external table (not necessarily on HDFS)
can be used to export data from Hive for other applications to use.
Another reason for using external tables is when you wish to associate multiple
schemas with the same dataset.
Partitions and Buckets
Hive organizes tables into partitions, a way of dividing a table into coarse-grained
parts based on the value of a partition column, such as date. Using partitions can
make it faster to do queries on slices of the data.
Tables or partitions may further be subdivided into buckets, to give extra structure
to the data that may be used for more efficient queries. For example, bucketing by
user ID means we can quickly evaluate a user-based query by running it on a
randomized sample of the total set of users.
To take an example where partitions are commonly used, imagine log files where
each record includes a timestamp. If we partitioned by date, then records for the
same date would be stored in the same partition. The advantage to this scheme is
that queries that are restricted to a particular date or set of dates can be answered
much more efficiently since they only need to scan the files in the partitions that the
query pertains to. Notice that partitioning doesn’t preclude more wide-ranging
queries: it is still feasible to query the entire dataset across many partitions.
A table may be partitioned in multiple dimensions. For example, in addition to partitioning logs by date, we might also subpartition each date partition by country to
permit efficient queries by location.
Partitions are defined at table creation time9 using the PARTITIONED BY clause,
which takes a list of column definitions. For the hypothetical log files example, we
might define a table with records comprising a timestamp and the log line itself:
When we load data into a partitioned table, the partition values are specified
You can also use INSERT OVERWRITE DIRECTORY to export data to a Hadoop
filesystem, but unlike external tables you cannot control the output format, which is
Control-A separated text files. Complex data types are serialized using a JSON
However, partitions may be added to or removed from a table after creation
using an ALTER TABLE statement.
LOAD DATA LOCAL INPATH 'input/hive/partitions/file1'
PARTITION (dt='2001-01-01', country='GB');
At the filesystem level, partitions are simply nested subdirectories of the table
directory. After loading a few more files into the logs table, the directory structure
might look like this:
/file2 /country=US/file3
/country=US/file5 /file6
The logs table has two date partitions, 2010-01-01 and 2010-01-02, corresponding to
subdirectories called dt=2010-01-01 and dt=2010-01-02; and two country subpartitions, GB and US, corresponding to nested subdirectories called country=GB and
country=US. The data files reside in the leaf directories.
We can ask Hive for the partitions in a table using SHOW PARTITIONS:
dt=2001-01-01/country=GB dt=200101-01/country=US dt=2001-0102/country=GB dt=2001-0102/country=US
One thing to bear in mind is that the column definitions in the PARTITIONED BY
clause are full-fledged table columns, called partition columns; however, the data
files do not contain values for these columns since they are derived from the
directory names.
You can use partition columns in SELECT statements in the usual way. Hive
performs input pruning to scan only the relevant partitions. For example:
SELECT ts, dt, line FROM logs WHERE country='GB';
will only scan file1, file2, and file4. Notice, too, that the query returns the values of
the dt partition column, which Hive reads from the directory names since they are not
in the data files.
There are two reasons why you might want to organize your tables (or partitions) into
buckets. The first is to enable more efficient queries. Bucketing imposes extra
structure on the table, which Hive can take advantage of when performing certain
queries. In particular, a join of two tables that are bucketed on the same columns—
which include the join columns—can be efficiently implemented as a map-side join.
The second reason to bucket a table is to make sampling more efficient. When
working with large datasets, it is very convenient to try out queries on a fraction of
your dataset while you are in the process of developing or refining them. We shall
see how to do efficient sampling at this end of this section.
First, let’s see how to tell Hive that a table should be bucketed. We use the
CLUSTERED BY clause to specify the columns to bucket on and the number of
CREATE TABLE bucketed_users (id INT, name STRING)
Here we are using the user ID to determine the bucket (which Hive does by hashing
the value and reducing modulo the number of buckets), so any particular bucket will
effectively have a random set of users in it.
In the map-side join case, where the two tables are bucketed in the same way, a
mapper processing a bucket of the left table knows that the matching rows in the
right table are in its corresponding bucket, so it need only retrieve that bucket (which
is a small fraction of all the data stored in the right table) to effect the join. This
optimization works, too, if the number of buckets in the two tables are multiples of
each other—they do not have to have exactly the same number of buckets. The
HiveQL for joining two bucketed tables is shown in “Map joins”.
The data within a bucket may additionally be sorted by one or more columns. This
allows even more efficient map-side joins, since the join of each bucket becomes an
efficient merge-sort. The syntax for declaring that a table has sorted buckets is:
CREATE TABLE bucketed_users (id INT, name STRING)
How can we make sure the data in our table is bucketed? While it’s possible to load
data generated outside Hive into a bucketed table, it’s often easier to get Hive to do
the bucketing, usually from an existing table. Hive does not check that the buckets in
the data files on disk are con-sistent with the buckets in the table definition (either in
number, or on the basis of bucketing columns). If there is a mismatch, then you may
get an error or undefined behavior at query time. For this reason, it is advisable to
get Hive to perform the bucketing.
Take an unbucketed users table:
hive> SELECT * FROM users;
0 Nat
To populate the bucketed table, we need to set the hive.enforce.bucketing property
to true, so that Hive knows to create the number of buckets declared in the table
definition. Then it is a matter of just using the INSERT command:
SELECT * FROM users;
Physically, each bucket is just a file in the table (or partition) directory. The file name
is not important, but bucket n is the nth file, when arranged in lexicographic order. In
fact, buckets correspond to MapReduce output file partitions: a job will produce as
many buckets (output files) as reduce tasks. We can see this by looking at the layout
of the bucketed_users table we just created. Running this command:
hive> dfs -ls /user/hive/warehouse/bucketed_users;
shows that four files were created, with the following names (the name is generated
by Hive and incorporates a timestamp, so it will change from run to run):
The first bucket contains the users with IDs 0 and 4, since for an INT the hash is the
integer itself, and the value is reduced modulo the number of buckets—4 in this
hive> dfs -cat /user/hive/warehouse/bucketed_users/*0_0;
We can see the same thing by sampling the table using the TABLESAMPLE clause,
which restricts the query to a fraction of the buckets in the table rather than the
whole table:
hive> SELECT * FROM bucketed_users
0 Nat
4 Ann
Bucket numbering is 1-based, so this query retrieves all the users from the first of
four buckets. For a large, evenly distributed dataset, approximately one quarter of
the table’s rows would be returned. It’s possible to sample a number of buckets by
specifying a different proportion (which need not be an exact multiple of the number
of buckets, since sampling is not intended to be a precise operation). For example,
this query re-turns half of the buckets:
hive> SELECT * FROM bucketed_users
0 Nat
4 Ann
2 Joe
Sampling a bucketed table is very efficient, since the query only has to read the
buckets that match the TABLESAMPLE clause. Contrast this with sampling a nonbucketed table, using the rand() function, where the whole input dataset is scanned,
even if a very small sample is needed:
hive> SELECT * FROM users
2 Joe
Storage Formats
There are two dimensions that govern table storage in Hive: the row format and the
file format. The row format dictates how rows, and the fields in a particular row, are
stored. In Hive parlance, the row format is defined by a SerDe, a portmanteau word
for a Serializer-Deserializer.
When acting as a deserializer, which is the case when querying a table, a SerDe will
deserialize a row of data from the bytes in the file to objects used internally by Hive
to operate on that row of data. When used as a serializer, which is the case when
per-forming an INSERT or CTAS (see “Importing Data”), the table’s SerDe will
serialize Hive’s internal representation of a row of data into the bytes that are written
to the output file.
The file format dictates the container format for fields in a row. The simplest format is
a plain text file, but there are row-oriented and column-oriented binary formats available, too.
The default storage format: Delimited text
When you create a table with no ROW FORMAT or STORED AS clauses, the
default format is delimited text, with a row per line. The default row delimiter is not a
tab character, but the Control-A character from the set of ASCII control codes (it has
ASCII code 1). The choice of Control-A, sometimes written as ^A in documentation,
came about since it is less likely to be a part of the field text than a tab character.
There is no means for escaping delimiter characters in Hive, so it is important to
choose ones that don’t occur in data fields. The default collection item delimiter is a
Control-B character, used to delimit items in an ARRAY or STRUCT, or key-value
pairs in a MAP. The default map key delimiter is a Control-C character, used to
delimit the key and value in a MAP. Rows in a table are delimited by a newline
character. The preceding description of delimiters is correct for the usual case of flat
data structures, where the complex types only contain primitive types. For nested
types, however, this isn’t the whole story, and in fact the level of the nesting
determines the delimiter.
For an array of arrays, for example, the delimiters for the outer array are
Control-B characters, as expected, but for the inner array they are Control-C
characters, the next delimiter in the list. If you are unsure which delimiters
Hive uses for a particular nested structure, you can run a command like:
SELECT array(array(1, 2), array(3, 4))
FROM dummy;
then use hexdump, or similar, to examine the delimiters in the output file. Hive
actually supports eight levels of delimiters, corresponding to ASCII codes 1, 2, ... 8,
but you can only override the first three.
Thus, the statement:
is identical to the more explicit:
Notice that the octal form of the delimiter characters can be used—001 for Control-A,
for instance. Internally, Hive uses a SerDe called LazySimpleSerDe for this delimited
format, along with the line-oriented MapReduce text input and output formats we saw
in Chapter 7. The “lazy” prefix comes about since it deserializes fields lazily—only as
they are accessed. However, it is not a compact format since fields are stored in a
verbose textual format, so a boolean value, for instance, is written as the literal string
true or false.
The simplicity of the format has a lot going for it, such as making it easy to process
with other tools, including MapReduce programs or Streaming, but there are more
compact and performant binary SerDe’s that you might consider using. Some are
listed in Table 12-4.
Table 12-4. Hive
SerDe name
Java package
The default SerDe. Delimited textual
with lazy field access.
A more efficient version of
SerDe. Binary format with lazy field
Used internally for such things as
A binary SerDe like LazyBinarySerDe,
optimized for sorting at the expense of
compactness (although it is still significantly
compact than
A variant of LazySimpleSerDe for
columnbased storage with RCFile.
A SerDe for reading textual data where
are specified by a regular expression.
writes data using a formatting
Useful for reading log files, but inefficient,
not suitable for general-purpose
A SerDe for reading Thrift-encoded binary
A SerDe for storing data in an HBase
HBase storage uses a Hive storage
which unifies (and generalizes) the roles of
format and file format. Storage handlers
specified using a STORED BY clause,
replaces the ROW FORMAT and
clauses. See
Binary storage formats: Sequence files and RCFiles
Hadoop’s sequence file format (“SequenceFile” ) is a general purpose bi-nary format
for sequences of records (key-value pairs). You can use sequence files in Hive by
using the declaration STORED AS SEQUENCEFILE in the CREATE TABLE
statement. One of the main benefits of using sequence files is their support for
splittable com-pression. If you have a collection of sequence files that were created
outside Hive, then Hive will read them with no extra configuration. If, on the other
hand, you want tables populated from Hive to use compressed sequence files for
their storage, you need to set a few properties to enable compression:
hive> SET hive.exec.compress.output=true;
hive> SET mapred.output.compress=true;
hive> SET
Sequence files are row-oriented. What this means is that the fields in each row are
stored together, as the contents of a single sequence file record.
Hive provides another binary storage format called RCFile, short for Record
Columnar File. RCFiles are similar to sequence files, except that they store data in a
column-oriented fashion. RCFile breaks up the table into row splits, then within each
split stores the values for each row in the first column, followed by the values for
each row in the second column, and so on. This is shown diagrammatically in Figure
Figure 12-3. Row-oriented versus column-oriented storage
A column-oriented layout permits columns that are not accessed in a query to be
skip-ped. Consider a query of the table in Figure 12-3 that processes only column 2.
With row-oriented storage, like a sequence file, the whole row (stored in a sequence
file record) is loaded into memory, even though only the second column is actually
read. Lazy deserialization goes some way to save processing cycles by only
deserializing the columns fields that are accessed, but it can’t avoid the cost of
reading each row’s bytes from disk.
With column-oriented storage, only the column 2 parts of the file (shaded in the
figure) need to be read into memory.
In general, column-oriented formats work well when queries access only a small
number of columns in the table. Conversely, row-oriented formats are appropriate
when a large number of columns of a single row are needed for processing at the
same time.
Use the following CREATE TABLE clauses to enable column-oriented storage in
An example: RegexSerDe
Let’s see how to use another SerDe for storage. We’ll use a contrib SerDe that uses
a regular expression for reading the fixed-width station metadata from a text file:
CREATE TABLE stations (usaf STRING, wban STRING, name STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
"input.regex" = "(\\d{6}) (\\d{5}) (.{29}) .*"
In previous examples, we have used the DELIMITED keyword to refer to delimited
text in the ROW FORMAT clause. In this example, we instead specify a SerDe with
theSERDE keyword and the fully qualified classname of the Java class that
implements the SerDe, org.apache.hadoop.hive.contrib.serde2.RegexSerDe.
SerDe’s can be configured with extra properties using the WITH
SERDEPROPERTIES clause. Here we set the input.regex property, which is specific
to RegexSerDe. input.regex is the regular expression pattern to be used during
deserialization to turn the line of text forming the row into a set of columns. Java
(see tern.html), and columns are
formed from capturing groups of parentheses. In this Sometimes you need to use
parentheses for regular expression constructs that you don’t want to count as a
capturing group. For example, the pattern (ab)+ for matching a string of one or more
ab characters. The solution is to use a noncapturing group, which has a ? character
after the first parenthesis. There are various noncapturing group constructs (see the
Java documentation), but in this example we could use (?:ab)+ to avoid capturing the
group as a Hive column.
example, there are three capturing groups for usaf (a six-digit identifier), wban (a
five-digit identifier), and name (a fixed-width column of 29 characters).
To populate the table, we use a LOAD DATA statement as before:
LOAD DATA LOCAL INPATH "input/ncdc/metadata/stations-fixedwidth.txt" INTO TABLE stations;
Recall that LOAD DATA copies or moves the files to Hive’s warehouse directory (in
this case, it’s a copy since the source is the local filesystem). The table’s SerDe is
not used for the load operation.
When we retrieve data from the table, the SerDe is invoked for deserialization, as we
can see from this simple query, which correctly parses the fields for each row:
hive> SELECT * FROM stations LIMIT 4;
Importing Data
We’ve already seen how to use the LOAD DATA operation to import data into a Hive
table (or partition) by copying or moving files to the table’s directory. You can also
populate a table with data from another Hive table using an INSERT statement, or at
creation time using the CTAS construct, which is an abbreviation used to refer to
Here’s an example of an INSERT statement:
SELECT col1, col2
FROM source;
For partitioned tables, you can specify the partition to insert into by supplying a
PARTITION (dt='2010-01-01')
SELECT col1, col2
FROM source;
The OVERWRITE keyword is actually mandatory in both cases, and means that the
con-tents of the target table (for the first example) or the 2010-01-01 partition (for the
second example) are replaced by the results of the SELECT statement. At the time
of writing, Hive does not support adding records to an already-populated
nonpartitioned table or partition using an INSERT statement. Instead, you can
achieve the same effect using a LOAD DATA operation without the OVERWRITE
You can specify the partition dynamically, by determining the partition value from the
SELECT statement:
SELECT col1, col2, dt
FROM source;
This is known as a dynamic-partition insert. This feature is off by default, so you
need to enable it by setting hive.exec.dynamic.partition to true first. Unlike other
databases, Hive does not (currently) support a form of the INSERT statement for
inserting a collection of records specified in the query, in literal form. That is,
statements of the form INSERT INTO...VAL UES... are not allowed.
Multitable insert
In HiveQL, you can turn the INSERT statement around and start with the FROM
clause, for the same effect:
FROM source
SELECT col1, col2;
The reason for this syntax becomes clear when you see that it’s possible to have
multiple INSERT clauses in the same query. This so-called multitable insert is more
efficient than multiple INSERT statements, since the source table need only be
scanned once to produce the multiple, disjoint outputs.
Here’s an example that computes various statistics over the weather dataset:
FROM records2
stations_by_year SELECT year,
records_by_year SELECT year, COUNT(1)
good_records_by_year SELECT year,
WHERE temperature != 9999
AND (quality = 0 OR quality = 1 OR quality = 4 OR quality = 5 OR quality =
9) GROUP BY year;
There is a single source table (records2), but three tables to hold the results from
three different queries over the source.
It’s often very convenient to store the output of a Hive query in a new table, perhaps
because it is too large to be dumped to the console or because there are further processing steps to carry out on the result.
The new table’s column definitions are derived from the columns retrieved by the
SELECT clause. In the following query, the target table has two columns named
col1 and col2 whose types are the same as the ones in the source table:
SELECT col1, col2
FROM source;
A CTAS operation is atomic, so if the SELECT query fails for some reason, then the
table is not created.
Altering Tables
Since Hive uses the schema on read approach, it’s flexible in permitting a table’s
defi-nition to change after the table has been created. The general caveat, however,
is that it is up to you, in many cases, to ensure that the data is changed to reflect the
new structure.
You can rename a table using the ALTER TABLE statement:
ALTER TABLE source RENAME TO target;
In addition to updating the table metadata, ALTER TABLE moves the underlying
table directory so that it reflects the new name. In the current
example,/user/hive/warehouse/ source is renamed to /user/hive/warehouse/target.
(An external table’s underlying directory is not moved; only the metadata is updated.)
Hive allows you to change the definition for columns, add new columns, or even
replace all existing columns in a table with a new set.
For example, consider adding a new column:
The new column col3 is added after the existing (nonpartition) columns. The data
files are not updated, so queries will return null for all values of col3 (unless of
course there were extra fields already present in the files). Since Hive does not
permit updating existing records, you will need to arrange for the underlying files to
be updated by another mechanism. For this reason, it is more common to create a
new table that defines new columns and populates them using a SELECT
Changing a column’s metadata, such as a column’s name or data type, is more
straight-forward, assuming that the new data type can be interpreted as the new
data type.
To learn more about how to alter a table’s structure, including adding and
dropping partitions, changing and replacing columns, and changing table and
at +DDL.
Dropping Tables
The DROP TABLE statement deletes the data and metadata for a table. In the
case of external tables, only the metadata is deleted—the data is left untouched.
If you want to delete all the data in a table, but keep the table definition (like
DELETE or TRUNCATE in MySQL), then you can simply delete the data files.
For example:
hive> dfs -rmr /user/hive/warehouse/my_table;
Hive treats a lack of files (or indeed no directory for the table) as an empty table.
Another possibility, which achieves a similar effect, is to create a new, empty
table that has the same schema as the first, using the LIKE keyword:
CREATE TABLE new_table LIKE existing_table;
Querying Data
This section discusses how to use various forms of the SELECT statement to
retrieve data from Hive.
Sorting and Aggregating
Sorting data in Hive can be achieved by use of a standard ORDER BY clause, but
there is a catch. ORDER BY produces a result that is totally sorted, as expected,
but to do so it sets the number of reducers to one, making it very inefficient for
large datasets. When a globally sorted result is not required—and in many cases
it isn’t—then you can use Hive’s nonstandard extension, SORT BY instead.
SORT BY produces a sorted file per reducer. In some cases, you want to control
which reducer a particular row goes to, typically so you can perform some
subsequent aggregation. This is what Hive’s DISTRIBUTE BY clause does.
Here’s an example to sort the weather dataset by year and temperature, in such a
way to ensure that all the rows for a given year end up in the same reducer
hive> FROM records2
SELECT year, temperature
SORT BY year ASC, temperature DESC;
1950 -11
A follow-on query would be able to use the fact that each year’s temperatures were
grouped and sorted (in descending order) in the same file.
If the columns for SORT BY and DISTRIBUTE BY are the same, you can use
CLUSTER BY as a shorthand for specifying both.
MapReduce Scripts
Using an approach like Hadoop Streaming, the TRANSFORM, MAP, and REDUCE
clauses make it possible to invoke an external script or program from Hive. Suppose
we want to use a script to filter out rows that don’t meet some condition, such as the
script in Exam-ple 12-1, which removes poor quality readings.
Example 12-1. Python script to filter out poor quality weather records
#!/usr/bin/env python
import re import sys
for line in sys.stdin:
(year, temp, q) = line.strip().split()
if (temp != "9999" and re.match("[01459]", q)): print
"%s\t%s" % (year, temp)
We can use the script as follows:
hive> ADD FILE /path/to/;
hive> FROM records2
SELECT TRANSFORM(year, temperature, quality)
AS year, temperature;
Before running the query, we need to register the script with Hive. This is so Hive
knows to ship the file to the Hadoop cluster.
The query itself streams the year, temperature, and quality fields as a tab-separated
line to the script, and parses the tab-separated output into year
and temperature fields to form the output of the query.
This example has no reducers. If we use a nested form for the query, we can specify
a map and a reduce function. This time we use the MAP and REDUCE keywords,
but SELECT TRANSFORM in both cases would have the same result. The source
for the script is shown in Example 2-11:
FROM records2
MAP year, temperature, quality
AS year, temperature) map_output
REDUCE year, temperature
AS year, temperature;
One of the nice things about using Hive, rather than raw MapReduce, is that it
makes performing commonly used operations very simple. Join operations are a
case in point, given how involved they are to implement in MapReduce.
Inner joins
The simplest kind of join is the inner join, where each match in the input tables
results in a row in the output. Consider two small demonstration tables: sales, which
lists the names of people and the ID of the item they bought; and things, which lists
the item ID and its name:
hive> SELECT * FROM sales;
hive> SELECT * FROM things;
We can perform an inner join on the two tables as follows:
hive> SELECT
> FROM sales JOIN things ON ( =;
2 2
2 2
3 3
4 4
The table in the FROM clause (sales) is joined with the table in the JOIN clause
(things), using the predicate in the ON clause. Hive only supports equijoins, which
means that only equality can be used in the join predicate, which here matches on
the id column in both tables.
Some databases, such as MySQL and Oracle, allow you to list the join tables in the
FROM clause and specify the join condition in the WHERE clause of a SELECT
statement. However, this syntax is not supported in Hive, so the following fails with a
parse error:
SELECT sales.*, things.*
FROM sales, things
Hive only allows a single table in the FROM clause, and joins must follow the SQL92 JOIN clause syntax.
In Hive, you can join on multiple columns in the join predicate by specifying a series
of expressions, separated by AND keywords. You can also join more than two tables
by supplying additional JOIN...ON... clauses in the query. Hive is intelligent about
trying to minimize the number of MapReduce jobs to perform the joins.
A single join is implemented as a single MapReduce job, but multiple joins can be
performed in less than one MapReduce job per join if the same column is used in the
join condition.13 You can see how many MapReduce jobs Hive will use for any particular query by prefixing it with the EXPLAIN keyword:
SELECT sales.*, things.*
FROM sales JOIN things ON ( =;
The EXPLAIN output includes many details about the execution plan for the query,
in-cluding the abstract syntax tree, the dependency graph for the stages that Hive
will execute, and information about each stage. Stages may be MapReduce jobs or
opera-tions such as file moves. For even more detail, prefix the query with EXPLAIN
Hive currently uses a rule-based query optimizer for determining how to execute a
query, but it’s likely that in the future a cost-based optimizer will be added.
Outer joins
Outer joins allow you to find nonmatches in the tables being joined. In the current
example, when we performed an inner join, the row for Ali did not appear in the
output, since the ID of the item she purchased was not present in the things table. If
we change the join type to LEFT OUTER JOIN, then the query will return a row for
every row in the
The order of the tables in the JOIN clauses is significant: it’s generally best to have
see for more
details, including how to give hints to the Hive planner.
left table (sales), even if there is no corresponding row in the table it is being joined
to (things):
hive> SELECT sales.*,
> FROM sales LEFT OUTER JOIN things ON ( =;
2 2
2 2
3 3
4 4
Notice that the row for Ali is now returned, and the columns from the things table are
NULL, since there is no match.
Hive supports right outer joins, which reverses the roles of the tables relative to the
left join. In this case, all items from the things table are included, even those that
weren’t purchased by anyone (a scarf):
hive> SELECT sales.*, things.*
> FROM sales RIGHT OUTER JOIN things ON ( =;
Finally, there is a full outer join, where the output has a row for each row from both
tables in the join:
hive> SELECT sales.*,
> FROM sales FULL OUTER JOIN things ON ( =;
Semi joins
Hive doesn’t support IN subqueries (at the time of writing), but you can use a LEFT
SEMI JOIN to do the same thing.
Consider this IN subquery, which finds all the items in the things table that are in the
sales table:
SELECT * FROM things WHERE IN (SELECT id from sales);
We can rewrite it as follows:
hive> SELECT * FROM things LEFT SEMI JOIN sales ON ( =;
There is a restriction that we must observe for LEFT SEMI JOIN queries: the right
table (sales) may only appear in the ON clause. It cannot be referenced in a
SELECT expression, for example.
Map joins
If one table is small enough to fit in memory, then Hive can load the smaller table
into memory to perform the join in each of the mappers. The syntax for specifying a
map join is a hint embedded in an SQL C-style comment:
SELECT /*+ MAPJOIN(things) */ sales.*, things.*
FROM sales JOIN things ON ( =;
The job to execute this query has no reducers, so this query would not work for a
RIGHT or FULL OUTER JOIN, since absence of matching can only be detected in
an aggre-gating (reduce) step across all the inputs.
Map joins can take advantage of bucketed tables (“Buckets”), since a mapper
working on a bucket of the left table only needs to load the corresponding buckets of
the right table to perform the join. The syntax for the join is the same as for the inmemory case above; however, you also need to enable the optimization with:
SET hive.optimize.bucketmapjoin=true;
A subquery is a SELECT statement that is embedded in another SQL statement.
Hive has limited support for subqueries, only permitting a subquery in the FROM
clause of a SELECT statement.
Other databases allow subqueries almost anywhere that an expression is valid, such
as in the list of values to retrieve from a SELECT statement or in the WHERE
clause. Many uses of subqueries can be rewritten as joins, so if you find yourself
writing a subquery where Hive does not support it, then see if it can be expressed as
a join. For example, an IN subquery can be written as a semi join, or an inner join
(see “Joins” ).
The following query finds the mean maximum temperature for every year and
weather station:
SELECT station, year, AVG(max_temperature)
SELECT station, year, MAX(temperature) AS
max_temperature FROM records2
WHERE temperature != 9999
AND (quality = 0 OR quality = 1 OR quality = 4 OR quality = 5 OR quality =
9) GROUP BY station, year) mt
GROUP BY station, year;
The subquery is used to find the maximum temperature for each station/date combination, then the outer query uses the AVG aggregate function to find the average of
the maximum temperature readings for each station/date combination.
The outer query accesses the results of the subquery like it does a table, which is
why the subquery must be given an alias (mt). The columns of the subquery have to
be given unique names so that the outer query can refer to them.
A view is a sort of “virtual table” that is defined by a SELECT statement. Views can
be used to present data to users in a different way to the way it is actually stored on
disk. Often, the data from existing tables is simplified or aggregated in a particular
way that makes it convenient for further processing. Views may also be used to
restrict users’ access to particular subsets of tables that they are authorized to see.
In Hive, a view is not materialized to disk when it is created; rather, the view’s
SELECT statement is executed when the statement that refers to the view is run. If a
view per-forms extensive transformations on the base tables, or is used frequently,
then you may choose to manually materialize it by creating a new table that stores
the contents of the view (see “CREATE TABLE...AS SELECT” ).
We can use views to rework the query from the previous section for finding the mean
maximum temperature for every year and weather station. First, let’s create a view
for valid records, that is, records that have a particular quality value:
CREATE VIEW valid_records
FROM records2
WHERE temperature != 9999
AND (quality = 0 OR quality = 1 OR quality = 4 OR quality = 5 OR quality = 9);
When we create a view, the query is not run; it is simply stored in the metastore.
Views are included in the output of the SHOW TABLES command, and you can see
more details about a particular view, including the query used to define it, by issuing
the DESCRIBE EXTENDED view_name command.
Next, let’s create a second view of maximum temperatures for each station and year.
It is based on the valid_records view:
CREATE VIEW max_temperatures (station, year, max_temperature)
SELECT station, year, MAX(temperature)
FROM valid_records
GROUP BY station, year;
In this view definition, we list the column names explicitly. We do this since the maximum temperature column is an aggregate expression, and otherwise Hive would
create a column alias for us (such as _c2). We could equally well have used an AS
clause in the SELECT to name the column.
With the views in place, we can now use them by running a query:
SELECT station, year, AVG(max_temperature)
FROM max_temperatures
GROUP BY station, year;
The result of the query is the same as running the one that uses a subquery, and, in
particular, the number of MapReduce jobs that Hive creates is the same for both: two
in each case, one for each GROUP BY. This example shows that Hive can combine
a query on a view into a sequence of jobs that is equivalent to writing the query
without using a view. In other words, Hive won’t needlessly materialize a view even
at execution time.
Views in Hive are read-only, so there is no way to load or insert data into an
underlying base table via a view.
User-Defined Functions
Sometimes the query you want to write can’t be expressed easily (or at all) using the
built-in functions that Hive provides. By writing a user-defined function (UDF), Hive
makes it easy to plug in your own processing code and invoke it from a Hive query.
UDFs have to be written in Java, the language that Hive itself is written in. For other
languages, consider using a SELECT TRANSFORM query, which allows you to
stream data through a user-defined script.
There are three types of UDF in Hive: (regular) UDFs, UDAFs (user-defined
aggregate functions), and UDTFs (user-defined table-generating functions). They
differ in the numbers of rows that they accept as input and produce as output:
 A UDF operates on a single row and produces a single row as its output. Most
functions, such as mathematical functions and string functions, are of this type.
 A UDAF works on multiple input rows and creates a single output row. Aggregate
functions include such functions as COUNT and MAX.
 A UDTF operates on a single row and produces multiple rows—a table—as
Table-generating functions are less well known than the other two types, so let’s look
at an example. Consider a table with a single column, x, which contains arrays of
strings. It’s instructive to take a slight detour to see how the table is defined and
Notice that the ROW FORMAT clause specifies that the entries in the array are
delimited by Control-B characters. The example file that we are going to load has
the following contents, where ^B is a representation of the Control-B character to
make it suitable for printing:
After running a LOAD DATA command, the following query confirms that the data
was loaded correctly:
hive > SELECT * FROM arrays;
Next, we can use the explode UDTF to transform this table. This function emits a
row for each entry in the array, so in this case the type of the output column y is
STRING. The result is that the table is flattened into five rows:
hive > SELECT explode(x) AS y FROM arrays;
SELECT statements using UDTFs have some restrictions (such as not being able to
re-trieve additional column expressions), which make them less useful in practice.
For this reason, Hive supports LATERAL VIEW queries, which are more
powerful.LATERAL VIEW queries not covered here, but you may find out more
about them at https://cwiki
Writing a UDF
To illustrate the process of writing and using a UDF, we’ll write a simple UDF to trim
characters from the ends of strings. Hive already has a built-in function called trim,
so we’ll call ours strip. The code for the Strip Java class is shown in Example 12-2.
Example 12-2. A UDF for stripping characters from the ends of strings
package com.hadoopbook.hive;
import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.hive.ql.exec.UDF;
public class Strip extends UDF { private
Text result = new Text();
public Text evaluate(Text str) { if (str ==
null) {
return null;
result.set(StringUtils.strip(str.toString())); return
public Text evaluate(Text str, String stripChars) { if (str
== null) {
return null;
result.set(StringUtils.strip(str.toString(), stripChars));
return result;
A UDF must satisfy the following two properties:
A UDF must be a subclass of org.apache.hadoop.hive.ql.exec.UDF.
A UDF must implement at least one evaluate() method.
The evaluate() method is not defined by an interface since it may take an arbitrary
number of arguments, of arbitrary types, and it may return a value of arbitrary type.
Hive introspects the UDF to find the evaluate() method that matches the Hive
function that was invoked.
The Strip class has two evaluate() methods. The first strips leading and trailing
white-space from the input, while the second can strip any of a set of supplied
characters from the ends of the string. The actual string processing is delegated to
the StringUtils class from the Apache Commons project, which makes the only
noteworthy part of the code the use of Text from the Hadoop Writable library. Hive
actually supports Java primi-tives in UDFs (and a few other types like java.util.List
and java.util.Map), so a sig-nature like:
public String evaluate(String str)
would work equally well. However, by using Text, we can take advantage of object
reuse, which can bring efficiency savings, and so is to be preferred in general.
To use the UDF in Hive, we need to package the compiled Java class in a JAR file
(you can do this by typing ant hive with the book’s example code) and register the
file with Hive:
ADD JAR /path/to/hive-examples.jar;
We also need to create an alias for the Java classname:
CREATE TEMPORARY FUNCTION strip AS 'com.hadoopbook.hive.Strip';
The TEMPORARY keyword here highlights the fact that UDFs are only defined for
the duration of the Hive session (they are not persisted in the metastore). In practice,
this means you need to add the JAR file, and define the function at the beginning of
each script or session.
As an alternative to calling ADD JAR, you can specify—at launch time— a path
where Hive looks for auxiliary JAR files to put on its classpath (including the
MapReduce classpath). This technique is useful for au-tomatically adding your own
library of UDFs every time you run Hive.
There are two ways of specifying the path, either passing the --auxpath option to the
hive command:
% hive --auxpath /path/to/hive-examples.jar
or by setting the HIVE_AUX_JARS_PATH environment variable before in-voking
Hive. The auxiliary path may be a comma-separated list of JAR file paths or a
directory containing JAR files.
The UDF is now ready to be used, just like a built-in function:
hive> SELECT strip(' bee ') FROM dummy;
hive> SELECT strip('banana', 'ab') FROM dummy;
Notice that the UDF’s name is not case-sensitive:
hive> SELECT STRIP(' bee ') FROM dummy;
Writing a UDAF
An aggregate function is more difficult to write than a regular UDF, since values are
aggregated in chunks (potentially across many Map or Reduce tasks), so the implementation has to be capable of combining partial aggregations into a final result. The
code to achieve this is best explained by example, so let’s look at the implementation
of a simple UDAF for calculating the maximum of a collection of integers (Exam-ple
Example 12-3. A UDAF for calculating the maximum of a collection of integers
package com.hadoopbook.hive;
import org.apache.hadoop.hive.ql.exec.UDAF;
import org.apache.hadoop.hive.ql.exec.UDAFEvaluator;
public class Maximum extends UDAF {
public static class MaximumIntUDAFEvaluator implements UDAFEvaluator {
private IntWritable result;
public void init() {
result = null;
public boolean iterate(IntWritable value) {
if (value == null) {
return true;
if (result == null) {
result = new IntWritable(value.get()); } else {
result.set(Math.max(result.get(), value.get()));
return true;
public IntWritable terminatePartial() {
return result;
public boolean merge(IntWritable other) {
return iterate(other);
public IntWritable terminate() {
return result;
The class structure is slightly different to the one for UDFs. A UDAF must be a
subclass of org.apache.hadoop.hive.ql.exec.UDAF (note the “A” in UDAF) and
org.apache.hadoop.hive.ql.exec.UDAFEvalua tor. In this example, there is a single
nested class, MaximumIntUDAFEvaluator, but we could add more evaluators such
as MaximumLongUDAFEvaluator, MaximumFloatUDAFEva luator, and so on, to
provide overloaded forms of the UDAF for finding the maximum of a collection of
longs, floats, and so on.
An evaluator must implement five methods, described in turn below (the flow is illustrated in Figure 12-4):
The init() method initializes the evaluator and resets its internal state. In
MaximumIntUDAFEvaluator, we set the IntWritable object holding the final result to
null. We use null to indicate that no values have been aggregated yet, which has the
desirable effect of making the maximum value of an empty set NULL.
The iterate() method is called every time there is a new value to be aggregated. The
evaluator should update its internal state with the result of performing the
aggregation. The arguments that iterate() takes correspond to those in the Hive
function from which it was called. In this example, there is only one argument. The
value is first checked to see if it is null, and if it is, it is ignored. Otherwise, the result
instance variable is set to value’s integer value (if this is the first value that has been
seen), or set to the larger of the current result and value (if one or more values have
already been seen). We return true to indicate that the input value was valid.
The terminatePartial() method is called when Hive wants a result for the partial
aggregation. The method must return an object that encapsulates the state of the
aggregation. In this case, an IntWritable suffices, since it encapsulates either the
maximum value seen or null if no values have been processed.
The merge() method is called when Hive decides to combine one partial aggregation with another. The method takes a single object whose type must correspond to
the return type of the terminatePartial() method. In this example, the merge() method
can simply delegate to the iterate() method, because the partial aggregation is
represented in the same way as a value being aggregated. This is not generally the
case (and we’ll see a more general example later), and the method should
implement the logic to combine the evaluator’s state with the state of the partial
Figure 12-4. Data flow with partial results for a UDAF
The terminate() method is called when the final result of the aggregation is needed.
The evaluator should return its state as a value. In this case, we return the result
instance variable.
Let’s exercise our new function:
'com.hadoopbook.hive.Maximum'; hive> SELECT
maximum(temperature) FROM records;
A more complex UDAF,
In this example, the merge() method is different to iterate(), since it combines the
partial sums and partial counts, by pairwise addition. Also, the return type of termina
tePartial() is PartialResult—which of course is never seen by the user calling the
function—while the return type of terminate() is DoubleWritable, the final result seen
by the user.
Example 12-4. A UDAF for calculating the mean of a collection of doubles
package com.hadoopbook.hive;
import org.apache.hadoop.hive.ql.exec.UDAF;
import org.apache.hadoop.hive.ql.exec.UDAFEvaluator;
public class Mean extends UDAF {
public static class MeanDoubleUDAFEvaluator implements
UDAFEvaluator { public static class PartialResult {
double sum; long count;
private PartialResult partial;
public void init() {
partial = null;
public boolean iterate(DoubleWritable value) { if (value == null) {
return true;
if (partial == null) {
partial = new PartialResult();
partial.sum += value.get();
return true;
public PartialResult terminatePartial() {
return partial;
public boolean merge(PartialResult other) {
if (other == null) {
return true;
if (partial == null) {
partial = new PartialResult();
partial.sum += other.sum; partial.count += other.count;
return true;
public DoubleWritable terminate() { if (partial == null) {
return null;
return new DoubleWritable(partial.sum / partial.count);
HBase is a distributed column-oriented database built on top of HDFS. HBase is the
Hadoop application to use when you require real-time read/write random-access to
very large datasets.Although there are countless strategies and implementations for
database storage and retrieval, most solutions—especially those of the relational
variety—are not built with very large scale and distribution in mind. Many vendors
offer replication and parti-tioning solutions to grow the database beyond the confines
of a single node, but these add-ons are generally an afterthought and are
complicated to install and maintain. They also come at some severe compromise to
the RDBMS feature set. Joins, complex quer-ies, triggers, views, and foreign-key
constraints become prohibitively expensive to run on a scaled RDBMS or do not
work at all.
HBase comes at the scaling problem from the opposite direction. It is built from the
ground-up to scale linearly just by adding nodes. HBase is not relational and does
not support SQL, but given the proper problem space, it is able to do what an
RDBMS cannot: host very large, sparsely populated tables on clusters made from
commodity hardware.
The canonical HBase use case is the webtable, a table of crawled web pages and
their attributes (such as language and MIME type) keyed by the web page URL. The
webtable is large, with row counts that run into the billions. Batch analytic and
parsing MapReduce jobs are continuously run against the webtable deriving
statistics and adding new columns of verified MIME type and parsed text content for
later indexing by a search engine. Concurrently, the table is randomly accessed by
crawlers running at various rates updating random rows while random web pages
are served in real time as users click on a website’s cached-page feature.
The HBase project was started toward the end of 2006 by Chad Walters and Jim
Kellerman at Powerset. It was modeled after Google’s “Bigtable: A Distributed
(, which had just been published. In
February 2007, Mike Cafarella made a code drop of a mostly working system that
Jim Kellerman then carried forward.
The first HBase release was bundled as part of Hadoop 0.15.0 in October 2007. In
May 2010, HBase graduated from a Hadoop subproject to become an Apache Top
Level Project. Production users of HBase include Adobe, StumbleUpon, Twitter, and
groups at Yahoo!.
In this section, we provide a quick overview of core HBase concepts. At a minimum,
a passing familiarity will ease the digestion of all that follows.
Whirlwind Tour of the Data Model
Applications store data into labeled tables. Tables are made of rows and columns.
Table cells—the intersection of row and column coordinates—are versioned. By
default, their version is a timestamp auto-assigned by HBase at the time of cell
insertion. A cell’s content is an uninterpreted array of bytes. Table row keys are also
byte arrays, so theoretically anything can serve as a row key from strings to binary
representations of long or even serialized data structures. Table rows are sorted by
row key, the table’s primary key. The sort is byte-ordered. All table accesses are via
the table primary key.
Row columns are grouped into column families. All column family members have a
common prefix, so, for example, the columns temperature:air and temperature:dew_point are both members of the temperature column family, whereas
station:identifier belongs to the station family. The column family prefix must be composed of printable characters. The qualifying tail, the column family qualifier, can be
made of any arbitrary bytes.
 For more detail than is provided here, see the HBase Architecture page on the
HBase wiki.
 As of this writing, there are at least two projects up on github that add secondary
indices to HBase.
 In HBase, by convention, the colon character (:) delimits the column family from
the column family qualifier. It is hardcoded.
A table’s column families must be specified up front as part of the table schema definition, but new column family members can be added on demand. For example, a
new column station:address can be offered by a client as part of an update, and its
value persisted, as long as the column family station is already in existence on the
targeted table. Physically, all column family members are stored together on the
filesystem. So, though earlier we described HBase as a column-oriented store, it
would be more accurate if it were described as a column-family-oriented store.
Because tunings and storage speci-fications are done at the column family level, it is
advised that all column family mem-bers have the same general access pattern and
size characteristics. In synopsis, HBase tables are like those in an RDBMS, only
cells are versioned, rows are sorted, and columns can be added on the fly by the
client as long as the column family they belong to preexists.
Tables are automatically partitioned horizontally by HBase into regions. Each region
comprises a subset of a table’s rows. A region is denoted by the table it belongs to,
its first row, inclusive, and last row, exclusive. Initially, a table comprises a single
region, but as the size of the region grows, after it crosses a configurable size
threshold, it splits at a row boundary into two new regions of approximately equal
size. Until this first split happens, all loading will be against the single server hosting
the original region. As the table grows, the number of its regions grows. Regions are
the units that get distributed over an HBase cluster. In this way, a table that is too big
for any one server can be carried by a cluster of servers with each node hosting a
subset of the table’s total regions. This is also the means by which the loading on a
table gets distributed. The online set of sorted regions comprises the table’s total
Row updates are atomic, no matter how many row columns constitute the row-level
transaction. This keeps the locking model simple.
Just as HDFS and MapReduce are built of clients, slaves, and a coordinating
master— namenode and datanodes in HDFS and jobtracker and tasktrackers in
MapReduce—so is HBase modeled with an HBase master node orchestrating a
cluster of one or more regionserver slaves (see Figure 13-1). The HBase master is
responsible for bootstrapping a virgin install, for assigning regions to registered
regionservers, and for recovering regionserver failures. The master node is lightly
loaded. The regionservers carry zero or more regions and field client read/write
requests. They also manage region splits informing the HBase master about the new
daughter regions for it to manage the off-lining of parent region and assignment of
the replacement daughters.
Figure 13-1. HBase cluster members
HBase depends on ZooKeeper (Chapter 14) and by default it manages a ZooKeeper
instance as the authority on cluster state. HBase hosts vitals such as the location of
the root catalog table and the address of the current cluster Master. Assignment of
regions is mediated via ZooKeeper in case participating servers crash midassignment. Hosting the assignment transaction state in ZooKeeper makes it so
recovery can pick up on the assignment at where the crashed server left off. At a
minimum, bootstrapping a client connection to an HBase cluster, the client must be
passed the location of the ZooKeeper ensemble. Thereafter, the client navigates the
ZooKeeper hierarchy to learn cluster attributes such as server locations.
Regionserver slave nodes are listed in the HBase conf/regionservers file as you
would list datanodes and tasktrackers in the Hadoop conf/slaves file. Start and stop
scripts are like those in Hadoop using the same SSH-based running of remote
commands mech-anism. Cluster site-specific configuration is made in the HBase
conf/hbase-site.xml and conf/ files, which have the same format as that
of their equivalents up in the Hadoop parent project.
HBase persists data via the Hadoop filesystem API. Since there are multiple
implemen-tations of the filesystem interface—one for the local filesystem, one for the
KFS file-system, Amazon’s S3, and HDFS (the Hadoop Distributed Filesystem)—
HBase can persist to any of these implementations. Most experience though has
been had using HDFS, though by default, unless told otherwise, HBase writes to the
local filesystem. The local filesystem is fine for experimenting with your initial HBase
install, but there-after, usually the first configuration made in an HBase cluster
involves pointing HBase at the HDFS cluster to use.
HBase in operation
HBase, internally, keeps special catalog tables named -ROOT- and .META. within
which it maintains the current list, state, and location of all regions afloat on the
cluster. The -ROOT- table holds the list of .META. table regions. The .META. table
holds the list of all user-space regions. Entries in these tables are keyed by region
name, where a region name is made of the table name the region belongs to, the
region’s start row, its time of creation, and finally, an MD5 hash of all of the former
(i.e., a hash of tablename, start row, and creation timestamp.)
Row keys, as noted previously, are sorted so find-ing the region that hosts a
particular row is a matter of a lookup to find the first entry whose key is greater than
or equal to that of the requested row key. As regions transition—are split,
disabled/enabled, deleted, redeployed by the region load bal-ancer, or redeployed
due to a regionserver crash—the catalog tables are updated so the state of all
regions on the cluster is kept current.
Fresh clients connect to the ZooKeeper cluster first to learn the location of -ROOT-.
Clients consult -ROOT- to elicit the location of the .META. region whose scope
covers that of the requested row. The client then does a lookup against the found
.META. region to figure the hosting user-space region and its location. Thereafter,
the client interacts directly with the hosting regionserver.
To save on having to make three round-trips per row operation, clients cache all they
learn traversing -ROOT- and .META. caching locations as well as user-space region
start and stop rows so they can figure hosting regions themselves without having to
go back to the .META. table. Clients continue to use the cached entry as they work
until there is a fault. When this happens—the region has moved—the client consults
the .META. again to learn the new location. If, in turn, the consulted .META. region
has moved, then -ROOT- is reconsulted.
Writes arriving at a regionserver are first appended to a commit log and then are
added to an in-memory memstore. When a memstore fills, its content is flushed to
the filesystem.
The commit log is hosted on HDFS, so it remains available through a regionserver
crash. When the master notices that a regionserver is no longer reachable, usually
because the servers’s znode has expired in ZooKeeper, it splits the dead
regionserver’s commit log by region. On reassignment, regions that were on the
dead regionserver, before they open for business, will pick up their just-split file of
not yet persisted edits and replay them to bring themselves up-to-date with the state
they had just before the failure.
Reading, the region’s memstore is consulted first. If sufficient versions are found
read-ing memstore alone, the query completes there. Otherwise, flush files are
consulted in order, from newest to oldest until versions sufficient to satisfy the query
are found, or until we run out of flush files.
A background process compacts flush files once their number has broached a
threshold, rewriting many files as one, because the fewer files a read consults, the
more performant it will be. On compaction, versions beyond the schema configured
maximum, deletes and expired cells are cleaned out. A separate process running in
the regionserver mon-itors flush file sizes splitting the region when they grow in
excess of the configured maximum.
Download a stable release from an Apache Download Mirror and unpack it on your
local filesystem. For example:
% tar xzf hbase-x.y.z.tar.gz
As with Hadoop, you first need to tell HBase where Java is located on your system. If
you have the JAVA_HOME environment variable set to point to a suitable Java
installation, then that will be used, and you don’t have to configure anything further.
Otherwise, you can set the Java installation that HBase uses by editing HBase’s
conf/, and specifying the JAVA_HOME variable (see Appendix A for
some examples) to point to version 1.6.0 of Java.
For convenience, add the HBase binary directory to your command-line path. For
export HBASE_HOME=/home/hbase/hbase-x.y.z
To get the list of HBase options, type:
% hbase
Usage: hbase <command>
where <command> is one of:
run the HBase shell
run an HBase HMaster node
run an HBase HRegionServer node
run a Zookeeper server
run an HBase REST server
run an HBase Thrift server
run an HBase Avro server
an hbase.rootdir
run the hbase 'fsck' tool
run the class named CLASSNAME
Most commands print help when invoked w/o parameters.
Test Drive
To start a temporary instance of HBase that uses the /tmp directory on the local filesystem for persistence, type:
This will launch a standalone HBase instance that persists to the local filesystem; by
default, HBase will write to /tmp/hbase-${USERID}.
To administer your HBase instance, launch the HBase shell by typing:
% hbase shell
HBase Shell; enter 'help<RETURN>' for list of supported
commands. Type "exit<RETURN>" to leave the HBase Shell
Version: 0.89.0-SNAPSHOT,
ra4ea1a9a7b074a2e5b7b24f761302d4ea28ed1b2, Sun Jul 18 15:01:50
PDT 2010 hbase(main):001:0>
This will bring up a JRuby IRB interpreter that has had some HBase-specific
commands added to it. Type help and then RETURN to see the list of shell
commands grouped into categories. Type help COMMAND_GROUP for help by
category or help COMMAND for help on a specific command and example usage.
Commands use Ruby formatting to specify lists and dictionaries. See the end of the
main help screen for a quick tutorial.
Now let us create a simple table, add some data, and then clean up.
To create a table named test with a single column family name data using defaults
for table and column family attributes, enter:
hbase(main):007:0> create 'test', 'data'
0 row(s) in 1.3066 seconds
If the previous command does not complete successfully, and the shell displays an
error and a stack trace, your install was not successful. Check the master logs under
the HBase logs directory—the default location for the logs directory is
${HBASE_HOME}/logs—for a clue as to where things went awry. See the help
output for examples of adding table and column family attributes when specifying a
To prove the new table was created successfully, run the list command. This will
output all tables in user space:
hbase(main):019:0> list test
1 row(s) in 0.1485 seconds
To insert data into three different rows and columns in the data column family, and
then list the table content, do the following:
hbase(main):021:0> put 'test', 'row1', 'data:1', 'value1'
0 row(s) in 0.0454 seconds
hbase(main):022:0> put 'test', 'row2', 'data:2', 'value2'
0 row(s) in 0.0035 seconds
hbase(main):023:0> put 'test', 'row3', 'data:3', 'value3'
0 row(s) in 0.0090 seconds
hbase(main):024:0> scan 'test'
column=data:1, timestamp=1240148026198,
column=data:2, timestamp=1240148040035,
column=data:3, timestamp=1240148047497,
3 row(s) in 0.0825 seconds
Notice how we added three new columns without changing the schema.
To remove the table, you must first disable it before dropping it:
hbase(main):025:0> disable 'test'
09/04/19 06:40:13 INFO client.HBaseAdmin: Disabled test
0 row(s) in 6.0426 seconds
hbase(main):026:0> drop 'test'
09/04/19 06:40:17 INFO client.HBaseAdmin: Deleted test
0 row(s) in 0.0210 seconds
hbase(main):027:0> list
0 row(s) in 2.0645 seconds
Shut down your HBase instance by running:
To learn how to set up a distributed HBase and point it at a running HDFS, see the
Getting Started section of the HBase documentation.
There are a number of client options for interacting with an HBase cluster.
HBase, like Hadoop, is written in Java. Example 13-1 shows how you would do in
Java the shell operations listed previously at “Test Drive” .
Example 13-1. Basic table administration and access
public class ExampleClient {
public static void main(String[] args) throws IOException {
Configuration config = HBaseConfiguration.create();
// Create table
HBaseAdmin admin = new HBaseAdmin(config);
HTableDescriptor htd = new HTableDescriptor("test");
HColumnDescriptor hcd = new
HColumnDescriptor("data"); htd.addFamily(hcd);
byte [] tablename = htd.getName(); HTableDescriptor
[] tables = admin.listTables();
if (tables.length != 1 && Bytes.equals(tablename, tables[0].getName()))
{ throw new IOException("Failed create of table");
// Run some operations -- a put, a get, and a scan -- against the table.
HTable table = new HTable(config, tablename);
byte [] row1 = Bytes.toBytes("row1"); Put p1
= new Put(row1);
byte [] databytes = Bytes.toBytes("data");
p1.add(databytes, Bytes.toBytes("1"), Bytes.toBytes("value1"));
Get g = new Get(row1);
Result result = table.get(g);
System.out.println("Get: " + result);
Scan scan = new Scan();
ResultScanner scanner = table.getScanner(scan);
try {
for (Result scannerResult: scanner) {
System.out.println("Scan: " + scannerResult);
} finally { scanner.close();
// Drop the table
This class has a main method only. For the sake of brevity, we do not include
package name nor imports. In this class, we first create an instance of
org.apache.hadoop.hbase.HBase Configuration class to create the instance. It will
return a Configuration that has read HBase configuration from hbase-site.xml and
hbase-default.xml files found on the program’s classpath. This Configuration is
subsequently used to create instances of HBaseAdmin and HTable, two classes
found in the org.apache.hadoop.hbase.client Java package. HBaseAdmin is used for
administering your HBase cluster, for adding and drop-ping tables. HTable is used to
access a specific table.
The Configuration instance points these classes at the cluster the code is to work
against. To create a table, we need to first create an instance of HBaseAdmin and
then ask it to create the table named test with a single column family named data. In
our example, our table schema is the default. Use methods on
org.apache.hadoop.hbase.HColumnDescriptor to change the table schema. The
code next asserts the table was actually created and then it moves to run operations
against the just-created table.
Operating on a table, we will need an instance of org.apache.hadoop.hbase.cli
ent.HTable passing it our Configuration instance and the name of the table we want
to operate on. After creating an HTable, we then create an instance of
org.apache.hadoop.hbase.client. Put to put a single cell value of value1 into a row
named row1 on the column named data:1 (The column name is specified in two
parts; the column family name as bytes—databytes in the code above—and then the
column family qualifier specified as Bytes.toBytes("1")). Next we create an
org.apache.hadoop.hbase.client.Get, do a get of the just-added cell, and then use an
org.apache.hadoop.hbase.client.Scan to scan over the table against the just-created
table printing out what we find.
Finally, we clean up by first disabling the table and then deleting it. A table must be
disabled before it can be dropped.
HBase classes and utilities in the org.apache.hadoop.hbase.mapreduce package
facilitate using HBase as a source and/or sink in MapReduce jobs. The
TableInputFormat class makes splits on region boundaries so maps are handed a
single region to work on. The
TableOutputFormat will write the result of reduce into HBase. The RowCounter class
in Example 13-2 can be found in the HBase mapreduce package. It runs a map task
to count rows using TableInputFormat.
Example 13-2. A MapReduce application to count the number of rows in an HBase
public class RowCounter {
/** Name of this 'program'. */
static final String NAME = "rowcounter";
static class RowCounterMapper
extends TableMapper<ImmutableBytesWritable, Result>
{ /** Counter enumeration to count the actual rows. */
public static enum Counters {ROWS}
public void map(ImmutableBytesWritable row, Result values,
Context context)
throws IOException {
for (KeyValue value: values.list()) {
if (value.getValue().length > 0) {
public static Job createSubmittableJob(Configuration conf, String[] args)
throws IOException {
String tableName = args[0];
Job job = new Job(conf, NAME + "_" + tableName);
// Columns are space delimited StringBuilder
sb = new StringBuilder();
final int columnoffset = 1;
for (int i = columnoffset; i < args.length; i++) {
if (i > columnoffset) {
sb.append(" ");
Scan scan = new Scan(); scan.setFilter(new
FirstKeyOnlyFilter()); if (sb.length() > 0) {
for (String columnName :sb.toString().split(" ")) {
String [] fields = columnName.split(":");
if(fields.length == 1) {
} else {
scan.addColumn(Bytes.toBytes(fields[0]), Bytes.toBytes(fields[1]));
// Second argument is the table name.
ImmutableBytesWritable.class, Result.class, job);
return job;
public static void main(String[] args) throws Exception {
Configuration conf = HBaseConfiguration.create();
String[] otherArgs = new GenericOptionsParser(conf,
if (otherArgs.length < 1) {
System.err.println("ERROR: Wrong number of parameters: " +
args.length); System.err.println("Usage: RowCounter <tablename>
[<column1> <column2>...]");
Job job = createSubmittableJob(conf, otherArgs);
System.exit(job.waitForCompletion(true) ? 0 : 1);
This class uses GenericOptionsParser, for parsing command line arguments. The
Row CounterMapper inner class implements the HBase TableMapper abstract, a
specialization of org.apache.hadoop.mapreduce.Mapper that sets the map inputs
types passed by TableInputFormat.
The createSubmittableJob() method parses arguments added to the configuration
that were passed on the command line to specify the table and columns we are to
run RowCounter against. The column names are used to configure an instance of
org.apache.hadoop.hbase.client.Scan, a scan object that will be passed through to
TableInputFormat and used constraining what our Mapper sees. Notice how we set
a filter, an instance of org.apache.hadoop.hbase.filter.FirstKeyOnlyFilter, on the
scan. This filter instructs the server to short-circuit when running server-side, doing
no more than verify a row has an entry before returning. This speeds the row count.
TableMapReduceUtil.initTableMap Job() utility method, which among other things
such as setting the map class to use, sets the input format to TableInputFormat. The
map is simple. It checks for empty values. If empty, it doesn’t count the row.
Otherwise, it increments Counters.ROWS by one.
Avro, REST, and Thrift
HBase ships with Avro, REST, and Thrift interfaces. These are useful when the interacting application is written in a language other than Java. In all cases, a Java server
hosts an instance of the HBase client brokering application Avro, REST, and Thrift
requests in and out of the HBase cluster. This extra work proxying requests and responses means these interfaces are slower than using the Java client directly.
To put up a stargate instance (stargate is the name for the HBase REST service),
start it using the following command:
% start rest
This will start a server instance, by default on port 8080, background it, and catch
any emissions by the server in logfiles under the HBase logs directory.
Clients can ask for the response to be formatted as JSON, Google’s protobufs, or as
XML, depending on how the client HTTP Accept header is set. See the REST wiki
page for documentation and examples of making REST client requests. To stop the
REST server, type:
% stop rest
Similarly, start a Thrift service by putting up a server to field Thrift clients by running
the following:
% start thrift
This will start the server instance, by default on port 9090, background it, and catch
any emissions by the server in logfiles under the HBase logs directory. The HBase
Thrift documentation7 notes the Thrift version used generating classes. The HBase
src/main/resources/org/apache/hadoop/hbase/thrift/Hbase.thrift in the HBase source
code. To stop the Thrift server, type:
% stop thrift
The Avro server is started and stopped in the same manner as you’d start and stop
the Thrift or REST services. The Avro server by default uses port 9090 (the same as
the Thrift server, although you wouldn’t normally run both).
Although HDFS and MapReduce are powerful tools for processing batch operations
over large datasets, they do not provide ways to read or write individual records efficiently. In this example, we’ll explore using HBase as the tool to fill this gap. The
existing weather dataset described in previous chapters contains observations for
tens of thousands of stations over 100 years and this data is growing without bound.
In this example, we will build a simple web interface that allows a user to navigate
the different stations and page through their historical temperature observations in
time order. For the sake of this example, let us allow that the dataset is massive, that
the observations run to the billions, and that the rate at which temperature updates
arrive is significant—say hundreds to thousands of updates a second from around
the world across the whole range of weather stations. Also, let us allow that it is a
requirement that the web application must display the most up-to-date observation
within a second or so of receipt.
The first size requirement should preclude our use of a simple RDBMS instance and
make HBase a candidate store. The second latency requirement rules out plain
HDFS. A MapReduce job could build initial indices that allowed random-access over
all of the observation data, but keeping up this index as the updates arrived is not
what HDFS and MapReduce are good at.
In our example, there will be two tables:
This table holds station data. Let the row key be the stationid. Let this table have a
column family info that acts as a key/val dictionary for station information. Let the
dictionary keys be the column names info:name, info:location, and info:description.
This table is static and the info family, in this case, closely mir-rors a typical RDBMS
table design.
This table holds temperature observations. Let the row key be a composite key of
stationid + reverse order timestamp. Give this table a column family data that will
contain one column airtemp with the observed temperature as the column value.
Our choice of schema is derived from how we want to most efficiently read from
HBase. Rows and columns are stored in increasing lexicographical order. Though
there are facilities for secondary indexing and regular expression matching, they
come at a per-formance penalty. It is vital that you understand how you want to most
efficiently query your data in order to most effectively store and access it.
For the stations table, the choice of stationid as key is obvious because we will
always access information for a particular station by its id. The observations table,
however, uses a composite key that adds the observation timestamp at the end. This
will group all observations for a particular station together, and by using a reverse
order timestamp (Long.MAX_VALUE - epoch) and storing it as binary, observations
for each station will be ordered with most recent observation first.
In the shell, you would define your tables as follows:
hbase(main):036:0> create 'stations', {NAME => 'info', VERSIONS => 1}
0 row(s) in 0.1304 seconds
hbase(main):037:0> create 'observations', {NAME => 'data', VERSIONS => 1}
0 row(s) in 0.1332 seconds
In both cases, we are interested only in the latest version of a table cell, so set
VERSIONS to 1. The default is 3.
Loading Data
There are a relatively small number of stations, so their static data is easily inserted
using any of the available interfaces. However, let’s assume that there are billions of
individual observations to be loaded. This kind of import is normally an extremely
complex and long-running database op-eration, but MapReduce and HBase’s
distribution model allow us to make full use of the cluster. Copy the raw input data
onto HDFS and then run a MapReduce job that can read the input and write to
Example 13-3. A MapReduce application to import temperature data from
HDFS into an HBase table
public class HBaseTemperatureImporter extends Configured implements Tool {
// Inner-class for map
static class HBaseTemperatureMapper<K, V> extends MapReduceBase
implements Mapper<LongWritable, Text, K, V> {
private NcdcRecordParser parser = new
NcdcRecordParser(); private HTable table;
public void map(LongWritable key, Text value,
OutputCollector<K, V> output, Reporter reporter)
throws IOException {
if (parser.isValidTemperature()) {
byte[] rowKey =
Put p = new Put(rowKey);
public void configure(JobConf jc) {
Create the HBase table client once up-front and keep it around
rather than create on each map invocation.
try {
this.table = new HTable(new HBaseConfiguration(jc), "observations");
} catch (IOException e) {
throw new RuntimeException("Failed HTable construction", e);
public void close() throws IOException {
public int run(String[] args) throws IOException {
if (args.length != 1) {
System.err.println("Usage: HBaseTemperatureImporter
return -1;
JobConf jc = new JobConf(getConf(), getClass());
FileInputFormat.addInputPath(jc, new Path(args[0]));
return 0;
public static void main(String[] args) throws Exception { int
exitCode = HBaseConfiguration(),
new HBaseTemperatureImporter(), args);
HBaseTemperatureImporter has an inner class named HBaseTemperatureMapper
that is like the MaxTemperatureMapper class from Chapter 5. The outer class
implements Tool and does the setup to launch the HBaseTemperatureMapper inner
class. HBaseTemperatureMap per takes the same input as MaxTemperatureMapper
and does the same parse—using the NcdcRecordParser introduced in Chapter 5—to
check for valid temperatures, but rather than add valid temperatures to the output
collector as MaxTemperatureMapper does, instead it adds valid temperatures to the
observations HBase table into the data:air-temp column. (We are using static defines
for data and airtemp imported from HBase TemperatureCli class described later
below.) In the configure() method, we create an HTable instance once against the
observations table and use it afterward in map invo-cations talking to HBase. Finally,
we call close on our HTable instance to flush out any write buffers not yet cleared.
The row key used is created in the makeObservationRowKey() method on RowKey
Converter from the station ID and observation time:
public class RowKeyConverter {
private static final int STATION_ID_LENGTH = 12;
* @return A row key whose format is: <station_id>
<reverse_order_epoch> */
public static byte[] makeObservationRowKey(String stationId,
long observationTime) {
byte[] row = new byte[STATION_ID_LENGTH +
Bytes.SIZEOF_LONG]; Bytes.putBytes(row, 0,
Bytes.toBytes(stationId), 0, STATION_ID_LENGTH); long
reverseOrderEpoch = Long.MAX_VALUE - observationTime;
Bytes.putLong(row, STATION_ID_LENGTH, reverseOrderEpoch);
return row;
The conversion takes advantage of the fact that the station ID is a fixed-length string.
The Bytes class used in makeObservationRowKey() is from the HBase utility
package. It includes methods for converting between byte arrays and common Java
and Hadoop types. In makeObservationRowKey(), the Bytes.putLong() method is
used to fill the key byte array. The Bytes.SIZEOF_LONG constant is used for sizing
and positioning in the row key array.
We can run the program with the following:
% hbase HBaseTemperatureImporter input/ncdc/all
Optimization notes
Watch for the phenomenon where an import walks in lock-step through the table with
all clients in concert pounding one of the table’s regions (and thus, a single node),
then moving on to the next, and so on, rather than evenly distributing the load over
all regions. This is usually brought on by some interaction between sorted input and
how the splitter works. Randomizing the ordering of your row keys prior to insertion
may help. In our example, given the distribution of stationid values and how
TextInputFormat makes splits, the upload should be sufficiently distributed. Only
obtain one HTable instance per task. There is a cost to instantiating an HTable, so if
you do this for each insert, you may have a negative impact on per-formance, hence
our setup of HTable in the configure() step.
If a table is new, it will have only one region and initially all updates will be to this
single region until it splits. This will happen even if row keys are randomly distributed.
This startup phenomenon means uploads run slow at first until there are sufficient
regions distributed so all cluster members are able to participate in the upload. Do
not confuse this phenomenon with that noted here.
By default, each HTable.put(put) actually performs the insert without any buffer-ing.
You can disable the HTable auto-flush feature using HTable.setAuto Flush(false)
and then set the size of configurable write buffer. When the inserts committed fill the
write buffer, it is then flushed. Remember though, you must call a manual
HTable.flushCommits(), or HTable.close(), which will call through to
HTable.flushCommits() at the end of each task to ensure that nothing is left unflushed in the buffer. You could do this in an override of the mapper’s close()
method. HBase includes TableInputFormat and TableOutputFormat to help with
MapReduce jobs that source and sink HBase (see Example 13-2). One way to write
the previous example would have been to use MaxTemperatureMapper from
Chapter 5 as is but add a reducer task that takes the output of the
MaxTemperatureMapper and feeds it to HBase via TableOutputFormat.
Web Queries
To implement the web application, we will use the HBase Java API directly. Here it
becomes clear how important your choice of schema and storage format is.
The simplest query will be to get the static station information. This type of query is
simple in a traditional database, but HBase gives you additional control and
flexibility. Using the info family as a key/value dictionary (column names as keys,
column values as values), the code would look like this:
public Map<String, String> getStationInfo(HTable table, String stationId)
throws IOException {
Get get = new Get(Bytes.toBytes(stationId));
Result res = table.get(get);
if (res == null) {
return null;
Map<String, String> resultMap = new HashMap<String, String>();
resultMap.put("name", getValue(res, INFO_COLUMNFAMILY,
NAME_QUALIFIER)); resultMap.put("location", getValue(res,
resultMap.put("description", getValue(res, INFO_COLUMNFAMILY,
return resultMap;
private static String getValue(Result res, byte [] cf, byte [] qualifier) { byte
[] value = res.getValue(cf, qualifier);
return value == null? "": Bytes.toString(value);
In this example, getStationInfo() takes an HTable instance and a station ID. To get
the station info, we use HTable.get() passing a Get instance configured to get all the
column values for the row identified by the station ID in the defined column family,
The get() results are returned in Result. It contains the row and you can fetch cell
values by stipulating the column cell wanted. The getStationInfo() method
converts the Result Map into a more friendly Map of String keys and values.
We can already see how there is a need for utility functions when using HBase.
There are an increasing number of abstractions being built atop HBase to deal
with this low-level interaction, but it’s important to understand how this works and
how storage choices make a difference.
One of the strengths of HBase over a relational database is that you don’t have to
prespecify the columns. So, in the future, if each station now has at least these
three attributes but there are hundreds of optional ones, we can just insert them
without modifying the schema. Your applications reading and writing code would
of course need to be changed. The example code might change in this case to
looping through Result rather than grabbing each value explicitly.
We will make use of HBase scanners for retrieval of observations in our web
Here we are after a Map<ObservationTime, ObservedTemp> result. We will use a
NavigableMap<Long, Integer> because it is sorted and has a descendingMap()
so we can access observations in both ascending or descending order. The code
is in Example 13-4.
Example 13-4. Methods for retrieving a range of rows of weather station
observations from an HBase table
public NavigableMap<Long, Integer>
getStationObservations(HTable table, String stationId, long
maxStamp, int maxCount) throws IOException {
byte[] startRow =
RowKeyConverter.makeObservationRowKey(stationId, maxStamp);
NavigableMap<Long, Integer> resultMap = new TreeMap<Long,
Scan scan = new Scan(startRow);
AIRTEMP_QUALIFIER); ResultScanner scanner =
table.getScanner(scan); Result res = null;
int count = 0;
try {
while ((res = != null && count++ < maxCount)
{ byte[] row = res.getRow();
byte[] value = res.getValue(DATA_COLUMNFAMILY,
AIRTEMP_QUALIFIER); Long stamp = Long.MAX_VALUE Bytes.toLong(row, row.length - Bytes.SIZEOF_LONG,
Bytes.SIZEOF_LONG); Integer temp = Bytes.toInt(value);
resultMap.put(stamp, temp);
} finally { scanner.close();
return resultMap;
public NavigableMap<Long, Integer> getStationObservations(HTable
table, String stationId) throws IOException {
return getStationObservations(table, stationId, Long.MAX_VALUE, 10);
The getStationObservations() method takes a station ID and a range defined by max
Stamp and a maximum number of rows (maxCount). Note that the NavigableMap
that is returned is actually now in descending time order. If you want to read through
it in ascending order, you would make use of NavigableMap.descendingMap().
HBase scanners are like cursors in a traditional database or Java iterators, except—
unlike the latter—they have to be closed after use. Scanners return rows in order.
Users obtain a scanner on an HBase table by calling HTable.getScanner(scan)
where the scan parameter is a configured instance of a Scan object. In the Scan
instance, you can pass the row at which to start and stop the scan, which columns in
a row to return in the row result, and optionally, a filter to run on the server side.9 The
ResultScanner interface, which is returned when you call HTable.getScanner(), is as
public interface ResultScanner extends Closeable,
Iterable<Result> { public Result next() throws IOException;
public Result [] next(int nbRows) throws IOException;
public void close();
You can ask for the next row’s results or a number of rows. Each invocation of next()
involves a trip back to the regionserver, so grabbing a bunch of rows at once can
make for significant performance savings. To learn more about the server-side
hbase.client.scanner.caching configuration option is set to 1 by default. You can also
set how much to cache/prefetch on the Scan instance itself. Scanners will, under the
covers, fetch this many results at a time, bringing them client side, and returning to
the server to fetch the next batch only after the current batch has been exhausted.
Higher caching values will enable faster scanning but will eat up more memory in the
client. Also, avoid setting the caching so high that the time spent processing the
batch client-side exceeds the scanner lease period. If a client fails to check back with
the server before the scanner lease expires, the server will go ahead and garbage
collect resources consumed by the scanner server-side. The default scanner lease is
60 seconds, and can be changed by setting Clients
will see an UnknownScannerException if the scanner lease has expired.
The advantage of storing things as Long.MAX_VALUE - stamp may not be clear in
the previous example. It has more use when you want to get the newest
observations for a given offset and limit, which is often the case in web applications.
If the observations were stored with the actual stamps, we would be able to get only
the oldest observations for a given offset and limit efficiently. Getting the newest
would mean getting all of them and then grabbing them off the end. One of the prime
reasons for moving from RDBMS to HBase is to allow for these types of “early-out”
HBase Versus RDBMS
HBase and other column-oriented databases are often compared to more
traditional and popular relational databases or RDBMSs. Although they differ
dramatically in their implementations and in what they set out to accomplish, the
fact that they are potential solutions to the same problems means that despite
their enormous differences, the comparison is a fair one to make.
As described previously, HBase is a distributed, column-oriented data storage
system. It picks up where Hadoop left off by providing random reads and writes on
top of HDFS. It has been designed from the ground up with a focus on scale in
every direction: tall in numbers of rows (billions), wide in numbers of columns
(millions), and to be horizontally partitioned and replicated across thousands of
commodity nodes auto-matically. The table schemas mirror the physical storage,
creating a system for efficient data structure serialization, storage, and retrieval.
The burden is on the application developer to make use of this storage and
retrieval in the right way.
Strictly speaking, an RDBMS is a database that follows Codd’s 12 Rules. Typical
RDBMSs are fixed-schema, row-oriented databases with ACID properties and a
so-phisticated SQL query engine. The emphasis is on strong consistency,
referential in-tegrity, abstraction from the physical layer, and complex queries
through the SQL lan-guage. You can easily create secondary indexes, perform
complex inner and outer joins, count, sum, sort, group, and page your data across
a number of tables, rows, and columns.
For a majority of small- to medium-volume applications, there is no substitute for
the ease of use, flexibility, maturity, and powerful feature set of available open
source RDBMS solutions like MySQL and PostgreSQL. However, if you need to
scale up in terms of dataset size, read/write concurrency, or both, you’ll soon find
that the con-veniences of an RDBMS come at an enormous performance penalty
and make distri-bution inherently difficult. The scaling of an RDBMS usually
involves breaking Codd’s rules, loosening ACID restrictions, forgetting
conventional DBA wisdom, and on the way losing most of the desirable properties
that made relational databases so conve-nient in the first place.
Successful Service
Here is a synopsis of how the typical RDBMS scaling story runs. The following list
presumes a successful growing service:
Initial public launch
Move from local workstation to shared, remote hosted MySQL instance with a
well-defined schema.
Service becomes more popular; too many reads hitting the database
Add memcached to cache common queries. Reads are now no longer strictly
ACID; cached data must expire.
Service continues to grow in popularity; too many writes hitting the database
Scale MySQL vertically by buying a beefed up server with 16 cores, 128 GB of
RAM, and banks of 15 k RPM hard drives. Costly.
New features increases query complexity; now we have too many joins
Denormalize your data to reduce joins. (That’s not what they taught me in DBA
Rising popularity swamps the server; things are too slow
Stop doing any server-side computations.
Some queries are still too slow
Periodically prematerialize the most complex queries, try to stop joining in most
Reads are OK, but writes are getting slower and slower
Drop secondary indexes and triggers (no indexes?).
At this point, there are no clear solutions for how to solve your scaling problems. In
any case, you’ll need to begin to scale horizontally. You can attempt to build some
type of partitioning on your largest tables, or look into some of the commercial
solutions that provide multiple master capabilities.
Countless applications, businesses, and websites have successfully achieved
scalable, fault-tolerant, and distributed data systems built on top of RDBMSs and are
likely using many of the previous strategies. But what you end up with is something
that is no longer a true RDBMS, sacrificing features and conveniences for
compromises and complexi-ties. Any form of slave replication or external caching
introduces weak consistency into your now denormalized data. The inefficiency of
joins and secondary indexes means almost all queries become primary key lookups.
A multiwriter setup likely means no real joins at all and distributed transactions are a
nightmare. There’s now an incredibly complex network topology to manage with an
entirely separate cluster for caching. Even with this system and the compromises
made, you will still worry about your primary master crashing and the daunting
possibility of having 10 times the data and 10 times the load in a few months.
Enter HBase, which has the following characteristics:
No real indexes
Rows are stored sequentially, as are the columns within each row. Therefore, no
issues with index bloat, and insert performance is independent of table size.
Automatic partitioning
As your tables grow, they will automatically be split into regions and distributed
across all available nodes.
Scale linearly and automatically with new nodes
Add a node, point it to the existing cluster, and run the regionserver. Regions will
automatically rebalance and load will spread evenly.
Commodity hardware
Clusters are built on $1,000–$5,000 nodes rather than $50,000 nodes. RDBMSs
are I/O hungry, requiring more costly hardware.
Fault tolerance
Lots of nodes means each is relatively insignificant. No need to worry about individual node downtime.
Batch processing
MapReduce integration allows fully parallel, distributed jobs against your data
with locality awareness.
If you stay up at night worrying about your database (uptime, scale, or speed), then
you should seriously consider making a jump from the RDBMS world to HBase.
Utilize a solution that was intended to scale rather than a solution based on stripping
down and throwing money at what used to work. With HBase, the software is free,
the hard-ware is cheap, and the distribution is intrinsic.
Use Case: HBase at is a real-time news aggregator and social sharing platform. With a
broad feature set, we started out with a complex implementation on top of
PostgreSQL. It’s a terrific product with a great community and a beautiful codebase.
We tried every trick in the book to keep things fast as we scaled, going so far as to
modify the PostgreSQL code directly to suit our needs. Originally taking advantage
of all RDBMS goodies, we found that eventually, one by one, we had to let them all
go. Along the way, our entire team became the DBA.
Very large items tables
At first, this was a single items table, but the high number of secondary indexes
made inserts and updates very slow. We started to divide items up into several oneto-one link tables to store other information, separating static fields from dynamic
ones, grouping fields based on how they were queried, and denormalizing everything
along the way. Even with these changes, single updates required rewriting the entire
record, so tracking statistics on items was difficult to scale. The rewriting of records
and having to update indexes along the way are intrinsic properties of the RDBMS
we were using. They could not be decoupled. We partitioned our tables, which was
not too difficult because of the natural partition of time, but the complexity got out of
hand fast. We needed another solution!
Very large sort merges
Performing sorted merges of time-ordered lists is common in many Web 2.0 applications. An example SQL query might look like this:
SELECT id, stamp, type FROM streams
WHERE type IN ('type1','type2','type3','type4',...,'typeN')
Assuming id is a primary key on streams, and that stamp and type have secondary
indexes, an RDBMS query planner treats this query as follows:
SELECT id, stamp, type FROM streams
WHERE type = 'type1' ORDER BY stamp DESC,...,
SELECT id, stamp, type FROM streams
WHERE type = 'typeN' ORDER BY stamp
The problem here is that we are after only the top 10 IDs, but the query planner
actually materializes an entire merge and then limits at the end. A simple heapsort
across each of the types would allow you to “early out” once you have the top 10. In
our case, each type could have tens of thousands of IDs in it, so materializing the
entire list and sorting it was extremely slow and unnecessary. We actually went so
far as to write a custom PL/Python script that performed a heapsort using a series of
queries like the following:
SELECT id, stamp, type FROM streams WHERE type = 'typeN'
If we ended up taking from typeN (it was the next most recent in the heap), we would
run another query:
SELECT id, stamp, type FROM streams
WHERE type = 'typeN'
In nearly all cases, this outperformed the native SQL implementation and the query
planner’s strategy. In the worst cases for SQL, we were more than an order of
magnitude faster using the Python procedure. We found ourselves continually trying
to outsmart the query planner.
Life with HBase
Our RDBMS-based system was always capable of correctly implementing our
require-ments; the issue was scaling. When you start to focus on scale and
performance rather than correctness, you end up short-cutting and optimizing for
your domain-specific use cases everywhere possible. Once you start implementing
your own solutions to your data problems, the overhead and complexity of an
RDBMS gets in your way. The abstraction from the storage layer and ACID
requirements are an enormous barrier and luxury that you cannot always afford
when building for scale. HBase is a distributed, column-oriented, sorted map store
and not much else. The only major part that is abstracted from the user is the
distribution, and that’s exactly what we don’t want to deal with. Business logic, on the
other hand, is very specialized and optimized. With HBase not trying to solve all of
our problems, we’ve been able to solve them better ourselves and rely on HBase for
scaling our storage, not our logic. It was an extremely liberating experience to be
able to focus on our applications and logic rather than the scaling of the data itself.
We currently have tables with hundreds of millions of rows and tens of thousands of
columns; the thought of storing billions of rows and millions of columns is exciting,
not scary.
In this section, we discuss some of the common issues users run into when running
an HBase cluster under load.
Up until HBase 0.20, HBase aligned its versioning with that of Hadoop. A particular
HBase version would run on any Hadoop that had a matching minor version, where
minor version in this context is considered the number between the periods (e.g., 20
is the minor version of an HBase 0.20.5). HBase 0.20.5 would run on an Hadoop
0.20.2, but HBase 0.19.5 would not run on Hadoop 0.20.0.
With HBase 0.90,11 the version relationship was broken. The Hadoop release cycle
has slowed and no longer aligns with that of HBase developments. Also, the intent is
that now a particular HBase version can run on multiple versions of Hadoop. For
example, HBase 0.90.x will work with both Hadoop 0.20.x and 0.21.x.
This said, ensure you are running compatible versions of Hadoop and HBase. Check
the requirements section of your download. Incompatible versions will throw an exception complaining about the version mismatch, if you are lucky. If they cannot talk
to each sufficiently to pass versions, you may see your HBase cluster hang
indefinitely, soon after startup. The mismatch exception or HBase hang can also
happen on upgrade if older versions of either HBase or Hadoop can still be found on
the classpath because of imperfect cleanup of the old software.
HBase’s use of HDFS is very different from how it’s used by MapReduce. In MapReduce, generally, HDFS files are opened, with their content streamed through a map
task and then closed. In HBase, data files are opened on cluster startup and kept
open so that we avoid paying the file open costs on each access. Because of this,
HBase tends to see issues not normally encountered by MapReduce clients:
Running out of file descriptors
Because we keep files open, on a loaded cluster, it doesn’t take long before we run
into system- and Hadoop-imposed limits. For instance, say we have a cluster that
has three nodes each running an instance of a datanode and a regionserver and
we’re running an upload into a table that is currently at 100 regions and 10 column
families. Allow that each column family has on average two flush files. Doing the
math, we can have 100 × 10 × 2, or 2,000, files open at any one time. Add to this
total miscellaneous other descriptors consumed by outstanding scanners and Java
libraries. Each open file consumes at least one descriptor over on the remote datanode. The default limit on the number of file descriptors per process is 1,024.
When we exceed the filesystem ulimit, we’ll see the complaint about Too many open
files in logs, but often you’ll first see indeterminate behavior in HBase. The fix
requires increasing the file descriptor ulimit count. You can verify that the HBase
process is running with sufficient file descriptors by looking at the first few lines of a
regionserver’s log. It emits vitals such as the JVM being used and envi-ronment
settings such as the file descriptor ulimit.
Running out of datanode threads
Similarly, the Hadoop datanode has an upper bound of 256 on the number of
threads it can run at any one time. Given the same table statistics quoted in the
preceding bullet, it’s easy to see how we can exceed this upper bound relatively
early, given that in the datanode as of this writing each open connection to a file
block consumes a thread. If you look in the datanode log, you’ll see a complaint like
xceiverCount 258 exceeds the limit of concurrent xcievers 256 but again, you’ll likely
see HBase act erratically before you encounter this log entry. Increase the
dfs.datanode.max.xcievers (note that the property name is misspelled) count in
HDFS and restart your cluster.
You must run HBase on an HDFS that has a working sync. Otherwise, you will lose
data. This means running HBase on Hadoop or later.
HBase runs a web server on the master to present a view on the state of your
running cluster. By default, it listens on port 60010. The master UI displays a list of
basic at-tributes such as software versions, cluster load, request rates, lists of cluster
tables, and participating regionservers. Click on a regionserver in the master UI and
you are taken to the web server running on the individual regionserver. It lists the
regions this server is carrying and basic metrics such as resources consumed and
request rates.
So far in this book, we have been studying large-scale data processing. This chapter
is different: it is about building general distributed applications using Hadoop’s distributed coordination service, called ZooKeeper.
Writing distributed applications is hard. It’s hard primarily because of partial failure.
When a message is sent across the network between two nodes and the network
fails, the sender does not know whether the receiver got the message. It may have
gotten through before the network failed, or it may not have. Or perhaps the
receiver’s process died. The only way that the sender can find out what happened is
to reconnect to the receiver and ask it. This is partial failure: when we don’t even
know if an operation failed.
ZooKeeper can’t make partial failures go away, since they are intrinsic to distributed
systems. It certainly does not hide partial failures, either.1 But what ZooKeeper does
do is give you a set of tools to build distributed applications that can safely handle
partial failures.
ZooKeeper also has the following characteristics:
ZooKeeper is simple
ZooKeeper is, at its core, a stripped-down filesystem that exposes a few simple
operations, and some extra abstractions such as ordering and notifications.
ZooKeeper is expressive
The ZooKeeper primitives are a rich set of building blocks that can be used to
build a large class of coordination data structures and protocols. Examples
include: dis-tributed queues, distributed locks, and leader election among a group
of peers.
ZooKeeper is highly available
ZooKeeper runs on a collection of machines and is designed to be highly
available, so applications can depend on it. ZooKeeper can help you avoid
introducing single points of failure into your system, so you can build a reliable
ZooKeeper facilitates loosely coupled interactions
ZooKeeper interactions support participants that do not need to know about one
another. For example, ZooKeeper can be used as a rendezvous mechanism so
that processes that otherwise don’t know of each other’s existence (or network
details) can discover and interact with each other. Coordinating parties may not
even be contemporaneous, since one process may leave a message in
ZooKeeper that is read by another after the first has shut down.
ZooKeeper is a library
ZooKeeper provides an open source, shared repository of implementations and
recipes of common coordination patterns. Individual programmers are spared the
burden of writing common protocols themselves (which are often difficult to get
right). Over time, the community can add to and improve the libraries, which is to
everyone’s benefit.
ZooKeeper is highly performant, too. At Yahoo!, where it was created, the throughput
for a ZooKeeper cluster has been benchmarked at over 10,000 operations per
second for write-dominant workloads generated by hundreds of clients. For
workloads where reads dominate, which is the norm, the throughput is several times
Installing and Running ZooKeeper
When trying out ZooKeeper for the first time, it’s simplest to run it in standalone
mode with a single ZooKeeper server. You can do this on a development machine,
for exam-ple. ZooKeeper requires Java 6 to run, so make sure you have it installed
first. You don’t need Cygwin to run ZooKeeper on Windows, since there are
Windows versions of the ZooKeeper scripts. (Windows is supported only as a
development platform, not as a production platform.) Download a stable release of
at, and unpack the tarball in a suitable
% tar xzf zookeeper-x.y.z.tar.gz
ZooKeeper provides a few binaries to run and interact with the service, and it’s convenient to put the directory containing the binaries on your command-line path:
export ZOOKEEPER_INSTALL=/home/tom/zookeeper-x.y.z
Before running the ZooKeeper service, we need to set up a configuration file. The
con-figuration file is conventionally called zoo.cfg and placed in the conf subdirectory
(al-though you can also place it in /etc/zookeeper, or in the directory defined by the
ZOOCFGDIR environment variable, if set).
Here’s an example:
This is a standard Java properties file, and the three properties defined in this
example are the minimum required for running ZooKeeper in standalone mode.
Briefly, tickTime is the basic time unit in ZooKeeper (specified in milliseconds),
dataDir is the local filesystem location where ZooKeeper stores persistent data, and
clientPort is the port the ZooKeeper listens on for client connections (2181 is a
common choice). You should change dataDir to an appropriate setting for your
With a suitable configuration defined, we are now ready to start a local ZooKeeper
% start
To check whether ZooKeeper is running, send the ruok command (“Are you OK?”) to
the client port using nc (telnet works, too):
% echo ruok | nc localhost 2181 imok
That’s ZooKeeper saying, “I’m OK.” There are other commands, known as the “fourletter words,” for managing ZooKeeper and they are listed in Table 14-1.
Table 14-1. ZooKeeper commands: the four-letter words
Server status ruok
Prints imok if the server is running and not in an error state.
Print the server configuration (from zoo.cfg).
Print the server environment, including ZooKeeper version, Java version and
other system
Print server statistics, including latency statistics, the number of znodes,
and the server
mode (standalone, leader or follower).
Print server statistics and connected clients.
Reset server statistics.
Shows if the server is in read-only (ro) mode (due to a network partition), or
mode (rw).
List all the sessions and ephemeral znodes for the ensemble. You must
connect to the
leader (see srvr) for this command.
List connection statistics for all the server’s clients.
Reset connection statistics.
List summary information for the server’s watches.
List all the server’s watches by connection. Caution: may impact server
performance for
large number of watches.
List all the server’s watches by znode path. Caution: may impact server
performance for
large number of watches.
Lists server statistics in Java Properties format, suitable as a source for
monitoring systems
like Ganglia and Nagios.
In addition to the mntr command, ZooKeeper exposes statistics via JMX. For more
details see the ZooKeeper documentation at There are
also monitoring tools and recipes in the src/contrib directory of the distribution.
An Example
Imagine a group of servers that provide some service to clients. We want clients to
be able to locate one of the servers, so they can use the service. One of the
challenges is maintaining the list of servers in the group.
The membership list clearly cannot be stored on a single node in the network, as the
failure of that node would mean the failure of the whole system (we would like the list
to be highly available). Suppose for a moment that we had a robust way of storing
the list. We would still have the problem of how to remove a server from the list if it
failed. Some process needs to be responsible for removing failed servers, but note
that it can’t be the servers themselves, since they are no longer running!
What we are describing is not a passive distributed data structure, but an active one,
and one that can change the state of an entry when some external event occurs.
Zoo-Keeper provides this service, so let’s see how to build this group membership
applica-tion (as it is known) with it.
Group Membership in ZooKeeper
One way of understanding ZooKeeper is to think of it as providing a high-availability
filesystem. It doesn’t have files and directories, but a unified concept of a node,
called a znode, which acts both as a container of data (like a file) and a container of
other znodes (like a directory). Znodes form a hierarchical namespace, and a natural
way to build a membership list is to create a parent znode with the name of the
group and child znodes with the name of the group members (servers). This is
shown in Figure 14-1.
Figure 14-1. ZooKeeper znodes
In this example, we won’t store data in any of the znodes, but in a real application,
you could imagine storing data about the members in their znodes, such as
Creating the Group
Let’s introduce ZooKeeper’s Java API by writing a program to create a znode for the
group, /zoo in this example. See Example 14-1.
Example 14-1. A program to create a znode representing a group in ZooKeeper
public class CreateGroup implements Watcher {
private static final int SESSION_TIMEOUT = 5000;
private ZooKeeper zk;
private CountDownLatch connectedSignal = new CountDownLatch(1);
public void connect(String hosts) throws IOException,
InterruptedException { zk = new ZooKeeper(hosts,
SESSION_TIMEOUT, this); connectedSignal.await();
public void process(WatchedEvent event) { // Watcher interface
if (event.getState() == KeeperState.SyncConnected) {
public void create(String groupName) throws
KeeperException, InterruptedException {
String path = "/" + groupName;
String createdPath = zk.create(path, null/*data*/,
System.out.println("Created " + createdPath);
public void close() throws InterruptedException {
public static void main(String[] args) throws Exception {
CreateGroup createGroup = new CreateGroup();
When the main() method is run, it creates a CreateGroup instance and then calls its
connect() method. This method instantiates a new ZooKeeper object, the main class
of the client API and the one that maintains the connection between the client and
the ZooKeeper service. The constructor takes three arguments: the first is the host
address (and optional port, which defaults to 2181) of the ZooKeeper service;3 the
second is the session timeout in milliseconds (which we set to 5 seconds), explained
in more detail later; and the third is an instance of a Watcher object. The Watcher
object receives callbacks from ZooKeeper to inform it of various events. In this case,
CreateGroup is a Watcher, so we pass this to the ZooKeeper constructor.
When a ZooKeeper instance is created, it starts a thread to connect to the
ZooKeeper service. The call to the constructor returns immediately, so it is important
to wait for the connection to be established before using the ZooKeeper object. We
make use of Java’s CountDownLatch class (in the java.util.concurrent package) to
block until the ZooKeeper instance is ready. This is where the Watcher comes in.
The Watcher interface has a single method:
public void process(WatchedEvent event);
When the client has connected to ZooKeeper, the Watcher receives a call to its
process() method with an event indicating that it has connected. On receiving a connection event (represented by the Watcher.Event.KeeperState enum, with value
SyncConnected), we decrement the counter in the CountDownLatch, using its count
Down() method. The latch was created with a count of one, representing the number
of events that need to occur before it releases all waiting threads. After calling count
Down() once, the counter reaches zero and the await() method returns.
The connect() method has now returned, and the next method to be invoked on the
CreateGroup is the create() method. In this method, we create a new ZooKeeper
znode using the create() method on the ZooKeeper instance. The arguments it takes
are the path (represented by a string), the contents of the znode (a byte array, null
here), an access control list (or ACL for short, which here is a completely open ACL,
allowing any client to read or write the znode), and the nature of the znode to be
Znodes may be ephemeral or persistent. An ephemeral znode will be deleted by the
ZooKeeper service when the client that created it disconnects, either by explicitly
dis-connecting or if the client terminates for whatever reason. A persistent znode,
on the other hand, is not deleted when the client disconnects. We want the znode
representing a group to live longer than the lifetime of the program that creates it,
so we create a persistent znode.
The return value of the create() method is the path that was created by ZooKeeper.
We use it to print a message that the path was successfully created. We will see
how the path returned by create() may differ from the one passed into the method
when we look at sequential znodes.
To see the program in action, we need to have ZooKeeper running on the local
machine, and then we can type:
java CreateGroup localhost zoo
Created /zoo
Joining a Group
The next part of the application is a program to register a member in a group. Each
member will run as a program and join a group. When the program exits, it should
be removed from the group, which we can do by creating an ephemeral znode that
rep-resents it in the ZooKeeper namespace.
The JoinGroup program implements this idea, and its listing is in Example 14-2.
The logic for creating and connecting to a ZooKeeper instance has been refactored
into a base class, ConnectionWatcher, and appears in Example 14-3.
Example 14-2. A program that joins a group
public class JoinGroup extends ConnectionWatcher {
public void join(String groupName, String memberName) throws
KeeperException, InterruptedException {
String path = "/" + groupName + "/" + memberName;
String createdPath = zk.create(path, null/*data*/,
System.out.println("Created " + createdPath);
public static void main(String[] args) throws Exception {
JoinGroup joinGroup = new JoinGroup();
joinGroup.join(args[1], args[2]);
// stay alive until process is killed or thread is interrupted
Example 14-3. A helper class that waits for the connection to ZooKeeper to be
public class ConnectionWatcher implements Watcher {
private static final int SESSION_TIMEOUT = 5000;
protected ZooKeeper zk;
private CountDownLatch connectedSignal = new CountDownLatch(1);
public void connect(String hosts) throws IOException,
InterruptedException { zk = new ZooKeeper(hosts,
SESSION_TIMEOUT, this); connectedSignal.await();
public void process(WatchedEvent event) {
if (event.getState() == KeeperState.SyncConnected) {
public void close() throws InterruptedException {
The code for JoinGroup is very similar to CreateGroup. It creates an ephemeral
znode as a child of the group znode in its join() method, then simulates doing work of
some kind by sleeping until the process is forcibly terminated. Later, you will see that
upon termination, the ephemeral znode is removed by ZooKeeper.
Listing Members in a Group
Now we need a program to find the members in a group (see Example 14-4).
Example 14-4. A program to list the members in a group
public class ListGroup extends ConnectionWatcher {
public void list(String groupName) throws KeeperException,
InterruptedException {
String path = "/" + groupName;
try {
List<String> children = zk.getChildren(path, false); if
(children.isEmpty()) {
System.out.printf("No members in group %s\n", groupName);
for (String child : children) {
} catch (KeeperException.NoNodeException e) {
System.out.printf("Group %s does not exist\n", groupName);
public static void main(String[] args) throws Exception {
ListGroup listGroup = new ListGroup();
In the list() method, we call getChildren() with a znode path and a watch flag to
retrieve a list of child paths for the znode, which we print out. Placing a watch on a
znode causes the registered Watcher to be triggered if the znode changes state.
Although we’re not using it here, watching a znode’s children would permit a
program to get notifications of members joining or leaving the group, or of the group
being deleted.
We catch KeeperException.NoNodeException, which is thrown in the case when the
group’s znode does not exist.
Let’s see ListGroup in action. As expected, the zoo group is empty, since we haven’t
added any members yet:
% java ListGroup localhost zoo
No members in group zoo
We can use JoinGroup to add some members. We launch them as background processes, since they don’t terminate on their own (due to the sleep statement):
java JoinGroup localhost zoo duck &
java JoinGroup localhost zoo cow &
java JoinGroup localhost zoo goat &
The last line saves the process ID of the Java process running the program that
adds goat as a member. We need to remember the ID so that we can kill the process
in a moment, after checking the members:
% java ListGroup localhost zoo
goat duck cow
To remove a member, we kill its process:
% kill $goat_pid
And a few seconds later, it has disappeared from the group because the process’s
Zoo-Keeper session has terminated (the timeout was set to 5 seconds) and its
associated ephemeral node has been removed:
% java ListGroup localhost zoo
duck cow
Let’s stand back and see what we’ve built here. We have a way of building up a list
of a group of nodes that are participating in a distributed system. The nodes may
have no knowledge of each other. A client that wants to use the nodes in the list to
perform some work, for example, can discover the nodes without them being aware
of the cli-ent’s existence.
Finally, note that group membership is not a substitution for handling network errors
when communicating with a node. Even if a node is a group member,
communications with it may fail, and such failures must be handled in the usual ways
(retrying, trying a different member of the group, and so on).
ZooKeeper command-line tools
ZooKeeper comes with a command-line tool for interacting with the ZooKeeper
name-space. We can use it to list the znodes under the /zoo znode as follows:
% localhost ls /zoo
Processing ls
WatchedEvent: Server state change. New state:
SyncConnected [duck, cow]
You can run the command without arguments to display usage instructions.
Deleting a Group
To round off the example, let’s see how to delete a group. The ZooKeeper class
provides delete() method that takes a path and a version number. ZooKeeper will
delete a znode only if the version number specified is the same as the version
number of the znode it is trying to delete, an optimistic locking mechanism that
allows clients to detect conflicts over znode modification. You can bypass the version
check, however, by using version number of –1 to delete the znode regardless of its
version number.
There is no recursive delete operation in ZooKeeper, so you have to delete child
znodes before parents. This is what we do in the DeleteGroup class, which will
remove a group and all its members (Example 14-5).
Example 14-5. A program to delete a group and its members
public class DeleteGroup extends ConnectionWatcher {
public void delete(String groupName) throws
KeeperException, InterruptedException {
String path = "/" + groupName;
try {
List<String> children = zk.getChildren(path, false);
for (String child : children) {
zk.delete(path + "/" + child, -1);
zk.delete(path, -1);
} catch (KeeperException.NoNodeException e) {
System.out.printf("Group %s does not exist\n", groupName);
public static void main(String[] args) throws Exception {
DeleteGroup deleteGroup = new DeleteGroup();
Finally, we can delete the zoo group that we created earlier:
java DeleteGroup localhost zoo
java ListGroup localhost zoo
Group zoo does not exist
The ZooKeeper Service
ZooKeeper is a highly available, high-performance coordination service. In this
section, we look at the nature of the service it provides: its model, operations, and
Data Model
ZooKeeper maintains a hierarchical tree of nodes called znodes. A znode stores
data and has an associated ACL. ZooKeeper is designed for coordination (which
typically uses small data files), not high-volume data storage, so there is a limit of 1
MB on the amount of data that may be stored in any znode.
Data access is atomic. A client reading the data stored at a znode will never receive
only some of the data; either the data will be delivered in its entirety, or the read will
fail. Similarly, a write will replace all the data associated with a znode. ZooKeeper
guaran-tees that the write will either succeed or fail; there is no such thing as a
partial write, where only some of the data written by the client is stored. ZooKeeper
does not support an append operation. These characteristics contrast with HDFS,
which is designed for high-volume data storage, with streaming data access, and
provides an append operation.
Znodes are referenced by paths, which in ZooKeeper are represented as slashdelimited Unicode character strings, like filesystem paths in Unix. Paths must be
absolute, so they must begin with a slash character. Furthermore, they are
canonical, which means that each path has a single representation, and so paths do
not undergo resolution. For example, in Unix, a file with the path /a/b can
equivalently be referred to by the path /a/./b, since “.” refers to the current directory
at the point it is encountered in the path. In ZooKeeper, “.” does not have this special
meaning and is actually illegal as a path component (as is “..” for the parent of the
current directory).
Path components are composed of Unicode characters, with a few restrictions (these
are spelled out in the ZooKeeper reference documentation). The string “zookeeper”
is a reserved word and may not be used as a path component. In particular,
ZooKeeper uses the /zookeeper subtree to store management information, such as
information on quotas.
Note that paths are not URIs, and they are represented in the Java API by a
java.lang.String, rather than the Hadoop Path class (or by the class, for
that matter).
Znodes have some properties that are very useful for building distributed
applications, which we discuss in the following sections.
Ephemeral znodes
Znodes can be one of two types: ephemeral or persistent. A znode’s type is set at
creation time and may not be changed later. An ephemeral znode is deleted by
ZooKeeper when the creating client’s session ends. By contrast, a persistent znode
is not tied to the client’s session and is deleted only when explicitly deleted by a
client (not necessarily the one that created it). An ephemeral znode may not have
children, not even ephemeral ones.
Even though ephemeral nodes are tied to a client session, they are visible to all
clients (subject to their ACL policy, of course).
Ephemeral znodes are ideal for building applications that need to know when certain
distributed resources are available. The example earlier in this chapter uses
ephemeral znodes to implement a group membership service, so any process can
discover the members of the group at any particular time.
Sequence numbers
A sequential znode is given a sequence number by ZooKeeper as a part of its name.
If a znode is created with the sequential flag set, then the value of a monotonically
in-creasing counter (maintained by the parent znode) is appended to its name.
If a client asks to create a sequential znode with the name /a/b-, for example, then
the znode created may actually have the name /a/b-3.4 If, later on, another
sequential znode with the name /a/b- is created, then it will be given a unique name
with a larger value of the counter—for example, /a/b-5. In the Java API, the actual
path given to sequential znodes is communicated back to the client as the return
value of the create() call.
Sequence numbers can be used to impose a global ordering on events in a
distributed system, and may be used by the client to infer the ordering. In “A Lock
Ser-vice” , you will learn how to use sequential znodes to build a shared lock.
Watches allow clients to get notifications when a znode changes in some way.
Watches are set by operations on the ZooKeeper service, and are triggered by other
operations on the service. For example, a client might call the exists operation on a
znode, placing a watch on it at the same time. If the znode doesn’t exist, then the
exists operation will return false. If, some time later, the znode is created by a
second client, then the watch is triggered, notifying the first client of the znode’s
creation. You will see precisely which operations trigger others in the next section.
Watchers are triggered only once. To receive multiple notifications, a client needs to
reregister the watch. If the client in the previous example wishes to receive further
notifications for the znode’s existence (to be notified when it is deleted, for example),
it needs to call the exists operation again to set a new watch.
There is an example in “A Configuration Service” demonstrating how to use watches
to update configuration across a cluster.
There are nine basic operations in ZooKeeper, listed in Table 14-2.
Table 14-2. Operations in the ZooKeeper service
getACL, setACL
getData, setData
Creates a znode (the parent znode must already exist)
Deletes a znode (the znode must not have any children)
Tests whether a znode exists and retrieves its metadata
Gets/sets the ACL for a znode
Gets a list of the children of a znode
Gets/sets the data associated with a znode
Synchronizes a client’s view of a znode with ZooKeeper
Update operations in ZooKeeper are conditional. A delete or setData operation has
to specify the version number of the znode that is being updated (which is found
from a previous exists call). If the version number does not match, the update will
fail. Up-dates are a nonblocking operation, so a client that loses an update (because
another process updated the znode in the meantime) can decide whether to try
again or take some other action, and it can do so without blocking the progress of
any other process.
Although ZooKeeper can be viewed as a filesystem, there are some filesystem
primitives that it does away with in the name of simplicity. Because files are small
and are written and read in their entirety, there is no need to provide open, close, or
seek operations.
The sync operation is not like fsync() in POSIX filesystems. As men-tioned earlier,
writes in ZooKeeper are atomic, and a successful write operation is guaranteed to
have been written to persistent storage on a majority of ZooKeeper servers.
However, it is permissible for reads to lag the latest state of ZooKeeper service, and
the sync operation exists to allow a client to bring itself up-to-date. This topic is
covered in more detail in the section on “Consistency”.
There is another ZooKeeper operation, called multi, which batches together multiple
primitive operations into a single unit that either succeeds or fails in its entirety. The
situation where some of the primitive operations succeed and some fail can never
Multi-update is very useful for building structures in ZooKeeper that maintain some
global invariant. One example is an undirected graph. Each vertex in the graph is
nat-urally represented as a znode in ZooKeeper, and to add or remove an edge we
need to update the two znodes corresponding to its vertices, since each has a
reference to the other. If we only used primitive ZooKeeper operations, it would be
possible for another client to observe the graph in an inconsistent state where one
vertex is connected to another but the reverse connection is absent. Batching the
updates on the two znodes into one multi operation ensures that the update is
atomic, so a pair of vertices can never have a dangling connection.
There are two core language bindings for ZooKeeper clients, one for Java and one
for C; there are also contrib bindings for Perl, Python, and REST clients. For each
binding, there is a choice between performing operations synchronously or
asynchronously. We’ve already seen the synchronous Java API. Here’s the
signature for the exists op-eration, which returns a Stat object that encapsulates the
znode’s metadata, or null if the znode doesn’t exist:
public Stat exists(String path, Watcher watcher) throws
KeeperException, InterruptedException
The asynchronous equivalent, which is also found in the ZooKeeper class, looks like
public void exists(String path, Watcher watcher, StatCallback cb, Object
In the Java API, all the asynchronous methods have void return types, since the
result of the operation is conveyed via a callback. The caller passes a callback
implementation, whose method is invoked when a response is received from
ZooKeeper. In this case, the callback is the StatCallback interface, which has the
following method:
public void processResult(int rc, String path, Object ctx, Stat stat);
The rc argument is the return code, corresponding to the codes defined by KeeperEx
ception. A nonzero code represents an exception, in which case the stat parameter
will be null. The path and ctx arguments correspond to the equivalent arguments
passed by the client to the exists() method, and can be used to identify the request
for which this callback is a response. The ctx parameter can be an arbitrary object
that may be used by the client when the path does not give enough context to
disambiguate the request. If not needed, it may be set to null.
There are actually two C shared libraries. The single-threaded library, zookeeper_st,
supports only the asynchronous API and is intended for platforms where the pthread
library is not available or stable. Most developers will use the multithreaded library,
zookeeper_mt, as it supports both the synchronous and asynchronous APIs. For
details on how to build and use the C API, please refer to the README file in the
src/c directory of the ZooKeeper distribution.
Should I Use the Synchronous or Asynchronous API?
Both APIs offer the same functionality, so the one you use is largely a matter of
style. The asynchronous API is appropriate if you have an event-driven
programming model, for example.
The asynchronous API allows you to pipeline requests, which in some scenarios
can offer better throughput. Imagine that you want to read a large batch of znodes
and process them independently. Using the synchronous API, each read would
block until it returned, whereas with the asynchronous API, you can fire off all the
asynchronous reads very quickly and process the responses in a separate thread
as they come back.
Watch triggers
The read operations exists, getChildren, and getData may have watches set on
them, and the watches are triggered by write operations: create, delete, and setData.
ACL operations do not participate in watches. When a watch is triggered, a watch
event is generated, and the watch event’s type depends both on the watch and the
operation that triggered it: A watch set on an exists operation will be triggered when
the znode being watched is created, deleted, or has its data updated.
A watch set on a getData operation will be triggered when the znode being watched
is deleted or has its data updated. No trigger can occur on creation, since the znode
must already exist for the getData operation to succeed.
A watch set on a getChildren operation will be triggered when a child of the znode
being watched is created or deleted, or when the znode itself is deleted. You can tell
whether the znode or its child was deleted by looking at the watch event type:
NodeDeleted shows the znode was deleted, and NodeChildrenChanged indicates
that it was a child that was deleted.
The combinations are summarized in Table 14-3.
Table 14-3. Watch creation operations and their corresponding triggers
NodeChildren d
A watch event includes the path of the znode that was involved in the event, so for
NodeCreated and NodeDeleted events, you can tell which node was created or
deleted simply by inspecting the path. To discover which children have changed after
a Node ChildrenChanged event, you need to call getChildren again to retrieve the
new list of children. Similarly, to discover the new data for a NodeDataChanged
event, you need to call getData. In both of these cases, the state of the znodes may
have changed between receiving the watch event and performing the read operation,
so you should bear this in mind when writing applications.
A znode is created with a list of ACLs, which determines who can perform certain
operations on it. ACLs depend on authentication, the process by which the client
identifies itself to ZooKeeper. There are a few authentication schemes that
ZooKeeper provides:
The client is authenticated by a username and password.
The client is authenticated using Kerberos.
The client is authenticated by its IP address.
Clients may authenticate themselves after establishing a ZooKeeper session.
Authen-tication is optional, although a znode’s ACL may require an authenticated
client, in which case the client must authenticate itself to access the znode. Here is
an example of using the digest scheme to authenticate with a username and
zk.addAuthInfo("digest", "tom:secret".getBytes());
An ACL is the combination of an authentication scheme, an identity for that scheme,
and a set of permissions. For example, if we wanted to give a client with the IP
address read access to a znode, we would set an ACL on the znode with
the ip scheme, an ID of, and READ permission. In Java, we would create
the ACL object as follows:
new ACL(Perms.READ, new Id("ip", ""));
The full set of permissions are listed in Table 14-4. Note that the exists operation is
not governed by an ACL permission, so any client may call exists to find the Stat for
a znode or to discover that a znode does not in fact exist.
Table 14-4. ACL permissions
ACL permission
Permitted operations
create (a child znode)
delete (a child znode)
There are a number of predefined ACLs defined in the ZooDefs.Ids class, including
OPEN_ACL_UNSAFE, which gives all permissions (except ADMIN permission) to
everyone. In addition, ZooKeeper has a pluggable authentication mechanism, which
makes it possible to integrate third-party authentication systems if needed.
The ZooKeeper service can run in two modes. In standalone mode, there is a single
ZooKeeper server, which is useful for testing due to its simplicity (it can even be
embedded in unit tests), but provides no guarantees of high-availability or resilience.
In production, ZooKeeper runs in replicated mode, on a cluster of machines called
an ensemble. ZooKeeper achieves high-availability through replication, and can
provide a service as long as a majority of the machines in the ensemble are up. For
example, in a five-node ensemble, any two machines can fail and the service will still
work because a majority of three remain. Note that a six-node ensemble can also
tolerate only two machines failing, since with three failures the remaining three do
not constitute a ma-jority of the six. For this reason, it is usual to have an odd
number of machines in an ensemble.
Conceptually, ZooKeeper is very simple: all it has to do is ensure that every
modification to the tree of znodes is replicated to a majority of the ensemble. If a
minority of the machines fail, then a minimum of one machine will survive with the
latest state. The other remaining replicas will eventually catch up with this state.
The implementation of this simple idea, however, is nontrivial. ZooKeeper uses a
pro-tocol called Zab that runs in two phases, which may be repeated indefinitely:
Phase 1: Leader election
The machines in an ensemble go through a process of electing a distinguished
member, called the leader. The other machines are termed followers. This phase
is finished once a majority (or quorum) of followers have synchronized their state
with the leader.
Phase 2: Atomic broadcast
All write requests are forwarded to the leader, which broadcasts the update to the
followers. When a majority have persisted the change, the leader commits the
up-date, and the client gets a response saying the update succeeded. The
protocol for achieving consensus is designed to be atomic, so a change either
succeeds or fails. It resembles a two-phase commit.
If the leader fails, the remaining machines hold another leader election and continue
as before with the new leader. If the old leader later recovers, it then starts as a
follower. Leader election is very fast, around 200 ms according to one published
result, so per-formance does not noticeably degrade during an election. All machines
in the ensemble write updates to disk before updating their in-memory copy of the
znode tree. Read requests may be serviced from any machine, and since they
involve only a lookup from memory, they are very fast.
Understanding the basis of ZooKeeper’s implementation helps in understanding the
consistency guarantees that the service makes. The terms “leader” and “follower” for
the machines in an ensemble are apt, for they make the point that a follower may lag
the leader by a number of updates. This is a consequence of the fact that only a
majority and not all of the ensemble needs to have persisted a change before it is
committed. A good mental model for ZooKeeper is of clients connected to
ZooKeeper servers that are following the leader. A client may actually be connected
to the leader, but it has no control over this, and cannot even know if this is the case.
See Figure 14-2. Every update made to the znode tree is given a globally unique
identifier, called a zxid (which stands for “ZooKeeper transaction ID”). Updates are
ordered, so if zxid z1 is less than z2, then z1 happened before z2, according to
ZooKeeper, which is the single authority on ordering in the distributed system.
Figure 14-2. Reads are satisfied by followers, while writes are committed by
the leader
The following guarantees for data consistency flow from ZooKeeper’s design:
Sequential consistency
Updates from any particular client are applied in the order that they are sent. This
means that if a client updates the znode z to the value a, and in a later operation,
it updates z to the value b, then no client will ever see z with value a after it has
seen it with value b (if no other updates are made to z).
Updates either succeed or fail. This means that if an update fails, no client will
ever see it.
Single system image
A client will see the same view of the system regardless of the server it connects
to. This means that if a client connects to a new server during the same session,
it will not see an older state of the system than the one it saw with the previous
server. When a server fails and a client tries to connect to another in the
ensemble, a server that is behind the one that failed will not accept connections
from the client until it has caught up with the failed server.
Once an update has succeeded, it will persist and will not be undone. This means
updates will survive server failures.
The lag in any client’s view of the system is bounded, so it will not be out of date
by more than some multiple of tens of seconds. This means that rather than allow
a client to see data that is very stale, a server will shut down, forcing the client to
switch to a more up-to-date server.
A ZooKeeper client is configured with the list of servers in the ensemble. On startup,
it tries to connect to one of the servers in the list. If the connection fails, it tries
another server in the list, and so on, until it either successfully connects to one of
them or fails if all ZooKeeper servers are unavailable.
Once a connection has been made with a ZooKeeper server, the server creates a
new session for the client. A session has a timeout period that is decided on by the
appli-cation that creates it. If the server hasn’t received a request within the timeout
period, it may expire the session. Once a session has expired, it may not be
reopened, and any ephemeral nodes associated with the session will be lost.
Although session expiry is a comparatively rare event, since sessions are long-lived,
it is important for applications to handle it .
Sessions are kept alive by the client sending ping requests (also known as
heartbeats) whenever the session is idle for longer than a certain period. (Pings are
automatically sent by the ZooKeeper client library, so your code doesn’t need to
worry about main-taining the session.) The period is chosen to be low enough to
detect server failure (manifested by a read timeout) and reconnect to another server
within the session timeout period.
Failover to another ZooKeeper server is handled automatically by the ZooKeeper
client, and, crucially, sessions (and associated ephemeral znodes) are still valid after
another server takes over from the failed one.
During failover, the application will receive notifications of disconnections and connections to the service. Watch notifications will not be delivered while the client is
disconnected, but they will be delivered when the client successfully reconnects.
Also, if the application tries to perform an operation while the client is reconnecting to
another server, the operation will fail.
There are several time parameters in ZooKeeper. The tick time is the fundamental
period of time in ZooKeeper and is used by servers in the ensemble to define the
schedule on which their interactions run. Other settings are defined in terms of tick
time, or are at least constrained by it. The session timeout, for example, may not be
less than 2 ticks or more than 20. If you attempt to set a session timeout outside this
range, it will be modified to fall within the range.
A common tick time setting is 2 seconds (2,000 milliseconds). This translates to an
allowable session timeout of between 4 and 40 seconds. There are a few
considerations in selecting a session timeout.
A low session timeout leads to faster detection of machine failure. In the group membership example, the session timeout is the time it takes for a failed machine to be
removed from the group. Beware of setting the session timeout too low, however,
since a busy network can cause packets to be delayed and may cause inadvertent
session expiry. In such an event, a machine would appear to “flap”: leaving and then
rejoining the group repeatedly in a short space of time.
Applications that create more complex ephemeral state should favor longer session
timeouts, as the cost of reconstruction is higher. In some cases, it is possible to
design the application so it can restart within the session timeout period and avoid
session expiry. (This might be desirable to perform maintenance or upgrades.) Every
session is given a unique identity and password by the server, and if these are
passed to Zoo-Keeper while a connection is being made, it is possible to recover a
session (as long as it hasn’t expired). An application can therefore arrange a graceful
shutdown, whereby it stores the session identity and password to stable storage
before restarting the pro-cess, retrieving the stored session identity and password
and recovering the session.
You should view this feature as an optimization, which can help avoid expire
sessions. It does not remove the need to handle session expiry, which can still occur
if a machine fails unexpectedly, or even if an application is shut down gracefully but
does not restart before its session expires—for whatever reason.
As a general rule, the larger the ZooKeeper ensemble, the larger the session timeout
should be. Connection timeouts, read timeouts, and ping periods are all defined
inter-nally as a function of the number of servers in the ensemble, so as the
ensemble grows, these periods decrease. Consider increasing the timeout if you
experience frequent connection loss. You can monitor ZooKeeper metrics—such as
request latency statistics—using JMX.
The ZooKeeper object transitions through different states in its lifecycle (see Fig-ure
14-3). You can query its state at any time by using the getState() method:
public States getState()
States is an enum representing the different states that a ZooKeeper object may be
in. (Despite the enum’s name, an instance of ZooKeeper may only be in one state at
a time.) A newly constructed ZooKeeper instance is in the CONNECTING state,
while it tries to establish a connection with the ZooKeeper service. Once a
connection is established, it goes into the CONNECTED state.
Figure 14-3. ZooKeeper state transitions
A client using the ZooKeeper object can receive notifications of the state transitions
by registering a Watcher object. On entering the CONNECTED state, the watcher
receives a WatchedEvent whose KeeperState value is SyncConnected.
The ZooKeeper instance may disconnect and reconnect to the ZooKeeper service,
mov-ing between the CONNECTED and CONNECTING states. If it disconnects, the
watcher receives a Disconnected event. Note that these state transitions are initiated
by the ZooKeeper instance itself, and it will automatically try to reconnect if the
connection is lost.
The ZooKeeper instance may transition to a third state, CLOSED, if either the close()
method is called or the session times out as indicated by a KeeperState of type
Expired. Once in the CLOSED state, the ZooKeeper object is no longer considered
to be alive (this can be tested using the isAlive() method on States) and cannot be
reused. To reconnect to the ZooKeeper service, the client must construct a new
ZooKeeper instance.
Building Applications with ZooKeeper
Having covered ZooKeeper in some depth, let’s turn back to writing some useful
applications with it.
A Configuration Service
One of the most basic services that a distributed application needs is a configuration
service so that common pieces of configuration information can be shared by
machines in a cluster. At the simplest level, ZooKeeper can act as a highly available
store for configuration, allowing application participants to retrieve or update
configuration files. Using ZooKeeper watches, it is possible to create an active
configuration service, where interested clients are notified of changes in
Let’s write such a service. We make a couple of assumptions that simplify the implementation (they could be removed with a little more work). First, the only
configuration values we need to store are strings, and keys are just znode paths, so
we use a znode to store each key-value pair. Second, there is a single client that
performs updates at any one time. Among other things, this model fits with the idea
of a master (such as the namenode in HDFS) that wishes to update information that
its workers need to follow.
We wrap the code up in a class called ActiveKeyValueStore:
public class ActiveKeyValueStore extends ConnectionWatcher {
private static final Charset CHARSET = Charset.forName("UTF-8");
public void write(String path, String value) throws InterruptedException,
KeeperException {
Stat stat = zk.exists(path, false); if (stat ==
null) {
zk.create(path, value.getBytes(CHARSET),
} else {
zk.setData(path, value.getBytes(CHARSET), -1);
The contract of the write() method is that a key with the given value is written to
ZooKeeper. It hides the difference between creating a new znode and updating an
ex-isting znode with a new value, by testing first for the znode using the exists
operation and then performing the appropriate operation. The other detail worth
mentioning is the need to convert the string value to a byte array, for which we just
use the getBytes() method with a UTF-8 encoding.
To illustrate the use of the ActiveKeyValueStore, consider a ConfigUpdater class that
updates a configuration property with a value. The listing appears in Example 14-6.
Example 14-6. An application that updates a property in ZooKeeper at random times
public class ConfigUpdater {
public static final String PATH = "/config";
private ActiveKeyValueStore store; private
Random random = new Random();
public ConfigUpdater(String hosts) throws IOException,
InterruptedException { store = new ActiveKeyValueStore();
public void run() throws InterruptedException, KeeperException {
while (true) {
String value = random.nextInt(100) + "";
store.write(PATH, value);
System.out.printf("Set %s to %s\n", PATH, value);
public static void main(String[] args) throws Exception {
ConfigUpdater configUpdater = new
The program is simple. A ConfigUpdater has an ActiveKeyValueStore that connects
to ZooKeeper in ConfigUpdater’s constructor. The run() method loops forever,
updating the /config znode at random times with random values.
Next, let’s look at how to read the /config configuration property. First, we add a read
method to ActiveKeyValueStore:
public String read(String path, Watcher watcher) throws
InterruptedException, KeeperException {
byte[] data = zk.getData(path, watcher, null/*stat*/); return
new String(data, CHARSET);
The getData() method of ZooKeeper takes the path, a Watcher, and a Stat object.
The Stat object is filled in with values by getData(), and is used to pass information
back to the caller. In this way, the caller can get both the data and the metadata for a
znode, although in this case, we pass a null Stat because we are not interested in
the metadata. As a consumer of the service, ConfigWatcher (see Example 14-7)
creates an ActiveKey ValueStore, and after starting, calls the store’s read() method
(in its displayConfig() method) to pass a reference to itself as the watcher. It displays
the initial value of the configuration that it reads.
Example 14-7. An application that watches for updates of a property in ZooKeeper
and prints them to the console
public class ConfigWatcher implements Watcher {
private ActiveKeyValueStore store;
public ConfigWatcher(String hosts) throws IOException,
InterruptedException { store = new ActiveKeyValueStore();
public void displayConfig() throws InterruptedException,
KeeperException { String value =,
this); System.out.printf("Read %s as %s\n", ConfigUpdater.PATH,
public void process(WatchedEvent event) {
if (event.getType() == EventType.NodeDataChanged) {
try {
} catch (InterruptedException e) {
System.err.println("Interrupted. Exiting.");
} catch (KeeperException e) {
System.err.printf("KeeperException: %s. Exiting.\n", e);
public static void main(String[] args) throws Exception {
ConfigWatcher configWatcher = new
ConfigWatcher(args[0]); configWatcher.displayConfig();
// stay alive until process is killed or thread is interrupted
When the ConfigUpdater updates the znode, ZooKeeper causes the watcher to fire
with an event type of EventType.NodeDataChanged. ConfigWatcher acts on this
event in its process() method by reading and displaying the latest version of the
Because watches are one-time signals, we tell ZooKeeper of the new watch each
time we call read() on ActiveKeyValueStore—this ensures we see future updates.
Further-more, we are not guaranteed to receive every update, since between the
receipt of the watch event and the next read, the znode may have been updated,
possibly many times, and as the client has no watch registered during that period, it
is not notified. For the configuration service, this is not a problem because clients
care only about the latest value of a property, as it takes precedence over previous
values, but in general you should be aware of this potential limitation.
Let’s see the code in action. Launch the ConfigUpdater in one terminal window:
% java ConfigUpdater localhost
Set /config to 79
Set /config to 14
Set /config to 78
Then launch the ConfigWatcher in another window immediately afterward:
% java ConfigWatcher localhost
Read /config as 79
Read /config as 14
Read /config as 78
The Resilient ZooKeeper Application
The first of the Fallacies of Distributed Computing8 states that “The network is reliable.” As they stand, the programs so far have been assuming a reliable network, so
when they run on a real network, they can fail in several ways. Let’s examine
possible failure modes and what we can do to correct them so that our programs are
resilient in the face of failure.
Every ZooKeeper operation in the Java API declares two types of exception in its
throws clause: InterruptedException and KeeperException.
An InterruptedException is thrown if the operation is interrupted. There is a standard
Java mechanism for canceling blocking methods, which is to call interrupt() on the
thread from which the blocking method was called. A successful cancellation will
result in an InterruptedException. ZooKeeper adheres to this standard, so you can
cancel a ZooKeeper operation in this way. Classes or libraries that use ZooKeeper
should usually propagate the InterruptedException so that their clients can cancel
their operations.
An InterruptedException does not indicate a failure, but rather that the operation has
been canceled, so in the configuration application example, it is appropriate to propagate the exception, causing the application to terminate.
A KeeperException is thrown if the ZooKeeper server signals an error or if there is a
communication problem with the server. There are various subclasses of
KeeperException.NoNodeExcep tion is a subclass of KeeperException that is thrown
if you try to perform an operation on a znode that doesn’t exist.
Every subclass of KeeperException has a corresponding code with information about
the type of error. For example, for KeeperException.NoNodeException the code is
Keep erException.Code.NONODE (an enum value).
There are two ways then to handle KeeperException: either catch KeeperException
and test its code to determine what remedying action to take, or catch the equivalent
KeeperException subclasses and perform the appropriate action in each catch block.
KeeperExceptions fall into three broad categories.
State exceptions. A state exception occurs when the operation fails because it
cannot be applied to the znode tree. State exceptions usually happen because
another process is mutating a znode at the same time. For example, a setData
KeeperException.BadVersionException if the znode is updated by another process
first, since the version number does not match. The programmer is usually aware
that this kind of conflict is possible and will code to deal with it.
Some state exceptions indicate an error in the program, such as KeeperExcep
tion.NoChildrenForEphemeralsException, which is thrown when trying to create a
child znode of an ephemeral znode.
Recoverable exceptions. Recoverable exceptions are those from which the
application can recover within the same ZooKeeper session. A recoverable
exception is manifested by KeeperException.ConnectionLossException, which
means that the connection to ZooKeeper has been lost. ZooKeeper will try to
reconnect, and in most cases the re-connection will succeed and ensure that the
session is intact.
However, ZooKeeper cannot tell whether the operation that failed with KeeperExcep
tion.ConnectionLossException was applied. This is an example of partial failure
(which we introduced at the beginning of the chapter). The onus is therefore on the
program-mer to deal with the uncertainty, and the action that should be taken
depends on the application.
At this point, it is useful to make a distinction between idempotent and nonidempotent operations. An idempotent operation is one that may be applied one or more
times with the same result, such as a read request or an unconditional setData.
These can simply be retried.
A nonidempotent operation cannot be indiscriminately retried, as the effect of
applying it multiple times is not the same as applying it once. The program needs a
way of detecting whether its update was applied by encoding information in the
znode’s path name or its data. We shall discuss how to deal with failed
nonidempotent operations in “Recoverable exceptions” on page 518, when we look
at the implementation of a lock service.
Unrecoverable exceptions. In some cases, the ZooKeeper session becomes
invalid— perhaps because of a timeout or because the session was closed (both get
a KeeperEx ception.SessionExpiredException), or perhaps because authentication
failed (Keeper Exception.AuthFailedException). In any case, all ephemeral nodes
associated with the session will be lost, so the application needs to rebuild its state
before reconnecting to ZooKeeper.
A reliable configuration service
Going back to the write() method in ActiveKeyValueStore, recall that it is composed
of an exists operation followed by either a create or a setData:
public void write(String path, String value) throws InterruptedException,
KeeperException {
Stat stat = zk.exists(path, false); if (stat ==
null) {
zk.create(path, value.getBytes(CHARSET),
} else {
zk.setData(path, value.getBytes(CHARSET), -1);
Taken as a whole, the write() method is idempotent, so we can afford to unconditionally retry it. Here’s a modified version of the write() method that retries in a loop.
It is set to try a maximum
number of
and sleeps for RETRY_PERIOD_SECONDS between each attempt:
public void write(String path, String value) throws
InterruptedException, KeeperException {
int retries = 0;
while (true) {
try {
Stat stat = zk.exists(path, false);
if (stat == null) {
zk.create(path, value.getBytes(CHARSET),
} else {
zk.setData(path, value.getBytes(CHARSET), stat.getVersion());
} catch (KeeperException.SessionExpiredException e) {
throw e;
} catch (KeeperException e) {
if (retries++ == MAX_RETRIES) {
throw e;
// sleep then retry
The code is careful not to retry KeeperException.SessionExpiredException, since
when a session expires, the ZooKeeper object enters the CLOSED state, from which
it can never reconnect (refer to Figure 14-3). We simply rethrow the exception and
let the caller create a new ZooKeeper instance, so that the whole write() method can
be retried. A simple way to create a new instance is to create a new ConfigUpdater
(which we’ve actually renamed ResilientConfigUpdater) to recover from an expired
public static void main(String[] args) throws Exception {
while (true) {
try {
ResilientConfigUpdater configUpdater = new
} catch (KeeperException.SessionExpiredException e) {
start a new session
} catch (KeeperException e) {
already retried, so exit
Another way of writing the code would be to have a single catch block, just for
KeeperException, and a test to see whether its code has the value
KeeperException.Code.SESSIONEXPIRED. Which method you use is a matter of
style, since they both behave in the same way.
An alternative way of dealing with session expiry would be to look for a KeeperState
of type Expired in the watcher (that would be the ConnectionWatcher in the example
here), and create a new connection when this is detected. This way, we would just
KeeperException.SessionExpiredExcep tion, since the connection should eventually
be reestablished. Regardless of the precise mechanics of how we recover from an
expired session, the important point is that it is a different kind of failure from
connection loss and needs to be handled differently.
This is just one strategy for retry handling—there are many others, such as using exponential backoff where the period between retries is multiplied by a constant each
time. The package in Hadoop Core is a set of utilities for
adding retry logic into your code in a reusable way, and it may be helpful for building
ZooKeeper applications.
A Lock Service
A distributed lock is a mechanism for providing mutual exclusion between a
collection of processes. At any one time, only a single process may hold the lock.
Distributed locks can be used for leader election in a large distributed system, where
the leader is the process that holds the lock at any point in time. znode has the
lowest sequence number. The ZooKeeper service is the arbiter of order, since it
assigns the sequence numbers.
The lock may be released simply by deleting the znode /leader/lock-1; alternatively, if
the client process dies, it will be deleted by virtue of it being an ephemeral znode.
The client that created /leader/lock-2 will then hold the lock, since it has the next
lowest sequence number. It will be notified that it has the lock by creating a watch
that fires when znodes go away.
The pseudocode for lock acquisition is as follows:
 Create an ephemeral sequential znode named lock- under the lock znode and remember its actual path name (the return value of the create operation).
 Get the children of the lock znode and set a watch.
 If the path name of the znode created in 1 has the lowest number of the children
returned in 2, then the lock has been acquired. Exit.
 Wait for the notification from the watch set in 2 and go to step 2.
A great strength of the Hadoop platform is its ability to work with data in several
different forms. HDFS can reliably store logs and other data from a plethora of
sources, and MapReduce programs can parse diverse ad hoc data formats,
extracting relevant information and combining multiple data sets into powerful
But to interact with data in storage repositories outside of HDFS, MapReduce
programs need to use external APIs to get to this data. Often, valuable data in an
organization is stored in relational database systems (RDBMS). Sqoop is an opensource tool that al-lows users to extract data from a relational database into Hadoop
for further processing. This processing can be done with MapReduce programs or
other higher-level tools such as Hive. (It’s even possible to use Sqoop to move data
from a relational database into HBase.) When the final results of an analytic pipeline
are available, Sqoop can export these results back to the database for consumption
by other clients.
In this chapter, we’ll take a look at how Sqoop works and how you can use it in your
data processing pipeline.
Getting Sqoop
Sqoop is available in a few places. The primary home of the project is This repository contains all the Sqoop source
code and documen-tation. Official releases are available at this site, as well as the
source code for the version currently under development. The repository itself
contains instructions for compiling the project. Alternatively, Cloudera’s Distribution
for Hadoop contains an installation package for Sqoop alongside compatible editions
of Hadoop and other tools like Hive.
If you download a release from Apache, it will be placed in a directory such as
/home/ yourname/sqoop-x.y.z/. We’ll call this directory $SQOOP_HOME. You can
run Sqoop by running the executable script $SQOOP_HOME/bin/sqoop.
If you’ve installed a release from Cloudera, the package will have placed Sqoop’s
scripts in standard locations like /usr/bin/sqoop. You can run Sqoop by simply typing
sqoop at the command line.(Regardless of how you install Sqoop, we’ll refer to this
script as just sqoop from here on.)
Running Sqoop with no arguments does not do much of interest:
% sqoop
Try sqoop help for usage.
Sqoop is organized as a set of tools or commands. Without selecting a tool, Sqoop
does not know what to do. help is the name of one such tool; it can print out the list
of available tools, like this:
% sqoop help
usage: sqoop
Available commands:
Generate code to interact with database records
Import a table
definition into Hive
Evaluate a SQL statement and display the results
Export an HDFS directory to a database table
List available
Import a table
from a database to HDFS
Import tables from a database to HDFS
Work with saved jobs
List available
databases on a server
List available
tables in a database
Merge results of incremental imports
Run a standalone Sqoop metastore
Display version information
See 'sqoop help COMMAND' for information on a specific command.
As it explains, the help tool can also provide specific usage instructions on a
particular tool, by providing that tool’s name as an argument:
% sqoop help import
usage: sqoop import [GENERIC-ARGS] [TOOL-ARGS]
Common arguments:
--connect <jdbc-uri> Specify JDBC connect string
--driver <class-name>
Manually specify JDBC driver class to use
--hadoop-home <dir> Override $HADOOP_HOME
Print usage instructions
Read password from console
--password <password>
Set authentication password
--username <username> Set authentication username
Print more information while working
An alternate way of running a Sqoop tool is to use a tool-specific script. This script
will be named sqoop-toolname. For example, sqoop-help, sqoop-import, etc. These
com-mands are identical to running sqoop help or sqoop import.
A Sample Import
After you install Sqoop, you can use it to import data to Hadoop.
Sqoop imports from databases. The list of databases that it has been tested with
includes MySQL, PostgreSQL, Oracle, SQL Server and DB2. For the examples in
this chapter we’ll use MySQL, which is easy-to-use and available for a large number
of platforms.
To install and configure MySQL, follow the documentation at
doc/refman/5.1/en/. Chapter 2 (“Installing and Upgrading MySQL”) in particular
should help. Users of Debian-based Linux systems (e.g., Ubuntu) can type sudo aptget install mysql-client mysql-server. RedHat users can typesudo yum install mysql
Now that MySQL is installed, let’s log in and create a database (Example 15-1).
Example 15-1. Creating a new MySQL database schema
% mysql -u root -p
Enter password:
Welcome to the MySQL monitor. Commands end with ;
or \g.
Your MySQL connection id is 349
Server version: 5.1.37-1ubuntu5.4 (Ubuntu)
Type 'help;' or '\h' for help. Type '\c' to clear the current input
mysql> CREATE DATABASE hadoopguide;
Query OK, 1 row affected (0.02 sec)
mysql> GRANT ALL PRIVILEGES ON hadoopguide.* TO '%'@'localhost';
Query OK, 0 rows affected (0.00 sec)
mysql> GRANT ALL PRIVILEGES ON hadoopguide.* TO ''@'localhost';
Query OK, 0 rows affected (0.00 sec)
mysql> quit;
The password prompt above asks for your root user password. This is likely the
same as the password for the root shell login. If you are running Ubuntu or another
variant of Linux where root cannot directly log in, then enter the password you picked
at MySQL installation time.
In this session, we created a new database schema called hadoopguide, which we’ll
use throughout this appendix. We then allowed any local user to view and modify the
contents of the hadoopguide schema, and closed our session.Now let’s log back into
the database (not as root, but as yourself this time), and create a table to import into
HDFS (Example 15-2).
Example 15-2. Populating the database
% mysql hadoopguide
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 352
Server version: 5.1.37-1ubuntu5.4 (Ubuntu)
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
-> price DECIMAL(10,2), ->
design_date DATE,
-> version INT,
-> design_comment VARCHAR(100));
Query OK, 0 rows affected (0.00 sec)
mysql> INSERT INTO widgets VALUES (NULL, 'sprocket', 0.25,
'2010-02-10', -> 1, 'Connects two gizmos');
Query OK, 1 row affected (0.00 sec)
mysql> INSERT INTO widgets VALUES (NULL, 'gizmo', 4.00, '200911-30', 4, -> NULL);
Query OK, 1 row affected (0.00 sec)
mysql> INSERT INTO widgets VALUES (NULL, 'gadget', 99.99,
'1983-08-13', -> 13, 'Our flagship product');
Query OK, 1 row affected (0.00 sec)
mysql> quit;
In the above listing, we created a new table called widgets. We’ll be using this
fictional product database in further examples in this chapter. The widgets table
contains several fields representing a variety of data types.
Now let’s use Sqoop to import this table into HDFS:
% sqoop import --connect
jdbc:mysql://localhost/hadoopguide \ > --table widgets -m
10/06/23 14:44:18 INFO tool.CodeGenTool: Beginning code generation
10/06/23 14:44:20 INFO mapred.JobClient: Running job:
job_201006231439_0002 10/06/23 14:44:21 INFO mapred.JobClient:
map 0% reduce 0%
10/06/23 14:44:32 INFO mapred.JobClient: map 100% reduce
0% 10/06/23 14:44:34 INFO mapred.JobClient: Job complete:
Sqoop’s import tool will run a MapReduce job that connects to the MySQL database
and reads the table. By default, this will use four map tasks in parallel to speed up
the import process. Each task will write its imported results to a different file, but all in
a common directory. Since we knew that we had only three rows to import in this example, we specified that Sqoop should use a single map task (-m 1) so we get a
single file in HDFS.
We can inspect this file’s contents like so:
% hadoop fs -cat widgets/part-m-00000
1,sprocket,0.25,2010-02-10,1,Connects two gizmos
2,gizmo,4.00,2009-11-30,4,null 3,gadget,99.99,198308-13,13,Our flagship product
The connect string (jdbc:mysql://localhost/hadoopguide) shown in the example will
read from a database on the local machine. If a distributed Hadoop cluster is being
used, then localhost should not be specified in the connect string; map tasks not
running on the same machine as the database will fail to connect. Even if Sqoop is
run from the same host as the database sever, the full hostname should be specified.
By default, Sqoop will generate comma-delimited text files for our imported data. Delimiters can be explicitly specified, as well as field enclosing and escape characters
to allow the presence of delimiters in the field contents. The command-line
arguments that specify delimiter characters, file formats, compression, and more
fine-grained control of the import process are described in the Sqoop User Guide
distributed with Sqoop, as well as in the online help (sqoop help import, or man
sqoop-import in CDH).
Text and binary file formats
Sqoop is capable of importing into a few different file formats. Text files (the default)
offer a human-readable representation of data, platform independence, and the
simplest structure. However, they cannot hold binary fields (such as database
columns of type VARBINARY) and cannot distinguish between null values and
String-based fields containing the value "null".
To handle these conditions, you can either use either Sqoop’s SequenceFile-based
format, or its Avro-based format. Both Avro data files and SequenceFiles provide the
most precise representation of the imported data possible. They also allow data to be
compressed while retaining MapReduce’s ability to process different sections of the
same file in parallel. However, current versions of Sqoop cannot load either Avro or
SequenceFiles into Hive (although you can load Avro data files into Hive manually). A
final disadvantage of SequenceFiles is that they are Java-specific, whereas Avro
data files can be processed by a wide range of languages.
Generated Code
In addition to writing the contents of the database table to HDFS, Sqoop has also
provided you with a generated Java source file ( written to the current
local directory. (After running the sqoop import command above, you can see this file
by running ls
Code generation is a necessary part of Sqoop’s import process; as you’ll learn in
“Database Imports: A Deeper Look” on page 531, Sqoop uses generated code to
handle the deserialization of table-specific data from the database source before
writing it to HDFS.
The generated class (widgets) is capable of holding a single record retrieved from
the imported table. It can manipulate such a record in MapReduce or store it in a
Sequen-ceFile in HDFS. (SequenceFiles written by Sqoop during the import
process will store each imported row in the “value” element of the SequenceFile’s
key-value pair format, using the generated class.)
It is likely that you don’t want to name your generated class widgets since each
instance of the class refers to only a single record. We can use a different Sqoop
tool to generate source code without performing an import; this generated code will
still examine the database table to determine the appropriate data types for each
% sqoop codegen --connect
jdbc:mysql://localhost/hadoopguide \ > --table widgets -class-name Widget
The codegen tool simply generates code; it does not perform the full import. We
speci-fied that we’d like it to generate a class named Widget; this will be written to We also could have specified --class-name and other code generation
arguments during the import process we performed earlier. This tool can be used to
regenerate code, if you accidentally remove the source file, or generate code with
dif-ferent settings than were used during the import.
If you’re working with records imported to SequenceFiles, it is inevitable that you’ll
need to use the generated classes (to deserialize data from the SequenceFile
storage). You can work with text file-based records without using generated code,
but as we’ll see in “Working with Imported Data”, Sqoop’s generated code can handle some tedious aspects of data processing for you.
Additional Serialization Systems
Recent versions of Sqoop support Avro-based serialization and schema generation
as well allowing you to use Sqoop in your project without integrating with generated
Database Imports: A Deeper Look
As mentioned earlier, Sqoop imports a table from a database by running a
MapReduce job that extracts rows from the table, and writes the records to HDFS.
How does Map-Reduce read the rows? This section explains how Sqoop works
under the hood.
At a high level, Figure 15-1 demonstrates how Sqoop interacts with both the
database source and Hadoop. Like Hadoop itself, Sqoop is written in Java. Java
provides an API called Java Database Connectivity, or JDBC, that allows
applications to access data stored in an RDBMS as well as inspect the nature of this
data. Most database vendors provide a JDBC driver that implements the JDBC API
and contains the necessary code to connect to their database server.
Figure 15-1. Sqoop’s import process
Based on the URL in the connect string used to access the database, Sqoop
attempts to predict which driver it should load. You may still need to download the
JDBC driver itself and install it on your Sqoop client. For cases where Sqoop does
not know which JDBC driver is ap-propriate, users can specify exactly how to load
the JDBC driver into Sqoop. This capability allows Sqoop to work with a wide variety
of database platforms.
Before the import can start, Sqoop uses JDBC to examine the table it is to import. It
retrieves a list of all the columns and their SQL data types. These SQL types
(VARCHAR, INTEGER, and so on) can then be mapped to Java data types (String,
Integer, etc.), which will hold the field values in MapReduce applications. Sqoop’s
code generator will use this information to create a table-specific class to hold a
record extracted from the table.
The Widget class from earlier, for example, contains the following methods that
retrieve each column from an extracted record:
public Integer get_id();
public String get_widget_name();
public java.math.BigDecimal get_price();
public java.sql.Date get_design_date(); public
Integer get_version();
public String get_design_comment();
More critical to the import system’s operation, though, are the serialization methods
that form the DBWritable interface, which allow the Widget class to interact with
public void readFields(ResultSet __dbResults) throws
SQLException; public void write(PreparedStatement
__dbStmt) throws SQLException;
JDBC’s ResultSet interface provides a cursor that retrieves records from a query;
the readFields() method here will populate the fields of the Widget object with the
columns from one row of the ResultSet’s data. The write() method shown above
allows Sqoop to insert new Widget rows into a table, a process called exporting.
Exports are discussed in “Performing an Export”.
The MapReduce job launched by Sqoop uses an InputFormat that can read sections
of a table from a database via JDBC. The DataDrivenDBInputFormat provided with
Hadoop partitions a query’s results over several map tasks.
Reading a table is typically done with a simple query such as:
SELECT col1,col2,col3,... FROM tableName
But often, better import performance can be gained by dividing this query across
mul-tiple nodes. This is done using a splitting column. Using metadata about the
table, Sqoop will guess a good column to use for splitting the table (typically the
primary key for the table, if one exists). The minimum and maximum values for the
primary key column are retrieved, and then these are used in conjunction with a
target number of tasks to determine the queries that each map task should issue.
For example, suppose the widgets table had 100,000 entries, with the id column
con-taining values 0 through 99,999. When importing this table, Sqoop would
determine that id is the primary key column for the table. When starting the
MapReduce job, the DataDrivenDBInputFormat used to perform the import would
then issue a statement such as SELECT MIN(id), MAX(id) FROM widgets. These
values would then be used to inter-polate over the entire range of data. Assuming we
specified that 5 map tasks should run in parallel (with -m 5), this would result in each
map task executing queries such as: SELECT id, widget_name, ... FROM widgets
WHERE id >= 0 AND id < 20000, SELECT id, widget_name, ... FROM widgets
WHERE id >= 20000 AND id < 40000, and so on.
The choice of splitting column is essential to efficiently parallelizing work. If the id
column were not uniformly distributed (perhaps there are no widgets with IDs
between 50,000 and 75,000), then some map tasks may have little or no work to
perform, whereas others have a great deal. Users can specify a particular splitting
column when running an import job, to tune the job to the data’s actual distribution. If
an import job is run as a single (sequential) task with -m 1, then this split process is
not performed.
After generating the deserialization code and configuring the InputFormat, Sqoop
sends the job to the MapReduce cluster. Map tasks execute the queries and
deserialize rows from the ResultSet into instances of the generated class, which are
either stored directly in SequenceFiles or transformed into delimited text before
being written to HDFS.
Controlling the Import
Sqoop does not need to import an entire table at a time. For example, a subset of
the table’s columns can be specified for import. Users can also specify a WHERE
clause to include in queries, which bound the rows of the table to import. For
example, if widgets 0 through 99,999 were imported last month, but this month our
vendor catalog included 1,000 new types of widget, an import could be configured
with the clause WHERE id >= 100000; this will start an import job retrieving all the
new rows added to the source database since the previous import run. User-supplied
WHERE clauses are applied before task splitting is performed, and are pushed down
into the queries exe-cuted by each task.
Imports and Consistency
When importing data to HDFS, it is important that you ensure access to a consistent
snapshot of the source data. Map tasks reading from a database in parallel are
running in separate processes. Thus, they cannot share a single database
transaction. The best way to do this is to ensure that any processes that update
existing rows of a table are disabled during the import.
Direct-mode Imports
Sqoop’s architecture allows it to choose from multiple available strategies for
perform-ing an import. Most databases will use the DataDrivenDBInputFormatbased approach described above. Some databases offer specific tools designed to
extract data quickly. For example, MySQL’s mysqldump application can read from a
table with greater throughput than a JDBC channel. The use of these external tools
is referred to as direct mode in Sqoop’s documentation. Direct mode must be
specifically enabled by the user (via the --direct argument), as it is not as generalpurpose as the JDBC approach. (For example, MySQL’s direct mode cannot handle
large objects—CLOB or BLOB columns, as Sqoop needs to use a JDBC-specific
API to load these columns into HDFS.)
For databases that provide such tools, Sqoop can use these to great effect. A directmode import from MySQL is usually much more efficient (in terms of map tasks and
time required) than a comparable JDBC-based import. Sqoop will still launch
multiple map tasks in parallel. These tasks will then spawn instances of the
mysqldump program and read its output. The effect is similar to a distributed
implementation of mk-parallel-dump from the Maatkit tool set. Sqoop can also
perform direct-mode imports from PostgreSQL.
Even when direct mode is used to access the contents of a database, the metadata
is still queried through JDBC.
Working with Imported Data
Once data has been imported to HDFS, it is now ready for processing by custom
Map-Reduce programs. Text-based imports can be easily used in scripts run with
Hadoop Streaming or in MapReduce jobs run with the default TextInputFormat.
To use individual fields of an imported record, though, the field delimiters (and any
escape/enclosing characters) must be parsed and the field values extracted and
con-verted to the appropriate data types. For example, the id of the “sprocket” widget
is represented as the string "1" in the text file, but should be parsed into an Integer or
int variable in Java. The generated table class provided by Sqoop can automate this
process, allowing you to focus on the actual MapReduce job to run. Each autogenerated class has several overloaded methods named parse() that operate on the
data represented as Text, CharSequence, char[], or other common types.
The MapReduce application called MaxWidgetId (available in the example code) will
find the widget with the highest ID.
The class can be compiled into a JAR file along with Both Hadoop (hadoop-core-version.jar) and Sqoop (sqoop-version.jar) will need to be on the
classpath for compilation. The class files can then be combined into a JAR file and
executed like so:
jar cvvf widgets.jar *.class
HADOOP_CLASSPATH=/usr/lib/sqoop/sqoop-version.jar hadoop jar \
> widgets.jar MaxWidgetId -libjars /usr/lib/sqoop/sqoop-version.jar
This command line ensures that Sqoop is on the classpath locally (via
$HADOOP_CLASS PATH), when running the method, as well as
when map tasks are running on the cluster (via the -libjars argument).
When run, the maxwidgets path in HDFS will contain a file named part-r-00000 with
the following expected result:
3,gadget,99.99,1983-08-13,13,Our flagship product
It is worth noting that in this example MapReduce program, a Widget object was
emitted from the mapper to the reducer; the auto-generated Widget class
implements the Writable interface provided by Hadoop, which allows the object to be
sent via Hadoop’s serialization mechanism, as well as written to and read from
The MaxWidgetId example is built on the new MapReduce API. MapReduce
applications that rely on Sqoop-generated code can be built on the new or old APIs,
though some advanced features (such as working with large objects) are more
convenient to use in the new API.
With the generic Avro mapping the MapReduce program does not need to use
schema-specific generated code (although this is an option too, by using Avro’s
specific compiler—Sqoop does not do the code generation in this case). The
example code includes a program called MaxWidgetIdGenericAvro, which finds the
widget with the highest ID and writes out the result in an Avro data file.
Imported Data and Hive
As noted in Chapter 12, for many types of analysis, using a system like Hive to
handle relational operations can dramatically ease the development of the analytic
pipeline. Especially for data originally from a relational data source, using Hive
makes a lot of sense. Hive and Sqoop together form a powerful toolchain for
performing analysis.
Suppose we had another log of data in our system, coming from a web-based widget
purchasing system. This may return log files containing a widget id, a quantity, a
ship-ping address, and an order date.
Here is a snippet from an example log of this type:
1,15,120 Any St.,Los Angeles,CA,90210,2010-08-01
3,4,120 Any St.,Los Angeles,CA,90210,2010-08-01
2,5,400 Some Pl.,Cupertino,CA,95014,2010-07-30
2,7,88 Mile Rd.,Manhattan,NY,10005,2010-07-18
By using Hadoop to analyze this purchase log, we can gain insight into our sales
oper-ation. By combining this data with the data extracted from our relational data
source (the widgets table), we can do better. In this example session, we will
compute which zip code is responsible for the most sales dollars, so we can better
focus our sales team’s operations. Doing this requires data from both the sales log
and the widgets table.
The above table should be in a local file named sales.log for this to work.
First, let’s load the sales data into Hive:
hive> CREATE TABLE sales(widget_id INT, qty INT,
street STRING, city STRING, state STRING,
zip INT, sale_date STRING)
Time taken: 5.248 seconds
hive> LOAD DATA LOCAL INPATH "sales.log" INTO TABLE sales;
Copying data from file:/home/sales.log
Loading data to table sales
Time taken: 0.188 seconds
Sqoop can generate a Hive table based on a table from an existing relational data
source. Since we’ve already imported the widgets data to HDFS, we can generate
the Hive table definition and then load in the HDFS-resident data:
% sqoop create-hive-table --connect
jdbc:mysql://localhost/hadoopguide \ > --table widgets --fieldsterminated-by ','
10/06/23 18:05:34 INFO hive.HiveImport: OK
10/06/23 18:05:34 INFO hive.HiveImport: Time taken: 3.22
seconds 10/06/23 18:05:35 INFO hive.HiveImport: Hive import
complete.% hive
hive> LOAD DATA INPATH "widgets" INTO TABLE widgets;
Loading data to table widgets
Time taken: 3.265 seconds
When creating a Hive table definition with a specific already-imported dataset in
mind, we need to specify the delimiters used in that dataset. Otherwise, Sqoop will
allow Hive to use its default delimiters (which are different from Sqoop’s default
Hive’s type system is less rich than that of most SQL systems. Many SQL types do
not have direct analogues in Hive. When Sqoop generates a Hive table definition for
an import, it uses the best Hive type available to hold a column’s values. This may
result in a decrease in precision. When this occurs, Sqoop will provide you with a
warning message, such as this one:
10/06/23 18:09:36 WARN
hive.TableDefWriter: Column design_date
had to be cast to a less precise type in Hive
This three-step process of importing data to HDFS, creating the Hive table, and then
loading the HDFS-resident data into Hive can be shortened to one step if you know
that you want to import straight from a database directly into Hive. During an import,
Sqoop can generate the Hive table definition and then load in the data. Had we not
already performed the import, we could have executed this command, which recreates the widgets table in Hive, based on the copy in MySQL:
% sqoop import --connect
jdbc:mysql://localhost/hadoopguide \ > --table widgets -m
1 --hive-import
The sqoop import tool run with the --hive-import argument will load the data directly
from the source database into Hive; it infers a Hive schema automatically based on
the schema for the table in the source database. Using this, you can get started
working with your data in Hive with only one command.
Regardless of which data import route we chose, we can now use the widgets data
set and the sales data set together to calculate the most profitable zip code. Let’s do
so, and also save the result of this query in another table for later:
hive> CREATE TABLE zip_profits (sales_vol DOUBLE, zip INT);
hive> INSERT OVERWRITE TABLE zip_profits
> SELECT SUM(w.price * s.qty) AS sales_vol, FROM SALES s
> JOIN widgets w ON (s.widget_id = GROUP BY;
3 Rows loaded to zip_profits
hive> SELECT * FROM zip_profits ORDER BY sales_vol DESC;
Importing Large Objects
Most databases provide the capability to store large amounts of data in a single field.
Depending on whether this data is textual or binary in nature, it is usually
represented as a CLOB or BLOB column in the table. These “large objects” are often
handled specially by the database itself. In particular, most tables are physically laid
out on disk as in Figure 15-2. When scanning through rows to determine which rows
match the criteria for a particular query, this typically involves reading all columns of
each row from disk. If large objects were stored “inline” in this fashion, they would
adversely affect the performance of such scans. Therefore, large objects are often
stored externally from their rows, as in Figure 15-3. Accessing a large object often
requires “opening” it through the reference contained in the row.
Figure 15-2. Database tables are typically physically represented as an array of
rows, with all the columns in a row stored adjacent to one another
The difficulty of working with large objects in a database suggests that a system
such as Hadoop, which is much better suited to storing and processing large,
complex data objects, is an ideal repository for such information. Sqoop can extract
large objects from tables and store them in HDFS for further processing. As in a
database, MapReduce typically materializes every record before passing it along to
the mapper. If individual records are truly large, this can be very inefficient.
As shown earlier, records imported by Sqoop are laid out on disk in a fashion very
similar to a database’s internal structure: an array of records with all fields of a
record concatenated together. When running a MapReduce program over imported
records, each map task must fully materialize all fields of each record in its input
split. If the contents of a large object field are only relevant for a small subset of the
total number of records used as input to a MapReduce program, it would be
inefficient to fully ma- terialize all these records. Furthermore, depending on the size
of the large object, full materialization in memory may be impossible.
Figure 15-3. Large objects are usually held in a separate area of storage; the
main row storage contains indirect references to the large objects
To overcome these difficulties, Sqoop will store imported large objects in a separate
file called a LobFile. The LobFile format can store individual records of very large
size (a 64-bit address space is used). Each record in a LobFile holds a single large
object. The LobFile format allows clients to hold a reference to a record without
accessing the record contents. When records are accessed, this is done through a Stream (for binary objects) or (for character-based
When a record is imported, the “normal” fields will be materialized together in a text
file, along with a reference to the LobFile where a CLOB or BLOB column is stored.
For example, suppose our widgets table contained a BLOB field named schematic
holding the actual schematic diagram for each widget.
An imported record might then look like:
The externalLob(...) text is a reference to an externally stored large object, stored in
LobFile format (lf) in a file named lobfile0, with the specified byte offset and length
inside that file.
When working with this record, the Widget.get_schematic() method would return an
object of type BlobRef referencing the schematic column, but not actually containing
its contents. The BlobRef.getDataStream() method actually opens the LobFile and
re-turns an InputStream allowing you to access the schematic field’s contents.
When running a MapReduce job processing many Widget records, you might need
to access the schematic field of only a handful of records. This system allows you to
incur the I/O costs of accessing only the required large object entries, as individual
schematics may be several megabytes or more of data.
The BlobRef and ClobRef classes cache references to underlying LobFiles within a
map task. If you do access the schematic field of several sequentially ordered
records, they will take advantage of the existing file pointer’s alignment on the next
record body.
Performing an Export
In Sqoop, an import refers to the movement of data from a database system into
HDFS. By contrast, an export uses HDFS as the source of data and a remote
database as the destination. In the previous sections, we imported some data and
then performed some analysis using Hive. We can export the results of this analysis
to a database for con-sumption by other tools.
Before exporting a table from HDFS to a database, we must prepare the database to
receive the data by creating the target table. While Sqoop can infer which Java types
are appropriate to hold SQL data types, this translation does not work in both
directions (for example, there are several possible SQL column definitions that can
hold data in a Java String; this could be CHAR(64), VARCHAR(200), or something
else entirely). Con-sequently, you must determine which types are most appropriate.
We are going to export the zip_profits table from Hive. We need to create a table in
MySQL that has target columns in the same order, with the appropriate SQL types:
% mysql hadoopguide
mysql> CREATE TABLE sales_by_zip (volume DECIMAL(8,2), zip INTEGER);
Query OK, 0 rows affected (0.01 sec)
Then we run the export command:
% sqoop export --connect jdbc:mysql://localhost/hadoopguide -m 1 \
--table sales_by_zip --export-dir /user/hive/warehouse/zip_profits \
--input-fields-terminated-by '\0001'
10/07/02 16:16:50 INFO mapreduce.ExportJobBase: Transferred 41 bytes in
10.8947 seconds (3.7633 bytes/sec)
10/07/02 16:16:50 INFO mapreduce.ExportJobBase: Exported 3 records.
Finally, we can verify that the export worked by checking MySQL:
% mysql hadoopguide -e 'SELECT * FROM sales_by_zip'
| 403.71
--------+------| volume| zip
| 10005 |
| 90210 |
| 95014 |
When we created the zip_profits table in Hive, we did not specify any delimiters. So
Hive used its default delimiters: a Ctrl-A character (Unicode 0x0001) between
fields,and a newline at the end of each record. When we used Hive to access the
contents of this table (in a SELECT statement), Hive converted this to a tab-delimited
representation for display on the console. But when reading the tables directly from
files, we need to tell Sqoop which delimiters to use. Sqoop assumes records are
newline-delimited by default, but needs to be told about the Ctrl-A field delimiters.
The --input-fields-terminated-by argument to sqoop export specified this information.
Sqoop supports several escape sequences (which start with a'\' character) when
specifying delimiters.
In the example syntax above, the escape sequence is enclosed in 'single quotes' to
ensure that the shell processes it literally. Without the quotes, the leading backslash
itself may need to be escaped (for example,--input-fields-terminated-by \\0001). The
escape sequences supported by Sqoop are listed in Table 15-1.
Table 15-1. Escape sequences can be used to specify nonprintable characters
as field and record delimiters in Sqoop
carriage return
NUL. This will insert NUL characters between fields or lines, or will disable enclosing/escaping if used
for one of the
--enclosed-by, --optionally-enclosed-by, or --escaped-by arguments.
The octal representation of a Unicode character’s code point. The actual character is specified by the
\0ooo octal value ooo.
The hexadecimal representation of a Unicode character’s code point. This should be of the form
\0xhhh \0xhhh, where
hhh is the hex value. For example, --fields-terminated-by '\0x10' specifies the
carriage return
Exports: A Deeper Look
The architecture of Sqoop’s export capability is very similar in nature to how Sqoop
performs imports. (See Figure 15-4.) Before performing the export, Sqoop picks a
strat-egy based on the database connect string. For most systems, Sqoop uses
JDBC. Sqoop then generates a Java class based on the target table definition. This
generated class has the ability to parse records from text files and insert values of
the appropriate types into a table (in addition to the ability to read the columns from
a ResultSet). A MapReduce job is then launched that reads the source data files
from HDFS, parses the records using the generated class, and executes the
chosen export strategy.
Figure 15-4. Exports are performed in parallel using MapReduce
The JDBC-based export strategy builds up batch INSERT statements that will each
add multiple records to the target table. Inserting many records per statement
performs much better than executing many single-row INSERT statements on most
database sys-tems. Separate threads are used to read from HDFS and
communicate with the data-base, to ensure that I/O operations involving different
systems are overlapped as much as possible.
For MySQL, Sqoop can employ a direct-mode strategy using mysqlimport. Each map
task spawns a mysqlimport process that it communicates with via a named FIFO on
the local filesystem. Data is then streamed into mysqlimport via the FIFO channel,
and from there into the database.
While most MapReduce jobs reading from HDFS pick the degree of parallelism
(num-ber of map tasks) based on the number and size of the files to process,
Sqoop’s export system allows users explicit control over the number of tasks. The
performance of the export can be affected by the number of parallel writers to the
database, so Sqoop uses the CombineFileInputFormat class to group up the input
files into a smaller number of map tasks.
Case Studies
Hadoop Usage at The Social Music Revolution
Founded in 2002, is an Internet radio and music community website that
offers many services to its users, such as free music streams and downloads, music
and event recommendations, personalized charts, and much more. There are about
25 million people who use every month, generating huge amounts of data
that need to be processed. One example of this is users transmitting information
indicating which songs they are listening to (this is known as “scrobbling”). This data
is processed and stored by, so the user can access it directly (in the form of
charts), and it is also used to make decisions about users’ musical tastes and
compatibility, and artist and track similarity.
Hadoop at
As’s service developed and the number of users grew from thousands to millions, storing, processing, and managing all the incoming data became increasingly
challenging. Fortunately, Hadoop was quickly becoming stable enough and was enthusiastically adopted as it became clear how many problems it solved. It was first
used at in early 2006 and was put into production a few months later. There
were several reasons for adopting Hadoop at
The distributed filesystem provided redundant backups for the data stored on
it (e.g., web logs, user listening data) at no extra cost.
Scalability was simplified through the ability to add cheap, commodity
hardware when required.
The cost was right (free) at a time when had limited financial
The open source code and active community meant that could freely
mod-ify Hadoop to add custom features and patches.
Hadoop provided a flexible framework for running distributed computing algorithms with a relatively easy learning curve.
Hadoop has now become a crucial part of’s infrastructure, currently
consisting of two Hadoop clusters spanning over 50 machines, 300 cores, and 100
TB of disk space. Hundreds of daily jobs are run on the clusters performing
operations, such as logfile analysis, evaluation of A/B tests, ad hoc processing, and
charts generation. This case study will focus on the process of generating charts, as
this was the first usage of Hadoop at and illustrates the power and flexibility
that Hadoop provides over other approaches when working with very large datasets.
Generating Charts with Hadoop uses user-generated track listening data to produce many different types of
charts, such as weekly charts for tracks, per country and per user. A number of
Hadoop programs are used to process the listening data and generate these charts,
and these run on a daily, weekly, or monthly basis. Figure 16-1 shows an example of
how this data is displayed on the site; in this case, the weekly top tracks.
Figure 16-1. top tracks chart
Listening data typically arrives at from one of two sources:
 A user plays a track of her own (e.g., listening to an MP3 file on a PC or other
device), and this information is sent to using either the official
client application or one of many hundreds of third-party applications.
 A user tunes into one of’s Internet radio stations and streams a song to
her computer. The player or website can be used to access these
streams and extra functionality is made available to the user, allowing her to love,
skip, or ban each track that she listens to.
When processing the received data, we distinguish between a track listen submitted
by a user (the first source above, referred to as a scrobble from here on) and a track
listened to on the radio (the second source, mentioned earlier, referred to as
a radio listen from here on). This distinction is very important in order to prevent a
feedback loop in the recommendation system, which is based only on
scrobbles. One of the most fundamental Hadoop jobs at takes the incoming
listening data and summarizes it into a format that can be used for display purposes
on the website as well as for input to other Hadoop programs. This is
achieved by the Track Statistics program, which is the example described in the
following sections.
The Track Statistics Program
When track listening data is submitted to, it undergoes a validation and
conversion phase, the end result of which is a number of space-delimited text files
containing the user ID, the track ID, the number of times the track was scrobbled, the
number of times the track was listened to on the radio, and the number of times it
was skipped. Table 16-1 contains sample listening data, which is used in the
following examples as input to the Track Statistics program (the real data is
gigabytes in size and includes many more fields that have been omitted here for
simplicity’s sake).
Table 16-1. Listening data
These text files are the initial input provided to the Track Statistics program, which
consists of two jobs that calculate various values from this data and a third job that
merges the results (see Figure 16-2).
The Unique Listeners job calculates the total number of unique listeners for a track
by counting the first listen by a user and ignoring all other listens by the same user.
The Sum job accumulates the total listens, scrobbles, radio listens, and skips for
each track
Figure 16-2. TrackStats jobs
by counting these values for all listens by all users. Although the input format of
these two jobs is identical, two separate jobs are needed, as the Unique Listeners
job is re-sponsible for emitting values per track per user, and the Sum job emits
values per track. The final “Merge” job is responsible for merging the intermediate
output of the two other jobs into the final result. The end results of running the
program are the following values per track:
Number of unique listeners
Number of times the track was scrobbled
Number of times the track was listened to on the radio
Number of times the track was listened to in total
Number of times the track was skipped on the radio
Each job and its MapReduce phases are described in more detail next. Please note
that the provided code snippets have been simplified due to space constraints; for
download details for the full code listings, refer to the preface.
Calculating the number of unique listeners
The Unique Listeners job calculates, per track, the number of unique listeners.
UniqueListenersMapper. The UniqueListenersMapper processes the spacedelimited raw lis-tening data and emits the user ID associated with each track ID:
public void map(LongWritable position, Text rawLine,
OutputCollector<IntWritable, IntWritable> output, Reporter reporter) throws
IOException {
String[] parts = (rawLine.toString()).split(" ");
int scrobbles =
Integer.parseInt(parts[TrackStatisticsProgram.COL_SCROBBLES]); int
radioListens =
if track somehow is marked with zero plays - ignore if
(scrobbles <= 0 && radioListens <= 0) {
if we get to here then user has listened to track,
so output user id against track id
IntWritable trackId = new IntWritable(
IntWritable userId = new IntWritable(
output.collect(trackId, userId);
UniqueListenersReducer. The UniqueListenersReducers receives a list of user IDs
per track ID and puts these IDs into a Set to remove any duplicates. The size of this
set is then emitted (i.e., the number of unique listeners) for each track ID. Storing all
the reduce values in a Set runs the risk of running out of memory if there are many
values for a certain key. This hasn’t happened in practice, but to overcome this, an
extra MapReduce step could be introduced to remove all the duplicate values or a
secondary sort could be used (for more details, see “Secondary Sort”):
public void reduce(IntWritable trackId, Iterator<IntWritable> values,
OutputCollector<IntWritable, IntWritable> output, Reporter reporter)
throws IOException {
Set<Integer> userIds = new HashSet<Integer>();
add all userIds to the set, duplicates automatically removed (set contract)
while (values.hasNext()) {
IntWritable userId =;
output trackId -> number of unique listeners per track
output.collect(trackId, new IntWritable(userIds.size()));
Table 16-2 shows the sample input data for the job. The map output appears in Table 16-3 and the reduce output in Table 16-4.
Table 16-2. Job
Line of file
Table 16-3. Mapper output
Table 16-4. Reducer output
Summing the track totals
The Sum job is relatively simple; it just adds up the values we are interested in for
each track.
SumMapper. The input data is again the raw text files, but in this case, it is handled
quite differently. The desired end result is a number of totals (unique listener count,
play count, scrobble count, radio listen count, skip count) associated with each track.
To simplify things, we use an intermediate TrackStats object generated using
Hadoop Record I/O, which implements WritableComparable (so it can be used as
output) to hold these values. The mapper creates a TrackStats object and sets the
values on it for each line in the file, except for the unique listener count, which is left
empty (it will be filled in by the final merge job):
public void map(LongWritable position, Text rawLine,
OutputCollector<IntWritable, TrackStats> output, Reporter
reporter) throws IOException {
String[] parts = (rawLine.toString()).split(" ");
int trackId = Integer.parseInt(parts[TrackStatisticsProgram.COL_TRACKID]);
int scrobbles =
Integer.parseInt(parts[TrackStatisticsProgram.COL_SCROBBLES]); int
radio = Integer.parseInt(parts[TrackStatisticsProgram.COL_RADIO]);
int skip = Integer.parseInt(parts[TrackStatisticsProgram.COL_SKIP]);
set number of listeners to 0 (this is calculated later)
and other values as provided in text file
TrackStats trackstat = new TrackStats(0, scrobbles + radio, scrobbles, radio,
skip); output.collect(new IntWritable(trackId), trackstat);
SumReducer. In this case, the reducer performs a very similar function to the
mapper— it sums the statistics per track and returns an overall total:
public void reduce(IntWritable trackId, Iterator<TrackStats> values,
OutputCollector<IntWritable, TrackStats> output, Reporter reporter)
throws IOException {
TrackStats sum = new TrackStats(); // holds the totals for this track
while (values.hasNext()) {
TrackStats trackStats = (TrackStats);
sum.setListeners(sum.getListeners() + trackStats.getListeners());
sum.setPlays(sum.getPlays() + trackStats.getPlays());
sum.setSkips(sum.getSkips() + trackStats.getSkips());
sum.setScrobbles(sum.getScrobbles() + trackStats.getScrobbles());
sum.setRadioPlays(sum.getRadioPlays() +
output.collect(trackId, sum);
Table 16-5 shows the input data for the job (the same as for the Unique Listeners
job). The map output appears in Table 16-6 and the reduce output in Table 16-7.
Table 16-5. Job input
Radio play
Table 16-6. Map output
Table 16-7. Reduce output
Merging the results
The final job needs to merge the output from the two previous jobs: the number of
unique listeners per track and the statistics per track. In order to be able to merge
these different inputs, two different mappers (one for each type of input) are used.
The two intermediate jobs are configured to write their results to different paths, and
the MultipleInputs class is used to specify which mapper will process which files. The
following code shows how the JobConf for the job is set up to do this:
MultipleInputs.addInputPath(conf, sumInputDir,
SequenceFileInputFormat.class, IdentityMapper.class);
MultipleInputs.addInputPath(conf, listenersInputDir,
SequenceFileInputFormat.class, MergeListenersMapper.class);
It is possible to use a single mapper to handle different inputs, but the example
solution is more convenient and elegant.
MergeListenersMapper. This mapper is used to process the UniqueListenerJob’s
output of unique listeners per track. It creates a TrackStats object in a similar manner
to the SumMapper, but this time, it fills in only the unique listener count per track and
leaves the other values empty:
public void map(IntWritable trackId, IntWritable uniqueListenerCount,
OutputCollector<IntWritable, TrackStats> output, Reporter reporter)
throws IOException {
TrackStats trackStats = new TrackStats();
output.collect(trackId, trackStats);
Table 16-8 shows some input for the mapper; the corresponding output is shown in
Table 16-9.
Table 16-8. MergeListenersMapper
Table 16-9. MergeListenersMapper
IdentityMapper. The IdentityMapper is configured to process the SumJob’s output
of TrackStats objects and, as no additional processing is required, it directly emits
the input data (see Table 16-10).
Table 16-10. IdentityMapper input and
SumReducer. The two mappers above emit values of the same type: a TrackStats
object per track, with different values filled in. The final reduce phase can reuse the
SumReducer described earlier to create a TrackStats object per track, sum up all the
values, and emit it (see Table 16-11).
Table 16-11. Final SumReducer output
The final output files are then accumulated and copied to a server where a web
service makes the data available to the website for display. An example of
this is shown in Figure 16-3, where the total number of listeners and plays are
displayed for a track.
Figure 16-3. TrackStats result
Hadoop has become an essential part of’s infrastructure and is used to
generate and process a wide variety of datasets ranging from web logs to user
listening data. The example covered here has been simplified considerably in order
to get the key concepts across; in real-world usage the input data has a more
complicated structure and the code that processes it is more complex. Hadoop itself,
while mature enough for pro-duction use, is still in active development, and new
features and improvements are added by the Hadoop community every week. We at are happy to be part of this community as a contributor of code and ideas,
and as end users of a great piece of open source technology.
—Adrian Woodhead and Marc de Palol
Hadoop and Hive at Facebook
Hadoop can be used to form core backend batch and near real-time computing infrastructures. It can also be used to store and archive massive datasets. In this case
study, we will explore backend data architectures and the role Hadoop can play in
them. We will describe hypothetical Hadoop configurations, potential uses of Hive—
an open source data warehousing and SQL infrastructure built on top of Hadoop—
and the different kinds of business and product applications that have been built
using this infrastructure.
Hadoop at Facebook
The amount of log and dimension data in Facebook that needs to be processed and
stored has exploded as the usage of the site has increased. A key requirement for
any data processing platform for this environment is the ability to scale rapidly.
Further, engineering resources being limited, the system should be very reliable and
easy to use and maintain.
Initially, data warehousing at Facebook was performed entirely on an Oracle
instance. After we started hitting scalability and performance problems, we
investigated whether there were open source technologies that could be used in our
environment. As part of this investigation, we deployed a relatively small Hadoop
instance and started pub-lishing some of our core datasets into this instance.
Hadoop was attractive because Yahoo! was using it internally for its batch
processing needs and because we were familiar with the simplicity and scalability of
the MapReduce model as popularized by Google.
Our initial prototype was very successful: the engineers loved the ability to process
massive amounts of data in reasonable timeframes, an ability that we just did not
have before. They also loved being able to use their favorite programming language
for pro-cessing (using Hadoop streaming). Having our core datasets published in
one centralized data store was also very convenient. At around the same time, we
started developing Hive. This made it even easier for users to process data in the
Hadoop cluster by being able to express common computations in the form of SQL,
a language with which most engineers and analysts are familiar.
As a result, the cluster size and usage grew by leaps and bounds, and today
Facebook is running the second largest Hadoop cluster in the world. As of this
writing, we hold more than 2 PB of data in Hadoop and load more than 10 TB of data
into it every day. Our Hadoop instance has 2,400 cores and about 9 TB of memory
and runs at 100% utilization at many points during the day. We are able to scale out
this cluster rapidly in response to our growth, and we have been able to take
advantage of open source by modifying Hadoop where required to suit our needs.
We have contributed back to open source, both in the form of contributions to some
core components of Hadoop as well as by open-sourcing Hive, which is now a
Hadoop top-level project.
Use cases
There are at least four interrelated but distinct classes of uses for Hadoop at
Producing daily and hourly summaries over large amounts of data. These
summa-ries are used for a number of different purposes within the company:
— Reports based on these summaries are used by engineering and
nonengineering functional teams to drive product decisions. These summaries
include reports on growth of the users, page views, and average time spent on
the site by the users.
— Providing performance numbers about advertisement campaigns that are run
on Facebook.
— Backend processing for site features such as people you may like and applications you may like.
Running ad hoc jobs over historical data. These analyses help answer questions
from our product groups and executive team.
As a de facto long-term archival store for our log datasets.
To look up log events by specific attributes (where logs are indexed by such
attributes), which is used to maintain the integrity of the site and protect users
against spambots.
Data architecture
Figure 16-4 shows the basic components of our architecture and the data flow within
these components.
Figure 16-4. Data warehousing architecture at Facebook
As shown in Figure 16-4, the following components are used in processing data:
Log data is generated by web servers as well as internal services such as the
Search backend. We use Scribe, an open source log collection service
developed in Face-book that deposits hundreds of log datasets with daily volume
in tens of terabytes into a handful of NFS servers.
A large fraction of this log data is copied into one central HDFS instance. Dimension data is also scraped from our internal MySQL databases and copied over
into HDFS daily.
We use Hive, a Hadoop subproject developed in Facebook, to build a data warehouse over all the data collected in HDFS. Files in HDFS, including log data from
Scribe and dimension data from the MySQL tier, are made available as tables
with logical partitions. A SQL-like query language provided by Hive is used in
conjunc-tion with MapReduce to create/publish a variety of summaries and
reports, as well as to perform historical analysis over these tables.
Browser-based interfaces built on top of Hive allow users to compose and launch
Hive queries (which in turn launch MapReduce jobs) using just a few mouse
Traditional RDBMS
We use Oracle and MySQL databases to publish these summaries. The volume
of data here is relatively small, but the query rate is high and needs real-time
An in-house ETL workflow software that is used to provide a common framework
for reliable batch processing across all data processing jobs.
Data from the NFS tier storing Scribe data is continuously replicated to the HDFS
cluster by copier jobs. The NFS devices are mounted on the Hadoop tier and the
copier processes run as map-only jobs on the Hadoop cluster. This makes it easy to
scale the copier processes and makes them fault-resilient. Currently, we copy over 6
TB per day from Scribe to HDFS in this manner. We also download up to 4 TB of
dimension data from our MySQL tier to HDFS every day. These are also
conveniently arranged on the Hadoop cluster, as map-only jobs that copy data out of
MySQL boxes.
Hadoop configuration
The central philosophy behind our Hadoop deployment is consolidation. We use a
single HDFS instance, and a vast majority of processing is done in a single
MapReduce cluster (running a single jobtracker). The reasons for this are fairly
 We can minimize the administrative overheads by operating a single cluster.
 Data does not need to be duplicated. All data is available in a single place for all
the use cases described previously.
 By using the same compute cluster across all departments, we get tremendous
 Our users work in a collaborative environment, so requirements in terms of
quality of service are not onerous (yet).
We also have a single shared Hive metastore (using a MySQL database) that holds
metadata about all the Hive tables stored in HDFS.
Hypothetical Use Case Studies
In this section, we will describe some typical problems that are common for large
web-sites, which are difficult to solve through traditional warehousing technologies,
simply because the costs and scales involved are prohibitively high. Hadoop and
Hive can provide a more scalable and more cost-effective solution in such situations.
Advertiser insights and performance
One of the most common uses of Hadoop is to produce summaries from large
volumes of data. It is very typical of large ad networks, such as Facebook ad
network, Google AdSense, and many others, to provide advertisers with standard
aggregated statistics about their ads that help the advertisers to tune their
campaigns effectively. Computing advertisement performance numbers on large
datasets is a very data-intensive opera-tion, and the scalability and cost advantages
of Hadoop and Hive can really help in computing these numbers in a reasonable
time frame and at a reasonable cost.
Many ad networks provide standardized CPC- and CPM-based ad-units to the
adver-tisers. The CPC ads are cost-per-click ads: the advertiser pays the ad network
amounts that are dependent on the number of clicks that the particular ad gets from
the users visiting the site. The CPM ads (short for cost per mille, that is, the cost per
thousand impressions), on the other hand, bill the advertisers amounts that are
proportional to the number of users who see the ad on the site. Apart from these
standardized ad units, in the last few years ads that have more dynamic content that
is tailored to each indi-vidual user have also become common in the online
advertisement industry. Yahoo! does this through SmartAds, whereas Facebook
provides its advertisers with Social Ads. The latter allows the advertisers to embed
information from a user’s network of friends; for example, a Nike ad may refer to a
friend of the user who recently fanned Nike and shared that information with his
friends on Facebook. In addition, Facebook also pro-vides Engagement Ad units to
the advertisers, wherein the users can more effectively interact with the ad, be it by
commenting on it or by playing embedded videos. In general, a wide variety of ads
are provided to the advertisers by the online ad networks, and this variety also adds
yet another dimension to the various kinds of performance numbers that the
advertisers are interested in getting about their campaigns.
At the most basic level, advertisers are interested in knowing the total and the
number of unique users that have seen the ad or have clicked on it. For more
dynamic ads, they may even be interested in getting the breakdown of these
aggregated numbers by the kind of dynamic information shown in the ad unit or the
kind of engagement action undertaken by the users on the ad. For example, a
particular advertisement may have been shown 100,000 times to 30,000 unique
users. Similarly, a video embedded inside an Engagement Ad may have been
watched by 100,000 unique users. In addition, these performance numbers are
typically reported for each ad, campaign, and account. An account may have
multiple campaigns with each campaign running multiple ads on the network. Finally,
these numbers are typically reported for different time durations by the ad networks.
Typical durations are daily, rolling week, month to date, rolling month, and
sometimes even for the entire lifetime of the campaign. Moreover, adver-tisers also
look at the geographic breakdown of these numbers among other ways of slicing and
dicing this data, such as what percentage of the total viewers or clickers of a
particular ad are in the Asia Pacific region.
As is evident, there are four predominant dimension hierarchies: the account, campaign, and ad dimension; the time period; the type of interaction; and the user
dimen-sion. The last of these is used to report unique numbers, whereas the other
three are the reporting dimensions. The user dimension is also used to create
aggregated geo-graphic profiles for the viewers and clickers of ads. All this
information in totality allows the advertisers to tune their campaigns to improve their
effectiveness on any given ad network. Aside from the multidimensional nature of
this set of pipelines, the volumes of data processed and the rate at which this data is
growing on a daily basis make this difficult to scale without a technology like Hadoop
for large ad networks. As of this writing, for example, the ad log volume that is
processed for ad performance numbers at Facebook is approximately 1 TB per day
of (uncompressed) logs. This volume has seen a 30-fold increase since January
2008, when the volumes were in the range of 30 GB per day. Hadoop’s ability to
scale with hardware has been a major factor behind the ability of these pipelines to
keep up with this data growth with minor tweaking of job configurations. Typically,
these configuration changes involve increasing the num-ber of reducers for the
Hadoop jobs that are processing the intensive portions of these pipelines. The
largest of these stages currently run with 400 reducers (an increase of eight times
from the 50 reducers that were being used in January 2008).
Ad hoc analysis and product feedback
Apart from regular reports, another primary use case for a data warehousing solution
is to be able to support ad hoc analysis and product feedback solutions. Any typical
website, for example, makes product changes, and it is typical for product managers
or engineers to understand the impact of a new feature, based on user engagement
as well as on the click-through rate on that feature. The product team may even wish
to do a deeper analysis on what is the impact of the change based on various
regions and countries, such as whether this change increases the click-through rate
of the users in the US or whether it reduces the engagement of users in India. A lot
of this type of analysis could be done with Hadoop by using Hive and regular SQL.
The measurement of click-through rate can be easily expressed as a join of the
impressions and clicks for the particular link related to the feature. This information
can be joined with geographic information to compute the effect of product changes
on different regions. Subse-quently one can compute average click-through rate for
different geographic regions by performing aggregations over them. All of these are
easily expressible in Hive using a couple of SQL queries (that would, in turn,
generate multiple Hadoop jobs). If only an estimate were required, the same queries
can be run for a sample set of the users using sampling functionality natively
supported by Hive. Some of this analysis needs the use of custom map and reduce
scripts in conjunction with the Hive SQL, and that is also easy to plug into a Hive
A good example of a more complex analysis is estimating the peak number of users
logging into the site per minute for the entire past year. This would involve sampling
page view logs (because the total page view data for a popular website is huge),
grouping it by time and then finding the number of new users at different time points
via a custom reduce script. This is a good example where both SQL and MapReduce
are required for solving the end user problem and something that is possible to
achieve easily with Hive.
Data analysis
Hive and Hadoop can be easily used for training and scoring for data analysis
applica-tions. These data analysis applications can span multiple domains such as
popular websites, bioinformatics companies, and oil exploration companies. A typical
example of such an application in the online ad network industry would be the
prediction of what features of an ad makes it more likely to be noticed by the user.
The training phase typically would involve identifying the response metric and the
predictive features. In this case, a good metric to measure the effectiveness of an ad
could be its click-through rate. Some interesting features of the ad could be the
industry vertical that it belongs to, the content of the ad, the placement of the ad on
the page, and so on. Hive is useful for assembling training data and then feeding the
same into a data analysis engine (typically R or user programs written in
MapReduce). In this particular case, different ad performance numbers and features
can be structured as tables in Hive. One can easily sample this data (sampling is
required as R can only handle limited data volume) and perform the appropriate
aggregations and joins using Hive queries to assemble a response table that
contains the most important ad features that determine the effec-tiveness of an
advertisement. However, since sampling loses information, some of the more
important data analysis applications use parallel implementations of popular data
analysis kernels using the MapReduce framework.
Once the model has been trained, it may be deployed for scoring on a daily basis.
The bulk of the data analysis tasks do not perform daily scoring though. Many of
them are ad hoc in nature and require one-time analysis that can be used as input
into the product design process.
When we started using Hadoop, we very quickly became impressed by its scalability
and availability. However, we were worried about widespread adoption, primarily because of the complexity involved in writing MapReduce programs in Java (as well as
the cost of training users to write them). We were aware that a lot of engineers and
analysts in the company understood SQL as a tool to query and analyze data, and
that a lot of them were proficient in a number of scripting languages like PHP and
Python. As a result, it was imperative for us to develop software that could bridge
this gap between the languages that the users were proficient in and the languages
required to program Hadoop.
It was also evident that a lot of our datasets were structured and could be easily
parti-tioned. The natural consequence of these requirements was a system that
could model data as tables and partitions and that could also provide a SQL-like
language for query and analysis. Also essential was the ability to plug in customized
MapReduce programs written in the programming language of the user’s choice into
the query. This system was called Hive. Hive is a data warehouse infrastructure built
on top of Hadoop and serves as the predominant tool that is used to query the data
stored in Hadoop at Facebook. In the following sections, we describe this system in
more detail.
Data organization
Data is organized consistently across all datasets and is stored compressed,
partitioned, and sorted:
Almost all datasets are stored as sequence files using the gzip codec. Older
datasets are recompressed to use the bzip codec that gives substantially more
compression than gzip. Bzip is slower than gzip, but older data is accessed much
less frequently and this performance hit is well worth the savings in terms of disk
Most datasets are partitioned by date. Individual partitions are loaded into Hive,
which loads each partition into a separate HDFS directory. In most cases, this
partitioning is based simply on datestamps associated with scribe logfiles. However, in some cases, we scan data and collate them based on timestamp
available inside a log entry. Going forward, we are also going to be partitioning
data on multiple attributes (for example, country and date).
Each partition within a table is often sorted (and hash-partitioned) by unique
ID (if one is present). This has a few key advantages:
It is easy to run sampled queries on such datasets.
We can build indexes on sorted data.
Aggregates and joins involving unique IDs can be done very efficiently on
such datasets.
Loading data into this long-term format is done by daily MapReduce jobs (and
is dis-tinct from the near real-time data import processes).
Query language
The Hive Query language is very SQL-like. It has traditional SQL constructs like
joins, group bys, where, select, from clauses, and from clause subqueries. It tries to
convert SQL commands into a set of MapReduce jobs. Apart from the normal SQL
clauses, it has a bunch of other extensions, like the ability to specify custom mapper
and reducer scripts in the query itself, the ability to insert into multiple tables,
partitions, HDFS, or local files while doing a single scan of the data and the ability to
run the query on data samples rather than the full dataset (this ability is fairly useful
while testing queries). The Hive metastore stores the metadata for a table and
provides this metadata to the Hive compiler for converting SQL commands to
MapReduce jobs. Through partition pruning, map-side aggregations, and other
features, the compiler tries to create plans that can optimize the runtime for the
Data pipelines using Hive
Additionally, the ability provided by Hive in terms of expressing data pipelines in SQL
can and has provided the much needed flexibility in putting these pipelines together
in an easy and expedient manner. This is especially useful for organizations and
products that are still evolving and growing. Many of the operations needed in
processing data pipelines are the well-understood SQL operations like join, group by,
and distinct ag-gregations. With Hive’s ability to convert SQL into a series of Hadoop
MapReduce jobs, it becomes fairly easy to create and maintain these pipelines. We
illustrate these facets of Hive in this section by using an example of a hypothetical ad
network and showing how some typical aggregated reports needed by the
advertisers can be com-puted using Hive. As an example, assuming that an online
ad network stores informa-tion on ads in a table named dim_ads and stores all the
impressions served to that ad in a table named impression_logs in Hive, with the
latter table being partitioned by date, the daily impression numbers (both unique and
total by campaign, that are routinely given by ad networks to the advertisers) for
2008-12-01 are expressible as the following SQL in Hive:
SELECT a.campaign_id, count(1), count(DISTINCT b.user_id)
FROM dim_ads a JOIN impression_logs b ON(b.ad_id = a.ad_id)
WHERE b.dateid = '2008-12-01'
GROUP BY a.campaign_id;
This would also be the typical SQL statement that one could use in other RDBMSs
such as Oracle, DB2, and so on.
In order to compute the daily impression numbers by ad and account from the same
joined data as earlier, Hive provides the ability to do multiple group bys
simultaneously as shown in the following query (SQL-like but not strictly SQL):
SELECT a.ad_id, a.campaign_id, a.account_id, b.user_id
FROM dim_ads a JOIN impression_logs b ON (b.ad_id =
a.ad_id) WHERE b.dateid = '2008-12-01') x
SELECT x.ad_id, count(1), count(DISTINCT x.user_id) GROUP BY
SELECT x.campaign_id, count(1), count(DISTINCT x.user_id) GROUP BY
x.campaign_id INSERT OVERWRITE DIRECTORY 'results_gby_accountid'
SELECT x.account_id, count(1), count(DISTINCT x.user_id) GROUP BY
In one of the optimizations that is being added to Hive, the query can be converted
into a sequence of Hadoop MapReduce jobs that are able to scale with data skew.
Essen-tially, the join is converted into one MapReduce job and the three group bys
are con-verted into four MapReduce jobs, with the first one generating a partial
aggregate on unique_id. This is especially useful because the distribution of
impression_logs over unique_id is much more uniform as compared to ad_id
(typically in an ad network, a few ads dominate in that they are shown more
uniformly to the users). As a result, computing the partial aggregation by unique_id
allows the pipeline to distribute the work more uniformly to the reducers. The same
template can be used to compute performance numbers for different time periods by
simply changing the date predicate in the query.
Computing the lifetime numbers can be more tricky though, as using the strategy described previously, one would have to scan all the partitions of the impression_logs
table. Therefore, in order to compute the lifetime numbers, a more viable strategy is
to store the lifetime counts on a per ad_id, unique_id grouping every day in a
partition of an intermediate table. The data in this table combined with the next days
impression_logs can be used to incrementally generate the lifetime ad performance
numbers. As an example, in order to get the impression numbers for 2008-12-01, the
intermediate table partition for 2008-11-30 is used.
The Hive queries that can be used to achieve this are as follows:
INSERT OVERWRITE lifetime_partial_imps
PARTITION(dateid='2008-12-01') SELECT x.ad_id, x.user_id,
SELECT a.ad_id, a.user_id, a.cnt FROM
lifetime_partial_imps a WHERE a.dateid
= '2008-11-30'
SELECT b.ad_id, b.user_id, 1 as cnt
FROM impression_log b
WHERE b.dateid = '2008-12-01' ) x
GROUP BY x.ad_id, x.user_id;
This query computes the partial sums for 2008-12-01, which can be used for
computing the 2008-12-01 numbers as well as the 2008-12-02 numbers (not shown
here). The SQL is converted to a single Hadoop MapReduce job that essentially
computes the group by on the combined stream of inputs. This SQL can be followed
by the following Hive query, which computes the actual numbers for different
groupings (similar to the one in the daily pipelines):
SELECT a.ad_id, a.campaign_id, a.account_id, b.user_id, b.cnt
FROM dim_ads a JOIN lifetime_partial_imps b ON (b.ad_id =
a.ad_id) WHERE b.dateid = '2008-12-01') x
SELECT x.ad_id, sum(x.cnt), count(DISTINCT x.user_id) GROUP BY
x.ad_id INSERT OVERWRITE DIRECTORY 'results_gby_campaignid'
SELECT x.campaign_id, sum(x.cnt), count(DISTINCT x.user_id) GROUP BY
x.campaign_id INSERT OVERWRITE DIRECTORY 'results_gby_accountid'
SELECT x.account_id, sum(x.cnt), count(DISTINCT x.user_id) GROUP BY
Hive and Hadoop are batch processing systems that cannot serve the computed
data with the same latency as a usual RDBMS such as Oracle or MySQL. Therefore,
on many occasions, it is still useful to load the summaries generated through Hive
and Hadoop to a more traditional RDBMS for serving this data to users through
different BI tools or even though a web portal.
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF