
Unformatted text preview: Hadoop: The Definitive Guide Tom White
foreword by Doug Cutting Beijing • Cambridge • Farnham • Köln • Sebastopol • Taipei • Tokyo Hadoop: The Definitive Guide
by Tom White
Copyright © 2009 Tom White. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.
O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions
are also available for most titles ( ). For more information, contact our
corporate/institutional sales department: (800) 998-9938 or [email protected] Editor: Mike Loukides
Production Editor: Loranah Dimant
Proofreader: Nancy Kotary Indexer: Ellen Troutman Zaig
Cover Designer: Karen Montgomery
Interior Designer: David Futato
Illustrator: Robert Romano Printing History:
June 2009: First Edition. Nutshell Handbook, the Nutshell Handbook logo, and the O’Reilly logo are registered trademarks of
O’Reilly Media, Inc. Hadoop: The Definitive Guide, the image of an African elephant, and related trade
dress are trademarks of O’Reilly Media, Inc.
Many of the designations used by manufacturers and sellers to distinguish their products are claimed as
trademarks. Where those designations appear in this book, and O’Reilly Media, Inc. was aware of a
trademark claim, the designations have been printed in caps or initial caps.
While every precaution has been taken in the preparation of this book, the publisher and author assume
no responsibility for errors or omissions, or for damages resulting from the use of the information contained herein. TM This book uses RepKover™, a durable and flexible lay-flat binding.
ISBN: 978-0-596-52197-4
[M]
1243455573 For Eliane, Emilia, and Lottie Table of Contents Foreword . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
1. Meet Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Data!
Data Storage and Analysis
Comparison with Other Systems
RDBMS
Grid Computing
Volunteer Computing
A Brief History of Hadoop
The Apache Hadoop Project 1
3
4
4
6
8
9
12 2. MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
A Weather Dataset
Data Format
Analyzing the Data with Unix Tools
Analyzing the Data with Hadoop
Map and Reduce
Java MapReduce
Scaling Out
Data Flow
Combiner Functions
Running a Distributed MapReduce Job
Hadoop Streaming
Ruby
Python
Hadoop Pipes
Compiling and Running 15
15
17
18
18
20
27
27
29
32
32
33
35
36
38 v 3. The Hadoop Distributed Filesystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
The Design of HDFS
HDFS Concepts
Blocks
Namenodes and Datanodes
The Command-Line Interface
Basic Filesystem Operations
Hadoop Filesystems
Interfaces
The Java Interface
Reading Data from a Hadoop URL
Reading Data Using the FileSystem API
Writing Data
Directories
Querying the Filesystem
Deleting Data
Data Flow
Anatomy of a File Read
Anatomy of a File Write
Coherency Model
Parallel Copying with distcp
Keeping an HDFS Cluster Balanced
Hadoop Archives
Using Hadoop Archives
Limitations 41
42
42
44
45
45
47
49
51
51
52
56
57
58
62
63
63
66
68
70
71
71
72
73 4. Hadoop I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
Data Integrity
Data Integrity in HDFS
LocalFileSystem
ChecksumFileSystem
Compression
Codecs
Compression and Input Splits
Using Compression in MapReduce
Serialization
The Writable Interface
Writable Classes
Implementing a Custom Writable
Serialization Frameworks
File-Based Data Structures
SequenceFile
MapFile
vi | Table of Contents 75
75
76
77
77
79
83
84
86
87
89
96
101
103
103
110 5. Developing a MapReduce Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
The Configuration API
Combining Resources
Variable Expansion
Configuring the Development Environment
Managing Configuration
GenericOptionsParser, Tool, and ToolRunner
Writing a Unit Test
Mapper
Reducer
Running Locally on Test Data
Running a Job in a Local Job Runner
Testing the Driver
Running on a Cluster
Packaging
Launching a Job
The MapReduce Web UI
Retrieving the Results
Debugging a Job
Using a Remote Debugger
Tuning a Job
Profiling Tasks
MapReduce Workflows
Decomposing a Problem into MapReduce Jobs
Running Dependent Jobs 116
117
117
118
118
121
123
124
126
127
127
130
132
132
132
134
136
138
144
145
146
149
149
151 6. How MapReduce Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
Anatomy of a MapReduce Job Run
Job Submission
Job Initialization
Task Assignment
Task Execution
Progress and Status Updates
Job Completion
Failures
Task Failure
Tasktracker Failure
Jobtracker Failure
Job Scheduling
The Fair Scheduler
Shuffle and Sort
The Map Side
The Reduce Side 153
153
155
155
156
156
158
159
159
161
161
161
162
163
163
164
Table of Contents | vii Configuration Tuning
Task Execution
Speculative Execution
Task JVM Reuse
Skipping Bad Records
The Task Execution Environment 166
168
169
170
171
172 7. MapReduce Types and Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
MapReduce Types
The Default MapReduce Job
Input Formats
Input Splits and Records
Text Input
Binary Input
Multiple Inputs
Database Input (and Output)
Output Formats
Text Output
Binary Output
Multiple Outputs
Lazy Output
Database Output 175
178
184
185
196
199
200
201
202
202
203
203
210
210 8. MapReduce Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
Counters
Built-in Counters
User-Defined Java Counters
User-Defined Streaming Counters
Sorting
Preparation
Partial Sort
Total Sort
Secondary Sort
Joins
Map-Side Joins
Reduce-Side Joins
Side Data Distribution
Using the Job Configuration
Distributed Cache
MapReduce Library Classes 211
211
213
218
218
218
219
223
227
233
233
235
238
238
239
243 9. Setting Up a Hadoop Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
Cluster Specification
viii | Table of Contents 245 Network Topology
Cluster Setup and Installation
Installing Java
Creating a Hadoop User
Installing Hadoop
Testing the Installation
SSH Configuration
Hadoop Configuration
Configuration Management
Environment Settings
Important Hadoop Daemon Properties
Hadoop Daemon Addresses and Ports
Other Hadoop Properties
Post Install
Benchmarking a Hadoop Cluster
Hadoop Benchmarks
User Jobs
Hadoop in the Cloud
Hadoop on Amazon EC2 247
249
249
250
250
250
251
251
252
254
258
263
264
266
266
267
269
269
269 10. Administering Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
HDFS
Persistent Data Structures
Safe Mode
Audit Logging
Tools
Monitoring
Logging
Metrics
Java Management Extensions
Maintenance
Routine Administration Procedures
Commissioning and Decommissioning Nodes
Upgrades 273
273
278
280
280
285
285
286
289
292
292
293
296 11. Pig . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
Installing and Running Pig
Execution Types
Running Pig Programs
Grunt
Pig Latin Editors
An Example
Generating Examples 302
302
304
304
305
305
307
Table of Contents | ix Comparison with Databases
Pig Latin
Structure
Statements
Expressions
Types
Schemas
Functions
User-Defined Functions
A Filter UDF
An Eval UDF
A Load UDF
Data Processing Operators
Loading and Storing Data
Filtering Data
Grouping and Joining Data
Sorting Data
Combining and Splitting Data
Pig in Practice
Parallelism
Parameter Substitution 308
309
310
311
314
315
317
320
322
322
325
327
331
331
331
334
338
339
340
340
341 12. HBase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
HBasics
Backdrop
Concepts
Whirlwind Tour of the Data Model
Implementation
Installation
Test Drive
Clients
Java
REST and Thrift
Example
Schemas
Loading Data
Web Queries
HBase Versus RDBMS
Successful Service
HBase
Use Case: HBase at streamy.com
Praxis
Versions x | Table of Contents 343
344
344
344
345
348
349
350
351
353
354
354
355
358
361
362
363
363
365
365 Love and Hate: HBase and HDFS
UI
Metrics
Schema Design 366
367
367
367 13. ZooKeeper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
Installing and Running ZooKeeper
An Example
Group Membership in ZooKeeper
Creating the Group
Joining a Group
Listing Members in a Group
Deleting a Group
The ZooKeeper Service
Data Model
Operations
Implementation
Consistency
Sessions
States
Building Applications with ZooKeeper
A Configuration Service
The Resilient ZooKeeper Application
A Lock Service
More Distributed Data Structures and Protocols
ZooKeeper in Production
Resilience and Performance
Configuration 370
371
372
372
374
376
378
378
379
380
384
386
388
389
391
391
394
398
400
401
401
402 14. Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405
Hadoop Usage at Last.fm
Last.fm: The Social Music Revolution
Hadoop at Last.fm
Generating Charts with Hadoop
The Track Statistics Program
Summary
Hadoop and Hive at Facebook
Introduction
Hadoop at Facebook
Hypothetical Use Case Studies
Hive
Problems and Future Work
Nutch Search Engine 405
405
405
406
407
414
414
414
414
417
420
424
425
Table of Contents | xi Background
Data Structures
Selected Examples of Hadoop Data Processing in Nutch
Summary
Log Processing at Rackspace
Requirements/The Problem
Brief History
Choosing Hadoop
Collection and Storage
MapReduce for Logs
Cascading
Fields, Tuples, and Pipes
Operations
Taps, Schemes, and Flows
Cascading in Practice
Flexibility
Hadoop and Cascading at ShareThis
Summary
TeraByte Sort on Apache Hadoop 425
426
429
438
439
439
440
440
440
442
447
448
451
452
454
456
457
461
461 A. Installing Apache Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465
B. Cloudera’s Distribution for Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471
C. Preparing the NCDC Weather Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 479 xii | Table of Contents Foreword Hadoop got its start in Nutch. A few of us were attempting to build an open source
web search engine and having trouble managing computations running on even a
handful of computers. Once Google published its GFS and MapReduce papers, the
route became clear. They’d devised systems to solve precisely the problems we were
having with Nutch. So we started, two of us, half-time, to try to recreate these systems
as a part of Nutch.
We managed to get Nutch limping along on 20 machines, but it soon became clear that
to handle the Web’s massive scale, we’d need to run it on thousands of machines and,
moreover, that the job was bigger than two half-time developers could handle.
Around that time, Yahoo! got interested, and quickly put together a team that I joined.
We split off the distributed computing part of Nutch, naming it Hadoop. With the help
of Yahoo!, Hadoop soon grew into a technology that could truly scale to the Web.
In 2006, Tom White started contributing to Hadoop. I already knew Tom through an
excellent article he’d written about Nutch, so I knew he could present complex ideas
in clear prose. I soon learned that he could also develop software that was as pleasant
to read as his prose.
From the beginning, Tom’s contributions to Hadoop showed his concern for users and
for the project. Unlike most open source contributors, Tom is not primarily interested
in tweaking the system to better meet his own needs, but rather in making it easier for
anyone to use.
Initially, Tom specialized in making Hadoop run well on Amazon’s EC2 and S3 services. Then he moved on to tackle a wide variety of problems, including improving the
MapReduce APIs, enhancing the website, and devising an object serialization framework. In all cases, Tom presented his ideas precisely. In short order, Tom earned the
role of Hadoop committer and soon thereafter became a member of the Hadoop Project
Management Committee.
Tom is now a respected senior member of the Hadoop developer community. Though
he’s an expert in many technical corners of the project, his specialty is making Hadoop
easier to use and understand.
xiii Given this, I was very pleased when I learned that Tom intended to write a book about
Hadoop. Who could be better qualified? Now you have the opportunity to learn about
Hadoop from a master—not only of the technology, but also of common sense and
plain talk.
—Doug Cutting
Shed in the Yard, California xiv | Foreword Preface Martin Gardner, the mathematics and science writer, once said in an interview:
Beyond calculus, I am lost. That was the secret of my column’s success. It took me so
long to understand what I was writing about that I knew how to write in a way most
readers would understand.* In many ways, this is how I feel about Hadoop. Its inner workings are complex, resting
as they do on a mixture of distributed systems theory, practical engineering, and common sense. And to the uninitiated, Hadoop can appear alien.
But it doesn’t need to be like this. Stripped to its core, the tools that Hadoop provides
for building distributed systems—for data storage, data analysis, and coordination—
are simple. If there’s a common theme, it is about raising the level of abstraction—to
create building blocks for programmers who just happen to have lots of data to store,
or lots of data to analyze, or lots of machines to coordinate, and who don’t have the
time, the skill, or the inclination to become distributed systems experts to build the
infrastructure to handle it.
With such a simple and generally applicable feature set, it seemed obvious to me when
I started using it that Hadoop deserved to be widely used. However, at the time (in
early 2006), setting up, configuring, and writing programs to use Hadoop was an art.
Things have certainly improved since then: there is more documentation, there are
more examples, and there are thriving mailing lists to go to when you have questions.
And yet the biggest hurdle for newcomers is understanding what this technology is
capable of, where it excels, and how to use it. That is why I wrote this book.
The Apache Hadoop community has come a long way. Over the course of three years,
the Hadoop project has blossomed and spun off half a dozen subprojects. In this time,
the software has made great leaps in performance, reliability, scalability, and manageability. To gain even wider adoption, however, I believe we need to make Hadoop even
easier to use. This will involve writing more tools; integrating with more systems; and * “The science of fun,” Alex Bellos, The Guardian, May 31, 2008, 2008/may/31/maths.science. xv writing new, improved APIs. I’m looking forward to being a part of this, and I hope
this book will encourage and enable others to do so, too. Administrative Notes
During discussion of a particular Java class in the text, I often omit its package name,
to reduce clutter. If you need to know which package a class is in, you can easily look
it up in Hadoop’s Java API documentation for the relevant subproject, linked to from
the Apache Hadoop home page at . Or if you’re using an IDE,
it can help using its auto-complete mechanism.
Similarly, although it deviates from usual style guidelines, program listings that import
multiple classes from the same package may use the asterisk wildcard character to save
space (for example: import org.apache.hadoop.io.*).
The sample programs in this book are available for download from the website that
accompanies this book: . You will also find instructions
there for obtaining the datasets that are used in examples throughout the book, as well
as further notes for running the programs in the book, and links to updates, additional
resources, and my blog. What’s in This Book?
The rest of this book is organized as follows. Chapter 2 provides an introduction to
MapReduce. Chapter 3 looks at Hadoop filesystems, and in particular HDFS, in depth.
Chapter 4 covers the fundamentals of I/O in Hadoop: data integrity, compression,
serialization, and file-based data structures.
The next four chapters cover MapReduce in depth. Chapter 5 goes through the practical
steps needed to develop a MapReduce application. Chapter 6 looks at how MapReduce
is implemented in Hadoop, from the point of view of a user. Chapter 7 is about the
MapReduce programming model, and the various data formats that MapReduce can
work with. Chapter 8 is on advanced MapReduce topics, including sorting and joining
data.
Chapters 9 and 10 are for Hadoop administrators, and describe how to set up and
maintain a Hadoop cluster running HDFS and MapReduce.
Chapters 11, 12, and 13 present Pig, HBase, and ZooKeeper, respectively.
Finally, Chapter 14 is a collection of case studies contributed by members of the Apache
Hadoop community. xvi | Preface Conventions Used in This Book
The following typographical conventions are used in this book:
Italic
Indicates new terms, URLs, email addresses, filenames, and file extensions.
Constant width Used for program listings, as well as within paragraphs to refer to program elements
such as variable or function names, databases, data types, environment variables,
statements, and keywords.
Constant width bold Shows commands or other text that should be typed literally by the user.
Constant width italic Shows text that should be replaced with user-supplied values or by values determined by context.
This icon signifies a tip, suggestion, or general note. This icon indicates a warning or caution. Using Code Examples
This book is here to help you get your job done. In general, you may use the code in
this book in your programs and documentation. You do not need to contact us for
permission unless you’re reproducing a significant portion of the code. For example,
writing a program that uses several chunks of code from this book does not require
permission. Selling or distributing a CD-ROM of examples from O’Reilly books does
require permission. Answering a question by citing this book and quoting example
code does not require permission. Incorporating a significant amount of example code
from this book into your product’s documentation does require permission.
We appreciate, but do not require, attribution. An attribution usually includes the title,
author, publisher, and ISBN. For example: “Hadoop: The Definitive Guide, by Tom
White. Copyright 2009 Tom White, 978-0-596-52197-4.”
If you feel your use of code examples falls outside fair use or the permission given above,
feel free to contact us at [email protected] Preface | xvii Safari® Books Online
When you see a Safari® Books Online icon on the cover of your favorite
technology book, that means the book is available online through the
O’Reilly Network Safari Bookshelf.
Safari offers a solution that’s better than e-books. It’s a virtual library that lets you easily
search thousands of top tech books, cut and paste code samples, download chapters,
and find quick answers when you need the most accurate, current information. Try it
for free at . How to Contact Us
Please address comments and questions concerning this book to the publisher:
O’Reilly Media, Inc.
1005 Gravenstein Highway North
Sebastopol, CA 95472
800-998...
View
Full Document