Hadoop The Definitive Guide, 2nd Edition.pdf -...

  • No School
  • AA 1
  • birajdarmm
  • 625
  • 100% (1) 1 out of 1 people found this document helpful

This preview shows page 1 out of 625 pages.

You've reached the end of your free preview.

Want to read all 625 pages?

Unformatted text preview: SECOND EDITION Hadoop: The Definitive Guide Tom White foreword by Doug Cutting Beijing • Cambridge • Farnham • Köln • Sebastopol • Tokyo Hadoop: The Definitive Guide, Second Edition by Tom White Copyright © 2011 Tom White. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles ( ). For more information, contact our corporate/institutional sales department: (800) 998-9938 or [email protected] Editor: Mike Loukides Production Editor: Adam Zaremba Proofreader: Diane Il Grande Indexer: Jay Book Services Cover Designer: Karen Montgomery Interior Designer: David Futato Illustrator: Robert Romano Printing History: June 2009: October 2010: First Edition. Second Edition. Nutshell Handbook, the Nutshell Handbook logo, and the O’Reilly logo are registered trademarks of O’Reilly Media, Inc. Hadoop: The Definitive Guide, the image of an African elephant, and related trade dress are trademarks of O’Reilly Media, Inc. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and O’Reilly Media, Inc. was aware of a trademark claim, the designations have been printed in caps or initial caps. While every precaution has been taken in the preparation of this book, the publisher and author assume no responsibility for errors or omissions, or for damages resulting from the use of the information contained herein. ISBN: 978-1-449-38973-4 [SB] 1285179414 For Eliane, Emilia, and Lottie Table of Contents Foreword . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii 1. Meet Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Data! Data Storage and Analysis Comparison with Other Systems RDBMS Grid Computing Volunteer Computing A Brief History of Hadoop Apache Hadoop and the Hadoop Ecosystem 1 3 4 4 6 8 9 12 2. MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 A Weather Dataset Data Format Analyzing the Data with Unix Tools Analyzing the Data with Hadoop Map and Reduce Java MapReduce Scaling Out Data Flow Combiner Functions Running a Distributed MapReduce Job Hadoop Streaming Ruby Python Hadoop Pipes Compiling and Running 15 15 17 18 18 20 27 28 30 33 33 33 36 37 38 v 3. The Hadoop Distributed Filesystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 The Design of HDFS HDFS Concepts Blocks Namenodes and Datanodes The Command-Line Interface Basic Filesystem Operations Hadoop Filesystems Interfaces The Java Interface Reading Data from a Hadoop URL Reading Data Using the FileSystem API Writing Data Directories Querying the Filesystem Deleting Data Data Flow Anatomy of a File Read Anatomy of a File Write Coherency Model Parallel Copying with distcp Keeping an HDFS Cluster Balanced Hadoop Archives Using Hadoop Archives Limitations 41 43 43 44 45 46 47 49 51 51 52 55 57 57 62 62 62 65 68 70 71 71 72 73 4. Hadoop I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 Data Integrity Data Integrity in HDFS LocalFileSystem ChecksumFileSystem Compression Codecs Compression and Input Splits Using Compression in MapReduce Serialization The Writable Interface Writable Classes Implementing a Custom Writable Serialization Frameworks Avro File-Based Data Structures SequenceFile vi | Table of Contents 75 75 76 77 77 78 83 84 86 87 89 96 101 103 116 116 MapFile 123 5. Developing a MapReduce Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 The Configuration API Combining Resources Variable Expansion Configuring the Development Environment Managing Configuration GenericOptionsParser, Tool, and ToolRunner Writing a Unit Test Mapper Reducer Running Locally on Test Data Running a Job in a Local Job Runner Testing the Driver Running on a Cluster Packaging Launching a Job The MapReduce Web UI Retrieving the Results Debugging a Job Using a Remote Debugger Tuning a Job Profiling Tasks MapReduce Workflows Decomposing a Problem into MapReduce Jobs Running Dependent Jobs 130 131 132 132 132 135 138 138 140 141 141 145 146 146 146 148 151 153 158 160 160 163 163 165 6. How MapReduce Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 Anatomy of a MapReduce Job Run Job Submission Job Initialization Task Assignment Task Execution Progress and Status Updates Job Completion Failures Task Failure Tasktracker Failure Jobtracker Failure Job Scheduling The Fair Scheduler The Capacity Scheduler 167 167 169 169 170 170 172 173 173 175 175 175 176 177 Table of Contents | vii Shuffle and Sort The Map Side The Reduce Side Configuration Tuning Task Execution Speculative Execution Task JVM Reuse Skipping Bad Records The Task Execution Environment 177 177 179 180 183 183 184 185 186 7. MapReduce Types and Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 MapReduce Types The Default MapReduce Job Input Formats Input Splits and Records Text Input Binary Input Multiple Inputs Database Input (and Output) Output Formats Text Output Binary Output Multiple Outputs Lazy Output Database Output 189 191 198 198 209 213 214 215 215 216 216 217 224 224 8. MapReduce Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 Counters Built-in Counters User-Defined Java Counters User-Defined Streaming Counters Sorting Preparation Partial Sort Total Sort Secondary Sort Joins Map-Side Joins Reduce-Side Joins Side Data Distribution Using the Job Configuration Distributed Cache MapReduce Library Classes viii | Table of Contents 225 225 227 232 232 232 233 237 241 247 247 249 252 252 253 257 9. Setting Up a Hadoop Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259 Cluster Specification Network Topology Cluster Setup and Installation Installing Java Creating a Hadoop User Installing Hadoop Testing the Installation SSH Configuration Hadoop Configuration Configuration Management Environment Settings Important Hadoop Daemon Properties Hadoop Daemon Addresses and Ports Other Hadoop Properties User Account Creation Security Kerberos and Hadoop Delegation Tokens Other Security Enhancements Benchmarking a Hadoop Cluster Hadoop Benchmarks User Jobs Hadoop in the Cloud Hadoop on Amazon EC2 259 261 263 264 264 264 265 265 266 267 269 273 278 279 280 281 282 284 285 286 287 289 289 290 10. Administering Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293 HDFS Persistent Data Structures Safe Mode Audit Logging Tools Monitoring Logging Metrics Java Management Extensions Maintenance Routine Administration Procedures Commissioning and Decommissioning Nodes Upgrades 293 293 298 300 300 305 305 306 309 312 312 313 316 11. Pig . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321 Installing and Running Pig 322 Table of Contents | ix Execution Types Running Pig Programs Grunt Pig Latin Editors An Example Generating Examples Comparison with Databases Pig Latin Structure Statements Expressions Types Schemas Functions User-Defined Functions A Filter UDF An Eval UDF A Load UDF Data Processing Operators Loading and Storing Data Filtering Data Grouping and Joining Data Sorting Data Combining and Splitting Data Pig in Practice Parallelism Parameter Substitution 322 324 324 325 325 327 328 330 330 331 335 336 338 342 343 343 347 348 351 351 352 354 359 360 361 361 362 12. Hive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365 Installing Hive The Hive Shell An Example Running Hive Configuring Hive Hive Services The Metastore Comparison with Traditional Databases Schema on Read Versus Schema on Write Updates, Transactions, and Indexes HiveQL Data Types Operators and Functions Tables x | Table of Contents 366 367 368 369 369 371 373 375 376 376 377 378 380 381 Managed Tables and External Tables Partitions and Buckets Storage Formats Importing Data Altering Tables Dropping Tables Querying Data Sorting and Aggregating MapReduce Scripts Joins Subqueries Views User-Defined Functions Writing a UDF Writing a UDAF 381 383 387 392 394 395 395 395 396 397 400 401 402 403 405 13. HBase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411 HBasics Backdrop Concepts Whirlwind Tour of the Data Model Implementation Installation Test Drive Clients Java Avro, REST, and Thrift Example Schemas Loading Data Web Queries HBase Versus RDBMS Successful Service HBase Use Case: HBase at Streamy.com Praxis Versions HDFS UI Metrics Schema Design Counters Bulk Load 411 412 412 412 413 416 417 419 419 422 423 424 425 428 431 432 433 433 435 435 436 437 437 438 438 439 Table of Contents | xi 14. ZooKeeper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441 Installing and Running ZooKeeper An Example Group Membership in ZooKeeper Creating the Group Joining a Group Listing Members in a Group Deleting a Group The ZooKeeper Service Data Model Operations Implementation Consistency Sessions States Building Applications with ZooKeeper A Configuration Service The Resilient ZooKeeper Application A Lock Service More Distributed Data Structures and Protocols ZooKeeper in Production Resilience and Performance Configuration 442 443 444 444 447 448 450 451 451 453 457 458 460 462 463 463 466 470 472 473 473 474 15. Sqoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 477 Getting Sqoop A Sample Import Generated Code Additional Serialization Systems Database Imports: A Deeper Look Controlling the Import Imports and Consistency Direct-mode Imports Working with Imported Data Imported Data and Hive Importing Large Objects Performing an Export Exports: A Deeper Look Exports and Transactionality Exports and SequenceFiles 477 479 482 482 483 485 485 485 486 487 489 491 493 494 494 16. Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 497 Hadoop Usage at Last.fm xii | Table of Contents 497 Last.fm: The Social Music Revolution Hadoop at Last.fm Generating Charts with Hadoop The Track Statistics Program Summary Hadoop and Hive at Facebook Introduction Hadoop at Facebook Hypothetical Use Case Studies Hive Problems and Future Work Nutch Search Engine Background Data Structures Selected Examples of Hadoop Data Processing in Nutch Summary Log Processing at Rackspace Requirements/The Problem Brief History Choosing Hadoop Collection and Storage MapReduce for Logs Cascading Fields, Tuples, and Pipes Operations Taps, Schemes, and Flows Cascading in Practice Flexibility Hadoop and Cascading at ShareThis Summary TeraByte Sort on Apache Hadoop Using Pig and Wukong to Explore Billion-edge Network Graphs Measuring Community Everybody’s Talkin’ at Me: The Twitter Reply Graph Symmetric Links Community Extraction 497 497 498 499 506 506 506 506 509 512 516 517 517 518 521 530 531 531 532 532 532 533 539 540 542 544 545 548 549 552 553 556 558 558 561 562 A. Installing Apache Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 565 B. Cloudera’s Distribution for Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 571 C. Preparing the NCDC Weather Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573 Table of Contents | xiii Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 577 xiv | Table of Contents Foreword Hadoop got its start in Nutch. A few of us were attempting to build an open source web search engine and having trouble managing computations running on even a handful of computers. Once Google published its GFS and MapReduce papers, the route became clear. They’d devised systems to solve precisely the problems we were having with Nutch. So we started, two of us, half-time, to try to re-create these systems as a part of Nutch. We managed to get Nutch limping along on 20 machines, but it soon became clear that to handle the Web’s massive scale, we’d need to run it on thousands of machines and, moreover, that the job was bigger than two half-time developers could handle. Around that time, Yahoo! got interested, and quickly put together a team that I joined. We split off the distributed computing part of Nutch, naming it Hadoop. With the help of Yahoo!, Hadoop soon grew into a technology that could truly scale to the Web. In 2006, Tom White started contributing to Hadoop. I already knew Tom through an excellent article he’d written about Nutch, so I knew he could present complex ideas in clear prose. I soon learned that he could also develop software that was as pleasant to read as his prose. From the beginning, Tom’s contributions to Hadoop showed his concern for users and for the project. Unlike most open source contributors, Tom is not primarily interested in tweaking the system to better meet his own needs, but rather in making it easier for anyone to use. Initially, Tom specialized in making Hadoop run well on Amazon’s EC2 and S3 services. Then he moved on to tackle a wide variety of problems, including improving the MapReduce APIs, enhancing the website, and devising an object serialization framework. In all cases, Tom presented his ideas precisely. In short order, Tom earned the role of Hadoop committer and soon thereafter became a member of the Hadoop Project Management Committee. Tom is now a respected senior member of the Hadoop developer community. Though he’s an expert in many technical corners of the project, his specialty is making Hadoop easier to use and understand. xv Given this, I was very pleased when I learned that Tom intended to write a book about Hadoop. Who could be better qualified? Now you have the opportunity to learn about Hadoop from a master—not only of the technology, but also of common sense and plain talk. —Doug Cutting Shed in the Yard, California xvi | Foreword Preface Martin Gardner, the mathematics and science writer, once said in an interview: Beyond calculus, I am lost. That was the secret of my column’s success. It took me so long to understand what I was writing about that I knew how to write in a way most readers would understand.* In many ways, this is how I feel about Hadoop. Its inner workings are complex, resting as they do on a mixture of distributed systems theory, practical engineering, and common sense. And to the uninitiated, Hadoop can appear alien. But it doesn’t need to be like this. Stripped to its core, the tools that Hadoop provides for building distributed systems—for data storage, data analysis, and coordination— are simple. If there’s a common theme, it is about raising the level of abstraction—to create building blocks for programmers who just happen to have lots of data to store, or lots of data to analyze, or lots of machines to coordinate, and who don’t have the time, the skill, or the inclination to become distributed systems experts to build the infrastructure to handle it. With such a simple and generally applicable feature set, it seemed obvious to me when I started using it that Hadoop deserved to be widely used. However, at the time (in early 2006), setting up, configuring, and writing programs to use Hadoop was an art. Things have certainly improved since then: there is more documentation, there are more examples, and there are thriving mailing lists to go to when you have questions. And yet the biggest hurdle for newcomers is understanding what this technology is capable of, where it excels, and how to use it. That is why I wrote this book. The Apache Hadoop community has come a long way. Over the course of three years, the Hadoop project has blossomed and spun off half a dozen subprojects. In this time, the software has made great leaps in performance, reliability, scalability, and manageability. To gain even wider adoption, however, I believe we need to make Hadoop even easier to use. This will involve writing more tools; integrating with more systems; and * “The science of fun,” Alex Bellos, The Guardian, May 31, 2008, 2008/may/31/maths.science. xvii writing new, improved APIs. I’m looking forward to being a part of this, and I hope this book will encourage and enable others to do so, too. Administrative Notes During discussion of a particular Java class in the text, I often omit its package name, to reduce clutter. If you need to know which package a class is in, you can easily look it up in Hadoop’s Java API documentation for the relevant subproject, linked to from the Apache Hadoop home page at . Or if you’re using an IDE, it can help using its auto-complete mechanism. Similarly, although it deviates from usual style guidelines, program listings that import multiple classes from the same package may use the asterisk wildcard character to save space (for example: import org.apache.hadoop.io.*). The sample programs in this book are available for download from the website that accompanies this book: . You will also find instructions there for obtaining the datasets that are used in examples throughout the book, as well as further notes for running the programs in the book, and links to updates, additional resources, and my blog. What’s in This Book? The rest of this book is organized as follows. Chapter 1 emphasizes the need for Hadoop and sketches the history of the project. Chapter 2 provides an introduction to MapReduce. Chapter 3 looks at Hadoop filesystems, and in particular HDFS, in depth. Chapter 4 covers the fundamentals of I/O in Hadoop: data integrity, compression, serialization, and file-based data structures. The next four chapters cover MapReduce in depth. Chapter 5 goes through the practical steps needed to develop a MapReduce application. Chapter 6 looks at how MapReduce is implemented in Hadoop, from the point of view of a user. Chapter 7 is about the MapReduce programming model, and the various data formats that MapReduce can work with. Chapter 8 is on advanced MapReduce topics, including sorting and joining data. Chapters 9 and 10 are for Hadoop administrators, and describe how to set up and maintain a Hadoop cluster running HDFS and MapReduce. Later chapters are dedicated to projects that build on Hadoop or are related to it. Chapters 11 and 12 present Pig and Hive, which are analytics platforms built on HDFS and MapReduce, whereas Chapters 13, 14, and 15 cover HBase, ZooKeeper, and Sqoop, respectively. Finally, Chapter 16 is a collection of case studies contributed by members of the Apache Hadoop community. xviii | Preface What’s New in the Second Edition? The second edition has two new chapters on Hive and Sqoop (Chapters 12 and 15), a new section covering Avro (in Chapter 4), an introduction to the new security features in Hadoop (in Chapter 9), and a new case study on analyzing massive network graphs using Hadoop (in Chapter 16). This edition continues to describe the 0.20 release series of Apache Hadoop, since this was ...
View Full Document

  • Fall '19
  • Hadoop

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture