Hadoop The Definitive Guide 4th Edition by Tom White (z-lib.org).pdf - FOURTH EDITION Hadoop The Definitive Guide Tom White Hadoop The Definitive Guide

Hadoop The Definitive Guide 4th Edition by Tom White (z-lib.org).pdf

This preview shows page 1 out of 728 pages.

You've reached the end of your free preview.

Want to read all 728 pages?

Unformatted text preview: FOURTH EDITION Hadoop: The Definitive Guide Tom White Hadoop: The Definitive Guide, Fourth Edition by Tom White Copyright © 2010 Tom White. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles ( ). For more information, contact our corporate/ institutional sales department: 800-998-9938 or [email protected] Editors: Mike Loukides and Meghan Blanchette Production Editor: Matt Hacker Copyeditor: FIX ME Proofreader: FIX ME! April 2015: Indexer: FIX ME Cover Designer: Karen Montgomery Interior Designer: David Futato Illustrator: Rebecca Demarest Fourth Edition Revision History for the Fourth Edition: 2014-12-18: Early Release revision See for release details. Nutshell Handbook, the Nutshell Handbook logo, and the O’Reilly logo are registered trademarks of O’Reilly Media, Inc. !!FILL THIS IN!! and related trade dress are trademarks of O’Reilly Media, Inc. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and O’Reilly Media, Inc. was aware of a trademark claim, the designations have been printed in caps or initial caps. While every precaution has been taken in the preparation of this book, the publisher and authors assume no responsibility for errors or omissions, or for damages resulting from the use of the information contained herein. ISBN: 978-1-491-90163-2 [?] For Eliane, Emilia, and Lottie Table of Contents Foreword. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix 1. Meet Hadoop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Data! Data Storage and Analysis Querying All Your Data Beyond Batch Comparison with Other Systems Relational Database Management Systems Grid Computing Volunteer Computing A Brief History of Apache Hadoop What’s in This Book? Part I. 1 3 4 4 6 6 8 9 9 13 Hadoop Fundamentals 2. MapReduce. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 A Weather Dataset Data Format Analyzing the Data with Unix Tools Analyzing the Data with Hadoop Map and Reduce Java MapReduce Scaling Out Data Flow Combiner Functions Running a Distributed MapReduce Job Hadoop Streaming 19 19 21 22 22 24 30 30 34 37 37 v Ruby Python 37 40 3. The Hadoop Distributed Filesystem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 The Design of HDFS HDFS Concepts Blocks Namenodes and Datanodes Block Caching HDFS Federation HDFS High-Availability The Command-Line Interface Basic Filesystem Operations Hadoop Filesystems Interfaces The Java Interface Reading Data from a Hadoop URL Reading Data Using the FileSystem API Writing Data Directories Querying the Filesystem Deleting Data Data Flow Anatomy of a File Read Anatomy of a File Write Coherency Model Parallel Copying with distcp Keeping an HDFS Cluster Balanced 43 45 45 46 47 48 48 50 51 53 54 56 57 58 61 63 63 68 68 68 71 74 76 77 4. YARN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 Anatomy of a YARN Application Run Resource Requests Application Lifespan Building YARN Applications YARN Compared to MapReduce 1 Scheduling in YARN The FIFO Scheduler The Capacity Scheduler The Fair Scheduler Delay Scheduling Dominant Resource Fairness Further Reading vi | Table of Contents 80 81 82 83 83 86 86 89 91 95 96 97 5. Hadoop I/O. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 Data Integrity Data Integrity in HDFS LocalFileSystem ChecksumFileSystem Compression Codecs Compression and Input Splits Using Compression in MapReduce Serialization The Writable Interface Writable Classes Implementing a Custom Writable Serialization Frameworks File-Based Data Structures SequenceFile MapFile Other File Formats and Column-Oriented Formats Part II. 99 100 101 101 102 103 107 109 111 112 115 123 127 129 129 137 138 MapReduce 6. Developing a MapReduce Application. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 The Configuration API Combining Resources Variable Expansion Setting Up the Development Environment Managing Configuration GenericOptionsParser, Tool, and ToolRunner Writing a Unit Test with MRUnit Mapper Reducer Running Locally on Test Data Running a Job in a Local Job Runner Testing the Driver Running on a Cluster Packaging a Job Launching a Job The MapReduce Web UI Retrieving the Results Debugging a Job Hadoop Logs 143 145 145 146 148 150 154 154 158 158 158 160 162 162 164 167 169 170 174 Table of Contents | vii Remote Debugging Tuning a Job Profiling Tasks MapReduce Workflows Decomposing a Problem into MapReduce Jobs JobControl Apache Oozie 176 177 177 178 179 180 181 7. How MapReduce Works. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 Anatomy of a MapReduce Job Run Job Submission Job Initialization Task Assignment Task Execution Progress and Status Updates Job Completion Failures Task Failure Application Master Failure Node Manager Failure Resource Manager Failure Shuffle and Sort The Map Side The Reduce Side Configuration Tuning Task Execution The Task Execution Environment Speculative Execution Output Committers 187 188 189 190 190 192 194 195 195 196 197 198 199 199 200 203 205 205 206 208 8. MapReduce Types and Formats. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 MapReduce Types The Default MapReduce Job Input Formats Input Splits and Records Text Input Binary Input Multiple Inputs Database Input (and Output) Output Formats Text Output Binary Output viii | Table of Contents 211 216 222 222 234 238 239 240 240 241 241 Multiple Outputs Lazy Output Database Output 242 247 247 9. MapReduce Features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249 Counters Built-in Counters User-Defined Java Counters User-Defined Streaming Counters Sorting Preparation Partial Sort Total Sort Secondary Sort Joins Map-Side Joins Reduce-Side Joins Side Data Distribution Using the Job Configuration Distributed Cache MapReduce Library Classes Part III. 249 249 253 257 257 258 259 261 264 270 271 273 276 276 277 282 Hadoop Operations 10. Setting Up a Hadoop Cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285 Cluster Specification Cluster Sizing Network Topology Cluster Setup and Installation Installing Java Creating Unix User Accounts Installing Hadoop Configuring SSH Configuring Hadoop Formatting the HDFS Filesystem Starting and Stopping the Daemons Creating User Directories Hadoop Configuration Configuration Management Environment Settings Important Hadoop Daemon Properties 286 287 288 290 290 290 291 291 292 292 292 294 294 295 296 298 Table of Contents | ix Hadoop Daemon Addresses and Ports Other Hadoop Properties Security Kerberos and Hadoop Delegation Tokens Other Security Enhancements Benchmarking a Hadoop Cluster Hadoop Benchmarks User Jobs 306 308 310 311 313 314 316 316 318 11. Administering Hadoop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319 HDFS Persistent Data Structures Safe Mode Audit Logging Tools Monitoring Logging Metrics and JMX Maintenance Routine Administration Procedures Commissioning and Decommissioning Nodes Upgrades Part IV. 319 319 325 327 327 332 332 333 334 334 336 339 Related Projects 12. Avro. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347 Avro Data Types and Schemas In-Memory Serialization and Deserialization The Specific API Avro Datafiles Interoperability Python API Avro Tools Schema Resolution Sort Order Avro MapReduce Sorting Using Avro MapReduce Avro in Other Languages x | Table of Contents 348 352 353 354 356 356 357 358 360 361 365 367 13. Parquet. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369 Data Model Nested Encoding Parquet File Format Parquet Configuration Writing and Reading Parquet Files Avro, Protocol Buffers and Thrift Parquet MapReduce 370 372 372 374 375 377 379 14. Flume. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383 Installing Flume An Example Transactions and Reliability Batching The HDFS Sink Partitioning and Interceptors File Formats Fan Out Delivery Guarantees Replicating and Multiplexing Selectors Distribution: Agent Tiers Delivery Guarantees Sink Groups Integrating Flume with Applications Component Catalog Further Reading 383 384 386 387 387 389 389 390 391 392 392 395 396 399 400 401 15. Sqoop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403 Getting Sqoop Sqoop Connectors A Sample Import Text and Binary File Formats Generated Code Additional Serialization Systems Imports: A Deeper Look Controlling the Import Imports and Consistency Incremental Imports Direct-mode Imports Working with Imported Data Imported Data and Hive Importing Large Objects 403 405 405 408 409 409 410 412 412 412 413 413 415 417 Table of Contents | xi Performing an Export Exports: A Deeper Look Exports and Transactionality Exports and SequenceFiles Further Reading 419 421 422 423 424 16. Pig. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425 Installing and Running Pig Execution Types Running Pig Programs Grunt Pig Latin Editors An Example Generating Examples Comparison with Databases Pig Latin Structure Statements Expressions Types Schemas Functions Macros User-Defined Functions A Filter UDF An Eval UDF A Load UDF Data Processing Operators Loading and Storing Data Filtering Data Grouping and Joining Data Sorting Data Combining and Splitting Data Pig in Practice Parallelism Anonymous Relations Parameter Substitution Further Reading 426 426 428 428 429 429 431 432 434 434 435 440 441 443 447 449 450 450 454 455 458 458 459 461 466 468 468 468 469 469 471 17. Hive. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473 Installing Hive The Hive Shell xii | Table of Contents 474 475 An Example Running Hive Configuring Hive Hive Services The Metastore Comparison with Traditional Databases Schema on Read Versus Schema on Write Updates, Transactions, and Indexes SQL-on-Hadoop Alternatives HiveQL Data Types Operators and Functions Tables Managed Tables and External Tables Partitions and Buckets Storage Formats Importing Data Altering Tables Dropping Tables Querying Data Sorting and Aggregating MapReduce Scripts Joins Subqueries Views User-Defined Functions Writing a UDF Writing a UDAF Further Reading 476 477 477 480 482 484 484 484 485 486 487 490 491 491 493 497 501 503 503 504 504 505 506 509 510 511 512 514 519 18. Crunch. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 521 An Example The Core Crunch API Primitive Operations Types Sources and Targets Functions Materialization Pipeline Execution Running a Pipeline Stopping a Pipeline Inspecting a Crunch Plan 522 525 525 530 533 535 537 540 540 541 542 Table of Contents | xiii Iterative Algorithms Checkpointing a Pipeline Crunch Libraries Further Reading 546 547 548 550 19. Spark. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 551 Installing Spark An Example Spark Applications, Jobs, Stages and Tasks A Scala Standalone Application A Java Example A Python Example Resilient Distributed Datasets Creation Transformations and Actions Persistence Serialization Shared Variables Broadcast Variables Accumulators Anatomy of a Spark Job Run Job Submission DAG Construction Task Scheduling Task Execution Executors and Cluster Managers Spark on YARN Further Reading 552 552 554 554 556 557 557 558 559 562 564 565 566 566 567 567 569 571 572 572 573 577 20. HBase. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 579 HBasics Backdrop Concepts Whirlwind Tour of the Data Model Implementation Installation Test Drive Clients Java MapReduce REST and Thrift Building an Online Query Application xiv | Table of Contents 579 580 580 580 582 585 586 588 588 591 593 593 Schema Design Loading Data Online Queries HBase Versus RDBMS Successful Service HBase Praxis HDFS UI Metrics Counters Further Reading 594 595 598 601 602 603 604 604 605 605 605 605 21. ZooKeeper. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 607 Installing and Running ZooKeeper An Example Group Membership in ZooKeeper Creating the Group Joining a Group Listing Members in a Group Deleting a Group The ZooKeeper Service Data Model Operations Implementation Consistency Sessions States Building Applications with ZooKeeper A Configuration Service The Resilient ZooKeeper Application A Lock Service More Distributed Data Structures and Protocols ZooKeeper in Production Resilience and Performance Configuration Further Reading 608 610 610 611 613 614 616 617 618 620 624 625 627 629 631 631 634 638 640 641 641 643 644 22. Case Studies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645 Composable Data at Cerner From CPUs to Semantic Integration Enter Apache Crunch 645 645 646 Table of Contents | xv Building A Complete Picture Integrating Healthcare Data Composability over Frameworks Moving Forward Biological Data Science: Saving Lives with Software The Structure of DNA The Genetic Code: Turning DNA letters into proteins Thinking of DNA as source code The Human Genome Project and Reference Genomes Sequencing and Aligning DNA ADAM, a scalable genome analysis platform From Personalized Ads to Personalized Medicine Join in Cascading Fields, Tuples, and Pipes Operations Taps, Schemes, and Flows Cascading in Practice Flexibility Hadoop and Cascading at ShareThis Summary 646 649 653 654 654 656 657 658 660 661 662 668 669 669 670 673 675 676 679 679 684 A. Installing Apache Hadoop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 685 B. Cloudera’s Distribution Including Apache Hadoop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 691 C. Preparing the NCDC Weather Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 693 D. The Old and New Java MapReduce APIs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 697 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 701 xvi | Table of Contents Foreword Hadoop got its start in Nutch. A few of us were attempting to build an open source web search engine and having trouble managing computations running on even a handful of computers. Once Google published its GFS and MapReduce papers, the route became clear. They’d devised systems to solve precisely the problems we were having with Nutch. So we started, two of us, half-time, to try to re-create these systems as a part of Nutch. We managed to get Nutch limping along on 20 machines, but it soon became clear that to handle the Web’s massive scale, we’d need to run it on thousands of machines and, moreover, that the job was bigger than two half-time developers could handle. Around that time, Yahoo! got interested, and quickly put together a team that I joined. We split off the distributed computing part of Nutch, naming it Hadoop. With the help of Yahoo!, Hadoop soon grew into a technology that could truly scale to the Web. In 2006, Tom White started contributing to Hadoop. I already knew Tom through an excellent article he’d written about Nutch, so I knew he could present complex ideas in clear prose. I soon learned that he could also develop software that was as pleasant to read as his prose. From the beginning, Tom’s contributions to Hadoop showed his concern for users and for the project. Unlike most open source contributors, Tom is not primarily interested in tweaking the system to better meet his own needs, but rather in making it easier for anyone to use. Initially, Tom specialized in making Hadoop run well on Amazon’s EC2 and S3 services. Then he moved on to tackle a wide variety of problems, including improving the Map‐ Reduce APIs, enhancing the website, and devising an object serialization framework. In all cases, Tom presented his ideas precisely. In short order, Tom earned the role of Hadoop committer and soon thereafter became a member of the Hadoop Project Man‐ agement Committee. xvii Tom is now a respected senior member of the Hadoop developer community. Though he’s an expert in many technical corners of the project, his specialty is making Hadoop easier to use and understand. Given this, I was very pleased when I learned that Tom intended to write a book about Hadoop. Who could be better qualified? Now you have the opportunity to learn about Hadoop from a master—not only of the technology, but also of common sense and plain talk. —Doug Cutting Shed in the Yard, California xviii | Foreword Preface Martin Gardner, the mathematics and science writer, once said in an interview: Beyond calculus, I am lost. That was the secret of my column’s success. It took me so long to understand what I was writing about that I knew how to write in a way most readers would understand.1 In many ways, this is how I feel about Hadoop. Its inner workings are complex, resting as they do on a mixture of distributed systems theory, practical engineering, and com‐ mon sense. And to the uninitiated, Hadoop can appear alien. But it doesn’t need to be like this. Stripped to its core, the tools that Hadoop provides for working with big data are simple. If there’s a common theme, it is about raising the level of abstraction—to create building blocks for programmers who have lots of data to store and analyze, and who don’t have the time, the skill, or the inclination to become distributed systems experts to build the infrastructure to handle it. With such a simple and generally applicable feature set, it seemed obvious to me when I started using it that Hadoop deserved to be widely used. However, at the time (in early 2006), setting up, configuring, and writing programs to use Hadoop was an art. Things have certainly improved since then: there is more documentation, there are more ex‐ amples, and there are thriving mailing lists to go to when you have questions. And yet the biggest hurdle for newcomers is understanding what this technology is capable of, where it excels, and how to use it. That is why I wrote this book. The Apache Hadoop community has come a long way. Since I wrote the first edition of this book the Hadoop project has blossomed. “Big data” has become a household term. 2 In this time, the software has made great leaps in adoption...
View Full Document

  • Spring '18
  • Dzao Chen
  • Hadoop

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture