Hadoop: The Definitive Guide, 1st Edition
Table of Contents Copyright .................................................................................................... 1 Preface ....................................................................................................... 1 Meet Hadoop .............................................................................................. 4 A Brief History of Hadoop ............................................................................................................................................................. 4 Distributed Computing .................................................................................................................................................................. 5 Installing Hadoop ........................................................................................................................................................................... 5
The Hadoop Distributed Filesystem ........................................................... 6 The Design of HDFS ....................................................................................................................................................................... 7 HDFS Concepts .............................................................................................................................................................................. 8 The Command Line Interface to HDFS ....................................................................................................................................... 10 The Java Interface to HDFS ......................................................................................................................................................... 14 Data Flow ...................................................................................................................................................................................... 25 Parallel Copying with distcp ......................................................................................................................................................... 31 Other Interfaces to HDFS ............................................................................................................................................................ 33
MapReduce ............................................................................................... 35 A Weather Dataset ........................................................................................................................................................................ 35 Analyzing the Data with Unix Tools ............................................................................................................................................. 37 Analyzing the Data with Hadoop ................................................................................................................................................. 38 Scaling Out ................................................................................................................................................................................... 43 Hadoop Streaming ....................................................................................................................................................................... 48 Hadoop Pipes ................................................................................................................................................................................ 51
Developing a MapReduce Application ...................................................... 54 The Development Lifecycle .......................................................................................................................................................... 54
Hadoop I/O .............................................................................................. 82 Data Integrity ............................................................................................................................................................................... 83 Compression ................................................................................................................................................................................ 86 Serialization .................................................................................................................................................................................. 94 File-based Data Structures ......................................................................................................................................................... 107
MapReduce Features ............................................................................... 117 Anatomy of a MapReduce Job Run ............................................................................................................................................ 117 MapReduce Types ...................................................................................................................................................................... 124 Input Formats ............................................................................................................................................................................. 129 Output Formats .......................................................................................................................................................................... 143 Sorting ......................................................................................................................................................................................... 149 Task Execution ........................................................................................................................................................................... 158 Counters ...................................................................................................................................................................................... 162 Side Data Distribution ................................................................................................................................................................ 168
Setting Up a Hadoop Cluster ................................................................... 172 Cluster Specification ................................................................................................................................................................... 172 Cluster Set Up and Installation .................................................................................................................................................. 174 SSH Configuration ...................................................................................................................................................................... 175 Hadoop Configuration ................................................................................................................................................................ 176 Tuning a Hadoop Cluster ........................................................................................................................................................... 190
Running Hadoop in the Cloud ................................................................. 192 Administering Hadoop ........................................................................... 193 ZooKeeper .............................................................................................. 193 Pig .......................................................................................................... 193 HBase ..................................................................................................... 193
Hadoop: The Definitive Guide, 1st Edition Hadoop: The Definitive Guide, 1st Edition By Tom White ISBN: 9780596521974 Prepared for Jing Huang, Safari ID:
[email protected] Publisher: O'Reilly Media, Inc. Print Publication Date: 2009/07/15 User number: 1754328 © 2008 Safari Books Online, LLC. This PDF is made available for personal use only during the relevant subscription term, subject to the Safari Terms of Service. Any other use requires prior written consent from the copyright owner. Unauthorized use, reproduction and/or distribution are strictly prohibited and violate applicable laws. All rights reserved.
Hadoop: The Definitive Guide, 1st Edition
Page 1
Return to Table of Contents
Copyright Copyright © Tom White, 2009. All rights reserved. Published by O'Reilly Media, Inc. (1005 Gravenstein Highway North, Sebastopol, CA, 95472) O'Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://safari.oreilly.com). For more information, contact our corporate/institutional sales department: (800) 998-9938 or
[email protected]. Nutshell Handbook, the Nutshell Handbook logo, and the O'Reilly logo are registered trademarks of O'Reilly Media, Inc. The vulpine phalanger and related trade dress are trademarks of O'Reilly Media, Inc. Many of the designations uses by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and O'Reilly Media, Inc. was aware of a trademark claim, the designations have been printed in caps or initial caps. While every precaution has been taken in the preparation of this book, the publisher and authors assume no responsibility for errors or omissions, or for damages resulting from the use of the information contained herein.
Preface
P.1. Administrative Notes During discussion of a particular Java class in the text I often omit its package name, to reduce clutter. If you need to know which package a class is in, you can easily look it up in Hadoop's Java API documentation at http://hadoop.apache.org/core/docs/current/api/ index.html. Or if you're using an IDE, it can help using its auto-complete mechanism. Similarly, although it deviates from usual style guidelines, program listings that import multiple classes from the same package use the asterisk wildcard character to save space (for example: import org.apache.hadoop.io.*).
P.2. Conventions Used in This Book The following typographical conventions are used in this book:
Hadoop: The Definitive Guide, 1st Edition Hadoop: The Definitive Guide, 1st Edition By Tom White ISBN: 9780596521974 Prepared for Jing Huang, Safari ID:
[email protected] Publisher: O'Reilly Media, Inc. Print Publication Date: 2009/07/15 User number: 1754328 © 2008 Safari Books Online, LLC. This PDF is made available for personal use only during the relevant subscription term, subject to the Safari Terms of Service. Any other use requires prior written consent from the copyright owner. Unauthorized use, reproduction and/or distribution are strictly prohibited and violate applicable laws. All rights reserved.
Hadoop: The Definitive Guide, 1st Edition
Page 2
Return to Table of Contents
Italic Indicates new terms, URLs, email addresses, filenames, and file extensions. Constant width Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords.
Constant width bold
Shows commands or other text that should be typed literally by the user. Constant width italic Shows text that should be replaced with user-supplied values or by values determined by context. This icon signifies a tip, suggestion, or general note.
This icon indicates a warning or caution.
P.3. Using Code Examples This book is here to help you get your job done. In general, you may use the code in this book in your programs and documentation. You do not need to contact us for permission unless you're reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing a CD-ROM of examples from O'Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product's documentation does require permission.
Hadoop: The Definitive Guide, 1st Edition Hadoop: The Definitive Guide, 1st Edition By Tom White ISBN: 9780596521974 Prepared for Jing Huang, Safari ID:
[email protected] Publisher: O'Reilly Media, Inc. Print Publication Date: 2009/07/15 User number: 1754328 © 2008 Safari Books Online, LLC. This PDF is made available for personal use only during the relevant subscription term, subject to the Safari Terms of Service. Any other use requires prior written consent from the copyright owner. Unauthorized use, reproduction and/or distribution are strictly prohibited and violate applicable laws. All rights reserved.
Hadoop: The Definitive Guide, 1st Edition
Page 3
Return to Table of Contents
We appreciate, but do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: "Book Title by Some Author. Copyright 2008 O'Reilly Media, Inc., 978-0-596-xxxx-x." If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact us at
[email protected].
P.4. Safari® Enabled
NOTE
When you see a Safari® Enabled icon on the cover of your favorite technology book, that means the book is available online through the O'Reilly Network Safari Bookshelf. Safari offers a solution that's better than e-books. It's a virtual library that lets you easily search thousands of top tech books, cut and paste code samples, download chapters, and find quick answers when you need the most accurate, current information. Try it for free at http://safari.oreilly.com.
P.5. How to Contact Us Please address comments and questions concerning this book to the publisher: O'Reilly Media, Inc. 1005 Gravenstein Highway North Sebastopol, CA 95472 800-998-9938 (in the United States or Canada) 707-829-0515 (international or local) 707 829-0104 (fax) We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at: http://www.oreilly.com/catalog/9780596521974 To comment or ask technical questions about this book, send email to:
[email protected] Hadoop: The Definitive Guide, 1st Edition Hadoop: The Definitive Guide, 1st Edition By Tom White ISBN: 9780596521974 Prepared for Jing Huang, Safari ID:
[email protected] Publisher: O'Reilly Media, Inc. Print Publication Date: 2009/07/15 User number: 1754328 © 2008 Safari Books Online, LLC. This PDF is made available for personal use only during the relevant subscription term, subject to the Safari Terms of Service. Any other use requires prior written consent from the copyright owner. Unauthorized use, reproduction and/or distribution are strictly prohibited and violate applicable laws. All rights reserved.
Hadoop: The Definitive Guide, 1st Edition
Return to Table of Contents
Page 4
For more information about our books, conferences, Resource Centers, and the O'Reilly Network, see our web site at: http://www.oreilly.com
Chapter 1. Meet Hadoop In pioneer days they used oxen for heavy pulling, and when one ox couldn't budge a log, they didn't try to grow a larger ox. We shouldn't be trying for bigger computers, but for more systems of computers.
Licensed by 1.1. A Brief HistoryJing of HadoopHuang —Grace Hopper
Hadoop was created by Doug Cutting, the creator of Lucene, the widely-used text search library. Hadoop has its origins in Nutch, an open source web search engine, itself a part of the Lucene project. Building a web search engine from scratch was an ambitious goal, for not only is the software required to crawl and index websites complex to write, it is also a challenge to run without a dedicated operations team — there are lots of moving parts. It's expensive too — Cutting and Cafarella estimated a system supporting a 1-billion-page index would cost around half a million dollars in hardware, with a "$30,00 monthly" running cost. Nevertheless, they believed it was a worthy goal, as it would open up and ultimately democratise search engine algorithms. Nutch was started in 2002, and a working crawler and search system quickly emerged. However, they realized that their architecture wouldn't scale to the billions of pages on the web. Help was at hand with the publication of "The Google File System" (GFS) [ref] in 2003. This paper described the architecture of Google's distributed filesystem that was being used in production at Google. GFS, or something like it, would solve their storage needs for the very large files generated as a part of the web crawl and indexing process. In particular, GFS would free up time being spent on administrative tasks such as managing storage nodes. In 2004 they set about writing an open source implementation, the Nutch Distributed File System (NDFS) as it became to be known. NDFS and the MapReduce implementation in Nutch were applicable beyond the realm of search, and in February 2006 they moved out of Nutch to form an independent subproject of Lucene called Hadoop. Yahoo! hired Doug Cutting and, with a dedicated team, provided Hadoop: The Definitive Guide, 1st Edition Hadoop: The Definitive Guide, 1st Edition By Tom White ISBN: 9780596521974 Prepared for Jing Huang, Safari ID:
[email protected] Publisher: O'Reilly Media, Inc. Print Publication Date: 2009/07/15 User number: 1754328 © 2008 Safari Books Online, LLC. This PDF is made available for personal use only during the relevant subscription term, subject to the Safari Terms of Service. Any other use requires prior written consent from the copyright owner. Unauthorized use, reproduction and/or distribution are strictly prohibited and violate applicable laws. All rights reserved.
Hadoop: The Definitive Guide, 1st Edition
Page 5
Return to Table of Contents
the resources to turn Hadoop into a system that ran at web scale. This was demonstrated in February 2008 when Yahoo! announced that their production search index was being generated by a 10,000 core Hadoop cluster. Earlier, in January 2008, Hadoop was made a top-level project at Apache, confirming its success and dynamic community.
The Origin of the Name "Hadoop" The name Hadoop is not an acronym, it's a made-up name. The project's creator, Doug Cutting, explains how the name came about: The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and pronounce, meaningless, and not used elsewhere: those are my naming criteria. Kids are good at generating such. Googol is a kid's term. Components of Hadoop are given more descriptive (and therefore more mundane) names. This is a good principle, as it means you can generally work out what something does from its name.
1.1.1. Who Uses Hadoop?
1.2. Distributed Computing When people first hear about Hadoop a common question is "How is it different from SETI?"
1.3. Installing Hadoop It's easy to install Hadoop on a single machine to try it out. The quickest way is to download and run a binary release from an Apache Software Foundation Mirror. Hadoop is written in Java, so you will need to have Java installed on your machine, version 6 or later.
1.3.1. Installing Hadoop on Unix Start by considering the user that you'd like to run Hadoop as. For trying out Hadoop, it is simplest to run Hadoop on a single machine using your own account. Download a stable release, which is packaged as a gzipped tar file, from http:// hadoop.apache.org/core/releases.html and unpack it somewhere on your filesystem. For example, if you downloaded hadoop-0.17.0.tar.gz, you would type the following. Hadoop: The Definitive Guide, 1st Edition Hadoop: The Definitive Guide, 1st Edition By Tom White ISBN: 9780596521974 Prepared for Jing Huang, Safari ID:
[email protected] Publisher: O'Reilly Media, Inc. Print Publication Date: 2009/07/15 User number: 1754328 © 2008 Safari Books Online, LLC. This PDF is made available for personal use only during the relevant subscription term, subject to the Safari Terms of Service. Any other use requires prior written consent from the copyright owner. Unauthorized use, reproduction and/or distribution are strictly prohibited and violate applicable laws. All rights reserved.
Hadoop: The Definitive Guide, 1st Edition
Page 6
Return to Table of Contents
% tar xzf hadoop-0.17.0.tar.gz
Before you can run Hadoop, you need to tell it where Java is located on your system. If you have the JAVA_HOME environment variable set to point to a suitable Java installation, then that will be used, and you don't have to configure anything further. Otherwise, you can set the Java installation that Hadoop uses by editing conf/hadoop-env.sh, and specifying the JAVA_HOME variable. For example, on my Mac I changed the line to read: export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6.0/Home
to point to version 1.6.0 of Java. Finally, it's very convenient to put the Hadoop binary directory on your command-line path. For example,
Licensed by Jing Huang
export PATH=$PATH:/Users/tom/hadoop-book/hadoop-0.17.0/bin
Now we are ready to run Hadoop. Get a list of options by typing % hadoop Usage: hadoop [--config confdir] COMMAND where COMMAND is one of: namenode -format format the DFS filesystem secondarynamenode run the DFS secondary namenode namenode run the DFS namenode datanode run a DFS datanode dfsadmin run a DFS admin client fsck run a DFS filesystem checking utility fs run a generic filesystem user client balancer run a cluster balancing utility jobtracker run the MapReduce job Tracker node pipes run a Pipes job tasktracker run a MapReduce task Tracker node job manipulate MapReduce jobs version print the version jar <jar> run a jar file distcp <srcurl> <desturl> copy file or directories recursively daemonlog get/set the log level for each daemon or CLASSNAME run the class named CLASSNAME Most commands print help when invoked w/o parameters.
Chapter 2. The Hadoop Distributed Filesystem When a dataset outgrows the storage capacity of a single physical machine, it is necessary to partition it across a number of separate machines. Filesystems that manage the storage across a network of machines are called distributed filesystems. Since they are networkbased, all the complications of network programming kick in, thus making distributed
Hadoop: The Definitive Guide, 1st Edition Hadoop: The Definitive Guide, 1st Edition By Tom White ISBN: 9780596521974 Prepared for Jing Huang, Safari ID:
[email protected] Publisher: O'Reilly Media, Inc. Print Publication Date: 2009/07/15 User number: 1754328 © 2008 Safari Books Online, LLC. This PDF is made available for personal use only during the relevant subscription term, subject to the Safari Terms of Service. Any other use requires prior written consent from the copyright owner. Unauthorized use, reproduction and/or distribution are strictly prohibited and violate applicable laws. All rights reserved.
Hadoop: The Definitive Guide, 1st Edition
Page 7
Return to Table of Contents
filesystems more complex than regular disk filesystems. For example, one of the biggest challenges is making the filesystem tolerate node failure without suffering data loss. Hadoop comes with a distributed filesystem called HDFS, which stands for Hadoop Distributed Filesystem. (You may sometimes see references to 'DFS' — informally or in older documentation or configuration — which is the same thing.) In this chapter we'll look at how to use HDFS by examining its interfaces, but first let's understand what HDFS is.
2.1. The Design of HDFS HDFS is filesystem designed for storing very large files with streaming data access patterns, running on clusters on commodity hardware. Examining this statement in more detail: Very large files "Very large" in this context means files that are hundreds of megabytes, gigabytes, or terabytes in size. There are Hadoop clusters running today that store petabytes of data[1]. [1]
"Scaling Hadoop to 4000 nodes at Yahoo!", http://developer.yahoo.net/blogs/hadoop/2008/09/scaling_hadoop_to_4000_nodes_a.html.
Streaming data access HDFS is built around the idea that the most efficient data processing pattern is a writeonce, read-many-times pattern. A dataset is typically generated or copied from source, then various analyses are performed on that dataset over time. Each analysis will involve most, if not all, of the dataset, so the time to read the whole dataset is more important than the latency in reading the first record. Commodity hardware Hadoop doesn't require expensive, highly reliable hardware to run on. It's designed to run on clusters of commodity hardware where the chance of node failure across the cluster is high (at least for large clusters). HDFS is designed to carry on working without a noticeable interruption to the user in the face of such failure. It is also worth examining the applications where using HDFS does not work so well. While this may change in the future, these are areas where HDFS is not a good fit today:
Hadoop: The Definitive Guide, 1st Edition Hadoop: The Definitive Guide, 1st Edition By Tom White ISBN: 9780596521974 Prepared for Jing Huang, Safari ID:
[email protected] Publisher: O'Reilly Media, Inc. Print Publication Date: 2009/07/15 User number: 1754328 © 2008 Safari Books Online, LLC. This PDF is made available for personal use only during the relevant subscription term, subject to the Safari Terms of Service. Any other use requires prior written consent from the copyright owner. Unauthorized use, reproduction and/or distribution are strictly prohibited and violate applicable laws. All rights reserved.
Hadoop: The Definitive Guide, 1st Edition
Page 8
Return to Table of Contents
Low-latency data access Applications that require low-latency access to data, in the tens of millisecond range, will not work well with HDFS. Remember HDFS is optimized for delivering a high throughput of data, and this may be at the expense of latency. Lots of small files Since the namenode holds filesystem metadata in memory, the limit to the number of files in a filesystem is governed by the amount of memory on the namenode. As a rule of thumb, each file, directory and block takes about 150 bytes. So, for example, if you had one million files, each taking one block, you would need at least 300MB of memory. Multiple writers, arbitrary file modifications Files in HDFS may be written to by a single writer. Writes are always made at the end of the file. There is no support for multiple writers, or for modifications at arbitrary offsets in the file. (These might be supported in the future, but they are likely to be relatively inefficient.)
2.2. HDFS Concepts 2.2.1. Blocks A disk has a block size which is the minimum amount of data that it can read or write. Filesystems for a single disk build on this by dealing with data in blocks, which are an integral multiple of the disk block size. Filesystem blocks are typically a few kilobytes in size, while disk blocks are normally 512 bytes. Mostly this is transparent to the filesystem user who is simply reading or writing a file — of whatever length. However, there are tools to do with filesystem maintenance, such as df and fsck, which operate on the filesystem block level. HDFS too has the concept of a block, but it is a much larger unit, 64 MB by default. Like in a filesystem for a single disk, files are broken into block-sized chunks, which are stored as independent units. Having a block abstraction brings several benefits. The first benefit is the most obvious: a file can be larger than any single disk in the network. There's nothing that requires the blocks from a file to be stored on the same disk, so they
Hadoop: The Definitive Guide, 1st Edition Hadoop: The Definitive Guide, 1st Edition By Tom White ISBN: 9780596521974 Prepared for Jing Huang, Safari ID:
[email protected] Publisher: O'Reilly Media, Inc. Print Publication Date: 2009/07/15 User number: 1754328 © 2008 Safari Books Online, LLC. This PDF is made available for personal use only during the relevant subscription term, subject to the Safari Terms of Service. Any other use requires prior written consent from the copyright owner. Unauthorized use, reproduction and/or distribution are strictly prohibited and violate applicable laws. All rights reserved.
Hadoop: The Definitive Guide, 1st Edition
Page 9
Return to Table of Contents
can take advantage of any of the disks in the cluster. In fact, it would be possible, if unusual, to store a single file on an HDFS cluster whose blocks filled all the disks in the cluster. Secondly, making the unit of abstraction a block rather than a file simplifies the storage subsystem. Simplicity is something to strive for all in all systems, but is important for a distributed system where the failure modes are so varied. The storage subsystem deals with blocks, simplifying storage management (since blocks are a fixed size it is easy to calculate how many can be stored on a given disk), and eliminating metadata concerns (blocks are just a chunk of data to be stored — file metadata such as permissions information does not need to be stored with the blocks so another system can handle metadata orthogonally). Furthermore, blocks fit well with replication for providing fault tolerance and availability. To insure against corrupted blocks, and disk and machine failure each block is replicated to a small number of physically separate machines, typically 3. If a block becomes unavailable, a copy can be read from another location in a way that is transparent to the client. A block that is no longer available due to corruption or machine failure can be rereplicated from their alternative locations to other live machines to bring the replication factor back to the normal level. (See Section 5.1 for more on guarding against corrupt data.) Similarly, some applications may choose to set a high replication factor for the blocks in a popular file to spread the read load on the cluster. Like its disk filesystem cousin, HDFS's fsck command understands blocks. For example, running % hadoop fsck -files -blocks
will list the blocks that make up each file in the filesystem.
2.2.2. Namenodes and Datanodes A HDFS cluster has two types of node operating in a master-worker pattern: a namenode (the master) and a number of datanodes (workers). The namenode manages the filesystem namespace. It maintains the filesystem tree and the metadata for all the files and directories in the tree. This information is stored persistently on the local disk in the form of two files: the namespace image, and the edits log. The namenode also knows the datanodes on which all the blocks for a given file are located, however, it does not store block locations persistently, since this information is reconstructed from datanodes when the system starts, A client accesses the filesystem on behalf of the user by communicating with the namenode and datanodes. The client presents a POSIX-like filesystem interface, so the user does not need to know about the namenode and datanode to function.
Hadoop: The Definitive Guide, 1st Edition Hadoop: The Definitive Guide, 1st Edition By Tom White ISBN: 9780596521974 Prepared for Jing Huang, Safari ID:
[email protected] Publisher: O'Reilly Media, Inc. Print Publication Date: 2009/07/15 User number: 1754328 © 2008 Safari Books Online, LLC. This PDF is made available for personal use only during the relevant subscription term, subject to the Safari Terms of Service. Any other use requires prior written consent from the copyright owner. Unauthorized use, reproduction and/or distribution are strictly prohibited and violate applicable laws. All rights reserved.
Hadoop: The Definitive Guide, 1st Edition
Page 10
Return to Table of Contents
Datanodes are the work horses of the filesystem. They store and retrieve blocks when they are told to (by clients or the namenode), and they report back to the namenode periodically with lists of blocks that they are storing. Without the namenode the filesystem cannot be used. In fact, if the machine running the namenode were obliterated, all the files on the filesystem would be lost since there would be no way of knowing how to reconstruct the files from the blocks on the datanodes. For this reason it is important to make the namenode resilient to failure, and Hadoop provides two mechanisms for this. The first way is to back up the files that make up the persistent state of the filesystem metadata. Hadoop can be configured so that the namenode writes its persistent state to multiple filesystems. These writes are synchronous and atomic. The usual configuration choice is to write to local disk as well as a remote NFS mount. It is also possible to run a secondary namenode, although despite its name it does not act as a namenode. Its main role is to periodically merge the namespace image with the edits log to prevent the edits log from becoming too large. The secondary namenode usually runs on a separate physical machine since it requires plenty of CPU and as much memory as the namenode to perform the merge. It keeps a copy of the merged namespace image, which can be used in the event of the namenode failing. However, the state of the secondary namenode lags that of the primary, so in the event of failure data loss is almost guaranteed. The usual course of action in this case is to copy the namenode's metadata files that are on NFS to the secondary, and run it as the new primary. Work is going on to create a standby namenode, which is a true namenode that can take over if the primary fails. .
2.3. The Command Line Interface to HDFS We're going to have a look at HDFS by interacting with it from the command line. There are many other interfaces to HDFS, but the command line is one of the simplest, and to many developers the most familiar. I'm assuming that you've already installed Hadoop locally, as described in Chapter 1. If that's not the case, then please do that first. While we're getting to know Hadoop, we're going to run HDFS in standalone mode for convenience. Standalone mode means that the HDFS daemons and the HDFS client run on a single machine. Later we'll see how to run on a cluster of machines to give us scalability and fault tolerance.
Hadoop: The Definitive Guide, 1st Edition Hadoop: The Definitive Guide, 1st Edition By Tom White ISBN: 9780596521974 Prepared for Jing Huang, Safari ID:
[email protected] Publisher: O'Reilly Media, Inc. Print Publication Date: 2009/07/15 User number: 1754328 © 2008 Safari Books Online, LLC. This PDF is made available for personal use only during the relevant subscription term, subject to the Safari Terms of Service. Any other use requires prior written consent from the copyright owner. Unauthorized use, reproduction and/or distribution are strictly prohibited and violate applicable laws. All rights reserved.
Hadoop: The Definitive Guide, 1st Edition
Page 11
Return to Table of Contents
Out of the box Hadoop is configured to use the local filesystem, not HDFS, so the first thing we need to do is configure it to run HDFS. We can do this by editing the file conf/hadoopsite.xml, which controls the local site-specific Hadoop configuration. Descriptions of all the properties that may be set in the configuration file may be found in hadoop-default.xml (see http://hadoop.apache.org/core/docs/current/hadoopdefault.html). Here is a suitable configuration: <property> fs.default.name hdfs://namenode/ true
Hadoop: The Definitive Guide, 1st Edition Hadoop: The Definitive Guide, 1st Edition By Tom White ISBN: 9780596521974 Prepared for Jing Huang, Safari ID:
[email protected] Publisher: O'Reilly Media, Inc. Print Publication Date: 2009/07/15 User number: 1754328 © 2008 Safari Books Online, LLC. This PDF is made available for personal use only during the relevant subscription term, subject to the Safari Terms of Service. Any other use requires prior written consent from the copyright owner. Unauthorized use, reproduction and/or distribution are strictly prohibited and violate applicable laws. All rights reserved.
Hadoop: The Definitive Guide, 1st Edition
Page 184
Return to Table of Contents
<property> dfs.name.dir /disk1/hdfs/name,/remote/hdfs/name true <property> dfs.data.dir /disk1/hdfs/data,/disk2/hdfs/data true <property> fs.checkpoint.dir /disk1/hdfs/namesecondary,/disk2/hdfs/namesecondary true <property> mapred.job.tracker jobtracker:8012 true <property> mapred.local.dir /disk1/mapred/local,/disk2/mapred/local true <property> mapred.system.dir /tmp/hadoop/mapred/system true <property> mapred.tasktracker.map.tasks.maximum 7 true <property> mapred.tasktracker.reduce.tasks.maximum 7 true <property> mapred.child.java.opts -Xmx400m
7.4.3.1. HDFS To run HDFS you need to designate one machine as a namenode. In this case, the property fs.default.name is a HDFS filesystem URI, whose host is the namenode's hostname or IP address, and port is the port that the namenode will listen on for RPCs. If no port is specified the default of 8020 is used.
Hadoop: The Definitive Guide, 1st Edition Hadoop: The Definitive Guide, 1st Edition By Tom White ISBN: 9780596521974 Prepared for Jing Huang, Safari ID:
[email protected] Publisher: O'Reilly Media, Inc. Print Publication Date: 2009/07/15 User number: 1754328 © 2008 Safari Books Online, LLC. This PDF is made available for personal use only during the relevant subscription term, subject to the Safari Terms of Service. Any other use requires prior written consent from the copyright owner. Unauthorized use, reproduction and/or distribution are strictly prohibited and violate applicable laws. All rights reserved.
Hadoop: The Definitive Guide, 1st Edition
Page 185
Return to Table of Contents
NOTE
The masters file that is used by the control scripts is not used by the HDFS (or MapReduce) daemons to determine hostnames. In fact, since the masters file is only used by the scripts, you can ignore it if you don't use them. The fs.default.name property also doubles as specifying the default filesystem. The default filesystem is used to resolve relative paths, which are handy to use since they save typing (and avoid hardcoding knowledge of a particular namenode's address). For example, with the default filesystem defined in Example 7-1, the relative URI /a/b is resolved to hdfs://namenode/a/b. The fact that fs.default.name is used to specify both the HDFS namenode and the default filesystem means that if you run HDFS then it has to be the default filesystem. If you use another filesystem as well as HDFS, such as an S3 filesystem, then since HDFS has to be the default, all S3 paths must always be specified as absolute URIs. There are two other configuration properties you should set for HDFS: those that set the storage directories for the namenode and for datanodes. The property dfs.name.dir specifies a list of directories where the namenode stores persistent filesystem metadata (the edit log, and the filesystem image). A copy of each of the metadata files is stored in each directory for redundancy. It's common to configure dfs.name.dir so the namenode metadata is written to one or two local disks, and a remote disk, such as a NFS mounted directory. Such a set up guards against failure of a local disk, and failure of the entire namenode, since in both cases the files can be recovered and used to start a new namenode. (The secondary namenode only takes periodic snapshots of the namenode, so it does not provide an up-to-date backup of the namenode.) Bear in mind that since the namenode writes to its edit log synchronously, increasing the number of directories it has to write copies to slows down its throughput. You should also set the dfs.data.dir property, which specifies a list of directories for a datanode to store its blocks. Unlike the namenode which uses multiple directories for redundancy, a datanode does not use the directories for storing block replicas, which are stored on separate datanodes. Rather, a datanode round-robins writes between its storage directories, so for performance you should specify a storage directory for each local disk. Read performance also benefits from having multiple disks for storage, because blocks will be spread across them, and concurrent reads for distinct blocks will be correspondingly spread across disks. Finally, you should configure where the secondary namenode stores its checkpoints of the filesystem. The fs.checkpoint.dir property specifies a list of directories where the Hadoop: The Definitive Guide, 1st Edition Hadoop: The Definitive Guide, 1st Edition By Tom White ISBN: 9780596521974 Prepared for Jing Huang, Safari ID:
[email protected] Publisher: O'Reilly Media, Inc. Print Publication Date: 2009/07/15 User number: 1754328 © 2008 Safari Books Online, LLC. This PDF is made available for personal use only during the relevant subscription term, subject to the Safari Terms of Service. Any other use requires prior written consent from the copyright owner. Unauthorized use, reproduction and/or distribution are strictly prohibited and violate applicable laws. All rights reserved.
Hadoop: The Definitive Guide, 1st Edition
Return to Table of Contents
Page 186
checkpoints are kept. Like the storage directories for the namenode which keep redundant copies of the namenode metadata, the checkpointed filesystem image is stored in each checkpoint directory for redundancy. Table 7-2 summarizes the important configuration properties for HDFS. Table 7-2. Important HDFS daemon properties Property name fs.default.name
Type URI comma-separated directory dfs.name.dir names comma-separated directory dfs.data.dir names comma-separated directory fs.checkpoint.dir names
file:///
Default value
Description The default filesystem.
${hadoop.tmp.dir}/dfs/name
...
${hadoop.tmp.dir}/dfs/data
...
${hadoop.tmp.dir}/dfs/ namesecondary
...
7.4.3.2. MapReduce To run MapReduce you need to designate one machine as a jobtracker, which on small clusters may be the same machine as the namenode. To do this, set the mapred.job.tracker property to the hostname or IP address and port that the jobtracker will listen on. Note that this property is not a URI, but a host-port pair, separated by a colon. During a MapReduce job, intermediate data and working files are written to temporary local files. Since this data includes the potentially very large output of map tasks, you need to ensure that the mapred.local.dir property, which controls the location of local temporary storage, is configured to use disk partitions that are large enough. The mapred.local.dir property takes a comma-separated list of directory names, and you should use all available local disks to spread disk I/O. Typically you will use the same disks for MapReduce temporary data as you use for datanode block storage (as governed by the dfs.data.dir property, discussed above). MapReduce uses a distributed filesystem to share files (such as the job JAR file) with the tasktrackers that run the MapReduce tasks. The mapred.system.dir property is used to specify a directory where these files can be stored. This directory is resolved relative to the default filesystem (configured in fs.default.name), which is usually HDFS. Finally, you should set the mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum properties to reflect the number of available cores on the tasktracker machines; and mapred.child.java.opts to reflect the amount of memory available for the tasktracker child JVMs. See the discussion in Section 7.4.2.1, earlier in the chapter.
Hadoop: The Definitive Guide, 1st Edition Hadoop: The Definitive Guide, 1st Edition By Tom White ISBN: 9780596521974 Prepared for Jing Huang, Safari ID:
[email protected] Publisher: O'Reilly Media, Inc. Print Publication Date: 2009/07/15 User number: 1754328 © 2008 Safari Books Online, LLC. This PDF is made available for personal use only during the relevant subscription term, subject to the Safari Terms of Service. Any other use requires prior written consent from the copyright owner. Unauthorized use, reproduction and/or distribution are strictly prohibited and violate applicable laws. All rights reserved.
Hadoop: The Definitive Guide, 1st Edition
Return to Table of Contents
Page 187
Table 7-3 summarizes the important configuration properties for HDFS. Table 7-3. Important MapReduce daemon properties Property name
Type
Default value
mapred.job.tracker
hostname and local port
mapred.local.dir
commaseparated directory names
mapred.system.dir
URI
mapred.tasktracker.map.tasks.maximum integer mapred.tasktracker.reduce.tasks.maximum integer
mapred.child.java.opts
string
${hadoop.tmp.dir}/ mapred/local ${hadoop.tmp.dir}/ mapred/system 2 2
-Xmx200m
Description The hostname and port that the jobtracker's RPC server runs on. If set to the default value of local, then the jobtracker is run in-process. ...
... ... ... ... This property can be set on a per-job basis, which can be useful for setting JVM properties for debugging, for example.
7.4.4. Variations 7.4.4.1. Running independent HDFS and MapReduce installations HDFS and MapReduce, although symbiotic, are actually independent. One can run without the other, and for maintenance reasons it is sometimes desirable to restart one system while leaving the other running, which is easily achieved with the control scripts. The usual case for this is avoiding a restart of the namenode, which on clusters with a large number of files can be quite time-consuming (an hour or more on the largest clusters). In some cases, it would be nice to be able to upgrade one system (typically MapReduce) — perhaps to patch a bug — while leaving the other running. This is not possible with the configurations we have covered so far in this chapter. But by having two Hadoop installations, one configured for HDFS, the other for MapReduce, it is possible to operate them independently. One has to exercise care to ensure that the two versions are compatible, of course, but this is a small price for better uptime. Almost all of the advice in this chapter is still relevant. The pertinent changes are as follows: •
Create two installation directories, naming one for HDFS and one for MapReduce, and unpack the Hadoop distribution in both.
Hadoop: The Definitive Guide, 1st Edition Hadoop: The Definitive Guide, 1st Edition By Tom White ISBN: 9780596521974 Prepared for Jing Huang, Safari ID:
[email protected] Publisher: O'Reilly Media, Inc. Print Publication Date: 2009/07/15 User number: 1754328 © 2008 Safari Books Online, LLC. This PDF is made available for personal use only during the relevant subscription term, subject to the Safari Terms of Service. Any other use requires prior written consent from the copyright owner. Unauthorized use, reproduction and/or distribution are strictly prohibited and violate applicable laws. All rights reserved.
Hadoop: The Definitive Guide, 1st Edition
Page 188
Return to Table of Contents
Create two sets of configuration files. Each set of configuration files only contains the configuration relevant to the installation it is for. For example, the MapReduce hadoop-site.xml shouldn't have HDFS properties in it. • The log directory can be shared since the log file names include the daemon name, thus preventing name conflicts. • On the master nodes, use the start-dfs.sh or start-mapred.sh script to start the appropriate daemons for that installation. •
7.4.4.2. Class-based configuration For some clusters the one-size-fits-all configuration model breaks down. For example, if you expand the cluster with new machines which have a different hardware specification to the existing ones, then you need a different configuration for the new machines to take advantage of their extra resources. In these cases, you need to have the concept of a class of machine, and maintain a separate configuration for each class. Hadoop doesn't provide tools to do this, but there are several excellent tools for doing precisely this type of configuration management, including Puppet, cfengine, and bcfg2. Even for small clusters it can be a challenge to keep all of the machines in sync: consider what happens if the machine is unavailable when you push out an update — who ensures it gets the update when it becomes available? This is a big problem and can lead to divergent installations, so even if you use the Hadoop control scripts for managing Hadoop, it may well be a good idea to use configuration management tools for maintaining the cluster. These tools are excellent for doing regular maintenance, such as patching security holes and updating system packages.
7.4.5. The Configuration API Components in Hadoop are configured using Hadoop's own configuration API. An instance of the Configuration class (found in the org.apache.hadoop.conf package) represents a collection of configuration properties and their values. Each property is named by a String, and the type of a value may be one of several types, including Java primitives such as boolean, int, long, float, and other useful types such as String, Class, java.io.File, and collections of Strings. Configurations read their properties from resources — XML files with a simple structure for defining name-value pairs. An example is shown in Example 7-2.
Hadoop: The Definitive Guide, 1st Edition Hadoop: The Definitive Guide, 1st Edition By Tom White ISBN: 9780596521974 Prepared for Jing Huang, Safari ID:
[email protected] Publisher: O'Reilly Media, Inc. Print Publication Date: 2009/07/15 User number: 1754328 © 2008 Safari Books Online, LLC. This PDF is made available for personal use only during the relevant subscription term, subject to the Safari Terms of Service. Any other use requires prior written consent from the copyright owner. Unauthorized use, reproduction and/or distribution are strictly prohibited and violate applicable laws. All rights reserved.
Hadoop: The Definitive Guide, 1st Edition
Page 189
Return to Table of Contents
Example 7-2. A simple configuration file, configuration-1.xml <property> color yellow <description>Color <property> size 10 <description>Size <property> weight heavy true <description>Weight <property> size-weight ${size},${weight} <description>Size and weight
Assuming this configuration file is in a file called configuration-1.xml, we can access its properties using a piece of code like this: Configuration conf = new Configuration(); conf.addResource("configuration-1.xml"); assertThat(conf.get("color"), is("yellow")); assertThat(conf.getInt("size", 0), is(10)); assertThat(conf.get("breadth", "wide"), is("wide"));
There are a couple of things to note: type information is not stored in the XML file, instead properties can be interpreted as a given type when they are read. Also, the get() methods allow you to specify a default value, which is used if the property is not defined in the XML file, as in the case of breadth here.
7.4.5.1. Combining resources Things get interesting when more than one resource is used to define a configuration. This is used in Hadoop to separate out the default properties for the system, defined in hadoopdefault.xml, from the site-specific overrides, in hadoop-site.xml. The file in Example 7-3 defines the size and weight properties. Example 7-3. A second configuration file, configuration-2.xml <property> size 12
Hadoop: The Definitive Guide, 1st Edition Hadoop: The Definitive Guide, 1st Edition By Tom White ISBN: 9780596521974 Prepared for Jing Huang, Safari ID:
[email protected] Publisher: O'Reilly Media, Inc. Print Publication Date: 2009/07/15 User number: 1754328 © 2008 Safari Books Online, LLC. This PDF is made available for personal use only during the relevant subscription term, subject to the Safari Terms of Service. Any other use requires prior written consent from the copyright owner. Unauthorized use, reproduction and/or distribution are strictly prohibited and violate applicable laws. All rights reserved.
Hadoop: The Definitive Guide, 1st Edition
Page 190
Return to Table of Contents
<property> weight light
Resources are added to a Configuration in order: Configuration conf = new Configuration(); conf.addResource("configuration-1.xml"); conf.addResource("configuration-2.xml");
Properties defined in resources that are added later override the earlier definitions. So the size property takes its value from the second configuration file, configuration-2.xml: assertThat(conf.getInt("size", 0), is(12));
However, properties that are marked as final cannot be overridden in later definitions. The weight property is final in the first configuration file, so the attempt to override it in the second fails and it takes the value from the first: assertThat(conf.get("weight"), is("heavy"));
Attempting to override final properties usually indicates a configuration error, so this results in a warning message being logged to aid diagnosis.
7.4.5.2. Variable expansion Configuration properties can be defined in terms of other properties, or system properties. For example, the property size-weight in the first configuration file is defined as $ {size},${weight}, and these properties are expanded using the values found in the configuration: assertThat(conf.get("size-weight"), is("12,heavy"));
System properties take priority over properties defined in resource files: System.setProperty("size", "14"); assertThat(conf.get("size-weight"), is("14,heavy"));
This feature is useful for overriding properties on the command line by using Dproperty=value JVM arguments.
7.5. Tuning a Hadoop Cluster Is the cluster set up correctly? The best way to answer this question is empirically: run some tests and see that you get the expected results. Benchmarks make good tests as you Hadoop: The Definitive Guide, 1st Edition Hadoop: The Definitive Guide, 1st Edition By Tom White ISBN: 9780596521974 Prepared for Jing Huang, Safari ID:
[email protected] Publisher: O'Reilly Media, Inc. Print Publication Date: 2009/07/15 User number: 1754328 © 2008 Safari Books Online, LLC. This PDF is made available for personal use only during the relevant subscription term, subject to the Safari Terms of Service. Any other use requires prior written consent from the copyright owner. Unauthorized use, reproduction and/or distribution are strictly prohibited and violate applicable laws. All rights reserved.
Hadoop: The Definitive Guide, 1st Edition
Page 191
Return to Table of Contents
also get numbers that you can compare with other clusters as a sanity check that your new cluster is performing roughly as expected. And you can tune a cluster using benchmark results to squeeze the best performance out of it. To get the best results, you should run benchmarks on a cluster that is not being used by others. In practice, this is just before it is put into service, and users start relying on it. Once users have periodically scheduled jobs on a cluster it is generally impossible to find a time when the cluster is not being used (unless you arrange downtime with users), so you should run benchmarks to your satisfaction before this happens.
7.5.1. User Jobs Hadoop comes with several benchmarks that you can run very easily with minimal set up cost. These are useful as sanity checks, and to record performance baselines. For tuning however, it is best to include a few jobs that are representative of the jobs that your users run, so your cluster is tuned for the kind of jobs you run. If this is you first cluster and you don't have any user jobs yet, then don't worry, you will soon have some that you can use to benchmark at a later date. Just stick to the Hadoop benchmarks to start with in this case. To allow comparisons between runs, you should select a dataset for your user jobs that you use each time you run the benchmarks. When you set up a new cluster, or upgrade a cluster, you will be able to use the same dataset to compare the performance with previous runs.
7.5.2. Hadoop Benchmarks While there are lots of benchmarks to choose from, I find that you only need to run one or two of them to get a good general picture of how the cluster is performing. Benchmarks can take a long time to run, so it is a good idea to pick a couple and stick to running and tuning these. Benchmarks are packaged in the test JAR file, and you can get a list of them, with descriptions, by invoking the JAR file with no arguments: % hadoop jar $HADOOP_HOME/hadoop-*-test.jar
Most (but not all) of the benchmarks show usage instructions when invoked with no arguments. For example: % hadoop jar $HADOOP_HOME/hadoop-*-test.jar TestDFSIO TestFDSIO.0.0.4 Usage: TestFDSIO -read | -write | -clean [-nrFiles N] [-fileSize MB] [-resFile resultFileName] [-bufferSize Bytes]
7.5.2.1. Benchmarking HDFS with TestDFSIO TestDFSIO tests the I/O performance of HDFS. It does this by using a MapReduce job as a convenient way to read or write files in parallel. Each file is read or written in a separate map task, and the output of the map is used for collecting statistics relating to the file just processed. The statistics are accumulated in the reduce, to produce a summary. Hadoop: The Definitive Guide, 1st Edition Hadoop: The Definitive Guide, 1st Edition By Tom White ISBN: 9780596521974 Prepared for Jing Huang, Safari ID:
[email protected] Publisher: O'Reilly Media, Inc. Print Publication Date: 2009/07/15 User number: 1754328 © 2008 Safari Books Online, LLC. This PDF is made available for personal use only during the relevant subscription term, subject to the Safari Terms of Service. Any other use requires prior written consent from the copyright owner. Unauthorized use, reproduction and/or distribution are strictly prohibited and violate applicable laws. All rights reserved.
Hadoop: The Definitive Guide, 1st Edition
Page 192
Return to Table of Contents
The following command writes 10 files of 1000MB each. % hadoop jar $HADOOP_HOME/hadoop-*-test.jar TestDFSIO -write -nrFiles 10 -fileSize 1000
At the end of the run, the results are written to the console, and also recorded in a file: % cat TestDFSIO_results.log ----- TestDFSIO ----- : write Date & time: Thu Aug 28 07:33:54 BST 2008 Number of files: 2 Total MBytes processed: 20 Throughput mb/sec: 2.3998080153587713 Average IO rate mb/sec: 2.6585614681243896 IO rate std deviation: 0.8294055902056588 Test exec time sec: 46.509
The files are written under the /benchmarks/TestDFSIO directory by default (this can be changed by setting the test.build.data system property), in a directory called io_data: % hadoop fs -ls /benchmarks/TestDFSIO/io_data Found 2 items -rw-r--r-3 tom supergroup 10485760 2008-08-28 07:33 /benchmarks/TestDFSIO/io_data/test_io_0 -rw-r--r-3 tom supergroup 10485760 2008-08-28 07:33 /benchmarks/TestDFSIO/io_data/test_io_1
To run a read benchmark, you use the -read argument, and specify the number of files to read using -nrFiles. Note that these files must already exist (having been written by TestDFSIO -write). % hadoop jar $HADOOP_HOME/hadoop-*-test.jar TestDFSIO -read -nrFiles 10
Here are the results for a real run: ----- TestDFSIO ----- : read Date & time: Thu Aug 28 07:38:39 BST 2008 Number of files: 2 Total MBytes processed: 20 Throughput mb/sec: 39.29273084479372 Average IO rate mb/sec: 40.649757385253906 IO rate std deviation: 7.427164753646442 Test exec time sec: 35.189
When you've finished benchmarking you can delete all the TestDFSIO files from HDFS using the -clean argument: % hadoop jar $HADOOP_HOME/hadoop-*-test.jar TestDFSIO -clean
7.5.2.2. Benchmarking MapReduce with Sort 7.5.2.3. Benchmarking MapReduce with MRBench
Hadoop: The Definitive Guide, 1st Edition Hadoop: The Definitive Guide, 1st Edition By Tom White ISBN: 9780596521974 Prepared for Jing Huang, Safari ID:
[email protected] Publisher: O'Reilly Media, Inc. Print Publication Date: 2009/07/15 User number: 1754328 © 2008 Safari Books Online, LLC. This PDF is made available for personal use only during the relevant subscription term, subject to the Safari Terms of Service. Any other use requires prior written consent from the copyright owner. Unauthorized use, reproduction and/or distribution are strictly prohibited and violate applicable laws. All rights reserved.
Hadoop: The Definitive Guide, 1st Edition
Page 193
Return to Table of Contents
Chapter 8. Running Hadoop in the Cloud Chapter 9. Administering Hadoop Chapter 10. ZooKeeper Chapter 11. Pig Chapter 12. HBase
Hadoop: The Definitive Guide, 1st Edition Hadoop: The Definitive Guide, 1st Edition By Tom White ISBN: 9780596521974 Prepared for Jing Huang, Safari ID:
[email protected] Publisher: O'Reilly Media, Inc. Print Publication Date: 2009/07/15 User number: 1754328 © 2008 Safari Books Online, LLC. This PDF is made available for personal use only during the relevant subscription term, subject to the Safari Terms of Service. Any other use requires prior written consent from the copyright owner. Unauthorized use, reproduction and/or distribution are strictly prohibited and violate applicable laws. All rights reserved.