This preview shows pages 1–2. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.View Full Document
Unformatted text preview: Large-scale Incremental Processing Using Distributed Transactions and Notifications Daniel Peng and Frank Dabek [email protected], [email protected] Google, Inc. Abstract Updating an index of the web as documents are crawled requires continuously transforming a large repository of existing documents as new documents ar- rive. This task is one example of a class of data pro- cessing tasks that transform a large repository of data via small, independent mutations. These tasks lie in a gap between the capabilities of existing infrastructure. Databases do not meet the storage or throughput require- ments of these tasks: Google’s indexing system stores tens of petabytes of data and processes billions of up- dates per day on thousands of machines. MapReduce and other batch-processing systems cannot process small up- dates individually as they rely on creating large batches for efficiency. We have built Percolator, a system for incrementally processing updates to a large data set, and deployed it to create the Google web search index. By replacing a batch-based indexing system with an indexing system based on incremental processing using Percolator, we process the same number of documents per day, while reducing the average age of documents in Google search results by 50%. 1 Introduction Consider the task of building an index of the web that can be used to answer search queries. The indexing sys- tem starts by crawling every page on the web and pro- cessing them while maintaining a set of invariants on the index. For example, if the same content is crawled un- der multiple URLs, only the URL with the highest Page- Rank  appears in the index. Each link is also inverted so that the anchor text from each outgoing link is at- tached to the page the link points to. Link inversion must work across duplicates: links to a duplicate of a page should be forwarded to the highest PageRank duplicate if necessary. This is a bulk-processing task that can be expressed as a series of MapReduce  operations: one for clus- tering duplicates, one for link inversion, etc. It’s easy to maintain invariants since MapReduce limits the paral- lelism of the computation; all documents finish one pro- cessing step before starting the next. For example, when the indexing system is writing inverted links to the cur- rent highest-PageRank URL, we need not worry about its PageRank concurrently changing; a previous MapRe- duce step has already determined its PageRank. Now, consider how to update that index after recrawl- ing some small portion of the web. It’s not sufficient to run the MapReduces over just the new pages since, for example, there are links between the new pages and the rest of the web. The MapReduces must be run again over the entire repository, that is, over both the new pages and the old pages. Given enough computing resources, MapReduce’s scalability makes this approach feasible, and, in fact, Google’s web search index was produced in this way prior to the work described here. However,in this way prior to the work described here....
View Full Document
This note was uploaded on 12/08/2011 for the course CS 525 taught by Professor Gupta during the Spring '08 term at University of Illinois, Urbana Champaign.
- Spring '08