Peng - Large-scale Incremental Processing Using Distributed Transactions and Notifications Daniel Peng and Frank Dabek [email protected]

Info iconThis preview shows pages 1–2. Sign up to view the full content.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Large-scale Incremental Processing Using Distributed Transactions and Notifications Daniel Peng and Frank Dabek [email protected], [email protected] Google, Inc. Abstract Updating an index of the web as documents are crawled requires continuously transforming a large repository of existing documents as new documents ar- rive. This task is one example of a class of data pro- cessing tasks that transform a large repository of data via small, independent mutations. These tasks lie in a gap between the capabilities of existing infrastructure. Databases do not meet the storage or throughput require- ments of these tasks: Google’s indexing system stores tens of petabytes of data and processes billions of up- dates per day on thousands of machines. MapReduce and other batch-processing systems cannot process small up- dates individually as they rely on creating large batches for efficiency. We have built Percolator, a system for incrementally processing updates to a large data set, and deployed it to create the Google web search index. By replacing a batch-based indexing system with an indexing system based on incremental processing using Percolator, we process the same number of documents per day, while reducing the average age of documents in Google search results by 50%. 1 Introduction Consider the task of building an index of the web that can be used to answer search queries. The indexing sys- tem starts by crawling every page on the web and pro- cessing them while maintaining a set of invariants on the index. For example, if the same content is crawled un- der multiple URLs, only the URL with the highest Page- Rank [28] appears in the index. Each link is also inverted so that the anchor text from each outgoing link is at- tached to the page the link points to. Link inversion must work across duplicates: links to a duplicate of a page should be forwarded to the highest PageRank duplicate if necessary. This is a bulk-processing task that can be expressed as a series of MapReduce [13] operations: one for clus- tering duplicates, one for link inversion, etc. It’s easy to maintain invariants since MapReduce limits the paral- lelism of the computation; all documents finish one pro- cessing step before starting the next. For example, when the indexing system is writing inverted links to the cur- rent highest-PageRank URL, we need not worry about its PageRank concurrently changing; a previous MapRe- duce step has already determined its PageRank. Now, consider how to update that index after recrawl- ing some small portion of the web. It’s not sufficient to run the MapReduces over just the new pages since, for example, there are links between the new pages and the rest of the web. The MapReduces must be run again over the entire repository, that is, over both the new pages and the old pages. Given enough computing resources, MapReduce’s scalability makes this approach feasible, and, in fact, Google’s web search index was produced in this way prior to the work described here. However,in this way prior to the work described here....
View Full Document

This note was uploaded on 12/08/2011 for the course CS 525 taught by Professor Gupta during the Spring '08 term at University of Illinois, Urbana Champaign.

Page1 / 14

Peng - Large-scale Incremental Processing Using Distributed Transactions and Notifications Daniel Peng and Frank Dabek [email protected]

This preview shows document pages 1 - 2. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online