Ch12_google pagerank

Ch12_google pagerank - Chapter 12 Google PageRank The...

Info iconThis preview shows pages 1–3. Sign up to view the full content.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Chapter 12 Google PageRank The worlds largest matrix computation. One of the reasons why Google TM is such an effective search engine is the PageRank TM algorithm developed by Googles founders, Larry Page and Sergey Brin, when they were graduate students at Stanford University. PageRank is de- termined entirely by the link structure of the World Wide Web. It is recomputed about once a month and does not involve the actual content of any Web pages or individual queries. Then, for any particular query, Google finds the pages on the Web that match that query and lists those pages in the order of their PageRank. Imagine surfing the Web, going from page to page by randomly choosing an outgoing link from one page to get to the next. This can lead to dead ends at pages with no outgoing links, or cycles around cliques of interconnected pages. So, a certain fraction of the time, simply choose a random page from the Web. This theoretical random walk is known as a Markov chain or Markov process . The limiting probability that an infinitely dedicated random surfer visits any particular page is its PageRank. A page has high rank if other pages with high rank link to it. Let W be the set of Web pages that can be reached by following a chain of hyperlinks starting at some root page, and let n be the number of pages in W . For Google, the set W actually varies with time, but by June 2004, n was over 4 billion. Let G be the n-by- n connectivity matrix of a portion of the Web, that is, g ij = 1 if there is a hyperlink to page i from page j and g ij = 0 otherwise. The matrix G can be huge, but it is very sparse. Its j th column shows the links on the j th page. The number of nonzeros in G is the total number of hyperlinks in W . Copyright c 2009 Cleve Moler Matlab R is a registered trademark of The MathWorks, Inc. TM August 8, 2009 1 2 Chapter 12. Google PageRank Let r i and c j be the row and column sums of G : r i = X j g ij , c j = X i g ij . The quantities r j and c j are the in-degree and out-degree of the j th page. Let p be the probability that the random walk follows a link. A typical value is p = 0 . 85. Then 1- p is the probability that some arbitrary page is chosen and = (1- p ) /n is the probability that a particular random page is chosen. Let A be the n-by- n matrix whose elements are a ij = pg ij /c j + : c j 6 = 0 1 /n : c j = 0 . Notice that A comes from scaling the connectivity matrix by its column sums. The j th column is the probability of jumping from the j th page to the other pages on the Web. If the j th page is a dead end, that is has no out-links, then we assign a uniform probability of 1 /n to all the elements in its column. Most of the elements of A are equal to , the probability of jumping from one page to another without following a link. If n = 4 10 9 and p = 0 . 85, then = 3 . 75 10- 11 ....
View Full Document

This note was uploaded on 10/11/2011 for the course MTHSC 365 taught by Professor Adams during the Spring '11 term at Clemson.

Page1 / 13

Ch12_google pagerank - Chapter 12 Google PageRank The...

This preview shows document pages 1 - 3. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online