Unformatted text preview: Evaluating the Web
PageRank
Hubs and Authorities 1 PageRank
x Intuition: solve the recursive equation: “a page is important if important pages link to it.”
x In highfalutin’ terms: importance = the principal eigenvector of the stochastic matrix of the Web.
A few fixups needed. 2 Stochastic Matrix of the Web
x Enumerate pages.
x Page i corresponds to row and column i.
x M [i,j ] = 1/n if page j links to n pages, including page i ; 0 if j does not link to i.
M [i,j ] is the probability we’ll next be at page i if we are now at page j. 3 Example
Suppose page j links to 3 pages, including i
j
i
1/3 4 Random Walks on the Web
x Suppose v is a vector whose i th component is the probability that we are at page i at a certain time.
x If we follow a link from i at random, the probability distribution for the page we are then at is given by the vector M v. 5 Random Walks (2)
x Starting from any vector v, the limit M (M (…M (M v ) …)) is the distribution of page visits during a random walk.
x Intuition: pages are important in proportion to how often a random walker would visit them.
x The math: limiting distribution = principal eigenvector of M = PageRank.
6 Example: The Web in 1839
ya
y 1/2 1/2
a 1/2 0
m 0 1/2 Yahoo Amazon M’soft 7 m
0
1
0 Simulating a Random Walk
x Start with the vector v = [1,1,…,1] representing the idea that each Web page is given one unit of importance.
x Repeatedly apply the matrix M to v, allowing the importance to flow like a random walk.
x Limit exists, but about 50 iterations is sufficient to estimate final distribution. 8 Example
x Equations v = M v :
y = y /2 + a /2
a = y /2 + m
m = a /2
y
a=
m 1
1
1 1
3/2
1/2 5/4
1
3/4 9/8
11/8
1/2 ... 9 6/5
6/5
3/5 Solving The Equations
x Because there are no constant terms, these 3 equations in 3 unknowns do not have a unique solution.
x Add in the fact that y +a +m = 3 to solve.
x In Websized examples, we cannot solve by Gaussian elimination; we need to use relaxation (= iterative solution).
10 RealWorld Problems
x Some pages are “dead ends” (have no links out). Such a page causes importance to leak out. x Other (groups of) pages are spider traps (all outlinks are within the group). Eventually spider traps absorb all importance. 11 Microsoft Becomes Dead End
ya
y 1/2 1/2
a 1/2 0
m 0 1/2 Yahoo Amazon M’soft 12 m
0
0
0 Example
x Equations v = M v :
y = y /2 + a /2
a = y /2
m = a /2
y
a=
m 1
1
1 1
1/2
1/2 3/4
1/2
1/4 5/8
3/8
1/4 ... 13 0
0
0 M’soft Becomes Spider Trap
ya
y 1/2 1/2
a 1/2 0
m 0 1/2 Yahoo Amazon M’soft 14 m
0
0
1 Example
x Equations v = M v :
y = y /2 + a /2
a = y /2
m = a /2 + m
y
a=
m 1
1
1 1
1/2
3/2 3/4
1/2
7/4 5/8
3/8
2 ... 15 0
0
3 Google Solution to Traps, Etc.
x “Tax” each page a fixed percentage at each interation.
x Add the same constant to all pages.
x Models a random walk with a fixed probability of going to a random place next. 16 Example: Previous with 20% Tax
x Equations v = 0.8(M v ) + 0.2:
y = 0.8(y /2 + a/2) + 0.2
a = 0.8(y /2) + 0.2
m = 0.8(a /2 + m) + 0.2
y
a=
m 1
1
1 1.00
0.60
1.40 0.84
0.60
1.56 0.776
0.536 . . .
1.688
17 7/11
5/11
21/11 General Case
x In this example, because there are no deadends, the total importance remains at 3.
x In examples with deadends, some importance leaks out, but total remains finite. 18 Solving the Equations
x Because there are constant terms, we can expect to solve small examples by Gaussian elimination.
x Websized examples still need to be solved by relaxation. 19 Speeding Convergence
x Newtonlike prediction of where components of the principal eigenvector are heading.
x Take advantage of locality in the Web.
x Each technique can reduce the number of iterations by 50%.
Important PageRank takes time!
20 Predicting Component Values
x Three consecutive values for the importance of a page suggests where the limit might be.
1.0 0.7 Guess for the next round
0.6 0.55 21 Exploiting Substructure
x Pages from particular domains, hosts, or paths, like stanford.edu or wwwdb.stanford.edu/~ullman tend to have higher density of links.
x Initialize PageRank using ranks within your local cluster, then ranking the clusters themselves.
22 Strategy
x Compute local PageRanks (in parallel?).
x Use local weights to establish intercluster weights on edges.
x Compute PageRank on graph of clusters.
x Initial rank of a page is the product of its local rank and the rank of its cluster.
x “Clusters” are appropriately sized regions with common domain or lowerlevel detail.
23 In Pictures 1.5 2.05 3.0 2.0
0.15 0.1 Local ranks
Intercluster weights 0.05 Ranks of clusters
Initial eigenvector 24 Hubs and Authorities
x Mutually recursive definition: A hub links to many authorities;
An authority is linked to by many hubs. x Authorities turn out to be places where information can be found.
Example: course home pages. x Hubs tell where the authorities are.
Example: CSD courselisting page.
25 Transition Matrix A
x H&A uses a matrix A [i, j ] = 1 if page i links to page j, 0 if not.
x AT, the transpose of A, is similar to the PageRank matrix M, but AT has 1’s where M has fractions. 26 Example
Yahoo Amazon yam
y111
A= a 1 0 1
m010 M’soft 27 Using Matrix A for H&A
x Powers of A and AT diverge in size of elements, so we need scale factors.
x Let h and a be vectors measuring the “hubbiness” and authority of each page.
x Equations: h = λAa; a = μAT h.
Hubbiness = scaled sum of authorities of successor pages (outlinks).
Authority = scaled sum of hubbiness of predecessor pages (inlinks). 28 Consequences of Basic Equations x From h = λAa; a = μAT h we can derive:
h = λμAAT h
a = λμATA a x Compute h and a by iteration, assuming initially each page has one unit of hubbiness and one unit of authority.
Pick an appropriate value of λμ. 29 Example
111
A= 101
010 110
AT = 1 0 1
110 321
AAT= 2 2 0
101 212
ATA= 1 2 1
212 a(yahoo) =
a(amazon) =
a(m’soft) = 1
1
1 5
4
5 24
18
24 114
84
114 ...
...
... 1+√3
2
1+√3 h(yahoo) =
h(amazon) =
h(m’soft) = 1
1
1 6
4
2 28
20
8 132
96
36 ...
...
... 1.000
0.735
30 0.268 Solving the Equations
x Solution of even small examples is tricky, because the value of λμ is one of the unknowns.
Each equation like y = λμ(3y +2a +m) lets us solve for λμ in terms of y, a, m ; equate each expression for λμ. x As for PageRank, we need to solve big examples by relaxation.
31 Details for h (1)
y = λμ(3y +2a +m)
a = λμ(2y +2a )
m = λμ(y +m)
x Solve for λμ:
λμ = y /(3y +2a +m) = a / (2y +2a ) = m / (y +m)
32 Details for h (2)
x Assume y = 1. λμ = 1/(3 +2a +m) = a / (2 +2a ) = m / (1+m)
x Crossmultiply second and third:
a +am = 2m +2am or a = 2m /(1m )
x Cross multiply first and third:
1+m = 3m + 2am +m 2 or a =(12m m 2)/2m
33 Details for h (3)
x Equate formulas for a : a = 2m /(1m ) = (12m m 2)/2m
x Crossmultiply: 1 2m m 2 m + 2m 2 + m 3 = 4m 2 x Solve for m : m = .268
x Solve for a : a = 2m /(1m ) = .735
34 Solving H&A in Practice
x Iterate as for PageRank; don’t try to solve equations.
x But keep components within bounds.
Example: scale to keep the largest component of the vector at 1. x Trick: start with h = [1,1,…,1]; multiply by AT to get first a; scale, then multiply by A to get next h,…
35 H&A Versus PageRank
x If you talk to someone from IBM, they will tell you “IBM invented PageRank.” What they mean is that H&A was invented by Jon Kleinberg when he was at IBM. x But these are not the same.
x H&A has been used, e.g., to analyze important research papers; it does not appear to be a substitute for PageRank.
36 ...
View
Full Document
 Spring '09
 Harshad number, Random walk

Click to edit the document details