This preview shows pages 1–11. Sign up to view the full content.

Hyperlink Analysis on the Web Monika Henzinger [email protected]

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
Outline • Random Walks • Classic Information Retrieval (IR) vs Web IR • Hyperlink Analysis: – PageRank –H ITS
Random Walks Random Walk = discrete-time stochastic process over a graph G=(V,E) with a transition probability matrix P – Random Walk is at one node at any time, making node-transitions at time steps t=1,2, … with P ij being the probability of going to node j when at node i – Initial node chosen according to some probability distribution q (0) over S

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
Random Walks (cont.) •q (t) = row vector whose i-th component is the probability that the chain is in node i at time t •q (t+1) = q (t) P => q (t) = q (0) P t •A stationary distribution is a probability distribution q such that q = q P (steady-state behavior) • Example: –P ij = 1/degree(i) if (i,j) in G and 0 otherwise, then q i = degree(i)/2m
Random Walks (cont.) • Theorem: Under certain conditions: – There exists a unique stationary distribution q with q i > 0 for all i – Let N(i,t) be the number of times the random walk visits node i in t steps. Then, the fraction of steps the walk spends at i equals q i , i.e. i t q t t i N = ) , ( lim

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
Information Retrieval Input: Document collection •G o a l : Retrieve documents or text with information content that is relevant to user’s information need Two aspects:
Classic information retrieval Ranking is a function of query term frequency within the docum ent (tf) and across all documents (idf) This works because of the following assumptions in classical IR: Queries are long and well specified “What is the impact of the Falklands war on Anglo-Argentinean relations” Documents (e.g., newspaper articles) are coherent , well authored , and are usually about one topic The vocabulary is small and relatively well understood

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
Web information retrieval None of these assumptions hold: – Queries are short: 2.35 terms in avg Huge variety in documents : language, quality, duplication – Huge vocabulary: 100s million of terms Deliberate misinformation Ranking is a function of the query terms and of the hyperlink structure
12/6/2002 Hyperlink analysis • Idea: Mine structure of the web graph – Each web page is a node – Each hyperlink is a directed edge Related work: – Classic IR work (citations = links) a.k.a. “Bibliometrics” [K’63, G’72, S’73,…] – Socio-metrics [K’53, MMSM’86,…] – Many Web related papers use this approach [PPR’96, AMM’97, S’97, CK’97, K’98, BP’98,…]

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
Google’s approach Assumption: A link from page A to page B is a recommendation of page B by the author of A (we say B is successor of A) Î Quality of a page is related to its in-degree Recursion: Quality of a page is related to its in-degree, and to –t h e quality of pages linking to it Î
This is the end of the preview. Sign up to access the rest of the document.

{[ snackBarMessage ]}