170-google

# 170-google - Hyperlink Analysis on the Web Monika Henzinger...

This preview shows pages 1–11. Sign up to view the full content.

Hyperlink Analysis on the Web Monika Henzinger [email protected]

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
Outline Random Walks Classic Information Retrieval (IR) vs Web IR Hyperlink Analysis: PageRank HITS Random Walks on the Web
Random Walks Random Walk = discrete-time stochastic process over a graph G=(V,E) with a transition probability matrix P Random Walk is at one node at any time, making node-transitions at time steps t=1,2, … with P ij being the probability of going to node j when at node i Initial node chosen according to some probability distribution q (0) over S

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
Random Walks (cont.) q (t) = row vector whose i-th component is the probability that the chain is in node i at time t q (t+1) = q (t) P => q (t) = q (0) P t A stationary distribution is a probability distribution q such that q = q P (steady-state behavior) Example: – P ij = 1/degree(i) if (i,j) in G and 0 otherwise, then q i = degree(i)/2m
Random Walks (cont.) Theorem: Under certain conditions: There exists a unique stationary distribution q with q i > 0 for all i Let N(i,t) be the number of times the random walk visits node i in t steps. Then, the fraction of steps the walk spends at i equals q i , i.e. i t q t t i N = ) , ( lim

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
Information Retrieval Input: Document collection Goal: Retrieve documents or text with information content that is relevant to user’s information need Two aspects: 1. Processing the collection 2. Processing queries (searching)
Classic information retrieval Ranking is a function of query term frequency within the document (tf) and across all documents (idf) This works because of the following assumptions in classical IR: Queries are long and well specified “What is the impact of the Falklands war on Anglo-Argentinean relations” Documents (e.g., newspaper articles) are coherent , well authored , and are usually about one topic The vocabulary is small and relatively well understood

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
Web information retrieval None of these assumptions hold: Queries are short: 2.35 terms in avg Huge variety in documents : language, quality, duplication Huge vocabulary: 100s million of terms Deliberate misinformation Ranking is a function of the query terms and of the hyperlink structure
04/29/09   Hyperlink analysis Idea: Mine structure of the web graph Each web page is a node Each hyperlink is a directed edge Related work: Classic IR work (citations = links) a.k.a. “Bibliometrics” [K’63, G’72, S’73,…] Socio-metrics [K’53, MMSM’86,…] Many Web related papers use this approach [PPR’96, AMM’97, S’97, CK’97, K’98, BP’98,…]

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
Google’s approach Assumption: A link from page A to page B is a recommendation of page B by the author of A (we say B is successor of A) Quality of a page is related to its in-degree Recursion: Quality of a page is related to – its in-degree, and to the quality of pages linking to it PageRank [BP ‘98]
This is the end of the preview. Sign up to access the rest of the document.

{[ snackBarMessage ]}

### Page1 / 48

170-google - Hyperlink Analysis on the Web Monika Henzinger...

This preview shows document pages 1 - 11. Sign up to view the full document.

View Full Document
Ask a homework question - tutors are online