SpamHits-1

SpamHits-1 - CS345 Data Mining Link Analysis 2:...

Info iconThis preview shows pages 1–12. Sign up to view the full content.

View Full Document Right Arrow Icon
CS345 Data Mining Link Analysis 2: Topic-Specific Page Rank Hubs and Authorities Spam Detection Anand Rajaraman, Jeffrey D. Ullman
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Topic-Specific Page Rank b Instead of generic popularity, can we measure popularity within a topic? s E.g., computer science, health b Bias the random walk s When the random walker teleports, he picks a page from a set S of web pages s S contains only pages that are relevant to the topic s E.g., Open Directory (DMOZ) pages for a given topic ( www.dmoz.org ) b For each teleport set S, we get a different rank vector r S
Background image of page 2
Matrix formulation b A ij = β M ij + (1- β )/|S| if i 2 S b A ij = β M ij otherwise b Show that A is stochastic b We have weighted all pages in the teleport set S equally s Could also assign different weights to them
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Example 1 2 3 4 Suppose S = {1}, β = 0.8 Node Iteration 0 1 2… stable 1 1.0 0.2 0.52 0.294 2 0 0.4 0.08 0.118 3 0 0.4 0.08 0.327 4 0 0 0.32 0.261 Note how we initialize the page rank vector differently from the unbiased page rank case. 0.2 0.5 0.5 1 1 1 0.4 0.4 0.8 0.8 0.8
Background image of page 4
How well does TSPR work? b Experimental results [Haveliwala 2000] b Picked 16 topics s Teleport sets determined using DMOZ s E.g., arts, business, sports,… b “Blind study” using volunteers s 35 test queries s Results ranked using Page Rank and TSPR of most closely related topic s E.g., bicycling using Sports ranking s In most cases volunteers preferred TSPR ranking
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Which topic ranking to use? b User can pick from a menu b Use Bayesian classification schemes to classify query into a topic b Can use the context of the query s E.g., query is launched from a web page talking about a known topic s History of queries e.g., “basketball” followed by “jordan” b User context e.g., user’s My Yahoo settings, bookmarks, …
Background image of page 6
Hubs and Authorities b Suppose we are given a collection of documents on some broad topic s e.g., stanford, evolution, iraq s perhaps obtained through a text search b Can we organize these documents in some manner? s Page rank offers one solution s HITS (Hypertext-Induced Topic Selection) is another b proposed at approx the same time (1998)
Background image of page 7

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
HITS Model b Interesting documents fall into two classes 1. Authorities are pages containing useful information s course home pages s home pages of auto manufacturers 2. Hubs are pages that link to authorities s course bulletin s list of US auto manufacturers
Background image of page 8
Idealized view Hubs Authorities
Background image of page 9

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Mutually recursive definition b A good hub links to many good authorities b A good authority is linked from many good hubs b Model using two scores for each node s Hub score and Authority score s Represented as vectors h and a
Background image of page 10
Transition Matrix A b HITS uses a matrix A [ i , j ] = 1 if page i
Background image of page 11

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Image of page 12
This is the end of the preview. Sign up to access the rest of the document.

This document was uploaded on 03/04/2012.

Page1 / 43

SpamHits-1 - CS345 Data Mining Link Analysis 2:...

This preview shows document pages 1 - 12. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online