SpamHits-1

SpamHits-1 - CS345 Data Mining Link Analysis 2:...

Info iconThis preview shows pages 1–11. Sign up to view the full content.

View Full Document Right Arrow Icon
    CS345 Data Mining Link Analysis 2: Topic-Specific Page Rank Hubs and Authorities Spam Detection Anand Rajaraman, Jeffrey D. Ullman
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Topic-Specific Page Rank Instead of generic popularity, can we measure  popularity within a topic? E.g., computer science, health Bias the random walk When the random walker teleports, he picks a page from a  set S of web pages S contains only pages that are relevant to the topic E.g., Open Directory (DMOZ) pages for a given topic ( www.dmoz.org ) For each teleport set S, we get a different rank vector  r S
Background image of page 2
Matrix formulation A ij  =  β M ij  + (1- β )/|S| if i  2  S A ij  =  β M ij  otherwise Show that  A  is stochastic We have weighted all pages in the teleport set  S equally Could also assign different weights to them 
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Example 1 2 3 4 Suppose S = {1},  β  = 0.8 Node Iteration 0 1 2… stable 1 1.0 0.2 0.52 0.294 2 0 0.4 0.08 0.118 3 0 0.4 0.08 0.327 4 0 0 0.32 0.261 Note how we initialize the page rank vector differently from the unbiased page rank case.  0.2 0.5 0.5 1 1 1 0.4 0.4 0.8 0.8 0.8
Background image of page 4
How well does TSPR work? Experimental results [Haveliwala 2000] Picked 16 topics Teleport sets determined using DMOZ E.g., arts, business, sports,… “Blind study” using volunteers 35 test queries Results ranked using Page Rank and TSPR of  most closely related topic  E.g., bicycling using Sports ranking In most cases volunteers preferred TSPR ranking
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Which topic ranking to use? User can pick from a menu Use Bayesian classification schemes to  classify query into a topic Can use the  context  of the query E.g., query is launched from a web page talking  about a known topic History of queries e.g., “basketball” followed by  “jordan” User context e.g., user’s My Yahoo settings,  bookmarks, …
Background image of page 6
Hubs and Authorities Suppose we are given a collection of  documents on some broad topic e.g., stanford, evolution, iraq perhaps obtained through a text search Can we organize these documents in some  manner? Page rank offers one solution HITS (Hypertext-Induced Topic Selection) is  another proposed at approx the same time (1998)
Background image of page 7

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
HITS Model Interesting documents fall into two classes 1. Authorities  are pages containing useful  information course home pages home pages of auto manufacturers 1. Hubs  are pages that link to authorities course bulletin list of US auto manufacturers
Background image of page 8
Idealized view Hubs Authorities
Background image of page 9

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Mutually recursive definition A good hub links to many good authorities A good authority is linked from many good  hubs Model using two scores for each node Hub score and Authority score Represented as vectors  h  and  a  
Background image of page 10
Image of page 11
This is the end of the preview. Sign up to access the rest of the document.

This document was uploaded on 03/04/2012.

Page1 / 43

SpamHits-1 - CS345 Data Mining Link Analysis 2:...

This preview shows document pages 1 - 11. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online