SpamHits-2

SpamHits-2 - CS345 Data Mining Link Analysis 2:...

Info iconThis preview shows pages 1–9. Sign up to view the full content.

View Full Document Right Arrow Icon
    CS345 Data Mining Link Analysis 2: Topic-Specific Page Rank Hubs and Authorities Spam Detection Anand Rajaraman, Jeffrey D. Ullman
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Some problems with page rank Measures generic popularity of a page Biased against topic-specific authorities Ambiguous queries e.g., jaguar Uses a single measure of importance Other models e.g., hubs-and-authorities Susceptible to Link spam Artificial link topographies created in order to boost  page rank
Background image of page 2
Topic-Specific Page Rank Instead of generic popularity, can we measure  popularity within a topic? E.g., computer science, health Bias the random walk When the random walker teleports, he picks a page from a  set S of web pages S contains only pages that are relevant to the topic E.g., Open Directory (DMOZ) pages for a given topic ( www.dmoz.org ) For each teleport set S, we get a different rank vector  r S
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Matrix formulation A ij  =  β M ij  + (1- β )/|S| if i  2  S A ij  =  β M ij  otherwise Show that  A  is stochastic We have weighted all pages in the teleport set  S equally Could also assign different weights to them 
Background image of page 4
Example 1 2 3 4 Suppose S = {1},  β  = 0.8 Node Iteration 0 1 2… stable 1 1.0 0.2 0.52 0.294 2 0 0.4 0.08 0.118 3 0 0.4 0.08 0.327 4 0 0 0.32 0.261 Note how we initialize the page rank vector differently from the unbiased page rank case.  0.2 0.5 0.5 1 1 1 0.4 0.4 0.8 0.8 0.8
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
How well does TSPR work? Experimental results [Haveliwala 2000] Picked 16 topics Teleport sets determined using DMOZ E.g., arts, business, sports,… “Blind study” using volunteers 35 test queries Results ranked using Page Rank and TSPR of  most closely related topic  E.g., bicycling using Sports ranking In most cases volunteers preferred TSPR ranking
Background image of page 6
Which topic ranking to use? User can pick from a menu Use Bayesian classification schemes to  classify query into a topic Can use the  context  of the query E.g., query is launched from a web page talking  about a known topic History of queries e.g., “basketball” followed by  “jordan” User context e.g., user’s My Yahoo settings,  bookmarks, …
Background image of page 7

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Suppose we are given a collection of  documents on some broad topic e.g., stanford, evolution, iraq perhaps obtained through a text search Can we organize these documents in some  manner? Page rank offers one solution
Background image of page 8
Image of page 9
This is the end of the preview. Sign up to access the rest of the document.

This document was uploaded on 03/04/2012.

Page1 / 46

SpamHits-2 - CS345 Data Mining Link Analysis 2:...

This preview shows document pages 1 - 9. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online