SpamHits

# SpamHits - CS345 Data Mining Link Analysis 3 Hubs and...

This preview shows pages 1–14. Sign up to view the full content.

CS345 Data Mining Link Analysis 3: Hubs and Authorities Spam Detection Anand Rajaraman, Jeffrey D. Ullman

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
Problem formulation (1998) Suppose we are given a collection of  documents on some broad topic e.g., stanford, evolution, iraq perhaps obtained through a text search Can we organize these documents in some  manner? Page rank offers one solution HITS (Hypertext-Induced Topic Selection) is  another proposed at approx the same time
HITS Model Interesting documents fall into two classes 1. Authorities  are pages containing useful  information course home pages home pages of auto manufacturers 1. Hubs  are pages that link to authorities course bulletin list of US auto manufacturers

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
Idealized view Hubs Authorities
Mutually recursive definition A good hub links to many good authorities A good authority is linked from many good  hubs Model using two scores for each node Hub score and Authority score Represented as vectors  h  and  a

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
Transition Matrix  A HITS uses a matrix  A [ i j ] = 1 if page  i  links to  page  j , 0 if not A T the transpose of  A , is similar to the  PageRank matrix  M , but  A T  has 1’s where  M    has fractions
Example Yahoo M’soft Amazon y 1 1 1 a 1 0 1 m 0 1 0 y a m A =

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
Hub and Authority Equations The hub score of page P is proportional to the  sum of the authority scores of the pages it  links to h  =  λ A a Constant   is a scale factor λ The authority score of page P is proportional  to the sum of the hub scores of the pages it is  linked from a  = μ A h Constant μ is scale factor
Iterative algorithm Initialize  h a  to all 1’s h  =  Aa Scale  h  so that its max entry is 1.0  A T h Scale  a  so that its max entry is 1.0 Continue until  h a  converge

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
Example 1 1 1 A = 1 0 1 0 1 0 1 1 0 A T = 1 0 1 1 1 0 a(yahoo) a(amazon) a(m’soft) = = = 1 1 1 1 1 1 1 4/5 1 1 0.75 1 . . . . . . . . . 1 0.732 1 h(yahoo) = 1 h(amazon) = 1 h(m’soft) = 1 1 2/3 1/3 1 0.73 0.27 . . . . . . . . . 1.000 0.732 0.268 1 0.71 0.29
Existence and Uniqueness h  =  λ A a a  = μ A h =  μ λ AA h a  =  μ λ A T a Under reasonable assumptions about  A the dual iterative algorithm converges to vectors  h*  and  a*  such that: h*  is the principal eigenvector of the matrix  AA T a*  is the principal eigenvector of the matrix  A T A

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
Bipartite cores Hubs Authorities Most densely-connected core ( primary core ) Less densely-connected core ( secondary core )
Secondary cores A single topic can have many bipartite cores

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
This is the end of the preview. Sign up to access the rest of the document.

{[ snackBarMessage ]}

### Page1 / 43

SpamHits - CS345 Data Mining Link Analysis 3 Hubs and...

This preview shows document pages 1 - 14. Sign up to view the full document.

View Full Document
Ask a homework question - tutors are online