This preview shows pages 1–3. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.View Full Document
Unformatted text preview: 1 CS345 Data Mining Link Analysis 3: Hubs and Authorities Spam Detection Anand Rajaraman, Jeffrey D. Ullman Problem formulation (1998) Suppose we are given a collection of documents on some broad topic e.g., stanford, evolution, iraq perhaps obtained through a text search Can we organize these documents in some manner? Page rank offers one solution HITS (Hypertext-Induced Topic Selection) is another proposed at approx the same time HITS Model Interesting documents fall into two classes 1. Authorities are pages containing useful information course home pages home pages of auto manufacturers 2. Hubs are pages that link to authorities course bulletin list of US auto manufacturers Idealized view Hubs Authorities Mutually recursive definition A good hub links to many good authorities A good authority is linked from many good hubs Model using two scores for each node Hub score and Authority score Represented as vectors h and a Transition Matrix A HITS uses a matrix A [ i , j ] = 1 if page i links to page j , 0 if not A T , the transpose of A , is similar to the PageRank matrix M , but A T has 1s where M has fractions 2 Example Yahoo Msoft Amazon y 1 1 1 a 1 0 1 m 0 1 y a m A = Hub and Authority Equations The hub score of page P is proportional to the sum of the authority scores of the pages it links to h = A a Constant is a scale factor The authority score of page P is proportional to the sum of the hub scores of the pages it is linked from a = A T h Constant is scale factor Iterative algorithm Initialize h , a to all 1s h = Aa Scale h so that its max entry is 1.0 a = A T h Scale a so that its max entry is 1.0 Continue until h , a converge Example 1 1 1 A = 1 0 1 0 1 0 1 1 0 A T = 1 0 1 1 1 0 a(yahoo) a(amazon) a(msoft) = = = 1 1 1 1 1 1 1 4/5 1 1 0.75 1 . . . . . . . . . 1 0.732 1 h(yahoo) = 1 h(amazon) = 1 h(msoft) = 1 1 2/3 1/3 1 0.73 0.27 . . . . . . . . . 1.000 0.732 0.268 1 0.71 0.29 Existence and Uniqueness h = A a a = A T h h = AA T h a = A T A a Under reasonable assumptions about A , the dual iterative algorithm converges to vectors h* and a* such that: h* is the principal eigenvector of the matrix AA T a* is the principal eigenvector of the matrix A T A Bipartite cores Hubs Authorities Most densely-connected core ( primary core ) Less densely-connected core ( secondary core ) 3...
View Full Document
- Fall '09
- Data Mining