This preview has intentionally blurred sections. Sign up to view the full version.View Full Document
Unformatted text preview: 1 CS345 Data Mining Link Analysis 3: Hubs and Authorities Spam Detection Anand Rajaraman, Jeffrey D. Ullman Problem formulation (1998) Â¡ Suppose we are given a collection of documents on some broad topic Â¢ e.g., stanford, evolution, iraq Â¢ perhaps obtained through a text search Â¡ Can we organize these documents in some manner? Â¢ Page rank offers one solution Â¢ HITS (Hypertext-Induced Topic Selection) is another Â¡ proposed at approx the same time HITS Model Â¡ Interesting documents fall into two classes 1. Authorities are pages containing useful information Â¢ course home pages Â¢ home pages of auto manufacturers 2. Hubs are pages that link to authorities Â¢ course bulletin Â¢ list of US auto manufacturers Idealized view Hubs Authorities Mutually recursive definition Â¡ A good hub links to many good authorities Â¡ A good authority is linked from many good hubs Â¡ Model using two scores for each node Â¢ Hub score and Authority score Â¢ Represented as vectors h and a Transition Matrix A Â¡ HITS uses a matrix A [ i , j ] = 1 if page i links to page j , 0 if not Â¡ A T , the transpose of A , is similar to the PageRank matrix M , but A T has 1â€™s where M has fractions 2 Example Yahoo Mâ€™soft Amazon y 1 1 1 a 1 0 1 m 0 1 y a m A = Hub and Authority Equations Â¡ The hub score of page P is proportional to the sum of the authority scores of the pages it links to Â¢ h = Î» A a Â¢ Constant Î» is a scale factor Â¡ The authority score of page P is proportional to the sum of the hub scores of the pages it is linked from Â¢ a = Î¼ A T h Â¢ Constant Î¼ is scale factor Iterative algorithm Â¡ Initialize h , a to all 1â€™s Â¡ h = Aa Â¡ Scale h so that its max entry is 1.0 Â¡ a = A T h Â¡ Scale a so that its max entry is 1.0 Â¡ Continue until h , a converge Example 1 1 1 A = 1 0 1 0 1 0 1 1 0 A T = 1 0 1 1 1 0 a(yahoo) a(amazon) a(mâ€™soft) = = = 1 1 1 1 1 1 1 4/5 1 1 0.75 1 . . . . . . . . . 1 0.732 1 h(yahoo) = 1 h(amazon) = 1 h(mâ€™soft) = 1 1 2/3 1/3 1 0.73 0.27 . . . . . . . . . 1.000 0.732 0.268 1 0.71 0.29 Existence and Uniqueness h = Î» A a a = Î¼ A T h h = Î»Î¼ AA T h a = Î»Î¼ A T A a Under reasonable assumptions about A , the dual iterative algorithm converges to vectors h* and a* such that: â€¢ h* is the principal eigenvector of the matrix AA T â€¢ a* is the principal eigenvector of the matrix A T A Bipartite cores Hubs Authorities Most densely-connected core ( primary core ) Less densely-connected core ( secondary core ) 3...
View Full Document
- Fall '09
- Data Mining