Unformatted text preview: DETECTING PHRASE-LEVEL DUPLICATION ON THE WORLD WIDE WEB Dennis Fetterly, Mark Manasse Marc Najork Microsoft Research SIGIR'05 CSE 450 Web Mining Seminar Presented by Liangjie Hong y gj g March 24th, 2008 1 BACKGROUND Types of Spam Content Spam Link S Li k Spam Redirection Spam Content Spam Keyword stuffing Hidden text Meta stuffing 2 MOTIVATION Keyword Stuffing Page duplication Word d li ti W d duplication PhrasePhrase-level duplication Characteristics Grammatically well-formed Generated randomly Assembled from various pages 3 FINDING PHRASE REPLICATION Representation of Documents Shingle Document word word word ... n k phrase k-phrase k-phrase k-phrase ... n fingerprint fingerprint fingerprint ... m In their practice, m = 84 k = 5 4 FINDING PHRASE REPLICATION Popular Shingles numbers & letters navigational t t i ti l text copyright notices machine generated 5 FINDING PHRASE REPLICATION Some Results with Popular Shingles 6 COVERING SETS Shingle Shingle Shingle ... Document Shingle Shi l Shingle Shingle ... Shingle Shingle Shingle g ... Shingle Shingle Shingle ... ... Finding the minimum size of covering set is NP-complete g g NPp Using Greedy heuristic to approximate More likely add documents from other hosts 7 COVERING SETS Two Examples of Covering Sets 8 COVERING SETS Some Results about Covering Sets 9 CONCLUSIONS & FUTURE WORK A third of the pages on the web consists of more replicated than original content High f ti of non-original phrases t i ll feature Hi h fraction f non- i i l h typically f t machinemachine-generated content Most popular phrases are not very interesting Provide a way to estimate how original the content is. Cannot distinguish legitimate from spam content ! 10 ...
