This preview shows page 1. Sign up to view the full content.
Unformatted text preview: DETECTING PHRASE-LEVEL DUPLICATION ON THE WORLD WIDE WEB
Dennis Fetterly, Mark Manasse Marc Najork Microsoft Research SIGIR'05 CSE 450 Web Mining Seminar Presented by Liangjie Hong y gj g March 24th, 2008
Types of Spam
Content Spam Link S Li k Spam Redirection Spam Content Spam
Keyword stuffing Hidden text Meta stuffing 2 MOTIVATION
Page duplication Word d li ti W d duplication PhrasePhrase-level duplication Characteristics
Grammatically well-formed Generated randomly Assembled from various pages
3 FINDING PHRASE REPLICATION
Representation of Documents Shingle
Document word word word ... n k phrase k-phrase k-phrase k-phrase ... n fingerprint fingerprint fingerprint ... m In their practice, m = 84 k = 5 4 FINDING PHRASE REPLICATION
numbers & letters navigational t t i ti l text copyright notices machine generated 5 FINDING PHRASE REPLICATION
Some Results with Popular Shingles 6 COVERING SETS
Shingle Shingle Shingle ... Document Shingle Shi l Shingle Shingle ... Shingle Shingle Shingle g ... Shingle Shingle Shingle ... ... Finding the minimum size of covering set is NP-complete g g NPp Using Greedy heuristic to approximate More likely add documents from other hosts 7 COVERING SETS
Two Examples of Covering Sets 8 COVERING SETS
Some Results about Covering Sets 9 CONCLUSIONS & FUTURE WORK
A third of the pages on the web consists of more replicated than original content High f ti of non-original phrases t i ll feature Hi h fraction f non- i i l h typically f t machinemachine-generated content Most popular phrases are not very interesting Provide a way to estimate how original the content is. Cannot distinguish legitimate from spam content ! 10 ...
View Full Document
This note was uploaded on 08/06/2008 for the course CSE 450 taught by Professor Davison during the Spring '08 term at Lehigh University .
- Spring '08