Detecting_Phrase_Level_Duplication

Detecting_Phrase_Level_Duplication - DETECTING PHRASE-LEVEL...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: DETECTING PHRASE-LEVEL DUPLICATION ON THE WORLD WIDE WEB Dennis Fetterly, Mark Manasse Marc Najork Microsoft Research SIGIR'05 CSE 450 Web Mining Seminar Presented by Liangjie Hong y gj g March 24th, 2008 1 BACKGROUND Types of Spam Content Spam Link S Li k Spam Redirection Spam Content Spam Keyword stuffing Hidden text Meta stuffing 2 MOTIVATION Keyword Stuffing Page duplication Word d li ti W d duplication PhrasePhrase-level duplication Characteristics Grammatically well-formed Generated randomly Assembled from various pages 3 FINDING PHRASE REPLICATION Representation of Documents Shingle Document word word word ... n k phrase k-phrase k-phrase k-phrase ... n fingerprint fingerprint fingerprint ... m In their practice, m = 84 k = 5 4 FINDING PHRASE REPLICATION Popular Shingles numbers & letters navigational t t i ti l text copyright notices machine generated 5 FINDING PHRASE REPLICATION Some Results with Popular Shingles 6 COVERING SETS Shingle Shingle Shingle ... Document Shingle Shi l Shingle Shingle ... Shingle Shingle Shingle g ... Shingle Shingle Shingle ... ... Finding the minimum size of covering set is NP-complete g g NPp Using Greedy heuristic to approximate More likely add documents from other hosts 7 COVERING SETS Two Examples of Covering Sets 8 COVERING SETS Some Results about Covering Sets 9 CONCLUSIONS & FUTURE WORK A third of the pages on the web consists of more replicated than original content High f ti of non-original phrases t i ll feature Hi h fraction f non- i i l h typically f t machinemachine-generated content Most popular phrases are not very interesting Provide a way to estimate how original the content is. Cannot distinguish legitimate from spam content ! 10 ...
View Full Document

This note was uploaded on 08/06/2008 for the course CSE 450 taught by Professor Davison during the Spring '08 term at Lehigh University .

Ask a homework question - tutors are online