DUSTBuster

DUSTBuster - Ziv Bar-Yossef, Idit Keidar, Uri Schonfeld...

Info iconThis preview shows pages 1–5. Sign up to view the full content.

View Full Document Right Arrow Icon
Ziv Bar-Yossef, Idit Keidar, Uri Schonfeld WWW’07 CSE 450 Web Mining Presented by Zaihan Yang
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
y Propose a novel algorithm DustBuster for uncovering DUST. y Discover DUST rules from a URL list y Mainly focus on the substring substitution rules y Introduce 3 heuristic methods. y Eliminate redundant rules. y Validate DUST rules. y Use DUST rules to transform URLs into canonical form y Main feature: Mine DUST from crawl logs or web server logs instead of examining page content.
Background image of page 2
Problem Identification y What is DUST? Different URLs with Similar Context. E.g. http://google.com/news & http://news.google.com . y How generated? Aliases, redirection, dynamically generated pages, etc. y Features of DUST? Not casual: with certain rules. Not universal: specific to web sites. y What advantage for uncovering them? Reduce overhead in crawling, indexing, and catching. Increase accuracy of page metrics, like PageRank.
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Problem Definition y URL: strings over Σ starts with “^” and ends with “$”. y
Background image of page 4
Image of page 5
This is the end of the preview. Sign up to access the rest of the document.

Page1 / 17

DUSTBuster - Ziv Bar-Yossef, Idit Keidar, Uri Schonfeld...

This preview shows document pages 1 - 5. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online