crawl - CS345 Data Mining Crawling the Web Web Crawling...

Info iconThis preview shows pages 1–14. Sign up to view the full content.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: CS345 Data Mining Crawling the Web Web Crawling Basics get next url get page extract urls to visit urls visited urls web pages Web Start with a “seed set” of to-visit urls Crawling Issues Load on web servers Insufficient resources to crawl entire web Which subset of pages to crawl? How to keep crawled pages “fresh”? Detecting replicated content e.g., mirrors Can’t crawl the web from one machine Parallelizing the crawl Polite Crawling Minimize load on web servers by spacing out requests to each server E.g., no more than 1 request to the same server every 10 seconds Robot Exclusion Protocol Protocol for giving spiders (“robots”) limited access to a website www.robotstxt.org/wc/norobots.html Crawl Ordering Not enough storage or bandwidth to crawl entire web Visit “important” pages first Importance metrics In-degree More important pages will have more inlinks Page Rank To be discussed later For now, assume it is a metric we can compute Crawl Order Problem: we don’t know the actual in- degree or page rank of a page until we have the entire web! Ordering heuristics Partial in-degree Partial page rank Breadth-first search (BFS) Random Walk -- baseline stanford.edu experiment 179K pages Source: Cho et al (1998) Overlap with best x% by indegree x% crawled by O(u) Larger study (328M pages) BFS crawling brings in high quality pages early in the crawl Source: Najork and Wiener (2001) Maintaining freshness How often do web pages change? What do we mean by freshness? What strategy should we use to refresh pages? How often do pages change? Cho et al (2000) experiment 270 sites visited (with permission) identified 400 sites with highest “PageRank” contacted administrators 720,000 pages collected 3,000 pages from each site daily start at root, visit breadth first (get new & old pages) ran only 9pm - 6am, 10 seconds between site requests Average change interval 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 1day 1day- 1week 1week- 1month 1month- 4months 4 months+ Source: Cho et al (2000) Modeling change Assume changes to a web page are a sequence of random events that happen independently at a fixed average rate Poisson process with parameter λ Let X(t) be a random variable denoting the number of changes in any time interval t Pr[X(t)=k] = e- λ t ( λ t) k /k! for k = 0,1,… “Memory-less” distribution Poisson processes Let us compute the expected number of changes in unit time E[X(1)] = ∑ k ke λ λ k /k! = λ λ is therefore the average number of changes in unit time Called the rate parameter Time to next event...
View Full Document

This document was uploaded on 03/04/2012.

Page1 / 46

crawl - CS345 Data Mining Crawling the Web Web Crawling...

This preview shows document pages 1 - 14. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online