crawl - CS345 Data Mining Crawling the Web Web Crawling...

Info iconThis preview shows pages 1–13. Sign up to view the full content.

View Full Document Right Arrow Icon
CS345 Data Mining Crawling the Web
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Web Crawling Basics get next url get page extract urls to visit urls visited urls web pages Web Start with a “seed set” of to-visit urls
Background image of page 2
Crawling Issues b Load on web servers b Insufficient resources to crawl entire web s Which subset of pages to crawl? b How to keep crawled pages “fresh”? b Detecting replicated content e.g., mirrors b Can’t crawl the web from one machine s Parallelizing the crawl
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Polite Crawling b Minimize load on web servers by spacing out requests to each server s E.g., no more than 1 request to the same server every 10 seconds b Robot Exclusion Protocol s Protocol for giving spiders (“robots”) limited access to a website s www.robotstxt.org/wc/norobots.html
Background image of page 4
Crawl Ordering b Not enough storage or bandwidth to crawl entire web b Visit “important” pages first b Importance metrics s In-degree b More important pages will have more inlinks s Page Rank b To be discussed later b For now, assume it is a metric we can compute
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Crawl Order b Problem: we don’t know the actual in- degree or page rank of a page until we have the entire web! b Ordering heuristics s Partial in-degree s Partial page rank s Breadth-first search (BFS) s Random Walk -- baseline
Background image of page 6
stanford.edu experiment b 179K pages Source: Cho et al (1998) Overlap with best x% by indegree x% crawled by O(u)
Background image of page 7

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Larger study (328M pages) BFS crawling brings in high quality pages early in the crawl Source: Najork and Wiener (2001)
Background image of page 8
Maintaining freshness b How often do web pages change? b What do we mean by freshness? b What strategy should we use to refresh pages?
Background image of page 9

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
How often do pages change? b Cho et al (2000) experiment b 270 sites visited (with permission) s identified 400 sites with highest “PageRank” s contacted administrators b 720,000 pages collected s 3,000 pages from each site daily s start at root, visit breadth first (get new & old pages) s ran only 9pm - 6am, 10 seconds between site requests
Background image of page 10
Average change interval 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 1day 1day- 1week 1week- 1month 1month- 4months 4 months+ Source: Cho et al (2000)
Background image of page 11

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Modeling change b Assume changes to a web page are a sequence of random events that happen independently at a fixed average rate b Poisson process with parameter λ b Let X(t) be a random variable denoting the
Background image of page 12
Image of page 13
This is the end of the preview. Sign up to access the rest of the document.

This note was uploaded on 01/31/2011 for the course CS 345 taught by Professor Dunbar,a during the Fall '07 term at UC Davis.

Page1 / 39

crawl - CS345 Data Mining Crawling the Web Web Crawling...

This preview shows document pages 1 - 13. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online