crawl - CS345 Data Mining Crawling the Web Web Crawling...

Info icon This preview shows pages 1–14. Sign up to view the full content.

View Full Document Right Arrow Icon
    CS345 Data Mining Crawling the Web
Image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Web Crawling Basics get next url get page extract urls to visit urls visited urls web pages Web Start with a “seed set” of to-visit urls
Image of page 2
Crawling Issues Load on web servers Insufficient resources to crawl entire web Which subset of pages to crawl? How to keep crawled pages “fresh”? Detecting replicated content e.g., mirrors Can’t crawl the web from one machine Parallelizing the crawl
Image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Polite Crawling Minimize load on web servers by spacing  out requests to each server E.g., no more than 1 request to the same  server every 10 seconds Robot Exclusion Protocol Protocol for giving spiders (“robots”) limited  access to a website www.robotstxt.org/wc/norobots.html
Image of page 4
Crawl Ordering Not enough storage or bandwidth to crawl  entire web Visit “important” pages first Importance metrics In-degree More important pages will have more inlinks Page Rank To be discussed later For now, assume it is a metric we can compute
Image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Crawl Order Problem: we don’t know the actual in- degree or page rank of a page until we  have the entire web! Ordering heuristics Partial in-degree Partial page rank Breadth-first search (BFS) Random Walk -- baseline
Image of page 6
stanford.edu experiment 179K pages Source: Cho et al (1998)  Overlap with best x% by indegree x% crawled by O(u)
Image of page 7

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Larger study (328M pages) BFS crawling brings in high quality pages early in the crawl Source: Najork and Wiener (2001)
Image of page 8
Maintaining freshness How often do web pages change? What do we mean by freshness? What strategy should we use to refresh  pages?
Image of page 9

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
How often do pages change? Cho et al (2000) experiment 270 sites visited (with permission) identified 400 sites with highest “PageRank” contacted administrators 720,000 pages collected 3,000 pages from each site daily start at root, visit breadth first (get new & old  pages) ran only 9pm - 6am, 10 seconds between site  requests
Image of page 10
Average change interval 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 1day 1day- 1week 1week- 1month 1month- 4months 4 months+ Source: Cho et al (2000)
Image of page 11

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Modeling change Assume changes to a web page are a sequence of  random events that happen  independently  at a  fixed  average rate Poisson process with parameter  λ Let X(t) be a random variable denoting the number of  changes in any time interval t Pr[X(t)=k] = e - λ t ( λ t) k /k!    for k = 0,1,…  “Memory-less” distribution
Image of page 12
Poisson processes Let us compute the expected number of changes in  unit time E[X(1)] =  k ke λ λ k /k!
Image of page 13

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Image of page 14
This is the end of the preview. Sign up to access the rest of the document.

{[ snackBarMessage ]}

What students are saying

  • Left Quote Icon

    As a current student on this bumpy collegiate pathway, I stumbled upon Course Hero, where I can find study resources for nearly all my courses, get online help from tutors 24/7, and even share my old projects, papers, and lecture notes with other students.

    Student Picture

    Kiran Temple University Fox School of Business ‘17, Course Hero Intern

  • Left Quote Icon

    I cannot even describe how much Course Hero helped me this summer. It’s truly become something I can always rely on and help me. In the end, I was not only able to survive summer classes, but I was able to thrive thanks to Course Hero.

    Student Picture

    Dana University of Pennsylvania ‘17, Course Hero Intern

  • Left Quote Icon

    The ability to access any university’s resources through Course Hero proved invaluable in my case. I was behind on Tulane coursework and actually used UCLA’s materials to help me move forward and get everything together on time.

    Student Picture

    Jill Tulane University ‘16, Course Hero Intern