4 - WEB CRAWLING Outline Motivation and taxonomy of...

Info iconThis preview shows pages 1–15. Sign up to view the full content.

View Full Document Right Arrow Icon
WEB CRAWLING
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Outline Motivation and taxonomy of crawlers Basic crawlers and implementation issues Universal crawlers Preferential (focused and topical) crawlers Crawler ethics and conflicts
Background image of page 2
Q: How does a search engine know that all these pages contain the query terms? A: Because all of those pages have been crawled
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Many names Crawler Spider Robot (or bot) Web agent Wanderer, worm, And famous instances: googlebot, scooter, slurp, msnbot,
Background image of page 4
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Motivation for crawlers Support universal search engines (Google, Yahoo, MSN/Windows Live, Ask, etc.) Vertical (specialized) search engines, e.g. news, shopping, papers, recipes, reviews, etc. Business intelligence: keep track of potential competitors, partners Monitor Web sites of interest Evil: harvest emails for spamming, phishing Can you think of some others?
Background image of page 6
One taxonomy of crawlers Many other criteria could be used: Incremental, Interactive, Concurrent, Etc. Universal crawlers Focused crawlers Evolutionary crawlers Reinforcement learning crawlers etc. .. Adaptive topical crawlers Best-first PageRank etc. .. Static crawlers Topical crawlers Preferential crawlers Crawlers
Background image of page 7

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Outline Motivation and taxonomy of crawlers Basic crawlers and implementation issues Universal crawlers Preferential (focused and topical) crawlers Crawler ethics and conflicts
Background image of page 8
Background image of page 9

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Graph traversal (BFS or DFS?)
Background image of page 10
Implementation issues Don t want to fetch same page twice! Keep lookup table (hash) of visited pages What if not visited but in frontier already? The frontier grows very fast! May need to prioritize for large crawls Fetcher must be robust! Don t crash if download fails Timeout mechanism Determine file type to skip unwanted files Can try using extensions, but not reliable Can issue HEAD HTTP commands to get Content-Type (MIME) headers, but overhead of extra Internet requests
Background image of page 11

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
More implementation issues Fetching Get only the first 10-100 KB per page Take care to detect and break redirection loops Soft fail for timeout, server not responding, file not found, and other errors
Background image of page 12
More implementation issues: Parsing HTML : DOM (Document Object Model) tree Unfortunately actual HTML is often incorrect in a strict syntactic sense Crawlers, like browsers, must be robust/forgiving Fortunately there are tools that can help e .g. tidy.sourceforge.net Must pay attention to HTML entities and unicode in text What to do with a growing number of other formats? Flash, SVG, RSS, AJAX
Background image of page 13

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
More implementation issues Stop words Noise words that do not carry meaning should be eliminated ( stopped ) before they are indexed E.g. in English: AND, THE, A, AT, OR, ON, FOR, etc Typically syntactic markers Typically the most common terms Typically kept in a negative dictionary 10–1,000 elements E.g. http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_words Parser can detect these right away and disregard them
Background image of page 14
Image of page 15
This is the end of the preview. Sign up to access the rest of the document.

Page1 / 56

4 - WEB CRAWLING Outline Motivation and taxonomy of...

This preview shows document pages 1 - 15. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online