s10-crawling-clustering

s10-crawling-clustering - HW 3 due on Thu 3/25 Midterm on...

Info iconThis preview shows pages 1–21. Sign up to view the full content.

View Full Document Right Arrow Icon
3/23 Agenda: Engineering Issues (Crawling; Connection Server; Distributed Indexing; Map-Reduce) HW 3 due on Thu 3/25 Midterm on Tu 3/30 Project 2 due on 4/6
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Engineering Issues Crawling Distributed Index Generation Connectivity Serving Compressing everything. .
Background image of page 2
Crawlers: Main issues General-purpose crawling Context specific crawiling Building topic-specific search engines…
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Background image of page 4
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Background image of page 6
P I D E R C A S T U Y
Background image of page 7

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Web Crawling (Search) Strategy Starting location(s) Traversal order Depth first Breadth first Or ??? Cycles? Coverage? Load? b c d e f g h i j
Background image of page 8
Robot (2) Some specific issues: 1. What initial URLs to use? Choice depends on type of search engines to be built. •. For general-purpose search engines, use URLs that are likely to reach a large portion of the Web such as the Yahoo home page. •. For local search engines covering one or several organizations, use URLs of the home pages of these organizations. In addition, use appropriate domain constraint.
Background image of page 9

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Robot (7) Several research issues about robots: Fetching more important pages first with limited resources. Can use measures of page importance Fetching web pages in a specified subject area such as movies and sports for creating domain-specific search engines. Focused crawling Efficient re-fetch of web pages to keep web page index up-to-date. Keeping track of change rate of a page
Background image of page 10
Storing Summaries Can’t store complete page text Whole WWW doesn’t fit on any server Stop Words Stemming What (compact) summary should be stored? Per URL Title, snippet Per Word URL, word number B u t , l o k a G g e s C c h p y . n d i r v
Background image of page 11

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Background image of page 12
Mercator’s way of maintaining URL frontier Extracted URLs enter front queue Each URL goes into a front queue based on its Priority. (priority assigned Based on page importance and Change rate) URLs are shifted from Front to back queues. Each Back queue corresponds To a single host. Each queue Has time te at which the host Can be hit again URLs removed from back Queue when crawler wants A page to crawl
Background image of page 13

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Background image of page 14
Robot (4) 2. How to extract URLs from a web page? Need to identify all possible tags and attributes that hold URLs. •. Anchor tag: <a href=“URL” … > … </a> •. Option tag: <option value=“URL”…> … </option> •. Map: <area href=“URL” …> •. Frame: <frame src=“URL” …> •. Link to an image: <img src=“URL” …> •. Relative path vs. absolute path: <base href= …> “Path Ascending Crawlers” – ascend up the path of the URL to see if there is anything else higher up the URL
Background image of page 15

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Background image of page 16
(This was an older characterization)
Background image of page 17

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Background image of page 18
Background image of page 19

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Focused Crawling Classifier: Is crawled page P relevant to the topic? Algorithm that maps page
Background image of page 20
Image of page 21
This is the end of the preview. Sign up to access the rest of the document.

Page1 / 88

s10-crawling-clustering - HW 3 due on Thu 3/25 Midterm on...

This preview shows document pages 1 - 21. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online