Section7.0 - Module 7 Introduction To search the Web, we...

Info iconThis preview shows pages 1–4. Sign up to view the full content.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Module 7 Introduction To search the Web, we first need an index; but to build an index, we need to extract terms from every webpage so that we can create postings lists. The programs used to gather all webpages so that they can be indexed are known as web crawlers , and in this module we will examine the basic ideas underlying crawlers. Therefore the goals of this third unit examining search engines are: to investigate the fundamentals of web crawling; to provide an appreciation for the size and connectedness of the Web; to explain why some data that you may find via the Web is not indexed; to expose some of the conflicts between Web search engine providers and those who create Web content. More specifically, by the end of the module, you will be able to: explain how the Web can be modeled as a graph; describe in detail how a graph can be explored by a breadth-first traversal; describe some of the problems in maintaining an index of webpages, including the problems of coverage and freshness; describe some of the pitfalls that make web crawling challenging. CS 100 Module 7 7.2 7.0 Crawling the Web © 2009, University of Waterloo So far, we have assumed that an index for all the pages on the Web exists, and we have relied on that index for answering simple and compound searches. In this module we examine how the index is created and maintained. To create an index for the Web, we need to visit all existing webpages to gather the words and generate appropriate postings lists. But how can we find all the Web’s pages before there is an i ndex? It turns out that the hyperlinks that point from one page to another form a web-like structure that we can crawl along, gathering pages as we go. But there are also problems: some webpages are not reachable by this form of crawling, the contents of many webpages will have changed by the time we have finished our crawl, and some nefarious webpage creators will try to interfere with our ability to crawl. The material in this module is supplemented by material from Web Dragons . In particular, you must read Chapter 3, pp. 70-99. (Do not get bogged down in the explanation of scale-free networks and the evolutionary model on pp. 88-91; you will not be responsible for that material in this course.) CS 100 Module 7 7.3 7.1 A Spider’s Path Crawl : “ Of a program, esp. one associated with a search engine: to follow automatically links (on the World Wide Web or a particular web site) in order to retrieve documents, typically for the purpose of indexing. ” [ Oxford English Dictionary , Draft Additions, June 2009] 7.1.1 Indexing a Webpage To create an index for the Web, we need to fetch each webpage, one after the other, and collect all the terms used on that page. We saw in Module 5 that we can build postings lists for a given page: after taking into account stop words and stemming, we can record the word offset for each term that appears on the page, sort the terms into alphabetical order, and record all the offsets for each term in corresponding postings...
View Full Document

Page1 / 17

Section7.0 - Module 7 Introduction To search the Web, we...

This preview shows document pages 1 - 4. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online