project3 - Project 3-CMPSCI 377 (Spring 2008) Worth: 15...

Info iconThis preview shows pages 1–3. Sign up to view the full content.

View Full Document Right Arrow Icon
Sheet1 Page 1 Project 3--CMPSCI 377 (Spring 2008) Worth: 15 points 1 Overview Goals of this assignment: understanding threads exclusion model. For this assignment, you will write a mini web spider. Search engines use web spiders (also called crawlers) to retrieve documents recursively from the Internet. 2 Spider 2.1 Input Your program, to be called spider, will take three command-line inputs: 1) The root URL to start from 2) The maximum depth to crawl 3) The number of worker threads to spawn 2.1.1 Input: Root URL The root website will be specified in the following form: http://www.cs.umass.edu OR http://www.cs.umass.edu/ OR http://www.cs.umass.edu/index.html See the section on the helper functions for using parse_single_URL() to parse this. 2.1.2 Input: Depth The maximum depth tells your crawler how far to recurse. A depth of zero means that the root URL should be retrieved, but no others. A depth of one indicates that the root URL should be retrieved, and all pages that it links to, but no others. 2.1.3 Input: Threads This is the number of worker threads to spawn for crawling web pages. You will have one additional thread that will do the parsing of pages to find new URLs. 2.2 Crawling Web Pages
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Sheet1 Page 2 Your program will use a work queue style of concurrency, where multiple threads pull work off of a single queue. Each worker thread will pull a URL off the from of a queue, retrieve that web page into a buffer, then insert that buffer into another queue for parsing. A separate thread will pull the buffers off of the parsing work queue, parse them for new URLs, insert those URLs back onto the work queue and so on. The worker thread may not retrieve another web page until the previous
Background image of page 2
Image of page 3
This is the end of the preview. Sign up to access the rest of the document.

Page1 / 10

project3 - Project 3-CMPSCI 377 (Spring 2008) Worth: 15...

This preview shows document pages 1 - 3. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online