parse-hints

parse-hints - For homework 2, please first use the NZ2 data...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
For homework 2, please first use the NZ2 data set provided on this page as input to your indexing program. Larger data sets, such as NZ10 and the entire NZ data set, will be provided soon. - NZ2 is a set of about 60000 web pages (2% of the entire crawl) from the .nz (New Zealand) web domain. - NZ10 is a set of about 300000 web pages (10% of the entire crawl) from the .nz (New Zealand) web domain - NZ is a set of almost 3 million web pages (the entire crawl) from the .nz (New Zealand) web domain Note that NZ2 is about 130MB and NZ10 is about 620MB in compressed form, and that the files are all gzipped (even though they do not have the ending .gz). For assignment #2, first start working with the smallest data set, and if you can efficiently build an index for NZ2, then run your code on the larger sets (NZ10 or NZ) to get full credit for #2. A few more hints on how to deal with the pages that are provided in the NZ2, NZ10, and NZ data sets: - again, each file consists of many pages. You need to parse individual pages,
Background image of page 1
This is the end of the preview. Sign up to access the rest of the document.

This note was uploaded on 03/07/2010 for the course CS 6913 taught by Professor Torsensuel during the Spring '10 term at NYU Poly.

Ask a homework question - tutors are online