ass2 - Department of Computing and Information Systems...

Info icon This preview shows pages 1–3. Sign up to view the full content.

Department of Computing and Information Systems COMP10002 Foundations of Algorithms Semester 2, 2016 Assignment 2 Learning Outcomes In this project you will demonstrate your understanding of dynamic memory and linked data structures, and extend your skills in terms of program design, testing, and debugging. You will also learn about inverted indexes, and the basic principles of web search algorithms. Indexed Document Retrieval The idea of an inverted index was mentioned briefly in class. To build an index for some input text, the words are isolated, together with their document numbers (in our case, line numbers in the input file), and arranged so that, for every word, a list of the documents that contain that word is constructed. In terms of notation, if t is an indexed term , then f t is the number of documents in the collection that contain that term at least once; and for any given document d , the value f d,t is the number of times that t appears in d . For example, consider the text: line one has one word twice line two has words once only line three follows lines one and two, but not four line four is like the other lines, not like line five line five has word one and word two and word three six is the littlest one If each line of that input file is taken to be a “document”, then the first few lines of a simple document- level inverted index for it would be and 2 3 1 5 2 but 1 3 1 five 2 4 1 5 1 follows 1 3 1 four 2 3 1 4 1 has 3 1 1 2 1 5 1 where the first integer following each word is the f t value for that term t , with exactly that many h d, f d,t i pairs after it all on the same input line. For example, term t = and ” appears in f t = 2 documents, document d = 3 (with f d,t = 1 , that is, one occurrence), and in document d = 5 (with f d,t = 2 occurrences). The full index file for the original six lines is available on the LMS, together with larger examples. Make sure that you understand the structure, and what the values represent. Stage 1 – Reading the Index (marks up to 8/15) Write a program that reads an index file with this format, specified as the first (and only) argument on the command-line, and builds (using realloc() and malloc() ) a data structure to store that index information. The only assumption you may make, purely for the purposes of reading the input strings, is that each term in the index will be at most 999 characters long. Apart from a single buffer of that size, all stored strings and lists of h d, f d,t i pairs should be held in dynamic arrays of the correct length for the data they contain (or within a factor of two of that minimum length). As evidence of the operation of this stage of your program, it should report the number of terms in the index that was 1
Image of page 1

Info icon This preview has intentionally blurred sections. Sign up to view the full version.

read, the total number of h d, f d,t i pairs in the index, and up to ten of the pairs associated with the first two and the last two terms in the index, using exactly the output format that is shown here and in the LMS examples. Note that the terms are labeled from 1: mac: ./ass2-soln test0-ind.txt Stage 1 Output
Image of page 2
Image of page 3
This is the end of the preview. Sign up to access the rest of the document.
  • One '13
  • Academic term, inverted index

{[ snackBarMessage ]}

What students are saying

  • Left Quote Icon

    As a current student on this bumpy collegiate pathway, I stumbled upon Course Hero, where I can find study resources for nearly all my courses, get online help from tutors 24/7, and even share my old projects, papers, and lecture notes with other students.

    Student Picture

    Kiran Temple University Fox School of Business ‘17, Course Hero Intern

  • Left Quote Icon

    I cannot even describe how much Course Hero helped me this summer. It’s truly become something I can always rely on and help me. In the end, I was not only able to survive summer classes, but I was able to thrive thanks to Course Hero.

    Student Picture

    Dana University of Pennsylvania ‘17, Course Hero Intern

  • Left Quote Icon

    The ability to access any university’s resources through Course Hero proved invaluable in my case. I was behind on Tulane coursework and actually used UCLA’s materials to help me move forward and get everything together on time.

    Student Picture

    Jill Tulane University ‘16, Course Hero Intern