Department of Computing and Information Systems
COMP10002 Foundations of Algorithms
Semester 2, 2016
Assignment 2
Learning Outcomes
In this project you will demonstrate your understanding of dynamic memory and linked data structures,
and extend your skills in terms of program design, testing, and debugging. You will also learn about
inverted indexes, and the basic principles of web search algorithms.
Indexed Document Retrieval
The idea of an
inverted index
was mentioned briefly in class. To build an index for some input text, the
words are isolated, together with their document numbers (in our case, line numbers in the input file),
and arranged so that, for every word, a list of the documents that contain that word is constructed. In
terms of notation, if
t
is an indexed
term
, then
f
t
is the number of documents in the collection that
contain that term at least once; and for any given document
d
, the value
f
d,t
is the number of times
that
t
appears in
d
. For example, consider the text:
line one has one word twice
line two has words once only
line three follows lines one and two, but not four
line four is like the other lines, not like line five
line five has word one and word two and word three
six is the littlest one
If each line of that input file is taken to be a “document”, then the first few lines of a simple document-
level inverted index for it would be
and 2 3 1 5 2
but 1 3 1
five 2 4 1 5 1
follows 1 3 1
four 2 3 1 4 1
has 3 1 1 2 1 5 1
where the first integer following each word is the
f
t
value for that term
t
, with exactly that many
h
d, f
d,t
i
pairs after it all on the same input line. For example, term
t
=
“
and
” appears in
f
t
= 2
documents, document
d
= 3
(with
f
d,t
= 1
, that is, one occurrence), and in document
d
= 5
(with
f
d,t
= 2
occurrences). The full index file for the original six lines is available on the LMS, together
with larger examples. Make sure that you understand the structure, and what the values represent.
Stage 1 – Reading the Index (marks up to 8/15)
Write a program that reads an index file with this format, specified as the first (and only) argument on
the command-line, and builds (using
realloc()
and
malloc()
) a data structure to store that index
information. The only assumption you may make, purely for the purposes of reading the input strings,
is that each term in the index will be at most
999
characters long. Apart from a single buffer of that
size, all stored strings and lists of
h
d, f
d,t
i
pairs should be held in dynamic arrays of the correct length
for the data they contain (or within a factor of two of that minimum length).
As evidence of the
operation of this stage of your program, it should report the number of terms in the index that was
1