To Parse or Not to
that is the question.
2:00 pm Tuesday, Mar. 20, 2007
5:30 pm Tuesday, Mar. 20, 2007
To study the performance of search trees when used for storing and searching large text
To learn how to parse individual words in a text file.
Different authors tend to use common words with differing frequencies. At different periods in
history, some words may have been more common than at other times. Some words that were
commonly used in one century may have fallen out of favor in subsequent centuries. Using literary
texts and word frequencies, we can learn interesting things about language. For this project you
will compile the common words used by Shakespeare in two of his plays and the common words
used by E.M. Forster in two of his novels. Then you will determine how many of the common
words from each period are still commonly used today. Common words are words that occur with
the highest frequencies (i.e. the most often) in language or written text.
Efficiency will be important. You will use binary search trees and possibly other tree structures
(AVL, splay) in this project (other data structures may also be used for parts of the project). You
will compute the number of comparisons and the actual CPU time involved in completing your
tasks and compare these statistics with the theoretical performance of the data structures you use.
Write a C++ program that will read four literary texts (specified below), and extract from each text
all the individual words. For each work, store the words into an initially empty binary search tree.
Do not store duplicate words, but keep a frequency counter in the node with that word.
After the words for each text have been stored, compute and print the final height of each tree.
Next, combine the two trees holding Shakespeare's most common words into one tree, and
combine the two trees holding Forster's most common words into one tree. For example, if
has the words
an(20), be(6), the(12), at(3), and(18), or(5), not(4)
has the words
the(14), be(8), at(4), and(12), or(2), not(3)