Project 2

Project 2 - Common Words To Parse or Not to Parse. that is...

Info iconThis preview shows pages 1–3. Sign up to view the full content.

View Full Document Right Arrow Icon
Common Words To Parse or Not to Parse. .. that is the question. .. Project #2 Winter 2007 CS352 Section V0A 2:00 pm Tuesday, Mar. 20, 2007 Section V0B 5:30 pm Tuesday, Mar. 20, 2007 Goals To study the performance of search trees when used for storing and searching large text files. 1. To learn how to parse individual words in a text file. 2. Introduction Different authors tend to use common words with differing frequencies. At different periods in history, some words may have been more common than at other times. Some words that were commonly used in one century may have fallen out of favor in subsequent centuries. Using literary texts and word frequencies, we can learn interesting things about language. For this project you will compile the common words used by Shakespeare in two of his plays and the common words used by E.M. Forster in two of his novels. Then you will determine how many of the common words from each period are still commonly used today. Common words are words that occur with the highest frequencies (i.e. the most often) in language or written text. Efficiency will be important. You will use binary search trees and possibly other tree structures (AVL, splay) in this project (other data structures may also be used for parts of the project). You will compute the number of comparisons and the actual CPU time involved in completing your tasks and compare these statistics with the theoretical performance of the data structures you use. Assignment Write a C++ program that will read four literary texts (specified below), and extract from each text all the individual words. For each work, store the words into an initially empty binary search tree. Do not store duplicate words, but keep a frequency counter in the node with that word. After the words for each text have been stored, compute and print the final height of each tree. Next, combine the two trees holding Shakespeare's most common words into one tree, and combine the two trees holding Forster's most common words into one tree. For example, if Hamlet has the words an(20), be(6), the(12), at(3), and(18), or(5), not(4) and Macbeth has the words the(14), be(8), at(4), and(12), or(2), not(3)
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
then the combined tree would hold the words an(20), the(26), be(14), at(7), and(30), or(7), not(7) In this example, the word with highest frequency (thus, the most common) in Macbeth is "the" and in Hamlet, the most common word is "an." But the most common word used in both works is "and." Ouput to a file the 200 most common words used by each author (using the "combined" tree). The file commonwords.txt contains a list of common words and their rankings, where the most common word is ranked 1, and the least common word is ranked 1000. If a word is ranked 1, then that word's frequency would correspond to 1000. Store these words in an efficient data structure (You must justify your choice in the written report). Use the common words list and the common words compiled for each author to determine how
Background image of page 2
Image of page 3
This is the end of the preview. Sign up to access the rest of the document.

Page1 / 6

Project 2 - Common Words To Parse or Not to Parse. that is...

This preview shows document pages 1 - 3. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online