Question

# Problem 1: Sentiment Analysis This problem requires you to make a...

Problem 1: Sentiment Analysis

This problem requires you to make a single large program. I have broken it up into smaller tasks, to help you approach writing the code. Please turn in one program file.

Sentiment Analysis is a Big Data problem which seeks to determine the general attitude of a writer given some text they have written. For instance, we would like to have a program that could look at the text "The film was a breath of fresh air" and realize that it was a positive statement while "It made me want to poke out my eye balls" is negative.

One algorithm that we can use for this is to assign a numeric value to any given word based on how positive or negative that word is and then score the statement based on the values of the words. But, how do we come up with our word scores in the first place?

That's the problem that we'll solve in this assignment. You are going to search through a file containing fragments of movie reviews from the Rotten Tomatoes website which have been assigned a numeric score indicating how positive or negative that review is. You'll use this to learn which words are positive and which are negative.

The data file is here: movie_reviews.txt, and looks like this:

4 This quiet , introspective and entertaining independent is worth seeking .
1 Aggressive self-glorification and a manipulative whitewash .
4 Best indie of the year , so far .
2 Nothing more than a run-of-the-mill action flick .
2 Reeks of rot and hack work from start to finish .

Note that each review starts with a number 0 through 4 with the following meaning:

• 0 : negative
• 1 : somewhat negative
• 2 : neutral
• 3 : somewhat positive
• 4 : positive

You are going to write a program that prompts the user to enter a phrase and then indicates whether that phrase is generally "positive" or "negative", by using the sentiment data contained in the data file.

To begin, download this movie_reviews.txt file by selecting File -> Save As in your browser and saving it to your computer. Ensure that the data file is saved in the same folder as your Python program.

Part 0

Begin by writing this function, which you will use to make sure that all of the string data you will be working with is formatted correctly.

# function:   cleanup_string
# input:      a string to clean up
# processing: (1) makes the entire string lowercase.
#             (2) retains only alphabetic and space characters
#                 [all numbers, punctuation and special characters removed]
# output:     returns the cleaned up string

# TEST CODE
test1 = cleanup_string("Hello World! This is a simple test of this function!")
print (test1)
# hello world this is a simple test of this function

test2 = cleanup_string("ABC123abc this is Another TEST!!!#@@")
print (test2)
# abcabc this is another test

test3 = cleanup_string("I'm so happy today! La la la la it doesn't get any better than this.")
print (test3)
# im so happy today la la la la it doesnt get any better than this

Part 1

To begin, your program has to compute the average sentiment score for each of the words in the movie_reviews.txt file. First, download the text file and save it in the same folder where your program will be. Then write a program to do the following:

• Set up a new dictionary variable called 'words'
• Iterate over every review in the text file.
• Examine every character in every review and clean it up, if necessary. Remove all punctuation and numbers from each word and replace them with empty strings.
• Split the cleaned up review into a list of words, and then iterate over that list.
• Any word that is valid (i.e. it has at least 1 character) can then be classified. Otherwise you can ignore it and move on to the next word.
• If this is the first time you have seen this word (i.e. it is not in your dictionary yet) you should add a new entry into your dictionary for that word (i.e. the word becomes a new key in the dictionary). The value to store at this key should be a list that contains two elements - the review and the number 1 (indicating that you've seen this word 1 time)
• If you have seen the word before (i.e. it is already in your dictionary) then you should add the new score into your list and increase the number of times that you have seen this word. For example:
4 I loved it
1 I hated it
• ... might look like this as a dictionary:
words['i']     = [5,2]
words['loved'] = [4,1]
words['it']    = [5,2]
words['hated'] = [1,1]
• Report to the user that the analysis of the 'movie_reviews.txt' file has been completed. Report how many lines were processed, and how many unique words were recorded. Also give them a summary of how long this took (hint: import the time module and use time.time() to compute the current time before and after your analysis algorithm and then compute the difference). For example:
Initializing sentiment database.
Sentiment database initialization complete.
Total unique words analyzed: 16128
Analysis took 0.142 seconds to complete.
• What is a word? When designing a program like this, you need to make sure that you and the program's end users agree on what counts as a unique word. For this assignment:
• Ignore capitalization: "A" and "a" should be counted as the same word.
• Words should have no punctuation or numbers: they should only contain alphabetic characters.
• Be sure to strip out all whitespace. For example, you should not have words in your dictionary that contain a space or tab character ("t").
• Also: make sure you are not counting empty lines!
• Also note: your analysis time may vary depending on your computer, but you should get the same number of lines and words as shown above.

Part 2

• Repeatedly ask the user for a phrase to analyze.
• Convert all words to lowercase for analysis. Also remove any punctuation or numbers from the words.
• Analyze each word in this phrase and use your dictionary to compute the average score for each word, and report this to the user.
• Compute whether the overall phrase is positive or negative by averaging together the scores for each word that is contained within the phrase. Anything less than 2 should be considered negative, and anything greater than 2 is positive. Note: any words that are not in the dictionary should not be counted when computing the score for the phrase.
• Continue to prompt for phrases until the user types "quit", at which point your program should end.

Here is an example session:

Initializing sentiment database.
Sentiment database initialization complete.
Total unique words analyzed: 16128
Analysis took 0.130 seconds to complete.

Enter a phrase to test: i loved it
* 'i' appears 383 times with an average score of 1.8302872062663185
* 'loved' appears 9 times with an average score of 2.6666666666666665
* 'it' appears 2405 times with an average score of 1.99002079002079
Average score for this phrase is: 2.1623248876512586
This is a POSITIVE phrase.

Enter a phrase to test: this movie was awful
* 'this' appears 994 times with an average score of 1.9657947686116701
* 'movie' appears 969 times with an average score of 1.8286893704850362
* 'was' appears 169 times with an average score of 1.621301775147929
* 'awful' appears 23 times with an average score of 1.0869565217391304
Average score for this phrase is: 1.6256856089959415
This is a NEGATIVE phrase.

Enter a phrase to test: pikachu is watching you
* 'pikachu' does not appear in any movie reviews.
* 'is' appears 2799 times with an average score of 2.0568060021436225
* 'watching' appears 80 times with an average score of 1.875
* 'you' appears 850 times with an average score of 2.050588235294118
Average score for this phrase is: 1.9941314124792466
This is a NEGATIVE phrase.

Enter a phrase to test: pikachu charmander
* 'pikachu' does not appear in any movie reviews.
* 'charmander' does not appear in any movie reviews.
Not enough words to determine sentiment.

Enter a phrase to test: happy birthday sad kitten
* 'happy' appears 17 times with an average score of 2.588235294117647
* 'birthday' appears 9 times with an average score of 2.7777777777777777
* 'sad' appears 33 times with an average score of 2.212121212121212
* 'kitten' appears 1 times with an average score of 2.0
Average score for this phrase is: 2.3945335710041595
This is a POSITIVE phrase.

Enter a phrase to test: it made me want to poke out my eyeballs
* 'it' appears 2405 times with an average score of 1.99002079002079
* 'made' appears 148 times with an average score of 1.945945945945946
* 'me' appears 81 times with an average score of 1.5802469135802468
* 'want' appears 67 times with an average score of 1.8208955223880596
* 'to' appears 2996 times with an average score of 1.9589452603471296
* 'poke' does not appear in any movie reviews.
* 'out' appears 298 times with an average score of 1.8187919463087248
* 'my' appears 83 times with an average score of 2.036144578313253
* 'eyeballs' appears 1 times with an average score of 1.0
Average score for this phrase is: 1.7688738696130188
This is a NEGATIVE phrase.

Enter a phrase to test: I would not, could not, Sam I Am
* 'i' appears 383 times with an average score of 1.8302872062663185
* 'would' appears 213 times with an average score of 1.6431924882629108
* 'not' appears 596 times with an average score of 1.919463087248322
* 'could' appears 155 times with an average score of 1.8838709677419354
* 'not' appears 596 times with an average score of 1.919463087248322
* 'sam' appears 2 times with an average score of 1.5
* 'i' appears 383 times with an average score of 1.8302872062663185
* 'am' appears 7 times with an average score of 2.7142857142857144
Average score for this phrase is: 1.90510621966498
This is a NEGATIVE phrase.

Enter a phrase to test: quit
Quitting.

Some notes:

• You must use a dictionary to solve this problem, and you may only analyze the 'moview_review.txt' file ONE TIME. You CANNOT re-analyze the file over and over again (i.e. for the phrase 'happy birthday' you can't iterate over every movie review to find all occurrences of 'happy' and then repeat this process to find all occurrences of 'birthday'). You will lose points for inefficient code.
• Important: this program will be tested automatically, so your output should match the examples I give in all formatting, and your sentiment scores should match the values I've computed to at least two decimal places.
• We will also examine your code, so remember to put clear comments to explain what you're doing.