Dan Jurafsky Baseline Algorithm (adapted from Pang and Lee) •  Tokeniza+on •  Feature Extrac+on •  Classifica+on using different classifiers •  Naïve Bayes •  MaxEnt •  SVM Dan Jurafsky Sen%ment Tokeniza%on Issues •  Deal with HTML and XML markup •  TwiXer mark ­up (names, hash tags) PoXs emo+cons •  Capitaliza+on (preserve for [<>]? # optional hat/brow! words in all caps) [:;=8] # eyes! [\-o\*\']? # optional nose! ! •  Phone numbers, dates [\)\]\(\[dDpP/\:\}\{@\|\\] # mouth | #### reverse orientation! [\)\]\(\[dDpP/\:\}\{@\|\\] # mouth! •  Emo+cons [\-o\*\']? # optional nose! [:;=8] # eyes! [<>]? # optional hat/brow! •  Useful code: 21 •  Christopher PoXs sen+ment tokenizer •  Brendan O'Connor twiXer tokenizer Dan Jurafsky Extrac%ng Features for Sen%ment Classifica%on •  How to handle nega+on •  I didn't like this movie! vs •  I really like this movie! •  Which words to use? •  Only adjec+ves •  All words •  All words turns out to work beXer, at least on this data 22 Dan Jurafsky Nega%on Das, Sanjiv and Mike Chen. 2001. Yahoo! for Amazon: Extrac+ng market sen+ment from stock message boards. In Proceedings of the Asia Pacific Finance Associa+on Annual Conference (APFA). Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. 2002. Thumbs up? Sentiment Classification using Machine Learning Techniques. EMNLP-2002, 79—86. Add NOT_ to every word between nega+on and following punctua+on: didn't like this movie , but I! didn't NOT_like NOT_this NOT_movie but I! Dan Jurafsky Reminder: Naïve Bayes cNB = argmax P(c j ) c j !C " P(wi | c j ) i! positions count (w, c) + 1 ˆ P(w | c) = count (c) + V 24 Dan Jurafsky Binarized (Boolean feature) Mul%nomial Naïve Bayes •  Intui+on: •  For sen+ment (and probably for other t...
