INTRODUCTION TO
DATA SCIENCE
JOHN P DICKERSON
Lecture #18 – 10/29/2018
CMSC320
Mondays & Wednesdays
2:00pm – 3:15pm

ANNOUNCEMENTS
2
Mini-Project #2 grades will be out by Thursday night!
Mini-Project #3 is out!
•
It is linked to from ELMS; it is also be available at:
•
Deliverable is a .ipynb file submitted to ELMS
•
Due
November 19
th
Please label your
ipynb
file something like
<lastname>_<firstname>_project3.ipynb

MIDTERMS
Not graded yet!
If you still need to take a midterm exam, please please please
please please tell me.
I know of exactly four of you who do.
3

THIS LECTURE
Data
collection
Data
processing
Exploratory
analysis
&
Data viz
Analysis,
hypothesis
testing, &
ML
Insight &
Policy
Decision
4

THIS LECTURE:
Words words words!
•
Free text and natural language processing in data science
•
Bag of words and TF-IDF
•
N-Grams and language models
•
Sentiment mining
Thanks to: Zico Kolter (CMU) & Marine Carpuat’s 723 (UMD)
5

PRECURSOR TO NATURAL
LANGUAGE PROCESSING

PRECURSOR TO NATURAL
LANGUAGE PROCESSING
Turing’s Imitation Game [1950]:
•
Person A and Person B go into separate rooms
•
Guests send questions in, read questions that come out – but
they are not told who sent the answers
•Person A (B) wants to convince group that she is Person B (A)We now ask the question, "What will happen when a machine takes the part of [Person] A in this game?" Will the interrogator decide wrongly as often when the game is played like this as he does when the game is played between [two humans]? These questions replace our original, "Can machines think?"
7

PRECURSOR TO NATURAL
LANGUAGE PROCESSING
Mechanical translation
started in the 1930s
•
Largely based on dictionary lookups
Georgetown-IBM Experiment:
•
Translated 60 Russian sentences to English
•
Fairly basic system behind the scenes
•
Highly publicized, system ended up spectacularly failing
Funding dried up; not much research in “mechanical
translation” until the 1980s …
8

STATISTICAL NATURAL
LANGUAGE PROCESSING
Pre-1980s: primarily based on sets of hand-tuned rules
Post-1980s: introduction of machine learning to NLP
•
Initially,
decision trees
learned what-if rules automatically
•
Then, hidden Markov models (HMMs) were used for part of
speech (POS) tagging
•
Explosion of statistical models for language
•
Recent work focuses on purely
unsupervised
or
semi-
supervised
learning of models
We’ll cover some of this in the machine learning lectures!
9

NLP IN DATA SCIENCE
In Mini-Project #1, you used
requests
and
BeautifulSoup
