INTRODUCTION TO
DATA SCIENCE
JOHN P DICKERSON
PREM SAGGAR
Lecture #1 – 08/27/2018
CMSC320
Mondays & Wednesdays
2:00pm – 3:15pm
Today and
Wednesday!

INTRODUCTION TO
?????????????

3
Data science is the application of
computational
and
statistical
techniques to
address or gain [managerial or scientific]
insight into some problem in the
real world
.
Zico Kolter
Machine Learning Prof, CMU

4
Drew Conway
CEO, Alluvium (analytics company)

MANY DEFINITIONS
Broad
: necessarily
larger
than a single
discipline
Interdisciplinary
: statistics, computer
science, operations research, statistical
and machine learning, data
warehousing, visualization,
mathematics, information science, …
Insight-focused
: grounded in the desire
to find insights in data and leverage
them to inform decision making
5
Tuomas Carsey, UNC

THE DATA LIFECYCLE
6
Data
collection
Data
processing
Exploratory
analysis
&
Data viz
Analysis,
hypothesis
testing, &
ML
Insight &
Policy
Decision

7
“The ability to take data—to be able to
understand
it, to
process
it, to
extract value
from it, to
visualize
it, to
communicate
it—that’s
going to be a hugely important skill in the next
decades, not only at the professional level but
even at the educational level for elementary
school kids, for high school kids, for college
kids.”
Hal Varian
Chief Economist at Google

THIS COURSE
You’ll learn to take data:
•
Process it
•
Visualize it
•
Understand it
•
Communicate it
•
Extract value from it
Info:
Piazza:
piazza.com/umd/fall2018/cmsc320
ELMS:
(everyone should be registered automatically)
8
Hal Varian

PREREQUISITE
KNOWLEDGE
Aimed at
CMSC undergrads
– but likely accessible to others
with programming experience and mathematical maturity.
We
do not
assume:
•
Experience with Python, pandas, scikit-learn, matplotlib, etc …
•
Deep statistics or any ML knowledge
•
Database or distributed systems knowledge
We
do
assume:
•
You want to be here!
9

WHO AM I?
10

WHO IS PREM SAGGAR?
(Prem will likely be here on Wednesday.)
11

WHO ARE YOU?
Register on Piazza:
piazza.com/umd/fall2018/cmsc320
2
nd
-year
12
3
rd
-year
4
th
-year +
STAT400?
CMSC422?
CMSC424?

(TENTATIVE) COURSE
STRUCTURE
First 4 lectures: intro & primers in the Python data
science stack
Next 6 lectures: data collection & management
•
Best practices, data wrangling, exploratory analysis,
ethics, debugging, visualization, etc …
Next 9 lectures: statistical modeling & ML
•
Statistical learning, regression, classification, cross-
validation, model evaluation, hypothesis testing, etc …
Midterm
Final 8 lectures: advanced topics
•
Dimensionality reduction, distributed learning, big data,
distributed computation
•
Either
group presentations or more lectures
13
Ambitious …

GRADE #1: MINI-PROJECTS
Students will complete
four
mini-project assignments:
•
Case studies
meant to mimic what you, a future data scientist,
will see in industry.
They should be fun
J
.


You've reached the end of your free preview.
Want to read all 42 pages?
- Spring '17
- John P. Dickerson