Statistics 202
Fall 2012
Data Mining
Assignment #1
Due Friday October 05, 2012
Prof. J. Taylor
You may discuss homework problems with other students, but you have to prepare
the written assignments yourself. Late homework will be penalized 10 points.
Plea
Homework 1
Problem 1 (Exercise 2.4 #2, p. 52)
Explain whether each scenario is a classification or regression problem, and
indicate whether we are most interested in inference or prediction. Finally,
provide n and p.
(a) We collect a set of data on the to
Homework 2
Problem 1 (Exercise 4.7 #4, p. 168)
When the number of features p is large, there tends to be deterioration in the
performance of KNN and other local approaches that perform prediction using
only observations that are near the test observation
Statistics 202
Tony Lindsey
July 1, 2015
Homework 1
Problem 1 (Exercise 2.4 #2, p. 52)
Explain whether each scenario is a classification or regression problem, and
indicate whether we are most interested in inference or prediction. Finally,
provide n and
Homework 3
Problem 1 (Exercise 6.8 #3, p. 260)
Suppose we estimate the regression coefficients in a linear regression model
by minimizing
n
p
i 1
j 1
( yi 0 j xij )2 subject to
p
j 1
j
s
for a particular value of s. For parts (a) through (e), indicate wh
Lecture 8: Classication
Reading: Chapter 4
STATS 202: Data mining and analysis
Sergio Bacallado
October 8, 2014
1 / 17
Classication problems
Supervised learning with a qualitative or categorical response.
Just as common, if not more common than regression
Lecture 3: Principal Components Analysis
(PCA)
Reading: Sections 6.3.1, 10.1, 10.2, 10.4
STATS 202: Data mining and analysis
Sergio Bacallado
September 26, 2014
1 / 25
Announcements
Homework 1 is out; due next Thursday.
Kaggle invitations have been sent.
Lecture 2: Supervised vs. unsupervised
learning, bias-variance tradeo
Reading: Chapter 2
STATS 202: Data mining and analysis
Sergio Bacallado
September 24, 2014
1 / 20
Supervised vs. unsupervised learning
Samples or units
In unsupervised learning we start
Lecture 1: Course logistics, homework 0
STATS 202: Data mining and analysis
Sergio Bacallado
September 22, 2014
1/6
Syllabus
Videos: Every lecture will be recorded by SCPD.
2/6
Syllabus
Videos: Every lecture will be recorded by SCPD.
Email policy: Please
Lecture 29: Review
Reading: All chapters in ISLR
STATS 202: Data mining and analysis
Sergio Bacallado
December 5, 2014
1/9
Announcements
Please send us all regrade requests as soon as possible.
2/9
Announcements
Please send us all regrade requests as soon
Lecture 28: Review
Reading: All chapters in ISLR.
STATS 202: Data mining and analysis
Sergio Bacallado
December 3, 2014
1 / 15
Announcements
Remember to submit Homework 8 by Friday at 10am to get
Kaggle credit.
2 / 15
Announcements
Remember to submit Home
8
Cluster Analysis:
Basic Concepts and
Algorithms
Cluster analysis divides data into groups (clusters) that are meaningful, useful,
or both. If meaningful groups are the goal, then the clusters should capture the
natural structure of the data. In some cas
Lecture 5: Clustering, Linear Regression
Reading: Chapter 10, Sections 3.1-2
STATS 202: Data mining and analysis
Sergio Bacallado
October 1, 2014
1 / 23
Announcements
Starting next week, Julia Fukuyama will be having her oce
hours after business hours for
Lecture 6: Linear Regression (continued)
Reading: Sections 3.1-3
STATS 202: Data mining and analysis
Sergio Bacallado
October 3, 2014
1 / 24
Multiple linear regression
Y = 0 + 1 X1 + + p Xp +
Y
N (0, ) i.i.d.
X2
X1
Figure 3.4
2 / 24
Multiple linear regr
Lecture 7: Linear Regression (continued)
Reading: Chapter 3
STATS 202: Data mining and analysis
Sergio Bacallado
October 6, 2014
1 / 19
Potential issues in linear regression
1. Interactions between predictors
2. Non-linear relationships
3. Correlation of
Lecture 9: Classication, LDA
Reading: Chapter 4
STATS 202: Data mining and analysis
Sergio Bacallado
October 10, 2014
1 / 21
Review: Main strategy in Chapter 4
Find an estimate P (Y | X). Then, given an input x0 , we predict
the response as in a Bayes cla
Lecture 10: Classication examples
Reading: Chapter 4
STATS 202: Data mining and analysis
Sergio Bacallado
October 13, 2014
1 / 22
Example. Predicting default
Used LDA to predict credit card default in a dataset of 10K people.
Predicted yes if P (default =
Lecture 11: Cross validation
Reading: Chapter 5
STATS 202: Data mining and analysis
Sergio Bacallado
October 15, 2014
1 / 17
Validation set approach
Goal: Estimate the test error for a supervised learning method.
Strategy:
2 / 17
Validation set approach
G
Lecture 12: The Bootstrap
Reading: Chapter 5
STATS 202: Data mining and analysis
Sergio Bacallado
October 17, 2014
1 / 16
Announcements
Homework 4 is due next Thursday.
2 / 16
Announcements
Homework 4 is due next Thursday.
Homework 2 is still being graded
Lecture 4: Finish PCA, Clustering
Reading: Sections 2.2.3, 10.3, 10.5
STATS 202: Data mining and analysis
Sergio Bacallado
September 29, 2014
1 / 23
Classication problem
Recall:
oo o
o
o
o
o
oo oo o
o
o
o
o o oo
o
o o
o o
o oo o
o
o o o
oo
o
o
o
o
o oo o
Homework 4
Problem 1 (Exercise 8.4 #10, p. 334)
We now use boosting to predict Salary in the hitters data set.
(a) Remove the observations for whom the salary information is unknown,
and then log-transform the salaries.
(b) Create a training set consistin
6
Association Analysis:
Basic Concepts and
Algorithms
Many business enterprises accumulate large quantities of data from their dayto-day operations. For example, huge amounts of customer purchase data are
collected daily at the checkout counters of grocer
DATA MINING
Statistics 202 Autumn, 2010
Homework 2 Solutions: due Friday, October 8th at 5pm
1. Reading:
Read about the Mahalanobis distance, in the book or on the internet.
In your own words, explain in no more than 20 lines how the Mahalanobis distance
DATA MINING
Susan Holmes
Stats202
Lecture 21
Fall 2010
ABabcdfghiejkl
.
.
.
.
.
.
Special Announcements
Do not update your version of R before the end of the
quarter.
All requests should be sent to
stats202-aut1011-staff@lists.stanford.edu.
A new homewor
DATA MINING
Susan Holmes
Stats202
Lecture 20
Fall 2010
ABabcdfghiejkl
.
.
.
.
.
.
Special Announcements
Do not update your version of R before the end of the
quarter.
All requests should be sent to
stats202-aut1011-staff@lists.stanford.edu.
A new homewor
DATA MINING
Susan Holmes
Stats202
Lecture 20
Fall 2010
ABabcdfghiejkl
.
.
.
.
.
.
Special Announcements
Do not update your version of R before the end of the
quarter.
All requests should be sent to
stats202-aut1011-staff@lists.stanford.edu.
A new homewor
DATA MINING
Susan Holmes
Stats202
Lecture 19
Fall 2010
ABabcdfghiejkl
.
.
.
.
.
.
Special Announcements
Do not update your version of R before the end of the
quarter.
All requests should be sent to
stats202-aut1011-staff@lists.stanford.edu.
A new homewor
DATA MINING
Susan Holmes
Stats202
Lecture 7(b)
Fall 2010
ABabcdfghiejkl
.
.
.
.
.
.
Multidimensional Scaling
From a non-technical point of view, the purpose of
multidimensional scaling (MDS) is to provide a visual
representation of the pattern of proximi
Data Mining
Cluster Analysis: Advanced Concepts
and Algorithms
Lecture Notes for Chapter 9
Introduction to Data Mining
by
Tan, Steinbach, Kumar
Edited for STATS202, Stanford University, Fall 2010
Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
Data Mining
Cluster Analysis: Basic Concepts
and Algorithms
Lecture Notes for Chapter 8
Introduction to Data Mining
by
Tan, Steinbach, Kumar
Edited for STATS202, Stanford University, Winter 2010
Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004