Statistics 202
Fall 2012
Data Mining
Assignment #1
Due Friday October 05, 2012
Prof. J. Taylor
You may discuss homework problems with other students, but you have to prepare
the written assignments yourself. Late homework will be penalized 10 points.
Plea
Statistics 202
Tony Lindsey
July 1, 2015
Homework 1
Problem 1 (Exercise 2.4 #2, p. 52)
Explain whether each scenario is a classification or regression problem, and
indicate whether we are most interested in inference or prediction. Finally,
provide n and
Homework 1
Problem 1 (Exercise 2.4 #2, p. 52)
Explain whether each scenario is a classification or regression problem, and
indicate whether we are most interested in inference or prediction. Finally,
provide n and p.
(a) We collect a set of data on the to
Homework 2
Problem 1 (Exercise 4.7 #4, p. 168)
When the number of features p is large, there tends to be deterioration in the
performance of KNN and other local approaches that perform prediction using
only observations that are near the test observation
Homework 3
Problem 1 (Exercise 6.8 #3, p. 260)
Suppose we estimate the regression coefficients in a linear regression model
by minimizing
n
p
i 1
j 1
( yi 0 j xij )2 subject to
p
j 1
j
s
for a particular value of s. For parts (a) through (e), indicate wh
Lecture 2: Supervised vs. unsupervised
learning, bias-variance tradeo
Reading: Chapter 2
STATS 202: Data mining and analysis
Sergio Bacallado
September 24, 2014
1 / 20
Supervised vs. unsupervised learning
Samples or units
In unsupervised learning we start
Lecture 1: Course logistics, homework 0
STATS 202: Data mining and analysis
Sergio Bacallado
September 22, 2014
1/6
Syllabus
Videos: Every lecture will be recorded by SCPD.
2/6
Syllabus
Videos: Every lecture will be recorded by SCPD.
Email policy: Please
Lecture 29: Review
Reading: All chapters in ISLR
STATS 202: Data mining and analysis
Sergio Bacallado
December 5, 2014
1/9
Announcements
Please send us all regrade requests as soon as possible.
2/9
Announcements
Please send us all regrade requests as soon
Lecture 28: Review
Reading: All chapters in ISLR.
STATS 202: Data mining and analysis
Sergio Bacallado
December 3, 2014
1 / 15
Announcements
Remember to submit Homework 8 by Friday at 10am to get
Kaggle credit.
2 / 15
Announcements
Remember to submit Home
Problem 1
Chapter 4, Exercise 4 (p. 168).
When the number of features p is large, there tends to be a deterioration in the performance of KNN
and other local approaches that perform prediction using only observations that are near the test
observation for
Problem 1
Complete Exercise 2 from section 2.4 of the textbook (p. 52).
2. Explain whether each scenario is a classification or regression problem, and indicate whether
we are most interested in inference or prediction. Finally, provide n and p.
(a) We co
Problem1
Chapter6,Exercise3(p.260).
(a) As we increase s from 0, the training RSS will:
i. Increase initially, and then eventually start decreasing in an inverted U shape.
ii. Decrease initially, and then eventually start increasing in a U shape.
iii. Ste
Lecture 3: Principal Components Analysis
(PCA)
Reading: Sections 6.3.1, 10.1, 10.2, 10.4
STATS 202: Data mining and analysis
Sergio Bacallado
September 26, 2014
1 / 25
Announcements
Homework 1 is out; due next Thursday.
Kaggle invitations have been sent.
Lecture 5: Clustering, Linear Regression
Reading: Chapter 10, Sections 3.1-2
STATS 202: Data mining and analysis
Sergio Bacallado
October 1, 2014
1 / 23
Announcements
Starting next week, Julia Fukuyama will be having her oce
hours after business hours for
Lecture 6: Linear Regression (continued)
Reading: Sections 3.1-3
STATS 202: Data mining and analysis
Sergio Bacallado
October 3, 2014
1 / 24
Multiple linear regression
Y = 0 + 1 X1 + + p Xp +
Y
N (0, ) i.i.d.
X2
X1
Figure 3.4
2 / 24
Multiple linear regr
Homework 4
Problem 1 (Exercise 8.4 #10, p. 334)
We now use boosting to predict Salary in the hitters data set.
(a) Remove the observations for whom the salary information is unknown,
and then log-transform the salaries.
(b) Create a training set consistin
Lecture 4: Finish PCA, Clustering
Reading: Sections 2.2.3, 10.3, 10.5
STATS 202: Data mining and analysis
Sergio Bacallado
September 29, 2014
1 / 23
Classication problem
Recall:
oo o
o
o
o
o
oo oo o
o
o
o
o o oo
o
o o
o o
o oo o
o
o o o
oo
o
o
o
o
o oo o
Lecture 12: The Bootstrap
Reading: Chapter 5
STATS 202: Data mining and analysis
Sergio Bacallado
October 17, 2014
1 / 16
Announcements
Homework 4 is due next Thursday.
2 / 16
Announcements
Homework 4 is due next Thursday.
Homework 2 is still being graded
Lecture 11: Cross validation
Reading: Chapter 5
STATS 202: Data mining and analysis
Sergio Bacallado
October 15, 2014
1 / 17
Validation set approach
Goal: Estimate the test error for a supervised learning method.
Strategy:
2 / 17
Validation set approach
G
Lecture 10: Classication examples
Reading: Chapter 4
STATS 202: Data mining and analysis
Sergio Bacallado
October 13, 2014
1 / 22
Example. Predicting default
Used LDA to predict credit card default in a dataset of 10K people.
Predicted yes if P (default =
Lecture 9: Classication, LDA
Reading: Chapter 4
STATS 202: Data mining and analysis
Sergio Bacallado
October 10, 2014
1 / 21
Review: Main strategy in Chapter 4
Find an estimate P (Y | X). Then, given an input x0 , we predict
the response as in a Bayes cla
Lecture 8: Classication
Reading: Chapter 4
STATS 202: Data mining and analysis
Sergio Bacallado
October 8, 2014
1 / 17
Classication problems
Supervised learning with a qualitative or categorical response.
Just as common, if not more common than regression
Lecture 7: Linear Regression (continued)
Reading: Chapter 3
STATS 202: Data mining and analysis
Sergio Bacallado
October 6, 2014
1 / 19
Potential issues in linear regression
1. Interactions between predictors
2. Non-linear relationships
3. Correlation of
Stats 202 Midterm Note
Lecture 1
Types of Data
- Qualitative: descriptive, categorical, 2 subtypes
Discrete finite, countable, integer value, does not change
o # cups coffee you drank today
- Quantitative: numerical
Ordinal meaningful rank, ordered
o Si
8
Cluster Analysis:
Basic Concepts and
Algorithms
Cluster analysis divides data into groups (clusters) that are meaningful, useful,
or both. If meaningful groups are the goal, then the clusters should capture the
natural structure of the data. In some cas
DATA MINING
Susan Holmes
Stats202
Lecture 22
Fall 2010
ABabcdfghiejkl
.
.
.
.
.
.
Special Announcements
Do not update your version of R before the end of the
quarter.
All requests should be sent to
[email protected]
A new homewor
DATA MINING
Susan Holmes
Stats202
Lecture 21
Fall 2010
ABabcdfghiejkl
.
.
.
.
.
.
Special Announcements
Do not update your version of R before the end of the
quarter.
All requests should be sent to
[email protected]
A new homewor
DATA MINING
Susan Holmes
Stats202
Lecture 20
Fall 2010
ABabcdfghiejkl
.
.
.
.
.
.
Special Announcements
Do not update your version of R before the end of the
quarter.
All requests should be sent to
[email protected]
A new homewor
DATA MINING
Susan Holmes
Stats202
Lecture 20
Fall 2010
ABabcdfghiejkl
.
.
.
.
.
.
Special Announcements
Do not update your version of R before the end of the
quarter.
All requests should be sent to
[email protected]
A new homewor
DATA MINING
Susan Holmes
Stats202
Lecture 19
Fall 2010
ABabcdfghiejkl
.
.
.
.
.
.
Special Announcements
Do not update your version of R before the end of the
quarter.
All requests should be sent to
[email protected]
A new homewor