STATS 202 Homework 1
Hao Chen
July 3, 2011
In total: 40 points.
Problem 2 (26 points, 2 points each)
Classify the following attributes as binary, discrete, or continuous. Also classify
them as qualitative (nominal or ordinal) or quantitative (interval or
Statistics 202: Statistical Aspects of Data Mining
Professor Rajan Patel
Lecture 6 = Collaborative Filtering
Agenda:
1) Homework #2 due Monday
2) Reminder: Midterm is on Monday, July 15th
3) Collaborative Filtering
3) Simpson's Paradox
3) Review for the M
Homework 5 - Stats 202
Page 1 of 2
Homework 5
1) Read Chapter 5 (Sections 5.2, 5.3, 5.5 and 5.6).
2) This question deals with In Class Exercise #34.
a) Repeat In Class Exercise #34 for the k-nearest neighbor classifier for
k=1,2,.,10. (We did k=1 in class
Homework 1
Jayanth
SUID# 06166180
Problem 1
a) Regression Problem as the variable, CEO salary is continuous and not discrete and takes
value in + . In the regression problem, we will test the hypothesis whether other variables
(profit, number of employees
1. Chapter 4, Exercise 4
a. 10%
b. 10% ^ 2 = 1%
c. 10% ^ 100
d. The number of available training observations within X% of a test point is dependent on p.
As p increases, the number of available observations that are within X% of the test point
reduces ex
Lecture 19: Decision trees
Reading: Section 8.1
STATS 202: Data mining and analysis
Sergio Bacallado
November 5, 2014
1 / 17
Decision trees, 10,000 foot view
R5
t4
X2
X2
R2
R3
t2
R4
R1
t1
X1
1. Find a partition of the space
of predictors.
2. Predict a con
1. Exercise 2 from section 2.4
a. n = 500 (top 500 firms); p = 3 (profit, number of employees, industry); Regression and inference
(we are trying to understand the factors that affect/influence CEO Salary as opposed to trying to
predict CEO salary based o
homework 3 - Stats 202
Page 1 of 2
homework 3
1) Read Chapter 6 (only sections 6.1 and 6.7).
2) This question uses the sample of 10,000 Ohio house prices
at http:/sites.google.com/site/stats202/homework-2/OH_house_prices.csv.
Download the data set to your
STATS 202 Homework 2
Austen Head
July 6, 2011
Disclaimer: These are sample solutions. For some problems there may be other acceptable answers.
R code to run before each of the problems:
> hw2website <- "http:/sites.google.com/site/stats202/homework-2/"
1
Lecture 5: Evaluation and Training
STATS 202: Data mining and analysis
Rajan Patel
1 / 22
Evaluating a classication method
We have talked about the 0-1 loss:
1
m
m
1(yi = yi ).
i=1
It is possible to make the wrong prediction for some classes more
often th
Lecture 4: Linear Regression and
Classication
Reading: Chapter 3 and Chapter 4
STATS 202: Data mining and analysis
Rajan Patel
1 / 31
Potential issues in linear regression
1. Interactions between predictors
2. Non-linear relationships
3. Correlation of er
Lecture 3: Linear Regression
STATS 202: Data mining and analysis
Rajan Patel
1/1
Simple linear regression
yi = 0 + 1 xi + i
15
10
5
Sales
20
25
i N (0, ) i.i.d.
0
50
100
150
200
250
300
TV
Figure: *
Figure 3.1
2/1
Simple linear regression
yi = 0 + 1 xi +
Lab18:Analysisofopinioneditorialsfrom
twoStanfordstudentnewspapers
Stanford has two large student newspapers. The Stanford Daily is the main campus tabloid, while
the Stanford Review publishes conservative-leaning political articles on a biweekly basis. E
Lab 11: The wrong way and the right way
to do cross-validation
In this lab, we simulate the wrong way and the right way to perform cross validation, as explained in
Lecture 11 and in Section 7.10 of the Elements of Statistical Learning.
We will work under
Lab 1: Illustration of the bias-variance
decomposition
library(ggplot2)
library(splines)
set.seed(1)
Define a true function f.
f = function(x) cfw_
x^2 - 0.2*x^2.3333
Now, we sample a random observation of the function at 10 input points with normal erro
Lecture 2: Classication, Clustering
STATS 202: Data mining and analysis
Rajan Patel
1 / 19
X2
Classication problem
oo o
o
oo
o
o
o
oo oo o
o
o
o
o o oo
o
o
o o
o o
o o
oo
o o o
oo o o
o
o oo oo
o oo
o
o
o
oo o
o
o
oo o o o o o
oo o
o
o
o o o
o o
oo
o
o o
Lecture 1:
Course logistics,
Supervised vs. Unsupervised learning,
Bias-Variance tradeo
STATS 202: Data mining and analysis
Rajan Patel
1 / 23
Syllabus
Videos: Every lecture will be recorded by SCPD.
Email policy: Please use the stats202 google group for
Lab 26: Missing data
In this lab, we will work with the NLSY79 dataset. This is a longitudinal study from the Bureau of
Labor Statistics, which followed a cohort of a few thousand baby boomers from 1979 until 2010,
recording hundreds of variables every th
Lecture 6: The Bootstrap
Reading: Chapter 5
STATS 202: Data mining and analysis
Rajan Patel
1/1
Cross-validation vs. the Bootstrap
Cross-validation: provides estimates of the (test) error.
The Bootstrap: provides the (standard) error of estimates.
One of
Problem 1
a) This is a regression problem in which we are mostly interested in inference; \( n=500 \), \( p=3 \). Note:
The variable industry is categorical; if there are many categories, this might be represented as several
predictors.
b) This is a class
Chapter 8: Decision Trees
Tree-based methods for regression and classification - For supervised learning
These involve stratifying or segmenting the predictor space into a number of simple regions.
Since the set of splitting rules used to segment the pred
Lecture 8: Decision trees
Reading: Section 8.1
STATS 202: Data mining and analysis
Rajan Patel
1 / 17
Decision trees, 10,000 foot view
R5
t4
X2
X2
R2
R3
t2
R4
R1
t1
X1
1. Find a partition of the space
of predictors.
2. Predict a constant in each
set of th
Problem 1
a) This is a regression problem in which we are mostly interested in inference; \( n=500 \), \( p=3 \). Note:
The variable industry is categorical; if there are many categories, this might be represented as several
predictors.
b) This is a class
errata.pdf
errata.pdf
Open
Extract
Open with
Loading
Page 1 of 7
Errata 1
Errata for Introduction to Data Mining
by Tan, Steinbach, and Kumar.
Please send all error reports to [email protected]
Preface
Page x, last sentence of first paragraph: The email a
Lecture 3: Linear Regression
STATS 202: Data mining and analysis
Rajan Patel
1 / 38
Simple linear regression
yi =
0
+
1 xi
+ "i
15
10
5
Sales
20
25
"i N (0, ) i.i.d.
0
50
100
150
200
250
300
TV
Figure: *
Figure 3.1
2 / 38
Simple linear regression
yi =
+
1
Lecture 7: Model selection and
regularization
Reading: Sections 6.1-6.2
STATS 202: Data mining and analysis
Rajan Patel
1 / 27
What do we know so far
I
In linear regression, adding predictors always decreases the
training error or RSS.
I
However, adding p
Lecture 2: Classification, Clustering
STATS 202: Data mining and analysis
Rajan Patel
1 / 19
X2
Classification problem
oo o
o
oo
o
o
o
oo oo o
o
o
oo oo ooo
o
oo ooo o
o
oo
oo o o
o
o ooo oo
o oo
oo
o
o
o
o
o
ooo o o o o o
o
o
o
o o
o oo o
o o
o o o
o o
o
Lecture 10: Support vector classifier
Reading: Sections 9.1-9.2
STATS 202: Data mining and analysis
Rajan Patel
1 / 14
Hyperplanes and normal vectors
I
Consider a p-dimensional space of predictors.
I
A hyperplane is an ane space which separates the space
Lecture 4: Linear Regression and
Classification
Reading: Chapter 3 and Chapter 4
STATS 202: Data mining and analysis
Rajan Patel
1 / 29
Potential issues in linear regression
1. Interactions between predictors
2. Non-linear relationships
3. Correlation of