Lecture 19: Decision trees
Reading: Section 8.1
STATS 202: Data mining and analysis
Sergio Bacallado
November 5, 2014
1 / 17
Decision trees, 10,000 foot view
R5
t4
X2
X2
R2
R3
t2
R4
R1
t1
X1
1. Find a partition of the space
of predictors.
2. Predict a con
STATS 202 Homework 1
Hao Chen
July 3, 2011
In total: 40 points.
Problem 2 (26 points, 2 points each)
Classify the following attributes as binary, discrete, or continuous. Also classify
them as qualitative (nominal or ordinal) or quantitative (interval or
Statistics 202: Statistical Aspects of Data Mining
Professor Rajan Patel
Lecture 6 = Collaborative Filtering
Agenda:
1) Homework #2 due Monday
2) Reminder: Midterm is on Monday, July 15th
3) Collaborative Filtering
3) Simpson's Paradox
3) Review for the M
Homework 5 - Stats 202
Page 1 of 2
Homework 5
1) Read Chapter 5 (Sections 5.2, 5.3, 5.5 and 5.6).
2) This question deals with In Class Exercise #34.
a) Repeat In Class Exercise #34 for the k-nearest neighbor classifier for
k=1,2,.,10. (We did k=1 in class
1. Exercise 2 from section 2.4
a. n = 500 (top 500 firms); p = 3 (profit, number of employees, industry); Regression and inference
(we are trying to understand the factors that affect/influence CEO Salary as opposed to trying to
predict CEO salary based o
homework 3 - Stats 202
Page 1 of 2
homework 3
1) Read Chapter 6 (only sections 6.1 and 6.7).
2) This question uses the sample of 10,000 Ohio house prices
at http:/sites.google.com/site/stats202/homework-2/OH_house_prices.csv.
Download the data set to your
STATS 202 Homework 2
Austen Head
July 6, 2011
Disclaimer: These are sample solutions. For some problems there may be other acceptable answers.
R code to run before each of the problems:
> hw2website <- "http:/sites.google.com/site/stats202/homework-2/"
1
Lecture 7: Model selection and
regularization
Reading: Sections 6.1-6.2
STATS 202: Data mining and analysis
Rajan Patel
1/1
What do we know so far
In linear regression, adding predictors always decreases the
training error or RSS.
However, adding predicto
Lecture 6: The Bootstrap
Reading: Chapter 5
STATS 202: Data mining and analysis
Rajan Patel
1/1
Cross-validation vs. the Bootstrap
Cross-validation: provides estimates of the (test) error.
The Bootstrap: provides the (standard) error of estimates.
One of
Problem 1
a) This is a regression problem in which we are mostly interested in inference; \( n=500 \), \( p=3 \). Note:
The variable industry is categorical; if there are many categories, this might be represented as several
predictors.
b) This is a class
Lecture 4: Classication, Clustering
STATS 202: Data mining and analysis
Rajan Patel
1 / 19
X2
Classication problem
oo o
o
oo
o
o
o
oo oo o
o
o
o
o o oo
o
o
o o
o o
o o
oo
o o o
oo o o
o
o oo oo
o oo
o
o
o
oo o
o
o
oo o o o o o
oo o
o
o
o o o
o o
oo
o
o o
Lecture 1:
Course logistics,
Supervised vs. Unsupervised learning,
Bias-Variance tradeo
STATS 202: Data mining and analysis
Rajan Patel
1 / 23
Syllabus
Videos: Every lecture will be recorded by SCPD.
Email policy: Please use the stats202 google group for
Lecture 7: Model selection and
regularization
Reading: Sections 6.1-6.2
STATS 202: Data mining and analysis
Rajan Patel
1/1
What do we know so far
In linear regression, adding predictors always decreases the
training error or RSS.
However, adding predicto
Lecture 8: Decision trees
Reading: Section 8.1
STATS 202: Data mining and analysis
Rajan Patel
1 / 17
Decision trees, 10,000 foot view
R5
t4
X2
X2
R2
R3
t2
R4
R1
t1
X1
1. Find a partition of the space
of predictors.
2. Predict a constant in each
set of th
Lecture 1:
Course logistics,
Supervised vs. Unsupervised learning,
Bias-Variance tradeo
STATS 202: Data mining and analysis
Rajan Patel
1 / 23
Syllabus
Videos: Every lecture will be recorded by SCPD.
Email policy: Please use the stats202 google group for
Lecture 4: Classication, Clustering
STATS 202: Data mining and analysis
Rajan Patel
1 / 19
X2
Classication problem
oo o
o
oo
o
o
o
oo oo o
o
o
o
o o oo
o
o
o o
o o
o o
oo
o o o
oo o o
o
o oo oo
o oo
o
o
o
oo o
o
o
oo o o o o o
oo o
o
o
o o o
o o
oo
o
o o
errata.pdf
errata.pdf
Open
Extract
Open with
Loading
Page 1 of 7
Errata 1
Errata for Introduction to Data Mining
by Tan, Steinbach, and Kumar.
Please send all error reports to dmbook@cs.umn.edu
Preface
Page x, last sentence of first paragraph: The email a
Problem 1
a) This is a regression problem in which we are mostly interested in inference; \( n=500 \), \( p=3 \). Note:
The variable industry is categorical; if there are many categories, this might be represented as several
predictors.
b) This is a class
Homework 2 solutions
Problem 1
a) Since \( X \) is uniformly distributed on \( [0,1] \), the probability that a given observation will be used
to make the prediction is 10%. Thus, When \( p=1 \), on average, 10% of the available observations will
be used
1. Chapter 6, Exercise 3
a. Training RSS will steadily decrease. When s = 0, the coefficients are zero. As s value increases
from 0, the constraints on coefficients steadily decrease thereby making the model more flexible.
As a model becomes more flexible
STATS 202 |
HW:1
Problem: 1
Background:
Error: The difference between actual or true value and predicted value derived from
model.
Error term has two components:
1. Reducible error
a. consists of variance and bias, need to strike a balance between these
c
HW 2
PROBLEM 1
a) 0.1
b) 0.12 = 0.01
c) 0.1100
d) In the above parts, the fraction is (0.1)p. In more generic case, the fraction is the neighbor
radius raised to the power of the dimension, which is the estimate of the underlying volume
of hypersphere (n-
Final Project Report
By Xiaogang Dong
Stanford ID # 05638478
Abstract
The main target of this project is to detect anomaly online credit card transactions. I try six
different classifiers: Nave Bayes, Decision Tree, K-Nearest Neighbor, Support Vector
Mach
Homework 1
Jayanth
SUID# 06166180
Problem 1
a) Regression Problem as the variable, CEO salary is continuous and not discrete and takes
value in + . In the regression problem, we will test the hypothesis whether other variables
(profit, number of employees
Stats202 Project Report
By Jayanth
Stanford ID #06166180
Abstract
The main target of this project is to make relevance predictions for the search engine query
using 10 different signals. The relevance prediction is 1, if the url is relevant for the query