MSE
E cfw_(Ynew f (Xnew )2
Ynew f (Xnew ) + f (Xnew )E cfw_f (Xnew ) + E cfw_f (Xnew )f (Xnew )
E cfw_(Ynew f (Xnew )2 = Expected conditional variance of Yold
E cfw_(f (Xnew ) E cfw_f (Xnew ))2 = Expected squared bias in f
E cfw_(E cfw_f (Xnew ) f (Xne
Homework 2
Statistics W4240: Data Mining
Columbia University, Fall 2015
Due Wednesday, October 7
For your .R submission, submit a le for each question labeled hw02 q1.R, and so on. The write
up should be saved as a .pdf and under 8MB.
DO NOT submit .rar,
Statistics W4240: Data Mining
Columbia University
Fall 2015
Version: November 25, 2015. The syllabus is subject to change, so look for the version with
the most recent date.
Course Description
Massive data collection and storage capacities have led to new
Data Mining
W4240
Prof. Rahul Mazumder
Columbia Statistics
Lecture Handout #9
Outline
Broadening Linear Regression
Polynomial Regression
Some Pitfalls
Nonlinearity
Heteroscedasticity
Outliers
Collinearity
Linear Regression
6
0
2
4
y
2
6
4
General guidelines:
*) The test is cumulative. More weight will be placed on the material after Midterm 1.
*) You will be tested based on your conceptual understanding of the material
*) I will not ask you R coding examples, R functions, etc
*) There will
General guidelines:
*) You will be tested based on your conceptual understanding of the material
*) I will not ask you R coding examples, R functions, etc
*) There will be minor computational exercises, but should be "doable" with a simple calculator.
*)
General guidelines:
*) The test is cumulative. More weight will be placed on the material after Midterms 1-2.
*) You will be tested based on your conceptual understanding of the material
*) I will not ask you R coding examples, R functions, etc
*) There w
Homework 1
Statistics W4240: Data Mining
Columbia University
Due Dates
T/Th section: due on Jan 29th, 2015
M/W section: due on Feb 2nd, 2015
For material from the James book, print out what appears on your R screen if there are no
values requested (such a
Homework 2
Statistics W4240: Data Mining
Columbia University
M/W Class: Due date Feb 25th (before class starts)
T/Th Class: Due date Feb 24th (before class starts)
For your .R submission, submit a file for question 2 labeled hw02 q2.R. The write up should
Homework 3
Statistics W4240: Data Mining
Columbia University
Due Date M/W section: March 30, (before class starts)
Due Date T/Th section: March 31, (before class starts)
Note:
1. The teaching staff has been receiving some emails from students about late s
Homework 4
Statistics W4240: Data Mining
Columbia University
Due date: M/W class, April 22 (before class)
Due date: T/R class, April 21 (before class)
Note:
1. Please try to be punctual in submitting homeworks via courseworks .
2. You may discuss problems
Homework 5 (Final Homework)
Statistics W4240: Data Mining
Columbia University
Due date: T/Th Class May 12th, 7:40 PM
Due date: M/W Class May 11th, 1:10 PM
Problem 1. (10 Points) James 6.8.1
Problem 2. (10 Points) James 6.8.3
Problem 3. (10 Points) James 6
Course: STAT W4240
Title: Data Mining
Semester: Spring 2015
Quiz 0
Explanation
Turn in this work on or before Jan 24th, 5 PM on courseworks. Submissions are
all electronic. No other form of submission will be accepted. No late work will be
accepted.
As li
Homework 6
Statistics W4240: Data Mining
Columbia University, Fall 2015
Due Wednesday, November 9
For your .R submission, submit les for each question labeled hw06 q1.R and so on. The
write up should be saved as a .pdf of size less than 4MB. DO NOT submit
Homework 5
Statistics W4240: Data Mining
Columbia University, Fall 2015
Due Monday, November 30
For your .R submission, submit a le for each question labeled hw05 q1.R and so on. The write
up should be saved as a .pdf of size less than 6MB. DO NOT submit
Some notation
f (x)
the conditional expectation of Y given X .
Why estimate conditional expectations, conditional
modes?
I
One answer has to do with prediction.
What is prediction?
I
You are going to get a new Xnew (from the same
distribution as the origi
Linear Regression
Linear regression models are a class of models for the
conditional expectation of Y given X .
Models in which the conditional expectation is a linear function
of the parameters.
As we shall see, the class is quite exible!
For example
Ecf
ANOVA
n
(Yi [0 + 1 Xi1 + . . . + j1 Xij1 ])2
i=1
The OLS
Also, n times the naieve estimator of expected mean squared
error.
ANOVA
Partitioning the sum of squares
n
SSE
(Yi Yi )2
=
i=1
n
(Yi Y )2
SSR =
i=1
Correction = nY 2
n
n
Yi = nY 2 +
i=1
(Yi Y )2 = n
Course
Data Mining
STAT S4240.001
503 Hamilton
MTuWTh 10:45 AM - 12:20 PM
Instructor
Daniel Rabinowitz
1014 School of Social Work Building
(212) 851-2141
dan@stat.columbia.edu
Teaching Assistant
Yixin Wang
1023 School of Social Work Building
(212) 851-215
First Problem Set
1. In the lecture notes, it says that we will be looking at models and
methods for supervised and unsupervised problems.
(a) What is the dierence between a model and a method?
A statistical model is a parameterized family of probability
Exercises to prepare for the rst day.
Look up or remember the denitions of
Expectation
Variance
Covariance
Conditional distribution
Conditional expectation
Conditional variance
Independent random variables
Conditional covariance
Exercises (cont.)
And prov
Second Problem Set
1. This exercise is to work through in some detail the bias-variance tradeo calculations in a specic example. In this example, the predictor
is one-dimensional, and takes values in the interval from 0 to 1, and
the distribution of the p
Solutions to the Review Exercises
There is not much in the way of details or intuition in these solutions.
So for those of you who found the exercises dicult, youll have to work
through the steps carefully. You should also be sure that you have some
intui
Homework 4
Statistics W4240: Data Mining
Columbia University, Fall 2015
Due Wednesday, November 11
For your .R submission, submit a le for each question labeled hw04 q1.R and so on. The
write up should be saved as a .pdf of size less than 8MB. DO NOT subm
Midterm Exam
STAT W4240 Section 001: Data Mining
March 11, 2014
Explanation
This exam is to be done in-class. You have 75 minutes to complete the entirety. All solutions should
be written in the accompanying blue book. On the cover of the blue book, pleas