Homework 2 Solution
Statistics W4240: Data Mining
Columbia University, Fall 2015
Due Wednesday, October 7
1. Problem 1. (35 Points)
(a). (2 Points) Produce column and row means using the commands colMeans() and
rowMeans().
p1 <- as.matrix(read.csv("hw02_q
Homework 3
Statistics W4240: Data Mining
Columbia University, Fall 2015
Due Wednesday, October 21
For your .R submission, submit a le for question 2, 5, and 6 labeled hw02 q2.R, and so on.
The write up should be saved as a .pdf and under 8MB.
DO NOT submi
Homework 1
Statistics W4240: Data Mining
Columbia University, Fall 2015
Due Wednesday, September 23
For material from the James book, print out what appears on your R screen if there are no
values requested (such as 2.8 (a). For your .R submission, place
Midterm 1 solutions
Statistics W4240 - Data Mining - Professor:Rahul Mazumder. Section: 1
March 26, 2015
1
Problem 1
All a),b) and c) are false. Notice that all of them explain the fact that i ri xi,1 > 0.
However, covariates are always orthogonal to the
Midterm Exam
STAT W4240 Section 001: Data Mining
March 11, 2014
Explanation
This exam is to be done in-class. You have 75 minutes to complete the entirety. All solutions should
be written in the accompanying blue book. On the cover of the blue book, pleas
Data Mining
W4240 Section 001
Prof. Giovanni Motta
Columbia University, Department of Statistics
September 21, 2015
Outline
Administrative Notes
Toolkit: Discrete Distributions
(Interjection: Describing Random Variables)
Toolkit: Continuous Distributions
Data Mining
S4240 Section 001
Giovanni Motta
Columbia University, Department of Statistics
September 28, 2015
Outline
Today: Principal components analysis (PCA)
1. PCA math
2. PCA examples
3. PCA with R
Reminder of data setup
Data with n observations and
#
# Introduction on R for W4240 Fall 2015
# Yixin Wang
# (Courtesy of Peter Lee for an earlier version)
#
# data frame
dat <- data.frame (list(name=c("Fred","Bert","Mary"), male=c(TRUE,TRUE,FALSE), age = c(18,45,37) )
dat
dat$income <- c(1,2,3)
dat
atta
#
# Introduction on R for W4240 Fall 2015
# Yixin Wang
# (Courtesy of Peter Lee for an earlier version)
#
# working directory
getwd()
setwd("/Users/yixinwang/Desktop/Introduction to R demo")
# As a calculator
3 + 3
3 / 5
2^2
# Assignments
x <- 3
x
y <- x^
Homework 1 Solution
Statistics W4240: Data Mining
Columbia University, Fall 2015
Due Wednesday, September 23
1. Problem 1 (James Ex 2.8)
(a) Data can be read from local le path, or directly from the website, using function
read.csv():
college <- read.csv(
Homework 2
Statistics W4240: Data Mining
Columbia University, Fall 2015
Due Wednesday, October 7
For your .R submission, submit a le for question 2, 5, and 6 labeled hw02 q2.R, and so on.
The write up should be saved as a .pdf and under 8MB.
DO NOT submit
Homework 2
Statistics W4240: Data Mining
Columbia University
Due Tuesday, February 18 (Section 01)
Due Wednesday, February 19 (Section 02)
For your .R submission, submit a le for question 2 labeled hw02 q2.R. The write up should be
saved as a .pdf and und
Data Mining
W4240 Section 001
Giovanni Motta
Columbia University, Department of Statistics
October 12, 2015
Outline
Homework Discussion
Broadening Linear Regression
Polynomial Regression
Some Pitfalls
Nonlinearity
Heteroscedasticity
Outliers
Collinearity
Data Mining
W4240 Section 001
Giovanni Motta
Columbia University, Department of Statistics
October 7, 2015
Outline
Basic Linear Regression
Accuracy of Linear Regression
Multiple Linear Regression
Connecting Linear Regression to PCA
Linear Regression Examp
Data Mining
W4240 Section 001
Giovanni Motta
Columbia University, Department of Statistics
September 30, 2015
Dimensionality Reduction
Summary of last week:
have high dimensional continuous data
want to nd low dimensional representation
Principal Componen
Data Mining
W4240 Section 001
Prof. Giovanni Motta
Columbia University, Department of Statistics
September 9, 2015
Course Information
Course: Data Mining
Number: STAT W4240, Section 001
Course Website: Courseworks, Piazza (for message board)
Instructor: P
Data Mining
W4240 Section 001
Giovanni Motta
Columbia University, Department of Statistics
October 5, 2015
Today:
Supervised Learning
Mean Squared Error
Bias-Variance Tradeo
Errors in Classication
Image Processing Example
Outline
Supervised Learning
Mean
Data Mining (W4240 Section 001)
Shrinkage
Giovanni Motta
Columbia University, Department of Statistics
November 11, 2015
Outline
Reminder from last time: Subset Selection
Motivation: subset Selection Shrinking coecients
Regularization
Regularization 2: Ri
Data Mining (W4240 Section 001)
Subset Selection
Giovanni Motta
Columbia University, Department of Statistics
November 9, 2015
Outline
Motivation: Linear Regression
Subset Selection
Optimism
Model Selection Criteria
Example
Outline
Motivation: Linear Regr
Data Mining
W4240 Section 001
Giovanni Motta
Columbia University, Department of Statistics
October 19, 2015
Logistic Regression
Recall from last time:
have binary responses (0 or 1)
Yi Ber(p(xi )
use a linear model for log odds:
log
p(xi )
1 p(xi )
= 0 +
Data Mining
W4240 Section 001
Giovanni Motta
Columbia University, Department of Statistics
October 28, 2015
Outline
Reviewing Estimators
Cross Validation
Data Preprocessing
Outline
Reviewing Estimators
Cross Validation
Data Preprocessing
Some Estimators
R
Data Mining
W4240 Section 001
Yixin Wang
Columbia University, Department of Statistics
November 4, 2015
Outline
Generalization Error
Bootstrap
Bootstrap Examples
Towards Bagging
Bootstrap Summary
Outline
Generalization Error
Bootstrap
Bootstrap Examples
T
Data Mining
W4240 Section 001
Giovanni Motta
Columbia University, Department of Statistics
October 21, 2015
Outline
Classication: Why and When
Naive Bayes Classication
The Naive assumption: what and why
The Dangers of Naivet
e
Naive Bayes in Practice
Outl
Data Mining
W4240 Section 001
Giovanni Motta
Columbia University, Department of Statistics
October 14, 2015
Outline
Classication
Logistic Functions
Logistic Regression
Optimization for Logistic Regression
Variants of Logistic Regression
Examples
Outline
C
Data Mining
W4240 Section 001
Prof. Hannah
Columbia University, Department of Statistics
February 22, 2016
Outline
Administrative
Broadening Linear Regression
Polynomial Regression
Some Pitfalls
Nonlinearity
Heteroscedasticity
Outliers
Collinearity
Homewo
Data Mining
W4240 Section 001
Prof. Hannah
Columbia University, Department of Statistics
February 8, 2016
Outline
Today:
1. Principal components analysis (PCA)
2. PCA math; answering questions
3. PCA examples
4. PCA with R
Reminder of data setup
Data with
Data Mining
W4240 Section 001
Prof. Hannah
Columbia University, Department of Statistics
February 3, 2016
Outline
1. High dimensional data
2. Principal components analysis (PCA) overview
3. (Review: eigenvalues/vectors and spectral decomposition)
4. PCA c
Data Mining
W4240
Prof. Rahul Mazumder
Columbia Statistics
Subset Selection/ Model Selection
Outline
Motivation: Linear Regression
Subset Selection
Optimism
Model Selection Criteria
Example
Outline
Motivation: Linear Regression
Subset Selection
Optimism
M
Data Mining
W4240
Prof. Rahul Mazumder
Columbia Statistics
Lecture Handout # 13
Generalization
Modeling for prediction:
1. get data
2. choose a model
3. t the model
4. make predictions for new data
Generalization: making high quality predictions for new d
Data Mining
W4240
Prof. Rahul Mazumder
Columbia Statistics
Lecture Handout # 7
Outline
Supervised Learning
Mean Squared Error
Bias-Variance Tradeo
Errors in Classication
Image Processing Example
Outline
Supervised Learning
Mean Squared Error
Bias and Vari
Data Mining
W4240 Section 001
Prof. Hannah
Columbia University, Department of Statistics
February 10, 2016
Bookkeeping
Homework 1: being graded
Homework 2: longer than hw01, so start early
If you are having trouble with R:
I
do the tutorials in the book
I
Data Mining
W4240 Section 001
Prof. Hannah
Columbia University, Department of Statistics
February 17, 2016
Outline
Basic Linear Regression
Accuracy of Linear Regression
Multiple Linear Regression
Connecting Linear Regression to PCA
Linear Regression Examp
Data Mining
W4240 Section 001
Prof. Hannah
Columbia University, Department of Statistics
February 15, 2016
Outline
Administrative Notes
Supervised Learning
Mean Squared Error
Bias and Variance
Errors in Classification
Image Processing Example
Administrati
Data Mining (W4240 Section 001)
Subset Selection
Prof. Hannah
Columbia University, Department of Statistics
March 23, 2016
Outline
Announcements
Motivation: Linear Regression
Subset Selection
Optimism
Model Selection Criteria
Example
Outline
Announcements
Data Mining
W4240 Section 001
Prof. Hannah
Columbia University, Department of Statistics
March 21, 2016
Outline
Administrative
Generalization Error
Bootstrap
Bootstrap Examples
Towards Bagging
Bootstrap Summary
Outline
Administrative
Generalization Error