Data Mining: 36-462/36-662
Homework 2
Due Tuesday February 19 2013
(at the beginning of lecture)
Append your R code to the end of your homework. In your solutions, you should just
present your R output (e.g., numbers, table, gures) or snippets of R code a
Data Mining: 36-462/36-662
Homework 1
Due Tuesday February 5 2013
(at the beginning of lecture)
Append your R code to the end of your homework. In your solutions, you should just
present your R output (e.g., numbers, table, gures) or snippets of R code as
Data Mining: 36-462/36-662
Homework 5
Due Thursday April 11 2013
(at the beginning of lecture)
Append your R code to the end of your homework. In your solutions, you should just
present your R output (e.g. numbers, table, gures) or snippets of R code as y
Correlation analysis 3: Measures of correlation
(continued)
Ryan Tibshirani
Data Mining: 36-462/36-662
February 21 2013
1
Reminder: correlation, rank correlation
Last time we learned about correlation. In the population: for
random variables X, Y R,
Cor(X
Data Mining: 36-462/36-662
Homework 6
Due Thursday April 25 2013
(at the beginning of lecture)
Append your R code to the end of your homework. In your solutions, you should just
present your R output (e.g. numbers, table, gures) or snippets of R code as y
Data Mining: 36-462/36-662
Homework 3
Due Thursday March 7 2013
(at the beginning of lecture)
Append your R code to the end of your homework. In your solutions, you should just
present your R output (e.g., numbers, table, gures) or snippets of R code as y
Data Mining: 36-462/36-662
Final Project
Crime Mining (or Data Criming?)
Thursday April 25 2013
Deliverables and deadlines
Here are some key deliverables and deadlines upfront:
Your predictions and your slides are both due Thursday May 9 at 5:30pm, submi
Data Mining: 36-462/36-662
Homework 4
Due Thursday March 28 2013
(at the beginning of lecture)
Append your R code to the end of your homework. In your solutions, you should just
present your R output (e.g. numbers, table, gures) or snippets of R code as y
Boosting
Ryan Tibshirani
Data Mining: 36-462/36-662
April 25 2013
Optional reading: ISL 8.2, ESL 10.110.4, 10.7, 10.13
1
Reminder: classication trees
Suppose that we are given training data (xi , yi ), i = 1, . . . n, with
yi cfw_1, . . . K the class labe
Tree-based methods for classication and
regression
Ryan Tibshirani
Data Mining: 36-462/36-662
April 11 2013
Optional reading: ISL 8.1, ESL 9.2
1
Tree-based methods
Tree-based based methods for predicting y from a feature vector
x Rp divide up the feature
Classication 1: Linear regression of indicators,
linear discriminant analysis
Ryan Tibshirani
Data Mining: 36-462/36-662
April 2 2013
Optional reading: ISL 4.1, 4.2, 4.4, ESL 4.14.3
1
Classication
Classication is a predictive task in which the response ta
Classication 2: Linear discriminant analysis
(continued); logistic regression
Ryan Tibshirani
Data Mining: 36-462/36-662
April 4 2013
Optional reading: ISL 4.4, ESL 4.3; ISL 4.3, ESL 4.4
1
Reminder: linear discriminant analysis
Last time we dened the Baye
Bagging
Ryan Tibshirani
Data Mining: 36-462/36-662
April 23 2013
Optional reading: ISL 8.2, ESL 8.7
1
Reminder: classication trees
Our task is to predict the class label y cfw_1, . . . K given a feature
vector x Rp . Classication trees divide the feature
Model selection and validation 2: Model
assessment, more cross-validation
Ryan Tibshirani
Data Mining: 36-462/36-662
March 28 2013
Optional reading: ISL 5.1, ESL 7.10, 7.12
1
Reminder: cross-validation
Given training data (xi , yi ), i = 1, . . . n, we co
Model selection and validation 1: Cross-validation
Ryan Tibshirani
Data Mining: 36-462/36-662
March 26 2013
Optional reading: ISL 2.2, 5.1, ESL 7.4, 7.10
1
Reminder: modern regression techniques
Over the last two lectures weve investigated modern regressi
Modern regression 2: The lasso
Ryan Tibshirani
Data Mining: 36-462/36-662
March 21 2013
Optional reading: ISL 6.2.2, ESL 3.4.2, 3.4.3
1
Reminder: ridge regression and variable selection
Recall our setup: given a response vector y Rn , and a matrix
X Rnp o
Modern regression 1: Ridge regression
Ryan Tibshirani
Data Mining: 36-462/36-662
March 19 2013
Optional reading: ISL 6.2.1, ESL 3.4.1
1
Reminder: shortcomings of linear regression
Last time we talked about:
1. Predictive ability: recall that we can decomp
Regression 2: More perspectives, shortcomings
Ryan Tibshirani
Data Mining: 36-462/36-662
March 5 2013
Optional reading: ISL 3.2.3, 2.2.2; ESL 3.2, 7.3
1
Reminder: explicit formula for regression coecients
Last time we proved that for the multiple regressi
Regression 1: Dierent perspectives
Ryan Tibshirani
Data Mining: 36-462/36-662
February 28 2013
Optional reading: ISL 3.2.3, ESL 3.2
1
Linear regression is an old topic
Linear regression, also called the method of least squares, is an old
topic, dating bac
Correlation analysis 2: Measures of correlation
Ryan Tibshirani
Data Mining: 36-462/36-662
February 19 2013
1
Review: correlation
Pearsons correlation is a measure of linear association
In the population: for random variables X, Y R,
Cor(X, Y ) =
Cov(X, Y
Correlation analysis 1: Canonical correlation
analysis
Ryan Tibshirani
Data Mining: 36-462/36-662
February 14 2013
1
Review: correlation
Given two random variables X, Y R, the (Pearson) correlation
between X and Y is dened as
Cor(X, Y ) =
Cov(X, Y )
Var(X
Dimension reduction 2: Principal component
analysis (continued)
Ryan Tibshirani
Data Mining: 36-462/36-662
February 7 2013
Optional reading: ISL 10.2, ESL 14.5
1
Reminder: projections onto unit vectors
The projection of x Rn onto a unit vector v Rn is giv
Dimension reduction 1: Principal component
analysis
Ryan Tibshirani
Data Mining: 36-462/36-662
February 5 2013
Optional reading: ISL 10.2, ESL 14.5
1
Clustering as dimension reduction
Weve thought about clustering observations, given features. But
in many
Clustering 3: Hierarchical clustering (continued);
choosing the number of clusters
Ryan Tibshirani
Data Mining: 36-462/36-662
January 31 2013
Optional reading: ISL 10.3, ESL 14.3
1
Even more linkages
Last time we learned about hierarchical agglomerative c
Clustering 2: Hierarchical clustering
Ryan Tibshirani
Data Mining: 36-462/36-662
January 29 2013
Optional reading: ISL 10.3, ESL 14.3
1
From K-means to hierarchical clustering
Recall two properties of K-means (K-medoids) clustering:
1. It ts exactly K clu
Clustering 1: K-means, K-medoids
Ryan Tibshirani
Data Mining: 36-462/36-662
January 24 2013
Optional reading: ISL 10.3, ESL 14.3
1
What is clustering? And why?
Clustering: task of dividing up data into groups (clusters), so that
points in any one group ar
PageRank
Ryan Tibshirani
Data Mining: 36-462/36-662
January 22 2013
Optional reading: ESL 14.10
1
Information retrieval with the web
Last time: information retrieval, learned how to compute similarity
scores (distances) of documents to a given query strin
Introduction to data mining
Ryan Tibshirani
Data Mining: 36-462/36-662
January 15 2013
1
Logistics
Course website (syllabus, lectures slides, homeworks, etc.):
http:/www.stat.cmu.edu/~ryantibs/datamining
We will use blackboard site for email list, grades