Name:
10-702 Statistical Machine Learning: Midterm Exam
March 4, 2010
Submit solutions to any three of the following ve problems. Clearly indicate
which problems you are submitting solutions for. Write your answers in the space provided;
additional sheets
Practice Problems: 10/36-702
1. Let (X1 , Y1 ), . . . , (Xn , Yn ) be iid. Suppose that X1 , . . . , Xn Unif(0, 1) and that
Yi = m(Xi ) + i
where 1 , . . . , n are iid with mean 0 and variance 2 and are independent of the Xi s.
Assume that m, m0 , m00 , m
Random Matrix Theory
These notes are based on the following sources:
1. Introduction to the Non-asymptotic Analysis of Random Matrices by Roman Vershynin.
2. Error Bounds for Random Matrix Approximation Schemes by A. Gittens and J. Tropp.
Another excellen
Linear Regression
We observe D = cfw_(X1 , Y1 ), . . . , (Xn , Yn ) where Xi = (Xi (1), . . . , Xi (d) Rd and Yi R.
For notational simplicity, we will always assume that Xi (1) = 1.
Given a new pair (X, Y ) we want to predict Y from X. The conditional pre
Nonparametric Classification
10/36-702
1
Introduction
Let h : X cfw_0, 1 to denote a classifier where X is the domain of X. In parametric classification we assumed that h took a very constrained form, typically linear. In nonparametric
classification we a
Concentration of Measure
1
Introduction
Often we need to show that a random quantity f (Z1 , . . . , Zn ) is close to its mean
(f ) = E(f (Z1 , . . . , Zn ). That is, we want a result of the form
P f (Z1 , . . . , Zn ) (f ) > < .
(1)
Such results are know
10/36-702: Minimax Theory
1
Introduction
When solving a statistical learning problem, there are often many procedures to choose from.
This leads to the following question: how can we tell if one statistical learning procedure is
better than another? One a
Bayes versus Frequentist
This lecture combines three blog posts that I wrote on this topic.
1
Adventures in FlatLand: Stones Paradox
Mervyn Stone is Emeritus Professor at University College London. He is famous for
his deep work on Bayesian inference as w
Concentration of Measure
1
Introduction
Often we need to show that a random quantity f ( Z1 , . . . , Z n ) is close to its mean
( f ) = E( f ( Z1 , . . . , Z n ). That is, we want a result of the form
P f ( Z 1 , . . . , Z n ) ( f ) >
< .
(1)
Such result
Modeling Basics: Assessment, Selection, and Complexity
Statistical Machine Learning, Spring 2015
Ryan Tibshirani (with Larry Wasserman)
1
You (should) already know this stu: statistical prediction
and the bias-variance tradeo
Suppose that we observe (X,
Nonparametric Bayesian Methods
1
What is Nonparametric Bayes?
In parametric Bayesian inference we have a model M = cfw_f (y|) : and data
Y1 , . . . , Yn f (y|). We put a prior distribution () on the parameter and compute the
posterior distribution using
Practice Midterm
10/36-702
For Review Session on Monday Mar 2
(1) Let X1 , . . . Xn Unif(0, 1). Compute the bias and variance of the histogram density
estimator with binwidth h for this distribution. Show that the optimal value of h is h = 1.
(2) Given sa
10/36-702 Review
These are things you should know from 36-705 and 10-715.
1
Probability
P
X n 0 means that means that, for every > 0 P(| X n | > ) 0 as n . X n
Z means
that P( X n z) P( Z z) at all continuity points z. X n = O P (a n ) means that, X n /a
Undirected Graphical Models
10/36-702
Graphical models are a way of representing the relationships between features (variables).
There are two main brands: directed and undirected. We shall focus on undirected graphical
models. See Figure 1 for an example
Density Estimation
10/36-702 Spring 2015
1
Introduction
Let X 1 , . . . , X n be a sample from a distribution P with density p. The goal of nonparametric
density estimation is to estimate p with as few assumptions about p as possible. We denote
the estima
Nonparametric Regression
Statistical Machine Learning, Spring 2015
Ryan Tibshirani (with Larry Wasserman)
1
Introduction, and k-nearest-neighbors
1.1
Basic setup, random inputs
Given a random pair (X, Y ) Rd R, recall that the function
f0 (x) = E(Y |X =
Sparsity and the Lasso
Statistical Machine Learning, Spring 2014
Ryan Tibshirani (with Larry Wasserman)
1
Regularization and the lasso
1.1
A bit of background
If 2 was the norm of the 20th century, then 1 is the norm of the 21st century . OK, maybe
that
Clustering
10/26-702 Spring 2014
1
The Clustering Problem
In a clustering problem we aim to nd groups in the data. Unlike classication, the data are
not labeled, and so clustering is considered an example of unsupervised learning. In many
cases, clusterin
Concentration of Measure
1
Introduction
Often we need to show that a random quantity f (Z1 , . . . , Zn ) is close to its mean
(f ) = E(f (Z1 , . . . , Zn ). That is, we want a result of the form
P f (Z1 , . . . , Zn ) (f ) > < .
(1)
Such results are know
Nonparametric Regression
10/36-702
Larry Wasserman
1
Introduction
Now we focus on the following problem: Given a sample (X1 , Y1 ), . . ., (Xn , Yn ), where
Xi Rd and Yi R, estimate the regression function
m(x) = E(Y |X = x)
(1)
without making parametric
Homework 3 Solution
1. (a)
1
n
E(j thetaj ) = E
m(xi )j (xi ) + i j (xi ) j
i=1
n
1
n
=
n
1
m(xi )j (xi )
m(xi )j (xi )dx
0
i=1
Where we used linearity of expectation and the fact that i is the only random quantity,
and it has mean 0.
We can lower bound
10702 Homework 2 Solution
Thanks to Akshay Krishnamurthy for providing his solution.
1
Convexity and Optimization
1. (Convexity)
(a) Well show that the second derivative of 1/g (x) is always positive, which implies that
1/g (x) is convex. First, the secon
36-702 Homework 1 Solution
Thanks to William Bishop and Rafael Stern for providing their solutions.
Problem 1
(a) Let n(j ) =
i Icfw_j (xi ),
L( ) = n(1)
2
n(2)
3
n(3)
6 11
6
n(4)
n(1)+n(2)+n(3) (6 11 )n(4)
Thus, there exists a constant k such that:
l(
10/36-702 Homework 2
Additional hints for problem 4
Yifei Ma
4 Let H be a Hilbert space of functions. Suppose that the evaluation functionals x f =
f (x) are continuous. Show that H is a reproducing kernel Hilbert space and nd the
kernel.
h1 Dual space
Su
10-702/36-702
Midterm Exam Solutions
March 2 2011
There are ve questions. You only need to do three. Circle the three questions
you want to be graded:
1
2
3
Name:
1
4
5
Problem 1: Let X1 , . . . , Xn be a random sample where B Xi B for some nite
B > 0. Fo
Lecture Notes 6
1
The Likelihood Function
Definition. Let X n = (X1 , , Xn ) have joint density p(xn ; ) = p(x1 , . . . , xn ; ) where
2 . The likelihood function L : ! [0, 1) is defined by
L() L(; xn ) = p(xn ; )
where xn is fixed and varies in . The lo
Linear Regression
We observe D = cfw_(X1 , Y1 ), . . . , (Xn , Yn ) where Xi = (Xi (1), . . . , Xi (d) 2 Rd and Yi 2 R.
For notational simplicity, we will always assume that Xi (1) = 1.
Given a new pair (X, Y ) we want to predict Y from X. The conditional
Lecture Notes 8
1
Minimax Theory
Suppose we want to estimate a parameter using data X n = (X1 , . . . , Xn ). What is the
b 1 , . . . , Xn ) of ? Minimax theory provides a framework for
best possible estimator b = (X
answering this question.
1.1
Introduct
Lecture Notes 16
Model Selection
Not in the text except for a brief mention in 13.6.
1
Introduction
Sometimes we have a set of possible models and we want to choose the best model. Model
selection methods help us choose a good model. Here are some example
Lecture Notes 9
Asymptotic (Large Sample) Theory (Chapter 9)
1
Review of o, O, etc.
1. an = o(1) mean an ! 0 as n ! 1.
P
2. A random sequence An is op (1) if An ! 0 as n ! 1.
P
3. A random sequence An is op (bn ) if An /bn ! 0 as n ! 1.
p
p
P
4. nb op (1)
Lecture Notes 11
Confidence Sets
1
Introduction
Let Cn be a set that is constructed from X1 , . . . , Xn . We say that Cn is a 1 confidence
set if
P ( Cn ) 1 for all .
In other words
inf P ( Cn ) 1 .
When Cn = [L(X1 , . . . , Xn ), U (X1 , . . . , Xn )] w
Function Spaces
A function space is a set of functions F that has some structure. Often a nonparametric
regression function or classifier is chosen to lie in some function space, where the assumed
structure is exploited by algorithms and theoretical analy
Lecture Notes 13
The Bootstrap
1
Introduction
The bootstrap is a method for estimating the variance of an estimator and for finding approximate confidence intervals for parameters. Although the method is nonparametric, it
can be used for inference about p
Lecture Notes 15
Prediction
Chapters 13, 22, 20.4.
1
Introduction
We observe training data (X1 , Y1 ), . . . , (Xn , Yn ) where Xi Rd . Given a new pair (X, Y ) we
want to predict Y from X. There are two common versions:
1. Y cfw_0, 1. This is called clas