The learning problem
Training data:
We want to estimate a function
such that
for new (previously unobserved)
From a list of hypotheses
(denoted
) by taking the
, select one
that minimizes
We want to argue that
1. Ensure that
2. Make
is small
is close to
a
A learning puzzle?
Is learning even possible?
or: How I learned to stop worrying and love statistics
Supervised learning
Given training data
learn a function
other than
such that
, we would like to
for
but
as we have just seen, this is impossible. Without
ECE 6254
Statistical Signal Processing
Spring 2014
Mark A. Davenport
Georgia Institute of Technology
School of Electrical and Computer Engineering
Statistical Signal Processing
Modern data-driven approach
How can we
learn effective models from data?
a
A good reference for the material in this section is Boyd and Vandenberghes Convex Optimization.
Descent methods for unconstrained optimization
Let us consider the general problem of minimizing an unconstrained
function f (z) : RN R
min f (z)
zRN
where f
V. Beyond Least Squares
Georgia Tech ECE 6250 Notes by J. Romberg. Last updated 9:22, December 5, 2013
Norm approximation
The bulk of this course has been spent studying variations of the
least-squares problem: given observations y RM and a known
M N matr
The Kalman Filter
The RLS algorithm for updating the least squares estimate given a
series of observations vectors looked like a lter: new data comes
in, and we use it (along with collected knowledge of the old data) to
produce a new output.
In the previo
The method of conjugate gradients (CG)
An excellent resource for the material in this section is the manuscript:
J. Shewchuk: An introduction to the conjugate gradient method
without the agonizing pain.
We can see from the example on the last page that st
III. Computing the Solution to
Least Squares Problems
Georgia Tech ECE 6250 Notes by J. Romberg. Last updated 17:13, November 14, 2013
Here are the least-squares problems we have talked about so far:
Pseudo-inverse when A
has full column rank
x = (ATA)1AT
Iterative methods for solving least-squares
When A has full column rank, our least-squares estimate is
x = (ATA)1ATy.
If A is M N , then constructing ATA costs O(M N 2) computations, and solving the N N system ATAx = ATy costs O(N 3)
computations. (Note t
Streaming solutions to least-squares problems
In our discussion of least-squares so far, we have focussed on static
problems: a set of measurements y = Ax0 + e comes in all at once,
and we use them all to estimate x0.
In this section, we will shift our fo
Weighted Least Squares
Standard least-squares tries to t a vector x to a set of measurements y by solving
min y Ax 2.
2
N
xR
Now, what if some of the measurements ore more reliable than others? Or, what if the errors are closely correlated between measure
Toeplitz matrices
Toeplitz matrices, which are matrices that are constant along their
diagonals, arise in many dierent signal processing applications, as
they are fundamental in describing the action of linear time-invariant
systems. For example, suppose
The Pseudo-Inverse
The pseudo-inverse of a matrix A with singular value decomposition (SVD) A = U V T is
A = V 1U T.
(1)
Other names for A include natural inverse, Lanczos inverse,
and Moore-Penrose inverse.
Given an observation y, taking x = Ay gives us
Haar wavelets
The Haar wavelet basis for L2(R) breaks down a signal by looking
at the dierence between piecewise constant approximations at different scales. It is the simplest example of a wavelet transform,
and is very easy to understand.
Let V0 be the
The Bayes classifier
Consider
where
is a random vector in
is a random variable (depending on
Define
Bayes classifier
)
Linear discriminant analysis (LDA)
In linear discriminant analysis, we assume
for
Here
is the multivariate Gaussian/normal
distribution
The log-likelihood
Thus
and the log-likelihood function is given by
Since this is a monotonic transformation, finding maximizing
is equivalent to maximizing
Maximum likelihood estimation
Notation
Note that
Thus,
, which lets us write
and
Maximum likelihoo
The Bayes classifier
Consider
where
is a random vector in
is a random variable (depending on
Define
Bayes classifier
)
Example
Suppose that
and that
If
Example
How do we calculate the Bayes risk?
In the case where
declaring 1 iff
, our test reduced to
,
Clustering
Example
Formal definition
Suppose
The goal of clustering is to assign the data to disjoint subsets
called clusters, so that points in the same cluster are more
similar to each other than points in different clusters
A clustering can be represen
-means clustering algorithm
Suppose
The
-means algorithm tries to solve
Algorithm
Initialize
Repeat until clusters dont change
Beyond
-means clustering
In -means clustering, by measuring the within-cluster
scatter as
we are implicitly assuming that each c
Principal component analysis
Let
denote a
columns are given by the vectors
data matrix whose
Recall that we can compute the principal components simply
by computing the singular value decomposition
The principal eigenvectors are the first
columns of
The i
Principal component analysis (PCA)
Most common linear method for dimensionality reduction
The idea behind PCA is to find an approximation
where
with orthonormal columns
If
is the matrix with columns
let
denote the SVD of
by the first columns of
, the opti
Multidimensional scaling (MDS)
The problem of finding
approximately agrees with
scaling (MDS)
such that
is known as multidimensional
There are a number of variants of MDS based on
our choice of distance function
how we quantify approximately agrees with
Nearest neighbor classifiers
Given enough data, the -nearest neighbor classifier will do
just as well as pretty much any other method
Catch
The amount of required data can be huge, especially if our
feature space is high-dimensional
The parameter
can ma
Model selection
In statistical learning, a model is a mathematical
representation of a function such as a
classifier
regression function
density
In many cases, we have one (or more) free parameters that
are not automatically determined by the learning alg
Dimensionality reduction
We observe data
The goal of dimensionality reduction is to transform these
inputs to new variables
where
in such a way that minimizes information loss
Dimensionality reductions serves two main purposes:
Helps (many) algorithms to
Linear methods for supervised learning
LDA
Logistic regression
Nave Bayes
PLA
Maximum margin hyperplanes
Soft-margin hyperplanes
Least squares resgression
Ridge regression
Nonlinear feature maps
Sometimes linear methods (in both regression and
classificat
The kernel trick
Many machine learning algorithms only involve the data
through inner products
For many interesting feature maps
, the function
has a simple, closed form expression that can be evaluated
without explicitly calculating
and
Homogenous kern