The learning problem
Training data:
We want to estimate a function
such that
for new (previously unobserved)
From a list of hypotheses
(denoted
) by taking the
, select one
that minimizes
We want to a
A learning puzzle?
Is learning even possible?
or: How I learned to stop worrying and love statistics
Supervised learning
Given training data
learn a function
other than
such that
, we would like to
fo
ECE 6254
Statistical Signal Processing
Spring 2014
Mark A. Davenport
Georgia Institute of Technology
School of Electrical and Computer Engineering
Statistical Signal Processing
Modern data-driven app
@. Maxni Appmmo-A 0343
Lwr gimme/3 WA \ HA'KHE
BfMHI3 =1 .0 T
T : 0' (AFVF c3 [4%,
"5 é: 070w" g! < Ave 0( *3
WL mu (ASL We QM»: Wm M:
(D be Fst MM 13 [AM By
M wee MN! MW
4A,!37: Tr (Ax-EB
TLT is T
M
A good reference for the material in this section is Boyd and Vandenberghes Convex Optimization.
Descent methods for unconstrained optimization
Let us consider the general problem of minimizing an unc
V. Beyond Least Squares
Georgia Tech ECE 6250 Notes by J. Romberg. Last updated 9:22, December 5, 2013
Norm approximation
The bulk of this course has been spent studying variations of the
least-square
The Kalman Filter
The RLS algorithm for updating the least squares estimate given a
series of observations vectors looked like a lter: new data comes
in, and we use it (along with collected knowledge
The method of conjugate gradients (CG)
An excellent resource for the material in this section is the manuscript:
J. Shewchuk: An introduction to the conjugate gradient method
without the agonizing pai
III. Computing the Solution to
Least Squares Problems
Georgia Tech ECE 6250 Notes by J. Romberg. Last updated 17:13, November 14, 2013
Here are the least-squares problems we have talked about so far:
Iterative methods for solving least-squares
When A has full column rank, our least-squares estimate is
x = (ATA)1ATy.
If A is M N , then constructing ATA costs O(M N 2) computations, and solving the N
Streaming solutions to least-squares problems
In our discussion of least-squares so far, we have focussed on static
problems: a set of measurements y = Ax0 + e comes in all at once,
and we use them al
Weighted Least Squares
Standard least-squares tries to t a vector x to a set of measurements y by solving
min y Ax 2.
2
N
xR
Now, what if some of the measurements ore more reliable than others? Or, wh
Toeplitz matrices
Toeplitz matrices, which are matrices that are constant along their
diagonals, arise in many dierent signal processing applications, as
they are fundamental in describing the action
The Pseudo-Inverse
The pseudo-inverse of a matrix A with singular value decomposition (SVD) A = U V T is
A = V 1U T.
(1)
Other names for A include natural inverse, Lanczos inverse,
and Moore-Penrose i
Haar wavelets
The Haar wavelet basis for L2(R) breaks down a signal by looking
at the dierence between piecewise constant approximations at different scales. It is the simplest example of a wavelet tr
The Bayes classifier
Consider
where
is a random vector in
is a random variable (depending on
Define
Bayes classifier
)
Linear discriminant analysis (LDA)
In linear discriminant analysis, we assume
fo
The log-likelihood
Thus
and the log-likelihood function is given by
Since this is a monotonic transformation, finding maximizing
is equivalent to maximizing
Maximum likelihood estimation
Notation
Note
The Bayes classifier
Consider
where
is a random vector in
is a random variable (depending on
Define
Bayes classifier
)
Example
Suppose that
and that
If
Example
How do we calculate the Bayes risk?
In
Clustering
Example
Formal definition
Suppose
The goal of clustering is to assign the data to disjoint subsets
called clusters, so that points in the same cluster are more
similar to each other than po
-means clustering algorithm
Suppose
The
-means algorithm tries to solve
Algorithm
Initialize
Repeat until clusters dont change
Beyond
-means clustering
In -means clustering, by measuring the within-cl
Principal component analysis
Let
denote a
columns are given by the vectors
data matrix whose
Recall that we can compute the principal components simply
by computing the singular value decomposition
Th
Principal component analysis (PCA)
Most common linear method for dimensionality reduction
The idea behind PCA is to find an approximation
where
with orthonormal columns
If
is the matrix with columns
l
Multidimensional scaling (MDS)
The problem of finding
approximately agrees with
scaling (MDS)
such that
is known as multidimensional
There are a number of variants of MDS based on
our choice of dista
Nearest neighbor classifiers
Given enough data, the -nearest neighbor classifier will do
just as well as pretty much any other method
Catch
The amount of required data can be huge, especially if our
Model selection
In statistical learning, a model is a mathematical
representation of a function such as a
classifier
regression function
density
In many cases, we have one (or more) free parameters th
Dimensionality reduction
We observe data
The goal of dimensionality reduction is to transform these
inputs to new variables
where
in such a way that minimizes information loss
Dimensionality reduction
The kernel trick
Many machine learning algorithms only involve the data
through inner products
For many interesting feature maps
, the function
has a simple, closed form expression that can be evalu