Clustering
Example
Formal definition
Suppose
The goal of clustering is to assign the data to disjoint subsets
called clusters, so that points in the same cluster are more
similar to each other than points in different clusters
A clustering can be represen
The Bayes classifier
Consider
where
is a random vector in
is a random variable (depending on
Define
Bayes classifier
)
Linear discriminant analysis (LDA)
In linear discriminant analysis, we assume
for
Here
is the multivariate Gaussian/normal
distribution
The learning problem
Training data:
We want to estimate a function
such that
for new (previously unobserved)
From a list of hypotheses
(denoted
) by taking the
, select one
that minimizes
We want to argue that
1. Ensure that
2. Make
is small
is close to
a
A learning puzzle?
Is learning even possible?
or: How I learned to stop worrying and love statistics
Supervised learning
Given training data
learn a function
other than
such that
, we would like to
for
but
as we have just seen, this is impossible. Without
ECE 6254
Statistical Signal Processing
Spring 2014
Mark A. Davenport
Georgia Institute of Technology
School of Electrical and Computer Engineering
Statistical Signal Processing
Modern data-driven approach
How can we
learn effective models from data?
a
A good reference for the material in this section is Boyd and Vandenberghes Convex Optimization.
Descent methods for unconstrained optimization
Let us consider the general problem of minimizing an unconstrained
function f (z) : RN R
min f (z)
zRN
where f
V. Beyond Least Squares
Georgia Tech ECE 6250 Notes by J. Romberg. Last updated 9:22, December 5, 2013
Norm approximation
The bulk of this course has been spent studying variations of the
least-squares problem: given observations y RM and a known
M N matr
The Kalman Filter
The RLS algorithm for updating the least squares estimate given a
series of observations vectors looked like a lter: new data comes
in, and we use it (along with collected knowledge of the old data) to
produce a new output.
In the previo
The method of conjugate gradients (CG)
An excellent resource for the material in this section is the manuscript:
J. Shewchuk: An introduction to the conjugate gradient method
without the agonizing pain.
We can see from the example on the last page that st
III. Computing the Solution to
Least Squares Problems
Georgia Tech ECE 6250 Notes by J. Romberg. Last updated 17:13, November 14, 2013
Here are the least-squares problems we have talked about so far:
Pseudo-inverse when A
has full column rank
x = (ATA)1AT
Iterative methods for solving least-squares
When A has full column rank, our least-squares estimate is
x = (ATA)1ATy.
If A is M N , then constructing ATA costs O(M N 2) computations, and solving the N N system ATAx = ATy costs O(N 3)
computations. (Note t
Streaming solutions to least-squares problems
In our discussion of least-squares so far, we have focussed on static
problems: a set of measurements y = Ax0 + e comes in all at once,
and we use them all to estimate x0.
In this section, we will shift our fo
Weighted Least Squares
Standard least-squares tries to t a vector x to a set of measurements y by solving
min y Ax 2.
2
N
xR
Now, what if some of the measurements ore more reliable than others? Or, what if the errors are closely correlated between measure
The log-likelihood
Thus
and the log-likelihood function is given by
Since this is a monotonic transformation, finding maximizing
is equivalent to maximizing
Maximum likelihood estimation
Notation
Note that
Thus,
, which lets us write
and
Maximum likelihoo
The Bayes classifier
Consider
where
is a random vector in
is a random variable (depending on
Define
Bayes classifier
)
Example
Suppose that
and that
If
Example
How do we calculate the Bayes risk?
In the case where
declaring 1 iff
, our test reduced to
,
Recap
For a given , if we know that
is a break point, meaning
that no data set of size
can be shattered, then we know
polynomial with dominant term
VC generalization bound
For any
, with probability
VC bound intuition
Hoeffding
inequality
space of all
pos
-means clustering algorithm
Suppose
The
-means algorithm tries to solve
Algorithm
Initialize
Repeat until clusters dont change
Beyond
-means clustering
In -means clustering, by measuring the within-cluster
scatter as
we are implicitly assuming that each c
Principal component analysis
Let
denote a
columns are given by the vectors
data matrix whose
Recall that we can compute the principal components simply
by computing the singular value decomposition
The principal eigenvectors are the first
columns of
The i
Principal component analysis (PCA)
Most common linear method for dimensionality reduction
The idea behind PCA is to find an approximation
where
with orthonormal columns
If
is the matrix with columns
let
denote the SVD of
by the first columns of
, the opti
Multidimensional scaling (MDS)
The problem of finding
approximately agrees with
scaling (MDS)
such that
is known as multidimensional
There are a number of variants of MDS based on
our choice of distance function
how we quantify approximately agrees with
Nearest neighbor classifiers
Given enough data, the -nearest neighbor classifier will do
just as well as pretty much any other method
Catch
The amount of required data can be huge, especially if our
feature space is high-dimensional
The parameter
can ma
Model selection
In statistical learning, a model is a mathematical
representation of a function such as a
classifier
regression function
density
In many cases, we have one (or more) free parameters that
are not automatically determined by the learning alg
Dimensionality reduction
We observe data
The goal of dimensionality reduction is to transform these
inputs to new variables
where
in such a way that minimizes information loss
Dimensionality reductions serves two main purposes:
Helps (many) algorithms to
Linear methods for supervised learning
LDA
Logistic regression
Nave Bayes
PLA
Maximum margin hyperplanes
Soft-margin hyperplanes
Least squares resgression
Ridge regression
Nonlinear feature maps
Sometimes linear methods (in both regression and
classificat
The kernel trick
Many machine learning algorithms only involve the data
through inner products
For many interesting feature maps
, the function
has a simple, closed form expression that can be evaluated
without explicitly calculating
and
Homogenous kern
Methods for classification
Parametric methods
LDA
Logistic regression
Nave bayes
Nonparametric methods
PLA
support vector machines
Nearest neighbor classifiers
Nearest neighbor classifier
The nearest neighbor classifier is easiest to state in word
Constrained optimization
A general constrained optimization problem has the form
where
The Lagrangian function is given by
Primal and dual optimization problems
Primal:
Dual:
Weak duality:
Strong duality: For convex problems with affine constraints
KKT co
The learning problem
Given a set
, find a function
that minimizes
In the case of classification, we can also think of this as trying
to find an
that approximates the Bayes classifier
More complex
better chance of approximating
Less complex
better chance o