Regularization and Kernel Methods
Le Song
Machine Learning I
CSE 6740, Fall 2013
Nonlinear regression
Want to fit a polynomial regression model
= 0 + 1 + 2 2 + + +
Let = 1, , 2 , ,
and = 0 , 1 , 2 , ,
y = +
2
Least mean square method
Given data point
Neural Networks
Le Song
Machine Learning I
CSE 6740, Fall 2013
Learning nonlinear decision boundary
Linearly separable
Nonlinearly separable
The XOR gate
Speech recognition
2
A decision tree for Tax Fraud
Input: a vector of attributes
= [Refund,MarSt,Tax
EM Algorithm and
Abnormal Data Detection
Le Song
Machine Learning I
CSE 6740, Fall 2013
Multimodal distributions
What if we know the data consists of a few Gaussians
What if we want to fit parametric models
2
Gaussian Mixture model
A density model () may
Mixture of Gaussian
Le Song
Machine Learning I
CSE 6740, Fall 2013
Why do we need density estimation?
Learn the shape of the data cloud
Assess the likelihood of seeing a particular data point
Is this a typical data point? (high density value)
Is this an a
Review of Last Lecture
Le Song
Machine Learning I
CSE 6740, Fall 2013
Learning nonlinear decision boundary
Linearly separable
Nonlinearly separable
The XOR gate
Speech recognition
2
Perceptron
From biological neuron to
artificial neuron (perceptron)
I n p
Distribution/Density Estimation
Le Song
Machine Learning I
CSE 6740, Fall 2013
What special clustering is good for
Cluster or group data
reachable to each
other by walking in
the data cloud
Key idea:
Build nearest neighbor graph
Use graph Laplacian
2
For
Clustering Nodes in Graphs
Le Song
Machine Learning I
CSE 6740, Fall 2013
Clustering is a subjective task
What is consider similar/dissimilar?
2
You pick your similarity/dissimilarity
3
Distance functions for vectors
Suppose two data points, both in
= 1
Nonlinear Dimensionality Reduction
Le Song
Machine Learning I
CSE 6740, Fall 2013
An example
Data vary more
in this direction
Data vary less
in this direction
Two features
are correlated
2
Reduce to 1 dimension
3
Principal component analysis
Given data po
Dimensionality Reduction
Le Song
Machine Learning I
CSE 6740, Fall 2013
Unsupervised learning
Learning from raw (unlabeled, unannotated, etc) data, as
opposed to supervised data where a classification of examples
is given
Explore and understand your data
Bayes Decision Rule and
Nave Bayes Classifier
Le Song
Machine Learning I
CSE 6740, Fall 2013
Gaussian Mixture model
A density model () may be multi-modal: model it as a
mixture of uni-modal distributions (e.g. Gaussians)
Consider a mixture of Gaussians
()
SVM and Decision Tree
Le Song
Machine Learning I
CSE 6740, Fall 2013
Which decision boundary is better?
Suppose the training samples
are linearly separable
We can find a decision boundary
which gives zero training error
Class 2
But there are many such
dec
Discriminative classifier and
logistic regression
Le Song
Machine Learning I
CSE 6740, Fall 2013
Classification
Input data feature
A label is provided for each data point, eg., 1, +1
Classifier
2
Class conditional distribution
Given data points , drawn i
Overfitting and Cross-Validation
Le Song
Machine Learning I
CSE 6740, Fall 2013
Apartment hunting
Suppose you are to move to Atlanta
And you want to find the most
reasonably priced apartment satisfying
your needs:
square-ft., # of bedroom, distance to cam
Markov Random Fields &
Conditional Random Fields
Le Song
Machine Learning I
CSE 6740, Fall 2013
From static to dynamic mixture models
Static mixture
Dynamic mixture
Y1
Y1
Y2
Y3
.
YT
A
X1
A
X1
A
X2
A
X3
.
A
XT
N
2
Hidden Markov Model
Directed graphical mod
Hidden Markov Models
Le Song
Machine Learning I
CSE 6740, Fall 2013
Hidden Markov Models
(+1 | )
( | )
2
From static to dynamic mixture models
Static mixture
Dynamic mixture
Y1
Y1
Y2
Y3
.
YT
A
X1
A
X1
A
X2
A
X3
.
A
XT
N
3
Hidden Markov Model
Direc
Kernel Methods
Le Song
Machine Learning I
CSE 6740, Fall 2013
Nonlinear regression
Find a nonlinear prediction from input feature to output
2
Ridge regression
Given data points, find that minimizes the regularized
mean square error
1
= =
2
+
2
=1
g
Regression
Le Song
Machine Learning I
CSE 6740, Fall 2013
Rationale: Combination of methods
There is no algorithm that is always the most accurate
We can select simple weak classification or regression
methods and combine them into a single strong method
Kernel Methods and HMM
Le Song
Machine Learning I
CSE 6740, Fall 2013
Nonlinear regression
Find a nonlinear prediction from input feature to output
implicitly map data to a new nonlinear feature space
find linear regressor in the new space
2
Nonlinear cla
Combining Classifiers
and boosting
Le Song
Machine Learning I
CSE 6740, Fall 2013
Rationale: Combination of methods
There is no algorithm that is always the most accurate
We can select simple weak classification or regression
methods and combine them into
Support Vector Machines
Le Song
Machine Learning I
CSE 6740, Fall 2013
Nave Bayes classifier
Still use Bayes decision rule for classification
=
But assume = 1 is fully factorized
= 1 =
( | = 1)
=1
Or the variables corresponding to each dimension of
Clustering
Le Song
Machine Learning I
CSE 6740, Fall 2013
Nonlinear dimensionality reduction
A: walking-distance over the data cloud
B: nearest neighbor graph and shortest path
C: two dimensional reduced representation of the data
2
Isomap
Key idea: produ