CS229 Problem Set #4
1
CS 229, Public Course
Problem Set #4:
Unsupervised Learning and Re
inforcement Learning
1.
EM for supervised learning
In class we applied EM to the unsupervised learning setting. In particular, we represented
p
(
x
) by marginalizing over a latent random variable
p
(
x
) =
summationdisplay
z
p
(
x, z
) =
summationdisplay
z
p
(
x

z
)
p
(
z
)
.
However, EM can also be applied to the supervised learning setting, and in this problem we
discuss a “mixture of linear regressors” model; this is an instance of what is often call the
Hierarchical Mixture of Experts model. We want to represent
p
(
y

x
),
x
∈
R
n
and
y
∈
R
,
and we do so by again introducing a discrete latent random variable
p
(
y

x
) =
summationdisplay
z
p
(
y, z

x
) =
summationdisplay
z
p
(
y

x, z
)
p
(
z

x
)
.
For simplicity we’ll assume that
z
is binary valued, that
p
(
y

x, z
) is a Gaussian density,
and that
p
(
z

x
) is given by a logistic regression model. More formally
p
(
z

x
;
φ
)
=
g
(
φ
T
x
)
z
(1
−
g
(
φ
T
x
))
1
−
z
p
(
y

x, z
=
i
;
θ
i
)
=
1
√
2
πσ
exp
parenleftbigg
−
(
y
−
θ
T
i
x
)
2
2
σ
2
parenrightbigg
i
= 1
,
2
where
σ
is a known parameter and
φ, θ
0
, θ
1
∈
R
n
are parameters of the model (here we
use the subscript on
θ
to denote two different parameter vectors, not to index a particular
entry in these vectors).
Intuitively, the process behind model can be thought of as follows. Given a data point
x
,
we first determine whether the data point belongs to one of two hidden classes
z
= 0 or
z
= 1, using a logistic regression model.
We then determine
y
as a linear function of
x
(different linear functions for different values of
z
) plus Gaussian noise, as in the standard
linear regression model. For example, the following data set could be wellrepresented by
the model, but not by standard linear regression.
This preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
CS229 Problem Set #4
2
(a) Suppose
x
,
y
, and
z
are all observed, so that we obtain a training set
{
(
x
(1)
, y
(1)
, z
(1)
)
, . . . ,
(
x
(
m
)
, y
(
m
)
, z
(
m
)
)
}
. Write the loglikelihood of the parameters,
and derive the maximum likelihood estimates for
φ
,
θ
0
, and
θ
1
. Note that because
p
(
z

x
) is a logistic regression model, there will not exist a closed form estimate of
φ
.
In this case, derive the gradient and the Hessian of the likelihood with respect to
φ
;
in practice, these quantities can be used to numerically compute the ML esimtate.
(b) Now suppose
z
is a latent (unobserved) random variable. Write the loglikelihood of
the parameters, and derive an EM algorithm to maximize the loglikelihood. Clearly
specify the Estep and Mstep (again, the Mstep will require a numerical solution,
so find the appropriate gradients and Hessians).
2.
Factor Analysis and PCA
In this problem we look at the relationship between two unsupervised learning algorithms
we discussed in class: Factor Analysis and Principle Component Analysis.
Consider the following joint distribution over (
x, z
) where
z
∈
R
k
is a latent random
variable
z
∼
N
(0
, I
)
x

z
∼
N
(
Uz, σ
2
I
)
.
This is the end of the preview.
Sign up
to
access the rest of the document.
 '09
 Regression Analysis, Machine Learning, logistic regression model

Click to edit the document details