This preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: Last name (CAPTIALS): ___________________________________ First name (CAPITALS): __________________________________ Andrew User ID (CAPITALS): (without the ©andrew.cmu.edu bit): ___________________________ 15781 Final Exam, Fall 2001 0 You must answer any nine questions out of the following twelve. Each question is
worth 11 points. 0 You must ﬁll out your name and your andrew userid clearly and in block capital letters
on the front page. You will be awarded 1 point for doing this correctly. o If you answer more than 9 questions, your best 9 scores will be used to derive your
total. 0 Unless the question asks for explanation, no explanation is required for any answer.
But you are welcome to provide explanation if you wish. Bayes Nets Inference
(a) Kangaroos.
< K ,> E>(K) = 22/3
/:i:\ P(ZXIK) = :1/2
(,A,/ P(A~K) = 1/10
Half of all kangaroos in the zoo are angry, and 2/3 of the 200 is comprised of kangaroos.
Only 1 in 10 of the other animals are angry. Whats the probability that a randomly—
chosen animal is an angry kangaroo?
P(A “ K) = P(K) P(AIK) = 2/3 * 1/2 = 1/3
(b) Stupidity.
( s l P (S) == 0 .5
I P (C l S) = O . 5
{C l p (c  ~s> = 0.2 Half of all people are stupid. If you’re stupid then you7re more likely to be confused.
A randomly—chosen person is confused. What’s the chance they’re stupid? P(SC) = P(S“C) / [P(S“C) + P("S “ 0)] lci Potatoes. 1/4 / (1/4 + 1/10) = 5/7 B is P(B) = 1/2
A P(r‘lB) = 1/2
P("\~B) = 1/10
,, A \. P(JlT) = 1/2 ‘ J P(4\~T> = 1/10 Half of all potatoes are big. A big potato is more likely to be tall. A tall potato is more likely to be lovab e. What s the probability that a big lovable potato is tall? P(TIB“L) = P(T“B“L) / [P(T“B“L + P(”T“B“L)] 1/8 / (1/8 + 1/40) = 5/6 (d) Final part. M “1 s,> <3
,//\/;J(WS)=l/2
(w > P(W\~S):l What’s P(W /\ F)? P(W“F) = P(W“F“H) + P(W“F“”H) but P(W“F“H) = O P(W‘F‘”H) = P(F”H)P(W‘”H) =
1/2 * (P(w “ s “ ”H) + P(w “ ~s “ ~H)) = 1/2 * (1/8 + 1/4) = 3/16 Bayes Nets and HMMs Let nbs(m) = the number of possible Bayes Network graph structures using m at—
tributes. (Note that two networks with the same structure but different probabilities in their tables do not count as different structures). Which of the following statements
is true? 0 iii nbsiml < m o ii)m<nbs(m )< mimil)
2 (
0 (iii) m—lm 1) < nbs(m ) < 2’”
0 (iv) 2’” < nbs(m )< <2m0371)
. <v>2 1) < nbsvn > Answer is (v) because the number of undirected graphs with n vertices
is 2‘[m choose 2], and there are even more acyclic directed graphs Remember that I < X7 Y7 Z > means X is conditionally independent of Z given Y Assuming the conventional assumptions and notation of Hidden Markov Models, in
which qt denotes the hidden state at time t and 0t denotes the observation at time t, which of the following are true of all HMMs? Write “True” or “False” next to each
statement. 0 1)] < Qt+17qt:Qt71 > 0 ii) I < qt+2aQt7qt71 > (
(
(iii) 1 < Qt+17 Qt; (It—2 >
(iv) I < Ot+1,0t,0t_1 >
(v) I < Ot+270t7OH >
(vi) I < Ot+170t70t—2 > (i) (ii) (iii) all TRUE
(iv) (v) (vi) all FALSE 3 Regress1on (a) Consider the following data with one input and one output. O
37 o
A
l o

A2—
4.)
a o
‘3
31’ .
N O
0? I I l
0 l 2 3
X (input) ———> o (i) What is the mean squared training set error of running linear regression on
this data (using the model y = we + wlzr)? 0 0 (ii) What is the mean squared test set error of running linear regression on this data, assuming the rightmost three points are in the test set, and the others are
in the training set. 0 0 (iii) What is the mean squared leave—one—out cross—validation (LOOCV) error of
running linear regression on this data? (b) Consider the following data With one input and one output. 3—
? X Y
' 1 1
A27 .
45’ 2 2
3‘ 3 1
51— o 0
>4 0 l I l 0 1 2 3 o (i) What is the mean squared training set error of running linear regression on
this data (using the model 3/ = wo + 111136)? (Hint: by symmetry it is clear that
the best ﬁt to the three datapoints is a horizontal line). SSE = (1/3)“2 + (2/3)“2 + (1/3)“2 = 6/9
MSE = SSE73 = 279 0 (ii) What is the mean squared leave—one—out cross—validation (LOOCV) error of
running linear regression on this data? 1/3 * (2‘2 + 1‘2 + 2‘2) = 9/3 3 (c) Suppose we plan to do regression with the following basis functions: 1 I 1 §
I\ I\
I \ i \
o \rr o
0 2 4 6 0 2 4 6
X (input) ———> X (input) ———>
(MM): 0 ifac<0 (b2(x)= 0 ifx<2 ¢3(x)= 0 ifaj<4
(MM): 36 if0§5v<1 (b2(x)= 36—2 if2§aj<3 ¢3(x)= 36—4 if4§$<5
gbﬁx): Z—x if1§$<2 ¢2(x): 4—x if3§$<4 ¢3(x): 6—x if5§x<6
(51(35): 0 if2§5v (52(35): 0 if4§5v ¢3(x)= 0 if6§$ Our regression will be y = ﬂlqﬁﬁx) + ﬂqu2(:r) + 63¢3(:r). Assume all our datapoints and future queries have 1 g x g 5. Is this a generally useful
set of basis functions to use? If ‘yes’ , then explain their prime advantage. If ‘no’7
explain their biggest drawback. NU They’re forced to predict y=0 at x=2 and x=4 (and forced
to be close to zero nearby) no matter what the values of beta. 4 Regression Trees Regression trees are a kind of decision tree used for learning from data with a real—valued
output instead of a categorical output. They were discussed in the ‘Eight favorite regression
W On the next page you will see pseudocode for building a regression tree in the special
case where all the input attributes are boolean (they can have values 0 or 1). The MakeTree function takes two arguments: o D, a set of datapoints o and A, a set of input attributes. It then makes the best regression tree it can using only the datapoints and attributes passed
to it. It is a recursive procedure. The full algorithm is run by calling MakeTree with
D containing every record and A containing every attribute. Note that this code does no
pruning, and that it assumes that all input attributes are binary—valued Now read the code on the next page, after which question (a) will ask you about bugs in
the code. MakeTree(D,A)
Returns a Regress1on Tree 1. For each attribute a 1n the set A do...
1.1 Let D0 = { (xk,yk) in D such that xk[a] = 0 } // Comment: xk[a] denotes the value of attribute a in record xk
1.2 Let D1 = { (xk,yk) in D such that xk[a] = 1 } // Comment: Note that DO union D1 == // Comment: Note too that DO intersection D1 == empty muO = mean value of yk among records in D0 mu1 = mean value of yk among records in D1 SSEO = sum over all records in D0 of (yk  muO) squared
SSEl = sum over all records in D1 of (yk  muO) squared
Let ScoreEa] = SSEO + SSEl I—xI—tI—xI—xI—x
\IO‘JO'IFPQJ 2. // Once a score has been computed for each attribute, let... a* = argmax ScoreEa] a 3. Let D0 = { (Xk,yk) in D such that Xk[a*] = 0 l
4. Let D1 = { (xk,yk) in D such that xk[a*] = 1 }
3. Let LeftChild = MakeTree(DO,A  {a*}) // Comment: A  {a*} means the set containing all elements of A except for a*
4. Let RightChild = MakeTree(D1,A  {a*lj
5. Return a tree whose root tests the value of a*, and whose “a* = O” branch is LeftChild and Whose “a* = 1” branch is RightChild. (a) lBeyondtin}obviousIHIﬁﬂenithatthereiSIu)pruning7themearethree bugsirlthe above
codEL7They are aﬂ.very dimjnct.()ne ofthen1is2n1thelevelofzitypographicalerron
'The(mhertvm)areInoresemouscnrorsinlognl ldenﬁfytheiﬁnee bugs(rennnnbemng
thatiﬂnalack ofgniunng E not one ofthe three bugsL expkuning “dureach one s a
bug. It is not necessary to explain how to ﬁx any bug, though you are welcome to do
so if that’s the easiest way to explain the bug. Line 1.6 should use (yk  mu1)‘2 Line 2 should use argmin The algorithm is missing the base case of the recursion (b) VVhy5in,the recursive caHs t0 k4akel¥ee,is the second argUInent 9A.—{a*}” instead
of simply “A” .7 Because a* can’t possibly be chosen in any recursive calls 5 Clustermg In the left of the following two pictures I show a dataset. In the right ﬁgure I sketch the
globally maximally likely mixture of three Gaussians for the given data. 0 Assume we have protective code in place that prevents any degenerate solutions in
which some Gaussian grows inﬁnitesimally smaII 0 And assume a GMM model in which all parameters (class probabilities, class centroids
and class covariances) can be varied. 3_ O 3_
I ' ' '. I
 . .00 o 
A 7'. oo ..o A 7
4J2 oo o. .0 «Hz
5 oo . :3
Q. o o 00 Q.
*3 °  *3
01’ 01’
>4 >4
0 I I I 0 I I I
0 l 2 3 0 1 2 3
X (input) ——> X (input) ———> (a) Using the same notation and the same assumptions, sketch the globally maximally likely mixture of two Gaussians. 3, o
/l\ .0 .0 ..
l . '00 o
A2_"...oo ..o
IJ oo oo
:3 o. .
Q. o 00
JJ .. o
5 _ O
31
> 0 I I I 0 1 2 3 (b) Using the same notation and the same assumptions, sketch a mixture of three distinct
Gaussians that is stuck in a suboptimal conﬁguration (i.e. in which inﬁnitely many
more iterations of the EM algorithm would remain in essentially the same suboptimal
conﬁguration). (You must not give an answer in which two or more Gaussians all have
the same mean vectorsiwe are looking for an answer in which all the Gaussians have distinct mean vectors l. 3__ o
? . o . o . .
 . . ° 0 o o
r~22’ ' 1 . o o . . o
H o o o o
a . . .0 o o
‘3 ° ° ° '
3 1 ’
>4 0 l I l X (input) ———> (c) Using the same notation and the same assumptions, sketch the globally maximally
likely mixture of two Gaussians in the following, new, dataset. 3
A
: “:..o.°.o:. '0 .0. 0.0
2 00‘. on . . o
3'
3. o .0 . 0'0. '0'
‘31 .".:'o.‘.o"°.o...oo
O
M
0 X (1nput) ———> (d) Now, suppose we ran k—means with k = 2 on this dataset. Show the rough locations of
the centers of the two clusters in the conﬁguration with globally minimal distortion. N Y (output) ——> H I
O
8
O
I
O
O
O
O
O
O 6 Regression algorithms For each empty box in the following table7 write in ‘Y’ if the statement at the top of the column applies to the regression algorithm. Write ‘N’ if the statement does not apply. No matter what the training
data is, the predicted output is
guaranteed to be a continuous
function of the input. (i.e. there
are no discontinuities in the pre—
diction). If a predictor gives
continuous but undifferentiable
predictions then you should an—
swer “Y”. The cost of training on a dataset
with R records is at least
0(R2): quadratic (or worse)
in R. For iterative algorithms
marked with (*) simply consider
the cost of one iteration of the
algorithm through the data. Linear Regression Y Quadratic Regression Y Perceptrons with sigmoid acti—
vation functions (*) Y 777 1—hidden—layer Neural Nets with
sigmoid activation functions (*) '< 7 1—nearest neighbor 10—nearest neighbor Kernel Regression Locally Weighted Regression Radial Basis Function Regres—
sion with 100 Gaussian basis
functions <<<zz /././.77 Regression Trees Cascade correlation (with sig—
moid activation functions) <2 22 Multilinear interpolation MARS << 7 Hidden Markov Models Warning: this is a question that will take a few minutes if you really understand HMMS, hut could take hours if you don’t. Assume we are working with this HMM
1/2 1/2 1 Start Here with Prob. 1 0,1121/2 (112:1/2 0,1320 b1(X):1/2 b1(Y):1/2 b1(Z):0
@2120 (122:1/2 0,2321/2 b2(X):1/2 b2(Y):0 b2(Z):1/2 W220
@3120 (132:0 0,3321 b3(X):0 b3(Y):1/2 b3(Z):1/2 W320 Where aij = P(C]t+1 I Slet I St) Suppose we have observed this sequence XZXYYZYZZ (in long—hand: 01 2 X702 2 Z703 2 X704 2 Y7O5 I 3/706 Z Z707 Z 3/708 I Z,Og Z Z).
Fill in this table with at“) values, remembering the deﬁnition: ai(t) 2 13(01 /\ 02 /\ ...Ot /\ qt 2 Si) So for example7 053(2):P(01:X/\02:Z/\03:X/\Q3:SQ) t 0415(1) ozt(2) ozt(3) 1 1/2 2 1/8 3 1/32 4 1/128
5 1/256
6 1/512
7 1/1024
8 1/2048
9 1/4096 8 Locally Welghted Regression Here s an argument made by a misguided practitioner of Locally Weighted Regression. Suppose you have a dataset with R1 training points and another dataset with R2
test points. You must predict the output for each of the test points. If you use a
kernel function that decays to zero beyond a certain Kernel width then Locally
Weighted Regression is computationally cheaper than regular linear regression.
This is because with locally weighted regression you must do the following for
each query point in the test set, 0 Find all the points that have non—zero weight for this particular query. 0 Do a linear regression with them (after having weighted their contribution
to the regression appropriately). o Predict the value of the query. whereas with regular linear regression you must do the following for each query
pomt: 0 take all the training set datapoints. 0 Do an unweighted linear regression with them. o Predict the value of the query. The locally weighted regression frequently ﬁnds itself doing regression on only a
tiny fraction of the datapoints because most have zero weight. So most of the
local method’s queries are cheap to answer. In contrast7 regular regression must
use every single training point in every single prediction and so does at least as
much work, and usually more. This argument has a serious error. Even if it is true that the kernel function causes almost
all points to have zero weight for each LWR query the argument is wrong. What is the error? Linear regression only needs to learn its weights (i.e. do the
appropriate matrix inversion) once in total. LWR must do
a separate matrix inversion for each test point. 9 Nearest neighbor and crossvalidation At some point during this question you may ﬁnd it useful to use the fact that if U and V are
two independent real—valued random variables then Var[aU + bV] = a2 Var[U] + 1)2 WWW]. Suppose you have 10,000 datapoints {(3619 yk) 2 k = 1, 2, ..., 10000}. Your dataset has one
input and one output. The kth datapoint is generated by the following recipe: xk : k/ 10000 —T
So that yk is all noise: drawn from a Gaussian with mean 0 and variance 02 = 4 (and
standard deviation 0 = 2). Note that its value is independent of all the other y values. You
are considering two learning algorithms: 0 Algorithm NN: 1—nearest neighbor. 0 Algorithm Zero: Always predict zero. (a) What is the expected Mean Squared Training Error for Algorithm NN? 0 (b) What is the expected Mean Squared Training Error for Algorithm Zero? 4 (c) What is the expected Mean Squared Leave—one—out Cross—validation Error for Algo
rithm NN? 8 = E[(xk  X[X+1D“2] (d) What is the expected Mean Squared Leave—one—out Cross—validation Error for Algo
rithm Zero? 10 Neural Nets (a) Suppose we are learning a 1—hidden—layer neural net with a sign—function activation Sign(z) : 1 if z 2 0
Sign(z) = —1 if z < 0 x1 >h1=sign(w11 x1 + w21 x2) m =
w12 Output = w21 /W1 h1 + W2 h2
x2 ,h2=sign(w12 x1 + w22 x2) W2 = w22 = We give it this training set, which represents the exclusive—or function if you interpret
—1 as false and +1 as true: X1 X2 Y
1 1 —1
1 —1 1 —1 1 1
—1 —1 —1 On the diagram above you must write in six numbers: a set of weights that would give
zero training error. (Note that constant terms are not being used anywhere7 and note
too that the output does not need to go through a Sign function). Or..if it impossible
to ﬁnd a satisfactory set of weights, just write “impossible”. Impossible (b) You have a dataset with one real—valued input :16 and one real—valued output y in which
you believe yk : exp(ka) + 6k where (:13;C7 yk) is the kth datapoint and 6k is Gaussian noise. This is thus a neural net
with just one weight: w. Give the update equation for a gradient descent approach to ﬁnding the value of a that
minimizes the mean squared error. 11 Support Vector Machines Consider the following dataset. We are going to learn a linear SVM from it of the form
f(:r) = sign(wx + b). ODenotes X Y
Class —1
1 —1
.Denotes
Class 1 2 ‘1
15 1
4 1
O O O O O 5 1
0 I I I I I I
0 1 2 3 4 5 (b) What is the training set error of the above example? (expressed as the percentage of
training points misclassiﬁed) (c) What is the leave—one—out cross—validation error of the above example? (expressed as
the percentage of left—out points misclassiﬁed) 2 wrong => 40% (d) True or False: Even With the clever SVM Kernel trick it is impossibly computationally expensive even on a snnerenmmlter to do the Followinﬂ' Given a dataset with 200 datapoints and 50 attributes learn an SVM Classiﬁer
with full 20th—degree—polyn0mial basis functions and then apply what y0u7ve
learned to predict the Classes of 1000 test datapoints. FALSE 12 VC Dlmension (a) Suppose we have one input variable x and one output variable y. We are using the
machine f1 (:13, 04) = sign(:r + 04). What is the VC dimension of f1? 1 (b) Suppose we have one input variable 36 and one output variable y. We are using the
machine f2(a:, Oz) 2 sign(owc + 1). What is the VC dimension of f2? I (c) Now assume our inputs are m—dimensional and we use the following two—level7 two—
choice decision tree to make our classiﬁcation: is X[A] < B? 7—K
if no / \ if yes
7—K
is XEC] < D? is x[E] < F?
/ \ / \
if no / \ if yes if no/ \ if yes
/ \ / \
/ \ / \
Predict Predict Predict Predict
Class G Class H Class I Class J Where the machine has 10 parameters D;
:3
“3 e
3% 1, [\D 7...,m} jQUU 1,27...,m}
3?
m
H e {—171} E I {4,1} J 6 —1,1 “em 6
E
E
E
E
E What is the VCIlimension of this machine? 4 ...
View
Full Document
 Fall '08
 EricP.Xing

Click to edit the document details