This preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: 15781 Final Exam, Fall 2002 . Write your name and your and rew email address below. Name: A AA FQNU MOOFQ.
Andrew ID: . There should be 17 pages in this exam (excluding this cover sheet). . If you need more room to work out your answer to a question, use the back of the page
and clearly mark on the front of the page if we are to look at what’s on the back. . You should attempt to answer all of the questions.
. You may use any and all notes7 as well as the class textbook.
., All questions are worth an equal amount. They are not all equally difﬁcult. . You have 3 hours. . Good luck! 1 Computational Learning Theory 1.1 PAC learning for Decison Lists A decision list is a list of if—then rules where each condition is a literal (a variable or its
negation). It can be thought of as a decision tree with just one path, For example, say that
I like to go for a walk if it’s warm or if it’s snowing and I have a jacket, as long as it’s not
raining. We could describe this as the following decision list: (a) (d) if rainy then no else if warm then yes else if not(havejacket) then no
else if snowy then yes else no. Describe an algorithm to learn DLs given a data set, for example Your algorithm should have the characteristic that it should always classify exam—
ples that it has already seen correctly (ie, it should be consistent with the data).If it’s
not possible to continue to produce a decision list that’s consistent with the data, your
algorithm should terminate and announce that it has failed. Find the size of the hypothesis space, IHI, for decisions lists of k attributes Find an expression for the number of examples needed to learn a decision list of k
attributes with error at most .10 with probability 90%,. What if the learner is trying to learn a decision list, but the representation that it is
using is a conjunction of k literals? Find the expression for the number of examples
needed to learn the decision list with error at most 1.10 with 90% probability. (b) L322 00 00
M 000
5‘”
o
00X
00
’9
/
’4'
o
OO Kmeans and Gaussian Mixture Models What is the effect on the means found by k—means (as opposed to the true means) of
overlapping clusters? M3 are. PWSLQOX {MWtr quark“ HM“ HA2 W Mew/LS
wasme lye. Run k—means manually for the following dataset‘. Circles are data points and squares
are the initial cluster centers. Draw the cluster centers and the decision boundaries that deﬁne each cluster, Use as many pictures as you need until convergence.
Note: Execute the algorithm such that if a mean has no points assigned to it, it stays Where it is for that iteration, (0) NOW draw (approximately) What a Gaussian mixture model of three gaussians with
the same initial centers as for the k—means problem would converge to. Assume that
the model puts no restrictions on the form of the covariance matrices and that EM
updates both the means and covariance matrices” (d) Is the classiﬁcation given by the mixture model the same as the, classiﬁcation given by
k—means? Why or why not? ‘ 3 HMMS Andrew lives a simple life. Some days he’s Angry and some days he’s Happy. But he hides
his emotional state, and so all you can observe is whether he smiles, frowns, laughs, or yells, We start on day 1 in the Happy state, and there’s one transition per day. Aneg Happy p(smile)=0 ,5
p(frown)=0 1
p(laugh)=0,2
p(yell)=0,2 p(smile)=0 ,1
p(frown)=0 5
p(laugh)=0 2
p(yell)=0 2 Deﬁnitions: ‘
qt : state on day t,’
0,, = observation on day t., (a) What is P(q2 = Happy)? 0, g (b) WhatisP(0:frown)? ,8. .L L .L 2 _8_, Jo _ i
2 onxo‘tgo"; Iw+To‘5‘ro.85 (c) What is P(Q2 = Happleg = frown)? 01: 92"“ l apt: H) P 1L: H) 4X3. : )0 lo _ 4/ (d) What is P(O100 = yell)? ( ll; —— ? F(Ouw=U£\l) 2‘ P(O.®:UaulézmzH>P(’1/,w= H) 1 P{Om:\aeM\ 7,“:A>P(1J£A>
: 2/10" ‘1’,“3V‘) +P{$,W:A3) : afoxl : 27/10 (e) Assume that 01 = frown, 02 = frown, 3 = frown, 04 = frown, and O5 2 frown,
What is the most likely sequence of states? HAAAA 4 Bayesian Inference (a) Consider a dataset over 3 boolean attributes, X, Y, and Z. Of these sets of information, which are suﬁicient to specify the joint distribution? Circle all that apply. ,
yes buaukt HM
A. P(~ XZ) P(~ X~Z) P(~ YX/\Z) P(~ YlXANZ) P(~ Y! ~ X /\ Z) P(~ Y N X/\ ~ Z) P(Z) ®a®—>® 85“
v Naifafiiize
B, P(~ X] N Z) P(X] N Z) P(Y[X /\ Z) PmXA ~ Z) No ‘ R can
P(Yl~X/\Z)P(Y~X/\~Z)P(Z) MW £3042)
on, P(XZ) P(X N Z) P(YX /\ Z) P(YX/\ N Z) 7:25 Ame A (NA cm
P(Y NXAZ) P(~ Y ~X/\~Z) P(~ Z) 8"“ (39%) 9"“ “2) D), P(Xlz) P(X\ ~Z) P(YIXAZ) P(YXA~Z)j No, 0ch 3w P(~ Yl N X/\ N Z) P(Y N X/\ N Z) P(Z) Given this dataset of 16 records: (b) Write down the probabilities needed to make a joint density bayes classiﬁer P AAB\C> PUMB‘NC) at
PEAANBl c.) (Dams) NC) WC) :1 47;: WNW). . f.(“A33LT'C)‘—"§; Mia/r
Mpg C) PQA'ATBLNElfé—‘AJM‘MJ mm 'or
u— (C) Write down—the probaloilities needed to make a naive bayes classiﬁer, P(Al c) = %%%; 9(a) :72 P(Al~c): “*9
?(Blc)= 542 Hangs/g (d) Write the classiﬁcation that the joint density bayes classiﬁer would make for C given
A=0,B=1‘. F<C [Ii/AME} 2: P(~AABIC>P(C)
(’(«AAM C)? (c)+P (NA/*8 PC) PFC) 2: iii .2 J2 : j é$i+igxli 42.1% (5/8) (e) Write the classiﬁcation that the naive bayes classiﬁer would make for C given A=0,B=l. WWW) = Whisk)?“ \°(»A»£>lc)P(c)+P(~A~£I~C)P(~Q) ngAlc; gauging ] +
10 4 — 1 5/» mu, ha» "LS 2315. renaming
/8 2+832’135‘r3 3e 5 Support Vector Machines This picture shows a dataset with two real—valued inputs (x1 and :02) and one categorical
output class. The positive points are shown as solid dots and the negative points are small circles. (a) e (b) Suppose you are using a linear SVM with no provision for noise (ie, a Linear SVM
that is trying to maximize its margin while ensuring all datapoints are on their correct Lsides of the margin) ., Draw three lines on the above diagram, showing’the classiﬁcation boundary and the two sides of the margin' Circle the supportivector(sy)_ Using the familiar LSVM classiﬁer notation of class 2 sign(w.x 47gb), calculate the
values of W and b learned for part (a) , ’ ‘ x ; Assume you are using a noisetolerant LSVM which tries to minimize.
. : ' 5 ' g ; .' . ‘ / _; i
1 .R r
f—ww + 0 2 6k 1‘
2 k=1 1 (1) using the notation of yourpnotes and the Burges pap‘er.’ Question: is it possible to invent a datas‘i‘etgand a positive" value of C in which (a)
the dataset is linearly separable but (b) the LSVM would nevertheless misclassify at
least one training point? If it is possible to invent such an example, please sketch the ' ’6 example and suggest a value for C. If it is not possible, explain why not,  l 6 Instancebased learning This picture shows a dataset with one real—valued input a: and one realvalued output yr.
There are seven training points. 0 1 2 3 4 5 6 X—>
Suppose you are training using kernel regression using some unspeciﬁed kernel function.
The only thing you know about the kernel function is that it is a monotonically decreasing
function of distance that decays to zero at a distance of 3 units (and is strictly greater than
zero at a distance of less than 3 units) i. ‘ l (a) isthe predicted value of y when :r/2‘l? I I” I \ ‘ GEJB ’
I I (“SN/nah ‘0 W‘M : Matte/52 (b) What is the predicted value of y when x = 3? 5;"‘2 5M £5W¢Qv V4“ 3"“ Amb‘fmug/ (\ 993+ 03516
0M (lav—35‘ wdﬁweek vallj IA «#2ij wqmﬁ‘JL—r—‘J (c) What is the predicted value of y when x = 5? : M :. C C3 NSTQS q. (d) What is the predicted value of y when :3 = 6? 4»: (Laws “‘56) The ﬁnal two parts of this question concern 1nearest neighbor used as a
classiﬁer. The following dataset has two real valued inputs and one binary categorical output.
The class is denoted by the color of the datapointi (e) Does there exist a choice of Euclidian distance metric for which 1nearest—neighbor
would achieve zero training set error on the above dataset? V yrs, 63. memo, L) = «(m—mole ugh—1°.sz * V. ,. ,(Lz’m‘ norg; 1y i: Ma 9%, ii“: at Hidiﬂvu‘vivw% ANY Mm‘ﬂﬂﬂ 0 W3 Now let’s consider a different dataset:
«6» <0» o o go ‘o Elliot‘o o o ‘0'.” (f) Does there exist a choice of Euclidian distance metric for which lnearest—neighbor
would achieve zero training set error on the above dataset? 765, m3 NRC. 7 Nearest Neighbor and CrossValidation Recipe for making training set of 10,000 dat— Recipe for making test set of 10,000 datapoints
apoints with two real—valued inputs and one with two realvalued inputs and one binary
binary output class: output class: No points in No points in
gap between rectangles gap between rectangles 5000 points \1/ 5000 points 5000 points \L 5000 points
with positions with positions with positions with positions
chosen chosen chosen chosen
randomly randomly randomly randomly
uniformly uniformly uniformly uniformly in. this in this in this in this
rectangle . rectangle » rectangle i. rectangle . 2595 have +ve 7596 have +ve none have +ve 1009s have +ve
class. class" class» class. 7595 have ve 2596 have —ve 100% have —ve none have ve
class" class» c1ass.. class. Using the above recipes for making training and test sets you will see that the training
set is noisy: in either region, 25% of the data comes from the minority class“ The test set is
noise—free. In each of the following questions, circle the answer that most closely deﬁnes the expected
error rate, expressed as a fraction. (a) What is the expected training set error using one—nearest—neighbor?
(8) 1/8 1/4 3/8 1/3 1/2 5/8 2/3 3/4 7/8 1 (b) What is the expected leaveoneout crossvalidation error on the training set using
_  _ ‘ I; J... 3/ 4. 3/ aPL.  ,3
one nearest neighbor. if )1 1+. 9— L..— /g 0 1/8 1/4 3/8 1/3 1/2 5/8 2/3 3/4 7/8 1 (c) What is the expected test set error if we train on the training set, test on the test set,
and use one—nearest—neighbor? O 1/8 1/4 3/8 1/3 1/2 5/8 2/3 3/4 7/8 1
(d) What is the expected training set error using 21—nearest—neighbor? 0 1/8 3/8 1/3 1/2 5/8 2/3 3/4 7/8 1 (e) What is the expected leave—one—out cross—validation error on the training set using
21—nearest—neighbor? O 1/8 1/4 3/8 1/3 1/2 5/8 2/3 3/4 7/8 1 (f) What is the expected test set error if we train on the training set, test on the test set,
and use 21—nearestneighbor? O 1/8 1/4 3/8 1/3 1/2 5/8 2/3 3/4 7/8 1 8 Learning Bayes Net Structure For each of the following training sets, draw the structure and CPTs that a Bayes Net
Structure learner should learn, assuming that it tries to account for all the dependencies in
the data as well as possible While minimizing the number of unnecessary links. In each case7
your Bayes Net will have three nodes, called A B and Cl, Some or all of these questions have multiple correct answersmyou need only supply one answer to each question. 9 Markov Decision Processes Consider the following MDP, assuming a discount factor of ’y = 0.5. Note that the action
“Party” carries an immediate reward of +10. The action “Study” unfortunately carries no
immediate reward, except during the senior year, when a reward of +100 is provided upon
transition to the terminal state “Employed”. (a) What is the probability that a freshman will fail to graduate to the “Employed” state
within four years, even if they study at every opportunity? (b) Draw the diagram for the Markov Process (not the MDP, the MP) that corresponds
to the policy “study whenever possible.” (c) What is the value associated with the state “Junior” under the “study whenever pos—
sible” policy? ' (d) Exactly how rewarding would parties have to be during junior year in order to make it
advisable for a junior to party rather than study (assuming, of course, that they wish
to optimize their cumulative discounted reward)? (e) Answer the following true or false. If true, give a one~sentence argument . If false, give
a counterexample. 0 (True or False?) If partying during junior year an optimal action when it is
assigned reward r, then it will also be an Optimal action for a freshman when
assigned reward r., 0 (True or False?) If partying during junior year is an optimal action when it
is assigned reward r, then it will also be an optimal action for a freshman when
assigned reward r, 10 Q Learning Consider the robot grid world shown below, in which actions have deterministic outcomes,
and for which the discount factor 7 z: 0.5. The robot receives zero immediate reward upon
executing its actions, except for the few actions where an immediate reward has been written
in on the diagram. Note the state in the upper corner allows an action in which the robot remains in that same state for one time tick.
IMPORTANT: Notice the immediate reward for the state—action pair < 0, South > is —100, not +100. (a) Write in the Q value for each stateaction pair, by writing it next to the corresponding
arrow. (b) Write in the V* (5) value for each state, by writing its value inside the grid cell repre
senting that state. (0) Write down an equation that relates the V*(s) for an arbitrary state 8 to the Q(s, a)
values associated with the same state. ((1) Describe one optimal policy, by circling only the actions recommended by this policy (8) Hand execute the deterministic Q learning algorithm, assuming the robot follows the
trajectory shown below. Show the sequence of Q estimates (describe which entry in the Q table is being updated at each step): state action nextstate immediate—reward updatedQ—estimates A East B 0
B East C 10
C Loop C O
C South F 100
F West E O
E North B O
B East C 10 (f) Propose a change to the immediate reward function that results in a change to the Q
function, but not to the V function. 11 Short Questions (a) Describe the difference between a maximum likelihood hypothesis and a maximum a
posteriori hypothesis, MLE : MCKDLCMiLQ, Woiéx‘m‘j} gamers) b3 96W MAP 3 MMCkMiZQ. WU“ (5W3, M a.ch Er PM
WU" FWS (b) Consider a learning problem deﬁned over a set of instances X i. Assume the space of
possible hypotheses, H , consists of all possible disjunctions over instances in X i. Le»,
the hypothesis 221 V $6 labels these two instances positive, and no others, What is the VC dimension of H 7 (0) Consider a naive Bayes classiﬁer with 2 boolean input variables, X and Y, and one
boolean output7 Z i, 0 Draw the equivalent Bayesian network. f9
on) o How many parameters must be estimated to train such a naive Bayes classiﬁer? 5 Fri?) PC7!
N?) PEKI~%) WHEY?) o How many parameters would have to be estimated if the naive Bayes assumption
is not made, and we wish to learn the Bayes net for the joint distribution over X , Y, and Z? 7 a P<a> amok
P<X=i,7=il2=k) True or False? If true, explain Why in at most two sentences" If false, explain Why or
give a brief counterexample. 0 (True or False?) The error of a hypothesis measured over the training set
provides a pessimistically biased eStimiiate of the true error of the hypothesis. 0 (True or False?) Boosting and the Weighted Majority algorithm are both
methods for combining the votes of multiple classiﬁers, 0 (True or False?) Unlabeled data can be used to detect over'ﬁtting, 0 (True or False?) Gradient descent has the problem of sometimes falling into
local minima7 Whereas EM does not] 0 (True or False?) HMM’s are a special case of MDP’s. ...
View
Full Document
 Spring '09
 Lanzi
 Machine Learning, Naive Bayes classifier, immediate reward, decision list, Andrew ID, training set error

Click to edit the document details