{[ promptMessage ]}

Bookmark it

{[ promptMessage ]}

final-2001-solutions - Last name(CAPTIALS First...

Info iconThis preview shows pages 1–21. Sign up to view the full content.

View Full Document Right Arrow Icon
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Background image of page 2
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Background image of page 4
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Background image of page 6
Background image of page 7

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Background image of page 8
Background image of page 9

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Background image of page 10
Background image of page 11

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Background image of page 12
Background image of page 13

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Background image of page 14
Background image of page 15

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Background image of page 16
Background image of page 17

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Background image of page 18
Background image of page 19

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Background image of page 20
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Last name (CAPTIALS): ___________________________________ First name (CAPITALS): __________________________________ Andrew User ID (CAPITALS): (without the ©andrew.cmu.edu bit): ___________________________ 15-781 Final Exam, Fall 2001 0 You must answer any nine questions out of the following twelve. Each question is worth 11 points. 0 You must fill out your name and your andrew userid clearly and in block capital letters on the front page. You will be awarded 1 point for doing this correctly. o If you answer more than 9 questions, your best 9 scores will be used to derive your total. 0 Unless the question asks for explanation, no explanation is required for any answer. But you are welcome to provide explanation if you wish. Bayes Nets Inference (a) Kangaroos. < K ,> E>(K) = 22/3 /:i:\ P(ZXIK) = :1/2 (,A,/ P(A|~K) = 1/10 Half of all kangaroos in the zoo are angry, and 2/3 of the 200 is comprised of kangaroos. Only 1 in 10 of the other animals are angry. Whats the probability that a randomly— chosen animal is an angry kangaroo? P(A “ K) = P(K) P(AIK) = 2/3 * 1/2 = 1/3 (b) Stupidity. ( s l P (S) == 0 .5 I P (C l S) = O . 5 {C l p (c | ~s> = 0.2 Half of all people are stupid. If you’re stupid then you7re more likely to be confused. A randomly—chosen person is confused. What’s the chance they’re stupid? P(S|C) = P(S“C) / [P(S“C) + P("S “ 0)] lci Potatoes. 1/4 / (1/4 + 1/10) = 5/7 B is P(B) = 1/2 A P(r‘lB) = 1/2 P("\~B) = 1/10 ,, A \. P(JlT) = 1/2 ‘ J P(4\~T> = 1/10 Half of all potatoes are big. A big potato is more likely to be tall. A tall potato is more likely to be lovab e. What s the probability that a big lovable potato is tall? P(TIB“L) = P(T“B“L) / [P(T“B“L + P(”T“B“L)] 1/8 / (1/8 + 1/40) = 5/6 (d) Final part. M “1 s,> <3 ,//\/;J(WS)=l/2 (w > P(W\~S):l What’s P(W /\ F)? P(W“F) = P(W“F“H) + P(W“F“”H) but P(W“F“H) = O P(W‘F‘”H) = P(F|”H)P(W‘”H) = 1/2 * (P(w “ s “ ”H) + P(w “ ~s “ ~H)) = 1/2 * (1/8 + 1/4) = 3/16 Bayes Nets and HMMs Let nbs(m) = the number of possible Bayes Network graph structures using m at— tributes. (Note that two networks with the same structure but different probabilities in their tables do not count as different structures). Which of the following statements is true? 0 iii nbsiml < m o ii)m<nbs(m )< mimil) 2 ( 0 (iii) m—lm 1) < nbs(m ) < 2’” 0 (iv) 2’” < nbs(m )< <2m0371) . <v>2 1) < nbsvn > Answer is (v) because the number of undirected graphs with n vertices is 2‘[m choose 2], and there are even more acyclic directed graphs Remember that I < X7 Y7 Z > means X is conditionally independent of Z given Y Assuming the conventional assumptions and notation of Hidden Markov Models, in which qt denotes the hidden state at time t and 0t denotes the observation at time t, which of the following are true of all HMMs? Write “True” or “False” next to each statement. 0 1)] < Qt+17qt:Qt71 > 0 ii) I < qt+2aQt7qt71 > ( ( (iii) 1 < Qt+17 Qt; (It—2 > (iv) I < Ot+1,0t,0t_1 > (v) I < Ot+270t7OH > (vi) I < Ot+170t70t—2 > (i) (ii) (iii) all TRUE (iv) (v) (vi) all FALSE 3 Regress1on (a) Consider the following data with one input and one output. O 37 o A l o | A2— 4.) a o ‘3 31’ . N O 0? I I l 0 l 2 3 X (input) ———> o (i) What is the mean squared training set error of running linear regression on this data (using the model y = we + wlzr)? 0 0 (ii) What is the mean squared test set error of running linear regression on this data, assuming the rightmost three points are in the test set, and the others are in the training set. 0 0 (iii) What is the mean squared leave—one—out cross—validation (LOOCV) error of running linear regression on this data? (b) Consider the following data With one input and one output. 3— ? X Y ' 1 1 A27 . 45’ 2 2 3‘ 3 1 51— o 0 >4 0 l I l 0 1 2 3 o (i) What is the mean squared training set error of running linear regression on this data (using the model 3/ = wo + 111136)? (Hint: by symmetry it is clear that the best fit to the three datapoints is a horizontal line). SSE = (1/3)“2 + (2/3)“2 + (1/3)“2 = 6/9 MSE = SSE73 = 279 0 (ii) What is the mean squared leave—one—out cross—validation (LOOCV) error of running linear regression on this data? 1/3 * (2‘2 + 1‘2 + 2‘2) = 9/3 3 (c) Suppose we plan to do regression with the following basis functions: 1 I 1 § I\ I\ I \ i \ --o \---r---r- -o 0 2 4 6 0 2 4 6 X (input) ———> X (input) ———> (MM): 0 ifac<0 (b2(x)= 0 ifx<2 ¢3(x)= 0 ifaj<4 (MM): 36 if0§5v<1 (b2(x)= 36—2 if2§aj<3 ¢3(x)= 36—4 if4§$<5 gbfix): Z—x if1§$<2 ¢2(x): 4—x if3§$<4 ¢3(x): 6—x if5§x<6 (51(35): 0 if2§5v (52(35): 0 if4§5v ¢3(x)= 0 if6§$ Our regression will be y = fllqfifix) + flqu2(:r) + 63¢3(:r). Assume all our datapoints and future queries have 1 g x g 5. Is this a generally useful set of basis functions to use? If ‘yes’ , then explain their prime advantage. If ‘no’7 explain their biggest drawback. NU They’re forced to predict y=0 at x=2 and x=4 (and forced to be close to zero nearby) no matter what the values of beta. 4 Regression Trees Regression trees are a kind of decision tree used for learning from data with a real—valued output instead of a categorical output. They were discussed in the ‘Eight favorite regression W On the next page you will see pseudocode for building a regression tree in the special case where all the input attributes are boolean (they can have values 0 or 1). The MakeTree function takes two arguments: o D, a set of datapoints o and A, a set of input attributes. It then makes the best regression tree it can using only the datapoints and attributes passed to it. It is a recursive procedure. The full algorithm is run by calling MakeTree with D containing every record and A containing every attribute. Note that this code does no pruning, and that it assumes that all input attributes are binary—valued Now read the code on the next page, after which question (a) will ask you about bugs in the code. MakeTree(D,A) Returns a Regress1on Tree 1. For each attribute a 1n the set A do... 1.1 Let D0 = { (xk,yk) in D such that xk[a] = 0 } // Comment: xk[a] denotes the value of attribute a in record xk 1.2 Let D1 = { (xk,yk) in D such that xk[a] = 1 } // Comment: Note that DO union D1 == // Comment: Note too that DO intersection D1 == empty muO = mean value of yk among records in D0 mu1 = mean value of yk among records in D1 SSEO = sum over all records in D0 of (yk - muO) squared SSEl = sum over all records in D1 of (yk - muO) squared Let ScoreEa] = SSEO + SSEl I—xI—tI—xI—xI—x \IO‘JO'IFPQJ 2. // Once a score has been computed for each attribute, let... a* = argmax ScoreEa] a 3. Let D0 = { (Xk,yk) in D such that Xk[a*] = 0 l 4. Let D1 = { (xk,yk) in D such that xk[a*] = 1 } 3. Let LeftChild = MakeTree(DO,A - {a*}) // Comment: A - {a*} means the set containing all elements of A except for a* 4. Let RightChild = MakeTree(D1,A - {a*lj 5. Return a tree whose root tests the value of a*, and whose “a* = O” branch is LeftChild and Whose “a* = 1” branch is RightChild. (a) lBeyondtin}obviousIHIfiflenithatthereiSIu)pruning7themearethree bugsirlthe above codEL7They are afl.very dimjnct.()ne ofthen1is2n1thelevelofzitypographicalerron 'The(mhertvm)areInoresemouscnrorsinlognl ldenfifytheifinee bugs(rennnnbemng thatiflnalack ofgniunng E not one ofthe three bugsL expkuning “dureach one s a bug. It is not necessary to explain how to fix any bug, though you are welcome to do so if that’s the easiest way to explain the bug. Line 1.6 should use (yk - mu1)‘2 Line 2 should use argmin The algorithm is missing the base case of the recursion (b) VVhy5in,the recursive caHs t0 k4akel¥ee,is the second argUInent 9A.—-{a*}” instead of simply “A” .7 Because a* can’t possibly be chosen in any recursive calls 5 Clustermg In the left of the following two pictures I show a dataset. In the right figure I sketch the globally maximally likely mixture of three Gaussians for the given data. 0 Assume we have protective code in place that prevents any degenerate solutions in which some Gaussian grows infinitesimally smaII 0 And assume a GMM model in which all parameters (class probabilities, class centroids and class covariances) can be varied. 3_ O 3_ I '- '- '. I | . .00 o | A 7'. oo ..o A 7 4J2 oo o. .0 «Hz 5 oo . :3 Q. o o 00 Q. *3 ° -- *3 01’ 01’ >4 >4 0 I I I 0 I I I 0 l 2 3 0 1 2 3 X (input) -——> X (input) ———> (a) Using the same notation and the same assumptions, sketch the globally maximally likely mixture of two Gaussians. 3, o /l\ .0 .0 .. l . '00 o A2_"...oo ..o -IJ oo oo :3 o. . Q. o 00 JJ .. o 5 _ O 31 >-| 0 I I I 0 1 2 3 (b) Using the same notation and the same assumptions, sketch a mixture of three distinct Gaussians that is stuck in a suboptimal configuration (i.e. in which infinitely many more iterations of the EM algorithm would remain in essentially the same suboptimal configuration). (You must not give an answer in which two or more Gaussians all have the same mean vectorsiwe are looking for an answer in which all the Gaussians have distinct mean vectors l. 3__ o ? . o . o . . | . . ° 0 o o r~22’ ' 1 . o o . . o H o o o o a . . .0 o o ‘3 ° ° ° ' 3 1 ’ >4 0 l I l X (input) ———> (c) Using the same notation and the same assumptions, sketch the globally maximally likely mixture of two Gaussians in the following, new, dataset. 3 A : “:..o.°.o:. '0 .0. 0.0 2 00‘. on . . o 3' 3. o .0 . 0'0. '0' ‘31 .".:'o.‘.o"°.o...oo O M 0 X (1nput) ———> (d) Now, suppose we ran k—means with k = 2 on this dataset. Show the rough locations of the centers of the two clusters in the configuration with globally minimal distortion. N Y (output) ——> H I O 8 O I O O O O O O 6 Regression algorithms For each empty box in the following table7 write in ‘Y’ if the statement at the top of the column applies to the regression algorithm. Write ‘N’ if the statement does not apply. No matter what the training data is, the predicted output is guaranteed to be a continuous function of the input. (i.e. there are no discontinuities in the pre— diction). If a predictor gives continuous but undifferentiable predictions then you should an— swer “Y”. The cost of training on a dataset with R records is at least 0(R2): quadratic (or worse) in R. For iterative algorithms marked with (*) simply consider the cost of one iteration of the algorithm through the data. Linear Regression Y Quadratic Regression Y Perceptrons with sigmoid acti— vation functions (*) Y 777 1—hidden—layer Neural Nets with sigmoid activation functions (*) '-< 7 1—nearest neighbor 10—nearest neighbor Kernel Regression Locally Weighted Regression Radial Basis Function Regres— sion with 100 Gaussian basis functions <<<zz /././.77 Regression Trees Cascade correlation (with sig— moid activation functions) <2 22 Multilinear interpolation MARS << 7 Hidden Markov Models Warning: this is a question that will take a few minutes if you really understand HMMS, hut could take hours if you don’t. Assume we are working with this HMM 1/2 1/2 1 Start Here with Prob. 1 0,1121/2 (112:1/2 0,1320 b1(X):1/2 b1(Y):1/2 b1(Z):0 @2120 (122:1/2 0,2321/2 b2(X):1/2 b2(Y):0 b2(Z):1/2 W220 @3120 (132:0 0,3321 b3(X):0 b3(Y):1/2 b3(Z):1/2 W320 Where aij = P(C]t+1 I Slet I St) Suppose we have observed this sequence XZXYYZYZZ (in long—hand: 01 2 X702 2 Z703 2 X704 2 Y7O5 I 3/706 Z Z707 Z 3/708 I Z,Og Z Z). Fill in this table with at“) values, remembering the definition: ai(t) 2 13(01 /\ 02 /\ ...Ot /\ qt 2 Si) So for example7 053(2):P(01:X/\02:Z/\03:X/\Q3:SQ) t 0415(1) ozt(2) ozt(3) 1 1/2 2 1/8 3 1/32 4 1/128 5 1/256 6 1/512 7 1/1024 8 1/2048 9 1/4096 8 Locally Welghted Regression Here s an argument made by a misguided practitioner of Locally Weighted Regression. Suppose you have a dataset with R1 training points and another dataset with R2 test points. You must predict the output for each of the test points. If you use a kernel function that decays to zero beyond a certain Kernel width then Locally Weighted Regression is computationally cheaper than regular linear regression. This is because with locally weighted regression you must do the following for each query point in the test set, 0 Find all the points that have non—zero weight for this particular query. 0 Do a linear regression with them (after having weighted their contribution to the regression appropriately). o Predict the value of the query. whereas with regular linear regression you must do the following for each query pomt: 0 take all the training set datapoints. 0 Do an unweighted linear regression with them. o Predict the value of the query. The locally weighted regression frequently finds itself doing regression on only a tiny fraction of the datapoints because most have zero weight. So most of the local method’s queries are cheap to answer. In contrast7 regular regression must use every single training point in every single prediction and so does at least as much work, and usually more. This argument has a serious error. Even if it is true that the kernel function causes almost all points to have zero weight for each LWR query the argument is wrong. What is the error? Linear regression only needs to learn its weights (i.e. do the appropriate matrix inversion) once in total. LWR must do a separate matrix inversion for each test point. 9 Nearest neighbor and cross-validation At some point during this question you may find it useful to use the fact that if U and V are two independent real—valued random variables then Var[aU + bV] = a2 Var[U] + 1)2 WWW]. Suppose you have 10,000 datapoints {(3619 yk) 2 k = 1, 2, ..., 10000}. Your dataset has one input and one output. The kth datapoint is generated by the following recipe: xk : k/ 10000 —T So that yk is all noise: drawn from a Gaussian with mean 0 and variance 02 = 4 (and standard deviation 0 = 2). Note that its value is independent of all the other y values. You are considering two learning algorithms: 0 Algorithm NN: 1—nearest neighbor. 0 Algorithm Zero: Always predict zero. (a) What is the expected Mean Squared Training Error for Algorithm NN? 0 (b) What is the expected Mean Squared Training Error for Algorithm Zero? 4 (c) What is the expected Mean Squared Leave—one—out Cross—validation Error for Algo- rithm NN? 8 = E[(xk - X[X+1D“2] (d) What is the expected Mean Squared Leave—one—out Cross—validation Error for Algo- rithm Zero? 10 Neural Nets (a) Suppose we are learning a 1—hidden—layer neural net with a sign—function activation Sign(z) : 1 if z 2 0 Sign(z) = —1 if z < 0 x1 >h1=sign(w11 x1 + w21 x2) m = w12 Output = w21 /W1 h1 + W2 h2 x2 ,h2=sign(w12 x1 + w22 x2) W2 = w22 = We give it this training set, which represents the exclusive—or function if you interpret —1 as false and +1 as true: X1 X2 Y 1 1 —1 1 —1 1 —1 1 1 —1 —1 —1 On the diagram above you must write in six numbers: a set of weights that would give zero training error. (Note that constant terms are not being used anywhere7 and note too that the output does not need to go through a Sign function). Or..if it impossible to find a satisfactory set of weights, just write “impossible”. Impossible (b) You have a dataset with one real—valued input :16 and one real—valued output y in which you believe yk : exp(ka) + 6k where (:13;C7 yk) is the kth datapoint and 6k is Gaussian noise. This is thus a neural net with just one weight: w. Give the update equation for a gradient descent approach to finding the value of a that minimizes the mean squared error. 11 Support Vector Machines Consider the following dataset. We are going to learn a linear SVM from it of the form f(:r) = sign(wx + b). ODenotes X Y Class —1 1 —1 .Denotes Class 1 2 ‘1 15 -1 4 1 O O O O O 5 1 0 I I I I I I 0 1 2 3 4 5 (b) What is the training set error of the above example? (expressed as the percentage of training points misclassified) (c) What is the leave—one—out cross—validation error of the above example? (expressed as the percentage of left—out points misclassified) 2 wrong => 40% (d) True or False: Even With the clever SVM Kernel trick it is impossibly computationally expensive even on a snnerenmmlter to do the Followinfl' Given a dataset with 200 datapoints and 50 attributes learn an SVM Classifier with full 20th—degree—polyn0mial basis functions and then apply what y0u7ve learned to predict the Classes of 1000 test datapoints. FALSE 12 VC Dlmension (a) Suppose we have one input variable x and one output variable y. We are using the machine f1 (:13, 04) = sign(:r + 04). What is the VC dimension of f1? 1 (b) Suppose we have one input variable 36 and one output variable y. We are using the machine f2(a:, Oz) 2 sign(owc + 1). What is the VC dimension of f2? I (c) Now assume our inputs are m—dimensional and we use the following two—level7 two— choice decision tree to make our classification: is X[A] < B? 7—K if no / \ if yes 7—K is XEC] < D? is x[E] < F? / \ / \ if no / \ if yes if no/ \ if yes / \ / \ / \ / \ Predict Predict Predict Predict Class G Class H Class I Class J Where the machine has 10 parameters D; :3 “3 e 3% 1, [\D 7...,m} jQUU 1,27...,m} 3? m H e {—171} E I {4,1} J 6 —1,1 “em 6 E E E E E What is the VCIlimension of this machine? 4 ...
View Full Document

{[ snackBarMessage ]}