h resulted in heads. The likelihood of the data: Log likelihood: P (X|✓) = ph (1 ln P (X|✓) = h ln p + (n p) n P(Y, X) = P(Y | X) P(X) and: h h) ln(1 P(Y, X) = P(X | Y) P(Y) p) Therefore: Taking a derivative and setting to 0: @ ln P (X|✓) h = @p p ) p= P (Y |X ) = (n (1 h) =0 p) P (X |Y )P (Y ) P (X ) This is known as Bayes' rule h n 9 Bayes' rule likelihood P (Y |X ) = posterior 10 Maximum a-posteriori and maximum likelihood prior P (X |Y )P (Y ) P (X ) posterior ∝ likelihood × prior The maximum a posteriori (MAP) rule: yM AP = arg max P (Y |X ) = arg max Y Y P (X |Y )P (Y ) = arg max P (X |Y )P (Y ) P (X ) Y If we ignore the prior distribution or assume it is uniform we obtain the maximum likelihood rule: yM L = arg max P (X |Y ) Y P(X) can be computed as: P (X ) = X Y P (X |Y )P (Y ) A classifier that has access to P(Y|X) is a Bayes optimal classifier. But is not important for inferring a label 12 3 10/29/13 Naïve Bayes classifier Naïve Bayes classifier Learning&the&Op)mal&Classifier& We would like to model P(X | Y), where X is a feature vector, and Y is its associated label. We would like to model P(X | Y), where X is a feature vector, and Y is its associated label. Task:%Predict%whether%or%not%a%picnic%spot%is%enjoyable% Simplifying assumption: conditional independence: given the class label the features are independent, i.e. % Training&Data:&& X%=%(X1%%%%%%%X2%%%%%%%%X3%%%%%%%%…%%%%%%%%…%%%%%%%Xd)%%%%%%%%%%Y% P ( X| Y ) = P ( x1 | Y ) P ( x2 | Y ) , . . . , P ( xd | Y ) n&rows& How many parameters now? How many parameters? Lets&learn&P(Y|X)&–&how&many&parameters?& Prior: P(Y) Prior:%P(Y%=%y)%for%all%y % k-1 if k classes %% KR1&if&K&labels& Likelihood: P(X | Y) (2d – 1)k for binary features Likelihood:%P(X=x|Y%=%y)%for%all%x,y (2 %% d&–&1)K&if&d&binary&features% 9% 13 14 Naïve Bayes classifier Naïve Bay...
