This preview has intentionally blurred sections. Sign up to view the full version.View Full Document
Unformatted text preview: Logistic Regression and Decision Trees CS 221 Section 4 October 16, 2009 Today we will derive the gradient descent update rule for logistic regression using maximum likelihood and also go over an example of creating decision trees. 1 Maximum likelihood Maximum likelihood is a general parameter estimation method. The intuition behind maximum likelihood is that we want to choose a hypothesis which makes the data as probable as possible. This will require us to make assumptions about the way that our data is generated. In general we will define probabilistic models which will describe our data generation. 1.1 Example Suppose we are given the task of predicting the probability that a future tossed thumbtack will land with the pointy side up. To aid us in this task we are given a dataset which contains the results of a set of tosses of the thumbtack in question. How should we proceed? Let’s model the thumbtack flip in the following way: as a Bernoulli random variable, where the probability that the thumbtack lands point up is θ and the probability that it lands point down is 1- θ . Let’s also assume that each toss was independent, with result drawn from the same Bernoulli distribution. Let’s say that our data D contains 8 examples where the thumbtack landed point up, and 2 where it landed point down. We can now talk about the prob- ability of this data, assuming the model parameter θ . This probability is: p ( D ; θ ) = θ 8 (1- θ ) 2 We call this probability the likelihood . Our task is to choose the parameter θ that we feel best describes the probability that the thumbtack lands point up, and we have a tool to tell us the likelihood of any θ that we pick. Which θ should we pick? 1 The principle of maximum likelihood says that we should choose θ so as to make the probability of the data as high as possible. I.e. we should choose the value for θ that maximizes the likelihood. So, what would this be for our example? We need to solve θ = arg max θ θ 8 (1- θ ) 2 In general it is awkward to take derivatives of products like this, instead we can maximize the log likelihood log p ( D ; θ ). This will give us the same answer as maximizing the likelihood because the logarithm is a monotonically increas- ing function. We can find the maximum likelihood in our example by setting ∂ log p ( D ; θ ) ∂θ to zero: ∂ ∂θ log p ( D ; θ ) = ∂ ∂θ ( log ( θ 8 (1- θ ) 2 )) = ∂ ∂θ ( log θ 8 + log(1- θ ) 2 ) = ∂ ∂θ (8log θ + 2log(1- θ )) = 8 θ- 2 1- θ 2 1- θ = 8 θ 2 θ = 8- 8 θ 10 θ = 8 θ = 0 . 8 Thus, 0.8 is our maximum likelihood estimate for the parameter θ . In other words, the data we saw is most likely if θ = 0 . 8. Note that this matches the intuition of the situation. If someone had asked you what the probability of a thumbtack landing point up was, and also told you that it landed point up 8 out of 10 times previously, you might have answered 80% as your best guess.out of 10 times previously, you might have answered 80% as your best guess....
View Full Document
This note was uploaded on 12/15/2009 for the course CS 221 at Stanford.
- Artificial Intelligence