This preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: CS221 Lecture notes Decision trees Last time, we discussed two supervised learning algorithms: linear regression and logistic regression. These algorithms worked well when our inputs x were continuous. Now we will discuss another classification algorithm, called decision trees, which is more appropriate for discrete inputs. For instance, suppose we are trying to predict if our roommate will eat a certain food. Our features are as follows: 1. HUNGER: starving, hungry, caneat, or full 2. LIKE: yes/no 3. HEALTHY: yes/no 4. PRICE: free, cheap, expensive Our target variable is y ∈ { , 1 } , which evaluates to 1 if our roommate eats the food. How would we solve this problem using logistic regression? We might define a feature vector which has a { , 1 } entry for each possible value of each feature. (In other words, if there are four features, each taking on three possible values, our inputs will be of length 12.) If a feature i takes the value v i , we assign a 1 to the corresponding element of the input x ; otherwise, we assign it a 0. For an example where x ( i ) = (humgry, yes, yes, cheap) , our 11dimensional feature vector might be x = [0 1 0 0 1 0 1 0 0 1 0] . 1 2 Figure 1: An example decision tree for deciding whether our roommate will eat a given food. You can imagine this process producing enormous feature vectors if we use it in domains with large numbers of variables. 1 Instead, we can use a decision tree classifier. A decision tree for this domain might look like the one in Figure 1. Suppose we’re given the same ex ample x ( i ) as above. We begin at the root node , which is labeled “Hunger.” Because our roommate is hungry, we descend down the branch labeled “hun gry” to get to the node labeled “Like.” Since our roommate likes the food, we descend down the “Y” branch. Now we’ve arrived at a leaf node , which happens to give the answer “Yes.” Hence, we conclude that our roommate will eat the food. In general, the internal nodes of the tree will correspond to features, the edges will correspond to different values of the feature, and the leaves will correspond to yes/no predictions. Or, rather than a simple yes or no, we might associate with each leaf ℓ a probability p ℓ , which is the probability that our roommate will eat the food in a situation associated with ℓ . 1 You may have observed that we can do slightly better by using only a single component of the inputs to represent binary features, but this won’t solve the basic problem of large feature vectors. 3 1 Decision tree learning Now, we discuss how to learn a decision tree from data. Suppose we take careful notes on our roommate’s eating habits and come up with the training data shown in Table 1. We define a scoring function ℓ as follows....
View
Full
Document
This note was uploaded on 11/30/2009 for the course CS 221 taught by Professor Koller,ng during the Winter '09 term at Stanford.
 Winter '09
 KOLLER,NG
 Artificial Intelligence, Algorithms

Click to edit the document details