ohdatamineCART2

OhdatamineCART2 - DATA MINING Susan Holmes Stats202 Lecture 13 Fall 2010 ABabcdfghiejkl Special Announcements All other requests should be sent to

Info iconThis preview shows pages 1–8. Sign up to view the full content.

View Full Document Right Arrow Icon
. . . . . . DATA MINING Susan Holmes © Stats202 Lecture 13 Fall 2010 ABabcdfghiejkl
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
. . . . . . Special Announcements I All other requests should be sent to [email protected] . I Homework, the deadline is today 5.00pm, all hw not within the deadline is rejected (we have an automatic system). Please don't forget to add your sunet id to your hw Fle name (at the end). I Next homework is up.
Background image of page 2
. . . . . . Last Time: Decision Trees and Classifcation Examples I Two sets oF Data: Training and Test. I Response Y is a nominal/categorical variable. I Explanatory variables can be continuous AND nominal AND ordinal. I Indices oF Purity: Gini, Entropy (Deviance) and Misclassifcation.
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
. . . . . . Binary recursive partitioning In binary recursive partitioning the goal is to partition the predictor space into boxes and then assign a value to each box based on the values of the response variable for the observations assigned to that box. At each step of the partitioning process we are required to choose a speciFc variable and a split point for that variable that we then use to divide all or a portion of the data set into two groups. This is done by selecting a group to divide and then examining all possible variables and all possible split points of those variables. Having selected the combination of group, variable, and split point that yields the greatest improvement in the Ft criterion we are using, we then divide that group into two parts. The usual Ft criterion for a classiFcation tree is an impurity index.
Background image of page 4
. . . . . . Impurity For categorical variables there are a number of different ways of calculating impurity. Let y be a categorical variable with m categories. Let n tk = number of observations of type k at node t p tk = proportion of observations of type k at node t The following four measures of impurity are commonly used. 1. Deviance: deviance D t = - 2 k n tk log p tk 2. Entropy: entropy D t = - 2 k p tk log 2 p tk 3. Gini index: gini index D t = 1 - k p 2 tk 4. Misclassi±cation error: misclassi±cation where k ( t ) is the category at node t with the largest number of observations.
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
. . . . . . Deviance : Supposes a probability model in which at node t of a tree, the probability distribution of the classes is p tk . Each case is eventually assigned to a leaf, and so at each leaf, we have a random sample n tk from the multinomial p tk . Conditionning on the observed variables x i in the training set, and hence we know the numbers n i assigned to every node of the tree, in particular the leaves. The conditional likelihood is then proportional to leavest classes k p n tk tk The deviance( -2 log-likelihood shifted to zero for the perfect model) is D t = - 2 k n tk logp tk = 2 n * Entropy for each leaf, and we sum it over all the leaves to get the tree's total deviance: D = t D t
Background image of page 6
. . . . . . Stopping Rules
Background image of page 7

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Image of page 8
This is the end of the preview. Sign up to access the rest of the document.

This note was uploaded on 07/29/2011 for the course STAT 202 at Stanford.

Page1 / 27

OhdatamineCART2 - DATA MINING Susan Holmes Stats202 Lecture 13 Fall 2010 ABabcdfghiejkl Special Announcements All other requests should be sent to

This preview shows document pages 1 - 8. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online