This preview shows pages 1–8. Sign up to view the full content.
.
.
.
.
.
.
DATA MINING
Susan Holmes ©
Stats202
Lecture 13
Fall 2010
ABabcdfghiejkl
This preview has intentionally blurred sections. Sign up to view the full version.
View Full Document .
.
.
.
.
.
Special Announcements
I
All other requests should be sent to
[email protected]
.
I
Homework, the deadline is today 5.00pm, all hw not
within the deadline is rejected (we have an automatic
system).
Please don't forget to add your sunet id to your hw Fle
name (at the end).
I
Next homework is up.
.
.
.
.
.
.
Last Time: Decision Trees and Classifcation
Examples
I
Two sets oF Data: Training and Test.
I
Response Y is a nominal/categorical variable.
I
Explanatory variables can be continuous AND nominal
AND ordinal.
I
Indices oF Purity: Gini, Entropy (Deviance) and
Misclassifcation.
This preview has intentionally blurred sections. Sign up to view the full version.
View Full Document .
.
.
.
.
.
Binary recursive partitioning
In binary recursive partitioning the goal is to partition the
predictor space into boxes and then assign a value to each
box based on the values of the response variable for the
observations assigned to that box.
At each step of the partitioning process we are required to
choose a speciFc variable and a split point for that variable
that we then use to divide all or a portion of the data set
into two groups. This is done by selecting a group to divide
and then examining all possible variables and all possible split
points of those variables.
Having selected the combination of group, variable, and split
point that yields the greatest improvement in the Ft criterion
we are using, we then divide that group into two parts.
The usual Ft criterion for a classiFcation tree is an impurity
index.
.
.
.
.
.
.
Impurity
For categorical variables there are a number of different
ways of calculating impurity. Let y be a categorical variable
with m categories. Let
n
tk
= number of observations of type k at node t
p
tk
= proportion of observations of type k at node t
The following four measures of impurity are commonly used.
1.
Deviance: deviance D
t
=

2
∑
k
n
tk
log p
tk
2.
Entropy: entropy D
t
=

2
∑
k
p
tk
log
2
p
tk
3.
Gini index: gini index D
t
= 1

∑
k
p
2
tk
4.
Misclassi±cation error: misclassi±cation where k
(
t
)
is the
category at node t with the largest number of
observations.
This preview has intentionally blurred sections. Sign up to view the full version.
View Full Document .
.
.
.
.
.
Deviance : Supposes a probability model in which at node t of
a tree, the probability distribution of the classes is p
tk
.
Each case is eventually assigned to a leaf, and so at each
leaf, we have a random sample n
tk
from the multinomial p
tk
.
Conditionning on the observed variables x
i
in the training set,
and hence we know the numbers n
i
assigned to every node of
the tree, in particular the leaves.
The conditional likelihood is then proportional to
∏
leavest
∏
classes k
p
n
tk
tk
The deviance( 2 loglikelihood shifted to zero for the
perfect model) is
D
t
=

2
∑
k
n
tk
logp
tk
= 2
n
*
Entropy
for each leaf, and we sum it over all the leaves to get the
tree's total deviance:
D
=
∑
t
D
t
.
.
.
.
.
.
Stopping Rules
This preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
This is the end of the preview. Sign up
to
access the rest of the document.