Unformatted text preview: The rpart programs build classi cation or regression models of a very general
structure using a two stage procedure; the resulting models can be represented as
binary trees. An example is some preliminary data gathered at Stanford on revival
of cardiac arrest patients by paramedics. The goal is to predict which patients
are revivable in the eld on the basis of fourteen variables known at or near the
time of paramedic arrival, e.g., sex, age, time from arrest to rst care, etc. Since
some patients who are not revived on site are later successfully resuscitated at the
hospital, early identi cation of these recalcitrant" cases is of considerable clinical
interest.
The resultant model separated the patients into four groups as shown in gure
1, where
X1 = initial heart rhythm
1= VF VT, 2=EMD, 3=Asystole, 4=Other X2 = initial response to de brillation
3 1=Improved, 2=No change, 3=Worse X3 = initial response to drugs 1=Improved, 2=No change, 3=Worse
The other 11 variables did not appear in the nal model. This procedure seems
to work especially well for variables such as X1 , where there is a de nite ordering,
but spacings are not necessarily equal.
The tree is built by the following process: rst the single variable is found which
best splits the data into two groups `best' will be de ned later. The data is
separated, and then this process is applied separately to each subgroup, and so on
recursively until the subgroups either reach a minimum size 5 for this data or until
no improvement can be made.
The resultant model is, with certainty, too complex, and the question arises as it
does with all stepwise procedures of when to stop. The second stage of the procedure
consists of using crossvalidation to trim back the full tree. In the medical example
above the full tree had ten terminal regions. A cross validated estimate of risk was
computed for a nested set of subtrees; this nal model, presented in gure 1, is the
subtree with the lowest estimate of risk. 2 Notation
The partitioning method can be applied to many di erent kinds of data. We will
start by looking at the classi cation problem, which is one of the more instructive
cases but also has the most complex equations. The sample population consists
of n observations from C classes. A given model will break these observations into
k terminal groups; to each of these groups is assigned a predicted class this will be
the response variable. In an actual application, most parameters will be estimated
from the data, such estimates are given by formulae. i i = 1; 2; :::; C Li; j i = 1; 2; :::; C Loss matrix for incorrectly classifying
an i as a j . Li; i 0: A Some node of the tree.
Note that A represents both a set of individuals in
the sample data, and, via the tree that produced it,
a classi cation rule for future data. Prior probabilities of each class. 4 x True class of an observation x, where x is the
vector of predictor variables. A The class assigned to A, if A were to be taken a...
View
Full Document
 Fall '13
 Regression Analysis, Missing values

Click to edit the document details