The tree with no splits this scaled version is much

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: in a tree with no splits. For regression models see next section the scaled cp has a very direct interpretation: if any split does not increase the overall R2 of the model by at least cp where R2 is the usual linear-models de nition then that split is decreed to be, a priori, not worth pursuing. The program does not split said branch any further, and saves considerable computational e ort. The default value of .01 has been reasonably successful at `pre-pruning' trees so that the cross-validation step need only remove 1 or 2 layers, but it sometimes overprunes, particularly for large data sets. 6.2 Example: Consumer Report Auto Data A second example using the class method demonstrates the outcome for a response with multiple  2 categories. We also explore the di erence between Gini and information splitting rules. The dataset cu.summary contains a collection of variables from the April, 1990 Consumer Reports summary on 117 cars. For our purposes, 22 Country:dghij Country:dghij | | Type:e Much bette (0/0/3/3/21) Type:e Type:bcf Much worse (7/0/2/0/0) average (7/4/16/0/0) Type:bcef Country:gj Country:gj Much bette (0/0/3/3/21) worse average (4/6/1/3/0) (0/2/4/2/0) worse average (4/6/1/3/0) (0/2/4/2/0) Much worse average (7/0/2/0/0) (7/4/16/0/0) Figure 5: Displays the rpart-based model relating automobile Reliability to car type, price, and country of origin. The gure on the left uses the gini splitting index and the gure on the right uses the information splitting index. car reliability will be treated as the response. The variables are: Reliabilty an ordered factor contains NAs: Much worse worse average better Much Better Price numeric: list price in dollars, with standard equipment Country factor: country where car manufactured Brazil, England, France, Germany, Japan, Japan USA, Korea, Mexico, Sweden, USA Mileage numeric: gas mileage in miles gallon, contains NAs Type factor: Small, Sporty, Compact, Medium, Large, Van In the analysis we are treating reliability as an unordered outcome. Nodes potentially can be classi ed as much worse, worse, average, better, or much better, though there are none that are labelled as just better". The 32 cars with missing response listed as NA were not used in the analysis. Two ts are made, one using the Gini index and the other the information index. fit1 fit2 - rpartReliability ~ Price + Country + Mileage + Type, data=cu.summary, parms=listsplit='gini' - rpartReliability ~ Price + Country + Mileage + Type, 23 data=cu.summary, parms=listsplit='information' parmfrow=c1,2 plotfit1; textfit1,use.n=T,cex=.9 plotfit2; textfit2,use.n=T,cex=.9 The rst two nodes from the Gini tree are Node number 1: 85 observations, complexity param=0.3051 predicted class= average expected loss= 0.6941 class counts: 18 12 26 8 21 probabilities: 0.2118 0.1412 0.3059 0.0941 0.2471 left son=2 58 obs right son=3 27 obs Primary splits: Country splits as ---LRRLLLL, improve=15.220, 0 missing Type splits...
View Full Document

This document was uploaded on 09/26/2013.

Ask a homework question - tutors are online