Consumer report auto data cont the dataset carall

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: e S-Plus manual. We will work with a subset of 23 of the variables, created by the rst two lines of the example below. We will use Price as the response. This data set is a good example of the usefulness of the missing value logic in rpart: most of the variables are missing on only 3-5 observations, but only 42 111 have a complete subset. cars - car.all , c1:12, 15:17, 21, 28, 32:36 cars$Eng.Rev - as.numericas.charactercar.all$Eng.Rev2 fit3 - rpartPrice ~ ., data=cars fit3 node, split, n, deviance, yval * denotes terminal node 1 root 105 7118.00 15.810 2 Disp. 156 70 1492.00 11.860 4 Country:Brazil,Japan,Japan USA,Korea,Mexico,USA 58 421.20 10.320 8 Type:Small 21 50.31 7.629 * 9 Type:Compact,Medium,Sporty,Van 37 132.80 11.840 * 5 Country:France,Germany,Sweden 12 270.70 19.290 * 3 Disp. 156 35 2351.00 23.700 6 HP.revs 5550 24 980.30 20.390 12 Disp. 267.5 16 396.00 17.820 * 13 Disp. 267.5 8 267.60 25.530 * 7 HP.revs 5550 11 531.60 30.940 * printcpfit3 Regression tree: rpartformula = Price ~ ., data = cars Variables actually used in tree construction: 1 Country Disp. HP.revs Type 29 Root node error: 7.1183e9 105 = 6.7793e7 1 2 3 4 5 CP nsplit rel error 0.460146 0 1.00000 0.117905 1 0.53985 0.044491 3 0.30961 0.033449 4 0.26511 0.010000 5 0.23166 xerror 1.02413 0.79225 0.60042 0.58892 0.57062 xstd 0.16411 0.11481 0.10809 0.10621 0.11782 Only 4 of 22 predictors were actually used in the t: engine displacement in cubic inches, country of origin, type of vehicle, and the revolutions for maximum horsepower the red line" on a tachometer. The relative error is 1 , R2 , similar to linear regression. The xerror is related to the PRESS statistic. The rst split appears to improve the t the most. The last split adds little improvement to the apparent error. The 1-SE rule would choose a tree with 3 splits. This is a case where the default cp value of .01 may have overpruned the tree, since the cross-validated error is not yet at a minimum. A rerun with the cp threshold at .002 gave a maximum tree size of 8 splits, with a minimun cross-validated error for the 5 split model. For any CP value between 0.46015 and 0.11791 the best model has one split; for any CP value between 0.11791 and 0.04449 the best model is with 3 splits; and so on. The print command also recognizes the cp option, which allows the user to see which splits are the most important. printfit3,cp=.10 node, split, n, deviance, yval * denotes terminal node 1 root 105 7.118e+09 15810 2 Disp. 156 70 1.492e+09 11860 4 Country:Brazil,Japan,Japan USA,Korea,Mexico,USA 58 4.212e+08 10320 * 5 Country:France,Germany,Sweden 12 2.707e+08 19290 * 3 Disp. 156 35 2.351e+09 23700 6 HP.revs 5550 24 9.803e+08 20390 * 7 HP.revs 5550 11 5.316e+08 30940 * 30 The rst split on displacement partitions the 105 observations into groups of 70 and 35 nodes 2 and 3 with mean prices of 11,860 and 23,700. The deviance corrected sum-of-squares at these 2 nodes are 1:49x109 and 2:35x109 , res...
View Full Document

This document was uploaded on 09/26/2013.

Ask a homework question - tutors are online