IPS_Splus
220 Pages

IPS_Splus

Course Number: MA 331, Fall 2009

College/University: Stevens

Word Count: 35148

Rating:

Document Preview

S-PLUS GUIDE FOR R Moore and McCabes INTRODUCTION TO THE PRACTICE OF STATISTICS Fifth Edition Gregory L. Snow Intermountain Health Care Laura M. Chihara Carleton College Tim Hesterberg Insightful Corp. W. H. Freeman and Company New York Teachers: Chapter Order We encourage teachers to consider using Chapter 14 after Chapter 7. David Moore writes: Id like users to consider resampling (Chapter 14) right after...

Unformatted Document Excerpt
Coursehero >> New Jersey >> Stevens >> MA 331

Course Hero has millions of student submitted documents similar to the one
below including study guides, practice problems, reference materials, practice exams, textbook help and tutor support.

Course Hero has millions of student submitted documents similar to the one below including study guides, practice problems, reference materials, practice exams, textbook help and tutor support.

GUIDE S-PLUS FOR R Moore and McCabes INTRODUCTION TO THE PRACTICE OF STATISTICS Fifth Edition Gregory L. Snow Intermountain Health Care Laura M. Chihara Carleton College Tim Hesterberg Insightful Corp. W. H. Freeman and Company New York Teachers: Chapter Order We encourage teachers to consider using Chapter 14 after Chapter 7. David Moore writes: Id like users to consider resampling (Chapter 14) right after Chapter 7 in their course. Thats the way Introduction to the Practice of Statistics ows: rst the traditional procedures (and a strong dose of practical concerns), then resampling methods as an answer to some (not all) of the practical issues. I initially wanted Chapter 14 to be Chapter 8, but bowed to market reality. In some future world, resampling will replace Chapters 6 and 7, but not yet. In addition to the practical issues that Moore notes, there are sound pedagogical reasons for considering Chapter 14 early. Resampling methods provide concrete graphical illustrations of some of the concepts that students nd most dicult, including sampling distributions, standard errors, and P -values. And resampling provides a way for students to check their answers obtained by formula methods. Students: Chapter 14 Whether or not your course includes Chapter 14, we encourage you to use this chapter in connection with other chapters. Bootstrap methods and permutation tests provide concrete visual representations for some of the more abstract and dicult concepts treated in Chapter 6 and later. These methods may help you better understand the material in other chapters. The approach is based on computer simulation and visualization, rather than the traditional mathematical approach. For Best Viewing This document is intended to be viewed online. Screen shots are taken at 96 dpi, so should be clear when viewing the document at 100% of actual size on a screen with 96 dpi. If your resolution is dierent, then another magnication may be better. Online Version The current version of this document will be available online through links at www. whfreeman.com/ips or www.whfreeman.com/ipsresample. S-PLUS R is a registered trademark of Insightful Corp. c 2005 by W. H. Freeman and Company This document may be freely copied by students and faculty for use as a supplement to Moore and McCabe, Introduction to the Practice of Statistics. Contents 0 Introduction to S-PLUS 0.1 Obtaining S-PLUS and Two 0.2 Getting Started . . . . . . . 0.3 Getting Help . . . . . . . . 0.4 Data in S-PLUS . . . . . . 0.5 The Object Explorer . . . . 0.6 Exiting S-PLUS . . . . . . . 0.7 Typing Commands . . . . . 1 1 1 2 3 11 11 12 20 20 31 34 40 42 42 43 45 45 53 55 60 63 64 Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Looking at DataDistributions 1.1 Displaying Distributions with Graphs . . . 1.2 Describing Distributions with Numbers . . . 1.3 The Normal Distributions . . . . . . . . . . 1.4 Editing Graphs . . . . . . . . . . . . . . . . 1.5 Multiple Plots in Command Line Graphics . 1.6 Exercises . . . . . . . . . . . . . . . . . . . 1.7 Solutions . . . . . . . . . . . . . . . . . . . 2 Looking at DataRelationships 2.1 Scatterplots . . . . . . . . . . . . . . . . . . 2.2 Correlation . . . . . . . . . . . . . . . . . . 2.3 Least-Squares Regression . . . . . . . . . . 2.4 Cautions About Correlation and Regression 2.5 Exercises . . . . . . . . . . . . . . . . . . . 2.6 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Producing Data 66 3.1 Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.2 Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4 Probability 70 4.1 Randomness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.2 Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.3 Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5 Sampling Distributions 76 5.1 Sampling Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 6 Introduction to Inference 6.1 Condence Intervals and 6.2 Power . . . . . . . . . . 6.3 Exercises . . . . . . . . 6.4 Solutions . . . . . . . . 82 82 84 86 87 Hypothesis Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii 7 Inference for Distributions 7.1 Inference for the Mean of a Population 7.2 Comparing Two Means . . . . . . . . 7.3 Exercises . . . . . . . . . . . . . . . . 7.4 Solutions . . . . . . . . . . . . . . . . 8 Inference for Proportions 8.1 Inference for a Single Proportion 8.2 Two Sample Proportions . . . . . 8.3 Exercises . . . . . . . . . . . . . 8.4 Solutions . . . . . . . . . . . . . 9 Inference for 9.1 Two-Way 9.2 Exercises 9.3 Solutions Two-Way Tables and . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 88 93 96 97 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 . 99 . 101 . 103 . 104 106 106 113 114 116 116 118 120 123 129 130 131 Tables the Chi-Square Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Inference for Regression 10.1 Graphing the Data . . . . . . . . . . . . . . . 10.2 Inference About the Model . . . . . . . . . . 10.3 Inference About Prediction . . . . . . . . . . 10.4 Checking the Regression Assumptions . . . . 10.5 More Detail About Simple Linear Regression 10.6 Exercise . . . . . . . . . . . . . . . . . . . . . 10.7 Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Multiple Regression 135 11.1 Inference for Multiple Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 11.2 Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 11.3 Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 12 One-Way Analysis of Variance 12.1 The Analysis of Variance F Test 12.2 Comparing the Means . . . . . . 12.3 Exercises . . . . . . . . . . . . . 12.4 Solutions . . . . . . . . . . . . . 13 Two-Way Analysis of Variance 13.1 Preliminary Data Analysis . . 13.2 Two-Way ANOVA . . . . . . 13.3 Exercise . . . . . . . . . . . . 13.4 Solution . . . . . . . . . . . . 145 145 148 152 153 155 155 160 161 162 165 166 183 187 187 187 196 196 196 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Bootstrap Methods and Permutation Tests 14.1 One and Two Means (Chapter 7) . . . . . . . . . . . . . . 14.2 Ratios of Means, Standard Deviations, or Other Statistics 14.3 Proportions (Chapter 8) . . . . . . . . . . . . . . . . . . . 14.4 Two-Way Tables and the Chi-Square Test (Chapter 9) . . 14.5 Regression (Chapter 10) . . . . . . . . . . . . . . . . . . . 14.6 Multiple Regression (Chapter 11) . . . . . . . . . . . . . . 14.7 Analysis of Variance (Chapters 12 and 13) . . . . . . . . . 14.8 Correlation (Chapter 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv 15 Nonparametric Tests 15.1 Wilcoxon Rank Sum Test . . . . 15.2 The Wilcoxon Signed Rank Test 15.3 The Kruskal-Wallis Test . . . . . 15.4 Exercises . . . . . . . . . . . . . 15.5 Solutions . . . . . . . . . . . . . 16 Logistic Regression 16.1 The Logistic Regression Model 16.2 Multiple Logistic Regression . . 16.3 Exercise . . . . . . . . . . . . . 16.4 Solution . . . . . . . . . . . . . Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 199 201 203 204 205 208 208 210 212 212 213 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v Chapter 0 Introduction to S-PLUS This manual is a supplement to Moore and McCabes Introduction to the Practice of Statistics (which we abbreviate IPS throughout this manual). It is intended to guide the student through the use of S-PLUS for Windows for the data analyses and statistical procedures detailed in the text. S-PLUS has both a Graphical User Interface as well as a command line interface. We describe both, focusing primarily on the GUI. We assume that you are using S-PLUS 6.1 or later. Chapter 14 requires S-PLUS 6.1 or later; earlier versions of S-PLUS may be used for other chapters, although there are minor dierences in the GUI. 0.1 Obtaining S-PLUS and Two Libraries To use this book you need a copy of S-PLUS, and two supplemental libraries: the IPSdata library containing the data sets, and (for Chapter 14) the Resample library containing additional software. If you are not using a lab with this software already installed, then see www.whfreeman.com/ipsresample for up-to-date information on obtaining and installing this software. The following information is current at the time this manual was written (but the above site may have more current information): there is a free student version of S-PLUS available at http://elms03.e-academy.com/splus, instructors can obtain evaluation or academic copies of S-PLUS from www.insightful.com, and the supplemental libraries can be downloaded from links at www.whfreeman.com/ipsresample. 0.2 Getting Started If S-PLUS is installed on your PC, then you normally start it using Windows Start menu: Start > Program Files > S-PLUS. Otherwise, there may be an icon for S-PLUS on the Windows desktop. If not, you will need to nd out where it is located on your system or network, or install it. 1 The following screen shot shows the main S-PLUS program window, containing other windows that may be present in a typical session. We will learn more about these windows in the next few sections. Menu Standard toolbar Window toolbar Object Explorer Commands window Data frame Graphsheet Plots 2D palette Minimized windows Figure 0.1 0.3 Getting Help There is extensive online help available in S-PLUS. You can access this information from the menu. For example, to get help on GUI topics, select Help > Available Help > S-PLUS Help. Figure 0.2 This brings up a standard Windows help application: 2 Figure 0.3 The Contents tab groups help les by topic, the Index tab has an alphabetical listing of help les, and the Search tab allows a search for words in a help page. To get a listing of command line functions by topic, select Help > Available Help > Language Reference. Finally, there are a number of manuals available under Help > Online Manuals, including: Getting Started Guide. An overview of the S-PLUS GUI and Commands window. Users Guide. A manual for the S-PLUS GUI. We recommend that you take a look at the brief tour of S-PLUS given in the Getting Started online manual. 0.4 0.4.1 Data in S-PLUS Opening data in S-PLUS You can bring up data in a window in S-PLUS using either the Select Data menu option or the Object Explorer. We describe the rst method here and postpone discussion of the Object Explorer until later in this chapter (Section 0.5). S-PLUS comes supplied with many sample data sets. We begin with a look at fuel.frame which has information about automobiles from a 1990 issue of Consumer Reports. 1. At the menu, select Data > Select Data.... 2. In the Source box, check the radio button for Existing Data. 3. In the Existing Data box, for Name: type fuel.frame. Figure 0.4 3 4. Click OK or Apply. Figure 0.5 The data appear in a data window, a tool for viewing and editing data. Remark: The OK and Apply buttons in the Select Data dialog box are similar, except that Apply keeps the dialog box open after executing the command. This is useful if you wish to try things out, or execute several instances of a command (for example, to open multiple data sets). 0.4.2 Opening data in the IPSdata library The data sets for Introduction to the Practice of Statistics, 5th Edition are included in a special library, the IPSdata library. To use these data, you must rst load the library; if it is installed in the standard location, then you can load the library using menus: 1. At the menu, select File > Load Library 2. For Library Name, select IPSdata. Figure 0.6 3. Click OK. This adds a new menu item IPSdata on the right of the main S-PLUS menu bar. Figure 0.7 4 If the IPSdata library is installed in a nonstandard location on your system, you will need to type a command to load it into the Commands window or a Script window; see Section 0.7 for an introduction and Section 0.7.1 for the command. Once the library is loaded, you can open a data set using the procedure described above in Section 0.4.1. Or you can use menus designed to make it easier to nd data in this library more easily: 1. At the menu, select IPSdata > Select Data... (or equivalently, Data > Select IPS Data...). 2. Select the type of data set you want, for example, for an exercise in Chapter 1 select either Chapter 1 or Exercises. 3. Select the data set from the drop-down menu, for example, Exercise1.29. Figure 0.8 5 0.4.3 Loading the Resample library Loading the IPSdata library automatically loads the resample library, containing bootstrap and permutation test software. If you want to load the library separately, in order to analyze data that is outside the IPSdata library, the procedure is the same except you must also select the button to Attach at top of search list so that when S-PLUS searches for certain functions it nds the versions in the resample library instead of its built-in versions. 0.4.4 Importing data S-PLUS can read data les stored in many dierent formats, such as text (.txt), Excel spreadsheets (.xls), as well as S-PLUSs own format (.sdd). To import data from a le: 1. At the menu, select File > Import Data > From File.... This opens the Import From File dialog box. 2. Click on the Browse button and navigate to the location of the data le you wish to load. As an example, we use the le csdata.txt located in the IPSdata/data folder. 3. Select the data le. The File Format: eld will ll automatically. 6 Figure 0.9 By default, the new data set in S-PLUS is given the same name as the original le. If you wish to change the name, type the new name in the Data Set eld in the To box. 4. Click Update Preview, to be sure that data will be read in the format you expect. 5. Click OK. 0.4.5 Data types The data that we opened or imported in the previous sections appear in a data window; the data window is a mechanism for viewing and manipulating the values that make up the data. Each data set in the IPSdata library is stored in a data frame, a type of matrix in which each column corresponds to a variable and each row to an observation. The variables may be numeric, character, logical (TRUE or FALSE), factor, or other less-common types. For example, in fuel.frame, the rst four columns are numeric and the fth is a factor variable. Factor variables contain categorical data such as gender; the groups (for example, male and female) are called the levels of the factor variable. In fuel.frame, the factor variable Type has six levels: Small, Compact, Medium, Large, Sporty, and Van. 7 To check the type of column, hover your cursor over the column number to see a data tip; double indicates numerical data stored in double-precision. Figure 0.10 An entry of NA indicates a missing value. 0.4.6 Creating a data set In addition to importing already prepared data, you can use S-PLUS to create your own data sets. To create a new data frame: 1. Click on the New Data Set button on the Standard toolbar. Figure 0.11 A blank window resembling a spreadsheet opens. You can type data directly into this window. 2. Type the numbers 1, 2, 3, 4, 5 into the rst ve cells of column 1. Pressing the Enter key moves the active cell in the same direction as the last arrow key. 3. In the next column, enter blue, blue, red, red, green. When entering character data, the column is stored as a factor variable. Figure 0.12 8 0.4.7 Making changes to data You may need to make changes to data. These changes might involve the entire data set (for example, rename the data set), a column or columns (for example, rename a column or change the number of signicant digits displayed), or a single cell (for example, change the value of an entry). Figure 0.13 To change the name of the whole data frame, place the cursor in the top left cell, right-click and select Properties... from the shortcut menu. In the Data Frame dialog box, type a new name in the Name: eld. Click OK. Figure 0.14 To make changes to a column, right-click on the column header (column number or name) and select Properties... from the shortcut menu. In the Double Precision Column dialog box, you can, for example, change the name of the column, add a description to be used for labeling graphs, and so on, or change the number of signicant digits (Precision:) that are displayed. Figure 0.15 9 Remark: In S-PLUS, names can be any length and can consist of letters, digits, or the period symbol. Names must begin with a letter and no spaces are allowed. Thus, weight, Weight1994, weight.94 are all legitimate names while 94Weight, Weight 94, or weight-94 are not. S-PLUS is case-sensitive. If the only change you wish to make is to the column name, then you can also do this by double-clicking on the current column name and then typing in the new name. Figure 0.16 In addition to making changes from a dialog box, many common operations can be done quickly via the toolbar, including changing precision, adding or removing columns, and sorting. Select the column of interest from the data window, then click on the relevant button on the toolbar. Figure 0.17 For a complete description of the data set buttons, see the Data Set Toolbar help le. 0.4.8 Selecting columns or rows For many analyses, you may need to select more than one column or row. Click on the rst column or row header (either number or name) in your selection. While holding down the CTRL key, click on the next column or row header. Continue until all desired columns and/or rows have been selected. Figure 0.18 To select several contiguous columns (or rows), click on the header of the rst column (or row), hold down the SHIFT key, and then click on the last column or row. 10 0.5 The Object Explorer S-PLUS is an object-oriented environment; everything in S-PLUS is an objectdata sets, functions, graphs. The Object Explorer is a tool for organizing and managing these objects. New objects that are imported or created are saved automatically to your working database, typically located in C:/Program Files/Insightful/splus62/users/YourName/.Data The Object Explorer displays the objects in this database. To open the Object Explorer, click on the Object Explorer button on the Standard toolbar. Figure 0.19 The Object Explorer consists of two panes: the left pane consists of folders that hold shortcuts to S-PLUS objects (data, graphsheets, and so on), while the right pane consists of properties of the objects selected in the left pane. For example, if the Data folder is selected in the left pane, then all objects in the working database are displayed in the right pane. If a specic data frame in the Data folder is selected in the left pane, then all columns of this data frame are displayed in the right pane. Thus, for example, you can make your variable selection for graphs or statistical analyses from the Object Explorer rather than from the data window. Figure 0.20 To open a data frame into a data window, you can double-click on its icon in the Object Explorer. To delete an object from the working database, right-click on its icon in the Object Explorer and select Delete from the shortcut menu. 0.6 Exiting S-PLUS To quit S-PLUS, select File > Exit from the menu. You will be presented with the Save Database Changes dialog box. 11 Figure 0.21 By default, S-PLUS saves all your newly created or modied data sets to the working database. If, however, you decide that you do not want to keep any of the current versions (for example, you wish to revert to the version of a data set that you had at the beginning of the session), you may de-select the data set or sets, and then click OK. 0.7 Typing Commands Although all of the graphs and analyses in IPS can be performed through the GUI, you may prefer typing commands rather than using the point-and-click interface. If you have never used S-PLUS before, we recommend that you begin with the point-and-click interface. You may come back later and try typing commands. Commands may be given in two dierent ways, using the Commands Window or a Script Window. In the Commands window you type commands and run them immediately, while in a Script Window you may write a script with one or more commands, then run the commands when desired. Contents of a Script Window may be saved in a le. We begin with the Commands Window, then describe Script Windows. 0.7.1 The Commands Window To open the Commands Window, use the menu: 1. At the menu, select Window > Commands Window. or click on the Commands Window button , located on the Standard toolbar. In the Commands window, you should see a prompt, the greater than sign >. Type an expression at this prompt, then press the ENTER key to evaluate the expression. For example: > 3*(9-4) <ENTER> [1] 15 If you press ENTER before typing a complete expression, then S-PLUS prints the continuation prompt, a + sign. At this prompt, nish the expression and then press ENTER. > 3*(9+ 4) [1] 15 <ENTER> <ENTER> From now on, we will not indicate the ENTER key. In general, S-PLUS ignores spaces. We recommend using spaces to make what you write more readable. 12 > 3 * (9 - 4) [1] 15 The pound symbol # is used for commentsanything typed after # is not evaluated. > (5 <= 8) [1] T # a logical expression; either TRUE (T) or FALSE (F) Missing values are denoted by NA. > log(-1) [1] NA To access help les, use either help or ?; for example > help() > help(mean) > ?mean # bring up the help system, help on help # help for the mean command # help for the mean command (equivalent to previous) There are several useful keys for editing commands. The Up () and Down () keys scroll backward or forward through previously typed commands; this is particularly useful for xing errors in long commands without retyping everything. The Home and End keys bring the cursor to the beginning or end of the current line. Load the IPSdata library To load the IPSdata library, give the command > library(IPSdata) However, if the library is not loaded in the standard location (inside the library folder inside the S-PLUS folder), then you also need to specify where the library can be found; for example, if the library is located at c:/Stat101/IPSdata then the command would be > library(IPSdata, lib.loc = "c:/Stat101") Notea backslash is used to include special characters in quoted strings (for example, "\t" gives a tab), so if you want a real backslash you need to type it twice, for example, > library(IPSdata, lib.loc = "c:\\Stat101") Load the resample library Loading the IPSdata library automatically loads the resample library, which contains bootstrap and permutation test software. You can also load the resample library separately, using almost the same procedure; you should also load the library rst, so that when S-PLUS searches for certain functions it nds the versions in the resample library instead of its built-in versions. > library(resample, first = T) 0.7.2 Script Windows The second way to give commands is to use a Script Window. This has the advantage of making it easier to save your work, and to rerun all or parts of your previous work. To open a Script Window, use the menu: 1. At the menu, select File > New. 2. Select Script File. 13 3. Click OK. In a Script Window, you type the same commands as in the Commands Window (without prompts), then execute them. For example, 3 * (9 - 4) (5 <= 8) # a logical expression; either TRUE (T) or FALSE (F) log(-1) help() ?mean # help for the mean command To execute one or more lines, select them using the mouse (click at the beginning of a line, or drag across multiple lines), and hit F10 or click the black triangle (this only appears when a script window is active). You may use multiple Script Windows. You may save the contents of a script window to a le, when the Script Window is active (usually, when it is not behind another window). 1. At the menu, select File > Save As.... 2. Type the name of the le. 3. Click OK. In the following sections we give commands as if you are using the Commands Window; however you can give the same commands using a Script Window. 0.7.3 Assigning names to objects S-PLUS evaluates each expression you type and outputs the result to the screen. Occasionally, you may wish to save the results of a calculation for later use. You do this by assigning (or saving) the value to a name. You can then obtain the value later by typing the name. For example: > IQ = 120 > IQ [1] 120 In this manual we use = to show assignment, but many of the other S-PLUS manuals use <-, a less than symbol followed immediately (that is, no space) by a hyphen; think of this as an arrow, with the value on the right being put into the name on the left. When assigning names, it is important not to use a name already used by S-PLUS. For example, c, C, T, F, cat, and date are reserved by S-PLUS. If you are not sure if a name is already in use, type the potential name and check the response. > dog Problem: Object "dog" not found Use traceback() to see the call stack > cat function(..., file="", sep=" ", fill=F, labels=NULL, append=F) .Internal(cat(..., file, sep, fill, labels, append), "S_cat", T, 0) We see that dog is safe to use, but not cat. To see a listing of the objects in the working database, use the objects command. 14 > objects() [1] ".Last.value" "IQ" To remove an object from the working database, use the rm command. > IQ [1] 120 > rm(IQ) > IQ Problem: Object "IQ" not found Use traceback() to see the call stack 0.7.4 Vectors The most basic data object in S-PLUS is the vector, a sequence of values. A single number is a vector of length one. We give examples of some of the ways you can create a vector. To create a sequence of numbers diering by 1, use the : command. > 4:10 [1] 4 5 6 7 8 9 10 > 4:-5 [1] 4 3 2 1 0 -1 -2 -3 -4 -5 To increment a sequence of numbers by values other than 1, use seq. > seq(0, 20, by=2) [1] 0 2 4 6 8 10 12 14 16 18 20 > seq(0, 20, length=25) [1] 0.0000000 0.8333333 1.6666667 2.5000000 [5] 3.3333333 4.1666667 5.0000000 5.8333333 [9] 6.6666667 7.5000000 8.3333333 9.1666667 [13] 10.0000000 10.8333333 11.6666667 12.5000000 [17] 13.3333333 14.1666667 15.0000000 15.8333333 [21] 16.6666667 17.5000000 18.3333333 19.1666667 [25] 20.0000000 The bracketed integers in the output are indices for the rst value on a line. For example, the 21st value in the last example is 16.6666667. To create a vector of values that has no particular pattern, or a vector of character or boolean values, use the c command (c for concatenate or combine). > c(3, -1, 0, 5, 2) [1] 3 -1 0 5 2 > c("A", "A", "B", "B") [1] "A" "A" "B" "B" The rep command is a useful function for creating vectors with values that repeat in a regular pattern. > rep(c("A", "B"), 5) [1] "A" "B" "A" "B" "A" > rep(c("A", "B"), c(5, [1] "A" "A" "A" "A" "A" > rep(c("A", "B"), each [1] "A" "A" "A" "A" "A" "B" "A" "B" "A" "B" 3)) "B" "B" "B" = 5) "B" "B" "B" "B" "B" Here we show three dierent ways of calling rep. For more information, see help(rep). 15 0.7.5 Interrupting output You may interrupt any printout using the ESC key > 1:1000000 [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 [17] 17 18 19 20 21 22 23 24 25 26 27 28 29 30 ... <ESC> Problem in print.atomic(x, quote): User interrupt requested. 15 31 16 32 0.7.6 Working with vectors Basic arithmetic operations work component-wise on vectors. > c(2, 4, 6) * c(10, 20, 30) # multiply 2 vectors [1] 20 80 180 > c(2, 4, 6)/2 # divide by 2 [1] 1 2 3 > y = c(8, 2, 9, 7, 5, 2, -3, -4, 0, 2, 1) > y [1] 8 2 9 7 5 2 -3 -4 0 2 1 > y + 2 [1] 10 4 11 9 7 4 -1 -2 2 4 3 > y * 5 [1] 40 10 45 35 25 10 -15 -20 0 10 5 In many cases, you need to work with only a portion of a vector. The basic syntax for subscripting a vector is vector[indices]. The indices may be a vector of integers, or a logical vector. > y [1] 8 2 9 7 5 2 -3 -4 0 2 1 > y[3] # extract the 3rd value [1] 9 > y[c(3, 8)] # extract the 3rd and 8th values [1] 9 -4 > y[-c(3, 8)] # extract all but the 3rd and 8th values [1] 8 2 7 5 2 -3 0 2 1 > y[length(y):1] # reverse the order [1] 1 2 0 -4 -3 2 5 7 9 2 8 > y < 2 # test whether a value is less than 2 [1] F F F F F F T T T F T > y[y < 2] # extract those values of y that are less than 2 [1] -3 -4 0 1 > y[0 < y & y < 5] # extract those values of y between 0 and 5 [1] 2 2 2 1 To replace the value of one entry with another, use the assignment operator. > y[3] [1] 9 > y[3] = 12 # replace the 3rd entry of y with 12 > y[3] [1] 12 > y [1] 8 2 12 7 5 2 -3 -4 0 2 1 16 Symbol == != > >= < <= | & Function Equal to Not equal to Greater than Greater than or equal to Less than Less than or equal to Element-wise or Element-wise and Table 0.1: Logical and Comparison Operators A vector can store only one kind of informationfor example, numerical, character, or logical, but not all three. The technical term is mode. If you try to store data of dierent modes together in a single vector, values are changed as needed to a common mode, whichever mode is most exiblecharacter is most exible, then numeric, then logical. > c(T, F, 3.35) [1] 1.00 0.00 3.35 > c(T, F, 3.35, "Seattle") [1] "TRUE" "FALSE" "3.35" "Seattle" 0.7.7 Data frames Vectors are relatively simple data structures that can only contain values of one mode. However, most data is more complex and may consist of numeric information, as well as categorical information such as gender or country of birth. A data frame is a matrix whose columns are vectors. Dierent columns may be of dierent modes. The built-in data set fuel.frame is a data frame with four columns of numeric values and one factor column. > fuel.frame Eagle Summit Ford Escort Ford Festiva Honda Civic 4 4 4 4 Weight Disp. Mileage Fuel 2560 97 33 3.030303 2345 114 33 3.030303 1845 81 37 2.702703 2260 91 32 3.125000 ... Type Small Small Small Small There are several methods for accessing the columns of a data frame. Using the $ sign The basic syntax is dataframe$column.name > fuel.frame$Weight [1] 2560 2345 1845 2260 2440 2285 2275 2350 2295 1900 [11] 2390 2075 2330 3320 2885 3310 2695 2170 2710 2775 ... Using brackets [,] The basic syntax is dataframe[row indices, column indices]. 17 > fuel.frame[4, 3] # 4th row, 3rd column [1] 32 > fuel.frame[1:5, c(1,3)] # rows 1 through 5, columns 1 and 3. Weight Mileage Eagle Summit 4 2560 33 Ford Escort 4 2345 33 Ford Festiva 4 1845 37 Honda Civic 4 2260 32 Mazda Protege 4 2440 32 To obtain all the rows, leave the row specication blank. The following are equivalent to fuel.frame$Weight: > fuel.frame[, 1] > fuel.frame[, "Weight"] Similarly, for all columns, leave the row specication blank. > fuel.frame[1:3, ] The attach command The columns of a data frame are not accessible directly from the command line. > Weight Problem: Object "Weight" not found Use traceback() to see the call stack We can access them using $ or subscripting, > fuel.frame$Weight but if we plan to do a substantial amount of work with a single data frame, it is nice to access the columns directly. We can attach a data frame to the S-PLUS search path, a list of databases that S-PLUS refers to when a function or data set is called. > attach(fuel.frame) > Weight Eagle Summit 4 Ford Escort 4 Ford Festiva 4 2560 2345 1845 Honda Civic 4 Mazda Protege 4 Mercury Tracer 4 2260 2440 2285 ... When done working with the data frame, we remove it from the search list: > detach("fuel.frame") > Weight Problem: Object "Weight" not found Use traceback() to see the call stack The detach command removes fuel.frame from the S-PLUS search path. Notice that this command requires double quotes around the name of the data frame whereas the attach command does not. Incidentally, when you load the IPSdata library, both it and the resample library are attached to the search path. You can see the current search path using > search() 18 0.7.8 Creating a data frame Use the data.frame command to create a data frame. The arguments to the function are the columns. > > > > 1 2 3 4 5 x = 1:5 y = c("blue", "blue", "red", "red", "green") df = data.frame(x, y) df x y 1 blue 2 blue 3 red 4 red 5 green 0.7.9 Functions In S-PLUS, function calls are always followed by a pair of parentheses, which may or may not enclose arguments. > objects() # function call with no argument > names(fuel.frame) # function call with one argument [1] "Weight" "Disp." "Mileage" "Fuel" "Type" > length(fuel.frame$Weight) [1] 60 > c(3, 4, 9) # function call with 3 arguments [1] 3 4 9 To quickly check options for a function, use the args command: > args(mean) function(x, trim=0, na.rm=F) NULL (You may ignore that NULL.) You can abbreviate the arguments to command line functions, provided the abbreviation is unambiguous: > mean(fuel.frame$Weight, trim=.1) # 10% trimmed mean (10% each side) [1] 2895.729 > mean(fuel.frame$Weight, tr=.1) [1] 2895.729 There is no other argument which begins tr, so trim may be abbreviated. In contrast, when calling the plot funcion, there are two optional arguments xlim and xlabel, so the abbreviation xl would be ambiguous. To exit S-PLUS from the command line, type > q() 19 Chapter 1 Looking at DataDistributions In this chapter, we perform exploratory data analysiscreating graphs and looking at basic numeric summariesto get an overview of the data. 1.1 1.1.1 Displaying Distributions with Graphs Categorical variables: Bar graphs and pie charts Categorical data places an individual into one of several groups or categories. In S-PLUS, a categorical variable is called a factor and the groups are called levels (see also Section 0.4.5). Example 1.1 Sizes of cars The data set cu.summary contains information from a 1990 issue of Consumer Reports on the properties of several cars. 1. At the menu, select Data > Select Data.... 2. In the Select Data dialog box, check the radio button next to Existing Data. 3. For Name: type cu.summary. Figure 1.1 4. Click OK. Verify that the data tip for Type shows it is indeed a factor variable. Next, we create a bar or pie chart of Type. 20 5. Select the Type column. 6. Bring up the Plots 2D palette by clicking on the Plots 2D button on the Standard toolbar. Figure 1.2 7. For a bar graph, click on the Bar Zero Base button. 30 24 Count 18 12 6 0 Small Sporty Compact Type Medium Large Van Figure 1.3 8. For a pie chart, click on the Pie Chart button. Sporty Small Compact Van Large Medium Figure 1.4 End of Example 1.1 21 Remarks: To add grid lines to the bar graph, right-click on the y axis and select Grids/Ticks... from the shortcut menu. In the Y Axis dialog box under Major Grids, change the State: to On and click on OK. Figure 1.5 To make changes to the bars or slices (such as colors, labels, and so forth), right-click on a bar and select By Bar... from the shortcut menu, or right-click on a pie slice and select By Slice... from the shortcut menu. This opens the appropriate dialog box. In the previous example, we worked with a factor variable (Type) which contained all the information in raw formwe have a listing of each observation and the level to which it belongs. In many instances, we instead have a summary of a factor variable, a list of the levels and the number of observations (counts) for each level. S-PLUS can handle these summaries too. Example 1.2 Education The textbook (IPS) provides data on the education level of 30-something young adults. We do not have the information on each of the over 35 million people in the data set, but rather a summary of the number of observations for each level of the variable Education. Education Less than high school High school graduate Some college Associate degree Bachelors degree Advanced degree To analyze this information in S-PLUS, 1. Create a new data set by clicking on the New Data Set button and enter the data above. Count (millions) 4.6 11.6 7.4 3.3 8.6 2.5 Percent 11.8 30.6 19.5 8.8 22.7 6.6 22 Figure 1.6 2. Select Education, then CTRL-click on the Count column. 3. Click on the Bar Zero Base button For a pie chart, 4. Select the Percent column, then CTRL-click on the Education column. 5. Click on the Pie Chart button . on the Plots 2D palette. on the Plots 2D palette. The two plots should look similar to the plots in IPS. If either one looks funny, right-click on a bar or a pie slice and select Data to Plot... from the shortcut menu. Check that the correct information is listed in the appropriate elds: for the bar graph, x Columns: should be the categorical variable Education and y Columns: should be Counts or Percents; for the pie chart, x Columns: should be Counts or Percents and the y Columns: should be Education. End of Example 1.2 Remark: The dotchart is another informative plot. 1. Select the Percent column, then CTRL-click on Education. 2. Click on the Dot (dotplot) button on the Plots 2D palette. 3. In the resulting graph, right-click on the x axis and select Range.... Enter 100 in the Axis Maximum: and Last Tick: elds, enter 0 in the Axis Minimum: eld, and click on OK. 23 Advanced degree Bachelors degree Education Associate degree Some college High school graduate Less than high school 0 25 50 Percent 75 100 Figure 1.7 Like the bar chart, you can compare the sizes of each level by noting how far each symbol is from an axis (in this case, the left vertical axis); since the horizontal axis is scaled to 100, you can compare relative sizes like the pie chart. A Pareto chart is a bar graph with the bars sorted in decreasing height and a line graph showing the cumulative percent in the bars. Since the nal cumulative percent is always 100%, we can easily see what proportion each bar is of the whole and still easily compare heights of bars next to each other. In addition to graphical representations of factor variables, S-PLUS can calculate numeric summaries for factor variables. Example 1.3 Counts and percentages We will continue to use the cu.summary data. 1. At the menu, select Statistics > Data Summaries > Crosstabulations.... 2. For Data Set: select cu.summary, and for Variables: select Type. 24 Figure 1.8 3. Click OK. *** Crosstabulations *** Call: crosstabs(formula = ~ Type, data = cu.summary, na.action = na.fail, drop.unused.levels = T) 117 cases in table +-------+ |N | |N/Total| +-------+ Type | | |RowTotl| -------+-------+-------+ Compact|22 |22 | |0.19 |0.19 | -------+-------+-------+ . . . -------+-------+-------+ Van |10 |10 | |0.085 |0.085 | -------+-------+-------+ ColTotl|117 |117 | |1 | | -------+-------+-------+ This gives the counts and percentages for each car type and the percentages. For example, there are ten vans representing 8.5% of the total. End of Example 1.3 25 Command Line Notes The table command returns the counts of a factor variable and the functions barplot, pie, and dotchart create plots of a factor variable. The table of counts and proportions can be created with crosstabs. When there is one line per subject, the rst argument to crosstabs is a formula of the form ~factor. > attach(cu.summary) > table(Type) Compact Large Medium Small Sporty Van 22 7 30 22 26 10 > barplot(table(Type), names = levels(Type)) > pie(table(Type), names = levels(Type)) > dotchart(table(Type)) > crosstabs( ~ Type) > detach() When instead there is a variable containing counts, the rst argument is of the form counts~factor. Here we multiply by 106 to convert from millions to ones: > SDF1 = data.frame( + Education = c("Less than high school", + "High school graduate", "Some college", "Associate degree", + "Bachelors degree", "Advanced degree"), + Count = c(4.6, 11.6, 7.4, 3.3, 8.6, 2.5), + Percent = c(11.8, 30.6, 19.5, 8.8, 22.7, 6.6)) > crosstabs(10^6 * Count ~ Education, data = SDF1) 1.1.2 Quantitative variables: Histograms Histograms provide a graphical view of quantitative or continuous data. Example 1.4 How to make a histogram Table 1.3 of IPS gives IQ test scores for 60 randomly chosen fth-grade students. To make a histogram of these data: 1. Load the IPSdata library, from menu File > Load Library.... Instructions for obtaining the library are in Section 0.1. 2. Open the data using IPSdata > Select Data..., and select Table1.3. 3. Select the iq column 4. Click on the Histogram button on the Plots 2D palette. 26 30 20 10 0 80 100 120 iq 140 160 Figure 1.9 Histograms can look very dierent depending on just where you place the edges of the bins. To change the number or location of bins, 5. Right-click on the histogram and choose Options... from the shortcut menu. 6. In the Histogram/Density dialog, for Lower Bound: enter 80, and for Interval Width enter 10. Figure 1.10 27 7. Click Apply. 8. Try some other bar widths, or change other parameters; click Apply to see each. 9. To add a density curve, for Output Type: select Density, select the Density Line tab, and select the checkbox for Draw Line. 10. Click OK when you are done. Here are histograms with bars of width 10 and 5, with a lower bound of 80: 15 0.03 10 0.02 5 0.01 0 80 90 100 110 iq 120 130 140 150 0.00 80 90 100 110 iq 120 130 140 150 Figure 1.11 End of Example 1.4 Remarks: You can add a title and explanatory labels along the axes using the Insert > Titles > Axis or Insert > Text menus. You can change other aspects of the histogram (color, shading, and so on) by returning to the Histogram/Density dialog box (right-click on the histogram and select Options from the shortcut menu). A density plot is an alternative to a histogram, generally superior because it is not dependent on bar edges. However, you may work with people not familiar with them. A good compromise is to combine a histogram and density curve . An example is shown above. 28 Command Line Notes The hist function creates histograms and the stem function creates stem and leaf plots. > library(IPSdata) > names(Table1.3) [1] "iq" > hist(Table1.3$iq) > hist(Table1.3$iq, nclass = 15) > hist(Table1.3$iq, breaks = seq(80, 160, by = 10)) > hist(Table1.3$iq, probability = T) # probability scale > lines(density(Table1.3$iq)) # add density curve > plot(density(Table1.3$iq), type = "l") # density only > stem(Table1.3$iq) N = 60 Median = 114 Quartiles = 104, 125.5 Decimal point is 1 place to the right of the colon 8 9 10 11 12 13 14 : : : : : : : 129 0467 01112223568999 00022334445677788 22344456778 013446799 25 1.1.3 Time plots Time plots are used to measure change over time. Example 1.5 Mississippi river discharge Table 1.4 in IPS presents the yearly discharge of the Mississippi River from 1954 to 2001. 1. Load the data using IPSdata > Select Data... and select Table1.4. 2. Select the year column. 3. CTRL-click on the discharge column. Select the rst column in the data set (the prices). 4. Click on any of the three line plot buttons from the top row of the Plots 2D palette To change the labeling, 5. Right-click on the x axis and choose Range... from the shortcut menu. 6. Set Axis Minimum: to 1950, and Axis Maximum to 2005. 7. You can try other options, and see their results by clicking Apply. 8. Click OK. . 29 900 700 discharge 500 300 100 1950 1960 1970 1980 year 1990 2000 Figure 1.12 End of Example 1.5 The previous example had one column of dates and another containing data. If your data set has only a single column, containing data, then you need select only that data; S-PLUS then assumes the times are equally spaced. If your data was not entered in time order, then right-click on the line in the plot and choose Smooth/Sort from the shortcut menu. In the dialog box, change Pre-Sort Data: to X,Y on X and click OK. 30 Command Line Notes Time plots can be made using the plot command. > attach(Table1.4) > plot(year, discharge) > plot(year, discharge, type="l") > plot(year, discharge, type="b") > detach() The type option to plot includes "p" for points, "l" for line, "b" for a line and points. See the online help les for plot and lines for more options. 1.2 Describing Distributions with Numbers Summary statistics of numeric data including the mean, median, quartiles, and standard deviation can be computed using Statistics > Data Summaries > Summary Statistics.... Example 1.6 Car mileage We return to the vehicle data from Consumer Reports, 1990. 1. Open the fuel.frame data set. 2. Select menu Statistics > Data Summaries > Summary Statistics.... 3. In the Summary Statistics dialog box, for Variables: choose <ALL>, and for Group Variables: select (None). Figure 1.13 31 4. Click on the Statistics tab at the top of the dialog box to go to the next page. S-PLUS computes the statistics with check marks next to them; the most common summary statistics are selected by default. 5. Click OK. *** Summary Statistics for data in: fuel.frame *** $$$"Factor Summaries": Type Compact:15 Large: 3 Medium:13 Small:13 Sporty: 9 Van: 7 $$$"Numeric Weight Min: 1845.0000 1st Qu.: 2571.2500 Mean: 2900.8333 Median: 2885.0000 3rd Qu.: 3231.2500 Max: 3855.0000 Total N: 60.0000 NAs : 0.0000 Std Dev.: 495.8661 Summaries": Disp. Mileage Fuel 73.00000 18.000000 2.7027027 113.75000 21.000000 3.7037037 152.05000 24.583333 4.2100326 144.50000 23.000000 4.3478261 180.00000 27.000000 4.7619048 305.00000 37.000000 5.5555556 60.00000 60.000000 60.0000000 0.00000 0.000000 0.0000000 54.16091 4.791559 0.7575258 The output gives counts for the categorical variable and the ve-number summary as well as the mean, standard deviation, and a ve-number summary for each quantitative variable. End of Example 1.6 Remarks: The method used by S-PLUS to compute the quartiles is a bit dierent from the process described in IPS. To see the summary statistics by car type, return to the Summary Statistics dialog box and change the Group Variables: eld to Type. Click OK. If you select a continuous variable for the Group Variables: eld, then S-PLUS automatically breaks it into equal-length categories, and supplies summary statistics for the corresponding groups. S-PLUS can also produce boxplots based on the ve-number summary. Example 1.7 Boxplots of car data 1. Bring up the fuel.frame data set. 2. Select Type, then CTRL-click on Mileage. 32 3. Click on the boxplot button to create the boxplot. 35 30 Mileage 25 20 15 Compact Large Medium Type Small Sporty Van Figure 1.14 End of Example 1.7 Remarks: The order in which you select the variables is important. If your boxplot does not look correct, reselect the two variables (factor variable rst, then CTRL-click the numeric variable). You can right click on any of the boxes and click on the shortcut menu to get the boxplot dialog box. This gives you options to change the colors of the boxes, the styles of lines and points used in the boxes, and other options. S-PLUS plots suspected outliers (using the 1.5IQR criterion) as points above and below the boxplots. The Range: option on the Box Specs page of the dialog allows you to change the decision rule for plotting suspected outliers (entering 2 means to use 2 1.5 IQR). If the data for each group are in its own column, you can still create side-by-side boxplots: with no columns selected in the data frame, click on the boxplot button. The Boxplot dialog box opens with the x Columns: and y Columns: elds blank. Leave the x Columns: eld blank and select the desired columns (use CTRL-click to make multiple selections) for the y Columns:. Click OK. You can transform variables by using the Data > Transform menu. Select the data set, then enter the name you want the new (transformed) column to have in the Target Column: box. Then enter the function for the transformation in the Expression: box. 33 Figure 1.15 Command Line Notes The summary command computes summaries of numeric and factor variables, mean computes the mean, stdev nds the standard deviation, and boxplot together with split creates boxplots. > summary(fuel.frame) Weight Disp. Mileage Min.:1845 Min.: 73.0 Min.:18.00 1st Qu.:2571 1st Qu.:113.8 1st Qu.:21.00 ... > attach(fuel.frame) > summary(Mileage) Min. 1st Qu. Median Mean 3rd Qu. Max. 18.00 21.00 23.00 24.58 27.00 37.00 > mean(Mileage) [1] 24.58 > stdev(Mileage) [1] 4.791559 > boxplot(split(Mileage,Type)) > detach() > daily.temp$temp.c = 5/9 * (daily.temp$temp.f ... ... ... - 32) 1.3 The Normal Distributions You can compute the normal probability values found in Table A of IPS. Example 1.8 SAT scores (IPS Example 1.26) We want to nd the proportion of students who have SAT scores between 720 and 820. SAT scores approximately follow a normal distribution with = 1026 and = 209. 1. Create a new data set . 34 2. Enter 720 and 820 into the rst column. 3. At the menu, select Data > Distribution Functions.... 4. In the Distribution Functions dialog box, select the name of the data set for Data Set:. 5. For Source Column: select the column that you just created. 6. The output is sent to a new column. Designate the name of this new column in the Target Column: eld (the default name is Probability). 7. Under Distribution Parameters, For Mean: enter 1026, and for Std. Deviation: enter 209. Figure 1.16 8. Click OK. 9. Select the new column and click on the Increase precision digits. button twice to view more signicant If we let X denote the SAT score, then the S-PLUS output tells us that P (X 720) = 0.0716 and P (X 820) = 0.1622. Thus, we want P (720 X 820) = 0.1622 0.0716 = 0.0906. End of Example 1.8 Example 1.9 Inverse normal calculations (IPS Example 1.30) We want to nd what score we need to earn on the SAT verbal section in order to be in the top 10%, given that the scores are approximately N (505, 110). 1. Create a new data frame and enter the value 0.9 into the rst cell. 2. At the menu, select Data > Distribution Functions.... 3. In the Distribution Functions dialog box, for Data Set: select the name of the data set. 4. For Source Column: select the column that you just created. For Target Column: keep the default name. 35 5. For Result Type:, check the radio button next to Quantile. 6. Under Distribution Parameters, for Mean: enter 505, and for Std. Deviation: enter 110. Figure 1.17 7. Click OK. We see that we would need to score 646 (rounded up from 645.97) to place in the highest 10%. End of Example 1.9 Example 1.10 Superimposing a normal curve on a histogram It is often informative to see how a histogram compares to a normal curve with the same mean and standard deviation. 1. Bring up the fuel.frame data set and create a histogram of Weight (recall Section 1.1.2). 2. Right-click on the histogram and choose Options... from the shortcut menu. 3. In the Histogram/Specs dialog box, change Output Type: to Density. This scales the histogram so that the total area is 1. 4. Click OK. We next generate a sequence of values, 1000 to 4400, which encompasses the range of x-values in the histogram above. 5. Create a new data set. 6. At the menu, select Data > Fill.... 7. In the Fill Numeric Columns dialog box, for Data Set: select the data set, and for Columns: select <END>. 8. Under Fill Options, for Length: enter 341, for Start: enter 1000, and for Increment: enter 10. 9. Click OK. 36 Figure 1.18 From Example 1.4, we know that the mean of Weight is 2900.83 and the standard deviation is 495.866. We wish to graph the normal density curve with these parameters over the range 1000 to 4400. 10. At the menu, select Data > Distribution Functions.... 11. In the Distribution Functions dialog box, for Data Set: select the name of the data set created above. 12. For Source Column: select the column created by the Fill command. For Target Column: leave the default name. 13. Change Result Type: to Density. 14. Under Distribution Parameters, for Mean: enter 2900.83, and for Std. Deviation. enter 495.866. 15. Click OK. The new column contains small numbers which appear to be zero, unless you increase the precision. Finally, we plot the normal curve and superimpose it onto the histogram of Weight. 16. Select the column created by the Fill command above, then CTRL-click on the Density column just created by the Distribution Functions command. 17. Click on the white background of the graphsheet with the histogram. There should be eight green knobs around the border of the histogram graph area. 37 Figure 1.19 18. Hold down the SHIFT key and click on the Line button on the Plots 2D palette. 0.0008 0.0006 Density 0.0004 0.0002 0.0000 1500 2000 2500 3000 3500 4000 Figure 1.20 The density plot should be superimposed on the histogram. If instead, you have two plots adjacent to each other on the same graphsheet, or if you created a new plot on a separate graphsheet, repeat the above steps (from 16). The histogram graphsheet must be selected as noted above and the SHIFT key must be held down simultaneously while the Line plot is being clicked on. End of Example 1.10 38 1.3.1 Normal quantile plot A normal quantile plot is a graph used to see how well the normal distribution ts a data set. A normal quantile plot graphs the data against what would be expected from a normal distribution. If the data fall close to a straight line then the data are approximately normally distributed. Data points far from the straight line show where there are dierences between the data and a normal distribution. Example 1.11 Normal quantile plot of car weight 1. Bring up the fuel.frame data set. 2. Select the Weight column. 3. Click on the QQ Normal button on the Plots 2D palette. 4000 3500 Weight 3000 2500 2000 1500 -3 -2 -1 0 Normal Distribution 1 2 Figure 1.21 The weights of the cars are fairly normally distributed. End of Example 1.11 39 Command Line Notes The pnorm function computes the probabilities for the normal distribution, qnorm gives the x-value for a given probability, and dnorm gives the height of the normal curvep for probability, q for quantile, and d for density. The function qqnorm creates a normal quantile plot and qqline adds the reference line to it. > pnorm(c(720, 820), mean = 1026, sd = 209) [1] 0.07158129 0.16215344 > 0.16215344 - 0.07158129 [1] 0.09057215 > qnorm(0.9, mean = 505, sd = 110) [1] 645.9707 > attach(fuel.frame) > hist(Weight, prob = T, col = 6, xlim = c(1500, 4500)) > lines(1000:4400, + dnorm(1000:4400, mean = mean(Weight), + sd = stdev(Weight))) > qqnorm(Weight) > qqline(Weight) > detach() 1.4 Editing Graphs In this section, we summarize the editing features that are possible with S-PLUS graphs created through the GUI. Creating a good-looking graph in S-PLUS is both quick and easy. Sometimes the graph can be improved by making small changes such as showing a larger range of data values, adding reference lines, adding tick marks, adding a comment, or changing the symbol used in the plot. 1.4.1 Editing existing portions of a graph Any plot created through the GUI can be customized by right-clicking on dierent features of the graph and choosing the desired option from the shortcut menu. Below, we show where the cursor should be placed to access the appropriate dialog boxes. 40 Bars, Slices, and Points Right-click on a bar, slice, or point to access the dialog box with options for changing colors, ll patterns, line styles, and plotting symbols and their size. X and Y Axes Right-click on either axis to access the dialog box with options for changing the range of the data that is plotted, the placement or appearance of tick marks, the scaling of the axis (for example, to a log scale), or for adding grid lines. Tick Labels Right-click on the labels corresponding to the tick marks to access the dialog box with options for changing the appearance of tick labels (font, rotation angle, and so on). Axis Titles Right-click on the axis title to access the dialog box with options to change fonts, edit content, add a date stamp, and so forth. 1.4.2 Annotating or adding to the graph brings up the graph Annotations palette. The Annotations button Figure 1.22 This palette has buttons for adding objects such as text (for labels or notes) , reference lines and arrows , or various reference symbols , a time and date stamp . then right-clicking on Any of the annotations can be modied by choosing the Select Tool button the annotation to be changed to access the shortcut menu. 41 The Graph Tools palette provides additional tools for labeling points, cropping, and so forth: Figure 1.23 1.5 Multiple Plots in Command Line Graphics Graphs created at the command line in S-PLUS are written to a graphics device. When you create a plot, a graphics device is opened automatically. Subsequent plots are written to the same device, overwriting earlier plots. > hist(fuel.frame$Weight) > boxplot(fuel.frame$Weight) You can have two plots visible by manually opening another graphics device. > graphsheet() > hist(fuel.frame$Mileage) > dev.list() # Show list of current devices graphsheet graphsheet 2 3 > dev.cur() # Show currently active device graphsheet 3 > help(dev.list) # Help, including commands for closing > # or switching devices Graphsheet 3 (GSD3) is the active device so subsequent graphs are displayed in this window. 1.6 Exercises These exercises use the data set cu.summary. 1. Summarize graphically the variables Reliability and Price. 2. Summarize numerically the variable Price. 3. Is the Price variable normally distributed? 42 1.7 Solutions or the Pie Chart button . . 1. From the GUI, select the Reliability column and click on either the Bar Zero Base button Next, select the Price column and click on the Histogram button From the command line, > > > > attach(cu.summary) barplot(table(Reliability), names = levels(Reliability)) hist(Price) detach() 2. From the GUI, Figure 1.24 From the command line, > summary(cu.summary$Price) 3. Create a QQ normal plot, , 43 40000 30000 Price 20000 10000 0 -3 -2 -1 0 1 2 3 Normal Distribution Figure 1.25 It is clear that Price does not follow a normal distribution; the largest values in Price are much greater than we would expect from a normal distribution. 44 Chapter 2 Looking at DataRelationships 2.1 Scatterplots Scatterplots are the easiest and most eective way to see relationships between two numeric variables. S-PLUS has several variations on scatterplots to allow you to explore these relationships. Example 2.1 Gas mileage We will explore data on the weight of cars and their gas mileage. The fuel.frame data set contains information on gas mileage for several cars as reported in Consumer Reports (April 1990). 1. Open the fuel.frame data set by clicking on the Data > Select Data menu, and typing fuel.frame in the Name: eld and clicking on OK. 2. Click on the column header for Weight, then CTRL-click on the column header for Mileage. The rst column selected becomes the x axis (horizontal axis), and the second column selected becomes the y axis (vertical axis). 3. Bring up the Plots 2D palette and click on the Scatter button in the upper left corner . 35 30 Mileage 25 20 15 1500 2000 2500 3000 Weight 3500 4000 Figure 2.1 45 We continue with some additional features of S-PLUS graphs. 4. If you hover the cursor over a data point, a data tip appears giving you information about that point (row number, x-value, y-value). Figure 2.2 To change what appears in the data tip, right-click on a point and choose Data to Plot... from the shortcut menu. Under Data Tips and Point Labels, select the column(s) that you want to use for the data tip. 5. To nd the point that corresponds to a given car, bring the data window to the front and click on the row header of the observation that you want to highlight (you can CTRL-click on additional row headers to highlight multiple points, Section 0.4.8). Now return the graphsheet to the front. The car(s) that you selected is indicated by a bold red symbol. 6. To change the symbol style, color or size, right-click on any data point and choose Symbol... from the shortcut menu. 7. Return to the fuel.frame data window and click on Weight, CTRL-click on Mileage, CTRL-click on Type, then click on the Text as Symbol button uses each cars type for symbols. on the Plots 2D palette.. The resulting scatterplot 46 Small 35 Small Small Sporty Small Small Small Small Small 30 Small Sporty Mileage Small Sporty Compact Sporty Compact Sporty Sporty Compact Compact Small Small Compact 25 Small Compact Compact Compact Compact Compact Compact Sporty Medium Compact Compact Medium Medium Medium Compact Medium Medium Medium Compact Medium Van Large Medium Medium Medium Medium 20 Medium Sporty Van Sporty Van Van Van Van Large Van Large 15 1500 2000 2500 3000 Weight 3500 4000 Figure 2.3 End of Example 2.1 Example 2.2 Are vans dierent? It is often informative to use dierent symbols and colors for dierent groups in a scatter plot. 1. Select the columns Weight, Mileage, and Type in this order. 2. On the Plots 2D palette, click on the Scatter button . A scatterplot is drawn with dierent symbols and colors for each Type of car. 3. Add a legend to the plot by clicking on the Auto Legend button toolbar. on the toolbar below the standard 47 35 30 Mileage Compact Large Medium Small Sporty Van 25 20 15 1500 2000 2500 3000 Weight 3500 4000 Figure 2.4 Now we can see that the vans are all heavier and get lower gas mileage, but do not fall outside of the overall pattern made by the other cars. You can move the legend to a dierent location by clicking on the box around the legend (green knobs will appear) and then dragging the box. To change the default symbol styles and colors for each group of symbols, right-click on a data point from a group and select Symbols... from the shortcut menu. Make your changes in the resulting dialog box. End of Example 2.2 Example 2.3 Scatterplot matrices When rst exploring data we often make several dierent scatterplots. A scatterplot matrix is a way to look at all possible combinations of two variables in one step. 1. Bring up the fuel.frame data set. 2. Select Weight, hold down the SHIFT key and click on the header of the second to last column (Fuel). This selects all of the numeric columns. 3. Click on the Scatter Matrix button on the Plots 2D palette. 48 50 100 150 200 250 300 2.5 3.5 4.5 5.5 4000 3500 Weight 3000 2500 2000 1500 300 250 200 150 100 50 35 30 Mileage 25 20 15 5.5 4.5 Fuel 3.5 2.5 1500 2000 2500 3000 3500 4000 15 20 25 30 35 Disp. Figure 2.5 This produces scatterplots using all possible pairs of the variables selected. To determine the y axis for any plot, scan horizontally until you reach the square with a variable name. To determine the x axis for this plot, scan vertically until you reach the square with a variable name. 4. To add a trend line to the scatterplot matrix, right-click on any data point and select Smooth... from the shortcut menu. 5. This brings up the Smooth tab of the Scatter Plot Matrix dialog box; go to the Smoothing Type: eld and select Least Squares. 6. Click OK. 49 50 100 150 200 250 300 2.5 3.5 4.5 5.5 4000 3500 Weight 3000 2500 2000 1500 300 250 200 150 100 50 35 30 Mileage 25 20 15 5.5 4.5 Fuel 3.5 2.5 1500 2000 2500 3000 3500 4000 15 20 25 30 35 Disp. Figure 2.6 End of Example 2.3 2.1.1 Brush and spin Brushing and spinning are dynamic graphics techniques for visualizing data in three or more dimensions. Brushing is a variation of a scatterplot matrix with dynamically linked plots; points selected in one plot are highlighted in other plots. Spinning is a three-dimensional scatterplot that may be spun around to be viewed from dierent angles. The brush and spin windows may be linked, so that points selected in either are highlighted in the other. 1. Bring up the fuel.frame data set. 2. Select Weight, hold down the SHIFT key and click on the header of the second to last column (Fuel). This selects all of the numeric columns. 3. At the menu, select Graph > Brush and Spin.... 4. Click OK. 5. Use the arrows in the upper left to spin the three-dimensional scatterplot. 6. Brush across points in any of the windows to highlight them in all windows. 50 Figure 2.7 2.1.2 Smoothers A smoother is a line that is added to a scatterplot to indicate curvature. There are many dierent algorithms for constructing smoothers, and we refer the interested reader to the online manuals (Guide to Statistics, Vol. 1) for further details. Example 2.4 Scatterplot smoothers We will continue with the Mileage data. 1. Bring up the fuel.frame data set and select Weight, then Mileage. 2. Click on any of the four scatterplot smoother buttons, Loess, Spline, Supersmooth, or Kernel . 51 35 30 Mileage 25 20 15 1500 2000 2500 3000 Weight 3500 4000 Figure 2.8 The added smooth line emphasizes that the relationship has some curvature and is probably not a straight line relationship (the plot in the above gure used the spline smooth). 3. Try the Kernel smoother. This is a case where the default smoothing parameter is poor (really poor), resulting in a smooth that is not smooth enough. We have often found this with the Loess smoother too, for small data sets. To adjust the parameter right-click on any data point and select Smooth... from the shortcut menu. You should be in the Smooth/Sort button. In the Kernel Specs eld, increase the Bandwidth to 500 and change the Kernel to Normal. You can try dierent values by clicking Apply. 4. When you have a curve you are satised with click OK. End of Example 2.4 See the S-PLUS Users Guide chapters on creating plots, exploring data, and editing graphics for more details on the dierent types of graphs you can make, and various options for changing the appearance of your graphs. 52 Command Line Notes Scatterplots are created using the plot command, scatterplot matrices are created with the pairs command. Brushing with or without spinning are done with brush, and spinning alone with spin. Creating scatterplots using dierent symbols for dierent groups is more complicated from the command line. See the chapters on traditional and trellis graphics in the S-PLUS Programmers Guide (specically superposing) if you are interested. A scatterplot with a loess smother can be created using scatter.smooth and the panel.smooth function can be used to add a loess smoother to a scatterplot matrix. Spline smooths are created using smooth.spline, kernel smooths using ksmooth, and Super Smooths using supsmu. > > > > > > > > > > attach(fuel.frame) plot(Weight, Mileage) plot(Weight, Mileage, pch = ".") plot(Weight, Mileage, pch = "*") pairs(fuel.frame) pairs(fuel.frame, panel = panel.smooth) brush(fuel.frame[,1:4]) spin(fuel.frame[,1:4]) scatter.smooth(Weight, Mileage) detach() 2.2 Correlation Correlation measures the linear relationship between two numeric variables. Example 2.5 Computing correlations 1. At the menu, select Statistics > Data Summaries > Correlations.... 2. In the Correlations and Covariances dialog box, for Data Set: select fuel.frame. 3. Select the numeric variables in the Variables: box by clicking on Weight, then scrolling down and SHIFT-clicking on Fuel. All the numeric variables are highlighted. 53 Figure 2.9 4. Click OK. *** Correlations for data in: fuel.frame *** numeric matrix: 4 rows, 4 columns. Weight Disp. Mileage Fuel Weight 1.0000000 0.8032804 -0.8478541 0.8616848 Disp. 0.8032804 1.0000000 -0.6931928 0.7138435 Mileage -0.8478541 -0.6931928 1.0000000 -0.9796598 Fuel 0.8616848 0.7138435 -0.9796598 1.0000000 The correlation between Disp. and Weight is 0.8032. End of Example 2.5 Command Line Notes The cor function computes the correlation of two variables or all possible pairs of correlations for multiple variables. The rmvnorm function generates random data with the specied correlation. > attach(fuel.frame) > cor(Disp., Mileage) > cor(fuel.frame[ ,1:4]) # Correlations for columns 1 through 4 > plot(rmvnorm(100, rho = -0.75)) > detach() Here is a script that produces 10 plots of variables with correlation 0.5. Paste this into a script window (File > New), modify as desired, and run it. n = 1000 rho = .5 for(i in 1:10) xy = rmvnorm(n, rho = rho) plot(xy) 54 2.3 Least-Squares Regression Correlation measures the linear relationship between two variables and least-squares regression nds the best linear equation y = bx + a that describes this relationship. The x variable is also called the explanatory variable, independent variable, or predictor variable. The y variable is also called the response variable or dependent variable. Example 2.6 Predicting a childs growth Table 2.5 in IPS contains data on two measurements of blood sugar in 18 diabetics. We will t a least squares regression line to this data. 1. Load the data: Table2.5. 2. Click on the explanatory variable hba, then CTRL-click on the response variable fpg. 3. On the Plots 2D palette, click on the Linear Fit Line button 4. From the main menu, select Insert > Curve Fit Equation. This adds the equation of the line to the plot. . 350 66.4285 + 10.4077*x 250 fpg 150 50 4.5 7.0 9.5 12.0 hba 14.5 17.0 19.5 Figure 2.10 To move the equation, click on the equation (green knobs appear) and drag it to the desired location. You can also resize the equation by dragging on one of the green knobs. We now continue with a more numerical approach. 5. Select Table2.5 from menu IPSdata > Select Data.... 55 6. Select Statistics > Regression > Linear... from the main menu. 7. In the Linear Regression dialog box, for Data Set: select the name of the data set (Table2.5). 8. Under Variables, for Response: select fpg, and for Explanatory: select hba. (Notesome versions of S-PLUS use Dependent: and Independent:). The Formula: eld should read fpg~hba. Figure 2.11 9. Click on the Results tab to go to the next page of the dialog box. 10. Under Saved Results, choose the same data set (Table2.5) for the Save In: eld and check the boxes next to Fitted Values and Residuals. 56 Figure 2.12 11. Click on the Plot tab. 12. Under Plots, check the Residuals vs Fit box (and any of the other plots that you want to see). 57 Figure 2.13 13. Click OK. The results of this command: A Report window contains the estimates of the slope (10.4077) and intercept (66.4285) along with other information about this regression. *** Linear Model *** Call: lm(formula = fpg ~ hba, data = Table2.5, na.action = na.exclude) Residuals: Min 1Q Median 3Q Max -73.75 -43.49 -5.536 15.61 181.2 Coefficients: Value Std. Error t value Pr(>|t|) (Intercept) 66.4285 46.5231 1.4279 0.1726 hba 10.4077 4.7310 2.1999 0.0429 Residual standard error: 63.82 on 16 degrees of freedom Multiple R-Squared: 0.2322 F-statistic: 4.84 on 1 and 16 degrees of freedom, the p-value is 0.04285 58 The tted values (predictions) and residuals have been added to the data set. Figure 2.14 A graph of the residuals against the tted values, with a scatterplot smooth added, and large residuals highlighted. The smooth used here is a Loess curve, which is resistant to outliers; hence it curves down in the middle even though there are large positive residuals in the middle. 15 150 100 12 Residuals -50 0 50 13 140 160 180 200 Fitted : hba 220 240 260 Figure 2.15 Note that you can create this plot by using the data added to the data set: Select Age, then residuals, then click on the Scatter button on the Plots 2D palette. Then use the methods described earlier in the chapter to add smooths and highlight points. End of Example 2.6 59 Command Line Notes Regression lines are t with the lm function (Linear Models, another term for least squares regression), the detailed output produced with the summary function and the tted values and residuals given with the fitted and resid functions, respectively. The command abline can be used for adding reference lines to a plot. > > > > > > > > Remarks: S-PLUS uses a special formula syntax of the form y~x to specify the x and y variables. The response variable is written to the left of the tilde , and the explanatory variable is written to the right of the tilde. Additional details on the regression output are given in Chapters 10 and 11. fit.diabetes = lm(fpg ~ hba, data = Table2.5) summary(fit.diabetes) attach(Table2.5) plot(hba, fpg) abline(fit.diabetes) plot(hba, resid(fit.diabetes)) abline(h=0) detach() 2.4 Cautions About Correlation and Regression Observing a relationship between two variables does not imply that the explanatory variable causes the response variable. S-PLUS has some tools that allow us to investigate relationships in greater detail. Example 2.7 Mileage, price, and weight of cars We continue to explore the data from Consumer Reports: what is the relationship between Price and Mileage? The data set car.test.frame combines data from cu.summary and fuel.frame. 1. Bring up the car.test.frame data set using Data > Select Data. 2. Create a Linear Fit plot axis. with Mileage on the horizontal (x) axis and Price on the vertical (y) 60 26000 21000 Price 16000 11000 6000 1000 15 20 25 Mileage 30 35 Figure 2.16 Someone ignorant of statistics may interpret this graph to say that improving your gas mileage may also lower the overall price of the car. We know that there could be lurking variables like the size of the car that aect both the price and the gas mileage. 3. CTRL-click on the Weight column. There should now be three columns selected: Mileage, Price, Weight. 4. Toggle on the Set Conditioning Mode button should be a 1. on the Standard toolbar. The adjacent eld The buttons on the Plots 2D palette should now have yellow strips on the tops. Figure 2.17 5. On the Plots 2D palette, click on the Linear Fit button. 61 16 Weight: 2,885.0 to 3,242.5 22 28 34 Weight: 3,265.0 to 3,860.0 20000 10000 Price Weight: 1,845.0 to 2,567.5 Weight: 2,575.0 to 2,885.0 20000 10000 16 22 28 34 Mileage Figure 2.18 This is a trellis plot: it presents four graphs grouped by the size of the cars (measured by their weight). The variable Weight is split into quartiles. The plot in the lower left corner graphs Mileage against Price for those cars with weights in the lowest quartile; the plot in the upper right corner graphs Mileage against Price for those cars with weights in the top quartile. All four plots are on the same scale, so we can see that the cars that are the biggest also cost the most and get the lowest mileage, and the smallest cars are also cheaper and have high gas mileage. The relationship between mileage and price shows up as much less important after taking the size of the cars into account. See the section on Trellis Graphs in Chapter 3 of the S-PLUS Users Guide for more detail. Important: Toggle the Set Conditioning Mode button End of Example 2.7 to o. Command Line Notes Trellis plots can be constructed using the xyplot command. The rst argument is a formula: y~x|cond. The function equal.count breaks up a continuous variable into groups for the conditional plotting (the default is to use overlapping groups, unlike the GUI plot). > xyplot(Price ~ Mileage | equal.count(Weight), + data = car.test.frame) 62 Remarks: To see another good example of correcting for lurking variables, download the SAT data mentioned at the beginning of Chapter 2 in IPS. First plot SAT Math (y) against Teachers Pay (x). Then redo the plot conditioning on Percent Taking the SAT. Other options for exploring two numeric variables in S-PLUS are bubble plots and color plots See the online help for additional information. . 2.5 Exercises These problems use data sets supplied with S-PLUS. Recall the Data > Select Data command (Section 0.4.1). 1. Bring up fuel.frame and create a scatterplot matrix of all the numeric columns to look for relationships. Next, plot Weight on the x axis and Mileage on the y axis (Mileage against Weight) with dierent symbols for each Type. 2. What is the correlation between Price and Mileage in the S-PLUS supplied cu.summary data set? You should include the argument na.method = "omit" to omit the missing observations. 3. Find the regression line of Mileage (y) against Weight (x) from fuel.frame. Plot the residuals against Weight and include a smoothed line. Does the residual plot show any nonlinearity? 63 2.6 Solutions . Select Weight, Mileage, and Type in that order and click on the Scatter button legend button . , then click on the auto 1. From the GUI, select all columns except Type and click on the Scatter Matrix button From the command line, > pairs(fuel.frame[,1:4], panel = panel.smooth) > xyplot(Mileage ~ Weight, data = fuel.frame, + groups = Type, panel = panel.superpose, pch = 1:6, + key = list(columns = 3, text = levels(fuel.frame$Type), + points = list(pch = 1:6))) 2. From the GUI, Figure 2.19 From the command line, > attach(cu.summary) > cor(Price, Mileage, na.method = "omit") [1] -0.6537541 > detach() 3. From the GUI, 64 Figure 2.20 From the command line, > fuel.fit = lm(Mileage ~ Weight, data = fuel.frame) > summary(fuel.fit) ... > scatter.smooth(fuel.frame$Weight, resid(fuel.fit)) > abline(h = 0) There is denite curvature in the residuals. 65 Chapter 3 Producing Data To obtain a simple random sample (SRS) of size n from a population, we need a method that gives every set of n individuals an equal chance of actually being selected. Example 3.1 How to choose an SRS We will use the built-in data set (fuel.frame) (Section 0.4.1) which has 60 observations. Suppose we wish to obtain an SRS of 10 cars. 1. From the main menu select Data > Random Sample .... 2. In the Random Sample of Rows dialog box, for Data Set: select the name of the data set. 3. For Sample Size: type 10. 4. For Save In: type fuel.sample. Figure 3.1 5. Click OK. The result is a data frame with 10 rows from the original 60 rows and all columns of fuel.frame. Notice that the rows are not only a random sample, but that they are returned in random order. If you were doing an experiment, for example, trying one marketing strategy in ve states and a dierent marketing strategy in another ve states, then you could assign the rst treatment to the rst ve states and the second treatment to the last ve states. The states would be assigned randomly to the treatments. End of Example 3.1 66 Remarks: Repeating this example will result in a dierent sample of size 10. To reproduce the same sample, you can specify a seed (Set Seed:), an integer between 0 and 1023 for the random number generator. Any time you specify the same sample size, population, and seed, the exact same sample results. This is useful so you (or someone else) can replicate your results. In the Random Sample of Rows dialog box, if you set the Sample Size: eld equal to the number of rows in your data set (or equivalently, if you leave this eld blank), then S-PLUS returns the entire data set but with the rows permuted in a random order. In some cases, you may have a large data set created outside of S-PLUS and you would like to work on a random sample of this data in S-PLUS. Rather than importing the entire data set into S-PLUS rst and then obtaining an SRS, you can import just a random sample of this data into S-PLUS. For example, suppose you wish to survey a sample of students at your university, and your registrar has provided you with a le containing the names of the entire student body. You could import a random sample of this population into a data frame in S-PLUS and then enter the responses directly into the data frame as you interview this sample of students. Example 3.2 Sampling an input le The workers data set has 71076 rows. This data set is part of the IPSdata library, in two formsalready loaded into S-PLUS, and as a plain text le. Here we import a random sample from the plain text le. 1. At the menu, select File > Import Data.... 2. That brings up the Import From File dialog box, with the Data Specs tab active. Click on Browse, and navigate to the location of workers.txt. This is typically something like c:/Program Files/Insightful/splus62/library/IPSdata/data/workers.txt Figure 3.2 3. Under To, change the Data set: eld to workers.sample. 67 Figure 3.3 4. Click on the Filter tab. 5. Under Filter columns, type samp fixed(100, 71076) in the Filter expression: eld. Figure 3.4 6. Click OK. End of Example 3.2 68 Command Line Notes The function sample takes a vector for its argument as well as the sample size (unlike the GUI command which works on a data frame). > > > > > > > > > attach(fuel.frame) sample(Weight, 10) sample(Weight, 10) # a different sample set.seed(4) sample(Weight, 10) set.seed(4) sample(Weight, 10) # the same sample detach() fuel.frame[sample(60, 10),] 3.1 Exercise Create a new data set and enter the names of at least ten friends or classmates in the rst column. Take a random sample of size ve from the list. 3.2 Solution From the GUI, Figure 3.5 From the command line, > temp = c("Laura", "Greg", "Tim", "David", "George", "Jodi", + "Sarah", "Ferdinand", "Isabella", "Christopher") > sample(temp, 5) [1] "Ferdinand" "Sarah" "David" "Jodi" "Laura" 69 Chapter 4 Probability 4.1 Randomness Chapter 4 in IPS introduces the axioms of probability and the Law of Large Numbers. The Law of Large Numbers says roughly that as we take larger and larger samples from a xed population, the sample means tend to get closer to the true population mean. Example 4.1 Flipping a fair coin To illustrate the Law of Large Numbers, we simulate the ipping of a fair coin a large number of times. 1. Create a new empty data set by clicking the button on the main toolbar (Section 0.4.6). 2. From the main menu, select Data > Random Numbers.... 3. For Target Column: select <END>. 4. Under Sampling, for Sample Size: type 100000, and for Distribution: select binomial. For Seed: type 12. (This step is optional; it should make your output match ours.) 5. Under Distribution Parameters, for Probability: type .5, and for Sample Size: type 1. 70 Figure 4.1 6. Click OK. Figure 4.2 In the newly created column, each row gives the number of heads in one toss of a fair coin. This experiment was repeated 100000 times. For example, the rst toss resulted in a head (1), the second toss resulted in a head (1), the third toss was not a head (0), and so on. 71 Now we want to convert those raw numbers into fractions. We start by creating a column that contains the numbers 1 to 100000. 7. At the menu, select Data > Fill.... 8. In the Fill Numeric Columns dialog box, for Data Set: select the name of the data set, and for Columns: select <END>. 9. Under Fill Options, for Length: type 100000. Figure 4.3 10. Click OK. We next create a column that contains, for each row, the proportion of heads up to that point. 11. From the main menu, select Data > Transform.... For Data Set: select the name of the data set. For Target Column: select <END> . 12. For Expression: type cumsum(V1)/V2. The command cumsum(V1) computes the cumulative sums of the entries in V1, that is, the number of heads so far. Figure 4.4 72 13. Click OK. Figure 4.5 By the 7th ip (row 7), the proportion of heads up to this point is 0.57 (4 out of the 7 ips). We next plot this information. 14. Select V2 for the x axis, CTRL-click on V3 for the y axis. 15. On the Plots 2D palette, click on the Line plot button 1.0 0.9 0.8 V3 0.7 0.6 0.5 0.4 10000 30000 50000 V2 70000 90000 110000 Figure 4.6 To get a better sense of what this plot is telling us, we change the x axis to a logarithmic scale. 73 16. Right-click on the x axis and select Display/Scale from the shortcut menu. 17. This brings up the X Axis dialog box, Display/Scale tab. In the Axis Scaling box, change Scaling: to Log. 18. Click OK. 19. Open the Annotations palette using the the Horizontal Reference Line button . icon from the main toolbar (Section 1.4.2) and click on 20. Add the reference line by clicking at the y = 0.5 tick mark. You can ne-tune by right-clicking on this reference line and choosing Position from the shortcut menu. Type 0.5 in the Position: eld. 1.0 0.9 0.8 V3 0.7 0.6 0.5 0.4 1 10 100 V2 1000 10000 100000 0.65 0.60 V3 0.55 0.50 0.45 10 2 3 4 5 67 100 2 3 4 5 6 7 1000 2 3 4 5 6 10000 2 3 4 5 6 100000 V2 Figure 4.7 74 You can zoom in by right-clicking on the x axis and/or y axis and changing the axis ranges. Shown here is the result of omitting the rst nine observations, and limiting the y axis range. We see that as the number of coin ips increases, the proportion of heads approaches 0.5. End of Example 4.1 4.2 Exercise 1. The procedures described in this chapter can be used for a variety of distributions other than the binomial. The F distribution comes up in the analysis of variance (IPS Chapter 12). The F distribution requires two parameters, the numerator degrees of freedom and the denominator degrees of freedom. If the random variable X follows the F distribution with numerator degree of freedom equal to 41 and denominator degree of freedom equal to 32, nd P (1.5 X 2.4). 4.3 Solution 1. From the GUI, Figure 4.8 From the command line, > pf(c(1.5, 2.4), 41, 32) [1] 0.8809057 0.9939659 > .9939659 - .8809057 [1] 0.1130602 75 Chapter 5 Sampling Distributions In Chapters 2 and 4 we saw how to compute values from the normal and other distributions. In Chapter 5 we learn about the binomial distribution. Distributions like the binomial, Poisson, and hypergeometric are called discrete distributions because they only apply to counts (0, 1, 2, 3, . . .) instead of a continuous range of values like the normal distribution. This means that in addition to computing cumulative probabilities (for example, the probability of getting 5 or fewer heads out of 10 ips of a coin), we can also compute exact probabilities (for example, the probability of getting exactly 5 heads out of 10 ips). Example 5.1 Flipping a biased coin Suppose you have a biased coin that comes up heads 60% of the time instead of 50%. If you toss this coin 10 times, what is the probability of getting no heads? One head? Two heads, and so on? 1. Create a new data set using the button and ll the rst column with the integers from 0 to 10. Entries in this column represent the possible number of heads on 10 ips of the coin. 2. From the main menu, select Data > Distribution Functions.... 3. For Source Column: type the name of the rst column. For Target Column: leave the default name. 4. Under Statistics, for Result Type: select the Probability radio button. binomial. For Distribution: select 5. Under Distribution Parameter, for Probability: enter 0.6, and for Sample Size: enter 10. 76 Figure 5.1 6. Click Apply. (Recall that Apply executes the command but does not close the dialog box.) 7. Change the Result Type: to Density. Notice that the default name for Target Column: becomes Density. 8. Click OK. 77 Figure 5.2 The Probability column gives cumulative probabilities and the Density column gives individual probabilities. For example, the probability of 7 heads or less is 0.83 and the probability of exactly 7 heads is 0.21 To avoid rounding the probabilities for no heads or 1 heads to zero, increase the precision (Section 0.4.7) of the Probability and Density columns using End of Example 5.1 on the toolbar. Command Line Notes The function pbinom gives the cumulative probabilities and dbinom gives the exact probabilities for the binomial distribution. > pbinom(7, size = 10, prob = .6) # P(X <= 7) [1] 0.8327102 > dbinom(7, size = 10, prob = .6) # P(X == 7) [1] 0.2149908 To reproduce the output in the GUI example, type: > cbind(V1 = 0:10, Probability = pbinom(0:10, 10, .6), + Density = dbinom(0:10, 10, .6)) V1 Probability Density [1,] 0 0.0001048576 0.0001048576 [2,] 1 0.0016777216 0.0015728640 ... [10,] 9 0.9939533824 0.0403107840 [11,] 10 1.0000000000 0.0060466176 The column names are optional. 78 5.1 Sampling Distributions It can be useful to see the eects of a sampling distribution by generating your own random numbers. Example 5.2 Generating a sampling distribution We will generate several columns of data from a known distribution, compute the sample means, and see how the distribution of the sample means compares to the population. 1. Create a new data set. 2. At the menu, select Data > Random Numbers.... 3. In the Random Numbers dialog box, for Data Set: select the name of the data set. For Target Column: select <END>. 4. Under Sampling, for Sample Size: type 1000, and for Distribution: select normal. 5. Under Distribution Parameters, for Mean: type 100, and for Std. Deviation: type 10. 6. Click on Apply four times, then click on Cancel. This creates four columns of random numbers, each column coming from a normal distribution with mean 100, standard deviation 10. Next, we create a fth column whose entries are the average of the four columns. 7. At the menu, select Data > Transform.... 8. In the Data Transform dialog box, for Data Set: select the name of the data set. For Target Column: select <END>. 9. For Expression: type (V1+V2+V3+V4)/4. Figure 5.3 10. Click OK. 11. In the data set, select all ve columns by clicking on the rst column header, then shift-clicking on the last column header. 79 12. On the Plots 2D palette, click on the Density plot button . 0.06 0.04 0.02 0.00 60 80 100 120 140 Figure 5.4 This draws ve density plots, one for each column. The height (y-coordinate) for each plot is determined by the relative concentration of values from one column of data, so this plot provides information similar to the histogram. Notice that one plot is much narrower (smaller standard deviation) and taller than the others. This corresponds to the plot from V5, the average of the rst four columns. We know from the theory that averages have less variability than individuals. Repeat this example but try some of the other distributions available in S-PLUS. For instance, the gamma distribution with shape parameter (mean) 3 will show the sampling distribution to be narrower and more symmetric. End of Example 5.2 We have already shown (Section 1.3) how to use S-PLUS to compute normal probabilities. S-PLUS also computes probabilities for other distributions, including the binomial, t, chi-square, and F distributions (all introduced in IPS in later chapters). Example 5.3 The t distribution The t distribution is important in inference when the population follows a normal distribution with unknown mean and standard deviation . Random variables that follow a t distribution must also have a parameter, the degrees of freedom, to complete its description. If the random variable T follows a t distribution with 43 degrees of freedom, nd the probability that T is less than 0.75, and nd the probability that T is greater than 3.1; that is, nd P (T 0.75) and P (T > 3.1). 1. Create a data set with .75 and 3.1 entered into the rst column. 2. At the menu, select Data > Distribution Functions.... 3. In the Distribution Functions dialog box, for Data Set: select the name of the data set. For Source Column: select the name of the column. 4. For Target Column: type a name for a new column (the default is Probability). The probabilities are saved to a newly created column. 80 5. Under Result Type, check the radio button next to Probability. For Distribution: select t. 6. For Deg. of Freedom 1: type 43. Figure 5.5 7. Click OK. S-PLUS gives us P (T .75) = 0.77 and P (T 3.1) = 1. Increase the precision of the Probability column to 4 by selecting the column and clicking on the Increase Precision button on the data set toolbar (Section 0.4.7). We then see P (T .75) = 0.7713 and P (T 3.1) = 0.9983, so P (T > 3.1) = 1 0.9983 = 0.0017. End of Example 5.3 Command Line Notes Functions to generate random numbers start with the letter r and are followed by the (abbreviated) name of the distribution (rnorm, rt, rbinom, rf). The same names prefaced with p give the probabilities (pnorm, pt, pbinom, pf). > coin = cumsum(rbinom(100000, 1, .5)) / (1:100000) > plot(1:100000, coin, log="x", type="l") > abline(h=.5, lty=4) > samp.gamma = matrix(rgamma(9*10000, 3), ncol=9) > m.gamma = rowMeans(samp.gamma) > d1 = density(samp.gamma) > d2 = density(m.gamma) > plot(d1, ylim = range(d2$y), type="l") > lines(d2, lty=4) > abline(v=3, lty=2) > pt(c(0.75, 3.1), 43) [1] 0.7713306 0.9982958 81 Chapter 6 Introduction to Inference Chapter 6 introduces condence intervals and hypothesis tests. In this chapter, we make the assumption that the population follows a normal distribution with unknown mean but known standard deviation . 6.1 Condence Intervals and Hypothesis Tests If we obtain an SRS of size n with sample mean x from a normal population, N (, ), then a C level condence interval is given by x z , n where z is the z-critical value, the value such that C100% of the area under the normal density curve occurs between z and z . We give the critical values for the most common condence levels. C 100% 90 95 99 Table 6.1 Critical values (1 C)/2 (upper tail area) 0.05 0.025 0.005 z 1.645 1.960 2.560 To perform a hypothesis test with H0 : = 0 , we form the z-test statistic, z= x 0 / n which will follow the standard normal distribution N (0, 1) under the null hypothesis. S-PLUS does not have a command for computing condence intervals or the z-test statistic based on a given known population standard deviation. You can use S-PLUS to nd the mean of a set of data and then calculate the condence interval or test statistic by hand (see the Command Line Notes). Example 6.1 Finding a condence interval Suppose you weigh yourself once a week for 10 weeks and record the following measurements: 190.5 189.0 195.5 187 189 190 190.5 192.5 191 189.5 Assume that your weight follows a normal distribution with standard deviation = 3. To compute a 90% condence interval, we need to nd the mean of these ten numbers. 1. Create a new data set using the button and enter the data (Section 0.4.6). 82 2. At the menu, select Statistics > Data Summaries > Summary Statistics.... 3. On the Data tab, for Data Set: enter select the name of the data set, and for Variables: select the column that holds the data. 4. On the Statistics tab, select Mean and clear the other options. Figure 6.1 5. Click OK. *** Summary Statistics for data in: SDF1 *** weight Mean: 190.45 To compute a 90% condence interval for your true mean weight, using Table 6.1 (or Section 1.3) to note 3 the z-critical value 1.645; then calculate 190.45 1.645 = (188.9, 192.0). 10 End of Example 6.1 Command Line Notes > weight = c(190.5, 189, 195.5, 187, 189, 190, 190.5 + 192.5, 191, 189.5) > mean(weight) [1] 190.45 > 190.45 + c(-1,1) * 1.645 * 3/sqrt(10) # CI [1] 188.8894 192.0106 > round(.Last.value, 1) # Round to 1 decimal digit [1] 188.9 192.0 83 Example 6.2 Two-sided hypothesis test Continuing with the data from Example 6.1, suppose your driverss license states that your weight is 192. We wish to test the hypothesis that these observations are compatible with a mean = 192: H0 : = 192 against Ha : = 192 Use a calculator or the S-PLUS command line to compute the z-statistic: z= x 190.45 192 = = 1.633 / n 3/ 10 Recalling Section 1.3 or Section 4.1, we can use S-PLUS to compute the probability P (Z 1.633) where Z has a N (0, 1) distribution. This probability is 0.051 and thus, the P -value is 2 0.051 = 0.102, which does not give strong evidence that your true mean weight diers from 192 pounds. End of Example 6.2 6.2 Power The power of a signicance test is a measure of its ability to detect an alternative hypothesis. It is the probability of rejecting the null hypothesis given that the alternative hypothesis is true. In order to compute the power of a signicance test, we need to know the hypothesized mean 0 , the alternative hypothesis, the standard deviation , the signicance level , and the sample size. Example 6.3 Power Researchers measure the percent change in total body bone mineral content of young women. They assume that = 2 and test the hypothesis H0 : = 0 against Ha : > 0. Find the power of the test if the alternative mean is = 1, 25 subjects are used, and = 0.05. 1. At the menu, select Statistics > Power and Sample Size > Normal Mean.... 2. For Compute: check Power and for Sample Type: select One Sample. 3. Under Probabilities, for Alpha(s): type 0.05. 4. Under Sample Sizes, for N1: type 25. 5. Under Standard Deviations, for Sigma1: type 2. 6. Under Null Hypothesis, for Mean: type 0. 7. Under Alternative Hypothesis, for Alt. Mean type 1 and for Test Type select greater. 84 Figure 6.2 8. Click OK. *** Power Table *** mean.null sd1 mean.alt delta alpha power n1 1 0 2 1 1 0.05 0.8037649 25 The power of the test is 0.80. End of Example 6.3 Command Line Notes The normal.sample.size command computes the power of a test. > normal.sample.size(mean = 0, mean.alt = 1, sd1 = 2, + n1 = 25, alpha = .05) + alternative = "greater") mean.null sd1 mean.alt delta alpha power n1 1 0 2 1 1 0.05 0.8037649 25 85 6.3 Exercises 1. A runner checks her pulse immediately upon awakening on seven randomly chosen mornings. She records 56 58 52 53 58 52 56 in beats per minute. Assume that the standard deviation is known to be = 2 beats per minute. Find a 90% condence interval for the true mean pulse rate of this runner. 2. Suppose the runner in the above example wishes to test the hypothesis that H0 : = 53 against H0 : > 53. What is the power of the test if the alternative mean is 56 beats per minute? (Use = .01.) 86 6.4 Solutions 1. From the command line, > pulse = c(56, 58, 52, 53, 58, 52, 56) > mean(pulse) + c(-1,1) * 1.645 * (1/sqrt(7)) [1] 54.37825 55.62175 2. From the GUI, Figure 6.3 From the command line, > normal.sample.size(mean = 53, mean.alt = 56, + sd1 = 2, n1 = 7, alpha = .01, + alternative = "greater") *** Power Table *** mean.null sd1 mean.alt delta alpha power n1 1 53 2 56 3 0.01 0.9497339 7 87 Chapter 7 Inference for Distributions This chapter introduces the t distribution and its use in nding condence intervals and performing signicance tests for one or two population means. The t distribution is necessary when samples arise from normal populations with unknown standard deviation . We encourage you to use Chapter 14 together with this chapter, and in particular Section 14.1. The approach taken there, based on computer simulation and visualization, complements the more mathematical approach taken here, and may help you better understand the concepts in this chapter. It also provides a way for you to check your work. 7.1 7.1.1 Inference for the Mean of a Population Condence intervals Suppose we have a normal population with unknown mean and standard deviation . Draw a simple random sample of size n and compute the sample mean x and sample standard deviation s. A level C condence interval for is s x t n where t is the upper (1 C)/2 critical value for t(n 1), the t distribution with n 1 degrees of freedom. Example 7.1 Corn soy blend (CSB)IPS Example 7.1 Researchers wished to evaluate appropriate vitamin C levels (mg/100gm) in a random sample of eight batches of CSB from a production run: 26 31 23 22 11 22 14 31 Find a 95% condence interval for the mean vitamin C content of CSB produced during this run. 1. Open the Example7.1 data set from the IPSdata library (Section 0.4.2). 2. At the menu, select Statistics > Data Summaries > Summary Statistics.... 3. In Data Set: enter Example7.1. In Variables: select vitc. 4. On the Statistics tab, check Mean, Conf. Limits for Mean, and Std. Deviation. 5. The default condence level is 95%. If you want a dierent condence level, type this value in the Conf. Level: eld. 88 Figure 7.1 6. Click OK. *** Summary Statistics for data in: Example7.1 *** vitc Mean: 22.500000 Std Dev.: 7.191265 LCL Mean: 16.487952 UCL Mean: 28.512048 Thus, we see a wide 95% condence interval of (16.49, 28.51). The interval is found by computing 22.50 7.19 t , where t is the upper 0.025 critical value on seven degrees of freedom. 8 End of Example 7.1 We just obtained condence intervals using the Summary Statistics menu. We can also obtain them from the same menu we use for hypothesis tests, the One Sample > t Test... menu. 7.1.2 Hypothesis tests x 0 . s/ n 89 To test the hypothesis H0 : = 0 , compute the one-sample t statistic t= This test statistic has a t(n 1) distribution under the null hypothesis if the population is normal. In S-PLUS, the t test procedure performs a hypothesis test and computes a condence interval. Example 7.2 Corn soy blend, continued Suppose we wish to test the hypothesis that the mean vitamin C content for the production run in Example 7.1 is 40mg/100g: H0 : = 40 against Ha : = 0. 1. At the menu, select Statistics > Compare Samples > One Sample > t Test.... 2. In the Data box, for Data Set enter Example7.1, and for Variable: select vitc. 3. In the Hypotheses box, for Mean Under the Null Hypothesis: enter 40 and for Alternative Hypothesis: select two.sided. Figure 7.2 4. Click OK. One-sample t-Test data: vitc in Example7.1 t = -6.883, df = 7, p-value = 0.0002 alternative hypothesis: mean is not equal to 40 95 percent confidence interval: 16.48795 28.51205 sample estimates: mean of x 22.5 With a P -value of 0.0002, we reject the null hypothesis at the = 0.05 signicance level. Notice that in the t test report, S-PLUS states the alternative hypothesis associated with this particular test. It does not mean that this is the conclusion that S-PLUS is drawing. To perform a one-sided test, repeat the above steps, but in step 3 select less for Alternative Hypothesis:. 90 One-sample t-Test data: vitc in Example7.1 t = -6.883, df = 7, p-value = 0.0001 alternative hypothesis: true mean is less than 40 95 percent confidence interval: NA 27.31696 sample estimates: mean of x 22.5 In the case of a one-sided alternative hypothesis, S-PLUS returns a one-sided condence interval. End of Example 7.2 Command Line Notes The function t.test performs the t test and computes a condence interval. For a one sample test, it requires a vector for the rst argument. > t.test(Exercise7.1$vitc, mu = 40) By default, t.test assumes a .95 condence level, a null hypothesis of = 0, and a two-sided test (the command t.test(Exercise7.1$vitc) would invoke these defaults). The options conf.level, mu, and alternative can be used to change these defaults. For example, to compute a 99% condence interval or to test H0 : = 35 against Ha : < 35, type > t.test(Exercise7.1$vitc, mu = 35, conf.level = .99, alternative = "less") 7.1.3 Matched pairs t procedures In a matched pairs design, subjects are matched in pairs, and each treatment is given to one subject in each pair. In many cases, the same person is used for each pair and the responses are before-and-after measurements. The one-sample t procedure is applied to the observed dierences in the responses. Example 7.3 French language test Twenty French teachers took a language test before and after an intense four-week summer French course. Did the course improve the test scores of these teachers? 1. Open the Exercise7.41 data set from the IPSdata library. Let denote the true mean of the dierence in response times (pre-post). We wish to test the hypothesis H0 : = 0 against H0 : > 0. 2. At the menu, select Statistics > Compare Samples > Two Samples > t Test.... 3. For Data Set: enter Exercise7.41. 4. For Variable 1: select post, and for Variable 2: select pre. 91 5. For Type of t Test select Paired t. 6. In the Hypotheses box, for Mean Under Null Hypothesis: type 0 and for Alternative Hypothesis: select greater. Figure 7.3 7. Click OK. Paired t-Test data: x: post in Exercise7.41 , and y: pre in Exercise7.41 t = 3.8649, df = 19, p-value = 0.0005 alternative hypothesis: mean of differences is greater than 0 95 percent confidence interval: 1.381502 NA sample estimates: mean of x - y 2.5 With t = 3.86 and P < 0.01, we conclude the improvement in scores was signicant. S-PLUS returns a one-sided 95% condence interval (1.38, ). Remark: In the data set provided (Exercise7.41), the dierence in scores (gain) was also present in the data set. Hence you could then do this problem as a one-sample t test, using the menu (Statistics > Compare Samples > One sample > t test...) for the variable gain. In other situations you can use the Transform menu to compute dierences. Select the Data > Transform... menu; for Target Column: eld, type a name for the new column to store dierences (for this example gain). For Expression:, type the algebraic expression for the dierence, for example, post - pre. 92 Figure 7.4 End of Example 7.3 Command Line Notes The t.test command can be used for the matched-pairs t test. > attach(Exercise7.41) > t.test(post, pre, paired = T, alternative = "greater") > detach() 7.1.4 Inference for nonnormal populations See Chapters 14 and 15 for ways to handle nonnormal populations. 7.2 Comparing Two Means We next consider two-sample problems in which we wish to compare the means of two independent, normal populations. Draw a sample of size n1 from a normal population with true mean 1 , and a sample of size n2 from another normal population with unknown 2 . Let x1 and x2 denote the sample means, and s1 and s2 denote the sample standard deviations. To test the hypothesis H0 : 1 2 = 0, we form the two-sample t statistic x1 x2 standard error and use P -values or critical values for a t distribution. A condence interval for the true mean dierence 1 2 is given by t= (1 x2 ) t standard error x If the population standard deviations are assumed to be equal, 1 = 2 , then standard error = sp where s2 denotes the pooled variance p 93 1 1 + n1 n2 s2 = p ((n1 1)s2 + (n2 1)s2 ) 1 2 n1 + n2 2 and the degrees of freedom for the t distribution is n1 + n2 2. If the population standard deviations are not assumed to be equal, 1 = 2 , then standard error = and the degrees of freedom is given by s2 s2 1 + 2 n1 n2 s2 1 1 n1 1 n1 2 2 s2 s2 1 + 2 n1 n2 df = + 1 s2 2 n2 1 n2 . 2 Example 7.4 Degree of reading power (IPS Example 7.14, Table 7.4) In an experiment to determine whether certain reading activities help school children improve reading ability, a group of 21 students receive a treatmentthe reading activitieswhile a group of 23 students follows their usual curriculum. Let c and t denote the true mean reading score for the control group and the treatment group, respectively. We wish to test the hypothesis H0 : t c = 0 against Ha : t c > 0. 1. Open the data set Table7.4 from the IPSdata library. Note that columns one and two give the same information: whether a particular individual belonged to the treatment group or control group. Column two (a factor, see Section 0.4.5) uses character strings to identify the two levels while column three uses an integer coding. 2. At the menu, select Statistics > Compare Samples > Two Samples > t Test .... 3. For Data Set: enter Table7.4. 4. For Variable 1: select drp, and for Variable 2: select group. Check Variable 2 is a Grouping Variable. 5. In the Test box, make sure the box Equal Variances is not checked. 6. In the Hypotheses box, for Mean Under Null Hypothesis: keep the default 0 and select less for Alternative Hypothesis:. 94 Figure 7.5 7. Click OK. Welch Modified Two-Sample t-Test data: x: drp with group = Control, and y: drp with group = Treat t = -2.3109, df = 37.8554006594391, p-value = 0.0132 alternative hypothesis: difference in means is less than 0 95 percent confidence interval: NA -2.691293 sample estimates: mean of x mean of y 41.52174 51.47619 The t test statistic is 2.3 based on 37.9 degrees of freedom. The P -value 0.0132 indicates that we can conclude that the reading activities improve scores at the = 0.05 signicance level. Note that for the t statistic and condence interval, S-PLUS computed the dierence xc xt (rather than xt xc as done in IPS) so that the signs are reversed. End of Example 7.4 Command Line Notes > > + > > + > attach(Table7.4) t.test(drp[group == "Control"], drp[group == "Treat"], alternative = "less", var.equal = F) # The following switches groups, and uses "greater" t.test(drp[group == "Treat"], drp[group == "Control"], alternative = "greater", var.equal = F) detach(Table7.4) 95 7.3 Exercises 1. Exercise 6.71 gives readings for radon detectors in a test setting. In that exercise you are told to Assume (unrealistically) that you know the standard deviation. . . Exercise 7.37 follows up by dropping that unrealistic assumption. It asks: Is there convincing evidence that the mean reading of all detectors of this type diers from the true value of 105? Carry out a test in detail and write a brief conclusion. Perform an appropriate hypothesis test and calculate a condence interval. 2. Examples 7.197.21 in IPS describe an experiment to determine whether or not calcium intake aects blood pressure. The data are in Table7.5 in the IPSdata library. Perform a signicance test to see whether or not calcium reduces blood pressure. 96 7.4 Solutions 1. Open Exercise7.37 from the IPSdata library. From the menu, select Statistics > Compare Samples > One Sample > t Test.... Figure 7.6 > t.test(Exercise7.37$radon, mu = 105) 2. Open Table7.5 from the IPSdata library. From the menu, select Statistics > Compare Samples > Two Samples > t Test.... For Variable 1: select dec (decrease), for Variable 2: select group, and check the Variable 2 is a Grouping Variable box. Figure 7.7 > attach(Table7.5) > t.test(dec[group == "Calcium"], dec[group == "Placebo"], + alternative = "greater", var.equal = F) > attach(Table7.5) 97 Chapter Order We encourage teachers to consider using Chapter 14 (on bootstrap methods and permutation tests) next, to complement the material in Chapter 7 and later. The reasons are given in the preface. We also encourage students to use Chapter 14 to supplement other chapters, even if that chapter is not included in your course. Bootstrap methods and permutation tests provide concrete visual representations, based on computer simulation, for some of the more abstract and dicult concepts treated in the later chapters. This complements the traditional mathematical approach, and may help you understand the concepts better. 98 Chapter 8 Inference for Proportions In the previous chapters, we focused on inference for a population mean . In this chapter, we focus on counts and proportions. We encourage you to use Chapter 14 together with this chapter, and in particular Section 14.3. The approach taken there, based on computer simulation and visualization, complements the more mathematical approach taken here, and may help you better understand the concepts in this chapter. It also provides a way for you to check your work. 8.1 Inference for a Single Proportion If we let p denote the true proportion of the population that has a certain outcome (we will call this outcome a success), then we estimate p by taking a simple random sample of size n and computing the sample proportion, p= count of successes in the sample X = n n p(1 p) . The n calculations in IPS for condence intervals and the P -values for signicance tests are based on this assumption. For hypothesis testing, to test H0 : p = p0 , we use the z test statistic For large n, we know that the sampling distribution of p is approximately N p, z= p p0 p0 (1 p0 ) n The Proportions Test in S-PLUS can be used for hypothesis testing and for nding condence intervals. Example 8.1 Buons coin tosses French naturalist Count Buon (17071788) tossed a coin n = 4040 times and came up with 2048 heads. 2048 The sample proportion is p = = 0.5069. Is this strong evidence that Buons coin was not a fair coin? 4040 We wish to test H0 : p = .50 against Ha : p = .50. In this example p0 = 0.5. 1. At the menu, select Statistics > Compare Samples > Counts and Proportions > Proportions Parameters.... This brings up the Proportions Test dialog box. 2. In the Data box, for Success Variable: type 2048, and for Trials Variable: type 4040. 3. In the Hypotheses box, for Proportions Variable: type .5, and for Alternative Hypothesis: select two.sided. 99 4. For a 95% condence interval leave the Condence Level: eld at the default of .95. Figure 8.1 5. Click OK. 1-sample proportions test with continuity correction data: 2048 and 4040 X-square = 0.7488, df = 1, p-value = 0.3869 alternative hypothesis: P(success) in Group 1 is not equal to 0.5 95 percent confidence interval: 0.4913912 0.5224569 sample estimates: propn in Group 1 0.5069307 The P -value for the test is P = 0.39, so we do not reject the null hypothesis. Note that the Proportions Test command also outputs a 95% condence interval: we are 95% condent that the true proportion of heads from this coin is in the interval (0.491, 0.522). This condence interval is p(1 p) computed using the formula p z , where z denotes the relevant z critical value. n End of Example 8.1 Remarks: The proportions test in S-PLUS uses a continuity correction in performing the calculations. This improves the accuracy of the results but means that the S-PLUS output does not correspond exactly to the results obtained in the text. Clear the Yates Continuity Correction checkbox if you wish to see the results without this correction. 100 To improve accuracy, another estimator of the population proportion is the Wilson estimate of the population proportion, namely X +2 p= , n+4 where X represents the number of successes. The corresponding standard error is SEp = p(1 p) . n+4 To compute condence intervals using the Wilson estimate, just increase the Success Variable: entry by 2 and the Trials Variable: by 4 in the Proportions Test dialog box. For the Buon coin problem, this would mean typing 2050 in the Success Variable: eld and 4044 in the Trials Variable: eld. Command Line Notes The function prop.test performs the proportions test. > prop.test(2048, 4040) This produces output similar to that above. From the command line, you have the option of extracting just the P -value or condence interval. > prop.test(2048, 4040)$p.value [1] 0.3868684 > prop.test(2048, 4040)$conf.int [1] 0.4913912 0.5224569 attr(, "conf.level"): [1] 0.95 If you wish to specify dierent hypotheses or a dierent condence level, say for example, H0 : p = 0.6 against Ha : p < 0.6 at = 0.01, then type > prop.test(2048, 4040, p = .6, alternative = "greater", + conf.level = .99) 8.2 Two Sample Proportions In this section, we compare the proportion of successes in two independent samples. If we let p1 and p2 denote the true proportion of successes for Population 1 and Population 2, respectively, we perform inference on the dierence p1 p2 . Example 8.2 Binge drinking study This example considers a study of binge drinking behavior in college students. The random variable X is the number of students classied as binge drinkers. Pop. 1 (men) 2 (women) Total n n1 = 7180 n2 = 9916 17096 X 1630 1684 3314 p = X/n. p1 = 0.227 p2 = 0.170 0.194 We are testing H0 : p1 = p2 against Ha : p1 = p2 . We rst need to create a data set that holds this information. 1. Create a new data set (Section 0.4.6). 101 2. Enter the counts X into the rst column, then the sample sizes n into the second column and label the columns X and n, respectively. 3. At the menu, select Statistics > Compare Samples > Counts and Proportions > Proportions Parameters .... This brings up the Proportions Test dialog. 4. In the Data box, for Data Set: enter the name of the data set. 5. For Success Variable: select X and for Trials Variable: select n. Figure 8.2 6. Click OK. 2-sample test for equality of proportions with continuity correction data: X and n from data set SDF13 X-square = 86.8062, df = 1, p-value = 0 alternative hypothesis: two.sided 95 percent confidence interval: 0.04488666 0.06949925 sample estimates: propn in Group 1 propn in Group 2 0.2270195 0.1698265 With a P -value of 0, we conclude that there is strong evidence that the proportion of men who binge drink is not the same as the proportion of women who binge drink. From the report, we see that a 95% condence interval for the true dierence in proportions is (0.045, 0.069). Note that this interval does not contain 0. End of Example 8.2 102 Command Line Notes For two independent samples, provide prop.test with two vectors: the rst one contains the successes (X), the second contains the sample sizes (n) corresponding to the successes. > prop.test(c(1630, 1684), c(7180, 9916)) 8.3 Exercises 1. A poll taken during 2001 indicated that about 73% of all Americans were extremely worried about privacy over the Internet. You suspect that a higher proportion of students at your college the have same concern. You poll a random sample of 300 students at your school and nd 240 of those questioned are extremely worried about privacy over the Internet. What can you conclude from your data? 2. Suppose your survey of privacy issues over the Internet included 138 women and 162 men, of which 120 of the women and 110 of the men were extremely worried about privacy issues. Is there evidence that men and women have dierent concerns about privacy issues over the Internet? 103 8.4 Solutions 1. From the GUI, Figure 8.3 From the command line, > prop.test(240, 300, p = .73, alternative = "greater") 1-sample proportions test with continuity correction data: 240 out of 300, null probability 0.73 X-square = 7.1072, df = 1, p-value = 0.0038 alternative hypothesis: P(success) in Group 1 is greater than 0.73 95 percent confidence interval: 0.7576395 1.0000000 sample estimates: propn in Group 1 0.8 A P -value of 0.0038 gives strong evidence that the proportion of students who are concerned about privacy issues over the Internet is higher than the general population. 104 2. From the GUI, Figure 8.4 From the command line, > prop.test(c(120, 110), c(138, 162)) 2-sample test for equality of proportions with continuity correction data: c(120, 110) out of c(138, 162) X-square = 14.0794, df = 1, p-value = 0.0002 alternative hypothesis: two.sided 95 percent confidence interval: 0.0925987 0.2885070 sample estimates: propn in Group 1 propn in Group 2 0.8695652 0.6790123 A P -value of 0.0002 gives strong evidence that there is a dierence in the proportions of men and women at your college who have concerns about privacy issues over the Internet. 105 Chapter 9 Inference for Two-Way Tables In this chapter, we consider two related questions: a) Is the distribution of a categorical variable dierent for several independent populations? b) Is there an association between two categorical variables from the same population? 9.1 Two-Way Tables and the Chi-Square Test Two-way tables (also called contingency tables) are a convenient way to summarize categorical variables. The categorical variables can arise in two dierent settings. A researcher may have a single categorical variable on samples taken from two or more independent populations. For instance, a sociologist surveys a random sample of 100 low-income, 100 medium-income, and 100 high-income individuals to compare their primary means of obtaining news. A test of homogeneity addresses the question, Is the proportion of individuals who read the daily newspaper the same for all three populations? Or, a researcher may have several categorical variables obtained from a single population. For example, an educator conducts a survey of 100 students at a high school and obtains information on their gender and their level of extracurricular activity. A test of association (also called a test of independence) addresses the question, Is there an association between gender and level of participation in extracurricular activity? Example 9.1 Binge drinking and gender A study compared binge drinking behavior and gender: Frequent binge drinker Yes No Total Men 1630 5550 7180 Women 1684 8232 9916 Total 3314 13782 17096 We can enter the data into S-PLUS in one of three ways: 1) a contingency table similar to the one above, but leaving out the (marginal) totals; 2) three columns: a categorical variable for gender, a categorical variable for outcome, and a numeric variable giving counts; 3) a single row for each subject with two entries (one for the row response and one for the column response). The third option is usually only seen when there are many dierent variables collected. In the third case we could add a column of 1s as the counts, and it would be a variation on the second method. 1. Create two new data sets by entering the data using the rst and second methods above. 106 Figure 9.1 Remark: You can convert between these two types of data entry using Data > Restructure > Stack and Data > Restructure > Unstack. Figure 9.2 Now, we conduct a test to see if the proportion of binge drinkers is the same for both sexes. First, use the data entered as a contingency table (SDF15 in Figure 9.1). 2. At the menu, select Statistics > Compare Samples > Counts and Proportions > Chi-square test.... This brings up the Pearsons Chi-Square Test dialog. 107 3. For Data Set: enter the name of the data (for example, SDF15). 4. Check the box labeled Data Set is a Contingency Table. The two Variables elds turn gray. 5. Click OK. Figure 9.3 The results are sent to a Report window: Pearsons chi-square test with Yates continuity correction data: SDF15 X-square = 86.8062, df = 1, p-value = 0 From the P -value of 0, we reject the null hypothesis that the proportion of binge drinkers is the same for both men and women. We repeat the analysis using the data entered into the data frame via the second method (SDF16 in Figure 9.1). 6. At the menu, select Statistics > Data Summaries > Crosstabulations. 7. In the Crosstabulations dialog box, select the name of the Data Set:. 8. Click on Binge, then CTRL-click on Sex in the Variables: box. Select Count in the Counts Variable: box (this would be left empty if there were one row per subject). 9. Click OK. 108 Figure 9.4 The results that are sent to the report window include the test, the table, and the marginal and conditional distributions. 109 *** Crosstabulations *** Call: crosstabs(formula = Count ~ Binge + Sex, data = SDF16, na.action = na.fail, drop.unused.levels = T) 17096 cases in table +----------+ |N | |N/RowTotal| |N/ColTotal| |N/Total | +----------+ Binge |Sex |Men |Women |RowTotl| -------+-------+-------+-------+ Yes |1630 |1684 |3314 | |0.49 |0.51 |0.19 | |0.23 |0.17 | | |0.095 |0.099 | | -------+-------+-------+-------+ No |5550 |8232 |13782 | |0.4 |0.6 |0.81 | |0.77 |0.83 | | |0.32 |0.48 | | -------+-------+-------+-------+ ColTotl|7180 |9916 |17096 | |0.42 |0.58 | | -------+-------+-------+-------+ Test for independence of all factors Chi^2 = 87.17176 d.f.= 1 (p=0) Yates correction not used End of Example 9.1 Command Line Notes To do the chi-square test from the command line, you rst need to create a matrix. Then, to do Cross tabulations, the data should be in a data frame. > men = c(1630, 5550) > women = c(1684, 8232) > binge = cbind(men, women) # This is a matrix (not data frame) > chisq.test(binge) ... > binge2 = data.frame(Sex = rep(c("Men", "Women"), each = 2), + Binge = rep(c("Yes", "No"), 2), + Count = c(men, women)) > crosstabs(Count ~ Binge + Sex, data = binge2) 110 Example 9.2 Chi-square test for raw data In many settings data come in a raw form, not summarized in a two-way table. An example is the S-PLUS built-in data set market.survey, containing the results of a marketing survey of 1000 households conducted for AT&T. Open the data set market.survey (Section 0.4.1) and note that it consists of 10 variables and 1000 observations. Figure 9.5 The variable pick has two levels: ATT, if the household surveyed chose AT&T for their telephone service, and OCC, if the household chose a dierent company. The variable education has ve levels based on the highest level of education achieved by the respondent. Suppose you want to know if there is a relationship between the choice respondents made in telephone service and their education level. That is, H0 : There is no association between choice of telephone service and education level, against Ha : There is an association between choice of telephone service and education level. To perform a chi-square test on these two variables, 1. At the menu, select Statistics > Compare Samples > Counts and Proportions > Chi-square test.... This brings up the Pearsons Chi-Square Test dialog. 2. For Data Set: select market.survey. 3. For Variable 1: select pick, and for Variable 2: select education. 4. Click OK. Figure 9.6 111 Pearsons chi-square test without Yates continuity correction data: pick and education from data set market.survey X-square = 28.623, df = 5, p-value = 0 Note that Yates continuity correction is not used here; it is only used for a two-by-two table. We reject the null hypothesis and conclude that there is an association between choice of service and education level. In addition to the Report window, there is a warning message indicating that one or both of the variables contained missing observations (NAs). S-PLUS automatically excludes them from the analysis. If you want to see more detail on the analysis, you can create a table of cross-classied data. 5. At the menu, select Statistics > Data Summaries > Crosstabulations.... This brings up the Crosstabulations dialog. 6. For Data Set:, type market.survey. 7. For Variables: select pick and education (hold down the CTRL key to make these multiple selections). 8. For Method to Handle Missing Values: select omit. Figure 9.7 9. Click OK. 112 Call: crosstabs(formula = ~ pick + education, data = market.survey, na.action = na.exclude, drop.unused.levels = T) 965 cases in table +----------+ |N | |N/RowTotal| |N/ColTotal| |N/Total | +----------+ pick |education |<HS |HS |Voc |Coll |BA |>BA |RowTotl| -------+-------+-------+-------+-------+-------+-------+-------+ OCC | 96 |180 | 35 | 82 | 60 | 21 |474 | |0.2 |0.38 |0.074 |0.17 |0.13 |0.044 |0.49 | |0.63 |0.5 |0.65 |0.44 |0.4 |0.35 | | |0.099 |0.19 |0.036 |0.085 |0.062 |0.022 | | -------+-------+-------+-------+-------+-------+-------+-------+ ATT | 57 |181 | 19 |105 | 90 | 39 |491 | |0.12 |0.37 |0.039 |0.21 |0.18 |0.079 |0.51 | |0.37 |0.5 |0.35 |0.56 |0.6 |0.65 | | |0.059 |0.19 |0.02 |0.11 |0.093 |0.04 | | -------+-------+-------+-------+-------+-------+-------+-------+ ColTotl|153 |361 |54 |187 |150 |60 |965 | |0.16 |0.37 |0.056 |0.19 |0.16 |0.062 | | -------+-------+-------+-------+-------+-------+-------+-------+ Test for independence of all factors Chi^2 = 28.62297 d.f.= 5 (p=0.00002748984) Yates correction not used We see that there were 965 households that answered both questions of interest (pick and education). Find the cell corresponding to row ATT and column BA. We note that 491 of the 965 respondents chose AT&T, and of these, 90 or 18% (from 90/491 = .1832, rounded to 0.18 in the print-out), had a B.A. degree. We also note that of those surveyed, 150 had a B.A. degree, and of these 90, or 60% (from 90/150 = 0.60), chose AT&T. S-PLUS does not report the expected counts for each cell. End of Example 9.2 Command Line Notes The function chisq.test performs the chi-square test, and crosstabs creates cross-tabulations. > chisq.test(market.survey$pick, market.survey$education) > crosstabs(~ pick + education, data = market.survey, + na.action = na.omit) 9.2 Exercises 1. A college dean wonders if student participation in extracurricular sports is related to their area of study. A questionnaire sent to a simple random sample of 140 students at his college provides the following information: 113 Math/Science Social Science Humanities Fine Arts No participate 32 43 22 20 Yes participate 15 20 10 8 Based on this data, what can this dean conclude? 2. Using the market.survey data, determine whether or not there is an association between choice of telephone service and income. 9.3 1. Solutions Figure 9.8 At the command line, > > > > yes = c(32, 43, 22, 20) no = c(15, 20, 10, 8) participate = cbind(yes, no) participate yes no [1,] 32 15 [2,] 43 20 [3,] 22 10 [4,] 20 8 > chisq.test(participate) Pearsons chi-square test without Yates continuity correction data: participate X-square = 0.1101, df = 3, p-value = 0.9906 From the P -value of 0.9906, the dean cannot conclude that there is an association between area of study and participation in extracurricular sports. 114 2. Figure 9.9 At the command line, > chisq.test(market.survey$pick, market.survey$income) Pearsons chi-square test without Yates continuity correction data: market.survey$pick and market.survey$income X-square = 11.1541, df = 6, p-value = 0.0837 Warning messages: 215 observations (x[i],y[i]) with NA in x or y removed. There does not appear to be an association between choice of telephone service and income. 115 Chapter 10 Inference for Regression In Chapter 2 we saw how to t a least-squares regression line using Statistics > Regression > Linear.... In this chapter we explore in more detail some of the other options available with this command. We encourage you to use Chapter 14 together with this chapter, and in particular Section 14.5. The approach taken there, based on computer simulation and visualization, complements the more mathematical approach taken here and may help you better understand the concepts in this chapter. It also provides a way for you to check your work. 10.1 Graphing the Data We should always begin by graphing the data. Example 10.1 Fuel eciency (IPS Example 10.1) Computers in some vehicles calculate various quantities related to the performance. Example 10.1 in IPS discusses a sample of 60 observations from one vehicle, in which miles per gallon (MPG) and speed in miles per hour (MPH) were recorded each time the gas tank was lled. We begin by plotting the data. A review of plotting commands in Chapter 2 of this manual may be helpful. 1. Open the Example10.1 data set from the IPSdata library. There are 60 observations. 2. Begin by plotting the data. Select MPH, and control-click on MPG (to select these as x and y (explanatory and response) variables, respectively). Bring up the Plots 2D palette and click on the Scatter button . 3. It would be helpful to show a regression line and/or smooth, as in Figure 10.3 in IPS. It is dicult to do both simultaneously without using the command line, but we can do one at a time. Click on the Linear Fit button for a line, and the Spline button for a spline smooth. 116 22 MPG 18 14 10 10 20 30 MPH 40 50 Figure 10.1 The relationship is clearly nonlinear. Following IPS, we next consider LOGMPH as the explanatory variable. If that variable were not already present in the data set, we would create it using the Data > Transform... menu. 4. Bring up the data. Select LOGMPH, and control-click on MPG. Bring up the Plots 2D palette and click on . This relationship appears close to linear. We will proceed with linear regression the Spline button using LOGMPH as the explanatory variable. 22 MPG 18 14 10 2.25 2.50 2.75 3.00 3.25 LOGMPH 3.50 3.75 4.00 Figure 10.2 117 Command Line Notes We use the functions plot, lm, abline, lines and smooth.spline. > attach(Example10.1) > plot(MPH, MPG) > lines(smooth.spline(MPH, MPG, df = 3)) > abline(lm(MPG ~ MPH, data = Example10.1), col=2) > > plot(LOGMPH, MPG) > lines(smooth.spline(LOGMPH, MPG, df = 3)) > abline(lm(MPG ~ LOGMPH, data = Example10.1), col=5) > detach() End of Example 10.1 10.2 Inference About the Model Now that we understand condence intervals and tests of hypothesis, we can apply these ideas to regression analysis to get a sense of how accurate our predictions are and to test to see if the regression equation diers signicantly from a horizontal line. Example 10.2 Fuel eciency 1. At the menu, select Statistics > Regression > Linear.... This brings up the Linear Regression dialog. 2. For Data Set: enter Example10.1. 3. Under Variables, for Response: select MPG, and for Explanatory: select LOGMPH. (In some versions of S-PLUS the elds are called Dependent: and Independent:, respectively.) The Formula: eld should read MPG ~ LOGMPH. 4. In the Save Model Object box, for Save As: type fitMPG. This saves the regression calculations as a model object for later use. 118 Figure 10.3 5. Click OK. *** Linear Model *** Call: lm(formula = MPG ~ LOGMPH, data = Example10.1, na.action = na.exclude) Residuals: Min 1Q Median 3Q Max -3.717 -0.5187 0.1121 0.6593 2.149 Coefficients: (Intercept) LOGMPH Value Std. Error -7.7963 1.1549 7.8742 0.3541 t value Pr(>|t|) -6.7506 0.0000 22.2373 0.0000 Residual standard error: 0.9995 on 58 degrees of freedom Multiple R-Squared: 0.895 F-statistic: 494.5 on 1 and 58 degrees of freedom, the p-value is 0 We see in the Report window that The estimated slope (b) is b = 7.8742, and the estimated intercept is a = 7.7963. Thus, the tted line is MPG = 7.80 + 7.87LOGMPH. 119 The standard error of the estimated slope is SEb = 0.03541. This measures the variability in the estimate of the slope and is used in both the condence interval and the hypothesis test. The t statistic for the slope is 22.2373; this is the ratio t = b/SEb = 7.8742/0.3541 (these numbers are rounded). We use t to test the hypothesis H0 : True slope is zero against Ha : True slope is not zero. Under the null hypothesis and if various other assumptions are satised, then t follows a t distribution with n 2 = 58 degrees of freedom. S-PLUS prints both the t statistic and the corresponding P -value, P = 0.0. S-PLUS also gives the corresponding information for the intercept. The residual standard error is s = 0.9995. This measures the variability of the residuals around the 1 regression line and is computed from s2 = residuals2 . n2 This report does not give a condence interval for the true slope (or intercept), although it is easy to compute with the information given above: b t SEb . You just need to nd t , the upper (1 C)/2 critical value (Table C in IPS, or see Chapter 4 of this manual). End of Example 10.2 Command Line Notes We use lm to t the regression and summary to describe the results. > fitMPG = lm(MPG ~ LOGMPH, data = Example10.1) > summary(fitMPG) . . . > 7.8742 + c(-1,1) * qt(.975,58) * 0.9995 [1] 5.873483 9.874917 Remark: S-PLUS uses a special syntax for expressing formulas in models. The response variable is written rst, followed by the tilde, . Any explanatory variables are written to the right of the tilde, separated by +. The intercept term is included by default. For instance, Y ~ X represents Y = a + bX. To omit the intercept term, write Y ~ X - 1. 10.3 Inference About Prediction A level C condence interval for the true mean response y for x = x is given by y t SE , where the standard error SE is SE = s 1 (x x)2 + n (x x)2 and t represents the upper (1 C)/2 critical value of the t distribution on n 2 degrees of freedom; s is the residual standard error. Example 10.3 Condence intervals for the mean response 1. Bring up the Object Explorer (Section 0.5) 120 . 2. Click on the Data icon in the left panel to select (highlight) it and nd fitMPG (the saved regression object) in the right panel. 3. Right-click on fitMPG and select Predict... from the shortcut menu. Figure 10.4 4. In the Linear Regression Prediction dialog box, enter Example10.1 in both the New Data: and Save In: elds. 5. Check the boxes Predictions and Condence Intervals. Figure 10.5 6. Click on OK. The data set now has three new columns: a column (fit) of tted values (), and columns LCL95 and y UCL95 with the lower and upper limits, respectively, of the condence intervals. 121 Figure 10.6 For example, the 95% condence interval for the true mean response at MPH= 18.60 is (14.88, 15.56). End of Example 10.3 Remarks: To nd tted values and condence intervals for x-values dierent from the ones used in the regression, create a new data set with a column holding the new x-values. This column of new x values must have the same name as the original column (and be spelled exactly the same). Then repeat the above steps (16) but enter the name of this data set in the Data Set: and Save In: elds. To see a plot of condence bounds for mean responses, create a linear t plot of the data. For example, using Example10.1: 1. Select LOGMPH, then CTRL-click on MPG. 2. Click on the Linear Fit button on the Plots 2D palette. 3. Right-click on the line and choose By Conf Bound... from the shortcut menu. This brings up the Curve Fitting Plot dialog, with the By Conf Bound tab active. 4. Type the desired Level: of condence. 5. Under Line Attributes, select a line Style: (solid, dashed, and so on). 6. Click OK. These condence bounds are a little wider than the pointwise condence intervals to adjust for multiple comparisons (that is, to have the condence of including the true line at all x-values simultaneously). Through the GUI, it is possible to compute only condence intervals for the mean response from a regression equation. See the Command Line Notes section below for how to compute the prediction intervals (for a single observation). The prediction interval for a single observation uses the standard error for prediction SEy 1 (x x)2 + . n (x x)2 SEy = s 1 + Note this is larger than SE because of the extra term (the 1) under the radical sign. Cautionwhile most common regression inferences are approximately correct for large sample sizes even when residuals are not normally distributed, this is not true for prediction intervals. That number 1 means that the eective sample size is always 1 for prediction intervals for a new observation; the number of previous observations is nearly irrelevant. 122 Command Line Notes The predict command computes condence and prediction intervals (options ci.fit=T and pi.fit=T, respectively). We rst demonstrate how to create the graph that includes the condence and prediction intervals of the mean responses. > predMPG = predict(fitMPG, ci.fit = T, pi.fit = T) > predMPG$ci.fit $ci.fit: lower upper 1 14.87890 15.56444 2 15.27157 15.91510 3 17.03299 17.55536 ... > attach(Example10.1) > plot(LOGMPH, MPG, ylim = range(predMPG$pi.fit)) > lines(sort(LOGMPH), predMPG$fit[order(LOGMPH)]) > lines(sort(LOGMPH), predMPG$ci.fit[order(LOGMPH),1], lty=3) > lines(sort(LOGMPH), predMPG$ci.fit[order(LOGMPH),2], lty=3) > lines(sort(LOGMPH), predMPG$pi.fit[order(LOGMPH),1], lty=4) > lines(sort(LOGMPH), predMPG$pi.fit[order(LOGMPH),2], lty=4) > detach() The predict command can also be used to obtain predicted values for new values of the explanatory variable: > predict(fitMPG, newdata=data.frame(LOGMPH=c(1.6, 1.7)) 1 2 1.062009 1.055697 Thus, MPG = 1.062 is predicted for LOGMPH = 1.6, and MPG = 1.056 is predicted for LOGMPH = 1.7. 10.4 Checking the Regression Assumptions S-PLUS generates several diagnostic plots to help you detect any extreme points that need investigating and to determine if the regression assumptions are satised. Example 10.4 Diagnostic plots 1. Bring up the Object Explorer. 2. Click on the Data icon in the left panel to select (highlight) it and nd fitMPG (the saved regression object) in the right panel. 3. Right-click on fitMPG and select Plot... from the shortcut menu. 4. In the Linear Model Plots dialog box, under Plots, check the top six boxes in the left panelall except Partial Residuals (which is useful when there are multiple predictor variables). 5. Click OK. The following six plots are produced: 123 Residuals vs. tted values. This plot provides the same information as the residuals vs. independent (x) plot. (The tted values are used because this plot generalizes nicely to the case of more than one predictor variable.) Look for nonlinear patterns or increasing variance in this plot. Residuals -2 -1 0 1 2 35 25 -3 57 12 14 16 18 Fitted : LOGMPH 20 22 Figure 10.7 We look for a nonrandom pattern in this plot. The solid line in the plot is a curve produced by a smoother, and we note how much it deviates from the dashed horizontal reference line at y = 0. In this case, the curve appears nearly straight, but with a positive slope. This curve is produced by loess, a procedure which is resistant to outliers. In this case it indicates observations 57, 25, and 35 are possible outliers. It gives a positive slope because it is less inuenced by observations 57 and 25 than is the least-squares line (which has zero intercept and zero slope). 124 Square root of absolute value of residuals vs. tted values. This plot is similar to the rst one, except that the square root of the absolute value of the residuals is plotted on the y axis. This puts all potential outliers at the top. Unequal variances are also easier to spot in this plot. 2.0 57 1.5 35 25 sqrt(abs(Residuals)) 0.0 12 0.5 1.0 14 16 fits 18 20 22 Figure 10.8 The smoother indicates that there is a positive relationship between the tted values and the size of the residuals. This suggests that there are unequal variances. A common remedy in cases like this, where variance increases with the tted values, is to transform the response variable, for example the using MPG or log(MPG) as the response variable (this may also require transforming explanatory variables for a good t). 125 Response vs. Fit. The third plot graphs the observed y-values against the predicted (tted) y-values. In a perfect t, all of the points would fall along the dotted line. We can see how much of the y-variable is predicted by the x-variable. We can also check for nonlinearity or unequal variances. When there is only one explanatory variable this plot is not very useful. It gives exactly the same points as a simple plot of y vs. x, except that the x axis is rescaled (so the x-values have the same mean as y). And the rst diagnostic plot is more useful for checking for unequal variances, and either of the rst two plots is more useful for checking for unequal variances. MPG 12 12 14 16 18 20 22 24 14 16 18 Fitted : LOGMPH 20 22 Figure 10.9 126 Residuals Normal QQ. This is the normal quantile plot of the residuals. It is used to see if the residuals follow a normal distribution. If the assumption is met, then the points should lie roughly on a line. Residuals -2 -1 0 1 2 35 25 -3 57 -2 -1 0 Quantiles of Standard Normal 1 2 Figure 10.10 In this example observations 57 and perhaps 25 do appear to be outliers. 127 Residual-Fit Spread. This plot is a visual analog of the r2 term. It shows a plot of the (centered) sorted tted values next to the sorted residuals. The better the regression ts, the higher the r2 , and the larger the spread of the tted values relative to the residuals. Fitted Values 6 6 Residuals 4 2 MPG 0 -2 -4 -6 0.0 0.2 0.4 0.6 0.8 1.0 f-value -6 0.0 -4 -2 0 2 4 0.2 0.4 0.6 0.8 1.0 Figure 10.11 In this example, we note that there is substantially more spread in the tted values than the residual values; this corresponds to a high r2 (recall that r2 = 0.895). 128 Cooks Distance. Cooks distance is a measure of how inuential an observation is. An observation is inuential if its omission greatly alters the regression equation (see IPS, Chapter 2). This is indicated by the length of the vertical line in the Cooks distance plot: the longer the line, the more inuential the observation. 57 0.4 Cooks Distance 0.2 0.3 25 0.1 35 0.0 0 10 20 30 40 50 60 Figure 10.12 In this example, observation 57 is quite inuential, followed by observations 25, 35, and 36. End of Example 10.4 Command Line Notes The plot function automatically gives the same six plots when called with a model object. The command par(mfrow=c(2,3)) sets up the graphsheet to display all six graphs on one page. > > > > plot(fitMPG, smooth = T, ask = T) par(mfrow = c(2,3)) # set layout parameter to 2x3 arrangement plot(fitMPG, smooth = T) par(mfrow = c(1,1)) # reset layout parameter to default 10.5 More Detail About Simple Linear Regression The ANOVA table for regression provides information to perform an F-test for the slope coecient. Example 10.5 Fuel eciency, continued We will use the model object fitMPG that we created earlier. 1. In the Object Explorer, right-click on the icon for fitMPG. 2. In the resulting menu, select Summary.... 129 3. In the Linear Regression Results dialog box, check the box ANOVA Table (and clear the box next to Long Form). 4. Click OK. *** Linear Model *** Analysis of Variance Table Response: MPG Terms added sequentially (first to last) Df Sum of Sq Mean Sq F Value Pr(F) LOGMPH 1 493.9918 493.9918 494.4971 0 Residuals 58 57.9407 0.9990 End of Example 10.5 Remark: In this chapter, we demonstrated how to create a model object (fitMPG) from a linear regression (see the rst example) and then with this object, access the dialog boxes to do additional analysis, such as prediction, diagnostic plots, and so forth. It is possible to combine all of these steps at once. In the initial dialog box (Statistics > Regression > Linear...), enter the information (see Figure 10.3), then click on the tabs Results, Plot, and Predict for additional options. 10.6 Exercise 1. In Germany, fuel eciency for cars is measured in liters per hundred kilometers. The analog using gallons and miles measurements is gallons per hundred miles (GPHM). This system more directly measures the cost of driving. For example, if you drive 100 miles per week, how much cheaper is a Jeep Liberty at 17 MPG than a Ford Explorer at 14 MPG? Or a Honda Insight at 57 MPG than a VW Golf at 38 MPG? It is easier given the GPHM ratings, of 5.9, 7.1, 1.8, and 2.6, respectively. The Liberty saves 1.2 gallons over the Explorer, while the Insight saves 0.8 gallons over the Golf. Note that a dierence of 3 MPG between two SUVs matters more than a dierence of 19 MPG between two smaller cars. Convert the MPG into GPHM, using the Data > Transform menu, using 100 / MPG for Expression:. Figure 10.13 130 Then regress GPHM against MPH (or a transformation of MPH). Do the steps outlined in this chapter, in particular: plotting the data, nding the equation of the regression line, obtaining the standard error and a condence interval for the regression slope, obtaining condence intervals for the predictions, and checking the regression assumptions. 10.7 Solution 1. From the GUI, Open the data set, select MPH, and control-click to select GPHM. Use a spline curve to check for nonlinearity; there is a clear nonlinear relationship. Try again with LOGMPH in place of MPH. The relationship is closer to linear, but there is still nonlinearity apparent. Use the Data > Transform menu, to create a new variable inverseMPH using the expression 1 / MPH and do the plot; now the relationship appears linear. 8 7 GPHM 6 5 4 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 inverseMPH Figure 10.14 Then run the regression and create plots and predictions: 131 Figure 10.15 Figure 10.16 132 Figure 10.17 *** Linear Model *** Call: lm(formula = GPHM ~ inverseMPH, data = Example10.1, na.action = na.exclude) Residuals: Min 1Q Median 3Q Max -0.5538 -0.1941 -0.0475 0.1657 0.8671 Coefficients: Value Std. Error t value Pr(>|t|) (Intercept) 3.0808 0.1214 25.3688 0.0000 inverseMPH 65.4213 2.7577 23.7229 0.0000 Residual standard error: 0.3043 on 58 degrees of freedom Multiple R-Squared: 0.9066 F-statistic: 562.8 on 1 and 58 degrees of freedom, the p-value is 0 The regression equation is GPHM = 3.08 + 65.4 inverseMPH. The standard error for the slope is 2.76; a 95% condence interval for the slope is 65.4 t 2.76. The r2 is .907, slightly higher than before. The regression plots indicate there may be some nonlinearity remaining, that residual variances are not quite constant, that residuals are reasonably close to normal though observations 57 and 59 may be outliers, and that observation 57 is inuential. 133 0.8 57 59 0.8 57 59 0.6 18 sqrt(abs(Residuals)) 18 5 6 Fitted : inverseMPH 7 8 5 6 fits 7 8 0.4 Residuals 0.2 -0.2 0.0 -0.6 -0.4 0.2 0.4 0.6 Fitted Values 0.8 59 57 Residuals 0.6 2 0.4 1 Residuals 0.2 0.0 GPHM 0 -0.4 -0.2 -1 -0.6 18 -2 -1 0 Quantiles of Standard Normal 1 2 0.0 0.2 0.4 0.6 0.8 1.0 f-value 0.0 0.2 0.4 0.6 0.8 1.0 57 0.20 Cooks Distance 0.10 0.15 8 59 0.0 0.05 0 10 20 30 40 50 60 Figure 10.18 134 -1 0 1 2 Chapter 11 Multiple Regression Multiple regression models a linear relationship between a numeric response variable y and several numeric explanatory variables x1 , x2 , . . . , xp . Let y denote the mean response: y = 0 + 1 x1 + 2 x2 + p xp 11.1 Inference for Multiple Regression Example 11.1 Computer science majors Information from a study on 244 computer science majors at a large university is described in the Data Appendix of IPS. The researchers wanted to predict success in the early university years, as measured by GPA. This data set is included in the IPSdata library as csdata. The rst step is to perform some exploratory data analysis. We will compute summary statistics and correlations. 1. Open the data set (using IPSdata > Select Data...). 2. At the menu, select Statistics > Data Summaries > Summary Statistics.... This brings up the Summary Statistics dialog. 3. In the Data box, for Data Set: select csdata. 4. For Variables:, select <ALL>. 5. Click OK. This gives numeric summaries of the data, and frequencies for the factor variable sex (not printed here). 6. At the menu, select Statistics > Data Summaries > Correlations.... This brings up the Correlations and Covariances dialog. 7. For Data Set:, select csdata and for Variables: select (use CTRL-click) all numerical variables. Or leave Variables set to <ALL>; and ignore the warning Dropping nonnumeric column(s) sex. 8. Click OK. 135 *** Correlations for data in: hsm 0.4364988 1.0000000 0.5756865 0.4468865 0.4535139 0.2211203 hss 0.3294253 0.5756865 1.0000000 0.5793746 0.2404793 0.2616975 csdata *** hse 0.2890013 0.4468865 0.5793746 1.0000000 0.1082849 0.2437146 satm 0.2517143 0.4535139 0.2404793 0.1082849 1.0000000 0.4639419 satv 0.1144905 0.2211203 0.2616975 0.2437146 0.4639419 1.0000000 gpa hsm hss hse satm satv gpa 1.0000000 0.4364988 0.3294253 0.2890013 0.2517143 0.1144905 From this matrix, we can note, for example, that the correlation between hsm (high school mathematics grade) and satm (SAT mathematics score) is 0.4535. Scatterplot matrices can also be a helpful tool in exploratory data analysis of several numeric variables (see Section 2.1). 9. Select all numerical columns, and click on the Scatter Matrix button 10. Right-click on a point, and select Smooth... from the shortcut menu. 11. For Smoothing Type: select Smoothing Spline. Leave Deg. of Freedom at the default value of 2 (numbers close to 1 give straight lines; larger numbers allow more curvature). 12. Click OK. Note in particular the top row of the gure, which has GPA on the y axis and the potential explanatory variables on the x axes. on the Plots 2D palette. 1 3 5 7 9 2 4 6 8 10 200300400500600700 4 3 gpa 2 1 0 9 7 5 3 1 10 8 hss 6 4 10 8 6 4 2 800 700 600 500 400 300 200 hse 2 hsm satm 700 600 500 400 300 200 0 1 2 3 4 2 4 6 8 10 200300400500600700800 satv Figure 11.1 136 We now consider a multiple regression on high-school grades. We rst work with the model using gpa for the response variable and using only the three high school grades for the explanatory variables: gpa = 0 + 1 hsm + 2 hss + 3 hse 13. At the menu, select Statistics > Regression > Linear.... 14. Select csdata for the Data Set:. 15. For Response: variable, select gpa. 16. For Explanatory:, select hsm, hss, hse (use control-click). The Formula: eld should read gpa ~ hsm + hss + hse. Figure 11.2 This formula syntax in S-PLUS indicates that gpa (the variable to the left of the tilde, ) is the response variable, and that the model includes (+) the terms hsm, hss, and hse. By default, the intercept term is included. 17. Click on the Plot tab and check the boxes next to all plots. 18. Click OK. 137 *** Linear Model *** Call: lm(formula = gpa ~ hsm + hss + hse, data = csdata, na.action = na.exclude) Residuals: Min 1Q Median 3Q Max -2.129 -0.3407 0.07568 0.4744 1.754 Coefficients: Value (Intercept) 0.5899 hsm 0.1686 hss 0.0343 hse 0.0451 Std. Error 0.2942 0.0355 0.0376 0.0387 t value 2.0047 4.7494 0.9136 1.1655 Pr(>|t|) 0.0438 0.0000 0.3599 0.2425 Residual standard error: 0.6998 on 220 degrees of freedom Multiple R-Squared: 0.2046 F-statistic: 18.86 on 3 and 220 degrees of freedom, the p-value is 6.359e-011 From the report, we obtain the regression equation gpa = 0.5899 + 0.1686hsm + 0.0343hss + 0.0451hse. We can also obtain the F statistics 18.86 which follows an F (3, 220) distribution. The P -value is essentially 0 so we can conclude that at least one of the slope coecients is nonzero. We also see that about 20.5% of the variation in gpa is explained by this regression. The plot of the residuals against the tted values shows random noise about the y = 0 line. Residuals -1 0 1 188 1.5 2.0 Fitted : hsm + hss + hse 2.5 138 105 3.0 -2 Figure 11.3 The plot of sqrt(abs(Residuals)) against tted values (not shown here) suggests that variances are slightly smaller for large tted values, though not enough to worry about. The plot of gpa against tted values suggests some possible nonlinearity in the relationship. However, much of the nonlinearity is due to the left side of the gure, where there are few points, so it may just be an artifact. 138 gpa 0 1 2 3 4 1.5 2.0 Fitted : hsm + hss + hse 2.5 3.0 Figure 11.4 The residuals fall o the straight line on the lower left side of the normal quantile plot. This indicates some deviation from normality, though not enough to worry about, except for prediction intervals. Residuals -1 0 1 188 -3 138 105 -2 -1 0 1 2 3 -2 Quantiles of Standard Normal Figure 11.5 The plots of sorted Fitted Values and sorted Residuals show that there is substantially more variation in the residuals; this is because there is a low r2 . The plot of Cooks distance (not shown) labels three points as inuential, but they are not substantially more inuential than other points. The three remaining plots, the partial residual plots (not shown), show no patterns that would be a cause for further investigation. End of Example 11.1 139 Command Line Notes > summary(csdata) > cor(csdata[,-7]) # omit "sex" For the regression, > fit1 = lm(gpa ~ hsm + hss + hse, data = csdata) > summary(fit1) > plot(fit1) # or: plot(fit1, ask = T) Example 11.2 Comparing regression models Sometimes we wish to compare two regression models to see if the simpler model ts the data as well as the more complicated model, or if the additional terms improve the t. 1. Repeat the previous example, but in addition, type fit1 in the Save As: eld on the Model page of the Linear Regression dialog box. 2. Click Apply. This saves the results of the regression in the model object fit1. A shortcut when repeating part of a previous analysis is to bring up the regression dialog, then use the back arrow at the bottom of the dialog to bring up the previous analysis. We now consider the model gpa = 0 + 1 hsm + 2 hss + 3 hse + 4 satm + 5 satv (that is, we add the SAT variables to the original model). 3. On the Model page of the Linear Regression dialog box, select gpa for Dependent: and choose all of hsm, hss, hse, satm, and satv in the Independent: box. 4. Type fit2 in Save As:. 5. Click OK. 140 Figure 11.6 6. From the main menu choose Statistics > Compare Models. 7. Select fit1 and fit2 in the Model Objects: box. 8. Click OK. Figure 11.7 141 Analysis of Variance Table Response: gpa Terms Resid. Df RSS 1 hsm + hss + hse 220 107.7505 2 hsm + hss + hse + satm + satv 218 106.8191 Test Df Sum of Sq 1 2 +satm+satv 2 0.9313136 F Value Pr(F) 0.9503276 0.38821 This tests to see if the two nested models t equally well (the null hypothesis) or if adding either or both of the SAT variables improves the t. The P -value of 0.38821 indicates that we cannot reject the null hypothesis that the simple model ts just as well as the more complicated model. End of Example 11.2 Command Line Notes The anova function compares dierent models. > fit1 = lm(gpa ~ hsm + hss + hse, data = csdata) > fit2 = lm(gpa ~ hsm + hss + hse + satm + satv, + data = csdata) > # Or, use the following to create fit2 from fit1, > # where "." is short for "like before": > fit2 = update(fit1, . ~ . + satm + satv) > anova(fit1, fit2) 11.2 Exercise 1. Using the csdata, perform the linear regression of gpa against satm and satv and interpret the output. 142 11.3 Solution 1. From the GUI, Figure 11.8 Click on the Plot tab and check the boxes to all desired plots. Then click OK. From the command line, > cs2.fit = lm(gpa ~ satm + satv, data = csdata) > summary(cs2.fit) > plot(cs2.fit, ask = T) (partial output) Coefficients: (Intercept) satm satv Value Std. Error t value Pr(>|t|) 1.2887 0.3760 3.4270 0.0005 0.0023 0.0007 3.4436 0.0005 0.0000 0.0006 -0.0397 0.9683 Residual standard error: 0.7577 on 221 degrees of freedom Multiple R-Squared: 0.06337 F-statistic: 7.476 on 2 and 221 degrees of freedom, the p-value is 0.0007218 143 The estimated coecients are 1.2887 for intercept, 0.0023 for satm, and 0.000 for satv. We caution that before we conclude that both slopes are 0, we should notice that the variable gpa ranges from 0.4 to 4.0 while the SAT scores range from 285 to 760 for the verbal, 300 to 800 for the math. The P -value of 0.00072 for the regression indicates that at least one slope is nonzero. The P -value of 0.9684 for the t test on the satv slope indicates that satv is not signicant, given that satm is included in the model. The residuals against tted values plot shows random noise about the line y = 0, but the normal quantile plot of the residuals hints at left-skewness in the residuals. 144 Chapter 12 One-Way Analysis of Variance In Chapter 7, we used the two-sample t procedures to compare the means of two independent populations. In many settings, we wish to compare the means of three or more independent populations. The one-way analysis of variance (ANOVA) is the generalization of the two-sample t test. 12.1 The Analysis of Variance F Test The hypothesis we wish to test in a one-way ANOVA is H0 : 1 = 2 = = k against Ha : i = j for at least one pair of indices i = j, where i represents the mean of a numeric variable grouped by level i. The ANOVA test is based on a distribution known as the F distribution. Analogous to the two-sample t test, a test statistic is computed, and this value is compared to an F distribution with degrees of freedom that depend on the data. Example 12.1 Reading in children The reading data set in the IPSdata library is described in the Data Appendix in IPS. The data is from a study of reading in children. 1. Open the reading data set from the IPSdata library. Figure 12.1 This data set uses the abbreviations B for Basal, D for DRTA, and S for Strat. We begin by obtaining summary statistics for each group. In addition to being instructive in its own right, this provides one way to check the ANOVA assumptions. 145 2. At the menu, select Statistics > Data Summaries > Summary Statistics .... This brings up the Summary Statistics dialog. 3. In the Data box, for Data Set: enter reading, and for Variables: select pre1. 4. In the Summaries by Group box, for Group Variables: select group. 5. Click on the Statistics tab to go the next page. Check the boxes for Mean and Std. Deviation and clear all other boxes. 6. Click OK. *** Summary Statistics for data in: reading *** group:B pre1 Mean: 10.500000 Std Dev.: 2.972092 --------------------------------------------------group:D pre1 Mean: 9.727273 Std Dev.: 2.693587 --------------------------------------------------group:S pre1 Mean: 9.136364 Std Dev.: 3.342304 The standard deviations are similar, suggesting that the ANOVA assumption of equal variances is not violated, so we continue with our ANOVA analysis. 7. At the menu, select Statistics > ANOVA > Fixed Eects. 8. For Data Set: select reading, for Response: select pre1, and for Explanatory: select group. 146 Figure 12.2 9. Click OK. *** Analysis of Variance Model *** Short Output: Call: aov(formula = pre1 ~ group, data = reading, na.action = na.exclude) Terms: Sum of Squares Deg. of Freedom group Residuals 20.5758 572.4545 2 63 Residual standard error: 3.014395 Estimated effects are balanced Df Sum of Sq Mean Sq F Value Pr(F) group 2 20.5758 10.28788 1.132206 0.3287909 Residuals 63 572.4545 9.08658 From the output, we see that the P -value is 0.3287, so we have no evidence to reject the null hypothesis that the reading scores are dierent for the three groups. 147 We would not expect a dierence here, because the response variable was a pretest. Keep in mind, however, that 5% of similar studies will show a signicant dierence in pretest scores between the groups, at the 5% level. S-PLUS also returns the Mean Square values of (using the notation in IPS) SSG/DF G = 10.287 and SSE/DFE= 9.08658. End of Example 12.1 Command Line Notes The tapply command is used to apply a function to a vector grouped by levels of a factor or several factors. The rst argument is the vector whose values you are interested in. The second argument can be a factor or a list of factors. The third argument is the function. > > > > > > attach(reading) boxplot(split(pre1, group)) tapply(pre1, group, stdev) detach() read.fit = aov(pre1 ~ group, data = reading) summary(read.fit) 12.2 Comparing the Means The ANOVA test tells us if there is a dierence or not. It does not tell us which means are dierent (remember only two need to dier, so all but one could all be equal). S-PLUS provides the multiple comparisons needed to answer the more specic questions on which means are dierent. Example 12.2 All pairwise comparisons We analyze the data in fuel.frame to see which types of cars have signicantly dierent gas mileages. 1. Choose Statistics > ANOVA > Fixed Eects.... For Data Set: enter fuel.frame, for Response: select Fuel (gallons to go 100 miles), and for Explanatory: select Type. 2. Click on the Compare tab. 3. For Levels Of: select Type, and make sure that both Print Results and Plot Intervals boxes are checked, then click on OK. 148 Figure 12.3 Partial output: 95 % simultaneous confidence intervals for specified linear combinations, by the Tukey method critical point: 2.9545 response variable: Fuel intervals excluding 0 are flagged by **** Estimate Std.Error Lower Bound Upper Bound -0.800 0.267 -1.590 -0.0116 **** -0.434 0.160 -0.906 0.0387 0.894 0.160 0.422 1.3700 **** Compact-Large Compact-Medium Compact-Small ... From this output we can estimate that the dierence in fuel consumption between Compact and Small cars is between 0.422 and 1.37 gallons. Since this interval does not include 0, we can conclude that small cars take signicantly less fuel to go 100 miles. The condence interval comparing Compact and Medium cars does contain 0, which means that we do not have enough evidence to say that the fuel consumption is dierent for the two types of cars. We could continue to look at other pairs of means and know that we are 95% condent (since we did not change the default condence level) that all 15 comparisons contain the true dierence. (S-PLUS has adjusted for the fact that we are doing 15 dierent comparisons and still maintains the correct condence level; this is more accurate than doing 15 dierent t tests.) 149 We can also look at the condence intervals in the graph produced. The vertical line is at 0, and any condence interval not including 0 indicates a signicant dierence. The advantage of computing condence intervals over doing only signicance tests is that we can also judge if the dierences are important in practical terms. (Some people may consider that 1.37 gallons is not enough to worry about even though the dierence is statistically signicant.) End of Example 12.2 Remarks: We used the default value of mca for the Comparison Type: eld, which gives all pairwise comparisons. There are two other options: mcc gives all comparisons with one single level (specied in Compare To Level:); for example, comparing each other type of car to the Small cars. The none option does no comparisons, but instead computes a condence interval for each type of car (six condence intervals). Other elds allow us to specify other options. Method: allows us to specify the method to use (the default chooses the best method among all those appropriate for the given data) and Condence Level: allows us to choose how condent we want to be in the results. Example 12.3 Contrasts Sometimes it is overkill to look at all pairwise comparisons, or we want to look at a dierent linear combination of the means than just a simple dierence. S-PLUS allows us to compute condence intervals (and test) contrasts on our variables. We proceed to use the data in fuel.frame to answer the questions: Are Small cars dierent from Compact cars? Are Vans dierent from Large cars? Is the fuel consumption of Medium cars equal to the average fuel consumption of Small and Large cars? 1. We rst need to create a data frame with the contrasts. Each column represents a contrast and the rows represent the values to compare/combine. Click on the new data set button and enter the contrasts to match the gure below (the row and column names are not necessary, but do help in keeping the contrasts and terms straight). Note: the rst row represents the intercept or overall mean; it is important to include this row even though we will not use it. The order of the rows is based on the order of the levels in the data (alphabetical by default). Figure 12.4 2. Choose Statistics > ANOVA > Fixed Eects.... For Data Set: enter fuel.frame, for Response: select Fuel, and for Explanatory: select Type. 150 3. Click on the Compare tab. 4. For Levels Of: select Type, and make sure that both Print Results and Plot Intervals boxes are checked. 5. Change Comparison Type: to none. For Contrast Matrix: enter the name of the new data set created above. Then click on OK. Figure 12.5 Partial results: 95 % simultaneous confidence intervals for specified linear combinations, by the Tukey method critical point: 2.41 response variable: Fuel intervals excluding 0 are flagged by **** Estimate Std.Error Lower Bound Upper Bound 0.894 0.160 0.5090 1.280 **** 0.345 0.291 -0.3560 1.050 0.481 0.179 0.0499 0.912 **** Small.vs.Compact Large.vs.Van Med.vs.L.S This shows that Compact cars use more fuel than Small cars, that we do not have enough evidence to show that Large cars and Vans dier, and that Medium cars use more fuel than the average of Small and Large cars. 151 Compare the line for Small vs. Large cars in this output to the output for all pairwise comparisons. The estimate and standard error are the same, but the interval is narrower, because we only needed to adjust the critical point for 3 comparisons rather than 15. End of Example 12.3 Command Line Notes The function multicomp does multiple comparisons and contrasts. > fuel.fit = aov(Fuel ~ Type, data = fuel.frame) > multicomp(fuel.fit, plot = T) > temp.mat = cbind(SvC = c(0,1,0,0,-1,0,0), + LvV = c(0,0,-1,0,0,0,1), + MvLS = c(0,0,-1/2,1,-1/2,0,0)) > multicomp(fuel.fit, comparisons = "none", lmat = temp.mat) 12.3 Exercises 1. For the reading comprehension data (the reading data set used earlier in this chapter), compare the results of a reading comprehension test (post3) for the three reading groups. 2. Test the two hypotheses H01 : 1 (D +S )B = 0 and H02 : D S = 0 for the reading comprehension 2 test. 152 12.4 Solutions 1. Check the sample standard deviations. From the Statistics > Data Summary > Summary Statistics... command, we obtain the mean and standard deviations for each color. Because 7.388 largest sd = = 1.31 < 2, we can safely use ANOVA. smallest sd 5.636 Select Statistics > Anova > Fixed Eects... from the menu and ll out the dialog box: Figure 12.6 The large F value of 4.48 and the small P -value of 0.01515 leads us to conclude that the means are dierent. Here are some of the commands that replicate the above results. > > > > attach(reading) tapply(post3, group, stdev) read2.fit = aov(post3 ~ group, data = reading) summary(read2.fit) 2. Add the following changes to the above solution: 153 Figure 12.7 > multicomp(read2.fit, comparisons = "none", + lmat = cbind(BvDS=c(0,-1,.5,.5), DvS=c(0,0,1,-1)) ) 154 Chapter 13 Two-Way Analysis of Variance In the previous chapter, we compared the means of several populations. In this chapter, we compare the means of populations that are classied in two ways. We assume that the data are approximately normal and that the standard deviations for the dierent groups are equal. 13.1 Preliminary Data Analysis Before doing any ANOVA, we need to take a preliminary look at the data. Do the samples come from populations with the same standard deviation? Is there any indication of interactions between the two explanatory variables? We will perform our exploratory analysis on the data set presented in Example 13.11 of IPS. Example 13.1 Cardiovascular risk factors Two hundred subjects consisting of runners and those who described themselves as sedentary (the control group) had their heart rate measured after six minutes of exercise on a treadmill. The factor (categorical) variable group has two levels, Control and Runners; the factor variable sex has two levels, Male and Female. All combinations appear, so we have four populations to consider: Control-Female, Control-Male, Runners-Female, Runners-Male. 1. Open the data set Example13.11.txt from the IPSdata library. 2. At the menu, select Statistics > Data Summaries > Summary Statistics.... 3. For Data Set enter Example13.11. For Variables: select beats. 4. For Group Variables:, select group and sex. 5. On the Statistics page, check Std. Deviation and clear all other options. 6. Click OK. From the report, we see that the largest standard deviation, 17.10 for Control-Male, is less than twice the smallest standard deviation, 12.49 for the Runners-Male. Thus, we may proceed with our analysis. Next, we take a graphical look at the data. 7. At the menu, select Statistics > Design > Factor Plot.... 8. For Data Set: enter Example13.11. 9. In the Variables box, for Response: select beats and for Explanatory: select group and sex. 10. In the Layout box, change the number of Columns: to 2. 155 Figure 13.1 11. Click OK. This gives two boxplots, of heart rate grouped by gender and by experimental group. We can compare medians, spread, and skewness between the dierent groups. We note that in the control group, there are three outliers, and in the runners group, one outlier. 156 200 180 160 140 beats 120 beats 100 80 Control group Runners 80 100 120 140 160 180 200 Female sex Male Figure 13.2 Remark: We can also obtain the side-by-side boxplots by using the Plots 2D Palette and clicking on the Box button (see Section 1.2). 12. At the menu select Statistics > Design > Design Plot.... 13. For Data Set: enter Example13.11. 14. In the Variables box, for Response: select beats and for Explanatory: select group and sex. 157 Figure 13.3 15. Click OK. This plot allows us to compare the mean heart rates for each group. The horizontal line at approximately y = 125 is the overall mean for beats. 140 Control 135 Female 130 mean of beats 120 125 Male 115 110 Runners group Factors sex Figure 13.4 We next assess interactions between group and sex. 16. At the menu, select Statistics > Design > Interaction Plot.... 17. For Data Set: enter Example13.11. 18. In the Variables box, for Response: select beats and for Explanatory: select group and sex. 19. Under Options, check Both Orderings for Each Pair. 20. Under Layout, change the number of Columns: to 2. 158 Figure 13.5 21. Click OK. group Control Runners sex Female Male 140 130 mean of beats 120 mean of beats Female sex Male 110 110 Control group 120 130 140 Runners Figure 13.6 159 The endpoints of each line in this plot are at the mean of beats for the relevant group: for instance, in the left plot above, the upper endpoint for the dotted line indicates that for females in the control group, the mean heart rate is approximately 148. The two lines in each graph are not quite parallel, indicating some interaction between the explanatory variables group and sex. End of Example 13.1 13.2 Two-Way ANOVA We will consider a model with interaction in the two-way ANOVA. Example 13.2 Cardiovascular risk, continued 1. At the menu, select Statistics > ANOVA > Fixed Eects... 2. For Data Set: enter Example13.11. 3. In the Variables box, for Response: select beats and for Explanatory: select group and sex. 4. To include the interaction term, add the term group:sex in the Formula: eld. The complete formula should read: beats ~ group + sex + group:sex The model formula indicates that beats is the response variable, that group and sex are included as main eects, and that an interaction between group and sex is included (group:sex). 5. Click OK. *** Analysis of Variance Model *** Short Output: Call: aov(formula = beats ~ group + sex + group : sex, data = Example13.11, na.action = na.exclude) Terms: group Sum of Squares 168432.1 Deg. of Freedom 1 sex group:sex Residuals 45030.0 1794.0 192729.8 1 1 796 Residual standard error: 15.5603 Estimated effects are balanced Df Sum of Sq Mean Sq F Value Pr(F) group 1 168432.1 168432.1 695.6470 0.000000000 sex 1 45030.0 45030.0 185.9799 0.000000000 group:sex 1 1794.0 1794.0 7.4095 0.006629953 Residuals 796 192729.8 242.1 160 From the output, we see that the F value for the group main eect is 695.647, which is highly signicant. The F -values and the P -values for sex and the interaction term group:sex are also signicant. End of Example 13.2 Remark: The ANOVA dialog box has options for producing diagnostic plots and making comparisons between the levels of the various categorical variables. These plots and comparisons are typically covered in a more advanced course on ANOVA. Command Line Notes For the standard deviations grouped by group and sex, use the tapply command. > attach(Example13.11) > tapply(beats, list(group, sex), stdev) For the preliminary plots, use the plot.factor, plot.design, and interaction.plot commands. > > > > > > > > > par(mfrow = c(1,2)) plot.factor(beats ~ group + sex) graphsheet() # new graph sheet plot.design(beats ~ group + sex) graphsheet() par(mfrow = c(1,2)) interaction.plot(group, sex, beats) interaction.plot(sex, group, beats) detach() The par command sets layout parameters. In the above, we specied a layout of one row and two columns. The command aov can be used to perform two-way ANOVA. > heart.fit = aov(beats ~ group + sex + group:sex, + data = Example13.11) > summary(heart.fit) 13.3 Exercise 1. Exercise 13.19 in IPS describes a study to compare certain biological materials for use as new tissue. The data are in Exercise13.19 in the IPSdata library. The group variable indicates the type of material used, the material variable is a numeric version of the group variable, and the weeks variable is when the response was measured. The gpi variable is the response, the percent of glucose phosphated isomerase (Gpi) cells in the region of a wound. Run the two-way ANOVA and include any preliminary analyses. Note: The explanatory variables must be factors (see Section 0.4.5). You will need to convert weeks to a factor. 161 13.4 Solution Figure 13.7 Because there are only three data values for each level of the variable Material1, a boxplot of Gpi grouped by Material1 would not be informative. We look at the means using the Design Plot... option. Figure 13.8 That plot indicates that week has little eect, but the type of material has a strong eect. An Interaction Plot... option shows the same information and shows there are possible interactions, albeit weak. 162 Figure 13.9 The numerical calculations are done from the ANOVA menu. They show that both weeks and the interaction are not signicant. Figure 13.10 From the command line, > Exercise13.19$weeks = as.factor(Exercise13.19$weeks) > attach(Exercise13.19) 163 > > > > > > > + > > par(mfrow = c(1,2)) plot.factor(gpi ~ group + weeks) graphsheet() plot.design(gpi ~ group + weeks) interaction.plot(group, weeks, gpi) detach() gpi.fit = aov(gpi ~ group + weeks + group:weeks, data=Exercise13.19) summary(gpi.fit) detach() 164 Chapter 14 Bootstrap Methods and Permutation Tests This chapter describes how to use resampling methods, the bootstrap and permutation tests, to calculate standard errors, to determine whether distributions are close enough to normal to use t tests and condence intervals, to calculate condence intervals, and for signicance testing. The techniques in this chapter can be applied to problems found in other chapters, especially Chapters 7, 8, 10, and 11. This chapter can supplement those chapters, providing an alternate approach with certain advantages, including a consistent approach that applies to a wide variety of problems (obviating the need for a cookbook of formulas to apply in dierent cases), and concrete graphical representations of some of the concepts that students nd most dicult, including sampling distributions, standard errors, condence intervals, and P -values. Most of this chapter is laid out according to the type of problem to be handled, starting with the oneand two-sample means problems of Chapter 7. You may choose to do the beginning of this chapter after Chapter 7, then read additional parts as supplements to later chapters when you come to them. Additional sources of information: There are some additional sources about resampling in connection with IPS besides this chapter. When the IPSdata library is loaded (see Section 0.4.2) it adds a menu to the right of the main menu bar with four options, of which the rst three are informational: IPSdata Library Users Manual is a guide to use the GUI menus for doing bootstrap and permutation test calculations. Example Code for Exercises contains example command-line code for all exercises in IPS Chapter 14. IPSdata help is primarily a listing of the data sets in the library and the variables they contain. Figure 14.1 Loading the IPSdata automatically loads the resample library, which adds an additional manual at the main menu under Help > Online Manuals > Resample Library Users Manual and online help les for resampling functions and the resampling GUI at Help > Available Help > resample. You can also load the resample library separately (for analyzing data not in IPSdata), see Section 0.4.3 or 0.7.1. 165 14.1 One and Two Means (Chapter 7) In Chapter 7 we looked at condence intervals and signicance tests for a single mean, for the mean dierence of matched pairs, and for the dierence of two independent means, using t tests and condence intervals. In this section we look at using resampling for ve of those six situationseverything except signicance tests for a single mean. We also do three things that are not possible with t tests and intervals: use robust alternatives like trimmed means in place of simple means, accurate handling nonnormal populations with small samples (and knowing when this is the case), and condence intervals and tests for the ratio of means. 14.1.1 One mean Example 14.1 Corn soy blend (CSB)IPS Example 7.1 Researchers wished to evaluate appropriate vitamin C levels (mg/100gm) in a random sample of eight batches of CSB from a production run. Our goal is to nd a 95% condence interval for the mean vitamin C content of CSB produced during this run. In Chapter 7 we did this using t condence intervals, based on the standard error s/ n. Here we use a bootstrap approach. 1. Open the IPSdata library (Section 0.4.2). This automatically opens the resample library. 2. Open the Example7.1 data set. 3. At the menu, select Statistics > Compare Samples > One Sample > t Test/Resample.... 4. For Data Set enter Example7.1, and for Variable: select vitc. 5. In the Hypotheses box, for Mean Under the Null Hypothesis: enter 40 and for Alternative Hypothesis: select two.sided. 166 Figure 14.2 6. Select the Bootstrap tab, and check the box to Perform Bootstrap. 7. In the Summary Results box, check the box for t Intervals using Bootstrap SE. 167 Figure 14.3 8. Click OK. 168 bootstrap : Example7.1$vitc : mean 28 mean 16 18 20 22 mean 24 26 28 16 0.0 18 20 0.05 22 24 Density 0.10 26 Observed Mean 0.15 -2 0 Quantiles of Standard Normal 2 Figure 14.4 The plots are a histogram and normal quantile plot of the bootstrap distribution. This is the distribution of means of many random samples. The key things to look for are spread and normality. First, the spread of the bootstrap distribution gives an idea of the uncertainty in the statistichow much does the statistic (the mean) vary when the data it uses come from random sampling? The middle is about 22.5 (the statistic, the mean of the original data), but there is a fairly substantial spread. We get a qualitative sense of the uncertainty just looking at the bootstrap distribution. We can also quantify the uncertainty in two ways: (1) the middle 95% of the bootstrap distribution gives a quick-and-dirty 95% condence interval, and (2) computing the standard deviation. That standard deviation measures the uncertainty in the statistic; in other words, it is a standard error (SE) for the statistic. This may give you a better idea what a standard error meansit is a standard deviation for the uncertainty of a statistic. With the bootstrap we compute the standard error directly, from the bootstrap distribution. With a formula like s/ n we do so indirectly, based on statistical theory. The second thing to look for is that the bootstrap distribution is quite normalwe see this in both the histogram and the normal quantile plot. Hence, even with the small sample size, Chapter 7 methods based on normality should be accurate. Now turn to the printed results, which include: 169 Summary Statistics: Observed Mean Bias SE mean 22.5 22.48 -0.0225 2.45 Percentiles: 2.5% 5% 95% 97.5% mean 17.5 18.25 26.25 27.125 T Confidence Intervals using Bootstrap Standard Errors: 2.5% 5% 95% 97.5% mean 16.69778 17.85287 27.14713 28.30222 The observed value is the mean of the original data; the SE is the standard error, 2.45. For comparison, the formula SE is s/ n = 2.54. The bootstrap provides a way for you to check if you calculated the standard error correctly using the formula. The range of the middle 95% of the bootstrap distribution, between the 2.5% and 97.5% percentiles, (17.5, 27.1) is a quick-and-dirty condence interval, and provides a way to check formula condence intervals. For comparison, the 95% t condence interval is (16.5, 28.5) (this quick-and-dirty bootstrap interval tends to be too narrow for very small samples). A second quick-and-dirty way to use bootstrapping for condence intervals is to do a t interval, but using the bootstrap SE in place of a formula SE. This is labeled here as T Confidence Intervals using Bootstrap Standard Errors (called bootstrap t intervals in IPS), and is (16.7, 28.3), which is fairly close to the formula t interval. We can use the same procedures for one-sided condence intervals. Here a one-sided 95% percentile interval is the bottom 95% of the bootstrap distribution, (, 26.2). For comparison, the formula t interval is (, 27.3), and the t interval using the bootstrap standard error is (, 27.1). In this problem, all the intervals gave roughly the same results, so we would not have to use the bootstrap. The main things the bootstrap gives us here are assurance that it is OK to use t intervals (because the bootstrap distribution is normal) and a concrete picture of the variability of the statistic. Command Line Notes > save.boot = bootstrap(Exercise7.1$vitc, mean) > save.boot # print the results > plot(save.boot) > qqnorm(save.boot) > limits.percentile(save.boot) > limits.t(save.boot) End of Example 14.1 Remark: Resampling is random, so the plots and numbers you create will dier from ours. But they should not be much dierent, unless you use less than 1000 resamples. Example 14.2 CLEC data For comparison, here is a situation in which t intervals and condence intervals should not be used because the underlying distribution is not normal, and the sample size is not large enough to allow use of t methods in spite of the non-normality. The CLEC data is a subset of the Verizon data set described in the chapter. This data set consists of 23 observations from a very skewed distribution. To bootstrap the distribution of the mean, using the GUI: 170 1. Open the CLEC data set (Section 0.4.1). 2. Begin by plotting the data, using a histogram and normal quantile plot from the 2D Plots palette . Note that the data are very skewed. You should make the histogram bars narrower, for example width 5 (see Section 1.1.2). 3. At the menu, select either Statistics > Compare Samples > One Sample > t Test/Resample... or (equivalently) Statistics > Resample > One sample t. 4. For Data Set: enter CLEC. For Variable: select Time. 5. Select the Bootstrap tab, and check the box to Perform Bootstrap. 6. In the Summary Results box, check the box by t Intervals using Bootstrap SE. Figure 14.5 7. Click OK. 171 bootstrap : CLEC$Time : mean 0.12 Observed Mean 0.08 0.10 Density 0.06 mean 10 15 20 mean 25 30 35 0.04 0.02 0.0 10 15 20 25 30 -2 0 Quantiles of Standard Normal 2 Figure 14.6 The bootstrap distribution shows substantial skewness, indicating that t intervals or signicance tests would be inaccurate. Note that while this amount of skewness in data might be acceptable, it is not acceptable in a sampling distribution, where the central limit theorem has already had its chance to work. Any dierences from normality here translate directly into errors if you use t condence intervals or signicance tests. The printed results include t condence intervals and signicance tests (not shown here), saddlepoint inferences (which you may ignore), and the following bootstrap results: 172 *** Bootstrap Results *** Call: bootstrap(data = CLEC$Time, statistic = mean, trace = F, save.indices = T) Number of Replications: 1000 Summary Statistics: Observed Mean Bias SE mean 16.51 16.65 0.1394 4.092 Percentiles: 2.5% 5% 95% 97.5% mean 10.10785 10.74915 24.62728 26.15751 . . . T Confidence Intervals using Bootstrap Standard Errors: 2.5% 5% 95% 97.5% mean 8.011573 9.475426 23.54284 25.00669 The printout includes the following items: Call is the command you could give to create this bootstrap object from the command line. (Note that you would not get exactly the same results, because you would be using dierent random numbers.) Observed is the mean of the original data, 16.51. Mean is the mean of the bootstrap distribution (that is, the average of the 1000 bootstrap means). Bias is the estimated bias, Mean Observed. SE is the bootstrap standard error. Percentiles are percentiles of the bootstrap distribution. The interval (10.1, 26.2) is the 95% bootstrap percentile condence interval. (Other condence intervals) The BCa and Tilting condence intervals are omitted here. T Confidence Intervals using Bootstrap SE are called bootstrap t condence intervals in IPS. These are 16.51 t 4.092. Note that the bootstrap percentiles and the t condence intervals are quite dierent; this is another indication that it would not be appropriate to use t condence intervals. Instead, we would prefer the percentile intervals, or better yet either of the other two intervals (not shown), the BCa or Tilting intervals. 173 Command Line Notes The basic function for one-sample bootstrapping is bootstrap, for which the rst two arguments are the data to bootstrap, and the statistic to apply to each bootstrap sample. > > > > > > save.boot = bootstrap(data = CLEC$Time, statistic = mean) save.boot par(mfrow = c(1,2)) # 1 row, 2 columns plot(save.boot) qqnorm(save.boot) par(mfrow = c(1,1)) # restore default layout There are other arguments you can pass to bootstrap, including label (used for printing and plotting) seed (to make results reproducible), and save.indices (to speed up computing condence intervals). For a more complete list see > help(bootstrap) # most common arguments > help(bootstrap.args) # more arguments, and more detail To compute condence intervals, do for example: > > > > > limits.percentile(save.boot) limits.percentile(save.boot, probs = c(.025, .975)) limits.t(save.boot) limits.bca(save.boot) limits.tilt(save.boot) Trimmed Mean One nice thing about the bootstrap is that it lets us handle a wide variety of statistics easily, without knowing a formula for the standard error of every statistic. We may prefer to use a trimmed mean rather than an ordinary mean. For example, scoring in Olympic gymnastics and diving uses trimmed means, in which the high and low scores are discarded and the mean of the remaining observations is computed. A trimmed mean is particularly useful for data that may have outliers, or otherwise have long tails. An extreme example of a trimmed mean is a median, where all observations but the middle one or two are discarded. The procedure for bootstrapping a trimmed mean is identical, except that after step 4 above, there is an additional step: For Trim % each side: enter the percentage of observations to trim; for a 25% trimmed mean, enter 25 (not 0.25). 174 Figure 14.7 The resulting bootstrap distribution is nearly normal, is centered at the trimmed mean of the original data (13.8), and the bootstrap distribution appears narrower than for the untrimmed mean. This is born out by standard error of 2.7, compared to 4.1 for the untrimmed mean. This is reasonable, as we exp...

Find millions of documents on Course Hero - Study Guides, Lecture Notes, Reference Materials, Practice Exams and more. Course Hero has millions of course specific materials providing students with the best way to expand their education.

Below is a small sample set of documents:

Stevens - MA - 611
Homework 6 Math 611 ProbabilityThis assignment is not going to be graded. It serves as Final test training exercise.(1) Let X1 , X2 , . . . be iid Cauchy distributed random variables with density 1 f (x) = (1+x2 ) and characteristic function (t) =
Stevens - MA - 611
Homework 2 Math 611 ProbabilityNovember 25, 2005NB: This is an extra points assignment (it is not mandatory). There are 15 problems, the more you solve, the more points you may get. Its due date is TBA. (1) We play a game which consists in tossing
Stevens - MA - 635
Final Examination (take-home)Ma635 Real Analysis Idue Thursday Dec 13, 20071) (Egoro, 1911). Let E be a set of nite measure and let f and f1 , f2 , . . . be measurable functions on E such thatklim fk (x) = f (x),a.e. on E.Show that for eve
Stevens - MA - 611
Math 611 ProbabilityAugust 28, 2006Instructor: Oce: Phone:Ionut FlorescuKidde 227 (201) 216-5452 Email: Oce hours: ifloresc@stevens.edu M 4:00pm -6:00pm and by appt.Some Topics to be presented:Elements of Probability Measure, Conditional Prob
Stevens - MA - 623
Homework 2Ma623 Stochastic Processesdue Tuesday Feb 27 2006For the rst assignment please do exercises 2.6, 2.10, 2.13, 2.30, 2.33 from the notes handed in class. In addition do the following simulation exercises: (1) Using a software package simu
Stevens - MA - 222
Probability. A (very) brief history.Ionut FlorescuWhat is Probability? In essence: Mathematical modeling of random events and phenomena. It is fundamentally dierent from modeling deterministic events and functions, which constitutes the traditional
Stevens - MA - 331
Math 331. Introductory StatisticsInstructor: Ionut Florescu Oce: Kidde 227 Email: ifloresc@stevens.edu Phone: (201) 216-5452 Oce hours: TTh 11:00pm -12:00pm and by appt. website: http:/www.math.stevens.edu/ioresc/Teaching/2007-2008/index331.htmlT
Stevens - MA - 641
Homework 3Ma641 Time Series Idue by class time 6:15pm, Monday June 16, 2008Please try to submit a hardcopy of the report in class. If you chose the elearn submission option elearn please convert the report to pdf format before submitting.Please
Stevens - CS - 535
Financial Computing: Asset Price Walks, Pricing Bonds and Options SyllabusIntroduction to Markets1. Markets, assets, derivatives. 2. Interest rates and present value. 3. Options, futures, forward contracts, and hedging strategies. Readings [WHD: C
Stevens - CS - 437
Computer Graphics CS437 A Frame Buer Primer. Double Buering and Animation BasicsGeorge KamberovStevens Institute of Technology, Hoboken, NJ 07030, USATopics The frame buer (FB) The color buer, double buering and animation The z-buer and hidden
Stevens - CS - 535
Asset Price Walks, Fokker-Plank, Wiener Process, and Ito IntegralsCS 535 George KamberovToday Introduction and motivation Asset price walks, the stochastic DE approach Random noise. From discrete to continuous random walks Fokker-Plank He
Stevens - CS - 535
Crash Course in Probability and Stochastic ProcessesCS 535 George KamberovFermat, Laplace, Gauss, Kolmogorov, ItoG. Kamberov CS535: Lecture 3: Probabbility 11Lebesgue, Dini, Fatou, Fubini,Wiener, DiracG. Kamberov CS535: Lecture 3: Probabbi
Stevens - CS - 437
Computer GraphicsInteractive, Animated GraphicsIn this moduleInteraction Keyboard, mouse, menus Selection and picking Readings: Angel 2nd edition or Angel 3rd edition Chapter 3; OpenGL Primer, Chapter 3; The Red Book, Chapter 1, pp 20-27, Chap
Stevens - CS - 437
ReflectionsInteractive Computer Graphics G. KamberovTechniques Ray tracing-based Environment mapping Doing it on the cheap planar reflectors Using texture mapping Using the stencil bufferInteractive Computer Graphics G. Kamberov1Today
Stevens - CS - 437
CS 437 Tentative SyllabusTue 08/30 Introduction, Administratrivia, Course Summary. Key tasks: Modeling, Shading, Viewing, Animation, Rendering, Interaction Th 09/1 Graphics architectures, API layers. Rendering APIs, Direct3D OpenGL, pipelines; rende
Stevens - CS - 537
CS 537 Homework 2 In this homework we will concentrate on building a simple scene, interaction techniques, and some advanced viewing. The scenario you should keep in mind is a game in which you can pick one of several objects in your hand and then ex
Stevens - CS - 537
CS537 Final project requirementsINSTRUCTOR: G. KAMBEROV The final project must show your mastery of the fundamentals of computer graphics covered in the course: Modeling and scene hierarchy Smooth animation of camera and object motions Entertai
Stevens - CS - 537
Computer Graphics CS537 The Flying Camera BrothersGeorge KamberovStevens Institute of Technology, Hoboken, NJ 07030, USAREADINGS: Angel 3d Edition, Chapter 4.10 and 4.11; Chapter 5 (in particular 5.3 and 5.7). The Red Book, Chapter 3. The Maths R
Stevens - CS - 537
Computer Graphics CS537 A Frame Buer Primer. Double Buering and Animation BasicsGeorge KamberovStevens Institute of Technology, Hoboken, NJ 07030, USATopics The frame buer (FB) The color buer, double buering and animation The z-buer and hidden
Stevens - CS - 639
Unpopping: Solving the Image-Space Blend ProblemMarkus Giegl Ars Creat Game Development Michael Wimmer Vienna Univ. of TechnologyAbstract This paper presents a new method for smoothly blending levels of detail or other objects in image space, ther
Stevens - CS - 639
ROAMing Terrain: Real-time Optimally Adapting MeshesMark Duchaineau;y Murray Wolinsky David E. Sigeti y Mark C. Miller Charles Aldrich Mark B. Mineev-Weinstein Los Alamos National Laboratory Lawrence Livermore National LaboratoryyAbstractTerrain
Stevens - CS - 639
.Computer Graphics in EntertainmentDesigning a PC Game Enginee outline here the requirements of a 3D game engine, illustrated by describing a particular engines components. We designed the game engine, marketed as NetImmerse, to run on PCs with a
Stevens - CS - 537
Computer GraphicsIntro and OverviewGK Intro and OverviewSo you are interested in Computer graphics, eh? Computer graphics (CG) is concerned with creating, viewing, and displaying scenes and objects on a computer. The main tasks involved in a C
Stevens - CS - 639
Automatic Bounding Volume Hierarchy Generation Using Stochastic Search MethodsKelvin NgUniversity of British Columbia Department of Computer Science kng@cs.ubc.caBorislav TrifonovUniversity of British Columbia Department of Computer Science trif
Stevens - CS - 537
Computer GraphicsSelection and Picking 1In this moduleInteraction Keyboard, mouse, menus Selection and picking Readings: Angel OpenGL Primer, Chapter 3; The Red Book, Chapter 1, pp 20-27, Chapter 13.1Interaction We can interact with the p
Stevens - CS - 537
3D Primitives and Polygonal ADT and DSCS 437/5373D PrimitivesCurvesSurfacesVolume ObjectsCS 437/53713D PrimitivesObjects With Good Characteristics Described by their surfaces; thought to be hollow Specified through a set of vertice
Stevens - CS - 537
Shadows and TransparencyInteractive Computer Graphics G. Kamberov1Introduction Shadows in CG Why? What? Basic Algorithms. Transparency Algorithms Readings: FVDFHP Chapter 14.4 and 14.5 Angel Chapter 5.10, Red Book 4th edition, 446-450,
Stevens - CS - 639
CS 539 Lecture 3(Terrain)KamberovTerrain 1Were are we &amp; where are we goingKamberovTerrain 2TerrainG. KamberovKamberovTerrain: applications Terrain-based computer games Training and simulation flight simulators, AA, Mission pla
Stevens - CS - 537
CS 537 Homework 3Problem The goal in this homework is to build a small world and a ying camera which will enable you to navigate smoothly in a 3D world. The user is identied with a the ying camera. You must use hierarchical modeling to build the sc
Stevens - CS - 537
CS 537 Homework 3Problem The goal in this homework is to build a a small world and a ying camera which will enable you to navigate smoothly in a 3D world. The user is identied with a the ying camera. You must use hierarchical modelling to build the
Stevens - CS - 537
Math Foundations of CG TransformationsMath 2CS4371Today: The issuesHow do we manipulate, move, reposition, etc? Particular applications:Instancing transformations Local transformation matrix Animation Camera positioning Viewing and projecti
Stevens - CS - 537
CS437/537: Lecture 2 G. KamberovToday Graphics architectures principles APIs. Libraries Programming paradigms Event driven programming Program structure Pipeline architecture1Assignments Reading Angel 4th Edition: 1.6.4-1.9, 2.1-2.5
Stevens - CS - 537
Texture MappingReadings:Angel: Chapter 9 (2nd edition), Chapter 7 (3d edition) Red Book: Chapter 9 Foley Van Dam, et al Chapter 14.3 Angels OpenGL Primer: Chapter 8, see also Chapter 7 about more Angel details on the pixel pipelineTexture Mapping
Stevens - CS - 537
Lines, planes, affine combinations, and containmentG. Kamberov CS437/537Line: parametric equationA line, defied by a point P0 and a vector d consists of all points P obtained by Velocity vector P () = P0 + d where varies over all scalars. P ()
Stevens - CS - 537
Building and Viewing an Interactive Scene TransformationsCS 437/537Today Coordinate systems and transformations The matrix pipeline Modeling: Building a scene Symbols and instancing Hierarchical models Attributes and rendering states Anima
Stevens - CS - 537
CS 537 Homework 2 In this homework we will begin to build a true 3D graphics application. The scenario you should keep in mind is a game in which you can pick one of several objects in your hand and then explore it by turning it around and looking at
Stevens - CS - 437
Intersection Testing Ray-XCS437: X-testing1The problemQuestion: Given a ray Does it intersect an object, a primitive? Where? Here we will be concerned with the first intersectionP + tv , t 0vPCS437: X-testing2ApplicationsRay tracin
Stevens - CS - 410
Stevens - CS - 437
Global Illumination Local Illumination (e.g. Phong) shaded color depends purely on local surface configuration Other surfaces have no effect. No shadows, not effected by their emissions Global Illumination:Illumination of one object depends on
Stevens - CS - 437
Computer GraphicsCS4373D Viewing and Projections1Readings Angel 2nd or 3rd Edition Chapter 5 Angels OpenGL Pirmer Chapter 4 The Red Book Chapter 321World, Camera. Rendering!The rendering3Things to do Define (buy, build, steal) a
Stevens - CS - 437
Computer GraphicsBuilding and Viewing an Interactive Scene TransformationsCS 437CG tasks A computer graphics program has to take care of the following main tasks Build a scene, or a sequence of scenes (animation) View the scene Facilitate int
Stevens - CS - 437
Computer GraphicsBuilding and Viewing an Interactive Scene TransformationsCS 437CG tasks A computer graphics program has to take care of the following main tasks Build a scene, or a sequence of scenes (animation) View the scene Facilitate int
Stevens - CS - 437
Texture MappingReadings:Angel: Chapter 9 (2nd edition), Chapter 7 (3d edition) Red Book: Chapter 9 Angels OpenGL Primer: Chapter 8, see also Chapter 7 about more details on the pixel pipelineTexture Mapping ReviewnnnMotivation: to model re
Stevens - CS - 535
Put Call RatioCS 535 George KamberovPessimism : Optimism 0.7 : 1Pessimism : Optimism 1.7 : 1P/C Jan 02 - May 31, 20021.56Pessimism0.78Optimism0.00 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97 101April 1 to
Stevens - CS - 437
CS 437 Vector and Affine SpacesGK: 437What is this all about?Building 3D graphics does need much more math than 2D graphics. In the next two modules we will cover the basics. The components needed to build objects and scenes: points, lines, plan
Stevens - CS - 437
Math Foundations of CG TransformationsMath 2CS4371Today: The issuesnHow do we manipulate, move, reposition, etc? Particular applications:n n n n nInstancing transformations Local transformation matrix Animation Camera positioning Viewin
Stevens - CS - 535
Stevens - CS - 535
Fokker-Plank, Normal Distribution, Wiener Process, and Ito IntegralsCS 535 George KamberovToday Normal distribution Wiener process Ito integral. Stochastic DE Log normal asset price modelG. Kamberov CS535: Lecture 4b: Normal Distribution, W
Stevens - CS - 535
CS 535: Expected Price and its Growth Rate
Stevens - E - 344
Problem Set 3 Due: Tuesday, 21 September, 2004 at the beginning of lectureE-344 Fall 2004In the following multiple-choice questions, indicate the single and best answer. 1. Ceramic materials are often made by sintering of green powder compacts, b
Stevens - E - 344
The body-centered tetragonal (BCT) crystal structure is just like the body-centered cubic structure except that it is elongated in one direction. For BCT: a = b c and = = = 90o. The body diagonal is a close-packed direction. Assume c=1.25a. Show
Stevens - E - 344
Make an argument why hydrochloric acid (HCl) boils at T = -84.9 whereas hydroflouric acid (HF) boils at T = 19.54 .
Stevens - E - 344
Chapter 5. Mechanical PropertiesAfter having studied this chapter you will be able to: Describe the importance of mechanical properties of materials in the utilization and in the fabrication of machine elements. Draw a stress-strain curve an
Stevens - MIS - 620
Technical Report CMU/SEI-94-TR-013 ESC-TR-94-013 August 1994Benefits of CMM-Based Software Process Improvement: Initial ResultsJames Herbsleb Anita Carleton James Rozum Jane Siegel David ZubrowTechnical ReportCMU/SEI-94-TR-013 ESC-TR-94-013 Aug
Stevens - MIS - 620
Kweku Ewusi-MensahABANDONED Information SystemsDevelopment ProjectsWhat is it about IS development projects that make them susceptible to cancellations?Critical Issues inAmedium-sized electronic distribution company with plans for becoming
Stevens - MIS - 620
If We Build It, They Will Come: Designing Information Systems that People Wan.Markus, M Lynne; Keil, Mark Sloan Management Review; Summer 1994; 35, 4; ABI/INFORM Global pg. 11Reproduced with permission of the copyright owner. Further reproduction
Stevens - CS - 558
Assignment 4 CS 558 Spring 2007Due: Monday May 14 Remember: no collaboration; cite all source material used (especially code)All code and results should be annotated and explained. Results will also be graded on computation speed.Assignment 4 Yo
Stevens - CS - 552
Research Registry for Neonatal Lupus: Development Plan Revision 1Author: Aaron MattI. Brief Functional DescriptionRRNL began 10 years ago with a grant from the NIH. Today, over 400 families have participated in the study. Staff at NYU's Tische H
Stevens - CS - 600
CS 600 Analysis of AlgorithmsHomework 7Due Friday, November 12, 2004. Programming Assignment. Implement in C/C+ a more advanced version of radix sort, where the size of each digit is decided based on n, the number of elements you are asked to sort,
Stevens - CS - 590
Homework Assignment 1due February 20, 2001Write a program that does the following: (a) Reads in 3 lines of text from a file named infile.dat. (b) Prints the text on the screen without any punctuation symbols. Words are separated by space. (c) Print
Stevens - CS - 590
CS 590 Introduction to Data Structures and AlgorithmsAssignment 4 SortingTask: You are given a set S of n numbers. Produce a sorted sequence of S in non-decreasing order. a) Use the following two methods: Insertion Sort MergeSort b) You will run ea