10 Pages

mulreg_text

Course: BINF 5035, Fall 2009
School: University of Medicine...
Rating:
 
 
 
 
 

Word Count: 3085

Document Preview

Sciences Health M.Sc. Programme Applied Biostatistics Week 10: Multiple regression More than one predictor In Week 9 we looked at regression with one predictor variable. Often we would like to use more than one predictor variable. In this lecture we look at how to do that for a continuous outcome variable, and describe a related method for use when the outcome is dichotomous. Table 1 shows the ages, heights and...

Register Now

Unformatted Document Excerpt

Coursehero >> New Jersey >> University of Medicine and Dentistry of New Jersey >> BINF 5035

Course Hero has millions of student submitted documents similar to the one
below including study guides, practice problems, reference materials, practice exams, textbook help and tutor support.

Course Hero has millions of student submitted documents similar to the one below including study guides, practice problems, reference materials, practice exams, textbook help and tutor support.
Sciences Health M.Sc. Programme Applied Biostatistics Week 10: Multiple regression More than one predictor In Week 9 we looked at regression with one predictor variable. Often we would like to use more than one predictor variable. In this lecture we look at how to do that for a continuous outcome variable, and describe a related method for use when the outcome is dichotomous. Table 1 shows the ages, heights and maximum voluntary contraction of the quadriceps muscle (strength) in a group of male alcoholics. The outcome variable is strength. Figure 1 shows the relationship between strength and height. We can fit a regression line: strength = 908 + 7.20 height This enables us to predict what the mean strength would be for men of any given height. But strength varies with other things beside height. Figure 2 shows the relationship between strength and age. We can fit a regression line from which we could predict the mean strength for any given age: strength = 502 4.12 age However, strength would still vary with height. To investigate the effect of both age and height, we can use multiple regression to fit a regression equation: strength = 466 + 5.40 height 3.08 age The coefficients are calculated by a least squares procedure, exactly the same in principle as for simple regression. In practice, this is always done using a computer program. From this equation, we would estimate the mean strength of men with any given age and height, in the population of which these are a sample. In this multiple regression equation, 5.40 is the estimated difference in mean muscle strength between men of the same age who differ in height by one centimetre. Similarly, 3.08 is the estimated difference in mean muscle strength between men of the same height who differ in age by one year, i.e. men who are one year older have muscle strength less by 3.08 newtons. We say that the 5.40 is the effect of height adjusted for age. Both coefficients are closer to zero than they are in the separate regressions. They are pulled towards zero because, as Figure 3 shows, age and height are related: height = 179 0.195 age, P = 0.03 Age and height each explains some of the relationship between strength and the other variable. 1 Table 1. Maximum voluntary contraction (strength) of quadriceps muscle, age and height, of 41 male alcoholics (Hickish et al., 1989) Age (years) 24 27 28 28 31 31 32 32 32 32 34 34 35 37 38 39 39 39 40 41 41 Height Strength (cm) (newtons) 166 466 175 304 173 343 175 404 172 147 172 294 160 392 172 147 179 270 177 412 175 402 180 368 167 491 175 196 172 343 172 319 161 387 173 441 173 441 168 343 178 540 Age (years) 42 47 47 48 49 49 50 51 53 53 53 53 55 55 55 58 61 62 65 65 Height Strength (cm) (newtons) 178 417 171 294 162 270 177 368 177 441 178 392 167 294 176 368 159 216 173 294 175 392 172 466 170 304 178 324 155 196 160 98 162 216 159 196 168 137 168 74 Figure 1. Muscle strength against height Quadriceps strength (N) 500 400 300 200 100 155 160 165 170 175 Height (cm) 180 2 Figure 2. Muscle strength against age Quadriceps strength (N) 500 400 300 200 100 20 30 40 50 60 Age (years) 70 Figure 3. Relationship between height and age in Table 1 180 Height (cm) 175 170 165 160 155 20 30 40 50 Age (years) 60 70 Significance tests and estimation in multiple regression We can test the significance of the regression of strength on height and age together and the significance of each predictor variable separately. These tests and the confidence intervals that go with them require the same assumptions of independent observations and residuals with a Normal distribution and uniform variance as for simple linear regression. For the example, both age and height have P=0.04 and we can conclude that both age and height are independently associated with strength: strength = 466 + 5.40 height 3.08 age 95% CI 0.25 to 10.55 P=0.04 6.05 to 0.10 P=0.04 If we compare this with the separate regressions, we see than the P values have increased: strength = 908 + 7.20 height 95% CI 2.15 to 12.25 P=0.006 strength = 502 4.12 age 95% CI 7.04 to 1.21 P=0.007 3 Each predictor reduces the significance of the other because they are related to one another as well as to strength. This increases the standard error of the estimates, and variables may have a multiple regression coefficient which is not significant in a multiple regression despite being related to the outcome variable in a simple regression. When the predictor variables are highly correlated the individual coefficients will be poorly estimated and have large standard errors. Correlated predictor variables may obscure the relationship of each with the outcome variable. We check the assumptions of Normal distribution and uniform variance as for simple linear regression, by plotting a histogram and Normal plot of residuals and scatter plots of the residuals. We usually plot this against the strength predicted by the regression equation. Interaction in multiple regression An interaction between two predictor variables arises when the effect of one on the outcome depends on the value of the other. For example, tall men may be stronger than short men when they are young, but the difference may disappear as they age. An interaction may take two simple forms. As height increases, the effect of age may increase so that the difference in strength between young and old tall men is greater than the difference between young and old short men. Alternatively, as height increases, the effect of age may decrease. If we create an interaction variable = height age and include it in the model, we can allow either for either of these possibilities: strength = 4661 24.7 height 112.8 age + 0.650 height age P=0.02 P=0.004 P=0.005The regression is still significant, as we would expect. However, the coefficients of height and age have changed; they have even changed sign. The coefficient of height depends on age. The regression equation can be written strength = 4661 + (24.7 + 0.650 age) height 112.8 age The coefficient of height depends on age, becoming 24.7 + 0.650 age. The difference in strength between short and tall subjects is greater for older subjects than for younger. Or we could write strength = 4661 24.7 height + (112.8 + 0.650 height) age The coefficient of age depends on height, becoming 112.8 + 0.650 height. The difference in strength between young and old subjects being less for taller subjects than for shorter. 4 Figure 4. Interaction between age and height in their effects on muscle strength 500 400 300 200 100 155 160 165 170 175 Height (cm) Age 40+ 180 Strength (N) Age<40 Figure 5. Fitted quadratic curve 500 400 300 200 100 155 160 165 170 175 Height (cm) linear 180 Strength (N) quadratic Figure 4 shows this interaction as separate regression lines for younger and older men. Curvilinear regression So far, we have assumed that all the regression relationships have been linear, i.e. that we are dealing with straight lines. This is not necessarily so. We may have data where the underlying relationship is a curve rather than a straight line. Unless there is a theoretical reason for supposing that a particular form of the equation, such as logarithmic or exponential, is needed, we test for non-linearity by using a polynomial. Clearly, if we can fit a relationship of the form strength = constant + constant height + constant age we can also fit one of the form strength = constant + constant height + constant height2 to give a quadratic equation, which would produce a curve rather than a straight line. We can continue adding powers of height to give equations which are cubic, quartic, etc., which would produce more complex curves. For the example data we get 5 Strength = 1693 23.70 height + 0.0918 height2 P=0.9 P=0.8 here is no evidence that the quadratic term improves the prediction of strength. Figure 5 shows the curve, which is hard to distinguish from the straight line. Height and height squared are highly correlated, which can lead to problems in estimation. To reduce the correlation, we can subtract a number close to mean height from height before squaring. For the data of Table 1, the correlation between height and height squared is 0.9998. This is why the height coefficient has changes and become non-significant. Mean height is 170.7 cm, so 170 is a convenient number to subtract. The correlation between height and height minus 170 squared is 0.44, so the correlation has been reduced, though not eliminated. The regression equation is strength = 961 + 7.49 height + 0.092 (height 170)2 P=0.01 P=0.8 The coefficient and P value for the quadratic term have not changed, but the coefficient for the linear term, height, has returned to something like its former value. Qualitative predictor variables The predictor variables height and age are quantitative. In the study from which these data come, we also recorded whether or not subjects had cirrhosis of the liver. Cirrhosis was recorded as `present' or `absent', so the variable was dichotomous. It is easy to include such variables as predictors in multiple regression. We create a variable which is 0 if the characteristic is absent, 1 if present, and use this in the regression equation just as we did height. The regression coefficient of this dichotomous variable is the difference in the mean of the outcome variable between subjects with the characteristic and subjects without. If the coefficient in this example were negative, it would mean that subjects with cirrhosis were not as strong as subjects without cirrhosis. In the same way, we can use sex as a predictor variable by creating a variable which is 0 for females and 1 for males. The coefficient then represents the difference in mean between male and female. If we use only one, dichotomous predictor variable in the equation, the regression is exactly equivalent to a two sample t test between the groups defined by the variable. For the strength data, we define a variable cirrhosis = 1 if subject has cirrhosis, 0 if not. Strength = 544 + 5.86 height 2.75 age 34.5 cirrhosis P=0.03 P=0.07 P=0.3Men with cirrhosis have mean strength lower than men without cirrhosis, of the same height and age, by 34.5 newtons (but not significant, 95% CI for coefficient = 100 to +31). When we have continuous and categorical predictor variables, is regression also called analysis of covariance or ancova. The continuous variables (here height and age) are called covariates and the categorical variables (here cirrhosis) are called factors. We can also have factors with more than two categories or classes, but the analysis is more complicated and we shall omit it here. Logistic regression Logistic regression is used when the outcome variable is dichotomous, a `yes or no', whether or not the subject has a particular characteristic such as a symptom. We want a regression equation which will predict the proportion of individuals who have the characteristic, or, equivalently, estimate the probability that an individual will have the characteristic. We cannot use an ordinary linear regression equation, because this might predict proportions less than zero or greater than one, which would be meaningless. If we used the odds, rather than the proportion, as the outcome we would have a variable which could take any positive value, but could not be negative. We use the 6 log of the odds, also called the logistic transformation or logit of the proportion as the outcome variable. The logit can take any value from minus infinity, when the proportion = 0, to plus infinity, when the proportion = 1. We can fit regression models to the logit which are very similar to the ordinary multiple regression models found for data from a Normal distribution. We assume that relationships are linear on the logistic scale. The method is called logistic regression, and the calculation is computer intensive. The effects of the predictor variables are found as log odds ratios. We will look at the interpretation in an example. When giving birth, women who have had a previous caesarean section usually have a trial of scar, that is, they attempt a natural labour with vaginal delivery and only have another caesarean if this is deemed necessary. Several factors may increase the risk of a caesarean, and in this study the factor of interest was obesity, as measured by the body mass index or BMI, defined as weight/height2 (data of Andreas Papadopoulos). For caesareans, the mean BMI was 26.4 Kg/m2 and for vaginal deliveries the mean was 24.9 Kg/m2. Two other variables had a strong relationship with a subsequent caesarean. Women who had had a previous vaginal delivery (PVD) were less likely to need a caesarean, odds ratio = 0.18, 95% confidence interval 0.10 to 0.32. Women whose labour was induced had an increased risk of a caesarean, odds ratio = 2.11, 95% confidence interval 1.44 to 3.08. All these relationships were highly significant. The question to be answered was whether the relationship between BMI and caesarean section remained when the effects of induction and previous deliveries were allowed for. 7 The logistic regression equation predicting the log odds of a caesarean was: log odds caesarean = 3.70 + 0.0883 BMI + 0.647 induction 1.80 PVD 0.0492 to 0.1275 0.228 to 1.067 2.38 to 1.21 P<0.001 P=0.003 P<0.001 where induction and PVD are 1 if present, 0 if not. Because the logistic regression equation predicts the log odds, the coefficients represent the difference between two log odds, a log odds ratio. The antilog of the coefficients is thus an odds ratio. Some programs will print these odds ratios directly. If we antilog the equation we get odds caesarean = 0.0247 1.092BMI 1.910induction 1.050 to 1.136 1.256 to 2.906 P<0.001 P=0.003 0.166PVD 0.09 to 0.98 P<0.001 This means that induction increases the odds of a caesarean by a factor of 1.910 and a previous vaginal delivery reduces the odds by a factor of 0.166. These are often called adjusted odds ratios and 1.91 is the odds ratio for induction of labour adjusted for BMI and previous vaginal delivery. In this example they and their confidence intervals are similar to the unadjusted odds ratios given above, because the three predictor variables happen not to be closely related to each other. For a continuous predictor variable, such as BMI, the coefficient is the change in log odds for an increase of one unit in the predictor variable. The antilog of the coefficient, the odds ratio, is the factor by which the odds must be multiplied for a unit increase in the predictor. Two units increase in the predictor increases the odds by the square of the odds ratio, and so on. A difference of 5 Kg/m2 in BMI gives an odds ratio for a caesarean of 1.0925 = 1.55, thus the odds of a caesarean are multiplied by 1.55. Categorical variables with more than two levels It is straightforward to use qualitative or categorical variables as predictors when there are two groups, but a bit more complicated when there are more. For example, Coste et al. (1997) followed up children of short stature given growth hormone treatment. There were three types of treatment: human growth hormone only (311 children), human growth hormone followed by recombinant growth hormone (1455), and recombinant growth hormone only (1467). Hence the treatment is a categorical variable with three categories. If we code these as 1, 2, and 3, then put this variable as a predictor into a multiple or logistic regression, the equation is forced to estimate the difference between human growth hormone only and human growth hormone followed by recombinant growth hormone as the same as the difference between human growth hormone followed by recombinant growth hormone and recombinant growth hormone only. What we do instead is to set up what we call dummy variables, a set of variables which together represent the categorical variable and which can be used in the regression equation. One way to do this would be: dummy1 = 1 if human growth hormone only, dummy1 = 0 if any other treatment dummy2 = 1 if human growth hormone followed by recombinant growth hormone dummy2 = 0 if any other treatment We do not need a dummy3, because if dummy1 = 0 and dummy2 = 0, we must have the third treatment, recombinant growth hormone only. We need one fewer dummy variables than there are categories. If we put both dummy variables as predictors into a multiple or logistic regression, the coefficient of dummy1 represents the difference between human growth hormone only and recombinant growth hormone only. ...

Find millions of documents on Course Hero - Study Guides, Lecture Notes, Reference Materials, Practice Exams and more. Course Hero has millions of course specific materials providing students with the best way to expand their education.

Below is a small sample set of documents:

University of Medicine and Dentistry of New Jersey - BINF - 5035
University of York Department of Health Sciences Applied BiostatisticsExercise: Multiple regressionQuestion 1 In a study of physical fitness and cardiovascular risk factors in children, blood pressure and recovery index (post exercise recovery rat
University of Medicine and Dentistry of New Jersey - BINF - 5035
University of York Department of Health Sciences Applied BiostatisticsSuggested answers to exercise: Multiple regressionQuestion 1 a) What is meant by `multiple regression analysis'? This is a statistical method used where we have a continuous out
University of Medicine and Dentistry of New Jersey - BINF - 5035
Health Sciences M.Sc. ProgrammeApplied BiostatisticsWeek 6: Proportions, risk ratios and odds ratiosRisk ratio or relative riskChi-squared tests are tests of significance, they do not provide estimates of the strength of relationships. There ar
University of Medicine and Dentistry of New Jersey - BINF - 5035
University of York Department of Health Sciences Applied BiostatisticsExercise: Odds ratio and relative riskQuestion 1 The following is the abstract of a paper (Illi et al., 2001): Objective: To investigate the association between early childhood
University of Medicine and Dentistry of New Jersey - BINF - 5035
University of York Department of Health Sciences Applied BiostatisticsSuggested answers to exercise: The analysis of cross-tabulationsQuestion 1 (a) What is meant by odds ratio 0.52 for runny nose and asthma and what does it tell us? The odds of a
University of Medicine and Dentistry of New Jersey - BINF - 5035
BINF5035Simple Descriptive Statistics Using SAS Procedures(commands=descriptives.sas)This handout covers the use of SAS procedures to get simple descriptive statistics and to carry out a few basic statistical tests, using the data set: the Flight
University of Medicine and Dentistry of New Jersey - BINF - 5035
2003-2008, The Trustees of Indiana UniversityComparing Group Means: 1Comparing Group Means: T-tests and One-way ANOVA Using Stata, SAS, and SPSSHun Myoung Park This document summarizes the methods of comparing group means and illustrates how to
University of Medicine and Dentistry of New Jersey - BINF - 5075
MICROSOFTINTERMEDIATE ACCESS MAINTAIN DATA INTEGRITYQUICK NOTESASSIGN APPROPRIATE DATA TYPE TO FIELDS.2 CREATE VALIDATION RULES IN TABLES.6 CREATE DATA CONTROL FIELDS IN TABLES ..9 CREATE VALIDATION RULES AND CONTROLS IN FORMS .11 USE EXPRESSION
University of Medicine and Dentistry of New Jersey - BINF - 5075
Clinical Trial Data Acquisition TechnologiesBINF5075Contents Clinical Trial Data Acquisition &amp; ManagementSoftware Clinical Trial Data Entry Technologies: Keyboard Barcoding Fax Direct data entry by participants Direct computer messaging:
University of Medicine and Dentistry of New Jersey - BINF - 5075
Database Management Systems 2BINF5075 Biomedical Informatics in Clinical Trials ManagementThe Relational Model The relational model is perhaps the simplest andmost intuitive data model ever developed. The entire model is based upon tables wit
University of Medicine and Dentistry of New Jersey - BINF - 5075
BINF5075 Database ExerciseDatabase operations will be covered in this module using Microsoft's Access software. This exercise assumes familiarity with MS-Excel so you should complete that one first if you haven't already done so. Why use a database
University of Medicine and Dentistry of New Jersey - BINF - 5075
Database Management Systems &amp; SQL -1BINF5075 Biomedical Informatics in Clinical Trials ManagementSQL Standard SQL-92 was developed by the INCITS Technical Committee H2 onDatabases. SQL-92 was designed to be a standard for relational database m
University of Medicine and Dentistry of New Jersey - BINF - 5075
Database Management Systems &amp; SQL-2BINF5075 Biomedical Informatics in Clinical Trials ManagementSQL ComponentsSQLDCL DBA Activities Create Users Delete Users Grant privileges Implement Access Security DDL RDBMS Structure Create/Delete DBs Create
University of Medicine and Dentistry of New Jersey - BINF - 5075
Data Cleaning Statistical ApproachBINF5075Statistical Approaches No explicit Data Quality methods Traditional statistical data collected from carefully designed experiments, often tied to analysis But, there are methods for finding anomalies
University of Medicine and Dentistry of New Jersey - BINF - 5075
Sample Size Planning and Randomization for Clinical TrialsBINF50751 Sample Size Planning1.1Introduction FundamentalPointsClinicaltrialsshouldhavesufficientstatistical powertodetectdifferencebetweengroups consideredtobeofclinicalinterest.The
University of Medicine and Dentistry of New Jersey - BINF - 5075
General Principles for Data Security in Clinical TrialsBINF5075Introduction In healthcare, data has broad public health significance; it is expected to be of the highest quality and integrity. This presentation provides guidance about compu
University of Medicine and Dentistry of New Jersey - BINF - 5075
Dates and TimesSAS Date, Time and DateTime FormatsDate StorageSAS stores Dates, Times and Date-Time values differently. Datetime: seconds between January 1, 1960 and an hour/minute/second within a specified date Time: seconds since midnight of
University of Medicine and Dentistry of New Jersey - BINF - 5075
478NADKARNIET AL.,EAV/CR Storage for Scientific DataApplication of Information TechnologyOrganization of Heterogeneous Scientific Data Using the EAV/CR RepresentationPRAKASH M. NADKARNI, MD, LUIS MARENCO, MD, ROLAND CHEN, MD, EMMANOUIL SKO
University of Medicine and Dentistry of New Jersey - BINF - 5075
PROC IMPORT OUT= WORK.Products DATATABLE= &quot;Products&quot; DBMS=ACCESS2000 REPLACE; DATABASE=&quot;C:\DataWarehousing05f\SASDataQuality.mdb&quot;; RUN; Proc Contents Data= Products; run; *; * Cleaning the supplier name. *; *; *; * Standardizing entry values. *; *; P
University of Medicine and Dentistry of New Jersey - BINF - 5075
Downloaded from emj.bmj.com on 12 December 2006Simple nomograms to calculate sample size in diagnostic studiesS Carley, S Dosman, S R Jones and M Harrison Emerg. Med. J. 2005;22;180-181 doi:10.1136/emj.2003.011148Updated information and services
University of Medicine and Dentistry of New Jersey - BINF - 5075
Downloaded from emj.bmj.com on 2 March 2007An introduction to power and sample size estimationS R Jones, S Carley and M Harrison Emerg. Med. J. 2003;20;453-458 doi:10.1136/emj.20.5.453Updated information and services can be found at: http:/emj.b
University of Medicine and Dentistry of New Jersey - BINF - 5075
TABLE OF CONTENTS Click on a link below: Catching data entry errors with SAS.2 Removing duplicate observations from a dataset using SAS..2 SAS missing values.3 Mean substitution for missing values in SAS.4 Recoding variable values into missing values
University of Medicine and Dentistry of New Jersey - BINF - 5075
INTRODUCTION TO SAS Module 1 Dr. Al Schwarzkopf EXERCISE 1: Running a program with an internal dataset Step 1. Start the SAS program. Step 2. Copy the program below into the Edit window. Step 3. Run the program using the Run icon. DATA FITDATA; INPUT
University of Medicine and Dentistry of New Jersey - BINF - 5075
Data One; Input ID $ X Y1 ; Cards; A 1 1 B 2 2 B 3 3 D 4 4 E 0 0 ; Data Two; Input ID $ X A2 ; Cards; A 5 5 A 6 6 B 7 7 C 8 8 E 11 11 E 11 11 ; run; Data Three; Merge One Two (drop= x); By ID; Proc Print Data= Three; Title3 'Merge One Two'; run; Data
University of Medicine and Dentistry of New Jersey - BINF - 5075
Paper CC12Data Transfer from Microsoft Access to SAS Made EasyZaizai Lu, AstraZeneca Pharmaceutical David Shen, ClinForce Inc.ABSTRACTTo transfer data from Microsoft Access database to SAS has never been easy. Unlike Oracle database, neither SAS
University of Medicine and Dentistry of New Jersey - BINF - 5075
EBM: TRIALS ON TRIALEBM: TRIALS ON TRIALDetermining the sample size in a clinical trialAdrienne Kirby, Val Gebski and Anthony C KeechSAMPLE SIZE MUST BE PLANNED carefully to ensure that the research time, patient effort and support costs invest
University of Medicine and Dentistry of New Jersey - BINF - 5075
Some Practical Guidelines for Effective Sample-Size DeterminationRussell V. Lenth Department of Statistics University of Iowa March 1, 2001Abstract Sample-size determination is often an important step in planning a statistical study-and it is usua
University of Medicine and Dentistry of New Jersey - BINF - 5075
Page 1 of 4Sample Relational Data Models for Clinical ResearchLab exercise: Use Microsoft Access to implement a &quot;one to many&quot; model Access 2002 1. 2. 3. 4. 5. 6. Access20007. 8. 9. 10. 11. 12. 13. 14. 15. 16.Step 1: Design Tables Open Access 1
Kenyon - MATH - 106
University of Medicine and Dentistry of New Jersey - BINF - 5312
Chapter 12 Graphical User Interface Concepts: Part 1Outline 12.1 Introduction 12.2 Windows Forms 12.3 EventHandling Model 12.3.1 Basic Event Handling 12.4 Control Properties and Layout 12.5 Labels, TextBoxes and Buttons 12.6 GroupBoxe
University of Medicine and Dentistry of New Jersey - BINF - 5312
Chapter 13 Graphical User Interfaces Part 2Outline 13.1 13.2 13.3 13.4 Introduction Menus LinkLabels ListBoxes and CheckedListBoxes 13.4.1 ListBoxes 13.4.2 CheckedListBoxes 13.5 ComboBoxes 13.6 TreeViews 13.7 ListViews 13.8 Tab