202 Pages

766note

Course: EPID 766, Fall 2009
School: N.C. State
Rating:
 
 
 
 
 

Word Count: 17185

Document Preview

766 Summer Epid 2009 EPID 766: Analysis of Longitudinal Data from Epidemiologic Studies Daowen Zhang zhang@stat.ncsu.edu http://www4.stat.ncsu.edu/dzhang2 Graduate Summer Session in Epidemiology TABLE OF CONTENTS Epid 766, D. Zhang Contents 1 Review and introduction to longitudinal studies 1.1 Review of 3 study designs . . . . . . . . . . . . . . . 1.2 Introduction to longitudinal studies . . . . . . . . ....

Register Now

Unformatted Document Excerpt

Coursehero >> North Carolina >> N.C. State >> EPID 766

Course Hero has millions of student submitted documents similar to the one
below including study guides, practice problems, reference materials, practice exams, textbook help and tutor support.

Course Hero has millions of student submitted documents similar to the one below including study guides, practice problems, reference materials, practice exams, textbook help and tutor support.
766 Summer Epid 2009 EPID 766: Analysis of Longitudinal Data from Epidemiologic Studies Daowen Zhang zhang@stat.ncsu.edu http://www4.stat.ncsu.edu/dzhang2 Graduate Summer Session in Epidemiology TABLE OF CONTENTS Epid 766, D. Zhang Contents 1 Review and introduction to longitudinal studies 1.1 Review of 3 study designs . . . . . . . . . . . . . . . 1.2 Introduction to longitudinal studies . . . . . . . . . . 1.3 Data examples . . . . . . . . . . . . . . . . . . . . . 1.4 Features of longitudinal data . . . . . . . . . . . . . 1.5 Why longitudinal studies? . . . . . . . . . . . . . . . 1.6 Challenges in analyzing longitudinal data . . . . . . . 1.7 Methods for analyzing longitudinal data . . . . . . . 1.8 Two-stage method for analyzing longitudinal data . . 1.9 Analyzing Framingham data using two-stage method 2 Linear mixed models for normal longitudinal data 2.1 2.2 2.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 5 11 12 20 22 25 30 31 33 48 What is a linear mixed (effects) model? . . . . . . . . . . 49 Estimation and inference for linear mixed models . . . . . 65 How to choose random effects and the error structure? . . 67 Graduate Summer Session in Epidemiology TABLE OF CONTENTS Epid 766, D. Zhang 2.4 2.5 2.6 Analyze Framingham data using linear mixed models . . . 68 GEE for linear mixed models . . . . . . . . . . . . . . . . 104 Missing data issues . . . . . . . . . . . . . . . . . . . . . 108 113 . 114 . 116 . 118 . 128 3 Modeling and design issues 3.1 How to handle baseline response? . . . . . . . . . . . . 3.2 Do we model previous responses as covariates? . . . . 3.3 Modeling outcome vs. modeling the change of outcome 3.4 Design a longitudinal study: Sample size estimation . . . . . . 4 Modeling discrete longitudinal data 135 4.1 Generalized estimating equations (GEEs) for continuous and discrete longitudinal data . . . . . . . . . . . . . . . . 136 4.1.1 Why GEEs? . . . . . . . . . . . . . . . . . . . . . 136 4.1.2 Key features of GEEs for analyzing longitudinal data140 4.1.3 Some popular GEE Models . . . . . . . . . . . . . 142 4.1.4 Some basics of GEEs . . . . . . . . . . . . . . . . 144 4.1.5 Interpretation of regression coefficients in a GEE Model . . . . . . . . . . . . . . . . . . . . . . . . 150 Graduate Summer Session in Epidemiology TABLE OF CONTENTS Epid 766, D. Zhang 4.1.6 4.1.7 4.2 4.3 4.4 Analyze Infectious disease data using GEE . . . . . 152 Analyze epileptic seizure count data using GEE . . 158 Generalized linear mixed models (GLMMs) . . . . . . . . . 169 4.2.1 Model specification and implementation . . . . . . 169 Analyze infectious disease data using a GLMM . . . . . . . 180 Analyze epileptic count data using a GLMM . . . . . . . . 191 202 5 Summary: what we covered Graduate Summer Session in Epidemiology CHAPTER 1 Epid 766, D. Zhang 1 Review and introduction to longitudinal studies Review of 3 study designs Introduction to longitudinal (panel) studies Data examples Features of longitudinal data Why longitudinal studies Challenges in analyzing longitudinal data Methods for analyzing longitudinal data: two-stage, linear mixed model, GEE, transition models Two-stage method for analyzing longitudinal data Analyzing Framingham data using two-stage method Graduate Summer Session in Epidemiology Slide 5 CHAPTER 1 Epid 766, D. Zhang 1.1 Review of 3 study designs Information on disease status (Y ) and exposure status (X) is obtained from a random sample at one time point. A snap shot of population. A single observation of each variable of interest is measured from each subject: (Yi , Xi ) (i = 1, ..., n). Regression such as logistic regression (if Yi is binary) can be used to assess the association between Y and X: P[Yi = 1|Xi ] log = 0 + 1 Xi 1 - P[Yi = 1|Xi ] P[Y = 1|X = 1]/(1 - P[Y = 1|X = 1]) 1 = log P[Y = 1|X = 0]/(1 - P[Y = 1|X = 0]) 1 = log odds-ratio between exposure population (X = 1) and non exposure population (X = 0). 1 > 0 = exposure population has higher probability of getting disease. 1. Cross-sectional study: Graduate Summer Session in Epidemiology Slide 6 CHAPTER 1 Epid 766, D. Zhang Given data as follows, Y =1 n11 n01 n11 n00 n10 n01 Y =0 n10 n00 X=1 X=0 then the MLE of 1 is given by 1 = log Feature: All numbers n00 , n01 , n10 , n11 are random. No causal inference can be made! 1 may not be stable (e.g., n11 may be too small). Useful public health information can be obtained, such as the proportion of people in the population with disease, the proportion of people in the population under exposure. Can account for confounders in the model. Graduate Summer Session in Epidemiology Slide 7 CHAPTER 1 Epid 766, D. Zhang 2. Prospective cohort study (follow-up study): A cohort with known exposure status (X) is followed over time to obtain their disease status (Y ). A single observation of (Y ) may be observed (e.g., survival study) or multiple observations of (Y ) may be observed (longitudinal study). Stronger evidence for causal inference. Causal inference can be made if X is assigned randomly (if X is a treatment indicator in the case of clinical trials). When single binary (0/1) Y is obtained, we have D E E n11 n01 D n10 n00 n1+ n0+ Here, n1+ and n0+ are fixed (sample sizes for exposure and nonexposure groups). Graduate Summer Session in Epidemiology Slide 8 CHAPTER 1 Epid 766, D. Zhang 3. Retrospective (case-control) study: A sample with known disease status (D) is drawn and their exposure history (E) is ascertained. Data can be summarized as D E E n11 n01 n+1 D n10 n00 n+0 where the margins n+1 and n+0 are fixed numbers. Assuming no bias in obtaining history information on E, association between E and D can be estimated. n11 Bin(n+1 , P [E|D]), n10 Bin(n+0 , P [E|D]). Odds ratio: estimate from this study n11 n00 = n10 n01 Graduate Summer Session in Epidemiology Slide 9 CHAPTER 1 Epid 766, D. Zhang estimates the following quantity = P [D|E]/(1 - P [D|E]) P [E|D]/(1 - P [E|D]) = . P [E|D]/(1 - P [E|D]) P [D|E]/(1 - P [D|E]) If disease is rare, i.e., P [D|E] 0, P [D|E] 0, relative risk of disease can be approximately obtained: P [D|E] = relative risk. P [D|E] More efficient than prospective cohort study in this case. Problem: recall bias! (it is difficult to ascertain exposure history E.) Graduate Summer Session in Epidemiology Slide 10 CHAPTER 1 Epid 766, D. Zhang 1.2 Introduction to longitudinal studies A longitudinal study is a prospective cohort study where repeated measures are taken over time for each individual. A longitudinal study is usually designed to answer the following questions: 1. How does the variable of interest change over time? 2. How is the (change of) variable of interest associated with treatment and other covariates? 3. How does the variable of interest relate to each other over time? 4. Graduate Summer Session in Epidemiology Slide 11 CHAPTER 1 Epid 766, D. Zhang 1.3 Data examples Example 1: Framingham study In the Framingham study, each of 2634 participants was examined every 2 years for a 10 year period for his/her cholesterol level. Study objectives: 1. How does cholesterol level change over time on average as people get older? 2. How is the change of cholesterol level associated with sex and baseline age? 3. Do males have more stable (true) baseline cholesterol level and change rate than females? A subset of 200 subjects' data is used for illustrative purpose. Graduate Summer Session in Epidemiology Slide 12 CHAPTER 1 Epid 766, D. Zhang A glimpse of the raw data newid id cholst sex age year 1 1244 175 1 32 0 1 1244 198 1 32 2 1 1244 205 1 32 4 1 1244 228 1 32 6 1 1244 214 1 32 8 1 1244 214 1 32 10 2 835 299 0 34 0 2 835 328 0 34 4 2 835 374 0 34 6 2 835 362 0 34 8 2 835 370 0 34 10 3 176 250 0 41 0 3 176 277 0 41 2 3 176 265 0 41 4 3 176 254 0 41 6 3 176 263 0 41 8 3 176 268 0 41 10 4 901 243 0 44 0 4 901 211 0 44 2 4 901 204 0 44 4 4 901 196 0 44 6 4 901 246 0 44 8 Graduate Summer Session in Epidemiology Slide 13 CHAPTER 1 Epid 766, D. Zhang Cholesterol level over time for a subset of 200 subjects from Framingham study Cholesterol levels over time 400 0 350 2 4 Time in years 6 8 10 Cholesterol level 150 200 250 300 Graduate Summer Session in Epidemiology Slide 14 CHAPTER 1 Epid 766, D. Zhang What we observed from this data set: 1. Cholesterol levels increase (linearly) over time for most individuals. 2. Each subject has his/her own trajectory line with a possibly different intercept and slope, implying two sources of variations: within and between subject variations. 3. Each subject has on average 5 observations (as opposed to one observation per subject for a cross-sectional study) 4. The data is not balanced. Some individuals have missing observations. 5. The inference is NOT limited to these 200 individuals. Instead, the inference is for the target population and each subject is viewed as a random person drawn from the target population. Graduate Summer Session in Epidemiology Slide 15 CHAPTER 1 Epid 766, D. Zhang Example 2: Respiratory Infection Disease Each of 275 Indonesian preschool children was examined up to six consecutive quarters for the presence of respiratory infection (yes/no). Information on age, sex, height for age, xerophthalmia (vitamin A deficiency) was also obtained. Study objectives: Was the risk of respiratory infection related to vitamin A deficiency after adjusting for age, sex, and height for age, etc.? Features of this data set: 1. Outcome is whether or not a child has respiratory infection, i.e., binary outcome. 2. Some covariates (age, vitamin A deficiency and height) are time-varying covariates and some are one-time covariates. Graduate Summer Session in Epidemiology Slide 16 CHAPTER 1 Epid 766, D. Zhang Proportions of respiratory infection and vitamin A deficiency 0.15 Proportion of vitamin A deficiency 1 2 3 4 5 6 Proportion of respiratory infection 0.10 0.05 0.0 0.0 0.02 0.04 0.06 1 2 3 4 5 6 Order of visit Order of visit Graduate Summer Session in Epidemiology Slide 17 CHAPTER 1 Epid 766, D. Zhang Example 3: Epileptic seizure counts from progabide trial In progabide trial, 59 epileptics were randomly assigned to receive anti-epileptic treatment (progabide) or placebo. The number of seizure counts was recorded in 4 consecutive 2-week intervals. Age and baseline seizure counts (for an 8 week period prior to treatment assignment) were also recorded. Study objectives: Does the treatment work? What is the treatment effect adjusting for available covariates? Features of this data set: 1. Outcome is count data, implying Poisson regression. However, over-dispersion may be a concern. 2. Baseline seizure counts were for 8 weeks, as opposed to 2 weeks for other seizure counts. 3. Randomization may be taken into account in data analysis. Graduate Summer Session in Epidemiology Slide 18 CHAPTER 1 Epid 766, D. Zhang Epileptic seizure counts from progabide trial Seizure counts for progabide arm 150 150 Seizure counts 0 1 2 Order of visit 3 4 0 50 100 Seizure counts for control arm Seizure counts 0 50 100 0 1 2 Order of visit 3 4 Graduate Summer Session in Epidemiology Slide 19 CHAPTER 1 Epid 766, D. Zhang 1.4 Features of longitudinal data Common features of all examples: Each subject has multiple time-ordered observations of response. Responses from same subject may be "more alike" than others. Inference NOT in study subjects, but in population from which they are from. # of subjects >> # of observations/subject Source of variations between and within subject variations. Difference in the examples: Different type of responses (continuous, binary, count). Objectives depend on type of study "mean" behavior, etc. Graduate Summer Session in Epidemiology Slide 20 CHAPTER 1 Epid 766, D. Zhang Comparison of data structures: Classical study Subject 1 2 Data x1 y1 x2 2 Subject 1 Longitudinal study Data x11 , x12 , ..., x15 y11 , y12 , ..., y15 x21 , x22 , ..., x25 Time t11 , t12 , ..., t15 t11 , t12 , ..., t15 t21 , t22 , ..., t25 y21 , y22 , ..., y25 t21 , t22 , ..., t25 y2 For simplicity, we consider one covariate case. Graduate Summer Session in Epidemiology Slide 21 CHAPTER 1 Epid 766, D. Zhang 1.5 Why longitudinal studies? 1. A longitudinal study allows us to study the change of the variable of interest over time, either population level or individual level. 2. A longitudinal study enables us to separately estimate the cross-sectional effect (e.g., cohort effect) and the longitudinal effect (e.g., aging effect): Given yij , ageij (j = 1, 2, , ni , j = 1 is baseline). In a cross-sectional study, ni = 1 and we are forced to fit the following model yi1 = 0 + C agei1 + i1 . That is, C is the cross-sectional effect of age. With longitudinal data (ni > 1), we can entertain the model yij = 0 + C agei1 + L (ageij - agei1 ) + ij . Graduate Summer Session in Epidemiology Slide 22 CHAPTER 1 Epid 766, D. Zhang Then yi1 = 0 + C agei1 + i1 (let j = 1), yij - yi1 = L (ageij - agei1 ) + ij - i1 . That is, L is the longitudinal effect of age and in general L = C . 3. A longitudinal study is more powerful to detect an association of interest compared to a cross-sectional study, = more efficient, less sample size (number of subjects). 4. A longitudinal study allows us to study the within-subject and between-subject variations. 2 Suppose b (, b ) is the blood pressure for a patient population. 2 However, what we observe is Y = b + e, where e (0, e ) is the measurement error. 2 e = within-subject variation 2 b = between-subject variation Graduate Summer Session in Epidemiology Slide 23 CHAPTER 1 Epid 766, D. Zhang If we have only one observation Yi for each subject from a sample of 2 2 n patients, then we can't separate e and b . Although we can use data Y1 , Y2 , ..., Yn to make inference on , we can't make any 2 inference on b . However, if we have repeated (or longitudinal) measurements Yij of blood pressure for each subjects, then Yij = bi + eij . 2 Now, it is possible to make inference about all quantities , b and 2 e . 5. A longitudinal study provides more evidence for possible causal interpretation. Graduate Summer Session in Epidemiology Slide 24 CHAPTER 1 Epid 766, D. Zhang 1.6 Challenges in analyzing longitudinal data Key assumptions in a classical regression model: There is only one observation of response per subject, = responses are independent to each other. For example, when y = cholesterol level, yi = 0 + 1 agei + 2 sexi + i . However, the observations from the same subject in a longitudinal study tend to be more similar to each other than those observations from other subjects, = responses (from the same subjects) are not independent any more. Although, the observations from different subjects are still independent. What happens if we treat observations as independent (i.e., ignore the correlation)? 1. In general, the estimation of the associations (regression coefficients) of the outcome and covariates is valid. Graduate Summer Session in Epidemiology Slide 25 CHAPTER 1 Epid 766, D. Zhang 2. However, the variability measures (e.g, the SEs from a classical regression analysis) are not right: sometimes smaller, sometimes bigger than the true variability. 3. Therefore, the inference is not valid (too significant than it should be if the SE is too small). Sources of variation and correlation in longitudinal data: 1. Between-subject variation: For the blood pressure example, if each subject's blood pressures were measured within a relatively short time, then the following model may be a reasonable one: yij = bi + eij , where bi is the true blood pressure of subject i, eij is the independent (random) measurement error, independent of bi . Graduate Summer Session in Epidemiology Slide 26 CHAPTER 1 Epid 766, D. Zhang For j = k, corr(yij , yik ) = = cov(yij , yik ) var(yij )var(yik ) 2 b 2 + 2 . b e 2 Therefore, if the between-subject variation b = 0, then data from the same subjects are correlated. Graduate Summer Session in Epidemiology Slide 27 CHAPTER 1 Epid 766, D. Zhang The blood pressure example blood pressure 100 5 110 120 130 140 10 15 minute 20 25 Graduate Summer Session in Epidemiology Slide 28 CHAPTER 1 Epid 766, D. Zhang 2. Serial correlation: If the time intervals between blood pressure measurements are relatively large so it may not be reasonable to assume a constant blood pressure for each subject: yij = bi + Ui (tij ) + ij , where bi = true long-term blood pressure, Ui (tij ) =a stochastic process (like a time series) due to biological fluctuation of blood pressure, ij is the independent (random) measurement error. Here the correlation is caused by both bi and Ui (tij ). 3. In a typical longitudinal study for human where # of observations/subject is small moderate, there may not be enough information for the serial correlation and most correlation can be accounted for by (possibly complicated) between-subject variation. Graduate Summer Session in Epidemiology Slide 29 CHAPTER 1 Epid 766, D. Zhang 1.7 Methods for analyzing longitudinal data 1. Two-stage: summarize each subject's outcome and regress the summary statistics on one-time covariates. Especially useful for continuous longitudinal data. However, this method is getting out-dated since mixed model approach can do the same even better. 2. Mixed (effects) model approach: model fixed effects and random effects; use random effect to model correlation. 3. Generalized estimating equation (GEE) approach: model the dependence of marginal mean on covariates. Correlation is not a main interest. Particularly good for discrete data. 4. transition models: use history as covariates. Good for prediction of future response using history. Graduate Summer Session in Epidemiology Slide 30 CHAPTER 1 Epid 766, D. Zhang 1.8 Two-stage method for analyzing longitudinal data Outcome (continuous): yi1 , ..., yini measured at ti1 , ..., tini , one-time covariates: xi1 , ..., xip . Two-stage analysis is conducted as follows: 1. Stage 1: Get summary statistics from subject i's data: yi1 , ..., yini . For example, use mean yi = (yi1 + + yini )/ni or fit a linear regression for each subject: yij = bi0 + bi1 tij + ij , and get estimates bi0 , bi1 of bi0 and bi1 . Here we assume that subject i's true response at time tij is given by bi0 + bi1 tij , a straight line. Suppose t = 0 is the baseline, then bi0 is subject i's true response at baseline and bi1 is subject i's change rate of Graduate Summer Session in Epidemiology Slide 31 CHAPTER 1 Epid 766, D. Zhang the true response (not y). The error term ij can be regarded as measurement error. 2. Stage 2: Treat the summary statistics as new responses and regress the summary statistics on one-time covariates. For example, after we got bi0 and bi1 , we can calculate the means of bi0 and bi1 and the standard errors of those means, or do the following regressions bi0 = 0 + 1 xi1 + + p xip + ei0 bi1 = 0 + 1 xi1 + + p xip + ei1 . Here, k is the effect of xk on the true baseline response (not y), k is the effect of xk on the change rate of of the true response. Graduate Summer Session in Epidemiology Slide 32 CHAPTER 1 Epid 766, D. Zhang 1.9 Analyzing Framingham data using two-stage method Example 1(a) The Framingham study: Stage I: For each subject, fit yij = bi0 + bi1 tij + ij . and get estimates bi0 and bi1 . SAS program for stage I: options ls=80 ps=200; data cholst; infile "cholst.dat"; input newid id cholst sex age year; run; proc sort; by newid year; run; proc print data=cholst (obs=20); var newid cholst sex age year; run; Graduate Summer Session in Epidemiology Slide 33 CHAPTER 1 title "First stage in two-stage analysis"; proc reg outest=out noprint; model cholst = year; by newid; run; data out; set out; b0 = intercept; b1 = year; keep newid b0 b1; run; data main; merge cholst out; by newid; if first.newid=1; run; title "Summary statistics for intercepts and slopes"; proc means mean stderr var t probt; var b0 b1; run; title "Correlation between intercepts and slopes"; proc corr; var b0 b1; run; Epid 766, D. Zhang Graduate Summer Session in Epidemiology Slide 34 CHAPTER 1 Epid 766, D. Zhang Part of output from above SAS program: Summary statististics for intercepts and slopes The MEANS Procedure Variable Mean Std Error Variance t Value Pr > |t| ------------------------------------------------------------------------------b0 220.6893518 2.9478698 1737.99 74.86 <.0001 b1 2.5502529 0.2566421 13.1730374 9.94 <.0001 ------------------------------------------------------------------------------Correlation between intercepts and slopes The CORR Procedure Pearson Correlation Coefficients, N = 200 Prob > |r| under H0: Rho=0 b0 b0 b1 1.00000 -0.26939 0.0001 b1 -0.26939 0.0001 1.00000 3 2 Graduate Summer Session in Epidemiology Slide 35 CHAPTER 1 Epid 766, D. Zhang Summary statistics from stage 1: Parameter b0 b1 2 Sb = 1738, b 0 mean 221 2.55 1 SE 3 0.257 t 75 10 P [T |t|] < .0001 < .0001 corr(b0 , b1 ) = -0.27 2 Sb = 13.2. b Note: 1. Similar to the blood pressure example, we can use the sample means of b0 and b1 to estimate the means of b0 and b1 . Hence we can use sample mean of b1 (2.55) its SE (0.257) to answer the first objective of this study. 2. However, since var(bi0 ) and var(bi1 ) contain variability due to estimating the true baseline response bi0 and change rate bi1 for individual i, so var(bi0 ) > var(bi0 ), var(bi1 ) > var(bi1 ). Graduate Summer Session in Epidemiology Slide 36 CHAPTER 1 Epid 766, D. Zhang 2 2 Sample variances Sb and Sb are unbiased estimates of var(bi0 ) and b b 0 1 var(bi1 ) and would overestimate var(bi0 ) and var(bi1 ). 3. Similarly, corr(b0 , b1 ) = corr(b0 , b1 ). Therefore, corr(b0 , b1 ) = -0.27 cannot be used to estimate the correlation between the true baseline response b0 and true change rate b1 . 4. We will use mixed model approach to address the above issues later. Graduate Summer Session in Epidemiology Slide 37 CHAPTER 1 Epid 766, D. Zhang Stage II: 1. Try to compare E(b0 ) and E(b1 ) between males and females. 2. Try to compare var(b0 ) and var(b1 ) between males and females. 3. Try to examine the effects of age and sex on b0 using b0 = 0 + 1 sex + 2 age + e0 . Technically, we should use b0 instead of b0 . However, b0 is an unbiased estimate of b0 (and b0 is not observable), so using b0 is valid. 4. Try to examine the effects of age and sex on b1 using b1 = 0 + 1 sex + 2 age + e1 . Similar to the above argument, using b1 here is valid. Graduate Summer Session in Epidemiology Slide 38 CHAPTER 1 Epid 766, D. Zhang SAS program for stage II: title "Test equality of mean and variance of intercepts and slopes between sexes"; proc ttest; class sex; var b0 b1; run; title "Regression to look at the association between intercept and age, sex"; proc reg data=main; model b0 = age sex; run; title "Regression to look at the association between slope and age, sex"; proc reg data=main; model b1 = age sex; run; Graduate Summer Session in Epidemiology Slide 39 CHAPTER 1 Epid 766, D. Zhang Part of output from above SAS program: Test equality of mean and variance of intercepts and slopes between sexes The TTEST Procedure Statistics Variable b0 b0 b0 b1 b1 b1 sex 0 1 Diff (1-2) 0 1 Diff (1-2) 97 103 N 97 103 Lower CL Mean 215.86 209.2 -5.264 1.0688 2.5796 -2.554 Mean 223.97 217.6 6.3629 1.7454 3.3083 -1.563 Upper CL Mean 232.07 226.01 17.99 2.4219 4.0369 -0.572 Lower CL Std Dev 35.252 37.812 37.941 2.9417 3.2793 3.2348 Std Dev 40.226 42.988 41.672 3.3567 3.7282 3.5529 4 Statistics Variable b0 b0 b0 b1 b1 b1 sex 0 1 Diff (1-2) 0 1 Diff (1-2) Upper CL Std Dev 46.847 49.82 46.224 3.9092 4.3206 3.941 Std Err 4.0843 4.2358 5.896 0.3408 0.3673 0.5027 Minimum 146.33 141.14 -14 -11.38 Maximum 348.1 360.17 8.3 11.743 Graduate Summer Session in Epidemiology Slide 40 CHAPTER 1 T-Tests Variable b0 b0 b1 b1 Method Pooled Satterthwaite Pooled Satterthwaite Variances Equal Unequal Equal Unequal DF 198 198 198 198 Epid 766, D. Zhang t Value 1.08 1.08 -3.11 -3.12 Pr > |t| 0.2818 0.2809 0.0022 0.0021 Equality of Variances Variable b0 b1 Method Folded F Folded F Num DF 102 102 Den DF 96 96 F Value 1.14 1.23 Pr > F 0.5117 0.2996 Graduate Summer Session in Epidemiology Slide 41 CHAPTER 1 Epid 766, D. Zhang Regression to look at the association between intercept and age, sex The REG Procedure Model: MODEL1 Dependent Variable: b0 Analysis of Variance Source Model Error Corrected Total DF 2 197 199 Sum of Squares 53715 292145 345859 38.50931 220.68935 17.44956 Mean Square 26857 1482.96718 F Value 18.11 Pr > F <.0001 5 Root MSE Dependent Mean Coeff Var R-Square Adj R-Sq 0.1553 0.1467 Parameter Estimates Variable Intercept age sex DF 1 1 1 Parameter Estimate 138.21793 2.05576 -9.75053 Standard Error 15.04083 0.34820 5.47862 t Value 9.19 5.90 -1.78 Pr > |t| <.0001 <.0001 0.0767 Graduate Summer Session in Epidemiology Slide 42 CHAPTER 1 Epid 766, D. Zhang Regression to look at the association between slope and age, sex The REG Procedure Model: MODEL1 Dependent Variable: b1 Analysis of Variance Source Model Error Corrected Total DF 2 197 199 Sum of Squares 257.85057 2363.58387 2621.43443 3.46380 2.55025 135.82170 Mean Square 128.92528 11.99789 F Value 10.75 Pr > F <.0001 6 Root MSE Dependent Mean Coeff Var R-Square Adj R-Sq 0.0984 0.0892 Parameter Estimates Variable Intercept age sex DF 1 1 1 Parameter Estimate 6.14089 -0.10538 1.73654 Standard Error 1.35288 0.03132 0.49279 t Value 4.54 -3.36 3.52 Pr > |t| <.0001 0.0009 0.0005 Graduate Summer Session in Epidemiology Slide 43 CHAPTER 1 Epid 766, D. Zhang Summary from Stage II: 1. Comparison of E(b0 ) and E(b1 ) between males and females: E(b0 ) : 223.97(female), 217.6(male), p-value = 0.28 E(b1 ) : 1.75(female), 3.31(male), p-value = 0.002. 2. Comparison of var(b0 ) and var(b1 ) between males and females: 2 Sb : 1621(female), 1848(male), p-value = 0.5 b 0 2 Sb : 11.3(female), 13.9(male), p-value = 0.3. b 1 However, the above tests do NOT compare var(b0 ) and var(b1 ) between males and females. We will use mixed model approach to address this problem. 3. Model for true baseline response b0 : b0 = 0 + 1 sex + 2 age + e0 , 0 = 138.2(15.0), 1 = -9.75(5.5), 2 = 2.06(0.35). Graduate Summer Session in Epidemiology Slide 44 CHAPTER 1 Epid 766, D. Zhang After adjusting for sex, one year increase in age corresponds to 2 unit increase in baseline cholesterol level. After adjusting for baseline age, on average males' baseline cholesterol level is about 10 units less than females'. 4. Model for change rate of the true response b1 : b1 = 0 + 1 sex + 2 age + e1 , 0 = 6.14(1.35), 1 = 1.74(0.5), 2 = -0.11(0.03). After adjusting for sex, one year increase in age corresponds to 0.11 less in cholesterol level change rate. After adjusting for baseline age, males' cholesterol level change rate is 1.74 greater than females'. Graduate Summer Session in Epidemiology Slide 45 CHAPTER 1 Epid 766, D. Zhang Some remarks on two-stage analysis: 1. The first stage model should be reasonably good for the second stage analysis to be valid and make sense. 2. Two-stage analysis can only be used when the covariates considered are one-time covariates (fixed over time). 3. Summary statistics of a time-varying covariates cannot be used in the second stage analysis because of error in variable issue. 4. When the covariates considered are time-varying covariates, two-stage analysis is not appropriate. Mixed model or GEE approach can be used. 5. Two-stage analysis for discrete data (binary or count data) may break down. Mixed model or GEE approach is much more appropriate. 6. Although two-stage approach can be used to make inference on the quantities of interest, it is less efficient compared to the mixed Graduate Summer Session in Epidemiology Slide 46 CHAPTER 1 Epid 766, D. Zhang model approach. Therefore, mixed model approach should be used whenever possible. Graduate Summer Session in Epidemiology Slide 47 CHAPTER 2 Epid 766, D. Zhang 2 Linear mixed models for normal longitudinal data 1. Random intercept model 3. Other error structures 4. General mixed models What is a linear mixed model? 2. Random intercept and slope model Estimation and inference Choose a variance matrix of the data Analyze Framingham data using linear mixed models GEE for mixed models, missing data issue Graduate Summer Session in Epidemiology Slide 48 CHAPTER 2 Epid 766, D. Zhang 2.1 What is a linear mixed (effects) model? A linear mixed model is an extension of a linear regression model to model longitudinal (correlated) data. It contains fixed effects and random effects where random effects are subject-specific and used to model between-subject variation and the correlation induced by this variation. What are fixed effects? Fixed effects are the covariate effects that are fixed across subjects in the study sample (so they are population-level quantities). These effects are the ones of our particular interest. E.g., the regression coefficients in usual regression models are fixed effects: y = + x + . What are random effects? Random effects are the covariate effects that vary among subjects. So these effects are subject-specific and hence are random (unobservable) since each subject is a random subject drawn from a population. Graduate Summer Session in Epidemiology Slide 49 CHAPTER 2 Epid 766, D. Zhang I. Random intercept model: Data from m subjects: Subject 1 2 i Outcome y11 , y12 , ..., y1n1 y21 , y22 , ..., y2n2 yi1 , yi2 , ..., yini Time t11 , t12 , ..., t1n1 t21 , t22 , ..., t2n2 ti1 , ti2 , ..., tini Random intercept b1 b2 bi m ym1 , ym2 , ..., ymnm tm1 , tm2 , ..., tmnm bm Other covariates: xij2 , ..., xijp , i = 1, ..., m, j = 1, ..., ni . A random intercept model assumes: yij = 0 + 1 tij + 2 xij2 + + p xijp + bi + ij . Graduate Summer Session in Epidemiology Slide 50 CHAPTER 2 Epid 766, D. Zhang Random intercept model: yij = 0 + 1 tij + 2 xij2 + + p xijp + bi + ij 2 where 's are fixed effects of interest, bi N (0, b ) are random effects, 2 ij N (0, ) are independent errors. Interpretation of the model components: 1. From model, E[yij ] = 0 + 1 tij + 2 xij2 + + p xijp . 2. k : Average increase in y associated with one unit increase in xk , the kth covariate. 3. bi : deviation of intercept of subject i from population intercept 0 (0 + bi is the intercept for subject i). 4. 0 + 1 tij + 2 xij2 + + p xijp + bi = true response for subject i at tij . Graduate Summer Session in Epidemiology Slide 51 CHAPTER 2 Epid 766, D. Zhang 2 2 5. b = between-subject variance, = within-subject variance. 2 2 6. Total variance of y: Var(yij ) = b + , constant over time. 7. Correlation between yij and yij : 2 b = corr(yij , yij ) = 2 2 b + 8. Correlation is constant and positive. Graduate Summer Session in Epidemiology Slide 52 CHAPTER 2 Epid 766, D. Zhang Why treat bi as random 1. Treating bi as random enables us to make inference for the whole population from which the sample was drawn. Treating bi as fixed would only allow us to make inference for the study sample. 2. Usually ni is small for longitudinal studies. Therefore, as the number of total data points gets larger, the number of bi (which is m, the number of subjects) gets large proportionally. In this case, the standard properties (such as consistency) for the parameter estimates may not still hold if bi is treated as fixed. Graduate Summer Session in Epidemiology Slide 53 CHAPTER 2 Epid 766, D. Zhang When no x, model reduces to yij = 0 + 1 tij + bi + ij . Graphical representation of data from random intercept model 8 6 Response 4 2 1 2 3 Time since study 4 5 Graduate Summer Session in Epidemiology Slide 54 CHAPTER 2 Epid 766, D. Zhang II. Random intercept and slope model: Data from m subjects: Subject 1 2 i Outcome y11 , ..., y1n1 y21 , ..., y2n2 yi1 , ..., yini Time t11 , ..., t1n1 t21 , ..., t2n2 ti1 , ..., tini Random intercept b10 b20 bi0 Random slope b11 b21 bi1 bm1 m ym1 , ..., ymnm tm1 , ..., tmnm bm0 Other covariates: xij2 , ..., xijp , i = 1, ..., m, j = 1, ..., ni . A random intercept and slope model assumes: yij = 0 + 1 tij + 2 xij2 + + p xijp + bi0 + bi1 tij + ij . Graduate Summer Session in Epidemiology Slide 55 CHAPTER 2 Epid 766, D. Zhang Random intercept and slope model: yij = 0 + 1 tij + 2 xij2 + + p xijp + bi0 + bi1 tij + ij , k the same as before, random effects bi0 , bi1 are assumed to have a bivariate normal distribution 0 01 b . i0 N , 00 01 11 bi1 0 Usually, no constraint is imposed on ij . Interpretation of the model components: 1. Mean structure is the same as before: E[yij ] = 0 + 1 tij + 2 xij2 + + p xijp . 2. k : Average increase in y associated with one unit increase in xk , the kth covariate. 3. bi0 : deviation of intercept of subject i from population intercept 0 Graduate Summer Session in Epidemiology Slide 56 CHAPTER 2 Epid 766, D. Zhang (0 + bi is the intercept for subject i). 4. bi1 : deviation of slope of subject i from population slope 1 (1 + bi1 is the slope for subject i). 5. 0 + 1 tij + 2 xij2 + + p xijp + bi0 + bi1 tij = true response for subject i at tij . 6. V ar(bi0 + bi1 tij ) = 00 + 2tij 01 + t2 11 = between-subject ij variance (varying over time). 2 7. = within-subject variance. 2 8. Total variance of y: Var(yij ) = 00 + 2tij 01 + t2 11 + , not a ij constant over time. 9. Correlation between yij and yij : not a constant over time. Graduate Summer Session in Epidemiology Slide 57 CHAPTER 2 Epid 766, D. Zhang When no x, model reduces to yij = 0 + 1 tij + bi0 + bi1 tij + ij . Graphical representation of data from random intercept and slope model 14 12 10 Response 8 6 2 4 1 2 3 Time since study 4 5 Graduate Summer Session in Epidemiology Slide 58 CHAPTER 2 Epid 766, D. Zhang III. Other mixed models: A correlated error model yij = 0 + 1 tij + 2 xij2 + + p xijp + ij , where ij are correlated normal errors. For example, 1. Compound symmetric (exchangeable) i1 0 i2 N 0 , 2 i3 0 variance matrix 1 1 . 1 Here, -1 < < 1. A random intercept model is almost equivalent to this model. Graduate Summer Session in Epidemiology Slide 59 CHAPTER 2 Epid 766, D. Zhang 2. AR(1) variance matrix i1 0 1 i2 N 0 , 2 i3 0 2 2 1 Here, -1 < < 1. It assumes that the error (i1 , i2 , i3 ) is an autoregressive process with order 1. This structure is more appropriate if y is measured at equally spaced time points. . 1 3. Spatial power variance matrix i1 0 1 i2 N 0 , 2 |t2 -t1 | i3 0 |t3 -t1 | |t2 -t1 | |t3 -t1 | |t3 -t2 | 1 |t3 -t2 | 1 Here, 0 < < 1. This error structure reduces to AR(1) when y is measured at equally spaced time points. This structure is appropriate if y is measured at unequally spaced time points. . Graduate Summer Session in Epidemiology Slide 60 CHAPTER 2 Epid 766, D. Zhang 4. Unstructured variance matrix i1 0 11 i2 N 0 , 12 i3 13 0 12 22 23 13 23 33 Here no restriction is imposed on ij . . Graduate Summer Session in Epidemiology Slide 61 CHAPTER 2 Epid 766, D. Zhang IV. General linear mixed models General model 1: fixed effects + random effects + pure measurement error: For example, yij = 0 + 1 tij + 2 x + bi0 + bi1 tij + ij , where ij is the pure measurement error (has an independent variance structure). Software to implement the above model: Proc Mixed in SAS: Proc Mixed data= method=; class id; model y = t x / s; /* specify t x for fixed effects */ random intercept t / subject=id type=un; /* specify the covariates */ /* for random effects */ repeated / subject=id type=vc; /* specify the variance structure for error */ run; Graduate Summer Session in Epidemiology Slide 62 CHAPTER 2 Epid 766, D. Zhang General model 2: fixed effects + random effects + stochastic process For example, yij = 0 + 1 tij + 2 xij2 + bi0 + bi1 tij + Ui (tij ), where Ui (t) is a stochastic process with AR(1), a spatial power variance structure or other variance structure. Software to implement the above model: Proc Mixed in SAS: Proc Mixed data= method=; class id; model y = t x / s; /* specify t x for fixed effects */ random intercept t / subject=id type=un; /* specify the covariates */ /* for random effects */ repeated / subject=id type=sp(pow)(t); /* specify the variance structure for error */ run; If the time points are equally spaced, we can use type=ar(1) in the repeated statement for AR(1) variance structure for Ui (t). Graduate Summer Session in Epidemiology Slide 63 CHAPTER 2 Epid 766, D. Zhang General model 3: fixed effects + random effects + stochastic process + pure measurement error For example, yij = 0 + 1 tij + 2 xij2 + bi0 + bi1 tij + Ui (tij ) + ij , where Ui (t) is a stochastic process with some variance structure (e.g.,a spatial power variance structure), ij is the pure measurement error. Software to implement the above model: Proc Mixed in SAS: Proc Mixed data= method=; class id; model y = t x / s; /* specify t x for fixed effects */ random intercept t / subject=id type=un; /* specify the covariates */ /* for random effects */ repeated / subject=id type=sp(pow)(t) local; /* specify error variance structure */ run; If the time points are equally spaced, we can use type=ar(1) in the repeated statement if assuming AR(1) for Ui (t). Graduate Summer Session in Epidemiology Slide 64 CHAPTER 2 Epid 766, D. Zhang 2.2 Estimation and inference for linear mixed models Let consist of all parameters in random effects and errors (ij ). We want to make inference on and . There are two approaches: 1. Maximum likelihood: (, ; y) = logL(, ; y). Maximize (, ; y) jointly w.r.t. and to get their MLEs. 2. Restricted maximum likelihood (REML): (a) Get REML of from a REML likelihood REM L (; y) (take into account estimation of ). Leads to less biased . For example, in a linear regression model 2 REM L = Residual Sum of Squares . n-p-1 Slide 65 (b) Estimate by maximizing (, REM L ; y). Graduate Summer Session in Epidemiology CHAPTER 2 Epid 766, D. Zhang Hypothesis Testing After we fit a linear mixed model such as yij = 0 + 1 tij + 2 xij2 + + p xijp + bi0 + bi1 tij + ij , SAS will output a test for each k , including the estimate, SE, p-value (for testing H0 : k = 0), etc. If we want to test a contrast between k , we can use estimate statement in Proc Mixed. Then SAS will output the estimate, SE for the contrast and the p-value for testing the contrast is zero. See Programs 2 and 3 for Framingham data. Graduate Summer Session in Epidemiology Slide 66 CHAPTER 2 Epid 766, D. Zhang 2.3 How to choose random effects and the error structure? 1. Use graphical representation to identify possible random effects. 2. Use biological knowledge to identify possible error structure. 3. Use information criteria to choose a final model: (a) Akaike's Information Criterion (AIC): AIC = -2((, ; y) - q) where q = # of elements in . Smaller AIC is preferred. (b) Bayesian Information Criterion (BIC): BIC = -2((, ; y) - 0.5 q log(m)), m = # of subjects Again, smaller BIC is preferred. Graduate Summer Session in Epidemiology Slide 67 CHAPTER 2 Epid 766, D. Zhang 2.4 Analyze Framingham data using linear mixed models Model to address objective 1: How does cholesterol level change over time on average as people get older? Consider the following basic model suggested by the data: yij = bi0 + bi1 tij + ij (2.1) where yij is the jth cholesterol level measurement from subject i, tij is year from the beginning of the study (or baseline) and bi0 , bi1 are random variables distributed as 01 b , i0 N 0 , 00 01 11 1 bi1 2 and ij are independent errors distributed as N(0, ). Graduate Summer Session in Epidemiology Slide 68 CHAPTER 2 Epid 766, D. Zhang Model (2.1) assumes that 1. The true cholesterol level for each individual changes linearly over time with different intercept and slope, which are both random (since the individual is a random subject drawn from the population). 2. Since t = 0 is the baseline, so bi0 can be viewed as the true but unobserved cholesterol level for subject i at the baseline, and bi1 can be viewed as the change rate of the true cholesterol level for subject i. 3. 0 is the population average of the true baseline cholesterol level of all individuals in the population, 1 is the population average change rate of true cholesterol level and it tells us how cholesterol level changes on average as people get older. So 1 is the longitudinal effect or aging effect on cholesterol level. 4. 00 is the variance of the true baseline cholesterol level bi0 ; 11 is the variance of the change rate bi1 of the true Graduate Summer Session in Epidemiology Slide 69 CHAPTER 2 Epid 766, D. Zhang cholesterol level; and 01 is the covariance between true baseline cholesterol level bi0 and the change rate bi1 of true cholesterol level. The random variables bi0 and bi1 can be re-written as bi0 = 0 + ai0 , bi1 = 1 + ai1 , where ai0 , ai1 have the following distribution: 0 01 a . i0 N , 00 01 11 ai1 0 Model (2.1) then can be re-expressed as yij = 0 + 1 tij + ai0 + ai1 tij + ij . (2.2) Therefore, 0 , 1 are fixed effects and ai0 , ai1 are random effects. Graduate Summer Session in Epidemiology Slide 70 CHAPTER 2 Epid 766, D. Zhang The following is the SAS program for fitting model (2.1): title "Framingham data: mixed model without covariates"; proc mixed data=cholst; class newid; model cholst = year / s; random intercept year / type=un subject=newid g; repeated / type=vc subject=newid; run; The following is the output from the above program: Framingham data: mixed model without covariates The Mixed Procedure Model Information Data Set Dependent Variable Covariance Structures Subject Effects Estimation Method Residual Variance Method Fixed Effects SE Method Degrees of Freedom Method WORK.CHOLST cholst Unstructured, Variance Components newid, newid REML Parameter Model-Based Containment 1 Graduate Summer Session in Epidemiology Slide 71 CHAPTER 2 Class Level Information Class newid Levels 200 Values 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ... Dimensions Covariance Parameters Columns in X Columns in Z Per Subject Subjects Max Obs Per Subject Observations Used Observations Not Used Total Observations Iteration History Iteration 0 1 2 Evaluations 1 2 1 -2 Res Log Like 10899.75433605 9960.12567386 9960.12082968 4 2 2 200 6 1044 0 1044 Epid 766, D. Zhang Criterion 0.00000120 0.00000000 Convergence criteria met. Graduate Summer Session in Epidemiology Slide 72 CHAPTER 2 The Mixed Procedure Estimated G Matrix Row 1 2 Effect Intercept year newid 1 1 Col1 1467.30 -2.2259 Epid 766, D. Zhang Col2 -2.2259 3.8409 Covariance Parameter Estimates Cov Parm UN(1,1) UN(2,1) UN(2,2) Residual Subject newid newid newid newid Fit Statistics -2 Res Log Likelihood AIC (smaller is better) AICC (smaller is better) BIC (smaller is better) 9960.1 9968.1 9968.2 9981.3 Estimate 1467.30 -2.2259 3.8409 434.11 Null Model Likelihood Ratio Test DF 3 Chi-Square 939.63 Pr > ChiSq <.0001 Graduate Summer Session in Epidemiology Slide 73 CHAPTER 2 Solution for Fixed Effects Effect Intercept year Estimate 220.57 2.8170 Standard Error 2.9305 0.2408 DF 199 191 t Value 75.26 11.70 Epid 766, D. Zhang Pr > |t| <.0001 <.0001 Type 3 Tests of Fixed Effects Effect year Num DF 1 Den DF 191 F Value 136.83 Pr > F <.0001 From this output, we see that: 1. 00 = 1467, as compared to var(b0 ) = 1738 from the two-stage approach. 2. 11 = 3.84, as compared to var(b1 ) = 13.2 from the two-stage approach. 3. corr(b0 , b1 ) = corr(a0 , a1 ) = -2.2259/ 1467 3.84 = -0.03, as compared to corr(b0 , b1 ) = -0.27. 4. The estimated mean of true baseline cholesterol level is 0 = 220.57 with SE=2.93, as compared to the sample mean Graduate Summer Session in Epidemiology Slide 74 CHAPTER 2 Epid 766, D. Zhang 220.69 of b0 with SE = 2.94 from the two-stage approach. 5. The estimated change rate (longitudinal effect) 1 = 2.82 with SE=0.24, as compared to the sample mean 2.55 of b1 with SE = 0.26 from the two-stage approach. 2 6. = 434.11. Q: Is it reasonable to assume ij in model (2.1) to be pure measurement error? We can consider a more general model such as AR(1) for ij and test this assumption. title "Framingham data: mixed model without covariates + AR(1) error"; proc mixed data=cholst covtest; class newid; model cholst = year / s; random intercept year / type=un subject=newid g; repeated / type=ar(1) subject=newid; run; Graduate Summer Session in Epidemiology Slide 75 CHAPTER 2 Epid 766, D. Zhang and the relevant output: Covariance Parameter Estimates Cov Parm UN(1,1) UN(2,1) UN(2,2) AR(1) Residual Subject newid newid newid newid Estimate 1490.82 -4.9483 4.4725 -0.06195 416.45 Standard Error 173.93 10.6697 1.2912 0.05851 27.0039 Z Value 8.57 -0.46 3.46 -1.06 15.42 Pr Z <.0001 0.6428 0.0003 0.2897 <.0001 Fit Statistics -2 Res Log Likelihood AIC (smaller is better) AICC (smaller is better) BIC (smaller is better) 9959.1 9969.1 9969.1 9985.5 Note: 1. P-value for testing H0 : = 0 is 0.2897, no strong evidence against H0 . 2. All model selection criteria lead to iid error ij . 3. We usually don't use the above output to test variances. Graduate Summer Session in Epidemiology Slide 76 CHAPTER 2 Epid 766, D. Zhang Model to investigate the cross sectional age effect and longitudinal age effect on cholesterol level: Re-write the true baseline cholesterol level bi0 and the change rate bi1 in model (2.1) in terms of conditional distributions given age: bi0 = 0 + C agei + ai0 bi1 = 1 + A agei + ai1 , (2.3) (2.4) Where agei is individual i's baseline age. Then C is the cross sectional age effect and 1 + A agei is the longitudinal effect for the population with baseline age eqaul to agei . The average longitudinal effect is 1 + A E(age), which can be estimated by 1 + A age, Graduate Summer Session in Epidemiology Slide 77 CHAPTER 2 Epid 766, D. Zhang where age is the sample average age. Suggest that we can center age and use the centered age (denoted by cagei = agei - age) in (2.3). Then 1 is the average longitudinal effect We are interested in testing H0 : C = 1 . Assume the usual distribution for (ai0 , ai1 ): 0 01 a . i0 N , 00 01 11 ai1 0 Here both 00 and 11 are the remaining variances in bi0 and bi1 after baseline age effect has been taken into account. So they should be smaller than those corresponding values in model (2.1). Basic model (2.1) becomes yij = 0 + C cagei + 1 tij + A cagei tij + ai0 + ai1 tij + ij ,(2.5) Graduate Summer Session in Epidemiology Slide 78 CHAPTER 2 Epid 766, D. Zhang where ij N (0, 2 ) are independent errors. The following is the SAS program for fitting model (2.5): title "Framingham data: longitudinal effect vs. cohort effect"; proc mixed data=cholst; class newid; model cholst = year cage cage*year / s; random intercept year / type=un subject=newid g; repeated / type=vc subject=newid; estimate "long-cross" year 1 cage -1; run; Graduate Summer Session in Epidemiology Slide 79 CHAPTER 2 Epid 766, D. Zhang The relevant output of the above SAS program is Iteration History Iteration 0 1 2 Evaluations 1 2 1 -2 Res Log Like 10826.01576300 9929.74817925 9929.72729664 Criterion 0.00000516 0.00000000 Convergence criteria met. Estimated G Matrix Row 1 2 Effect Intercept year newid 1 1 Col1 1226.69 9.7829 Col2 9.7829 3.2598 Covariance Parameter Estimates Cov Parm UN(1,1) UN(2,1) UN(2,2) Residual Subject newid newid newid newid Fit Statistics -2 Res Log Likelihood AIC (smaller is better) Graduate Summer Session in Epidemiology 9929.7 9937.7 Slide 80 Estimate 1226.69 9.7829 3.2598 434.15 CHAPTER 2 AICC (smaller is better) BIC (smaller is better) 9937.8 9950.9 Epid 766, D. Zhang Null Model Likelihood Ratio Test DF 3 Chi-Square 896.29 Pr > ChiSq <.0001 Solution for Fixed Effects Effect Intercept year cage year*cage Estimate 220.57 2.8157 1.9861 -0.1024 Standard Error 2.7172 0.2343 0.3455 0.02930 DF 198 190 652 652 t Value 81.18 12.02 5.75 -3.50 Pr > |t| <.0001 <.0001 <.0001 0.0005 Type 3 Tests of Fixed Effects Effect year cage year*cage Num DF 1 1 1 Den DF 190 652 652 Estimates Label long-cross Estimate 0.8296 Standard Error 0.4174 DF 652 t Value 1.99 Pr > |t| 0.0473 Slide 81 F Value 144.42 33.05 12.22 Pr > F <.0001 <.0001 0.0005 Graduate Summer Session in Epidemiology CHAPTER 2 Epid 766, D. Zhang What we learn from this output: 1. 00 = 1226.7, much smaller than the corresponding estimate 1467 from model (2.1) when baseline age was not used to explain the variability in the true baseline cholesterol level. 2. 11 = 3.26, much smaller than the corresponding estimate 3.84 from model (2.1) when baseline age was not used to explain the variability in the true baseline cholesterol change rate. 3. 0 = 220.57 is the estimate of mean true baseline cholesterol level for the individuals whose baseline age = 42.56 (the average age), which is the same as the one from model (2.1) but with a smaller SE (2.71 vs. 2.93). 4. The estimate of the longitudinal age effect is 1 = 2.8157 with SE = 0.2343, which is basically the same as 1 = 2.8170 with SE = 0.24 from model (2.1). 5. The estimate of the cross sectional age effect is C = 1.99 with SE = 0.3455, which is very different from the estimate of Graduate Summer Session in Epidemiology Slide 82 CHAPTER 2 Epid 766, D. Zhang the longitudinal age effect 1 . 6. The P-value for testing H0 : L = C is 0.0473, significant at level 0.05! 2 7. = 434.15 is basically the same as the corresponding estimate from model (2.1), which is 434.11. 8. Similarly, we can test iid ij by considering correlated errors such as AR(1). Graduate Summer Session in Epidemiology Slide 83 CHAPTER 2 Epid 766, D. Zhang Model to address objective 2: How is the change of cholesterol level associated with sex and baseline age? Re-write the true baseline cholesterol level bi0 and the change rate bi1 in model (2.1) in terms of conditional distribution given gender and baseline age: bi0 = 0 + sexi 0,sex + agei 0,age + ai0 bi1 = 1 + sexi 1,sex + agei 1,age + ai1 , where we assume that ai0 , ai1 have the following 01 0 a . i0 N , 00 01 11 0 ai1 Then 0,sex , 0,age are the sex effect and baseline age effect on the baseline cholesterol level. Of course, 0 does NOT have a proper interpretation. Graduate Summer Session in Epidemiology Slide 84 (2.6) (2.7) CHAPTER 2 Epid 766, D. Zhang Similarly, 1,sex , 1,age are the sex effect and baseline age effect on the change rate of the true cholesterol level, and 1 does NOT have a proper interpretation. Substituting the above expressions into model (2.1), we got yij = 0 + sexi 0,sex + agei 0,age + 1 tij +sexi tij 1,sex + agei tij 1,age + ai0 + ai1 tij + ij .(2.8) Suppose we also want to test whether or not the change rates between 30 years old males and 40 years old females are the same using above model. From model (2.7), the (average) change rate of 30 years old males is 1 + 1 1,sex + 30 1,age = 1 + 1,sex + 301,age . Graduate Summer Session in Epidemiology Slide 85 CHAPTER 2 Epid 766, D. Zhang The (average) change rate of 40 years old females is 1 + 0 1,sex + 40 1,age = 1 + 401,age . The difference between these two rates is 1 + 1,sex + 301,age - (1 + 401,age ) = 1,sex - 101,age . Therefore, we need only to test H0 : 1,sex - 101,age = 0. We can use the following SAS program to answer our questions. title "Framingham data: how baseline cholesterol level and"; title2" change rate depend on sex and baseline age"; proc mixed data=cholst; class newid; model cholst = sex age year sex*year age*year / s; random intercept year / type=un subject=newid g s; repeated / type=vc subject=newid; estimate "rate-diff" sex*year 1 age*year -10; run; Graduate Summer Session in Epidemiology Slide 86 CHAPTER 2 Epid 766, D. Zhang Part of the relevant output from above program is Framingham data: how baseline cholesterol level and change rate depend on sex and baseline age Iteration History Iteration 0 1 2 Evaluations 1 2 1 -2 Res Log Like 10813.99587154 9907.89014721 9907.86364103 Criterion 0.00000655 0.00000000 1 Convergence criteria met. Estimated G Matrix Row 1 2 Effect Intercept year newid 1 1 Col1 1209.89 13.5502 Col2 13.5502 2.5211 Covariance Parameter Estimates Cov Parm UN(1,1) UN(2,1) UN(2,2) Residual Subject newid newid newid newid Estimate 1209.89 13.5502 2.5211 434.15 Graduate Summer Session in Epidemiology Slide 87 CHAPTER 2 Fit Statistics -2 Res Log Likelihood AIC (smaller is better) AICC (smaller is better) BIC (smaller is better) 9907.9 9915.9 9915.9 9929.1 Epid 766, D. Zhang Null Model Likelihood Ratio Test DF 3 Chi-Square 906.13 Pr > ChiSq <.0001 Solution for Fixed Effects Effect Intercept sex age year sex*year age*year Estimate 138.18 -9.6393 2.0509 6.8003 1.7995 -0.1145 Standard Error 14.9148 5.4352 0.3454 1.2229 0.4536 0.02835 DF 197 652 652 189 652 652 t Value 9.26 -1.77 5.94 5.56 3.97 -4.04 Pr > |t| <.0001 0.0766 <.0001 <.0001 <.0001 <.0001 Graduate Summer Session in Epidemiology Slide 88 CHAPTER 2 Solution for Random Effects Effect Intercept year Intercept year Intercept year newid 2 2 74 74 171 171 Estimate 100.50 2.7414 46.9844 1.3579 -51.5764 -0.6812 Std Err Pred 11.5761 1.2643 11.0096 1.2525 11.3046 1.2583 Estimates Label rate-diff Estimate 2.9441 Standard Error 0.5606 DF 651 t Value 5.25 DF 651 651 651 651 651 651 t Value 8.68 2.17 4.27 1.08 -4.56 -0.54 Epid 766, D. Zhang Pr > |t| <.0001 0.0305 <.0001 0.2787 <.0001 0.5885 Pr > |t| <.0001 Graduate Summer Session in Epidemiology Slide 89 CHAPTER 2 Epid 766, D. Zhang What we learn from this output: 1. 0,sex = -9.64 (SE = 5.43), so after adjusting for baseline age, males' baseline cholesterol level is about 10 units less than females'. 2. 0,age = 2.05 (SE = 0.35), so after adjusting for gender, one year older people's baseline cholesterol level is about 2 units higher than that of one year younger people. 3. 1,sex = 1.80 (SE = 0.45), so after adjusting for baseline age, males' change rate is 1.80 (cholesterol unit/year) greater that females' change rate. Similar estimate from 2-stage analysis is 1.74 (SE=0.49). 4. 1,age = -0.11 (SE = 0.028), so after adjusting for sex, one year older people's change rate is 0.11 less than one year younger people's change rate. Similar estimate from 2-stage analysis is -0.11 (SE=0.031). 5. The change rate difference of interest is 2.94 (SE = 0.56). Significant! Graduate Summer Session in Epidemiology Slide 90 CHAPTER 2 Epid 766, D. Zhang 6. 00 = 1210, which is smaller than the corresponding estimate from model (2.5) since we use both age and gender to explain the variability in baseline true cholesterol level. 7. 11 = 2.52, which is smaller than the corresponding estimate from model (2.5) since we use both age and gender to explain the variability in the cholesterol level change rate. 2 8. = 434.15, basically the same as its estimates from models (2.1) and (2.5). 9. Similarly, we can test iid ij by considering correlated errors such as AR(1). Note: The models (2.6) and (2.7) for bi0 and bi1 are basically the same as the second stage models in the two stage analysis for the Framingham data. Compare results from this model to the results from the two-stage analysis: Graduate Summer Session in Epidemiology Slide 91 CHAPTER 2 Epid 766, D. Zhang (a) Effect on baseline cholesterol level: Model (2.8) : Two-stage : 0 = 138.18(SE = 14.9), 0,sex = -9.64(SE = 5.43), 0,age = 2.05(SE = 0.35) 0 = 138.2(SE = 15.0), 1 = -9.75(SE = 5.48), 2 = 2.06(SE = 0.35). (b) Effect on change rate of cholesterol level: Model (2.8) : Two-stage : 1 = 6.80(SE = 1.22), 1,sex = 1.80(SE = 0.45), 1,age = -0.11(SE = 0.03). 0 = 6.14(SE = 1.35), 1 = 1.74(SE = 0.49), 2 = -0.11(SE = 0.03). We can also estimate the individual random effects and estimate their trajectory lines. Graduate Summer Session in Epidemiology Slide 92 CHAPTER 2 Epid 766, D. Zhang Estimated subject-specific lines from model (2.8): Cholesterol levels over time 400 0 350 2 4 Time in years 6 8 10 ID=2 Cholesterol level ID=74 200 250 300 ID=171 150 Graduate Summer Session in Epidemiology Slide 93 CHAPTER 2 Epid 766, D. Zhang Model to address Objective 3: Do males have more stable (true) baseline cholesterol level and change rate than females? From model (2.1), assume bi0 , bi1 have different distributions for males and females: m01 b i0 N m0 , m00 Males: m01 m11 m1 bi1 f 01 b i0 N f 0 , f 00 (2.9) Females: f 01 f 11 f 1 bi1 We would like to test H0 : m00 = f 00 , m01 = f 01 , m11 = f 11 (i.e., the above two variance-covariance matrices are the same). Graduate Summer Session in Epidemiology Slide 94 CHAPTER 2 Epid 766, D. Zhang The SAS program and its output for fitting above model are as follows: title "Framingham data: do males have more stable (true) baseline"; title2 "cholesterol level and change rate than females?"; proc mixed data=cholst; class newid gender; model cholst = sex year sex*year / s; random intercept year / type=un subject=newid group=gender g; repeated / type=vc subject=newid; run; Framingham data: do males have more stable (true) baseline cholesterol level and change rate than females? The Mixed Procedure Iteration History Iteration 0 1 2 Evaluations 1 3 1 -2 Res Log Like 10889.09479529 9939.57691271 9939.56399905 Criterion 0.00000317 0.00000000 1 The Mixed Procedure Convergence criteria met. Graduate Summer Session in Epidemiology Slide 95 CHAPTER 2 Estimated G Matrix Row 1 2 3 4 Effect Intercept year Intercept year newid 1 1 1 1 gender 0 0 1 1 Col1 1402.47 -4.7015 Col2 -4.7015 1.8279 Epid 766, D. Zhang Col3 Col4 1532.81 3.6119 3.6119 4.7970 Covariance Parameter Estimates Cov Parm UN(1,1) UN(2,1) UN(2,2) UN(1,1) UN(2,1) UN(2,2) Residual Subject newid newid newid newid newid newid newid Group gender gender gender gender gender gender 0 0 0 1 1 1 Estimate 1402.47 -4.7015 1.8279 1532.81 3.6119 4.7970 433.71 Fit Statistics -2 Res Log Likelihood AIC (smaller is better) AICC (smaller is better) BIC (smaller is better) 9939.6 9953.6 9953.7 9976.7 Graduate Summer Session in Epidemiology Slide 96 CHAPTER 2 Epid 766, D. Zhang In order to test H0 : the two variance matrices are the same using the likelihood ratio test, we need to fit a model with the same fixed and random effects but under H0 . The following is the SAS program and its output under H0 . This null model is called model (2.90 ). title "Framingham data under H0: males and females have the same variance"; title2 "matrices of baseline cholesterol level and change rate"; proc mixed data=cholst; class newid gender; model cholst = sex year sex*year / s; random intercept year / type=un subject=newid g; repeated / type=vc subject=newid; run; Framingham data under H0: males and females have the same variance matrices of baseline cholesterol level and change rate The Mixed Procedure Model Information Data Set Dependent Variable Covariance Structures WORK.CHOLST cholst Unstructured, Variance 1 The Mixed Procedure Convergence criteria met. Graduate Summer Session in Epidemiology Slide 97 CHAPTER 2 Estimated G Matrix Row 1 2 Effect Intercept year newid 1 1 Col1 1465.85 -0.2516 Epid 766, D. Zhang Col2 -0.2516 3.2618 Covariance Parameter Estimates Cov Parm UN(1,1) UN(2,1) UN(2,2) Residual Subject newid newid newid newid Estimate 1465.85 -0.2516 3.2618 434.17 Fit Statistics -2 Res Log Likelihood AIC (smaller is better) AICC (smaller is better) BIC (smaller is better) 9943.0 9951.0 9951.1 9964.2 The difference of -2 residual log likelihood is 9943 9939.6 = 3.4 (between models (2.9) and (2.90 )) and the P-value = P [2 3.4] = 0.33. 3 Graduate Summer Session in Epidemiology Slide 98 CHAPTER 2 Epid 766, D. Zhang Note: We can also test H0 : whether or not males and females have the same variance matrices of true baseline cholesterol level and change rate of cholesterol level by adjusting for baseline age and sex. We already fit the model under H0 (model (2.8)) and -2 residual log likelihood is 9907.9. The alternative model can be fit using the following SAS program (called model (2.8A )). title "Framingham data: do males have more stable (true) baseline cholesterol"; title2 "level and change rate than females adjusting for sex and baseline age"; proc mixed data=cholst; class newid gender; model cholst = sex age year sex*year age*year / s; random intercept year / type=un subject=newid group=gender g; repeated / type=vc subject=newid; run; Graduate Summer Session in Epidemiology Slide 99 CHAPTER 2 Epid 766, D. Zhang Part of the output from above program is Framingham data: do males have more stable (true) baseline cholester 20 level and change rate than females adjusting for sex and baseline age The Mixed Procedure Model Information Data Set Dependent Variable Covariance Structures WORK.CHOLST cholst Unstructured, Variance The Mixed Procedure Convergence criteria met. Estimated G Matrix Row 1 2 3 4 Effect Intercept year Intercept year newid 1 1 1 1 gender 0 0 1 1 Col1 1403.04 -2.6077 Col2 -2.6077 1.5955 1021.77 30.7768 30.7768 3.3214 Col3 Col4 Covariance Parameter Estimates Cov Parm UN(1,1) UN(2,1) UN(2,2) UN(1,1) Graduate Summer Session in Epidemiology Subject newid newid newid newid Group gender gender gender gender 0 0 0 1 Estimate 1403.04 -2.6077 1.5955 1021.77 Slide 100 CHAPTER 2 UN(2,1) UN(2,2) Residual newid newid newid gender 1 gender 1 30.7768 3.3214 434.93 Epid 766, D. Zhang Fit Statistics -2 Res Log Likelihood AIC (smaller is better) AICC (smaller is better) BIC (smaller is better) 9901.6 9915.6 9915.7 9938.7 The -2 residual log likelihood is 9901.6 so difference is 9907.9-9901.6 = 6.3. The P-value = P [2 6.3] = 0.09, more 3 evidence against H0 . Graduate Summer Session in Epidemiology Slide 101 CHAPTER 2 Epid 766, D. Zhang Comparison of fit statistics among models Model Model (2.1) Model (2.5) Model (2.8) Model (2.9) Model (2.90 ) Model (2.8A ) AIC 9968.1 9937. 9915.9 9953.6 9951.0 9915.6 BIC 9981.3 9950.9 9929.1 9976.7 9964.2 9938.7 Graduate Summer Session in Epidemiology Slide 102 CHAPTER 2 Epid 766, D. Zhang Note: 1. The choice of model, especially the fixed effects terms, depends on the questions we need to answer. However, we can use AIC or BIC to determine the random effects and the error structure. 2. If we want a model with the most prediction power, we can consider a complicated model with AIC or BIC as a guide for model selection. 3. It seems that model (2.8) is the winner among the above models if we are looking for a model with the most prediction power. Graduate Summer Session in Epidemiology Slide 103 CHAPTER 2 Epid 766, D. Zhang 2.5 GEE for linear mixed models When the variation pattern in data is so complicated that you don't feel comfortable in the random effects and their variance structure you imposed, you can use the model you posed to estimate the fixed effects ('s) and use GEE approach to calculate the standard errors for the fixed effect estimates. These SE estimates will be valid regardless of the validity of the random effects structure you put. So these SE estimates are robust (we will talk more on Thursday). For example, you can use the following model to estimate 's: yij = 0 + 1 tij + 2 sexi + 3 agei + 4 sexi tij + 5 agei tij +bi0 + bi1 tij + ij . If you specify empirical in Proc mixed, you will get robust SE estimates. See the following SAS program and output. Graduate Summer Session in Epidemiology Slide 104 CHAPTER 2 title "Using GEE to fit Framingham data"; proc mixed data=cholst empirical; class newid; model cholst = year sex age sex*year age*year / s; random intercept year / type=un subject=newid; repeated / type=vc subject=newid; 766, run; Epid D. Zhang Output of the above program Using GEE to fit Framingham data The Mixed Procedure Model Information Data Set Dependent Variable Covariance Structures Subject Effects Estimation Method Residual Variance Method Fixed Effects SE Method Degrees of Freedom Method WORK.CHOLST cholst Unstructured, Variance Components newid, newid REML Parameter Empirical Containment 24 Iteration History Iteration 0 1 2 Evaluations 1 2 1 -2 Res Log Like 10813.99587154 9907.89014721 9907.86364103 Criterion 0.00000655 0.00000000 Convergence criteria met. Graduate Summer Session in Epidemiology Slide 105 CHAPTER 2 The Mixed Procedure Covariance Parameter Estimates Cov Parm UN(1,1) UN(2,1) UN(2,2) Residual Subject newid newid newid newid Estimate 1209.89 13.5502 2.5211 434.15 Epid 766, D. Zhang Fit Statistics -2 Res Log Likelihood AIC (smaller is better) AICC (smaller is better) BIC (smaller is better) 9907.9 9915.9 9915.9 9929.1 Null Model Likelihood Ratio Test DF 3 Chi-Square 906.13 Pr > ChiSq <.0001 Solution for Fixed Effects Effect Intercept sex age year year*sex year*age Estimate 138.18 -9.6393 2.0509 6.8003 1.7995 -0.1145 Standard Error 15.4017 5.4588 0.3749 1.2188 0.4524 0.02868 DF 197 651 651 190 651 651 t Value 8.97 -1.77 5.47 5.58 3.98 -3.99 Pr > |t| <.0001 0.0779 <.0001 <.0001 <.0001 <.0001 Slide 106 Graduate Summer Session in Epidemiology CHAPTER 2 Epid 766, D. Zhang What we observe: 1. Fixed effects estimates and variance-covariance parameter estimates are exactly the same as those from model (4). 2. The SE for the fixed effects estimates are different from those from model (4). However, they are very close, indicating model (4) has reasonably good fit to the data and we don't have to use GEE approach. Graduate Summer Session in Epidemiology Slide 107 CHAPTER 2 Epid 766, D. Zhang 2.6 Missing data issues However, GEE will be less efficient if a correct model can be specified; and if there are missing data, then the missing data mechanism has to be missing completely at random (MCAR) for GEE inference to be valid. Missing data mechanism: 1. missing completely at random (MCAR): The reason that the data are missing has nothing to do with anything, i.e., at each time point, the observed data can be viewed as a random sample from the population. 2. missing at random (MAR): The reason that a subject has missing data may depend only on his/her observed data. Mixed model inference is valid under this condition. MCAR implies MAR. 3. missing not at random (MNAR): The reason that a subject has missing data depends on his/her unobserved data. Special assumption (untestable) has to be made for inference. Graduate Summer Session in Epidemiology Slide 108 CHAPTER 2 Epid 766, D. Zhang Ways to assess MCAR 1. Suppose the missing data pattern (for y) looks like Time points 1 2 3 ? ? ? and assume x (such as age) is a completely observed variable. 2. Compare x for the two groups with observed y and missing y at times 2 and 3 (using, say, two-sample t-test). A significant difference indicates the violation of MCAR. Otherwise, you may feel comfortable about the MCAR assumption. Remark: MAR cannot be tested. Graduate Summer Session in Epidemiology Slide 109 CHAPTER 2 Use age to test MCAR for Framingham data: options ls=72 ps=72; data cholst; infile "cholst.dat"; input newid id cholst sex age year; if year = . then delete; run; data base; set cholst; if year=0; keep newid age; run; data year; do newid=1 to 200; do year=0 to 10 by 2; output; end; end; run; data cholst; merge cholst year; by newid year; if cholst=. then yobs=0; else yobs=1; drop age; run; data cholst; merge cholst base; by newid; run; proc sort; by year; run; Graduate Summer Session in Epidemiology Epid 766, D. Zhang Slide 110 CHAPTER 2 Epid 766, D. Zhang title "Test equality of age between missing and non-missing groups"; proc ttest; var age; class yobs; by year; run; SAS output: Test equality of age between missing and non-missing groups 1 -------------------------------- year=2 -------------------------------The TTEST Procedure T-Tests Variable age age Method Pooled Satterthwaite Variances Equal Unequal DF 198 29.6 t Value 0.35 0.35 Pr > |t| 0.7298 0.7325 -------------------------------- year=4 -------------------------------The TTEST Procedure T-Tests Variable age age Method Pooled Satterthwaite Variances Equal Unequal DF 198 39.5 t Value -0.23 -0.22 Pr > |t| 0.8172 0.8304 Graduate Summer Session in Epidemiology Slide 111 CHAPTER 2 Epid 766, D. Zhang -------------------------------- year=6 --------------------------------T-Tests Variable age age Method Pooled Satterthwaite Variances Equal Unequal DF 198 40.3 t Value 1.43 1.37 Pr > |t| 0.1536 0.1774 -------------------------------- year=8 -------------------------------T-Tests Variable age age Method Pooled Satterthwaite Variances Equal Unequal DF 198 47 t Value 0.47 0.50 Pr > |t| 0.6418 0.6179 ------------------------------- year=10 -------------------------------T-Tests Variable age age Method Pooled Satterthwaite Variances Equal Unequal DF 198 63.3 t Value 0.24 0.27 Pr > |t| 0.8071 0.7879 Graduate Summer Session in Epidemiology Slide 112 CHAPTER 3 Epid 766, D. Zhang 3 Modeling and design issues How to handle baseline response? Do we model previous responses as covariates? Modeling response vs. modeling the change of response A simulation study comparing modeling response to modeling its change Design a longitudinal study. Sample size calculation 1. Comparing time-averaged means 2. Comparing slopes Graduate Summer Session in Epidemiology Slide 113 CHAPTER 3 Epid 766, D. Zhang 3.1 How to handle baseline response? Model baseline outcome as part of the response. For example, yij = 0 + 1 xij + eij , i = 1, ..., m, j = 1, 2, ..., ni , (3.1) where the errors eij include random effects and other errors, and hence are correlated. For example, eij = bi + ij for a random intercept model. Model baseline outcome as a covariate. For example, yij = 0 + 1 xij + 2 yi1 + ij , Comments: 1. There are some subtle difference between these two models. The regression parameters 0 , 1 and the variance components have different interpretation and hence we will get different estimates from two models. 1 in model (3.1) is the overall effect of x on y, Graduate Summer Session in Epidemiology Slide 114 i = 1, ..., m, j = 2, ..., ni . (3.2) CHAPTER 3 Epid 766, D. Zhang while 1 in model (3.2) is the adjusted covariate effect of x on y adjusting for baseline response. 2. Model (3.2) is more convenient for prediction. Although one can also get a prediction model similar to model (3.2) by conditioning on the baseline response from model (3.1). 3. When baseline response yi1 is used as a covariate, it CANNOT be re-used in the outcome variable. For model (3.2), index j goes from 2 to ni . Because of this, the estimates from model (3.1) may be more efficient. 4. It is obvious that in the presence of missing data, the subjects with baseline measurements only will be deleted from analysis if model (3.2) is used. In the case where missingness depends on the baseline measurements, inference using model (3.2) will be invalid. However, model (3.1) will still give valid inference. We will see a simulation study later. Graduate Summer Session in Epidemiology Slide 115 CHAPTER 3 Epid 766, D. Zhang 3.2 Do we model previous responses as covariates? One might consider an auto-regressive type of model like the following one instead of (3.1): yij = 0 + 1 xij + 2 yi,j-1 + ij , Comments: 1. This model is different from models (3.1) and (3.2). Here 1 is the adjusted effect of x on y after adjusting for the previous response. Therefore, they have different interpretation. 2. Since we allow the current response depends on the previous response in this model, part of the correlation among responses is taken away by the coefficient 2 . Hence the errors may have much simpler variance structure than the errors in model (3.1). In fact, people often assume ij in (3.3) to be independent. This is an Graduate Summer Session in Epidemiology Slide 116 i = 1, ..., m, j = 2, ..., ni . (3.3) CHAPTER 3 Epid 766, D. Zhang example of transition models. Consequently, the variance component parameters in this model are different and have different interpretation from those in model (3.1). 3. We can obtain a similar model to this one if we assume the errors in model (3.1) have an AR(1) variance structure. 4. Similar to model (3.2), this model is more convenient for prediction. 5. Similar to model (3.2), subjects with baseline measurements only will be deleted from the analysis. If missingness of subsequent measurements depends on the baseline measurements, this model will give invalid inference on the parameters of interest. Graduate Summer Session in Epidemiology Slide 117 CHAPTER 3 Epid 766, D. Zhang 3.3 Modeling outcome vs. modeling the change of outcome Define change from baseline: Dij = yij - yi1 , and consider model Dij = 0 + 1 xij + ij , Comments: 1. This model emphasizes the effect of x on the change (from baseline value) of outcome. Therefore, 1 has different interpretation than the 1 's in previous models. 2. Since we are modeling the difference, part of the correlation in the responses due to among individual variation is removed. Therefore, the errors in this model will have a simpler variance structure than Graduate Summer Session in Epidemiology Slide 118 i = 1, ..., m, j = 2, ..., ni i = 1, ..., m, j = 2, ..., ni . (3.4) CHAPTER 3 Epid 766, D. Zhang model (3.1), and the parameters in the variance structures have different interpretation. 3. Baseline outcome yi1 can be used as a covariate. 4. It cannot model how x affects the overall mean of outcome. 5. Similar to models (3.2) and (3.3), subjects with baseline measurements only will be deleted from the analysis, and if missingness depends on the baseline measurements, the inference will be invalid. 6. Which model to use depends on the scientific questions we want to address. Graduate Summer Session in Epidemiology Slide 119 CHAPTER 3 Epid 766, D. Zhang A simulation study Generate data from the following model: yij = 0 + 1 tj + bi + ij , i = 1, ..., 50, j = 1, 2, where 0 = 1, 1 = 2, t1 = 0, t2 = 1, bi N (0, 1), ij N (0, 1). 1. yi1 can be viewed as pre-test (or baseline) score, yi2 can be viewed as post-test score for subject i. 2. In the simulation, we let yi2 be missing whenever the baseline measurement yi1 is negative. 3. 1 = E(yi2 ) - E(yi1 ). We would like to make inference on 1 in the presence of missing data. Graduate Summer Session in Epidemiology Slide 120 CHAPTER 3 Epid 766, D. Zhang One simulated data set: Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 id 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 score0 1.33662 0.17404 1.45672 1.08229 0.55392 1.73579 -0.27640 0.78154 -0.33015 -1.11409 1.54039 1.20696 1.35767 0.68858 0.81951 0.49849 -1.68078 2.31063 1.05800 1.00388 4.45060 2.20755 1.02019 2.30880 1.93793 -1.30937 -0.80651 0.65134 0.72529 1.00030 2.75257 -1.71925 0.65070 0.23703 -1.32099 score1 1.96479 1.93052 5.07021 3.71837 2.51172 3.43906 . 1.60275 . . 2.02123 2.19839 2.33060 1.55404 3.78494 1.40747 . 3.70494 2.22613 4.72160 7.63933 2.18365 1.81962 4.09571 3.26014 . . 4.66953 0.77726 4.76540 5.03208 . 3.11335 2.03079 . scoredif 0.62816 1.75648 3.61349 2.63608 1.95780 1.70327 . 0.82121 . . 0.48084 0.99143 0.97293 0.86545 2.96542 0.90897 . 1.39431 1.16813 3.71773 3.18873 -0.02390 0.79943 1.78691 1.32222 . . 4.01819 0.05197 3.76511 2.27951 . 2.46265 1.79376 . Slide 121 Graduate Summer Session in Epidemiology CHAPTER 3 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 0.50320 4.41193 -0.60138 -0.24154 2.31534 1.55065 1.32359 1.08330 0.14231 -0.08897 -1.03434 3.75676 3.19876 1.02650 1.39603 1.96533 5.55117 . . 3.74849 5.14498 5.46448 4.74553 3.23607 . . 4.16679 4.32866 2.97035 1.75847 1.46214 1.13924 . . 1.43315 3.59433 4.14089 3.66223 3.09376 . . 0.41004 1.12990 1.94386 0.36244 Epid 766, D. Zhang 1. If we take difference as we did in the previous model, then we would use the sample mean of the non-missing difference (only 38 differences) to estimate 1 , this will give 1 = 1.85 (SE=0.20). Obviously, this estimate is biased (here it is biased towards zero). We call this approach paired t-test approach. 2. Since we have a special case of longitudinal studies, we can use mixed model approach to estimate 1 . For this purpose, let us re-arrange data in the right format for Proc mixed. Graduate Summer Session in Epidemiology Slide 122 CHAPTER 3 Epid 766, D. Zhang 3. The data for the first 20 subjects are given below: Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 id 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 score 1.33662 1.96479 0.17404 1.93052 1.45672 5.07021 1.08229 3.71837 0.55392 2.51172 1.73579 3.43906 -0.27640 . 0.78154 1.60275 -0.33015 . -1.11409 . time 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 4. We use the following SAS program for estimating 1 proc mixed data=maindat; class id; model score = time / s; random int / subject=id type=vc; repeated / subject=id type=vc; run; Graduate Summer Session in Epidemiology Slide 123 CHAPTER 3 Epid 766, D. Zhang 5. Part of the output from the above SAS program: The Mixed Procedure Model Information Data Set Dependent Variable Covariance Structure Subject Effects Estimation Method Residual Variance Method Fixed Effects SE Method Degrees of Freedom Method WORK.MAINDAT score Variance Components id, id REML Parameter Model-Based Containment Class Level Information Class id Levels 50 Values 1 2 .. Dimensions Covariance Parameters Columns in X Columns in Z Per Subject Subjects Max Obs Per Subject Observations Used Observations Not Used Total Observations Convergence criteria met. 2 2 1 50 2 88 12 100 Covariance Parameter Estimates Graduate Summer Session in Epidemiology Slide 124 CHAPTER 3 Cov Parm Intercept Residual Subject id id Fit Statistics -2 Res Log Likelihood AIC (smaller is better) AICC (smaller is better) BIC (smaller is better) The SAS System The Mixed Procedure Solution for Fixed Effects Effect Intercept time Estimate 0.9146 2.0503 Standard Error 0.2117 0.1987 DF 49 37 t Value 4.32 10.32 300.6 304.6 304.8 308.4 Estimate 1.4573 0.7828 Epid 766, D. Zhang 5 Pr > |t| <.0001 <.0001 Type 3 Tests of Fixed Effects Effect time Num DF 1 Den DF 37 F Value 106.50 Pr > F <.0001 Graduate Summer Session in Epidemiology Slide 125 CHAPTER 3 Epid 766, D. Zhang The following table gives the simulation results comparing the longitudinal approach modeling all responses simultaneously and the paired t-test approach modeling the difference based on 1000 simulation runs: Method Longitudinal approach Paired t-test approach Mean 2.002 1.712 SE 0.222 0.214 SD 0.257 0.217 Cov. prob. 0.91 0.72 where Mean is the sample mean of 1000 1 's from both approaches; SE is the sample mean of 1000 estimated SEs of 1 ; SD is the sample standard deviation of 1000 1 's; Cov. prob. is the empirical coverage probability of 95% CI of 1 . Graduate Summer Session in Epidemiology Slide 126 CHAPTER 3 Epid 766, D. Zhang What we see from this table: 1. The estimate 1 using longitudinal approach by modeling all responses simultaneously is unbiased; however, if we take difference of the responses (here we are forced to delete all subjects with missing measurements), the estimate 1 is biased. 2. Although the estimate 1 from the paired t-test approach has slightly smaller SE or SD, since the estimate itself is biased, the coverage probability of the 95% CI of 1 is too low, making invalid inference on 1 . However, the coverage probability of the 95% CI of 1 from the longitudinal approach is almost right at the nominal level (0.95). 3. With mixed model approach, we can estimate other quantities. Graduate Summer Session in Epidemiology Slide 127 CHAPTER 3 Epid 766, D. Zhang 3.4 Design a longitudinal study: Sample size estimation Recall that in the classical setting, sample size estimation is posed as a hypothesis testing problem such as the following one H0 : 1 = 2 vs HA : 1 = 2 . Assume y1k , ..., ymk N (k , 2 ), k = 1, 2. Given significance level , power , and the difference = (1 - 2 )/ we wish to detect, the required total sample size (number of subjects) in each group should be z/2 + z1- m=2 2 . Graduate Summer Session in Epidemiology Slide 128 CHAPTER 3 Epid 766, D. Zhang Design a longitudinal study (cont'd): I: Compare time-averaged means between two groups. Assume model for the data to be collected: Group A : yij = A + ij , i = 1, ..., m, j = 1, ..., n Group B : yij = B + ij , i = 1, ..., m, j = 1, ..., n m =# of subjects, n =# of observations/subject, ij normally distributed errors with mean zero, variance 2 and correlation . We want to test H0 : A = B vs HA : A = B at level with power to detect difference = (A - B )/. The quantities m and n have to satisfy (z/2 + z1- )2 . m = 2(1 + (n - 1)) 2 n Graduate Summer Session in Epidemiology Slide 129 CHAPTER 3 Epid 766, D. Zhang Comments: 1. When n = 1, the study reduces to a cross-sectional study and the sample size formula reduces to the classical one. 2. When = 0 (responses are independent), the required sample size is 1/n of that for classical study. 3. When = 1, required sample size is the same as that of the classical study. 4. For fixed n, smaller gives smaller sample size. 5. If correlation is high, use more subjects and less obs/subject; if correlation is low, use less subjects and more obs/subject. 6. The sample size formula depends on information on 2 and . 7. One can choose a combination of m and n to meet one's specific needs. 8. The above formula is for two-sided test. Graduate Summer Session in Epidemiology Slide 130 CHAPTER 3 Epid 766, D. Zhang An example: If n = 3, = 0.05, = 0.8, then the number of subjects (m) per group is (1.96 + 0.84)2 m = 2(1 + 2) 32 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.1 733 838 942 1047 1152 1256 1361 0.2 184 210 236 262 288 314 341 0.3 82 94 105 117 128 140 152 0.4 46 53 59 66 72 79 86 0.5 30 34 38 42 47 51 55 0.6 21 24 27 30 32 35 38 0.7 15 18 20 22 24 26 28 0.8 12 14 15 17 18 20 22 Graduate Summer Session in Epidemiology Slide 131 CHAPTER 3 Epid 766, D. Zhang Design a longitudinal study (cont'd): II: Compare slopes between two groups. Model for the data to be collected: Group A : yij = 0A + 1A tj + ij , i = 1, ..., m, j = 1, ..., n Group B : yij = 0B + 1B tj + ij , i = 1, ..., m, j = 1, ..., n m =# of subjects, n =# of observations/subject, ij are normal errors with mean zero, variance 2 and correlation . We are interested in testing H0 : 1A = 1B vs HA : 1A = 1B at level with power to detect difference = (1A - 1B )/. The quantities m and n have to satisfy 2(1 - )(z/2 + z1- )2 , s2 = m= t n2 s2 t Graduate Summer Session in Epidemiology n j=1 (tj n - t)2 . Slide 132 CHAPTER 3 Epid 766, D. Zhang Comments: 1. For fixed time points tj , larger gives smaller sample size m. 2. If = 1, one subject from each group is enough. 3. = 0 will require maximum sample size m. 4. If correlation is low, use more subjects and less obs/subject; if correlation is high, use less subjects and more obs/subject. 5. The sample size formula depends on information on 2 and and the placement of time points tj 's. 6. One can choose a combination of m and n to meet one's specific needs. 7. The above formula is for two-sided test. Graduate Summer Session in Epidemiology Slide 133 CHAPTER 3 Epid 766, D. Zhang An example: If n = 3, = 0.05, = 0.8, t = (0, 2, 5) so s2 = 4.222, t then the number of subjects (m) per group is 2(1 - )(1.96 + 0.84)2 m= 3 4.2222 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.02 2479 2169 1859 1550 1240 930 620 0.03 1102 964 827 689 551 414 276 0.04 620 543 465 388 310 233 155 0.05 397 348 298 248 199 149 100 0.06 276 241 207 173 138 104 69 0.07 203 178 152 127 102 76 51 0.08 155 136 117 97 78 59 39 0.09 123 108 92 77 62 46 31 0.10 100 87 75 62 50 38 25 Slide 134 Graduate Summer Session in Epidemiology CHAPTER 4 Epid 766, D. Zhang 4 Modeling discrete longitudinal data 1. Why GEEs? Generalized estimating equations (GEEs) 2. Key features of GEEs 3. Some popular GEE models 4. Some basics of GEEs 5. Interpretation of GEEs 6. Analyze infectious disease data using GEE 7. Analyze epileptic data using GEE Generalized linear mixed models (GLMMs) 1. Model specification & implementation 3. Analyze epileptic data using a GLMM Graduate Summer Session in Epidemiology Slide 135 2. Analyze infectious disease data using a GLMM CHAPTER 4 Epid 766, D. Zhang 4.1 Generalized estimating equations (GEEs) for continuous and discrete longitudinal data Why GEEs? 4.1.1 Recall that a linear mixed model for longitudinal data may take the form: yij = 0 + 1 tij + 2 xij2 + + p xijp + bi0 + bi1 tij + ij . Key features: 1. Outcome yij is continuous and normally distributed. 2. Correlation in outcome observations from the same individuals is directly modeled using random effects (e.g., random intercept and slope). However, 1. in many biomedical studies, the outcome variables are discrete Slide 136 Graduate Summer Session in Epidemiology CHAPTER 4 Epid 766, D. Zhang (not continuous). For example, the outcome is binary (yes/no) in Indonesian children study, and the outcome is count in the Epileptic clinical trial. 2. sometimes, we are mainly interested in the covariate effects, not in correlation among the outcome observations from the same subject. A partial reason is that it is much harder to know how the discrete observations are correlated to each other over time than continuous outcomes. 3. we might also want to model the correlation in a natural way jointly with the estimation of covariate effects of interest. Graduate Summer Session in Epidemiology Slide 137 CHAPTER 4 Epid 766, D. Zhang What is wrong with the classical regression approach such as the logistic regression for binary outcomes? Classical logistic regression model: yi Binomial(1, i (xi )), yi = 1/0, i (xi ) = E[yi |xi ] i (xi ) = 0 + 1 xi1 + + p xip . logit{i (xi )} = log 1 - i (xi ) Key features: 1. Each subject contributes only one binary observation. 2. It is reasonable to assume that the outcomes from different subjects are independent. However, in a longitudinal study, 1. Each subject has multiple binary (1/0) responses over time. 2. The subjects with higher probability to get disease will tend to have more 1's, resulting a correlation. 3. Even though a classical regression by ignoring correlation will Graduate Summer Session in Epidemiology Slide 138 CHAPTER 4 Epid 766, D. Zhang give us correct and meaningful regression coefficient estimates, their SEs are often too small, resulting invalid inference. 4. The correlation has to be taken into account for valid inference (to get correct standard errors of the regression coefficient estimates). Generalized estimating equations (GEEs) is an approach that allows us to make valid inference by implicitly taken into account the correlation. Graduate Summer Session in Epidemiology Slide 139 CHAPTER 4 Epid 766, D. Zhang 4.1.2 Key features of GEEs for analyzing longitudinal data 1. We only need to correctly specify how the mean of the outcome variable is related to the covariates of interest. For example, for the infection disease study, yij Binomial(1, ij (xij )) logit{ij (xij )} = 0 + 1 seasonij + 2 Xeroij + 3 agei +4 timeij + 5 sexi + 6 heightij , ij (xij ) = P [yij = 1|xij ] = E(yij |xij ) is the population probability of respiratory infection for the population defined by the specific covariate values (i.e., xij ). 2. The correlation among the observations from the same subject over time is not the major interest and is treated as nuisance. 3. We can specify a correlation structure. The validity of the inference does not depend on the correct specification of this structure. GEE Graduate Summer Session in Epidemiology Slide 140 CHAPTER 4 Epid 766, D. Zhang will give us a robust inference on the regression coefficients, which is valid regardless whether or not the correlation structure we specified is right. 4. A fundamental assumption on missing data is that missing data mechanism has to be MCAR. 5. GEE calculates correct SEs for the regression coefficient estimates using sandwich estimates that will take into account the possibility that the correlation structure is misspecified. 6. The regression coefficients in GEE have population-average interpretation. Graduate Summer Session in Epidemiology Slide 141 CHAPTER 4 Epid 766, D. Zhang 4.1.3 Some popular GEE Models Continuous (Normal): (x) = 0 + 1 x1 + + p xp where (x) = E(y|x) is the mean of outcome variable at x = (x1 , ..., xp ), such as mean of cholesterol level. Proportion (Binomial, Binary): logit{(x)} = 0 + 1 x1 + + p xp (x) = P [y = 1|x] = E(y|x) such as disease risk. logit() = log{/(1 - )} is the logit link function. Other link functions are possible. Graduate Summer Session in Epidemiology Slide 142 CHAPTER 4 Epid 766, D. Zhang Count or rate (Poisson) log{(x)} = 0 + 1 x1 + + p xp (x) is the rate (e.g. (x) is the incidence rate of a disease) for the count data (number of events) y over a (time, space) region T such that y|x has a distribution similar to Poisson{(x)T }. Here log(.) link is used. Other link functions are possible. However, we don't have to assume the Poisson distribution. In particular, we don't have to assume that var(y|x) = (x)T. Graduate Summer Session in Epidemiology Slide 143 CHAPTER 4 Epid 766, D. Zhang 4.1.4 Some basics of GEEs Data: yij , i = 1, ..., m, j = 1, ..., ni with mean ij = E(yij |xij ). Denote yi = yi1 yi2 . . . yini , i = i1 i2 . . . ini . Suppose we correctly specify the mean structure for data yij : g(ij ) = 0 + x1ij 1 + ... + xpij p , where g() is the link function such as logit function for binary response and log link for count data. Graduate Summer Session in Epidemiology Slide 144 CHAPTER 4 Epid 766, D. Zhang A GEE solves the following generalized estimating equation for (Liang and Zeger, 1986): m S (, ) = i=1 i T Vi-1 (yi - i ) = 0, (4.1) where Vi is some matrix (intended to specify for var(yi |xi )) and is the possible parameters in the correlation structure. The above estimating equation is unbiased no matter what matrix Vi we use as long as the mean structure is right. That is E[S (, )] = 0. Under some regularity conditions, the solution from the GEE equation (4.1) has asymptotic distribution N(, ), Graduate Summer Session in Epidemiology Slide 145 a CHAPTER 4 Epid 766, D. Zhang where -1 -1 = I0 I1 I0 m I0 I1 = i=1 m T Di Vi-1 Di = i=1 m T Di Vi-1 var(yi |xi )Vi-1 Di T Di Vi-1 (yi - i ())(yi - i ())T Vi-1 Di = i=1 is called the empirical or robust variance estimate. -1 If Vi is correctly specified, then I1 I0 and I0 (model based). -1 In this case, is the most efficient estimate. Otherwise, = I0 . Graduate Summer Session in Epidemiology Slide 146 CHAPTER 4 Epid 766, D. Zhang Vi , the working variance matrix for yi (at xi ), can be decomposed as Vi = i Ri i , where i = var(yi1 |xi1 ) 0 . . . 0 var(yi2 |xi2 ) . . . 0 0 . . . var(yini |xini ) , 1/2 1/2 . . . 0 0 and Ri is the correlation structure. We may try to specify Ri so that it is close to the "true". This Ri is called the working correlation matrix and may be mis-specified. Graduate Summer Session in Epidemiology Slide 147 CHAPTER 4 Epid 766, D. Zhang Some working correlation structures 2. Exchangeable (compound 1 Ri = . . . 1 (N 1. Independent: Ri () = Ini ni . No needs to be estimated. symmetric): 1 . . . . . . 1 . . . m Since E(eij eik ) = (at true ), = = - p - 1) eij eik , i=1 j<k where N = m ni (ni - 1)/2 (total # of pairs), is a i=1 (over)-dispersion parameter, no impact on GEE inference! Graduate Summer Session in Epidemiology Slide 148 CHAPTER 4 Epid 766, D. Zhang 3. AR(1): Ri = 1 . . . ni -1 1 . . . ni -2 ni -1 ni -2 . . . . . . 1 Since E(eij ei,j+1 ) = (at true ), = = where N = 1 (N - p - 1) m ni -1 eij ei,j+1 , i=1 j=1 m i=1 (ni 4. Many more can be found in SAS. Software: Proc Genmod in SAS - 1) (total # of adjacent pairs). Graduate Summer Session in Epidemiology Slide 149 CHAPTER 4 Epid 766, D. Zhang 4.1.5 Interpretation of regression coefficients in a GEE Model A classical logistic model: y = indicator of lung cancer Bin(1, ) logit() = 0 + 1 XE + 2 XC where 1 XE = 0 1 XC = 0 exposure = yes exposure = no confounder = yes confounder = no For example, XE = smoking (yes/no), XC = Age (> 50 vs. 50). Then 1 = age-adjusted log(OR) ( log(RR)) of lung cancer comparing the population of smokers and the population of non-smokers. Graduate Summer Session in Epidemiology Slide 150 CHAPTER 4 Epid 766, D. Zhang In general, k in a logistic regression can be interpreted as k = log(OR) of disease under consideration for two populations with covariate values xk + 1 and xk while other covariates are held fixed. The regression coefficients in a GEE logistic model have the same population-averaged interpretation as those in a classical logistic model. GEE combines information from a sample of subjects to estimate these population-averaged estimates. These will be contrasted with subject-specific regression coefficients later. Graduate Summer Session in Epidemiology Slide 151 CHAPTER 4 Epid 766, D. Zhang 4.1.6 Analyze Infectious disease data using GEE Data: 275 Indonesian preschool children. Each was followed over 6 consecutive quarters. Outcome = respiratory infection (yes/no) Covariates: Xero (xerophthalmia (yes/no)), season, age, sex, height (height for age) GEE logistic model: yij (1/0) = infection indicator Bin(1, ij ), logit(ij ) = 0 + 1 seasonij + 2 Xeroij + 3 agei +4 timeij + 5 sexi + 6 heightij See the SAS program indon gee.sas and its output indon gee.lst for details. Graduate Summer Session in Epidemiology Slide 152 CHAPTER 4 SAS program: indon gee.sas options ls=72 ps=72; /*------------------------------------------------------*/ /* */ /* Proc Genmod to fit population average (marginal) */ /* model using GEE approach for the Indonesia children */ /* infection disease data */ /* */ /*------------------------------------------------------*/ data indon; infile "indon.dat"; input id infect intercept age xero cosv sinv sex height stunted visit baseage season visitsq; time = age-baseage; run; data indon; set indon; nobs=_n_; run; title "Print the first 20 observations"; proc print data=indon (obs=20); var id infect season xero age sex height visit; run; title "Model 1: Use exchangeable working correlation"; proc genmod descending; class id; model infect = season xero baseage time sex height / dist=bin link=logit; repeated subject=id / type=exch corrw; run; Graduate Summer Session in Epidemiology Epid 766, D. Zhang Slide 153 CHAPTER 4 SAS output: indon gee.lst Model 1: Use exchangeable working correlation The GENMOD Procedure Model Information Data Set Distribution Link Function Dependent Variable Observations Used WORK.INDON Binomial Logit infect 1200 Epid 766, D. Zhang 2 Class Level Information Class id Levels 275 Values 121013 ... Response Profile Ordered Value 1 2 infect 1 0 Total Frequency 107 1093 PROC GENMOD is modeling the probability that infect='1'. Graduate Summer Session in Epidemiology Slide 154 CHAPTER 4 Criteria For Assessing Goodness Of Fit Criterion Deviance Scaled Deviance Pearson Chi-Square Scaled Pearson X2 Log Likelihood Algorithm converged. Analysis Of Initial Parameter Estimates Parameter Intercept season xero baseage time sex height Scale DF 1 1 1 1 1 1 1 0 Estimate -2.3572 -0.0424 0.6657 -0.0333 0.0006 -0.3841 -0.0462 0.9931 Standard Error 0.3435 0.1098 0.4313 0.0064 0.0199 0.2173 0.0205 0.0000 Wald 95% Confidence Limits -3.0305 -0.2576 -0.1796 -0.0458 -0.0384 -0.8099 -0.0864 0.9931 -1.6838 0.1728 1.5110 -0.0209 0.0397 0.0418 -0.0061 0.9931 ChiSquare 47.08 0.15 2.38 27.47 0.00 3.12 5.09 DF 1193 1193 1193 1193 Value 685.3920 694.9775 1176.5455 1193.0000 -347.4888 Epid 766, D. Zhang Value/DF 0.5745 0.5825 0.9862 1.0000 Pr > ChiSq <.0001 0.6995 0.1227 <.0001 0.9753 0.0771 0.0240 NOTE: The scale parameter was estimated by the square root of Pearson's Chi-Square/DOF. Graduate Summer Session in Epidemiology Slide 155 CHAPTER 4 GEE Model Information Correlation Structure Subject Effect Number of Clusters Correlation Matrix Dimension Maximum Cluster Size Minimum Cluster Size Algorithm converged. Working Correlation Matrix Col1 Row1 Row2 Row3 Row4 Row5 Row6 1.0000 0.0462 0.0462 0.0462 0.0462 0.0462 Col2 0.0462 1.0000 0.0462 0.0462 0.0462 0.0462 Col3 0.0462 0.0462 1.0000 0.0462 0.0462 0.0462 Col4 0.0462 0.0462 0.0462 1.0000 0.0462 0.0462 Exchangeable id (275 levels) 275 6 6 1 Epid 766, D. Zhang Col5 0.0462 0.0462 0.0462 0.0462 1.0000 0.0462 Col6 0.0462 0.0462 0.0462 0.0462 0.0462 1.0000 Analysis Of GEE Parameter Estimates Empirical Standard Error Estimates Parameter Estimate Intercept season xero baseage time sex height -2.3504 -0.0409 0.5525 -0.0338 0.0017 -0.4021 -0.0493 Standard Error 0.3332 0.0889 0.4472 0.0061 0.0216 0.2375 0.0258 95% Confidence Limits -3.0036 -0.2151 -0.3240 -0.0458 -0.0407 -0.8675 -0.0999 -1.6973 0.1334 1.4291 -0.0217 0.0440 0.0633 0.0014 Z Pr > |Z| -7.05 -0.46 1.24 -5.49 0.08 -1.69 -1.91 <.0001 0.6457 0.2167 <.0001 0.9385 0.0903 0.0566 Graduate Summer Session in Epidemiology Slide 156 CHAPTER 4 Epid 766, D. Zhang Some remarks: Proc Genmod in SAS always fits the model using independence correlation structure to get initial parameter estimate. We should read the output under "Analysis Of GEE Parameter Estimates", which is valid even if the correlation structure we specified (it is exchangeable here) may not be true. Given other characteristics, the odds-ratio of getting respiratory infection between two populations with or without Vitamin A deficiency is estimated to be e0.5525 = 1.74. If respiratory infection could be viewed as a rare disease, kids with Vitamin A deficiency would be 74% more likely to develop respiratory infection. However, p-value=0.2167 indicates that there is no significant difference in infection risk for these two populations. Graduate Summer Session in Epidemiology Slide 157 CHAPTER 4 Epid 766, D. Zhang 4.1.7 Analyze epileptic seizure count data using GEE Data: 59 patients, 28 in control group, 31 in treatment (progabide) group. 5 seizure counts (including baseline) were obtained. Covariates: treatment (covariate of interest), age. GEE Poisson model: yij =seizure counts obtained at the jth (j = 0, 1, ..., 4) time point for patient i, yij has a Poisson-like distribution with mean ij = E(yij ) = tij ij , where tij is the length of time from which the seizure count yij was observed, ij is hence the rate to have a seizure. First consider model log(ij ) = 0 + 1 I(j > 0) + 2 trti + 3 trti I(j > 0) log(ij ) = log(tij ) + 0 + 1 I(j > 0) + 2 trti + 3 trti I(j > 0) Note that log(tij ) is often called an offset. Graduate Summer Session in Epidemiology Slide 158 CHAPTER 4 Epid 766, D. Zhang Interpretation of 's: log of seizure rate Group Control (trt=0) Treatment (trt=1) Before randomization 0 0 + 2 After randomization 0 + 1 0 + 1 + 2 + 3 Therefore, 1 = time effect, 2 = difference in seizure rates at baseline between two groups, 3 = treatment effect of interest. If randomization is taken into account (2 = 0), we can consider the following model log(ij ) = log(tij ) + 0 + 1 I(j > 0) + 2 trti I(j > 0) See the SAS program seize gee.sas and its output seize gee.lst for details. Graduate Summer Session in Epidemiology Slide 159 CHAPTER 4 First part of seize gee.sas options ls=80 ps=1000 nodate; /*------------------------------------------------------*/ /* */ /* Proc Genmod to fit population average (marginal) */ /* model using GEE approach for the epileptic seizure */ /* count data */ /* */ /*------------------------------------------------------*/ data seizure; infile "seize.dat"; input id seize visit trt age; nobs=_n_; interval = 2; if visit=0 then interval=8; logtime = log(interval); assign = (visit>0); run; title "Model 1: overall effect of the treatment"; proc genmod data=seizure; class id; model seize = assign trt assign*trt / dist=poisson link=log offset=logtime; repeated subject=id / type=exch; run; Epid 766, D. Zhang Graduate Summer Session in Epidemiology Slide 160 CHAPTER 4 Output of the above program: Model 1: overall effect of the treatment The GENMOD Procedure Model Information Data Set Distribution Link Function Dependent Variable Offset Variable Observations Used WORK.SEIZURE Poisson Log seize logtime 295 Epid 766, D. Zhang 1 Class Level Information Class id Levels 59 Values 101 102 ... Criteria For Assessing Goodness Of Fit Criterion Deviance Scaled Deviance Pearson Chi-Square Scaled Pearson X2 Log Likelihood Algorithm converged. DF 291 291 291 291 Value 3577.8316 3577.8316 5733.4815 5733.4815 6665.9803 Value/DF 12.2950 12.2950 19.7027 19.7027 Graduate Summer Session in Epidemiology Slide 161 CHAPTER 4 Analysis Of Initial Parameter Estimates Parameter Intercept assign trt assign*trt Scale DF 1 1 1 1 0 Estimate 1.3476 0.1108 0.0265 -0.1037 1.0000 Standard Error 0.0341 0.0469 0.0467 0.0651 0.0000 Wald 95% Confidence Limits 1.2809 0.0189 -0.0650 -0.2312 1.0000 1.4144 0.2027 0.1180 0.0238 1.0000 ChiSquare 1565.44 5.58 0.32 2.54 Epid 766, D. Zhang Pr > ChiSq <.0001 0.0181 0.5702 0.1110 NOTE: The scale parameter was held fixed. GEE Model Information Correlation Structure Subject Effect Number of Clusters Correlation Matrix Dimension Maximum Cluster Size Minimum Cluster Size Algorithm converged. Analysis Of GEE Parameter Estimates Empirical Standard Error Estimates Parameter Intercept assign trt assign*trt Estimate 1.3476 0.1108 0.0265 -0.1037 Standard Error 0.1574 0.1161 0.2219 0.2136 95% Confidence Limits 1.0392 -0.1168 -0.4083 -0.5223 1.6560 0.3383 0.4613 0.3150 Z Pr > |Z| 8.56 0.95 0.12 -0.49 <.0001 0.3399 0.9049 0.6274 Slide 162 Exchangeable id (59 levels) 59 5 5 5 Graduate Summer Session in Epidemiology CHAPTER 4 Second part of seize gee.sas Epid 766, D. Zhang title "Model 2: take randomization into account"; proc genmod data=seizure; class id; model seize = assign assign*trt / dist=poisson link=log offset=logtime scale=pearson aggregate=nobs; repeated subject=id / type=exch; run; Output from the above program: Model 2: take randomization into account The GENMOD Procedure Model Information Data Set Distribution Link Function Dependent Variable Off...

Find millions of documents on Course Hero - Study Guides, Lecture Notes, Reference Materials, Practice Exams and more. Course Hero has millions of course specific materials providing students with the best way to expand their education.

Below is a small sample set of documents:

UMBC - NASA - 413
OCE: Review Evaluation Welcome emurian! Remember - your responses.https:/www.onlinecourseevaluations.com/(X(1)S(kpxjyifucjk01gqg3cpfz.UNIVERSITY OF MARYLAND - BALTIMORE COUNTYHENRY EMURIANHere is a printable document of each student in this g
UCF - TAX - 5015
TAX 5015 (Summer 2009) Chapter review exercise #2 Topic review: Corporations Taxable income Due date: Thursday, May 21, 2009 Name(s): SOLUTION. Brady Corporation reported the following information for 2009: Bad debt expense Increase in balance of a
UCF - TAX - 6845
Some common ACE adjustments Solution Part 1: Compute ACE. RedSox Inc., reported the following information: Alternative minimum taxable income Municipal bond interest Expenses related to municipal bonds Key-employee life insurance proceeds in excess
Texas A&M - CPSC - 310
Spring'2005 CPSC310-500/CPSC603-600, Dr. Salih Yurtta. s Test-1I. (70 points.) Relational Database Schema Movies = (MovieExec-schema, Studio-schema, MovieStar-schema, Movie-schema, StarsIn-schema) is given with the following relation schemas: Movi
Texas A&M - CPSC - 310
Spring'2006 CPSC310-500/CPSC603-600, Dr. Salih Yurtta. [hw-01] sRelational Database Schema Movies = (MovieExec-schema, Studio-schema, MovieStar-schema, Movie-schema, StarsIn-schema) is given with the following relation schemas: MovieExec-schema =
Texas A&M - CPSC - 310
z yx wv lB ik j i 9 t f o u p t p p p m d o s p r p p m o k p q p n p m @ u p t p p p s p r p p k p q p B o k p q p n p m @ k p q p o n m y l B B B i k j i 9 1 h h g) e 1 f) 9 1 f) e@ f
Brigham Young University, Hawaii - BUSM - 327
SELECTIONBUSM 327, Ch 61Chapter Objectives Explain the significance of employee selection. Identify environmental factors that affect the selection process. Describe the general selection process. Explain the importance of the preliminary
WVU - CS - 470
Homework 7 : Due Oct. 25, 2006CS 470 October 18, 20061Mesh loadingThe goal of this section is to load a model from an external file (ply format) and compute normal vectors for flat shading.You will create an interactive 3D application with t
WVU - CS - 470
Homework 6 : Due Oct. 4, 2006CS 470 September 27, 20061Camera TransformationThe goal of this homework is to gain experience with viewing transformations, matrices, the camera frame and the gluLookAt command. Develop a Camera3D class with the f
WVU - CS - 470
Computer GraphicsCS 470Computer Science and Electrical Engineering Dept. West Virginia University1st September 2006CS 470 (West Virginia University)Computer Graphics1st September 20061 / 11Outline1ViewingCS 470 (West Virginia Univ
WVU - CS - 470
Computer GraphicsCS 470Computer Science and Electrical Engineering Dept. West Virginia University13th September 2006CS 470 (West Virginia University)Computer Graphics13th September 20061 / 19Outline1Review Coordinate systems and fra
WVU - CS - 470
Computer GraphicsCS 470Computer Science and Electrical Engineering Dept. West Virginia University27th September 2006CS 470 (West Virginia University)Computer Graphics27th September 20061 / 19Outline1The camera frame2The gluLookA
Bowling Green - MATH - 120
Updated Math 112/120/122 announcementApril 23, 2003 The Department of Mathematics and Statistics will no longer offer Math 120, College Algebra, after Fall semester 2003. Math 120 is being replaced by a sequence of two courses, Math 112, College Alg
Eastern Washington University - CSCD - 210
There once was a young person named Little Red Riding Hood who lived on the edge of a large forest full of endangered owls and rare plants that would probably provide a cure for cancer if only someone took the time to study them.Red Riding Hood l
Allan Hancock College - MATH - 3063
Allan Hancock College - MATH - 3063
THE UNIVERSITY OF SYDNEY Differential Equations &amp; Biomathematics Semester 1 Tutorial Week 10 20081. Find first integrals for, and sketch the phase portraits of, the following systems. x = x2 , x = -x2 , (i) (ii) y = 3xy. y = xy. 2. Sketch the p
Colorado State - JT - 211
JT211 Final Exam Study Guide Spring 2009How to see: Visual input + understanding = _? Meaningfulness and memory Finding meaning in an image: 6 perspectives &quot;Sensing plus selecting plus perceiving equals seeing&quot; What does this mean? Light Visible lig
Colorado State - CS - 370
Opening the file Start VIM. From the desktop it can be found by going to Applications &gt; Accessories &gt; Vi Improved. Then open DISK-DISK1, located in your lab4 directory, by going through File &gt; Open. Once you've opened the DISK, you should see somethi
Colorado State - CS - 457
CS457 - Lecture TCP/IP Socket Programming Kaustubh Gadkari (slides adapted from Yan Chen)Outline Socket Introduction Using Stream Sockets Using Datagram Sockets Concurrency Programming Tips Project Q&amp;ATCP/IP Socket Programming2Socket Intr
Santa Clara - COEN - 180
COEN 180Field Effect TransistorFET Strength of an electric field controls the conductivity of a connection. Junction FET (JFET) Metal Oxide Semiconductor Field Effect Transistor (MOSFET). JFETJFETConductivity Width of channel
Santa Clara - COEN - 180
vti_encoding:SR|utf8-nl vti_timelastmodified:TR|16 Apr 2004 04:40:00 -0000 vti_extenderversion:SR|5.0.2.6417 vti_author:SR|BOBADILLA\Thomas Schwarz vti_modifiedby:SR|BOBADILLA\Thomas Schwarz vti_timecreated:TR|16 Apr 2004 04:40:00 -0000 vti_cacheddtm
Santa Clara - COEN - 180
vti_encoding:SR|utf8-nl vti_timelastmodified:TR|12 Apr 2004 16:40:00 -0000 vti_extenderversion:SR|5.0.2.6417 vti_author:SR|BOBADILLA\Thomas Schwarz vti_modifiedby:SR|BOBADILLA\Thomas Schwarz vti_timecreated:TR|12 Apr 2004 16:40:00 -0000 vti_cacheddtm
Santa Clara - COEN - 180
vti_encoding:SR|utf8-nl vti_timelastmodified:TR|10 Apr 2004 06:05:00 -0000 vti_extenderversion:SR|5.0.2.6417 vti_author:SR|BOBADILLA\Thomas Schwarz vti_modifiedby:SR|BOBADILLA\Thomas Schwarz vti_timecreated:TR|10 Apr 2004 06:05:00 -0000 vti_cacheddtm
Fayetteville State University - CHEM - 140
Fayetteville State University Department of Natural Sciences College of Basic and Applied Sciences Syllabus CHEM 140-04, FALL 2006 GENERAL CHEMISTRY ILecture Schedule: Tuesdays and Thursdays 8:00-9:20 AM, LS 304E Laboratory: Tuesdays 2:00-4:50 PM, L
Fayetteville State University - CHEM - 325
FAYETTEVILLE STATE UNIVERSITY DEPARTMENT OF NATURAL SCIENCES COLLEGE OF BASIC AND APPLIED SCIENCES CHEM 325 SPRING 2006Lecture Schedule: Mondays and Wednesdays 3:00-5:50 PM, LS 313 Lecturer: Dr. Jairo Castillo-Char a Office: LS 321 Telephone: (910)
Fayetteville State University - FACULTY - 110
FAYETTEVILLE STATE UNIVERSITY College of Arts and Sciences Department of Natural Sciences SYLLABUS I. LOCATOR INFORMATION: Semester: Fall Course Number NSCI 110-10 Instructor: Dr. Jairo Castillo-Char Office Location: LS 321 Office Telephone: 910-672-
Fayetteville State University - NSCI - 110
Chapter 1Copyright The McGraw-Hill Companies, Inc. Permission required for reproduction or display.Fig.01.coFig.01.01pv =kFig.01.03Fig.01.04Fig.01.07Ptolemaic's SystemFig.01.08Copernican `s SystemFig.01.10Fig.01.11Tycho Bra
Fayetteville State University - LEC - 140
FAYETTEVILLE STATE UNIVERSITY COLLEGE OF BASIC AND APPLIED SCIENCES DEPARTMENT OF NATURAL SCIENCES CHEM 140Chapter 3 Stoichiometry: Calculations with Chemical Formulas and EquationsPrentice Hall 2003 Chapter 3Chemical Equations Lavoisier: mass
Fayetteville State University - FACULTY - 110
1 FAYETTEVILLE STATE UNIVERSITY COLLEGE OF BASIC AND APPLIED SCIENCES DEPARTMENT OF NATURAL SCIENCES NSCI 110-12 SPRINT 2006 Review #1 and 2 PROBLEM SOLVING STRATEGY You will become skillful solving problems if you spend a great deal of time practici
Fayetteville State University - FACULTY - 110
-1FAYETTEVILLE STATE UNIVERSITY COLLEGE OF BASIC AND APPLIED SCIENCES DEPARTMENT OF NATURAL SCIENCES CSCI 110-04 FALL 2005 Oct. 21-05 Chapter 5 Electricity and Magnetism I) Basic concepts and Equations 1) The unit of charge is the Coulomb(C): The cha
Fayetteville State University - FACULTY - 110
-1FAYETTEVILLE STATE UNIVERSITY COLLEGE OF BASIC AND APPLIED SCIENCES DEPARTMENT OF NATURAL SCIENCES CSCI 110-04 FALL 2005 Nov. 28-05 Chapter 10 Crystals, Ions and Solutions I) Basic Concepts and Equations 1) Crystalline Solids: refers to well-organi
Colorado State - MECH - 513
F. Tracking Transient Entity Attributes with Ranked Lists Sometimes you do not want to take parts from the queue in waiting line order. For example, a rush order is expedited through the factory. In this section you will learn: 1. To use ranked lists
Hudson VCC - NR - 288
Agroecological DesignProblems in modern food systems 17% of all fossil fuel used in the U.S. is consumed by the food production system.1 The average U.S. farm uses an estimated 3 calories to produce 1 calorie of food. 1 Food and agricult
Hudson VCC - NR - 288
The Art and Science of Thinking EcologicallyTecate StoryLooking at My Shoes and Seeing Nothing There: The Failure of My EducationWes Jacksonhttp:/www.landinstitute.orgSustainability will result from our becoming better ecological accountant
Hudson VCC - NR - 288
Biosphere 2http:/www.bio2.com/Photo Credits: http:/www.biospheres.com, http:/www.pbase.com/jwalk/image/32289133 http:/www.andreasschreiber.mynetcologne.de/biosphere_2.htmBiosphere 2 Mars IdeasDesigns &amp; ModelsDesigns &amp; ModelsDesignsWest Lu