Unformatted text preview: PUBH 7430 Lecture 5 J. Wolfson Division of Biostatistics University of Minnesota School of Public Health September 20, 2011 The linear predictor We will often write down models like Y = Xβ + or E (Y) = Xβ or (later) f (E (Y)) = Xβ Xβ is called the linear predictor The linear predictor, cont’d. Breaking down the linear predictor: • X is the “stacked” covariate matrix of dimension (N × p ) • β is a coeﬃcient vector of length p Hence Xβ is a vector of length N , whose entries are a linear combination of the columns of X: p x11 · β r =1 x11r βr p x12 · β r =1 x12r βr Xβ = = . . . . . . p xKnK · β r =1 xKnK r βr Note: If the model has an intercept (standard for most models), the ﬁrst column of X will be a vector of ones: [1, 1, 1, . . . , 1]. 1.2 1.2 qq 0.8 0.2 0.4 0.6 0.8 1.0 q q q q q qq q q q q 0.8 1.0 1.2 1.4 1.6 Scatterplot matrix What to look for • What pairs appear to have highest correlation (tight cluster around 45-degree line)? What pairs have lowest correlation? • Do pairs further apart in time (space) have higher/lower correlations? • Do correlation patterns between adjacent pairs appear to depend on time? Scatterplot matrix 0.2 0.4 0.6 0.8 1.0 0.6 0.8 1.0 1.2 1.4 q 0.0 0.1 0.2 0.3 0.4 q q q qq q q q qq qq q q qq q qq q logFEV.6 q qq q q q q q q q q q q q q qq q q qq q q qq qq q q q q qq q q q q q q q qq q q qq q q q q q q q q q q qq q q q q q q q q qq q q q q q qqq q q q q qq q q q q q qq q qqq q q q qq q qq q qq q q qqq qqq q q qq q q q qqq qqq q q q q q qqqq qq q q q q q q qq q q q qq q qq q q q qq q qq q q qq q qq q q q qqq q q q q q q q q qqq q logFEV.9 q q q qq q q qqq q q qq q q q q q q q qqq q q q qq q q q qq q q qq q q q q q qq q q q qq q qq q q qqq q q q q qq q qqq q q q q qqq q qq qq q qqq qq q q q qq q q qqqqq q q qq q q q qq q q q qq qq q q q qq q q q q q q q q qq q q qq q qq qq q qq q q q q q q q q qq q qq q qq q q q qq qq q q q qqqq qqq q qq q q q qq q qq q q qq qq q q q q qq q qqq q q q q qqq qq q q q qq q qq qqq q qqq qq qq q q q q qq q q q q qq q q qq q q qq q q qq qq q q q q q logFEV.12 q q q q q q q q qq q q q q q q qq q q q q qq q qq qq q qq q q q q qq qq q qq qq q qq q q q q q qq q q q q q q q q q q q qq q q q q qq q q q q q qq q q q q qq qq q q qq qq qq q q qqqq qq qq qqq q qq q q qq q q q q q qqq qq q qqq q q qqq q qq q qqq q q q q qqq q q q qq q q qq q qq q q qqq q q q q q q qq q qq qq qq q q q q q qq qq q qqq q q q qq q q qqqq qq q q qq q qq q qqq q q q q q qq q qq qq q q q qq qqq q qqq q qq qq q qq qq q qq qqq q q qq q q q q q qq q q qq q q q q q q qq qq q qq q qq q q q q q q q qq q q q qq q q q qqq q q q q q qqq q q q qq q qq q qq q qq q qq q q q q qq qq q q qqqqqq q q q q q qq q qq qqq q q q qq q q qq q q q qq qqq q qq q q q qq q q q qq q q qq qq q qq q q q q qqq qq q q q qq q q q q q qq q q q q q q q qq q qq qq q q q qq q qq q qq q qq q q q q q qq q qq q qq q q qq q q q qq qq qq q q qq q qq q q q q q q q logFEV.15 q q q q 1.6 q q 0.0 0.1 0.2 0.3 0.4 qq qq q qq q qq q q qq q q qqq q qq qq qq qq q q qq q qq q qqq q qq q q qq qq qq qq q qqq q qqq q q qq q q qq q q qq q qq qqq q q qq q q q qq q q qq q q q q q qq q qq qq q q q q qq q q qq q q qq q qq q q qqq q q qq q qq q qqq q q qq qq q q q q qq q qq q 0.4 0.8 1.4 q q q logFEV.18 1.0 0.6 0.8 1.0 1.2 1.4 q q q q q qq q qq q q q q qq q qq q q q q qqq qq q q q qq q q qqq q qqqq qq qq q 0.4 q qq q 1.2 q q q 0.8 q q qq qq q qq q q q qqq q q 1.2 1.2 qq 0.8 0.2 0.4 0.6 0.8 1.0 q 0.8 1.0 1.2 1.4 1.6 Correlation matrix A scatterplot matrix can be summarized numerically by the correlation matrix (scaled version of covariance matrix): 1 ρ(A, B ) ρ(A, C ) . . . ˆ ˆ ˆ ˆ 1 ρ(B , C ) . . . ˆ R (A, B , C , . . . ) = ρ(A, B ) . . . . . . . . . . . . Note: In the notation from lecture 4, [A, B , C , . . . ] = Yi , corresponding to the observations on a single (or, alternately, “generic”) cluster. It can be estimated assuming that all clusters have the same underlying correlation matrix. Correlation matrix When estimated from the data, get the sample correlation matrix : logFEV.6 logFEV.9 logFEV.12 logFEV.15 logFEV.18 logFEV.6 logFEV.9 logFEV.12 logFEV.15 logFEV.18 1.00 0.55 0.49 0.55 NA 0.55 1.00 0.71 0.74 0.72 0.49 0.71 1.00 0.75 0.64 0.55 0.74 0.75 1.00 0.87 NA 0.72 0.64 0.87 1.00 Notes • Usefulness of correlation matrix as a summary measure depends on whether relationship between variables is linear • Look for scatterplots shaped like ellipses ⇒ data are approximately bivariate Normal Correlation structures: Autocorrelation and variograms Autocorrelation and stationarity • Sometimes, pairs of observations measured the same time apart may have similar correlations • eg. logFEV.13 logFEV.14 logFEV.15 logFEV.16 logFEV.17 logFEV.13 logFEV.14 logFEV.15 logFEV.16 logFEV.17 1.00 0.89 0.85 0.75 0.76 0.89 1.00 0.89 0.80 0.77 0.85 0.89 1.00 0.88 0.85 0.75 0.80 0.88 1.00 0.89 0.76 0.77 0.85 0.89 1.00 • Suggests that the data generating process is stationary, i.e. correlation between a pair of observations depends only on the time lag between them, not on the observation time itself: ρ(Y (t1 ), Y (t2 )) depends only on |t1 − t2 | Autocorrelation and stationarity If process is stationary, can safely combine measurements with same time lag to estimate auto-correlation function: A(u ) = ρ(Y (t1 ), Y (t2 )), |t1 − t2 | = u eg. for FEV data A(u ) = ρ(FEV (age1 ), FEV (age2 )), |age1 − age2 | = u ˆ A(u ) = (i ,j ):|ti −tj |=u (yi (i ,j ):|ti −tj |=u (yi − y )2 ¯ − y )(yj − y ) ¯ ¯ (i ,j ):|ti −tj |=u (yi − y )2 ¯ Autocorrelation: limitations A(u ) = ρ(Y (t1 ), Y (t2 )), |t1 − t2 | = u The autocorrelation function has limited value if • Data generating process is not stationary (eg. FEV data across all ages) • Observation times are not regularly spaced (eg. Beta-carotene measurements across time) ...
