L4_rev1

# L4_rev1 - CORRELATION Data arise in pairs(xi yi i = 1 2 n...

This preview shows page 1. Sign up to view the full content.

This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: CORRELATION Data arise in pairs (xi, yi), i = 1 , 2, . . . , n Response y : outcome of experiment Explanatory x: explains outcome Scatterplot Plot observed pairs (xi, yi) in x-y plane 1 Example: Cancer Mortality in Oregon vs Radioactive Contamination County Index of exposure Umatilla Morrow Gilliam Sherman Wasco Hood River Portland Columbia Clatsop xi 2.49 2.57 3.41 1.25 1.62 3.83 11.64 6.41 8.34 Cancer Mortality per 100,000 yi 147.1 130.1 129.9 113.5 137.5 162.3 207.5 177.9 210.3 Background: - Plutonium storage new Hanford, WA - Strontium 90, Cesium 137 leaks to Columbia River and then to Paciﬁc Ocean - 9 counties on river or ocean - Index of exposure: distance, frontage 2 220 Mortality 200 180 160 140 120 100 0 2 4 6 8 Index of exposure 10 12 Cause-eﬀect relationship? Reference: Fadeley, R.C., Journal of Environmental Health, 27, 1965, pp 883-892 3 Scatterplot shows • overall pattern, if any • grouping, outliers • direction • form • strength Note: Cannot prove cause-eﬀect relationship, only association 4 Correlation coeﬃcient n 1 xi − x yi − y ¯ ¯ r= n − 1 i=1 sx sy (xi − x)(yi − y ) ¯ ¯ = (xi − x)2 (yi − y )2 ¯ ¯ where n 1n 1 2= x= ¯ xi, sx ¯ (xi − x)2 n i=1 n − 1 i=1 n 1n 1 2= y= ¯ yi, sy ¯ (yi − y )2 n i=1 n − 1 i=1 5 r measures direction and strength of linear relationship between two quantitative variables. 6 Facts about r • Magnitude of |r| expresses how well a straight line ﬁts data • −1 ≤ r ≤ 1 • r = +1 : (xi, yi) lie on a straight line with positive slope • r = −1 : straight line with negative slope • r = 0 : x and y called uncorrelated 7 • r independent of scale for x or y • r measures how much of y variation can be “explained” by variable x Cautions • r may be inﬂuenced strongly by a single point y x 8 • Don’t interpret r without an accompanying scatterplot √ (i) y = x, r near 1 BUT y x 9 (ii) r large and positive BUT y x (iii) High |r| does not imply cause-eﬀect relationship (vocabulary vs foot size) 10 FITTING STRAIGHT LINES Based on principle of least squares y yi ( x i , yi ) yi yi yi ! predicted value x xi 11 Equation of Line y = α + βx Fitted values ˆ yi = α + β xi ˆ ˆ Residual ei = yi − yi ˆ Minimize sum of squared residuals n S (α, β ) = i=1 e2 i n = (yi − α − βxi)2 i=1 12 Solve ∂ ∂α ∂ ∂β −2 −2 (yi − α − βxi)2 = 0 (yi − α − βxi)2 = 0 (yi − α − βxi) = 0 xi(yi − α − βxi) = 0 13 yi = nα + β i Multiply (1) by (1) i xiyi = α i xi x2 i xi + β i i xi, (2) i multiply (2) by n, then subtract (1) from (2). Solve for β then substitute into (1) to solve for α. Denote solutions by α, β . ˆˆ 14 (xi − x)(yi − yi) ¯ ¯ sy ˆ β= =r 2 (xi − x) ¯ sx α = y − βx ˆ ¯ ˆ¯ ˆ y = α + βx ˆ ˆ s ˆx + r y x = y − β¯ ¯ sx sy sy =y−r x+r x ¯ ¯ sx sx sy = y + r (x − x) ¯ ¯ sx 15 Exercise 1.5-4 p 62. Fitting a loga- rithm. ˆ ˆ 1.5-4 " ! 3.2761 , # ! 0.3098 . Good fit. S catter pl ot of Y =L og(Volume) vs X=Day 6.0 Y= Log(Volume) 5.5 5.0 4.5 4.0 3.5 0 1 2 3 4 5 X=Day 6 7 8 9 1.5-5 Try transformations of the form y \$ ; these are called power transformations. One can show that lim \$ &0 y\$ %1 \$ ! log y . Hence \$ ! 0 corresponds to the log transformation. The transformation y 0.25 leads to a linear relation. 16 S catter plot of Y vs X, wi th r egr ession thr ough or i gi n 9 8 7 Y- Dat a 6 5 4 3 2 Exercise 1.5-3 p 62. Fitting a parabola 1 0 without linear term y = α + βx2. 0 1 2 3 4 5 X- Dat a 6 7 8 9 ˆ ˆ 1.5-3 In the equations for the estimates, replace xi with xi2 . Then " ! 1.1165 , # ! 3.03 ˆ and y ! 3.03 \$ 1.11654 x 2 S catter plot of Y vs X 30 25 Y 20 15 10 5 0 -4 -3 -2 -1 0 1 2 3 4 5 X Solutions: Chapter 1 51 17 Exercise 1.5-9 p 64. ˆ 1.5-8 r ! 0.63 ; y ! 4.819 " 0.8707 x ˆ ˆ 1.5-9 r ! 0.48 ; # ! 0.25 and \$ ! 2.53 . Fitted L ine Pl ot S alary = 0.25 + 2.533 Education 70 60 Salary 50 40 30 20 10 10 12 14 16 Educat ion 18 20 1.5-10 Yearly number of storks and yearly number of births. Strong correlation, but no causation. Lurking variable “economic development” affects both series. 1.5-11 r ! 0.71 (all data); r ! 0.48 (male); r ! 0.54 (female) . 1.6-1 through 1.6-3 Need to collect your own data. 18 1.6-4 Replication: Average of several measurements is less variable than a single measurement. Randomization: Spreads the risk of uncontrolled variability fairly across MEANING OF r2 Total Sum of Squares n (yi − y )2 ¯ SST O = i=1 Error Sum of Squares n SSE = (yi − yi)2 ˆ i=1 Regression Sum of Squares n SSR = (ˆi − y )2 y ¯ i=1 19 Intuitively must have SSE < SST Identity SST O = SSE + SSR Proof. n i=1 (yi − y )2 = [(yi − yi) + (ˆi − y )]2 ¯ ˆ y ¯ (yi − y )2 = ¯ n i=1 (yi − yi)2 + ˆ n n (ˆi − y )2 y ¯ i=1 (yi − yi)(ˆi − y ) ˆy ¯ +2 i=1 Next show cross-product term vanishes. 20 Need the following facts 1. yi = α + βxi, y = α + β x ˆ ¯ ¯ 2. Least-squares estimates ∂ (yi − α − βxi)2 = 0 ∂α ∂ (yi − α − βxi)2 = 0 ∂b −2 (yi − α − βxi) = 0 −2 xi(yi − α − βxi) = 0 (yi − yi) = 0 ˆ xi(yi − yi) = 0 ˆ Last two equations will be used to show cross-product vanishes as will fact 1. 21 Cross-product n (yi − yi)(ˆi − y ) ˆy ¯ i=1 n (yi − yi)(α + βxi − α − β x) ˆ ¯ = i=1 n n (yi − yi)xi − β x ˆ ¯ =b i=1 (yi − yi) ˆ i=1 =0 Hence, n i=1 (yi − y )2 = ¯ n (yi − yi)2 + ˆ i=1 SST = SSE n (ˆi − y )2 y ¯ i=1 + SSR 22 Finally show r2 SSR = SST O 23 ...
View Full Document

{[ snackBarMessage ]}

Ask a homework question - tutors are online