This preview shows pages 1–4. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: STAT 200, Lang Wu 1 Chapter 2. Looking at Data Relationship 2.1. Correlation So far we have mostly focused on data from one variable. Sometimes we have data from two variables, such as data on weight and height or data on age and income. In this case, it is often of interest to look at the relationship between the two variables. For example, we may want to check if the two variables are correlated or not. If we wish to check the relationship between two variables, a graphical method is to plot the data using a scatterplot . Scatterplots are often used to plot data on two continuous variables and can display relationship between the two variables. The simplest and most important relationship between two continuous vari ables is a linear relationship : y = a + bx, where x and y are the two continuous variables and a and b are two constants. If x and y roughly have a linear relationship, their scatterplot would roughly around a straight line. Due to random variations, the points on a scatterplot typically will not exactly on a straight line. The strength of a linear relationship in a scatterplot is determined by how closely the points follow a straight line. STAT 200, Lang Wu 2 Let ( x 1 , x 2 , , x n ) be data for variable x , and let ( y 1 , y 2 , , y n ) be data for variable y . A numerical measure of a linear relationship between the two continuous variables is the correlation r : r = n i =1 ( x i x )( y i y ) q n i =1 ( x i x ) 2 n i =1 ( y i y ) 2 , where x = 1 n n X i =1 x i , y = 1 n n X i =1 y i . Remark 1. Note that correlation does not imply causation. In other words, x and y are correlated does not imply x causes y or vice versa. Remark 2. The value of r is always between 1 and 1. The larger the value of  r  , the stronger the linear relationship. A negative value of r suggests a negative association, while a positive value of r suggests a positive association. Remark 3. The value of r does not change when we change the unites of measurements for x and y . Remark 4. The correlation r only measures the linear relationship, not other relationship. Remark 5. The value of r is very sensitive to outliers. Figure 1 shows some data from 47 US states in 1960: the number of males of age 1424 per 1000 population, the mean number of years of schooling for persons of age 25 or older, and family income. We see that the number of male is negatively associated with years of education (the correlation coefficient is STAT 200, Lang Wu 3 120 130 140 150 160 9.0 9.5 10.0 10.5 11.0 11.5 12.0 Male versus Education (r=0.39) Number of teen male Education (year)...
View Full
Document
 Spring '11
 David

Click to edit the document details