Unformatted text preview: M316 Chapter 2 Dr. Berg Describing Distributions with Numbers It is convenient to be able to give numeric descriptions of data such as average travel time to work. Measuring Center: Mean The most common measure of central tendency is the arithmetic average or mean. Definition The mean of a set of quantitative data is equal to the sum of all the measurements in the data set divided by the number of measurements in the data 1 n set. Thus, if the data set is {x1, x 2 ,..., x n } , then the (sample) mean is x = x i . n i=1 In statistics, we use the mean of a sample of a population to estimate the mean of the whole population, since it is usually not practical to collect the data for the entire population. To distinguish these from each other, we use x (read as x bar) for the mean of a sample, and use it to make inferences about the population mean (the Greek lower case letter mu). Example Here are incomes of 15 college graduates (in thousands of dollars). 110, 25, 50, 50, 55, 30, 35, 30, 4, 32, 50, 30, 32, 74, 60 The mean is 110 + 25 + 50 + 50 + 55 + 30 + 35 + 30 + 4 + 32 + 50 + 30 + 32 + 74 + 60 666 x= = = 44.4 15 15 or $44,400. The highest observation (110) and the lowest (4) are probably outliers. 552 Without them the mean is x = 42.461 or $42, 461. 13 Measuring Center: Median Another important measure of central tendency is the median. It is particularly useful for skewed distributions. 1 M316 Chapter 2 Dr. Berg Definition The median M is the midpoint of a distribution in the sense that half the observations are smaller (or equal) and half the observations are larger (or equal). To find the median: 1) Arrange the observations in increasing order. 2) If the number of observations n is odd, then the median is the middle observation: observation number (n +1)/2. 3) If the number of observations n is even, the median is the average of the middle two observations. Example We use the 15 annual incomes from the previous example. In increasing order they are: 4, 25, 30, 30, 30, 31, 32, 35, 50, 50, 50, 55, 60, 74, 110 15 + 1 Since there are 15 of them, we take observation number = 8 . Thus M=35. 2 If we leave out the largest observation, then there are 14 observations. We 32 + 35 would average the 7th and 8th observations to get M = = 33.5 . 2 Eliminating outliers normally has less effect on the median than on the average. Example If in a group of ten houses, one costs a million dollars and the rest cost ten thousand dollars, leaving out the milliondollar house radically changes the mean but not the median. Comparing the Mean and Median If the data distribution has perfect symmetry, the mean and median are identical. When the distribution is skewed right, the mean is to the right of the median, and when skewed left, the mean is left of the median. For a skewed distribution, it is useful to know both the mean and the median. Measuring Spread: Quartiles The simplest measure of the spread of the data is to observe the maximum (largest) and minimum (smallest) values. Since these may be outliers, a better measure is the to note the smallest and largest values of the middle half of the data. 2 M316 Chapter 2 Dr. Berg Definition After the data have been arranged in increasing order and the median M located, the first quartile Q1 is the median of the data to the left of M, and the third quartile Q3 is the median of the data to the right of M. Question: what would Q2 represent? Example We use the 15 annual incomes (in thousands of dollars) from the previous examples. As we have seen, the median is 35. They are: 4, 25, 30, 30, 30, 31, 32, 35, 50, 50, 50, 55, 60, 74, 110 For the first quartile, we take the median of the numbers 4, 25, 30, 30, 30, 31, 32, which is 30. For the third quartile, we take the median of the numbers 50, 50, 50, 55, 60, 74, 110, which is 55. 32 + 35 If we leave out the outlier 110, then the median is M = = 33.5 and the 2 first quartile is the median of 4, 25, 30, 30, 30, 31, 32, which again is 30; and the third quartile is the median of 35, 50, 50, 50, 55, 60, 74, which is 50. Suppose we omit both outliers 4 and 110. The median is still 35, but 30 + 30 50 + 55 Q1 = = 30 and Q1 = = 52.5 . 2 2 Some computer systems use a slightly different algorithm, so results can vary a little. The FiveNumber Summary and Boxplots A simple summary of the spread of a distribution is the five number summary, and its graphic representation, the boxplot (or box and whiskers). Definition The fivenumber summary of a data set is the minimum value, first quartile, median, third quartile, and maximum value. Example The fivenumber summary for the 15 annual incomes of the previous examples is: min.=4, Q1=30, M=35, Q3=55, max.=110. Definition A boxplot is a graph of the fivenumber summary where 3 M316 Chapter 2 Dr. Berg a) A central box spans Q1 and Q3. b) A line in the box marks M. c) Lines (whiskers) extend from the box to mark the smallest and largest observations. Example Here are boxplots comparing the travel times to work of samples of workers in North Carolina and in New York. (Figure 2.1 in the textbook) Spotting Suspected Outliers Looking at the boxplot of travel times to work for New York, the smallest and largest observations are extreme and don't describe the spread of the majority of observations. A better description is the interquartile range. Definition The interquartile range IQR is the distance between the first and third quartiles: IQR = Q 3  Q1 . We can use this to develop a rule of thumb for spotting outliers. 4 M316 Chapter 2 Dr. Berg The 1.5 IQR Rule for Outliers Call an observation a suspected outlier if it falls more than 1.5 IQR above the third quartile or below the first quartile. Exercise Make boxplot for the 15 annual incomes (in thousands): 4, 25, 30, 30, 30, 31, 32, 35, 50, 50, 50, 55, 60, 74, 110 Apply the 1.5 IQR rule to this data. Measuring Spread: Standard Deviation The most common measure of the spread of a set of data is variance, and its square root called standard deviation. The standard deviation of a sample is designated s, and is used to make inferences about the standard deviation of a population, which is designated . Definition The variance of a set of observations is an "average" of the sum of squares of the errors (deviation from the mean) of the observations. If the n observations are designated x1, x 2 ,..., x n , then the variance is 2 2 2 1 n ( x1  x ) + ( x 2  x ) + ... + ( x n  x ) . 2 2 s = ( xi  x) = n 1 i=1 n 1 The standard deviation is s = s2 . Exercise A medical student takes her resting pulse each night after eating. One weeks worth of observations in beats per minute are: 62, 57, 53, 69, 60, 61, and 58. Find the variance and standard deviation of these observations. Choosing Measures of Center and Spread The fivenumber summary is better for a skewed distribution, or one having strong outliers. The mean and standard deviation are used for symmetric distributions. Using Technology A calculator with "twovariable statistics" functions will to basic calculations, but more elaborate tools are useful. Here are displays of some of these tools. Can you identify the outputs? 5 M316 Chapter 2 Dr. Berg 6 M316 Organizing a Statistical Problem Chapter 2 Dr. Berg There is a fourstep process for organizing a statistical problem: State: What is the practical question, in the context of the realworld problem? Formulate: What specific statistical operations does the problem call for? Solve: Make the graphs and carry out the calculations needed for the problem. Conclude: Give your practical conclusion in the setting of the realworld problem. 7 ...
View
Full Document
 Fall '08
 BLOCKNACK
 Dr. Berg

Click to edit the document details