1 Chapter 4: Moments

2 Intuitively: moments are mathematical expressions that characterize important features of a data set (or “sample”) Many of these features are visually present in the overall “shape” of a histogram We will investigate the first four moments of a sample (and two standardizations): 1. Mean: the sample’s average value 2. Variance: its degree of spread 1. Standard deviation 3. Skew: its degree of symmetry 1. Standardized skew 4. Kurtosis: its degree of peakedness 1. Standardized kurtosis
3 First Moment: the Mean The mean of a data-set is just the average value of the set. i.e., the mean of {1,2,3,4,5,6,7,8,9,10} is: The mean is a measure of “central tendency” i.e., it is one measure of the kind of value that the data tend to have. 5 . 5 10 55 10 10 9 8 7 6 5 4 3 2 1 = = + + + + + + + + +

4 The mean has two symbols: 1. , which says that the mean is the first moment about 0. 2. , which is the notation for the sample average. (or or or any other lowercase roman letter) ' 1 m x y z
5 Here is how I want you to calculate the mean for a data set {x 1 , …, x n } : 1. Add up all the numbers: x 1 + …+ x n = Total 2. Divide the Total by n: Total/n = m’ 1 = x

6 More generally, we can calculate the mean (m’ 1 ) of a data set {x 1 ,…, x N } as: Notice that the mean is sensitive to the various sizes of the data points, in a way the median is not. x x N N x x x m N i i N = = + + = = 1 2 1 1 1 '
ORANGE COUNTY 2005 Total households 969,916 +/-4,402 Less than \$10,000 45,016 +/-3,233 \$10,000 to \$14,999 34,180 +/-3,012 \$15,000 to \$24,999 71,467 +/-4,720 \$25,000 to \$34,999 83,803 +/-4,948 \$35,000 to \$49,999 124,419 +/-6,518 \$50,000 to \$74,999 185,235 +/-6,895 \$75,000 to \$99,999 134,953 +/-6,313 \$100,000 to \$149,999 159,996 +/-6,086 \$150,000 to \$199,999 63,118 +/-4,025 \$200,000 or more 67,729 +/-3,930 Median household income (dollars) 65,953 +/-1,107 Mean household income (dollars) 88,648 +/-1,552

8 Grouped Data Frequently, especially with data from large surveys, the data are not reported exactly, but are grouped into categories that provide simplified estimations The mean of each category can then be estimated by the cell mark , which is the midpoint between the category’s two boundaries. What assumption does this strategy tacitly make?
Income Bracket (K) Cell Mark (K) # of Households < 10 5 39,022 10-15 12.5 32,461 15-25 20 68,436 25-35 30 76,620 35-50 42.5 115,896 50-75 62.5 173,926 75-100 87.5 133,534 100-150 125 173,339 150-200 175 76,388 > 200 ???? 82,418

10 Uses of the mean: To compare class averages on a test. Did one class, on average, do better than the other? To summarize the “central tendency” of a group of data (household incomes, weights of neonates, length of insect copulation, etc.) Unlike the median, the mean is not robust . I.e., adding just one data point can change it arbitrarily much.
11 Second Moment: Variance The second moment (m 2 ) of a data set measures how “spread out” the data are We formally define m 2 , the second moment of a given data set {x 1 ,…, x N } as: N x x m N i i = - = 1 2 2 ) (

