M316 Chapter 7 - M316 Chapter 7 Dr. Berg Exploring Data:...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: M316 Chapter 7 Dr. Berg Exploring Data: Part I Review A: Data 1. Identify the individuals and variables in a set of data. 2. Identify each variable as categorical or quantitative. Identify the units in which each quantitative variable is measured. 3. Identify the explanatory and response variables in situations where one variable explains or influenced another. B: Displaying Distributions 1. Recognize when a pie chart can and cannot be used. 2. Make a bar graph of the distribution of a categorical variable, or in general to compare related quantities. 3. Interpret pie charts and bar graphs. 4. Make a time plot of a quantitative variable over time. Recognize patterns such as trends and cycles in time plots. 5. Make a histogram of the distribution of a quantitative variable. 6. Make a stemplot of the distribution of a small set of observations. Round leaves or split stems as needed to make an effective stemplot. C: Describing Distributions (Quantitative Variable) 1. Look for the overall pattern and for major deviations from the pattern. 2. Assess from a histogram or stemplot whether the shape of a distribution is roughly symmetric, distinctly skewed, of neither. Assess whether the distribution has one or more major peaks. 3. Describe the overall pattern by giving numerical measures of center and spread in addition to a verbal description of shape. 4. Decide which measures of center and spread are more appropriate: the mean and standard deviation (especially for symmetric distributions) or the fivenumber summary (especially for skewed distributions). 5. Recognize outliers and give plausible explanations for them. D: Numerical Summaries of Distributions 1. Find the median M and the quartiles Q1 and Q3 for a set of observations. 2. Find the fivenumber summary and draw a boxplot: assess center, spread, symmetry, and skewness from a boxplot. 1 Now we try to put together all we have learned in Part I. Part I Summary M316 Chapter 7 Dr. Berg 3. Find the mean x and the standard deviation s for a set of observations. 4. Understand that the median is more resistant than the mean. Recognize that skewness in a distribution moves the mean away from the median toward the long tail. 5. Know the basic properties of the standard deviation: s 0 always; s = 0 only when all observations are identical and increases as the spread increases; s has the same units as the original measurements; s is pulled up strongly outliers or skewness. E: Density Curves and Normal Distributions 1. Know that areas under a density curve represent proportions of all observations and that the total area under a density curve is 1. 2. Approximately locate the median (equalareas point) and the mean (balance point) on a density curve. 3. Know that the mean and median both lie at the center of a symmetric density curve and that the mean moves farther toward the long tail of a skewed curve. 4. Recognize the shape of Normal curves and estimate by eye both the mean and standard deviation from such a curve. 5. Use the 689599.7 rule and symmetry to state what percent of the observations from a Normal distribution fall between two points when both points lie at the mean or one, two, or three standard deviations on either side of the mean. 6. Find the standardized value (zscore) of an observation. Interpret zscores and understand that any Normal distribution becomes standard Normal N(0, 1) when standardized. 7. Given that a variable has a Normal distribution with a stated mean and standard deviation , calculate the proportion of valued above a stated number, below a stated number, or between two stated numbers. 8. Given that a variable has a Normal distribution with a stated mean and standard deviation , calculate the point having a stated proportion of all values above it or below it. F: Scatterplots and Correlation 1. Make a scatterplot to display the relationship between two quantitative variables measured on the same subjects. Place the explanatory variable (if any) on the horizontal scale of the plot. 2. Add a categorical variable to a scatterplot by using a different plotting symbol or color. 3. Describe the direction, form, and strength of the overall pattern of a scatterplot. In particular, recognize positive or negative association and linear (straightline) patterns. Recognize outliers in a scatterplot. 4. Judge whether it is appropriate to use correlation to describe the relationship between two quantitative variables. Find the correlation r. 2 M316 Chapter 7 Dr. Berg 5. Know the basic properties of correlation: r measures the direction and strength of only straightline relationships; r is always a number between 1 and 1; r = 1 only for perfect straightline relationships; r moves away from 0 toward 1 as the straightline relationship gets stronger. G: Regression Lines 1. Understand that regression requires an explanatory variable and a response variable. Use a calculator or software to find the leastsquares regression line of a response variable y on an explanatory variable x from data. ^ 2. Explain what the slope b and the intercept a mean in the equation y = a + bx of a regression line. 3. Draw a graph of a regression line when you are given its equation. 4. Use a regression line to predict y for a given x. Recognize extrapolation and be aware of its dangers. 5. Find the slope and intercept of the leastsquares regression line from the means and standard deviations of x and y and their correlation. 6. Use r 2 , the square of the correlation, to describe how much of the variation in one variable can be accounted for by a straightline relationship with another variable. 7. Recognize outliers and potentially influential observations from a scatterplot with the regression line drawn on it. 8. Calculate the residuals and plot them against the explanatory variable x. Recognize that a residual plot magnifies the pattern of the scatterplot of y versus x. H: Cautions About Correlation and Regression 1. Understand that both r and the leastsquares line can be strongly influenced by a few extreme observations. 2. Recognize possible lurking variables that way explain the observed association between two variables x and y. 3. Understand that even a strong correlation does not mean that there is a causeandeffect relationship between x and y. 4. Give plausible explanations for an observed association between two variables: direct cause and effect, the influence of lurking variables, or both. I: Categorical Data 1. From a twoway table of counts, find the marginal distributions of both variables by obtaining the row sums and column sums. 2. Express any distribution in percents by dividing the category counts by their total. 3. Describe the relationship between two categorical variables by computing and comparing percents. Often this involves comparing the conditional distributions of one variable for the different categories of the other variable. 4. Recognize Simpson's paradox and be able to explain it. 3 M316 Chapter 7 Dr. Berg 4 M316 Review Exercises 7.4 Chapter 7 Dr. Berg What We Watch Here are data on movie studio income in 2004, in billions of dollars. Source Income Theaters 7.4 Video/DVD 20.9 Pay TV 4.0 Free TV 12.6 Make a graph that compares these amounts. What percent of studio income comes from theater showings of movies? 7.7 Returns on Stocks Are Not Normal The 99.7 part of the 689599.7 rule says that in practice Normal distributions are about 6 standard deviations wide. Here are the returns of the S&P500 for the years 1972 to 2004. Year Return Year Return Year Return 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 15.07 -21.522 -34.54 28.353 18.177 -12.992 -2.264 4.682 17.797 -12.71 17.033 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 18.075 2.253 26.896 17.39 0.783 11.677 25.821 -8.679 26.594 4.584 7.127 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 -1.316 34.167 19.008 31.138 26.534 17.881 -12.082 -13.23 -23.909 26.311 7.37 Find the mean and standard deviation of the real returns. What are the values three standard deviations above and below the mean? How do these compare to the actual maximum and minimum? 7.10 Hot Mutual Funds? Investment advertisements always warn that "past performance does not guarantee future results." This table gives the percent returns from 23 Fidelity Investments "sector funds" in 2002 (a down year) and 2003 (an up year). These often rise and fall faster than the market as a whole. 2002 2003 2002 2003 2002 2003 -17.1 -6.7 -21.1 -12.8 -18.9 -7.7 -17.2 -11.4 23.9 14.1 41.8 43.9 31.1 32.3 36.5 30.6 -0.7 -5.6 -26.9 -42 -47.8 -50.5 -49.5 -23.4 36.9 27.5 26.1 62.7 68.1 71.9 57 35 -37.8 -11.5 -0.7 64.3 -9.6 -11.7 -2.3 59.4 22.9 7.6 32.1 28.7 29.5 19.1 5 M316 Chapter 7 Dr. Berg a) Make a scatterplot of 2003 return (response) against 2002 return (explanatory). The funds with the best performance in 2002 tend to have the worst performance in 2003. Fidelity Gold Fund, the only fund with a positive return in both years, is an extreme outlier. b) To demonstrate that correlation is not resistant, find r for all we funds and then find r for the 22 funds other than Gold. Explain from Gold's position in the plot why omitting this point makes r more negative. 7.12 More on Hot Funds a) Find the equations of two leastsquares lines for predicting 2003 returns from 2002 returns, one for all 23 funds and one omitting Fidelity Gold Fund. Make a scatterplot with both lines drawn on it. The two lines are very different. b) Starting with the leastsquares idea, explain why adding Fidelity Gold Fund to the other funds moves the line in the direction the graph shows. 7.17 The Mississippi River Table 7.1 gives the volume of water discharged by the Mississippi River into the Gulf of Mexico for each year from 1954 to 2001. a) Make a graph of the distribution of water volume. Describe the overall shape and any outliers. b) Do you expect the mean to be close to the median? c) Are the mean and standard deviations adequate descriptions of the distribution? Find the fivenumber summary. 7.18 More on the Mississippi River Make a time plot of the data in table 7.1. What does the time plot reveal that the histogram does not? 6 M316 Solutions Chapter 7 Dr. Berg 7 M316 Chapter 7 Dr. Berg 8 ...
View Full Document

This note was uploaded on 09/14/2009 for the course CH 310 N taught by Professor Blocknack during the Fall '08 term at University of Texas at Austin.

Ask a homework question - tutors are online