chap12

MATH 241, Fall 2008
University of Montana
Word Count: 2178

12 Chapter Notes (Sample Surveys) In everything we have done thusfar, the data were given, and the subsequent analysis was exploratory in nature. This type of statistical analysis is known as exploratory data analysis (EDA). Here, and in the next chapter, we will study techniques for producing or collecting data to answer specic questions. We will see that methods of collecting a sample of data in an unbiased...

12 Chapter Notes (Sample Surveys) In everything we have done thusfar, the data were given, and the subsequent analysis was exploratory in nature. This type of statistical analysis is known as exploratory data analysis (EDA). Here, and in the next chapter, we will study techniques for producing or collecting data to answer specic questions. We will see that methods of collecting a sample of data in an unbiased fashion hinge on the idea of randomness. Three Keys to a Good Sample 1. Sampling: In most problems, we are interested in learning something about a population of individuals. However, it will often be the case that the population is too large or too dicult to examine completely, so we take a sample of individuals from the population which we hope is representative of the population as a whole. Examples: What is the average size of Ponderosa pines in a certain area? What proportion of Missoula residents have served on a jury? What proportion of American adults approve of the recently approved bail-out plan? What proportion are satised with the state of the country? Polls, such as this last example are known as sample surveys. 100 The most important aspect of a sample, no matter how the sample is taken, is that it is representative of the population from which it comes. If the sample is not representative of the population, we say it is biased. A biased sample is a useless sample! Bias arises in many ways. Consider the following examples which illustrate poor sampling techniques leading to biased samples. Whats wrong with these samples? Examples: (a) Suppose CNN takes a poll where they ask viewers to call in and state whether or not they are happy in their marriage. 90% of the call-ins say they are unhappy in their marriage. Do we conclude that among the US population, 90% of all married people are unhappy in their marriage? (b) A university instructor wants to know how students feel about their statistics course. As students come to her ofce hours, she asks them to answer a few questions about the course. How accurate will the information gathered be? (c) A survey was given to UM students regarding their opinions on possible new businesses to open in the UC. The survey was administered to any student willing to ll it out. Would the responses received be accurate or somehow biased? 101 (d) Historical Mishaps: 1936 1948 This case illustrates a poor sampling technique that led to a sample not being representative of the intended population. In addition to the problems raised in the examples above (voluntary response, interviewer bias, convenience sampling), another common problem is nonresponse. People are sometimes dicult to locate or simply refuse to cooperate. How many of you have thrown out a mail survey or hung up the phone on someone who wants to ask you a few questions? If there is something dierent about the way nonrespondents would respond if they did respond, this can introduce bias in your results. How can we protect against unseen sources of bias? (e) Wording of Questions: Dont you agree that social workers should earn more money than they currently earn? 102 2. Randomization: The key to avoiding the introduction of bias in a sample is the use of randomization in selecting which population units will comprise the sample. Examples: (a) Reconsider the UC survey on new businesses. Although we might think that simply having students ll out the survey voluntarily is just as good as sampling students at random, can you think of sources of bias that might result? (b) Suppose a biologist captures and radio tags 50 cutthroat trout in the Rock Creek drainage to study the types of habitats in which these sh live. Do you think these 50 sh are representative of all cutthroat in terms of their habitat? Random selection of units to comprise a sample from the population protects against a particular type of bias known as selection bias. Such bias is the result of important but unrealized dierences in the units of the population relative to what you are measuring. This is one of the startling truths about sampling. The introduction of randomness in selection actually allows us to draw accurate conclusions about the population. 103 3. Sample Size: The fundamental question when planning a study is: How large a sample do we need for the sample to be representative of the population? Although you might be tempted to think we should take a certain fraction or percentage of the population, it turns out that the size of the population (as long as its large) is unimportant. In other words, a sample of 100 Missoula residents will be about as representative of the Missoula population as a sample of 100 US residents of the entire US. If the sample consists of the entire population, it is called a census. What problems might we encounter in trying to take a census? (a) (b) (c) 104 Parameters and Statistics: Typically, the purpose behind taking a sample is to gain information about some aspect of the population as a whole. In particular, we are often interested in estimating the mean or standard deviation of some variable, or the proportion of population units with some characteristic. For example, we might want to estimate: the average energy bill for Missoula residents, or the proportion of Americans currently unemployed, or the mean annual income of Montana residents. These unknown population quantities are called parameters. The point of taking a sample is to estimate these unknown parameters with statistics computed from the sample. (i.e.: we might use the sample mean energy bill y from a sample of 30 Missoula households to estimate the true but unknown average energy bill of Missoula residents). Notation: Common notation used is summarized below. (Sample) Name Mean Standard Deviation Proportion Correlation Regression coecient Statistic y s p r b 105 (Population) Parameter (pronounced mu) (pronounced sigma) p (pronounced rho) (pronounced beta) Other Important Sampling Terminology Sampling Unit: The sampling unit is the basic unit on which we will measure the variables of interest (one value per unit); units might be people, households, animals, plots of land, etc. Sampling Frame: The sampling frame is a list of individual units from which the sample is chosen. This will not always be the same as the population of interest - examples? Sampling Variability: This is the notion that every time we take a sample from some population, we will generally not get the same answer. Consider taking a sample of size n = 10 from this class to estimate the average maximum distance traveled by foot in a day by a 241 student. Sample 1 Average Distance Sample 2 Sample 3 If we were to repeat this many times, we would have several distance averages, hopefully distributed around the true average. variability The in these averages is what we mean by sampling variability. The distribution of these averages is known as the sampling distribution of the mean maximum distance traveled in a sample of size 10. 106 Identify each of the following as a parameter or a statistic, and give the symbol used to represent it. In a sample of 2290 U.S. voters, 65% claim they will be voting for a certain Presidential candidate on election day. The proportion of all U.S. voters voting in the last election that voted for Barack Obama. 51% of all U.S. babies are boys. The proportion of all UM campus computers with working versions of SPSS. The 15% of women in the US Senate. The standard deviation of monthly incomes for 50 Missoula residents. The proportion of students at the University of Montana who participate in school- sponsored athletics. Now that we know all the terminology, lets consider some basic sampling methods which rely on randomness to select a representative sample from the population. 107 1. Simple Random Sample (SRS): Consider selecting a sample of size n. If this sample is drawn so that every possible sample of size n has the same chance of being selected, it is said to be a simple random sample (SRS). Example: Suppose we have a piece of land and we want to estimate the volume of timber or the number of woodpecker nests on the piece of land. A census might be too costly. One simple way to take a sample might be to divide the area into equal-sized blocks as shown below. The blocks should be small enough to survey reliably. Suppose the area is divided into 36 blocks and weve decided to survey a sample of 9 blocks. To select an SRS, label the blocks in any order. Go to the random number table and select a row at random to generate the sample. For example, suppose we choose row 7: 73184 95907 05179 51002 83374 52297 07769 99792 78365 93487 Starting at row 7, select an SRS of 9 plots, and mark them on the picture above. 108 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 Do the plots selected look random? In an SRS, every combination of 9 blocks has the same probability of being selected. Selecting an SRS does not guarantee that the particular sample selected is perfectly representative of the population. It is not the sample you select which is unbiased; its the procedure by which the sample is selected which is unbiased. If we were selecting an SRS from an alphabetical list of 36 people, we probably wouldnt worry that the names werent evenly distributed through the list, since we have no reason to believe that the variable being measured (e.g.: their opinion on some issue) is associated with their position on the list. However, in this example, we might know that there is geographic variation across the area (perhaps the left side is at a higher elevation than the right side). If this were true, we can use this extra information to ensure a more geographically representative sample by taking a stratied random sample of plots. 109 2. Stratied Random Sample: Suppose we divide the area into 3 rectangular subareas (from left to right with the elevation gradient) each containing 12 plots. Then take a separate SRS of size 3 within each subarea (using dierent random numbers for each subarea). 01 07 01 07 01 07 this plan, the sample taken 02 08 02 08 02 08 is more equally representa- 03 09 03 09 03 09 tive of the varying elevations 04 10 04 10 04 10 in the area. 05 11 05 11 05 11 Note that we only need to label the individuals within 06 12 06 12 06 12 each stratum. Elevation size 9 as before, but under Use row 29 from the random # table to select a stratied random sample, starting in the left stratum and proceeding to the right: 72042 12287 21081 48426 44321 58765 Random Random Select plots 01-12 according to: This still gives a sample of Ignore the values 96,97,98,99. Plot Numbers Plot Numbers Why do we do this? 1 00-07 7 48-55 In what situations does stratied random sampling work best versus simple random sampling? 110 2 3 4 5 6 ...

