Unformatted text preview: Introduction to Inference Objectives (BPS chapter 14) • Estimating with confidence • Confidence intervals for the mean µ • The reasoning of tests of significance • Stating hypotheses • Test statistics • P‐values • Statistical significance Statistical Inference ‐ Provides methods for drawing conclusions about a population from sample data • Confidence Intervals: What is the population mean? • Tests of Significance: Is the population mean larger than __? Estimating with confidence Although the sample mean, x‐bar , is a unique number for any particular sample, if you pick a different sample, you will probably get a different sample mean. In fact, you could get many different values for the sample mean, and virtually none of them would actually equal the true population mean, µ. The sample distribution is narrower than the population distribution, by a factor of √n. Thus, the estimates gained from our samples are always relatively close to the population parameter µ. 95% of all sample means will be within roughly 2 standard deviations (2*s/√n) of the population parameter m. Because distances are symmetrical, this implies that the population parameter m must be within roughly 2 standard deviations from the sample average xbar, in 95% of all samples. Confidence interval A level C confidence interval for a parameter has two parts: • An interval calculated from the data, usually of the form estimate ±margin of error • A confidence level C, which gives the probability that the interval will capture the true parameter value in repeated samples, or the success rate for the method. Implications We don’t need to take lots of random samples to “rebuild” the sampling distribution and find m at its center. All we need is one SRS of size n, and relying on the properties of the sample means distribution to infer the population mean m. With 95% confidence, we can say that µ should be within roughly 2 standard deviations (2*s/√n) from our sample mean x‐bar. • In 95% of all possible samples of this size n, µ will indeed fall in our confidence interval. • In only 5% of samples would be farther from µ. A confidence interval can be expressed as: ± m m is called the margin of error Two endpoints of an interval: m possibly within ( − m) to ( + m) A confidence level C (in %) indicates the success rate of the method that produces the interval. It represents the area under the normal curve within ± m of the center of the curve. NAEP Quantitative Scores The NAEP survey includes a short test of quantitative skills, covering mainly basic arithmetic and the ability to apply it to realistic problems. Scores on the test range from 0 to 500, with higher scores indicating greater numerical abilities. It is known that NAEP scores have standard deviation s = 60. In a recent year, 840 men 21 to 25 years of age were in the NAEP sample. Their mean quantitative score was 272. On the basis of this sample, estimate the mean score m in the population of all 9.5 million young men of these ages. 1.To estimate the unknown population mean m, use the sample mean = 272. 2.The law of large numbers suggests that will be close to m, but there will be some error in the estimate. 3.The sampling distribution of has the Normal distribution with mean m and standard deviation σ
60
=
≈ 2.1 n
840 € The 68‐95‐99.7 rule indicates that and m are within two standard deviations (4.2) of each other in about 95% of all samples. So, if we estimate that m lies within 4.2 of , we’ll be right about 95% of the time. Confidence Interval Mean of a Normal Population Take an SRS of size n from a Normal population with unknown mean m and known standard deviation s. A level C confidence interval for m is: Using the 68‐95‐99.7 rule gave an approximate 95% confidence interval. A more precise 95% confidence interval can be found using the appropriate value of z* (1.960) with the previous formula. We are 95% confident that the average NAEP quantitative score for all adult males is between 267.884 and 276.116. Careful Interpretation of a Confidence Interval “We are 95% confident that the mean NAEP score for the population of all adult males is between 267.884 and 276.116.” (We feel that plausible values for the population of males’ mean NAEP score are between 267.884 and 276.116.) ** This does not mean that 95% of all males will have NAEP scores between 267.884 and 276.116. ** Statistically: 95% of all samples of size 840 from the population of males should yield a sample mean within two standard errors of the population mean; i.e., in repeated samples, 95% of the C.I.s should contain the true population mean. Confidence intervals contain the population mean m in C% of samples. Different areas under the curve give different confidence levels C. z* is related to the chosen confidence level C. C is the area under the standard normal curve between −z* and z* ⎛ σ ⎞
x ± 1.96⎜
⎟
⎝ n ⎠
⎛ 10 ⎞
90 ± 1.96⎜
⎟
⎝ 25 ⎠
90 ± 3.92 Example: 95% CI Say x‐bar is 90, σ=10, n=25 We are 95% confident that mu is in the interval [86.08,93.92]. €
Confidence Z‐value level 90% Z=1.645 95% Z=1.96 99% Z=2.576 Confidence intervals A very large school district in Connecticut wants to estimate the average SAT score of this year’s graduating class. The district takes a simple random sample of 100 seniors and calculates the 95% confidence interval for the graduating students’ average SAT score at 505 to 520 points. Impact of sample size The spread in the sampling distribution of the mean is a function of the number of individuals per sample. • The larger the sample size, the smaller the standard deviation (spread) of the sample mean distribution. * But the spread only decreases at a rate equal to √n. Factors affecting the size of the error term: • The sample size, n o (the larger n is, the smaller the error) • The standard deviation, s o (the smaller s is, the smaller the error) • The confidence level o (the smaller the level of confidence, the smaller the error) • Margin of error o Increasing the confidence level will Sample size and experimental design ⎛ z * σ ⎞ 2
σ
m = z*
⇔ n = ⎜
⎟
⎝ m ⎠
n € You may need a certain margin of error (e.g., drug trial, manufacturing specs). In many cases, the population variability (s) is fixed, but we can choose the number of measurements (n). Remember, though, that sample size is not always stretchable at will. There are typically costs and constraints associated with large samples. The best approach is to use the smallest sample size that can give you useful results. ⎛ 10 ⎞
Example: finding the sample size Say, σ=10 is known and you want your ME from a 95% CI to be no 3 = 1.96⎜ n ⎟
⎝
⎠
more than +/‐ 3. What size sample do you need? 19.6 n=
3 n = 43 Hypotheses tests A test of statistical significance tests a specific hypothesis using sample data to decide on the validity of the hypothesis. €
In statistics, a hypothesis is an assumption, or a theory about the characteristics of one or more variables in one or more populations. The null hypothesis is the statement being tested. It is a statement of “no effect” or “no difference,” and it is labeled H0. The alternative hypothesis is the claim we are trying to find evidence for, and it is labeled Ha. One‐sided and two‐sided tests A two‐tail or two‐sided test of the population mean has these null and alternative hypotheses: H0: µ = [a specific number] Ha: µ ≠ [a specific number] A one‐tail or one‐sided test of a population mean has these null and alternative hypotheses: H0: µ = [a specific number] Ha: µ < [a specific number] OR H0: µ = [a specific number] Ha: µ > [a specific number] What determines the choice of a one‐sided versus two‐sided test is what we know about the problem before we perform a test of statistical significance. A health advocacy group tests whether the mean nicotine content of a brand of cigarettes is greater than the advertised value of 1.4 mg. Here, the health advocacy group suspects that cigarette manufacturers sell cigarettes with a nicotine content higher than what they advertise in order to better addict consumers to their products and maintain revenues. Thus, this is a one‐sided test: H0: µ = 1.4 mg Ha: µ > 1.4 mg To test the hypothesis H0: µ = µ0 (where µ0 is a known or pre‐determined constant) based on an SRS of size n from a Normal population with unknown mean µ and known standard deviation σ, we rely on the properties of the sampling distribution N(µ, σ√n). The P‐value is the area under the sampling distribution for values at least as extreme, in the direction of Ha, as that of our random sample. x −µ z=
σ n
Again, we first calculate a z‐value and then use Table A or the applet The P‐value: What is the probability of drawing a random sample such as yours if H0 is true? € Tests of statistical significance quantify the chance of obtaining a particular random sample result if the null hypothesis were true. This quantity is the Pvalue. This is a way of assessing the “believability” of the null hypothesis given the evidence provided by a random sample. Interpreting a P‐value To calculate the P‐value for a two‐sided test, use the symmetry of the normal curve. Find the P‐value for a one‐sided test and double it. Could random variation alone account for the difference between the null hypothesis and observations from a random sample? Remember, p‐value is, assuming the null hypothesis is true, the likelihood of observing a sample with a mean as extreme or more extreme than what we have already observed. A small P‐value implies that random variation because of the sampling process alone is not likely to account for the observed difference. With a small P‐value, we reject H0. The true property of the population is significantly different from what was stated in H0. Thus small P‐values are strong evidence AGAINST H0 Diet colas use artificial sweeteners to avoid sugar. These sweeteners gradually lose their sweetness over time. Trained testers sip the cola and assign a “sweetness score” of 1 to 10. The cola is then retested after some time and the two scores are compared to determine the difference in sweetness after storage. Bigger differences indicate bigger loss of sweetness. Suppose we know that for any cola, the sweetness loss scores vary from taster to taster according to a Normal distribution with standard deviation s = 1. The mean mu for all tasters measures loss of sweetness. The sweetness losses for a new cola, as measured by 10 trained testers, yields an average sweetness loss of x‐bar = 1.02. Do the data provide sufficient evidence that the new cola lost sweetness in storage? The null hypothesis is no average sweetness loss occurs, while the alternative hypothesis (that which we want to show is likely to be true) is that an average sweetness loss does occur. H0: mu = 0 Ha: mu > 0 This is considered a one‐sided test because we are interested only in determining if the cola lost sweetness (gaining sweetness is of no consequence in this study). If the null hypothesis of no average sweetness loss is true, the test statistic would be: Because the sample result is more than 3 standard deviations above the hypothesized mean 0, it gives strong evidence that the mean sweetness loss is not 0, but positive. For test statistic z = 3.23 and alternative hypothesis Ha: mu > 0, the P‐value would be: P‐value = P(Z > 3.23) = 1 – 0.9994 = 0.0006 If H0 is true, there is only a 0.0006 (0.06%) chance that we would see results at least as extreme as those in the sample; thus, since we saw results that are unlikely if H0 is true, we therefore have evidence against H0 and in favor of Ha. The significance level, α, is the largest P‐value tolerated for rejecting a true null hypothesis (how much evidence against H0 we require). This value is decided arbitrarily before conducting the test. If the P‐value is equal to or less than α (p ≤ α), then we reject H0. If the P‐value is greater than α (p > α), then we fail to reject H0. CIs to test hypotheses Because a two‐sided test is symmetrical, you can also use a confidence interval to test a two‐sided hypothesis. Logic of confidence interval test Ex: A sample gives a 99% confidence interval of x m = 0.84 ± 0.0101 . ± With 99% confidence, could samples be from populations with µ =0.86? µ =0.85? €
Objectives (BPS chapter 15) Inference in practice o Where did the data come from? o Cautions about z procedures o Cautions about confidence intervals o Cautions about significance tests o The power of a test o Type I and II errors Where did the data come from? When you use statistical inference, you are acting as if your data are a probability sample or come from a randomized experiment. Statistical confidence intervals and hypothesis tests cannot remedy basic flaws in producing the data, such as voluntary response samples or uncontrolled experiments. Caution about z procedures o The data must be an SRS, simple random sample, of the population. More complex sampling designs require more complex inference methods. o The sampling distribution must be approximately normal. This is not true in all instances. o We must know s, the population standard deviation. This is often an unrealistic requisite. We'll see what can be done when s is unknown in the next chapter. o We cannot use the z procedure if the population is not normally distributed and the sample size is too small because the central limit theorem will not work and the sampling distribution will not be approximately normal. Poorly designed studies often produce useless results (e.g., agricultural studies before Fisher). Nothing can overcome a poor design. The margin of error does not cover all errors: The margin of error in a confidence interval covers only random sampling error. Undercoverage, nonresponse, or other forms of bias are often more serious than random sampling error (e.g., our elections polls). The margin of error does not take these into account at all. Outliers influence averages and therefore your conclusions as well. Practical significance Statistical significance only says whether the effect observed is likely to be due to chance alone because of random sampling. Statistical significance may not be practically important. That’s because statistical significance doesn’t tell you about the magnitude of the effect, only that there is one. An effect could be small enough to be irrelevant. And with a large enough sample size, a test of significance can detect even very small differences between two sets of data, as long as it is real. Example: Drug to lower temperature, found to reproducibly lower a patient’s temperature by 0.4° Celsius (P‐value < 0.01). But clinical benefits of temperature reduction appear for 1° decrease or more. Sample size affects statistical significance Because large random samples have small chance variation, very small population effects can be highly significant if the sample is large. Because small random samples have a lot of chance variation, even large population effects can fail to be significant if the sample is small. Interpreting effect size: It’s all about context There is no consensus on how big an effect has to be in order to be considered meaningful. In some cases, effects that may appear to be trivial can in reality be very important. Example: Improving the format of a computerized test reduces the average response time by about 2 seconds. Although this effect is small, it is important since this is done millions of times a year. The cumulative time savings of using the better format is gigantic. Always think about the context. Try to plot your results, and compare them with a baseline or results from similar studies. More cautions… Confidence intervals vs. hypothesis tests It’s a good idea to give a confidence interval for the parameter in which you are interested. A confidence interval actually estimates the size of an effect rather than simply asking if it is too large to reasonably occur by chance alone. Beware of multiple analyses Running one test and reaching the 5% level of significance is reasonably good evidence that you have found something. Running 20 tests and reaching that level only once is not. The power of a test of hypothesis with fixed significance level α is the probability that the test will reject the null hypothesis when the alternative is true. In other words, power is the probability that the data gathered in an experiment will provide sufficient evidence to reject a wrong null hypothesis . Knowing the power of your test is important: o When designing your experiment: To select a sample size large enough to detect an effect of a magnitude you think is meaningful. o When a test found no significance: Check that your test would have had enough power to detect an effect of a magnitude you think is meaningful. How large a sample size do I need? In general: If you want a smaller significance level (α) or a higher power (1 ‐ β), you need a larger sample. A two‐sided alternative hypothesis always requires a larger sample than a one‐sided alternative. Detecting a small effect requires a larger sample than detecting a larger effect. Type I and II errors A Type I error is made when we reject the null hypothesis and the null hypothesis is actually true (incorrectly reject a true H0). The probability of making a Type I error is the significance level α. A Type II error is made when we fail to reject the null hypothesis and the null hypothesis is false (incorrectly keep a false H0). The probability of making a Type II error is labeled β. The power of a test is 1 − β. Type I and II errors—court of law o H0: The person on trial is not a thief. (In the U.S., people are considered innocent unless proven otherwise.) o Ha: The person on trial is a thief. (The police believe this person is the main suspect.) o A Type I error is made if a jury convicts a truly innocent person. (They reject the null hypothesis even though the null hypothesis is actually true.) o A Type II error is made if a truly guilty person is set free. (The jury fails to reject the null hypothesis even though the null hypothesis is false.) Running a test of significance is a balancing act between the chance α of making a Type I error and the chance β of making a Type II error. Reducing α reduces the power of a test and thus increases β. It might be tempting to emphasize greater power (the more the better). However, with "too much power" trivial effects become highly significant. A Type II error is not definitive since a failure to reject the null hypothesis does not imply that the null hypothesis is wrong. If the probability of rejecting the null hypothesis when the null hypothesis is actually true, it is best to select a conservative value of alpha The drug does not work, but we think it does a bad drug goes on the market. Suppose that a regulatory agency will propose that Congress cut federal funding to a metropolitan area if its mean level of NOx is unsafe—that is, if it exceeds 5.0 ppt. The agency gathers sample NOx concentrations on 60 different days and calculates a test of significance to assess whether the mean level of NOx is greater than 5.0 ppt. ...
View
Full Document
 Spring '10
 Quesen
 Confidence

Click to edit the document details