Unformatted Document Excerpt
Coursehero >>
Alabama >>
UAB >>
SOPH 611
Course Hero has millions of student submitted documents similar to the one
below including study guides, practice problems, reference materials, practice exams, textbook help and tutor support.
Course Hero has millions of student submitted documents similar to the one
below including study guides, practice problems, reference materials, practice exams, textbook help and tutor support.
1.
Introduction Section and Overview
What are statistics? What is the practice of statistics?
These are two different questions! Statistics are just numbers but the practice of statistics involves measuring variability of numbers to interpret results.
Statistics can be used to analyze data after an experiment has been carried out but can also be used to make suggestions for how experiments can be designed to reduce variation and produce better, more accurate, consistent, predictive results.
There are numbers, formulas, and defined scientific processes involved in answering a statistical question. However, keep in mind that statistics as a mathematical discipline is a different discipline theoretical statistics. We will be approaching statistics as an applied discipline using some basic levels of math. Instead of using theorems, properties, and abstract math, were going to use cases studies to illustrate some fundamental points about using statistics to make sense out of data.
Well begin with definitions:
Data: A collection of facts, not necessarily numeric, such as: age, gender, hair color, weight, temperature, etc.
Population: A well defined collection of objects, such as: students (at UAB, in engineering), paint colors (from 1 company, from multiple companies), etc. If you collect information on all of the objects in a population, that is a census. If you collect information on some of the population, that is a sample.
Note: The US Census is actually a sample not a true census
Copyright Stacey S. Cofield, 23 August 2004. All rights reserved.
1/5
Variable: A measurement on an object that can change from one object to another. Usually denoted with lower case letters: x, y, z
numerical variables: age, height, time categorical variables: gender, hair color, school class
If you measure only one characteristic ~ univariate If you measure two characteristics ~ bivariate If you measure more than two characteristics ~ multivariate
Descriptive Statistics: Often called summary statistics, such as the number of subjects (N), the mean of values (), variance (2), standard deviation (). Often depicted using plots, such as: histograms, box and scatter plots.
Inferential Statistics: The process of using data to make generalizations to a population, such as: confidence intervals, estimation, prediction, etc. Inference is a conclusion that patterns in the data are present in the population.
When collecting data, make sure you collect a good sample to avoid a biased sample. For example, if you are trying to summarize how students feel about a political issue, ask men and women, republicans and democrats, freshman and seniors, etc. There are several sampling procedures: simple random sampling the most simple sampling procedure involves selecting a subset of n objects from the population, such that each object has an equal chance of being selected stratified sampling sampling a subset of n from each gender, each age group, or each school class convenience sampling when it isnt possible to get a simple random sample, you sample what you have available to you
Copyright Stacey S. Cofield, 23 August 2004. All rights reserved.
2/5
Often the goal of a study is to declare a causal relationship between a response and predictors. The response could be change in blood pressure and the predictors could be age, gender, exercise, and weight. Unless the study is designed well, ensuring a random sample, you wont be able to declare a causal relationship.
Cause and effect relationships should only be drawn from randomized experiments. Observational studies, where the subjects are not randomly chosen or allocated for study, can establish correlation between a response and predictors. Keep in mind that Correlation Causation
Inferences to populations should only be drawn from random sampling studies, such as randomized clinical trials and designed laboratory experiments.
Some other definitions that statisticians use:
Model: The statistical model is an equation that predicts the response (or outcome) as a function of other variables.
Parameters: These are unknown coefficients (variables) in the model, that need to be estimated, such as the mean or standard deviation. Unless you have a census (all subjects in a population), these are never truly known only estimated.
Statistical Significance: A precise statistical term that does not equate to practical significance. This usually means that the data provides evidence that the estimated parameter in not the null value (assumed value).
Hypotheses: Usually in terms of null and alternative hypotheses. The question you are trying to answer and the alternative (or opposite) of that question. In statistics, the null hypothesis is usually the current standard or what you are
Copyright Stacey S. Cofield, 23 August 2004. All rights reserved.
3/5
trying to disprove. The alternative is what you are trying to show by statistically rejecting the null.
Now, well look at a case study to determine how to use statistics to answer questions. First well discuss the steps involved in answering the question using common sense and then well define a process that we will use throughout the course. CPR by phone: In an urban setting, about 6% of out-of-hospital cardiac arrests survive to hospital discharge. Clearly, survival depends upon a number of things. For instance, survival can increase if a bystander witnesses the arrest and administers cardiopulmonary resuscitation (CPR ~ 15 chest compressions and 2 breaths 4 times per mintue). But this happens no more than 50% of the time. From the literature, when CPR is administered by a non-EMT, survival probability can increase to at least 9%. In a study in the Seattle area1, emergency response personnel devised a way for dispatchers to instruct a bystander in CPR over the phone. They found that 29 of 278 CPR patients survived to discharge from the hospital. Question: Does dispatcher-instructed bystander-administered CPR improve the chances of survival? Steps to Answering the Questions with Data How do we answer the question? How do we use statistics to answer the question? If you think about how you make decisions every day, this can be applied to making statistical decision use what see, use what you know, use what you can show: 1. Begin by writing down what you understand 2. Outline what the data says and form clear and succinct questions pertaining to what the data may imply (or what you would like to show) 3. Form a scientific question to determine if the results are random 4. Compare the data from each side of the question and decide what to believe
1
A Hallstrom, L Cobb, E Johnson, and M Copass (2000) Cardiopulmonary resuscitation by chest compression alone or with mouth-to-mouth ventilation. The New England Journal of Medicine, 342 (21), 15461553. Copyright Stacey S. Cofield, 23 August 2004. All rights reserved.
4/5
Put into scientific (statistical) terms, the four phases above, can be further defined in ten steps: Phase 1: State the Question 1. Evaluate and describe the data 2. Review the assumptions 3. State the questionin the form of hypotheses Phase 2: Decide How to Answer the Question 4. Decide on a summary numbera statisticthat reflects the question 5. How could random variation affect that statistic? 6. State a decision rule, using the statistic, to answer the question Phase 3: Answer the Question 7. Calculate the statistic 8. Make a statistical decision 9. State the substantive conclusion Phase 4: Communicate the Answer to the Question 10. Document our understanding with text, tables, or figures Summary We began with an introduction to statistics and defined some commonly used terms. We then looked at the CPR case study and define a formal process to use statistics to answer a question about the case study. The 10 step process will be used throughout the course to address statistical questions.
Well get to each of these steps later and re-examine the case study in terms of the 10 steps after we examine the concept of probability and sampling. Next time, well discuss graphical methods that can be used to describe this case study and other forms of data.
Copyright Stacey S. Cofield, 23 August 2004. All rights reserved.
5/5
Section 2.
Graphical Methods
Graphical displays are an important tool in illustrating data. There are a number of graphical tools that are commonly used to display information about the population you are studying: histograms stem-and-leaf diagrams box plots bar or pie charts scatter plots both two and three dimensional Receiver Operator Characteristic (ROC) curves
Throughout this course, well rely on graphical methods to summarize the data and in many cases, lead us to the correct statistical analysis method.
When used correctly, graphical methods can provide an accurate visual picture of a number of descriptive statistics, such as: mean variance range quartiles outliers
Some graphical displays can also summarize the results of a statistical analysis. For example; the difference between groups, a regression fit, or sensitivity and specificity analyses.
Most statistical packages provide a number of the common graphical display methods. In addition, there are spreadsheet packages (e.g., Excel) that also have graphical options.
Copyright Stacey S. Cofield, 25 August 2004. All rights reserved.
1 / 14
Section 2.1 Histogram
A histogram is a standard tool used to illustrate general characteristics of a data set or population. For example, using the Motivation and Creativity data from Ramsey & Shaffer (Section 1.1), where subjects were placed in intrinsic and extrinsic treatment groups, given a questionnaire about their motivation to write, then asked to write a poem, which was then scored to determine if motivation was related to performance. Essentially, the researchers were attempting to determine if students produced better work when motivated by creativity or rewards.
Viewed in JMP IN , the data contain a score (numeric) and a treatment group (categorical):
Copyright Stacey S. Cofield, 25 August 2004. All rights reserved.
2 / 14
The data can be viewed using histograms for:
1) scores for the entire group
30 25 20 15 10 5 2
2) the distribution of treatment groups
4
68 Count
10 12
INTRINSIC
EXTRINSIC
5
10 15 20 25 Count
Copyright Stacey S. Cofield, 25 August 2004. All rights reserved.
3 / 14
3) scores for each treatment group
TREATMENT=EXTRINSIC
TREATMENT=INTRINSIC
30 25 20 15 10 5 0 246 Count
30 25 20 15 10 5 0 246 Count
What is the most informative? The histogram (1) of the overall scores only provides a picture of how all students scored but no information on how the scores differed between the two groups. The histogram (2) of the treatment groups shows that the students were approximately evenly divided between the two groups but gives no information about the scores in each group. The histograms (3) of each of the treatment groups, not only shows the range of each scores but also allows for some comparison between the two groups. The histograms show: Extrinsic group scores range from 5 25, with most students scoring 15 20 Intrinsic group scores range from 10 30, with most students scoring 17.5 25 It appears that the students in the Intrinsic group scored higher that those in the Extrinsic group The final set of histograms would lead you to examine the statistical difference between the two groups, perhaps by comparing the mean of the two groups using a t-test.
Copyright Stacey S. Cofield, 25 August 2004. All rights reserved.
4 / 14
2.2
Stem-and-Leaf Plot
The stem-and-leaf plot is both a graph and a table. The stem-and-leaf for the creativity study is below: Extrinsic Stem Leaf 40 5 1 6 7 8 9 9 10 8 11 30 12 13 8 14 0 15 8 16 5422 17 775 18 52 19 7 20 2 21 1 22 23 0 24 25 26 27 28 29 30 Legend 29 Intrinsic Leaf
009 6
6 25 2 138 356 36 126 1 03 7
7
7 is 29.7
The stem is the center column, and each of the treatment groups has a column of leaves. For example, at stem 17, the extrinsic groups has scores of 17.5, 17.4, 17.2, and 17.2; while the intrinsic group has scores of 17.5 and 17.2. The median can be determined from a stem-and-leaf plot by locating the k = (n+1)/2 leaf. If k is an integer, the median is the kth smallest observation, if k falls between two integers, then the
Copyright Stacey S. Cofield, 25 August 2004. All rights reserved.
5 / 14
median is halfway between the kth and (k + 1) smallest observations. For example, in the Extrinsic group, the median is 12th smallest observation (17.2) and the in the Intrinsic group, the median is between the 12th and 13th smallest observations (20.4).
Like the histogram, the stem-and-leaf shows the mean, range, and variability but by showing each number. The stem-and-leaf plot is, however, only applicable to smaller data sets, and is not commonly used in publications.
2.3
The Box Plot
Box plots, or box-and-whisker plots, are used to illustrate the mean, median, variance, range, and percentiles of data. The box plot consists of a box, with a line at the median, two tails extending from the top and bottom of the box that demonstrate some percentage away from the box, and points above the tails representing potential outliers. Most box plots will show the Interquartile Range (IQR) as the ends of the box, with 50% of the data contained within the box, the lower whisker most commonly represents the 10th percentile and the upper whisker the 90th percentile. Note: this will vary from program to program, some programs will allow the user to specify the whisker length, other use set limits. The box plots for the creativity data are below:
TREATMENT=EXTRINSIC
TREATMENT=INTRINSIC
25 20 15 10 5
30 25 20 15 10
Copyright Stacey S. Cofield, 25 August 2004. All rights reserved.
6 / 14
In JMP IN , the box plot shows:
Often, histograms and box plots are shown together:
TREATMENT=EXTRINSIC
TREATMENT=INTRINSIC
30 25 20 15 10 5 0 246 Count
30 25 20 15 10 5 0 246 Count
Copyright Stacey S. Cofield, 25 August 2004. All rights reserved.
7 / 14
2.4
Bar and Pie Charts
Other commonly used graphical displays are bar and pie charts, often used to display classes within a population. For example, the departments in which the students are enrolled and the degrees being pursued by students in BST 621:
8.0 6.0
N
8.0 6.0 4.0 2.0 0.0
MPH MS MSPH
N
4.0 2.0 0.0 BioChem BioEng Biostat CS Cell Biology EPI MCH Math Medicine N/A Opthamology
Department
Pie Chart of Departments
Degree
5% 9%
5%
5%
5%
BioChem BioEng Biostat CS Cell Biology
9% 31% 5%
EPI MCH Math Medicine 8% 5% 13% N/A Opthamology
Not Specified
NDS
Copyright Stacey S. Cofield, 25 August 2004. All rights reserved.
PhD
8 / 14
Pie Chart of Degrees
9%
31% MPH 27% MS MSPH NDS Not Specified 5% 5% 23% PhD
2.5
Scatter Plots
Two dimensional scatter plots are most often used to show how data changes across one variable as another variable changes. They are commonly used in regression analysis to show the distribution of the data points, while at the same time, showing the best fit line. For example, in a study of height (in) and weight (lbs) in female college students, it can be seen that as height increases, weight increases:
200 175
Weight
150
125
100
75 45
50
55
60 Height
65
70
75
Copyright Stacey S. Cofield, 25 August 2004. All rights reserved.
9 / 14
Three dimensional scatter plots can be used to examine relationships between three variables but are often not used in publications unless the relationships between the variables can be clearly illustrated.
2.7
Graphical Displays in JMP IN
The first screen view in JMP IN is the JMP Starter screen:
From here, the user has two options, you can either select the Graph starter, choose a graph, and then a data set:
Copyright Stacey S. Cofield, 25 August 2004. All rights reserved.
10 / 14
or, you can open a data set:
Copyright Stacey S. Cofield, 25 August 2004. All rights reserved.
11 / 14
and then choose Graph from the menu bar and select a specific plot:
Once any data set is open, the user can also select a graph icon from the shortcut bar:
For example: Histogram (distribution) leaf plots. Chart produces various charts such as bar, pie, line and needle. produces a histograms, box plots and stem-and-
Overlay Plots
produces a special type of line plot. Overlay plots get their
name from "overlaying" 2 or more Y columns (along the vertical axis), across the one X column (along the horizontal axis).
Spinning Plot rotated to see depth.
produces a three-dimensional scatter plot that can be
Copyright Stacey S. Cofield, 25 August 2004. All rights reserved.
12 / 14
2.8
Chart Junk
Chart Junk is the term used to refer to excess or extraneous information on a graph or chart. Often this information can (or is specifically used) to distract from the actual results. Chart Junk can also refer to poor use of high-tech graphics. Some examples: Not including a 0 reference on the Y axis can distort trends:
45 42.5
60 50 40
Sales
40 37.5 35
Sales
March May June Sept Jan
30 20 10
March
Nov
May
June
Month
Month
Compressing the Y axis can do the reverse:
200 150 Sales 100 50 0 Q1 Q2 Q3 Q4
Sales
50 40 30 20 10 0 Q1 Q2 Q3 Q4
Quarter
Quarter
Sept
Jan
Nov
0
Copyright Stacey S. Cofield, 25 August 2004. All rights reserved.
13 / 14
Not providing a relative basis:
# A Grades 14 12 10 8 6 4 2 0 Fr So Jr Sr
100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Fr So Jr Sr # in Class # A Grades
Summary Graphical displays are useful to illustrate properties associated with data and populations. Correct use of graphs can highlight group differences and even lead to appropriate statistical tests. Inappropriate use can lead to a misrepresentation or distortion of results. As with formulating a hypothesis, simple is best when using graphical displays.
Copyright Stacey S. Cofield, 25 August 2004. All rights reserved.
14 / 14
Section 3.
Probability Concepts
The concept of probability is used in many aspects of health related sciences. It is common to hear about the probability of surviving a surgery or treatment. We often read about studies stating that the odds of contracting a disease is higher for one group compared to another. Probability is even used in everyday terms, there is a 50-50 shot, the odds of winning are 1 in 7.1 million, etc. These probabilities are actually fractions that are multiplied by 100% to obtain a percentage, for example, a 1 in 4 chance is actually 1 out of 4 = = 0.25 0.25 x 100% = 25% chance.
The likelihood of an event can range from an event that can not occur to an event that is certain to occur. Statistical probability of an event is measured by a number between 0 and 1 (0 100%); the less likely an event is to occur, the closer the probability is to zero and the more likely an event is to occur, the closer the probability is to one:
0.00 cannot occur
0.50 equally likely
1.00 certain occurrence
Examples: The probability of a coin landing heads up is 1 out of 2 = = 0.50 or 50% The probability of rolling a 2 on a six-sided die is 1 out of 6 = 1 = 0.16 ~ 6 16.7%
3.1
Types of Probability 1) Objective probability 2) Subjective Probability
There are two types of probability:
Copyright Stacey S. Cofield, 30 August 2004. All rights reserved.
1/9
3.1.1 Objective Probability Objective probability is measuring the likelihood of events based on objective processes. Objective probability can be classified as either a) classical, or a priori, probability, or b) relative frequency, or a posteriori, probability. Most statisticians will identify with one classification over the other. Those identifying with classical probability are often referred to as Bayesians, while those identifying with relative frequency probability are often called Frequentists. Note: In most cases, the results are the same, the assumptions associated with calculating the probabilities simply differ.
Classical Probability An excerpt from Against the Gods: The Remarkable Story of Risk1: Since the beginning of recorded history, gambling the very essence of risktaking has been a popular pastime and often an addiction. It was a game of chance that inspired Pascal and Fermats revolutionary breakthrough into the laws of probability, not some profound question about the nature of capitalism or visions of the future. Games of chance are often used to demonstrate the principles associated with classical probability. Dice and card games are common examples. It is not necessary to actually roll a die or deal a card to determine the probability of a certain number being rolled or card being drawn. The chance of drawing any single card from a full deck of 52 cards is always 1/52; whereas, the likelihood of drawing any heart is always 13/52. With a fair die, each of the six sides is equally likely to be observed on any given roll, regardless of which number was observed on the previous roll. The probabilities are based on reasoning, not the act or rolling a die or drawing a card. This fact is often lost on those who gamble: Losing streaks and winning streaks occur frequently in games of chance, as they do in real life. Gamblers respond to these events in asymmetric fashion: they appeal to the law of averages to bring losing streaks to a speedy end. And
Bernstein, Peter L. Against the Gods: The Remarkable Story of Risk, 2nd edn. New York: John Wiley and Sons, 1998. Copyright Stacey S. Cofield, 30 August 2004. All rights reserved.
1
2/9
they appeal to the same law of averages to suspend itself so that winning streaks will go on and on. The law of averages hears neither appeal. The last sequence of throws of the dice conveys absolutely no information about what the next throw will bring. Cards, coins, dice and roulette wheels have no memory1. In the above examples, each of the occurrences is an event and each of the events is as likely to occur as any other event. Not only that, but no two events can occur at the same time, the events are mutually exclusive; that is, a 2 and a 4 can not be observed on the same roll of a die or when choosing a single card, the selection can not be both a two of hearts and a 4 of diamonds. Let N be the total number of mutually exclusive and equally likely events (for a fair die, N = 6). Also, define m as the number of each possible event, for example, when drawing a 2 from a deck of cards, there are 4 possibilities: a 2 of hearts, a 2 of clubs, a 2 of spades, and a 2 of diamonds, so m = 4. Let P(E) be the probability of E, then:
P (E ) =
m N
Relative Frequency Probability Often, the true probability of an event is not known and can only be estimated. Relative frequency probability is counting the number of repetitions of a process and the number of times each event occurs in order to predict, or estimate, the likelihood of an event occurring. The formula for the definition of the probability of E is similar but here, m is the number of times the event was observed and n is the number of times the process was repeated:
P (E ) = m n
Copyright Stacey S. Cofield, 30 August 2004. All rights reserved.
3/9
A number of health related probabilities are based on relative frequency estimates of occurrence. For example, the probability of surviving breast cancer to 5 years, is an estimate based on the number of observed occurrences of breast cancer and the number of those cases that have survived to 5 years. However, the true probability of surviving to 5 years is unknown.
3.1.2 Subjective Probability See Daniel, p. 61
3.2
Properties of Probability
There are three basic properties of probability: 1. Given n mutually exclusive outcomes (events), E1, E2, , En, the probability of any event, Ei, occurring is non-negative, such that:
P (E ) 0
2. The sum of all probabilities for the mutually exclusive events is one:
P ( E1 ) + P ( E2 ) + + P ( En ) = 1
i =1
P ( Ei ) = 1
n
3. The probability of any of k mutually exclusive events occurring is the sum of their individual probabilities:
P ( E1 or E2 or
Ek ) = P ( E1 ) + P ( E2 ) +
+ P ( Ek )
These properties only hold if the events are mutually exclusive. If the events can overlap, or more than one event can occur at the same time, calculating the probability of events is more involved.
Copyright Stacey S. Cofield, 30 August 2004. All rights reserved.
4/9
3.2
Calculating Probability
The theoretical properties of probability can be easily extended to calculating the probability of events for practical purposes. To illustrate the various types of probabilities that can be calculated, well look at data from a automotive survey of 303 adults that purchased cars in a single year (Table 1). Table 1. Summary Data of Automotive Survey Buyer Information Marital Gender Status Age Female 138 Married 196 range 18-60 Male 165 Single 107 mean 30.7 sd 5.98 Vehicle Information Country of Vehicle Vehicle Origin Size Purpose American 115 large 42 family 155 Europe 40 medium 124 sport 100 Japanese 148 small 37 work 48
Some auto makers are interested in the buying habits of men and women: do men buy larger cars than women, do women purchase cars for different uses than men? Table 2 shows the distribution of gender and vehicle purpose. If we select a person at random from this sample, what is the probability that the person is female? Table 2. Frequency of Gender and Vehicle Purpose Gender Vehicle Purpose Family Sporty Work Total Female 76 41 21 138 Male 79 59 27 165 Total 155 100 48 303
We assume that male and female are mutually exclusive categories and that the likelihood of choosing any one person is equal to the likelihood of any other. The number of subjects with the characteristic of interest (female) is 138 and the total number of subjects is 303. Therefore, the probability of the subject being female is:
Copyright Stacey S. Cofield, 30 August 2004. All rights reserved.
5/9
P ( F ) = number of females = 138 303 = 0.4554
total number of subjects
The probability that any randomly selected subject is female is 0.46, or there is a 46% chance that the subject is female. This is an unconditional probability, since the denominator is the total sample, or the probability isnt conditioned on any one subgroup. If the probability is calculated using a subset of the total sample as the denominator, then this is called a conditional probability. For example, suppose we select a subject and the subject is male (M), what is the probability that this subject purchased a car for work (w)? The total sample is no longer of interest, since by selecting a male, the female subjects have been eliminated. Now, given that the subject is male, the denominator of interest is 165, of these, 27 purchased a car for work purposes. Therefore the probability of interest is:
P (w | M ) = number of work vehicles = 27 165 = 0.1636
total number of males
Given that a subject is male, the probability that he purchased the car for work is 0.16, or there is a 16% chance that vehicle was purchased for work given the subject was a male. How about if the subject was female?
P (w | F ) = number of work vehicles = 21 138 = 0.1521
total number of females
Given that a subject is female, the probability that he purchased the car for work is 0.15, or there is a 15% chance that vehicle was purchased for work given the subject was a female. Determining if the 16% is different from the 15% is a different question that will be addresses when we learn about comparing proportions. Often we want to look at the probability of a subject having more than one characteristic, or the joint probability of multiple events. What is the probability that a
Copyright Stacey S. Cofield, 30 August 2004. All rights reserved.
6/9
subject picked at random is female and purchased the car for work purposes?
P ( F w ) = number of work vehicles purchased by women = 21 303 = 0.0693
total number of subjects
The probability that a subject is female and purchases a car for work is 0.07, or a 7% chance that any given subject purchased a car for work and is female. The relationship between conditional and joint probabilities can be expressed as:
P ( A | B) =
P ( A B) P (B )
,
P (B ) 0
Since we know P(F w) = P(w F) , and the P(F), then we can determine the probability of the vehicle being purchased for work given that the subject is female:
P (w F ) P (F )
P (w | F ) =
= 0.0693 = 0.1521
0.4554
Which is the same probability found by directly calculating the probability above. What if the probability of interest is either one of two events occurring? If the events are mutually exclusive, then the probability is the sum of the individual probabilities for the two events. For the automotive data, what is the probability that a subject is male or female: P(F U M) = P(F) + P(M) = 0.46 + 0.54 = 1. This is intuitive if the events cannot occur at the same time, but what about events that can occur simultaneously? To determine the probability of event A or event B occurring, add the probabilities of either event occurring and subtract the probability of both events occurring together:
P ( A B ) = P ( A) + P ( B ) P ( A B )
What is the probability that a randomly selected subject is female (F) or will have
Copyright Stacey S. Cofield, 30 August 2004. All rights reserved.
7/9
purchased a vehicle for work?
P ( F w) = P ( F ) + P ( w) P ( F w)
= 0.4554 + 0.0891 0.0693 = 0.4752
It is important to note that the number of subjects that are female and purchased a car for work is included in the number of subjects that are female and the number of subjects that purchased a car for work. Since these subjects have been included twice, they have to be subtracted out once to account for the overlapping. What about if one event occurring has no effect on the occurrence of another event? For example, what if the probability of a child choosing a type of candy given that the child is a girl P(c|g) is the same a child picking that type of candy P(c) ? The choice of candy is said to be independent of the gender of the child or P(c) = P(c|g). Consequently, we can determine the probability of the candy being chosen and the subject being a girl:
P (c g ) = P ( g ) P (c | g ) = P ( g ) P (c)
We can also determine the alternative to events, such as the probability of being not a girl or being a boy. These are called complementary events. The complement of g is g and P ( g ) = 1 P ( g ) .
Returning to the automotive example, the probability of not being a female (being male) is:
P ( F ) = 1 P ( F ) = 1 0.4554 = 0.5446
Which is the same as the probability as being a male.
Copyright Stacey S. Cofield, 30 August 2004. All rights reserved.
8/9
Lets looks at Table 2 again:
Gender Vehicle Purpose Family Sporty Work Total Female 76 41 21 138 Male 79 59 27 165 Total 155 100 48 303
When the numerator of a probability is one of the column of row totals (a margin total), the probability is called a marginal probability. In general, given that a variable can be broken down into m categories and another jointly occurring variable can be broken down into n categories, the marginal probability of the first event occurring is the sum of the joint probabilities of both events:
P ( Ai ) = P Ai B j j
(
)
Earlier, we determined the probability of being female. Lets use the above definition to determine that probability from a series of joint probabilities. The gender variable is broken down in two categories, M and F, and the vehicle usage category is broken down in three categories, family, sport, and work. The female category occurs jointly with all vehicle usage categories: P ( F f ) = 76 P ( F s ) = 41 P ( F w ) = 21 303 303 303 = 0.2508 = 0.1353 = 0.0693
Using the above formula, the probability of being a female is:
P ( F ) = P ( F f ) + P ( F s ) + P ( F w)
= 0.2508 + 0.1353 + 0.0693 = 0.4554
The result is the same result we have seen previously.
Copyright Stacey S. Cofield, 30 August 2004. All rights reserved.
9/9
Section 4.
Discrete Probability Distributions
In the previous section we discussed some concepts of probability and how to calculate the probability of an event. In this section, well extend these ideas to more complex situations. Well begin with a few definitions. Sample Space: The sample space for a given set of events is the set of all possible
values the events may assume. A sample space may also be known as an event space or possibility space. For example, the sample space of a toss of two coins, each of which may land heads (H) or tails (T), is the set of all possible outcomes: HH, HT, TH, and TT. Random variable (rv): A random variable is a real-valued function of events in the sample space. For example: An electrical system has several components, the system will fail if any one component fails. The sample space S = {S, F} for success or failure. A random variable X is X(S) = 1 and X(F) = 0, indicating if the system succeeded or failed. A random sample of UAB student GPA scores is selected. Define a rv X by letting X = the student GPA. The possible values of X are between 0.000 and 4.333 such that {x: 0.000 x 4.333}. Discrete Random Variable: A rv whose possible values either constitute a finite set or can be listed in an infinite sequence in which there is a first element, a second element, and so on.
4.1
Probability Distributions for Discrete Random Variables
The probability distribution (pdf) of X shows how the entire probability of 1 is distributed to all of the possible values of X. Also called the probability mass function (pmf). The pdf of a discrete random variable is defined for every number x by p(x) = P(X = x). Some properties of P(X = x) are: 1) 0 P(X = x) 1 2)
P( X
= x) = 1
Copyright Stacey S. Cofield, 1 September 2004. All rights reserved.
1 / 16
Example: A produce company supplies lots of produce to grocers. Each lot refers to a different farm. The number of bad pieces of produce on average in each shipment is recorded for quality control: Table 1. Produce Example, # of Bad Pieces per Lot lot # bad pieces 1 0 2 1 3 2 4 0 5 4 6 0
If a grocer selects a random lot for purchase, let X be the number of bad pieces of produce in the selected lot. The possible values of X are 0, 1, 2, and 4. Let p(x) be the probability that X = x. Where x can be 0, 1, 2, or 4. Then p(0) = P(X = 0) = P(lot 1, 4, or 6 is selected) = 3/6 = 0.50 p(1) = P(X = 1) = P(lot 2 is selected) = 1/6 = 0.167 p(2) = P(X = 2) = P(lot 3 is selected) = 1/6 = 0.167 p(4) = P(X = 4) = P(lot 5 is selected) = 1/6 = 0.167 In the long run, a lot with 0 pieces of bad produce would be selected half of the time, and the other half of the time, the lot would contain 1, 2, or 4 bad pieces (1/6 of the time each). The above probabilities define the pdf of X. The plot of the pdf is:
0.30 0.10 0 1 2 3 4 5
Figure 1. PDF of Produce Example Example: A video store has kept track of the number of videos that customers rent in week to help determine potential promotional programs to increase the number of
Probability
0.50
Copyright Stacey S. Cofield, 1 September 2004. All rights reserved.
2 / 16
rentals. Over the course of a year, 50% of the customers rented 1 video, 30% 2 videos, 10% 3 videos, 7% 4 videos, and 3% 5 or more videos (called 5 videos). The pdf can now be defined as: p(1) = P(X = 1) = P(1 video) = 0.50 p(2) = P(X = 2) = P(2 videos) = 0.30 p(3) = P(X = 3) = P(3 videos) = 0.10 p(4) = P(X = 4) = P(4 videos) = 0.07 p(5) = P(X = 5) = P(5 videos) = 0.03 Which can also be expressed as: 0.50 0.30 P ( x ) = 0.10 0.07 0.03 x =1 x=2 x = 3 , 0 if x 1, 2, 3, 4, 5 x=4 x =5
0.5 0.4 g
probability
0.3 0.2 0.1 0.0 1 2 3 # videos
Figure 2. PDF of Video Store Example
4
5
Copyright Stacey S. Cofield, 1 September 2004. All rights reserved.
3 / 16
If a rv only has two options, it is called a binary rv. If the options are 0 or 1, it is called a Bernoulli rv. For a Bernoulli random variable, the pdf can be generally defined as:
1 P( x = x ) = 0
if x = 0 if x = 1 otherwise
where is a number (probability of occurrence) between 0 and 1. If we look at the video rental data in terms of comedy or non-comedy, comedies are rented 35% of the time. Therefore = 0.35 and the pdf can be expressed as:
1 0.35 if x = 0 (non-comedy) P ( x = x ) = 0.35 if x = 1 (comedy) 0 otherwise
0.65 if x = 0 (non-comedy) = 0.35 if x = 1 (comedy) 0 otherwise
We can use the pdf to determine probabilities. For example, what is the probability that a customer will rent 1 or 2 videos in a week? Since the events are mutually exclusive (renting 1 and 2 videos in a week would be renting 3 videos and the customers who rented two videos are not contained in the 1 group), we simply add the probability of renting 1 video to the probability of renting two videos:
P (1 2 ) = P (1) + P ( 2 ) = 0.50 + 0.30 = 0.80
The probability that a customer will rent 1 or 2 videos in a given week is 80%. This may lead to a promotional program that encourages those who rent 1 or 2 videos a week to rent 2 or 3 videos at no extra cost, since 80% of the customers are in this demographic.
Copyright Stacey S. Cofield, 1 September 2004. All rights reserved.
4 / 16
4.2
Cumulative Distributions
Sometimes, it is more convenient to work with the cumulative probability of a random variable. Cumulative probability is adding successive probabilities to obtain a probability up to and including a certain event. For example, the probability that a customer rents 3 or fewer videos is 0.50 + 0.30 + 0.10 = 0.90. Statistically, this is defined as: F ( x ) = P ( X xi ) = p ( x i )
i
For the video example, the CDF is: Table 2. CDF for Video Store Example # videos 1 2 3 4 5 The CDF can also be plotted: CDF P(X x) 0.50 0.80 0.90 0.97 1.00
1.0 Cum Prob 0.8 0.6 0.4 0.2
0 1 2 3 4 # videos 5 6
Figure 3. CDF of Video Store Example
Copyright Stacey S. Cofield, 1 September 2004. All rights reserved.
5 / 16
What is the probability that a customer will rent 4 or fewer videos in a week? Using the table the probability may be found directly: # videos 1 2 3 4 5 CDF P(X x) 0.50 0.80 0.90 0.97 1.00
Using the plot, the probability can be found by the corresponding Y axis value for an X value of 4:
0.97
1.0 Cum Prob 0.8 0.6 0.4 0.2
0 1 2 3 4 # videos 5 6
Obtaining the probability that X falls in between two numbers can also be obtained from the CDF. In general,
P ( a X b ) = F ( b ) F ( a 1) , for a b
What is the probability that a customer will rent 3-5 videos in a week?
P ( 3 X 5 ) = F ( 5 ) F ( 2 ) = 1.00 0.80 = 0.20
The probability that a customer will rent between 3 and 5 videos in a given week is 0.20, or there is a 20% chance that a customer will rent between 3 and 5 videos per week.
Copyright Stacey S. Cofield, 1 September 2004. All rights reserved.
6 / 16
4.2
Expected Value and Variance of Random Variables
Often it is of interest to know the most likely outcome of a situation. The average number of bad produce items, the average number of videos rented, etc. The most likely occurrence or outcome is also called the expected value or more commonly, the mean. The pdf can be used to estimate the expected value of X: E ( X ) = X = xi p ( xi ) , i
i
For the video store example:
E ( X ) = X = xi p ( x i )
i =1 5
= 1(0.50) + 2(0.30) + 3(0.10) + 4(0.07) + 5(0.03) = 1.83 So, the average number of videos rented in a given week is slightly below 2 videos. This may lead to a store striving to increase that number to higher than 2 videos a week. Some properties of expected value: for any constant a, E(aX) = aE(X) for any constant b, E(X + b) = E(X) + b The variance (and standard deviation) may also be calculated using the pdf: Var ( X ) = 2 = ( xi X ) p ( xi ) , i X
i 2 X = X 2
For the video store example:
Var ( X ) = 2 = ( xi X ) p ( xi ) X
i 2
= (1 1.83 ) (0.50) + ( 2 1.83 ) (0.30) + ( 3 1.83 ) (0.10) + ( 4 1.83 ) (0.07) + ( 5 1.83 ) (0.03) = 1.1210
2 2
2
2
2
X = 2 = 1.06 X
Copyright Stacey S. Cofield, 1 September 2004. All rights reserved.
7 / 16
Note a short-cut formula for the variance can be computed using just the expected value: Proof:
Var ( X ) = ( xi X ) p ( xi )
i 2
Var ( X ) = E X 2 E ( X )
()
2
= xi2 2 xi + 2 p ( xi )
i i
(
)
= xi2 p ( xi ) 2 xi p ( xi ) + 2 p ( xi )
() = E ( X 2 ) 2 2 = E ( X 2 ) E ( X )
Some properties of variance: for any constant a, Var(aX) = a2Var(X) for any constant b, Var(X + b) = Var(X)
i
= E X 2 2 i + 2 = E X 2 2 2 + 2
()
i
Notice that the addition of a constant will change the location of the expected value (mean) of X but NOT the variance (spread of values) of X. Therefore, the video store data can be summarized as follows: A video store recorded the number of videos rented per customer in a week. The average number of videos rented was 1.83 (sd =1.06), with a range of 1 5 or more videos in a week and 80% of the customers rented 1 or 2 videos in a week.
4.5
Named Discrete Probability Distributions
The previous examples were based on collected data but many phenomenon can be described using well known theoretical distributions with associated means and variances. We will discuss two of these distributions: the binomial and Poisson distributions.
Copyright Stacey S. Cofield, 1 September 2004. All rights reserved.
8 / 16
4.5.1 The Binomial Distribution
The binomial distribution is one of the most widely used distributions used in statistics derived from the Bernoulli trial. Named for James Bernoulli, any experiment that can result in one of two outcomes is called a Bernoulli trial. Any sequence of experiments (or trials) that meets the following conditions is a Bernoulli process: 1) The trials are identical and can result in one of two mutually exclusive outcomes; success or failure (arbitrary) 2) The probability of each outcome is constant from trial to trial, p and 1-p 3) The trials are independent, meaning that the outcome of a single trial does not influence the outcome of any other trial. Examples: a coin toss, pass or fail, living or deceased, etc. Given n Bernoulli trials, the binomial random variable is defined as: X = the number of successes in n trials Example: An electrical fuse will work or fail to work. Let work be a success = 1 and failing to work be a failure = 0. Suppose for two fuses, there are four possible outcomes: 11, 10, 01, 00 (or ss, sf, fs, ff) It is known that the probability of failure is 10% or 0.10. Consequently, the probability of a success is 90% or 0.90. Therefore, p = 0.90 and 1-p = 0.10. The probability of any single success is 0.90 but we are often interested in a sequence of successes and failures. What is the probability that for two fuses, we observe a success and a success, or 11, denoted P(1,1)? Note: P(1,1) is a joint probability but commas will be used in place of the intersection notation. P(1,1) = pp = 0.90 x 0.90 = 0.81 There is a 81% chance of observing two success in a row with two fuses. What about a success and a failure, 10, P(1,0)?
Copyright Stacey S. Cofield, 1 September 2004. All rights reserved.
9 / 16
P(1,0) = p(1-p) = 0.90 x 0.10 = 0.09 There is only a 9% chance of observing a success and a failure in a row with two fuses. Of course, observing a failure and a success is also 0.09 or 9%, since p(1-p) = (1-p) p. What about two failures in a row, 00, P(0,0)? P(0,0) = (1-p) (1-p) = 0.10 x 0.10 = 0.01 At 0.10, the probability of observing two failures in a row is extremely small; there is only a 1% chance of observing two failures. Notice that the sum of the probabilities of all possible outcomes is 1: P(1,1) + P(1,0) + P(0,1) + P(0,0) = 0.81 + 0.09 + 0.09 + 0.01 = 1.00 When the trial increases to n = 3 fuses, there are 8 possible outcomes: 000 001 010 100 110 101 011 111
The associated probabilities for each outcome are: Table 4. Outcomes and Probabilities for an n = 3 Bernoulli Trial outcome 000 001 010 100 110 101 011 111 probability (1-p) (1-p) (1-p) (1-p) (1-p) p (1-p) p (1-p) p (1-p) (1-p) p p (1-p) p (1-p) p (1-p) p p ppp = = = = = = = = (1-p)3 (1-p)2p (1-p)2p (1-p)2p p2(1-p) p2(1-p) p2(1-p) p3
Copyright Stacey S. Cofield, 1 September 2004. All rights reserved.
10 / 16
For a small n, the number of outcomes and associated probabilities can be determined quite easily by hand. However, as n increases, it becomes obvious there is a need for an easy method of counting the number of sequences possible and determining the probability for any given outcome. In order to do this, we must first define two mathematical terms: 1) Factorial, denoted by a!, is the successive multiplication of (a)(a-1)(a-2)(1). By definition 0! = 1. n 2) Combination, denoted by nCx, or , is the number of combinations of n x objects that can be formed by taking x of them at a time (when order is immaterial).
nC x
n n! = = x x ! ( n x )!
For example, as seen above, number of ways there can be 1 success in 3 trials is 3: 001, 010, 100, or:
3C 1
3 3! 3i2i1 6 6 = = = =3 = = 1 1! ( 3 1) ! 1( 2 ) ! 1( 2i1) 2
The probability of obtaining 1 success in 3 trials is the sum of all the ways you can obtain 1 success in 3 trials, or: (1-p)2p + (1-p)2p + (1-p)2p = 3(1-p)2p In general:
n nx x n nx x p = q p (1 p ) x x where q = (1 p )
For any x, the probability of obtaining x success in n trials is denoted:
n n x x p q x f (x) = 0
,for x = 0,1,2,
n
,elsewhere
This expression is called the Binomial Distribution.
Copyright Stacey S. Cofield, 1 September 2004. All rights reserved.
11 / 16
Note that : f ( x ) 0 for all real values of x and
f ( x ) = 1 (See Daniel, p.88).
Often a b is used in place of an f to denote the pdf and cdf of the binomial distribution: b ( x; n, p ) denotes the binomial pdf with x successes, n trials, and p equal to the probability of a success. X ~ Bin(n, p) or X ~ B(n, p) denotes the cdf such that: P ( X x ) = B ( x : n, p ) =
y =0
b ( y; n, p )
x
Example: If the failure rate of a structural joint is 5%, what is the probability that at least 5 out of 10 joints will fail? First write what you know: P(# failures at least 5) = P(# fail 5) = P(# success 5) n = 10 p = 0.95 x=5
P ( X 5 ) = B ( 5; 10,0.95 ) =
5
y =0
b ( y ; 10,0.95 )
= b ( 0; 10,0.95 ) + b (1; 10,0.95 ) + b ( 2; 10,0.95 ) + b ( 3; 10,0.95 ) + b ( 4; 10,0.95 ) + b ( 5; 10,0.95 ) 10 10 10 10 0 = ( 0.05 ) ( 0.95 )0 + ( 0.05 )10 1 ( 0.95 )1 + ( 0.05 )10 2 ( 0.95 )2 0 1 2 10 10 10 10 3 ( 0.95 )3 + ( 0.05 )10 4 ( 0.95 )4 + ( 0.05 )10 5 ( 0.95 )5 ( 0.05 ) 5 3 4 = 0.0006
What about the probability of exactly 5 joints failing? This can be calculated be subtracting the probability that at most 4 fail from the probability that at most 5 fail:
P ( X = 5) = P ( X 5) P ( X 4) = 0.00006 0.0000027 = 0.00005725
Copyright Stacey S. Cofield, 1 September 2004. All rights reserved.
12 / 16
In general:
P ( X = b ) = P ( X b ) P ( X ( b 1) )
To determine the probability that X x, subtract the probability that X (x-1) from 1:
P ( X x ) = 1 P ( X ( x 1) )
To determine the probability that a X b:
P ( a X b ) = P ( X b ) P ( X ( a 1) )
Clearly, for even marginally large n, this process cannot be easily completed by hand or even using a calculator. There are tables that can be used to look up values but there are also a number of statistical packages and websites that will calculate the cdf and pdf of many distributions given the associated parameters. A good one to use is from UCLA: http://calculators.stat.ucla.edu/cdf/ Be careful to appropriately specify your parameters and make sure you check the far right side of the result to see if the answer is given in scientific notation (such as the one above). For any calculator, make sure you read the instructions to ensure that you know what result the calculator is returning.
The Mean and Variance of the Binomial Distribution
Once the parameters of the binomial distribution have been specified, the mean and variance can be easily calculated:
E ( X ) = np Var ( X ) = np (1 p ) = npq
For example: X ~ B(10, 0.70): E(X) = 10(0.70) = 7
Var(X) = 10(0.70)(0.30) = 2.1
X ~ B(87, 0.42): E(X) = 87(0.42) = 36.5 Var(X) = 87(0.42)(0.58) = 21.19
Copyright Stacey S. Cofield, 1 September 2004. All rights reserved.
13 / 16
4.5.2 The Poisson Distribution
The easiest way to think about the Poisson distribution, named for Simeon Denis Poisson, is event counting. Counting the number of car accidents in an intersection, the number of errors in text, the number of new cells, etc. The Poisson Process can be described as the occurrence of events over an interval of time, such that: 1) The occurrences of events are independent , that is, the occurrence of an event in an interval of space or time has no bearing on the probability of a second occurrence of the event in the same, or any other, interval. 2) An infinite number of event occurrences must be theoretically possible 3) The probability of a single occurrence of the event in a given interval is proportional to the length of the interval. 4) In any infinitesimally small portion of the interval, the probability of more than one occurrence of the event is ~ 0. If x is the number of occurrences of a random event in an interval of time, the probability that x will occur is:
f (x) =
e x x!
,
x = 0, 1, 2,
Where (lambda) is the parameter of the Poisson distn. and e is the constant ~2.7183. As with the binomial dist, f ( x ) 0 for every x and
f ( x ) = 1.
The Mean and Variance of the Poisson Distribution
Once the parameters of the Poisson distribution have been specified, the mean and variance can be easily calculated:
E(X) = Var ( X ) =
This is a rare distribution where the mean and variance are equal. The Poisson distribution can also be expressed in terms of the Poisson Process. Say you are interested in specifying a specific time interval, such that t = 0 at the start:
Copyright Stacey S. Cofield, 1 September 2004. All rights reserved.
14 / 16
There exists a parameter > 0 such that for any short time period, t, the probability that exactly one event occurs is t + o(t). Where as t nears 0, so does o(t) / t. In other words, o(t) is negligible. The probability that more than one event occurs during t is o(t) The number of events received during t is independent of the number of events prior to this time interval (memoryless property). In terms of the Poisson dist.:
f (x) =
e t ( t ) x!
x
,
x = 0, 1, 2,
notice that = t. The parameter is called the rate of process. For example: Suppose at a particular intersection the rate of events in a month is 6, or =6. What is the probability that at least 1 accident occurs in a week? So t = 0.25 (1/4 of a month) and
P ( X 1) = 1 P ( X = 0 ) = 1 = 1 e 6(0.25) ( 6(0.25) ) 0!
0
e 1.5 1.50 = 1 e 1.5 1 = 0.7769 The probability of observing at least one accident is a given week is 78%, or there is a 78% chance of observing at least one accident in week. This intuitively makes sense, since a month contains 4 weeks and there are more the rate is more than 4 accidents in a month. What is the expected number of accidents in a given week? From the Poisson distn., we know:
E(X) = Var ( X ) =
and since = t:
Copyright Stacey S. Cofield, 1 September 2004. All rights reserved.
15 / 16
E ( X ) = = t = 6 ( 0.25 ) = 1.25 Var ( X ) = = t = 6 ( 0.25 ) = 1.25
We expect to see 1.25 accidents in a given week, with a variance of 1.25 accidents. Clearly, you cant observe 1.25 accidents, this is an average, an expected value. There is no requirement that the mean be an integer.
Summary
We have discussed discrete probability distributions, specifically the binomial and Poisson distributions. These distributions focus on the counting of events, success or failure, or successive single events. In Section 5, we will discuss continuous random variables, or variables that can take on any value with in a specified interval of values.
Copyright Stacey S. Cofield, 1 September 2004. All rights reserved.
16 / 16
Section 5.
Continuous Probability Distributions
The binomial and Poisson distributions are examples of discrete distributions, where a random variable can assume only a select number of values. A continuous random variable is a variable that can assume any value within a specified range, or an infinite number of values. Statistically, a continuous random variable is a rv with a set of possible values in an entire interval of numbers, A and B, where A B. For example, height of trees in a forest, time to run a specified distance, age, etc. The probability distribution of probability density function (pdf) of X is a function f(x) such that for any two numbers a and b, a b:
b P ( a X b ) = f ( x ) dx a
The probability that X takes on a value in the interval [a,b] is the area under the graph of the density function. Also referred to as the density curve, with the following conditions: 1. f ( x ) 0 , x 2. For example:
+ f ( x ) dx = area under the entire curve f ( x ) = 1
f(x)
a
b
x
Figure 1. Continuous Distribution
Copyright Stacey S. Cofield, 06 September 2004. All rights reserved.
1/7
The cumulative distribution function (cdf), F(x), is defined for x:
x F ( X ) = P ( X x) = f ( y ) dy
as the area under the curve to the left of x.
5.1
Expected Value and Variance of Continuous Distributions
The expected value, or mean, of a continuous random variable X with a pdf f(x) is:
x = E ( x ) =
xf ( x )dx
If X is a continuous rv with pdf f(x) and h(x) is any function of X, then:
E h ( x ) =
h ( x ) f ( x )dx
The variance of a continuous rv X with a pdf f(x) and mean value is:
2 x =V (x) =
= E X 2 E ( X )
()
( x )2 f ( x )dx = E ( X )2
2
5.2
The Uniform Distribution
There are a number of named continuous distributions. The least complex being the Uniform distribution, a distribution that has constant probability. A continuous rv is said to have uniform distribution on the interval [A, B] if the pdf of X is:
1 A x B f ( x; A, B ) = B A 0 otherwise
The probability of X equally any number c, P(X=c) = 0 and for any two numbers a and b with a < b:
P ( a X b) = P ( a < X b) = P ( a X < b) = P ( a < X < b)
Copyright Stacey S. Cofield, 06 September 2004. All rights reserved.
2/7
The probability assigned to any particular value is 0, and the probability of an interval does not depend on either endpoint being included. This follows from the are under the curve of any particular value is 0, and therefore under any endpoint is also zero. Example: Suppose every 10 minutes a bus arrives at your stop. Due to the variation in the time you leave your house, you dont always get to the bus stop at the same time. Therefore, the time spent waiting for the bus, X, is a continuous random variable. What is the probability that you wait between 1 and 3 minutes?
1 f ( x;0,10 ) = 10 0 0
3 1
0 x 10 otherwise
3 1
P (1 X 3 ) = f ( x )dx =
x =3
1 dx 10
x 3 1 2 = = = 10 x =1 10 10 10
=
1 = 0.20 5
The probability that I will wait between 1 and 3 minutes for a bus is 0.20 or 20%.
1 10 0 10 0 1 3 10
Figure 2. The pdf for Bus Stop Example
Copyright Stacey S. Cofield, 06 September 2004. All rights reserved.
3/7
5.3
The Normal Distribution
The most well known distribution is the normal distribution. Also referred to as the Gaussian distribution, after Carl Friedrich Gauss, or the bell curve, after the shape of the distribution:
Figure 3. Pdf and CDF of the Normal Distribution1 The normal density is given by:
f ( x) =
2 1 x ) 2 2 e( 2
,
< x <
where, and e are the constants ~ 3.14 and 2.72; and and are the mean and standard deviation, respectively. The following are some characteristics of the normal distribution:
the distribution is symmetric about the mean, the mean = median = mode large standard deviations result in flatter curves smaller standard deviations result in taller curves
~68% of the distribution area 2 ~95% of the distribution area 3 ~99.7% of the distribution area
the mean and standard deviation completely define the distribution
1
Eric W. Weisstein. "Gamma Distribution." From MathWorld--A Wolfram Web Resource. http://mathworld.wolfram.com Copyright Stacey S. Cofield, 06 September 2004. All rights reserved.
4/7
The standard normal distribution, is a normal distribution with a mean = 0 and standard deviation = 1:
1 ( z )2 2 f (z) = e 2 , < z <
where z =
(x ) .
For example, 10000 data points were generated using a normal
distribution with a mean = 0 and a standard deviation = 1:
0.10 Probability 0.08 0.05 0.03
-3.0
-1.0 .0 1.0 2.0 3.0 4.0
Figure 4. Example of a Histogram from Normally Distributed Data
0.9 Cum Prob 0.7 0.5 0.3 0.1
-4 -3 -2 -1 0 1 Column response 1 2 3 4
Figure 5. Example of a CDF from Normally Distributed Data
Copyright Stacey S. Cofield, 06 September 2004. All rights reserved.
5/7
It is of interest to note, that thought the data was generated using a normal distribution with mean = 0 and standard deviation = 1, the estimated mean = -0.02 and the estimated standard deviation = 0.999. Since the entire area under the curve of a standard normal distribution is 1 unit, the relationship between normal distributions and the standard normal distribution can be used to determine the probability of z being , < , , >, or = to any zO. Well explore this further when we begin hypothesis testing. Other commonly used continuous distributions are:
The Gamma Distribution:
P (x) =
x 1e x ( )
with a mean = and a variance = 2 and pdf and cdf:
Figure 6. Gamma Distribution pdf and cdf for Given Parameters1
The Exponential Distribution:
P ( x ) = e x
with a mean =
1
and a variance =
1
2
and a pdf and cdf:
Copyright Stacey S. Cofield, 06 September 2004. All rights reserved.
6/7
Figure 7. Exponential Distribution pdf and cdf1
The Chi-Squared Distribution:
Pr ( x ) = x(
( r 2) 1)e x
2
1 r 2r 2 2
with mean = r and variance = 2r and a pdf and cdf:
Figure 7. Chi-Squared Distribution pdf and cdf for Given r1 The chi-squared distribution is the basis for a number of statistical procedures, used throughout this course and will be discussed in more detail in hypothesis testing and inference. It is of note, that the chi-squared distribution is a member of the Gamma family with = 2 and = r/2.
Copyright Stacey S. Cofield, 06 September 2004. All rights reserved.
7/7
Section 6.
Hypothesis Testing
So far, we have defined probability for discrete and continuous events, how to calculate probability, and the underlying distributions associated with common events. The most commonly used distributions in applied statistics are the normal, t, F, and chi-squared distributions. We will use these distributions to calculate p-values, or the probability, that the observed event is due to chance. Well begin by reviewing the steps of hypothesis testing (introduced in Section 1) and comparing proportions.
Review Steps to Answer a Question AMB Phase 1: State the Question 1. Evaluate and describe the data 2. Review the assumptions 3. State the questionin the form of hypotheses
Phase 2: Decide How to Answer the Question 4. Decide on a summary numbera statisticthat reflects the question 5. How could random variation affect that statistic? 6. State a decision rule, using the statistic, to answer the question
Phase 3: Answer the Question 7. Calculate the statistic 8. Make a statistical decision 9. State the substantive conclusion
Phase 4: Communicate the Answer to the Question 10. Document our understanding with text, tables, or figures
Copyright Stacey S. Cofield, 13 September 2004. All rights reserved. AMB materials reprinted courtesy of Al Best, Virginia Commonwealth University
1 / 22
Lets revisit the CPR example: CPR by phone: In an urban setting, about 6% of out-of-hospital cardiac arrests survive to hospital discharge. Survival can increase if a bystander witnesses the arrest and administers cardiopulmonary resuscitation (CPR ~ 15 chest compressions and 2 breaths 4 times per minute). But this happens no more than 50% of the time. From the literature, when CPR is administered by a non-EMT, survival probability can increase to at least 9%. In a study in the Seattle area1, emergency response personnel devised a way for dispatchers to instruct a bystander in CPR over the phone. They found that 29 of 278 CPR patients survived to discharge from the hospital. Question: Does dispatcher-instructed bystander-administered CPR improve the chances of survival?
Phase 1: State the Question The initial phase is to clearly state the question. Though this appears simple, it is the most crucial phase of the process. A poorly defined question will complicate the analysis and the interpretation of results. In order to clearly and concisely state the question or questions, we often have to begin by understanding the data.
1. Evaluate and Describe the Data Evaluating and describing the data can help to determine the types of questions that can be answered from the data. It is not uncommon for a researcher to pose a question that can not be directly addressed using data they have collected. You should begin by determining where the data came from: the population or sample used, the data collection procedures, the data amount of data collected, etc. You should also determine what was observed, meaning what are the observed statistic(s) from the data, such as the mean and variance of a continuous variable, the proportions of responses for categorical data (e.g. 45% Yes, 55% No). Note: Keep in mind that evaluating the study procedures is a different question than evaluating the data.
1
A Hallstrom, L Cobb, E Johnson, and M Copass (2000) Cardiopulmonary resuscitation by chest compression alone or with mouth-to-mouth ventilation. The New England Journal of Medicine, 342 (21), 15461553. denotes a reference to The Statistical Sleuth and associated page number.
,20
Copyright Stacey S. Cofield, 13 September 2004. All rights reserved. AMB materials reprinted courtesy of Al Best, Virginia Commonwealth University
2 / 22
Statistic: a statistic is any quantity that can be calculated from the observed data,20.
What did the data come from? For the CPR study, to determine where the data came from, we must rely on the researchers description. From our in-class discussion on the article, we determined that the study variables were clearly defined, the sample was representative of cardiac-arrest victims, and that the measurements were not biased. Bias could result, for instance, if only the easier cardiac-arrest cases were included in the study AMB.
What are the observed statistics? Often there are numerous observed statistics resulting from data but what were interested in is the observed statistics that will help to address our main issue: Does dispatcher-instructed bystander-administered CPR improve the chances of survival? Since survival was measured as survival to hospital discharge, our main variable of interest is a Yes or No response (dichotomous variable). From this variable, we would like to compare the proportion of each group that survived or compare the Yes responses in the two groups. From the article, 29 of 278 CPR patients survived. How can we summarize this statistically? The proportion of people who survived, p, is:
x p = survived nphone
where xsurvived is the number of survivors in the phone-CPR group and nphone is the total number of subjects receiving phone-CPR. In the NEJM paper, the observed proportion surviving is:
x 29 p = survived = = 0.104 nphone 278 ~ 10%
Approximately 10% of cardiac-arrest victims receiving phone-CPR survived to hospital discharge.
Copyright Stacey S. Cofield, 13 September 2004. All rights reserved. AMB materials reprinted courtesy of Al Best, Virginia Commonwealth University
3 / 22
2. Review AssumptionsAMB The next step toward the process of stating the question is to understand what assumptions were making. In this case study, the main assumptions are that the cardiac arrest victims in this study are representative of cardiac victims in general and that each victims experience is independent of the next. Since the study had well-designed entry criteria and no subjects were excluded for arbitrary reasons, we can conclude that the subjects in this study are representative of others who may have an emergency response system similar to that of Seattle. (For instance, we would not generalize this study to a rural environment where the time to get to the hospital is longer.) AMB Stating the assumptions will help you in determining what type of generalizations you can make and to what population. The assumptions may also help you decide what type of statistical test to use.
3. State the Question Now that we have summarized the data and stated the assumptions, we can state the question(s) to be answered using the observed data. As stated above, we want to know if phone-CPR survival proportion is higher than the standard method. Using our observed proportion: Is the observed 10.4% phone-CPR survival rate higher than the old survival rate of 6%? The easy answer is yes, the observed proportion is numerically larger, after all, 10.4 > 6.0; but what were actually interested in determining is if the difference could be due to chance? If we sampled this population again and again and again, would we continue to see proportions like 10.4 or was this an extreme observation? What we are really after is to determine if 10.4 is statistically different from 6? Assume that the intervention was something a simple as asking the bystander to wait for EMS. We would expect to see a 6% survival rate.
Copyright Stacey S. Cofield, 13 September 2004. All rights reserved. AMB materials reprinted courtesy of Al Best, Virginia Commonwealth University
4 / 22
To answer the question statistically, we need to define the question in terms of no different versus different. The no different statement is often called the null hypothesis, H0. We will always assume the null hypothesis, that nothing has happened and the observed difference is due to chance. In other words, we assume the status quo. Assume that the intervention was something a simple as asking the bystander to wait for EMS. We would expect to see a 6% survival rate. The Websters on-line dictionary defines the status quo as the existing condition or state of affairs2 Well continue to believe the null hypothesis until we are shown enough evidence to adjust the status quo. The key is to define the null hypothesis as simply as possible in order to make the decision to alter the status quo as easy as possible. If the observed results can be due to chance, well have no reason to reject the null hypothesis. A statisticians simplest possible explanation is randomness.
A null-hypothesis is the simplest explanation of events: There is no difference. There is no change. There is no improvement. Nothing unusual is occurring. ABM If random noise, measurement error, or chance occurrence can account for the observed data, then there is no need to reject the null hypothesis. The null hypothesis is a statement about a population NOT about the observed data. A statistical hypothesis is a way to state a null hypothesis so that it can be evaluated using statistical techniques. Usually, we will try to contradict the null hypothesis. We will try to use the observed data to show that the observed difference is likely not due to chance and that there is enough evidence to change what we believe about the population.
2
http://dictionary.reference.com/
Copyright Stacey S. Cofield, 13 September 2004. All rights reserved. AMB materials reprinted courtesy of Al Best, Virginia Commonwealth University
5 / 22
For the CPR study: Lets assume that any observed improvement over the 6% survival rate is purely due to chance. We want to state the null hypothesis in terms of the assumed 6% or less survival rate. Two possible ways to state the null-hypothesis are: H0: true percentage surviving is 6%, or H0: true proportion surviving is 0.06. We can test either hypothesis, however, it is easier to work with proportions. Therefore, well believe that the true rate of survival is 0.06, unless we are shown enough evidence to believe the contrary, that the rate of survival is greater than 0.06. Now, we can state the both what we currently believe and what we are trying to show in the form of hypotheses: H0: true proportion 0.06 and HA: true proportion > 0.06. where HA is the alternative hypothesis.
The alternative hypothesis is the statement we hope to be able to conclude. The statement about a population that is true if the null hypothesis is not true. The two are complementary: One and only one of the two hypotheses is true. Also called the research hypothesis. ABM
One and only one of the two hypotheses is true so how do we prove which hypothesis is true? This is where it gets tricky, we dont prove anything directly. What we will use is Proof by Contradiction. Well see if the data leads us to conclude the alternative hypothesis, thereby contradicting the null hypothesis. The process is as follows:
Copyright Stacey S. Cofield, 13 September 2004. All rights reserved. AMB materials reprinted courtesy of Al Best, Virginia Commonwealth University
6 / 22
Define the state of truth as the status quo or its contrary and state as the null hypothesis and the alternative hypothesis.
Collect data
Assess the likelihood of observing the collected data assuming (under) the null hypothesis.
If the observed data is what we expect (within reason) to see, then
If the observed data not what we expect (outside of reason), then
Continue to believe or fail to reject the null hypothesis
Cease believing or reject the null hypothesis in favor of the alternative hypothesis
Put simply, if the observed data is reasonably similar to the simple assumption, well continue to believe the simple explanation (of randomness). If believing the simple explanation seems very unreasonable when presented with the data, then well reject the null hypothesis in favor of its alternative.
When forced to choose between a simple conceptual model and real-world data that clearly contradicts that model, we choose to trust the data.ABM
Copyright Stacey S. Cofield, 13 September 2004. All rights reserved. AMB materials reprinted courtesy of Al Best, Virginia Commonwealth University
7 / 22
Lets look at how the hypotheses can be defined AMB Null Hypothesis Simpler model No difference, no change, no relationship Nothing is happening, random variation could explain observed data We fail to reject the current state We believe this if p 0.05 Alternative Hypothesis More complicated model Difference, change, relationship It is very unlikely that random variation would produce the observed data We reject the current state and accept the alternative state We believe this if p < 0.05
Now that we have defined the question, how do we answer the question? We first need to determine the statistics that we will use to test the hypothesis, then we need to assess the effect of random variation, and then we need to state the rule we will use to make the decision. It is important to note that the test statistic and decision rule should be well defined prior to actual calculation of the statistic and p value.
Phase 2: Decide How to Answer the Question In Phase 2, well decide how to answer the question. Well consider all the outcomes that could happen and assess the impact of random variation. To do this, we look at all the different values that a summary, or test, statistic could assume. Then we decide which values reflect the null hypothesis and which values reflect the alternative hypothesis. All of the steps in Phase 2 are completed prior to actually answering the question. It is very important to state the values of the test statistic will be in support of the null hypothesis and which values will support the alternative hypothesis prior to making a decision based upon the observed data. The will help you, as a statistician, make an impartial decision.
Copyright Stacey S. Cofield, 13 September 2004. All rights reserved. AMB materials reprinted courtesy of Al Best, Virginia Commonwealth University
8 / 22
4. Decide on a summary statistic In Step 1, as a part of describing the data, we choose the observed statistic that best summarizes the question we are attempting to answer. In this step, we will use that statistic to answer the question. For the CPR study, we have decided to look at p, the proportion surviving. In most cases, we take our simple summary statistic and transform it into another number that is (mathematically) easier to use. For simplicity, well begin by using p. So, our summary statistic is p, the proportion surviving to hospital discharge. If the null-hypothesis is true, we expect p to be approximately 0.06 or ~6%.
5. How could random variation affect the observed proportion surviving? Lets assume the null hypothesis is true, that the proportion surviving is 0.06. What would we expect to see in repeated studies? Do we expect to see exactly 0.06 every time? What about 0.07 or 0.058? Since we are using a sample of the population, and since there is inherent variability in sampling, it is likely well see results near 0.06. We may see 0.071 or 0.05 as survival rates. How different, or how far away from 0.06 would be surprising? What about 0.08 or 0.04? Or 0.10? So, assuming that the CPR intervention is no different that the current intervention and that the true proportion surviving really is 0.06, what survival proportions would we expect to observe if 1) the null-hypothesis is true, and 2) we repeated the study over and over again? A simulation AMB Try this thought experiment. Consider what we might observe if we ran this study again. And again. And again. And again How often would we observe various proportions surviving if we run this study 1000 times, each time on n = 278 subjects and each subject has a 0.06 chance of surviving?
Copyright Stacey S. Cofield, 13 September 2004. All rights reserved. AMB materials reprinted courtesy of Al Best, Virginia Commonwealth University
9 / 22
Note: every time this study is done youll observe a different number of survivors; the estimated proportion surviving will be different. What we actually observed in this study is just one of a number of possibilities of what we could have observed. If it helps, think of it this way: Imagine a coin that comes up heads 6 times out of 100 and tails 94 times out of 100. Put 278 of these (weird) coins in a bag. Shake up the bag. Dump the coins out on the table and count how many came up heads. (Every time we do this we will not always get exactly 6% of the coins turning up heads.) Each coin represents a person. Each head represents a person surviving. The bag of coins dumping out on the table represents one realization of the experiment. Repeat the experiment 1000 times. Table 1.1 shows the results of this approach.
In experiment 1, n = 278 subjects were given CPR and n = 17 of them survived or p1 = 0.061. In experiment 2, twenty survived, giving p2 = 0.072. And so on through experiment 1000 where p1000 = 0.065. Recall that were doing this simulation to see how much variability we would expect to see in this one particular studyassuming the true proportion of survivors is 0.06. One answer to this question comes by looking at the range of values we observed; that is, the smallest proportion surviving and the largest proportion surviving (in our 1000
Copyright Stacey S. Cofield, 13 September 2004. All rights reserved. AMB materials reprinted courtesy of Al Best, Virginia Commonwealth University
10 / 22
simulations). Rearranging Table 1.1 by sorting from the fewest number of survivors to the largest number of survivors we see Table 1.2.
Here we see that the range of proportions includes p = 0.018. In one experiment only five out of 278 survived. Up through two experiments where p = 0.108, or an observed proportion almost twice the true, underlying survival rate of 0.06. So, one answer to the question how much variability do we expect? is proportions between 0.018 and 0.108.
How much variability? One problem with this answer is that it is not true that all values between 0.018 and 0.108 are equally likely. By inspecting the above table, we observe that 7 survivors happen about seven times as often as 6 survivors (There is one 6s in the above table and seven 7s). Similarly, 29 survivors only happened in only one simulation but 28 survivors happened three times. If we count up the number of experiments where we
Copyright Stacey S. Cofield, 13 September 2004. All rights reserved. AMB materials reprinted courtesy of Al Best, Virginia Commonwealth University
11 / 22
observed 5 survivors, 6 survivors, ,30 survivors then these counts may be displayed as in Table 1.3.
From this we observe that it is far more common to observe 16 survivors (it happened 104 times) than to observe 5 survivors (it happened only once out of 1000 simulations). We can show the distribution of these 1000 proportions in a histogram (See Figure 1.1). Note that the most common proportions are around 0.06; For instance, p = 0.058 occurred 104 times and p = 0.061 occurred 88 times.
Copyright Stacey S. Cofield, 13 September 2004. All rights reserved. AMB materials reprinted courtesy of Al Best, Virginia Commonwealth University
12 / 22
Recall that the null-hypothesis is true in the above simulation. The true underlying survival proportion is exactly 0.06 (6%). We randomly sample n = 278 from this population and observe some proportion survivingfrom 5/278 surviving all the way up to 30/278 surviving. Thus the figure shows what we would expect to see if the nullhypothesis is true.
How much variability? How much variability do we expect? We see, in Figure 1.2, that typical values cluster around the middle.
Copyright Stacey S. Cofield, 13 September 2004. All rights reserved. AMB materials reprinted courtesy of Al Best, Virginia Commonwealth University
13 / 22
The single most common proportion is 0.058 only 10.4% of the time. Other more typical values occur around this middle. That is, 29% of the time we observed proportion between 0.054 and 0.061. And over 46% of the time between 0.050 and 0.065. So, how much variability do we expect? If the true underlying proportion is 0.06 then more than half the time (61.3%) well observe values between 0.047 and 0.068. Thats the end of our thought experiment. Now we have some feel for how much variability to expect, if the null hypothesis is true.
Copyright Stacey S. Cofield, 13 September 2004. All rights reserved. AMB materials reprinted courtesy of Al Best, Virginia Commonwealth University
14 / 22
What would we believe if ? Its useful to consider what we would choose to believe if various outcomes occur. If wed observe x = 16 survivors out of n = 278, then the proportion surviving, p, would be 0.058. Assuming the null-hypothesis is true, what is the probability of this outcome or one more extreme than this? By more extreme we mean, proportions this large (0.058) or largeroutcomes favoring the alternative hypothesis. From Figure 1.3 we see that observing p 0.058 would be rather common if the null-hypothesis is true.
In our simulation of 1000 experiments, we saw p = 0.058 (n = 16 survivors) in 104 experiments. We saw p = 0.061 (n = 17 survivors) in 88 experiments. We saw p = 0.065 (n = 18 survivors) in 88 experiments. and so on. Overall, in 1000 experiments, we saw 613 this extreme or more. So, observing 16 survivors or more is fairly common; it happened in 61.3% of the experiments.
What would this mean? Observing this, what would this scenario imply about the two possible states of nature we are considering? There are two choices. Choose to believe:
Copyright Stacey S. Cofield, 13 September 2004. All rights reserved. AMB materials reprinted courtesy of Al Best, Virginia Commonwealth University
15 / 22
1. The survival proportion is 0.06 (or less). To choose this, were saying that an event occurred that is within what wed expect if the null hypothesis is true. 2. The survival proportion is larger than 0.06. To choose this, were saying that the observed datasurvival p = 0.058is compelling evidence that the null hypothesis is not true. After observing only 16 survivors, it would be extremely difficult to defend choice 2. We just cant observe a proportion less than 0.06 as evidence that the real proportion is greater than 0.06. Observing p = 0.058 clearly supports the null hypothesis. Wed choose to stay with our preference that the actual proportion is unchanged at 0.06.
6. State a decision rule to answer the question Some of the possible outcomes will lead us to believe the true proportion is 0.06 or less. In this case, we will fail to reject the null hypothesis and conclude that the true proportion is 0.06 or less. Other outcomes lead us to believe that this is unlikely and, instead, to conclude that the null hypothesis is not true. In this case, we will reject the null hypothesis in favor of the alternative that the true proportion is greater than 0.06. But which outcomes lead us to fail to reject the null hypothesis? And which lead us to reject the null hypothesis in favor of the alternative? The answer to these questions forms our decision rule. In our simulation of 1000 experiments, we can see in Figure 1.4 how often wed observe each of an increasing proportion of survivors. Consider three possible outcomes: n = 17 survivors, n = 18 survivors, or n = 19 survivors. The simulation shows that about half of the time (509/1000) wed see proportions of p = 0.061 or greater (x = 17 or more survivors). About 40% of the time (421/1000) wed see proportions of p = 0.065 or greater (x = 18 or more survivors). About a third of the time (333/1000) wed see proportions of p = 0.068 or greater (x = 19 or more survivors).
Copyright Stacey S. Cofield, 13 September 2004. All rights reserved. AMB materials reprinted courtesy of Al Best, Virginia Commonwealth University
16 / 22
Each of these observations are close to p = 0.06 and are thought to be expected observations. That is, if we observed p = 0.068 we would fail to reject the nullhypothesis. These calculations may also be seen in Table 1.4.
Copyright Stacey S. Cofield, 13 September 2004. All rights reserved. AMB materials reprinted courtesy of Al Best, Virginia Commonwealth University
17 / 22
Another scenario to consider On the other hand, even assuming a real proportion of 0.06, two of the experiments had p = 0.108 survivors. In 2 out of the 1000 simulated experiments we observed a proportion this large. What if this is what we observe? We can choose: 1. The survival proportion is 0.06 or less. To choose this, were saying that a rare event occurredtwo in thousandbut that this is within what wed expect if the null hypothesis is true. 2. The survival proportion is larger than 0.06. To choose this, were saying that the observed datasurvival p = 0.108is compelling evidence that the null hypothesis is not true. For choice 1, were saying that our preference for the null-hypothesis is so strong that in order to disbelieve it, well only accept evidence that will happen by chance less than 2 out of 1000 trials. Choice 2 implies what an observed result of p = 0.108 is so significant a departure from what wed observe by chance, that well reject the null hypothesis in favor of the alternative hypothesis, that the survival rate is > 0.06.
The significance level is represented by the Greek symbol alpha, . It is the probability of rejecting a true null hypothesis3.
The researcher chooses the risk of making this error, prior to conducting the analysis. The significance level is the percentage or proportion of time the researcher is willing to conclude that the null hypothesis is false when it really is true. The most frequent values are = 0.05, 0.01, 0.025, or 0.10. If = 0.05, then 5% of the time, we are willing to reject the null hypothesis, when, in fact, the null hypothesis is true.
3
Daniel, p.205
Copyright Stacey S. Cofield, 13 September 2004. All rights reserved. AMB materials reprinted courtesy of Al Best, Virginia Commonwealth University
18 / 22
Researchers are often comfortable with more risk that one out of a thousand. Usually, we tolerate a scientific process that will reject a true null hypothesis 5% of the time. That is, a typical significance level chosen for rejecting the null hypothesis is 1/20 or 0.05. Consequently, our level of preference for the null hypothesis is 19/20 or 95% or 0.95. Meaning, that 95% of the time, we will correctly fail to reject the null hypothesis, when the null hypothesis is true. It is very important to note that no matter what the significance level, of the time (0.05 or 5%), we will incorrectly reject the null hypothesis. We will reject the null hypothesis in favor of the alternative hypothesis, when the null is true at the level of . The level of defines our decision rule:
We will fail to reject the null hypothesis when the p value We will reject the null hypothesis when the p value >
The decision rule From the simulation, which observations will lead us to reject the null-hypothesis 5% of the time or less? We see from Table 1.3 that x 24 survivors occurred in 49 of 1000 experiments. Thus, choosing this as a cut-off will result in a significance level of = 0.049. Since 0.049 is less than the desired = 0.05, this should be our cut-off. Therefore, our decision rule (based on this simulation) is: H0: The population survival proportion is 0.06 or less if the observed proportion p 0.083 (x = 23 survivors or less). HA: The population survival proportion is larger than 0.06 if the observed proportion p > 0.086 (x = 24 or more survivors). For the studies looking at the question Does dispatcher-instructed bystanderadministered CPR improve the chances of survival? We can say that we believe the survival rate is unchanged from 0.06 unless the observed outcome is 0.086 (x =24 or
Copyright Stacey S. Cofield, 13 September 2004. All rights reserved. AMB materials reprinted courtesy of Al Best, Virginia Commonwealth University
19 / 22
more survive). In this case, well choose to reject the null hypothesis in favor of the alternative hypothesis that the survival probability is larger than 6%. This is based on our 1000 study simulation results. (We will extend this to situations without simulated data.)
Phase 3: Answer the Question Now that we have decided what test statistic, p, the observed proportion of survival, the effect of variability, and the decision rule, we can answer the question with our observed data. Recall the question: Does dispatcher-instructed bystander-administered CPR improve survival? Our two hypotheses are: H0: true proportion 0.06 HA: true proportion > 0.06 7. Calculate the statistic In the actual study, we observed proportion of 0.104.
8. Make a decision Based on our simulated results, our observed proportion falls under the decision rule for rejecting the null hypothesis in favor of the alternative hypothesis. That is, we choose to believe the alternative hypothesis: HA: The survival proportion is > 0.06, since the observed proportion p 0.086 (x = 24 or more survivors). How often, in our 1000 random experiments above, did we see p> 0.104? This occurred in 3 out of the 1000 simulated experiments. So, if the survival rate really is 6% it would apparently be a rare event to observe 10.4% survival. Therefore, the p-value for our experiment is thus approximately 3 in 1000 (p-value = 0.003). Recall that this calculation assumes that the null-hypothesis is true: That the
Copyright Stacey S. Cofield, 13 September 2004. All rights reserved. AMB materials reprinted courtesy of Al Best, Virginia Commonwealth University
20 / 22
survival proportion is really 0.06. It is important to note that it is possible to observe proportions as large as 0.104, but the likelihood is very small (three in a thousand).
A p-value is the probability that randomness alone leads to a test statistic to the observed statistic. When H0 is true, the p-value is the probability of the test statistic as extreme, or more extreme than the one actually observed,13. -value Note: p-value denotes the above probability and p denotes an observed or sample proportion. Often the p-value and level refer to questions that are backwards from what you want to determine. What we are after, is the truth. Want we really want to know is the probability that the null-hypothesis is true; you would conclude no change if this probability is high. Or, you want to know the probability that the alternative hypothesis is true; you would conclude some change if this probability is high.
P-values and significance levels do NOT answer these questions. That statistics will not tell you the truth. The p-value assumes that the null hypothesis is true. Under this assumption, the p-value measures how likely the various outcomes are to occur. If the observed outcome is likely to happen, then you will (usually) choose to believe the null hypothesis. (You could be wrong in this belief.) If the observed outcome is unlikely to happen, then you will (usually) choose to believe the alternative hypothesis. (Again, you could be wrong.) With all of this information, how do we state what we believe?
Copyright Stacey S. Cofield, 13 September 2004. All rights reserved. AMB materials reprinted courtesy of Al Best, Virginia Commonwealth University
21 / 22
9. State the conclusion Does dispatcher-instructed bystander-administered CPR improve the chances of survival over no CPR? With this set of observed data, it would be hard to defend the null hypothesis. We reject the null hypothesis of survival rate being 0.06 in favor of the rate being > 0.06. Bystander CPR improves the survival rate compared to no CPR.
Phase 4: Communicate the Answer to the Question The final, and perhaps most important, phase of this process is to communicate what you understand. Weve formalized a precise question, described how to answer it, then brought data to bear on the question, and finally answered it. AMB You need to clearly and concisely convey what was done, what is the question, how you answered the question, what is the decision, and the final conclusion. This can be done using text, tables, and figures.
10. Document our understanding with text, tables, or figures Bystander-administered CPR was administered to n = 278 cardiac arrest victims. Survival probabilities of cardiac arrest victims without this intervention is 6%. In this study, the observed survival probability was 10.4% (x = 29), which was a significant increase (p-value < 0.01). Bystander CPR improves the survival to hospital discharge compared to non CPR. For this simple case study, there is no need for a table or figure to explain the result. It is important that you state the final conclusion, in addition to what the decision was statistically.
Summary In this section, we used a case study and simulation to go through the steps to answer questions with data. These steps are called hypothesis testing. This was done using simulated results, specific to this study. Often, this isnt possible, so in the next section, well standardize the process to a formal set of steps to use in general.
Copyright Stacey S. Cofield, 13 September 2004. All rights reserved. AMB materials reprinted courtesy of Al Best, Virginia Commonwealth University
22 / 22
Section 7.
Extending Hypothesis Testing
Recall, from the simulation in the CPR case study that the p-value associated with our observed proportion of 0.104 was about 3 in 1000 (a better estimate, was p-value = 0.003). Recall that this calculation assumes that the null-hypothesis is true, that the survival proportion is really 0.06. Its possible to observe proportions as large as 0.104 (we did observe this proportion), but the likelihood is exceedingly small (less than 3 in a thousand). It is unlikely that what we observed was due to chance, it is more likely that the survival proportion is greater than the assumed 0.06. We have informally discussed how to state a question in the form of two hypotheses (null and alternative), how to assess the data, and how to answer the question by using a statistic and an associated measure of the probability of observing our statistic, given the current state or null hypothesis. We will use this measure of probability, the p-value, throughout the rest of this course.
A p-value is the probability of observing, when H0 is true, a value of the test statistic as extreme or more extreme than the one observed,13.
When the p-value is less than a specified significance level, we will reject the null hypothesis. That is, assuming the null hypothesis is true, if presented with evidence that what we have observed is unlikely due to chance, we will change our assumptions. The significance level that we choose determines how much evidence we need to make the change in the status quo.
,13
Denotes the Statistical Sleuth and page number.
Copyright Stacey S. Cofield, 19 September 2004. All rights reserved. 1 / 22 AMB notation denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
Lets review the steps well use for testing hypotheses: Steps for Hypothesis Testing AMB Phase 1: State the Question 1. Evaluate and describe the data 2. Review the assumptions 3. State the questionin the form of hypotheses
Phase 2: Decide How to Answer the Question 4. Decide on a summary numbera statisticthat reflects the question 5. How could random variation affect that statistic? 6. State a decision rule, using the statistic, to answer the question
Phase 3: Answer the Question 7. Calculate the statistic 8. Make a statistical decision 9. State the substantive conclusion
Phase 4: Communicate the Answer to the Question 10. Document our understanding with text, tables, or figures We applied the above steps to the CPR specifically, but, the steps can be applied to all hypothesis testing situations by defining some of the steps in a more general fashion. When assessing the CPR example, we did not fully examine the assumptions associated with the data (step 2). In addition, Phase 2 was applied specifically to this example and can be stated using a more formal general definition. Assumptions We discussed two assumptions earlier: 1. Representative: Is the observed data representative of the population? 2. Independence: Are the observations (responses of interest) independent?
Copyright Stacey S. Cofield, 19 September 2004. All rights reserved. 2 / 22 AMB notation denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
The data needs to be representative and each observation needs to be independent. If process used to sample the data is appropriates, that is if the data are sampled in an un-biased manner, then these assumptions are met. The third assumption that needs to be addressed is: 3. Size: Is the size of the sample large enough to make generalizations to the population at large? The necessary size is dependant upon the question being addressed and will differ for each set of circumstances. For proportions, the size depends on the hypothesized value or the proportion of interest. In the CPR example, the hypothesized proportion is p0 = 0.06. (Note: the observed proportion p = 0.104 is not relevant, the steps are carried out assuming the null hypothesis is true we dont evaluate the assumptions with consideration to the observed data). So, how large is large enough? Here is the rule-of-thumb: N must be large enough so that were likely to see at least five of each of the two possible outcomes,534. This also depends on the assumed proportion, p0. In order to assume the sample size is large enough, both of the following must be true: p0 n > 5 (1 p0) n > 5
In the CPR study, p0 = 0.06 and n = 278: p0 n = 0.06 278 = 16.68 > 5 (1 p0) n = (1 0.06) 278 = 261.31 > 5 So, for the CPR study, the sample size assumption is met. That is, the sample size is sufficient to proceed with hypothesis testing. It is important to stress that the observed proportion is not considered when evaluating the sample size assumption. A common mistake would be using the observed proportion rather than the hypothesized proportion
Copyright Stacey S. Cofield, 19 September 2004. All rights reserved. 3 / 22 AMB notation denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
to assess the sample size.AMB Remember, we are always operating under the assumption that the null hypothesis is true. All three of the assumptions must be evaluated using the proportion assumed under the null hypothesis. A similar mistake would be to compare the observed number of events of interest to five. For example, in the CPR study, the number survived = 29. The incorrect argument is that 29 > 5 so the study is of sufficient size.AMB This is the wrong conclusion, because the conclusion is based on the observed number (which is part of the observed proportion), not on the assumed number that would result from evaluating the sample collected under the assumed proportion. That completes the generalization of step 2, evaluating the assumptions or representativeness, independence, and sample size. Each of the 10 steps should be evaluated in order. Upon the completion of each step, we can proceed to the next step. If a single step can not be evaluated, then you should not proceed to the next step in the process. Step 2 is particularly important, if the data do not meet the assumptions, then the statistical tests applied to test the hypothesis will not be valid. If the assumptions in step 2 have been met, then we can proceed to steps 3 10. Choosing the test statistic is a key step in the process (step 4). For the CPR example, we used a specific statistic, the proportion p, but the statistic and decision rules can be more generally defined and applied to all situations for testing a proportion. Recall that our decision rule was based upon the simulation results. With a significance level of = 0.05 (we are comfortable choosing the alternative hypothesis 5% of the time when the null hypothesis is true), which observations should lead us to reject the nullhypothesis 5% of the time or less? We saw that 24 or more survivors occurred in 49 of 1000 experiments (0.049). Choosing this as a cut-off will result in a significance level of = 0.049. Since we didnt observe an exact number of experiments resulting in a proportion exactly equal to 0.05,
Copyright Stacey S. Cofield, 19 September 2004. All rights reserved. 4 / 22 AMB notation denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
we will use 0.049, since it is less than the desired = 0.05. Choosing a cut-off greater than 0.05 would result in choosing the alternative hypothesis more often than were comfortable with if the null hypothesis is true. So, our decision rule based on the simulation is: H0: The population survival proportion is 0.06 or less if the observed proportion p 0.083 (x = 23 survivors or less). HA: The population survival proportion is larger than 0.06 if the observed proportion p > 0.086 (x = 24 or more survivors). But, we had to perform 1000 simulations to arrive at this decision rule. This would be tedious and is, in fact, unnecessary, to run simulations for every hypothesized proportion, p. It would be much easier if there were a set of standard results that we could use to test hypotheses for proportions. If you recall from our discussions about known distributions, statisticians and mathematicians have done just this. So, for particular situations, prior to conducting an experiment, we can tell how likely it is to observe each possible outcome. Recall for the CPR simulations, assuming a true proportion of 0.06 and n = 278 subjects, the results looked similar to a normal distribution (Figure 1). The height of the bars in the figure show the probability of each outcome (the proportion of the1000 experiments with the specified outcome).
Figure 1. CPR Simulation Results AMB
Copyright Stacey S. Cofield, 19 September 2004. All rights reserved. 5 / 22 AMB notation denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
The smooth curve is the theoretical distribution of a normal curve under the null hypothesis, centered on the population value (p0 = 0.06) with proportions farther away from this center being less likely to occur assuming the null hypothesis is true. We can actually use the theoretical distribution to determine if our observed proportion is different from our assumed proportion. This is done by using a generic test statistic, in place of the proportion, p. The generic test statistic is D,242:
observed p - hypothesized p standard error of the hypothesized p z= p p0
p0 (1 p0 ) n
Where:
p is the observed proportion (pronounced p-hat)
p0 , it is the hypothesized proportion under the null hypothesis (pronounced pnaught). The term standard error in the denominator is a new term. But we have already explored this value when we assessed step 5, How could random variation
affect that statistic? Under the null hypothesis, we assume that p is be close
to p0 , but how do we define close? The standard error of p0 will help to
determine how close p has to be p0 to continue under the assumption of the null
hypothesis (fail to reject the null hypothesis). Calculation of the standard error of p0 depends upon: o The assumptions being met, o The hypothesized value of the parameter, p0 , and o The sample size, n. Notice that the standard error is calculated using only p0 , since we calculate the test statistic under the assumption that the null hypothesis is true. We are not
D,242
Denotes Daniel and page number
Copyright Stacey S. Cofield, 19 September 2004. All rights reserved. 6 / 22 AMB notation denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
concerned about how the variability of the observed data will affect our hypothesis testing result, remember, we believe the null hypothesis and therefore, the variability in the observed data should be assumed to be the same as the variability under the null hypothesis (until we are shown otherwise). And z is called a z-score, that follows the standard normal distribution.
Using the z-score allows us to use a decision rule based on the standard normal distribution, rather than the proportion, p. Recall that the standard normal distribution is a normal distribution with a mean 0 and standard deviation 1. The properties of the standard normal distribution allow us to easily determine an associated p-value. The cut-off for the decision rule is easy to determine and it does not change for different values of p, n, and p0 . This cut-off is referred to as the critical value and each critical value has an associated p-value. The critical values most commonly used for z correspond to = 0.05, 0.025, 0.01, and 0.10. For an = 0.05, the associated z value is 1.645, meaning that 5% of the standard normal distribution values are greater than 1.645 (Figure 2).
Figure 2. The Standard Normal Distribution, ZAMB So how does this change our decision rule? Recall that: We assume the null hypothesis, that p p0
Copyright Stacey S. Cofield, 19 September 2004. All rights reserved. 7 / 22 AMB notation denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
Our previous decision was based upon the proportion, p, and our simulation studies.
But, we can now use z and the associated p-value to choose between the hypotheses: H0: proportion p p0 . Choose this if z zcritical and p-value . HA: proportion p > p0 . Choose this if z > zcritical and p-value < . You will most commonly see the significance level defined using the p-value approach in the literature. Using the critical value approach would require the reader to know, understand, and recall the relationship between the critical values and associated pvalues. Consequently, well redefine our hypothesis testing steps in terms of the pvalue approach. Recall: From our simulation in the first case study we said that the p-value for our experiment was about three one in 1000 (p-value = 0.003). Recall that this calculation assumes that the null hypothesis is true: That the survival proportion is really 0.06. Its possible to observe proportions as large as 0.104, but the likelihood is small (three in a thousand).AMB Steps for Hypothesis Testing AMB Phase 1: State the Question
1. Evaluate and describe the data We observed n = 278 CPR patients who received instructions by phone, of whom x = 29 survived to hospital discharge. The characteristic of interest is survival proportion, p = 29/278 = 0.104. The intent is to compare the outcomes in this study to a p0 = 0.06 survival rate presumed to be typical.
2. Review assumptions There are three assumptions.
Copyright Stacey S. Cofield, 19 September 2004. All rights reserved. 8 / 22 AMB notation denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
Representativeness: From the design of the study, it is clear that subjects are representative of cardiac-arrest victims in cities with a quick-response emergency system. Independence: The response of one cardiac-arrest victim does not depend on the response of others. The subjects are independent. Sufficient size: In order for the test statistic to follow the normal distribution, n must be large enough to observe both 5 survivals and 5 non-survivals. Since, p0 n = 0.06 278 = 16.68 > 5, and (1 p0 ) n = (1 0.06) 278 = 261.31 > 5, this assumption is valid.
3. State the questionin the form of hypotheses The intent is to show that phone-CPR is superior to doing nothing. Thus, the alternative hypothesis is that there are higher than 6% survival rates: H0: p 0.06, or HA: p > 0.06. Phase 2: Decide How to Answer the Question
4. Decide on a summary numbera statisticthat reflects the question Well use the z-score:
z=
p0 (1 p0 ) n
p p0
Remember that p0 is the hypothesized proportion, n is the sample size, and p is the
observed proportion calculated from the sample. The statistic used in this situation is z, the standard normal distribution. The advantage of the conversion to the standard normal z is that the cut-off for the decision rule is easy to determine and it does not change for different values of p, n, and p0 .
Copyright Stacey S. Cofield, 19 September 2004. All rights reserved. 9 / 22 AMB notation denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
The p-value approach makes the decision rule even easier because the decision rule does not change at all, no matter what the null- and alternative-hypotheses are.
5. How could random variation affect that statistic? If the null hypothesis is true, then z is zero. Since the assumptions are met, z is normally distributed. Large values of z reflect higher survival proportions and thus favor the alternative hypothesis.
6. State a decision rule, using the statistic, to answer the question If we want to reject the null-hypothesis 5% of the time, the decision rule is: Use the test statistic (in this case, z), and calculate the p-value. (Eventually, computer software will do this for you.) Then the rule is: Universal decision rule11 Choose to believe (at = 0.05): H0: null-hypothesis. Choose this if p-value (usually 0.05). HA: alternative-hypothesis. Choose this if p-value < (usually 0.05). In our case, testing a whether a single proportion is greater than an assumed proportion, for an = 0.05: H0: p 0.06. Choose this if p-value 0.05. HA: p > 0.06. Choose this if p-value < 0.05. Phase 3: Answer the Question
7. Calculate the statistic The test statistic is: z= p0 (1 p0 ) n p p0 = 0.104 0.06 0.06 (1 0.06 ) 278 = 0.044 0.0142
= 3.09
Copyright Stacey S. Cofield, 19 September 2004. All rights reserved. 10 / 22 AMB notation denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
Recall that a z-value all the way out to the right of 3 is rather unlikely. In fact, the associated p-value is p = 0.0010 (well talk about calculating p-values later).
8. Make a statistical decision Reject the null hypothesis since p-value < 0.05. (The observed value of the summary statistic is larger than what is expected by chance alone.)
9. State the substantive conclusion We conclude that the survival proportion is larger than 0.06.
Phase 4: Communicate the Answer to the Question
10. Document our understanding with text, tables, or figures Does dispatcher-instructed bystander-administered CPR improve the chances of survival? Without this intervention it is presumed that the survival probability will be unchanged (at 6%). From this study, which used n = 278 patients, we observed p = 0.1040 (x = 29 survived until hospital discharge). The observed rate was compared to the hypothesized rate using the z test statistic. We reject the hypothesis p0 0.06 in favor of the alternative hypothesis that the survival probability is larger than 6% since z = 3.09, p-value = 0.0010. From this, we can define the Universal Decision Rule: H0: null-hypothesis. Choose this if p-value (usually 0.05). HA: alternative-hypothesis. Choose this if p-value < (usually 0.05).
How do we determine p-values? Without software (or before software), p-values can be determined from standard normal tables, such as Table A.1 in the Statistical Sleuth.,715 The Statistical Sleuth gives the probability that p is less than a specified value. Be very careful about what
Copyright Stacey S. Cofield, 19 September 2004. All rights reserved. 11 / 22 AMB notation denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
value a table gives you, some tables will give you the probability that p is greater than a specified value. We have used the CPR study to test p p0 this is called a one-tailed test. That is, we are looking for the probability associated with one side of the standard normal curve, or one-tail of the curve. We can also test p p0 (one-tailed) and p = p0 (two-tailed). Table 1 shows the p-values associated with the various types of tests. Table 1. Associated p-values for One- and Two-Tailed TestsAMB
Determining p-values from statistical tables can be very difficult and very confusing. Luckily, there are p-value calculators available that are easy to use. Once such
Copyright Stacey S. Cofield, 19 September 2004. All rights reserved. 12 / 22 AMB notation denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
calculator was devised by Al Best, PhD, at Virginia Commonwealth University. It is an Excel based calculator (available for download from the WebCT course website):
Calculation noteAMB: Sometimes, software will return a p-value as 0or 0.000. As previously discussed, with continuous distributions, this is not possible. Software will often limit the number of outputted decimal places and round to 0 if the p-value is so small it exceeds the limit of decimal places. When you encounter this situation, first determine the number of decimal places the calculator reports (when it will return a 0 value) and then let p be < 0.001 or < 0.0001 (for 3 or 4 decimal places, respectively).
Copyright Stacey S. Cofield, 19 September 2004. All rights reserved. 13 / 22 AMB notation denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
Confidence Intervals So far, we have used simulations, cut-off values, and p-values to test hypotheses. Hypothesis testing is only one approach to making conclusions about observed data and populations. However, p-values are the most common method of addressing scientific (statistical) questions. Often, researchers want to use a less rigid approach to hypothesis testing by estimating the parameter and placing upper and lower bounds (or limits) on the estimate. The interval is called a confidence interval. First, some definitions: Inference: An inference is a conclusion that patterns observed in the data are present in the broader population.,8 Statistical Inference: A statistical inference is an inference justified by a probability model (distribution) linking the data to the broader population. ,8 Parameter: A parameter is an unknown numerical value describing a feature of a distribution. ,20 Statistic: A statistic is any value that can be calculated from the observed data. ,20 Estimate: An estimate is a statistic used as a guess (or estimate) of a parameter. ,20 The CPR SimulationAMB Weve seenwith our simulation in the first case studythat each experiment will yield a different estimate of the proportion. Its important to remember that in the simulation we knew the population parameter p was exactly 0.06. But, in the first simulated experiment the first estimated proportion was p1= 0.061. In the second experiment the statistic was p2 = 0.072. We saw values all the way from 0.018 up to 0.108.
Copyright Stacey S. Cofield, 19 September 2004. All rights reserved. 14 / 22 AMB notation denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
The point is that in practice we only have one of these experimental realizations in hand. So we have reason to wonder how close this estimate is to the true value. In our simulated experiments we had the luxury of knowing what the true answer was; we knew that the actual underlying population proportion was exactly p = 0.06. Recall that none of our simulated experiments estimates matched this value. The confidence interval approach allows us to make statements about a population parameter without referring to hypotheses and to also give a range of values that reflects our degree of certainty. Confidence Interval for A Single Proportion Using the CPR ExampleAMB In the CPR case study we found a p = 0.104 survival proportion for those receiving full CPR (x = 29 survivors out of n = 278 studied). An interval estimate would be an improvement because wed be saying that we estimate that the survival proportion is between ___ and ___ rather than we estimate p = 0.104. Estimating a parameter with an interval involves three components,538 (let 1= 0): The point estimate. In this example p = 0.104. The standard error of the estimate. This describes how much variability we expect. A reliability coefficient. This describes our degree of certainty.
A Confidence Interval, the General Case The general form of an interval estimate is: estimate [(reliability coefficient) (standard error)] This will yield two values, a lower limit and an upper limit, around the point estimate. This range of values will, we hope, include the true (unknown) population proportion were trying to estimate. The confidence interval will, with specified reliability, contain the population proportion.
Copyright Stacey S. Cofield, 19 September 2004. All rights reserved. 15 / 22 AMB notation denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
In order to calculate the confidence interval for a proportion, we need to be clear about the three components.
Estimate The point estimate is easy. Just calculate the observed proportion: p = x / n, in the CPR case p = 0.104. But how variable is this estimate? The standard error tells us.
Standard Error The standard error we use here is different from that used in hypothesis testing. Recall that earlier we were in the mind-set of hypothesis testing. We had a specified nullhypothesis, in that case H0: p0 0.06. We assumed the null hypothesis was true and from this we determined what the standard error would be. Recall that under hypothesis testing: SE p = 0
p0 (1 p0 ) . n
Whats different: Here we are not doing hypothesis testing here. Were just estimating a confidence interval. Were not using some hypothesized p0; that is, we are not going to assume any particular Instead, were just going to begin with the data; were going to use our observed p . So: SE p =
p (1 p ) n
.
How n affects the standard error Note that the standard error of the estimate gets smaller when n gets larger. We expect less variability in an estimate if we use more data to estimate it (Figure 3).
Copyright Stacey S. Cofield, 19 September 2004. All rights reserved. 16 / 22 AMB notation denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
Figure 3. Relationship Between n and Standard ErrorAMB
In our example where n = 278 and p = 0.104, the associated standard error is:
SE p =
Reliability coefficient The reliability coefficient reflects how sure we want to be. Where we are heading is that we want to say, were ____% sure that the proportion is between ___ and ___. Usually the level of reliability we choose is 95%. So, we want to be able to say, were 95% sure that the proportion is between ___ and ___. If the assumptions are met, the proportion follows a bell-shaped Normal distribution. Conveniently, the decision rule we used in hypothesis testing will give us the reliability coefficient we want. Recall from our hypothesis testing discussion, a two-tailed test:
p (1 p ) n
=
0.104 (1 0.104 ) 278
= 0.0183
Copyright Stacey S. Cofield, 19 September 2004. All rights reserved. 17 / 22 AMB notation denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
Null hypothesis: fixed proportion H0: p = p0. Choose this if z < 1.96 and z > +1.96 HA: p p0. Choose this if z < 1.96 or z > +1.96. Since we are using an overall significance level of = 0.05, we need to use a z-score of 1.96 for a two-tailed test that will result in two one-sided p-values of 0.025 (Figure 4).
Figure 4. Standard Normal DistributionAMB
Reliability Coefficients Commonly Used The reliability coefficients we use most often are: For 90% confidence, use z = 1.645. For 95% confidence, use z = 1.96. For 99% confidence, use z = 2.575.
Note: Sometimes results are shown as mean SE. This gives (apparently) narrow limits since this interval uses z = 1. But this is only a 68% confidence interval! One short-cut interpretation of SE is about 2/3 of the estimates are within one SE.
Copyright Stacey S. Cofield, 19 September 2004. All rights reserved. 18 / 22 AMB notation denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
Confidence Interval So now we can calculate a 95% confidence interval for the CPR study: estimate (reliability coefficient) (standard error) =
p z(1 2 ) SE p 0.104 1.96 ( 0.0183 )
( 0.068,
0.140 )
Or, in sentence format: In the first case study there were 29 survivors (out of n = 278 studied) yielding a 95% confidence interval on the population survival proportion of [0.068, 0.140]. That is, Were 95% confident that the survival proportion is between 0.068 and 0.140. Common mistake: Note that in the above calculation we used the reliability coefficient z = 1.96. Do not use the z test statistic calculated in the previous section for hypothesis testing (see, z = 3.11).
How level of confidence affects the width of the confidence interval We saw earlier how sample size affects the size of the SE. Larger n means smaller SE. See in Figure 5 how the level of confidence affects the width of the confidence interval.
Figure 5. Relationship between n and the Confidence IntervalAMB
Copyright Stacey S. Cofield, 19 September 2004. All rights reserved. 19 / 22 AMB notation denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
In the CPR study (n = 278), the 95% CI was [0.068, 0.140] for a width of 0.072. A 90% CI would be [0.074, 0.134] for a width of 0.060; Less confidence = narrower interval. A 99% CI would be [0.057, 0.152] for a width of 0.094; More confidence = wider interval. What about 100% reliability? Say that we want to be sure that our estimated interval includes the true population mean. For 100%, use reliability coefficient = (infinity). Then, no matter what the estimate or standard the CI = estimate (reliability coefficient) (standard error) = estimate (standard error) = [ , + ]. In other words, the only way to be sure is to give a trivial answer. But this answer isnt correct either. Remember that proportions must fall between 0 and 1. So a 100% confidence interval about any proportion is [0, 1]. Note: However, it is mathematically possible to calculated negative lower limits (below 0) and upper limits greater than 1. For a negative lower limit, report 0 and for an upper limit greater than 1, report 1. Review: Steps to Using Confidence IntervalsAMB The steps are similar to those for Hypothesis Testing. The differences are in italics.
Phase 1: State the Question 1. Evaluate and describe the data A sentence that includes the sample size, the number of events of interest, and the observed proportion will do. 2. Review assumptions Assess representativeness, independence, and sample size. If the assumptions are met, proceed.
Copyright Stacey S. Cofield, 19 September 2004. All rights reserved. 20 / 22 AMB notation denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
3. State the questionwhat do you want to estimate and is there a comparison value? What proportion in the population is the sample proportion trying to estimate? If there is a comparison value (like that specified in a null-hypothesis), state it.
Phase 2: Decide How to Answer the Question
4. Decide on a summary statistic that reflects the question. estimate (reliability coefficient) (standard error)
p z(1 2 )SE p
Note: z is fixed by the desired confidence level, not by data.
5. How could random variation affect that statistic? If the assumptions are met, then this interval will cover the population proportion 95% of the time. (Or with whatever reliability coefficient you specify.)
6. Determine the reliability coefficient and standard error to be used in the CI For 95% confidence, we use z = 1.96.
Phase 3: Answer the Question
7. Calculate the interval See above.
8. Compare the interval to the comparison value If there is a comparison value, does the interval include it?
9. State the substantive conclusion Something like: We estimate the population proportion of to be [lower, upper] with 95% confidence perhaps which does not include the hypothesized value of .
Copyright Stacey S. Cofield, 19 September 2004. All rights reserved. 21 / 22 AMB notation denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
Phase 4: Communicate the Answer to the Question
10. Document our understanding with text Write sentence(s) that includes: sample size, the number of events of interest, the observed proportion, the level of confidence, and the confidence interval. If needed, comment on whether the comparison value is included in the interval. The CPR Example. Step 10 for the CPR example will be left for homework.
Summary We have looked at several methods to assess and describe data and underlying populations. We can use simulations, z-scores, p-values, or confidence intervals about an estimate to make conclusions about observed data and broader populations. Next, well look at sample size and precision of estimates and the design of a study to estimate population proportions.
Copyright Stacey S. Cofield, 19 September 2004. All rights reserved. 22 / 22 AMB notation denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
Section 8.
Sample Size, Study Design, Comparing Two Proportions with
Confidence Intervals
Review Up to this point, we have discussed how to state a question in the form of two hypotheses (null and alternative), how to assess the data, and how to answer the question by using a general test statistic and an associated measure of the probability of observing our statistic (p-value), given the current state (null hypothesis). In addition, we have addresses estimation with confidence by using the standard normal distribution to place upper and lower confidence bounds about an estimate.
Both the hypothesis testing and the confidence interval procedures involve estimation of standard error, either about the null proportion or the observed proportion. We briefly explored how standard error estimates relates to the sample size of the observed data. Recall that the sample size is in the denominator of the standard error estimate, so as the sample size increases, the standard error will decrease. Intuitively, this makes sense, the more data collected, the more precise the estimate and the smaller the error about the estimate. Previously, we have been using observed data to test an assumed proportion. The sample size is involved in testing the assumptions under the null hypothesis and in calculating the standard error under the null hypothesis. What about comparing two observed proportions? How does sample size figure into calculations, affect variability, and precision when testing or comparing two observed proportions? Gallup Polls are a good example of a very public application of comparison of two observed proportions.
Gallup Poll: On Election Day this year, residents in Michigan (along with those in potentially 11 other states) will vote on whether to amend their state constitution to make marriages between same-sex couples illegal1. As part of a special poll of Michigan registered voters, Gallup sought to find out how the people in that state plan to vote on this contentious issue.
1
www.gallup.com
Copyright Stacey S. Cofield, 19 September 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
1 / 10
Michigan residents who are registered to vote are somewhat more likely to say they would vote against, rather than vote for, a proposal to ban gay or lesbian marriages in their state. Support for the gay marriage ban in Michigan is highest among conservatives, men, and residents aged 35 and older.
According to the poll, a bare majority of Michigan registered voters -- 51% -- say they would vote against the proposal to ban gay marriages. This compares with 44% who would vote for the proposal. Among likely Michigan voters, the results are essentially the same, with 51% against the ban and 45% in favor.
Question: If the election were being held today, would you vote for the proposal, which would pass a ban on gay marriages or against the proposal, which would defeat the ban on gay marriages? Results in Figure 1.
Figure 1. Gallup Poll Results for Michigan Gay Marriage Ban Results are based on telephone interviews with 829 registered voters in Michigan, aged 18 and older, conducted Sept. 10-13, 2004. For results based on this sample, one can say with 95% confidence that the margin of sampling error is 4 percentage points. In
Copyright Stacey S. Cofield, 19 September 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
2 / 10
addition to sampling error, question wording and practical difficulties in conducting surveys can introduce error or bias into the findings of public opinion polls.
Confidence Interval What is the confidence interval about the proportion opposing the ban, 51%?
0.51(1 0.51) 0.51 1.96 829 = 0.51 1.96 ( 0.017 ) = 0.51 0.034 = [0.48, 0.54]
That is, we are 95% confident that the proportion survey opposing the ban on gay marriages is between 48% and 54%. Notice that the lower bound is still greater than the proportion supporting the ban on gay marriages.
Sample Size and Precision
How reliable, or variable are these numbers? That is, proportion opposing the ban in this sample is 51% but how variable might this value be if they exactly redid the survey again? Could it change to 30% next time and 90% the next? Or would a re-do just change it to 50% or 52%? Instead of using confidence intervals, Gallup uses margin of error. The Gallup survey methods says: that the margin of sampling error is 4
percentage points. Margin of sampling error
How do they figure the margin is 4? As weve seen in an earlier section, sample size affects the standard error. Also, weve seen how the width of the confidence interval is narrowed as n increases or as the reliability coefficient decreases. That is, if the Gallup survey was based on n =1829
Copyright Stacey S. Cofield, 19 September 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
3 / 10
instead of 829, there would be less sampling error; a narrower confidence interval. If instead of 95% confidence, wed be happy with only 90% confidence, then the interval would be narrower. Note: There are diminishing returns with larger ns. That is, after a certain point, sampling more subjects does not buy much in terms of narrowing the confidence interval.
The Margin of Error
The width of the confidence intervalcommonly called the margin of error depends on 3 things:
The sample size, n, The level of confidence, usually 95%, and The proportion we are trying to estimate.
Designing a study
Were trying to estimate a proportion. Usually the first question is how many subjects do I need? The necessary sample size depends on 3 things:
The margin of error (how tight does the estimate need to be?), The level of confidence, and A provisional value for the proportion.
The level of confidence is decided by the researcher. Usually its 95%. The last component can be a problem: We have to know the answer to the question what is the proportion before we can estimate it with some level of precision. For instance, another survey found 54%, using a similar question. In practice what we do is try a few guesses and then add more subjects than the worst case.
Sample size calculation
Recall that the z that corresponds to 95% confidence is 1.96, an equation for calculating the sample size necessary to estimate a proportion p, to a margin of error d, and a reliability coefficient z, is:
Copyright Stacey S. Cofield, 19 September 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
4 / 10
n=
z 2 p (1 p )
d2
With p = 0.51. Gallup wants a margin of error d = 0.04 with 95% confidence, so:
n=
z 2 p (1 p ) d2 1.962 i 0.51(1 0.51) 0.042
=
= 600.01
Rounding up, we need 601 subjects to estimate p with the level of error and confidence specified. If wed guessed p = 0.54, then the required n would be 597, a roughly similar value. Where does Gallup come up with n=829? I dont know. My guess is that they worked backwards from the actual n and calculated the margin of error, d.
d =z
p (1 p ) n 0.51(1 0.51) 829
= 1.96
= 0.034 0.04?
Another Gallup Example
PRINCETON, NJ -- In a new Gallup Poll, conducted Sept. 13-15, President George W. Bush leads Democratic candidate John Kerry by 55% to 42% among likely voters, and by 52% to 44% among registered voters. These figures represent a significant improvement for Bush since just before the beginning of the Republican National Convention (Tables 1 and 2).
Copyright Stacey S. Cofield, 19 September 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
5 / 10
Table 1. KERRY VS. BUSH AMONG LIKELY VOTERS
Kerry/ Edwards % Likely voters 2004 Sep 13-15 2004 Sep 3-5 2004 Aug 23-25 2004 Aug 9-11 2004 Jul 30-Aug 1 2004 Jul 19-21 2004 Jul 8-11 2004 Jun 21-23 2003 Jun 3-6 Bush/ NEITHER No Cheney (vol.) opinion % % %
42 45 47 47 47 49 50 48 50
55 52 50 50 51 47 46 49 44
1 1 1 1 * 2 2 1 2
2 2 2 2 2 2 2 2 3
Table 2. KERRY VS. BUSH AMONG REGISTERED VOTERS
Kerry/ Edwards % Registered voters 2004 Sep 13-15 2004 Sep 3-5 2004 Aug 23-25 2004 Aug 9-11 2004 Jul 30-Aug 1 2004 Jul 19-21 2004 Jul 8-11 2004 Jun 21-23 2004 Jun 3-6 Bush/ NEITHER No Cheney (vol.) opinion % % %
44 48 48 47 48 49 51 49 49
52 49 47 48 48 45 44 45 44
2 2 2 2 1 3 2 2 3
2 1 3 2 3 3 3 3 4
Survey Methods
Results are based on telephone interviews with 1,022 national adults, aged 18 and older, conducted Sept. 13-15, 2004. For results based on the total sample of national adults, one can say with 95% confidence that the margin of sampling error is 3 percentage points.
Copyright Stacey S. Cofield, 19 September 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
6 / 10
Results based on likely voters are based on the subsample of 767 survey respondents deemed most likely to vote in the November 2004 general election, according to a series of questions measuring current voting intentions and past voting behavior. For results based on the total sample of likely voters, one can say with 95% confidence that the margin of sampling error is 4 percentage points. The likely voter model assumes a turnout of 55% of national adults. The likely voter sample is weighted down to match this assumption. For results based on the sample of 935 registered voters, the maximum margin of sampling error is 4 percentage points. In addition to sampling error, question wording and practical difficulties in conducting surveys can introduce error or bias into the findings of public opinion polls.
Confidence Intervals
What is the confidence interval about the proportion of likely voters supporting Bush, 0.55? 0.55 (1 0.55 ) 0.55 1.96 767 = 0.55 1.96 ( 0.018 ) = 0.55 0.035 = [0.515, 0.585] [0.52, 0.59]
That is, we are 95% confident that the proportion of likely voters supporting Bush is between 52% and 59%. Notice that the lower bound is still greater than the proportion of likely voters supporting Kerry.
What is the confidence interval about the proportion of registered voters supporting Bush, 0.52?
Copyright Stacey S. Cofield, 19 September 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
7 / 10
0.52 (1 0.52 ) 0.52 1.96 935 = 0.52 1.96 ( 0.016 ) = 0.52 0.032 = [0.49, 0.55]
That is, we are 95% confident that the proportion of registered voters supporting Bush is between 49% and 55%. Notice that the lower bound is still greater than the proportion of registered voters supporting Kerry.
Sample size calculation
Did Gallup use enough people? With p = 0.55. Gallup wants a margin of error d = 0.04 with 95% confidence, so: n= z 2 p (1 p ) d2 1.962 i 0.55 (1 0.55 ) 0.042
=
= 594.2
Rounding up, we need 595 subjects to estimate p with the level of error and confidence specified. They used 737 people, their estimates are statistically reliable.
Local Poll
For the Columbiana City Council, District 2, 46% polled voters support Derrik Bryant and 54% of polled voters support Danny Kelley2. Only 93 subjects were polled, what is the 95% confidence interval about each proportion?
2
Birmingham News, http://www.al.com/election/coverage/?municipal_runoffs/columbiana.html
Copyright Stacey S. Cofield, 19 September 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
8 / 10
For Bryant: 0.46 (1 0.46 ) 0.46 1.96 93 = 0.46 1.96 ( 0.052 ) = 0.46 0.10 = [0.36, 0.56] We are 95% confident that the proportion of voters supporting Bryant is between 36% and 56%.
For Kelley:
0.54 (1 0.54 ) 0.54 1.96 93 = 0.54 1.96 ( 0.052 ) = 0.54 0.10 = [0.44, 0.64]
We are 95% confident that the proportion of voters supporting Kelley is between 44% and 64%. Notice that the standard error for both CIs is the same, since we are dealing with two proportions that total 100%. The confidence intervals for both candidates overlap, this is likely due to the fact that such a small n was used. What should have been the sample size?
n= 1.962 i 0.46 (1 0.46 ) 0.042
= 596.4
Rounding up, we need 597 subjects to estimate p with the level of error and confidence specified.
Copyright Stacey S. Cofield, 19 September 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
9 / 10
What about declaring statistical significance? How do we determine if two observed proportions are statistically different?
Look Ahead
In the next section, we will look at directly comparing two proportions using the 10 steps of hypothesis testing. First, lets review the CPR example:
Case Study
Recall the question that was actually asked in the CPR study reported in the NEJM. Do we need to give mouth-to-mouth ventilation and chest compression? Or will just doing chest compression alone be just as effective?
Summary: In the Seattle study, heart-attack victims were randomly assigned to two
groups: full CPR or chest compression alone. They found a 10.4% survival rate for those receiving full CPR (x = 29, n = 278) and a 14.6% survival rate for those receiving chest compression alone (x = 35, n = 240). The trial was designed to detect a 3.5% improvement of chest compression alone over full CPR.
Question: Is there any difference in the survival proportions of dispatcher-instructed
bystander administered CPR depending on whether mouth-to-mouth ventilation is used or not?
Exercise (not to be turned in for credit bring to class)
Prior to the next class, briefly write out the 10 steps as they pertain to comparing two observed proportions. Do your best to make a statement about each step, even if you are unsure of the statistical terms or formulas that will be applied.
Copyright Stacey S. Cofield, 19 September 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
10 / 10
Section 9.
Comparing Two Proportions
Lets return to the CPR example.
Case Study Recall the question that was actually asked in the CPR study reported in the NEJM. Do we need to give mouth-to-mouth ventilation and chest compression? Or will just doing chest compression alone be just as effective? Summary: In the Seattle study, heart-attack victims were randomly assigned to two groups: full CPR or chest compression alone. They found a 10.4% survival rate for those receiving full CPR (x = 29, n = 278) and a 14.6% survival rate for those receiving chest compression alone (x = 35, n = 240). The trial was designed to detect a 3.5% improvement of chest compression alone over full CPR. Question: Is there any difference in the survival proportions of dispatcher-instructed bystander administered CPR depending on whether mouth-to-mouth ventilation is used or not?
Review Steps for Hypothesis Testing AMB Phase 1: State the Question 1. Evaluate and describe the data 2. Review the assumptions 3. State the questionin the form of hypotheses
Phase 2: Decide How to Answer the Question 4. Decide on a summary numbera statisticthat reflects the question 5. How could random variation affect that statistic? 6. State a decision rule, using p-values, to answer the question
Copyright Stacey S. Cofield, 19 September 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
1 / 17
Phase 3: Answer the Question 7. Calculate the statistic 8. Make a statistical decision 9. State the substantive conclusion
Phase 4: Communicate the Answer to the Question 10. Document our understanding with text, tables, or figures The NEJM CPR Results How do these steps get applied in the case of comparing two proportions?
Hypothesis Testing: Comparing Two Proportions Lets go through the 10-steps for the NEJM paper.
Phase 1: State the Question 1. Evaluate and describe the data The outcomes of the study are given in Table 4 in the NEJM article (Figure 1).
Figure 1. Table 4 from NEJM CPR Study
Copyright Stacey S. Cofield, 19 September 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
2 / 17
Contingency Table The number of patients in each group, and the number of survivors (but not the nonsurvivors) is shown in Table 4. This form of tabular display is somewhat like a contingency table. The contingency table corresponding to these results is shown in Table 1. This form of tabular display is called a contingency table, or cross-tabulation table, a two-way classification, or a 2 x 2 table. ,552 Table 1. Contingency Table Showing All 4 CellsAMB
We observed n = 278 CPR patients who received instructions by phone, of whom x = 29 survived to hospital discharge. We observed n = 240 chest-compression alone patients, of whom x = 35 survived. Overall (ignoring group membership), there were 64 survivors out of a total of 518.
A 2x2 table shows the number of subjects falling into each cross-classification of a row factor and a column factor. ,552
Histogram Another way to graphically display the NEJM information is by using a histogram Figure 2). One of the useful characteristics of displaying data using a histogram is that it visually compares two things that should be compared (proportions).
,552
Denotes the Statistical Sleuth and page number.
Copyright Stacey S. Cofield, 19 September 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
3 / 17
Figure 2. Histogram of Proportion SurvivingAMB
Tabular Display Another tabular display would show the characteristic of interest: survival proportion. Table 2 shows proportions calculated separately for each column. Table 2. Proportion Surviving within Each PopulationAMB
Notice the columns sum to 1, which is intuitive since, since all subjects either survived or did not survive (well address issues of missing data later). Each proportion was calculated separately for each population or treatment group (proportion survivors CPR with mouth-by-mouth and chest compressions alone). These proportions can answer the following questions:
Copyright Stacey S. Cofield, 19 September 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
4 / 17
What proportion of everyone receiving chest compression plus mouth-to-mouth ventilation, survived to hospital discharge? Of those receiving chest compression alone, what proportion survived to discharge?
And since the columns add to one, we can look at the converse: Of those receiving chest compression plus mouth-to-mouth ventilation, what proportion did not survive? Of those receiving chest compression alone, what proportion did not survive?
The intent is to compare the two survival proportions. With this reasoning, Table 2 is recommended. The main criticism of this table is that it does not show the ns (or count totals for each group), and therefore, it is difficult to judge the reliability of the estimated proportions. I recommend using both Table 1 and Table 2 or combining them into a single table, such as Table 3. Table 3. Proportion (N) of Population Groups Surviving to Discharge Population proportion (N) Chest Compression and mouth-to-mouth 0.104 (29) 0.896 (249) 1.00 (278) Chest Compression Alone 0.146 (35) 0.854 (205) 1.00 (240)
Variable Survived Did not survive Column Total
Row Total 0.124 (64) 0.876 (454) 1.00 (518)
The goal of Step 1 is to adequately describe the proportions and populations of interest. Lets return to the 10 hypothesis-testing steps. 2. Review assumptionsAMB As in the case where were interested in a single proportion, with two proportions must also meet the three assumptions: representativeness, independence, and sample size.
Copyright Stacey S. Cofield, 19 September 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
5 / 17
Representativeness: Are the subjects in each group representative of some population of interest? If the study subjects were chosen as a simple-random sample from a larger population and if these subjects were randomly assigned to the two groups, then we can be comfortable that the information in this sample is representative of the population of interest. Independence: Does the response of one subject depend on the response of another? If not, then the subjects are independent. Sufficient size: In order for the test statistic to follow the normal distribution, n must be large enough to expect both 5 survivals and 5 non-survivals in each group. As in the single population case, we are not asking whether you observed at least 5 subjects in each cell. To check this, we must calculate the expected number of subjects under the null-hypothesis. But we have not yet stated the hypotheses. Lets do that and then come back 3. State the questionin the form of hypothesesAMB The intent is to compare the two survival proportions and ask, Are the proportions in the two groups the same? The alternative is that the two groups have a different survival proportion. H0: pCPR = pchest, or HA: pCPR pchest. Where weve abbreviated the two groups as CPR and chest. Note that if the null-hypothesis is true, the two groups are said to be homogeneous. If the null-hypothesis is true, then the two proportions are the same. If they are the same, its convenient to think of the proportion as a single number, p. So, another way to think of the null hypothesis is: H0: pCPR = pchest = p What is the best estimate of p, the survival proportion under the null hypothesis? From Table 3 the total proportion surviving was 0.124. This was obtained by calculating the
Copyright Stacey S. Cofield, 19 September 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
6 / 17
total number of survivors x = 64, divided by the total number of subjects in the study n = 518. In general, this overall proportion is termed p , or p bar. ,537
x +x p= 1 2 n1 + n2
We need this proportion to check whether we have a sufficient sample size in each group.
Sufficient size Review: For the assumptions to be met, each groups n must be large enough to expect to observe both 5 survivals and 5 non-survivals in each group. If the null hypothesis is true, how many people do we expect to see in each of the four cells? We keep the number of subjects in each group fixed (nCPR = 278 and nchest = 240) and we assume p = p = 0.124. So:
If you have 278 people and 0.124 proportion survive how many do you expect to survive?
( p ) nCPR = 278 0.124 = 34.3
If you have 240 people and 0.124 proportion survive how many do you expect to survive?
( p ) nChest
= 240 0.124 = 29.7
And similarly, we can calculate the expected number of non-survivors (Table 4). Chest Compression and mouth-to-mouth 34.3 243.7 Chest Compression Alone 29.7 210.3 240
Variable Survived Did not survive
Row Total 64 454 518
278 Column Total Table 4. Expected Frequencies for CPR study
Copyright Stacey S. Cofield, 19 September 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
7 / 17
Is this assumption for our statistical test met? (Are the expected counts in all cells greater than 5?) If it is, then we can trust that the sample proportion will be normally distributed. If we can trust that the sample proportion is normally distributed, then we can calculate a p-value. If we can calculate a p-value we trust, then we can make a decision with understandable risk.
Phase 2: Decide How to Answer the Question 4. Decide on a summary statistic that reflects the question We want to know if the two proportions are the same: H0: pCPR = pchest = p This is equivalent to asking if the difference between the two is zero,537: H0: pCPR - pchest = 0 Note: Recall that when looking at one proportion there were three possibilities for null hypotheses. In the case when were looking at two proportions were almost always interested in the null-hypothesis: same proportions and the alternative hypothesis: different proportions. So, how do we use the generic test statistic to test the difference between two proportions as the statistic of interest?
Generic Test Statistic From our earlier discussion, recall that the generic test statistic is:
z=
p0 (1 p0 ) n
p p0
Copyright Stacey S. Cofield, 19 September 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
8 / 17
The modified test statistic Here the relevant statistic is the observed difference between the two proportions: H0: pCPR - pchest = p and the hypothesized value of this difference is zero. The standard error of the difference is an average calculated across the two groups. ,537
z=
( p2 p1) 0 SE0 ( p2 p1) 0 p (1 p ) p (1 p ) + n1 n2
=
Note: use p Bar in the denominator, not the two p Hats.
5. How could random variation affect that statistic?AMB
If the null hypothesis is true, then z is zero. Since the assumptions are met, z is normally distributed. Extreme values of z reflect larger differences and thus favor the alternative hypothesis. To calculate a p-value, use the two-tail method where we are interested in calculating the probability of differences between the two proportions as large or larger than we observed.
Copyright Stacey S. Cofield, 19 September 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
9 / 17
6. State a decision rule, using the statistic, to answer the questionAMB
Just like in the first case study, if we want to reject the null-hypothesis 5% of the time, our decision rule is to choose to believe: H0: pCPR pchest = 0 . Choose this if p-value (usually 0.05) HA: pCPR pchest 0. Choose this if p-value < (usually 0.05)
Phase 3: Answer the Question 7. Calculate the statistic
Weve already calculated pCPR as 0.104. Now we need to calculate pChest and p :
nchest = 240 p= 35 = 0.146 240
x +x p = 1 2 = 0.124 n1 + n2
z=
( 0.104 0.146 ) 0 0.124 (1 0.124 ) 0.124 (1 0.124 ) +
278 240 0.042 0.029
=
= 1.432
Note: There a short-cut method that some find easier to calculate. Well cover this later in the section.
Copyright Stacey S. Cofield, 19 September 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
10 / 17
8. Make a Statistical Decision
The p-value = 0.1521. Since p-value > = 0.05, we will fail to reject the null hypothesis.
9. State a Substantive Solution
We conclude that there is insufficient evidence to conclude that the two survival proportions are different.
Phase 4: Communicate the Answer to the Question 10. Document our understanding with text, tables, or figuresAMB
For a dispatcher-instructed bystander-administered intervention after a cardiac arrest, is the survival proportion for full CPR different from the survival proportion with chest compression alone? In this study, n = 278 patients were randomized to the chestcompression and mouth-to-mouth ventilation group, and we observed p = 0.104 (x =29) survived until hospital discharge. And n = 240 patients were randomized to the chestcompression alone group, where we observed p = 0.146 (x =35) survived until hospital discharge. Thus, there was a nominal improvement in survival of 4.2% but the two proportions were compared and found to be not significantly different (z = 1.4, p-value = .1521).
Question: Why did we report a positive z value?AMB
By convention, if were doing is testing is A different than B? we could have just as well phrase the question as is B different than A?. Thus, the sign does not matter. So, we report the positive value.
Copyright Stacey S. Cofield, 19 September 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
11 / 17
Question: Why is our p-value different than the one reported in the NEJM paper?
On page 1547 of the paper, in the last paragraph of methods it says, The primary analysis consisted of a simple comparison of proportions by Fishers exact test.
So what is Fishers Exact Test? Fishers Exact Test,562
Devised by R.A. Fisher, we can determine the exact probability of obtaining the observed results or results that are more extreme. One advantage to this method is that we can use it even if the sample sizes are too small for the normal-approximation assumptions to be met. That is, we can use this method even when the expected cellsize is less than 5. Fundamentally, this test does exactly what Dr. Best did in the simulation for the CPR example. Recall that we simulated running the experiment a large number of times (1000 times), we kept track of all the possible outcomes, then counted up the number of times each outcome occurred, and determined a p-value. We then compared the results we got by enumerating all the possibilities from the simulation to what wed expect by theory.
Fishers method
Fishers idea was that with small samples we dont have to approximate the distribution with z to calculate p-values. We can enumerate all the possible outcomes and calculate p-values exactly.
ExampleAMB
Enumeration
Lets look at a simple example. Fisher used an example of a woman tasting tea. A British woman claimed to be able to distinguish whether milk or tea was added to the cup first. Lets use a more up-to- date question. Can you tell the difference between Coke and Pepsi?
Copyright Stacey S. Cofield, 19 September 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
12 / 17
Two cups
Say I poured, hidden from you, two soft-drink cups. One with Coke and one with Pepsi. Then I ask you: Which is Coke? And which is Pepsi? What are the possible outcomes of this experiment?
And we can look at the exact distribution of the number of correct. Thus we can determine the p-value wed conclude for all the possibilities.
This experiment would not allow us to conclude anything (unless p-values of 0.5 are convincing).
Four cups
Assuming an equal number of Cokes and Pepsis, the next larger experiment would be 4 cups
.
Copyright Stacey S. Cofield, 19 September 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
13 / 17
If someone is guessing randomly, these 6 possibilities are equally likely.
So if someone got all 4 right, we be able to conclude that this person could tell the difference between Coke and Pepsi, p-value = 0.1667. Would this be convincing?
Calculation of Fishers exact p-values
How are we going to use this exact test in practice? Fortunately, software can calculate these p-values easily. And so it is not necessary to discuss how to do it (or to ask you to do it on a homework problem). The one thing you have to know about is how to read the software print-out. When Fishers exact p-value is reported, it is reported using tails. That is, the software can not know whether you want to test: H0: p1 < p2; or H0: p1 > p2; or H0: p1 = p2. You have to know what your question (hypothesis) is to know which pvalue to report.
The most conservative p-value to report is the 2-tail one. In this case thats what they did in the NEJM paper.
Short cut: Comparing Two Proportions
There is another method for comparing two proportions. Its easy to calculate and warrants some discussion.
Copyright Stacey S. Cofield, 19 September 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
14 / 17
Short-cut calculation
We start by labeling the four cells with the letters a thru d:
The preferred (and simpler) statistic is represented by 2 (say chi squared). Its actually the square of the z statistic we have already seen:
2 n ( ad bc )
2 =
( a + c )( b + d )( a + b )( c + d )
Notice that the denominator is the product of all the marginal totals. In the CPR example: 2
2 =
( 29 + 249 )( 35 + 205 )( 29 + 35 )( 249 + 205 )
518 ( 29i205 35i249 )
= 2.05 notice that 2.05 = 1.43
The decision rule is straightforward. Take the square-root of the 2 value (it is z) and look up the p-value. You will define the decision rule as a homework problem.
Copyright Stacey S. Cofield, 19 September 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
15 / 17
The confidence interval for the difference between two proportions
The confidence interval for the difference is formed using a similar generic formula for the confidence interval that we used for one proportion. ,538 Here, the point estimate is the difference in observed proportions: p1 p2 , notice that there is no mention of the mean proportion p .
The standard error is:
p1 (1 p1) p2 (1 p2 ) + n1 n2
So the CI (or confidence interval) is:
p1 (1 p1) p2 (1 p2 ) + n1 n2
( p1 p2 ) z 1
(
2
)
For the CPR example:
0.104 (1 0.104 ) 278 0.146 (1 0.146 ) 240
( 0.104 0.146 ) 1.96
0.042 1.96 ( 0.023 )
+
[ 0.087,
0.015]
The 95% CI is -0.099 to 0.16 or Were 95% confident that the interval 8.7% to 1.5% covers the true difference in the population survival proportion from full CPR versus chest compression alone.
Copyright Stacey S. Cofield, 19 September 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
16 / 17
Note: The 95% CI includes zero, meaning that using a confidence interval alone to test the difference, we would conclude the difference is zero or that there is no difference in the treatment groups. If you find a significant difference, you should add the confidence interval about the observed difference to step 10 of the hypothesis testing steps.
Review
Over the last several sections, we have applied the ten steps of hypothesis testing to comparing a single observed proportion to an assumed proportions and comparing two observed proportions. We tested the two observed proportions by actually testing if the difference of the two observed proportions is equal to no difference. We will continue to apply the 10 steps of hypothesis testing to other types of hypothesis tests, such as comparing a single mean to an assumed mean, comparing two means, and comparing several means.
Copyright Stacey S. Cofield, 19 September 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
17 / 17
Section 10. JMP Tutorial
Overview Were going to run the JMP IN program and open a file that contains the data from the Automotive study. JMP data tables have an important structure that you need to understand in order to use JMP effectively. Well illustrate some key concepts in JMP by doing two analyses, a preliminary analysis and the contingency-table analysis of interest. Youll learn how to save reports and open the report in a word processor to use in conveying and writing up your results.
Chapter 1 in JMP Start Statistics (JSS) introduces the format of the text and concepts covered in the book. Preliminary analysis of each variable is described in chapter 7. The analysis of contingency tables is covered in Chapter 11.
Background Information JMP is pronounced jump.
Throughout, JMP and JMP IN are used interchangeably. JMP IN is the student version of JMP distributed by Duxbury/Thomson.
Both JMP and JMP IN were developed by a small group at SAS Institute headed by John Sall. Johns Macintosh Product was designed from the beginning to be wholly interactive and easy to use.
Copyright Stacey S. Cofield, 30 September 2004. All rights reserved.
1 / 31
Typographical Conventions The typefaces used here are similar to those chosen in JMP Start Statistics (JSS). The typefaces were chosen to help navigate the menus and options available. File names and data table names are capitalized. The files on a disk (e.g., BABYWGTS.JMP) must be read into the JMP memory before they can be used in any analysis. What resides on disk is called a data file, once its into active memory its called a data table. The column names have a similar appearance to the way they appear in the JMP data table (WEIGHT). This Arial typeface may appear either plain or boldface. JMP IN is driven by menu and mouse commands. Menu names (File menu) or menu items (Open command) appear in a bold sans-serif font similar to the way they appear at the top of the screen. When choosing a command from the menu, the text appears as follows: o Choose File > Open. This means that we should use the mouse to go to the File menu and choose the Open command. JSS uses an arrow to show the menu hierarchy: File Open
Similarly, JMP has a series of popup menu commands on the top of a report window. These commands appear as: o o If there is no red arrow, there are no options in the menu
Notes, definitions, tips, and steps appear in rounded-corner, grayed out boxes: The probability of being in a given interval is the area under the density curve over the interval.
Running JMP Start JMP IN. Go the Start button and (somewhere) you should find JMP IN:
Copyright Stacey S. Cofield, 30 September 2004. All rights reserved.
2 / 31
When you start JMP IN, an Initial Splash Window will appear:
Copyright Stacey S. Cofield, 30 September 2004. All rights reserved.
3 / 31
After the splash is finished, the JMP Starter will appear (page 8):
If you do not want the splash or JMP starter to appear at start up, you need to change your preferences by selecting Preferences from the File menu:
Copyright Stacey S. Cofield, 30 September 2004. All rights reserved.
4 / 31
The preferences window will open. You can also change your background colors , font type and font sizes:
Opening a .JMP Data File
After start up you will need to open a data file prior to running any type of analyses. JMP can open and read a number of data types, including: o .JMP files (JMP files) o .TXT files (text files) o .XLS files (Excel files) o .SAS9BDAT files (Permanent SAS files) o also HTML, FoxPro, and dbase files.
To open a file, choose File
Open to get to the Open Data File window:
Copyright Stacey S. Cofield, 30 September 2004. All rights reserved.
5 / 31
Navigate through the directories to find the file you want. In our case, we want the file called CAR POLL.JMP: Select the file in the Open Data Table dialog. Press the Open button. The JMP data table will open:
Copyright Stacey S. Cofield, 30 September 2004. All rights reserved.
6 / 31
The Structure of JMP Data Tables Data tables have a standard rectangular structure. A data table always has rows and columns and it is always rectangular. The left panels of the data table tell us that there are 303 rows and 6 columns:
Columns The columns are variables. Each variable describes a particular characteristic. Columns have three important features: Columns have names. Columns have a data-type. Columns have modeling type. We can double-click above a columns name to see all the characteristics of a column when the mouse arrow turns into a .
Column name JMP refers to variables by their column name (not, for instance, by number or position). A column name can be any character your keyboard can generate (letters, numbers, spaces, special characters).
Copyright Stacey S. Cofield, 30 September 2004. All rights reserved.
7 / 31
Column names can be up to 31 characters long so, spend a few moments to name columns something descriptive. For instance, here we see two columns named: Sex and Marital Status. Much better than A and B.
Data type All columns must have a data type: o A column can contain just numbersnumeric data type. o Or, it can contain any charactercharacter data type. When a non .JMP data file is opened for the first time, JMP will assign the data type but you can change it later. For example, if a column contains all numbers, JMP will assign the column a numeric data type. But if the variable is a character variable, you can change the type by selecting a column, then selecting Cols Column Info
What data type should you use?
If you want to do any arithmetic (e.g. mean, variance) on a column it must be numeric. That is, if we have some financial figures (like $12,243.15) and want to calculate totals, we must define the column as numeric and enter the data as: 12243.15 (no dollar signs or commas; just a decimal sign).
However, if you have subject ID numbers (e.g. Social Security numbers like 408-912093) it is unlikely that we need to do any arithmetic on this column. So, we could define the column as character (and enter the data as: 408-91-2093, without the quote marks) or numeric (and enter the data as: 408912093, without the dashes). Columns that have character values (M and F, for instance) must be defined as Data Type: Character.
The only thing you can type in a Data Type: Numeric field are numeric digits, a decimal, and a leading negative sign. Numeric columns are right justified while character columns are left justified. Notice that we can tell a columns data type by looking at the spreadsheet:
Copyright Stacey S. Cofield, 30 September 2004. All rights reserved.
8 / 31
What about dates? Dates should probably be defined as Data Type: Numeric with Format: Date: mmddyyyy. Then dates are displayed as mm/dd/yyyy.
Modeling type Also called measurement scales the modeling type will determine how a variable is used in an analysis: A continuous modeling type implies that we want JMP to use numeric values.
The categorical modeling types are ordinal and nominal. Both imply that there are discrete, categorical values that name the levels or different groups. Nominal implies that these groups are just different, or in name only. Ordinal implies that their order is important.
Modeling type can be changed by using the Column Info option. Or you can change the type using the columns panel on the left side of the data table.
Copyright Stacey S. Cofield, 30 September 2004. All rights reserved.
9 / 31
The little boxes to the left of the column names in the columns panel are called popup menus. Put the tip of the arrow cursor in the left box and hold down on the mouse button, the options of Continuous, Ordinal, or Nominal appear. To change modeling type, drag the arrow to the desired modeling type and release the mouse button:
Rows There are n = 303 subjects in the Car Poll data. Almost always well have as many rows as subjects.
Synonyms: 1 row = 1 subject = 1 observation = 1 case
Now that we understand the rows and column structure of a JMP data table, lets begin by taking a quick overview of the menus.
Menu Overview Similar to other windows based programs, the JMP IN menu bar appears along the top of the application window:
Copyright Stacey S. Cofield, 30 September 2004. All rights reserved.
10 / 31
File and Edit Menus Briefly, the File and Edit menus should be familiar. You can Open, Save, Save As, Print, etc. from the File Menu. From the Edit menu you can Cut, Copy, Paste, Search, etc. In addition, you will use the Edit menu to save a JMP report for use in a word processing program (more on this later).
Tables, Rows and Cols Menus The next three menus are for working on JMP data tables. The Tables menu modifies or creates a new JMP table from one or more already open tables. The Rows menu modifies the rows in the data table and the Cols menu modifies the data table columns.
DOE Menu The DOE menu creates data sets using Design of Experiments techniques.
Analyze and Graph Menus The Analyze and Graph menus are the launch pad for JMP analysis or graph platforms. Both menus use the data in the top-most (active) data table to create a window that shows various graphical and tabular displays. See the JMP help system or JSS for details.
Analysis Platforms There are a variety of analyses that JMP IN can produce through the Analyze and Graph menus. The results of these menu selections are called JMP platforms and each platform provides a different analysis or display. Well demonstrate the simplest of these in a moment.
Depending on the modeling type of the columns of the active data window, JMP produces the appropriate graphical displays and text reports. In this section, each of the platforms from the Analyze and Graph menus are introduced. Because of the graphical emphasis of JMP IN, many interactive graphical displays are produced when we launch an analysis platform. The platforms in the
Copyright Stacey S. Cofield, 30 September 2004. All rights reserved.
11 / 31
Analyze menu also produce text reports. The platforms in the Graph menu primarily produce graphical displays. Not all of the platforms are used in BST 621.
The Analyze Platform There are a number of platforms under the JMP IN Analyze menu. The most commonly used are at the top of the menu. In BST 621 youll spend almost all of your time with the first two, Distribution and Fit Y by X:
Distribution produces univariate analyses which include histograms, outlier box plots, and descriptive statistics like means, variances and quantiles for continuous columns. For ordinal or nominal columns, this platform shows histograms, mosaic charts (stacked bar chart) and a frequency table.
Fit Y by X does bivariate analyses (two columns at a time) for each pair of the specified X and Y columns. Depending on the modeling type of X and Y, this platform does regression, analysis of variance (or t-tests), contingency table analysis or logistic regression.
Fit Model is a general model fitting platform which allows multi-variable (more than one X column) analysis or multivariate (more than one Y column) analysis or both.
The platforms may initially appear to be fixed and rigid. Keep in mind that the platforms are just starting points to begin an analysis. Each platform is designed for adaptation and exploration. The default display can be modified to suit a presentation or to find features of the data.
Copyright Stacey S. Cofield, 30 September 2004. All rights reserved.
12 / 31
The Graph Platform The Graph menu contains some specialized plots frequently used by statisticians.
Chart produces various charts such as bar, pie, line and needle. Overlay Plots produces a special type of line plot. Overlay plots get their name from "overlaying" 2 or more Y columns (along the vertical axis), across the one X column (along the horizontal axis).
Spinning Plot produces a three-dimensional scatter plot that can be rotated to see depth.
Tools Menu
The default tool is the arrow tool. If we want context-sensitive help on the contents of any window created by Analyze or Graph menu commands, choose Tools ? and click on the window object we
dont understand. There are other tools useful in a variety of situations. Check JSS or help system for more information.
Copyright Stacey S. Cofield, 30 September 2004. All rights reserved.
13 / 31
Window Menus Since each analysis platform places another window on screen, it is likely that after analyzing data there will be a number of windows. To help manage this bounty of windows, use out the Window menu. Its a good idea to close windows (File Close)
not in use; just remember that if weve closed a window and need it again, youll have to go through the menu commands again.
Preliminary Analyses Recall Step 1 in the Steps for Hypothesis Testing: Describe the Data. To begin, and really before beginning any analyses, you should begin with the data. First, look at the data. Easy preliminary analyses can alert you to any problems that could alter the results of your analyses. Looking at Distributions You could inspect the individual rows and columns but its easier to look at the distribution of the variables. Recall that the distribution of a variable is a list of all the values with an indication of their relative frequency. In JMP, this is done with the Distribution of Y platform. Choose Analyze Distribution
The Distribution dialog appears:
Copyright Stacey S. Cofield, 30 September 2004. All rights reserved.
14 / 31
Notice that all the columns in the data table appear on the left. Were asked to Cast Selected Columns into Roles. Roles refer to one more fundamental JMP concept: Column roles.
Roles Columns have roles in analyses. Like the modeling type, a variables column role provides JMP IN with information on how to treat the column in an analysis. The popup to the right of a column name in the data table is one way to mark a columns role. A brief description of each column role follows: None indicates that the column is to play no role in the upcoming analysis. X indicates the column is an explanatory variable. Other names for variables that have this role are: independent variable, classification variable, or predictor variable. In bivariate displays the X values appear on the horizontal axis of plots. Y indicates the column is a response variable. Sometimes this is called the dependent variable. In bivariate displays the Y values appear on the vertical axis of plots. Weight supplies (optional) weights for the analysis. Weights are usually a number between 0 and 1 that corresponds to the importance of data in that row. This role is not used in BST 621. Freq tells JMP IN the number of occurrences of that row of data. Normally, each row of the data tables counts as n = 1. When we use the Freq role the column counts as n = the value of the Freq variable. By is an optional role. It indicates that we want to repeat the same analysis for every level of the By column(s). It is extremely important to be familiar with these column roles. Assuming that the column roles are specified correctly, JMP IN can begin producing the correct analyses by simply choosing an entry in the Analyze or Graph menu.
Returning to the analysis Returning to the Distribution of Y dialog: Select Sex and press the Y, Response button to use this column in the Y role.
Copyright Stacey S. Cofield, 30 September 2004. All rights reserved.
15 / 31
Select Marital Status and press the Y, Response button to use this column in the Y role. Press the OK button.
A window appears with the results of the Distribution command.
The Distribution window Almost all JMP report windows have a graphical display on top to give us a broad overview and, on the bottom, a tabular display to show us details:
Graphical Displays In the top portion of the window, is a histogram or a stacked-bar chart. Any data errors or unexpected results will appear, for example, if Male had been entered as Make:
Copyright Stacey S. Cofield, 30 September 2004. All rights reserved.
16 / 31
Tabular Displays In the bottom portion of the window, is a frequency table. It lists all the levels for the nominal columns, their frequency count, proportion, and cumulative proportion.
Interacting with the Analysis Displays The JMP IN interactive analysis displays are very much interactive. These displays were designed to help you discover important features of your data and lead you to an accurate analysis. In what follows, a description is given of some important buttons, tools and menus.
Graphs and Reports Every graph produced in JMP IN has interactive features. JMP IN is designed so that the various displays and data tables are dynamically linked.
Copyright Stacey S. Cofield, 30 September 2004. All rights reserved.
17 / 31
Using the JMP Tools JMP provides the user with various tools that have functionality in the analysis displays. The tools are selected by clicking on the Tools menu and selecting the tool. The different tools perform different functions.
Modifying the Data Table Weve noticed that there seems to be one row with Sex = Make where all the others have either Male or Female.
Highlighting Click on the Sex = Female green histogram bar:
Notice that the Female bar changes darkness; this is whats called highlighting.
Copyright Stacey S. Cofield, 30 September 2004. All rights reserved.
18 / 31
Notice that the corresponding portion of Marital Status bars also highlight. Notice that the corresponding rows in the data table also highlight (the row state turns dark blue and the rows are highlighted in gray).
Finding the incorrect row To find the row with the misspelled Sex, select the bar corresponding to the incorrect value. (The bar is very short, but it is there; click to the right of where it is.) Click on the Sex = Make green histogram bar:
The Selected 1 in the lower-left of the data table in the Rows panel shows us that weve selected 1 row. You can see that Row 4 is selected (you may have to select the data table window by clicking on it with the mouse and scroll down to find the row). Now you can change the value to Male: Dont type quote marks. When youre done, press the Tab, Return, or Enter key.
Copyright Stacey S. Cofield, 30 September 2004. All rights reserved.
19 / 31
Generally, its a good idea to review all of the columns with Distribution of Y. You can repair any miss keyed values easily and these errors then wont show up in future analyses.
The Distribution of Y window Notice that the Distribution of Y window did not automatically update when we changed the data. To verify that any changes were properly applied, redo the preliminary analysis. Note however that there two controls: one to change how much of the outline you see and the other for options.
Disclosure buttons: When the blue diamond is down, the portion of the report below it is visible. When the blue diamond is up, the report is hidden.
Popup options: The main functional options in JMP report windows are in the red triangle popup menu. This is also called the popup icon.
Click the red triangle popup menu to the left of Distributions. Choose Script Redo Analysis.
The revised Distribution window appears.
Copyright Stacey S. Cofield, 30 September 2004. All rights reserved.
20 / 31
In general, check these things: Are the column names descriptive? Are the values in each column what you expect? Are the relative frequencies approximately what you expect? Do you have any missing values? That is, does the Frequencies Total equal to the number of rows in the data table?
Getting Help There are three ways to get help in JMP. First, there is the Help Tool.
Using the Help Tool Select the ? tool from the Tools menu, and Click on any feature in the window.
For instance, clicking on the green histogram will tell you what kind of graph that is:
Copyright Stacey S. Cofield, 30 September 2004. All rights reserved.
21 / 31
You can always click on anything in a report window with the Help tool and JMP will tell you what it is.
Getting Help from the JMP IN Statistics Guide Choose Help Index.
Type Contingency Table and then select Contingency Table.
A window appears, telling you how to analyze two-way frequency tables.
It helps to have the overall view of JMP to decode the instructions. The text that begins The Contingency platform is the personality of the Fit Y by X command means the following: Performing Analyses Choose Analyze Fit Y by X.
Copyright Stacey S. Cofield, 30 September 2004. All rights reserved.
22 / 31
In the dialog, select one of the columns as Y, response. Recall that the Y role is the response variable role. Select another of the columns as X, factor. Recall that the X role is the explanatory variable role.
Take a moment and try the above. When youre done, the dialog will appear as follows:
Press OK.
A report window will appear.
Copyright Stacey S. Cofield, 30 September 2004. All rights reserved.
23 / 31
Lets look at each display in this window to understand how to use them.
Copyright Stacey S. Cofield, 30 September 2004. All rights reserved.
24 / 31
Graphical Display At the top of every JMP analysis window is a figure illustrating the main point of the analysis. The display at the top of the window is a mosaic plot.
The stacked barchart on the far right shows the marginal proportion of Marital Status = single and Marital Status = married. The proportion Marital Status = married is 0.65. If the proportions are homogeneous in both Gender groups (Male and Female), then the height of the Married bar in each group should have approximately the same height as the marginal Married bar. As we can see, the red bar in the Female group is only slightly larger than the red bar in the Male group. The impression is that the two proportions (Male/Married and Female/Married) are not different.
Tabular Displays The contingency table displays the number of each combination of variables (male/married, female/married, male/single, female/single), along with various percentages (marginal and total). In the red popup menu, options are available to display summary statistics in the contingency table.
Copyright Stacey S. Cofield, 30 September 2004. All rights reserved.
25 / 31
The Tests report In the Tests report there is a lot of information pertaining to relevant questions.
The important part is the row labeled Pearson. The Pearson chi-square value is 1.914 and its p-value is 0.1665. Fishers exact p-value is also available. In all instances, the
Copyright Stacey S. Cofield, 30 September 2004. All rights reserved.
26 / 31
p-values are not significant. Meaning that we would fail to reject that the probability of being single (married) is not greater (less, different) for females versus males. Saving Analyses The JMP IN windows can be printed. This is one way to save these reports for future use. However, there is a way to save the results into a word-processing document.
Journaling Results To save the analyses, there are three steps. Make the reports in top-most window appear as desired; and make them look like what you need to display by adding or removing options. Choose Edit Journal. An untitled journal window appears.
Then save the journal using the following guidelines:
Saving the Journal Make the journal window the top-most window and choose File Save As dialog appears. Save the file into an appropriate directory and give it a name. In the dialog: o at the top, specify a directory o in the filename field, type a name o in the save-as popup, choose Microsoft Word (*.DOC) or Rich Text Format (*.RTF) o Press Save. This saves the reports to disk. Save As. A
Now the file can be read by a word processor such as Microsoft Word. This process can be repeated for any analysis or report window.
Using the file in Microsoft Word You can either open the file as a new document and add in your written report, or you can insert the file into an already created report.
Copyright Stacey S. Cofield, 30 September 2004. All rights reserved.
27 / 31
Editing the results This is a word processing document, the figures are saved in WMF (window meta file) format so, if desired, they can be edited. The text-based reports are stored as tabdelimited text. As text, its easy to remove lines, change the way they appear, or change the characters to appear as you desire.
Printing and Saving Again, youre in Microsoft Word, so you can print the document, save it as a Microsoft Word .doc, etc. as you would with any other document
Special Case: Frequencies Returning to the CPR Study, the NEJM paper showed the results for n = 518 subjects and so there were 518 rows in the data table. However, there is a better way to begin if what we have is the 4 cells of a 2x2 contingency table. One JMP row = one contingency table cell Consider this JMP data table:
Copyright Stacey S. Cofield, 30 September 2004. All rights reserved.
28 / 31
The JMP data table has one row per contingency-table cell. The JMP data table has the same two columns as in the larger data file. The addition is the third row, here called N.
The Frequency role Weve seen the Y and X roles in use. In contingency tables, the Frequency role is used a great deal. Mark the role of N as Freq.
This makes it appear that each row counts as the value of N. Note: Freq columns must be numeric.
When you choose Fit X by Y, specify the columns as you did previously, but now you will also select N as Freq. This will produce the same results as if you had a single row for each subject.
Copyright Stacey S. Cofield, 30 September 2004. All rights reserved.
29 / 31
This option is especially helpful if you are trying to replicate results in a published study and do not have access to the full dataset. You can also analyze data from investigators without needing access to all of their data as long as you have the cell counts, you can test hypotheses.
Summary Here are some of the core concepts weve covered: Column names Data type Modeling type Column roles The Help System The Distribution of Y platform The Fit Y by X platform Journaling
Specifically, for the analysis of frequencies/proportions, the simplest data table is probably as follows:
Copyright Stacey S. Cofield, 30 September 2004. All rights reserved.
30 / 31
In contingency tables, we need one row per cell. If there are r rows and c columns, the data table needs r c rows. If the table is a two way contingency table, we need three columns: o A variable for the values of the rows of the contingency table. o A variable for the values of the columns of the contingency table. o A variable for the corresponding cell frequency. This variable should be defined with the Freq role.
Most importantly, remember that by using the ? tool, you will automatically be directed to the associated section of the JMP IN 5.1 Help.
Acknowledgement The majority of this material was taken from JMP Introduction, Al Best, PhD, Virginia Commonwealth University 2004. The examples have been changed to reflect the material in BST 621 and any updates to JMP IN 5.1 and JSS.
Copyright Stacey S. Cofield, 30 September 2004. All rights reserved.
31 / 31
11.
Analysis of Frequencies
Overview This is the final section on the analysis of frequencies and proportions. Weve compared an observed proportion to a fixed standard. Weve compared proportions from two groups. Here we compare proportions from multiple groups and then go on to look for relationships between two classification variables.
Case Studies From The Framingham study: A sample of male residents of Framingham, MA, aged 4059 were classified on several factors, including blood pressure. Then after a six-year follow-up they determined whether or not theyd developed coronary heart disease (CHD). A sample of the 2,282 men in the original study are shown in Table 1. The research question is Are the CHD proportions the same across the levels of blood pressure? Table 1. Framingham Men by Blood Pressure and CHD Blood Pressure < 117 117-126 127-136 137-146 147-156 157-166 167-186 >186 Total Coronary Heart Disease Present Absent 3 153 17 235 12 272 16 255 12 127 8 77 16 83 8 35 92 1237 proportion CHD 0.0192 0.0675 0.0423 0.0590 0.0863 0.0941 0.1616 0.1860 0.0692
Second Case Study Mental Health and SES: To explore the relationship between socio-economic status (SES) and mental health impairment, Srole et al. sampled residents of Manhattan in the early 1970s. A persons parents SES was measured by a combination of factors including income level and the type of work performed. It is rated from high to low in 6 levels (Table 2). Mental health status is a global measure of functioning, described as
Copyright Stacey S. Cofield, 19 September 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
1 / 22
well, mild symptoms, moderate symptoms, and functionally impaired. The question is Is there a relationship between SES and mental health impairment? Table 2. Mental Health Status and Parents Socioeconomic Status (SES) SES A (high) B C D E F (low) Total Well 64 57 57 72 36 21 307 Mental Health Status Mild Moderate Impaired 94 58 46 94 54 40 105 65 60 141 77 94 97 54 78 71 54 71 602 362 389 Total 262 245 287 384 265 217 1660
CHD Case Study We begin with the steps for comparing multiple proportions. This is an extension of the case where we compared two proportions.
Phase 1: State the Question 1. Evaluate and describe the data Display the results in a contingency table (see Table 1). Calculate proportions for the event of interest in each population. Note that the table is backwards from that shown in the Section 3 handout. In the 2x2 contingency table, the usual standard in Epidemiology is that the columns are outcomes and the rows are populations. When you have a number of populations (we have 8 BP groups) and a binary outcome, it makes more sense to have a table with a number of rows and just a few columns.
2. Review assumptions Assess representativeness, independence, and sample size. For the sample size assumptions to be met, n must be large enough to expect to observe both 5 survivals and 5 non-survivals in each group. As before, we are looking at expected frequencies. So we need a null-hypothesis.
Copyright Stacey S. Cofield, 05 October 2004. All rights reserved. Material reprinted courtesy of Al Best, Virginia Commonwealth University 2004, some examples may have been altered for this lecture.
2 / 22
3. State the questionin the form of hypotheses Recall that the research question was Are the CHD proportions the same across the levels of blood pressure? Stated as no differences, the null hypothesis is just: H0: p1 = p2 = p3 = p4 = p5 = p6 = p7 = p8, (i.e., all CHD proportions are equal across the 8 BP groups. The proportions are homogeneous.) And so the any difference alternative hypothesis is: HA: pi pj for at least one i and j (i.e., at least one pair is different). This is the same test of homogeneity we saw in the two-proportion case; here we extend it to as many groups as necessary.
Calculating expected frequencies Again if all the group proportions are equal, then they are equal to a common value, p . The observed proportion with CHD in the whole study (ignoring BP) is: If the null hypothesis is true, then the CHD proportion should be a constant 0.0692 in every blood pressure group. To obtain the expected frequency in the Coronary Disease = Present column, simply multiply p by the total row-N (the number of subjects in a blood pressure group). Subtract this expected frequency from the row n to get the expected frequency in the Coronary Disease = Absent column (Table 3). That is, the first expected frequency (10.8) is 0.0692 156. Needless-to-say, JMP will calculate these for you:
Copyright Stacey S. Cofield, 05 October 2004. All rights reserved. Material reprinted courtesy of Al Best, Virginia Commonwealth University 2004, some examples may have been altered for this lecture.
3 / 22
Table 3. Observed and Expected Cell Counts CHD Example Count CHD CHD Expected Absent Present Total 153 3 < 117 156 145.2 10.8 235 17 117-126 252 234.6 17.4 272 12 127-136 284 264.3 19.7 255 16 137-146 271 252.2 18.8 127 12 147-156 139 129.4 9.6 77 8 157-166 85 79.1 5.9 83 16 167-186 99 92.1 6.9 35 >186 8 43 40.0 3.0 1237 92 1329 Note that in one of the 16 cells, the expected frequency is less than 5. Actually, the real rule of 5 is that no more than 20% of the cells may be less than 5, so we dont really have a problem here. However, there are three choices when this happens: 1. ignore the possibility of a problem, 2. collapse cells, or 3. use more complex statistical methods. If just one cell is small then many will choose to ignore the problem. We could collapse the 167-186 and >186 blood pressure groups. Or, there are more complex statistical methods available; consult with a statistician on exact methods.
Phase 2: Decide How to Answer the Question 4. Decide on a summary statistic that reflects the question. Recall that in the two-proportion case we used the observed difference: p1 p2
Copyright Stacey S. Cofield, 05 October 2004. All rights reserved. Material reprinted courtesy of Al Best, Virginia Commonwealth University 2004, some examples may have been altered for this lecture.
4 / 22
This approach just doesnt work well here; there are too many differences (in this case there are actually 28). It turns out that the chi-square statistic we used in the earlier short-cut method is best. It compared the frequencies we observed to what wed expect if the group proportions were all the same. (Recall also that the chi-square is just the square of z.) The general form of the chi-square statistic is as follows:
2 =
O E 2 ( ) iE i i all i cells
Where we are summing across all the cellsin our example, 16. Oi is the observed frequency in the cell and Ei is the expected frequency.
Question: If the null hypothesis is true (each Oi is equal to Ei ), then what do you expect
the 2 value to equal?
Question: If the null hypothesis is not true (each Oi is different from Ei ), then what do
you expect the 2 value to equal? Keep these questions in mind as you try to interpret chi-square values. In any event, to calculate the chi-square statistic we need expected frequencies.
Example Calculation
Lets follow through one example. From Table 3 we see all 16 Oi and Ei values. In table 4, we can see (Oi Ei) for all cells: Table 4. Observed minus Expected CHD Example CHD CHD Group Absent Present 7.8 -7.8 < 117 0.4 0.4 117-126 7.7 -7.7 127-136 2.8 -2.8 137-146 -2.4 2.4 147-156 -2.1 2.1 157-166 -9.1 9.1 167-186 -5.0 5.0 >186
Copyright Stacey S. Cofield, 05 October 2004. All rights reserved. Material reprinted courtesy of Al Best, Virginia Commonwealth University 2004, some examples may have been altered for this lecture.
5 / 22
Example Calculation
The squared differences are shown in Table 5. Notice that some of these differences are large. But be careful of interpreting these because the size of these squared differences depends on the expected values of the cell. Table 5. Squared Difference of Observed minus Expected CHD Example Group < 117 117-126 127-136 137-146 147-156 157-166 167-186 >186 CHD CHD Absent Present 60.84 60.84 0.16 0.16 59.29 59.29 7.84 7.84 5.76 5.76 4.41 4.41 82.81 82.81 25.00 25.00
The cell chi-square values are shown in Table 6. These values are more interpretable. That is, non-zero values indicate a departure from the null-hypothesis values. Table 6. Cell Chi-Squared Values CHD Example CHD CHD Absent Present 5.633 0.419 0.011 0.001 2.984 0.222 0.406 0.030 0.588 0.044 0.761 0.057 12.208 0.908 8.477 0.630
Group < 117 117-126 127-136 137-146 147-156 157-166 167-186 >186
The sum of all the cells (total over all i) is 33.378 (same as the JMP output). And so we see how the observed chi-square value, X2 = 33.378, is calculated. Our intuition is that this is a big number since, under the null-hypothesis, we expect the chi-square to be nearer to zero. JMP will calculate these cell Chi-square values for you, see the options next to Contingency Table.
Copyright Stacey S. Cofield, 05 October 2004. All rights reserved. Material reprinted courtesy of Al Best, Virginia Commonwealth University 2004, some examples may have been altered for this lecture.
6 / 22
5. How could random variation affect that statistic?
The value reflects the overall agreement between the observed and expected frequencies. If H0 is true, the difference between the observed and expected frequencies will be 0. If the assumptions are met, then 2 is distributed as a chi-square distribution. If H0 is not true, then differences will yield a 2 larger than zero.
Degrees of Freedom
In order to calculate a p-value for the observed 2, we need one more thing. The chisquare distribution changes, depending on the number of cells in the contingency table. What it depends upon is the number of independent terms added together. In a contingency table with r rows and c columns, the degrees of freedom for the chi-square distribution is: df = (r 1) (c 1)
In the CHD example, there were r = 8 rows and c = 2 columns so df = (8 1) (2 1) = 7 1 = 7 The shape of the chi-square distribution is very skewed for small df, and it becomes more like the normal distribution with larger df (Figure 1). Critical-values for the chisquare distribution are given in Daniels appendix Table F. The tabled critical value for df = 7 and (1 ) = 0.95 is 14.067.
Figure 1. Chi-Square Distribution for Various Degrees of Freedom (df)
Copyright Stacey S. Cofield, 05 October 2004. All rights reserved. Material reprinted courtesy of Al Best, Virginia Commonwealth University 2004, some examples may have been altered for this lecture.
7 / 22
However, well use software to calculate p-values.
6. State a decision rule, using p-values, to answer the question
The generic decision rule: Reject H0 if pvalue < .05. (Or, using the critical value determined from the table, the rule would be: Reject H0 if 2 > 14.067.) P-values will be calculated by determining the probability of a chi-square as large or larger than what we observed. In this example p-value = Prob (2 33.378, df = 7)
Phase 3: Answer the Question 7. Calculate the statistic
Calculate 2 by the method above or by using software. Use tabled critical values or use software to calculate the p-value. JMP IN will calculate the p-values automatically. The Fit Y by X platform automatically calculates the tests in Figure 1. Lets concentrate on how to read the report; the important parts are shown in bold.
Tests
Source Model Error C. Total N Test Likelihood Ratio Pearson
DF 7 1321 1328 1329
-LogLike 15.01128 319.40443 334.41572 ChiSquare 30.023 33.378
RSquare (U) 0.0449
Prob>ChiSq <.0001 <.0001
The chi-square value we use is called a Pearson chi-square after Karl Pearson. The three pieces you need to be able to extract from the print-out are as follows: The chi-square value. Most people use the Pearson ChiSquare value: 33.378. When you write this up, use at most 2 decimal places. (Actually, the likelihood-ratio chi-square is better. When they disagree, believe the LR chi-square.) The degrees of freedom. Use what is under DF: 7. The p-value. Use the corresponding Prob>ChiSq: <.0001. Use 4 decimal places of precision
Copyright Stacey S. Cofield, 05 October 2004. All rights reserved. Material reprinted courtesy of Al Best, Virginia Commonwealth University 2004, some examples may have been altered for this lecture.
8 / 22
When you write this up. In reporting these values in text, combine the pieces together as follows: chi-square = 33.4, df = 7, p-value < 0.0001 . Or, if you have wordprocessor that can do Greek symbols and also do both sub- and super-scripts: = 33.4, df = 7, p < 0.0001 . Note that you only need report chi-square values to one (or two) decimal places. And, when a p-value p can not be confused with proportion p, sometimes we leave off the -value of p-value.
2
8. Make a statistical decision
Either fail to reject the null-hypothesis or reject the null-hypothesis of no difference. Is the p-value less than 0.05? Yes. Then reject the null-hypothesis.
9. State the substantive conclusion
So, we reject the null-hypothesis and conclude that the CHD proportions in the blood pressure groups are not homogeneous (they are different).
Phase 4: Communicate the Answer to the Question 10. Document our understanding with text, tables, or figures
A cross-tabulation table with frequencies and proportionsas shown in Table 1is an excellent summary. A summary sentence should include: The chi-square value, its df, and the p-value testing whether the proportions are all the same. So, here is a sample write-up: A sample of n=1329 male residents of Framingham, MA, were classified according to blood pressure and the presence or absence of Coronary Heart Disease (CHD). The frequency counts and the proportion of CHD in each BP group are shown in Table 1. The homogeneity of these proportions was tested using a chi-square test. The proportion of CHD cases was found to be different across the eight blood pressure groups ( = 33.4, df = 7, p < 0.0001). The trend in the differences in the proportions such that those with lower blood pressure
Copyright Stacey S. Cofield, 05 October 2004. All rights reserved. Material reprinted courtesy of Al Best, Virginia Commonwealth University 2004, some examples may have been altered for this lecture. 9 / 22
2
had less likelihood of CHD than those with higher blood pressure. For instance, approximately 2% of individuals in the lowest BP group had CHD whereas approximately 19% of individuals in the highest BP group had CHD.
Mental Health and SES
In the previous section we saw how to compare multiple proportions. The contingency table had 8 rows and 2 columns. In the general case, were interested in any two-way contingency table. This is a very common situation. In Table 2 there are r = 6 rows and c = 4 columns. There are two nominal variables recorded: Parents SES: from A through F Mental Health: from well through impaired
(Actually, you can argue that both of these variables are ordinal. However, the endresult chi-square value and its interpretation are the same. So its OK to ignore this point.)
Phase 1: State the Question 1. Evaluate and describe the data
Display the results in a contingency table (see Table 2). In this example, we are interested in whether there is a relationship between the two classification variables. Should we calculate proportions? As we saw in the previous section, we can calculate row proportions, column proportions, whole-table proportions. Any of these proportions may be useful, depending upon the question, but usually the whole-table proportions are best.
Copyright Stacey S. Cofield, 05 October 2004. All rights reserved. Material reprinted courtesy of Al Best, Virginia Commonwealth University 2004, some examples may have been altered for this lecture.
10 / 22
2. Review assumptions
Assess representativeness, independence, and sample size. Do the calculation of expected sample size. If the other assumptions are met but an expected n is too small then you may need to collapse cells.
3. State the questionin the form of hypotheses
The null-hypothesis is that SES status and mental health status are independent. Informally, independence means unrelated. That is, if you knew someones SES then this would give you no useful information to predict their mental health status. Or, equivalently, if you knew someones mental health status, this would give you no useful information to predict their SES. Consider how the contingency table would look if the rows and columns were independent (Table 7). Table 7. Mental Health Status SES A (high) B C D E F (low) Total Proportion Well ? ? ? ? ? ? 307 0.185 Mental Health Status Mild Moderate Impaired ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 602 362 389 0.363 0.218 0.234 Total 262 245 287 384 265 217 1660 Proportion 0.158 0.148 0.173 0.231 0.160 0.131 1.000
Question: Consider the SES = A row (n = 262) and assume independence. What
would the row proportions be?
Question: Consider the SES = B row (n = 245) and assume independence. What
would the row proportions be?
Question: Consider the SES = F row (n = 217) and assume independence. What
would the row proportions be?
Copyright Stacey S. Cofield, 05 October 2004. All rights reserved. Material reprinted courtesy of Al Best, Virginia Commonwealth University 2004, some examples may have been altered for this lecture.
11 / 22
Homogeneous row proportions
If the rows and columns are independent then these row proportions are homogeneous (Table 8). Table 8. Mental Health Status with Homogeneous Row Proportions SES A (high) B C D E F (low) Total Proportion Well 0.185 0.185 0.185 0.185 0.185 0.185 307 0.185 Mental Health Status Mild Moderate Impaired 0.363 0.218 0.234 0.363 0.218 0.234 0.363 0.218 0.234 0.363 0.218 0.234 0.363 0.218 0.234 0.363 0.218 0.234 602 362 389 0.363 0.218 0.234 Total 262 245 287 384 265 217 1660 Row Proportion 1.000 1.000 1.000 1.000 1.000 1.000 1.000
Note that independence does not state that all the proportions are equal to a single common p. Independence states that the proportions calculated separately for each row are the same as the column-marginal proportions. If you want to think of it in null- and alternative-hypotheses, the null is as follows: Pr(Well in SES A group) = Pr(Well in SES B group) = Pr(Well in SES F group); AND Pr(Mild in SES A group) = Pr(Mild in SES F group); AND Pr(Moderate in SES A group) = Pr(Moderate in SES F group); AND Pr(Impaired in SES A group) = Pr(Impaired in SES F group). The alternative is that at least one of these statements of equal probabilities is not true. That is, what were testing is this: If the expected row proportions in the population are as above, how likely is it that wed observe the actual row proportions that we did observe (Table 9)?
Copyright Stacey S. Cofield, 05 October 2004. All rights reserved. Material reprinted courtesy of Al Best, Virginia Commonwealth University 2004, some examples may have been altered for this lecture.
12 / 22
Table 9. Observed Row Proportions Parents' Mental Health Status and SES
Mental Health Status Mild Moderate Impaired 0.359 0.221 0.176 0.384 0.220 0.163 0.366 0.226 0.209 0.367 0.201 0.245 0.366 0.204 0.294 0.327 0.249 0.327 602 362 389 0.363 0.218 0.234
SES Status A (high) B C D E F (low) Total N Proportion
Well 0.244 0.233 0.199 0.188 0.136 0.097 307 0.185
Row sum 1.000 1.000 1.000 1.000 1.000 1.000
Total N 262 245 287 384 265 217 1660
Homogeneous column proportions
If its more convenient, we can phrase the question in terms of column proportions (Table 10).
Question: Consider the Mental Health = Well row (n = 307) and assume
independence. What would the column proportions be?
Question: Consider the Mental Health = Impaired row (n = 389) and assume
independence. What would the column proportions be? Table 10. Independence as a Statement of Homogeneity of Column Proportions Parents' Mental Health Status SES
SES Status A (high) B C D E F (low) Col. sum Total N
Well 0.158 0.148 0.173 0.231 0.160 0.131 1.000 307
Mental Health Status Mild Moderate 0.158 0.158 0.148 0.148 0.173 0.173 0.231 0.231 0.160 0.160 0.131 0.131 1.000 1.000 602 362
Impaired 0.158 0.148 0.173 0.231 0.160 0.131 1.000 389
Total N 262 245 287 384 265 217
Proportion 0.158 0.148 0.173 0.231 0.160 0.131 1.000
1660
Copyright Stacey S. Cofield, 05 October 2004. All rights reserved. Material reprinted courtesy of Al Best, Virginia Commonwealth University 2004, some examples may have been altered for this lecture.
13 / 22
Note that independence does not state that all the proportions are equal to a single p. Independence states that the proportions calculated separately for each column are same as the row-marginal proportions. Again if you want to think of it in null- and alternative-hypotheses, the null is as follows: Pr(SES A in Well group) = Pr(SES A in Mild group) = Pr(SES A in Impaired group); AND Pr(SES B in Well group) = = Pr(SES B in Impaired group); AND Pr(SES C in Well group) = = Pr(SES C in Impaired group); AND Pr(SES C in Well group) = = Pr(SES C in Impaired group); AND Pr(SES D in Well group) = = Pr(SES D in Impaired group); AND Pr(SES E in Well group) = = Pr(SES E in Impaired group); AND Pr(SES F in Well group) = = Pr(SES F in Impaired group). The alternative is that at least one of these statements of equal probabilities is not true. So what were testing is this: If the expected column proportions in the population are as above, how likely is it that wed observe the actual column proportions that we did observe (Table 1)? Table 11. Observed Column Proportions Parents' Mental Health Status SES
SES Stauts A (high) B C D E F (low) Col. sum Total N
Well 0.208 0.186 0.186 0.235 0.117 0.068 1.000 307
Mental Health Status Mild Moderate Impaired 0.156 0.160 0.118 0.156 0.149 0.103 0.174 0.180 0.154 0.234 0.213 0.242 0.161 0.149 0.201 0.118 0.149 0.183 1.000 1.000 1.000 602 362 389
Total Proportion 262 0.158 245 0.148 287 0.173 384 0.231 265 0.160 217 0.131 1.000 1660
Independence of whole-table proportions
Or, if its more convenient, we can phrase the question in terms of whole-table proportions.
Copyright Stacey S. Cofield, 05 October 2004. All rights reserved. Material reprinted courtesy of Al Best, Virginia Commonwealth University 2004, some examples may have been altered for this lecture.
14 / 22
Question: Now consider 1) the SES = A row (n = 262), the Mental Health = Well row
(n = 307), and the total study n = 1660. Assuming independence, what should the whole-table proportion be for the cell in the upper-left corner? In order to answer this, we need a more formal definition of independence.
Independence
Consider the probability of a subject being classified as Mental Health = Well and SES = A. (That is, both events occur.) We know the following: Probability (Mental Health = Well) = 307/1660 = 0.185. Probability (SES = A) = 262/1660 = 0.158. If these two events are independent then: Probability (Mental Health = Well and SES = A) = Probability (Mental Health = Well) Probability (SES = A) Two events A and B, are independent if and only if P (A and B) = P(A) P(B)
If Mental Health and SES are independent, then: Probability (Mental Health = Well and SES = A) = 0.185 0.158 = 0.029. So, if the rows and columns are independent, then wed expect to see the proportions shown below (Table 12). Table 12. Whole-table Proportions Under Independence Mental Health Status & SES
Mental Health Status Mild Moderate Impaired 0.057 0.034 0.037 0.054 0.032 0.035 0.063 0.038 0.041 0.084 0.050 0.054 0.058 0.035 0.037 0.047 0.029 0.031 602 362 389 0.363 0.218 0.234
SES Stauts A (high) B C D E F (low) Total N Proportion
Well 0.029 0.027 0.032 0.043 0.030 0.024 307 0.185
Total 262 245 287 384 265 217 1660
Proportion 0.158 0.148 0.173 0.231 0.160 0.131
1.000
Note that independence does not state that all the proportions are equal to a single p.
Copyright Stacey S. Cofield, 05 October 2004. All rights reserved. Material reprinted courtesy of Al Best, Virginia Commonwealth University 2004, some examples may have been altered for this lecture.
15 / 22
Independence states that cell proportions are same as the product of the two marginal proportions. So what were testing is this: If the expected whole-table proportions in the population are as above, how likely is it that wed observe the actual whole-table proportions that we did observe (Table 13)? Table 13. Observed Whole Table Proportions, Mental Health Status & SES
Bottom line: the p-value will tell you whether there is non-independence in the rows and columns. Tell the story using row-, column, or whole-table proportion; whichever is easiest.
Phase 2: Decide How to Answer the Question 4. Decide on a summary statistic that reflects the question.
As in the CHD example where we were comparing multiple proportions, the summary statistic we use is chi-square: O E 2 ( ) iE i i all i cells
2 =
As before, the terms in the summation are the observed and expected frequencies. In our example, we can calculate the expected frequencies by multiplying the proportion in Table 12 by the whole-table n = 1660. See Table 14 (JMP table).
Copyright Stacey S. Cofield, 05 October 2004. All rights reserved. Material reprinted courtesy of Al Best, Virginia Commonwealth University 2004, some examples may have been altered for this lecture.
16 / 22
Table 14. Expected Frequencies Parents' Mental Health Status SES
Mental Health Status Mild Moderate Impaired 95.0 57.1 61.4 88.8 53.4 57.4 104.1 62.6 67.3 139.3 83.7 90.0 96.1 57.8 62.1 78.7 47.3 50.9 602 362 389
SES Status A (high) B C D E F (low) Total N
Well 48.5 45.3 53.1 71.0 49.0 40.1 307
Total 262 245 287 384 265 217 1660
5. How could random variation affect that statistic?
As before, if H0 is true, then the chi-square value will be near 0. If the assumptions are met, then the statistic is distributed as a chi-square with df = (r 1)(c 1). If H0 is not true, then the chi-square value will be large.
6. State a decision rule, using p-values, to answer the question
The generic decision rule: Reject H0 if pvalue < .05.
Phase 3: Answer the Question 7. Calculate the statistic
2
Calculate value with software. Software will then give you the p-value.
Source
DF -LogLikelihood
RSquare
Model Error C. Total N
15 1640 1655 1660
23.7089 2922.3833 2946.0922
0.0080
Test
ChiSquare
Prob>ChiSq
Likelihood Ratio Pearson
47.418 45.985
<.0001 <.0001
Copyright Stacey S. Cofield, 05 October 2004. All rights reserved. Material reprinted courtesy of Al Best, Virginia Commonwealth University 2004, some examples may have been altered for this lecture.
17 / 22
8. Make a statistical decision
Is the p-value less than 0.05? Yes. Reject the Null Hypothesis
9. State the substantive conclusion
Reject the null-hypothesis of independence and conclude that the proportions are different.
Phase 4: Communicate the Answer to the Question 10. Document our understanding with text, tables, or figures
A cross-tabulation table with frequencies and proportionsas shown in Table 2is an excellent summary. A summary sentence should include: The chi-square value, its df, and the p-value testing whether there is an association between the two classification variables. So, here is (the beginning of) a sample write up: An urban sample of adults was classified according to SES and mental health status. The frequency counts are shown in Table 2. A chi-square test of independence was used to determine if there was any relationship between the two. Parents SES and mental-health impairment was found to be associated ( = 46.0, df = 15, p < 0.0001).
2
Now what? There is a relationship but where? Which proportions are different than which others? Not every proportion is different. In order to point to cells that contribute to this large chi-square value, we conventionally inspect the cell chi-squares and look for large values (roughly, any greater than 2). The cell chi-squares are shown in Table 15. Table 15. Cell Chi-square Values A (high) B C D E F (low)
Well 4.988 3.016 0.290 0.014 3.453 9.121 Mild 0.011 0.299 0.008 0.022 0.008 0.752 Moderate 0.013 0.006 0.093 0.542 0.248 0.942 Impaired 3.861 5.281 0.783 0.179 4.071 7.984
Copyright Stacey S. Cofield, 05 October 2004. All rights reserved. Material reprinted courtesy of Al Best, Virginia Commonwealth University 2004, some examples may have been altered for this lecture.
18 / 22
The numerically large values are in the corners. This tells us that the proportions that are different than expected are those in the corners (and that within Mild and Moderate and within SES=C and D, there is no difference between the observed and expected proportions). So, now you may write sentences of interpretation to explain the significant chi-square value. There are three ways to write this up, depending upon whether its convenient for you to think in terms of the row-, column-, or whole-table proportions. Choose one: In terms of whole-table proportions (compare the observed proportions in Table 13 with the expected proportions in Table 12): The relationship between SES and mental health is that a higher proportion of subjects have well mental health and higher SES (A an B values), between 3.4% and 3.9%, than is expected (approximately 2.8%). Similarly, a higher proportion of subjects have impaired mental health and low SES (E and F values), between 4.3% and 4.7%, than is expected (approximately 3.4%). This is in contrast to a lower proportion of subjects having impaired mental health and higher SES (approximately 2.6% observed and 3.6% expected) and a lower proportion of subjects having well mental health and lower SES (approximately 1.7% observed and 2.6% expected). Within the mild and moderate mental health groups, the distribution of SES categories was as expected under independence, as was the distribution of mental health groups within the middle SES categories. In terms of row proportions (compare the observed proportions in Table 9 with the expected proportions in Table 8): The relationship between SES and mental health is observed within the higher SES subjects and also within the lower SES subjects. If there were no relationship between the two characteristics, then within each SES category the expected proportion well is 18.5%. But within higher SES subjects (groups A & B), the observed proportion well is higher (approximately 23.8%). And within lower SES subjects (groups E & F), the observed proportion well is lower (approximately 11.6%). The proportions mild and moderate, calculated separately within all SES groups, do not differ from that expected (36.3% and 21.8%, respectively). However, the proportion impaired is expected to be 23.4% within all SES groups but the observed proportion within the higher SES groups is lower (approximately 16.9%). The
Copyright Stacey S. Cofield, 05 October 2004. All rights reserved. Material reprinted courtesy of Al Best, Virginia Commonwealth University 2004, some examples may have been altered for this lecture.
19 / 22
observed proportion impaired within the lower SES groups is higher (approximately 31.1%). In terms of column proportions (compare the observed proportions in Table 11 with the expected proportions in Table 10): The relationship between SES and mental health is observed within the well subjects and also within the impaired mental health subjects. The distribution of SES within the mild and moderate mental health groups is as expected. If there were no relationship between the two characteristics, then within each mental health group the expected proportion SES=A is 15.8% and the expected proportion SES=B is 14.8%. However, within the well mental health group a larger proportion was observed in these two SES categories (20.8% and 18.6%, respectively). And, within the impaired mental health group a lower proportion was observed in these two SES categories (11.8% and 10.3%, respectively). In addition, if there were no relationship between the two characteristics, then within each mental health group the expected proportion SES=E is 16% and the expected proportion SES=F is 13.1%. However, within the well mental health group a smaller proportion was observed in these two SES categories (11.7% and 6.8%, respectively). And, within the impaired mental health group a higher proportion was observed in these two SES categories (20.1% and 18.3%, respectively). Again, you do not want to write all three of these interpretations. Choose one approacheither row proportions, or columns proportions, or whole-table proportions whichever seems most natural.
Summary
So, how should you analyze frequency counts in practice? Heres the recipe:
Phase 1: State the Question 1. Evaluate and describe the data
Display the results in a contingency table. Perhaps calculate proportions, depending on what is of interest. Is the question testing homogeneity or independence?
Copyright Stacey S. Cofield, 05 October 2004. All rights reserved. Material reprinted courtesy of Al Best, Virginia Commonwealth University 2004, some examples may have been altered for this lecture.
20 / 22
2. Review assumptions
Assess representativeness, independence, and sample size. Do the calculation of expected sample size. If the other assumptions are met but an expected n is too small then consider collapsing.
3. State the questionin the form of hypotheses
Homogeneity compares proportions from different groups and asks are they the same or different? Independence looks for relationships between classification variables and asks is there a relationship?
Phase 2: Decide How to Answer the Question 4. Decide on a summary statistic that reflects the question.
A chi-square statistic. State the form of the statistic.
5. How could random variation affect that statistic?
If H0 is true, the chi-square will be near 0. If the assumptions are met, then it is distributed as a chi-square distribution with df=(r-1)(c-1). If H0 is not true, then chisquare values will be large.
6. State a decision rule, using p-values, to answer the question
Reject H0 if pvalue < 0.05.
Phase 3: Answer the Question 7. Calculate the statistic
2
Calculate value and determine the p-value.
8. Make a statistical decision
Is the p-value less than 0.05?
9. State the substantive conclusion
Either fail to reject the null-hypothesis or reject the null-hypothesis.
Copyright Stacey S. Cofield, 05 October 2004. All rights reserved. Material reprinted courtesy of Al Best, Virginia Commonwealth University 2004, some examples may have been altered for this lecture.
21 / 22
Phase 4: Communicate the Answer to the Question 10. Document our understanding with text, tables, or figures
Show the contingency table and, if useful, the proportions. If there was a significant result, then inspect the cell-chi-squares to point you to what is interesting. Write a sentence with your conclusion and include the chi-square value, the df, and the p-value. If there was a significant result, write sentences about the departures from independence (and its also often useful to write a sentence about where independence holds).
Copyright Stacey S. Cofield, 05 October 2004. All rights reserved. Material reprinted courtesy of Al Best, Virginia Commonwealth University 2004, some examples may have been altered for this lecture.
22 / 22
Section 12. Analysis of Frequencies in JMP Review We have now expanded the testing of proportions to include comparing multiple proportions. Using a chi-squared (2) test, we can determine if there is a difference among the proportions. If there is a difference, we can use cell chi-squared values to determine where the departures from what was expected occur.
We have also discussed the difference between testing for homogeneity and testing for independence. A test for homogeneity is testing if the proportions are different and a test for independence is testing for a relationship between the classification (predictor and outcome) variables. The chi-squared results will be identical but the theory and interpretation (words used to interpret the results) are slightly different. Note: for more information, see the Equivalence of Methods handout. Multiple Comparisons It is possible to see a non-significant overall test and still be concerned about testing two or more proportions within the group. For example, say we are testing the proportion of relapse among ten different treatment groups, each taking a different drug and the overall chi-squared test is not significant. But we think that there may be a significant difference among three of the drug groups, how do we test this? How many degrees of freedom does the test have? What significance level do we use? If we are estimating several confidence intervals, what reliability coefficient do we use? There are some internal relationships that need to be considered and we need to be penalized for assessing multiple relationships while maintaining an overall level of significance. This is commonly addressed when comparing multiple means but the procedures are similar for proportions. We will discuss multiple comparisons and adjustments for multiple comparisons when we learn about testing multiple means in BST 621. In addition, well address the issue directly with logistic regression in BST 622.
Copyright Stacey S. Cofield, 07 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
1 / 19
Analyzing Frequencies Using JMP Lets use the Vitamin C and Common Cold Example in The Statistical Sleuth (exercise 19.16, ex1916.JMP)., 531 In a Canadian experiment, at the beginning of winter, 800 subjects were randomly divided into two groups: vitamin C group, and a placebo group. The vitamin C group received 1000 mg/day and the placebo group received identical pills made from inert ingredients. At the end of the cold season, a blinded physician interviewed each subject and determined if (Yes/No) the subject had experienced a cold (Table 1). Table 1. Vitamin C versus Placebo and the Common Cold Group Placebo Vitamin C Totals Cold 328 296 624 No Cold 72 104 176 Totals 400 400 800
The question of interest is: Is the cold status independent of treatment group? Four hundred subjects received the placebo pill and 328 (82%) caught a cold during the winter cold season. Of the 400 subjects in the vitamin C group, 296 (74%) caught a cold. Is the 74% in the vitamin C group different from the 82% in the placebo group, that is, is there a relationship between treatment group and cold status or are they independent? Lets return to the 10 steps to address these questions. As we address each step, well use JMP where necessary.
Copyright Stacey S. Cofield, 07 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
2 / 19
Review Steps for Hypothesis Testing AMB Phase 1: State the Question 1. Evaluate and describe the data 2. Review the assumptions 3. State the questionin the form of hypotheses
Phase 2: Decide How to Answer the Question 4. Decide on a summary numbera statisticthat reflects the question 5. How could random variation affect that statistic? 6. State a decision rule, using p-values, to answer the question
Phase 3: Answer the Question 7. Calculate the statistic 8. Make a statistical decision 9. State the substantive conclusion
Phase 4: Communicate the Answer to the Question 10. Document our understanding with text, tables, or figures
Using JMP First lets look to the JMP data. As given by The Statistical Sleuth:
Copyright Stacey S. Cofield, 07 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
3 / 19
The column knew is from an example in Chapter 19, that we will not reference in this section, so we need to combine the Yes rows for each group and the No rows for each group. We can do this by using the Summary function in the Tables menu:
You will now see the Summary dialogue box:
Now select the group as Treatment and cold and nocold by selecting Sum from the drop down menu in the Statistics section:
Copyright Stacey S. Cofield, 07 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
4 / 19
Press OK. This produces a data set with four columns and two rows:
Notice that the column N Rows tells you how many rows of each treatment were in the original dataset (a yes and no row for each treatment). Also notice that there is a small lock icon next to each column name in the columns panel:
This icon indicates that the column information cannot be changed, since it is derived from a main table. You can create new columns with the same data and change the column information. You can also delete columns. We will not need the N Rows column, so lets delete the column by selecting the column:
Copyright Stacey S. Cofield, 07 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
5 / 19
Go to the Cols menu and select Delete Columns:
This will produce a dataset that looks like the 2x2 contingency table of treatment by cold status:
Prior to doing any analyses or further transformations of the data, save this data so that we can go back to this table in the future. Go to the File menu and select Save As:
Copyright Stacey S. Cofield, 07 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
6 / 19
Note: Selecting Save will save the table with an unrelated table name in a default location. You need to choose a name and location in the Save As dialogue box that is meaningful to you and/or the data for ease of recall:
Once this dataset is saved, you can close the original dataset. You should always close windows you wont be using for a number of reasons, including ease of movement from window to window and potential computer operation issues.
Copyright Stacey S. Cofield, 07 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
7 / 19
Now we need to create a dataset that we can use for analysis of the proportions. To do this, we need to Stack the columns into rows. To do this, go to the Tables menu and select Stack:
The Stack dialogue box will appear. You need to select the columns you would like to stack as rows. You can specify the names of the new columns and the name of the output data set produced here, or you can wait until the dataset is created (this is often easier, you can choose an appropriate name once you determine if the stacking results are what you wanted). In the dialogue box, select Sum(cold) and Sum(nocold) and click Stack:
Copyright Stacey S. Cofield, 07 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
8 / 19
This results in three columns (treatment, ID, and Stack) and four rows (placebo/Sum(cold), placebo/Sum(nocold), vitc/Sum(cold), vitc/Sum(nocold)):
If the Stack procedure produces what you want, then save the dataset and give the columns appropriate names by File > Save As (to save the file) and Cols > Column Info(to change the column names):
Youll notice that the Cold Status options are not what we would like them to be called in any output we may use in a write-up. So lets change the outcome labels to Cold and No Cold. You can do this by typing the changes yourself or you can do a find and replace by Edit > Search > Find:
Copyright Stacey S. Cofield, 07 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
9 / 19
This will bring up the Find dialogue box, similar to find and replace procedures in word processing programs. Remember that find/replace in JMP is not reversible by using the Edit > Undo Typing option. In this case, retyping the outcomes is easy but in large datasets, with many rows, the find/replace option will be easier and you are less likely to make typographical errors. Note: Since the Treatment column is locked, youll have to make changes in the treatment options (vitC to Vitamin C) in your word processing document. Lets return to the 10 steps:
Step 1. Evaluate and describe the data We have been given most of the information (including Ns and proportions) from the text. but we can also determine the Ns and proportions from JMP. Choose Analyze > Distribution:
Copyright Stacey S. Cofield, 07 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
10 / 19
In the Distribution dialogue box, make the following selections, to see the distributions of cold status by treatment group:
This will produce the following output: Treatment=placebo Treatment=vitC
No Cold
No Cold
Cold
Cold
Frequencies Level Cold No Cold Total
Count 328 72 400
Prob 0.82000 0.18000 1.00000
Frequencies Level Cold No Cold Total
Count 296 104 400
Prob 0.74000 0.26000 1.00000
Copyright Stacey S. Cofield, 07 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
11 / 19
This gives both the counts (n) for each of the four groups and the proportion of each outcome within each treatment group. From this and the information in the text, we can complete step 1 (and much of step 10, since well use the description of the data in the summary of the experiment). Example: In a Canadian experiment, at the beginning of winter, 800 subjects were randomly divided into two groups: vitamin C group, and a placebo group. The vitamin C group received 1000 mg/day and the placebo group received identical pills made from inert ingredients. At the end of the cold season, a blinded physician interviewed each subject and determined if (Yes/No) the subject had experienced a cold (Table 2). Four hundred subjects received the placebo pill and 328 (82%) caught a cold during the winter cold season. Of the 400 subjects in the vitamin C group, 296 (74%) caught a cold. Table 2. Vitamin C versus Placebo and the Common Cold Group Placebo Vitamin C Outcome Proportion (N) Cold No Cold 0.82 (328) 0.18 (72) 0.74 (296) 0.26 (104) 0.78 (624) 0.22 (176)
Total N 400 400 800
Step 2. Review the assumptions The assumptions to be assessed are representativeness, independence, and sample size. Representativeness: We assume that the subjects come from a random sample of potential cold sufferers and that the subjects are representative of the larger population of cold sufferers. We will assume the sample is representative. Independence: We assume the one subject contracting a cold is independent of another subject contracting a cold (we dont have all the information here but we will assume
Copyright Stacey S. Cofield, 07 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
12 / 19
they didnt use family members or roommates that would be more likely to influence the result of another subject). The subjects are independent. Note: For full assessment of the representativeness and independence assumptions, we should consult the actual article but for ease of discussion, well assume that they are met. For the sample size assumptions to be met, n must be large enough to expect to observe both 5 survivals and 5 non-survivals in each group. As before, we are looking at expected frequencies. So we need a null-hypothesis.
Step 3. State the questionin the form of hypotheses The investigator wishes to determine if the cold outcome is independent of treatment status. In statistical terms, our hypotheses are: H0: Treatment status and Cold status are independent HA: There is a relationship between treatment and cold status Now that we have a null hypothesis, we can assess the sample size assumption by using JMP to look at the expected cell counts for each group. Use the Analyze > Fit Y by X option. But first, lets review how to preselect column roles. You can do this by selecting a column and then using Cols > Preselect Roles and selecting the role:
Copyright Stacey S. Cofield, 07 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
13 / 19
Or you can use the column panel, right click on the column and select the role:
Notice that since we have already selected the N column for the Freq role the Freq option is not available. Only one column per table can be given the Freq role. There is no limit to the number of X and Y roles. Notice also, that until a role is specified, the column has a default specification of No Role. Lets return to our dataset with our preselected roles:
Now, use the Analyze > Fit Y by X option and the model will be automatically fit without using the Fit Y by X dialogue box. With the results, use the Expected option in the Contingency Tables options:
Copyright Stacey S. Cofield, 07 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
14 / 19
Resulting in the expected cell counts under the null hypothesis of equal proportions:
All of the cells have an expected cell count > 5. The sample size assumption is met.
4. Decide on a summary numbera statisticthat reflects the question For testing independence between two classification variables, well use a chi-squared test statistic:
2 =
O E 2 ( ) iE i i all i cells
5. How could random variation affect that statistic?
Under the null hypothesis, the variables are independent and we expect small chisquared values, near zero. Large values of the chi-squared statistic will lead to rejecting the null hypothesis.
Copyright Stacey S. Cofield, 07 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
15 / 19
6. State a decision rule, using p-values, to answer the question
If the associated p-value < 0.05, we will reject the null hypothesis of independence.
7. Calculate the statistic
Since we have already specified our column roles and used the Fit Y by X option for determining the expected cell frequencies in Step 2, we can determine the Pearson 2 value without further work in JMP. The test is automatically run when you Fit Y by X:
Tests Source Model Error C. Total N
DF 1 798 799 800
-LogLike 3.74621 417.78016 421.52637 Prob>ChiSq 0.0062 0.0063
RSquare (U) 0.0089
Test Likelihood Ratio Pearson Fisher's Exact Test Left Right 2-Tail
ChiSquare 7.492 7.459 Prob 0.9976 0.0040 0.0080
Alternative Hypothesis Prob(Cold Status=No Cold) is greater for Treatment=placebo than vitC Prob(Cold Status=No Cold) is greater for Treatment=vitC than placebo Prob(Cold Status=No Cold) is different across Treatment
12 = 7.459 , p = 0.0063
8. Make a statistical decision
The p-value = 0.0063, which is < 0.05. We reject the null hypothesis.
9. State the substantive conclusion
The classification variables, treatment and cold status, are not independent. There is a relationship between treatment status and cold status. Prior to completing step 10, it is of interest to determine the extent of the relationship between the two variables. To do this, we could perform a one- or two-sided test of the proportions to determine the where the difference is between the two proportions. But
Copyright Stacey S. Cofield, 07 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
16 / 19
recall, that the test for independence is also a two-sided test for homogeneity. That is, we could have set up the hypotheses as follows: H0: pvitC-cold = pplacebo-cold HA: pvitC-cold pplacebo-cold From the earlier result, we rejected the null hypothesis of equality, in favor of the hypothesis of inequality between the proportions. Lets calculate a confidence interval about the proportion of subjects with a cold in the Vitamin C group: First, remove the preselected role of X and Y for the Treatment and Cold Status columns, respectively. We need to do this in order to specify a By group in the
Distribution dialogue:
Specifying the By group will show us the distribution of Cold Status by Treatment what we did for step 1. Click Ok. This returns the results from step 1. What we want to do now is determine the confidence interval about the proportion. To do this, use the options next to Cold Status for Treatement=vitC and select Confidence Interval >
.95:
Copyright Stacey S. Cofield, 07 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
17 / 19
The results are as follows:
We are 95% confident that the true proportion of cold sufferers taking Vitamin C is between 69 and 78%. This confidence interval does not contain the proportion of placebo cold sufferers (82%).
10. Document our understanding with text, tables, or figures
In a Canadian experiment, at the beginning of winter, 800 subjects were randomly divided into two groups: vitamin C group, and a placebo group. The vitamin C group received 1000 mg/day and the placebo group received identical pills made from inert ingredients. At the end of the cold season, a blinded physician interviewed each subject and determined if (Yes/No) the subject had experienced a cold (Table 3). Four
Copyright Stacey S. Cofield, 07 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
18 / 19
hundred subjects received the placebo pill and 328 (82%) caught a cold during the winter cold season. Of the 400 subjects in the vitamin C group, 296 (74%) caught a cold. Table 3. Vitamin C versus Placebo and the Common Cold Group Placebo Vitamin C
Outcome Proportion (N) Cold No Cold 0.82 (328) 0.18 (72) 0.74 (296) 0.26 (104) 0.78 (624) 0.22 (176)
Total N 400 400 800
There is a relationship between the treatment group (placebo, vitamin C) and cold status (2 = 7.459, df = 1, p = 0.0063). The probability of suffering a cold is related to the treatment group. The 95% confidence interval about the true proportion of vitamin C cold sufferers is 0.69 to 0.78, this does not include the observed proportion of placebo cold sufferers, 0.82. The analyses show that the proportion of subjects suffering from a cold is not equal (likely lower) for the subjects taking vitamin C than for the subjects taking placebo.
Review
Using the 10 step of hypothesis testing, we have compared a single observed proportion to an assumed proportion, using a z-score to test one- and two-sided hypotheses. We have also used a z-score to test one- and two-sided hypotheses comparing two observed proportions. In addition, we can use a chi-squared statistic to test two-sided hypotheses comparing two observed proportions from frequency tables. The chi-squared test can also be used to test multiple proportions using a test of homogeneity, and to test for a relationship between classification variables, using a test for independence. We will apply the hypothesis testing steps for comparing a single mean to an assumed mean, two observed means, and multiple means.
Copyright Stacey S. Cofield, 07 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
19 / 19
Section 13. Continuous Variables
Overview Up to this point we have examined statistics related to categorical variables that result in proportions. But what about variables with a continuous measure? What are the most common values? How much variability is associated with each measure? What types of statistical tests can be used to compare statistics of this type?
Case Studies Sports Program Satisfaction: Students from University College were asked to rate the UCs sports program as part of a review of the athletic program. From the survey, a percent satisfaction score was determined for each student. Approximately 10% of the student body completed the entire survey, resulting in 165 satisfaction scores.
The university administration is interested in several things, including overall satisfaction and satisfaction by gender. More specifically, how satisfied is the overall student body with the athletics program and how does this compare with national results of similar studies for similar programs? Also, is there a difference in satisfaction between male and female students?
If there is a significant difference, the administration will launch a full scale review of the program. If the satisfaction levels are agreeable to the administration, no further effort will be put into the review.
Comprehensive description As with our other examples, lets begin by examining the data. Figure 1 shows the summary statistics produced by the Distribution option in JMP. We have two variables: gender and score. Gender is a categorical variable (male or female) and Score is a continuous variable (0 100).
Copyright Stacey S. Cofield, 14 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
1 / 17
Figure 1. Descriptive Statistics for Gender and Score for Sports Program
Possible summaries Consider these summary statements from the dataset: Of the 165 students who completed the survey, 85 were female and 80 were male. The average percent satisfaction is 72.7% satisfied with the current athletic program, with a standard deviation of 12.91 The median satisfaction score was 73.3%. The range of scores was 48.1% to 97.8%. From our sample of students, were 95% confident that the average satisfaction is between 70.7 and 74.7%.
Copyright Stacey S. Cofield, 14 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University 2 / 17
What about the scores by gender? That is, what are the same summaries for male and female students? This is done using the Distribution analysis option and instead of specifying Gender as a Y variable, specify Gender as a By variable (Figure 2).
Figure 2. Descriptive Statistics Score, by Gender, for Sports Program
Possible summaries Consider these summary statements for the dataset, by gender: Of the 165 students who completed the survey, 85 were female and 80 were male (note, the Ns for male and female match).
Copyright Stacey S. Cofield, 14 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University 3 / 17
The average percent satisfaction is 67.0% (SD = 11.4) for females and 78.7% (SD 11.7) for males. The median satisfaction score was 66.8% females for and 78.5% for males. The range of scores was 48.2 86.6% for females and 60.1 97.8% for males. From our sample of students, were 95% confident that the average satisfaction is between 64.5 and 69.4% for females and 76.1 and 81.3% for males. A cardiac risk studyAMB: Blood lipid values on patients visiting offices in Texas. One lipid measure is total cholesterol. Its desirable to have values less than 200 mg/dL. Individuals with values above 240 mg/dL are thought to be at high risk, according to the National Cholesterol Education Program (NCEP) guidelines. There were 3,576 patients in this study and no missing values.
Descriptive Statistics In the cholesterol data in the cardiac risk study we focus on issues relating to the shape, center, and spread of the distribution of values.
Examine the Data The data is in a file called LIPIDS.JMP (do not confuse with Lipid Data.JMP from the JMP Sample Data Files). Prior to running a statistical analysis we should look at the values of each variable. Using the Columns Panel Verify that Cholesterol uses the numeric data type and that it is marked with the continuous modeling type.
Choose Analyze > Distribution Select Cholesterol as the Y variable.
Copyright Stacey S. Cofield, 14 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
4 / 17
Figure 3 shows the default output for a distribution analysis on a continuous variable. Figure 3. Distribution Analysis for Cholesterol Histogram / Box Plot Recall in the previous JMP examples that all the Y variables were nominal. So, we saw one bar per value (and no box plot). Here we do not see one bar per value. The green bars in Figure 3 are those of a grouped-histogram. A groupedhistogram displays a range of values because to display one bar per value would give far too many bars. Although there are n = 3576 patients in the study, there are only 267 unique values of cholesterolafter rounding the values to whole numbers. A histogram with 267 bars would be overwhelming (and nearly useless). JMP chooses the number and width of the bars to display but you can change the number of bars and the width of the bars (see below).
The box plot is a useful way to determine the number of extreme values in either direction. The box plot also displays the central tendencies of the distribution as well as giving another visual display of the spread of the data.
Grouped Frequency Distribution An option in JMP is to show the counts for each bar (Figure 4). Table 1 also shows the grouped frequency distribution. Tables such as this can be useful to show a continuous distribution without describing each unique value. Particularly if the continuous variable will be categorized for analysis which is often see with variables such as age (into 10 year age groups) or blood pressure (into 10 point intervals) when a 1 point change isnt of interest but a change at 10 years is interesting.
Copyright Stacey S. Cofield, 14 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
5 / 17
Figure 4. Histogram of Cholesterol with Counts
Table 1. Grouped Frequency of Cholesterol, with Proportion of Total Distribution Group Frequency Proportion 60-79 3 0.0008 80-99 8 0.0022 100-119 39 0.0109 120-139 128 0.0358 140-159 347 0.0970 160-179 506 0.1415 180-199 706 0.1974 200-219 613 0.1714 220-239 503 0.1407 240-259 290 0.0811 260-279 201 0.0562 280-299 122 0.0341 300-319 45 0.0126 320-339 31 0.0087 340-359 16 0.0045 360-379 8 0.0022 380-399 5 0.0014 400-419 3 0.0008 440-459 1 0.0003 460-479 1 0.0003 The most common range of values is between 180199, with 706 subjects in that range (19.74%).
Copyright Stacey S. Cofield, 14 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University 6 / 17
Choice of Intervals The intervals are 20 units wide, starting at whole numbers. How were these intervals decided? Does it matter what intervals we choose? There is no agreement as to whats the best way to choose the number of intervals. There is no rule for choosing the beginning point of the intervals, although there is a preference for round numbers.
Changing the width of the bars For instance, Figure 5 shows three other histograms with the bars 40 units wide, 60 units wide, and 80 units wide.
Figure 5. Alternate Histograms for Cholesterol
Changing the center of the bars You can also change the center of the bars but be very careful as this can distort the distribution of your values (Figure 6). Notice the left is centered at 200, while the right is centered below 200.
Figure 6. Cholesterol Histograms with Different Centers
Copyright Stacey S. Cofield, 14 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
7 / 17
Other histograms There are a number of different histograms that could be shown with varying number and width of bars. Including, one bar that makes the distribution look flat, to a large number of bars that shows the distribution as bumpy and nearly discrete (Figure 7).
Figure 7. Other Histograms for Cholesterol
Be aware, and be careful, that the choice of the center and width of grouped-histogram bars matters to how you perceive the distribution of values.
Quantiles (more below) The quantiles of the distribution are displayed for continuous variables directly below the histogram. Some of the quantiles are named, such as the minimum (0%), maximum (100%), the interquartile range (25% and 75% quartiles), and the median (50%). The quantiles show how much (what percentage) of the data is spread where across the entire distribution.
Moments The Moments of the distribution are displayed directly below the Quantiles. The moments are the most commonly referred to descriptive statistics of a continuous variable. The moments are the mean, standard deviation (of the distribution), standard error of the mean, the 95% confidence interval about the mean, and the total number of observations in the dataset (N). If there are missing variables (none here), there will also be an N Missing given.
Copyright Stacey S. Cofield, 14 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University 8 / 17
Numerical Statistics In describing the distribution of a continuous variable, there are some key, common, questions: Where are the central values? How much dispersion is there around this center? What is the shape of the distribution? Well begin with the first two questions. There are two sorts of summary statistics: those that describe the center and those that describe the spread.
Central Tendency We need to begin with the center, of the typical values. The most commonly used measures of central tendency are the median and the mean.17,20
Mean The mean of a population is termed (pronounced Mew) and the mean of a sample is denoted by y (pronounced y-bar), calculated:
yi
y= i N
From the moments, we can see that the mean Cholesterol is 205.7 mg/dL. The mean can be interpreted as the best guess of any given persons cholesterol level. If you were to guess the mean value for all subjects, and total the distance away (error associated with a wrong guess) from each subjects true value, using the mean as the guess would result in the smallest total error compared to using any other value as a universal guess. There are several important properties of the mean: Uniqueness: for one set of data, there is only one mean. Simplicity: Easy to calculate. Easy to interpret. Effect of extreme values: can be large.
Note: When we say mean, or y-bar, this is an arithmetic mean not to be confused with the geometric mean. The geometric mean is different.
Copyright Stacey S. Cofield, 14 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University 9 / 17
Median The median is the value that divides the set of data into two parts: Half (50%) are above the median and half below. The median Cholesterol is 201 mg/dL. Thus, according to the NCEP guidelines, a little less than half of the observed patients have values in the desirable range (200 or less). There are also several important properties of the median: Uniqueness: for one set of data, there is only one median. (If there are tied values, different software may give you slightly different medians.) Simplicity: Easy to calculate. Easy to interpret. Effect of extreme values: less affected than the mean.
Dispersion The mean and median are two descriptions of the center of the distribution, what about the spread of the data? How is the data dispersed? Or, how much dispersion is there around this center?
Synonyms: Dispersion, variety, spread, scatter, variability. Well use variability to describe the spread of the data. Variance or variability are the most common terms used to describe the dispersion of the data.
When variability is small then all the values are close together. When variability is larger, then values range more widely.
Range The range of the data can be described in two ways, as the range or the statistical range. The range is smallest and largest observed values in the entire set of observations. The statistical range is the mathematical difference between the largest and the smallest values in the entire set of observations (the 0% and 100% quantiles). In the cholesterol data, since the largest value is 465 and the smallest is 63, the statistical range is 402, or you could say the range is (63, 465). The important
Copyright Stacey S. Cofield, 14 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
10 / 17
properties of the range are: Uniqueness: for one set of data, there is only one range. Simplicity: Easy to calculate. It uses only two pieces of information (not the whole set of data). As n increases, it is likely that the range will also increase. As you have more opportunity for scatter, you might see more extreme values. The mean is not as affected by an increase in n.
Variance The word variance has a variety of definitions; here we refer only to the statistical one. Like the mean, the true variance is unknown. Referred to as 2 (pronounced sigmasquared), it is an average squared deviation from the mean. Well use an estimate, s2, of 2 . From the Statistical Sleuth, the sample variance is:
Var ( y ) = i =1
( yi y )
( n 1)
n
2
Note that its an average divided by n1, not by n. This correction makes this sample estimate closer to the true population variance. In practice, the standard deviation is most often used to describe the variance. The standard deviation, , is also not known, well us the sample standard deviation, s:
SD( y ) = s =
i =1
( yi y )
( n 1)
n
2
Notice that the average deviation from the mean is exactly zero. In the JMP report, the variance is not shown you see the standard deviation. There are a number of statistical properties associated with the variance, the most important being: As n increases, the variance gets smaller this is intuitive, since the more information you have about the variable, the more precise your estimate of the variability.
Copyright Stacey S. Cofield, 14 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University 11 / 17
Units: the units of the variance are difficult to interpret since they are squared.
Standard Deviation
The standard deviation solves this problem. The SD is in the original units of the variable, since it is the square root of the variance.
In the Cholesterol report, the standard deviation is 45.9 mg/dL. An important property of the standard deviation: Units: its in the original units and so its more straightforward to interpret. Interpretation: The Empirical Rule
If a set of measurements has a mound (bell) shape then:
y 1s contains approximately 2/3 of the measurements, the interval y 2s contains approximately 95% of the measurements, the interval y 3s contains approximately all of the measurements. One interpretation of the standard deviation is that it describes a range of values on both sides of the mean that include approximately 68% of the data. Actually, in the Cholesterol data, 70.7% of the data are within one standard deviation; 95.8% are within 2 SD, and 99.1% are within 3 SDs.
Note: S.D. is a common abbreviation for the standard deviation; s is typically used in
formulas.
The word moments is a statistical term referring to parameters of a distribution. If a distribution is Normal, then its first two moments (parameters) are the mean and SD.
Standard Error of the Mean
The standard error of the mean estimates the standard deviation of the distribution of mean estimators. This is the best guess about the difference between the estimated mean and the true mean.,33 Std Err Mean is computed by dividing the sample standard deviation, s, by the square root of N. The standard error is always smaller
Copyright Stacey S. Cofield, 14 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
12 / 17
than the standard deviation ( s
n
<
s ) and it is therefore, tempting to report the SE
in place of the SD but this is an incorrect practice. You want to give an idea of the spread of the observed data (SD) not the variability likely in the estimate of the mean.
Percentiles and Quartiles
As weve seen, the median separates the distribution into half; that is, 50% of the values are below the median and 50% are above the median. The median is the also called the 50th percentile. In general the pth percentile is the value of y such that p-percent of the values are below this value. The 100%tile value of cholesterol is the value of Cholesterol such that 100 percent of the values are below this value. Its the maximum value, in this case 465 mg/dL. Similarly, the 0%tile is the minimum value, in this case 63 mg/dL.
Just as percentiles divide the distribution into percents, quartiles divide the distribution into quarters. The first quartile is the 25th percentile. The second quartile is the 50th percentile. The third quartile is the 75th percentile. Note that half of the data are between the 25%tile and the 75%tile.
Interquartile Range As weve seen, the range is the distance between the largest and
smallest values. Obviously, just one extreme value can drastically change the range. So, the range is a crude measure of dispersion. The interquartile range (IQR) is the difference between the third- and first-quartile. Its the difference between the 75%tile and 25%tile. It describes the distance between the middle half of the data.
An advantage of the percentiles and quantiles is they are useful descriptions regardless of the shape of the distribution. They dont assume that you have a mound-shape (normal) distribution. These values, and others, are shown in the box plot (Figure 1).
Box Plot JMP shows the quartiles and other values in a box plot. The box of the box
plot is the range of data between the 25%tile and the 75%tile. Half of the data is in the box; its drawn larger to emphasize were the middle-values lie. Inside the box is a line,
Copyright Stacey S. Cofield, 14 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
13 / 17
the line is the median; it divides the data into halves. The whiskers (in the JMP-default box plot) extend out to 1.5 times the IQR. The whiskers are meant to draw your eye to the typical range of data values. Values beyond the whiskers are shown as dots and are interpreted as potential outliers (Figure 1).
The individual points that extend beyond the whiskers show potential outliers in the sense that they have a large distance from the median and so they may, potentially, be in error. Note that not all outliers are errors (and, not all errors are outliers). Even perfectly normal data will have potential outliers; there has to be some value that is the one that is farthest out.
What to Do with Potential Outliers
Some researchers just delete the outliers.
WRONG. Do not remove outliers from your data.
This destroying of data is just as bad as making up values. We use the word potential to describe an appropriate attitude towards these values. The first question is: Is the value wrong? Values can be in error for all sorts of reasons: entered incorrectly: digits reversed, subjects reversed, extra digits, broken equipment: not calibrated, meter broken, poor lab procedures: using different solutions, leaving during a experiment, etc. Prior to removing outliers, research potential errant values and if you find they are in error, fix them by correcting the entry or re-running the experiment under the correct conditions. If you can not determine the correct value but you have satisfied yourself the present value may be wrong, youve two choices: Give it a missing value (an empty value in the JMP data table that prints as a bullet ), or keep the value as it is and continue to entertain the possibility that it may be a problem. If you decide to remove entire subjects or observations from your dataset, be sure to
Copyright Stacey S. Cofield, 14 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
14 / 17
note the reasons and that you did remove observations. This can be an acceptable practice if and only if, there is no recourse to correcting the error (you cant re-run an experiment) and you are sure the observations were measured in error. Otherwise, keep the points in the analysis. Even unusual values can average out and not cause any problems by unduly influencing conclusions.
Possible summaries Consider these summary statements. Which one(s) would be
acceptable? Which one(s) would give the reader the most accurate sense of the distribution of the data? The average cholesterol was 205.7 mg/dL. The average cholesterol in 3,576 patients was 205.7 mg/dL (SD = 45.9). The average cholesterol in 3,576 patients was 205.7 mg/dL (SE = 0.768). The median cholesterol level was 201 mg/dL, with half of the patients between 175 and 231. The median cholesterol level was 201 mg/dL, with values ranging between 63 and 465. From our sample of 3,576 patients, were 95% confident that the average cholesterol is between [204.2, 207.2].
Distribution of Triglycerides
Lets look at the distribution of a different blood lipid measurement in the same patients (Figure 8). The mean triglyceride is 164.7ng/dL and the median is 132. In the Triglycerides data, 90.4% of the data are within one standard deviation; 96.5% are within 2 SD, and 98.2% are within 3 SD.
Copyright Stacey S. Cofield, 14 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
15 / 17
Figure 8. Distribution of Triglycerides
Summary
There are a number of ways to describe the distribution of a variable. Some are better than others. What should you use? These recommendations come from the American College of Physicians (TA Lang, M Secic, 1997, How to Report Statistics in Medicine. ISBN 0-943126-44-4). They provide some guidance on what descriptive statistics should I use?
Provide appropriate measures of central tendency and dispersion when summarizing data that have a continuous distribution.
Do not summarize continuous data with the mean and standard error of the mean.
Yes, the SE is smaller than the SD. Thats why it is appealing to report it instead of the SD. But the SE is used to describe the precision of an estimate, its the standard error of the mean. The SE is not a descriptive statistic and should be not used as such.
Copyright Stacey S. Cofield, 14 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
16 / 17
When describing the dispersion of a sample, use the SD. When using the mean for statistical inference, use the SE.
Use the mean and standard deviation (SD) only when describing approximately normally distributed data. If the mean and SD are appropriate, how many decimal
places should we use? Report the mean values to no more than one decimal place
more than the data they summarize. Report the standard deviation to no more than two decimal places more than the data they summarize report the SD to one more decimal place than you report the mean
The American Medical Association Manual of Style (C Iverson (chair), 1998, 9th Edition) says, Numbers that result from calculations, such as the mean and SDs, should be expressed to no more than 1 significant digit beyond the accuracy of the instrument. Thus the mean (SD) weight of individuals weighed on a scale accurate to 0.1 kg should be expressed as 62.45 (4.13) kg.
Describe markedly non-normally distributed (skewed) data with the median and range or interquartile range (or other interpercentile range).
They comment When data are markedly non-normally distributed, the mean and standard deviation, although they may be mathematically correct, do not allow the reader to picture the distribution accurately.
All recommendations boil down to two words: Be Clear.
Copyright Stacey S. Cofield, 14 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
17 / 17
Section 14. Estimating With Confidence Overview In Section 13, we examined continuous variables and associated distributional properties. We discussed issues relating to shape, center, and dispersion. In this section well discuss estimation. How do we estimate parameters of a population distribution? That is, how do we sample a population to estimate parameters related to shape, center, and dispersion?
Often, well be interested in estimating the center of the data, i.e., the center or the mean/median. Accurately estimating this parameter can be difficult depending upon whether the distribution has a large or small amount of dispersion. Particularly when we also use an estimate for the population dispersion. In addition, the estimates we use rely on an underlying distribution, some with properties that are simple and others that are more complex.
Estimation Rarely do are you able to measure every subject or object in a population. Often, you have to sample from the population to estimate the population. Recall in the CPR study, we hypothesized a survival proportion of 6%. Then we simulated 1000 samples of size n = 278 and for each simulated experiment we calculated p, the proportion surviving to hospital discharge. From this we were able to address what the distribution of survival proportions would look like (Figure 1).
Figure 1. Simulation, p = 0.06, n = 278.AMB
Copyright Stacey S. Cofield, 19 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
1 / 25
From the simulation, we could see how the sample proportion would behave if we repeated the experiment. Would the average sample proportion be the same as the true proportion? Yes. How variable are the sample proportions? Over 95% of them were between 0.032 and 0.094. This simulation was for a binary outcome, a categorical variable. But what about a variable with a continuous response?
The Lipids Population Recall from Section 13, the Cholesterol data, specifically the LDL results (Figure 2).
Figure 2. Distribution of Cholesterol Values
Lets assume these values are results from the entire population being studied. Now, lets take samples from the population to try and estimate the parameters associated with the distribution of cholesterol values.
Since we have a response from every subject in the population, we know the true mean of the population is 205.68 mg/dL and the true standard deviation is 45.919 mg/dL. Lets fit a normal distribution with this mean and standard deviation (Figure 3).
Copyright Stacey S. Cofield, 19 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
2 / 25
Figure 3. Normal Distribution Fit Over Cholesterol Population
If the population distribution truly stems from a normal distribution, the bars of the histogram should follow the line of the fitted normal distribution. We see this is more or less the case, with the bars to the left of the center exceeding the line and the bars to the right of the center falling short of the line. Regardless, we can see that the population responses follow a unimodal (one mode) distribution, that is roughly symmetric about the center.
If this were a sample, is the sample normally distributed (or close enough) such that we can use known properties of the normal distribution to estimate the parameters of the population? In this case, we have over 3500 responses. How many responses do we need to be able to say we have a normally distributed variable? That is, how many observations do we need to sample from a population to accurately estimate the population parameters by relying on the properties of the normal distribution? Sampling from the PopulationAMB Lets draw a sample of size n = 9. and do this again and again 100 times. For each sample of size n = 9, calculate the mean Cholesterol. What is the distribution of these 100 means?
Or, lets draw a sample of size n = 25. and do this again and again 100 times. For
Copyright Stacey S. Cofield, 19 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
3 / 25
each sample of size n = 25, calculate the mean Cholesterol. What is the distribution of these 100 means?
Or, lets draw a sample of size n = 100. and do this again and again 100 times. For each sample of size n = 100, calculate the mean Cholesterol. What is the distribution of these 100 means?
Figure 4. Samples from the Cholesterol PopulationAMB
Since the above figure plots the values on the same scale as the population (shown in Figure 4), we can compare the distribution of these estimated means to the distribution of the original values.
Properties of the Sample Mean There are three important questions about these 100 means: What is the mean of the sample means? What is the standard deviation of the sample means? What is the shape of the distribution of the sample means?
Copyright Stacey S. Cofield, 19 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
4 / 25
Mean of the Means On the average, how close are the sample means to the true value? We observe that the mean of the means when we take samples of size 9 is within 1 mg/dL of the true mean (its 0.9396 mg/dL higher than the true mean of 205.6826 (Table 1). The mean of the means when we take samples of size 25 is even closer (only 0.7082 higher). When using 100 samples to estimate each mean, the mean of these means is very close (0.1219 mg/dL below the true mean).
Table 1. Mean of Means from Samples Compared to Population Mean population mean = 205.6826 difference from population 0.9396 0.7082 -0.1219
sample n=9 n = 25 n = 100
mean 206.6222 206.3908 205.5607
There is no guarantee that one specific sample drawn with an n = 9 must be farther away than one other specific sample drawn with an n = 25. Nor is it guaranteed that one sample mean using n = 100 must be closer to the true value than if we used n = 9. But, on the average, the mean of the means will be identical to the true population mean.
So, the first property of the sampling distribution of y is: When sampling from a population with true mean , the true mean of the distribution of y is . If this property were not true, then wed say that y is biased. That is, if the true mean of the distribution of y is + 10 (for example), then wed say that y is giving us estimates that are biased towards higher values. This is not the case; y is not biased.
Further, on the average, the mean of means from larger samples should be closer to the true mean than the mean of the means from smaller samples.
Copyright Stacey S. Cofield, 19 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
5 / 25
Standard Deviation of the Means Weve used three simulations to estimate the mean of the means, so how variable is the mean of the means? A quick look at Figure 4 shows that the variability of the means when sampling with n = 9 is wider than the variability of the means when sampling with n = 100. A detailed look at the frequency distribution in Table 2 shows a similar pattern.
Table 2. Frequency Distribution of the Population of Cholesterol Values, and the Means when Sampling with Different ns.AMB
Even when sampling n = 9, y values are, at worst, about 50 mg/dL from the true mean.
In the table below we first see the true population standard deviation = 45.926 (Table 3). Then the estimated standard deviation of the y values from 100 simulations of sample size n = 9, of 100 simulations of sample size n = 25, and of 100 simulations of sample size n = 100.
Copyright Stacey S. Cofield, 19 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
6 / 25
Table 3. Standard Deviation of Means of Samples Compared to Population Standard Deviation population sample n=9 n = 25 n = 100 SD = 45.926 SD 17.4976 9.9440 5.0003 ratio to population 2.6 4.6 9.2
Clearly, the standard deviation of the meanscalled the standard error of the mean decreases with larger sample sizes. The standard error of the mean is denoted y , to indicate that it is the standard deviation of the y values. Actually, the amount of the decrease is predictable. This brings us to the second property of the sampling distribution y : When sampling from a population with true standard deviation , the standard deviation of the distribution of y = y =
n
(, 31-34)
That is, sampling with n = 9 should result in a standard error of one-third of the population standard deviation (since 9 = 3). In our small simulation of 100
experiments, instead of 1/3, the ratio was 1/2.6. Sampling with n = 25 should have yielded a standard error a fifth as large, it was 1/4.6. Sampling with n = 100 should have yielded a standard error a tenth as large, it was 1/9.2. There are a variety of reasons why it didnt come out exactly but the main reason is that its a small simulation. The point is: you can control the variability of your estimate, y , by choosing your sample size n. Note: The true population is 45.9194, not the 45.926 estimated by JMP, because the JMP estimate assumes the values come from a sample, not a population.
The Distribution of the Sample Means Lastly, one key property of the sample mean, y , is that it is normally distributed no matter what the original distribution of y, given that certain conditions are met.
Copyright Stacey S. Cofield, 19 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
7 / 25
The Normal Distribution The normal distribution is not just a vague concept; we use the words mound shaped and bell shaped to informally visualize a general shape. It does have a precise mathematical definition, which we dont need to fully understand to use the distribution. We do, however, need to know the answer to the questions: What do we mean when we say that something follows a normal distribution? How can I tell whether a variable seems to follow a normal distribution?
To answer these questions we need to understand the characteristic properties of the normal distribution.
The characteristics of the normal distribution are: Its symmetric about its mean, .
Which means that the mean and median are equal, and the tails should be the same. It follows the empirical rule. Its a probability distribution completely determined by the parameters and .
Which means that, for a given and , you can determine the probability of observing any value (call it y) as large or larger than y by first calculating a z statistic:
zcalculated = y
Then use a z-to-p-value calculator to determine P ( z zc ) . (Dont worry, JMP will do this for you.)
So we said the sample mean, y , is normally distributed despite the original distribution of y given that certain conditions are met. So, what are the conditions? Practically, its true if either the original distribution is normal, or if the sample size is large.
Sampling from a Normal Distribution Lets deal with the easiest case first. If cholesterol really was exactly normally distributed then all three properties above would hold. So: Is cholesterol symmetric about its mean? Are the mean and median equal? The observed mean 205.7 is slightly higher than the median, 201. There seems
Copyright Stacey S. Cofield, 19 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
8 / 25
to be more values in the higher range (above say, 300) than in the lower range (say, below 100). Does it follow the empirical rule? In the cholesterol data, 70.7% of the data are within one standard deviation; 95.8% are within 2 SD, and 99.1% are within 3 SD. These percentages would be 68%, 95%, and over 99%, respectively. Do the probabilities from zcalculated = observed distribution? If the cholesterol values were normally distributed, JMP would have shown the distribution of cholesterol as in Figure 5.
y
equal the percentiles in the
Figure 5. Histogram, Box Plot, and Normal Quantile Plot From Normally Distributed Cholesterol ValuesAMB
To show the Normal Quantile Plot in JMP, check the option in the Distribution of Y report window. The (generated or manufactured) data are normally distributed and we can see this by inspecting the normal quantile plot, if the distribution is normal then: the (black, observed) points should follow the (red) diagonal straight line the points should fall within the (dashed red) boundaries.
Copyright Stacey S. Cofield, 19 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
9 / 25
In Figure 6 we see the comparable normal quantile plot for the actual cholesterol data.
Figure 6. Histogram and Normal Quantile Plot for Cholesterol Data
We can see that there are more values above the mean (the tail is longer and sweeps up). We can also see that the line of points either follows the dashed red lines or slightly extends beyond the limits. So, is Cholesterol normally distributed? Not precisely. But it is close. Close enough to use the properties of the normal distribution?
The Distribution of the Sample Means Using various sample sizes are the y values normally distributed? Look at the normal quantile plots for all three samples (Figure 7).
Copyright Stacey S. Cofield, 19 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
10 / 25
Figure 7. Histograms and Normal Quantile Plots for Samples from CholesterolAMB
In all three cases, even though the population values for cholesterol were not perfectly normal, the y values were normally distributed, even with small n = 9.
Sampling from a non-Normal Distribution Now, the more difficult case. We said the sample mean, y , is normally distributed no matter what the original distribution of y is as long as either the original distribution is normal, or if the sample size is large. How large is large enough? The rule of thumb isin most practical situationsn =30 is satisfactory.,33 As a practical matter though, if the original distribution is severely non-normal then it may take much more 30 samples to assure us that the sample mean will be normally distributed.
Copyright Stacey S. Cofield, 19 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
11 / 25
A non-Normal example Recall the distribution of triglycerides from this population. The mean triglyceride is 164.7ng/dL and the median is 132, more than 30 points different. In the triglycerides data, 90.4% of the data are within one standard deviation2; 96.5% are within 2 SD, and 98.2% are within 3 SD. The normal quantile plot in Figure 8 shows (with little doubt) that the Triglycerides are not normally distributed.
Figure 8. Histogram, Box Plot, and Normal Quantile Plot for Non-Normal Triglycerides
The Distribution of the Sample Means from a Non-Normal Population So, for the various sample sizes are the y values normally distributed?
For the 100 simulations each using a sample of size n = 9, the mean of the means is approximately 167ng/mL, approximately 2ng/mL above the true mean (which is marked by the horizontal blue line in the histogram). The median of the means is approximately 154, 21 units above the true median. The distribution of these means is not normal (Figure 9).
Copyright Stacey S. Cofield, 19 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
12 / 25
Figure 9. Histogram, Box Plot, and Normal Quantile Plot for 100, n = 9 Sample Means from the Non-Normal TriglyceridesAMB
N = 25 sampling For 100 simulations each used a sample of size n = 25? Here, the mean of the means is approximately 168ng/mL, approximately 3ng/mL above the true mean. The median of the means is approximately 163, 30 units above the true median. The distribution of these means is not normal, although its closer than for n = 9 (Figure 10).
Figure 10. Histogram, Box Plot, and Normal Quantile Plot for 100, n = 25 Sample Means from the Non-Normal TriglyceridesAMB
Copyright Stacey S. Cofield, 19 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
13 / 25
N = 100 Sampling For 100 simulations each used a sample of size n = 100? Here, the mean of the means is approximately 165ng/mL, approximately 0.06ng/mL above the true mean. The median of the means is approximately 164, 32 units above the true median. The distribution of these means is normal (Figure 11).
Figure 11. Histogram, Box Plot, and Normal Quantile Plot for 100, n = 100 Sample Means from the Non-Normal TriglyceridesAMB The rule of thumb is n = 30, should be sufficient for y to be normally distributed.,33 This distribution appears to be normal (Figure 12).
Figure 12. Histogram, Box Plot, and Normal Quantile Plot for 100, n = 30 Sample Means from the Non-Normal TriglyceridesAMB
Copyright Stacey S. Cofield, 19 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
14 / 25
Summary Even though the population of triglycerides is not normally distributed, with sufficient sample size, the distribution of the sample means is increasingly close to normal (Table 4). So, even with a non-normal population as sample size increases, the three properties of the sample mean still hold.
Table 4. Comparisons of the Population and Samples, Mean Difference and Standard Deviations, Triglycerides population sample n=9 n = 25 n = 100 Mean = 164.6700 mean 166.6122 167.6892 164.7284 difference 1.9422 3.0192 0.0584 SD = 129.1319 SD 46.1812 26.2889 15.0938 ratio 2.8 4.9 8.6
Notice, however that the standard errors are larger than they would be if the population was normal. When sampling with n = 9, the SE should be one-third of the population (129/3= 43) but it is larger. The standard error when using n = 25 should be 25.8 but its a slightly larger 26.3. And the SE for n = 100 should be 12.9 but its 15.1. That is, when sampling from a skewed distribution, the standard errors will be larger due to the effect of the extreme outliers at the ends of the tails.
Central Limit Theorem More formally, the conditions we have been discussing are the result of the Central Limit Theorem (CLT).,33 Weve discussed large tests, and used the term asymptotic theory, these terms are based on the CLT. The CLT is the only theorem well cover in BST 621 (because its that important). Draw a simple random sample of size n from any population with mean and finite standard deviation . When n is large, the sampling distribution of the sample mean
y is close to the normal distribution with mean and standard deviation .
Central Limit Theorem
Copyright Stacey S. Cofield, 19 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
15 / 25
Its not surprising that when sampling from a normal population the means will be normally distributed. Its far more useful to know that no matter what the underlying distribution, your means will be normally distributed, as long as you have sufficient n. How large an n is required? It depends on the underlying distribution, but the rule of thumb is 30.
However, this theorem can not save us from an ill-conceived sampling methodology. That is, if we draw a simple random sample then we can trust that the CLT will hold.
Say you dont have a simple random sample; are you in trouble? Were not in great danger if the data can plausibly be thought of as observations taken at random from a population. If the data are representative, youre probably OK. However, there is no way to rescue a study using data collected haphazardly. The data will have unknown biases and no formula can rescue badly produced data. Keep in mind: garbage in, garbage out.AMB
So, assuming the data at hand is representative, lets move on to confidence intervals. So far, our estimation methods for continuous variable have resulted in point estimates. Confidence intervals are even more useful. Weve seen confidence intervals about proportion estimates, the process is similar for mean estimates.
Confidence Intervals Confidence intervals use point estimates and an estimate of dispersion to form interval estimates.
Recall that estimating a parameter with an interval involves three components: The point estimate of , this is the sample mean, y When the population standard deviation is known to be , the standard error of
y is y =
n
.
The reliability coefficients we use is the 100(1- )% z value:
Copyright Stacey S. Cofield, 19 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
16 / 25
o For 90% confidence, use z = 1.645. o For 95% confidence, use z = 1.96. o For 99% confidence, use z = 2.575.
Recall the general form of an interval estimate:
estimate (reliability coefficient) (standard error)
This will yield two values, a lower limit and an upper limit, around the point estimate. This range of values will, we hope, include the true (unknown) population mean were trying to estimate. The confidence interval will, with specified reliability, contain the population mean.
Confidence intervals when the variance is known So, if we know the population standard deviation, , then a 95% confidence interval for the population mean is as follows:
y 1.96
n
In our example population the known is 45.9194. Using a sample of size n = 9, the first simulated experiment yielded a y of 217.6. So: 217.6 1.96 45.9194 9
217.6 1.96 (15.3065 ) 217.6 30.0006
[187.6,
247.6 ]
Notice that this interval covers the true mean of 205.7. All 100 confidence intervals are shown in Figure 13. Those shown with a red dot do not cover the true mean.
Copyright Stacey S. Cofield, 19 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
17 / 25
Figure 13. Cholesterol Confidence Intervals
The problem with this method for calculating confidence intervals using the z methods is this method requires knowledge of the population standard deviation, . Which is almost always unknown.
Confidence intervals when the variance is not known In practice, we never know . The obvious solution is to use the estimated standard deviation, s, we determined from our sample. If it were as simple as that, then the 95% confidence interval would be: y 1.96
s n
. But this does not work. The problem is
that the reliability coefficient (1.96) is wrong. Its wrong because now there are two random terms entering into the confidence interval, y and s. Both of these are subject to random fluctuation.
Gosset, a statistician who worked at the Guinness brewery, figured out the solution to this problem: the t-distribution. But to keep from getting fired, he had to publish the work under a pseudonym Student. Thus, you may have seen a reference to Students
t. The t-distribution is very close to the z but the t distribution has wider tails, reflecting
the extra variability ignored by z (Figure 14).
Copyright Stacey S. Cofield, 19 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
18 / 25
Figure 14. Comparison of Normal and t DistributionAMB
The degrees of freedom for the t-distribution for a single mean is df = n 1, the same as the denominator used to calculate s, the estimated standard deviation. So, the correct formula for the 100(1- )% confidence interval on a population mean when estimating both the mean and standard deviation is:
y t( df ,1 2 )
s n
In Appendix Table A.2, in the Statistical Sleuth gives the appropriate t-values for various df., 718 If you use this table, you want to use the values under the column labeled .975 for a 95% CI. (That is for a 95% CI, =0.05; so, (1 /2) = 0.975). Some example reliability coefficients are given in Table 5.
Notice that as n (and df) gets larger the reliability coefficient gets closer to the z value we would use if we knew . Table 5. Reliability coefficient (t-values) for a 95% Confidence IntervalAMB
Copyright Stacey S. Cofield, 19 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
19 / 25
Calculating 95% Confidence Intervals in JMP Luckily, you dont need to use the table values, JMP IN automatically calculates the 95% confidence interval on the mean and shows it in the Distribution of Y report window. For instance, Figure 15 shows the Moments report from the first (n = 9) cholesterol sample. The 95% confidence interval from this sample is [173.4, 261.7].
Figure 15. Moments Report for n = 9 Cholesterol Sample
All 100 of the confidence intervals, when using n = 9 sampling, are shown in Figure 16.
Figure 16. 95% Confidence Intervals for Cholesterol, n = 9 AMB
Copyright Stacey S. Cofield, 19 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
20 / 25
The means shown here are identical to those shown in Figure 13, but the widths of the above intervals are wider than shown in Figure 13 and the intervals are not of constant width, since the width depends upon the estimate of s. When using larger samples, in this case n = 25, the confidence intervals are narrower (Figure 17).
Figure 17. 95% Confidence Intervals for Cholesterol, n = 25 AMB
And when using large samples, in this case n = 100, the confidence intervals are much narrower (Figure 18).
Figure 18. 95% Confidence Intervals for Cholesterol, n = 100 AMB
Copyright Stacey S. Cofield, 19 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
21 / 25
The relationship between sample size and confidence A 95% confidence interval implies that were 95% sure that the interval covers the true (but unknown) mean. Conversely, it also means that 5% of the intervals we calculate will not cover the true mean. This is true whether we use n = 2 or n = 2,000,000.
What does change with sample size is the width of the confidence interval. With larger sample sizes the width of the interval is narrower but we will still be wrong 5% of the time just by narrower amounts.
What about confidence intervals for non-normal populations? Lets return to the Triglyceride results.
CIs for Triglyceride Now lets look at 95% confidence intervals using the triglyceride population. Just as before, we simulate 100 studies, each with a different sample size. First, in Figure 19 we see the distribution of the sample means when sampling with only n = 9.
Figure 19. 95% CI on Triglyceride, n = 9 AMB
Notice how much more variable the widths are. The first sample y estimate was 164.6 and estimated standard deviation s = 101.2. But the second sample y estimate was
Copyright Stacey S. Cofield, 19 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
22 / 25
323.2 and estimated standard deviation s = 383.3. With larger estimates, youre seeing the effect that an outlier can have. In larger samples, outliers should average out. Do they? In Figure 20 we see the distribution of the sample means when sampling with n = 25.
Figure 20. 95% CI on Triglyceride, n = 25 AMB
The widths are narrower than when sampling with n = 9. But the effect of outliers is still evident. In Figure 21 we see the distribution of the sample means when sampling with
n = 100.
Figure 21. 95% CI on Triglycerides, n = 100 AMB
The intervals are narrower still and the effect of outliers is diminished. Here, we have sufficient sample size to trust to the Central Limit Theorem.
Copyright Stacey S. Cofield, 19 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
23 / 25
Summary Sample estimates have distributions that are affected by the underlying distribution and sample size. Estimates may be useless if obtained from a haphazard sample with unknowable bias. But, if the data are representative of the population then: we can rely on the sample mean to estimate the center of the distribution the sample mean is unbiased.
Further, if the population is known to be normal, then: a sample mean will also be normal.
If the population distribution has an unknown distribution then, with sufficient sample size, we can rely on the CLT and trust that a sample mean will also be normally distributed.
Use the Normal Quantile Plot in JMP to assess whether a distribution appears normal.
The standard error of the sample mean will be smaller with larger samples. The standard error describes the variability of the sample mean. (Not the variability of the sample data.) If the variability of the data is , then the standard error of the mean is
y =
n
.
The confidence interval on the population mean, obtained from a sample of n
observations is y t( df ,1 2 )
s n
.
Here, y is the sample mean, s is the sample standard deviation, and the t reliability coefficient is the (1 - /2) percentile of the t-distribution with df = n 1. When describing a confidence interval in a sentence or table, be sure to indicate the level of confidence and the sample size. Always be aware that the shape of the underlying distribution and the size of your sample will directly affect the believability of your point- and intervalestimates.
Copyright Stacey S. Cofield, 19 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
24 / 25
Some Example Write Ups
Non-Normal: For a sample size is small or if you were uncomfortable with the assumption of normality.
Triglycerides A random sample of n = 20 subjects was assessed for serum triglycerides. Since the sample was small and the distribution was skewed the distribution of the sample is described by the median and range. The median triglyceride was 115 and the values ranged between 31 and 755. Half of the values were between 91.25 and 195.0.
Normal: In the case where you are more comfortable that the distribution of the mean is normal.
Cholesterol A random sample of n = 20 subjects was assessed for serum cholesterol. The average cholesterol was 201.8, SD = 53.25. We are 95% confident that the range 176.8-226.7 includes the true population mean.
Copyright Stacey S. Cofield, 19 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
25 / 25
Section 15. Hypothesis Testing on Means
Overview In Section 14, we discussed descriptive statistics associated with continuous variables. Such as: the mean and median, standard deviation and error, and various types of ranges. Now, well move on to hypothesis testing on the means of continuous variables. Well first consider sample means from a single population. There are several questions that can be addressed with a single sample or population mean: How does the sample mean compare with a hypothesized value? We have a pre and a post measure on the same characteristic and we want to know How does the pre-test score compare with the post-test score? These are the same types of questions asked when we sampled from a single population with proportions the main difference is the test well use to answer the questions. In both of these questions, well consider (briefly) the situation where the standard deviation of the population is known (not often the case). Then well move to the more common situation where were estimating both the mean and standard deviation from a sample. Usually, its safe to assume a normal distribution; but not always. In the case where normality is clearly not appropriate, we can answer these questions using methods not based on a known distribution with defined parameters (nonparametric methods). First, well begin with the easiest situation of comparing one sample mean to a hypothesized or assumed mean. Lets return to the Cholesterol example.
Copyright Stacey S. Cofield, 19 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
1 / 15
One Sample Mean Compared to a Hypothesized Mean Example: The mean cholesterol of a certain population is thought to be 205.7 mg/dl (standard deviation = 45.93 mg/dl). We have obtained a random sample of n = 30 subjects from this population (Table 1). Table 1. Cholesterol Values of n = 30 Sample 119 127 141 151 160 169 175 175 176 180 183 185 186 190 191 192 195 195 196 197 209 230 230 235 237 240 249 280 292 313
The question of interest is: Is the mean of our sample different from the hypothesized mean of the population? To answer this question, well return to the 10 steps for hypothesis testing that we used for comparing proportions. The steps we go through to answer this question are nearly identical to steps we went thru to answer our very first question related to the CPR data. Recall the first situation was that wed observed p = 29/278 = 0.104 and we wanted to know if this proportion was greater than the hypothesized proportion = 0.06 (Our null hypothesis was p = p0). 10 Steps for Hypothesis TestingAMB
Phase 1: State the Question 1. Evaluate and describe the data Recall our first two questions: Where did this data come from? What are the observed statistics?
Copyright Stacey S. Cofield, 19 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
2 / 15
The source of data is extremely important, remember that if you cant identify the source of the data and how it was collected, your inferences wont mean anything. Where did this data come from? In this case, we drew a simple random sample of size n = 30 from the population. What are the observed statistics? Enter the data into a .jmp file and look at the data using the Distribution platform (Figure 1). Figure 1 Distribution of Cholesterol Sample
Copyright Stacey S. Cofield, 19 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
3 / 15
Use these results to describe your sample of 30 cholesterol values. There are no extreme outliers to investigate. The normal quantile plot shows no obvious departure from normality (all of the observations are within the dashed red lines and more or less follow the solid diagonal red line). Plus, since n is large (n = 30), we can assume the central limit holds. Now that we are comfortable with normality, we can use the mean and standard deviation in our summary for Step 1: From our random sample of 30 cholesterol values, the sample mean was 199.9 mg/dL (SD = 45.37). The 95% confidence interval on the population mean value is: [183.0, 216.9] Note: that this 95% confidence interval is shown in the Moments report.
2. Review assumptions The assumptions for testing means are the same as the assumptions for testing proportions: Is the process used in this study likely to yield data that is representative of the population? Yes Is each subject in the sample independent of the others? Yes Is the sample size sufficient? Sufficient for what? There are two things to consider. First, can we assume that the sample mean will be normally distributed? The CLT says (essentially) that y will be normally distributed if were in one of two situations: o The true underlying distribution is normal, or o We have a large enough sample so that the CLT will hold. We never really know if the distribution is normal. We could just assume its normal. But assuming normality can be dangerous. The question is, Is the sample size sufficient so that we can assume the CLT will hold?
Copyright Stacey S. Cofield, 19 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
4 / 15
In this situation we look at the sample data and ask: Can we tell from the sample that the distribution is not normal? or, Do we have a large enough sample size for the CLT to hold? Bottom line: We have a strong preference for normality (testing is easier). So, unless you can clearly say: No, the sample does NOT look normal to me, and as long as n is large enough for the CLT to hold, stick with normality (well discuss what to do with small/non-normality later).
3. State the questionin the form of hypotheses Just as when comparing an observed proportion to a hypothesized proportion, there are three possible null hypotheses: The null hypothesis is 0 (population less than or equal to hypothesized), The null hypothesis is 0 (population greater than or equal to hypothesized), The null hypothesis is a fixed value, = 0 (population equal to hypothesized). And the alternative hypothesis is the opposite of the null. What are the null and alternative hypotheses for our question? (Is the sample mean different than the hypothesized mean of 205.7 mg/dl?)
Phase 2: Decide How to Answer the Question 4. Decide on a summary statistic that reflects the question Recall the generic test statistic from our earlier lectures: test statistic = summary statistic - hypothesized paramter standard error of the summary statistic
The relevant summary statistic is y , the observed sample mean from n observations. Were testing hypotheses about whether the sample parameter is the same as the population parameter We know, from the second property of the sampling distribution of y , that the
Copyright Stacey S. Cofield, 19 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
5 / 15
standard error in the denominator is: y =
n
So, to test our hypothesis relating the population mean to the sample mean, we compare the sample mean to the hypothesized value. When we know the population standard deviation, the statistics is as follows: z= y 0 n
But, we almost never know so the statistic we usually use is: t= where s is the estimate of . 5. How could random variation affect that statistic? What values of t will we see if the null hypothesis is true? It depends on what the hypothesis is. For H0: 0 well likely see negative t values. For H0: 0 well likely see positive t values. For H0: = 0 well see t close to zero. If the assumptions are met and the null-hypothesis is true and we know , then z is normally distributed. This means that we can plug a z-value into the p-value spreadsheet to calculate the associated p-value for our test statistic. Or, as well see, JMP will calculate p-values for us; either when we are using the zstatistic or the t-statistic. What values of t will we see if the null hypothesis is not true? It depends on what the alternative hypothesis is. For HA: > 0 well likely see positive t values. For HA: < 0 well likely see negative t values. y 0 sn
Copyright Stacey S. Cofield, 19 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
6 / 15
For HA: 0 well see t values different than zero. Recall the rough interpretation that zs larger than 2 are remarkable, and for large ns this rough interpretation for t also holds. 6. State a decision rule, using the statistic, to answer the question The universal decision rule: Reject H0: if p-value < (usually 0.05)
Phase 3: Answer the Question 7. Calculate the statistic There are three cases: a known sigma, an unknown sigma, and the nonparametric statistic.
First case: a known sigma Assuming that we know the true population standard deviation = 45.93 mg/dl. Using sample n = 30 observations and the sample mean y = 199.9 mg/dL. The hypothesized mean is 0 = 205.7 mg/dl:
z=
199.9 205.7 45.93 30
=
5.8 = 0.69 8.39
Using the Excel spreadsheet to determine the p-value we see the following.
Normal Curve Area Enter a value in the z-cell: z = -0.69, were using a two-sided test, so use the third result:
Copyright Stacey S. Cofield, 19 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
7 / 15
Second case: an unknown sigma Using the estimated standard deviation s = 45.37:
t=
199.9 205.7 45.37 30
=
5.8 = 0.70 8.28
Note that the df for this test is n 1. So, in this case the df for the t-test is 29. Using JMP to calculate the p-value Hypothesis tests on a single mean is easy with JMP. Use the arrow popup to the right of the column name. Begin in the Distribution platform window. Choose > Test Mean.
A dialog will appear thats self-explanatory. Type the Specify Hypothesized Mean: 205.7. To use the known standard deviation (and thus a z-test), fill in the value: Type the True Standard Deviation: 45.93.
Copyright Stacey S. Cofield, 19 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
8 / 15
Press the OK button. The report appears at the bottom of the Distribution of Y window. We still have to be able to choose the p-value we want to report, based on our knowledge of the null and alternative hypothesis.
We were using a two-sided test, so our p-value is 0.4917, which is similar to the p-value from the Excel calculator (we rounded off our mean estimate, JMP does not round off until the p-value). To use the estimated standard deviation (and thus a t-test), do not fill in the value. Leave blank the true standard deviation: _______. The new report appears at the bottom of the Distribution of Y window.
Copyright Stacey S. Cofield, 19 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
9 / 15
Which p-value should we report? Use the one that matches the direction of the alternative hypothesis. For HA: > 0 well likely see positive t values; Report Prob > t. For HA: < 0 well likely see negative t values; Report Prob < t. For HA: 0 well see t values different than zero; Report Prob > |t|. We use the two-sided result, p-value = 0.4919.
8. Make a statistical decision We choose to interpret the t-test. Since p-value = 0.4919, we fail to reject the nullhypothesis at the = 0.05 level of significance.
9. State the substantive conclusion We do not have sufficient evidence to conclude that the sample mean is different than the hypothesized mean of 205.7 mg/dL.
Phase 4: Communicate the Answer to the Question 10. Document our understanding with text, tables, or figures The hypothesized mean cholesterol value is 205.7 mg/dl in a specific population. From a random sample of 30 observations, we observed a sample mean of 199.9 mg/dL (SD = 45.93). The mean cholesterol value is assumed to be normally distributed by the CLT.
Copyright Stacey S. Cofield, 19 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
10 / 15
We compared the sample mean to the hypothesized mean. Using a two-tailed t-test, we conclude that the observed mean is not significantly different than the hypothesized mean (t = -0.69, df = 29, p-value = 0.4919). From this we conclude that the population mean is not different than 205.7 mg/dL. Were 95% confident that the mean cholesterol value is included in the interval [183.0, 216.9]. Alternatively, it may be more straightforward to include many of the numbers in a table and state your results in text. So instead of the above paragraph, you could do this: From a random sample of 30 patients, we observed the statistics shown in Table 2. We tested the hypothesis that the population mean cholesterol value is 205.7 mg/dl, and concluded that the observed mean is not significantly different (t = -0.69, df = 29, p-value = 0.4919). From this we conclude that the population mean is not different than 205.7 mg/dL. Were 95% confident that the mean cholesterol value is included in the interval [183.0, 216.9]. Table 2. Description of Sample Cholesterol mean 199.9 SD 45.93 Range 119.0 313.0
It may also be useful to include the 95% confidence interval as additional columns in the table, and thus omit them from the text.
Data not normal? We know the triglyceride data in the Figure 2 population is not normal. However, all we usually have to go on is the sample. In Figure 2 we see the sample results. Looking at these sample values, would you conclude that the distribution is normal?
Copyright Stacey S. Cofield, 19 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
11 / 15
If n is too small for the CLT to apply, or when its known that the distribution is not normally distributed, or when we want to make the strong assumptions implied by assuming normality, what can we do? If we want a test that makes minimal assumptions, then we look to the kinds of methods called nonparametric statistics. These methods dont assume any form to a distribution and so there arent any parameters to estimate. In this situation, the minimal assumptions are as follows: The data are representative. The observations are independent. The population is symmetrically distributed about its center, . In particular, the following method applies for small samples or if the distribution has a center with tails on each side. The hypotheses are the same as before, but the statistic we use is different.
Copyright Stacey S. Cofield, 19 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
12 / 15
The Wilcoxon signed-rank test Nonparametric statistical tests use the rank of a score, not the raw value to compute statistics (and p-values). The signed-rank statistic was developed by a statistician named Wilcoxon. It considers the differences of each value from the hypothesized mean. It ranks these differences without regard to whether the difference is above or below the hypothesized mean. If there is no difference from the hypothesized mean (that is, if the null-hypothesis is true), then the chance of observing a positive difference of a given size is the same as the probability of observing a negative difference of a given size. Observations are ranked on the magnitude of the difference without regard to signs. So, differences of the same magnitude, say 10.3 and +10.3, have the same rank. Then, Wilcoxons signed-rank test compares the sum of ranks above the hypothesized value and the sum of the ranks below the hypothesized value. The closer the observed mean is to the hypothesized mean, the more these two rank-sums will be the same.
Hypothesis testing using Wilcoxon signed-rank test But, you dont have to understand the details to perform the test. To test whether a sample of 30 triglyceride values is different than the hypothesized value of 164.7 mg/dL, just fill in the hypothesized value and check the If you also want a nonparametric test: Wilcoxon Sign-Rank.
The resulting table will also show the statistic and p-values for Wilcoxons signed rank:
Copyright Stacey S. Cofield, 19 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
13 / 15
How to choose between the p-values? The easiest way is to choose the one you would have used if you had done the t-test. The choice is the same. When reporting the result of Wilcoxons signed-rank test, it is not useful to report the statistic; Report just the p-value. Summary: The hypothesized mean triglyceride value is 164.7 mg/dl. From a random sample of 30 observations, we observed a sample mean of 164.2 mg/dL (SD = 105.86), and median value of 129.5 (range = 51 to 584). Triglyceride is known to be not normally distributed and so we tested whether the population mean was different than the hypothesized mean using Wilcoxons signed-rank test. We conclude that the observed mean is not significantly different than the hypothesized mean (two-tailed p-value = 0.269). From this we conclude that the population mean is not different than 164.7 mg/dL.
Copyright Stacey S. Cofield, 19 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
14 / 15
Summary: One Sample Mean Briefly, here is how to proceed when comparing a sample mean to a hypothesized value. Describe the source of the data. After reviewing the Distribution of Y report on the variable of interest, state the descriptive statistics that seem to be the most useful. Assess the assumptions, including normality. If normality is warranted, a confidence interval on the mean is a useful descriptive statistic. Decide upon your hypothesis. Be clear as to which values of t will lead to rejecting the null hypothesis. Use the Test Mean = value popup in JMP to test your hypothesis. Record the df, t- and p-value if you use parametric methods. If you used the nonparametric test, report the p-value. Reject hypothesis or fail to reject? State your substantive conclusion (tell the story) Include relevant summary and results in your final report.
Copyright Stacey S. Cofield, 19 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
15 / 15
Section 16. Two Means From Paired Measurements In 2004, high school students will have to pass Standard of Learning tests to graduate and schools will have to achieve a minimum pass-rate (50%) to retain school certification. In Figure 1 we see just the 10th grade Algebra I pass-rates for each high school in Alabama for 1998 and 2000. Do not confuse the individual student scores with these school average pass-rates. Describe these scores.
Figure 1. 10th Grade, 1999 and 2000, Algebra I Passing Rates
Copyright Stacey S. Cofield, 26 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
1 / 12
Hypothesis testing In a press release from the Alabama Department of Education, these two sentences appear: Students posted modest gains in this years administration of the SOL exams, which test students in the core academic areas of Algebra I, math, science, and history and Theres a clear trend of overall improvement, and in many places its very significant. What is the hypothesis being tested here? Phase 1: State the Question 1. Evaluate and describe the data Where did this data come from? What are the observed statistics? 2. Review assumptions There are three questions: Is the process used in this study likely to yield data that is representative of the population? Actually, you could argue that the data we have is the population. In any event, its certainly representative. Is each observation independent of the others? What are the observations? Our assumption is that all the observations in a single column are independent. That is, the Algebra I 1999, column contains the pass-rates for n = 306 schools. Are these observations (rows) independent? The other column is labeled Algebra I 2000, it contains the pass-rates for n = 306 schools. Are these observations independent? If each observation in the Algebra I 1999 column is independent, then the sample mean calculated from those observations will represent the average pass-rate for the 1999 school year on the mandated curriculum in 10th grade Algebra I classes. If the observations are not independent, then the sample mean may be biased and it also will have a different standard error than we expect.
th
What about the values in a row, are they independent? That is, the 5 row in Figure
Copyright Stacey S. Cofield, 26 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
2 / 12
2 shows the scores for Aliceville High School in Pickens County. Are the test scores within a school independent?
Figure 2. School Data Is the sample size sufficient? Can we assume that the sample mean will be normally distributed? The CLT says that each variables y will be normally distributed if the underlying distribution is close to normal or we have a large sample (Figure 3). We are considering two variables: Is the sample size sufficient for Algebra I 1999? If so, then the sample mean, y1999 ,will be normally distributed with a mean 1999 and standard error
1999
n1999
.
(Note: the 1998 is a subscript representing scores from that calendar year.) Is the sample size sufficient for Algebra I 2000? If so, then the sample mean, y 2000 , will be normally distributed with a mean 2000 and standard error
2000
n2000
. But this
2000 mean and 2000 standard error will be related to the 1999 mean and standard error to some unknown degree. These are important assumptions. If they are met then the two sample means each have a normal distribution but each distribution can have a different mean and standard
Copyright Stacey S. Cofield, 26 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
3 / 12
error. But, these two distributions are not independent. The values come from paired observations and so the observations in the same school, across different years, are dependent. This is a problem well need to address later.
Figure 3. Normal Quantile Plots for Each Year 3. State the questionin the form of hypotheses Again there are three situations we may be in: The null hypothesis is 1999 2000 (the first is less than or equal to the second), The null hypothesis is 1999 2000 (the first is greater than or equal to the second), or The null hypothesis is equality, 1999 = 2000 (they are the same). And the alternative hypothesis is the opposite of the null. What are the null and alternative hypotheses for our question? H0: HA: In a sentence:
Copyright Stacey S. Cofield, 26 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
4 / 12
An alternate statement of the hypotheses Recall when we compared two proportions, the null hypothesis was H0: pCPR = pchest. We noted that this was equivalent to asking whether the difference between the two was zero. That is, we also stated the hypothesis as H0: pCPR - pchest. = 0. Now, lets restate the null and alternative hypotheses of the SOL data using the difference. H0: HA: In a sentence:
The Difference Score Consider this difference score, the difference between a schools 2000 Algebra I score and a schools 1999 score. Lets calculate it so that positive number will indicate and improvement: Difference = [Algebra I 2000, Grade 10] [Algebra I 1999, Grade 10] So, for Aliceville High School in Pickens County (see Figure 7.5) Difference = 61.84 46.15 = 15.69 Figure 4 shows the distribution of values of this difference score. Describe the sample results.
Figure 4. Distribution of the Difference: YR2000 minus YR1999
Copyright Stacey S. Cofield, 26 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
5 / 12
Weve restated the question from Is there a difference between two paired means to is the difference zero. So, lets back up.: In Alabama, one principal whose school made some gains gave credit to the total dedication of the complete staff. We improved from last year, but were still not satisfied, Weve been working very hard, very diligently, every day, to get our children to pass.) Phase 1: Re-State the Question 1. Evaluate and describe the data Where did this data come from? Think of the difference score data this way. Say each school kept its own records and only reported the year-to-year differences to the state. Then, all wed have to work with is the data we see in Figure 4. 2. Review assumptions There are three questions: Is the process used in this study likely to yield data that is representative of the population? Is the difference score representative of the schools in Virginia? Is each observation independent of the others? Our assumption is that all the observations in a single column are independent. That is, the Difference column contains the differences in pass-rates for n = 303 schools that had scores in both years. Are these differences independent? Is the sample size sufficient? Can we assume that the sample mean will be normally distributed? The CLT says that the mean difference d will be normally distributed if the underlying distribution is close to normal or we have a large sample. Is the sample size sufficient for Difference? If so, then the sample mean,
d ,will be normally distributed with a mean d and standard error
d
nd
The
sample difference score d will be an unbiased estimate of 2000 1999 - this is Good.
Copyright Stacey S. Cofield, 26 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
6 / 12
It also turns out that
d
nd
is not just a simple function of
1999
n1999
and
2000
n2000
- this is
Not good. However, do we care? Do we need to know what the true s are? Cant we just estimate the standard error of the difference directly? See the Std Error Mean in Figure4. Another question that may not be obvious is this. If y1999 is normally distributed and y2000 is also normally distributed, is the difference d normally distributed? From Figure 5, it looks like it is normally distributed. In fact, the difference of two normally distributed variables will be normally distributed.
Figure 5. Normal Quantile Plot of the Difference Score Bottom line: If a variable is normally distributed then we can perform the following operations on it and the new variable will still be normally distributed: Add or subtract a constant. So, if Algebra I is normally distributed then 10 + Algebra I is normally distributed.
Copyright Stacey S. Cofield, 26 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University 7 / 12
Multiply or divide by a constant. So, Algebra I /100 is normally distributed. Add or subtract another normally distributed variable. So, if Algebra I is normal and Algebra II is normal then Algebra I + Algebra II is normal. And thats it. Note: Dividing two normals will not yield a normal distribution. Multiplying two normals will not yield a normal distribution. Phase 2: Decide How to Answer the Question 4. Decide on a summary statistic that reflects the question So the null hypothesis tests whether d equals a hypothesized value, zero. But havent we already covered this? In the One Sample Mean handout, one of the null hypotheses was: The null hypothesis is a fixed value, = 0 If we think of it as The null hypothesis is a fixed value, d = 0 Its obvious that these two situations are identical. So, the statistic here is identical to the One Sample Mean situation. t= y 0 sn
Then the remaining steps are identical (except in how we write up the sentence interpretations, of course). 7. Calculate the statistic Then lets fast forward. We saw earlier how to test whether a mean equals to a value.
Copyright Stacey S. Cofield, 26 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
8 / 12
8. Make a statistical decision Which p-value corresponds to our hypothesis? Which p-value do we report? Conclusion One way to state the conclusion would be as follows. The Alabama SOL school-specific passing rates were used to assess the claim that there has been a significant increase in scores since the spring of 1999. The 10th grade Algebra I passing rates are summarized in Table 1. Different numbers of schools were assessed in each year but we are 95% confident that the average passing rates from the spring 1999 testing is between 36.1% and 44.1%. The spring 2000 testing resulted in a confidence interval between 46.3% and 51.3%. Using a paired t-test, there is an increase in 10th grade Algebra I test scores from 1999 to 2000 (t = 12.42, df = 302, pvalue = <0.0001). Table 1. School Averages Year 1999 (N = 306) 2000 (N = 306) 2000 1999 (N = 303) Mean 38.57 48.80 10.49 SD 22.02 22.05 14.70 95% CI 36.1 41.1 46.3 51.3 8.8 12.1
Summary: Two Paired Means Briefly, here is how to proceed when comparing two paired means. Describe the source of the data. After reviewing the Distribution of Y report on both variables, state the descriptive statistics that seem to be the most useful. Assess the assumptions, including normality. If normality is warranted, confidence intervals on each of the means are useful descriptive statistics. Calculate the difference score (see below). Make sure you understand what a positive difference means; is it an improvement or a worsening? After reviewing the Distribution of Y report on the difference score, state the descriptive statistics that seem to be the most useful. Assess the assumptions relating to the difference score, including normality. If
Copyright Stacey S. Cofield, 26 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
9 / 12
normality is warranted, a confidence interval on the mean difference is a useful descriptive statistic. Using the difference score, decide upon your hypothesis. Be clear as to which values of t will lead to rejecting the null hypothesis. Use the Test Mean = value popup in JMP to test your hypothesis. Record the df, t- and p-value if you use parametric methods. If you used the nonparametric test, record the p-value. Reject hypothesis or fail to reject? State your substantive conclusion (tell the story). Creating a difference score in JMPAMB Here are the steps for calculating the difference between two columns using the JMP calculator. First, we have to create a new column. Then we calculate its values. Choose Cols > New Column When a dialog window appears. Do two things: Type in the name of the column. For example, Column name: Post - Pre Use the popup and change New Property > Formula. Youll then see the JMP calculator window (Figure 6).
Figure 6. The JMP IN Calculator Window
Copyright Stacey S. Cofield, 26 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
10 / 12
Note the general layout of the calculator window: The Calculator panel at the top is what we manipulate. The Formula display panel at the bottom shows us the result of what weve done. Important note: The formula is a single expression that will be repeatedly applied to each row. That is, we build a formula for one generic row and JMP automatically applied this identical formula to each and every row. (For those of us used to Excel, note that JMP is different from Excel. In Excel, each cell can have a different formula. In JMP all the cells in a column have the same formula.) In the calculator panel there are a lot of things we can click on. Important note: There is always something in the formula that is highlighted. Right now, the empty formula highlighted but as you click on different things, the portion of the formula that is in blue is highlighted. This is important because whatever is highlighted in the formula display is the object that will be acted on when you click on the actions in the calculator panel. This follows the one rule of a graphical user interface in The analysis of frequencies using JMP handout. This is what will happen to the formula when you click on the different controls in the calculator panel: Clicking on a variable in the column selector list will replace the highlighted portion of the formula with the column name. Clicking on one of the operators in the Keypad (add, subtract, and so forth) will apply that operator to the highlighted portion of the formula. Clicking on the groups of functions listed in the left-side of the function browser displays the functions in that group on the right-side list of the function browser (and does nothing to the formula). Clicking on a function in the function browser and choosing from the popup list applies that function to the highlighted portion of the formula. Pressing the button tells JMP to apply the formula to each row in the
data table (and leave the window open). Closing the window applies the formula to each row in the data table and closes the calculator window.
Copyright Stacey S. Cofield, 26 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
11 / 12
Steps to create a difference score To build a formula that calculates a difference takes just three clicks. Usually when calculating a difference score, we want to calculate the post minus the pre. So, remember which of the columns is the post and which is the pre. In the column selector list, select the post variable (Using English scores). The formula display area should now look like below. In the Keypad, press the subtract button. The formula display area should now look like below. In the column selector list, select the pre variable. In our example, we selected English 1999. The formula display area should now look like below.
Thats it. Close the window. Notice that JMP IN automatically handles missing values. That is, if either English 1999 is missing or if English 2000 is missing, then no difference score is possible. In either case, the difference is shown in the data table with a to indicate that its missing.
Copyright Stacey S. Cofield, 26 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
12 / 12
Section 17. Comparing Means in Two Populations
Overview The previous section discussed hypothesis testing when sampling from a single population (either a single mean or two means from the same population). Now well consider how to compare sample means from two populations. Towards the end of the course, well discuss comparing means from more than two populations. When were comparing the means from two independent samples we usually ask: How does one sample mean compare with the other? However, focusing just on comparing the means can be premature. Its safer to first consider the variability of each sample, the pattern of any outliers, the shape of the distributions. Then it may be safe to assume a normal distributions; But not always. So, well also discuss approaches to answering these questions when were not comfortable with the assumption of normality, or when this assumption is just not defendable.
Two Sample Means Cities and counties: Returning to the Alabama SOL pass-rates, is there a difference between cities and counties? Recall last time we looked at a difference across years in the same population. Here, we want to look at the difference in one year between two populations: city high schools and county high schools. Phase 1: State the Question, 1. Evaluate and describe the data Begin by looking at the data. Where did the data come from? What are the observed statistics? The source of this data is Alabama Department of Education. The first step in any data analysis is evaluating and describing the data. These first steps are also called preliminary analysis, to distinguish them from the definitive (or outcome) analysis. The goal of a preliminary analysis is to describe and inform. The goal of a definitive analysis is decision making or hypothesis testing.
Copyright Stacey S. Cofield, 28 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
1 / 20
Preliminary Analysis What are the observed statistics? Use the Fit Y by X platform to look at a graphical and tabular summary of the data, as in Figure 1.
Figure 1. Algebra I 2000 Pass rates by School Location (City/County) Note: Previously, in Step 1, we used the Distribution of Y reports to identify and fix errors and to further understand the data. There are three components to this figure: The dot plots, the box plots, and the quantiles table. Lets look at each. The Dot plot A dot plot shows the continuous Y-variables (Algebra I 2000 pass rate) values along the vertical axis and the nominal X-variables (City Yes or No) values along the horizontal axis. So we see the two groups along the horizontal axis; City = No and City = Yes. The width of the groups is proportional to the sample size of each group; there are more No (non-cities) values so it is drawn wider. This follows the your eye
Copyright Stacey S. Cofield, 28 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
2 / 20
goes to ink rule. Groups with larger samples are more informative than groups with smaller samples so the larger n group is drawn bigger. Along the vertical axis, we see the 10th grade, year 2000 Algebra I SOL pass-rate; One dot for each high school. Values range from 0% passing to 100% passing. The horizontal spreading of the values is done so that you can see each schools scores better (called Jittered Points). The amount of horizontal spread (jittering) is random; that is, dont try to interpret the scores for districts farther to the right or left any differently than districts with scores closer to the center (horizontally). Of course, the vertical values are interpretable; that is, a school at the top has a higher pass-rate than a school at the bottom. Box Plots These side-by-side box plots describe the shape of the distributions within each group. These plots do not assume normality so we use them to begin to answer the question, Is each group normally distributed? In the box plots we can easily see whether the values are symmetric about the median. Look for these warning flags that the data is not normal: Is the distance between the median and the 75%tile different than the distance between the median and the 25%tile? Is the upper whisker-bar (actually, the 90%tile) more distant from the median than the lower whisker-bar (the 10%tile)? Are the high-extreme tail-values more distant from the median than the lowextreme tail-values? These informal, graphic assessments dont raise any warning flags for these data. The dotted horizontal line represents the mean value for all schools (not considering the group). Quantiles Report If a more detailed comparison of values is needed, the numerical values plotted in the box plots are shown in the Quantiles Report (Figure 2).
Copyright Stacey S. Cofield, 28 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
3 / 20
Figure 2. Quantiles Report for Each School Location For instance, in the City = No group, the distance from the median to the 75%tile (~49 vs 65, 16 points) is about the same as the distance between the median and the 25%tile (~49 vs 34, about 15 points). But, as in the Distribution platform, the preferred way to answer the question, Is each group normally distributed? is with a normal quantile plot. Normal Quantile Plots Actually, the more proper phrasing of the question is: Within each group, is each group normally distributed? That is, it may be that if we were to lump both groups together, the data would appear non-normal. We must take group membership into account when making this assessment. The normal-quantile plot for these data appear in Figure 3.
Figure 4. Normal Quantile Plots for Each Location Follow the same interpretation of the normal quantile plot as we discussed with a single mean. Are each group of black dots along a straight line? In the SOL data, these two sets of points follow the lines fairly well, with some departure in the tails.
Copyright Stacey S. Cofield, 28 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
4 / 20
So, we now have enough information to answer the question, Within each group, is each group normally distributed? If the answer is Yes or Probably then we can proceed with parametric tests to compare the means. The Central Limit Theorem can apply if the sample size is large. The rule of thumb is if the total n is at least 30 (n1 + n2 30). If the answer to the normality question is No or I doubt it then well use nonparametric methods to answer our question. First well look at parametric methods.
Preliminary analysis, showing means If the data is normally distributed then means and SDs make sense. If these distributional assumptions are unwarranted, then we should consider nonparametric methods. Thus, the next thing to do in our preliminary analysis may be to get rid of the box plots and quantile plot and to show the means and standard deviations calculated within each group. In Figure 4 we see these values, the data points have been hidden to show the plotted means, error bars, and standard deviation lines.
Figure 4. Means and Standard Deviations for Each Location
Copyright Stacey S. Cofield, 28 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
5 / 20
This figure shows the means, here connected with a line, and a short dashed bar that is one standard-error error bars. The long dashed lines above and below the means are one standard deviation away from their respective mean. The means and standard errors and deviations can be shown by selecting Means and Std Dev from the main Options menu. From the Display Options sub-menu in the Options menu, select the options necessary (Figure 5).
Figure 5. Display Options for Fit Y by X Means and Std Deviations Report This summary table is repeated in Figure 6. We can use it to describe the following: the number of observations in each group, the means of each group, the standard deviation within each group. Recall that the SE is not a descriptive statistic for the data, it is used for inference about the mean. JMP
Copyright Stacey S. Cofield, 28 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
6 / 20
includes the SE here because it is used to form confidence intervals about the mean.
Figure 6. Means and Standard Deviations for Each Location Note: You can change the number of decimal places displayed in any JMP report: Double-click a number in the report. A dialog will appear. Change the number of decimal places.
Summary: Preliminary Analysis So far, what have we learned about the data? We have not found any errors in the data. Were comfortable with the assumption of normality within each group. Weve obtained descriptive statistics for each of the group were comparing. Also, at this point, we can look at the two means and make a guess, is there a difference between cities and counties? City schools seem to be about 12 points below non-city schools, and with SEs < 3, this seems like a big difference. Recall that the t-statistic is the ratio of the difference to a standard error. The ratio of 12 to 3 is bigger than 2.
2. Review assumptions As always there are three questions to consider. Is the process used in this study likely to yield data that is representative of each of the two populations? Yes, it is the population Is observation subject in the two samples independent of the others? Yes. Is the sample size sufficient? Yes, both groups are large and were comfortable with normality for both groups. So, we ask the same questions we did in Hypothesis Testing on Means when looking
Copyright Stacey S. Cofield, 28 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University 7 / 20
at one sample mean (see that handout). The difference is that we have to ask these questions twice: once for each of the two groups, city and county (or City = Yes and City = No). Bottom line: We have to be comfortable that the first two assumptions are met before we can proceed at all. If were comfortable with the normality assumption, then we proceed, as below. Later, well discuss what to do when normality can not be safely assumed.
3. State the questionin the form of hypotheses Lets refer to the two groups as 1 and 2 for notational purposes. Using these as subscripts, there are three possible null hypotheses: The null hypothesis is 1 2 (the mean of population one is less than or equal to population two), The null hypothesis is 1 2 (the mean of population one is greater than or equal to population two), or The null hypothesis is a fixed value, 1 = 2 (the mean of population one is equal to the mean of population two). And the alternative hypothesis is the opposite of the null. So, what are the null and alternative hypotheses for our question: Is there a difference between cities and counties?
Phase 2: Decide How to Answer the Question 4. Decide on a summary statistic that reflects the question Recall the general test statistic: test statistic = summary statistic - hypothesized paramter standard error of the summary statistic
In this situation (as in comparing paired means), we are going to use the difference score as our summary statistic:
Copyright Stacey S. Cofield, 28 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
8 / 20
The relevant statistic is y1 y 2 or the observed difference of the two means. The hypothesized parameter is easy: 1 2 = 0 , since under any of the three null hypotheses a difference of 0 would result in failing to reject the null hypothesis. What about the standard error? There are two possibilities for the standard error of y1 y 2 . The two possibilities depend upon the two standard deviations within each group. o Are they the same? o Or do the two groups have different standard deviations?
Assuming the two populations have equal variances If the standard deviations (or variances) within the two populations are equal than the standard error of the difference is easy. We just average the two estimated standard deviations and obtained a pooled estimate.,39 The variance of the mean difference is the sum of the standard errors of each mean:
2 1 2
+2 n1 n2
Well use the t-test to compare the two sample means and, using a pooled estimate for
2 the variance called sp , we calculate:
t=
y1 y 2
2 sp
n1 variances:
2 sp
+
2 sp
n2
The pooled variance estimate is a weighted average of the two individual-group
2 2 ( n1 1) s1 + ( n2 1) s2 =
n1 + n2 2
Under the equal variance assumption, we calculate the p-value using df = n1 + n2 2.
Copyright Stacey S. Cofield, 28 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
9 / 20
Assuming the two populations have unequal variances If the variances are not equal, the calculation is more complicated:
t = y1 y 2
2 2 s1 s2 + n1 n2
Note that the separate variance estimates are used in this t prime statistic, not the pooled estimated for variance. Further, the df is not a simple function of just n1 and n2. The details of these calculations are not important. What we need to know is how to proceed using JMP.
Deciding on the correct t-test Which test should we use? We may not need to choose; if the two sample sizes are equal (n1 = n2) the two methods give identical results. Its even pretty close if the ns are slightly different. If one n is more than 1.5 times the other (in the SOL case, n1 = 306 and n2 = 93, which is over 3 times as large), youll have to decide which t-test to use. Here are the steps: Decide whether the standard deviations are different. Use the equal variance t-test if they are the same, or Use the unequal variance t-test if they are different. Or, you could decide not to decide; use the unequal variance t-test. Its more conservative.
Deciding: equal standard deviations? There are three ways to make this decision. Inspect the two standard deviation estimates. Use the normal quantile plot. Test for equal standard deviations.
Inspect the two standard-deviation estimates Refer to the Means and Std Deviations report in Figure 6. Look at the two standard
Copyright Stacey S. Cofield, 28 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
10 / 20
deviations, in this case 22.1 and 19.9. Form the ratio of the largest to the smallest. If the ratio is larger than about 3, then the two SDs may be unequal (in our case, the ratio is 3.3). For a better answer to the question, see the normal quantile plot.
Use the normal quantile plot If the two standard deviations are equal then the slopes for the two lines in the normal quantile plot will be the same (the lines will be parallel). In Figure 4 the lines have roughly the same slope. So, for the SOL data the assumption of equal variability seems safe. If the slopes of the two lines are in that gray area between clearly parallel and clearly not parallel, what do we do? There are four possibilities: Ignore the problem and be risky: use the equal variance t-test. Ignore the problem and be conservative: use the unequal variance t-test. Make a formal test of unequal variability in the two groups. Compare the means using nonparametric methods. If they are not parallel, as in the figure below, assuming equal variability is unwarranted (Figures 7 and 8).
Figure 7. Normal Quantile Plots for US History 1998 Here, the data appear to be reasonably normal (this isnt the question) but the lines are not parallel they start out close and end far apart.
Copyright Stacey S. Cofield, 28 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
11 / 20
Figure 8. Normal Quantile Plots for Writing 2000 Notice in Figure 8 that not only do the variances appear to be unequal but normality is also in questions. Well look at this case later.
Test for equal standard deviations In the case of Figure 7, where well be using a t-test, but were not sure which one, JMP provides a way to test for equal variance in the main options menu (Figure 9).
Figure 9. Test for Equality of Variance Results for US History 1998
Copyright Stacey S. Cofield, 28 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
12 / 20
There are 5 variance tests, which test do we use?
Choosing between the tests of equal variance Of the five tests: OBriens, the Brown-Forsythe test, Levenes test, Bartletts test, and the F-test; the last three are out of date and are not recommended. Theres not much difference between OBriens and Brown-Forsythe. Brown-Forsythe is more robust (resistant to outlying observations), so well use this result. What are we testing? The null-hypothesis for these tests are, the variances are equal. So, if the Prob>F value for the Brown-Forsythe test is < 0.05, then you will reject the null hypothesis (universal decision rule) and conclude that the groups have unequal variances. In lots of statistics textbooks, a test based on the ratio of the two variances is described. This F Test, 2 sided (above) is not a good test to use unless you have absolutely normal data. The test is extremely sensitive to normality assumptions so that unless your data is nearly perfectly normal, this test will not apply. You should not use this test and you should be skeptical if you read a report where someone else used it. What if we reject equal variances, as we would do in this case (Figure 9)? The report also shows the result for the t-test to compare the two means, allowing the standard deviations to be unequal. This is the unequal variance t-test. Here is a written summary of the results using this method: The two groups were compared using an unequal variance t-test and found to be significantly different (t = 7.1, df = 217.4, p-value < 0.0001). School districts in cities had lower scores . Notice the degrees of freedom it isnt a whole number. That is because it is based on a weighted contribution of each sample (with unequal ns) to the standard error estimate. You can round the number off but ONLY to one decimal place, do not round to a whole number.
Copyright Stacey S. Cofield, 28 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
13 / 20
5. How could random variation affect that statistic? What values of t will we see if the null hypothesis is true? It depends on what the hypothesis is. For H0: 1 2 well likely see negative t values. For H0: 1 2 well likely see positive t values. For H0: 1 = 2 well see t close to zero. JMP will calculate p-values for us; either for the equal variance t-statistic or the unequal variance t-statistic. What values of t will we see if the null hypothesis is not true? It depends on what the alternative hypothesis is. For HA: 1 > 2 well likely see positive t values. For HA: 1 < 2 well likely see negative t values. For HA: 1 2 well see t values different than zero. Recall the rough interpretation that ts larger than 2 are likely not due to chance.
6. State a decision rule, using the statistic, to answer the question The universal decision rule: Reject H0: if p-value < .
Phase 3: Answer the Question 7. Calculate the statistic There are three possible statistics that may be appropriate: an equal variance t-test, an unequal variance t-test, or the nonparametric Wilcoxon rank-sum test.
Equal variance If the equal variance assumption is reasonable, then the standard t-test is appropriate. (Note: When reporting a t-test its assumed that, unless you specify otherwise, its the equal-variance t-test.) Figure 10 shows the means diamonds in the dot plot, the t-test report, and the means for a oneway ANOVA (Analysis Of VAriance) report. Well cover
Copyright Stacey S. Cofield, 28 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
14 / 20
oneway ANOVA later in the course. When there are only two groups, the t-test and ANOVA give identical results.
Figure 10. Results of Equal Variance t-test
Copyright Stacey S. Cofield, 28 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University 15 / 20
Note that the ns and means in the report are the same as Figure 6 However, the standard errors are different. As the note says, these standard errors use the pooled estimate of variance; and the standard errors in the Means and Std Errors report in Figure 6 simply calculate the standard deviations within each group and divide by the square root of each n.
In JMP To compare the two means using an equal variance t-test in JMP: Choose Means/Anova/Pooled t from the main options menu. This adds the Oneway ANOVA report and means diamonds. The t-value, df, and p-value are shown in the t-Test report. However, only the two-tailed p-value is reported (under Prob>|t|). If the null-hypothesis specified that we were testing for equality, then this is the p-value we want.
One-tail p-values To obtain one-tail p-values from this report, refer to your alternative hypothesis. That is, what values of t will we see if the null hypothesis is not true. If its true that HA: 1 > 2 well likely see positive t values. If its true that HA: 1 < 2 well likely see negative t values. First, which group did JMP use for y1 and which for y 2 ? JMP uses the order of the Xvariable: If the X-variable is character, JMP alphabetically sorts the values and whichever comes first is y1 . If the X-variable is numeric, JMP uses the smallest value of the X-variable as y1 . In either case, the order is shown in the plot and in the Means for Oneway Anova report. The left-most group in the plot and the first value in the report represents y1 . If your alternative was HA: 1 > 2 and the t-test value is positive, then the onetailed p-value is half the two-tail Prob>|t| in the report. If the t-test value was negative then youve observed a difference in the opposite direction from that
Copyright Stacey S. Cofield, 28 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University 16 / 20
expected. The p-value is one minus half the Prob>|t| in the report. If your alternative was HA: 1 < 2 and the t-test value is negative, then the onetailed p-value is half the two-tail Prob>|t| in the report. If the t-test value was positive then youve observed a difference in the opposite direction from that expected. The p-value is one minus half the Prob>|t| in the report.
Unequal variance If the variances are not equal or if we just want a more conservative test then see the bottom portion of the Tests that the Variances are Equal report, The unequal-variances t-test is listed as the Welch Anova. (See Figure 9). Report the t-value, df and p-value, as in the equal variance case. Nonparametric comparison of the medians, 89 If normality isnt reasonable, then you can use a nonparametric test. You will use a test that compares the medians between the two groups. The nonparametric test is based solely on the ranks of the values of the Y-variable. In JMP, choose Nonparametric > Wilcoxon test. The Wilcoxon rank-sum test (also called the Mann-Whitney test) ranks all the Y-values (in both groups) and then compares the sum of the ranks in each group (the groups are specified by the X-variable). If the median of the first group is, in fact equal to the median of the second group, then the sum of the ranks should be equal for equal sample sizes (Figure 11).
Figure 11. Nonparametric Results
Copyright Stacey S. Cofield, 28 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
17 / 20
When reporting the results of a nonparametric test, its usual to only report the p-value. In the above report, there are two p-values, one for the z-test and one using a chisquare value. The p-values will rarely be different. For large samples report the p-value from the normal approximation. For small samples you should probably consult a statistician to help you obtain exact p-values.
8. Make a statistical decision Using all three tests, the two groups are different. All p-values are < 0.05.
9. State the substantive conclusion The schools in cites have significantly lower mean pass rates (35.9% vs 48.8%) and significantly different median pass rates (36.0% vs 49.2%)
Phase 4: Communicate the Answer to the Question 10. Document our understanding with text, tables, or figures The year 2000 Alabama SOL pass-rates in 10th grade Algebra I were divided into two groups according to whether the school was a city or county high school. There were n = 306 schools within city school-districts and n = 92 in county school districts. The observed average pass rates within city schools was 35.9% (SD = 19.9) and pass rates outside of cities were 48.8% (SD = 22.1). Using a two-tailed t-test, we conclude that the observed means are significantly different (t = 5.0, df = 396, p-value < 0.0001). From this we conclude that city schools have a significantly lower pass rate compared to county schools. The 95% confidence interval about the mean difference is between 7.8% and 17.9%.
Less text, replaced by information in a table Alternatively, it may be more straightforward to include many of the numbers in a table and state your results in text. So instead of the above paragraph, you could do this: The summary results for the year 2000 Alabama SOL pass rate percentages in 10th grade algebra I are shown in Table 1. Schools were divided into cities if their district name contained City and were otherwise classified as a county school district. From
Copyright Stacey S. Cofield, 28 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
18 / 20
these results we conclude that schools in cities had a pass-rate that was significantly lower than the pass-rates compared to county schools. The 95% confidence interval about the mean difference is between 7.8% and 17.9%. Table 1. 2000 Alabama SOL Pass-Rates in 10th Grade Algebra I For Schools in Cities and in Counties Location City County Difference
*t
Number of Pass Rate Schools (SD) 92 35.9 (19.90) 306 48.8 (22.05) 12.9*
SE 2.23 1.62 2.57
95% CI 31.5 40.4 51.2 46.4 7.8 17.9
= 5.0, df = 396, p-value < 0.0001
Less text, replaced by information in a figure As another alternative, it may be more informative to describe the results in a figure. Instead of the above, you could do this: The summary results for the year 2000 Alabama SOL pass rate percentages in 10th grade algebra I are shown in Figure 12. Schools were divided into city and county high schools. From these results we conclude that schools in cities had a pass-rate that was significantly lower than the pass-rates compared to county schools(t = 5.0, df = 396, p-value < 0.0001). The 95% confidence interval about the mean difference is between 7.8% and 17.9%.
100 90 80 70 60 50 40 30 20 10 0
County High Schools mean = 48.8, SD = 22.05
100 90 80 70 60 50 40 30 20 10 0
City High Schools mean = 35.9, SD = 19.90
Figure 12. 2000 Alabama SOL Pass-Rates in 10th Grade Algebra I For Schools in Cities and in Counties
Copyright Stacey S. Cofield, 28 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University 19 / 20
Summary: Two Independent Means Briefly, here is how to proceed when comparing the means obtained from two independent samples. Describe the two groups and the values in each group. What summary statistics are appropriate? Are there missing values? (why?) Assess the normality assumption. If normality is not warranted, then do a nonparametric test to compare the medians. If normality is warranted, then assess the equal variance assumption. Report confidence intervals on each of the means if normality is reasonable. Perform the appropriate statistical test: equal variance t-test, unequal variance ttest, or the Wilcoxon rank-sum test. Determine the p-value that corresponds to your hypothesis. Reject or fail to reject? State your substantive conclusion. Additional note: Say you conclude that the groups have different means. How do you describe what the different means are? If youve followed the above recipe, you are in one of three situations: Normality is reasonable and the variances are equal, use the equal variance t-test. The write up reads the means are significantly different (t = x.xx, df =xxx, p-value = 0.xxxx). Also give a table of means, SEs and 95%CIs just like the Means for Oneway Anova. The variances are unequal, use unequal-variance t-test. The write up reads the means are significantly different (unequal variance t = x.xx, df =xxx, p-value = 0.xxxx). Also give a table of means, SEs and 95%CIs just like the Means and Std. Deviations report. Note: the means are the same. The SEs and CIs are different. Normality is unreasonable, use Wilcoxons test. The write up reads the medians are significantly different (by Wilcoxons signed-rank test, p-value = 0.xxxx). Also give a table of medians and IQRs. There is a way to put 95%CIs on these estimates but not using any easily available software. Always report a measure of the center and spread. For all three tests, report the p-value and make a decision based upon your hypothesis.
Copyright Stacey S. Cofield, 28 October 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
20 / 20
Section 18. Power and Sample Size Up to this point, we have analyzed data after it has been collected. We have discussed significance level, alpha, the amount of time were willing to be wrong. That is, the amount of time were willing to choose the alternative when the null is true. But what about the other side of that? What about the times we fail to reject the null hypothesis when the alternative is true? If there is a real difference, we would like to be sure we see it or as sure as we can. How do we do that? We know that the larger the sample size, the smaller the standard error, and the larger the test statistic. Recall that larger test statistics tend toward the alternative. How large is large enough? What size do we need to be able to detect a difference if it exists? The most recent comparison we have looked at is the comparison of two means. What we have done so far is assess the data for assumptions after the data has been collected. We make sure that the sample is large enough to use large or asymptotic tests. Wouldnt it be easier if we knew we had collected a large enough sample made the sample size determination before we collected the data? But again, how large is large enough? As a researcher, you want to have enough subjects (units) to be able to make a good decision but you dont want to be wasteful. That is, you could collect thousands of units (subjects) but can be very costly. Not only that, but, past a certain point, you dont gain enough to warrant using an extra unit. The single most common question to a statistician is, How many subjects do I need? The answer is it depends. It depends upon what kind of difference you want to be able to detect, what kind of difference you can afford to detect, and how comfortable you are with failing to reject the null hypothesis when the alternative is true. An understanding of the topics in this section will move us towards a more useful answer. Power and Sample Size When doing research, we take risks. We make some guesses and see where it leads.
Copyright Stacey S. Cofield, 04 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
1 / 21
But we do want to avoid making too many mistakes. There are two sorts of mistakes you could make when making a conclusion from the results of a study. Lets review these: Significance Level Recall that the significance level, , is the probability of rejecting a true null-hypothesis. That is, its the probability of making one sort of mistake: The null hypothesis is true (there really is no difference) but we (incorrectly) conclude that the null hypothesis is not true. We concluded significant difference but we were wrong. What happened was this: The sample t-statistic was large enough to return a p-value that is small enough to reject (by our universal decision rule, we reject when p-value < 0.05). Remember, though, we know that 5% of the time simply due to random variationwell observe large statistics. This kind of error is called a type I error. Type I Error Rejecting a true null hypothesis. (The null hypothesis is true but the p-value < .) So far, weve only been concerned with minimizing the number of times well make this sort of mistake. By convention, weve set up our decision making process so that well make this sort of error no more than 5% of the time. When we go through the process of statistical decision making we choose to either reject or fail to reject the null hypothesis. Why dont we say that we accept the null-hypothesis? If the null hypothesis is no difference, then why dont we conclude that we accept the hypothesis of no difference? Two reasons: Because you can never prove the null hypothesis. And because its shockingly easy to do an experiment where the p-value is much-much greater than 0.05 even though there really is a difference. Real Differences Usually, the reason an experiment is performed is that the researchers hopes to find a difference. But there are many ways to run a poor experiment so that we fail to uncover
Copyright Stacey S. Cofield, 04 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
2 / 21
an important difference. This is the other sort of error we want to guard against: Failing to reject the null hypothesis when the alternative hypothesis is true. Type II Error Not rejecting a false null hypothesis. (The alternative hypothesis is true but the pvalue .) If there really is a difference (the null hypothesis really is not true) and we conclude that there is no difference (we do not reject the null hypothesis), then weve made a Type II error. The probability of making this sort of error is (say beta). How often we make this mistake depends on the magnitude of the difference and the design of the study. Overview There are only two options: there is a difference or there isnt. And we can either conclude there is a difference or conclude that there isnt. Thus there are four different possibilities, as shown in Table 1. Table 1. Types of Statistical Errors
Conclusion Truth Null Alternative Null (fail to reject, p ) Alternative (reject, p < ) correct Type II Error Type I Error correct
Clearly, we prefer to make the correct conclusion but there are no guarantees. If there is, in fact, no difference then well correctly fail to reject the null hypothesis 95% of the time. If there is, in fact, a difference then we want to maximize our chance of correctly rejecting the null hypothesis. Power Remember, that the chance of making the wrong decision when the alternative hypothesis is in fact true is . The probability of not making this sort of error is 1 . This is the power of a statistical test.
Copyright Stacey S. Cofield, 04 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
3 / 21
Correctly rejecting a false null hypothesis. (The alternative hypothesis is true and the p-value < .) We want to act in such a way that if there really is a difference (the null hypothesis really is not true) then we want to be able to conclude that there is no difference (we reject the null hypothesis). We want to maximize power. Controlling power Powerthe likelihood of discovering real differencesis controlled by four things: the magnitude of the difference, the sample size of the study, the amount of uncontrolled error in the study, and the Type I error rate we are willing to accept.
Simplified situation Say were looking for a mean difference between two independent groups. To make it easy, lets assume that we know the distribution is normal, we know the variance within each group is the same, and we design the study so that the sample size within each group is the same. Then our test statistic would be:
z= 2 1 n
Power, the likelihood of discovering real differences, is controlled by: The magnitude of the difference: 2 1 = d = (say delta). To maximize power, have large differences. The sample size of the study, n. To maximize power, do big studies. The amount of uncontrolled error in the study, . To maximize power, minimize variability. The Type I error rate we are willing to accept, . To maximize power, make large. Controlling any of these will control power. However, we dont have too much leeway with the significance level, usually we accept the convention and use = 0.05. What about the other three factors?
Copyright Stacey S. Cofield, 04 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
4 / 21
The Effect Size, Delta The magnitude of the difference is the effect size. JMP calls this delta. There are different definitions of effect size. Some authors use a standardized value like
2 1 ,
but its better to separate considerations of the magnitude of the difference from consideration of the amount of uncontrolled variability. The JMP definition of effect size, called delta, in this situation is
2 1
2
. That is, half the difference is delta.
Power as a function of Delta Lets take two groups, n = 1079 and s = 15.5, with a difference of about 10, so delta would 5. In Figure 1 we see the power of the t-Test for deltas up to 2.
Figure 1. Power of the t-Test (s = 15.5, n = 1079, = .05) The power starts out really low (at 5%) and gets higher as the difference gets larger. This intuitively makes sense, it should be easier to reject a null hypothesis if the true difference between the two groups is really large. So how much power do we need? 50% Power One way to think about this question is that we need power to be at least 50%. We want
Copyright Stacey S. Cofield, 04 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
5 / 21
our chance of rejecting correctly to be at least an even chance. Heres is an example of a experimental design with 50% power: State your null and alternative hypotheses. Collect the following data: Flip a coin. If the coin is heads then reject the null hypothesis.
Thats it. If you cant design a study better than that, then youll save a lot of money by just doing the above study. Desired Power So, what values of power do we want? Usually, studies aim for power of about 7585%, with 80% as the most common goal. Studies with 95+% power are usually too resource intensive. About 80%, perhaps as low as 75%, is typical. But there is no rule; you may want to go lower for a pilot study and higher for an important confirmatory study. In Figure 1 we can see that 80% power is achieved at a point somewhere near delta = 1.25 (difference = 2.5). Actually, a 2.6 difference is detected with 78.6% power and a 2.8 difference is detected with 84.2% power. How large a difference is important? Note also in Figure 1 that even a difference of four (delta = 2) between the groups is detectable with >99% power (again, assuming = .05, the pooled within group standard deviation, s = 15.5, and a total sample size of n = 1079). Recall an important principle: statistically significant is not the same thing as important or useful or clinically significant or even interesting. Statistically significant just means that the difference we observed is unlikely to be zero by chance. So, how much of a difference between groups is important? Would a 0.5 difference be interesting? Would a 10 difference be interesting? Somewhere between a small and large, there is a point where differences go from not interesting to interesting. Clearly, where this point lies is a judgment call.
Copyright Stacey S. Cofield, 04 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
6 / 21
The way to think about this question is as follows: We want to design studies to show differences that are interesting (or important or useful, choose whatever word you want). So, we have to decide: What is the smallest difference that would be interesting? Once this is answered, the next step in designing studies is to ask the question, how many subjects do I need? Power as a function of sample size So, lets again assume = 0.05 and s = 15.5, but this time lets fix delta = 2.5 (difference = 5) and vary the sample size. In Figure 2 we see how power increases with larger studies.
Figure 2. Power of the t-Test (s = 15.5, delta = 2.5, = 0.05) Power We reach a 50% power after sampling approximately n = 150 units. Over 80% power is reached when n = 310 units. So, if we decide that a difference of 5 between groups is important then a sample of 310 units would be sufficient to test that hypothesis (at = 5%, with power = 80%, assuming the standard deviation within groups remains 15.5). Power as a function of error The other ingredient in power is the amount of uncontrolled error in the study, . To maximize power, minimize measurement variability. Anything that reduces the estimate for s will increase power. For example, say the next time you run the experiment, it is
Copyright Stacey S. Cofield, 04 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
7 / 21
more accurate (or less accurate). Thenassuming everything else stays the samethe power of the t-test will go up (or down). How much does the estimate of s affect power? For example, say we stay with the idea of being interested in differences of 5. From the previous discussion we decided that about 310 units would be a large enough study for 80% power (at the usual level). What if, instead of the estimated standard deviation being 15.5, perhaps s varies between 5 and 25. See how the power varies with s in Figure 3.
Figure 3. Power of the t-Test (n = 310, delta = 2.5, = .05) Power If the estimated standard deviation within groups decreased from 15.5 to 10.0, then the power would increase from 80.8% to 99.2%. This might be too much to hope for because this would be a huge increase in the reliability of the experiment. Perhaps its better to be pessimistic. Think about the largest variability you might see. If this decrease in reliability were to be evident overall, then power would do down. In this case, power does not go below 70% until the standard deviation increases to 17.75. Weighing all the considerations Its a judgment call, of course. When designing a study we have to weigh all the considerations: How big a difference do we expect?
Copyright Stacey S. Cofield, 04 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
8 / 21
How large a study, n, can we afford? How large will the standard deviation, s, be? How risky, in terms of alpha protection, do we want to tolerate?
It is a balancing act. There is no rule. You may want to give up a little power in favor of a smaller, more realistic sample size. You may want to increase your alpha to allow yourself the ability to reject more often in a pilot study. Lets move on to see how to actually do power analysis with JMP IN. Doing Power Analysis with JMP IN As with many statistical software packages, power analysis is built in to JMP. Not only can you look at power and sample size before running an experiment (in the design stage), but you can also asses power after an analysis has been concluded. Most of the statistical tests have a power option. Power analysis is thought to be important in two situations. 1) If a statistical test fails to find a difference it is important to know if it had sufficient power to detect one. If the test had low power, then perhaps the power can be improved by increasing n or modifying the design to decrease noise. 2) The other application of power analysis is in planning experiments to determine the number of subjects required to detect a particular difference. Using existing data or estimates of the magnitude of the difference and the variation, it is possible to calculate the number of subjects required for a stated level of power. Interpreting a non-significant resultAMB Lets consider a new example where the results of a study indicate a non significant difference. Two diets: In a clinical trial of a calorie-substitute drink obese subjects were randomly assigned to either the drink or a control intervention. After 12 weeks, the amount of weight loss was recorded (Figure 4).
Copyright Stacey S. Cofield, 04 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
9 / 21
Figure 4. Clinical Trial Results The average subject in the calorie-substitute arm lost 10.6 pounds and the average control subject lost 8.1 pounds. This was not a significant difference (t = 1.7, df = 70, pvalue = 0.0834). When faced with the results of a study that does not show a significant
Copyright Stacey S. Cofield, 04 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
10 / 21
difference, it may be important to ask, How large a difference could have been found with this number of subjects? That is, for n = 72, s = 6 and = 0.05, what level of power did this study have to show a weight loss of, say from 2 to 20 pounds? The way to answer these sorts of questions is through power analysis. This is done in JMP by first performing the analysis of interest and then by working through the Power Details dialog. Next to the Oneway Analysis is a arrow popup. When you click on it, the Power option appears. Choose Power A dialog appears with the factors that can be considered when doing a power analysis:
Power Details Dialog Recall that the factors to consider in a power analysis are as follows: How much risk, in terms of alpha protection, do we want to tolerate? The JMP
Copyright Stacey S. Cofield, 04 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
11 / 21
default is Alpha = 0.05. How large will the standard deviation, s, be? The JMP default is the pooled standard deviation estimate from the equal variance t-test, here its called Sigma = 5.97, but its really spooled. How big a difference do we expect? If the sample size in each group is the same then delta is half the difference, here the difference in the means is 2.47 and so Delta = 1.24. How large a study, n, can we afford? The total sample size (not the sample size in each group) is shown by Number = 72. Filling in the dialog The question marks allow you to fill-in different values for these factors. What you fill in depends upon the question. One possibility is to change nothing and Check Solve for Power Press the Done button.
Retrospective or Observed power: This is thought to answer the question: How much power does this study have? This isnt exactly what it tells you, it tells you what the power would be if you do another study where the difference really is this large and the standard deviation of the population is this large, what is the probability being able to declare a difference as statistically significant?
Often, this is [incorrectly] interpreted as the power of this study is 41%. That is, if we expect to find a difference in weight between the two groups of about 2.47 pounds, and the standard deviation within each group is about 6 pounds, and we do a two-sided t-test with = 0.05,
Copyright Stacey S. Cofield, 04 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
12 / 21
then we have a 41% chance of being able to reject the null-hypothesis of no difference.
If that was our intent when the study was designed, then this was not a well-designed study. What good is power if you dont use it? All this tells you is that if a study were designed with these parameters, you would have very little chance to detect a difference, even if it does exist. It is unlikely, that the study was designed with this power in mind. If this is the true difference, then youll need to change the sample size, or the alpha level to increase your power to detect this difference. Different deltas Now back to our original question: We wanted to know: for n = 72, s = 6, and = 0.05, what level of power did this study have to show a weight loss of, say from 2 to 20 pounds? The question referred to differences thought to be biologically or scientifically important, not to the (randomly varying) differences and their statistical significance. Differences in the means of between 2 and 10 pounds translates to deltas of between 1 and 5 pounds. So, fill in the range of values for delta.
And then a larger Power report will appear, as shown in Figure 5:
Copyright Stacey S. Cofield, 04 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
13 / 21
Figure 5. Power of the t-Test ( = 0.05, s = 6, delta = 1 to 5 by 0.25, n = 72) small difference scenario: The report shows that doing a study with n = 72 subjects has 55.6% power to detect a delta of 1.5 (which is a difference of 3 pounds). So, if that is the magnitude of the difference that is important then this study is clearly under powered. That is, it was doomed to show no difference from the start. medium difference scenario: However, if we were interested in differences on the order of 4 pounds (delta = 2), then n = 72 subjects is sufficient for 80% power. So, if that is the magnitude of the difference that is important then this study had sufficient power to detect it. That is, before the study had begun we would have judged that it had a good chance (80%) of being able to declare a 4- pound difference as significantly different. big difference scenario: Finally, if only a 10-pound difference would be important (delta = 5), then this study would have surely found it (power > 99%) with n = 72 subjects.
Copyright Stacey S. Cofield, 04 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
14 / 21
Interpretation of a non-significant result Recall what we did observe in this study. We were unable to declare a statistically significant difference, since the p-value > 0.05 (Figure 6).
Figure 6. t-Test results small difference scenario: Diets are notoriously difficult. So, we only expected to see small differences on the order of 3 pounds. We ran the study with only 72 subjects, even though the chance of success was small (55.6%). But we werent as lucky as wed hoped and so, in retrospect, were not surprised that we didnt find a significant difference. Was it a prudent decision to run such a small study when the chances for success were low? medium difference scenario: We only expected to see differences on the order of 4 pounds. We ran the study with 72 subjects because the chance of success was good (80%). So we were hoping to be able to claim our diet was better. But it didnt turn out as wed hoped (we only observed a 2.5 pound difference). The p-value is awfully close to significant, and we wonder if a journal editor would accept the paper with = 0.10? Or declare marginal significance and a plan to do a study with more subjects to further investigate the relationship of the drug to weight loss? large difference scenario: Nobody cares about a product that cant do 10-pounds better than the control diet (just eat sensibly and reduce calories). Running a study with 72 subjects would have found a difference this large (if it was there). But, the
Copyright Stacey S. Cofield, 04 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
15 / 21
difference is not this large. Give up the idea of being able to make a statistically defendable claim and give the product to the advertising and marketing guys to see if they can sell it anyway. (Hopefully, the clinic that ran the study signed a contract with a no disclosure without approval clause so our company can make sure this study never sees the light of day.) So what we make of a non-significant difference in a study depends on how big a difference we judged to be important and whether the study had sufficient power to detect that difference. Planning a study to find a difference Another point of view is that of the pilot study. That is, from this existing data we estimate the magnitude of the difference and the variation. We use this to calculate the number of subjects required for a stated level of power for designing the next study. Sayfrom this pilot studywe expect to find a difference in weight between the two groups of about 2.47 pounds, and the standard deviation within each group is about 6 pounds, and we plan to do a two-sided t-test with = 0.05. How large a study do we need for 80% power? It turns out that we cant directly answer this question with JMP. We can answer a similar question though, With various sample sizes, what will our power be? That is, we do a power analysis that fixes delta, sigma, and alpha but varies n.
Then the Power table will show the power for various sample sizes, as in Figure 7.
Copyright Stacey S. Cofield, 04 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
16 / 21
Figure 7. Power of the t-Test ( = .05, s = 6, delta = 1.24, n = 72 to 240 by 12) We see that 80% power is achieved somewhere between 180 and 192 subjects. If we wanted to determine a more exact sample size estimate, then we could redo the power analysis with smaller increments between these two values. The popup next to the Power button in the above report will show a plot of power versus sample size, as in Figure 8:
Figure 8. Power versus Sample Size One very common response to a power analysis such as this is, I cant afford 190 subjects! So, the conversation continues. When designing a study we have to weigh all the considerations:
Copyright Stacey S. Cofield, 04 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University 17 / 21
How big a difference do we expect? How large a study, n, can we afford? How large will the standard deviation, s, be? How risky, in terms of alpha protection, do we want to tolerate?
If a power analysis yields a too-large sample size or when the sample size is fixed (by a consideration of the resources available) then were left with four options: Increase differences. Lower variability. Accept lower power. Give up.
Additional comments There are two other options available in JMP that we should discuss because youll be tempted to use them. Two of the check-boxes are for least significant number and least significant value. Least Significant Number How many more observations would make the reported difference become significant? Significance is driven by many things but if the only thing that changed was the sample size (the square-root of n in the standard error) then eventually the t-test would pass the critical value an the p-value would become significant. Thats what the LSN is. The LSN for the Diet Study So, we need an n at least as large as 93 (always round up). However there are two strong warnings about this line of thinking. First, it is unethical to keep recruiting subjects until the results are significant. All of our statistical thinking assumes this: Run the study as designed. When its all over then you get to look at the data and run the statistical test. If the p-value is less than 5%, then you can declare the results as statistically significant. No fair peeking at the data until youre done. If you peek then this introduces all sorts of bias and renders the study useless. Really: useless. Someone
Copyright Stacey S. Cofield, 04 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
18 / 21
has just spent all sorts of time and money and resources and the result is no information. Second, as weve seen, what is the power of this study with n = 92 subjects? See Figure 7. The probability of finding a significant difference with LSN observations is as low as 50%. Least Significant Value How small a difference could the significance test in this study detect? Say we fix the sample size and sigma and alpha, how small a difference could this study detect? As the difference gets larger and larger, eventually the t-value will pass the critical level and the p-value will be significant. Thats what the Least Significant Value is. The LSV for the Diet Study So a delta of 2.8 would be significant; this translates to a weight difference of 5.6 pounds. This is sometimes called the sensitivity of the test. That is, we would not be able to detect differences less than 5.6 pounds with this study. In this study a way to interpret the LSV is as follows: If there is a real difference in weights, it is 95% likely to be less than 5.6 pounds. In general, were 95% confident that the true difference is less than the LSV. Thus, in an inconclusive study, can be taken to place an upper bound on the difference. If it is judged important to pursue differences as small as this, then an additional study can be designed. JMP Power Calculator What weve done so far assumes that you have pilot data but thats not always the case. Suppose you have two samples, their ns, means and the standard deviation (or standard error) within each group:
The steps for calculating sample size are as follows.
Copyright Stacey S. Cofield, 04 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
19 / 21
From the menu bar, choose DOE > Sample Size and Power Press the Two Sample Means button, if you prefer working from the difference between the two means, or Press the k Sample Means button, if you prefer entering the two means. I prefer the latter.
Calculate (or approximate) the pooled standard deviation and enter it for Error Std Dev: Enter the two Means Enter the desired Power. Our window should look like that in Figure 9.9.
Figure 9. Sample Size Calculation Press the Continue button
The result for Sample Size that JMP displayed (rounded UP to whole numbers): 180. In other words, 90 per group. In real life wed want to vary the standard deviation and the size of the difference to be more comfortable that if either changed, wed still have acceptable power. If we wanted to write this up, wed say something like:
Copyright Stacey S. Cofield, 04 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
20 / 21
If we expect to find a difference in weight between the two groups of about 2.52 pounds, with a standard deviation within each group of 6 pounds, for a two-sided t-test with = 0.05, then we have a 80% chance of being able to reject the null-hypothesis of no difference if we have a total sample size of n = 180 subjects. Sample Size Calculators As weve seen, JMP calculates power for a given alpha, effect size, sample size, and standard deviation. Or, it will calculate sample size. There are other (good) ways to do this: Talk to a statistician Use a free sample size calculator like the ones from the statistics department at UCLA. See http://calculators.stat.ucla.edu/powercalc/ for a list. When using any sample size calculator, pay close attention to: The definition of effect size. Differences or standardized differences or something else? The ns given. Is it the total n for the whole experiment or the n for each cell?
Final Thoughts Power and sample size are incredibly important in a successful and believable experiment and results. If you do not consider power and sample size prior to conducting your experiment, you may not be able to give any useful results. You may get lucky but this is not the way to go about statistics. Your best offense is a good defense. That is, plan well and no matter your results, youll be able to say something believable and useful. When in doubt, contact a statistician it is often less than an hours worth of work for a straightforward analysis plan to have an idea of what youll need in terms of sample size to detect a meaningful difference with an acceptable level of power at a reasonable significance level.
Copyright Stacey S. Cofield, 04 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
21 / 21
Section 19. Correlation and Regression
Review and Overview Previously, we have looked at the relationships between two or more categorical variables (dichotomous or multi-level), and the relationship between a categorical and continuous variable. Next well begin the discussion of looking at relationships between two continuous variables. Well start by looking at scatterplots and then well turn to the case where the relationship between the two variables follows a straight line (a formula or equation). There are three ways to think about the form of the relationship between two continuous variables: by using a correlation coefficient as a descriptive statistic, by looking at the significance of slope, and by comparing models.
These three points of view turn out to be a general way to look at the relationship between variables. In later discussions well deal with other complexities that may arise (e.g., multiple variables). But we will begin with the simplest case: correlation and straight-line regression.
Scatterplots We use scatterplots to show the relationship between two continuous variables measured on the same units (individuals, cells, machines, etc.). The values of one variable appear on the horizontal axis (the X variable in JMP) and the values of the other variable appear on the vertical axis (the Y variable in JMP). Each individual appears as a single point on the plot (as compared to being lumped into a group). To interpret a scatterplot look first at the overall pattern. The figure below shows no relationship (Figure 1).
Copyright Stacey S. Cofield, 16 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
1 / 29
Figure 1. Scatterplot of X versus Y, with no relationship (r = 0.02)
Direction of the association The figures below illustrate a positive direction of the relationship (on the left) and a negative direction of the relationship (on the right) (Figures 2 and 3). The strength of the relationship is weak in the upper plots and strongest in the lowest plots. If there is a relationship, there are three things to look for: The direction of the relationship (as X increases, Y decreases, etc.) The strength of the relationship (how much does Y increase or decrease) The form of the relationship (is it linear?)
For now, dont worry about the r (correlation) values. In all of the above figures, the form of the relationship is linear.
Copyright Stacey S. Cofield, 16 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
2 / 29
Figure 2. Positive Relationship (r = 1.0)
Figure 3. Negative Relationship (r = -1.0) Scaling and aspect ratio Be careful when interpreting scatterplots because the scaling of the axes does matters. For instance, the figure below is exactly the same data as in Figure 1 above (Figure 4). The only thing different is the scaling of the Y axis:
Copyright Stacey S. Cofield, 16 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
3 / 29
Figure 4. No Relationship with Larger Y Axis Even though the data is identical, the pattern in the above figure appears stronger because of the lengthening of the Y axis.
Scaling If you change the scale of one axis, but not the other, you will get a different impression of the relationship (Figure 5). Using the negative relationship data but lengthening the Y axis (and not the X axis).
Figure 5. Negative Relationship with Unequal Axis Scale
Aspect ratio
Copyright Stacey S. Cofield, 16 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
4 / 29
In addition to the scale, the aspect ratio of the axes also matter. All of the previous figures were drawn so the aspect ratio is 1, that is the length of the figure axes is equal or the plot is square. The aspect ratio is the ratio of the height divided by the width. Again, we see identical data but this time with different aspect ratios (Figure 6).
Figure 6. Negative Relationship with Unequal Axis Length
Correlation The term correlation refers to the strength and direction of a linear relationship (usually written r).
Interpretation of the r statistic Here are the basic facts you need to know to interpret a correlation r: It makes no difference which variable you call X and which you call Y, youll get the same value for r. The units of X and the units of Y dont matter. The correlation r has no unit of measurement; its just a number. A positive r indicates a positive relationship (as one variables values increase, the other variables values also increase). A negative r indicates a negative relationship (as one variables values increase, the other variables values decrease). A value near zero indicates a very weak linear relationship (as one variables
Copyright Stacey S. Cofield, 16 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
5 / 29
values increase, we have no idea what the other variables values will do). Correlation r is a number, always between 1 and +1. The number only indicates the strength and direction of a straight-line (linear) relationship (not the causation, i.e., which variable causes the relationship). Correlation does not describe a curved relationship, no matter how strong it is. Just like the mean and standard deviation, correlation can be strongly affected by outliers.
Outliers Below if the Negative relationship data, the figure on the right has an r = -1.0, the figure on the right has one extreme outlier and now an r = -0.33 (Figure 7).
r = -1.0
r = -0.33
Figure 7. The Effect of an Outlier on Strength of a Linear Relationship Moral of the story: Always (every time, really) look at your data (no kidding).
Form of the Relationship Here are four other sets of data (n = 11) with the same correlation (Figure 8).
Copyright Stacey S. Cofield, 16 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
6 / 29
Figure 8. Four Sets of Data with r = 0.82 CorrelationAMB Keep in mind, that the number only indicates the strength and direction of a straight-line (linear) relationship. Correlation does not describe a curved relationship, no matter how strong the relationship. Rule: There is no substitute for plotting the data.
Regression Regression refers the situation when we have measurements on two variables and we
Copyright Stacey S. Cofield, 16 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
7 / 29
wish to use one to predict the other. Alabama High School SOLs: Consider the SOL pass rates in the City High Schools in the Alabama school system. Well look at two scores, English and Geometry 2000.
Distribution of Y First (as always), look at the data from the 2000 SOLs (Figure 9).
Figure 9. Distribution Reports Scatterplot Looking at only the English and Geometry pass rates of 2000, consider these questions: Does higher (or lower) pass rates in English seem to go with higher (or lower) pass rates in Geometry? Is the correlation between English and Geometry zero? Fitting a straight line thru the data, is the slope zero? Does knowing the English pass rate of a school allow us to predict the Geometry pass rate for a school (beyond just a chance level)? It turns out that these are all the same question. If we answer one, weve answered the others. Lets begin by looking at the scatterplot (Figure10):
Copyright Stacey S. Cofield, 16 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
8 / 29
Figure 10. Scatterplot of English and Geometry 2000 Informally, wed venture a guess that there is a relationship between the English and the Geometry pass rates in Alabama high schools. Its not a strong relationship but clearly a low English score seems to go with a low Geometry score (and vice versa). So, weve answered one of our questions, Does higher (or lower) pass rates in English seem to go with higher (or lower) pass rates in Geometry? Yes, it seems so. Now, lets look at each of the other questions. JMP notes To look at relationships between two continuous variables, begin with the second entry in the Analyze menu (Figure 11). Choose Analyze > Fit Y by X and identify the X- and Y-columns in the dialog. A dot plot will appear. Making choices in the popup above the figure modifies the default scatterplot. To fit a straight line through the data, choose Fit Line. A new popup will appear. To remove the line from the scatterplot. Choose Linear Fit Remove Fit
Copyright Stacey S. Cofield, 16 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
9 / 29
Figure 11. Fit of English and Geometry 2000
Slope/intercept One of the assumptions we are making is that a straight-line relationship is appropriate. If it is not appropriate, then there are alternatives. But, in this case linearity does seem to make sense. That is, the form of the relationship is: Y = intercept + slope X + error This is a straight line that intersects the X-axis at the intercept, and has a trend indicated by the slope. The slope is the average increase in Y for every unit increase in X. For example, the best fitting line for the example data is shown above (Figure 11). See the Parameter Estimates report for the estimated intercept and slope. The line intercepts the horizontal axis at about 6.86. That is, if English were zero, then the line predicts that the Geometry pass rate is 6.86. The slope is approximately 0.64. That is, for every increase of one unit in English, the line predicts that there will be and increase of 0.64 units for Geometry.
Testing the Slope Note that one of our questions was, When fitting a straight line through the data, is the slope zero? That is, does the line have a slope or is it flat? This is a statistical
Copyright Stacey S. Cofield, 16 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
10 / 29
question, we test the slop compared to 0 using a means test of an observed mean compared to zero, using a t-test. The parameter estimates report answers that question. The estimated slope is approximately 0.64 (SE = 0.18), and it is significantly different than zero (t = 3.5, p-value = 0.0008). Its proper to make this inference if all the assumptions of the model are met. (We havent covered these assumptions yet. But we will.)
Least squares Here is a portion of the data values, showing the explanatory X variable (English), the response Y variable (Geometry), the predicted value for Geometry, and the error term. Lets work through the values for one row in the data table. Using the equation for the line: Y = intercept + slope X We estimate the intercept and slope, and then we can predict Y by the vertical height of
the line. Call the prediction Y .
predicted-Geometry = 6.86 + 0.64 English
Sheffield High School Note the Sheffield High School row (row 389). The observed English value is 71.93. We use this in the prediction equation to predict the actual Geometry value. That is, the line predicts a pass rate of about 52.94%, whereas the actual Geometry pass rate at Sheffield was 28.13%. The difference between the observed value of Y and Y is called the error (or the residual).
Y = Y + error
Since the actual Y value (Geometry) was 28.13, the error is 24.81.
Copyright Stacey S. Cofield, 16 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
11 / 29
Least Squares How are the estimates for the slope and intercept determined? In what sense is this a best fitting line? The slope and intercept are those values that minimize the total squared-error. This is called the least squared error. The slope and intercept are least-squared estimates. If you draw a vertical line from each data point to the fit line, you can see each error. In the figure below, each of the vertical lines represent one error value for a smaller data set, so that you can see it more clearly (Figure 12). Square each of these lengths and add them up. This squared error is the smallest possible. No other slope and intercept will give a smaller squared error.
Figure 12. Error (Residuals) Represented by Distance from Fit Line
Variance accounted for R-Sqaured is an important descriptor: the squared correlation is the proportion of variance in Y accounted for by X. That is, r2 (say R squared) tells us how much of the variance in the response variable is accounted for by the explanatory variable. The histogram and moments reports for the response variable (Geometry), the predicted values for Geometry, and the errors are shown in Figure 13.
Copyright Stacey S. Cofield, 16 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
12 / 29
Figure 13. Observed and Predicted Geometry 2000 Distributions Notice that there is more spread in Geometry than in Predicted Geometry. That is, the
estimated SD of the Y variable is 20.9 and the estimated SD of Y is 7.3. The predicted
values accounted for some fraction of the total variance in the observed values. The ratio of the predicted to observed variance is the proportion of the variance in the
Copyright Stacey S. Cofield, 16 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
13 / 29
response variable that is accounted for by the explanatory variable. This ratio is 0.12 here. That is r2 = 0.12.
Correlation and Rsquare The Rsquare value is already available in the Linear Fit report. To calculate the correlation (r) we could take the square root of Rsquare, or use JMP to estimate the correlation directly. One of our questions was, Is the correlation between English and Geometry zero? Either the Bivariate Report or the Multivariate Report will answer that question. The correlation between writing and Geometry is positive (r = 0.35, p-value = 0.0008). To estimate the correlation between the Y and X variables in the scatterplot: OR To estimate the correlation between the Y and X variables in the Multivariate Report: Choose Analysis > Multivariate Methods > Multivariate An new window will appear, and select the variables you are interested (at least 2 variables) and select OK. From the Multivaraite menu, select Pairwise Comparisons. Choose Density Ellipse > .95 To remove these, Choose Bivariate Normal Ellipse Remove Fit An ellipse, a bivariate report, and a new popup will appear
Assumptions Inspecting a straight-line fit is one way to interpret the relationship between the Y and X columns. Another is the bivariate normal ellipse. The word normal reminds us that there are assumptions underlying all of what weve talked about so far. That is, under some circumstances a correlation estimate makes sense, a straight line fit makes sense, talking about variance accounted for makes sense. But there are many
Copyright Stacey S. Cofield, 16 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
14 / 29
situations where linear relationships are not applicable.
The assumptions behind a correlation For correlation to apply, the following assumptions must hold: The observations must be representative and independent. For a fixed X value, the Y values are normally distributed. For every fixed X, the variance of the Y values is the same. For a fixed Y value, the X values are normally distributed. For every fixed Y, the variance of the X values is the same. The X and Y values have a bivariate normal distribution.
If these assumptions hold, then the bivariate normal ellipse can be thought of as a confidence bound. That is, a 0.95 bivariate normal density ellipse will enclose about 95% of the data points if these assumptions hold.
The assumptions behind a straight-line fit For a straight-line fit to make sense, these are the assumptions that need to hold2: The observations (rows) must be representative and independent. The form of the relationship between Y and X must be linear The values of X are measured without error (or at least that the measurement error is negligible) For every fixed X, the variance of the Y values is the same. The error residual values have a normal distribution.
If these assumptions hold, then the statistical test for the slope is appropriate. Most of these are easy to assess in JMP. Linearity is largely assessed visually. But there are also important clues about nonlinearity available by assessing the last two assumptions.
Equal variance How do we assess whether the variance of the Y values is the same for every fixed X? First, visually. Look at the amount of spread above and below the fit line. Is there equal variance here?
Copyright Stacey S. Cofield, 16 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
15 / 29
See the fan spread? For low cholesterol values, there is less spread above and below the line than with larger cholesterol values. The spread around the line at cholesterol = 200 is much wider than at 100. Secondly, look at a plot of the residuals versus the X values. This plot clearly shows that the variance of the residuals is not equal across X the values. To see the plot of the residuals for any fit, go to the submenu popup for that fit. Choose Linear Fit Plot Residuals A residual plot like that below is clear evidence of equal variance. What you want to see is no pattern whatsoever.
Normality How do we assess whether the error residual values have a normal distribution? We assess normality of the residuals by inspecting the normal quantile plot. Although, in practice, its usually sufficient to inspect the residuals vs. predictor plot. Note that you do not care whether the raw Y values are normal. You dont care whether the X values are normal. Although it usually is a good sign if they are normal. You do care about the normality of the residuals at a fixed point on the X axis.
Thinking about a linear fit Again, were considering predicting a high schools Geometry pass rate from its English pass rate. We do this by fitting a straight line through the points in the scatterplot and interpreting the slope. If there is no relationship, what will the slope of this line be? Said another way, if were trying to predict a high schools pass rate from something that is absolutely useless as a predictor, what Geometry pass rate for a high school would we predict?
Copyright Stacey S. Cofield, 16 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
16 / 29
If we knew no other useful information, our best single predictor of the Geometry pass rate for a high school would be the mean Geometry pass rate. It would be best in the sense that it would minimize our expected error. That is, to give a total score for how good (or bad) our predictions are, calculate the sum of all the squared errors for each high school (the ith one). Our total error is:
SST = Gi ,obs G
i
(
)
2
in general:
SST = ( y i y )
i 2
Predicting the mean is the best we can do if we have no other useful information. That is, the simplest possible model for prediction is to simply predict the mean. Predicting the mean will give us the least squared error. We call the score when using this model the Total Sum of Squares (SST). Since this is the best we can do with no other information, this is the baseline used to compare all other potential predictors. That is, if a set of predicted values does not come out with an error less than SST, then its clearly worse than the simplest possible model. Say we have a potential way to guess the Geometry pass rate. In our case, our model is a linear regression model; were using the English values to predict the Geometry pass rate. It gives us a predicted Geometry for all the high schools. We calculate the sum of all the squared deviations from the predicted mean:
SSR = Gi , pred G
i
(
)
2
in general:
Copyright Stacey S. Cofield, 16 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
17 / 29
2 SSR = ( y i y ) i
This is called the Sum of Squares Regression, or the explained sum of squares. Its also called the Model Sum of Squares, since it describes how good our model is. The better this model is, the larger the SSR. The worse it is, the smaller the SSR. The SSR tell us how much of SST the model explains. Whats left over is the Sum of Squared Error: SSE = Gi ,obs Gi ,pred
i
(
)
2
in general:
2 SSE = ( y i y i )
i
It turns out that:
SST = SSR + SSE
( yi y )
i
2
2 2 = ( yi y ) + ( yi yi ) i i
So, we have some portion of the total variability (SST) explained by a model (SSR), with some portion left unexplained, the error of the model (SSE). This brings us back to the notion of variance accounted for. Another definition for Rsquare is: r2 = SSR / SST This has a number of uses because it ties together a lot of what weve done in this class and will tie together things well cover in the future. In the Mean Fit report that occurs when we just fit a flat regression line we see how bad this model is: it has SSE = 34,547.04 (Figure 11). The goal is to have the SSE be a small as possible and the SSR as large as possible (r2 = 1).
Copyright Stacey S. Cofield, 16 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
18 / 29
Analysis of Variance In the Analysis of Variance report that occurs when we fit a line, we see how bad this model is: it has SSE = 34,547.042 Recall that: SST = SSR + SSE 39,220.66 = 4,673.62 + 34,547.04 And so: r2 = SSR / SST = 4,673.62 / 39,220.66 = 0.119 This gives us another way to look at correlation. Its a way to compare models. We have one very simple model (predict the mean) and we have a model we think is better (the line). How much better? An r2 = zero would tell us its no better. An r2 = 1 would tell us it does a perfect job of predicting the data.
Confidence Bounds This is visualized very well in JMP by looking at the confidence band around our straight-line model. In the figure below, we see both the Linear Fit and its 95% confidence bound (Figure 14). We also see the Mean Fit. So, recall one of our questions was: Does knowing the English pass rate of a school allow us to predict the Geometry pass rate for a school (beyond just a chance level)? Now we have a way to answer this question. The beyond just a chance level prediction would be: Predict the mean Geometry pass rate (about 47.7). Our model competes with that simplest possible model. Is it any better? Answer that question by looking at the confidence bounds: Do the (red, curved) confidence bounds include the (flat green) mean line? If the green is inside the confidence band then the model is not statistically significant (actually the p-value will be > 0.05). If the green line cuts through the confidence band then the model is statistically significant (and the p-value will be <
Copyright Stacey S. Cofield, 16 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
19 / 29
0.05). In this case, the green line extends beyond the confidence bounds and the linear fit is significant.
Figure 14. Linear Fit with Confidence Bounds and Mean Fit of Geometry by English
Copyright Stacey S. Cofield, 16 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
20 / 29
To see the 95% confidence band for the mean predicted value for any fit (linear or mean), go to the submenu popup for that fit. Choose Linear Fit Confid Curves: Fit
Random Predictor Here is how an equivalent scatterplot would look if we had a useless predictor (Figure 15).
Figure 15. Linear and Mean Fit of Useless Predictor Just generate a random variable and use it in a scatterplot. The regression happened to have a slight increase but, as you can see, the confidence bounds include the mean prediction. The p-value here was 0.4157 (not significant). The fact that the random varaible is not a significant predictor is reflected in the figure.
Correlation or Regression? If there is a clear predictor variable and a clear response variable (you want to use the values from the predictor as a basis for guessing unknown future values of the response), then the regression model is appropriate. If both are random and could be thought of on equal terms, then the correlation model is
Copyright Stacey S. Cofield, 16 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
21 / 29
appropriate. For instance, weve been using English to predict Geometry. Would it make sense to do the reverse (Figure 16)?
Figure 16. Fit of English by Geometry Whats the same? Whats different?
Copyright Stacey S. Cofield, 16 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
22 / 29
Step by Step In the previous sections we gave an overview of the issues and covered some tools to address these issues. Again, we can use the 10-step process of decision making. Its not necessary to follow through the details of every step, each time we look at a set of data. But, until we gain the experience to be comfortable that well end up in the right place, its a good idea to take one step at a time. 10 Steps of Hypothesis TestingAMB
Phase 1: State the Question 1. Evaluate and describe the data As always, the first two questions are: Where did this data come from? What are the observed statistics? The source of the data bears directly on the crucial assumption of independence and representativeness. If the data is not a representative sample of the population then inference to the population from the sample is not useful. If the rows are not independent then the variance estimates will be wrong, and inference isat best weak. The univariate distribution of each variable should be inspected (see Figure 9) and suitable descriptive statistics recorded. A preliminary assessment of normality is useful (although not critical). Potential outliers should be investigated. Then the bivariate scatterplot should be inspected (see Figure 10). The linearity and equal variance assumption should be informally assessed.
2. Review assumptions In place of our standard three assumptions, there are five questions here: Is the process used in this study likely to yield data that is representative of the
Copyright Stacey S. Cofield, 16 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
23 / 29
population? A simple random sample is the best way to assure this. Otherwise, look for possible sources of bias. Is each observation in the sample independent of the others? The classic example of a situation where this assumption is violated is when there are repeated observations on the sampled individual. If the above assumptions are questionable, then proceeding is risky. If the above assumptions are met, then we can proceed to the following three crucial assumptions: Are the residuals normally distributed? If the sample size is sufficient, then we can often assume that the errors will be normally distributed. The test statistic we usethe t statisticis not sensitive to moderate departures from normality. Thus, unless the distribution is seriously skewed, the actual calculated p-values and confidence intervals will be close to the levels for exact normality. With large samples, the normality assumption is nearly always met. Is the form of the relationship linear? For every fixed X, is the variance of the Y values the same? This is the homogeneous-variance assumption. Usually, the assessment of these last three assumptions will have to wait until the residual errors are determined. This occurs after we fit a straight-line model. However, if its obviousfrom the inspection of the scatter plotthat linearity is suspect or that equal variance does not hold, then dont proceed. Well cover some methods to handle these situations later.
3. State the questionin the form of hypotheses There are three equivalent ways to state the null hypotheses:
Copyright Stacey S. Cofield, 16 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
24 / 29
H0: r = 0 (zero correlation), vs HA: r 0 (either a positive or negative correlation) H0: slope = 0 (flat line), vs HA: slope 0 (a trend), H0: Y = Y + error (predict the mean), vs HA: Y = intercept + slope X + error (predict a trended value).
Answer one of the three and you have answered the other two. They are absolutely and completely the same questions. We may prefer one over the other in terms of ease of explanation, but the decision making process is identical. However, since the correlation is best thought of as a descriptive statistic, its best not to use r to do hypothesis testing. (Actually, r is not normally distributed so hypothesis testing on it is best avoided.) So, we proceed by testing hypotheses on the slope. Thus the hypotheses stated for testing purposes are: H0: slope = 0, vs HA: slope 0. Phase 2: Decide How to Answer the Question 4. Decide on a summary statistic that reflects the question There are two test statistics we can use. In this situationwhere there is only one Y and one Xthe following two statistics give identical p-values:
The t-value with df = n 2. The F distribution with two df parameters: the df-numerator and df-denominator. Since the dfmodel = 1, the F-value has (1, n 2) df. It turns out that in the situation where df-numerator = 1, then t2 = F. If it helps, you can think of it as though the t-test is testing whether the slope is zero and the F-test is testing whether predicting values with a straight line is better than predicting the mean(Y). You get the same p-value; and make the same decision either way.
Copyright Stacey S. Cofield, 16 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
25 / 29
5. How could random variation affect that statistic? What values of t will we see if the null hypothesis is true? If the null hypothesis is not true? For H0: slope = 0 (or r = 0) well see t equal to zero. For HA: slope 0 (or r 0) well see t values different than zero.
Recall the rough interpretation that ts larger than 2 are remarkable. What values of F will we see if the null hypothesis is true? If the null hypothesis is not true? For H0: Y = Y + error, well see F equal to zero. For HA: Y = intercept + slope X + error, well see F values greater than zero.
From the rough interpretation of t, wed guess that an F larger than 4 is remarkable. Either test will yield a p-value, when compared to a distribution with df = n 2.
6. State a decision rule, using the statistic, to answer the question The universal decision rule: Reject H0: if p-value < .
Phase 3: Answer the Question 7. Calculate the statistic Use JMP to estimate a linear fit. Inspect the Parameter Estimates report for the slope estimate, SE, t, and p-value. The df corresponds to dfError. Or, inspect the Analysis of Variance report for the F ratio, dfModel and dfError, and p-value. [calculation note: The F ratio corresponds to:
Its a ratio of the mean square of the model (which compares the straight line predicted value to the mean predicted value) to the mean square error (which compares the straight line predicted value to the observed value).] The correlation may be a useful descriptive statistic; it is available in the Bivariate report. The p-value for the correlation is identical to either p-value, above.
Copyright Stacey S. Cofield, 16 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
26 / 29
8. Make a statistical decision In this example, since p-value = 0.0008, we reject the null-hypothesis. But at this point it is crucial to determine whether the statistical decision is defendable. Are the assumptions met? In order of importance: Is the data representative? Is each observation independent? Is the relationship linear? For every fixed X, is the variance of the Y values the same? Are the residuals normally distributed?
If the first three are not clearly met, then inference is questionable. If the error variance is nonconstant then other alternatives should be considered. Normal residuals and equal variance usually go hand in hand. With sufficiently large sample size, normality is usually a safe assumption. That is, unless there is clear evidence of skewness or of severe nonnormality, then were usually safe. The residuals in this case show no obvious pattern (Figure 17).
Figure 17. Residual Plot for Geometry
9. State the substantive conclusion There is a positive linear relationship (or association) between the high school English pass rate and the high school Geometry pass rate.
Phase 4: Communicate the Answer to the Question 10. Document our understanding with text, tables, or figures There are a number of ways to document our understanding. They would all begin with background information on each variable. Wed describe the methods of how the data
Copyright Stacey S. Cofield, 16 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
27 / 29
was obtained. And then Correlation: If the intent of the question is to inquire whether there was a significant correlation between the two pass rates, the following paragraph would be an adequate description. In the Spring of 2000, n = 91 high schools in the Alabama School System had end-of-grade SOL pass rates reported in English and Geometry. There was found to be a significant positive linear correlation between the two pass rates (r = 0.34, pvalue = 0.0008). Higher pass rates in Geometry are associate with higher pass rates in English and lower pass rates in Geometry are associated with lower pass rates in English. Straight line: If the intent of the question is to describe a straight line through the pairs of points, the following paragraph would be adequate. The relationship between Geometry and English pass rates was found to be linear and the trend was significantly different than zero (t = 3.47, df = 89, p-value = 0.0008). The predicted Geometry pass rate for a high school is 6.86 + 0.64*English. This linear relationship accounts for 34% of the variance of the Geometry pass rate. Figure 18 illustrates the linear trend and shows the 95% confidence bound when predicting an individual schools Geometry pass-rate from a given English pass-rate (wider bounds) and the 95% confidence bounds for the linear prediction (narrower bound).
Figure 18. Linear Fit of Geometry by English with 95% Confidence Bounds
Copyright Stacey S. Cofield, 16 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
28 / 29
The 95% CIs are available in JMP by right-clicking on the parameter estimates table. Choose from the Columns popup that appears. Note: In clinical situations, the prediction interval is the one you want (the wider interval). The problem with the PI is that its often much wider than the CI. So, people often (incorrectly) show the CI when they really should show the PI (since its narrower and looks better). JMP note: Use the popup under the scatterplot to add one confidence curve or the other. Model comparison: If the intent of the question is to determine whether a model can be developed to predict one variable from another, the results can be stated as with the straight line paragraphs above. Probably the only thing that would be different is instead of saying: The relationship between Geometry and writing pass rates was found to be linear and the trend was significantly different than zero (t = ). We say: A significant linear model was found that described the relationship between the two variables (F(1, 89) = 12.0, p-value = 0.0008).
Details Note how the df for the F-value is either included in parentheses or with subscripts: F1,89 = 12.04. Only one decimal place is needed for the F-value; perhaps two and t-values are also reported to one or two decimal places. The r-correlations are reported to, at most, two decimal places. The number of decimal places for slopes and intercept estimates should be the same as the number of decimal places you use for the mean or SD (or perhaps one more decimal).
Copyright Stacey S. Cofield, 16 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
29 / 29
Section 20. Other Topics in Correlation and Regression Overview In the previous section we discussed how to assess a relationship between two continuous variables using methods based upon linearity and the assumption of normality. In this section well begin by assessing a relationship when the form is not linear or when the normality assumption is arguable (or not defendable). Well follow this discussion with other topics, including, why its called regression and how no change can masquerade as real change.
Lets begin with nonlinearity and/or non-normality. One way to describe the relationship between two variables without assuming a particular form (i.e. linear) and without making assumptions regarding the distribution (i.e. normal), is to look at the nonparametric analog of Pearsons correlation coefficient (r).
Example: Consider the 2000 Writing SOL pass rate in Alabama County high schools and the Algebra I pass rate. The distribution of each variable is summarized as follows (Figure 1).
Figure 1. Distribution of Writing and Algebra I SOL Scores
A question of interest is: are the two correlated? The first step is to plot Algebra I by Writing (Figure 2).
Copyright Stacey S. Cofield, 17 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University 1 / 19
Figure 2. Plot of Algebra I by Writing 2000
Consider the strength, direction, and form of the relationship: Relationship, there is a relationship between the spring 2000 Writing SOL pass rate and the Algebra I pass rate. Strength, appears to be moderate, but clearly non-zero. Direction, is positive. Form: the From may not be linear.
Of course, we could impose the assumptions of normality and linearity (Figure 3).
Copyright Stacey S. Cofield, 17 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
2 / 19
Figure 3. Linear Fit of Algebra I by Writing 2000 The line accounts for some portion of the variance (r2 = 28.3%, p-value < 0.0001). The model is statistically significant, this is a strong, positive relationship. However, is there anything wrong with this assertion? Lets look at the residuals from this proposed model (Figure 4).
Figure 4. Residual Plot for Linear Fit So there are issues with a linear fit, there are clear outliers that are not being addressed in an appropriate manner by using a linear fit. So can we just use correlation?
Assessment of Pearsons r Is the correlation r = 0.53 a defendable summary statistic (Figure 5)?
Copyright Stacey S. Cofield, 17 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
3 / 19
Figure 5. Correlation of Algebra I by Writing 2000
The majority of the points are included in the Bivariate Normal Ellipse, in fact, nearly all but there are a number of points that are not included and are a distance away from the bounds. In addition, the points are not evenly spread outside of the bounds. So what are our options? One way to handle non-linear, non-normal relationships is to ignore these problems and use the linear fit (this is not the best approach). Another option is to remove these annoying data points and proceed with the linear approach (not only is this not good, it can be unethical). Another way to proceed is to assess the outlier as justifiably different. The Brewbaker Technology Magnet High School may not be a real high school, since this type of school has a different curriculum than so called regular high schools. Perhaps we could justify not including this school, but this is the only such school. Removal of any outliers should be justified and the analysis presented with and without the outlying observations. Proceed using methods that do not require an assumption of normality.
Copyright Stacey S. Cofield, 17 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
4 / 19
Spearmans Rank Correlation Another way to proceed is to simply look at a nonparametric measure of how related two variables are. The most common is Spearmans rank correlation. The Spearman correlation is a summary statistic with many of the same characteristics of Pearsons r. The Spearman r ranges between 1 and +1, and is interpreted as the correlation between the ranks of the two variables. The Spearman rank-correlation coefficient is not available in Fit Y by X. To obtain nonparametric correlation coefficients between continuous variables: Choose Analyze > Multivariate and identify all the columns of interest. A correlations report will appear (showing Pearsons correlation). To add a report of nonparametric associations, Choose Nonparametric
Correlations > Spearmans Rho Correlation between ranks. The Spearman r is calculated simply by determining the (Pearson) r of the ranks. To illustrate, convert the original data to ranks (the smallest value has rank 1, the largest value has rank 306since there are n = 306 schools with data on both scores). The Spearman results for this data are below, showing an r = 0.52 (Figure 6).
Figure 6. Spearman Rank for Algebra I and Writing 200
Copyright Stacey S. Cofield, 17 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
5 / 19
Thus the Spearman r is interpreted as the correlation between the rank-order of the observations.
So, one summary paragraph might be as follows: In the Spring 2000 SOL test results in Alabama high schools there was a significant positive relationship between the Writing pass rate and the Algebra I pass rate (Spearman rank correlation r = 0.51, p-value < 0.0001). The rank correlation was used because the relationship between the two pass rates is not linear. Note that we dont give any slope or intercept or talk about any line that we could use to predict one from the other. Additionally, it may be useful to describe any outliers. If any more description of the relationship is needed, add only the scatterplot.
Regression to the Mean Another important topic, when considering relationships, is the phenomenon of regression to the mean. Galton observed that childrens heights tend to be more moderate than their parents (Figure 7, Example AMB).
Figure 7. Child Height by Parent Height
Consider how tall childrens height compares with the average height of the childs two parents. From the straight-line fit above we see that if two parents have a height of 72 inches, wed predict the average child height to be 70.5 inches. Similarly, if parents
Copyright Stacey S. Cofield, 17 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University 6 / 19
height is 63 inches, 65.9 inches. The slope of the line is not one.
Regression to the mean Galton called this regression to the mean. That is, there is a tendency for extreme values to be less extreme upon repeated observations. Note that this phenomena does not depend upon which variable is the X-variable and which is the Y-variable (Figure 8).
Figure 8. Parent Height by Child Height
Remember, you can think of every measurement as being made up of two parts: the true-value, and the measurement error. Thus, the observed values are made up of their true height + measurement error of the height. Measurement error has as much chance of being positive as negative. So if you were simply to re-measure, the observed value next time may be lower or higher. This has profound implication for how we interpret change.
English SOLs Recall the English SOL pass rates for all County High Schools in the State of Alabama (Figure 9).
Copyright Stacey S. Cofield, 17 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
7 / 19
Figure 9. English Scores 1998, 1999, 2000 There was a significant increase in the pass rate from 1998 to 1999 (t = 2.33, p = 0.01). Now, lets divide the schools based on how much they improved between 1998 and 1999, into three groups, to 10%, bottom 10% and the middle 80% (31, 31, and 245 schools respectively, 5 schools were excluded due to a missing year or 0 pass rate in either year).
Now we follow these three groups of schools into 2000. What will happen to the Spring 2000 English SOL pass rate for these schools (Figure 10)?
Note that three additional schools were removed due to missing 2000 scores.
Copyright Stacey S. Cofield, 17 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University 8 / 19
Figure 10. Change from 1999 to 2000 by Group
What do we make out of these changes in the three groups? The Bottom group increased, the Middle groups slightly increased, and the Top group decreased.
Copyright Stacey S. Cofield, 17 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University 9 / 19
Regression to the mean The change between 1998 and 1999 (horizontal axis) is negatively related to the change between 1999 and 2000 (vertical axis), r = 0.57, p < 0.0001 (Figure 11).
Figure 11. Change from 1999 to 2000 by Change from 1998 to 1999
What this means is, schools who show a decrease one year are likely to show an increase the next year, schools who show an increase one year are likely to show a decrease the next year, and schools with slight change (or 0 change) will have a slight change again. Or, the change tends towards 0 on average.
Copyright Stacey S. Cofield, 17 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
10 / 19
The Application of this Phenomena Regression to the mean is real. Its an entirely statistical phenomenon that arises out of measurement error.
Example: Were designing a clinical trial. We bring in a number of individuals with the disease and measure how sick they are.
Our new miracle drug is in short supply. But were sure our drug works and so we feel ethically obligated to give it to the worst off. We choose those who are in the worst 50% and they get better. Now youve now lowered their sickness! But be careful, it could be due to regression to the mean.
Illusion The problem with this interpretation is that the significant change is just regression to the mean. How to detect real change? So, how do we design studies to deal with regression to the mean? Think about this issue.
Transformations and Other Models When looking at a particular scatterplot to see a relationship, a straight-line fit may not be defendable. Perhaps there is a way to linearize the relationship and solve some, if not all of the problems. That is, measurement scale is often arbitrary. So, if a change of scale results in a linear relationship, then this can be very useful. These changes of scale are referred to as transformations.
In addition, it may be possible to summarize a relationship between two variables with a curvilinear form. That is, we may be able to add more terms to the model and better explain the data.
Copyright Stacey S. Cofield, 17 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
11 / 19
Transformations One way to remedy outliers, failed normality, nonlinearity, and/or non-constant variance is by transforming the data to another scale.
First Example: Lipids Example: Consider the relationship between cholesterol and triglycerides. The distribution of triglycerides was clearly skewed, and there were a number of outliers in cholesterol (Figure 12).
Figure 12. Distribution of Triglycerides and Cholesterol
Rank correlation As weve seen, a solution to these problems is Spearmans rank correlation (Figure 13).
Figure 13. Spearman Results Cholesterol by Triglycerides
So, there is a statistically significant relationship. If that is a sufficient answer to the question, then that may be all we need. However, if we want to describe the relationship with some form of curve, if we want to predict one variable from the other,
Copyright Stacey S. Cofield, 17 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University 12 / 19
then a rank-correlation is insufficient.
Transformations of one variable First, consider the distribution of a single variable. The most common problem is skewness and this most often occurs in measures that have all-positive values. One way to consider skewness is in terms of its severity. In the figure below, we see increasingly severe skewness (Figure 14).
Figure 14. Skewed DistributionsAMB
All of these distributions can be made normal through a transformation of scale. The others are increasingly skewed.
Square root If the magnitude of skewness is moderate, a square-root transformation may be all thats necessary to pull in the outliers and transform the data to be normal (Figure 15).
Copyright Stacey S. Cofield, 17 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
13 / 19
Figure 15. Square Root Transformation (Before on Left, After on Right)AMB
By transform the data we mean to use a new variable that is a transformation of the old variable. That is, in this case: NewVariable = . Note that the square-
root transformation will work even if there are zero values but not if the original variable has negative values.
Log transformation In general, if you take 10X and get Y, then X is the log of Y. In this case, the log base is 10; using 10 as the base is taking the common log. Other bases are also used: Base e, and base 2 are typical. Base e is called the natural log. The number e is a mathematical constant, approximately 2.7182. Base 2 could be used if its convenient to think of the quantity as having units such that each doubling is a unit. Drug doses are commonly log-transformed because the effect of drugs is related to the multiplying of dose-level. That is, the effect of dose is not additive.
If the magnitude of skewness is pronounced, a log transformation may be all thats necessary to pull in the outliers and transform the data to be normal (Figure 16). That is, in this case the transformation is as follows: NewVariable = Log(OldVariable). Note that the log transformation will work only if the original variable has positive (non-zero) values. However, many measuring devices have a lower detection limit that is reported as zero. For instance, a digital measurement device may be unable to reliably measure quantities below 0.01. Its programmed to return 0.0 on the read-out
Copyright Stacey S. Cofield, 17 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
14 / 19
Figure 16. Log Transformation (Before on Left, After on Right) AMB
whenever values below this occur. Thusthe argument isthese 0.0 values are not really zero, they are just very small. Then it would be appropriate to transform the data after adding this small positive number to the zeroes that appear in the original variable. That is, if you have zero values, try adding 0.001. The choice of the constant is arbitrary. Most people look at the resulting distribution to verify that the distribution becomes normalthat the zero values are appropriately small after the resulting transformation.
Reciprocal The reciprocal transformation is not commonly used. However, if the magnitude of skewness is extreme, a reciprocal transformation may be necessary to pull in the outliers and transform the data to be normal (Figure 17). One good example of reciprocal relationships is automobile fuel efficiency. In the US we think in terms of miles-per-gallon. Most everywhere else thinks in terms of liters-per-100 kilometers. One is the reciprocal of the other (after changes in the units).
Figure 17. Reciprocal Transformation (Before on Left, After on Right)AMB
Copyright Stacey S. Cofield, 17 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
15 / 19
That is, in this case: NewVariable = 1/OldVariable. Since dividing by zero is undefined, this transformation only works with non-zero values.
Considering a transformation of Triglyceride So, which would we choose? Wed prefer the distribution to be normal, with fewer outliers. Depending upon our choice, we may see a different relationship with cholesterol. Try transforming one or both variables and look at the residual plots to determine normality and assess the equal variance assumption. Remember to report your results in terms of the original scale. That is, if you log transform cholesterol, then report the prediction formula as: Cholesterol = EXP(intercept + slope*Triglycerids). Also, youll need to justify the transformation, for example: The triglyceride values in this sample (n = 3576) were log-transformed because the original distributions were strongly skewed. As can be seen in the figure below, the assumptions for assessing correlation are now met and it is clear that there is a strong positive correlation between cholesterol and triglycerides (r = 0.34, p < .0001) and that this relationship is not unduly influenced by the outliers in the untransformed data.
Curvilinear models Recall that the form of the relationship in the first scatterplot is assumed to be linear:
Y = intercept + slope X + error This is a straight line that intersects the X-axis at the intercept, and has a trend indicated by the slope. The slope is the average increase in Y for every unit increase in X. But what if there is some curvature in the data, the relationship doesnt look strictly linear? From a Department on Transportation report to predict amount of police assistance (Total Load) needed based upon the time drivers need to divert to a specified route during an accident or construction (Figure 18). There appears to be a slight curve in the relationship, that shorter and longer times require less load than the middle times.
Copyright Stacey S. Cofield, 17 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
16 / 19
Figure 18. Total Load by Minutes of Diversion We can expand this to allow for a quadratic trend:
2
Y = intercept + linear X + quadratic X + error This is a curve that intersects the X-axis at the intercept. Then the additive effects of X are twofold: First add a trend indicated by the linear slope (the average increase in Y for every unit increase in X) and then add a trend indicated by the quadratic slope (the average increase in Y for every unit increase in squared-X).
Fitting Quadratic trends In the Fit Y by X platform, we can fit a quadratic trend in Y easily. To fit a straight line using transformed Y and/or X variables in the scatterplot: Choose Bivariate Fit Polynomial > 2, quadratic
A new curve will appear, with associated reports (Figure 19).
Copyright Stacey S. Cofield, 17 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
17 / 19
Figure 19. Quadratic Fit of Total Load by Minutes of Diversion Compare the residuals plots for the first order (X) linear fit, to the second order (X, X2) fit (Figure 20).
Figure 20. Residuals of Linear Fit, First Order on Left, Second Order on Right
The residuals of the first order fit have a curved trend; there are more points below the line on either end. Whereas, the second order fit plot appears more evenly distributed. In addition to second order models, third and fourth order models can also be used. Be cautious when adding higher order terms, as they can be difficult to interpret. You may wish to explore other types of modeling (non-linear, non-parametric) in lieu of using a linear model.
Copyright Stacey S. Cofield, 17 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
18 / 19
Summary
George Box said all models are wrong, but some are useful.
Meaning that models are a representation of reality. If they help us understand what might be the underlying relationship, they are useful. However, they may give us a simple understanding of what is probably a complex relationship. So, while seeking explanations dont get wedded to the tools we use.
For instance, in the above example, minutes of diversion is likely not the only predictor of police assistance required. Does that make it real? Or true? Or even useful? On the other hand, it may be that a variable has a known relationship but the statistical test is not significant. There are lots of reasons why something may come out to be not significant. Dont confuse not statistically significant for not important, or even for not useful. A statistical relationship may describe something that isnt useful or practical and something that is useful or practical may not always be statistically significant.
Copyright Stacey S. Cofield, 17 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
19 / 19
Section 21. Analysis of Variance
Overview Weve used the t-test to compare the means from two independent groups. Now weve come to the final topic of the course: how to compare means from more than two populations. When were comparing the means from two independent samples we usually asked: Is one mean different than the other? Fundamentally, were going to proceed in this section exactly as we did in Hypothesis testing on two means, up to a point. Its more complicated when we have more than two groups. So, well need to address those issues.
First, well need to have a clear understanding of the data and what were testing. Both the t-test situation and the correlation/regression situation will help us understand the analysis of variance (ANOVA). The theory behind ANOVA is more complex than the two means situation, and so before we go through the step-by-step approach of doing ANOVA, lets get an intuitive feel for whats happening. How does ANOVA compare means? Since weve attained some understanding of the t-test and correlation, these will help us be comfortable with the answer to this question.
Model Comparisons First well review what we learned when considering the correlation/regression question. We fit a straight line through the data in a scatterplot (when it was appropriate). Theres an intuitive link between this situation and what we do when we compare means. This leads us to comparing models. Recall that in the correlation/regression situation there were three ways to phrase the question: Non-zero correlation? Non-zero slope? Is the straight-line model better? In the case of simple regression, the last form of this question is more complicated that we really need. However, this form of the simple regression question helps us understand the analysis of variance.
Copyright Stacey S. Cofield, 16 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
1 / 29
Fitting a straight line In a previous section, we talked about comparing two models of the data. H0: Y = Y + error HA: Y = intercept + slope X + error What we were doing was looking at the average Y response and wondering: Is the average Y constant across all values of X? or Does the average Y change with X? The way we visualized this comparison (Correlation and Regression handout) was to look at the confidence band around our straight-line model and compare it to the horizontal line drawn at the mean of Y (Figure 1).
Figure 1. Regression ExampleAMB
Comparing group means When were comparing the average Y response in different groups, were asking a similar question. Say we have four groups (X = a, b, c, or d) then the two models would be: H0: Y = Y + error, or. HA: Y = Ya + error, (if X = a), Y = Yb + error, (if X = b), Y = Yc + error, (if X = c),
Copyright Stacey S. Cofield, 16 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
2 / 29
Y = Yd + error, (if X = d). What we are doing is looking at the average Y response and wondering: Is the average Y constant across all values of X? or Does the average Y change with X? The way we visualize this comparison is to look at the confidence bounds around each mean and compare it to the horizontal line drawn at the grand mean of Y (Figure 2).
Figure 2. Example of Four Means and Grand MeanAMB Example to illustrate comparing means Consider two different experiments designed to compare three different treatment groups. In each experiment, five subjects are randomly allocated to one of three groups. After the experimental protocol is completed, their response is measured. The two experiments use different measuring devices, Narrow and Wide (Table 1). Table 1. Group DataAMB
Copyright Stacey S. Cofield, 16 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
3 / 29
We see their responsesshown by dot plotsin the two figures below (Figure 3).
Figure 3. Group Data for Narrow and WideAMB
In both experiments the mean response for Group 1 is Y1 = 5.9, for Group 2 is Y2 = 5.5, and for Group 3 is Y3 = 5.0. In the first experimentthe one on the left measured by the variable Narrow are the three groups different? In the second experimentthe one on the right measured by the variable Wideare the three groups different? The interoccular-trauma testAMB Using the tried and true statistic called the interoccular-trauma test (the difference hits you between the eyes), the three groups seem to have a different mean in the first experiment. Whereas in the second experiment, the best explanation would probably be that any apparent differences in the means could have come about through chance variation.
In fact, what your eye is doing is comparing the differences between the means in each group to the differences within each group. That is, in the Narrow experiment a summary of the results would include a table of means and SDs (Table 2).
Copyright Stacey S. Cofield, 16 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
4 / 29
Table 2. Summary Statistics for Narrow Group Data
So, a difference as large as 5 vs. 5.9 is considered in the context of very small SDs (on the order of 0.013). The 0.9 unit difference is compared to standard errors of approximately 0.0058. Consider comparing the two groups with a t-test, wed calculate something like:
A t-value of 78 is huge; clearly a significant difference.
The other experiment But, in the Wide experiment a summary of the results would include a table of means and SDs as below (Table 3).
Table 3. Summary Statistics for Wide Group Data
So, a difference as large as 5 vs. 5.9 is considered in the context of large SDs (on the order of 1.5). The 0.9 unit difference is compared to standard errors of approximately 0.68. Comparing the two groups with a t-test, wed calculate something like:
Copyright Stacey S. Cofield, 16 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
5 / 29
The 2 in the denominator is because the variance of the difference is the sum of the standard errors of each mean. A t-value of 0.66 is not remarkable; no evidence for a significant difference.
Understanding the F-test What ANOVA does is compare the differences between the means in each group to the differences within each groups observations. This is an extremely important concept because its key to your understanding of the statistical test we use. Recall the F test we used when fitting a straight line. We used:
Its a ratio of the mean square of the model (which compares the straightline predicted-values to the mean predicted-values) to the mean square error (which compares the straight-line predicted-values to the observed values). This is exactly what we do when we compare means. We use an F test that compares the differences between the means in each group to the differences within each group. The notation changes somewhat:
SS model First, consider the differences between the means. The SSmodel is:
Copyright Stacey S. Cofield, 16 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
6 / 29
So the differences between the means in each group just the sum of the weighted squared deviations (SSmodel = 2.0333) divided by the number of groups minus one (df = groups 1). So the numerator for the F test is SSmodel/dfmodel = 2.033/2 = 1.0667 = MSmodel.
SS error in the Narrow experiment In the Narrow experiment, we consider the differences within each group. We take each observation and compare it to the group mean:
So the differences within all groups is just the sum of the squared deviations (SSerror = 0.0022) divided by the number of values minus the number of groups (df = n groups = 15 3 = 12). So the denominator for the F test is SSerror/dferror = 0.0022/12 = 0.00018
Copyright Stacey S. Cofield, 16 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
7 / 29
= MSerror. Again what were doing is comparing the differences between the means in each group to the differences within each group. So the F test we use when comparing means is:
Clearly, this is a significant F.
SS error in the Wide experiment Whereas in the Wide experiment, the deviations of the observed values from the group mean are much larger.
So the differences within the Wide groups is just the sum of the squared deviations (SSerror = 27.98) divided by the number of values minus the number of groups (df = 12). So the denominator for the F test is SSerror/dferror = 27.98/12 = 2.332 = MSerror.
Were comparing the differences between the means in each group to the differences within each group. So the F test in the Wide experiment is:
Copyright Stacey S. Cofield, 16 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
8 / 29
Clearly no evidence for a difference.
Understanding ANOVA ANOVA is applicable when the response variable is continuous and we have more than two groups to compare. Our two intuitive understanding of the analysis of variance are as follows: What ANOVA does is compares two models: One overall grand mean, vs. Different means for each group. It does this by comparing the differences between the means in each group to the differences of the individual values within each group.
ANOVA: More than Two Sample Means Calcium and Weight: Lets looks at a study where four different diets may affect the mean weight of male rats. Researchers (randomly?) divided 7-week old rats into four groups: untreated controls, high calcium diet (Ca), deoxycortiosterone-NaCl treated rats (DOC), and rats receiving both dietary supplements (DOC+Ca). The question is: do the four conditions have differing effects on the mean weight of Wistar-Kyoto rats?
As always, begin by looking at the data. Phase 1: State the Question 1. Evaluate and describe the data Recall our first two questions: Where did this data come from? What are the observed statistics? The first step in any data analysis is evaluating and describing the data.
Copyright Stacey S. Cofield, 16 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
9 / 29
Preliminary Analysis What are the observed statistics? Use the Fit Y by X platform in JMP to look at a graphical and tabular summary of the data, as in Figure 4.
Figure 4. Rat Weight Data
Oneway Analysis of Weight By Diet The dot plots show some differences between groups and spread within each group. Think back to what wed do next if we were comparing two means. We know we should concern ourselves with two assumptions: equal variance within each group, and normality. In the two-group situation, how did we assess these assumptions?
Normal Quantile Plots The normal-quantile plot for these data appear in Figure 5. Follow the same interpretation of the normal quantile plot as before. Here we see the points lined up along the lines and we see parallel lines. We can proceed assuming normality and equal variance.
Copyright Stacey S. Cofield, 16 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
10 / 29
Figure 5. Normal Quantile Plot for the Groups
Preliminary analysis, showing means If the data is normally distributed then means and SDs make sense. (If these distributional assumptions are unwarranted, then we should consider nonparametric methods.) The next thing to do in our preliminary analysis is to show the means and standard deviations calculated within each group. In Figure 6 we see these values.
Figure 6. Mean and SDs for Rat Data
This figure shows the means and one standard-error error bars. The dashed lines
Copyright Stacey S. Cofield, 16 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
11 / 29
above and below the means are one standard deviation away from their respective mean.
This summary table is repeated in Figure 7. We use it to describe the number of observations in each group, the means of each group, the standard deviation within each group.
Figure 7. Summary Report for Rat Data
Note that the 95% CI on the means are shown. Recall that these CIs do NOT assume equal variability.
Summing up What have we learned about the data? We have not found any errors in the data. Were comfortable with the assumptions of normality and equal variance. Weve obtained descriptive statistics for each of the group were comparing. Its our guess that there is a significant difference. So, were ready for step 2.
2. Review assumptions As always there are three questions to consider. Is the process used in this study likely to yield data that is representative of each of the two populations? Is each animal in the samples independent of the others? Is the sample size within each group sufficient?
Bottom line: We have to be comfortable that the first two assumptions are met before we can proceed at all. If were comfortable with the normality assumption, then we
Copyright Stacey S. Cofield, 16 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
12 / 29
proceed, as below. In a following section, well discuss what to do when normality can not be safely assumed. 3. State the questionin the form of hypotheses Presuming that were OK with normality, here are the hypotheses (using group names as subscripts): The null hypothesis is Ca = DOC = DOC+Ca = control (the means of all populations are the same), The alternative hypothesis is that not all the means are equal (there is at least one difference). Phase 2: Decide How to Answer the Question 4. Decide on a summary statistic that reflects the question We can not use a t-test because there are more than two means. Note also, that it is completely inappropriate to use multiple t-tests! We use the F-test discussed above to compare the differences between the means to the differences within each group. As in the case when two groups are compared, we need to concern ourselves with whether the variances are the same in each group. There are, as before, two possibilities. The two possibilities depend upon the standard deviations within each group. Are they the same? Or do the groups have different standard deviations?
Assuming the two populations have equal variances If the standard deviations (or variances) within the populations are equal than the average standard error in the denominator of the F-test is appropriate. If the two populations have equal variance, then we calculate the p-value using the distribution of F. The distribution is complex, but recall that there are two df needed: the df-numerator (which is the number of groups minus one), and the df-denominator (the number of subjects minus the number of groups).
Allowing the two populations to have unequal variances If the variances are not equal, the calculation is more complicated. However, like the multiple proportions example, JMP handles the calculation details.
Copyright Stacey S. Cofield, 16 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
13 / 29
Deciding on the correct test Which test should we use? We may not need to choose; if the all sample sizes are equal (termed a balanced design) the two methods give identical results. Its even pretty close if the ns are slightly different. If one n is more than 1.5 times the other, youll have to decide which t-test to use. Here are the steps: Decide whether the standard deviations are different. Use the equal variance F-test if the SDs appear the same, or Use the unequal variance Welch ANOVA if the SDs appear different. As before, look at the normal quantile plot. If the lines are in that gray area between clearly parallel and clearly not parallel, what do we do? Four possibilities come to mind: o Ignore the problem and be risky: use the equal variance F-test. o Ignore the problem and be conservative: use the unequal variance F-test. o Make a formal test of unequal variability in the two groups. o Compare the means using nonparametric methods.
Test for equal standard deviations JMP IN provides a way to test for equal variance. The results are shown in Figure 8.
Figure 8. Testing for Equal Variance
Copyright Stacey S. Cofield, 16 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
14 / 29
Choosing between the tests of equal variance If the Prob>F value for the Brown-Forsythe test is < 0.05, then you have unequal variances. This report also shows the result for the F-test to compare the means, allowing the standard deviations to be unequal. This is the Welch ANOVA Here is a written summarization of the results using this method: The four groups were compared using an unequal variance F-test and found to be significantly different (F(3, 31) = 14.2, p-value < .0001). The means were found to be different .
5. How could random variation affect that statistic? Recall the rough interpretation that Fs larger than 4 are remarkable.
6. State a decision rule, using the statistic, to answer the question The universal decision rule: Reject Ho: if p-value < .
Phase 3: Answer the Question 7. Calculate the statistic There are three possible statistics that may be appropriate: an equal variance F-test, an unequal variance F-test, or the nonparametric Wilcoxon rank-sums test.
Equal variance If the equal variance assumption is tenable then the standard F-test is appropriate. (Note: When reporting a F-test its assumed that, unless you specify otherwise, its the equal-variance F-test.) Figure 9 shows the means diamonds in the dot plot. Weve seen the means diamonds in the situation with a two-group t-test. As the JMP Help shows, they represent the averages and a 95% confidence interval for each group. Recall that we talked about the relationship between confidence intervals and the twogroup t-test. We said that you can interpret confidence intervals as follows: If two confidence intervals do not overlap (vertically) then the two groups are different. Since the control groups confidence interval does not overlap any of the other diet groups,
Copyright Stacey S. Cofield, 16 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
15 / 29
Figure 9. Means Diamonds wed be safe concluding that the control group is significantly higher than each of the other groups. Since we see at least one apparent difference, hopefully the F-test will bear that out. The F Ratio appears in the Analysis of Variance report (Figure 10).
Figure 10. ANOVA Report
So, the F is large (F = 11.99, with df = 3, 57), and the p-value is small (p-value < 0.0001).
Copyright Stacey S. Cofield, 16 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
16 / 29
Note that the ns and means in the report are the same as Figure 7. However, the standard errors are different. As the note says, these standard errors use the pooled estimate of variance; and the standard errors in the Means and Std Errors report in Figure 7 simply calculate the standard deviations within each group and divide by the square root of each n. Recall that the F-test is two-tailed; there is no direction of the difference. Since the null hypothesis specified a test for equality, this is the p-value we want.
Nonparametric comparison of the means If we wish to compare the medians we dont have to make any normality assumptions. So, we use a nonparametric test based solely on the ranks of the values of the Yvariable. The Wilcoxon rank-sum test (also called the Kruskal-Wallis test) simply ranks all the Y-values and then compares the sum of the ranks in each group. If the median of the first group is, in fact equal to the median of the second group, then the sum of the ranks should be equal.
Figure 11. Wilcoxon Rank Sum Test Report
When reporting the results of a nonparametric test, its usual to only report the p-value, although reporting the chi-square value is also appropriate. A summary sentence: The groups were compared using the nonparametric Kruskal-Wallis rank sum test and found to be different (chi-square = 23, df = 3, p-value < 0.0001).
Copyright Stacey S. Cofield, 16 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
17 / 29
8. Make a statistical decision Using all three tests, the groups are different. (All p-values are less than 0.05.)
9. State the substantive conclusion The four diets have different means.
Phase 4: Communicate the Answer to the Question 10. Document our understanding with text, tables, or figures A total of n = 61 Wistar-Kyoto (WKY) rats were assigned to one of four dietary groups: untreated controls, high calcium diet (Ca), deoxycortioterone-NaCl treated rats (DOC), and rats receiving both dietary supplements (DOC+Ca). The ANOVA test indicated that the groups were significantly different (F(3, 57) = 12, p < .0001). (Show a table of summary statistics and a plot of the means.)
So Far: Multiple Independent Means Briefly, here is how we proceeded when comparing the means obtained from multiple independent samples. Describe the groups and the values in each group. What summary statistics are appropriate? Are there missing values? (why?) Assess the assumptions, including normality and equal variance. If normality is warranted, then it may be useful to determine confidence intervals on each of the means. Perform the appropriate statistical test. Determine the p-value that corresponds to your hypothesis. Reject or fail to reject? State your substantive conclusion. Is that it? What about where the means are different which groups are the same?
Considering the means If we look at the ordered means, it would appear thatfrom lightest to heaviestthe four groups are: DOC+Ca (303 gm), DOC (309 gm), Ca (321 gm), and control (343 gm). Questions:
Copyright Stacey S. Cofield, 16 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
18 / 29
Is DOC+Ca significantly different than DOC? Is DOC+Ca significantly different than Ca? Is DOC+Ca significantly different than control? Is DOC significantly different than Ca? Is DOC significantly different than control? Is Ca significantly different than control? All weve decided is that there is a difference somewhere; that all four of the means are not equal. But were far from explaining where the difference(s) lie. We have an impression that the control group is higher than the others, but are there any other differences?
Graphical comparison Recall the interpretation of means diamonds. Since they are confidence intervals, if they dont overlap then the groups are different. However, what if they dont overlap? Are the groups different? The answer is, if they overlap a little then the groups are different? How much is a little? The smaller horizontal lines within the means diamonds are one way to tell. These are called significance overlap lines. If we pay attention to whether the area within the overlap lines in two groups separate, then we can (roughly) see whether the groups are different (Figure 12).
Figure 12. Means and Overlap Rat Data
Copyright Stacey S. Cofield, 16 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
19 / 29
In the above figure, we show the lower limit of the Ca groups overlap line. Extending it across to the right we see that the DOC groups overlap lines are within the lower Ca limit. We also see that the lower Ca limit is above the DOC+Ca groups overlap lines. From this we gain the impression that the Ca and DOC groups may not be different but the Ca and DOC+Ca groups may be different. These visual impressions are important. They give us leads to what the real answers to the group comparison questions will be. But we need a definitive answer.
All possible t-tests One solution to this problem is to do all possible t-tests. [Note: we do not do this in practice. It is NOT a good idea.] We could answer each of the six questions above as though we had done a series of two-group studies. JMP can do this (Figure 13).
Figure 13. All Possible t-tests
All possible two-group t-tests The circles to the right of the dot-plot are called comparison circles. Clicking on one of the circles asks JMP the question, Which group is different from this one? The highlighted group turns red. Other groups that have a different color (gray or black) are significantly different. Other groups that do not have a different color (they are red too) are not significantly different. From the above figure (where control is the highlighted, red group), we see that the control group is significantly higher than each of the others).
Copyright Stacey S. Cofield, 16 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
20 / 29
Before we make too much about these statistically significant differences, you should know that you shouldnt believe them.
Multiple comparisons The problem is this: If you torture the data long enough, they will eventually confess. The more statistical tests we do, eventually one will come out significant. We didnt have this problem in the two-group t-test situation. We just had two groups and there were two alternative: the groups were not different or the groups were different. There was only one t-testand it was controlled by a Type I error rate, = 0.05. We only say significant difference! when its not true, 5% of the time. With three groups, there are three possible t-tests to do. If each test has = 0.05, then the probability of saying significant difference! at least once when its not true, is more than 5% Doing each test at = 0.05 does not yield an experiment whose overall type I error rate is = .05. As Table 4 shows, the probability of making this mistake goes up to over 14%. This is not good.
Table 14. Error Rates for Multiple Comparisons
Copyright Stacey S. Cofield, 16 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
21 / 29
This is what is called the multiple comparison problem. So, even if there really is no difference, if you do the three t-tests required to compare three groups, at least one of them will appear significantby chance aloneover 14% of the time. If we have 6 groups, this error rate is over 50% and by the time were up to 10 groups were virtually assured of finding a significant difference. We need to correct for the number of tests we are performing so that the error rate stays at 5%.
The Bonferroni correction The Bonferroni correction says this: If we want to operate with = 0.05, then count up the number of comparisons were doingthe second column in the table above, call this number kand do each t-test at p-value < (0.05/k). So, with three groups there are three comparisons (k = 3), so compare the p-values to < 0.01667.
With 6 groups, p-values would have to be less than 0.00333. With 10 groups only huge differences with p-values < 0.00048 would be declared significant. This approximation is simple and works fairly well as long as you dont mind how conservative it is. Its very hard to find a significant difference using Bonferroni-corrected p-values. (On the other hand, if you can declare a difference with this severe a penalty, its believable.) There is a better way to handle the multiple comparison problem. JMP notes To compare means after a significant ANOVA result, make one of the compare choices in the Analysis popup below the dot plot. To do all possible t-tests between the means (This is not a good idea), Choose Oneway Compare Means > Each pair, Students t and then use Bonferronis correction by changing alpha. Choose Oneway Set alpha level > Other To compare the all possible pairs of means and keeping experiment-wise Type I error rate = 0.05, Choose Oneway Compare Means > All pairs, Tukeys HSD To compare each of the means to one control group mean, Highlight at least one point in the control group (click on it) and then Choose Oneway Compare Means > With Control, Dunnetts
Copyright Stacey S. Cofield, 16 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
22 / 29
Compare each pair Doing all possible t-tests without some sort of modification of significance level is a bad idea. Then what should we do? Use Tukey-Kramer honestly significant difference test. The HSD takes a much more sophisticated approach to this problem. The HSD looks at the distribution of the biggest difference. That is, the test comparing the smallest mean to the largest mean. The details are not important. We just need to know two things: The observed differences are compared to an appropriate standard thats larger than the t-test standard. The standard is arrived at so that the overall experiment-wise error rate is 0.05. You can believe the HSD results. When Oneway Compare Means > All pairs, Tukeys HSD is chosen, the figure changes to show the HSD comparison circles (Figure 15).
Figure 15. Tukeys Comparison
Notice howby an uncorrected t-testthe Ca and DOC+Ca groups are declared significantly different. By the HSD, the Ca and DOC+Ca groups are not different. Notice also that the comparison circles for HSD are larger than the Students t comparison circles. This is a reflection of the higher standard for calling a difference significant.
Copyright Stacey S. Cofield, 16 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
23 / 29
Bottom line: If were interested in all possible differences between the groups, use Compare all pairs, Tukeys HSD to compare means.
The HSD results Below, we see the comparison circles after clicking on each of the groups circle (Figure 16).
Figure 16. All Group Comparisons with Tukey
Note that we dont have to click on every circle to answer all our questions. In addition to the comparison circles, a report is also available (Figure 17).
Copyright Stacey S. Cofield, 16 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
24 / 29
Figure 17. Means Comparisons JMP Report
Means Comparisons The Means Comparisons report first shows a table of all possible ordered differences between the means. Its arranged so that the smallest differences are just off the diagonal and the largest difference is in the upper right-hand (or lower left-hand) corner. The bottom part of the report is rather busy. What you need to know is how to read it, and that is described in the note at the bottom of the report: Positive values [in the lower report] show pairs of means that are significantly different.
[Irritation: The one thing thats missing is confidence intervals. That is, Ca is
Copyright Stacey S. Cofield, 16 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
25 / 29
significantly different than control by 21.7 grams. Wed like to say were 95% confident that the difference between the two means lies in the interval between ___ and ___. But JMP doesnt give us these intervals.] In any event, whats the conclusion? The F-test says there is a difference. Which groups are different?
Comparison with a control If were not interested in all the possible mean differences then the HSD is too conservative. The most common situation when this occurs is when the experiment is only interested in whether a mean is different than a pre-planned control group. That is if these are the only questions of interest: Is DOC+Ca significantly different than control? Is DOC significantly different than control? Is Ca significantly different than control? Then there is another option available. Dunnetts comparison has to protect against less comparisons and so its easier to find a difference (compared to the HSD, Figure 18).
Figure 18. Dunnetts Comparison Click on any point in the control group and then make the choice in the Analysis popup.
Copyright Stacey S. Cofield, 16 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
26 / 29
Phase 4: Communicate the Answer to the Question 10. Document our understanding with text, tables, or figures Now we complete the description of our conclusions begun earlier:
Using Tukeys HSD, it was determined that the control diet had significantly higher weight (mean = 343 gm) than each of the special-diet groups. Within the three special diet groups: DOC+Ca, DOC, and Ca (combined mean* 311 gm, SD = 21.6), there was no significant difference.
The earlier summary table of means, SDs and CIs is probably sufficient. However, if we wanted to show a figure, the best depiction using JMP would be the dot plot and CIs represented by the means diamonds.
Note: There are any number of ways to calculate the combined mean and SD. Can you figure out one that works for you? With some manual work in a graphics editor, the above figure can be redone to show the 95% confidence intervals in the more usual style (Figure 19).
Figure 19. Data Points with Means and 95% CI
Note that the picture frame has also been removed (chart junk). The sharp-eyed
Copyright Stacey S. Cofield, 16 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
27 / 29
reader will also notice that these CIs are the correct ones.
Summary: Comparing Multiple Independent Means Briefly, here is how to proceed when comparing the means obtained from multiple independent samples. Describe the groups and the values in each group. What summary statistics are appropriate? Are there missing values? (why?) Assess the assumptions, including normality and equal variance. If normality is warranted, then it may be useful to determine confidence intervals on each of the means. Perform the appropriate statistical test. If Normality is unwarranted then use the Rank-Sum test (or consider transforming the Y-variable to make it more normal). If Normality and equal variance are apparent, then use the F-test. If Normality and unequal variance are apparent, then use the Welch ANOVA Ftest (or consider transforming the Y-variable to equalize variance). Determine the p-value that corresponds to your hypothesis. Null hypothesis: No difference. Reject or fail to reject? Here is where the process diverges If we fail to reject: There is no evidence for more than one mean. Report the single mean & etc. If we reject: Then use the appropriate multiple comparison test to determine which groups are significantly different. Report means & etc. that reflect the pattern that is evident.
Indeterminate results Note: The following scenarios are possible. The F-test is not significant but one or more of the group comparisons is significant. o No fair; you werent supposed to look at the group comparisons if the overall test was not significant. Fishing is not allowed.
Copyright Stacey S. Cofield, 16 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
28 / 29
The F-test is significant but none of the group comparisons is significant. In other words, the F-test says there is a difference but we cant find it. This will sometimes happen (and its irritating when it does). All you should do is report it (or redo the study with larger n). What people really do is fish until they find a plausible conclusion (be careful how you report this). The F-test is significant. When the group comparisons are considered, the pattern of the means is difficult to interpret. Its even possibleespecially when the ns in each group are different that groups that should be different, arent. o For example, with a three-group study, its possible that the multiple comparison tests indicate that: Group A > Group B, and Group B > Group C, but Group A is not > Group C (!?) o All you can do is report the overall F-test and the results or just the F-test and re-run the study in a balanced fashion with larger n.
Copyright Stacey S. Cofield, 16 November 2004. All rights reserved. AMB denotes material reprinted courtesy of Al Best, Virginia Commonwealth University
29 / 29
Find millions of documents on Course Hero - Study Guides, Lecture Notes, Reference Materials, Practice Exams and more.
Course Hero has millions of course specific materials providing students with the best way to expand
their education.
Below is a small sample set of documents:
UAB - SOPH - 611
BST 611 (Beasley) Homework 1 (150 points) 1. Compute the Mean, Variance, and Standard Deviation, Quartiles, and SIQR for Y and X. Y X (20 points) 112 35 110 32 PLEASE SHOW YOUR WORK! 104 30 98 33 98 31 Note: Y and X are NOT in Rank Order. 90 32 92 34
UAB - SOPH - 611
BST 611 (Beasley) Homework 1 (150 points) 1. Compute the Mean, Variance, and Standard Deviation, Quartiles, and SIQR for Y and X. Y X (20 points) 112 35 110 32 PLEASE SHOW YOUR WORK! 104 30 98 33 98 31 Note: Y and X are NOT in Rank Order. 90 32 92 34
UAB - SOPH - 611
A 51.0 61.1 28.0 48.4 31.0 35.1 29.0 53.5 34.2 47.5 32.0 64.2 34.7 38.0 39.3 43.0 55.7 46.0 37.2 70.1 62.4 47.1 48.0 53.3 53.4 53.8 56.0 41.0 63.3 36.4 33.0 57.6 42.3 57.0 39.9 59.2 62.9 65.3 35.6 47.8 47.3 53.2 58.1 67.1 57.3 69.0 21.0 22.0B 45.0
UAB - SOPH - 611
BST 611 (Beasley) Homework 2 (100 points) Use the following Sample Data for Questions 1 5. Male Female Honda 60 70 Toyota 58 46 Nissan 84 56 Mazda 46 26 Mitsubishi 32 241. What is the Marginal Probability of randomly selecting a female from this s
UAB - SOPH - 611
BST 611 (Beasley) Homework 2 (100 points) Use the following Sample Data for Questions 1 5. Male Female Honda 60 70 Toyota 58 46 Nissan 84 56 Mazda 46 26 Mitsubishi 32 241. What is the Marginal Probability of randomly selecting a female from this s
UAB - SOPH - 611
Roll1 1 2 1 3 2 1 4 3 2 1 5 4 3 2 1 6 5 4 3 2 1 6 5 4 3 2 6 5 4 3 6 5 4 6 5 6Roll2 1 1 2 1 2 3 1 2 3 4 1 2 3 4 5 1 2 3 4 5 6 2 3 4 5 6 3 4 5 6 4 5 6 5 6 6Sum 2 3 3 4 4 4 5 5 5 5 6 6 6 6 6 7 7 7 7 7 7 8 8 8 8 8 9 9 9 9 10 10 10 11 11 12Mean 1 1.
UAB - SOPH - 611
BST 611 (Beasley) Homework 3 (200 points) 1. Frattola et al. (2000, Hypertension, 36, 622-628) investigated treatment of Blood Pressure among Diabetics. In a replication, suppose researchers randomly assigned nC = 35 patients to a control condition g
UAB - SOPH - 611
BST 611 (Beasley) Homework 3 (200 points) 1. Frattola et al. (2000, Hypertension, 36, 622-628) investigated treatment of Blood Pressure among Diabetics. In a replication, suppose researchers randomly assigned nC = 35 patients to a control condition g
UAB - SOPH - 611
SBP 127 137 139 169 149 143 151 144 143 145 166 152 167 154 127 123 154 143 145 135 159 140 161 111 140 136 123 143 125 141 120 153DBP 62 71 70 80 64 63 74 84 72 72 79 70 90 72 65 56 68 80 79 74 79 67 89 61 72 79 65 71 71 67 66 83 58 57 62group 0
UAB - SOPH - 611
group 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1quit 1 0 1 0 1 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 1 0 1 0 0 0 1 0 0 0 1 0 1 0 0 1 0 0 1 0 0 1 0 1 0 0 1 1 0 1 0itt 1 0 1 0 1 0 1 0 0 1 0 0 1 0 0
UAB - SOPH - 611
IDBMIPRE BMIPOST BMIDIFF RMRPRE RMRPOSTRMRDIFF 1 26.41 26.56 -0.15 1304.11 1312.32 -8.21 2 26.95 1400.23 3 27.56 1395.34 4 29.26 1329.73 5 30.18 32.04 -1.87 1666.05 1412.01 254.04 6 30.78 1578.79 7 31.27 1397.53 8 31.83 34.77 -2.93 1352.84 1465.71
UAB - SOPH - 611
BST 611 (Beasley) Homework 4 (100 points)1Bivariate data with a Dichotomous variable Based on Hommes et al. (1991), a researcher investigated whether resting energy expenditure (REE) is increased in the early asymptomatic stage of HIV infection.
UAB - SOPH - 611
BST 611 (Beasley) Homework 4 (100 points)1Bivariate data with a Dichotomous variable Based on Hommes et al. (1991), a researcher investigated whether resting energy expenditure (REE) is increased in the early asymptomatic stage of HIV infection.
UAB - SOPH - 611
X 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1REEgroup 7589 Con 8457 Con 5270 Con 8322 Con 7689 Con 7853 Con 7416 Con 7123 Con 4160 Con 7558 Con 7253 Con 7768 Con 7655 Con 7530 Con 4290 Con 9298 Con 6538 Con 732
UAB - SOPH - 611
1BST 611 BeasleyFinal 200 Points1. If a researcher sets = .05 and then based on the sample data calculates a test statistic which results in a p-value of p = .04, will the null hypothesis be rejected or not rejected? Explain your response brie
UAB - SOPH - 611
1BST 611 Beasley PointsFinal 2001. If a researcher sets = .05 and then based on the sample data calculates a test statistic which results in a p-value of p = .04, will the null hypothesis be rejected or not rejected? Explain your response brie
UAB - SOPH - 611
tvtime 1.5 1.5 3.5 1.5 0.5 0 1.5 3 0 2 2.5 2 3.5 2 7 4 3 2aggress 21 15 36 27 19 25 22 30 23 26 33 20 29 24 30 40 26 18letter 16 15 10 7 10 11 11 12 8 12 14 17 11 15 11 9 22 24
UAB - SOPH - 611
gender Male Male Male Male Male Male Female Female Male Female Female Male Female Male Female Male Male Male Male Male Female Male Male Female Female Female Male Female Female Male Female Male Female Male Malemarstat Married Single Married Single M
UAB - SOPH - 611
age 9.12 9.92 9.52 9.27 9.64 9.64 9.85 9.07 9.41 9.57 9.84 9.19 9.24 12.19 12.99 13.27 14.67 15.67 15.6 16.93 19.84 20.95 22.54 23.87 25.71 26.38 25.48 27.01 28.19 23.58fsh 0.31 1.15 0.62 1.24 0.81 1.2 1.26 1.77 2.09 2.08 1.08 1.48 2.16 2.54 1.62 0
UAB - SOPH - 611
cond 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 20 20 20 20 20 20 20 20 20 20 20CT 64.47 64.31 64.67 62.62 64.34 63.61 64.41 63.05 63.72 63.27 64.29 64.66 65.32 63.22 63.66 64.08 63.22 64.83
UAB - SOPH - 611
Page 1True / False Questions 1) _ The independent-samples t-test cannot be used when samples differ in size. 2) _ Assuming the null hypothesis (H0: 1= 2), the single most frequently occurring value in the sampling distribution of mean difference is
UAB - SOPH - 621
Biostatistics 621: Statistical Methods IInstructor: T. Mark Beasley, Ph.D. RPHB 309-E 205-975-4957 MBeasley@uab.edu Office Hours for Class: Beasley T-Th 12:30-2:00 H. Gao (TA) Wed. 2:00-4:00 roberton@uab.eduCourse Website: http:/www.soph.uab.edu/S
UAB - SOPH - 621
BST 621 (Beasley) Homework 1 (100 points) 1. Hand Calculate the Mean, Variance, and Standard Deviation for Y and X. Note: The Mean, Variance, and Standard Deviation for X and Y should be integers. Show your work; partial credit will be given. Also En
UAB - SOPH - 621
BST 621 (Beasley) Homework 1 (100 points) 1. Hand Calculate the Mean, Variance, and Standard Deviation for Y and X. Note: The Mean, Variance, and Standard Deviation for X and Y should be integers. Show your work; partial credit will be given. Also En
UAB - SOPH - 621
A 51.0 61.1 28.0 48.4 31.0 35.1 29.0 53.5 34.2 47.5 32.0 64.2 34.7 38.0 39.3 43.0 55.7 46.0 37.2 70.1 62.4 47.1 48.0 53.3 53.4 53.8 56.0 41.0 63.3 36.4 33.0 57.6 42.3 57.0 39.9 59.2 62.9 65.3 35.6 47.8 47.3 53.2 58.1 67.1 57.3 69.0 21.0 22.0B 45.0
UAB - SOPH - 621
BST 621 (Beasley) Homework 2 (100 points) Use the following Sample Data for Questions 1 5. Male Female Honda 60 70 Toyota 58 46 Nissan 84 56 Mazda 46 26 Mitsubishi 32 241. What is the Marginal Probability of randomly selecting a female from this s
UAB - SOPH - 621
BST 621 (Beasley) Homework 2 (100 points) Use the following Sample Data for Questions 1 5. Male Female Honda 60 70 Toyota 58 46 Nissan 84 56 Mazda 46 26 Mitsubishi 32 241. What is the Marginal Probability of randomly selecting a female from this s
UAB - SOPH - 621
Roll1 1 2 1 3 2 1 4 3 2 1 5 4 3 2 1 6 5 4 3 2 1 6 5 4 3 2 6 5 4 3 6 5 4 6 5 6Roll2 1 1 2 1 2 3 1 2 3 4 1 2 3 4 5 1 2 3 4 5 6 2 3 4 5 6 3 4 5 6 4 5 6 5 6 6Sum 2 3 3 4 4 4 5 5 5 5 6 6 6 6 6 7 7 7 7 7 7 8 8 8 8 8 9 9 9 9 10 10 10 11 11 12Mean 1 1.
UAB - SOPH - 621
BST 621 (Beasley) Homework 3 (Mid-Term) (200 points) 1. Suppose a researcher was interested in a smoking cessation treatment and administered a treatment of nicotine patches to N = 20 randomly sampled patients. After 4 weeks, the results showed that
UAB - SOPH - 621
BST 621 (Beasley) Homework 3 (Mid-Term) (200 points) 1. Suppose a researcher was interested in a smoking cessation treatment and administered a treatment of nicotine patches to N = 20 randomly sampled patients. After 4 weeks, the results showed that
UAB - SOPH - 621
ID 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35DBP 62 71 70 80 64 63 74 84 72 72 79 70 90 72 65 56 68 80 79 74 79 67 89 61 72 79 65 71 71 67 66 83 58 57 62SBP 127 137 139 169 149 143 151 144 143
UAB - SOPH - 621
group 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1quit 1 0 1 0 1 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 1 0 1 0 0 0 1 0 0 0 1 0 1 0 0 1 0 0 1 0 0 1 0 1 0 0 1 1 0 1 0itt 1 0 1 0 1 0 1 0 0 1 0 0 1 0 0
UAB - SOPH - 621
BST 621 (Beasley) Homework 4 (200 points) 1. An institutional researcher at U of X was interested in comparing the GRE scores of potential graduate students who apply to the School of Liberal Arts (SOLA) vs. the School of Public Health (SOPH). The re
UAB - SOPH - 621
BST 621 (Beasley) Homework 4 (200 points) 1. An institutional researcher at U of X was interested in comparing the GRE scores of potential graduate students who apply to the School of Liberal Arts (SOLA) vs. the School of Public Health (SOPH). The re
UAB - SOPH - 621
IDBMIPRE BMIPOST BMIDIFF RMRPRE RMRPOSTRMRDIFF 1 26.41 26.56 -0.15 1304.11 1312.32 -8.21 2 26.95 1400.23 3 27.56 1395.34 4 29.26 1329.73 5 30.18 32.04 -1.87 1666.05 1412.01 254.04 6 30.78 1578.79 7 31.27 1397.53 8 31.83 34.77 -2.93 1352.84 1465.71
UAB - SOPH - 621
1BST 621 BeasleyFinal 150 Points + Extra Credit1. An article in a political science journal states that no significant difference was found between men and women in their voting rates (p = .63). Can we conclude that the population voting rates
UAB - SOPH - 621
1BST 621 Beasley CreditFinal 150 Points + Extra1. An article in a political science journal states that no significant difference was found between men and women in their voting rates (p = .63). Can we conclude that the population voting rates
UAB - SOPH - 621
cond 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 20 20 20 20 20 20 20 20 20 20 20CT 64.47 64.31 64.67 62.62 64.34 63.61 64.41 63.05 63.72 63.27 64.29 64.66 65.32 63.22 63.66 64.08 63.22 64.83
UAB - SOPH - 621
gender Male Male Male Male Male Male Female Female Male Female Female Male Female Male Female Male Male Male Male Male Female Male Male Female Female Female Male Female Female Male Female Male Female Male Malemarstat Married Single Married Single M
UAB - SOPH - 621
age 9.12 9.92 9.52 9.27 9.64 9.64 9.85 9.07 9.41 9.57 9.84 9.19 9.24 12.19 12.99 13.27 14.67 15.67 15.6 16.93 19.84 20.95 22.54 23.87 25.71 26.38 25.48 27.01 28.19 23.58fsh 0.31 1.15 0.62 1.24 0.81 1.2 1.26 1.77 2.09 2.08 1.08 1.48 2.16 2.54 1.62 0
UAB - SOPH - 621
tvtime 1.5 1.5 3.5 1.5 0.5 0 1.5 3 0 2 2.5 2 3.5 2 7 4 3 2aggress 21 15 36 27 19 25 22 30 23 26 33 20 29 24 30 40 26 18letter 16 15 10 7 10 11 11 12 8 12 14 17 11 15 11 9 22 24
UAB - SOPH - 621
Biostatistics 621: Statistical Methods IFall Semester 2007Course InformationInstructor: T. Mark Beasley, PhD Associate Professor of Biostatistics Office: Ryals Room 309E Phone: (205) 975-4957 Email: mbeasley@uab.edu Tuesday/Thursday 11:00 12:15
UAB - SOPH - 621
Probability ConceptsEveryday Probability1 in 4 chance 30% survival rate 1 in 7.1 million the time 88% on-timeStatistical ProbabilityExpressed in-terms of 0 1 Used to 0 100%, this comes from multiplying the probability by 100% The closer to 0
UAB - SOPH - 621
Probability DistributionsSeries of eventsPreviously we have been discussing the probabilities associated with a single event: Observing a 1 on a single roll of a die Observing a K with a single card selection Now, well extend this concept to mor
UAB - SOPH - 621
Continuous Probability DistributionsPreviously we have been discussing the probabilities associated with discrete random variables where a rv can only assume a select number of values. Now, well extend this concept to continuous random variables
UAB - SOPH - 621
Hypothesis TestingSteps to Answering the Questions with DataHow does science advance knowledge? How do we answer questions about the world using observations? Generally, science forms a question and brings data to bear to answer it. Informally, th
UAB - SOPH - 621
Extending Hypothesis Testing p-values & confidence intervalsSo far: how to state a question in the form of two hypotheses (null and alternative), how to assess the data, how to answer the question by using a statistic and an associated measure o
UAB - SOPH - 621
Comparing Means in Two PopulationsOverview The previous section discussed hypothesis testing when sampling from a single population (either a single mean or two means from the same population). Now well consider how to compare sample means from t
UAB - SOPH - 621
Comparing Two ProportionsCase StudyRecall the question that was actually asked in the CPR study reported in the NEJM. Do we need to give mouth-to-mouth ventilation and chest compression? Or will just doing chest compression alone be just as effe
UAB - SOPH - 621
Estimating with ConfidenceOverview We began our discussion of continuous variables with descriptive statistics. We'd considered issues relating to shape, center, and dispersion. Now we move on to estimation. How do we estimate parameters of a p
UAB - SOPH - 621
Estimating with Confidence, Part IIReview We use y-bar to estimate a population mean, . When sampling from a population with true mean , the true mean of the distribution of ybar is . On the average, the mean of means from larger samples should
UAB - SOPH - 621
Sample Size, Study Design and Comparing Two Proportions with Confidence IntervalsReviewUp to this point, we have discussed: how to state a question in the form of two hypotheses (null and alternative), how to assess the data, and how to answer
UAB - SOPH - 621
Two Paired MeansExample In 2004, high school students will have to pass Standard of Learning tests to graduate and schools will have to achieve a minimum passrate (50%) to retain school certification. In the figure below we see just the 10th grad
UAB - SOPH - 621
Analysis of Variance ANOVAOverview Weve used the t -test to compare the means from two independent groups. Now weve come to the final topic of the course: how to compare means from more than two populations. When were comparing the means from tw
UAB - SOPH - 621
Analysis of FrequenciesOverview This is the final section on the analysis of frequencies and proportions. Weve compared an observed proportion to a fixed standard. Weve compared proportions from two groups. Here we compare proportions from multi
UAB - SOPH - 621
Correlation and RegressionReview and Overview Previously, we have looked at the relationships between two or more categorical variables (dichotomous or multilevel), and the relationship between a categorical and continuous variable. Next well beg
UAB - SOPH - 621
Relationship of One-Way and Two-Way ANOVAOne-Way ANOVA Source Table ANOVA MODEL: Yij = * + j + ij Source Sum of SquaresBetween Groups (Explained Variance)Within Groups (Error Variance)H0: 1 = 2 = . . . = j or H0: 2j = 0 df Mean Squares F J-1 NJ
UAB - SOPH - 621
data one; input cell $ TRAINING $ EDUC $ frc C1 C2 C3 R1 R2; cards; sll BOTH EE 59 +97 +136 0 +106 +94 sl2 COGN EE 35 +97 -117 +30 +106 +94 sl3 CULT EE 8 +97 -117 -106 +106 +94 sl4 NONE EE 48 -253 0 0 +106 +94 s21 BOTH ME 23 +97 +136 0 +106 -150 s22
UAB - SOPH - 623
BST 623 General Linear Models Assignment # 1Beasley 50 points1. Based on Muller and Fetterman page 49, suppose a two-group Between-Subjects Design with 2 n=5 for each group and 1 = = for the model, y = X11 + . 6 Design Matrix 1 contains an i
UAB - SOPH - 612
Section 2.Multiple RegressionRegression analyses use predictor variables in an attempt to predict outcome variables. There are many types of regression analysis: Simple linear regression is used to assess the relationship between a single continu
UAB - SOPH - 612
BST 612Regression Assignment #1 (150 Points)1. Studies on biomarkers of aging have looked for changes in cells, hormones, genes, and even behaviors to find a predictor of the rate of aging. For example, the brain shrinks with age. Lane et al. (20