This preview shows page 1. Sign up to view the full content.
Unformatted text preview: Statistical Methods I (EXST 7005) Page 68 Notes on SAS PROC Univariate
Note that all values we calculated match the values given by SAS.
Note that the standard error is called the “Std Error Mean”. This is unusual; it is called the “Std
Error” in most other SAS procedures.
The test statistic value matches our calculated value (0.840).
SAS also provides a “Pr>t 0.4226”.
Calculated
The value provided by SAS is a P value
Upper
Lower
(Pr>t = 0.4226) meaning that the
value
Critical
calculated value of t = 0.840 would leave Critical
region
0.4226 (or 42.46 percent) of the
region
distribution in the 2 tails (half in each
4
3
2
1
0
1
2
3
4
tail). The two tailed split is indicated by
the absolute value signs around t, so the proportion in each tail is 0.2113 (or 21.13 %). The Pvalue indicates our calculated value would leave 21.13% in each tail, our critical region
has only 0.5% in each tail. Clearly we are in the region of “acceptance”. Example 2b with SAS
Testing the thermographs using SAS PROC UNIVARIATE. We didn't have data, so we cannot
test with SAS.
A NOTE. SAS automatically tests the mean of the values in PROC UNIVARIATE against 0.
In the thermograph example our hypothesized value was 0.8, not 0.0.
But from what we know of transformations, we can subtract 0.8 from each value without
changing the characteristics of the distribution. SAS Example 2c – Freund & Wilson (1993) Example 4.2
We receive a shipment of apples that are supposed to be “premium apples”, with a diameter of at
least 2.5 inches. We will take a sample of 12 apples, and test the hypothesis that the mean
size is equal 2.5 inches, and thus qualify as premium apples. If LESS THAN 2.5 inches, we
reject.
1) H0: μ = μ0
2) H1: μ < μ0
3) Assume: Independence (randomly selected sample)
Apple size is normally distributed. 4) α = 0.05. We have a one tailed test (H1: μ < μ0), and we chose α = 0.05. The critical limit
would be a t value with 11 d.f. This value is –1.796.
5) Draw a sample. We will take 12 apples, and let SAS do the calculations.
The sample values for the 12 apples are;
2.9, 2.1, 2.4, 2.8, 3.1, 2.8, 2.7, 3.0, 2.4, 3.2, 2.3, 3.4
As mentioned, SAS automatically tests against zero, and we want to test against 2.5. So,
we subtract 2.5 from each value and test against zero. The test should give the same
results.
James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 69 SAS Program data step
options ps=61 ls=78 nocenter nodate nonumber;
data apples; infile cards missover;
TITLE1 'Test the diameter of apples against 2.5 inches';
LABEL diam = 'Diameter of the apple';
input diam; diff = diam – 2.5;
cards; run; SAS Program procedures
proc print data=apples; var diam diff; run;
proc univariate data=apples plot; var diff; run; See SAS PROC UNIVARIATE Output
6) Now we want compare the observed value to the calculated value. This case is a little tricky.
We have a one tailed test (H1: μ < μ0) and we
chose α = 0.05 so the critical limit would be
a t value with 11 d.f. was –1.796.
SAS gives us a t value of 2.27.
Reject? No, it is a positive 2.27, not negative. So
we would not reject the hypothesis. Our calculated
value was here,
t=2.27. Our critical limit
was here,
t=1.796.
4 3 2 1 0 1 2 3 7) Conclude the size of the apples is not significantly below the 2.5 inch diameter we required.
I used the t values here, not the SAS provided P values. Why? Because we were doing a one
tailed test and the SAS Pvalues are 2 tailed.
However, we can use them if we understand them. The two tailed P value provided by SAS
showed that the area in two tails was 0.0443, so the area in each tail was 0.02215.
If the values were not in different tails, we can see that the size of the calculated value tails
(0.02215 on each side) is well within the region of rejection. So if they had been on the
same side, they would have been significantly different.
Because they were in different tails, they not cause rejection of the null hypothesis. The Null hypothesis
We could not reject the apples as too small. Had we noticed that the mean was greater than 2.5,
we would not even have had to conduct the test and do the calculations. But other hypotheses
could have been tested. Maybe “Prime apples” are supposed to have a mean size greater than
2.5. We could reject the apples if we could not prove that the size was greater than 2.5.
Previously we tested,
1) H0: μ = μ0
2) H1: μ < μ0
But we could have tested
1) H0: μ = μ0
2) H1: μ > μ0 James P. Geaghan Copyright 2010 4 Statistical Methods I (EXST 7005) Page 70 In this case if we set α to a one sided 0.05 we would have rejected the H0 since the tail was
0.0443/2 = 0.02215. We would still have taken the apple shipment.
But which is the right test?
It depends on what is important to you. Do you lose your job for sending back apples that
were not really too small, or do you lose your job for accepting apples that did not meet
the criteria. Is it fair to demand that the seller prove the mean was greater than the
standard limit? The correct alternative depends on what you have to prove (with an
α*100% chance of error) and what is important to you. SAS example 2c
Test for differences in seed production at two levels on a plant (top and bottom). We have ten
vigorous plants bearing Lucerne flowers, each of which has flowers at the top & bottom.
We want to test for differences in the number of seeds for the average of two pods in each
position. For each plant take two pods from the top and get and average, and two from the
bottom for an average. Calculate the difference between the mean for the top and mean for
the bottom and test to see if the difference is zero (i.e. no difference).
1) H0: μ = μ0
2) H1: μ ≠ μ0
3) Assume: Independence (randomly selected sample) and that the number per pod is
normally distributed.
4) α = 0.05 and with 9 d.f. our critical limit for a two tailed test would be t=2.262.
5) Take a sample. We have 10 plants, so n = 10 and d.f. = 9.
TOP
4.0
5.2
5.7
4.2
4.8
3.9
4.1
3.0
4.6
6.8 BOTTOM
4.4
3.7
4.7
2.8
4.2
4.3
3.5
3.7
3.1
1.9 See SAS Output
6) Compare the test statistic to the critical limits.
SAS reports; the mean = 1 and t = 1.978. The P(>t) = 0.0793. This area leaves almost
4% in each tail (0.0793/2 = 0.03965) and our critical region includes only 2.5% in
each tail. Therefore, the observed value falls in the area of “acceptance”.
We fail to reject the null hypothesis.
7) Conclude the number of seeds does not differ between the top and bottom of the plant.
Of course, we may have made a Type II error. James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 71 Summary
The t distribution is similar the Z distribution, but it is used where the value of σ2 is not known
and is estimated from the sample. This is a much more common case.
The t distribution is very similar to the normally distributed Z distribution in that it is a bell shaped
curve, centered on zero and ranges from ±∞. However, it differs slightly because the
distribution is derived from a normal distribution divided by a chi square distribution.
The t distribution can be also be used with observations or samples, the formulas are
ti = t= (Y i −Y
S ) ; the tdistribution applied to individual observations (Y − μ ) = (Y − μ ) ; the tdistribution used for hypothesis testing
0 SY 0 S n Not all tables or computer algorithms give the area in the tail. Some give the cumulative
frequency starting at –∞ or at zero (0). Some give the area in one tail and some the area in
two tails.
In the ttables
Each row represents a different t distribution (with different df).
The t table has many different distributions. Only the POSITIVE side of the table is given,
since the t distribution is symmetric.
Only selected probabilities were provided in the ttable,
SAS PROC UNIVARIATE will do ttest, but it only tests the hypothesized value of zero and
provides a twotailed result. James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 72 Distribution of Variance and the Chi square (χ2) distribution
The distribution of Variance
Given Y ~ N(μ, σ 2)
E(S2) = σ2
where;
S2 = SS / df and if we let df = γ,
for a sample γ = n–1
n ∑ (Y − Y ) S 2 = SS = i =1 γ 2 i γ or for a population γ = N and
n σ 2 = SS γ = ∑ (Y − μ ) 2 i i =1 N This is the structure of variance for any variable.
Y −μ ⎞
⎟ ~ N(0, 1)
⎠ Recall, Z i = ⎛ i
⎜
⎝ σ 2
We now define a new distribution, Z ,
i 2 n (Y − μ )
Sum of Squares SS
⎛Y −μ ⎞
i
=
= 2
∑ Z =∑ ⎜ i
⎟ = i∑ σ 2
i −1
i −1 ⎝
−1
σ ⎠
σ2
σ
n 2
i 2 n n Recall, σ 2 = ∑ (Y − μ ) 2 i i =1 N = SS γ So SS = γσ 2 or more generally, SS = γVar, because the variance will not always be σ2
n Therefore, ∑ Z i2 = SS
i −1 σ2 = γ Var σ2 Finally
E(Var) = σ2 ( )(
n E ∑ Zi2 = SS
i −1 = E ( γ Var ) = γσ
σ )
σ
σ
2 2 2 2 =γ where,
γ = N for a population or sample with known m
γ = n–1 for a sample using Y to get deviations from the mean and variance
Use of the expected value tells us that James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 73 if all possible samples of size n are drawn from Y ~ N(μ, σ2), then on the average,
n 2
∑ Z will take the value of the d.f. (γ).
i
i −1 ∑ (Y − Y )
n This is the distribution of variance, S 2 = 2 i i =1 n −1 = SS
SS
=
d. f . γ ( )
n and E ∑ Zi2 = γ is a new distribution is called the Chi square (χ2). The most useful
i −1 form of this distribution for hypothesis testing is SS /σ2, and the distribution centers
on γ. Properties of the Chi Square distribution
• The distribution has only one parameter, γ • For every γ there is a different distribution • The distribution is nonnegative (positive values only), ranging from 0 to +∞ • The distribution is asymmetrical, but it approaches symmetry as γ increases Chi square with 1 d.f. Chi square with 5 d.f.
Chi square with 10 d.f.
0 5 10 15 20 25 30 The Chi square tables
• The left side gives the degrees of freedom, γ. Each degree of freedom is a different
distribution, given in the rows as with the ttable. • The probability in the upper TAIL of the distribution is given in the row at the top of the
table. • The distribution is NOT symmetric, so the probabilities at the top must be used for both
upper and lower limits. James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 74 Partial Chi square table
d.f.
1
2
3
4
5
6
7
8
9
10
20
30
100 0.995
0.00
0.01
0.07
0.21
0.41
0.68
0.99
1.34
1.73
2.16
7.43
13.79
67.33 0.99
0.00
0.02
0.11
0.30
0.55
0.87
1.24
1.65
2.09
2.56
8.26
14.95
70.06 0.975
0.00
0.05
0.22
0.48
0.83
1.24
1.69
2.18
2.70
3.25
9.59
16.79
74.22 0.95
0.00
0.10
0.35
0.71
1.15
1.64
2.17
2.73
3.33
3.94
10.85
18.49
77.93 0.5
0.05 0.025
0.01 0.005
0.45
3.84
5.02
6.63
7.88
1.39
5.99
7.38
9.21 10.60
2.37
7.81
9.35 11.34 12.84
3.36
9.49 11.14 13.28 14.86
4.35 11.07 12.83 15.09 16.75
5.35 12.59 14.45 16.81 18.55
6.35 14.07 16.01 18.48 20.28
7.34 15.51 17.53 20.09 21.95
8.34 16.92 19.02 21.67 23.59
9.34 18.31 20.48 23.21 25.19
19.34 31.41 34.17 37.57 40.00
29.34 43.77 46.98 50.89 53.67
99.33 124.34 129.56 135.81 140.17 Hypothesis testing
We will be able to use this distribution to test hypotheses about the variance.
1) H 0 : σ 2 = σ 02 (note we are testing hypotheses about variances)
2) H 1 : σ 2 ≠ σ 02 (directional alternatives can also be tested).
3) Assume independence and normality*
* some types of Chi square test do not require the assumption of normality.
4) Set α (at say 0.05 or 0.01 as before) and we will need to learn to set critical limits.
5) Draw a sample of size n and calculate an estimate of the variance (S2 for a sample).
We know for the Chi square that
⎛ (Y − Y )2
⎞
SS 2
χ = ΣZ = ⎜ ∑ i
2⎟ =
σ ⎟
σ
⎜
⎝
⎠
2 2
i And if the null hypothesis is true, χ 2 = SS will have a Chi square distribution and
σ 02
will center on γ (the degrees of freedom). 6) Compare the critical limits to the calculated statistic, and
7) Draw our conclusions and interpret the results. Critical limits from the Chi square distribution
The tables are similar to the ttable in that (1) each row is a different distribution and (2)
selected probabilities are given at the top.
The tables are different from the ttable in that the tables (1) are not symmetric and (2) do not
center on a single value (zero) like the ttable, but rather each distribution centers on its d.f. James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 75 Examples of using the Chi square tables
Given γ = d.f. = 10, find P( χ ≥ χ0 ) = 0.25
2 2 This is an area in the tail of the distribution, consistent
10
0
with our tables. We look up the value in the tables and
find for d.f. = 10 the value that leaves 25% in the upper tail is 12.549. +∞ Given γ = d.f. = 10, find P(χ2 ≤ 4.87) = ?
This is an area in the LOWER tail of the
distribution. This value is not given in our tables,
but the area under the curve is still equal to one, so
P(χ2 ≤ 4.87) = 1–P(χ2 ≥ 4.87). 1P(χ2≤4.87)
P(χ2≤4.87) 10% 90%
10 0 +∞ Given γ = d.f. = 20, find P(15.5 ≤ χ2 ≤ 28.4)
This is an area in the center of the distribution.
There are two ways to get this area, find the tails.
P(χ2 ≤ 15.5) and P(χ2 ≥ 28.4), and subtract them
from 1. 75% 65% 10% 20
0
+∞
The other way is to find the probability that P(χ2 ≥
15.5) and P(χ2 ≥ 28.4) and subtract the first from the second. This is easier given the
way our tables are set up. Given γ = d.f. = 15, find P( χ 2 ≥ χ 02 ) = 0.05
This is the area in the upper tail provided by our tables.
This value can be red directly from the tables.
Given γ = d.f. = 15, find P( χ ≤ χ ≤ χ ) = 0.95
2
1 2 2
2 5%
0 15 24.996 +∞ This is the area in the middle of our tables. We must assume symmetry or there are an
infinite number of answers. Since a=1–0.95 = 0.05, we calculate α/2 = 0.025. The
upper tail can be read directly from the table, χ 22 =27.488. The value that would leave
2.5% in the lower tail would leave 97.5 (0.975) in the upper tail, so χ 12 =6.262. Hypothesis testing
1) State the null hypothesis. H 0 :σ 2 = σ 02
e.g. H 0 :σ 2 = 10, where σ 02 = 10
2) H 1:σ 2 ≠ σ 02 . This could also have been a one tailed alternative, H 1:σ 2 < σ 02 or H 1:σ 2 > σ 02 .
3) Assume independence and normality. Some other Chi square tests do not require the
assumption of normality.
4) Set α, say 0.05 (0.01 would be another common choice) and determine the critical limits for
the test.
We have a twotailed test given H 1:σ 2 ≠ σ 02 , and want α = 0.05 for two tails, 0.025 in each
tail. Given that n = 20 and d.f. = γ = 19 we want to find upper and lower limits so that
P( χ12 ≤ χ 2 ≤ χ 22 )=0.95. James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 76 P( χ12 ≤ χ 2 ≤ χ 22 )
P( χ 2 ≤ χ 12 ) = 0.025 or P( χ 2 ≥ χ 12 ) = 0.975, χ 12 = 8.91
P( χ 2 ≥ χ 22 ) = 0.025, χ 22 = 32.9
5) Draw a sample of size n and calculate an estimate of the variance (S2 for a sample). χ 2 ≤ SS σ 02 =∑ (Yi − Y ) 2 σ 02 . In this case χ 2 = 400/10 = 40 with 19 d.f. 6) Compare the critical limits from the Chi square table (8.91 and 32.9) to the calculated test
statistic ( χ 2 = 40). The calculated value exceeds the upper limit in the area of rejection.
7) Since the calculated value exceeds the upper limit we would reject the null hypothesis and
conclude the results were consistent with the alternate hypothesis. Since we have rejected
the null hypothesis, there is a 5% possibility that we made a type I error. Numerical example
Lobsters are to be used in a growth experiment. Weight gain will be studied, and it is important
that there be little variation in the initial weights. Based on previous experience, we know
that an initial standard deviation of NO MORE than 0.5 oz. would be adequate. Determine
if this tolerance is met. Less variation is no problem, only exceeding the 0.5 oz. value.
1) H 0 :σ 2 = σ 02 , in this case σ 02 = (0.5)2 = 0.25
2) H 1:σ 2 > σ 02
3) Assume the sample is independent (randomly sampled) and the lobster weights are
normally distributed.
4) State the level of significance (α). We will use α = 0.01 since the validity of our
experiment depends on this one and a type II error only means we draw a different
sample. Determine the critical limit. Given that this is a one tailed test ( H 1:σ 2 > σ 02 )
and the value of α = 0.01 and that the sample size is 12 and degrees of freedom are γ ( ) = df = 12 – 1 = 11 then find P χ ≥ χ0 = 0.01 . From the table this value is 24.7.
2 2 6) Draw a sample and compute χ2 value. The new sample values, where n = 12,
Yi = 11.9, 11.8, 12.7, 12.3, 12.1, 11.3, 12.6, 11.5, 11.9, 12.0, 11.8, 12.1
n ∑Y
i =1 i 2 = 1729.80, n ∑Y i i =1 ⎛ n ⎞
n
⎜ ∑ Yi ⎟
2
SS = ∑ Yi − ⎝ i =1 ⎠
i =1 χ 2 = SS σ 02 = 1.8 = 144 2 n 0.25 =7.2 = (144 )
1729.80 − 2 12 =1.8 with 11 df 6) Compare the calculated test statistic value (7.2) to the critical limit from the table (24.7).
In this case the test statistic does not occur in the region of rejection. James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 77 7) Since the calculated value (7.2) is less than the critical limit value, and we would fail to
reject the null hypothesis and conclude that our results are consistent with the null
huypothsis. There is a chance of a Type II error.
We never actually finished our calculation of the variance. The SS was 1.8, so S2 = 1.8 /
11 = 0.1636 and S = √0.1636 = 0.4045. Had we noted earlier that the calculated value
was actually smaller than our hypothesized value, we could have stopped. It couldn't
be in the upper tail. Review ) ( ) 2
⎛ n 2 ⎞ ∑ (Yi − μ ) 2
SS 2 = E γ Var 2 = γσ 2 = γ
For a population, E ⎜ ∑ Z ⎟ =
2 =
⎜
i ⎟
σ
σ
σ
σ
⎝ i −1 ⎠ ( All of these follow a chi square distribution ) 2
⎛ n 2 ⎞ ∑ (Yi − Y ) 2
SS 2 = E ⎛ (n − 1) S 2 ⎞ = γ
For a sample, E ⎜ ∑ Z ⎟ =
2 =
⎜
⎜
i ⎟
σ
σ
σ ⎟
⎝
⎠
⎝ i −1 ⎠ ( All of these follow a chi square distribution
χ = SS/σ2 is the form used for hypothesis testing
2 Hypothesis tests covered so far.
Z= t= (Y − μ )
0 σY (Y − μ )
0 SY χ 2 ≤ SS σ 2
0 =∑ (Yi − Y ) 2 σ 02 The last distribution we will discuss is the F distribution. This distribution will allow us to test
2
H 0 : σ 12 = σ 2 , and some other tests about the equality of means for more than two means. Which test to use?
To test means
H0: μ = μ0, σ2 known
H0: μ = μ0, σ2 not known
To test variances
H 0 : σ 2 = σ 02
H 0: σ 2
1 =σ 2
2 (covered later) Z test
t test
χ2 test
F test James P. Geaghan Copyright 2010 ...
View Full
Document
 Fall '08
 Geaghan,J

Click to edit the document details