This preview shows page 1. Sign up to view the full content.
Unformatted text preview: Statistical Methods I (EXST 7005) Page 30 Coding and Transformations
Objective – Hypothesis testing Background
Many applications in statistics require modifying an existing distribution to an alternative form
of the distribution. Hypothesis testing, in particular, requires taking an observed
distribution and transforming to a recognized statistical distribution with known properties.
This modification involves a transformation. Theorems
If a constant “a” is added to each observation then, the mean of the data set will increase by “a”
units the variance and standard deviation will remain unchanged
Example: Population of size N = 4
Yi = 2, 4, 6, 8
N μ= ∑Y
i =1 i N = 20
=5
4 ⎛ N ⎞
(Yi − μ )
∑
⎜ ∑ Yi ⎟
N
2
2
i =1
= ∑ Yi − ⎝ i =1 ⎠
σY =
N
N
i =1
N 2 2 N= (120 − 100 ) = 5
4 σ Y = 2.24 Now add 10 to each observation
Example: Population size still N = 4
Yi = 12, 14, 16, 18
N μ= ∑Y
i =1 i N = 60
= 15
4 ⎛ N ⎞
⎜ ∑ Yi ⎟
N
2
2
σ Y = ∑ Yi − ⎝ i =1 ⎠
N
i =1 2 N= ( 920 − 900 ) = 5
4 σ Y = 2.24 The mean increased by a factor of 10 while the variance and standard deviation did not
change.
NOTE that “a” may be either negative or positive, so we and add or subtract a constant from
all values of Y. If we took the values of Yi = 12, 14, 16, 18 and subtracted 10 from each
value we would reverse the previous example.
When subtracting the mean is REDUCED by the value subtracted and the variance and
standard deviation remain unchanged. The mean would then ten less and the variance
and standard deviation would be unchanged James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 31 Another theorem
If each observation Yi is multiplied by a constant “a” then, the mean of the data set is “a” times
the old mean, the new variance is “a2” times the old variance and the standard deviation is
“a” times the old standard deviation.
Example: using the same Population as before; N = 4
Yi = 2, 4, 6, 8, μ = 5; σ 2 = 5; σ = 2.24 let “a” be 10; so we multiply each observation by 10.
Yi = 20, 40, 60, 80
N μ= ∑Y
i =1 N i = 200
= 50 , which is equal to aμ = 10(5) = 50
4
2 ⎛ N ⎞
⎜ ∑ Yi ⎟
N
2
1
∑ Yi − ⎝ i=N ⎠ (12000 − 10000)
2
=
= 500 , which is a2σ2 = 102(5) = 500
σ Y = i =1
4
N σ = 22.4, which is 10(2.24) = 22.4 or 500 = 22.4 NOTE that “a” may also be an inverse (i.e. 1/a instead of a), so we can multiply or divide all
values of Yi by any constant
if we took the values of Y=20, 40, 60, 80 and divided each Yi by 10, we would reverse
the previous example.
For division, the mean is divided by the value “a” (1/10), the variance divided by “a2”
(1/100), and the standard deviation divided by “a” (1/10 ) The transformation operations may be used in combination.
Example: Population of size N = 3
Y = 10, 20, 30: μ=20; σ 2 =66.67; σ = 8.16
The transformation is “divide by 10 (or multiply by 1/10 ) and subtract 2”
Yi = –1, 0, 1 (much easier to work with)
2 ⎛ N ⎞
⎜ ∑ Yi ⎟
N
N
2
1
∑ Yi − ⎝ i=N ⎠ ( 2 − 0 )
∑ Yi 0
′
=
= 0.66667
σ Y2 = i =1
μ ′ = i =1 = = 0
N
3
3
N
and σ′ = 0.816
Note that order is important. To get back the original values we must reverse the
transformation.
Above we (1) divided and then (2) subtracted. James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 32 To reverse this we must (1) add and then (2) multiply; μ =10( μ′ + 2) =10( 2) = 20 Since addition and subtraction do not affect measures of dispersion, we need consider
only the division; σ Y2 = a 2σ Y 2 = 100(0.66667) = 66.667
′
′
σ Y = aσ Y = 10(0.816) = 8.16 Note that there is no addition or subtraction for the measures of dispersion since they
were unaffected by the original transformation. Other transformations
The logarithmic transformation was mentioned previously.
Yi′ = log(Yi ) if we calculate statistics such as the mean using the log transformed values, and then backtransforming or detransform with the antilog,
⎛ log(Y ′ )
i
anti log ⎜ ∑
n
⎜
⎝ ⎞ ⎜∑
⎜
⎟ = e⎝
⎟
⎠
⎛ log(Yi′ ) ⎞
⎟
n ⎟
⎠ = GM (Yi ) This results in a “geometric mean”
HOWEVER, note that we cannot take the logarithm of 0 (zero), so if there are zeros in the data
set we must combine two transformations. One common modification is to add 1 to all
observations.
Yi ′ = log(Yi + 1) Be careful in backtransforming or detransforming to subtract 1 after taking the antilog to
detransform. Order is important.
The same is true for inverses used in calculating a harmonic mean with an inverse
transformation Yi′ = 1
Yi
If we calculate the mean of the inverse transformed values, then detransform with the
inverse to get the harmonic mean. The “Z” transformation
The Z transformation consists of a combination of several of the previously discussed
transformations. Yi − μ Yi − Y
for a sample.
σ
S
This transformation standardizes any normal distribution to a different normal distribution with
a mean of zero and a variance of one (i.e. μ = 0; σ2 = 1; σ = 1). This is called the
standard normal distribution. This is necessary, because otherwise there are an infinite
number of different normal distributions with different means and variances. By
transforming to a standard normal distribution we can learn to work with a single
distribution with known characteristics. Zi = or ti = James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 33 Example: transform the data for a population of N = 4.
Yi = 2, 4, 6, 8
Initially, calculate the mean and variance μ = 5; σ2 = 5; σ = 2.24 and the transformation is applied with the following result.
Zi = (2–5)/2.24, (4–5)/2.24, (6–5)/2.24, (8–5)/2.24 = –1.34, –0.45, 0.45, 1.34 μ=0/4= 0
σ2 = 4 / 4 = 1
σ = √1 = 1
NOTE: addition and subtraction do not affect calculations of the variance and could be ignored. The Z distribution
This is the first statistical distribution that we will use and develop for hypothesis testing.
These hypothesis testing techniques will require an understanding of the distribution, of
how to work with tables of probabilities of the distribution, and of the Z transformation.
Fortunately, other statistical distributions will be similar. Once these techniques are
learned, they apply readily to other statistical tests and applications. The Z transformation
Purpose – transforms values from any normal population to the corresponding values from the
Standard Normal Distribution. The distribution is N(μ = 0, σ2 = 1).
Zi = (Yi – μ) / σ
where; μ = the mean of the original population
σ = the standard deviation of the original population
Yi = the value of an observation from the original population
Zi = the corresponding value from a Standard Normal Distribution
The purpose of this transformation is to deal with the infinite number of possible normal curves
with different values of μ and σ by standardizing any normal curve so we can work with a
single distribution. We will then work with these distributions from a table of Z values
and probabilities related to the Z values. This will tie together much of what we have
discussed (frequency and probability concepts, transformations, use of means, variances
and standard deviations, and their calculations). Probability statements
“Typical” probability statements are of the form.
P[ Z ≤ Z0] = r.c.f. at Z0
where Z0 is some hypothesized value James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 34 Tabulated Z distribution
We will need tables to work with the Z distribution. You
should have those tables available. Your book has Ztables, copies of my notes have the tables and I have a
copy on the internet. Table
gives
positive
values
only. The Z table is exactly symmetric. As a result, the negative
half (below zero) is a mirror image of the upper half.
Therefore, our tables only need (and will only have)
half of the distribution since it is exactly symmetric. 4 To work with these half tables, it is important to note that 3 2 1 0 1 2 3 P[Z ≤ 0] = P[Z ≥ 0] = 0.5 since half of the distribution is above 0 and half is below
P[Z ≤ –Z0] = P[Z ≥ +Z0] since the table is symmetric
P[Z ≤ Z0] = 1 – P[Z ≥ Z0] since the total area under the curve sums to one. Z table
The table in the text is “onesided” as only 1 side is required due to symmetry.
Values in the rows on the left side and top of the Z table give the value of Z, values in the body
of the table are the probabilities of randomly choosing a larger Z value by random chance.
For example, take Z = 0.11. What proportion of the distribution occurs above this value?
Or, what is the probability of picking a Z value at random and it being larger than 0.11?
Only the first 6 rows and columns of the table are shown here, plus the row and column
headings. Complete tables are available on the Internet linked to the departmental
STATLAB web page.
0.00
0.10
0.20
0.30
0.40
0.50 0.00
0.5000
0.4602
0.4207
0.3821
0.3446
0.3085 0.01
0.4960
0.4562
0.4168
0.3783
0.3409
0.3050 0.02
0.4920
0.4522
0.4129
0.3745
0.3372
0.3015 0.03
0.4880
0.4483
0.4090
0.3707
0.3336
0.2981 0.04
0.4840
0.4443
0.4052
0.3669
0.3300
0.2946 0.05
0.4801
0.4404
0.4013
0.3632
0.3264
0.2912 Reading the Z tables
Values on the left side and top of the Z table give the value of Z, For example, to find Z=0.11,
read the integer portion and first decimal part (0.1) along the left side and,
find the second decimal (0.01) along the top
The intersection of these gives the probability of a greater value of Z, in this case P(Z ≥
+0.11) = 0.4562.
Note that the value of Z=0.00 has a probability of 0.5, so half of the distribution is above
this value (and half below) Working with Z tables
What did we just do?
4 3 2 1 0 1 2 3 4 Z = 0 .1 1 James P. Geaghan Copyright 2010 4 Statistical Methods I (EXST 7005) Page 35 We found the area under the curve above a value of Z=0.11. The values in the available
tables will always be giving the r.c.f. of the upper area of the curve.
What if we want to work with the lower half of the curve?
Due to symmetry in the distribution, the probability of a
randomly selected value falling in the negative area
to the left is the same as the corresponding positive
area, so P(Z ≥ +0.11) = P(Z ≤ – 0.11)
Some things we know from previous discussions of the
empirical rule. 4 3 2 1 0 1 2 3 Z = 0 .1 1 P(Z ≥ 0) = P(Z ≤ 0) = 0.5.
The probability that a randomly selected Z falls between the limits μ – 1σ and μ + 1σ is
about 68%, and half of the remaining fall in each of the tails (about 16%). Since σ = 1
for the standard normal, we should have about 16% above +1, and 16% below –1.
Looking this up in the table we see P(Z ≥ +1) = 0.1587. Due to symmetry P(Z ≤ –1) is
also 0.1587.
The probability that a randomly selected Z falls between the limits μ – 1.96σ and μ +
1.96σ is 95%, and half of the remaining fall in each of the tails (about 2.5%). Since σ
= 1 for the standard normal, we should have about 2.5% above 1.96, and 2.5% below
–1.96. Looking this up in the table we see P(Z ≥ +1.96) = 0.0250, and P(Z ≤ –1.96)
would be the same.
A memorable value, 1.96!
The probability that a randomly selected Z falls between the limits μ–2.576σ and
μ+2.576σ is about 99%, and half of the remaining fall in each of the tails (about
0.5%). Since σ = 1 for the standard normal, we should have about 0.5% above 2.576,
and 0.5% below –2.576. Attempting to look this up in the table we see that the value
2.576 does not occur exactly in the tables, but
P(Z ≥ +2.57) = P(Z ≤ –2.57) = 0.0051 and P(Z ≥ +2.58) = P(Z ≤ –2.58) = 0.0049
So the true value is somewhere between 2.57 and 2.58, it turns out to be exactly
P(Z ≥ +2.576) = P(Z ≤ –2.576) = 0.005
“In between” values would normally be determined by interpolation. Exact values can be
obtained from various software packages, including SAS and EXCEL.
Note: On an exam, if a value does not occur exactly, I will accept either of the two limits
on either side of the correct value, or anything in between.
In the real world you can get “exact” values from EXCEL. In the even more real world,
how much precision, or how many decimal places, do you really need to make this
type of decision? All my tables were created in EXCEL A few more examples of working with Z tables
Find P(Z ≥ +1.35). This is an area in the upper half of the distribution (since Z is positive) so
we can read it directly from the Z tables. P(Z ≥ +1.35) = 0.0885 James P. Geaghan Copyright 2010 4 Statistical Methods I (EXST 7005) Page 36 Find P(Z ≤ –2.22). This is an area from the lower half of the table, but due to symmetry
P(Z ≤ –2.22) = P(Z ≥ +2.22), so we can use the upper half of the table that we have
available. P(Z ≤ –2.22) = 0.0132
What about problems that do not ask for the area in the
upper or lower tail? For example, P(Z ≤ 1.30).
This value is in the upper half of the table, but the
probability requested is for randomly chosen Z
values less than or equal, this will go into the
lower half of the distribution! 4 3 2 1 To solve this problem you must recall that the total
area under the curve adds to 1. To find P(Z ≤ 1.30),
we first find P(Z ≥ +1.30) and subtract from 1.
P(Z ≤ 1.30) = 1 – P(Z ≥ +1.30) = 1 – 0.0968 =
0.9032. 0 1 2 0.9032
4 3 2 1 0 3 4 0.0968
1 2 3 4 Even trickier Z distribution problems
Note that the value of Z = 0.00 has a probability of 0.5, so half of the distribution is above this
value (and half below).
Find P(Z ≥ –0.65).
Now we are looking for a value greater than or equal to
a value on the negative side of the distribution.
From our tables we first find
4
3
2
1
0
1
2
P(Z ≥ 0.65) = 0.2578 = P(Z ≤ –0.65) due to symmetry, and so 1–P(Z ≤ –0.65) =
1–0.2578 = 0.7422 3 4 It is strongly advisable to sketch the problem, and to see if the answer makes sense. In this case
we can see from the sketch that the desired area is over half of the total area, so the answer
should be greater than 0.5, and of course it was (P(Z) ≥ –0.65) = 0.7422). A few extra examples
1) P(Z ≥ 3.50) = ?
2) P(Z ≤ –2.00) = ?
3) P(Z ≥ 0.00) = ?
4) P(Z ≤ 1.64) = P(Z ≥ –1.64) = ?
5) P(Z ≤ 1.96) = P(Z ≥ –1.96) = ? Read directly from the table
Read from the table, but for the upper (positive) end
Read directly from the table
This is not in the table. Use 1–P(Z ≥ 1.64)
This is not in the table. Use 1–P(Z ≥ 1.96) Twotailed problems and “area in the middle” problems
A common type of problem is to determine the area between
two limits, or to determine the area in the tails outside some
specific limits.
Probability expressions for these problems will take the
form P(Z1 ≤ Z ≤ Z2) = ?
If the problem is symmetric, then –Z1 = Z2, call the limit
Z0 and we can rewrite P(–Z0 ≤ Z ≤ Z0) as P(Z ≤ Z0).
Probability expressions for areas in the tails will take
the form P(Z ≥ Z0) if the problem is symmetric. If 4 3 2 1 0 1 2 3 4 James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 37 not, we can write this as two expressions, P(Z ≤ Z1) OR P(Z ≥ Z2).
It is also possible that problems involving sections may not be symmetric, and may occur
entirely in the positive tail, or negative tail. Some Examples
P(Z ≥ Z0), where Z0 = 1.96. Since we are taking the
absolute value of a randomly chosen Z value in either
tail, that value may be either positive or negative and
its absolute value may be greater than or equal to 1.96.
P(Z ≥ Z0), where Z0 = 1.96. 4 3 2 1 3 2 0 1 1 2 3 4 P(Z ≥ Z0) = P(Z ≤ –Z0) + P(Z ≥ Z0), and since it is symmetric
P(Z ≥ Z0) = 2*P(Z ≥ Z0) = 2(0.0250) = 0.050
P(Z ≤ Z0), where Z0 = 2.576. This problem is similar to
the previous, but it describes the area in the middle,
between two limits.
4 P(Z ≤ Z0) = 1 – P(Z ≥ Z0) = 1–2*P(Z ≥ Z0) = 1 –
2(0.0050) = 0.99 0 1 2 3 4 1 2 3 4 1 2 3 4 An asymmetric case
P(–1.96 ≤ Z ≤ 2.576) = ? This is the area in the
middle, the total minus the two tails. We already
know these tails.
P(–1.96 ≤ Z ≤ 2.576) = 1 – P(Z ≥ 1.96) – P(Z ≥
2.576) = 1 – 0.0250 – 0.0050 = 0.97 4 3 2 1 0 Working the Z tables, backward & forward
We have seen how to find a probability from a value of Z0.
Now we need to be able to find a value of Z0 when a
probability is known. Basically, we find the value of
the probability in the body of our Z table, and
determine the corresponding value of Z0.
P(Z ≤ Z0) = 0.1587, find the value of Z0 4 3 2 1 0 This probability is a value less than 0.5, so it is a tail and can be solved directly from our
tables. We only need to find 0.1587 in the table and determine the corresponding value
of Z0. The value in the table occurs in the row corresponding to “1.0” and the column
corresponding to “0.00”. Finally note that the randomly chosen Z was to be less than or
equal to Z0, so we are in the lower tail. Z0 = –1.00
P(Z ≤ Z0) = 0.8413, find the value of Z0
This probability is a value greater than 0.5. To read it
from our tables we must determine the
corresponding tail. The tail would be given by 1–
0.8413 = 0.1587. So this is the same as the value
we just looked up, it occurs in the row
corresponding to “1.0” and the column
4 3 2 1 0 1 2 3 4 James P. Geaghan Copyright 2010 ...
View Full
Document
 Fall '08
 Geaghan,J

Click to edit the document details