EXST7005 Fall2010 06a Transformation 01

EXST7005 Fall2010 06a Transformation 01 - Statistical...

This preview shows page 1. Sign up to view the full content.

This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Statistical Methods I (EXST 7005) Page 30 Coding and Transformations Objective – Hypothesis testing Background Many applications in statistics require modifying an existing distribution to an alternative form of the distribution. Hypothesis testing, in particular, requires taking an observed distribution and transforming to a recognized statistical distribution with known properties. This modification involves a transformation. Theorems If a constant “a” is added to each observation then, the mean of the data set will increase by “a” units the variance and standard deviation will remain unchanged Example: Population of size N = 4 Yi = 2, 4, 6, 8 N μ= ∑Y i =1 i N = 20 =5 4 ⎛ N ⎞ (Yi − μ ) ∑ ⎜ ∑ Yi ⎟ N 2 2 i =1 = ∑ Yi − ⎝ i =1 ⎠ σY = N N i =1 N 2 2 N= (120 − 100 ) = 5 4 σ Y = 2.24 Now add 10 to each observation Example: Population size still N = 4 Yi = 12, 14, 16, 18 N μ= ∑Y i =1 i N = 60 = 15 4 ⎛ N ⎞ ⎜ ∑ Yi ⎟ N 2 2 σ Y = ∑ Yi − ⎝ i =1 ⎠ N i =1 2 N= ( 920 − 900 ) = 5 4 σ Y = 2.24 The mean increased by a factor of 10 while the variance and standard deviation did not change. NOTE that “a” may be either negative or positive, so we and add or subtract a constant from all values of Y. If we took the values of Yi = 12, 14, 16, 18 and subtracted 10 from each value we would reverse the previous example. When subtracting the mean is REDUCED by the value subtracted and the variance and standard deviation remain unchanged. The mean would then ten less and the variance and standard deviation would be unchanged James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 31 Another theorem If each observation Yi is multiplied by a constant “a” then, the mean of the data set is “a” times the old mean, the new variance is “a2” times the old variance and the standard deviation is “a” times the old standard deviation. Example: using the same Population as before; N = 4 Yi = 2, 4, 6, 8, μ = 5; σ 2 = 5; σ = 2.24 let “a” be 10; so we multiply each observation by 10. Yi = 20, 40, 60, 80 N μ= ∑Y i =1 N i = 200 = 50 , which is equal to aμ = 10(5) = 50 4 2 ⎛ N ⎞ ⎜ ∑ Yi ⎟ N 2 1 ∑ Yi − ⎝ i=N ⎠ (12000 − 10000) 2 = = 500 , which is a2σ2 = 102(5) = 500 σ Y = i =1 4 N σ = 22.4, which is 10(2.24) = 22.4 or 500 = 22.4 NOTE that “a” may also be an inverse (i.e. 1/a instead of a), so we can multiply or divide all values of Yi by any constant if we took the values of Y=20, 40, 60, 80 and divided each Yi by 10, we would reverse the previous example. For division, the mean is divided by the value “a” (1/10), the variance divided by “a2” (1/100), and the standard deviation divided by “a” (1/10 ) The transformation operations may be used in combination. Example: Population of size N = 3 Y = 10, 20, 30: μ=20; σ 2 =66.67; σ = 8.16 The transformation is “divide by 10 (or multiply by 1/10 ) and subtract 2” Yi = –1, 0, 1 (much easier to work with) 2 ⎛ N ⎞ ⎜ ∑ Yi ⎟ N N 2 1 ∑ Yi − ⎝ i=N ⎠ ( 2 − 0 ) ∑ Yi 0 ′ = = 0.66667 σ Y2 = i =1 μ ′ = i =1 = = 0 N 3 3 N and σ′ = 0.816 Note that order is important. To get back the original values we must reverse the transformation. Above we (1) divided and then (2) subtracted. James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 32 To reverse this we must (1) add and then (2) multiply; μ =10( μ′ + 2) =10( 2) = 20 Since addition and subtraction do not affect measures of dispersion, we need consider only the division; σ Y2 = a 2σ Y 2 = 100(0.66667) = 66.667 ′ ′ σ Y = aσ Y = 10(0.816) = 8.16 Note that there is no addition or subtraction for the measures of dispersion since they were unaffected by the original transformation. Other transformations The logarithmic transformation was mentioned previously. Yi′ = log(Yi ) if we calculate statistics such as the mean using the log transformed values, and then backtransforming or detransform with the antilog, ⎛ log(Y ′ ) i anti log ⎜ ∑ n ⎜ ⎝ ⎞ ⎜∑ ⎜ ⎟ = e⎝ ⎟ ⎠ ⎛ log(Yi′ ) ⎞ ⎟ n ⎟ ⎠ = GM (Yi ) This results in a “geometric mean” HOWEVER, note that we cannot take the logarithm of 0 (zero), so if there are zeros in the data set we must combine two transformations. One common modification is to add 1 to all observations. Yi ′ = log(Yi + 1) Be careful in back-transforming or detransforming to subtract 1 after taking the anti-log to detransform. Order is important. The same is true for inverses used in calculating a harmonic mean with an inverse transformation Yi′ = 1 Yi If we calculate the mean of the inverse transformed values, then detransform with the inverse to get the harmonic mean. The “Z” transformation The Z transformation consists of a combination of several of the previously discussed transformations. Yi − μ Yi − Y for a sample. σ S This transformation standardizes any normal distribution to a different normal distribution with a mean of zero and a variance of one (i.e. μ = 0; σ2 = 1; σ = 1). This is called the standard normal distribution. This is necessary, because otherwise there are an infinite number of different normal distributions with different means and variances. By transforming to a standard normal distribution we can learn to work with a single distribution with known characteristics. Zi = or ti = James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 33 Example: transform the data for a population of N = 4. Yi = 2, 4, 6, 8 Initially, calculate the mean and variance μ = 5; σ2 = 5; σ = 2.24 and the transformation is applied with the following result. Zi = (2–5)/2.24, (4–5)/2.24, (6–5)/2.24, (8–5)/2.24 = –1.34, –0.45, 0.45, 1.34 μ=0/4= 0 σ2 = 4 / 4 = 1 σ = √1 = 1 NOTE: addition and subtraction do not affect calculations of the variance and could be ignored. The Z distribution This is the first statistical distribution that we will use and develop for hypothesis testing. These hypothesis testing techniques will require an understanding of the distribution, of how to work with tables of probabilities of the distribution, and of the Z transformation. Fortunately, other statistical distributions will be similar. Once these techniques are learned, they apply readily to other statistical tests and applications. The Z transformation Purpose – transforms values from any normal population to the corresponding values from the Standard Normal Distribution. The distribution is N(μ = 0, σ2 = 1). Zi = (Yi – μ) / σ where; μ = the mean of the original population σ = the standard deviation of the original population Yi = the value of an observation from the original population Zi = the corresponding value from a Standard Normal Distribution The purpose of this transformation is to deal with the infinite number of possible normal curves with different values of μ and σ by standardizing any normal curve so we can work with a single distribution. We will then work with these distributions from a table of Z values and probabilities related to the Z values. This will tie together much of what we have discussed (frequency and probability concepts, transformations, use of means, variances and standard deviations, and their calculations). Probability statements “Typical” probability statements are of the form. P[ Z ≤ Z0] = r.c.f. at Z0 where Z0 is some hypothesized value James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 34 Tabulated Z distribution We will need tables to work with the Z distribution. You should have those tables available. Your book has Ztables, copies of my notes have the tables and I have a copy on the internet. Table gives positive values only. The Z table is exactly symmetric. As a result, the negative half (below zero) is a mirror image of the upper half. Therefore, our tables only need (and will only have) half of the distribution since it is exactly symmetric. -4 To work with these half tables, it is important to note that -3 -2 -1 0 1 2 3 P[Z ≤ 0] = P[Z ≥ 0] = 0.5 since half of the distribution is above 0 and half is below P[Z ≤ –Z0] = P[Z ≥ +Z0] since the table is symmetric P[Z ≤ Z0] = 1 – P[Z ≥ Z0] since the total area under the curve sums to one. Z table The table in the text is “one-sided” as only 1 side is required due to symmetry. Values in the rows on the left side and top of the Z table give the value of Z, values in the body of the table are the probabilities of randomly choosing a larger Z value by random chance. For example, take Z = 0.11. What proportion of the distribution occurs above this value? Or, what is the probability of picking a Z value at random and it being larger than 0.11? Only the first 6 rows and columns of the table are shown here, plus the row and column headings. Complete tables are available on the Internet linked to the departmental STATLAB web page. 0.00 0.10 0.20 0.30 0.40 0.50 0.00 0.5000 0.4602 0.4207 0.3821 0.3446 0.3085 0.01 0.4960 0.4562 0.4168 0.3783 0.3409 0.3050 0.02 0.4920 0.4522 0.4129 0.3745 0.3372 0.3015 0.03 0.4880 0.4483 0.4090 0.3707 0.3336 0.2981 0.04 0.4840 0.4443 0.4052 0.3669 0.3300 0.2946 0.05 0.4801 0.4404 0.4013 0.3632 0.3264 0.2912 Reading the Z tables Values on the left side and top of the Z table give the value of Z, For example, to find Z=0.11, read the integer portion and first decimal part (0.1) along the left side and, find the second decimal (0.01) along the top The intersection of these gives the probability of a greater value of Z, in this case P(Z ≥ +0.11) = 0.4562. Note that the value of Z=0.00 has a probability of 0.5, so half of the distribution is above this value (and half below) Working with Z tables What did we just do? -4 -3 -2 -1 0 1 2 3 4 Z = 0 .1 1 James P. Geaghan Copyright 2010 4 Statistical Methods I (EXST 7005) Page 35 We found the area under the curve above a value of Z=0.11. The values in the available tables will always be giving the r.c.f. of the upper area of the curve. What if we want to work with the lower half of the curve? Due to symmetry in the distribution, the probability of a randomly selected value falling in the negative area to the left is the same as the corresponding positive area, so P(Z ≥ +0.11) = P(Z ≤ – 0.11) Some things we know from previous discussions of the empirical rule. -4 -3 -2 -1 0 1 2 3 Z = -0 .1 1 P(Z ≥ 0) = P(Z ≤ 0) = 0.5. The probability that a randomly selected Z falls between the limits μ – 1σ and μ + 1σ is about 68%, and half of the remaining fall in each of the tails (about 16%). Since σ = 1 for the standard normal, we should have about 16% above +1, and 16% below –1. Looking this up in the table we see P(Z ≥ +1) = 0.1587. Due to symmetry P(Z ≤ –1) is also 0.1587. The probability that a randomly selected Z falls between the limits μ – 1.96σ and μ + 1.96σ is 95%, and half of the remaining fall in each of the tails (about 2.5%). Since σ = 1 for the standard normal, we should have about 2.5% above 1.96, and 2.5% below –1.96. Looking this up in the table we see P(Z ≥ +1.96) = 0.0250, and P(Z ≤ –1.96) would be the same. A memorable value, 1.96! The probability that a randomly selected Z falls between the limits μ–2.576σ and μ+2.576σ is about 99%, and half of the remaining fall in each of the tails (about 0.5%). Since σ = 1 for the standard normal, we should have about 0.5% above 2.576, and 0.5% below –2.576. Attempting to look this up in the table we see that the value 2.576 does not occur exactly in the tables, but P(Z ≥ +2.57) = P(Z ≤ –2.57) = 0.0051 and P(Z ≥ +2.58) = P(Z ≤ –2.58) = 0.0049 So the true value is somewhere between 2.57 and 2.58, it turns out to be exactly P(Z ≥ +2.576) = P(Z ≤ –2.576) = 0.005 “In between” values would normally be determined by interpolation. Exact values can be obtained from various software packages, including SAS and EXCEL. Note: On an exam, if a value does not occur exactly, I will accept either of the two limits on either side of the correct value, or anything in between. In the real world you can get “exact” values from EXCEL. In the even more real world, how much precision, or how many decimal places, do you really need to make this type of decision? All my tables were created in EXCEL A few more examples of working with Z tables Find P(Z ≥ +1.35). This is an area in the upper half of the distribution (since Z is positive) so we can read it directly from the Z tables. P(Z ≥ +1.35) = 0.0885 James P. Geaghan Copyright 2010 4 Statistical Methods I (EXST 7005) Page 36 Find P(Z ≤ –2.22). This is an area from the lower half of the table, but due to symmetry P(Z ≤ –2.22) = P(Z ≥ +2.22), so we can use the upper half of the table that we have available. P(Z ≤ –2.22) = 0.0132 What about problems that do not ask for the area in the upper or lower tail? For example, P(Z ≤ 1.30). This value is in the upper half of the table, but the probability requested is for randomly chosen Z values less than or equal, this will go into the lower half of the distribution! -4 -3 -2 -1 To solve this problem you must recall that the total area under the curve adds to 1. To find P(Z ≤ 1.30), we first find P(Z ≥ +1.30) and subtract from 1. P(Z ≤ 1.30) = 1 – P(Z ≥ +1.30) = 1 – 0.0968 = 0.9032. 0 1 2 0.9032 -4 -3 -2 -1 0 3 4 0.0968 1 2 3 4 Even trickier Z distribution problems Note that the value of Z = 0.00 has a probability of 0.5, so half of the distribution is above this value (and half below). Find P(Z ≥ –0.65). Now we are looking for a value greater than or equal to a value on the negative side of the distribution. From our tables we first find -4 -3 -2 -1 0 1 2 P(Z ≥ 0.65) = 0.2578 = P(Z ≤ –0.65) due to symmetry, and so 1–P(Z ≤ –0.65) = 1–0.2578 = 0.7422 3 4 It is strongly advisable to sketch the problem, and to see if the answer makes sense. In this case we can see from the sketch that the desired area is over half of the total area, so the answer should be greater than 0.5, and of course it was (P(Z) ≥ –0.65) = 0.7422). A few extra examples 1) P(Z ≥ 3.50) = ? 2) P(Z ≤ –2.00) = ? 3) P(Z ≥ 0.00) = ? 4) P(Z ≤ 1.64) = P(Z ≥ –1.64) = ? 5) P(Z ≤ 1.96) = P(Z ≥ –1.96) = ? Read directly from the table Read from the table, but for the upper (positive) end Read directly from the table This is not in the table. Use 1–P(Z ≥ 1.64) This is not in the table. Use 1–P(Z ≥ 1.96) Two-tailed problems and “area in the middle” problems A common type of problem is to determine the area between two limits, or to determine the area in the tails outside some specific limits. Probability expressions for these problems will take the form P(Z1 ≤ Z ≤ Z2) = ? If the problem is symmetric, then –Z1 = Z2, call the limit Z0 and we can rewrite P(–Z0 ≤ Z ≤ Z0) as P(|Z| ≤ Z0). Probability expressions for areas in the tails will take the form P(|Z| ≥ Z0) if the problem is symmetric. If -4 -3 -2 -1 0 1 2 3 4 James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 37 not, we can write this as two expressions, P(Z ≤ Z1) OR P(Z ≥ Z2). It is also possible that problems involving sections may not be symmetric, and may occur entirely in the positive tail, or negative tail. Some Examples P(|Z| ≥ Z0), where Z0 = 1.96. Since we are taking the absolute value of a randomly chosen Z value in either tail, that value may be either positive or negative and its absolute value may be greater than or equal to 1.96. P(|Z| ≥ Z0), where Z0 = 1.96. -4 -3 -2 -1 -3 -2 0 -1 1 2 3 4 P(|Z| ≥ Z0) = P(Z ≤ –Z0) + P(Z ≥ Z0), and since it is symmetric P(|Z| ≥ Z0) = 2*P(Z ≥ Z0) = 2(0.0250) = 0.050 P(|Z| ≤ Z0), where Z0 = 2.576. This problem is similar to the previous, but it describes the area in the middle, between two limits. -4 P(|Z| ≤ Z0) = 1 – P(|Z| ≥ Z0) = 1–2*P(Z ≥ Z0) = 1 – 2(0.0050) = 0.99 0 1 2 3 4 1 2 3 4 1 2 3 4 An asymmetric case P(–1.96 ≤ Z ≤ 2.576) = ? This is the area in the middle, the total minus the two tails. We already know these tails. P(–1.96 ≤ Z ≤ 2.576) = 1 – P(Z ≥ 1.96) – P(Z ≥ 2.576) = 1 – 0.0250 – 0.0050 = 0.97 -4 -3 -2 -1 0 Working the Z tables, backward & forward We have seen how to find a probability from a value of Z0. Now we need to be able to find a value of Z0 when a probability is known. Basically, we find the value of the probability in the body of our Z table, and determine the corresponding value of Z0. P(Z ≤ Z0) = 0.1587, find the value of Z0 -4 -3 -2 -1 0 This probability is a value less than 0.5, so it is a tail and can be solved directly from our tables. We only need to find 0.1587 in the table and determine the corresponding value of Z0. The value in the table occurs in the row corresponding to “1.0” and the column corresponding to “0.00”. Finally note that the randomly chosen Z was to be less than or equal to Z0, so we are in the lower tail. Z0 = –1.00 P(Z ≤ Z0) = 0.8413, find the value of Z0 This probability is a value greater than 0.5. To read it from our tables we must determine the corresponding tail. The tail would be given by 1– 0.8413 = 0.1587. So this is the same as the value we just looked up, it occurs in the row corresponding to “1.0” and the column -4 -3 -2 -1 0 1 2 3 4 James P. Geaghan Copyright 2010 ...
View Full Document

Ask a homework question - tutors are online