This preview shows page 1. Sign up to view the full content.
Unformatted text preview: Statistical Methods I (EXST 7005) Page 22 Measures of dispersion
Our first major objective is to develop the concepts needed to understand hypothesis testing. We will
primarily test hypotheses about means, but variances can also be tested. Testing means will
require a measure of the dispersion or variability in the data set, so testing both means and
variances requires knowledge of variance.
The following presents some measures of variation or variability among the elements
(observations) of a data set
• Range – difference between the largest and smallest observation
This is a rough estimator which does not use all of the information in the data set. • Interquartile range – difference between the third and first quartile (Q3 – Q1)
Recall that the first quartile (Q1) is the value that has one quarter of the observations with
lesser values and the third quartile has three quarters of the observations with lesser
values. This may be a better measure of variability than the range in most situations
because the range can be influenced by a single unusually large or unusually small
value. However, this measure also does not use all of the information in the data set. • Variance – the “average” squared deviation from the mean,
The Population Variance is σ2 (called “sigma squared”)
This is a parameter, and therefore a constant
N The variance is given by ∑ (Yi − μ) σ 2 = i=1 2 N where N is the size of the population S2 is the Sample Variance (called “S-squared”).
This is a statistic, and therefore a variable
n ∑ (Y − Y )
i =1 i
The sample variance is given by S =
where n is the size of the sample
2 NOTE that the divisor is n–1 rather than n. If n is used then the calculation is a biased
estimator of σ2, tending to be an under estimate.
Standard Deviation – a standard measure of the deviation of observations from the mean. It is
calculated as the square root of the variance σ = σ2 this is a parameter S = S2 this is a statistic Mean Absolute Deviation (MAD) – the “average deviation” from the mean, but using absolute
values. This is another possible measure of dispersion. However the variance is the usual
calculation as it has some advantages over the MAD. James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 23 Desirable properties of a measure of dispersion
A valid, useful measure of dispersion should:
• use all of the available information • be independent of other parameters (and statistics) for large data sets • be capable of being expressed in the same units as the variables • be small when the spread among the points in the data set is small, and large when the
spread is wider. The Standard deviation meets these criteria. A note on units
When we calculate the mean for a sample or population, the units on the mean are the same as
for the original variable. If the original variable was measured in inches, the units of the
mean will be inches
The variance also has units, but since the calculation involves the square of the original
variable, the units on the variance are the original variable squared. If the original variable
was measured in inches, the units of the variance would be inches squared
Since the standard deviation is the square root of the variance, the units on the standard
deviation would again be the same as the original variable. Degrees of freedom (d.f.)
In the calculation of a population variance the divisor is N, while in the calculation for a sample
the divisor is n–1. This is because the calculated estimate of one parameter (σ2) uses an
estimate of another parameter (μ) in its calculation. For a sample, the estimate of the
variance (S2) employs a previously estimated statistic (Y ). Since we use an estimate of Y
to calculate our estimate of S2, the divisor is n–1, S =
2 Σ (Yi − Y )
n −1 2 . This denominator is called its degrees of freedom. If we needed to estimate two parameters prior to estimating
a parameter, the d.f. would be n–2.
Why? If we knew μ, as we do for a population, then we could get an independent deviation
from each and every observation.
If we knew that μ = 5, and we drew an observation at random and its value was 3, then
the deviation would be –2. Each and every observation contributes a deviation
since we know the value of μ.
But we cannot get an estimate of σ2 from a single sample observation since that
observation is also its own mean and the deviation is zero. If we drew a single
sample observation, with a value of 3, and we did not know the value of μ, then we
would estimate the value of Y from our sample. That estimated value would also
be 3 and there would be no deviation.
In summary, with a known value of μ every observation can deviate independently from
μ, and the sum of the deviations has no restrictions. However, deviations from Y
always sum to ZERO, so only the first n–1 can assume “any” independent value. James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 24 When we know the value of n–1 observations, the remaining observation is fixed
by our knowledge of Y . Calculating the Variance
The variance is calculated as the sum of squared deviations divided by the degrees of freedom.
n For a sample , S 2 = ∑ (Y − Y )
i =1 2 i n −1 . This calculation requires going through the data once to estimate Y and a second time to estimate (Yi − Y ) .
2 The variance can, in many cases, be calculated more easily with the “calculator formula”.
When we refer to “sum of squares”, or SS, we will mean the “Corrected Sum of Squares”,
unless otherwise stated. When we need to refer to the uncorrected sums of squares
they will be denoted as UCSS or USS.
Uncorrected sums of squares ∑ (Y )
n i =1 2 i Corrected sums of squares (deviation formula) n ∑ (Y
i =1 i − Y )2
2 ⎛ n ⎞
⎜ ∑ Yi ⎟
Corrected sums of squares (calculator formula) ∑ Yi − ⎝ i =1 ⎠ or
i =1 n ∑Y
i =1 i 2 − nY 2 As noted, the deviation formula requires two passes through the data. However, since
most calculators can simultaneously accumulate both the sum of Yi, n ∑Y
i =1 sum of Yi squared, i , and the ∑ (Y ) , the calculation formula requires only a single pass
n i =1 2 i through the data.
The “correction” made in corrected sums of squares is a correction for the mean. This is
apparent in the deviation formula, but not a obvious in the calculator formula. The
term ⎛ n ⎞
⎜ ∑ Yi ⎟
⎝ i =1 ⎠ 2 n or nY 2 in the calculator formula is called the “correction factor”, and corrects for the mean.
Finally, the sum of squares is divided by the degrees of freedom to get the variance. The
value of the sum of squares should be the same regardless of the formula used. An example of variance
Examine two samples;
Sample 1: 1, 2, 3
Sample 2: 11, 12, 13 Y =2
Y = 12 James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 25 Note that the deviations from the mean are the same in each case (–1, 0, 1) and the sum of
squared deviations, SS = (–1)2 + (0)2 + (1)2 = 2, is also the same for both samples
using the deviation formula.
The corrected SS using the calculator formula are also the same
Sample 1 SS = 14 – 12 = 2 Sample 2 SS = 434 – 432 = 2 And the Variance for both samples is then SS / (n–1) = 2 / 2 = 1
So, two different looking sets of numbers have the same “scatter” and the same
variance. Coefficient of variation ( Y )100% CV is the standard deviation expressed as a percent of the mean, CV = S the CV is used to compare relative variation between different experiments or variables,
independent of the mean. This calculation allows the comparison of different
variables (variability on automobile weights to variability in hippopotamus weights) or
variables on different scales (e.g. inches to kilograms).
compare the variability of peoples weights to peoples heights.
compare variation in infants lengths to adult heights.
NUMERICAL Example: compare the relative variation in fork length of fish to the weights
and scale lengths of the same fish. Data from 3 year old Flier Sunfish (Centrarchus
Std Dev Length (mm)
15.1 Weight (g)
19.6 Scale Lt. (mm)
0.8 CV (length) = (15.1 / 131.8) × 100% = 11.5%
CV (weight) =(19.6 / 53.0) × 100% = 37.0%
CV (scale length) = (0.8 / 6.9) × 100% = 11.6%
From the results above we may conclude that the fish weights are relatively more
variable than their length, and that the variability in body length and scale length
are nearly the identical.
the CV has no units
highly variable data may pass 100% From SAS example #1a
See SAS output Coefficient of Variation and other statistics discussed James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 26 Expected values and Bias
Unbiased Estimator: a statistic is said to be an unbiased estimator of a parameter if, with
repeated sampling, the average of all of the sample statistics approaches the parameter. An
estimator would be biased if on the average it approached a value that was a larger or
smaller than the true target parameter.
Expected value: the mean value of a statistic from a large number of samples (the “long run”
average). From our previous discussions, dividing by n–1 to calculate variance for a
sample results in a value which is LARGER than if we divide by n. If dividing by n–1 is
the correct approach (giving an unbiased estimate), it suggests that dividing by a larger
number, n, causes a negative bias (a value which is, on the average, too SMALL). This is
It is true that the expected value of the sample mean is equal to the population parameter ( ) ( ) 2
(i.e. E Y = μ , and it is also true that E S = σ , so these are unbiased estimators. Note that for symmetric distributions, μ can also be estimated by the median, mode or
midrange. However, the mean is an unbiased estimator for all distributions.
Expected Values are actually calculated as a sum (or integration for continuous variables) of
the product of the observed values (Yi) in the distribution and the probability (p(Yi)) of
occurrence of each value (e.g. E(Zi ) = Σ[Zi × P(Zi )] ). These have various uses,
including the evaluation of bias.
For our purposes;
The expected value is the measure of the true central tendency for the probability
distribution. If we took all possible samples, the mean would be the expected value,
provided the estimator we used is unbiased.
For any statistic, if the expected value of the statistic is the same as the population value, the
statistic is unbiased. Summary of Dispersion
Dispersion is a measure of the variability among the elements of a population or sample
A number of estimates are available, including the Range, Interquartile range, Variance and
Standard deviation. All are available from SAS PROC UNIVARIATE.
Units of the variable are squared on variances, but the same as the original variable for standard
Calculations on samples must consider degrees of freedom.
Both the sample means and sample variances (when divided by “n–1”) are unbiased estimators
of their target parameters, the population mean and population variance, respectively. James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 27 Constructing a Frequency Table
DIVIDE the population into a number of classes or groups based on the characteristics studied.
Categories are often quantitative, but not necessarily
DETERMINE the number of observations in each class (i.e. the frequency of occurrence of
observations in each class).
CONSTRUCT the table with both classes and frequencies. The frequencies may also be relative
(i.e. percentages) or cumulative.
Construct a frequency table for a population of fish age groups.
N = 10
Y = age of fish in years: 8, 4, 4, 0, 1, 5, 6, 5, 3, 4
These values are placed into discrete age groups (0 to 8) Frequency Table
SUM Frequency (f.)
10 cumulative frequency (c.f.)
10 Additional terms
Frequency Total: the total number of observations. The sum of the class frequencies.
Frequency (f): the number of observations in each class
Cumulative Frequency (c.f.): The sum of all class frequencies up to and including the class
in question. Implies an order or rank, so this is usually done only with
Relative Frequency (r.f.): the ratio of the class frequencies to the total frequency. These
always sum to 1.0
r.f. * 100% gives the percentage frequency (sums to 100%)
Relative Cumulative Frequency (r.c.f.): the sum of the r.f. up to and including the class in
question (for QUANTITATIVE VARIABLES). James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 28 Frequency Table
Class value frequency
1.0 relative cumulative
1.0 Graphic displays of frequencies
HISTOGRAM or bar-chart - representation of a frequency table
The area under each bar is proportional to the relative frequency (r.f.) of the class.
0 0 1 2 3 4 5 6 7 8 9 FREQUENCY POLYGON a variation of a histogram type plot in which the midpoints of
each class relative frequency is connected with a straight line.
0 1 2 3 4 5 6 7 8 9 Characteristics of histograms
When done with relative frequencies, the total area of a graph of relative frequencies is 1.0
Any subsection of a graph of relative frequencies will have an area such that, 0 ≤ subsection
area ≤ 1 SAS example (#1b) from Freund & Wilson (1997) Table 1.1, see SAS output for results
Things to note – Options
OPTIONS LS=99 PS=512 nocenter nodate nonumber;
ODS HTML body='C:\Example01.html' ; James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 29 TITLE1 'Introductory SAS example 1';
– the DATA step
– the raw DATA (note ending semicolon)
– the Procedures
PROC SORT; BY QUALITY;
PROC MEANS; BY QUALITY;
PROC CHART; VBAR QUALITY;
PROC CHART; HBAR QUALITY;
proc gchart; pie QUALITY;
proc gchart; star QUALITY;
proc gchart; donut QUALITY; Summary
Frequencies are a common and useful technique for descriptive statistics with many possible
We would usually do the calculations in SAS
The distributions that we will use for hypothesis testing will be in the form of frequency
distributions Linear Models
The simplest form of the linear additive model
Yi = μ + ε i for i = 1, 2, 3,...,N This is a population version of the model, so the term μ is a constant, it is the population mean
The sample version would use Y , which is a statistic and a variable.
εi represents the deviations of the observations from the mean. It has a mean of zero since
deviations sum to zero.
ei would be used to represent sample deviations,
and, of course, the population size, N, would be changed to the sample size, n.
This is a mathematical representation of a population or sample. All of the analyses discussed
in the Statistical Methods courses have a linear model. The models get more complex as
the analysis gets more advanced.
Multiplicative models and multiplicative errors exist, but are not covered in basic statistical
methods. Note that the error term in this model is additive.
Other models we will discuss this semester include:
Yij = μ i + ε ij for the two sample t-tests: Yij = μ + τ i + ε ij another form of the t-test also used for ANOVA Yi = β 0 + β1 X i + ε i Simple Linear Regression
Yi = β 0 + β1 X 1i + β 2 X 2 i + β 3 X 3i + ε i Multiple Linear Regression James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 30 Coding and Transformations
Objective – Hypothesis testing Background
Many applications in statistics require modifying an existing distribution to an alternative form
of the distribution. Hypothesis testing, in particular, requires taking an observed
distribution and transforming to a recognized statistical distribution with known properties.
This modification involves a transformation. Theorems
If a constant “a” is added to each observation then, the mean of the data set will increase by “a”
units the variance and standard deviation will remain unchanged
Example: Population of size N = 4
Yi = 2, 4, 6, 8
N μ= ∑Y
i =1 i N = 20
4 ⎛ N ⎞
(Yi − μ )
⎜ ∑ Yi ⎟
= ∑ Yi − ⎝ i =1 ⎠
N 2 2 N= (120 − 100 ) = 5
4 σ Y = 2.24 Now add 10 to each observation
Example: Population size still N = 4
Yi = 12, 14, 16, 18
N μ= ∑Y
i =1 i N = 60
4 ⎛ N ⎞
⎜ ∑ Yi ⎟
σ Y = ∑ Yi − ⎝ i =1 ⎠
i =1 2 N= ( 920 − 900 ) = 5
4 σ Y = 2.24 The mean increased by a factor of 10 while the variance and standard deviation did not
NOTE that “a” may be either negative or positive, so we and add or subtract a constant from
all values of Y. If we took the values of Yi = 12, 14, 16, 18 and subtracted 10 from each
value we would reverse the previous example.
When subtracting the mean is REDUCED by the value subtracted and the variance and
standard deviation remain unchanged. The mean would then ten less and the variance
and standard deviation would be unchanged James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 31 Another theorem
If each observation Yi is multiplied by a constant “a” then, the mean of the data set is “a” times
the old mean, the new variance is “a2” times the old variance and the standard deviation is
“a” times the old standard deviation.
Example: using the same Population as before; N = 4
Yi = 2, 4, 6, 8, μ = 5; σ 2 = 5; σ = 2.24 let “a” be 10; so we multiply each observation by 10.
Yi = 20, 40, 60, 80
N μ= ∑Y
i =1 N i = 200
= 50 , which is equal to aμ = 10(5) = 50
2 ⎛ N ⎞
⎜ ∑ Yi ⎟
∑ Yi − ⎝ i=N ⎠ (12000 − 10000)
= 500 , which is a2σ2 = 102(5) = 500
σ Y = i =1
N σ = 22.4, which is 10(2.24) = 22.4 or 500 = 22.4 NOTE that “a” may also be an inverse (i.e. 1/a instead of a), so we can multiply or divide all
values of Yi by any constant
if we took the values of Y=20, 40, 60, 80 and divided each Yi by 10, we would reverse
the previous example.
For division, the mean is divided by the value “a” (1/10), the variance divided by “a2”
(1/100), and the standard deviation divided by “a” (1/10 ) The transformation operations may be used in combination.
Example: Population of size N = 3
Y = 10, 20, 30: μ=20; σ 2 =66.67; σ = 8.16
The transformation is “divide by 10 (or multiply by 1/10 ) and subtract 2”
Yi = –1, 0, 1 (much easier to work with)
2 ⎛ N ⎞
⎜ ∑ Yi ⎟
∑ Yi − ⎝ i=N ⎠ ( 2 − 0 )
∑ Yi 0
σ Y2 = i =1
μ ′ = i =1 = = 0
and σ′ = 0.816
Note that order is important. To get back the original values we must reverse the
Above we (1) divided and then (2) subtracted. James P. Geaghan Copyright 2010 ...
View Full Document
- Fall '08