This preview shows page 1. Sign up to view the full content.
Unformatted text preview: Statistical Methods I (EXST 7005) Page 5 Introduction
Course Objectives
The objectives of this course are to provide the student with an understanding of elemental
statistics and to develop the ability to understand and apply basic statistical procedures.
Initially we will develop some fundamental concepts of statistics and express data using some
basic descriptive methods.
As part of the course we will develop a framework of statistical notation and terminology. This is
necessary to understand advanced statistical methodology in the published literature and for
communication with statisticians providing statistical support.
We will further develop an understanding and appreciation of statistical inference, particularly
hypothesis testing concepts, in order to understand the role of statistics in the decision making
processes. The early topics covered involve concepts that will form the foundation for
understanding hypothesis testing. We will develop procedures for basic statistical tests of
hypothesis and estimation in order to understand and utilize basic statistical methods.
Finally, we will use modern statistical software to apply statistics. Although simple software, such
as spreadsheets, are useful for some basic statistical analyses, eventually the user will be
severely limited if these are the only applications available. This course is intended to take
the user through introductory statistics and into more advanced statistical analysis typically
needed for scientific research. The software we have used is SAS® 9.1.3, 2004. The Scientific Method
Understanding the importance of statistics to a myriad of scientific disciplines depends, in part, on
recognizing its role in the scientific method. The scientific method is a way of approaching an
investigation, and statistics is an integral part of the method. The scientific method is described
below as a 5 step process.
1. REVIEW and OBSERVATION: This includes all mechanisms by which a scientist
becomes knowledgeable and formulates concepts in his discipline including literature
searches, formal course work, laboratory and field observations, and communication with
other scientists.
2. HYPOTHESIS: The researcher develops a testable contention about the functioning of
some aspect of his discipline. The hypothesis often involves the comparison of the
performance of two or more categories, such as comparisons between environments, plant
varieties, educational alternatives, pharmaceutical applications or agricultural practices.
Statistical concepts are involved here
3. EXPERIMENT: The researcher plans and executes an experiment designed to test the
hypothesis that has been developed. Statistical techniques are involved here
4. EVALUATE the HYPOTHESIS: This involves the analysis of the data gathered in the
experiment, and should result in the confirmation or rejection of the hypothesis. This is a
statistical application.
5. DRAW CONCLUSIONS: Based on the initial understanding of the situation, and on the
results of the experimental procedure conducted, the researcher will state a conclusion.
Conclusions and interpretation of the results should be stated in the context of the original
field of study and may not appear to be inherently statistical. James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 6 Some Areas of Statistics
•
•
• Descriptive Statistics – graphs and charts
Exploratory Statistics – a group varying from descriptive statistics to multivariate analyses
Designed research studies – a variety of scientific experiments
o Experimental Design – investigator controls individuals (the objective is usually a
comparison). These studies often involve statistical testing of differences
o Sample Survey – individuals are not controlled (e.g. find out “How many” or “How
much”). These studies may test for differences or estimate values with confidence
intervals. Organizing, Tabulating and Summarizing Data
The first order of business in many scientific endevours is to find some expression of the data.
This may be included in a report as an end in itself, or be used to better understand the results of a
study to guide further analysis.
•
•
•
•
• Descriptive Statistics or Exploratory Statistics
Frequency distributions
Graphs and histograms
Pie charts and star charts
Drawing Conclusions and Assessing reliability – This will be our main concern Definitions (you do not have to know these terms verbatim)
• STATISTICAL INFERENCE is the drawing of a conclusion from incomplete information
o DEDUCTIVE Inference – conclude about a part from knowledge of the whole
population
o INDUCTIVE Inference – conclude about whole from a part • CONSTANT – a quantity or characteristic whose value remains constant from one individual
to another • VARIABLE – a quantity or characteristic whose value changes from one individual to another • OBSERVATION – the measurement of some characteristic or variable on an individual • DATA – a set of observations taken from a group of individuals being studied • POPULATION – all possible individuals on which a variable may be measured. The total
group (as defined by the investigator) about which inferences are to be made. • SAMPLE – a finite number (subset) of individuals selected from a population for study in a
given experiment. • SAMPLE SIZE – the number of observations or measurements in the sample, usually
designated n. • RANDOM SAMPLE – a sample drawn in such a way that every individual in the population
has an equal chance of being included in the sample. • PARAMETER – a summary number that describes a population. It is a constant since it
involves measurement of every individual in the population (e.g. μ or σ2 or β). James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 7 • STATISTIC – a summary number that describes a sample. It is a variable since many different
samples can be drawn from a population. A statistic is used to estimate a parameter (e.g. Y or
S2 or b). • EXPERIMENT – a planned inquiry to obtain new knowledge or confirm or deny results of
previous experiments. • TREATMENT – a procedure whose effect is to be measured or compared with other
experiments. • EXPERIMENTAL UNIT – the unit to which one application of the treatment is applied. • SAMPLING UNIT – the unit on which the effect of the treatment is measured. This may be
the same as the experimental unit, or smaller than the experimental unit. CLASSIFICATION OF VARIABLES
• QUALITATIVE – each individual belongs to one or several mutually exclusive categories.
o Ordinal scale – ranked category variables; e.g. small, medium, large
o Nominal scale – a classification or group; e.g. male, female • QUANTITATIVE – an observation resulting from a true numerical measurement.
o CONTINUOUS – a quantitative variable for which all values within some range are
possible (e.g. height, weight, depth). These variables are often grouped in intervals.
o DISCRETE – a quantitative variable which does not take on all values in a continuum;
often the variable can assume integer values only (e.g. number of objects or
individuals). Symbolic Notation
Greek letters are used to indicate
PARAMETERS μ
σ
β
ρ
τ (means) Arabic (English) letters are used to indicate
STATISTICS
X , Y (means) (standard deviations) S (standard deviations) (slopes) b (slopes) (correlation) r (correlation) (experimental treatments) t (experimental treatments) Other Symbolic Notation
• Letters at the beginning of the alphabet are used for CONSTANTS (a, b, c) • Letters at the end of the alphabet are used for VARIABLES (X, Y, Z) • Letters in the middle of the alphabet (i, j, k, l) are used as subscripts, often italicized (e.g.
Xi and Yijk) James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 8 Means and measures of central tendency
Means, or the arithmetic “average”, are important statistics in characterizing data. It is a measure that
provides an indication of where the center of the distributions lies and is an important reference
point. In hypothesis testing it is usually the means that are compared to determine of two samples
are potentially drawn from the same population or not. Variances and other parameters can also
be tested to compare populations, but the test of the means is more common. Summation Operations
The symbol Σ is used to represent summation. Given a variable, Yi, representing a series of
observations from Y1 (the first observation) to Yn (the last observation out of “n”
observations), the notation ΣYi represents the sum of all of the Yi values from the first to the
last. Since the summation is for values of i from 1 to n the summation sign is often
subscripted with “i = 1” and superscripted with an n (e.g. ∑ n Y) i =1 i Example of Summation: A variable “length of Bluegill in centimeters” is measured for individuals
captured in a seine. This quantitative variable will be called “Y”, and the number of
individuals captured will be represented by “n”.
For this example let n = 4
The variable Yi is subscripted in order to distinguish between the individual fish (i)
Y1 = 3, Y2 = 4, Y3 = 1, Y4 = 2 Summation operation: To indicate that the sum all individuals in the sample (size n) write.
n ∑Y
i =1 i = Y1 + Y2 + Y3 + Y4 = 3 + 4 + 1 + 2 = 10 , and where, n = 4
n the mean is given by ∑Y
i =1 i n = 10 = 2.5
4 Sum of Squares
Two other values that will have to be calculated are the “sum of the squares” and the “square of
the sums”. To indicate the sum of squared numbers, simply indicate the square of the
variable after the summation notation.
n ∑Y
i =1 i 2 = Y12 + Y22 + Y32 + Y42 = 32 + 4 2 + 12 + 2 2 = 9 + 16 + 1 + 4 = 30 where n = 4
This is called the Sum of Squares, and should not be confused with the ... Square of the Sum: The sum was n ∑Y
i =1 i = 10
2 ⎛ n ⎞
2
The square of the sum is given by simply squaring the sum, ⎜ ∑ Yi ⎟ = 10 = 100 .
⎝ i =1 ⎠
Both of these calculations will be needed in calculating the variance.
James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 9 Measures of Central Tendency
These measures provide an indication of location on a scale. The most common measure is called
the arithmetic mean or the “average”. It is the sum of all observations of the variable of
interest ( ∑ n Y ) divided by the number of values summed (n). i =1 i Calculation of the Mean
Example: The calculation of the mean for 4 fish lengths.
It was previously determined that
n ∑Y
i =1 i = Y1 + Y2 + Y3 + Y4 = 3 + 4 + 1 + 2 = 10 where, n = 4
n The mean is given by ∑Y
i =1 i n = 10 = 2.5
4 For a larger sample of fish
Yi = 7, 9, 9, 3, 6, 5, 0, 7, 0, 7
n = 10
ΣYi = (7 + 9 + 9 + 3 + 6 + 5 + 0 + 7 + 0 + 7) = 53
n The mean is then ∑Y
i =1 i n = ( 7 + 9 + 9 + 3 + 6 + 5 + 0 + 7 + 0 + 7) 10 = 53 = 5.3 .
10 Other measures of central tendency
MEDIAN – the centralmost observation in a ranked (ordered or sorted) set of observations.
If the number of observations is even, take the mean of the center most 2 observations
Example: for the fish sample used earlier, rank the observations
Yi = 0, 0, 3, 5, 6, 7, 7, 7, 9, 9
If a single observation was in the center it would be used as the median. In this case the
number of observations is even and the center falls between two numbers, 6 and 7, so
calculate the mean of those two numbers.
MEDIAN = (6 + 7) / 2 = 6.5
MODE – the value of the most frequently occurring observation
Example: For the fish sample,
Y = 0, 0, 3, 5, 6, 7, 7, 7, 9, 9
The most frequently occurring value was “7”.
Therefore, the MODE = 7 James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 10 MIDRANGE – average of the largest and smallest observation.
Example: The smallest observation in the fish sample was 0 and the largest was 9. The
midrange is calculated as the midpoint between these values
MIDRANGE = (0 + 9) / 2 = 4.5
Do not make the mistake of subtracting the lower value from the higher value and
dividing by 2. That would be half of the RANGE, not the MIDRANGE. Percentiles and Quartiles
Percentiles – the value of an observation that has a given percent of the observations below that
value and the remaining observations above that value.
The 50th percentile is the value where 50% of the sample observations would have values
below it and 50% would be above it. This is also known as the median.
It is often useful to know what value has 5% of the observations below it and 95% above it.
This is the 5th percentile. Conversely the 95th percentile is the observation whose value
exceeds 95% of the observations in the data set and is exceeded by 5% of the values.
Likewise, the value of the 75th percentile would have 75% of the observations below the value
and 25% above.
Quartiles – observations that have one, two or three quarters of the observations above and below
their value.
The first quartile is the value of the observation that has one quarter of the observations below
it and three quarters above the value. It is the 25th percentile.
The second quartile is the value of the observation that has half (two quarters) of the
observations below and above. This value is the same as the MEDIAN or 50th percentile.
The third quartile is the value of the observation that has three quarters of the observations
below and one quarter of the observations above the value. It is the 75th percentile. Which measure of Central Tendency is best?
This depends on the distribution. If the distribution is monomodal and symmetric then the
MEAN = MEDIAN = MODE = MIDRANGE
This is true for the NORMAL bellshaped curve. Bimodal distributions are not well described by any measure of central tendency, particularly a
single MODE. James P. Geaghan Copyright 2010 ...
View
Full
Document
 Fall '08
 Geaghan,J

Click to edit the document details