EXST7005 Fall2010 02 Central Tendency 01

EXST7005 Fall2010 02 Central Tendency 01 - Statistical...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Statistical Methods I (EXST 7005) Page 5 Introduction Course Objectives The objectives of this course are to provide the student with an understanding of elemental statistics and to develop the ability to understand and apply basic statistical procedures. Initially we will develop some fundamental concepts of statistics and express data using some basic descriptive methods. As part of the course we will develop a framework of statistical notation and terminology. This is necessary to understand advanced statistical methodology in the published literature and for communication with statisticians providing statistical support. We will further develop an understanding and appreciation of statistical inference, particularly hypothesis testing concepts, in order to understand the role of statistics in the decision making processes. The early topics covered involve concepts that will form the foundation for understanding hypothesis testing. We will develop procedures for basic statistical tests of hypothesis and estimation in order to understand and utilize basic statistical methods. Finally, we will use modern statistical software to apply statistics. Although simple software, such as spreadsheets, are useful for some basic statistical analyses, eventually the user will be severely limited if these are the only applications available. This course is intended to take the user through introductory statistics and into more advanced statistical analysis typically needed for scientific research. The software we have used is SAS® 9.1.3, 2004. The Scientific Method Understanding the importance of statistics to a myriad of scientific disciplines depends, in part, on recognizing its role in the scientific method. The scientific method is a way of approaching an investigation, and statistics is an integral part of the method. The scientific method is described below as a 5 step process. 1. REVIEW and OBSERVATION: This includes all mechanisms by which a scientist becomes knowledgeable and formulates concepts in his discipline including literature searches, formal course work, laboratory and field observations, and communication with other scientists. 2. HYPOTHESIS: The researcher develops a testable contention about the functioning of some aspect of his discipline. The hypothesis often involves the comparison of the performance of two or more categories, such as comparisons between environments, plant varieties, educational alternatives, pharmaceutical applications or agricultural practices. Statistical concepts are involved here 3. EXPERIMENT: The researcher plans and executes an experiment designed to test the hypothesis that has been developed. Statistical techniques are involved here 4. EVALUATE the HYPOTHESIS: This involves the analysis of the data gathered in the experiment, and should result in the confirmation or rejection of the hypothesis. This is a statistical application. 5. DRAW CONCLUSIONS: Based on the initial understanding of the situation, and on the results of the experimental procedure conducted, the researcher will state a conclusion. Conclusions and interpretation of the results should be stated in the context of the original field of study and may not appear to be inherently statistical. James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 6 Some Areas of Statistics • • • Descriptive Statistics – graphs and charts Exploratory Statistics – a group varying from descriptive statistics to multivariate analyses Designed research studies – a variety of scientific experiments o Experimental Design – investigator controls individuals (the objective is usually a comparison). These studies often involve statistical testing of differences o Sample Survey – individuals are not controlled (e.g. find out “How many” or “How much”). These studies may test for differences or estimate values with confidence intervals. Organizing, Tabulating and Summarizing Data The first order of business in many scientific endevours is to find some expression of the data. This may be included in a report as an end in itself, or be used to better understand the results of a study to guide further analysis. • • • • • Descriptive Statistics or Exploratory Statistics Frequency distributions Graphs and histograms Pie charts and star charts Drawing Conclusions and Assessing reliability – This will be our main concern Definitions (you do not have to know these terms verbatim) • STATISTICAL INFERENCE is the drawing of a conclusion from incomplete information o DEDUCTIVE Inference – conclude about a part from knowledge of the whole population o INDUCTIVE Inference – conclude about whole from a part • CONSTANT – a quantity or characteristic whose value remains constant from one individual to another • VARIABLE – a quantity or characteristic whose value changes from one individual to another • OBSERVATION – the measurement of some characteristic or variable on an individual • DATA – a set of observations taken from a group of individuals being studied • POPULATION – all possible individuals on which a variable may be measured. The total group (as defined by the investigator) about which inferences are to be made. • SAMPLE – a finite number (subset) of individuals selected from a population for study in a given experiment. • SAMPLE SIZE – the number of observations or measurements in the sample, usually designated n. • RANDOM SAMPLE – a sample drawn in such a way that every individual in the population has an equal chance of being included in the sample. • PARAMETER – a summary number that describes a population. It is a constant since it involves measurement of every individual in the population (e.g. μ or σ2 or β). James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 7 • STATISTIC – a summary number that describes a sample. It is a variable since many different samples can be drawn from a population. A statistic is used to estimate a parameter (e.g. Y or S2 or b). • EXPERIMENT – a planned inquiry to obtain new knowledge or confirm or deny results of previous experiments. • TREATMENT – a procedure whose effect is to be measured or compared with other experiments. • EXPERIMENTAL UNIT – the unit to which one application of the treatment is applied. • SAMPLING UNIT – the unit on which the effect of the treatment is measured. This may be the same as the experimental unit, or smaller than the experimental unit. CLASSIFICATION OF VARIABLES • QUALITATIVE – each individual belongs to one or several mutually exclusive categories. o Ordinal scale – ranked category variables; e.g. small, medium, large o Nominal scale – a classification or group; e.g. male, female • QUANTITATIVE – an observation resulting from a true numerical measurement. o CONTINUOUS – a quantitative variable for which all values within some range are possible (e.g. height, weight, depth). These variables are often grouped in intervals. o DISCRETE – a quantitative variable which does not take on all values in a continuum; often the variable can assume integer values only (e.g. number of objects or individuals). Symbolic Notation Greek letters are used to indicate PARAMETERS μ σ β ρ τ (means) Arabic (English) letters are used to indicate STATISTICS X , Y (means) (standard deviations) S (standard deviations) (slopes) b (slopes) (correlation) r (correlation) (experimental treatments) t (experimental treatments) Other Symbolic Notation • Letters at the beginning of the alphabet are used for CONSTANTS (a, b, c) • Letters at the end of the alphabet are used for VARIABLES (X, Y, Z) • Letters in the middle of the alphabet (i, j, k, l) are used as subscripts, often italicized (e.g. Xi and Yijk) James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 8 Means and measures of central tendency Means, or the arithmetic “average”, are important statistics in characterizing data. It is a measure that provides an indication of where the center of the distributions lies and is an important reference point. In hypothesis testing it is usually the means that are compared to determine of two samples are potentially drawn from the same population or not. Variances and other parameters can also be tested to compare populations, but the test of the means is more common. Summation Operations The symbol Σ is used to represent summation. Given a variable, Yi, representing a series of observations from Y1 (the first observation) to Yn (the last observation out of “n” observations), the notation ΣYi represents the sum of all of the Yi values from the first to the last. Since the summation is for values of i from 1 to n the summation sign is often subscripted with “i = 1” and superscripted with an n (e.g. ∑ n Y) i =1 i Example of Summation: A variable “length of Bluegill in centimeters” is measured for individuals captured in a seine. This quantitative variable will be called “Y”, and the number of individuals captured will be represented by “n”. For this example let n = 4 The variable Yi is subscripted in order to distinguish between the individual fish (i) Y1 = 3, Y2 = 4, Y3 = 1, Y4 = 2 Summation operation: To indicate that the sum all individuals in the sample (size n) write. n ∑Y i =1 i = Y1 + Y2 + Y3 + Y4 = 3 + 4 + 1 + 2 = 10 , and where, n = 4 n the mean is given by ∑Y i =1 i n = 10 = 2.5 4 Sum of Squares Two other values that will have to be calculated are the “sum of the squares” and the “square of the sums”. To indicate the sum of squared numbers, simply indicate the square of the variable after the summation notation. n ∑Y i =1 i 2 = Y12 + Y22 + Y32 + Y42 = 32 + 4 2 + 12 + 2 2 = 9 + 16 + 1 + 4 = 30 where n = 4 This is called the Sum of Squares, and should not be confused with the ... Square of the Sum: The sum was n ∑Y i =1 i = 10 2 ⎛ n ⎞ 2 The square of the sum is given by simply squaring the sum, ⎜ ∑ Yi ⎟ = 10 = 100 . ⎝ i =1 ⎠ Both of these calculations will be needed in calculating the variance. James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 9 Measures of Central Tendency These measures provide an indication of location on a scale. The most common measure is called the arithmetic mean or the “average”. It is the sum of all observations of the variable of interest ( ∑ n Y ) divided by the number of values summed (n). i =1 i Calculation of the Mean Example: The calculation of the mean for 4 fish lengths. It was previously determined that n ∑Y i =1 i = Y1 + Y2 + Y3 + Y4 = 3 + 4 + 1 + 2 = 10 where, n = 4 n The mean is given by ∑Y i =1 i n = 10 = 2.5 4 For a larger sample of fish Yi = 7, 9, 9, 3, 6, 5, 0, 7, 0, 7 n = 10 ΣYi = (7 + 9 + 9 + 3 + 6 + 5 + 0 + 7 + 0 + 7) = 53 n The mean is then ∑Y i =1 i n = ( 7 + 9 + 9 + 3 + 6 + 5 + 0 + 7 + 0 + 7) 10 = 53 = 5.3 . 10 Other measures of central tendency MEDIAN – the central-most observation in a ranked (ordered or sorted) set of observations. If the number of observations is even, take the mean of the center most 2 observations Example: for the fish sample used earlier, rank the observations Yi = 0, 0, 3, 5, 6, 7, 7, 7, 9, 9 If a single observation was in the center it would be used as the median. In this case the number of observations is even and the center falls between two numbers, 6 and 7, so calculate the mean of those two numbers. MEDIAN = (6 + 7) / 2 = 6.5 MODE – the value of the most frequently occurring observation Example: For the fish sample, Y = 0, 0, 3, 5, 6, 7, 7, 7, 9, 9 The most frequently occurring value was “7”. Therefore, the MODE = 7 James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 10 MIDRANGE – average of the largest and smallest observation. Example: The smallest observation in the fish sample was 0 and the largest was 9. The midrange is calculated as the midpoint between these values MIDRANGE = (0 + 9) / 2 = 4.5 Do not make the mistake of subtracting the lower value from the higher value and dividing by 2. That would be half of the RANGE, not the MIDRANGE. Percentiles and Quartiles Percentiles – the value of an observation that has a given percent of the observations below that value and the remaining observations above that value. The 50th percentile is the value where 50% of the sample observations would have values below it and 50% would be above it. This is also known as the median. It is often useful to know what value has 5% of the observations below it and 95% above it. This is the 5th percentile. Conversely the 95th percentile is the observation whose value exceeds 95% of the observations in the data set and is exceeded by 5% of the values. Likewise, the value of the 75th percentile would have 75% of the observations below the value and 25% above. Quartiles – observations that have one, two or three quarters of the observations above and below their value. The first quartile is the value of the observation that has one quarter of the observations below it and three quarters above the value. It is the 25th percentile. The second quartile is the value of the observation that has half (two quarters) of the observations below and above. This value is the same as the MEDIAN or 50th percentile. The third quartile is the value of the observation that has three quarters of the observations below and one quarter of the observations above the value. It is the 75th percentile. Which measure of Central Tendency is best? This depends on the distribution. If the distribution is monomodal and symmetric then the MEAN = MEDIAN = MODE = MIDRANGE This is true for the NORMAL bell-shaped curve. Bimodal distributions are not well described by any measure of central tendency, particularly a single MODE. James P. Geaghan Copyright 2010 ...
View Full Document

Ask a homework question - tutors are online