Biostatistics_1_2009-2010 - Faculty of Engineering Master...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Faculty of Engineering Master of Biomedical Engineering Biostatistics (3 credits) Theory: 15 hrs Exercises: 10 hrs Prof. Dr. ir. C. De Wagter I. Introduction to descriptive statistics I. 1. Random variable, proportion A random variable is a variable quantity of which the value (or outcome) is uncertain (and determined by chance or random experiment). The reason for lack of determinism can be two-fold: fundamental (quantum physics) or related to the observation (finite precision of measurement or computation). We will consider especially quantitative random variables, some are discrete (number of bikes per family), other variables are continuous (time needed to finish this lesson). A proportion is obtained by normalization of the outcomes of a random variable. Example of random variable: the yearly number of mortal road accidents in Belgium per age group of the driver. Statistical data show that older drivers are more involved. However these data do no prove that young people drive more safely. Indeed, there are more older drivers than younger ones. Therefore, it is imperative to investigate in each age group the proportion of the number of mortal accidents to the number of drivers. (Then it turns out that older people drive more safely). I. 2. Frequency distribution In descriptive statistics, the value of a random variable is the result of an observation, experiment or measurement (in broad sense). In order to obtain an overview of the data measured, a frequency distribution may be very helpful. Often, the set of possible outcomes in a population is grouped into a number of intervals or classes (always in the case of continuous random variables). The intervals are bounded by the class limits. The number of times the random variable takes a value that belongs to a certain class, establishes the (absolute) frequency of the class considered, or, briefly, the class frequency. The sum of the class frequencies equals the total number of observations. The relative frequency is obtained after division by the number of observations. I. 3. Collection and interpretation of data I.3.a. Accuracy Accuracy is directly related to the margin with which the observed outcome approximates the actual (unknown) value of the quantity measured. Thus, the accuracy is also affected by the calibration and resolution (resolving power) of the measuring equipment. I.3.b. Systematic and random errors, errors of interpretation Systematic errors affect the outcomes of the measurement in a consistent way. They can be due to wrong zero adjustment or to calibration errors of measuring equipment. When known (possibly approximately), the statistician can correct for them. Random errors have, by definition, a lot of independent origins and, by consequence, have a small mean. By performing many observations, the effect of random errors is minimized. cdw-oct-2005 Biostatistics 2 This category of errors is closely related to statistics and probability. Errors of interpretation are due to human failure. An example is the above mentioned wrong choice of random variable to conclude that young car drivers are more careful. I. 4. Organization and presentation of data I.4.a. Contingency table A contingency table is obtained by classification of a population (or sample, see further) according to 2 different criteria (to each criterion, a categorical (nominal) random variable is connected). Each table cell denotes the frequency split up according the respective criteria. I.4.b. Box chart, bar chart The height of each box or bar represents the frequency of the corresponding outcome measured. Continuous random variables are grouped into classes. The bars, in contrast to the boxes, are normally separated from each other. I.4.c. Histogram The histogram is the ideal tool for the graphical representation of the frequency distribution of a continuous random variable. The horizontal axis is divided into class intervals. Above each interval, a rectangle of which the area (relevant in case of uneven class interval widths) represents the frequency of the corresponding class. In fact, the height of the rectangles denote frequency densities (frequency per unit class width). In contrast to the box and bar chart, sparse class intervals can be grouped together by using uneven interval widths. Example. The following numbers are observed as a result of 29 measurements: 50, 51, 30, 20, 30, 36, 60, 51, 1, 200, -100, 5, 70, 75, 100, 20, 50, -50, 140, 160, 250, -150, -180, 30, 48, 47, 46, 31, 40. This results in the following histogram: cdw-oct-2005 Biostatistics 3 16 14 12 10 8 6 4 2 Std. Dev = 88,03 Mean = 40,0 N = 29,00 0 -200,0 -150,0 -100,0 -50,0 0,0 50,0 100,0 150,0 200,0 250,0 V AR00002 Figure 1. Histogram with equal interval width I.4.d. Pie chart The surface of a circle (representing 100 %) is divided into sectors that correspond with the classes of a population (or sample), according to a categorical (nominal) random variable. The area of each sector is proportional to the frequency of the corresponding class. I. 5. Measures that characterize a frequency distribution I.5.a. Measures of central tendency The (arithmetic) mean of N outcomes of the random variable x is defined by N x= i=1 xi (1) N All observed outcomes have influence. Outliers strongly affect the mean. Graphically, the mean is that outcome that forms a point of equilibrium for the surface under the histogram when materialized and hung in a field of gravity. The median is the middle of the outcomes after having arranged them in order of magnitude. If the number of outcomes is even, the mean is taken of the two central outcomes. The median divides the area determined by the histogram into two equal halves. cdw-oct-2005 Biostatistics 4 The mode is the outcome that has the highest frequency. I.5.b. Measures of dispersion and variability 2 The variance σ is the average of the squared deviation of all N outcomes in a population with respect to the mean: N σ2 = xi−x i=1 2 (2) N Normally, instead of taking the complete population, the statistician restricts himself to a subset of the population, called the sample, which only contains n (< N) observations. As we will see further on, expression (2) gives then a too optimistic value, i.e. too small, of the variance. The reason is that the mean of the sample also contains uncertainty (another sample would normally lead to a different mean). Therefore, the sample variance s2 is a random variable on itself, that can be efficiently estimated as : n xi−x i=1 s2 = 2 (3) n− 1 The reader can easily prove that (2) and (3) can also be expressed as N σ2 = i=1 xi 2 xi 2 (4) − x2 N and n s2 = i=1 − n x2 (5) n− 1 Formulae (4) and (5) are more suited for calculations by hand. Remark that the subtractions that occur on the right in (4) and (5) are fundamentally positive. The standard deviation σ (s) is the root of σ2 (s2). Hence, we can write for the sample standard deviation s : s= n i=1 xi−x 2 (6) n− 1 The coefficient of variation, cv, is the ratio of the standard deviation and the mean: s cv = x The quartiles are those values of the random variable that divide the area under the histogram cdw-oct-2005 Biostatistics 5 into 4 equal parts. The second quartile is equal to the median. 25% (75%) of the outcomes are lower than the first (third) quartile. The percentiles are defined analogously. The 71st percentile, for instance, is that outcome under which 71% of all ordered outcomes lie. The mean, standard deviation, quartiles and percentiles are expressed in the same unit as the random variable itself. The variance has the squared unit. The coefficient of variation is unitless. I.5.c. The boxplot An interesting graphical tool for exploring observed data is the boxplot. Figure 2 gives such a plot for the example considered in Figure 1 (p. 4). The box contains 50% of the outcomes and lies between the 1st quartile and the 3rd quartile. The median is indicated in the box. An acentral position indicates a skew frequency distribution. The whiskers indicate the outcomes that are just no outliers nor extreme values. Outliers are contained in the interval that is defined between distances of 1.5 and 3 times the box height (3rd quartile - 1st quartile) from the median. Extreme values are still more distant from the box. 300 21 200 10 20 19 100 0 18 -100 11 22 23 -200 -300 N= 29 VAR00002 Figure 2. Boxplot (o = outlier; * = extreme value; followed by ranking number of outcomes) I.5.d. Adjustments for grouped observations In previous formulae (1)—(6), we tacitly assumed that the measurements with same outcome xi were repeated in the summation. If we consider the possible outcomes x only one, how- cdw-oct-2005 Biostatistics 6 ever, we should take into account their frequency f (or relative frequency f’). This also applies when the outcomes are grouped in classes. Then fi represents the class frequency and xi the class center. Formulae (1)—(6) then respectively become (N is the total number of observations): x σ2 = ∀i = ∀i fi x i = N fi x i − x N σ2 = s2 = and s= cdw-oct-2005 ∀i (1’) 2 = s2 = fi xi ∀i ∀i ∀i f i x i− x ∀i fi xi−x 2 (2’) 2 (3’) n− 1 fi x i 2 − x2 (4’) − n x2 (5’) N fi x i 2 n− 1 ∀i=1 fi x i − x 2 (6’) n− 1 Biostatistics 7 II. Basics of probability II. 1. Random experiment, event An experiment is a process by which an outcome is obtained. In case of a random experiment the outcome is a priori uncertain. Attaining a specific outcome (or combination of outcomes) establishes an event. The probability of an event E1 is expressed by a number that lies between 0 and 1: 0 ≤ P(E1) ≤ (7) 1 P (E1) = 0 indicates that event E1 cannot occur, while P (E1) = 1 guarantees that event E1 occurs with certainty. From descriptive statistics, probability is the limit of the relative frequency for a high number of observations : fE (8) lim n1 P(E1) = n− →∞ An alternative and practical approach was followed by Laplace: if all the possible outcomes of an experiment have the same probability, then the probability of an event is the ratio of the number of "favorable" outcomes to the total number of possible outcomes. Example 1: We throw a dice twice. Which is the probability of obtaining "2" and "4" (irrespective of sequence). Answer: 2/36 = 0.0555. Indeed, 36 possible outcomes: (1,1), (1,2), ... , (1,6), (2,1), ..., (6,6). There are 2 favorable outcomes: (2,4) and (4,2). Example 2: The probability of throwing two times "4" amounts to 1/36 = 0.0277. Indeed, the favorable outcome (4,4) occurs only once in the set of possible outcomes. II.1.a. Basic rules for probability Union of two events. The probability of either event E1 or event E2 is the sum of the two related probabilities minus the probability of the intersection: P(E1 ∪ E2) = P(E1) + P(E2)− P(E1 ∩ E2) (9) The term P(E1∩E2) expresses the probability of the intersection of events E1 and E2, i.e. the probability of the two events being concurrent. Complement of an event. The probability that an event E1 does NOT occur is given by: P(E1) = 1 − P(E1) (10) Conditional probability. The probability of event E1 occurring when it is known that some cdw-oct-2005 Biostatistics 8 event E2 has occurred : P(E1|E2) = P(E1∩E2) P(E2) (11) Bayes’ rule interrelates the conditional P(E1 | E2) and P(E2 | E1) : P(E1|E2) = P(E2|E1) P(E1) P(E2) (12) Exercise: derive (12) from (11). Exercise: interpret the following identity : P(E1|E2) + P(E1|E2) = 1 (13) Exercise: prove that there generally holds that P(E1) = P(E1 ∩ E2) + P(E1 ∩ E2) (14) By applying (11), we obtain from (14): P(E1) = P(E1|E2) P(E2) + P(E1|E2) P(E2) (14’) Previous expression can be generalized to P (B ) = i P(B|Ai) P(Ai) (14’’) if events Ai are mutually exclusive and if their union is the entire sample space, i.e. ΣiP(Ai) = 1. Above equation, sometimes called the law of total probability, can be very useful in the denominator of Bayes’ rule (12). In many cases, the events Ai are all possible mutually exclusive causes of event B. Exercise: the probability that a student lies in plaster is 0.05 when he has taken a ski holiday and 0.005 when he has not. We know that 20% of the students takes a ski holiday. a) What is the probability that a student lies in plaster? b) When I see a student lying in plaster, what is the probability he has taken ski holiday? (Answers: a) 0.014, b) 0.71) Two events E1 and E2 are independent if and only if P(E1|E2) = P(E1) or P(E2|E1) = P(E2). In words: if they do not affect each other. (Bayes’ rule implies that both conditions are equivalent). From (11) it follows that P(E1 ∩ E2) = P(E1) . P(E2) cdw-oct-2005 (15) Biostatistics 9 Two events E1 and E2 are mutually exclusive when P(E1∩E2) = 0. This normally implies that they cannot occur concurrently. Exercise: prove that for mutually exclusive events E1 and E2, we have that P(E1 | E2) = 0. II.1.b. Conditional probabilities in medical diagnosis See “Statistiek in de kliniek: de diagnose doorgelicht,” Giard, R.W.M., Natuur en Techniek, vol. 59, pp. 260-271, 1991. Key words: conditional probability, Bayes’ rule, sensitivity and specificity, indifferent test. The contingency table shown on p. 265 can be completed as D - Marginal total T P(T-∩ D) 283/N = 0.290 D + T P(T+∩ D) 51/N = 0.052 false negative P(T-∩ D-) P(T+∩ D-) 27/N = 0.028 false positive P(T+) 612/N = 0.629 Marginal total P(D) 334/N = 0.343 P(D-) 639/N = 0.657 P(T-) 310/N = 0.319 663/N = 0.681 N 973 Notice the two combinations for which the test gives a "wrong" prediction: false positive and false negative. In Fig. 12a (b), the curve can be made more ideal by increasing the specificity (sensitivity) of the test. For any test, however, the specificity can only be increased by decreasing the sensitivity, and vice versa (both are linked). The values we assign should be based on a trade-off of the consequences of false positive and false negative test results. Above contingency table also illustrates equation (14). Denoting prev = prevalentie = P D sens = sensitiviteit = P T+|D spec = specificiteit = P T−|D we can construct the following conditional probabilities, both physician and patient are really interested in: P D |T + sens×prev = sens×prev + (1−spec)×(1−prev) Fig. 12a p. 271 P D |T + Positive predictive value (1−spec)×(1−prev) = sens×prev + (1−spec)×(1−prev) Σ=1 cdw-oct-2005 P D |T − (1−sens)×prev = spec×(1−prev) + (1−sens)×prev Fig. 12b p. 271 P D |T − spec×(1−prev) = spec×(1−prev) + (1−sens)×prev Negative predictive value Σ=1 Biostatistics 10 Exercise. Compute from above contengency table that sens = 0.85 and spec = 0.96. Exercise. In Fig. 12a and 12b on p. 271 (for sens = 0.85 and spec = 0.96), the indifferent curve is obtained when the events D and T are independent. What do the above conditional probabilities reduce to then? Interpret. II.1.c. Discrete and continuous probability distribution, probability density function, cumulative probability distribution In subsection I. 1, we have seen that a random variable can have a number of outcomes. Each outcome occurs with a certain probability. The probability or chance that the discrete random variable xk takes the value or outcome xk is written as P(xk=xk) or, briefly, P(xk). The number of outcomes xk for a discrete random variable is finite or countable infinite. We distinguish the xk (the symbol for the random variable) and xk (a specific outcome of the random variable xk). P(xk=xk) is, according to subsection II. 1 (p. 8), the probability that the event <xk=xk> occurs. As a function of xk, P(xk) = P(xk=xk) is called the probability distribution. The following two properties hold for the probability distribution of the discrete random xk : 0 ≤ P(xk) ≤ 1, ∀k (16) and ∀k P(xk) = 1 (17) A continuous random variable x has a continuum of possible outcomes. Strictly speaking, this implies that any specific value x has a probability 0 to occur as outcome of the random variable. For that, we define the probability density p(x) as p(x) ∆x = P x− x ≤ x ≤ x + ∆x 2 2 (18) where ∆x denotes a small increment. For each a ≤ b, it follows that Pa≤x≤b = b a p(x) dx In words: the probability that x lies between a and b is given by the area under the curve p(x) and is bounded by x=a and x=b. cdw-oct-2005 Biostatistics 11 In analogy with (16) and (17), we can write for the probability density p(x) : p(x) ≥ 0 , ∀x (19) and +∞ −∞ p(x) dx = 1 (20) Remark that, in contrast with P(xk), p(x) may be (locally) larger than 1. The cumulative probability distribution, for discrete and continuous random variables, are defined as Φ(xk) = j≤ k P xk = xj Φ(x) = x −∞ = j≤ k P(xj) (21) (22) p(y) dy Both represent the probability that an outcome is lower or equal to the mentioned value of xk or x. Both functions are increasing from 0 to 1 with respect to xk or x. From equation (22) it follows that (23) p(x) = ∂Φ(x) ∂x II.1.d. Measures of central tendency and variability of a probability distribution This subsection parallels subsection “I. 5. Measures that characterize a frequency distribution” (p. 4). A frequency distribution is obtained through the experiment. A probability distribution or density function, on the other hand, results from a mathematical analysis model of the real world. Example: the probability distribution of the random variable "the number of dots obtained when having tossed a die" can be obtained experimentally (by tossing many times) and analytically (considering the fact that any of the 6 sides has the same probability of facing upward when the die come to rest). 1. The expected value or mean of a random variable. In the discrete case: µ = E(xk) = ∀k P(xk) xk (24) It is clear that the expected value is closely related to the arithmetic mean of a measured cdw-oct-2005 Biostatistics 12 random variable (see equation (1’)). In the continuous case, the expected value of the random variable x is defined as: +∞ µ = E(x) = −∞ p(x) x dx (25) 2. The median is the value of the random variable that partitions the area under the probability distribution curve into two halves. Analogously, the expected value is the value of the random variable that would equilibrate the materialized area under the same p(x)curve. This explains why the expected value of a skew distribution lies, more than the median, in the direction of the longer tail. With symmetric distribution curves, the median and expected value coincide on the symmetry axis. The median has always a cumulative probability of 0.5. 3. The mode is the x-value for which p(x) is maximum. 4. In the discrete case, the variance is given by σ2 = E xk− µ 2 = ∀k P(xk) xk− µ 2 (26) and in the continuous case by σ2 = E x− µ 2 +∞ = −∞ 2 p(x) x− µ dx (27) In above expressions E{} is a mathematical operator that acts on a random variable which, in its own right, is a function of the random variable xk or x. 5. The standard deviation σ is the square root of the variance σ2. 6. The quartiles and percentiles are defined as in subsection I. 5 “Measures that characterize a frequency distribution” (p. 4). 7. The skewness coefficient of a probability distribution is defined as E x− µ 3 (28) σ3 This coefficient is positive (negative) for distributions with a longer positive (negative) tail. 8. Kurtosis or peakedness : E x− µ 4 − (29) 3 σ4 The subtraction by 3 accomplishes that the kurtosis of the Gaussian distribution is zero (see further). The kurtosis is positive when distribution is more peaked than the Gaussian distribution (for the same standard deviation). This commonly implies that the tails are longer. cdw-oct-2005 Biostatistics 13 II.1.e. Gaussian distribution The Gaussian distribution or normal distribution of a continuous random variable is symmetric and bell-shaped. There are two parameters, µ and σ : p(x) = σ 1 √ 2π e−1 2 x −µ 2 σ (30) For each normal distribution, we have: • The probability of finding the outcome within the distance σ, 2, 3σ from the expected value µ is respectively given by: P µ− σ ≤ x ≤ µ + σ = 0. 68 (31a) P µ− 2σ ≤ x ≤ µ + 2σ = 0. 95 (31b) P µ− 3σ ≤ x ≤ µ + 3σ = 0. 997 (31c) • • Figure 3 displays the effect of σ on the shape of the curve. Graphically, σ is characterized by the points of inflection of the Gaussian curve. Figure 3. Gaussian distribution for µ=0 and i) σ= 1 and ii) σ = 1.5 Figure 4 shows the standard normal probability distribution, which is normalized such that µ = 0 and σ = 1. The same Figure illustrates the equations (31a)—(31c). Table 1 lists the corresponding cumulative probability distribution. Each Gaussian distributed random variable x can be related to the standard normal variable z via z= cdw-oct-2005 x− µ σ (32) Biostatistics 14 Using Table 1, the reader can easily check (31a)—(31c) via • 1− 2 Φ(µ− σ) = 0. 6826 (33a) 1− 2 Φ(µ− 2σ) = 0. 9544 (33b) 1− 2 Φ(µ− 3σ) = 0. 9974 (33c) • • Figure 4. Standard normal distribution with indication of σ, 2σ and 3σ Exercise. Determine the intervals (around the mean) in which a Gaussian distributed variable falls with a probability of • 50 % (the corresponding half interval width is sometimes entitled "probable error") • 90 % • 95 %. (Answers: ± 0.674 σ, ± 1.645 σ, ± 1.960 σ). cdw-oct-2005 Biostatistics 15 The sum of 2 independent random variables that are Gaussian distributed according to 2 2 (µ1,σ1 ) and (µ2,σ2 ), respectively, is Gaussian distributed on its own according to (µ1+µ2,σ12+σ22). (Remark that the variances have to be added and NOT the standard deviations !). In general, the sum of n independent random variables that are Gaussian distributed according to (µ1,σ12), (µ2,σ22), ... and (µn,σn2), is Gaussian distributed on its own according to (µ1+µ2+...+µn,σ12+σ22+...+σn2). This implies that the arithmetical mean of n independent observations of the random variable x, all Gaussian distributed according to (µ,σ2), is Gaussian distributed on its own according to (µ,σ2/n). Indeed, E E n i=1 xi n and σ 2 n i=1 n xi = = 12 n x E x µ E E x−µ −µ =1 n i=1 xi n i=1 i=1 =1 n i n i=1 i = 2 n n n E = x i− µ 2 1 n2 n i=1 2 i = 1 σ2 n Swapping "E" and "Σ" is only allowed when the random variables xi are independent. The central limit theorem, on the other hand, sounds as follows: the sum of n independent random variables is, on condition that n is large, normally distributed irrespective of the individual probability distributions of the random variables. This allows us to generalize the above conclusions to arbitrarily distributed random variables when n is sufficiently large. The normal distribution is very important in practice, especially by realizing that random variables that are influenced by many independent factors will become normally distributed, in good approximation (approximate when the random variable is essentially positive, for instance). This follows from the central limit theorem. Therefore, the random errors discussed in subsection I.3.b (p. 2) will be normally distributed. When reporting quantitative measuring results, good practice is to supplement the arithmetic mean with a number of standard deviations: x ± s of x ± 2s of x ± 3s cdw-oct-2005 Biostatistics 16 II.1.f. Binomial distribution With the binomial probability distribution (or Bernoulli distribution), the random variable xk=k is discrete and can take the values 0,1,2,..., n: P(k) = Ck θk (1− θ)n−k n ; k = 0, 1, 2, . . . , n (34) where the binomial coefficients Cnk (the number of possible combinations or subsets of k elements in a set of n elements) are given by n! Ck = k! (n−k)! n ; k = 0, 1, 2, . . . , n (35) Remark that 0! = 1. The binomial distribution comes up when we repeat an experiment (to observe event E1 that occurs with probability P(E1)=θ) n times. The probability that E1 occurs just k times is given by equation (34). The probability distribution is symmetric if θ=0.5. (Example: probability of obtaining k heads when tossing a coin n times). Exercise. Ascertain that P(k) is positively skew if θ < 0.5. Consider the probability of obtaining k times "6 dots" when throwing 10 times a die. Make a graph of the probability distribution. This exercise illustrates why the binomially distributed variable k is sometimes called the "number of successes". For large n, the binomial distribution comes close to the normal distribution (and thus becomes approximately symmetric, even for θ ≠ 0.5). II.1.g. Poisson distribution Here, the discrete random variable xk=k can take the values 0,1,2,... (unbounded) according to µk P(k) = k! e−µ ; k = 0, 1, 2, . . . µ (36) 0 As with the normal distribution, equation (30), µ directly denotes the expected value. For the 2 Poisson distribution, we have σ = µ. The Poisson distribution comes up when we observe a rare event E1 that may occur repeatedly in time. In any small time interval ∆t, the probability that the event occurs should be given by λ∆t, irrespective of the prehistory. The probability that the event occurs just k times over a time span t, is given by equation (36) after substitution λt by µ. The Poisson distribution is relevant when studying following problems: • • the probability of receiving k telephone calls over a time span t (λ is the averaged number of calls per unit time) the probability of finding k particles in the view field of a microscope cdw-oct-2005 Biostatistics 17 • the probability of observing k disintegrations in a radioactive material over a time span t. The Poisson distribution has a positive skewness and becomes more symmetric for increasing µ. In fact, the Poisson distribution comes close to the normal distribution then. Figure 5 gives the Poisson probability distribution of k for µ = 1, 2, 10 and 50. 0,4 0,35 0,3 0,25 µ=1 µ=2 0,2 µ = 10 µ = 50 0,15 0,1 0,05 0 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 k Figure 5. Poisson probability distribution of k for µ = 1, 2, 10 and 50 Page 1 In fact, the Poisson distribution was originally developed (by S.-D. Poisson, 1781-1840) as an approximation to the binomial distribution for the case that the probability of success on any trial, θ, was very small and the number of trials was so high that the product n*θ was finite. Then, the binomial distribution can be approximated by (36), where µ = n*θ. Exercise. In nuclear imaging, the coefficient of variation should be lower than 1% in a specific part of the region of interest. What is the minimum number of scintillation counts required in the corresponding pixels? [Background radiation and radioactive decay of the isotope during the investigation may be ignored]. 2 II.1.h. χ -distribution 2 The random variable χ is defined as the sum of the squares of n continuous independent random variables that are standard normally distributed: χ2 = x2 + x2 +. . . +x2 n 1 2 cdw-oct-2005 (37) Biostatistics 18 The resulting probability distribution is the χ2-distribution. The parameter n is called the 2 2 number of degrees of freedom of χ . We generally have that E(χ ) = n. The kurtosis is higher than for the normal distribution (and is thus positive). The larger n is, the larger µ becomes and the more the distribution approximates the normal distribution. See figure 6 for the prob2 ability distribution of χ . χ22 χ Figure 6. χ2 probability distribution for df = 1, 2, 3, 10 and 50 2 2 2 The critical values χπ such that P(χ > χπ ) = p for n = 1,2,... can be found in Table 2. In Excel, you can use the function CHIINV (inverse of CHDIST). II.1.i. The t-distribution of Student The distribution of Student (pseudonym of W.-S. Gosset) describes the distribution of the random variable t that is defined as: : y t = √n √z where y is normally distributed according to (µ=0,σ2=1), and z (independent of y) is distributed according to χ2 with n degrees of freedom (df). The t-distribution is symmetric around t=0 and approximates the normal distribution when n is very large. See Figure 7. The tdistribution has a positive kurtosis. Critical values t(1)p so that P(t > t(1)p) = p (1-sided) and for tp(2) so that P(|t| > tp(2)) = p (2sided) for n = 1,2,... are listed in Table 3. It holds t(1)p = t(2)2p. From Excel, you can use the cdw-oct-2005 Biostatistics 19 function TINV (inverse of TDIST). Figure 7. Standard normal distribution (blue), t-distribution df=1 (green) and t-distribution df=2 (red). II.1.j. The F-distribution The F-distribution describes the probability distribution of the random variable F ("F" of R.A. Fisher) that is defined as F= x m y n 2 where x and y are independent random variables that are distributed according to χ with m and n degrees of freedom respectively. The critical values Fp so that P(F > Fp) = p for different pairs of degrees of freedom can be found in Table 4. From Excel, you can use the function FINV (inverse of FDIST). cdw-oct-2005 Biostatistics 20 ...
View Full Document

Ask a homework question - tutors are online