Unformatted text preview: Faculty of Engineering Master of Biomedical Engineering Biostatistics (3 credits)
Theory: 15 hrs
Exercises: 10 hrs Prof. Dr. ir. C. De Wagter I. Introduction to descriptive statistics I. 1. Random variable, proportion A random variable is a variable quantity of which the value (or outcome) is uncertain (and
determined by chance or random experiment). The reason for lack of determinism can be
twofold: fundamental (quantum physics) or related to the observation (finite precision of
measurement or computation). We will consider especially quantitative random variables,
some are discrete (number of bikes per family), other variables are continuous (time needed
to finish this lesson).
A proportion is obtained by normalization of the outcomes of a random variable. Example of
random variable: the yearly number of mortal road accidents in Belgium per age group of the
driver. Statistical data show that older drivers are more involved. However these data do no
prove that young people drive more safely. Indeed, there are more older drivers than younger
ones. Therefore, it is imperative to investigate in each age group the proportion of the number of mortal accidents to the number of drivers. (Then it turns out that older people drive
more safely).
I. 2. Frequency distribution
In descriptive statistics, the value of a random variable is the result of an observation, experiment or measurement (in broad sense). In order to obtain an overview of the data measured,
a frequency distribution may be very helpful. Often, the set of possible outcomes in a population is grouped into a number of intervals or classes (always in the case of continuous random variables). The intervals are bounded by the class limits. The number of times the random variable takes a value that belongs to a certain class, establishes the (absolute) frequency of the class considered, or, briefly, the class frequency. The sum of the class
frequencies equals the total number of observations. The relative frequency is obtained after
division by the number of observations.
I. 3. Collection and interpretation of data
I.3.a. Accuracy
Accuracy is directly related to the margin with which the observed outcome approximates
the actual (unknown) value of the quantity measured. Thus, the accuracy is also affected by
the calibration and resolution (resolving power) of the measuring equipment.
I.3.b. Systematic and random errors, errors of interpretation
Systematic errors affect the outcomes of the measurement in a consistent way. They can be
due to wrong zero adjustment or to calibration errors of measuring equipment. When known
(possibly approximately), the statistician can correct for them.
Random errors have, by definition, a lot of independent origins and, by consequence, have a
small mean. By performing many observations, the effect of random errors is minimized. cdwoct2005 Biostatistics 2 This category of errors is closely related to statistics and probability.
Errors of interpretation are due to human failure. An example is the above mentioned wrong
choice of random variable to conclude that young car drivers are more careful.
I. 4. Organization and presentation of data
I.4.a. Contingency table
A contingency table is obtained by classification of a population (or sample, see further) according to 2 different criteria (to each criterion, a categorical (nominal) random variable is
connected). Each table cell denotes the frequency split up according the respective criteria.
I.4.b. Box chart, bar chart
The height of each box or bar represents the frequency of the corresponding outcome measured. Continuous random variables are grouped into classes. The bars, in contrast to the
boxes, are normally separated from each other.
I.4.c. Histogram
The histogram is the ideal tool for the graphical representation of the frequency distribution
of a continuous random variable. The horizontal axis is divided into class intervals. Above
each interval, a rectangle of which the area (relevant in case of uneven class interval widths)
represents the frequency of the corresponding class. In fact, the height of the rectangles denote frequency densities (frequency per unit class width). In contrast to the box and bar
chart, sparse class intervals can be grouped together by using uneven interval widths.
Example. The following numbers are observed as a result of 29 measurements: 50, 51, 30,
20, 30, 36, 60, 51, 1, 200, 100, 5, 70, 75, 100, 20, 50, 50, 140, 160, 250, 150, 180, 30, 48,
47, 46, 31, 40. This results in the following histogram: cdwoct2005 Biostatistics 3 16 14 12 10 8 6 4 2 Std. Dev = 88,03
Mean = 40,0
N = 29,00 0
200,0 150,0 100,0 50,0 0,0 50,0 100,0 150,0 200,0 250,0 V AR00002 Figure 1. Histogram with equal interval width I.4.d. Pie chart
The surface of a circle (representing 100 %) is divided into sectors that correspond with the
classes of a population (or sample), according to a categorical (nominal) random variable.
The area of each sector is proportional to the frequency of the corresponding class.
I. 5. Measures that characterize a frequency distribution
I.5.a. Measures of central tendency
The (arithmetic) mean of N outcomes of the random variable x is defined by
N x= i=1 xi (1) N All observed outcomes have influence. Outliers strongly affect the mean. Graphically, the
mean is that outcome that forms a point of equilibrium for the surface under the histogram
when materialized and hung in a field of gravity.
The median is the middle of the outcomes after having arranged them in order of magnitude.
If the number of outcomes is even, the mean is taken of the two central outcomes. The median divides the area determined by the histogram into two equal halves. cdwoct2005 Biostatistics 4 The mode is the outcome that has the highest frequency.
I.5.b. Measures of dispersion and variability
2
The variance σ is the average of the squared deviation of all N outcomes in a population
with respect to the mean: N σ2 = xi−x i=1 2 (2) N Normally, instead of taking the complete population, the statistician restricts himself to a
subset of the population, called the sample, which only contains n (< N) observations. As we
will see further on, expression (2) gives then a too optimistic value, i.e. too small, of the variance. The reason is that the mean of the sample also contains uncertainty (another sample
would normally lead to a different mean). Therefore, the sample variance s2 is a random variable on itself, that can be efficiently estimated as :
n xi−x i=1 s2 = 2 (3) n− 1 The reader can easily prove that (2) and (3) can also be expressed as
N σ2 = i=1 xi 2 xi 2 (4) − x2 N and
n s2 = i=1 − n x2 (5) n− 1 Formulae (4) and (5) are more suited for calculations by hand. Remark that the subtractions
that occur on the right in (4) and (5) are fundamentally positive.
The standard deviation σ (s) is the root of σ2 (s2). Hence, we can write for the sample standard deviation s : s= n
i=1 xi−x 2 (6) n− 1 The coefficient of variation, cv, is the ratio of the standard deviation and the mean: s
cv = x
The quartiles are those values of the random variable that divide the area under the histogram cdwoct2005 Biostatistics 5 into 4 equal parts. The second quartile is equal to the median. 25% (75%) of the outcomes
are lower than the first (third) quartile.
The percentiles are defined analogously. The 71st percentile, for instance, is that outcome
under which 71% of all ordered outcomes lie.
The mean, standard deviation, quartiles and percentiles are expressed in the same unit as the
random variable itself. The variance has the squared unit. The coefficient of variation is
unitless.
I.5.c. The boxplot
An interesting graphical tool for exploring observed data is the boxplot. Figure 2 gives such
a plot for the example considered in Figure 1 (p. 4). The box contains 50% of the outcomes
and lies between the 1st quartile and the 3rd quartile. The median is indicated in the box. An
acentral position indicates a skew frequency distribution. The whiskers indicate the outcomes that are just no outliers nor extreme values. Outliers are contained in the interval that
is defined between distances of 1.5 and 3 times the box height (3rd quartile  1st quartile)
from the median. Extreme values are still more distant from the box.
300
21 200 10 20
19 100 0
18 100 11 22
23 200 300
N= 29 VAR00002 Figure 2. Boxplot (o = outlier; * = extreme value; followed by ranking number of outcomes) I.5.d. Adjustments for grouped observations
In previous formulae (1)—(6), we tacitly assumed that the measurements with same outcome
xi were repeated in the summation. If we consider the possible outcomes x only one, how cdwoct2005 Biostatistics 6 ever, we should take into account their frequency f (or relative frequency f’). This also applies when the outcomes are grouped in classes. Then fi represents the class frequency and xi
the class center. Formulae (1)—(6) then respectively become (N is the total number of observations): x σ2 = ∀i = ∀i fi x i
= N fi x i − x
N σ2 = s2 =
and s= cdwoct2005 ∀i (1’) 2 = s2 = fi xi ∀i ∀i ∀i f i x i− x ∀i fi xi−x 2 (2’) 2 (3’) n− 1 fi x i 2 − x2 (4’) − n x2 (5’) N fi x i 2 n− 1 ∀i=1 fi x i − x 2 (6’) n− 1 Biostatistics 7 II. Basics of probability II. 1. Random experiment, event An experiment is a process by which an outcome is obtained. In case of a random experiment the outcome is a priori uncertain. Attaining a specific outcome (or combination of outcomes) establishes an event. The probability of an event E1 is expressed by a number that
lies between 0 and 1:
0 ≤ P(E1) ≤ (7) 1 P (E1) = 0 indicates that event E1 cannot occur, while P (E1) = 1 guarantees that event E1
occurs with certainty. From descriptive statistics, probability is the limit of the relative frequency for a high number of observations : fE (8) lim n1
P(E1) = n−
→∞ An alternative and practical approach was followed by Laplace: if all the possible outcomes
of an experiment have the same probability, then the probability of an event is the ratio of
the number of "favorable" outcomes to the total number of possible outcomes.
Example 1: We throw a dice twice. Which is the probability of obtaining "2" and "4" (irrespective of sequence). Answer: 2/36 = 0.0555. Indeed, 36 possible outcomes: (1,1), (1,2), ... ,
(1,6), (2,1), ..., (6,6). There are 2 favorable outcomes: (2,4) and (4,2).
Example 2: The probability of throwing two times "4" amounts to 1/36 = 0.0277. Indeed, the
favorable outcome (4,4) occurs only once in the set of possible outcomes.
II.1.a. Basic rules for probability
Union of two events. The probability of either event E1 or event E2 is the sum of the two
related probabilities minus the probability of the intersection: P(E1 ∪ E2) = P(E1) + P(E2)− P(E1 ∩ E2) (9) The term P(E1∩E2) expresses the probability of the intersection of events E1 and E2, i.e. the
probability of the two events being concurrent.
Complement of an event. The probability that an event E1 does NOT occur is given by: P(E1) = 1 − P(E1) (10) Conditional probability. The probability of event E1 occurring when it is known that some cdwoct2005 Biostatistics 8 event E2 has occurred : P(E1E2) = P(E1∩E2)
P(E2) (11) Bayes’ rule interrelates the conditional P(E1  E2) and P(E2  E1) : P(E1E2) = P(E2E1) P(E1)
P(E2) (12) Exercise: derive (12) from (11).
Exercise: interpret the following identity : P(E1E2) + P(E1E2) = 1 (13) Exercise: prove that there generally holds that P(E1) = P(E1 ∩ E2) + P(E1 ∩ E2) (14) By applying (11), we obtain from (14): P(E1) = P(E1E2) P(E2) + P(E1E2) P(E2) (14’) Previous expression can be generalized to P (B ) = i P(BAi) P(Ai) (14’’) if events Ai are mutually exclusive and if their union is the entire sample space, i.e. ΣiP(Ai) =
1. Above equation, sometimes called the law of total probability, can be very useful in the
denominator of Bayes’ rule (12). In many cases, the events Ai are all possible mutually exclusive causes of event B.
Exercise: the probability that a student lies in plaster is 0.05 when he has taken a ski holiday
and 0.005 when he has not. We know that 20% of the students takes a ski holiday. a) What is
the probability that a student lies in plaster? b) When I see a student lying in plaster, what is
the probability he has taken ski holiday? (Answers: a) 0.014, b) 0.71)
Two events E1 and E2 are independent if and only if P(E1E2) = P(E1) or P(E2E1) = P(E2). In
words: if they do not affect each other. (Bayes’ rule implies that both conditions are equivalent). From (11) it follows that P(E1 ∩ E2) = P(E1) . P(E2) cdwoct2005 (15) Biostatistics 9 Two events E1 and E2 are mutually exclusive when P(E1∩E2) = 0. This normally implies
that they cannot occur concurrently.
Exercise: prove that for mutually exclusive events E1 and E2, we have that P(E1  E2) = 0.
II.1.b. Conditional probabilities in medical diagnosis
See “Statistiek in de kliniek: de diagnose doorgelicht,” Giard, R.W.M., Natuur en Techniek,
vol. 59, pp. 260271, 1991.
Key words: conditional probability, Bayes’ rule, sensitivity and specificity, indifferent test.
The contingency table shown on p. 265 can be completed as D  Marginal total T
P(T∩ D) 283/N = 0.290 D +
T
P(T+∩ D) 51/N = 0.052
false negative
P(T∩ D) P(T+∩ D)
27/N = 0.028
false positive
P(T+) 612/N = 0.629 Marginal total
P(D)
334/N = 0.343
P(D)
639/N = 0.657 P(T)
310/N = 0.319 663/N = 0.681 N
973 Notice the two combinations for which the test gives a "wrong" prediction: false positive and
false negative. In Fig. 12a (b), the curve can be made more ideal by increasing the specificity
(sensitivity) of the test. For any test, however, the specificity can only be increased by decreasing the sensitivity, and vice versa (both are linked). The values we assign should be
based on a tradeoff of the consequences of false positive and false negative test results.
Above contingency table also illustrates equation (14).
Denoting prev = prevalentie = P D
sens = sensitiviteit = P T+D
spec = specificiteit = P T−D
we can construct the following conditional probabilities, both physician and patient are really
interested in: P D T + sens×prev
= sens×prev + (1−spec)×(1−prev) Fig. 12a p. 271 P D T + Positive predictive value (1−spec)×(1−prev)
= sens×prev + (1−spec)×(1−prev) Σ=1 cdwoct2005 P D T − (1−sens)×prev
= spec×(1−prev) + (1−sens)×prev Fig. 12b p. 271 P D T − spec×(1−prev)
= spec×(1−prev) + (1−sens)×prev
Negative predictive value
Σ=1 Biostatistics 10 Exercise. Compute from above contengency table that sens = 0.85 and spec = 0.96.
Exercise. In Fig. 12a and 12b on p. 271 (for sens = 0.85 and spec = 0.96), the indifferent
curve is obtained when the events D and T are independent. What do the above conditional
probabilities reduce to then? Interpret.
II.1.c. Discrete and continuous probability distribution, probability density
function, cumulative probability distribution
In subsection I. 1, we have seen that a random variable can have a number of outcomes.
Each outcome occurs with a certain probability. The probability or chance that the discrete
random variable xk takes the value or outcome xk is written as P(xk=xk) or, briefly, P(xk). The
number of outcomes xk for a discrete random variable is finite or countable infinite. We distinguish the xk (the symbol for the random variable) and xk (a specific outcome of the random
variable xk). P(xk=xk) is, according to subsection II. 1 (p. 8), the probability that the event
<xk=xk> occurs. As a function of xk, P(xk) = P(xk=xk) is called the probability distribution.
The following two properties hold for the probability distribution of the discrete random xk :
0 ≤ P(xk) ≤ 1, ∀k (16) and ∀k P(xk) = 1 (17) A continuous random variable x has a continuum of possible outcomes. Strictly speaking,
this implies that any specific value x has a probability 0 to occur as outcome of the random
variable. For that, we define the probability density p(x) as p(x) ∆x = P x− x ≤ x ≤ x + ∆x
2
2 (18) where ∆x denotes a small increment.
For each a ≤ b, it follows that Pa≤x≤b = b
a p(x) dx In words: the probability that x lies between a and b is given by the area under the curve p(x)
and is bounded by x=a and x=b. cdwoct2005 Biostatistics 11 In analogy with (16) and (17), we can write for the probability density p(x) : p(x) ≥ 0 , ∀x (19) and
+∞ −∞ p(x) dx = 1 (20) Remark that, in contrast with P(xk), p(x) may be (locally) larger than 1.
The cumulative probability distribution, for discrete and continuous random variables, are
defined as Φ(xk) = j≤ k P xk = xj Φ(x) = x −∞ = j≤ k P(xj) (21) (22) p(y) dy Both represent the probability that an outcome is lower or equal to the mentioned value of xk
or x. Both functions are increasing from 0 to 1 with respect to xk or x. From equation (22) it
follows that
(23) p(x) = ∂Φ(x)
∂x II.1.d. Measures of central tendency and variability of a probability distribution
This subsection parallels subsection “I. 5. Measures that characterize a frequency
distribution” (p. 4). A frequency distribution is obtained through the experiment. A probability distribution or density function, on the other hand, results from a mathematical analysis model of the real world. Example: the probability distribution of the random variable "the
number of dots obtained when having tossed a die" can be obtained experimentally (by tossing many times) and analytically (considering the fact that any of the 6 sides has the same
probability of facing upward when the die come to rest).
1. The expected value or mean of a random variable. In the discrete case: µ = E(xk) = ∀k P(xk) xk (24) It is clear that the expected value is closely related to the arithmetic mean of a measured cdwoct2005 Biostatistics 12 random variable (see equation (1’)).
In the continuous case, the expected value of the random variable x is defined as:
+∞ µ = E(x) = −∞ p(x) x dx (25) 2. The median is the value of the random variable that partitions the area under the probability distribution curve into two halves. Analogously, the expected value is the value of
the random variable that would equilibrate the materialized area under the same p(x)curve. This explains why the expected value of a skew distribution lies, more than the
median, in the direction of the longer tail. With symmetric distribution curves, the median and expected value coincide on the symmetry axis.
The median has always a cumulative probability of 0.5.
3. The mode is the xvalue for which p(x) is maximum.
4. In the discrete case, the variance is given by σ2 = E xk− µ 2 = ∀k P(xk) xk− µ 2 (26) and in the continuous case by σ2 = E x− µ 2 +∞ = −∞ 2 p(x) x− µ dx (27) In above expressions E{} is a mathematical operator that acts on a random variable
which, in its own right, is a function of the random variable xk or x.
5. The standard deviation σ is the square root of the variance σ2.
6. The quartiles and percentiles are defined as in subsection I. 5 “Measures that
characterize a frequency distribution” (p. 4).
7. The skewness coefficient of a probability distribution is defined as E x− µ 3 (28) σ3
This coefficient is positive (negative) for distributions with a longer positive (negative)
tail.
8. Kurtosis or peakedness : E x− µ 4 − (29)
3 σ4
The subtraction by 3 accomplishes that the kurtosis of the Gaussian distribution is zero
(see further). The kurtosis is positive when distribution is more peaked than the Gaussian
distribution (for the same standard deviation). This commonly implies that the tails are
longer. cdwoct2005 Biostatistics 13 II.1.e. Gaussian distribution
The Gaussian distribution or normal distribution of a continuous random variable is symmetric and bellshaped. There are two parameters, µ and σ : p(x) = σ 1
√ 2π e−1
2 x −µ 2
σ (30) For each normal distribution, we have:
• The probability of finding the outcome within the distance σ, 2, 3σ from the expected
value µ is respectively given by: P µ− σ ≤ x ≤ µ + σ = 0. 68 (31a) P µ− 2σ ≤ x ≤ µ + 2σ = 0. 95 (31b) P µ− 3σ ≤ x ≤ µ + 3σ = 0. 997 (31c) • • Figure 3 displays the effect of σ on the shape of the curve. Graphically, σ is characterized by
the points of inflection of the Gaussian curve. Figure 3. Gaussian distribution for µ=0 and i) σ= 1 and ii) σ = 1.5
Figure 4 shows the standard normal probability distribution, which is normalized such that µ
= 0 and σ = 1. The same Figure illustrates the equations (31a)—(31c). Table 1 lists the corresponding cumulative probability distribution. Each Gaussian distributed random variable x
can be related to the standard normal variable z via z= cdwoct2005 x− µ
σ (32) Biostatistics 14 Using Table 1, the reader can easily check (31a)—(31c) via
•
1− 2 Φ(µ− σ) = 0. 6826 (33a) 1− 2 Φ(µ− 2σ) = 0. 9544 (33b) 1− 2 Φ(µ− 3σ) = 0. 9974 (33c) • • Figure 4. Standard normal distribution with indication
of σ, 2σ and 3σ Exercise. Determine the intervals (around the mean) in which a Gaussian distributed variable
falls with a probability of
• 50 % (the corresponding half interval width is sometimes entitled "probable error") • 90 % • 95 %. (Answers: ± 0.674 σ, ± 1.645 σ, ± 1.960 σ). cdwoct2005 Biostatistics 15 The sum of 2 independent random variables that are Gaussian distributed according to
2
2
(µ1,σ1 ) and (µ2,σ2 ), respectively, is Gaussian distributed on its own according to
(µ1+µ2,σ12+σ22). (Remark that the variances have to be added and NOT the standard deviations !).
In general, the sum of n independent random variables that are Gaussian distributed according to (µ1,σ12), (µ2,σ22), ... and (µn,σn2), is Gaussian distributed on its own according to
(µ1+µ2+...+µn,σ12+σ22+...+σn2). This implies that the arithmetical mean of n independent observations of the random variable x, all Gaussian distributed according to (µ,σ2), is Gaussian
distributed on its own according to (µ,σ2/n). Indeed, E E n i=1 xi n and σ 2 n
i=1 n xi = = 12
n x E x µ
E E x−µ −µ =1
n i=1 xi n i=1 i=1 =1
n i n i=1 i = 2 n n n E = x i− µ 2 1 n2 n i=1 2 i = 1 σ2
n Swapping "E" and "Σ" is only allowed when the random variables xi are independent.
The central limit theorem, on the other hand, sounds as follows: the sum of n independent
random variables is, on condition that n is large, normally distributed irrespective of the individual probability distributions of the random variables. This allows us to generalize the
above conclusions to arbitrarily distributed random variables when n is sufficiently large.
The normal distribution is very important in practice, especially by realizing that random
variables that are influenced by many independent factors will become normally distributed,
in good approximation (approximate when the random variable is essentially positive, for instance). This follows from the central limit theorem. Therefore, the random errors discussed
in subsection I.3.b (p. 2) will be normally distributed. When reporting quantitative measuring
results, good practice is to supplement the arithmetic mean with a number of standard deviations: x ± s of x ± 2s of x ± 3s cdwoct2005 Biostatistics 16 II.1.f. Binomial distribution
With the binomial probability distribution (or Bernoulli distribution), the random variable
xk=k is discrete and can take the values 0,1,2,..., n: P(k) = Ck θk (1− θ)n−k
n ; k = 0, 1, 2, . . . , n (34) where the binomial coefficients Cnk (the number of possible combinations or subsets of k elements in a set of n elements) are given by n!
Ck = k! (n−k)!
n ; k = 0, 1, 2, . . . , n (35) Remark that 0! = 1.
The binomial distribution comes up when we repeat an experiment (to observe event E1 that
occurs with probability P(E1)=θ) n times. The probability that E1 occurs just k times is given
by equation (34). The probability distribution is symmetric if θ=0.5. (Example: probability
of obtaining k heads when tossing a coin n times).
Exercise. Ascertain that P(k) is positively skew if θ < 0.5. Consider the probability of obtaining k times "6 dots" when throwing 10 times a die. Make a graph of the probability distribution. This exercise illustrates why the binomially distributed variable k is sometimes called
the "number of successes".
For large n, the binomial distribution comes close to the normal distribution (and thus becomes approximately symmetric, even for θ ≠ 0.5).
II.1.g. Poisson distribution
Here, the discrete random variable xk=k can take the values 0,1,2,... (unbounded) according
to µk
P(k) = k! e−µ ; k = 0, 1, 2, . . . µ (36) 0 As with the normal distribution, equation (30), µ directly denotes the expected value. For the
2
Poisson distribution, we have σ = µ.
The Poisson distribution comes up when we observe a rare event E1 that may occur repeatedly in time. In any small time interval ∆t, the probability that the event occurs should be
given by λ∆t, irrespective of the prehistory. The probability that the event occurs just k times
over a time span t, is given by equation (36) after substitution λt by µ.
The Poisson distribution is relevant when studying following problems:
•
• the probability of receiving k telephone calls over a time span t (λ is the averaged number
of calls per unit time)
the probability of finding k particles in the view field of a microscope cdwoct2005 Biostatistics 17 • the probability of observing k disintegrations in a radioactive material over a time span t. The Poisson distribution has a positive skewness and becomes more symmetric for increasing µ. In fact, the Poisson distribution comes close to the normal distribution then.
Figure 5 gives the Poisson probability distribution of k for µ = 1, 2, 10 and 50.
0,4 0,35 0,3 0,25
µ=1
µ=2 0,2 µ = 10
µ = 50 0,15 0,1 0,05 0
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 k Figure 5. Poisson probability distribution of k for µ = 1, 2, 10 and 50
Page 1 In fact, the Poisson distribution was originally developed (by S.D. Poisson, 17811840) as
an approximation to the binomial distribution for the case that the probability of success on
any trial, θ, was very small and the number of trials was so high that the product n*θ was
finite. Then, the binomial distribution can be approximated by (36), where µ = n*θ.
Exercise. In nuclear imaging, the coefficient of variation should be lower than 1% in a specific part of the region of interest. What is the minimum number of scintillation counts required in the corresponding pixels? [Background radiation and radioactive decay of the isotope during the investigation may be ignored].
2
II.1.h. χ distribution
2
The random variable χ is defined as the sum of the squares of n continuous independent
random variables that are standard normally distributed: χ2 = x2 + x2 +. . . +x2
n
1
2 cdwoct2005 (37) Biostatistics 18 The resulting probability distribution is the χ2distribution. The parameter n is called the
2
2
number of degrees of freedom of χ . We generally have that E(χ ) = n. The kurtosis is higher
than for the normal distribution (and is thus positive). The larger n is, the larger µ becomes
and the more the distribution approximates the normal distribution. See figure 6 for the prob2
ability distribution of χ . χ22
χ
Figure 6. χ2 probability distribution for df = 1, 2, 3, 10 and 50
2 2 2
The critical values χπ such that P(χ > χπ ) = p for n = 1,2,... can be found in Table 2. In
Excel, you can use the function CHIINV (inverse of CHDIST). II.1.i. The tdistribution of Student
The distribution of Student (pseudonym of W.S. Gosset) describes the distribution of the
random variable t that is defined as: : y
t = √n √z
where y is normally distributed according to (µ=0,σ2=1), and z (independent of y) is distributed according to χ2 with n degrees of freedom (df). The tdistribution is symmetric around
t=0 and approximates the normal distribution when n is very large. See Figure 7. The tdistribution has a positive kurtosis.
Critical values t(1)p so that P(t > t(1)p) = p (1sided) and for tp(2) so that P(t > tp(2)) = p (2sided) for n = 1,2,... are listed in Table 3. It holds t(1)p = t(2)2p. From Excel, you can use the cdwoct2005 Biostatistics 19 function TINV (inverse of TDIST). Figure 7. Standard normal distribution (blue), tdistribution df=1 (green) and tdistribution df=2 (red). II.1.j. The Fdistribution
The Fdistribution describes the probability distribution of the random variable F ("F" of
R.A. Fisher) that is defined as F= x
m
y
n 2
where x and y are independent random variables that are distributed according to χ with m
and n degrees of freedom respectively. The critical values Fp so that P(F > Fp) = p for different pairs of degrees of freedom can be
found in Table 4. From Excel, you can use the function FINV (inverse of FDIST). cdwoct2005 Biostatistics 20 ...
View
Full Document
 Spring '10
 MarnikVuylsteke
 Normal Distribution, Probability theory

Click to edit the document details