Unformatted text preview: Chapter 3 Concepts of Probability
c 2010 by Harvey Gould and Jan Tobochnik 5 October 2010 We introduce the basic concepts of probability and apply them to simple physical systems and everyday life. We discuss the universal nature of the central limit theorem and the Gaussian distribution for the sum of a large number of random variables. Because of the importance of probability in many contexts, our discussion goes beyond what we need for the applications of statistical mechanics that we discuss in later chapters. 3.1 Probability in Everyday Life One of our goals, which we will consider in Chapter 4 and subsequent chapters, is to relate the behavior of various macroscopic quantities to the underlying microscopic behavior of the individual atoms or other constituents. To do so, we need to introduce some ideas from probability. We all use ideas of probability in everyday life. For example, every morning many of us decide what to wear based on the probability of rain. We cross streets knowing that the probability of being hit by a car is small. You can make a rough estimate of the probability of being hit by a car. It must be less than one in a thousand, because you have crossed streets thousands of times and hopefully you have not been hit. You might be hit tomorrow, or you might have been hit the ﬁrst time you tried to cross a street. These comments illustrate that we have some intuitive sense of probability, and because it is a useful concept for survival, we know how to estimate it. As expressed by Laplace (1819), “Probability theory is nothing but common sense reduced to calculation.” Another interesting thought is due to Maxwell (1850): “The true logic of this world is the calculus of probabilities.” That is, probability is a natural language for describing many real world phenomena. However, our intuition only takes us so far. Consider airplane travel. Is it safe to ﬂy? Suppose that there is one chance in ﬁve million of a plane crashing on a given ﬂight and that there are about 50,000 ﬂights a day. Then every 100 days or so there is a reasonable likelihood of a plane crash somewhere in the world. This estimate is in rough accord with what we know. For a given ﬂight, your chances of crashing are approximately one part in 5 × 106 , and if you ﬂy ten times a year for 100 years, it seems that ﬂying is not too much of a risk. Suppose that instead of living 106 CHAPTER 3. CONCEPTS OF PROBABILITY 107 100 years, you could live 50,000 years. In this case you would take 500,000 ﬂights, and it would be much more risky to ﬂy if you wished to live your full 50,000 years. Although this last statement seems reasonable, can you explain why? Much of the motivation for the mathematical formulation of probability arose from the proﬁciency of professional gamblers in estimating betting odds and their desire to have more quantitative measures of success. Although games of chance have been played since history has been recorded, the ﬁrst steps toward a mathematical formulation of games of chance began in the middle of the seventeenth century. Some of the important contributors over the following 150 years include Pascal, Fermat, Descartes, Leibnitz, Newton, Bernoulli, and Laplace, names that are probably familiar to you. Given the long history of games of chance and the interest in estimating probability in a variety of contexts, it is remarkable that the theory of probability took so long to develop. One reason is that the idea of probability is subtle and is capable of many interpretations. An understanding of probability is elusive due in part to the fact that the probably depends on the status of the information that we have (a fact well known to poker players). Although the rules of probability are deﬁned by simple mathematical rules, an understanding of probability is greatly aided by experience with real data and concrete problems. To test your current understanding of probability, solve Problems 3.1–3.6 before reading the rest of this chapter. Then in Problem 3.7 formulate the laws of probability based on your solutions to these problems. Problem 3.1. Marbles in a jar A jar contains two orange, ﬁve blue, three red, and four yellow marbles. A marble is drawn at random from the jar. Find the probability that (a) the marble is orange; (b) the marble is red; (c) the marble is orange or blue. Problem 3.2. Piggy bank A piggy bank contains one penny, one nickel, one dime, and one quarter. It is shaken until two coins fall out at random. What is the probability that at least $0.30 falls out? Problem 3.3. Two dice A person tosses a pair of dice at the same time. Find the probability that (a) both dice show the same number; (b) both dice show a number less than 5; (c) both dice show an even number; (d) the product of the numbers is 12. Problem 3.4. Free throws A person hits 16 free throws out of 25 attempts. What is the probability that this person will make a free throw on the next attempt? CHAPTER 3. CONCEPTS OF PROBABILITY Problem 3.5. Toss of a die 108 Consider an experiment in which a die is tossed 150 times and the number of times each face is observed is counted.1 The value of A, the number of dots on the face of the die and the number of times that it appeared is shown in Table 3.1. (a) What is the predicted average value of A assuming a fair die? (b) What is the average value of A observed in this experiment? value of A 1 2 3 4 5 6 frequency 23 28 30 21 23 25 Table 3.1: The number of times face A appeared in 150 tosses (see Problem 3.5). Problem 3.6. What’s in your purse? A coin is taken at random from a purse that contains one penny, two nickels, four dimes, and three quarters. If x equals the value of the coin, ﬁnd the average value of x. Problem 3.7. Rules of probability Based on your solutions to Problems 3.1–3.6, state the rules of probability as you understand them at this time. The following problems are related to the use of probability in everyday life. Problem 3.8. Choices Suppose that you are oﬀered the following choice: (a) A prize of $50, or (b) you ﬂip a (fair) coin and win $100 if you get a head, but $0 if you get a tail. Which choice would you make? Explain your reasoning. Would your choice change if the prize was $40? Problem 3.9. More choices Suppose that you are oﬀered the following choices: (a) A prize of $100 is awarded for each head found in ten ﬂips of a coin, or
1 The earliest known sixsided dice have been found in the Middle East. A die made of baked clay was found in excavations of ancient Mesopotamia. The history of games of chance is discussed by Bennett (1998). CHAPTER 3. CONCEPTS OF PROBABILITY (b) a prize of $400. What choice would you make? Explain your reasoning. Problem 3.10. Thinking about probability 109 (a) Suppose that you were to judge an event to be 99.9999% probable. Would you be willing to bet $999 999 against $1 that the event would occur? Discuss why probability assessments should be kept separate from decision issues. (b) In one version of the lottery the player chooses six numbers from 1 through 49. The player wins only if there is an exact match with the numbers that are randomly generated. Suppose that someone gives you a dollar to play the lottery. What sequence of six numbers between 1 and 49 would you choose? Are some choices better than others? (c) Suppose you toss a coin six times and obtain heads each time. Estimate the probability that you will obtain heads on your seventh toss. Now imagine tossing the coin 60 times, and obtaining heads each time. What do you think would happen on the next toss? (d) What is the probability that it will rain tomorrow? What is the probability that the Dow Jones industrial average will increase tomorrow? (e) Give several examples of the use of probability in everyday life. In each case discuss how you would estimate the probability. 3.2 The Rules of Probability We now summarize the basic rules and ideas of probability.2 Suppose that there is an operation or a process that has several distinct possible outcomes. The process might be the ﬂip of a coin or the roll of a sixsided die. We call each ﬂip a trial. The list of all the possible events or outcomes is called the sample space. We assume that the events are mutually exclusive, that is, the occurrence of one event implies that the others cannot happen at the same time. We let n represent the number of events, and label the events by the index i which varies from 1 to n. For now we assume that the sample space is ﬁnite and discrete. For example, the ﬂip of a coin results in one of two events, which we refer to as heads and tails and the roll of a die yields one of six possible events. For each event i, we assign a probability P (i) that satisﬁes the conditions P (i) ≥ 0 and P (i) = 1.
i (3.1) (3.2) P (i) = 0 means that the event cannot occur, and P (i) = 1 means that the event must occur. The normalization condition (3.2) says that the sum of the probabilities of all possible mutually exclusive outcomes is 1.
2 In 1933 the Russian mathematician A. N. Kolmogorov formulated a complete set of axioms for the mathematical deﬁnition of probability. CHAPTER 3. CONCEPTS OF PROBABILITY Example 3.1. Sample space of a die Let x be the number of points on the face of a die. What is the sample space of x? 110 Solution. The sample space or set of possible events is xi = {1, 2, 3, 4, 5, 6}. These six outcomes are mutually exclusive. ♦ The rules of probability will be summarized in (3.3) and (3.5). These abstract rules must be supplemented by an interpretation of the term probability. As we will see, there are many diﬀerent interpretations of probability because any interpretation that satisﬁes the rules of probability may be regarded as a kind of probability. A common interpretation of probability is based on symmetry. Suppose that we have a twosided coin that shows heads and tails. There are two possible mutually exclusive outcomes, and if the coin is fair, each outcome is equally likely.3 If a die with six distinct faces (see Figure 3.1) is perfect, we can use symmetry arguments to argue that each outcome should be counted equally and P (i) = 1/6 for each of the six faces. For an actual die, we can estimate the probability of an outcome a posteriori, that is, by the observation of the outcome of many throws. As we will see, other kinds of information in addition to symmetry arguments can be used to estimate probabilities. Figure 3.1: The six possible outcomes of the toss of a die. Suppose that we know that the probability of rolling any face of a die in one throw is equal to 1/6, and we want to ﬁnd the probability of ﬁnding face 3 or face 6 in one throw. In this case we wish to know the probability of a trial that is a combination of more elementary operations for which the probabilities are already known. That is, we want to know the probability of the outcome, i or j , where i and j are mutually exclusive events. According to the rules of probability, the probability of event i or j is given by P (i or j ) = P (i) + P (j ) (addition rule). (3.3) The relation (3.3) is generalizable to more than two events. An important consequence of (3.3) is that if P (i) is the probability of event i, then the probability of event i not occurring is 1 − P (i). Example 3.2. What is the probability of throwing a three or a six with one throw of a die? Solution. The probability that the face exhibits either 3 or 6 is
1 6 + 1 6 1 = 3. ♦ Example 3.3. What is the probability of not throwing a six with one throw of die?
3 Is the outcome of a coin toss really random? The outcome of a coin ﬂip is deterministic, but the outcome depends sensitively on the initial conditions, which we don’t know precisely. See the references at the end of the chapter. CHAPTER 3. CONCEPTS OF PROBABILITY 111 Solution. The answer is the probability of either 1 or 2 or 3 or 4 or 5. The addition rule gives that the probability P (not six) is P (not six) = P (1) + P (2) + P (3) + P (4) + P (5) 5 = 1 − P (6) = , 6 (3.4a) (3.4b) where the last relation follows from the fact that the sum of the probabilities for all outcomes sums to one. It is useful to take advantage of this property when solving many probability problems. ♦ Another simple rule concerns the probability of the joint occurrence of independent events. These events might be the probability of throwing a 3 on one die and the probability of throwing a 4 on a second die. If two events are independent, then the probability of both events occurring is the product of their probabilities P (i and j ) = P (i) P (j ) (multiplication rule). (3.5) Events are independent if the occurrence of one event does not aﬀect the probability of the occurrence of the other. To understand the applicability of (3.5) and the meaning of the independence of events, consider the problem of determining the probability that a person chosen at random is a female over six feet tall. Suppose that we know that the probability of a person to be over six feet tall 1 1 is P (6+ ) = 10 , and the probability of being female is P (female) = 2 . We might conclude that 1 1 1 + the probability of being a tall female is P (female)P (6 ) = 2 × 10 = 20 . This same probability calculation would hold for a tall male. However, this reasoning is incorrect, because the probability of being a tall female diﬀers from the probability of being a tall male. The problem is that the two events – being over six feet tall and being female – are not independent. On the other hand, consider the probability that a person chosen at random is female and was born on September 6. We can reasonably assume equal likelihood of birthdays for all days of the year, and it is correct 1 to conclude that this probability is 1 × 365 (not counting leap years). Being a woman and being 2 born on September 6 are independent events. Problem 3.11. Give an example from your solutions to Problems 3.1–3.6 where you used the addition rule or the multiplication rule or both. Example 3.4. What is the probability of throwing an even number with one throw of a die? Solution. We can use the addition rule to ﬁnd that P (even) = P (2) + P (4) + P (6) = 111 1 ++=. 666 2 (3.6) ♦ Example 3.5. What is the probability of the same face appearing on two successive throws of a die? Solution. We know that the probability of any speciﬁc combination of outcomes, for example, 1 1 (1,1), (2,2), . . . (6,6) is 1 × 6 = 36 . Hence, by the addition rule 6 P (same face) = P (1, 1) + P (2, 2) + . . . + P (6, 6) = 6 × 1 1 =. 36 6 (3.7) ♦ CHAPTER 3. CONCEPTS OF PROBABILITY Example 3.6. What is the probability that in two throws of a die at least one 6 appears? Solution. We know that 112 5 1 , P (not 6) = . (3.8) 6 6 There are four possible outcomes (6, 6), (6, not 6), (not 6, 6), and (not 6, not 6) with the respective probabilities P (6) = P (6, 6) = 11 1 ×= , 66 36 15 5 ×= , 66 36 55 25 ×= . 66 36 (3.9a) (3.9b) (3.9c) P (6, not 6) = P (not 6, 6) = P (not 6, not 6) = All outcomes except the last have at least one 6. Hence, the probability of obtaining at least one 6 is P (at least one 6) = P (6, 6) + P (6, not 6) + P (not 6, 6) (3.10a) 5 5 11 1 + + = . (3.10b) = 36 36 36 36 A more direct way of obtaining this result is to use the normalization condition. That is, 25 11 P (at least one six) = 1 − P (not 6, not 6) = 1 − = . (3.10c) 36 36 ♦ Example 3.7. What is the probability of obtaining at least one six in four throws of a die? Solution. We know that, in one throw of a die, there are two outcomes with P (6) = 1 and 6 P (not 6) = 5 as in (3.8). Hence, in four throws of a die there are 16 possible outcomes, only one 6 of which has no 6. We can use the multiplication rule (3.5) to ﬁnd that P (not 6, not 6, not 6, not 6) = P (not 6)4 = and hence P (at least one six) = 1 − P (not 6, not 6, not 6, not 6) 671 54 = =1− ≈ 0.517. 6 1296 (3.12a) (3.12b) ♦ 5 6
4 , (3.11) Frequently we know the probabilities only up to a constant factor. For example, we might know P (1) = 2P (2), but not P (1) or P (2) separately. Suppose we know that P (i) is proportional to f (i), where f (i) is a known function. To obtain the normalized probabilities, we divide each function f (i) by the sum of all the unnormalized probabilities. That is, if P (i) ∝ f (i) and Z = f (i), then P (i) = f (i)/Z . This procedure is called normalization. CHAPTER 3. CONCEPTS OF PROBABILITY 113 Example 3.8. Suppose that in a given class it is three times as likely to receive a C as an A, twice as likely to obtain a B as an A, onefourth as likely to be assigned a D as an A, and nobody fails the class. What are the probabilities of getting each grade? Solution. We ﬁrst assign the unnormalized probability of receiving an A as f (A) = 1. Then f (B ) = 2, f (C ) = 3, and f (D) = 0.25. Then Z = i f (i) = 1 + 2 + 3 + 0.25 = 6.25. Hence, P (A) = f (A)/Z = 1/6.25 = 0.16, P (B ) = 2/6.25 = 0.32, P (C ) = 3/6.25 = 0.48, and P (D) = 0.25/6.25 = 0.04. ♦ The normalization procedure arises again and again in diﬀerent contexts. We will see that much of the mathematics of statistical mechanics can be formulated in terms of the calculation of normalization constants. Problem 3.12. Rolling the dice If a person rolls two dice, what is the probability P (n) of getting the sum n? Plot P (n) as a function of n. Problem 3.13. An almost even bet What is the probability of obtaining at least one double 6 in 24 throws of a pair of dice? Problem 3.14. Rolling three dice Suppose that three dice are thrown at the same time. What is the ratio of the probabilities that the sum of the three faces is 10 compared to 9? Problem 3.15. Fallacious reasoning What is the probability that the total number of spots shown on three dice thrown at the same time is 11? What is the probability that the total is 12? What is the fallacy in the following argument? The number 11 occurs in six ways: (1,4,6), (2,3,6), (1,5,5), (2,4,5), (3,3,5), (3,4,4). The number 12 also occurs in six ways: (1,5,6), (2,4,6), (3,3,6), (2,5,5), (3,4,5), (4,4,4) and hence the two numbers should be equally probable. 3.3 Mean Values The speciﬁcation of the probability distribution P (1), P (2), . . . , P (n) for the n possible values of the variable x constitutes the most complete statistical description of the system. However, in many cases it is more convenient to describe the distribution of the possible values of x in a less detailed way. The most familiar way is to specify the average or mean value of x, which we will denote as x. The deﬁnition of the mean value of x is x ≡ x1 P (1) + x2 P (2) + . . . + xn P (n)
n (3.13a) (3.13b) =
i=1 xi P (i), where P (i) is the probability of xi . If f (x) is a function of x, then the mean value of f (x) is given by
n f (x) =
i=1 f (xi )P (i). (3.14) CHAPTER 3. CONCEPTS OF PROBABILITY If f (x) and g (x) are any two functions of x, then
n 114 f (x) + g (x) =
i=1 n [f (xi ) + g (xi )]P (i)
n (3.15a) (3.15b) =
i=1 f (xi )P (i) +
i=1 g (xi )P (i), or f (x) + g (x) = f (x) + g (x). Problem 3.16. Show that, if c is a constant, then cf (x) = cf (x). We deﬁne the mth moment of the probability distribution P as
n (3.15c) (3.16) xm ≡ xi m P (i),
i=1 (3.17) where we have let f (x) = xm . The mean of x is the ﬁrst moment of the probability distribution. Problem 3.17. Suppose that the variable x takes on the values −2, −1, 0, 1, and 2 with probabilities 1/16, 4/16, 6/16, 4/16, and 1/16, respectively. Calculate the ﬁrst two moments of x. The mean value of x is a measure of the central value of x about which the various values of xi are distributed. If we measure the deviation of x from its mean, we have ∆x ≡ x − x and ∆x = (x − x) = x − x = 0. (3.19) That is, the average value of the deviation of x from its mean vanishes. If only one outcome j were possible, we would have P (i) = 1 for i = j and zero otherwise; that is, the probability distribution would have zero width. Usually, there is more than one outcome and a measure of the width of the probability distribution is given by ∆x2 ≡ x − x .
2 (3.18) (3.20) The quantity ∆x2 is known as the dispersion or variance and its square root is called the standard deviation. The use of the square of x − x ensures that the contribution of x values that are smaller and larger than x contribute to ∆x2 with the same sign. It is easy to see that the larger the spread of values of x about x, the larger the variance. A useful form for the variance can be found by noting that x−x
2 = x2 − 2xx + x2 = x2 − x2 . = x2 − 2x x + x
2 (3.21a) (3.21b) (3.21c) CHAPTER 3. CONCEPTS OF PROBABILITY throw 1 2 3 4 probability of success on trial i p qp q2 p q3 p 115 Table 3.2: Probability of a 6 for the ﬁrst time on throw i, where p = 1/6 is the probability of a 6 on a given throw and q = 1 − p (see Example 3.10). Because ∆x2 is always nonnegative, it follows that x2 ≥ x2 . The variance is the mean value of (x − x)2 and represents the square of a width. We will ﬁnd that it is useful to interpret the width of the probability distribution in terms of the standard deviation σ , which is deﬁned as the square root of the variance. The standard deviation of the probability distribution P (x) is given by σx = ∆x2 = x2 − x2 . (3.22) Example 3.9. Find the mean value x, the variance ∆x2 , and the standard deviation σx for the value of a single throw of a die. Solution. Because P (i) =
1 6 for i = 1, . . . , 6, we have that 1 7 (1 + 2 + 3 + 4 + 5 + 6) = = 3.5, 6 2 1 46 x2 = (1 + 4 + 9 + 16 + 25 + 36) = , 6 3 46 49 37 ∆x2 = x2 − x2 = − = ≈ 3.08, 3 4 12 √ σx ≈ 3.08 = 1.76. x= (3.23a) (3.23b) (3.23c) (3.23d) ♦ Example 3.10. On the average, how many times must a die be thrown until a 6 appears? Solution. Although it might be obvious that the answer is six, it is instructive to conﬁrm this answer directly. Let p be the probability of a 6 on a given throw. To calculate m, the mean number of throws needed before a 6 appears, we calculate the probability of obtaining a 6 for the ﬁrst time on the ith throw, multiply this probability by i, and then sum over all i. The ﬁrst few probabilities are listed in Table 3.2. The resulting sum is m = p + 2pq + 3pq 2 + 4pq 3 + · · · = p(1 + 2q + 3q + · · · ) d =p 1 + q + q2 + q3 + · · · dq 1 d1 p =. =p = 2 dq 1 − q (1 − q ) p
2 (3.24a) (3.24b) (3.24c) (3.24d) CHAPTER 3. CONCEPTS OF PROBABILITY 116 Another way to obtain this result is to use the following recursive argument. Because the throws are independent, the mean number of additional throws needed after throwing the die any number of times is still m. If we throw the die once and do not obtain a 6, we will need on average m more throws or a total of m + 1. If we throw it twice and do not obtain a 6 on either throw, we will need m more throws or a total of m + 2, and so forth. The contribution to the mean due to failing on the ﬁrst throw and then succeeding sometime later is q (m + 1). The probability of succeeding on the ﬁrst throw is p and the contribution to the mean is p(1) = p. The mean is the sum of these two terms or m = (1 − p)(m + 1) + p. The solution for m is m = 1/p. ♦ 3.4 The Meaning of Probability How can we assign probabilities to the various events? If event E1 is more probable than event E2 [P (E1 ) > P (E2 )], we mean that E1 is more likely to occur than E2 . This statement of our intuitive understanding of probability illustrates that probability is a way of classifying the plausibility of events under conditions of uncertainty. Probability is related to our degree of belief in the occurrence of an event. This deﬁnition of probability is not bound to a single evaluation rule and there are many ways to obtain P (Ei ). For example, we could use symmetry considerations as we have done, past frequencies, simulations, theoretical calculations, or, as we will learn in Section 3.4.2, Bayesian inference. Probability assessments depend on who does the evaluation and the status of the information the evaluator has at the moment of the assessment. We always evaluate the conditional probability, that is, the probability of an event E given the information I , P (E I ). Consequently, several people can have simultaneously diﬀerent degrees of belief about the same event, as is well known to investors in the stock market. If rational people have access to the same information, they should come to the same conclusion about the probability of an event. The idea of a coherent bet forces us to make probability assessments that correspond to our belief in the occurrence of an event. If we consider an event to be 50% probable, then we should be ready to place an even bet on the occurrence of the event or on its opposite. However, if someone wishes to place the bet in one direction but not in the other, it means that this person thinks that the preferred event is more probable than the other. In this case the 50% probability assessment is incoherent, and this person’s wish does not correspond to his or her belief. A coherent bet has to be considered virtual. For example, a person might judge an event to be 99.9999% probable, but nevertheless refuse to bet $999999 against $1, if $999999 is much more than the person’s resources. Nevertheless, the person might be convinced that this bet would be fair if he/she had an inﬁnite budget. Probability assessments should be kept separate from decision issues. Decisions depend not only on the probability of the event, but also on the subjective importance of a given amount of money (see, for example, Problems 3.10 and 3.85). Our discussion of probability as the degree of belief that an event will occur shows the inadequacy of the frequency deﬁnition of probability, which deﬁnes probability as the ratio of the number of desired outcomes to the total number of possible outcomes. This deﬁnition is inadequate because we would have to specify that each outcome has equal probability. Thus we would have to use the term probability in its own deﬁnition. If we do an experiment to measure the frequencies of CHAPTER 3. CONCEPTS OF PROBABILITY 117 various outcomes, then we need to make an additional assumption that the measured frequencies will be the same in the future as they were in the past. Also we have to make a large number of measurements to ensure accuracy, and we have no way of knowing a priori how many measurements are suﬃcient. Thus, the deﬁnition of probability as a frequency really turns out to be a method for estimating probabilities with some hidden assumptions. Our deﬁnition of probability as a measure of the degree of belief in the occurrence of an outcome implies that probability depends on our prior knowledge, because belief depends on prior knowledge. For example, if we toss a coin and obtain 100 tails in a row, we might use this knowledge as evidence that the coin or toss is biased, and thus estimate that the probability of throwing another tail is very high. However, if a careful physical analysis shows that there is no bias, then we would stick to our estimate of 1/2. The probability assessment depends on what knowledge we bring to the problem. If we have no knowledge other than the possible outcomes, then the best estimate is to assume equal probability for all events. However, this assumption is not a deﬁnition, but an example of belief. As an example of the importance of prior knowledge, consider the following problem. Problem 3.18. A couple with two children (a) A couple has two children. What is the probability that at least one child is a girl? (b) Suppose that you know that at least one child is a girl. What is the probability that both children are girls? (c) Instead suppose that we know that the oldest child is a girl. What is the probability that the youngest is a girl? We know that we can estimate probabilities empirically by sampling, that is, by making repeated measurements of the outcome of independent events. Intuitively we believe that if we perform more and more measurements, the calculated average will approach the exact mean of the quantity of interest. This idea is called the law of large numbers. As an example, suppose that we ﬂip a single coin M times and count the number of heads. Our result for the number of heads is shown in Table 3.3. We see that the fraction of heads approaches 1/2 as the number of measurements becomes larger. Problem 3.19. Multiple tosses of a single coin Use Program CoinToss to simulate multiple tosses of a single coin. What is the correspondence between this simulation of a coin being tossed many times and the actual physical tossing of a coin? If the coin is “fair,” what do you think the ratio of the number of heads to the total number of tosses will be? Do you obtain this number after 100 tosses? 10,000 tosses? Another way of estimating the probability is to perform a single measurement on many copies or replicas of the system of interest. For example, instead of ﬂipping a single coin 100 times in succession, we collect 100 coins and ﬂip all of them at the same time. The fraction of coins that show heads is an estimate of the probability of that event. The collection of identically prepared systems is called an ensemble, and the probability of occurrence of a single event is estimated with respect to this ensemble. The ensemble consists of a large number M of identical systems, that is, systems that satisfy the same known conditions. CHAPTER 3. CONCEPTS OF PROBABILITY tosses 10 50 100 200 500 1,000 10,000 100,000 500,000 1,000,000 heads 4 29 49 101 235 518 4997 50021 249946 500416 fraction of heads 0.4 0.58 0.49 0.505 0.470 0.518 0.4997 0.50021 0.49999 0.50042 118 Table 3.3: The number and fraction of heads in M tosses of a coin. We did not really toss a coin in the air 106 times. Instead we used a computer to generate a sequence of random numbers to simulate the tossing of a coin. Because you might not be familiar with such sequences, imagine a robot that can write the positive integers between 1 and 231 on pieces of paper. The robot places these pieces in a hat, shakes the hat, and then chooses the pieces at random. If the number chosen 1 is less than 2 × 231 , then we say that we found a head. Each piece is placed back in the hat after it is read. If the system of interest is not changing in time, it is reasonable to assume that an estimate of the probability by either a series of measurements on a single system at diﬀerent times or similar measurements on many identical systems at the same time would give consistent results. Note that we have estimated various probabilities by a frequency, but have not deﬁned probability in terms of a frequency. As emphasized by D’Agostini, past frequency is experimental data. This data happened with certainty so the concept of probability no longer applies. Probability is how much we believe that an event will occur taking into account all available information including past frequencies. Because probability quantiﬁes the degree of belief at a given time, it is not directly measurable. If we make further measurements, they can only inﬂuence future assessments of the probability. 3.4.1 Information and uncertainty Consider two experiments that each have two outcomes E1 and E2 with probabilities P1 and P2 . For example, the experiments could correspond to the toss of a coin. In the ﬁrst experiment the coin has probabilities P1 = P2 = 1/2, and in the second experiment (a bent coin) P1 = 1/5 and P2 = 4/5. Intuitively, we would say that the result of the ﬁrst experiment is more uncertain than the result of the second experiment. Next consider two additional experiments. In the third experiment there are four outcomes with P1 = P2 = P3 = P4 = 1/4, and in the fourth experiment there are six outcomes with P1 = P2 = P3 = P4 = P5 = P6 = 1/6. The fourth experiment is the most uncertain because there are more equally likely outcomes and the second experiment is the least uncertain. We will now introduce a mathematical measure that is consistent with our intuitive sense of uncertainty. Let us deﬁne the uncertainty function S (P1 , P2 , . . . , Pi , . . .) where Pi is the probability CHAPTER 3. CONCEPTS OF PROBABILITY 119 of event i. We ﬁrst consider the case where all the probabilities Pi are equal. Then P1 = P2 = . . . = Pi = 1/Ω, where Ω is the total number of outcomes. In this case we have S = S (1/Ω, 1/Ω, . . .) or simply S (Ω). It is easy to see that S (Ω) has to satisfy some simple conditions. For only one outcome, Ω = 1 and there is no uncertainty. Hence we must have S (Ω = 1) = 0. We also have that S (Ω1 ) > S (Ω2 ) if Ω1 > Ω2 . (3.26) That is, S (Ω) is an increasing function of Ω. We next consider the form of S for multiple events. For example, suppose that we throw a die with Ω1 outcomes and ﬂip a coin with Ω2 equally probable outcomes. The total number of outcomes is Ω = Ω1 Ω2 . If the result of the die is known, the uncertainty associated with the die is reduced to zero, but there still is uncertainty associated with the toss of the coin. Similarly, we can reduce the uncertainty in the reverse order, but the total uncertainty is still nonzero. These considerations suggest that S (Ω1 Ω2 ) = S (Ω1 ) + S (Ω2 ). (3.27) It is remarkable that there is an unique functional form that satisﬁes the three conditions (3.25)–(3.27). We can ﬁnd this form by writing (3.27) in the form S (xy ) = S (x) + S (y ), (3.28) (3.25) and taking the variables x and y to be continuous. (The analysis can be done assuming that x and y are discrete variables, but the analysis is simpler if we assume that x and y are continuous.) Now we take the partial derivative of S (xy ) with respect to x and then with respect to y . We let z = xy and obtain ∂z dS (z ) dS (z ) ∂S (z ) = =y , ∂x ∂x dz dz ∂S (z ) ∂z dS (z ) dS (z ) = =x . ∂y ∂y dz dz From (3.28) we have dS (x) ∂S (z ) = , ∂x dx dS (y ) ∂S (z ) = . ∂y dy By comparing the righthand side of (3.29) and (3.30), we have dS dS =y , dx dz dS dS =x . dy dz (3.31a) (3.31b) (3.30a) (3.30b) (3.29a) (3.29b) CHAPTER 3. CONCEPTS OF PROBABILITY If we multiply (3.31a) by x and (3.31b) by y , we obtain x dS (y ) dS (z ) dS (x) =y =z . dx dy dz 120 (3.32) Note that the ﬁrst term in (3.32) depends only on x and the second term depends only on y . Because x and y are independent variables, the three terms in (3.32) must be equal to a constant. Hence we have the desired condition x dS (x) dS (y ) =y = A, dx dy (3.33) where A is a constant. The diﬀerential equation in (3.33) can be integrated to give S (x) = A ln x + B. (3.34) The integration constant B must be equal to zero to satisfy the condition (3.25). The constant A is arbitrary so we choose A = 1. Hence for equal probabilities we have that S (Ω) = ln Ω. (3.35) What about the case where the probabilities for the various events are unequal? We will not derive the result here but only state the result that the general form of the uncertainty S is S=− Pi ln Pi .
i (3.36) Note that, if all the probabilities are equal, then Pi = In this case S=−
i 1 Ω (for all i). (3.37) 1 1 1 ln = Ω ln Ω = ln Ω, ΩΩ Ω (3.38) because there are Ω equal terms in the sum. Hence (3.36) reduces to (3.35) as required. We also see that, if outcome j is certain, Pj = 1 and Pi = 0 if i = j and S = −1 ln 1 = 0. That is, if the outcome is certain, the uncertainty is zero, and there is no missing information. We have shown that, if the Pi are known, then the uncertainty or missing information S can be calculated. Usually the problem is the other way around, and we want to determine the probabilities. Suppose we ﬂip a perfect coin for which there are two possibilities. We expect that P1 (heads) = P2 (tails) = 1/2. That is, we would not assign a diﬀerent probability to each outcome unless we had information to justify it. Intuitively we have adopted the principle of least bias or maximum uncertainty. Let’s reconsider the toss of a coin. In this case S is given by S=− Pi ln Pi = −(P1 ln P1 + P2 ln P2 ) (3.39a) (3.39b) i = − P1 ln P1 + (1 − P1 ) ln(1 − P1 ) , CHAPTER 3. CONCEPTS OF PROBABILITY 121 where we have used the fact that P1 + P2 = 1. We use the principle of maximum uncertainty and set the derivative of S with respect to P1 equal to zero:4 P1 dS = −[ln P1 + 1 − ln(1 − P1 ) − 1] = − ln = 0. dP1 1 − P1 The solution of (3.40) satisﬁes P1 = 1, 1 − P1 (3.41) (3.40) which is satisﬁed by P1 = 1/2. We can check that this solution is a maximum by calculating the second derivative. 1 1 ∂2S = −4, (3.42) 2 =− P + 1−P ∂P1 1 1 which is less than zero, as expected for a maximum. Problem 3.20. Uncertainty (a) Consider the toss of a coin for which P1 = P2 = 1/2 for the two outcomes. What is the uncertainty in this case? (b) What is the uncertainty for P1 = 1/5 and P2 = 4/5? How does the uncertainty in this case compare to that in part (a)? (c) On page 118 we discussed four experiments with various outcomes. Compare the uncertainty S of the third and fourth experiments. Example 3.11. The toss of a threesided die yields events E1 , E2 , and E3 with faces of 1, 2, and 3 points, respectively. As a result of tossing many dice, we learn that the mean number of points is f = 1.9, but we do not know the individual probabilities. What are the values of P1 , P2 , and P3 that maximize the uncertainty consistent with the information that f = 1.9? Solution. We have S = − P1 ln P1 + P2 ln P2 + P3 ln P3 . We also know that f = 1P1 + 2P2 + 3P3 , (3.44) and P1 + P2 + P3 = 1. We use the latter condition to eliminate P3 using P3 = 1 − P1 − P2 , and rewrite (3.44) as f = P1 + 2P2 + 3(1 − P1 − P2 ) = 3 − 2P1 − P2 . (3.45) We then use (3.45) to eliminate P2 and P3 from (3.43) with P2 = 3 − f − 2P1 and P3 = f − 2 + P1 : S = −[P1 ln P1 + (3 − f − 2P1 ) ln(3 − f − 2P1 ) + (f − 2 + P1 ) ln(f − 2 + P1 )].
4 We (3.43) (3.46) have used the fact that d(ln x)/dx = 1/x. CHAPTER 3. CONCEPTS OF PROBABILITY 122 Because S in (3.46) depends only on P1 , we can diﬀerentiate S with respect to P1 to ﬁnd its maximum value: dS = − ln P1 + 1 − 2 ln(3 − f − 2P1 ) + 1 + ln(f − 2 + P1 ) + 1 dP1 P1 (f − 2 + P1 ) = 0. = ln (3 − f − 2P1 )2 (3.47a) (3.47b) We see that for dS/dP1 to be equal to zero, the argument of the logarithm must be one. The result is a quadratic equation for P1 (see Problem 3.21). ♦ Problem 3.21. Fill in the missing steps in Example 3.11 and solve for P1 , P2 , and P3 . In Section 3.11.1 we maximize the uncertainty for a case for which there are more than three outcomes. 3.4.2 *Bayesian inference Conditional probabilities are not especially important for the development of equilibrium statistical mechanics, so this section may be omitted for now. Conditional probability and Bayes’ theorem are very important for the analysis of data including spam ﬁlters for email and image restoration, for example. Bayes’ theorem gives us a way of understanding how the probability that a hypothesis is true is aﬀected by new evidence. Let us deﬁne P (AB) as the probability of A occurring given that we know that B has occurred. We know that P (A) = P (AB)P (B) + P (A−B)P (−B), (3.48) where −B means that B did not occur. In (3.48) we used the fact that P (A and B) = P (AB)P (B) = P (BA)P (A). (3.49) Equation (3.49) means that the probability that A and B occur equals the probability that A occurs given B times the probability that B occurs, which is the same as the probability that B occurs given A times the probability that A occurs. Note that P (A and B) is the same as P (B and A), but P (AB) does not have the same meaning as P (BA). We can rearrange (3.49) to obtain Bayes’ theorem P (AB) = P (BA)P (A) P (B) (Bayes’ theorem). (3.50) We can generalize (3.50) for multiple possible outcomes Ai for the same B. We rewrite (3.50) as P (Ai B) = P (BAi )P (Ai ) P (B) (multiple outcomes). (3.51) CHAPTER 3. CONCEPTS OF PROBABILITY 123 If all the Ai are mutually exclusive and if at least one of the Ai must occur, then we can also write P (B) =
i P (BAi )P (Ai ). (3.52) If we substitute (3.52) for P (B) into (3.51), we obtain P (Ai B) = P (BAi )P (Ai ) . i P (BAi )P (Ai ) (3.53) Bayes’ theorem is very useful for ﬁnding the most probable explanation of a given data set. In this context Ai represents the possible explanation and B represents the data. As more data becomes available, the probabilities P (BAi )P (Ai ) change. Example 3.12. A chess program has two modes, expert (E) and novice (N). The expert mode beats you 75% of the time and the novice mode beats you 50% of the time. You close your eyes and randomly choose one of the modes and play two games. The computer wins (W) both times. What is the probability that you chose the novice mode? Solution. The probability of interest is P (NWW), which is diﬃcult to calculate directly. Bayes’ theorem lets us use the probability P (WWN), which is easy to calculate, to determine P (NWW). We use (3.50) to write P (WWN)P (N) P (NWW) = . (3.54) P (WW) We know that P (N) = 1/2 and P (WWN) = (1/2)2 = 1/4. We next have to calculate P (WW). There are two ways that the program could have won the two games: You chose the novice mode and it won twice, or you chose the expert mode and it won twice. Because N and E are mutually exclusive, we have P (WW) = P (N and WW) + P (E and WW). From (3.49) we have P (WW) = P (WWN)P (N) + P (WWE)P (E) Hence P (NWW) = (3.55a) (3.55b) 13 . = (1/2 × 1/2 × 1/2) + (3/4 × 3/4 × 1/2) = 32 P (WWN)P (N) (1/4 × 1/2) 4 = ≈ 0.31. = 13 P (WW) 13 32 (3.56) Note that the probability of choosing the novice mode has decreased from 50% to about 31% because you have the additional information that the computer won twice and thus you are more likely to have chosen the expert mode. ♦ Example 3.13. Alice plants two types of ﬂowers in her garden: 30% of type A and 70% of type B. Both types yield either red or yellow ﬂowers, with P (redA) = 0.4 and P (redB) = 0.3. (a) What is the percentage of red ﬂowers that Alice will obtain? Solution. We can use the total probability law (3.49) to write P (red) = P (redA)P (A) + P (redB)P (B) = (0.4 × 0.3) + (0.3 × 0.7) = 0.33. So Alice will ﬁnd on average that one of three ﬂowers will be red. (3.57a) (3.57b) CHAPTER 3. CONCEPTS OF PROBABILITY 124 (b) Suppose a red ﬂower is picked at random from Alice’s garden. What is the probability of the ﬂower being type A? Solution. We apply Bayes’ theorem (3.53) and obtain P (Ared) = P (redA)P (A) P (redA)P (A) + P (redB)P (B) 0.4 × 0.3 12 4 = = = ≈ 0.36. (0.4 × 0.3) + (0.3 × 0.7) 33 11 (3.58a) (3.58b) We ﬁnd that given that the ﬂower is red, its probability of being type A increases to 0.36 because type A has a higher probability than type B of yielding red ﬂowers. ♦ Example 3.14. Do you have a fair coin? Suppose that there are four coins of the same type in a bag. Three of them are fair, but the fourth is double headed. You choose one coin at random from the bag and toss it ﬁve times. It comes up heads each time. What is the probability that you have chosen the double headed coin? Solution. If the coin were fair, the probability of ﬁve heads in a row (5H) would be (1/2)5 = 1/32 ≈ 0.03. This probability is small, so you would probably decide that you have not chosen a fair coin. But because you have more information, you can determine a better estimate of the probability. We have P (5H) = P (5Hfair)P (fair) + P (5Hnot fair)P (not fair)
5 (3.59a) (3.59b) (3.59c) (3.59d) = [(1/2) × 3/4] + [1 × 1/4] = 35/128 ≈ 0.27. P (fair5H) = P (5Hfair)P (fair)/P (5H ) = [(1/2)5 × 3/4] = 3/35 = 0.12. 35/128 Thus the probability that the coin is fair given the ﬁve heads in succession is less than the probability 3/4 of picking a fair coin randomly out of the bag. ♦ Problem 3.22. More on choosing a fair coin Suppose that you have two coins that look and feel identical, but one is double headed and one is fair. The two coins are placed in a box and you choose one at random. (a) What is the probability that you have chosen the fair coin? (b) Suppose that you toss the chosen coin twice and obtain heads both times. What is the probability that you have chosen the fair coin? Why is this probability diﬀerent than in part (a)? (c) Suppose that you toss the chosen coin four times and obtain four heads. What is the probability that you have chosen the fair coin? (d) Suppose that there are ten coins in the box with nine fair and one double headed. You toss the chosen coin twice and obtain two heads. What is the probability that you have chosen the fair coin? CHAPTER 3. CONCEPTS OF PROBABILITY 125 (e) Now suppose that the biased coin is not double headed, but has a probability of 0.98 of coming up heads. Also suppose that the probability of choosing the biased coin is 1 in 104 . What is the probability of choosing the biased coin given that the ﬁrst toss yields heads? Example 3.15. Let’s Make A Deal Consider the quandary known as the Let’s Make A Deal or Monty Hall problem.5 In this former television show a contestant is shown three doors. Behind one door is an expensive prize such as a car and behind the other two doors are inexpensive gifts such as a tie. The contestant chooses a door. Suppose the contestant chooses door 1. Then the host opens door 2 containing the tie, knowing that the car is not behind door 2. The contestant now has a choice – stay with the original choice or switch to door 3. What would you do? Let us use Bayes’ theorem (3.53) to determine the best course of action. We want to calculate P (A1 B ) = P (car behind door 1door 2 open after door 1 chosen) and P (A3 B ) = P (car behind door 3door 2 open after door 1 chosen), (3.60b) (3.60a) where Ai denotes the car behind door i. We know that all the P (Ai ) equal 1/3, because with no information we assume that the probability that the car is behind each door is the same. Because the host can open door 2 or 3 if the car is behind door 1, but can only open door 2 if the car is behind door 3 we have P (door 2 open after door 1 chosencar behind 1) = 1 , 2 P (door 2 open after door 1 chosencar behind 2) = 0, P (door 2 open after door 1 chosencar behind 3) = 1.
1 ×3 1 1 1 1 = 3, 1 ( 2 × 3 ) + (0 × 3 ) + (1 × 3 ) 1 (2 1 2 (3.61a) (3.61b) (3.61c) From Bayes’ theorem we have P (car behind 1door 2 open after door 1 chosen) = P (car behind 3door 2 open after door 1 chosen) = (3.62a) (3.62b) × 1 3) 1 1× 3 2 1 = 3. + (0 × 1 ) + (1 × 3 ) 3 The results in (3.62) suggest the contestant has a higher probability of winning the car by switching doors and choosing door 3. The same logic suggests that one should always switch doors independently of which door was originally chosen. Problem 3.23. Simple considerations Make a table showing the three possible arrangements of the car and explain in simple terms why switching doubles the chances of winning.
5 This question was posed on the TV game show, “Let’s Make A Deal,” hosted by Monty Hall. See, for example, www.letsmakeadeal.com/problem.htm> and <www.nytimes.com/2008/04/08/science/08monty.html> . CHAPTER 3. CONCEPTS OF PROBABILITY Problem 3.24. What does the host know? 126 The point of Bayesian statistics is that it approaches a given data set with a particular model in mind. In the Let’s Make A Deal problem the model we have used is that the host knows where the car is. (a) Suppose that the host doesn’t know where the car is, but chooses door 2 at random and there is no car. What is the probability that the car is behind door 1? (b) Is the probability that you found in part (a) the same as found in Example 3.15? Discuss why the probability that the car is behind door 1 depends on what the host knows. Example 3.16. Bayes’ theorem and the problem of false positives Even though you have no symptoms, your doctor wishes to test you for a rare disease that only 1 in 10,000 people of your age contract. The test is 98% accurate, which means that, if you have the disease, 98% of the times the test will come out positive and 2% negative. Also, if you do not have the disease, the test will come out negative 98% of the time and positive 2% of the time. You take the test and it comes out positive. What is the probability that you have the disease? Answer the question before you read the solution using Bayes’ theorem. Solution. Let P (D) represent the probability of having the disease given no other information except the age of the person. In this example P (D) = 1/10000 = 0.0001. The probability of not having the disease is P (N) = 1 − P (D) = 0.9999. Let P (+D) = 0.98 represent the probability of testing positive given that you have the disease, and P (+N) = 0.02 represent the probability of testing positive given that you do not have the disease. We wish to ﬁnd the probability P (D+) that you will test positive for the disease and actually be sick. From Bayes’ theorem we have P (D+) = P (+D)P (D) P (+D)P (D) + P (+N)P (N) (0.98)(0.0001) = (0.98)(0.0001) + (0.02)(0.9999) = 0.0047 = 0.47%. (3.63a) (3.63b) (3.63c) We expect that you will ﬁnd this result diﬃcult to accept. How can it be that the probability of having the disease is so small given the high reliability of the test? This example, and others like it, shows our lack of intuition about many statistical problems. ♦ Problem 3.25. Testing accuracy Suppose that a person tests positive for a disease that occurs in 1 in 100, 1 in 1000, 1 in 10,000, or 1 in 100,000 people. Determine in each case how accurate the test must be for the test to give a probability equal to at least 50% of actually having the disease. Because of the problem of false positives, some tests might actually reduce your life span and thus are not recommended. Suppose that a certain type of cancer occurs in 1 in 1000 people who are less than 50 years old. The death rate from this cancer is 25% in 10 years. The probability of having cancer if the test is positive is 1 in 20. Because people who test positive become worried, CHAPTER 3. CONCEPTS OF PROBABILITY 127 90% of the patients who test positive have surgery to remove the cancer. As a result of the surgery, 2% die due to complications, and the rest are cured. We have that P (death rate due to cancer) = P (deathcancer)P (cancer) = 0.25 × 0.001 = 0.00025, P (death due to test) = P (diesurgery)P (surgerypositive)P (positive cancer) = 0.02 × 0.90 × 0.02 = 0.00036. Hence, the probability of dying from the surgery is greater than dying from the cancer. Problem 3.26. Three balls in a sack Imagine that you have a sack of three balls that can be either red or green. There are four hypotheses for the distribution of colors for the balls: (1) all are red, (2) two are red, (3) one is red, and (4) all are green. Initially, you have no information about which hypothesis is correct, and thus you assume that they are equally probable. Suppose that you pick one ball out of the sack and it is green. Use Bayes’ theorem to determine the new probabilities for each hypothesis. We have emphasized that the deﬁnition of probability as a frequency is inadequate. If you are interesting in learning more about Bayesian inference, see in particular the paper by D’Agostini. (3.64a) (3.64b) (3.64c) (3.64d) 3.5 Bernoulli Processes and the Binomial Distribution Because most physicists spend little time gambling,6 we will have to develop our intuitive understanding of probability in other ways. Our strategy will be to ﬁrst consider some physical systems for which we can calculate the probability distribution by analytical methods. Then we will use the computer to generate more data to analyze. Noninteracting magnetic moments . Consider a system of N noninteracting magnetic dipoles each having a magnetic moment µ and associated spin in an external magnetic ﬁeld B . The ﬁeld B is in the up (+z ) direction. According to quantum mechanics the component of the magnetic dipole moment along a given axis is limited to certain discrete values. Spin 1/2 implies that a magnetic dipole can either point up (parallel to B ) or down (antiparallel to B ). We will use the word spin as a shorthand for magnetic dipole. The energy of interaction of a spin with the magnetic ﬁeld is E = −µB if the spin is up and +µB if the spin is down (see Figure 3.2). More generally, we can write E = −sµB , where s = +1 (spin up) or s = −1 (spin down). As discussed in Section 1.9, page 22, this model is a simpliﬁcation of a more realistic magnetic system. We will take p to be the probability that the spin (magnetic moment) is up and q the probability that the spin is down. Because there are no other possible outcomes,we have p + q = 1 or q = 1 − p. If B = 0, there is no preferred spatial direction and p = q = 1/2. For B = 0 we do not yet know how to calculate p and for now we will assume that p is given. In Section 4.8 we will learn how to calculate p and q when the system is in equilibrium at temperature T .
6 After a Las Vegas hotel hosted a meeting of the American Physical Society (APS) in March 1986, the APS was asked never to return. The Las Vegas newspaper headline read, “Meeting of physicists in town, lowest casino take ever.” CHAPTER 3. CONCEPTS OF PROBABILITY 128 +µB Energy 0 –µB Figure 3.2: The energy of a spin 1/2 magnetic dipole. Note that the state of lowest energy is negative. We associate with each spin a random variable si which has the values ±1 with probability p and q , respectively. One of the quantities of interest is the magnetization M , which is the net magnetic moment of the system. For a system of N spins the magnetization is given by
N M = µ(s1 + s2 + . . . + sN ) = µ
i=1 si . (3.65) In the following, we will take µ = 1 for convenience whenever it will not cause confusion. We will ﬁrst calculate the mean value of M , then its variance, and ﬁnally the probability distribution P (M ) that the system has magnetization M . To calculate the mean value of M , we need to take the mean values of both sides of (3.65). We interchange the sum and the average [see (3.15c)] and write
N N M=
i=1 si =
i=1 si . (3.66) Because the probability that any spin has the value ±1 is the same for each spin, the mean value of each spin is the same, that is, s1 = s2 = . . . = sN ≡ s. Therefore the sum in (3.66) consists of N equal terms and can be written as M = N s. (3.67) The meaning of (3.67) is that the mean magnetization is N times the mean magnetization of a single spin. Because s = (1 × p) + (−1 × q ) = p − q , we have that M = N (p − q ). To calculate the variance of M , that is, (M − M )2 , we write
N (3.68) ∆M = M − M = ∆si ,
i=1 (3.69) CHAPTER 3. CONCEPTS OF PROBABILITY where As an example, let us calculate (∆M )2 for N = 3 spins. In this case (∆M )2 is given by (∆M )2 = (∆s1 + ∆s2 + ∆s3 )(∆s1 + ∆s2 + ∆s3 ) = (∆s1 ) + (∆s2 ) + (∆s3 )
2 2 2 129 ∆si ≡ si − s. (3.70) (3.71a) (3.71b) + 2 ∆s1 ∆s2 + ∆s1 ∆s3 + ∆s2 ∆s3 . We take the mean value of (3.71b), interchange the order of the sums and averages, and write (∆M )2 = (∆s1 )2 + (∆s2 )2 + (∆s3 )2 + 2 ∆s1 ∆s2 + ∆s1 ∆s3 + ∆s2 ∆s3 . (3.72) The ﬁrst term in brackets on the right of (3.72) represents the three terms in the sum that are multiplied by themselves. The second term represents all the cross terms arising from diﬀerent terms in the sum, that is, the products in the second sum refer to diﬀerent spins. Because diﬀerent spins are statistically independent (the spins do not interact), we have that ∆si ∆sj = ∆si ∆sj = 0 (i = j ), (3.73) because ∆si = 0. That is, each cross term vanishes on the average. Hence (3.73) reduces to a sum of squared terms (3.74) (∆M )2 = (∆s1 )2 + (∆s2 )2 + (∆s3 )2 . Because each spin is equivalent on the average, each term in (3.74) is equal. Hence, we obtain the desired result (∆M )2 = 3(∆s)2 . (3.75) The variance of M is 3 times the variance of a single spin, that is, the variance is additive. We now evaluate (∆M )2 by ﬁnding an explicit expression for (∆s)2 . We have s2 = [12 × p] + [(−1)2 × q ] = p + q = 1. Hence, (∆s)2 = s2 − s2 = 1 − (p − q )2 = 1 − (2p − 1)2 = 1 − 4p + 4p − 1 = 4p(1 − p) = 4pq,
2 (3.76a) (3.76b) and our desired result for (∆M )2 is (∆M )2 = 3(4pq ). (3.77) Problem 3.27. Variance of N spins In the text we showed that (∆M )2 = 3(∆s)2 for N = 3 spins (see [3.75) and (3.77)]. Use similar considerations for N noninteracting spins to show that (∆M )2 = N (4pq ). (3.78) CHAPTER 3. CONCEPTS OF PROBABILITY 130 p3 p2q p2q p2q pq2 pq2 pq2 q3 Figure 3.3: An ensemble of N = 3 spins. The arrow indicates the direction of a spin. The probability of each member of the ensemble is shown. Because of the simplicity of a system of noninteracting spins, we can calculate the probability distribution itself and not just the ﬁrst few moments. As an example, let us consider the statistical properties of a system of N = 3 noninteracting spins. Because each spin can be in one of two states, there are 2N =3 = 8 distinct outcomes (see Figure 3.3). Because each spin is independent of the other spins, we can use the multiplication rule (3.5) to calculate the probabilities of each outcome as shown in Figure 3.3. Although each outcome is distinct, several of the microstates have the same number of up spins. The main quantity of interest is the probability PN (n) that n spins point up out a total of N spins. For example, for N = 3 spins, there are three states with n = 2, each with probability p2 q so the probability that two spins are up is equal to 3p2 q . From Figure 3.3 we see that P3 (n = 3) = p3 , P3 (n = 2) = 3p q, P3 (n = 1) = 3pq , P3 (n = 0) = q . Example 3.17. Find the ﬁrst two moments of P3 (n). Solution. The ﬁrst moment n of the distribution is given by n = (0 × q 3 ) + (1 × 3pq 2 ) + (2 × 3p2 q ) + (3 × p3 ) = 3p (q + 2pq + p ) = 3p (q + p) = 3p.
2 2 2 3 2 2 (3.79a) (3.79b) (3.79c) (3.79d) (3.80a) (3.80b) Similarly, the second moment n2 of the distribution is given by n2 = (0 × q 3 ) + (1 × 3pq 2 ) + (4 × 3p2 q ) + (9 × p3 ) = 3p (q + 4pq + 3p ) = 3p(q + 3p)(q + p)
2 2 2 (3.81a) (3.81b) (3.81c) (3.82) ♦ = 3p (q + 3p) = (3p) + 3pq. Hence (n − n)2 = n2 − n2 = 3pq. CHAPTER 3. CONCEPTS OF PROBABILITY 131 The mean magnetization M or the mean number of up spins minus the mean number of down spins is given by M = [n − (3 − n)] = 2n − 3 = 6p − 3, or M = 3(2p − 1) = 3(p − q ) in agreement with (3.68). Problem 3.28. Coin ﬂips The outcome of N coins is identical to N noninteracting spins, if we associate the number of coins with N , the number of heads with n, and the number of tails with N − n. For a fair coin the probability p of a head is p = 1 and the probability of a tail is q = 1 − p = 1 . What is the 2 2 probability that in three tosses of a coin, there will be two heads? Problem 3.29. Onedimensional random walk If a drunkard begins at a lamp post and takes N steps of equal length in random directions, how far will the drunkard be from the lamp post?7 We will consider an idealized example of a random walk for which the steps of the walker are restricted to a line (a onedimensional random walk). Each step is of equal length a, and at each interval of time, the walker takes either a step to the right with probability p or a step to the left with probability q = 1 − p. The direction of each step is independent of the preceding one. Let n be the number of steps to the right, and n′ the number of steps to the left. The total number of steps is N = n + n′ . What is the probability that a random walker in one dimension has taken three steps to the right out of four steps? From the above examples and problems, we see that the probability distributions of noninteracting spins, the ﬂip of a coin, and a simple onedimensional random walk are identical. These examples have two characteristics in common. First, in each trial there are only two outcomes, for example, up or down, heads or tails, and right or left. Second, the result of each trial is independent of all previous trials, for example, the drunken sailor has no memory of his or her previous steps. This type of process is called a Bernoulli process (after the mathematician Jacob Bernoulli, 1654–1705). Because of the importance of magnetic systems, we will cast our discussion of Bernoulli pro1 cesses in terms of noninteracting spins with spin 2 . The main quantity of interest is the probability PN (n) which we now calculate for arbitrary N and n. We know that a particular outcome with n ′ up spins and n′ down spins occurs with probability pn q n . We write the probability PN (n) as PN (n) = WN (n, n′ ) pn q n ,
′ (3.83) where n′ = N − n and WN (n, n′ ) is the number of distinct microstates of N spins with n up spins and n′ down spins. From our discussion of N = 3 noninteracting spins, we already know the ﬁrst several values of WN (n, n′ ). We can determine the general form of WN (n, n′ ) by obtaining a recursion relation between WN and WN −1 . A total of n up spins and n′ down spins out of N total spins can be found by adding one spin to N − 1 spins. The additional spin is either (1) up if there are already (n − 1) up spins and n′ down spins, or (2) down if there are already n up spins and (n′ − 1) down spins.
7 The history of the random walk problem is discussed by Montroll and Shlesinger (1984). CHAPTER 3. CONCEPTS OF PROBABILITY 132 1 1 1 1 2 1 1 3 3 1 1 4 6 4 1 Figure 3.4: The values of the ﬁrst few coeﬃcients WN (n, n′ ). Each number is the sum of the two numbers to the left and right above it. This construction is called a Pascal triangle. Because there are WN −1 (n − 1, n′ ) ways of reaching the ﬁrst case and WN −1 (n, n′ − 1) ways of reaching the second case, we obtain the recursion relation WN (n, n′ ) = WN −1 (n − 1, n′ ) + WN −1 (n, n′ − 1). (3.84) If we begin with the known values W0 (0, 0) = 1, W1 (1, 0) = W1 (0, 1) = 1, we can use the recursion relation (3.84) to construct WN (n, n′ ) for any desired N . For example, W2 (2, 0) = W1 (1, 0) + W1 (2, −1) = 1 + 0 = 1, W2 (1, 1) = W1 (0, 1) + W1 (1, 0) = 1 + 1 = 2, W2 (0, 2) = W1 (−1, 2) + W1 (0, 1) = 0 + 1. In Figure 3.4 we show that WN (n, n′ ) forms a pyramid or (a Pascal) triangle. It is straightforward to show by induction that the expression WN (n, n′ ) = N! N! = ′! n! n n!(N − n)! (3.86) (3.85a) (3.85b) (3.85c) satisﬁes the relation (3.84). Note the convention 0! = 1. We can combine (3.83) and (3.86) to ﬁnd the desired result PN (n) = N! p n q N −n n! (N − n)! (binomial distribution). (3.87) The form (3.87) is the binomial distribution. Note that for p = q = 1/2, PN (n) reduces to PN (n) = N! 2 −N . n! (N − n)! (3.88) The probability PN (n) is shown in Figure 3.5 for N = 64. CHAPTER 3. CONCEPTS OF PROBABILITY
0.10 133 0.08 P (n) 64 0.06 0.04 0.02 0.00 0 8 16 24 32 40 n 48 56 64 Figure 3.5: The binomial distribution P64 (n) for p = q = 1/2. What is your visual estimate of the width of the distribution? Problem 3.30. Binomial distribution (a) Calculate the probability PN (n) that n spins are up out of a total of N for N = 4 and N = 16 and put your results in a table. Calculate the mean values of n and n2 using your tabulated values of PN (n). It is possible to do the calculation for general p and q , but choose p = q = 1/2 for simplicity. Although it is better to ﬁrst do the calculation of PN (n) by hand, you can use Program Binomial. (b) Use Program Binomial to plot PN (n) for larger values of N . Assume that p = q = 1/2. Determine the value of n corresponding to the maximum of the probability and visually estimate the width for each value of N . What is your measure of the width? One measure is the value of n at which PN (n) is equal to half its value at its maximum. What is the qualitative dependence of the width on N ? Also compare the relative heights of the maximum of PN for increasing values of N . (c) Program Binomial also plots PN (n) versus n/n. Does the width of PN (n) appear to become larger or smaller as N is increased? (d) Plot ln PN (n) versus n for N = 16. (Choose Log Axes under the Views menu.) Describe the qualitative dependence of ln PN (n) on n. Can ln PN (n) be ﬁtted to a parabola of the form A + B (n − n)2 , where A and B are ﬁt parameters? Problem 3.31. Asymmetrical distribution (a) Plot PN (n) versus n for N = 16 and p = 2/3. For what value of n is PN (n) a maximum? How does the width of the distribution compare to what you found in Problem 3.30? CHAPTER 3. CONCEPTS OF PROBABILITY (b) For what values of p and q do you think the width is a maximum for a given N ? 134 Example 3.18. Show that the expression (3.87) for PN (n) satisﬁes the normalization condition (3.2). Solution. The reason that (3.87) is called the binomial distribution is that its form represents a typical term in the expansion of (p + q )N . By the binomial theorem we have (p + q ) We use (3.87) and write
N N = N! p n q N −n . n! (N − n)! n=0 N (3.89) PN (n) =
n=0 N! pn q N −n = (p + q )N = 1N = 1, n! (N − n)! n=0 N (3.90) ♦ where we have used (3.89) and the fact that p + q = 1. Mean value . We now ﬁnd an analytical expression for the dependence of the mean number of up spins n on N and p. From the deﬁnition (3.13) and (3.87) we have
N N n=
n=0 n PN (n) =
n=0 n N! p n q N −n . n! (N − n)! (3.91) We evaluate the sum in (3.91) by using a technique that is useful in a variety of contexts.8 The technique is based on the fact that d p pn = npn . (3.92) dp We use (3.92) to rewrite (3.91) as n= N! ∂ p p n q N −n . n! (N − n)! ∂p n=0
N (3.93) We have used a partial derivative in (3.93) to remind us that the derivative operator does not act on q , which we have temporarily assumed to be an independent variable. We interchange the order of summation and diﬀerentiation in (3.93) and write n=p =p ∂ ∂p N! p n q N −n n! (N − n)! n=0
N (3.94a) (3.94b) ∂ (p + q )N , ∂p Because the operator acts only on p, we have n = pN (p + q )N −1 .
8 The (3.95) integral R∞
0 xn e−ax for a > 0 is evaluated in the Appendix using a similar technique. 2 CHAPTER 3. CONCEPTS OF PROBABILITY 135 The result (3.95) is valid for arbitrary p and q , and hence it is applicable for p + q = 1. Thus our desired result is n = pN. (3.96) The nature of the dependence of n on N and p should be intuitively clear. Compare the general result (3.96) to the result (3.80b) for N = 3. What is the dependence of n′ on N and p? Relative ﬂuctuations . To determine ∆n2 we need to know n2 [see the relation (3.21)]. The average value of n2 can be calculated in a manner similar to that for n. We write
N n2 =
n=0 N n2 N! p n q N −n n! (N − n)!
2 (3.97a) = N! ∂ p n! (N − n)! ∂p n=0 ∂ ∂p
2N p n q N −n
2 (3.97b) (p + q )N (3.97c) (3.97d) (3.97e) =p =p ∂ pN (p + q )N −1 ∂p = p N (p + q )N −1 + pN (N − 1)(p + q )N −2 . n2 = p [N + pN (N − 1)]
2 N! ∂ p n q N −n = p n! (N − n)! ∂p n=0 Because we are interested in the case p + q = 1, we have (3.98a)
2 = p [pN + N (1 − p)] = (pN ) + p (1 − p)N = n + pqN,
2 (3.98b) (3.98c) Problem 3.32. Width of the binomial distribution Compare the calculated values of σn from (3.99) with your estimates in Problem 3.30 and to the exact result (3.82) for N = 4. Explain why σn is a measure of the width of PN (n). where we have used (3.96) and substituted q = 1 − p. Hence, from (3.98c) we ﬁnd that the variance of n is given by (3.99) σn 2 = (∆n)2 = n2 − n2 = pqN. The relative width of the probability distribution of n is given by (3.96) and (3.99) √ q 1/2 1 pqN σn √. = = (3.100) n pN p N √ We see that the relative width goes to zero as 1/ N . Frequently we need to evaluate ln N ! for N ≫ 1. An approximation for ln N ! known as Stirling’s approximation is9 ln N ! ≈ N ln N − N + 1 ln(2πN ) 2 (Stirling’s approximation). (3.101) 9 It is more accurate to call it the de MoivreStirling approximation because de Moivre had already found that √ √ n! ≈ c nnn /en for some constant c. Stirling’s contribution was to identify the constant c as 2π . CHAPTER 3. CONCEPTS OF PROBABILITY 136 In some contexts we can neglect the logarithmic term in (3.101) and use the weaker approximation ln N ! ≈ N ln N − N. A derivation of Stirling’s approximation is given in the Appendix. Problem 3.33. Range of applicability of Stirling’s approximation (a) What is the largest value of ln N ! that you can calculate exactly using your calculator? (b) Compare the approximations (3.101) and (3.102) to each other and to the exact value of ln N ! for N = 5, 10, 20, and 50. If necessary, compute ln N ! using the relation
N (3.102) ln N ! =
m=1 ln m. (3.103) Put your results in a table. What is the percentage error of the two approximations for N = 50? (c) Use Stirling’s approximation to show that d ln x! = ln x for x ≫ 1. dx (3.104) Problem 3.34. Density ﬂuctuations A container of volume V contains N molecules of a gas. We assume that the gas is dilute so that the position of any one molecule is independent of all other molecules. Although the density is uniform on the average, there are ﬂuctuations in the density. Divide the volume V into two parts V1 and V2 with V = V1 + V2 . (a) What is the probability p that a particular molecule is in the volume V1 ? (b) What is the probability that N1 molecules are in V1 and N2 molecules are in V2 , where N = N1 + N2 ? (c) What is the average number of molecules in each part? (d) What are the relative ﬂuctuations of the number of molecules in each part? Problem 3.35. Random walk Suppose that a random walker takes n steps to the right and n′ steps to the left for a total of N steps. Each step is of equal length a and the probability of a step to the right is p. Denote x as the net displacement of a walker after N steps. What is the mean value x for an N step random walk? What is the N dependence of the variance (∆x)2 ? CHAPTER 3. CONCEPTS OF PROBABILITY Problem 3.36. Monte Carlo simulation of a onedimensional random walk 137 Program RandomWalk1D simulates a random walk in one dimension. A walker starts at the origin and takes N steps. At each step the walker goes to the right with probability p or to the left with probability q = 1 − p. Each step is the same length and independent of the previous steps. What is the displacement of the walker after N steps? Are some displacements more likely than others? We can simulate an N step walk by the following pseudocode: do istep = 1,N if (rnd <= p) then x=x+1 else x=x1 end if end do The function rnd generates a random number between zero and one. The quantity x is the net displacement after N steps assuming that the steps are of unit length. We average over many walkers (trials), where each trial consists of a N step walk and construct a histogram for the number of times that the displacement x is found for a given number of walkers. The probability that the walker is a distance x from the origin after N steps is proportional to the corresponding value of the histogram. This procedure is called Monte Carlo sampling.10 (a) Is the value of x for one trial of any interest? Why do we have to average over many trials? (b) Will we obtain the exact result for the probability distribution by doing a Monte Carlo simulation? (c) Describe the changes of the histogram for larger values of N and p = 1/2. (d) What is the most probable value of x for p = 1/2 and N = 16 and N = 32? What is the approximate width of the distribution? Estimate the width visually. One way to do so is to determine the value of x at which the value of the histogram is onehalf of its maximum value. How does the width change as a function of N for ﬁxed p? (e) Choose N = 4 and p = 1/2. How does the histogram change, if at all, as the number of walks increases for ﬁxed N ? The binomial distribution for large N . In Problem 3.30 we found that the binomial distribution has a welldeﬁned maximum and can be approximated by a smooth, continuous function for large N even though only integer values of n are possible. We now ﬁnd the form of this n dependence. The ﬁrst step is to realize that PN (n) for N ≫ 1 is a rapidly varying function of n near the maximum of PN (n) at n = pN . For this reason we do not want to approximate PN (n) directly. ˜ Because the logarithm of PN (n) is a slowly varying function (see Problem 3.30), we expect that the
10 The name “Monte Carlo” was ﬁrst used by Nicholas Metropolis and Stanislaw Ulam in “The Monte Carlo method,” Journal of the American Statistical Association 44 (247), 335–341 (1949). CHAPTER 3. CONCEPTS OF PROBABILITY 138 Taylor series expansion of ln PN (n) will converge. Hence, we expand ln PN (n) in a Taylor series about the value of n = n at which ln PN (n) reaches its maximum value. We have ˜ ln PN (n) = ln PN (n = n) + (n − n) ˜ ˜ d ln PN (n) dn 1 d2 ln PN (n) + (n − n)2 ˜ 2 d2 n n=˜ n + · · · . (3.105) n=˜ n Because the expansion (3.105) is about the maximum n = n, the ﬁrst derivative d ln PN (n)/dn n=˜ ˜ n must be zero and the second derivative d2 ln PN (n)/dn2 n=˜ must be negative. We assume that n the higher terms in (3.105) can be neglected (see Problem 3.76) and adopt the notation ln A = ln PN (n = n) ˜ and B=− d2 ln PN (n) dn2
n=˜ n (3.106) . (3.107) The approximation (3.105) and the notation in (3.106) and (3.107) allow us to write 1 ˜ ln PN (n) ≈ ln A − B (n − n)2 , 2 or
˜ PN (n) ≈ A e− 2 B (n−n) .
1 2 (3.108) (3.109) We next use Stirling’s approximation (3.101) to evaluate the ﬁrst two derivatives of ln PN (n) to ﬁnd the parameters B and n. We ﬁrst take the logarithm of both sides of (3.87) and obtain ˜ ln PN (n) = ln N ! − ln n! − ln(N − n)! + n ln p + (N − n) ln q. It is straightforward to use the approximation (3.104) to obtain d(ln PN (n)) = − ln n + ln(N − n) + ln p − ln q. dn (3.111) (3.110) The most probable value of n is found by ﬁnding the value of n that satisﬁes the condition d ln PN (n)/dn = 0. We ﬁnd q N −n ˜ =, (3.112) n ˜ p or (N − n)p = nq . The relation p + q = 1 allows us to write ˜ ˜ n = pN, ˜ (3.113) as expected. Note that n = n, that is, the value of n for which PN (n) is a maximum is also the ˜ mean value of n. The second derivative can be found from (3.111). We have 1 1 d2 (ln PN (n)) =− − . 2 dn n N −n (3.114) CHAPTER 3. CONCEPTS OF PROBABILITY Hence, the coeﬃcient B deﬁned in (3.107) is given by B=− d2 ln PN (n) dn2 = 1 1 1 + = . n N −n ˜ ˜ N pq 139 n=˜ n (3.115) From the relation (3.99) we see that 1 B = 2, σ (3.116) where σ 2 is the variance of n. In Problem 3.37 you will be asked to show that the coeﬃcient A in (3.107) can be approximated for large N as A= 1 1 = . 1/2 (2πN pq ) (2πσ 2 )1/2 (3.117) We thus ﬁnd the form of the Gaussian probability distribution
2 2 1 PN (n) = √ e−(n−n) /2σ 2 2πσ (Gaussian probability distribution). (3.118) An alternative derivation of the parameters A, n, and B is given in Problem 3.72. ˜ Problem 3.37. Calculation of the normalization constant Derive the form of A in (3.117) using Stirling’s approximation (3.101). Note that the weaker form of Stirling’s approximation in (3.102) yields the incorrect result that ln A = 0. n 0 1 2 3 4 5 P10 (n) 0.000977 0.009766 0.043945 0.117188 0.205078 0.246094 Gaussian approximation 0.001700 0.010285 0.041707 0.113372 0.206577 0.252313 Table 3.4: Comparison of the exact values of P10 (n) with the Gaussian probability distribution (3.118) for p = q = 1/2. From our derivation we see that (3.118) is valid for large values of N and for values of n near n. The Gaussian approximation is a good approximation even for relatively small values of N for most values of n. A comparison of the Gaussian approximation to the binomial distribution is given in Table 3.4. A discussion of the accuracy of the Gaussian approximation to the binomial distribution is given in Problem 3.76. The most important feature of the Gaussian probability distribution is that its relative width, σn /n, decreases as N −1/2 . The binomial distribution also shares this feature. The alternate derivation of the Gaussian probability distribution in Problem 3.72 shows why the binomial and Gaussian distributions have the same mean and variance. CHAPTER 3. CONCEPTS OF PROBABILITY 140 θ Figure 3.6: The angle θ is an example of a continuous random variable. 3.6 Continuous Probability Distributions In many cases of physical interest the random variables have continuous values. Examples of continuous variables are the positions of the holes left by darts thrown at a dart board, the position and velocity of a particle described by classical mechanics, and the angle of a compass needle. As an example, consider a spinner, the equivalent of a wheel of fortune,11 with an arrow that spins around and stops at some angle at random (see Figure 3.6). In this case the variable θ is a continuous random variable that takes all values in the interval [0, 2π ]. What is the probability that θ has a particular value? Because there are an inﬁnite number of possible values of θ in the interval [0, 2π ], the probability of obtaining any particular value of θ is zero. Thus, we have to reformulate the question and ask for the probability that the value of θ is between θ and θ + ∆θ. In other words, we have to ask for the probability that θ is in a particular angular range ∆θ about θ. For example, the probability that θ in Figure 3.6 is between 0 and π is 1/2 and the probability that θ is between 0 and π/2 is 1/4. Another example of a continuous random variable is the displacement from the origin of a onedimensional random walker that steps at random to the right with probability p, but with a step length that is chosen at random between zero and the maximum step length a. The continuous nature of the step length means that the displacement x of the walker is a continuous variable. If we perform a simulation of this random walk, we can record the number of times H (x) that the displacement of the walker from the origin after N steps is in a bin of width ∆x between x and x + ∆x. A plot of H (x) as a function of x for the bin width ∆x = 0.5 is shown in Figure 3.7. The histogram H (x) is proportional to the estimated probability that a walker lies in a bin of width ∆x a distance x from the origin after N steps. To obtain the probability, we divide H (x) by the total number of walkers Nw . In practice, the choice of the bin width is a compromise. If ∆x is too big, the features of the histogram would be lost. If ∆x is too small, many of the bins would be empty for a given number of walkers, and our estimate of the number of walkers in each bin would be less accurate. Because we expect the number of walkers in a particular bin to be proportional to the width of the bin, we may write p(x)∆x = H (x)/Nw . The quantity p(x) is called the probability density.
11 The Wheel of Fortune is an American television game. The name of the show comes from the large spinning wheel that determines the dollar amounts and prizes won by the contestants. CHAPTER 3. CONCEPTS OF PROBABILITY
100 141 80 H(x)
60 40 20 0 15 10 5 0 5 x 10 15 Figure 3.7: Histogram H (x) of the number of times that the displacement of a onedimensional random walker lies between x and x + ∆x after N = 16 steps (see Problem 3.38). The length of each step is chosen with uniform random probability to be between zero and one. The bin width is ∆x = 0.5. The data were generated with 1000 trials, a relatively small number. The results of this set of trials are the estimates x = −0.045 and x2 = 4.95. In the limit ∆x → 0, H (x) becomes a continuous function of x, and we can write the probability that the displacement x of the walker is between a and b as (see Figure 3.8)
b P (a < x < b) =
a p(x) dx. (3.119) Note that the probability density p(x) is nonnegative and has units of one over the dimension of length. The formal properties of the probability density p(x) can be generalized from the discrete case. For example, the normalization condition is given by
∞ p(x) dx = 1.
−∞ (3.120) The mean value of the function f (x) is given by
∞ f=
−∞ f (x) p(x) dx. (3.121) Problem 3.38. Simulation of a onedimensional random walk with variable step length Program RandomWalk1DContinuous simulates a random walk in one dimension with a variable step length. CHAPTER 3. CONCEPTS OF PROBABILITY 142 y
p(x) a b x Figure 3.8: The probability that x is between a and b is equal to the shaded area. (a) The step length is generated at random with a uniform probability between 0 and 1. Calculate the mean displacement and its variance for one step. (b) Compare your analytical results from part (a) to the results of the simulation for N = 1. (c) How does the variance of the displacement found in the simulation for N = 16 depend on the variance of the displacement for N = 1 that you calculated in part (a)? (d) Explore how the histogram changes with the bin width. What is a reasonable choice of the bin width for N = 100? Problem 3.39. Exponential probability density The random variable x has the probability density p(x) = A e−λx 0 (0 ≤ x ≤ ∞) (x < 0). (3.122) The exponential probability density plays an important role in statistical mechanics (see (4.79), page 200). (a) Determine the normalization constant A in terms of λ. (b) What is the mean value of x? What is the most probable value of x? (c) What is the mean value of x2 ? (d) Determine the probability for λ = 1 that a measurement of x yields a value between 1 and 2. CHAPTER 3. CONCEPTS OF PROBABILITY (e) Determine the probability for λ = 1 that a measurement of x yields a value less than 0.3. Problem 3.40. Probability density for velocity 143 Consider the probability density function p(vx ) = (a/π )3/2 e−avx for the velocity of a particle in the xdirection. The probability densities for vy and vz have the same form. Each of the three velocity components can range from −∞ to +∞ and a is a constant. This form of the probability density for the velocity will be derived in Section 6.2.2 for a classical system of particles at temperature T . (a) Show that p(v) is normalized. Use the fact that [see (A.15)]
∞ 2 e−au du =
0 2 1 2 π . a (3.123) Note that this calculation involves doing three similar integrals that can be evaluated separately. (b) What is the probability that a particle has a velocity between vx and vx + dvx , vy and vy + dvy , and vz and vz + dvz ? (c) What is the probability that vx ≥ 0, vy ≥ 0, vz ≥ 0 simultaneously? Problem 3.41. Gaussian probability density (a) Find the ﬁrst four moments of the Gaussian probability density p(x) = (2π )−1/2 e−x
2 /2 (−∞ < x < ∞). (3.124) (b) Calculate the value of C4 , the fourthorder cumulant, deﬁned by C4 = x4 − 4x3 x − 3 x2 + 12 x2 x2 − 6 x4 . Problem 3.42. Uniform probability distribution Consider the probability density given by p(x) = (a) Sketch the dependence of p(x) on x. (b) Find the ﬁrst four moments of p(x). (c) Calculate the value of the fourthorder cumulant C4 deﬁned in (3.125) for the probability density in (3.126). Compare your result to the corresponding result for C4 for the Gaussian distribution. (2a)−1 0 (x ≤ a), (x > a). (3.126)
2 (3.125) CHAPTER 3. CONCEPTS OF PROBABILITY Problem 3.43. Other probability distributions Not all probability densities have a ﬁnite variance as you will ﬁnd in the following. (a) Sketch the Lorentz or Cauchy distribution given by p(x) = γ 1 π (x − a)2 + γ 2 (−∞ < x < ∞). 144 (3.127) Choose a = 0 and γ = 1 and compare the form of p(x) in (3.127) to the Gaussian distribution given by (3.124). (b) Calculate the ﬁrst moment of the Lorentz distribution assuming that a = 0 and γ = 1. (c) Does the second moment exist? 3.7 The Central Limit Theorem (or Why Thermodynamics Is Possible) We have discussed how to estimate probabilities empirically by sampling, that is, by making repeated measurements of the outcome of independent random events. Intuitively we believe that if we perform more and more measurements, the calculated average will approach the exact mean of the quantity of interest. This idea is called the law of large numbers. However, we can go further and ﬁnd the form of the probability distribution that a particular measurement diﬀers from the exact mean. The form of this probability distribution is given by the central limit theorem. We ﬁrst illustrate this theorem by considering a simple example. Suppose that we wish to estimate the probability of obtaining the face with ﬁve dots in one 1 throw of a die. The answer of 6 means that if we perform N measurements, ﬁve will appear approximately N/6 times. What is the meaning of approximately? Let S be the total number of times that a ﬁve appears in N measurements. We write
N S=
i=1 si , (3.128) where si = 1 0 if the ith throw gives a 5, otherwise. (3.129) The ratio S/N approaches 1/6 for large N . How does this ratio approach the limit? We can empirically answer this question by repeating the measurement M times. (Each measurement of S consists of N throws of a die.) Because S itself is a random variable, we know that the measured values of S will not be identical. In Figure 3.9 we show the results of M = 10, 000 measurements of S for N = 100 and N = 800. We see that the approximate form of the distribution of values of S is a Gaussian. In Problem 3.44 we calculate the absolute and relative width of the distributions. Problem 3.44. Analysis of Figure 3.9 CHAPTER 3. CONCEPTS OF PROBABILITY 145 (a) Estimate the absolute width and the relative width ∆S/S of the distributions shown in Figure 3.9 for N = 100 and N = 800. (b) Does the error of any one measurement of S decrease with increasing N as expected? (c) How would the plot change if the number of measurements M were increased to M = 100, 000? In Section 3.11.2 we show that in the limit N → ∞, the probability density p(S ) is given by p(S ) = where S = N s,
2 σS 1
2 2πσS e −(S − S ) 2 2 /2σS (central limit theorem), (3.130) (3.131) (3.132) = Nσ , 2 with σ 2 = s2 − s2 . The quantity p(S )∆S is the probability that the value of the sum N si is i=1 between S and S + ∆S . Equation (3.130) is equivalent to the central limit theorem. Note that the Gaussian form in (3.130) holds only for large N and for values of S near its most probable (mean) value. The latter restriction is the reason that the theorem is called the central limit theorem; the requirement that N be large is the reason for the term limit. The central limit theorem is one of the most remarkable results of the theory of probability. In its simplest form the theorem states that the probability distribution of the value of the sum of a large number of random variables is approximately a Gaussian. The approximation improves 1 as the number of variables in the sum increases. For the throw of a die we have s = 1 , s2 = 6 , 6 1 5 2 and σ 2 = s2 − s2 = 1 − 36 = 36 . For N throws of a die, we have S = N/6 and σS = 5N/36. We 6 see that in this example the most probable relative error in any one measurement of S decreases as σS /S = 5/N . If we let S represent the displacement of a walker after N steps and let σ 2 equal the mean square displacement of a single step, then the central limit theorem implies that the probability density of the displacement is a Gaussian, which is equivalent to the results that we found for random walks in the limit of large N . Or we can let S represent the magnetization of a system of spins and obtain similar results. The displacement of a random walk after N steps and the magnetization of a system of spins are examples of a random additive process. Hence, the probability distribution for random walks, spins, and multiple coin tosses is given by (3.130), and our task reduces to ﬁnding expressions for s and σ 2 for the particular process of interest. Problem 3.45. Central limit theorem Use Program CentralLimitTheorem to test the applicability of the central limit theorem. (a) Assume that the variable si is uniformly distributed between 0 and 1. Calculate analytically the mean and standard deviation of s and compare your numerical results with your analytical calculation. CHAPTER 3. CONCEPTS OF PROBABILITY
0.12 0.10 0.08 0.06 0.04 N = 800 0.02 0.00 N = 100 146 p(S) 0 50 100 S 150 200 Figure 3.9: The distribution of the measured values of M = 10, 000 diﬀerent measurements of the sum S for N = 100 and N = 800 terms in the sum. The quantity S is the number of times that face 1 appears in N throws of a die. For N = 100, the measured values are S = 16.67, S 2 = 291.96, and σS = 3.74. For N = 800, the measured values are S = 133.31, S 2 = 17881.2, and σS = 10.52. What is the estimated value of the relative width for each case? (b) Use the default value of N = 12, the number of terms in the sum, and describe the qualitative form of p(S ), where p(S )∆S is the probability that the sum S is between S and S + ∆S . Does the qualitative form of p(S ) change as the number of measurements (trials) of S is increased for a given value of N ? (c) What is the approximate width of p(S ) for N = 12? Describe the changes, if any, of the width of p(S ) as N is increased. Increase N by at least a factor of 4. Do your results depend strongly on the number of measurements? (d) To determine the generality of your results, consider the probability density f (s) = e−s for s ≥ 0 and answer the same questions as in parts (a)–(c). (e) Consider the Lorentz distribution f (s) = (1/π )(1/(s2 + 1), where −∞ ≤ s ≤ ∞. What is the mean value and variance of s? Is the form of p(S ) consistent with the results that you found in parts (b)–(d)? (f) Each value of S can be considered to be a measurement. The sample variance σS is a measure ˜2 of the square of the diﬀerences of the result of each measurement and is given by σS = ˜2 1 N −1
N i=1 (Si − S )2 . (3.133) The reason for the factor of N − 1 rather than N in the deﬁnition of σS is that to compute it, ˜2 we need to use the N values of s to compute the mean of S , and thus, loosely speaking, we CHAPTER 3. CONCEPTS OF PROBABILITY 147 have only N − 1 independent values of s remaining to calculate σS . Show that if N ≫ 1, then ˜2 2 2 σS ≈ σS , where the standard deviation σS is given by σS = S 2 − S . ˜ (g) The quantity σS is known as the standard deviation of the means. That is, σS is a measure of ˜ ˜ how much variation we expect to ﬁnd if we make repeated measurements of S . How does the value of σS compare to your estimated width of the probability density p(S )? ˜ The central limit theorem shows why the Gaussian probability density is ubiquitous in nature. If a random process is related to a sum of a large number of microscopic processes, the sum will be distributed according to the Gaussian distribution independently of the nature of the distribution of the microscopic processes.12 The central limit theorem implies that macroscopic bodies have well deﬁned macroscopic properties even though their constituent parts are changing rapidly. For example, the particle positions and velocities in a gas or liquid are continuously changing at a rate much faster than a typical measurement time. For this reason we expect that during a measurement of the pressure of a gas or a liquid, there are many collisions with the wall and hence the pressure, which is a sum of the pressure due to the individual particles, has a well deﬁned average. We also expect that the probability that the measured pressure deviates from its average value is proportional to N −1/2 , where N is the number of particles. Similarly, the vibrations of the molecules in a solid have a time scale much smaller than that of macroscopic measurements, and hence the pressure of a solid also is a welldeﬁned quantity. Problem 3.46. Random walks and the central limit theorem Use the central limit theorem to ﬁnd the probability that a onedimensional random walker has a displacement between x and x + dx. (There is no need to derive the central limit theorem.) 3.8 *The Poisson Distribution or Should You Fly? We now return to the question of whether or not it is safe to ﬂy. If the probability of a plane crashing is p = 10−6 , then 1 − p is the probability of surviving a single ﬂight. The probability of surviving N ﬂights is then PN = (1 − p)N . For N = 1000, PN ≈ 0.999, and for N = 500, 000, PN ≈ 0.607. Thus, our intuition is veriﬁed that if we took 1000 ﬂights, we would have only a small chance of crashing. This type of reasoning is typical when the probability of an individual event is small, but there are very many attempts. Suppose we are interested in the probability of the occurrence of n events out of N attempts given that the probability p of the event for each attempt is very small. The resulting probability is called the Poisson distribution, a distribution that is important in the analysis of experimental data. We discuss it here because of its intrinsic interest. One way to derive the Poisson distribution is to begin with the binomial distribution: P (n) = N! pn (1 − p)N −n . n! (N − n)! (3.134) 12 We will state the central limit theorem more carefully in Section 3.11.2 and note that the theorem holds only if the second moment of the probability distribution of the individual terms in the sum is ﬁnite. CHAPTER 3. CONCEPTS OF PROBABILITY
0.30 0.25 148 P(n)
0.20 0.15 0.10 0.05 0.00 0 5 10 n 15 20 Figure 3.10: Plot of the Poisson probability distribution for p = 0.0025 and N = 1000. We will suppress the N dependence of P . We ﬁrst show that the term N !/(N − n)! can be approximated by N n in the limit N ≫ n. We write N! = N (N − 1)(N − 2) · · · (N − n + 1) (N − n)! = N n [1 − 1/N ][1 − 2/N ] · · · [1 − (n − 1)/N ] ≈ N n, (3.135a) (3.135b) (3.135c) where each term in brackets can be replaced by 1 because N ≫ n. We next write ln(1 − p)(N −n) = (N − n) ln(1 − p) ≈ −(N − n)p ≈ −N p, and hence q N −n ≈ e−pN . We then combine these approximations to obtain (N p)n −pN 1 e , (3.136) P (n) ≈ N n pn e−pN = n! n! or n n −n P (n) = (Poisson distribution), (3.137) e n! where n = pN . The form (3.137) is the Poisson distribution (see Figure 3.10). Let us apply the Poisson distribution to the airplane survival problem. We want to know the probability of never crashing, that is, P (n = 0). The mean N = pN equals 10−6 × 1000 = 0.001 for N = 1000 ﬂights and N = 0.5 for N = 500, 000 ﬂights. Thus, the survival probability is P (0) = e−N ≈ 0.999 for N = 1000 and P (0) ≈ 0.607 for N = 500, 000 as we calculated previously. We see that if we ﬂy 500,000 times, we have a much larger probability of dying in a plane crash. Problem 3.47. Poisson distribution CHAPTER 3. CONCEPTS OF PROBABILITY 149 (a) Show that the Poisson distribution is properly normalized, and calculate the mean and variance of n [see (A.5)]. Because P (n) for n > N is negligibly small, you can sum P (n) from n = 0 to n = ∞ even though the maximum value of n is N . (b) Plot the Poisson distribution P (n) as a function of n for p = 0.01 and N = 100. 3.9 *Traﬃc Flow and the Exponential Distribution The Poisson distribution is closely related to the exponential distribution as we will see in the following. Consider a sequence of similar random events which occur at times t1 , t2 , . . .. Examples of such sequences are the times that a Geiger counter registers a decay of a radioactive nucleus and the times of an accident at a busy intersection. Suppose that we determine the sequence over a very long time T that is much greater than any of the intervals τi = ti − ti−1 . We also suppose that the mean number of events is λ per unit time so that in the interval τ , the mean number of events is λτ . We also assume that the events occur at random and are independent of each other. We wish to ﬁnd the probability w(τ )dτ that the interval between events is between τ and τ + dτ . If an event occurred at t = 0, the probability that at least one other event occurs within the interval [0, τ ] is
τ w(τ ′ ) dτ ′ .
0 (3.138) The probability that no event occurs in this interval is
τ 1− w(τ ′ ) dτ ′ .
0 (3.139) Another way of thinking of w(τ ) is that it is the probability that no event occurs in the interval [0, τ ] and then an event occurs within [τ, τ + ∆τ ]. Thus, w(τ )∆τ = probability that no event occurs in the interval [0, τ ] × probability that an event deﬁnitely occurs in the interval [τ, τ + ∆τ ]
τ = 1− w(τ ′ ) dτ ′ λ∆τ.
0 (3.140) If we cancel ∆τ from each side of (3.140) and diﬀerentiate both sides with respect to τ , we ﬁnd dw = −λw, dτ so that w(τ ) = Ae−λτ . The constant of integration A is determined from the normalization condition:
∞ ∞ (3.141) (3.142) w(τ ′ ) dτ ′ = 1 = A
0 0 e−λτ dτ ′ = A/λ. ′ (3.143) CHAPTER 3. CONCEPTS OF PROBABILITY Hence, w(τ ) is the exponential function w(τ ) = λe−λτ . 150 (3.144) These results for the exponential distribution lead naturally to the Poisson distribution. Let us divide the interval T ≫ 1 into n smaller intervals τ = T /n. What is the probability that 0, 1, 2, 3, . . . events occur in the interval τ ? We will show that the probability that n events occur in the time interval τ is given by the Poisson distribution: Pn (τ ) = (λτ )n −λτ e , n! (3.145) where we have set n = λτ in (3.137). We ﬁrst consider the case n = 0. If n = 0, the probability that no event occurs in the interval τ is [see (3.140)]
τ τ Pn=0 (τ ) = 1 − 0 w(τ ′ ) dτ ′ = 1 − λ e−λτ dτ ′ = e−λτ . ′ (3.146) 0 For n = 1 there is exactly one event in time interval τ . This event must occur at some time τ ′ . If it occurs at τ ′ , then no other event can occur in the interval [τ ′ , τ ] (otherwise n would not equal 1). Thus, we have
τ Pn=1 (τ ) =
0 τ w(τ ′ )Pn=0 (τ − τ ′ ) dτ ′ λe−λτ e−λ(τ −τ ) dτ ′ ,
′ ′ (3.147a) (3.147b) =
0 where we have used (3.146) with τ → (τ − τ ′ ). Hence,
τ Pn=1 (τ ) =
0 λe−λτ dτ ′ = (λτ )e−λτ . (3.148) If n events are to occur in the interval [0, τ ], the ﬁrst must occur at some time τ ′ and exactly (n − 1) must occur in the time (τ − τ ′ ). Hence,
τ Pn (τ ) =
0 λe−λτ Pn−1 (τ − τ ′ ) dτ ′ . ′ (3.149) Equation (3.149) is a recurrence formula that can be used to derive (3.145) by induction. It is easy to see that (3.145) satisﬁes (3.149) for n = 1. As is usual when solving recursion formulas by induction, we assume that (3.145) is correct for (n − 1). We substitute this result into (3.149) and ﬁnd τ (λτ )n −λτ Pn (τ ) = λn e−λτ (τ − τ ′ )n−1 dτ ′ /(n − 1)! = e . (3.150) n! 0 An application of the Poisson distribution is given in Problem 3.48. Problem 3.48. Analysis of traﬃc data In Table 3.5 we show the number of vehicles passing a marker during a 30 s interval. The observations were made on a single lane of a sixlane divided highway. Assume that the traﬃc density is so low that passing occurs easily and no platoons of cars develop.
∗ CHAPTER 3. CONCEPTS OF PROBABILITY N 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 > 15 frequency 1 7 14 25 31 26 27 14 8 3 4 3 1 0 1 0 151 Table 3.5: Observed distribution of vehicles passing a marker on a highway in 30 s intervals, taken from Montroll and Badger (1974), page 98. (a) Is the distribution of the number of vehicles consistent with the Poisson distribution? If so, what is the value of the parameter λ? (b) As the traﬃc density increases, the ﬂow reaches a regime where the vehicles are very close to one another so that they are no longer mutually independent. Make arguments for the form of the probability distribution of the number of vehicles passing a given point in this regime. 3.10 *Are All Probability Distributions Gaussian? We have discussed random additive processes and found that the probability distribution of their sum is a Gaussian for a suﬃciently large number of terms. An example of such a process is a onedimensional random walk for which the displacement x is the sum of N random steps. We now discuss random multiplicative processes. Examples of such processes include the distributions of incomes, rainfall, and fragment sizes in rock crushing processes.13 Consider the latter for which we begin with a rock of size w. We strike the rock with a hammer and generate two fragments whose sizes are pw and qw, where q = 1 − p. In the next step the possible sizes of the fragments are p2 w, pqw, qpw, and q 2 w. What is the distribution of the fragment sizes after N blows of the hammer? To answer this question consider the value of the product of the binary sequence of N elements in which the numbers x1 and x2 appear independently with probabilities p and q , respectively. We write Π = x1 x1 x2 x1 x2 . . . (3.151)
13 The following discussion is based on an article by Redner (1990). CHAPTER 3. CONCEPTS OF PROBABILITY 152 We ask what is Π, the mean value of Π? To calculate Π we deﬁne PN (n) to be the probability that the product of N independent factors of x1 and x2 has the value x1 n x2 N −n . This probability is given by the number of sequences where x1 appears n times multiplied by the probability of choosing a speciﬁc sequence with x1 appearing n times. This probability is the familiar binomial distribution: N! p n q N −n . (3.152) PN (n) = n! (N − n)! We average over all possible outcomes of the product to obtain its mean value
N Π=
n=0 PN (n) x1 n x2 N −n = (px1 + qx2 )N . (3.153) The most probable event in the product contains N p factors of x1 and N q factors of x2 . Hence, the most probable value of the product is Π = (x1 p x2 q )N . (3.154) To obtain a better feeling for these results we consider some special cases. For x1 = 2, x2 = 1/2, and p = q = 1/2 we have Π = (1/4)[x2 + 2x1 x2 + x2 ] = (1/4)[4 + 2 + 1/4] = 25/16 for 2 2 N = 2; for general N we have Π = (5/4)N . In contrast, the most probable value for N = 2 is given by Π = [21/2 × (1/2)1/2 ]2 = 1; the same result holds for any N . For p = 1/3 and q = 2/3 and the same values of x1 and x2 we ﬁnd Π = 1 for all N and Π = [21/3 × (1/2)2/3 ]2 = 2−2/3 for N = 2 and 2−N/3 for any N . We see that Π = Π for a random multiplicative process. In contrast, the most probable event is a good approximation to the mean value of the sum of a random additive process (and is identical for p = q ). The reason for the large discrepancy between Π and Π is the important role played by rare events. For example, a sequence of N factors of x1 = 2 occurs with a very small probability, but the value of this product is very large in comparison to the most probable value. Hence, this extreme event makes a ﬁnite contribution to Π and a dominant contribution to the higher moments Πm .
∗ Problem 3.49. A simple multiplicative process (a) Conﬁrm the general result in (3.153) for N = 4 by showing explicitly all the possible values of the product. (b) Consider the case x1 = 2, x2 = 1/2, p = 1/4, and q = 3/4, and calculate Π and Π.
N n N −n reduces to (c) Show that the mean value of the mth moment Πm = n=0 P (n) x1 ) x2 mN (px1 ) as m → ∞ (for x1 > x2 ). (Hint: Consider the ratio of each term in the sum to the term with xN m .) This result implies that the mth moment is determined solely by the most 1 extreme event for m ≫ 1. m (d) Explain why a lognormal distribution for which p(Π) ∼ e−(ln Π−ln Π) /2σ is a reasonable guess for the continuum approximation to the probability of a random multiplicative process for N ≫ 1. Here Π = xn xN −n . 12 2 2 CHAPTER 3. CONCEPTS OF PROBABILITY
∗ 153 Problem 3.50. Simulation of a multiplicative process Run Program MultiplicativeProcess to simulate the distribution of values of the product x1 n x2 N −n . Choose x1 = 2, x2 = 1/2, and p = q = 1/2. First choose N = 4 and estimate Π and Π. Do your estimated values converge more or less uniformly to the analytical values as the number of measurements becomes large? Do a similar simulation for N = 40. Compare your results with a similar simulation of a random walk and discuss the importance of extreme events for random multiplicative processes. 3.11
3.11.1 *Supplementary Notes
Method of undetermined multipliers Suppose that we want to maximize the function f (x, y ) = xy 2 subject to the constraint that x2 + y 2 = 1. One way would be to substitute y 2 = 1 − x2 and maximize f (x) = x(1 − x2 ). This approach works only if f can be reduced to a function of one variable. We ﬁrst consider this case as a way of introducing the general method of undetermined multipliers. Our goal is to maximize f (x, y ) = xy 2 subject to the constraint that g (x, y ) = x2 + y 2 − 1 = 0. In the method of undetermined multipliers this problem can be reduced to solving the equation df − λdg = 0, (3.155) where df = 0 at the maximum of f , dg = 0 because g expresses the constraint, and λ will be chosen so that (3.155) is satisﬁed. If we substitute df = y 2 dx + 2xydy and dg = 2xdx + 2ydy in (3.155), we obtain (y 2 − 2λx)dx + 2(xy − λy )dy = 0. (3.156) We choose λ = y 2 /2x so that the ﬁrst term is zero at the maximum. Because this term is zero, the √ second term must also be zero; that is, x = λ = y 2 /2x, so x = ±y/ 2. Hence, from the constraint g (x, y ) = 0, we obtain x = 1/3 and λ = 2. More generally, we wish to maximize the function f (x1 , x2 , . . . , xN ) subject to the constraints gj (x1 , x2 , . . . , xN ) = 0 where j = 1, 2, . . . , M with M < N . The maximum of f is given by
N df =
i=1 ∂f dxi = 0, ∂xi (3.157) and the constraints can be expressed as
N dgj =
i=1 ∂gj dxi = 0. ∂xi
M j =1 (3.158) λj dgj = 0, or (3.159) As in our example, we can combine (3.157) and (3.158) and write df −
N i=1 ∂gj ∂f λj dxi = 0. − ∂xi j =1 ∂xi M CHAPTER 3. CONCEPTS OF PROBABILITY 154 We are free to choose all M values of λj such that the ﬁrst M terms in the square brackets are zero. For the remaining N − M terms, the dxi can be independently varied because the constraints have been satisﬁed. Hence, the remaining terms in square brackets must be independently zero, and we are left with N − M equations of the form ∂gj ∂f λj − = 0. ∂xi i=1 ∂xi
M (3.160) In Example 3.11 we were able to obtain the probabilities by reducing the uncertainty S to a function of a single variable P1 and then maximizing S (P1 ). We now consider a more general problem where there are more outcomes – a loaded die for which there are six outcomes. Suppose that we know that the average number of points on the face of a die is n. We wish to determine the values of P1 , P2 , . . . , P6 that maximize the uncertainty S subject to the constraints
6 Pj = 1,
j =1 6 (3.161) jPj = n.
j =1 (3.162) For a perfect die n = 3.5. We take
6 f =S=−
6 Pj ln Pj ,
j =1 (3.163a) g1 =
j =1 Pj − 1, (3.163b) and
6 g2 =
j =1 jPj − n. (3.163c) We have ∂f /∂Pj = −(1 + ln Pj ), ∂g1 /∂Pj = 1, and ∂g2 /∂Pj = j , and write (3.160) for j = 1 and j = 2 as −(1 + ln P1 ) − α − β = 0, (3.164a) (3.164b) −(1 + ln P2 ) − α − 2β = 0, where we have taken α and β (instead of λ1 and λ2 ) as the undetermined Lagrange multipliers. The solution of (3.164) for α and β is α = ln P2 − 2 ln P1 − 1, β = ln P1 − ln P2 . (3.165a) (3.165b) We solve (3.165b) for ln P2 = ln P1 − β and use (3.165a) to ﬁnd ln P1 = −1 − α − β . We then use this result to write ln P2 as ln P2 = −1 − α − 2β . We can independently vary dP3 , . . . , dP6 CHAPTER 3. CONCEPTS OF PROBABILITY 155 because the two constraints are satisﬁed by the values of P1 and P2 . Hence, we have from (3.160) and (3.163) that ln Pj = −1 − α − jβ, or Pj = e−1−α e−βj . We eliminate the constant α by the normalization condition (3.161) and write Pj = e−βj . −βj je (3.168) (3.167) (3.166) The constant β is determined by the constraint (3.45): n= e−β + 2e−2β + 3e−3β + 4e−4β + 5e−5β + 6e−6β . e−β + e−2β + e−3β + e−4β + e−5β + e−6β (3.169) Usually, (3.169) must be solved numerically. The exponential form (3.168) will become very familiar to us [see (4.79)], page 200) and is known as the Boltzmann distribution. In the context of thermal systems the Boltzmann distribution maximizes the uncertainty given the constraints that the probability distribution is normalized and the mean energy is known. Problem 3.51. Numerical solution of (3.169) Show that the solution to (3.169) is β = 0 for n = 7/2, β = +∞ for n = 1, β = −∞ for n = 6, and β = −0.1746 for n = 4. 3.11.2 Derivation of the central limit theorem To discuss the derivation of the central limit theorem, it is convenient to introduce the characteristic function φ(k ) of the probability density p(x). The main utility of the characteristic function is that it simpliﬁes the analysis of the sums of independent random variables. We deﬁne φ(k ) as the Fourier transform of p(x):
∞ φ(k ) = eikx =
−∞ eikx p(x) dx. (3.170) Because p(x) is normalized, it follows that φ(k = 0) = 1. The main property of the Fourier transform that we need is that if φ(k ) is known, we can ﬁnd p(x) by calculating the inverse Fourier transform: 1 ∞ −ikx p(x) = e φ(k ) dk. (3.171) 2π −∞ Problem 3.52. Characteristic function of a Gaussian Calculate the characteristic function of the Gaussian probability density. CHAPTER 3. CONCEPTS OF PROBABILITY 156 One useful property of φ(k ) is that its power series expansion yields the moments of p(x): φ(k ) = = k n dn φ(k ) n! dk n n=0
∞ ∞ k=0 (3.172) (3.173) eikx = (ik )n xn . n! n=0 By comparing coeﬃcients of k n in (3.172) and (3.173), we see that x = −i In Problem 3.53 we show that x2 − x2 = − d2 ln φ(k ) dk 2
k=0 dφ dk k=0 . (3.174) , (3.175) and that certain convenient combinations of the moments are related to the power series expansion of the logarithm of the characteristic function. Problem 3.53. The ﬁrst few cumulants The characteristic function generates the cumulants Cm deﬁned by ln φ(k ) = (ik )n Cn . n! n=1
∞ (3.176) Show that the cumulants are combinations of the moments of x and are given by C1 = x, C3 = C2 = σ 2 = x2 − x2 , x3 − 3 x2 x +2x ,
2 3 (3.177a) (3.177b) (3.177c) (3.177d) C4 = x4 − 4 x3 x − 3 x2 + 12 x2 x2 − 6 x4 . The ﬁrst few cumulants were calculated for several probability distributions in Problems 3.41(b) and 3.42(c). What is the value of C4 for the Gaussian distribution? Now let us consider the properties of the characteristic function for the sums of independent variables. For example, let p1 (x) be the probability density for the weight x of adult males and let p2 (y ) be the probability density for the weight of adult females. If we assume that people marry one another independently of weight, what is the probability density p(z ) for the weight z of an adult couple? We have that z = x + y. How do the probability densities combine? The answer is given by p(z ) = p1 (x)p2 (y ) δ (z − x − y )dx dy. (3.179) (3.178) CHAPTER 3. CONCEPTS OF PROBABILITY 157 The integral in (3.179) represents all the possible ways of obtaining the combined weight z as determined by the probability density p1 (x)p2 (y ) for the combination of x and y that sums to z . The form (3.179) of the integrand is known as a convolution. An important property of a convolution is that its Fourier transform is a simple product. We have φz (k ) = = = eikz p(z )dz eikz p1 (x)p2 (y )δ (z − x − y )dx dy dz eikx p1 (x)dx eiky p2 (y )dy (3.180a) (3.180b) (3.180c) (3.180d) = φ1 (k )φ2 (k ). It is straightforward to generalize this result to a sum of N random variables. We write S = x1 + x2 + . . . + xN . Then
N (3.181) φS (k ) =
i=1 φi (k ). (3.182) That is, the characteristic function of the sum of several independent variables is the product of the individual characteristic functions. If we take the logarithm of both sides of (3.182), we obtain
N ln φS (k ) =
i=1 ln φi (k ). (3.183) Each side of (3.183) can be expanded as a power series and compared order by order in powers of ik . The result is that when random variables are added, their associated cumulants also add. That is, the nth order cumulants satisfy the relation
S 1 2 N Cn = Cn + Cn + . . . + Cn . (3.184) We conclude that if the random variables xi are independent (uncorrelated), their cumulants, and in particular, their variances, add. We saw a special case of this result for the variance in (3.75). If we denote the mean and standard deviation of the weight of an adult male as w and σ , respectively, then from (3.177a) and (3.184) we ﬁnd that the mean weight of N adult males is given by N w . Similarly from (3.177b) we√ that the standard deviation of the weight of N adult see 2 2 males is given by σN = N σw , or σN = N σw . Hence, we ﬁnd the now familiar result that the √ sum of N random variables scales as N while the standard deviation scales as N . We are now in a position to derive the central limit theorem. Let x1 , x2 , . . ., xN be N mutually independent variables. For simplicity, we assume that each variable has the same probability 2 density p(x). The only condition is that the variance σx of the probability density p(x) must be ﬁnite. For simplicity, we make the additional assumption that x = 0, a condition that always can CHAPTER 3. CONCEPTS OF PROBABILITY 158 be satisﬁed by measuring x from its mean. The central limit theorem states that the sum S has the probability density 2 2 1 p(S ) = √ (3.185) e−S /2N σx . 2 2πN σx
2 From (3.177b) we see that S 2 = N σx , and hence the variance of S grows linearly with N . However, the distribution of the values of the arithmetic mean S/N becomes narrower with increasing N : x1 + x2 + . . . xN N 2 = 2 N σx σ2 = x. 2 N N (3.186) From (3.186) we see that it is useful to deﬁne a scaled sum 1 z = √ (x1 + x2 + . . . + xN ), N and to write the central limit theorem in the form p(z ) = 1
2 2πσx (3.187) e −z 2 2 /2σx . (3.188) To obtain the result (3.188), we write the characteristic function of z as φz (k ) = x1 + x2 + . . . + xN N 1/2 × p(x1 ) p(x2 ) · · · p(xN )dz dx1 dx2 · · · dxN ··· eikz δ z − · · · eik(x1 +x2 +···+xN )/N k N 1/2
N
1/2 (3.189a) (3.189b) (3.189c) = =φ p(x1 ) p(x2 ) . . . p(xN )dx1 dx2 · · · dxN . We next take the logarithm of both sides of (3.189c) and expand the righthand side in powers of k using (3.176) to ﬁnd ∞ (ik )m 1−m/2 N Cm . (3.190) ln φz (k ) = m! m=2 The m = 1 term does not contribute in (3.190) because we have assumed that x = 0. More importantly, note that as N → ∞ the higherorder terms are suppressed so that ln φz (k ) → − or φz (k ) → e−k
2 k2 C2 , 2 + ··· . (3.191) (3.192) σ 2 /2 Because the inverse Fourier transform of a Gaussian is also a Gaussian, we ﬁnd that p(z ) =
2 2 1 e−z /2σx . 2 2πσx (3.193) CHAPTER 3. CONCEPTS OF PROBABILITY 159 The leading correction to φ(k ) in (3.193) gives rise to a term of order N −1/2 , and therefore does not contribute in the limit N → ∞. The only requirements for the applicability of the central limit theorem are that the various xi be statistically independent and that the second moment of p(x) exists. It is not necessary that all the xi have the same distribution. Not all probabilities have a ﬁnite second moment as demonstrated by the Lorentz distribution (see Problem 3.43), but the requirements for the central limit theorem are weak and the central limit theorem is widely applicable. Vocabulary
sample space, events, outcome uncertainty, principle of least bias or maximum uncertainty probability distribution P (i) or Pi , probability density p(x) mean value f (x) , moments, variance ∆x2 , standard deviation σ conditional probability P (AB), Bayes’ theorem binomial distribution, Gaussian distribution, Poisson distribution random walk, random additive processes, central limit theorem Stirling’s approximation Monte Carlo sampling Rare or extreme events, random multiplicative processes cumulants, characteristic function Additional problems
Problem 3.54. Probability that a site is occupied In Figure 3.11 we show a square lattice of 162 sites each of which is occupied with probability p. Estimate the probability p that a site in the lattice is occupied and explain your reasoning. Problem 3.55. Three coins (in a fountain) Three coins are tossed in succession. Assume that landing heads or tails is equiprobable. Find the probabilities of the following: (a) the ﬁrst coin is heads; (b) exactly two heads have occurred; (c) not more than two heads have occurred. Problem 3.56. A student’s fallacious reasoning A student tries to solve Problem 3.13 by using the following reasoning. The probability of a double 6 is 1/36. Hence the probability of ﬁnding at least one double 6 in 24 throws is 24/36. What CHAPTER 3. CONCEPTS OF PROBABILITY 160 Figure 3.11: Representation of a square lattice of 16 × 16 sites. The sites are represented by squares. Each site is either occupied (shaded) independently of its neighbors with probability p or is empty (white) with probability 1 − p. These conﬁgurations are discussed in the context of percolation in Section 9.3. is wrong with this reasoning? If you have trouble understanding the error in this reasoning, try solving the problem of ﬁnding the probability of at least one double 6 in two throws of a pair of dice. What are the possible outcomes? Is each outcome equally probable? Problem 3.57. d’Alembert’s fallacious reasoning (a) What is the probability that heads will appear at least once in two tosses of a single coin? Use 3 the rules of probability to show that the answer is 4 . (b) d’Alembert, a distinguished French mathematician of the eighteenth century, reasoned that there are only three possible outcomes: heads on the ﬁrst throw, heads on the second throw, and no heads at all. The ﬁrst two of these three outcomes is favorable. Therefore the probability 2 that heads will appear at least once is 3 . What is the fallacy in his reasoning? Even eminent mathematicians (and physicists) have been lead astray by the subtle nature of probability. Problem 3.58. Number of ﬁsh in a pond A farmer wants to estimate how many ﬁsh are in a pond. The farmer takes out 200 ﬁsh and tags them and returns them to the pond. After suﬃcient time to allow the tagged ﬁsh to mix with the others, the farmer removes 250 ﬁsh at random and ﬁnds that 25 of them are tagged. Estimate the number of ﬁsh in the pond. Problem 3.59. Estimating the area of a pond A farmer owns a ﬁeld that is 10 m × 10 m. In the midst of this ﬁeld is a pond of unknown area. Suppose that the farmer is able to throw 100 stones at random into the ﬁeld and ﬁnds that 40 of the stones make a splash. How can the farmer use this information to estimate the area of the pond? CHAPTER 3. CONCEPTS OF PROBABILITY xi , yi 0.984, 0.246 0.860, 0.132 0.316, 0.028 0.523, 0.542 0.349, 0.623 xi , yi 0.637, 0.779, 0.276, 0.081, 0.289, 161 1 2 3 4 5 6 7 8 9 10 0.581 0.218 0.238 0.484 0.032 Table 3.6: A sequence of ten random pairs of numbers (see Problem 3.60). Problem 3.60. Monte Carlo integration Consider the ten pairs of numbers (xi , yi ) given in Table 3.6. The numbers are all in the range 0 < xi , yi ≤ 1. Imagine that these numbers were generated by counting the clicks generated by a Geiger counter of radioactive decays, and hence they can be considered to be a part of a sequence of random numbers. Use this sequence to estimate the magnitude of the integral
1 F=
0 dx (1 − x2 ). (3.194) If you have been successful in estimating the integral in this way, you have found a simple version of a general method known as Monte Carlo integration.14 (a) Show analytically that the integral in (3.194) is equal to π/4. (b) Use Program MonteCarloEstimation to estimate the integral (3.194) by Monte Carlo integration. Determine the error (the magnitude of the deviation from the exact answer) for trials of n pairs of points equal to n = 104 , 106 , and 108 . Does the error decrease with increasing n on the average? (c) Estimate the integral using n = 1000. Repeat for a total of ten trials using a diﬀerent random number seed each time. The easiest way to do so is to press the Reset button and then press the Calculate button. The default is for the program to choose a new seed each time based on the clock. Is the magnitude of the variation of your values of the same order as the error between the average value and the exact value? For a large number of trials, the error is estimated from the standard error of the mean, which approximately equals the standard deviation divided by the square root of the number of trials. Problem 3.61. Bullseye A person playing darts hits a bullseye 20% of the time on the average. Why is the probability of b bullseyes in N attempts a binomial distribution? What are the values of p and q ? Find the probability that the person hits a bullseye (a) once in ﬁve throws; (b) twice in ten throws.
14 Monte Carlo methods were ﬁrst developed to estimate integrals that could not be performed analytically or by the usual numerical methods. CHAPTER 3. CONCEPTS OF PROBABILITY Why are these probabilities not identical? 162 Problem 3.62. Family values There are ten children in a given family. Assuming that a boy is as likely to be born as a girl, ﬁnd the probability of the family having (a) ﬁve boys and ﬁve girls; (b) three boys and seven girls. Problem 3.63. Fathers and sons (and daughters) What is the probability that ﬁve children produced by the same couple will consist of the following (assume that the probabilities of giving birth to a boy and a girl are the same): (a) three sons and two daughters? (b) alternating sexes? (c) alternating sexes starting with a son? (d) all daughters? Problem 3.64. Probability in baseball A good hitter in major league baseball has a batting average of 300, which means that the hitter will be successful three times out of ten tries on the average. Assume that the batter has four times at bat per game. (a) What is the probability that he gets no hits in one game? (b) What is the probability that he will get two hits or less in a threegame series? (c) What is the probability that he will get ﬁve or more hits in a threegame series? Baseball fans might want to think about the signiﬁcance of “slumps” and “streaks” in baseball. Problem 3.65. Playoﬀ winners (a) In the World Series in baseball and in the playoﬀs in the National Basketball Association and the National Hockey Association, the winner is determined by the best of seven games. That is, the ﬁrst team that wins four games wins the series and is the champion. Do a simple statistical calculation assuming that the two teams are evenly matched and show that a sevengame series should occur 31.25% of the time. What is the probability that the series lasts n games? More information can be found at <www.mste.uiuc.edu/hill/ev/seriesprob.html> and at <www.aip.org/isns/reports/2003/080.html>. CHAPTER 3. CONCEPTS OF PROBABILITY 163 (b) Most teams have better records at home. Assume the two teams are evenly matched and each has a 60% chance of winning at home and a 40% change of winning away. In principle, both teams should have an equal chance of winning a seven game series. Determine which pattern of home games is closer to giving each team a 50% chance of winning. Consider the two common patterns: (1) two home, three away, two home; and (2) two home, two away, one home, one away, one home. Problem 3.66. Galton board The Galton board [named after Francis Galton (1822–1911)] is a triangular array of pegs. The rows are numbered 0, 1, . . . from the top row down, such that row n has n + 1 pegs. Suppose that a ball is dropped from above the top peg. Each time the ball hits a peg, it bounces to the right with probability p and to the left with probability 1 − p, independently from peg to peg. Suppose that N balls are dropped successively such that the balls do not encounter one another. How will the balls be distributed at the bottom of the board? Links to applets that simulate the Galton board can be found in the references. Problem 3.67. The birthday problem What if somebody oﬀered to bet you that at least two people in your physics class had the same birthday? Would you take the bet? (a) What are the chances that at least two people in your class have the same birthday? Assume that the number of students is 25. (b) What are the chances that at least one other person in your class has the same birthday as you? Explain why the chances are less in this case than in part (a). Problem 3.68. A random walk down Wall Street Many analysts attempt to select stocks by looking for correlations in the stock market as a whole or for patterns for particular companies. Such an analysis is based on the belief that there are repetitive patterns in stock prices. To understand one reason for the persistence of this belief do the following experiment. Construct a stock chart (a plot of stock price versus time) showing the movements of a hypothetical stock initially selling at $50 per share. On each successive day the closing stock price is determined by the ﬂip of a coin. If the coin toss is a head, the stock closes 1/2 point ($0.50) higher than the preceding close. If the toss is a tail, the price is down by 1/2 point. Construct the stock chart for a long enough time to see “cycles” and other “patterns” appear. A sequence of numbers produced in this manner is identical to a random walk, yet the sequence frequently appears to be correlated. The lesson of the charts is that our eyes look for patterns even when none exists. Problem 3.69. Displacement and number of steps to the right (a) Suppose that a random walker takes N steps of unit length with probability p of a step to the right. The displacement m of the walker from the origin is given by m = n − n′, where n is the number of steps to the right and n′ is the number of steps to the left. Show that m = (p − q )N 2 and σm = (m − m)2 = 4N pq .
2 (b) The result (3.78) for (∆M )2 diﬀers by a factor of four from the result for σn in (3.99). Why? CHAPTER 3. CONCEPTS OF PROBABILITY Problem 3.70. Watching a drunkard A random walker is observed to take a total of N steps, n of which are to the right. 164 (a) Suppose that a curious observer ﬁnds that on ten successive nights the walker takes N = 20 steps and that the values of n are given successively by 14, 13, 11, 12, 11, 12, 16, 16, 14, 8. Calculate n, n2 , and σn . You can use this information to make two estimates of p, the probability of a step to the right. If you obtain diﬀerent estimates for p, which estimate is likely to be the most accurate? (b) Suppose that on another ten successive nights the same walker takes N = 100 steps and that the values of n are given by 58, 69, 71, 58, 63, 53, 64, 66, 65, 50. Calculate the same quantities as in part (a) and use this information to estimate p. How does the ratio of σn to n compare for the two values of N ? Explain your results. (c) Calculate m and σm , where m = n − n′ is the net displacement of the walker for parts (a) and (b). This problem inspired an article by Zia and Schmittmann. Problem 3.71. Consider the binomial distribution PN (n) for N = 16 and p = q = 1/2. (a) What is the value of PN (n) at n = n − σn ? (b) What is the value of the product PN (n = n)(2σn )? Problem 3.72. Alternative derivation of the Gaussian distribution On page 137 we evaluated the binomial probability PN (n) using Stirling’s approximation to determine the parameters A, B , and n in (3.109). Another way to determine these parameters is to ˜ approximate the binomial distribution by a Gaussian and require that the zeroth, ﬁrst, and second moments of the Gaussian and binomial distribution be equal. We write
˜ P (n) = Ae−B (n−n)
2 /2 , (3.195) where A, B , and n are the parameters to be determined. We ﬁrst require that ˜
N P (n) dn = 1.
0 (3.196) Because P (n) depends on the diﬀerence n − n, it is convenient to change the variable of integration ˜ in (3.196) to x = n − n and write ˜
N (1−p) P (x) dx = 1,
−N p (3.197) where P (x) = Ae−Bx
∞ 2 /2 . (3.198) Because we are interested in the limit N → ∞, we can extend the limits in (3.197) to ±∞: P (x) dx = 1.
−∞ (3.199) CHAPTER 3. CONCEPTS OF PROBABILITY 165 Figure 3.12: Example of a wall as explained in Problem 3.73. (a) The ﬁrst moment of the Gaussian distribution is
∞ n=
−∞ nP (n) dn, (3.200) where P (n) is given by (3.195). Make a change of variables and show that
∞ n= (x + n)P (x) dx = n. ˜ ˜
−∞ (3.201) (b) The ﬁrst moment of the binomial distribution is given by pN according to (3.96). Require the ﬁrst moments of the binomial and Gaussian distributions to be equal, and determine n. ˜ (c) The variance of the binomial distribution is given in (3.99) and is equal to (n − n)2 = N pq . The corresponding variance of the Gaussian distribution is given by
∞ (n − n)2 = −∞ (n − n)2 P (n) dn. (3.202) Make the necessary change of variables in (3.202) and do the integrals in (3.199) and (3.202) [see (A.23) and (A.17)] to conﬁrm that the values of B and A are given by (3.115) and (3.117), respectively. (d) Explain why the third moments of the binomial and Gaussian distribution are not equal. Problem 3.73. A simple twodimensional wall Consider a twodimensional “wall” constructed from N squares as shown in Figure 3.12. The base row of the wall must be continuous, but higher rows can have gaps. Each column must be continuous and selfsupporting with no overhangs. Determine the total number WN of diﬀerent N site clusters, that is, the number of possible arrangements of N squares consistent with these rules. Assume that the squares are identical. Problem 3.74. Heads you win Two people take turns tossing a coin. The ﬁrst person to obtain heads is the winner. Find the probabilities of the following events: CHAPTER 3. CONCEPTS OF PROBABILITY (a) The game terminates at the fourth toss; (b) the ﬁrst player wins the game; (c) the second player wins the game.
∗ 166 Problem 3.75. Firstpassage time Suppose that a onedimensional unbiased random walker starts out at the origin x = 0 at t = 0 and takes unit length steps at regular intervals. As usual the probability of a step to the right is p. (a) How many steps will it take for the walker to ﬁrst reach x = +1? This quantity, known as the ﬁrstpassage time, is a random variable because it is diﬀerent for diﬀerent realizations of the walk. Let Pn be the probability that x ﬁrst equals +1 after n steps. What is Pn for n = 1, 3, 5, and 7? (b) Write a program to simulate a random walker in one dimension and estimate the number of steps needed to ﬁrst reach x = +1. What is your estimate of the probability that the walker will eventually reach x = +1 assuming that p = 1/2? What is the mean number of steps needed to reach x = +1?
∗ Problem 3.76. Range of validity of the Gaussian distribution How good is the Gaussian distribution as an approximation to the binomial distribution as a function of N ? To determine the validity of the Gaussian distribution, consider the next two terms after (3.114) in the power series expansion of ln P (n): 1 1 (n − n)3 C + (n − n)4 D, ˜ ˜ 3! 4! where C = d3 ln P (n)/d3 n and D = d4 ln P (n)/d4 n evaluated at n = n. ˜ (a) Show that C  < 1/N 2 p2 q 2 . What does C equal if p = q ? (b) Show that D < 4/N 3 p3 q 3 . (c) Show that the results for C  and D imply that the neglect of terms beyond second order in (n − n) is justiﬁed if n − n ≪ N pq . Explain why stopping at second order is justiﬁed if ˜ ˜ N pq ≫ 1. Problem 3.77. A L´vy ﬂight e A L´vy ﬂight, named after the mathematician Paul Pierre L´vy, is a random walk in which the e e length ℓ of each step is distributed according to a probability distribution of the form p(ℓ) ∝ ℓ−µ , where 1 < µ < 3. Is the form of the probability distribution of the displacement of the walker after N steps a Gaussian? Problem 3.78. Balls and boxes Suppose there are three boxes each with two balls. The ﬁrst box has two green balls, the second box has one green and one red ball, and the third box has two red balls. Suppose you choose a box at random and ﬁnd one green ball. What is the probability that the other ball is green? (3.203) CHAPTER 3. CONCEPTS OF PROBABILITY Problem 3.79. Telephone numbers 167 Open a telephone directory to a random page or look at the phone numbers in your cell phone and make a list corresponding to the last digit of the ﬁrst 100 telephone numbers you see. Find the probability P (n) that the number n appears in your list. Plot P (n) as a function of n and describe its n dependence. Do you expect that P (n) is approximately uniform?
∗ Problem 3.80. Benford’s law or looking for number one Suppose that you constructed a list of the populations of the largest cities in the world, or a list of the house numbers of everybody you know. Other naturally occurring lists include river lengths, mountain heights, radioactive decay halflives, the size of the ﬁles on your computer, and the ﬁrst digit of each of the numbers that you ﬁnd in a newspaper. (The ﬁrst digit of a number such as 0.00123 is 1.) What is the probability P (n) that the ﬁrst digit is n, where n = 1, . . . , 9? Do you think that P (n) will be the same for all n? It turns out that the form of the probability P (n) is given by P (n) = log10 1 + 1 . n (3.204) The distribution (3.204) is known as Benford’s law and is named after Frank Benford, a physicist, who independently discovered it in 1938, although it was discovered previously by the astronomer Simon Newcomb in 1881. The distribution (3.204) implies that for certain data sets, the ﬁrst digit is distributed in a predictable pattern with a higher percentage of the numbers beginning with the digit 1. What are the numerical values of P (n) for the diﬀerent values of n? Is P (n) normalized? Accounting data is one of the many types of data that is expected to follow the Benford distribution. It has been found that artiﬁcial data sets do not have ﬁrst digit patterns that follow the Benford distribution. Hence, the more an observed digit pattern deviates from the expected Benford distribution, the more likely the data set is suspect. Tax returns have been checked in this way. The frequencies of the ﬁrst digit of 2000 numerical answers to problems given in the back of four physics and mathematics textbooks have been tabulated and found to be distributed in a way consistent with Benford’s law. Benford’s law is also expected to hold for answers to homework problems.15 Problem 3.81. Faking it Ask several of your friends to ﬂip a coin 100 times and record the results or pretend to ﬂip a coin and fake the results. Can you tell which of your friends faked the results?
∗ ∗ Problem 3.82. Zipf’s law Suppose that we analyze a text and count the number of times a given word appears. The word with rank r is the rth word when the words of the text are listed with decreasing frequency. Make a loglog plot of word frequency f versus word rank r. The relation between word rank and word frequency was ﬁrst stated by George Kingsley Zipf (1902–1950). This relation states that for a given text 1 , (3.205) f∼ r ln(1.78R)
15 See Huddle (1997) and Hill (1998). CHAPTER 3. CONCEPTS OF PROBABILITY 1 2 3 4 5 6 7 8 9 10 the of to a and in that for was with 15861 7239 6331 5878 5614 5294 2507 2228 2149 1839 11 12 13 14 15 16 17 18 19 20 his is he as on by at it from but 1839 1810 1700 1581 1551 1467 1333 1290 1228 1138 168 Table 3.7: Ranking of the top 20 words (see Problem 3.82). where R is the number of diﬀerent words. Note the inverse power law behavior of the frequency on the rank. The relation (3.205) is known as Zipf ’s law. The top 20 words in an analysis of a 1.6 MB collection of 423 short Time magazine articles (245,412 term occurrences) are given in Table 3.7. Analyze another text and determine if you ﬁnd a similar relation.
∗ Problem 3.83. Time of response to emails When you receive an email, how long does it take for you to respond to it? If you keep a record of your received and sent mail, you can analyze the distribution of your response times – the number of hours between receiving an email from someone and replying to it. It turns out that the time it takes people to reply to emails can be described by a power law; that is, the probability p(τ )dτ that the response is between τ and τ + dτ is p(τ ) ∼ τ −a with a ≈ 1. Oliveira and Barab´si have shown that the response times of Einstein and Darwin to letters can a also be described by a power law, but with an exponent a ≈ 3/2.16 This result suggests that there is a universal pattern for human behavior in response to correspondence. What is the implication of a power law response? Problem 3.84. Pick any card Three cards are in a hat. One card is white on both sides, the second is white on one side and red on the other, and the third is red on both sides. The dealer shuﬄes the cards, takes one out and places it ﬂat on the table. The side showing is red. The dealer now says, “Obviously this card is not the whitewhite card. It must be either the redwhite card or the redred card. I will bet even money that the other side is red.” Is this bet fair?
∗ Problem 3.85. Societal response to rare events (a) Estimate the probability that an asteroid will impact the Earth and cause major damage. Does it make sense for society to take steps now to guard itself against such an occurrence? (b) The likelihood of the breakdown of the levees near New Orleans was well known before its occurrence on August 30, 2005. Discuss the various reasons why the decision was made not to strengthen the levees. Relevant issues include the ability of people to think about the
16 See Oliveira and Barab´si (2005). a CHAPTER 3. CONCEPTS OF PROBABILITY 169 probability of rare events, and the large amount of money needed to strengthen the levees to withstand such an event.
∗ Problem 3.86. Science and society Does capital punishment deter murder? Are vegetarians more likely to have daughters? Does it make sense to talk about a “hot hand” in basketball? Are the digits of π random? See <chance.dartmouth.edu/chancewiki/> and <www.dartmouth.edu/~chance/> to read about interesting issues involving probability and statistics. Suggestions for further reading
Vinay Ambegaokar, Reasoning About Luck, Cambridge University Press (1996). A book developed for a course for nonscience majors. An excellent introduction to statistical reasoning and its uses in physics. Ralph Baierlein, Atoms and Information Theory, W. H. Freeman (1971). The author derives the Boltzmann distribution using arguments similar to those used to obtain (3.168). Arieh BenNaim, Entropy Demystiﬁed: The Second Law Reduced to Plain Common Sense, World Scientiﬁc (2007). Deborah J. Bennett, Randomness, Harvard University Press (1998). Peter L. Bernstein, Against the Gods: The Remarkable Story of Risk, John Wiley & Sons (1996). The author is a successful investor and an excellent writer. The book includes an excellent summary of the history of probability. David S. Betts and Roy E. Turner, Introductory Statistical Mechanics, AddisonWesley (1992). Section 3.4 is based in part on Chapter 3 of this text. JeanPhillippe Bouchaud and Marc Potters, Theory of Financial Risks, Cambridge University Press (2000). This book by two physicists is an example of the application of concepts in probability and statistical mechanics to ﬁnance. Although the treatment is at the graduate level and assumes some background in ﬁnance, the ﬁrst several chapters are a good read for students who are interested in the overlap of physics, ﬁnance, and economics. Also see J. Doyne Farmer, Martin Shubik, and Eric Smith, “Is economics the next physical science?,” Phys. Today 58 (9), 37–42 (2005). A related book on the importance of rare events is by Nassim Nicholas Taleb, The Black Swan: The Impact of the Highly Improbable, Random House (2007). See <www.compadre.org/stp/> to download a simulation of the Galton board by Wolfgang Christian and Anne Cox. Other simulations related to statistical and thermal physics are also available at this site. Giulio D’Agostini, “Teaching statistics in the physics curriculum: Unifying and clarifying role of subjective probability,” Am. J. Phys. 67, 1260–1268 (1999). The author, whose main research interest is in particle physics, discusses subjective probability and Bayes’ theorem. Section 3.4 is based in part on this article. CHAPTER 3. CONCEPTS OF PROBABILITY 170 F. N. David, Games, Gods and Gambling: A History of Probability and Statistical Ideas, Dover Publications (1998). Marta C. Gonz´lez, C´sar A. Hidalgo, and AlbertL´szl´ Barab´si, “Understanding individual a e ao a human mobility patterns,” Nature 453, 779–782 (2008). The authors studied the trajectories of 100,000 cell phone users over a sixmonth period and found that human trajectories cannot be simply modeled by a L´vy ﬂight or as an ordinary random walk. Similar studies have been e done on animal trajectories. The website <barabasilab.com/> has many examples of the application of probability to diverse systems of interest in statistical physics. James R. Huddle, “A note on Benford’s law,” Math. Comput. Educ. 31, 66 (1997); T. P. Hill, “The ﬁrst digit phenomenon,” Am. Sci. 86, 358–363 (1998). Gene F. Mazenko, Equilibrium Statistical Mechanics, John Wiley & Sons (2000). Sections 1.7 and 1.8 of this graduate level text discuss the functional form of the missing information. Leonard Mlodinow, The Drunkard’s Walk: How Randomness Rules Our Lives, Vintage Press (2009). A popular book on how the mathematical laws of randomness aﬀect our lives. Elliott W. Montroll and Michael F. Shlesinger, “On the wonderful world of random walks,” in Studies in Statistical Mechanics, Vol. XI: Nonequilibrium Phenomena II, edited by J. L. Lebowitz and E. W. Montroll NorthHolland (1984). An excellent article on the history of random walks. Elliott W. Montroll and Wade W. Badger, Introduction to Quantitative Aspects of Social Phenomena, Gordon and Breach (1974). The applications of probability that are discussed include traﬃc ﬂow, income distributions, ﬂoods, and the stock market. Jo˜o Gama Oliveira and AlbertL´szl´ Barab´si, “Darwin and Einstein correspondence patterns,” a ao a Nature 437, 1251 (2005). Richard Perline, “Zipf’s law, the central limit theorem, and the random division of the unit interval,” Phys. Rev. E 54, 220–223 (1996). The outcome of tossing a coin is not really random. See Ivars Peterson, “Heads or tails?,” Science News Online, <www.sciencenews.org/articles/20040228/mathtrek.asp> and Erica Klarreich, “Toss out the tossup: Bias in headsortails,” Science News 165 (9), 131 (2004), <www.sciencenews.org/articles/20040228/fob2.asp>. Some of the original publications include Joseph Ford, “How random is a coin toss?,” Phys. Today 36 (4), 40–47 (1983); Joseph B. Keller, “The probability of heads,” Am. Math. Monthly 93, 191–197 (1986); and Vladimir Z. Vulovic and Richard E. Prange, “Randomness of a true coin toss,” Phys. Rev. A 33, 576–582 (1986). S. Redner, “Random multiplicative processes: An elementary tutorial,” Am. J. Phys. 58, 267–273 (1990). Jason Rosenhouse, The Monty Hall Problem: The Remarkable Story of Math’s Most Contentious Brain Teaser, Oxford University Press (2009). Charles Ruhla, The Physics of Chance, Oxford University Press (1992). CHAPTER 3. CONCEPTS OF PROBABILITY 171 B. Schmittmann and R. K. P. Zia, “‘Weather’ records: Musings on cold days after a long hot Indian summer,” Am. J. Phys. 67, 1269–1276 (1999). A relatively simple introduction to the statistics of extreme values. Suppose that somebody breaks the record for the 100 m dash. How long do such records typically survive before they are broken? Kyle Siegrist at the University of Alabama in Huntsville has developed many applets to illustrate concepts in probability and statistics. See <www.math.uah.edu/stat/> and follow the link to Bernoulli processes. J. Torres, S. Fern´ndez, A. Gamero, and A Sola, “How do numbers begin? (The ﬁrst digit law),” a Eur. J. Phys. 28, L17–L25 (2007). G. Troll and P. beim Graben, “Zipf’s law is not a consequence of the central limit theorem,” Phys. Rev. E 57, 1347–1355 (1998). Hans Christian von Baeyer, Information: The New Language of Science, Harvard University Press (2004). This book raises many profound issues. It is not an easy read even though it is well written. Charles A. Whitney, Random Processes in Physical Systems: An Introduction to ProbabilityBased Computer Simulations, John Wiley & Sons (1990). Michael M. Woolfson, Everyday Probability and Statistics, Imperial College Press (2008). An interesting book for lay people. A discussion by Eliezer Yudkowsky of the intuitive basis of Bayesian reasoning can be found at <yudkowsky.net/bayes/bayes.html>. R. K. P. Zia and B. Schmittmann, “Watching a drunkard for 10 nights: A study of distributions of variances,” Am. J. Phys. 71, 859–865 (2003). See Problem 3.70. ...
View
Full
Document
This note was uploaded on 01/23/2011 for the course PHYS 123 taught by Professor Smith during the Spring '07 term at UC Davis.
 Spring '07
 SMITH
 The Land

Click to edit the document details