econ101 - Entropy, Power Laws, and Economics Tom Carter...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Entropy, Power Laws, and Economics Tom Carter Complex Systems Summer School SFI, 2007˜ tom/ Santa Fe June, 2007 1 Contents Mathematics of Information Some entropy theory A Maximum Entropy Principle Application: Economics I Fit to Real WorldTM A bit about Power Laws Application: Economics II References 6 13 17 20 26 30 40 47 2 The quotes Science, wisdom, and counting Surprise, information, and miracles Information (and hope) H (or S) for Entropy To topics ← 3 Science, wisdom, and counting “Science is organized knowledge. Wisdom is organized life.” - Immanuel Kant “My own suspicion is that the universe is not only stranger than we suppose, but stranger than we can suppose.” - John Haldane “Not everything that can be counted counts, and not everything that counts can be counted.” - Albert Einstein (1879-1955) “The laws of probability, so true in general, so fallacious in particular .” - Edward Gibbon 4 Surprise, information, and miracles “The opposite of a correct statement is a false statement. The opposite of a profound truth may well be another profound truth.” - Niels Bohr (1885-1962) “I heard someone tried the monkeys-on-typewriters bit trying for the plays of W. Shakespeare, but all they got was the collected works of Francis Bacon.” - Bill Hirst “There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.” - Albert Einstein (1879-1955) 5 Mathematics of Information ← • We would like to develop a usable measure of the information we get from observing the occurrence of an event having probability p . Our first reduction will be to ignore any particular features of the event, and only observe whether or not it happened. Thus we will think of an event as the observance of a symbol whose probability of occurring is p. We will thus be defining the information in terms of the probability p. The approach we will be taking here is axiomatic: on the next page is a list of the four fundamental axioms we will use. Note that we can apply this axiomatic system in any context in which we have available a set of non-negative real numbers. A specific special case of interest is probabilities (i.e., real numbers between 0 and 1), which motivated the selection of axioms . . . 6 • We will want our information measure I (p) to have several properties: 1. Information is a non-negative quantity: I (p) ≥ 0. 2. If an event has probability 1, we get no information from the occurrence of the event: I (1) = 0. 3. If two independent events occur (whose joint probability is the product of their individual probabilities), then the information we get from observing the events is the sum of the two informations: I (p1 ∗ p2) = I (p1) + I (p2). (This is the critical property . . . ) 4. We will want our information measure to be a continuous (and, in fact, monotonic) function of the probability (slight changes in probability should result in slight changes in information). 7 • We can therefore derive the following: 1. I (p2) = I (p ∗ p) = I (p) + I (p) = 2 ∗ I (p) 2. Thus, further, I (pn) = n ∗ I (p) (by induction . . . ) 3. I (p) = I ((p1/m)m) = m ∗ I (p1/m), so 1 I (p1/m) = m ∗ I (P ) and thus in general n/m) = n ∗ I (p) I (p m 4. And thus, by continuity, we get, for 0 < p ≤ 1, and a > 0 a real number: I (pa) = a ∗ I (p) • From this, we can derive the nice property: I (p) = − logb(p) = logb(1/p) for some base b. 8 • Summarizing: from the four properties, 1. I (p) ≥ 0 2. I (p1 ∗ p2) = I (p1) + I (p2) 3. I (p) is monotonic and continuous in p 4. I (1) = 0 we can derive that I (p) = logb(1/p) = − logb(p), for some positive constant b. The base b determines the units we are using. We can change the units by changing the base, using the formulas, for b1, b2, x > 0, logb1 (x) x = b1 and therefore logb2 (x) = logb2 (b1 logb1 (x) ) = (logb2 (b1))(logb1 (x)). 9 • Thus, using different bases for the logarithm results in information measures which are just constant multiples of each other, corresponding with measurements in different units: 1. log2 units are bits (from ’binary’) 2. log3 units are trits(from ’trinary’) 3. loge units are nats (from ’natural logarithm’) (We’ll use ln(x) for loge(x)) 4. log10 units are Hartleys, after an early worker in the field. • Unless we want to emphasize the units, we need not bother to specifiy the base for the logarithm, and will write log(p). Typically, we will think in terms of log2(p). 10 • For example, flipping a fair coin once will give us events h and t each with probability 1/2, and thus a single flip of a coin gives us − log2(1/2) = 1 bit of information (whether it comes up h or t). Flipping a fair coin n times (or, equivalently, flipping n fair coins) gives us − log2((1/2)n) = log2(2n) = n ∗ log2(2) = n bits of information. We could enumerate a sequence of 25 flips as, for example: hthhtththhhthttththhhthtt or, using 1 for h and 0 for t, the 25 bits 1011001011101000101110100. We thus get the nice fact that n flips of a fair coin gives us n bits of information, and takes n binary digits to specify. That these two are the same reassures us that we have done a good job in our definition of our information measure . . . 11 Information (and hope) “In Cyberspace, the First Amendment is a local ordinance.” - John Perry Barlow “Groundless hope, like unconditional love, is the only kind worth having.” - John Perry Barlow “The most interesting facts are those which can be used several times, those which have a chance of recurring. . . . Which, then, are the facts that have a chance of recurring? In the first place, simple facts.” H. Poincare, 1908 12 Some entropy theory ← • One question we might ask here is, what is the average amount of information we will get (per observation) from observing events from a probability distribution P ? In particular, what is the expected value of the information? • Suppose we have a discrete probability distribution P = {p1, p2, . . . , pn}, with pi ≥ 0 and n pi = 1, or a continuous i=1 distribution p(x) with p(x) ≥ 0 and p(x)dx = 1, we can define the expected value of an associated discrete set F = {f1, f2, . . . , fn} or function F (x) by: n < F >= i=1 fipi or < F (x ) > = F (x)p(x)dx. 13 With these ideas in mind, we can define the entropy of a distribution by: H (P ) =< I (p) > . In other words, we can define the entropy of a probability distribution as the expected value of the information of the distribution. In particular, for a discrete distribution P = {p1, p2, . . . , pn}, we have the entropy: 1 H (P ) = pi log . pi i=1 n 14 Several questions probably come to mind at this point: • What properties does the function H (P ) have? For example, does it have a maximum, and if so where? • Is entropy a reasonable name for this? In particular, the name entropy is already in use in thermodynamics. How are these uses of the term related to each other? • What can we do with this new tool? • Let me start with an easy one. Why use the letter H for entropy? What follows is a slight variation of a footnote, p. 105, in the book Spikes by Rieke, et al. :-) 15 H (or S) for Entropy “The enthalpy is [often] written U. V is the volume, and Z is the partition function. P and Q are the position and momentum of a particle. R is the gas constant, and of course T is temperature. W is the number of ways of configuring our system (the number of states), and we have to keep X and Y in case we need more variables. Going back to the first half of the alphabet, A, F, and G are all different kinds of free energies (the last named for Gibbs). B is a virial coefficient or a magnetic field. I will be used as a symbol for information; J and L are angular momenta. K is Kelvin, which is the proper unit of T. M is magnetization, and N is a number, possibly Avogadro’s, and O is too easily confused with 0. This leaves S . . .” and H. In Spikes they also eliminate H (e.g., as the Hamiltonian). I, on the other hand, along with Shannon and others, prefer to honor Hartley. Thus, H for entropy . . . 16 A Maximum Entropy Principle ← • Suppose we have a system for which we can measure certain macroscopic characteristics. Suppose further that the system is made up of many microscopic elements, and that the system is free to vary among various states. Then (a generic version of) the Second Law of Thermodynamics says that with probability essentially equal to 1, the system will be observed in states with maximum entropy. We will then sometimes be able to gain understanding of the system by applying a maximum information entropy principle (MEP), and, using Lagrange multipliers, derive formulae for aspects of the system. 17 • Suppose we have a set of macroscopic measurable characteristics fk , k = 1, 2, . . . , M (which we can think of as constraints on the system), which we assume are related to microscopic characteristics via: pi ∗ fi i (k ) = fk . Of course, we also have the constraints: pi ≥ 0, and pi = 1. i We want to maximize the entropy, i pi log(1/pi), subject to these constraints. Using Lagrange multipliers λk (one for each constraint), we have the general solution: pi = exp −λ − k λk fi (k ) . 18 If we define Z , called the partition function, by Z (λ1, . . . , λM ) = i exp − k λk fi (k ) , then we have eλ = Z , or λ = ln(Z ). 19 Application: Economics I (a Boltzmann Economy) ← • Our first example here is a very simple economy. Suppose there is a fixed amount of money (M dollars), and a fixed number of agents (N ) in the economy. Suppose that during each time step, each agent randomly selects another agent and transfers one dollar to the selected agent. An agent having no money doesn’t go in debt. What will the long term (stable) distribution of money be? This is not a very realistic economy – there is no growth, only a redistribution of money (by a random process). For the sake of argument, we can imagine that every agent starts with approximately the same amount of money, although in the long run, the starting distribution shouldn’t matter. 20 • For this example, we are interested in looking at the distribution of money in the economy, so we are looking at the probabilities {pi} that an agent has the amount of money i. We are hoping to develop a model for the collection {pi}. If we let ni be the number of agents who have i dollars, we have two constraints: ni ∗ i = M i and ni = N. i Phrased differently (using pi = ni ), this N says M pi ∗ i = N i and pi = 1. i 21 • We now apply Lagrange multipliers: L= i pi ln(1/pi) − λ i pi ∗ i − M N − µ i pi − 1 , from which we get ∂L = −[1 + ln(pi)] − λi − µ = 0. ∂pi We can solve this for pi: ln(pi) = −λi − (1 + µ) and so pi = e−λ0 e−λi (where we have set 1 + µ ≡ λ0). 22 • Putting in constraints, we have 1= i pi e−λ0 e−λi i M i=0 = = e−λ0 and M = N = i e−λi, pi ∗ i i e−λ0 e−λi ∗ i M i=0 = e−λ0 e−λi ∗ i. We can approximate (for large M ) M i=0 e−λi ≈ M 0 e−λxdx ≈ 1 , λ and M i=0 e−λi ∗ i ≈ M 0 −λx dx ≈ 1 . xe λ2 23 From these we have (approximately) λ0 = 1 e λ and eλ0 From this, we get N λ= = e−λ0 , M and thus (letting T = M ) we have: N pi = e−λ0 e−λi 1 −i = e T. T This is a Boltzmann-Gibbs distribution, where we can think of T (the average amount of money per agent) as the “temperature,” and thus we have a “Boltzmann economy” . . . Note: this distribution also solves the functional equation p(m1)p(m2) = p(m1 + m2). 24 M 1 = 2. N λ • This example, and related topics, are discussed in Statistical mechanics of money by Adrian Dragulescu and Victor M. Yakovenko, and Statistical mechanics of money: How saving propensity affects its distribution by Anirban Chakraborti and Bikas K. Chakrabarti 25 Fit of this model to the Real WorldTM ← • How well does this model seem to fit to the Real World? For a fairly large range of individuals, it actually does a decent job. Here is a graphical representation of U.S. census data for 1996: 1 e −x . The black line is p(x) = R R 26 • However, for the wealthy it doesn’t do such a good job. Here are some graphical representations of U.K. and U.S. data for 1996-2001: As can be seen on the left graph, the wealth distribution for the U.K. wealthy in 1996 is close to a linear fit in log − log coordinates. Can we modify the model somewhat to capture other characteristics of the data? 27 • There are a wide variety of important distributions that are observed in data sets. For example: – Normal (gaussian) distribution: x2 p(x) ∼ exp(− 2 ) 2σ Natural explanation: Central limit theorem; sum of random variables (with finite second moment): n Xn = i=1 xi Many applications: ∗ Maxwell: distribution of velocities of gas particles ∗ IQ ∗ heights of individuals Distribution is thin tailed – no one is 20 feet tall . . . 28 – Exponential distribution: p(x) ∼ exp(−x/x0) Natural explanation 1: Survival time for constant probability decay. Natural explanation 2: Equlibrium statistical mechanics (see above – maximum entropy subject to constraint on mean). Many applications: ∗ Radioactive decay. ∗ Equilibrium statistical mechanics (Boltzmann-Gibbs distribution) Characteristic scale is x0; distribution is thin tailed. – Power law (see below): p(x) ∼ x−α 29 A bit about Power Laws ← • Various researchers in various fields at various times have observed that many datasets seem to reflect a relationship of the form p(x) ∼ x−α for a fairly broad range of values of x. These sorts of data relations are often called power laws, and have been the subject of fairly intensive interest and study. An early researcher, Vilfredo Pareto, observed in the late 1800s that pretty uniformly across geographical locations, wealth was distributed through the population according to a power law, and hence such distributions are often called Pareto distributions. 30 A variety of other names have been applied to these distributions: – Power law distribution – Pareto’s law – Zipf’s law – Lotka’s law – Bradford’s law – Zeta distribution – Scale free distribution – Rank-size rule My general rule of thumb is that if something has lots of names, it is likely to be important . . . 31 • These distributions have been observed many places (as noted, for example, in Wikipedia): – Frequencies of words in longer texts – The size of human settlements (few cities, many hamlets/villages) – File size distribution of Internet traffic which uses the TCP protocol (many smaller files, few larger ones) – Clusters of Bose-Einstein condensate near absolute zero – The value of oil reserves in oil fields (a few large fields, many small fields) – The length distribution in jobs assigned supercomputers (a few large ones, many small ones) – The standardized price returns on individual stocks 32 – Size of sand particles – Number of species per genus (please note the subjectivity involved: The tendency to divide a genus into two or more increases with the number of species in it) – Areas burnt in forest fires • There are a variety of important properties of power laws: – Distribution has fat / heavy tails (extreme events are more likely than one might expect . . . ). Stock market volatility; sizes of storms / floods, etc. – A power law is a linear relation between logarithms: p(x) = Kx−α log(p(x)) = −α log(x) + log(K ) 33 – Power laws are scale invariant: Sufficient: p(x) = Kx−α x → cx p(x) → Kc−αx−α = c−αp(x) Necessary: Scale invariant is defined as p(cx) = K (c)p(x) Power law is the only solution (0 and 1 are trivial solutions). • Power laws are actually asymptotic relations. We can’t define a power law on [0, ∞]: If α > 1, not integrable at 0. If α <= 1, not integrable at ∞. Thus, when we say something is a power law, we mean either within a range, or as x → 0 or as x → ∞. 34 • Moments: power laws have a threshold above which moments don’t exist. For p(x) ∼ x−(α+1), when α > m, ∞ γ (m) = = a ∞ a xmp(x)dx xmx−(α+1)dx =∞ • The lack of moments is conserved under aggregation . . . If α(x) is the tail exponent of the random variable x (the value above which moments don’t exist), then α(x + y ) = min(α(x), α(y )) α(xy ) = min(α(x), α(y )) α(xk ) = α(x)/k. 35 • Power laws are generic for heavy / fat tailed distributions. In other words, any “reasonable” distribution with fat tails (i.e., with moments that don’t exist) is a power law: P (X > x) = 1 − Φα(x) = 1 − exp(−x−α) ≈ 1 − (1 − x−α) = x− α (there is some discussion of extreme value distributions that goes here, with discussion of Fr´chet, Weibull, and e Gumbel distributions – specifically Fr´chet distributions (with fat tails) e . . . perhaps another place or time). 36 • Some mechanism for generating power laws: – Critical points and deterministic dynamics – Non-equilibrium statistical mechanics – Random processes – Mixtures – Maximization principles – Preferential attachment – Dimensional constraints 37 • Multiplicative (random) processes generate log-normal distributions, which can look like power law distributions across various ranges of the variable. If a(t) is a random variable: x(t + 1) = a(t)x(t) t−1 x(t) = i=0 t−1 a(i)x(0) log(a(i)) + log(x(0)) i=0 log(x(t)) = 1 −(log x−µ)2 /2σ 2 f (x) = √ e 2πσx µ (log (x))2 + ( 2 − 1) log(x) + const log(f (x)) = − 2σ 2 σ In particular, if σ is large in comparison with log(x), then it will look like log (f (x)) ≈ log(x−1), which is a one-over-x power law distribution . . . 38 • Other distributions that have power-law appearing regions: – Mixed multiplicative / additive processes (Kesten processes): x(t + 1) = a(t)x(t) + b(t) – Stable multiplicative random walk with reflecting barrier. Both of these will look log-normal in their bodies, and like power laws in their tails. (Various pieces of this section draw from lectures / notes by Doyne Farmer on power laws in financial markets – my thanks to him . . . ) 39 Application: Economics II (a power law) ← • Suppose that a (simple) economy is made up of many agents a, each with wealth at time t in the amount of w(a, t). (I’ll leave it to you to come up with a reasonable definition of “wealth” – of course we will want to make sure that the definition of “wealth” is applied consistently across all the agents.) We can also look at the total wealth in the economy W (t) = a w(a, t). For this example, we are interested in looking at the distribution of wealth in the economy, so we will assume there is some collection {wi} of possible values for the wealth an agent can have, and associated probabilities {pi} that an agent has wealth wi. We are hoping to develop a model for the collection {pi}. 40 • In order to apply the maximum entropy principle, we want to look at global (aggregate/macro) observables of the system that reflect (or are made up of) characteristics of (micro) elements of the system. For this example, we can look at the growth rate of the economy. A reasonable way to think about this is to let Ri = wi(t1)/wi(t0) and R = W (t1)/W (t0) (where t0 and t1 represent time steps of the economy). The growth rate will then be ln(R). We then have the two constraints on the pi: pi ∗ ln(Ri) = ln(R) i and pi = 1. i 41 • We now apply Lagrange multipliers: L= i pi ln(1/pi) − λ i pi ln(Ri) − ln(R) − µ i pi − 1 , from which we get ∂L = −[1 + ln(pi)] − λ ln(Ri) − µ = 0. ∂pi We can solve this for pi: − pi = e−λ0 e−λ ln(Ri) = e−λ0 Ri λ (where we have set 1 + µ ≡ λ0). Solving, we get λ0 = ln(Z (λ)), where − Z (λ) ≡ i Ri λ (the partition function) normalizes the probability distribution to sum to 1. From this we see the power law (for λ > 1): − Ri λ . pi = Z (λ) 42 We might actually like to calculate specific values of λ, so we will do the process again in a continuous version. In this version, we will let R = w(T )/w(0) be the relative wealth at time T. We want to find the probability density function f (R), that is: max H (f ) = − {f } ∞ 1 f (R) ln(f (R))dR, subject to ∞ ∞ 1 1 f (R)dR = 1, f (R) ln(R)dR = C ln(R), where C is the average number of transactions per time step. We need to apply the calculus of variations to maximize over a class of functions. 43 When we are solving an extremal problem of the form F [x, f (x), f (x)]dx, we work to solve ∂F ∂F d − ∂f (x) dx ∂f (x) = 0. Our Lagrangian is of the form L≡ − ∞ 1 f (R) ln(f (R))dr − µ ∞ 1 ∞ 1 f (R)dR − 1 −λ f (R) ln(R)dR − C ∗ ln(R) . Since this does not depend on f (x), we look at: ∂ [−f (R) ln f (R) − µ(f (R) − 1) − λ(f (R) ln R − R)] ∂f (R) =0 from which we get f (R) = e−(λ0−λ ln(R)) = R−λe−λ0 , where again λ0 ≡ 1 + µ. 44 We can use the first constraint to solve for eλ0 : R−λ+1 1 −λdR = λ0 = R e = , 1−λ 1 λ−1 1 assuming λ > 1. We therefore have a power law distribution for wealth of the form: ∞ ∞ f (R) = (λ − 1)R−λ. To solve for λ, we use: C ∗ ln(R) = (λ − 1) ∞ 1 R−λ ln(R)dR. Using integration by parts, we get R1−λ C ∗ ln(R) = (λ − 1) ln(R) 1−λ 1 ∞ R −λ −(λ − 1) dR 1 1−λ ∞ ∞ R1−λ R 1− λ = (λ − 1) ln(R) + . 1−λ 1 1−λ 1 ∞ 45 By L’Hˆpital’s rule, the first term goes to o zero as R → ∞, so we are left with R1−λ 1 C ∗ ln(R) = = , 1−λ 1 λ−1 or, in other terms, λ − 1 = C ∗ ln(R−1). For much more discussion of this example, see the paper A Statistical Equilibrium Model of Wealth Distribution by Mishael Milakovic, February, 2001, available on the web at:˜ tom/SFICSSS/Wealth/wealth-Milakovic.pdf ∞ 46 References ← [1] Bar-Yam, Yaneer, Dynamics of Complex Systems (Studies in Nonlinearity) , Westview Press, Boulder, 1997. [2] Brillouin, L., Science and information theory Academic Press, New York, 1956. [3] Brooks, Daniel R., and Wiley, E. O., Evolution as Entropy, Toward a Unified Theory of Biology, Second Edition, University of Chicago Press, Chicago, 1988. [4] Campbell, Jeremy, Grammatical Man, Information, Entropy, Language, and Life, Simon and Schuster, New York, 1982. [5] Cover, T. M., and Thomas J. A., Elements of Information Theory, John Wiley and Sons, New York, 1991. [6] DeLillo, Don, White Noise, Viking/Penguin, New York, 1984. [7] Feller, W., An Introduction to Probability Theory and Its Applications, Wiley, New York,1957. 47 [8] Feynman, Richard, Feynman lectures on computation, Addison-Wesley, Reading, 1996. [9] Gatlin, L. L., Information Theory and the Living System, Columbia University Press, New York, 1972. [10] Greven, A., Keller, G., Warnecke, G., Entropy, Princeton Univ. Press, Princeton, 2003. [11] Haken, Hermann, Information and Self-Organization, a Macroscopic Approach to Complex Systems, Springer-Verlag, Berlin/New York, 1988. [12] Hamming, R. W., Error detecting and error correcting codes, Bell Syst. Tech. J. 29 147, 1950. [13] Hamming, R. W., Coding and information theory, 2nd ed, Prentice-Hall, Englewood Cliffs, 1986. [14] Hill, R., A first course in coding theory Clarendon Press, Oxford, 1986. [15] Hodges, A., Alan Turing: the enigma Vintage, London, 1983. [16] Hofstadter, Douglas R., Metamagical Themas: Questing for the Essence of Mind and Pattern, Basic Books, New York, 1985 48 [17] Jones, D. S., Elementary information theory Clarendon Press, Oxford, 1979. [18] Knuth, Eldon L., Introduction to Statistical Thermodynamics, McGraw-Hill, New York, 1966. [19] Landauer, R., Information is physical, Phys. Today, May 1991 23-29. [20] Landauer, R., The physical nature of information, Phys. Lett. A, 217 188, 1996. [21] van Lint, J. H., Coding Theory, Springer-Verlag, New York/Berlin, 1982. [22] Lipton, R. J., Using DNA to solve NP-complete problems, Science, 268 542–545, Apr. 28, 1995. [23] MacWilliams, F. J., and Sloane, N. J. A., The theory of error correcting codes, Elsevier Science, Amsterdam, 1977. [24] Martin, N. F. G., and England, J. W., Mathematical Theory of Entropy, Addison-Wesley, Reading, 1981. [25] Maxwell, J. C., Theory of heat Longmans, Green and Co, London, 1871. 49 [26] von Neumann, John, Probabilistic logic and the synthesis of reliable organisms from unreliable components, in automata studies( Shanon,McCarthy eds), 1956 . [27] Papadimitriou, C. H., Computational Complexity, Addison-Wesley, Reading, 1994. [28] Pierce, John R., An Introduction to Information Theory – Symbols, Signals and Noise, (second revised edition), Dover Publications, New York, 1980. [29] Roman, Steven, Introduction to Coding and Information Theory, Springer-Verlag, Berlin/New York, 1997. [30] Sampson, Jeffrey R., Adaptive Information Processing, an Introductory Survey, Springer-Verlag, Berlin/New York, 1976. [31] Schroeder, Manfred, Fractals, Chaos, Power Laws, Minutes from an Infinite Paradise, W. H. Freeman, New York, 1991. [32] Shannon, C. E., A mathematical theory of communication Bell Syst. Tech. J. 27 379; also p. 623, 1948. [33] Slepian, D., ed., Key papers in the development of information theory IEEE Press, New York, 1974. 50 [34] Turing, A. M., On computable numbers, with an application to the Entscheidungsproblem, Proc. Lond. Math. Soc. Ser. 2 42, 230 ; see also Proc. Lond. Math. Soc. Ser. 2 43, 544, 1936. [35] Zurek, W. H., Thermodynamic cost of computation, algorithmic complexity and the information metric, Nature 341 119-124, 1989. To top ← 51 ...
View Full Document

Ask a homework question - tutors are online