Probability and measurements - Tarantola A.

Probability and measurements - Tarantola A. - ALBERT ALBERT...

Info iconThis preview shows pages 1–8. Sign up to view the full content.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: ALBERT ALBERT TARANTOLA to be published by ... Probability and Measurements 1 Albert Tarantola Universit´ de Paris, Institut de Physique du Globe e 4, place Jussieu; 75005 Paris; France E-mail: [email protected] December 3, 2001 1 c A. Tarantola, 2001. ii iii To the memory of my father. To my mother and my wife. iv v Preface In this book, I attempt to reach two goals. The first is purely mathematical: to clarify some of the basic concepts of probability theory. The second goal is physical: to clarify the methods to be used when handling the information brought by measurements, in order to understand how accurate are the predictions we may wish to make. Probability theory is solidly based on Kolmogorov axioms, and there is no problem when treating discrete probabilities. But I am very unhappy with the usual way of extending the theory to continuous probability distributions. In this text, I introduce the notion of ‘volumetric probability’ different from the more usual notion of ‘probability density’. I claim that some of the more basic problems of the theory of continuous probability distributions can only ne solved within this framework, and that many of the well known ‘paradoxes’ of the theory are fundamental misunderstandings, that I try to clarify. I start the book with an introduction to tensor calculus, because I choose to develop the probability theory considering metric manifolds. The second chapter deals with the probability theory per se. I try to use intrinsic notions everywhere, i.e., I only introduce definitions that make sense irrespectively of the particular coordinates being used in the manifold under investigation. The reader shall see that this leads to many develoments that are at odds with those found in usual texts. In physical applications one not only needs to define probability distributions over (typically) large-dimensional manifolds. One also needs to make use of them, and this is achieved by sampling the probability distributions using the ‘Monte Carlo’ methods described in chapter 3. There is no major discovery exposed in this chapter, but I make the effort to set Monte Carlo methods using the intrinsic point of view mentioned above. The metric foundation used here allows to introduce the important notion of ‘homogeneous’ probability distributions. Contrary to the ‘noninformative’ probability distributions common in the Bayesian literature, the homogeneity notion is not controversial (provided one has agreed ona given metric over the space of interest). After a brief chapter that explain what an ideal measuring instrument should be, the book enters in the four chapter developing what I see as the four more basic inference problems in physics: (i) problems that are solved using the notion of ‘sum of probabilities’ (just an elaborate way of ‘making histograms), (ii) problems that are solved using the ‘product of probabilities’ (and approach that seems to be original), (iii) problems that are solved using ‘conditional probabilities’ (these including the so-called ‘inverse problems’), and (iv) problems that are solved using the ‘transport of probabilities’ (like the typical [indirect] mesurement problem, but solved here transporting probability distributions, rather than just transporting ‘uncertainties). I am very indebted to my colleagues (Bartolom´ Coll, Georges Jobert, Klaus Mosegaard, e ´ Miguel Bosch, Guillaume Evrard, John Scales, Christophe Barnes, Fr´d´ric Parrenin and ee Bernard Valette) for illuminating discussions. I am also grateful to my collaborators at what was the Tomography Group at the Institut de Physique du Globe de Paris. Paris, December 3, 2001 Albert Tarantola vi Contents 1 Introduction to Tensors 1 2 Elements of Probability 69 3 Monte Carlo Sampling Methods 153 4 Homogeneous Probability Distributions 169 5 Basic Measurements 185 6 Inference Problems of the First Kind (Sum of Probabilities) 207 7 Inference Problems of the Second Kind (Product of Probabilities) 211 8 Inference Problems of the Third Kind (Conditional Probabilities) 219 9 Inference Problems of the Fourth Kind (Transport of Probabilities) 287 vii viii Contents 1 Introduction to Tensors 1.1 Chapter’s overview . . . . . . . . . . . . . 1.2 Change of Coordinates (Notations) . . . . 1.3 Metric, Volume Density, Metric Bijections 1.4 The Levi-Civita Tensor . . . . . . . . . . . 1.5 The Kronecker Tensor . . . . . . . . . . . 1.6 Totally Antisymmetric Tensors . . . . . . 1.7 Integration, Volumes . . . . . . . . . . . . 1.8 Appendixes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Elements of Probability 2.1 Volume . . . . . . . . . . . . . . . . . . . . . . 2.2 Probability . . . . . . . . . . . . . . . . . . . 2.3 Sum and Product of Probabilities . . . . . . . 2.4 Conditional Probability . . . . . . . . . . . . . 2.5 Marginal Probability . . . . . . . . . . . . . . 2.6 Transport of Probabilities . . . . . . . . . . . 2.7 Central Estimators and Dispersion Estimators 2.8 Appendixes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Monte Carlo Sampling Methods 3.1 Introduction . . . . . . . . . . . . . . . . . . . . 3.2 Random Walks . . . . . . . . . . . . . . . . . . 3.3 Modification of Random Walks . . . . . . . . . 3.4 The Metropolis Rule . . . . . . . . . . . . . . . 3.5 The Cascaded Metropolis Rule . . . . . . . . . . 3.6 Initiating a Random Walk . . . . . . . . . . . . 3.7 Designing Primeval Walks . . . . . . . . . . . . 3.8 Multistep Iterations . . . . . . . . . . . . . . . . 3.9 Choosing Random Directions and Step Lengths 3.10 Appendixes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 3 4 7 9 11 14 19 23 . . . . . . . . . . . . . . . . 69 70 78 84 88 100 106 116 120 . . . . . . . . . . 153 . 154 . 155 . 157 . 158 . 158 . 159 . 160 . 161 . 162 . 164 . . . . . . . . 4 Homogeneous Probability Distributions 169 4.1 Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 4.2 Homogeneous Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . 171 4.3 Appendixes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 ix x 5 Basic Measurements 5.1 Terminology . . . . . . . . . . . . . . . . . . . 5.2 Old text: Measuring physical parameters . . . 5.3 From ISO . . . . . . . . . . . . . . . . . . . . 5.4 The Ideal Output of a Measuring Instrument . 5.5 Output as Conditional Probability Density . . 5.6 A Little Bit of Theory . . . . . . . . . . . . . 5.7 Example: Instrument Specification . . . . . . 5.8 Measurements and Experimental Uncertainties 5.9 Appendixes . . . . . . . . . . . . . . . . . . . 6 Inference Problems of the First Kind 6.1 Experimental Histograms . . . . . . . 6.2 Sampling a Sum . . . . . . . . . . . . 6.3 Further Work to be Done . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 186 187 189 194 195 195 195 197 200 (Sum of Probabilities) 207 . . . . . . . . . . . . . . . . . . . . . . . . 208 . . . . . . . . . . . . . . . . . . . . . . . . 209 . . . . . . . . . . . . . . . . . . . . . . . . 209 7 Inference Problems of the Second Kind (Product of Probabilities) 211 7.1 The ‘Shipwrecked Person’ Problem . . . . . . . . . . . . . . . . . . . . . . . . . 212 7.2 Physical Laws as Probabilistic Correlations . . . . . . . . . . . . . . . . . . . . . 213 8 Inference Problems of the Third Kind (Conditional 8.1 Adjusting Measurements to a Physical Theory . . . . 8.2 Inverse Problems . . . . . . . . . . . . . . . . . . . . 8.3 Appendixes . . . . . . . . . . . . . . . . . . . . . . . 9 Inference Problems of the Fourth Kind 9.1 Measure of Physical Quantities . . . . 9.2 Prediction of Observations . . . . . . . 9.3 Appendixes . . . . . . . . . . . . . . . Probabilities) 219 . . . . . . . . . . . . . . . 220 . . . . . . . . . . . . . . . 222 . . . . . . . . . . . . . . . 231 (Transport of ......... ......... ......... Probabilities) 287 . . . . . . . . . . . . . . 288 . . . . . . . . . . . . . . 299 . . . . . . . . . . . . . . 300 Contents 1 Introduction to Tensors 1.1 Chapter’s overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Change of Coordinates (Notations) . . . . . . . . . . . . . . . . . . . . . 1.2.1 Jacobian Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.2 Tensors, Capacities and Densities . . . . . . . . . . . . . . . . . . 1.3 Metric, Volume Density, Metric Bijections . . . . . . . . . . . . . . . . . 1.3.1 Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.2 Volume Density . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.3 Bijection Between Densities Tensors and Capacities . . . . . . . . 1.4 The Levi-Civita Tensor . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.1 Orientation of a Coordinate System . . . . . . . . . . . . . . . . . 1.4.2 The Fundamental (Levi-Civita) Capacity . . . . . . . . . . . . . . 1.4.3 The Fundamental Density . . . . . . . . . . . . . . . . . . . . . . 1.4.4 The Levi-Civita Tensor . . . . . . . . . . . . . . . . . . . . . . . . 1.4.5 Determinants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5 The Kronecker Tensor . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.1 Kronecker Tensor . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.2 Kronecker Determinants . . . . . . . . . . . . . . . . . . . . . . . 1.6 Totally Antisymmetric Tensors . . . . . . . . . . . . . . . . . . . . . . . 1.6.1 Totally Antisymmetric Tensors . . . . . . . . . . . . . . . . . . . 1.6.2 Dual Tensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6.3 Exterior Product of Tensors . . . . . . . . . . . . . . . . . . . . . 1.6.4 Exterior Derivative of Tensors . . . . . . . . . . . . . . . . . . . . 1.7 Integration, Volumes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.7.1 The Volume Element . . . . . . . . . . . . . . . . . . . . . . . . . 1.7.2 The Stokes’ Theorem . . . . . . . . . . . . . . . . . . . . . . . . . 1.8 Appendixes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.8.1 Appendix: Tensors For Beginners . . . . . . . . . . . . . . . . . . 1.8.2 Appendix: Dimension of Components . . . . . . . . . . . . . . . . 1.8.3 Appendix: The Jacobian in Geographical Coordinates . . . . . . . 1.8.4 Appendix: Kronecker Determinants in 2 3 and 4 D . . . . . . . . 1.8.5 Appendix: Definition of Vectors . . . . . . . . . . . . . . . . . . . 1.8.6 Appendix: Change of Components . . . . . . . . . . . . . . . . . 1.8.7 Appendix: Covariant Derivatives . . . . . . . . . . . . . . . . . . 1.8.8 Appendix: Formulas of Vector Analysis . . . . . . . . . . . . . . . 1.8.9 Appendix: Metric, Connection, etc. in Usual Coordinate Systems xi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 3 4 4 5 7 7 8 8 9 9 9 9 10 10 11 11 11 14 14 14 16 18 19 19 20 23 23 41 42 44 45 46 47 48 50 xii 1.8.10 1.8.11 1.8.12 1.8.13 1.8.14 Appendix: Appendix: Appendix: Appendix: Appendix: Gradient, Divergence and Curl in Usual Coordinate Systems Connection and Derivative in Different Coordinate Systems Computing in Polar Coordinates . . . . . . . . . . . . . . . Dual Tensors in 2 3 and 4D . . . . . . . . . . . . . . . . . . Integration in 3D . . . . . . . . . . . . . . . . . . . . . . . . 2 Elements of Probability 2.1 Volume . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Notion of Volume . . . . . . . . . . . . . . . . . . . 2.1.2 Volume Element . . . . . . . . . . . . . . . . . . . 2.1.3 Volume Density and Capacity Element . . . . . . . 2.1.4 Change of Variables . . . . . . . . . . . . . . . . . . 2.1.5 Conditional Volume . . . . . . . . . . . . . . . . . . 2.2 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Notion of Probability . . . . . . . . . . . . . . . . . 2.2.2 Volumetric Probability . . . . . . . . . . . . . . . . 2.2.3 Probability Density . . . . . . . . . . . . . . . . . . 2.2.4 Volumetric Histograms and Density Histograms . . 2.2.5 Change of Variables . . . . . . . . . . . . . . . . . . 2.3 Sum and Product of Probabilities . . . . . . . . . . . . . . 2.3.1 Sum of Probabilities . . . . . . . . . . . . . . . . . 2.3.2 Product of Probabilities . . . . . . . . . . . . . . . 2.4 Conditional Probability . . . . . . . . . . . . . . . . . . . . 2.4.1 Notion of Conditional Probability . . . . . . . . . . 2.4.2 Conditional Volumetric Probability . . . . . . . . . 2.5 Marginal Probability . . . . . . . . . . . . . . . . . . . . . 2.5.1 Marginal Probability Density . . . . . . . . . . . . 2.5.2 Marginal Volumetric Probability . . . . . . . . . . . 2.5.3 Interpretation of Marginal Volumetric Probability . 2.5.4 Bayes Theorem . . . . . . . . . . . . . . . . . . . . 2.5.5 Independent Probability Distributions . . . . . . . 2.6 Transport of Probabilities . . . . . . . . . . . . . . . . . . 2.7 Central Estimators and Dispersion Estimators . . . . . . . 2.7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . 2.7.2 Center and Radius of a Probability Distribution . . 2.8 Appendixes . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8.1 Appendix: Conditional Probability Density . . . . . 2.8.2 Appendix: Marginal Probability Density . . . . . . 2.8.3 Appendix: Replacement Gymnastics . . . . . . . . 2.8.4 Appendix: The Gaussian Probability Distribution . 2.8.5 Appendix: The Laplacian Probability Distribution . 2.8.6 Appendix: Exponential Distribution . . . . . . . . 2.8.7 Appendix: Spherical Gaussian Distribution . . . . . 2.8.8 Appendix: Probability Distributions for Tensors . . 2.8.9 Appendix: Determinant of a Partitioned Matrix . . 2.8.10 Appendix: The Borel ‘Paradox’ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 61 63 65 67 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 70 70 70 71 73 75 78 78 79 79 81 82 84 84 85 88 88 89 100 100 102 103 103 104 106 116 116 116 120 120 122 123 125 130 131 137 140 143 144 xiii 2.8.11 Appendix: Axioms for the Sum and the Product . . . . . . . . . . . . . . 148 2.8.12 Appendix: Random Points on the Surface of the Sphere . . . . . . . . . . 149 2.8.13 Appendix: Histograms for the Volumetric Mass of Rocks . . . . . . . . . 151 3 Monte Carlo Sampling Methods 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Random Walks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Modification of Random Walks . . . . . . . . . . . . . . . . . . . 3.4 The Metropolis Rule . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 The Cascaded Metropolis Rule . . . . . . . . . . . . . . . . . . . . 3.6 Initiating a Random Walk . . . . . . . . . . . . . . . . . . . . . . 3.7 Designing Primeval Walks . . . . . . . . . . . . . . . . . . . . . . 3.8 Multistep Iterations . . . . . . . . . . . . . . . . . . . . . . . . . . 3.9 Choosing Random Directions and Step Lengths . . . . . . . . . . 3.9.1 Choosing Random Directions . . . . . . . . . . . . . . . . 3.9.2 Choosing Step Lengths . . . . . . . . . . . . . . . . . . . . 3.10 Appendixes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.10.1 Random Walk Design . . . . . . . . . . . . . . . . . . . . . 3.10.2 The Metropolis Algorithm . . . . . . . . . . . . . . . . . . 3.10.3 Appendix: Sampling Explicitly Given Probability Densities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Homogeneous Probability Distributions 4.1 Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Homogeneous Probability Distributions . . . . . . . . . . . . . . . . . 4.3 Appendixes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Appendix: First Digit of the Fundamental Physical Constants 4.3.2 Appendix: Homogeneous Probability for Elastic Parameters . 4.3.3 Appendix: Homogeneous Distribution of Second Rank Tensors 5 Basic Measurements 5.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Old text: Measuring physical parameters . . . . . . . . . . . . 5.3 From ISO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Proposed vocabulary to be used in metrology . . . . . 5.3.2 Some basic concepts . . . . . . . . . . . . . . . . . . . 5.4 The Ideal Output of a Measuring Instrument . . . . . . . . . . 5.5 Output as Conditional Probability Density . . . . . . . . . . . 5.6 A Little Bit of Theory . . . . . . . . . . . . . . . . . . . . . . 5.7 Example: Instrument Specification . . . . . . . . . . . . . . . 5.8 Measurements and Experimental Uncertainties . . . . . . . . . 5.9 Appendixes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.9.1 Appendix: Operational Definitions can not be Infinitely 5.9.2 Appendix: The International System of Units (SI) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... Accurate ...... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 . 154 . 155 . 157 . 158 . 158 . 159 . 160 . 161 . 162 . 162 . 163 . 164 . 164 . 165 . 168 . . . . . . 169 . 169 . 171 . 176 . 176 . 178 . 183 . . . . . . . . . . . . . 185 . 186 . 187 . 189 . 189 . 191 . 194 . 195 . 195 . 195 . 197 . 200 . 200 . 201 xiv 6 Inference Problems of the First Kind 6.1 Experimental Histograms . . . . . . . 6.2 Sampling a Sum . . . . . . . . . . . . 6.3 Further Work to be Done . . . . . . . (Sum of Probabilities) 207 . . . . . . . . . . . . . . . . . . . . . . . . 208 . . . . . . . . . . . . . . . . . . . . . . . . 209 . . . . . . . . . . . . . . . . . . . . . . . . 209 7 Inference Problems of the Second Kind (Product of Probabilities) 7.1 The ‘Shipwrecked Person’ Problem . . . . . . . . . . . . . . . . . . . . . . . 7.2 Physical Laws as Probabilistic Correlations . . . . . . . . . . . . . . . . . . . 7.2.1 Physical Laws . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.2 Example: Realistic ‘Uncertainty Bars’ Around a Functional Relation 7.2.3 Inverse Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 . 212 . 213 . 213 . 213 . 214 8 Inference Problems of the Third Kind (Conditional Probabilities) 219 8.1 Adjusting Measurements to a Physical Theory . . . . . . . . . . . . . . . . . . . 220 8.2 Inverse Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222 8.2.1 Model Parameters and Observable Parameters . . . . . . . . . . . . . . . 223 8.2.2 A Priori Information on Model Parameters . . . . . . . . . . . . . . . . . 223 8.2.3 Measurements and Experimental Uncertainties . . . . . . . . . . . . . . . 225 8.2.4 Joint ‘Prior’ Probability Distribution in the (M , D ) Space . . . . . . . . 225 8.2.5 Physical Laws . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226 8.2.6 Inverse Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226 8.3 Appendixes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231 8.3.1 Appendix: Short Bibliographical Review . . . . . . . . . . . . . . . . . . 231 8.3.2 Appendix: Example of Ideal (Although Complex) Geophysical Inverse Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 8.3.3 Appendix: Probabilistic Estimation of Earthquake Locations . . . . . . . 241 8.3.4 Appendix: Functional Inverse Problems . . . . . . . . . . . . . . . . . . . 246 8.3.5 Appendix: Nonlinear Inversion of Waveforms (by Charara & Barnes) . . 263 8.3.6 Appendix: Using Monte Carlo Methods . . . . . . . . . . . . . . . . . . . 272 8.3.7 Appendix: Using Optimization Methods . . . . . . . . . . . . . . . . . . 275 9 Inference Problems of the Fourth Kind (Transport of 9.1 Measure of Physical Quantities . . . . . . . . . . . . . 9.1.1 Example: Measure of Poisson’s Ratio . . . . . . 9.2 Prediction of Observations . . . . . . . . . . . . . . . . 9.3 Appendixes . . . . . . . . . . . . . . . . . . . . . . . . 9.3.1 Appendix: Mass Calibration . . . . . . . . . . . Probabilities) .......... .......... .......... .......... .......... . . . . . . . . . . . . . . . 287 . 288 . 288 . 299 . 300 . 300 Bibliography 501 Index 601 Chapter 1 Introduction to Tensors [Note: This is an old introduction, to be updated!] The first part of this book recalls some of the mathematical tools developed to describe the geometric properties of a space. By “geometric properties” one understands those properties that Pythagoras (6th century B.C.) or Euclid (3rd century B.C.) were interested on. The only major conceptual progress since those times has been the recognition that the physical space may not be Euclidean, but may have curvature and torsion, and that the behaviour of clocks depends on their space displacements. Still these representations of the space accept the notion of continuity (or, equivalently, of differentiability). New theories are being developed dropping that condition (e.g. Nottale, 1993). They will not be examined here. A mathematical structure can describe very different physical phenomena. For instance, the structure “3-D vector space” may describe the combination of forces being applied to a particle, as well as the combination of colors. The same holds for the mathematical structure “differential manifold”. It may describe the 3-D physical space, any 2-D surface, or, more importantly, the 4-dimensional space-time space brought into physics by Minkowski and Einstein. The same theorem, when applied to the physical 3-D space, will have a geometrical interpretation (stricto sensu), while when applied to the 4-D space-time will have a dynamical interpretation. The aim of this first chapter is to introduce the fundamental concepts necessary to describe geometrical properties: those of tensor calculus. Many books on tensor calculus exist. Then, why this chapter here? Essentially because no uniform system of notations exist (indices at different places, different signs . . . ). It is then not possible to start any serious work without fixig the notations first. This chapter does not aim to give a complete discussion on tensor calculus. Among the many books that do that, the best are (of course) in French, and Brillouin (1960) is the best among them. Many other books contain introductory discussions on tensor calculus. Weinberg (1972) is particularly lucid. I do not pretend to give a complete set of demonstrations, but to give a complete description of interesting properties, some of which are not easily found elsewhere. Perhaps original is a notation proposed to distinguish between densities and capacities. 1 2 While the trick of using indices in upper or lower position to distinguish between tensors or forms (or, in metric spaces, to distinguish between “contravariant” or “covariant” components) makes formulas intuitive, I propose to use a bar (in upper or lower position) to distinguish between densities (like a probability density) or capacities (like a volume element), this also leading to intuitive results. In particular the bijection existing between these objects in metric spaces becomes as “natural” as the one just mentioned between contravariant and covariant components. Chapter’s overview 1.1 3 Chapter’s overview [Note: This is an old introduction, to be updated!] A vector at a point of an space can intuitively be imagined as an “arrow”. As soon as we can introduce vectors, we can introduce other objects, the forms . A form at a point of an space can intuitively be imagined as a series of parallel planes . . . At any point of a space we may have tensors, of which the vectors of elementary texts are a particular case. Those tensors may describe the properties of the space itself (metric, curvature, torsion . . . ) or the properties of something that the space “contains”, like the stress at a point of a continuous medium. If the space into consideration has a metric (i.e., if the notion of distance between two points has a sense), only tensors have to be considered. If there is not a metric, then, we have to simultaneously consider tensors and forms. It is well known that in a transformation of coordinates, the value of a probability density f at any point of the space is multiplied by ‘the Jacobian’ of the transformation. In fact, a probability density is a scalar field that has well defined tensor properties. This suggests to introduce two different notions where sometimes only one is found: for instance, in addition to the notion of mass density, ρ , we will also consider the notion of volumetric mass ρ , identical to the former only in Cartesian coordinates. If ρ(x) is a mass density, and v i (x) a true vector, like a velocity. Their product pi (x) = ρ(x) v i (x) will not transform like a true vector: there will be an extra multiplication by the Jacobian. pi (x) is a density too (of linear momentum). In addition to tensors and to densities, the concept of “capacity” will be introduced. Under a transformation of coordinates, a capacity is divided by the Jacobian of the trasformation. An example is the capacity element dV = dx0 dx1 . . . , not to be assimilated to the volume element dV . The product of a capacity by a density gives a true scalar, like in dM = ρ dV . It is well known that if there is a metric, we can define a bijection between forms and vectors (we can “raise and lower indices”) through Vi = gij V j . The square root of the determinant of {gij } will be denoted g and we will see that it defines a natural bijection between capacities, tensors, and densities, like in pi = g pi , so, in addition to the rules concerning the indices, we will have rules concerning the “bars”. Without a clear understanding of the concept of densities and capacities, some properties remain obscure. We can, for instance, easily introduce a Levi-Civita capacity εijk... , or a Levi-Civita density (the components of both take only the values -1, +1 or 0). A Levi-Civita pure tensor can be defined, but it does not have that simple property. The lack of clear understanding of the need to work simultaneously with densities, pure tensors, and capacities, forces some authors to juggle with “pseudo-things” like the pseudo-vector corresponding to the vector product of two vectors, or to the curl of a vector field. Many of the properties of tensor spaces arte not dependent on the fact that the space may have a metric (i.e., a notion of distance). We will only assume that we have a metric when the property to be demonstrated will require it. In particular, the definition of “covariant” derivative, in the next chapter, will not depend on that assumption. Also, the dimension of the differentiable manifold (i.e., space) into consideration, is arbitrary (but finite). We will use Latin indices {i, j, k, . . . } to denote the components of tensors. In the second part of the book, as we will specifically deal with the physical space and space-time, the Latin indices {i, j, k, . . . } will be reserved for the 3-D physical space, while the Greek indices {α, β, γ, . . . } will be reserved for the 4-D space-time. 4 1.2 1.2 Change of Coordinates (Notations) 1.2.1 Jacobian Matrices Consider a change of coordinates, passing from the coordinate system x = {xi } = {x1 , . . . , xn } to another coordinate system y = {y i } = {y 1 , . . . , y n } . One may write the coordinate transformation using any of the two equivalent functions y = y(x) ; x = x(y) , (1.1) this being, of course, a short-hand notation for y i = y i (x1 , . . . , xn ) ; (i = 1, . . . , n) and xi = xi (y 1 , . . . , y n ) ; (i = 1, . . . , n) . We shall need the two sets of partial derivatives Y ij = ∂y i ∂xj ; X ij = ∂xi ∂y j . (1.2) One has Y ik X k j = X ik Y k j = δij . (1.3) To simplify language and notations, it is useful to introduce a matrices of partial derivatives , ranging the elements X i j and Y i j as follows, X 11 X 12 X 13 · · · 2 2 2 X = X 1 X 2 X 3 · · · . . . ... . . . . . . Y 11 Y 12 Y 13 · · · 2 2 2 Y = Y 1 Y 2 Y 3 · · · . . . ... . . . . . . ; . (1.4) Then, equations 1.3 just tell that the matrices X and Y are mutually inverses: YX = XY = I . (1.5) The two matrices X and Y are called Jacobian matrices . As the matrix Y is obtained by taking derivatives of the variables y i with respect to the variables xi , one obtains the matrix {Y i j } as a function of the variables {xi } , so we can write Y(x) rather than just writting Y . The reciprocal argument tels that we can write X(y) rather than just X . We shall later use this to make some notations more explicit. Finally, the Jacobian determinants of the transformation are the determinants1 of the two Jacobian matrices: Y = det Y ; X = det X . (1.6) 1 Explicitly, Y = det Y = n! εijk... Y i p Y j q Y k r . . . εpqr... , and X = det X = 1 pqr... i j k , and where the Levi-Civita’s “symbols” εijk... take the value +1 if n! εijk... X p X q X r . . . ε {i, j, k, . . . } is an even permutation of {1, 2, 3, . . . } , the value −1 if {i, j, k, . . . } is an odd permutation of {1, 2, 3, . . . } , and the value 0 if some indices are identical. The Levi-Civita’s tensors will be introduced with mre detail in section 1.4). 1 Change of Coordinates (Notations) 1.2.2 5 Tensors, Capacities and Densities Consider an n-dimensional manifold, and let P be a point of it. Also consider a tensor T at point P , and let Tx ij... k ... be the components of T on the local natural basis associated to some coordinates x = {x1 , . . . , xn } . On a change of coordinates from x into y = {y 1 , . . . , y n } (and the corresponding change of local natural basis) the components of T shall become Ty ij... k ... . It is well known that the components are related through Ty pq... rs... ∂y p ∂y q ∂xk ∂x = ··· r · · · Tx ij... k i ∂xj s ∂x ∂y ∂ y ... , (1.7) or, using the notations introduced above, Ty pq... rs... = Y p i Y q j · · · X k r X s · · · Tx ij... k ... . (1.8) In particular, for totally contravariant and totally contravariant tensors, k Ty ... = Y ki Y j ij · · Tx ··· ; Ty k ... = X i k X j · · · Tx ij... . (1.9) In addition to actual tensors, we shall encounter other objects, that ‘have indices’ also, and that transform in a slightly different way: densities and capacities (see for instance Weinberg [1972] and Winogradzki [1979]). Rather than a general exposition of the properties of densities and capacities, let us anticipate that we shall only find totally contravariant densities and totally covariant capacities (the most notable example being the Levi-Civita capacity, to be introduced below). From now on, in all this text, • a density is denoted with an overline, like in a ; • a capacity is denoted with an underline, like in b . It is time now to give what we can take as defining properties: Under the considered change of coordinates, a totally contravariant density a changes components following the law ak ... = y 1k Y iY Y j · · · aij... x , (1.10) or, equivalently, ak ... = X Y k i Y j · · · aij... . Here X = det X and Y = det Y are the y x Jacobian determinants introduced in equation 1.6. This rule for the change of components for a totally contravariant density is the same as that for a totally contravariant tensor (equation at left in 1.9), excepted that there is an extra factor, the Jacobian determinant X = 1/Y . Similarly, a totally covariant capacity b changes components following the law by k ... = 1 X i k X j · · · bx ij... X , (1.11) or, equivalently, by k ... = Y X i k X j · · · bx ij... . Again, this rule for the change of components for a totally covariant capacity is the same as that for a totally covariant tensor (equation at right in 1.9), excepted that there is an extra factor, the Jacobian determinant Y = 1/X . 6 1.2 The number of terms in equations 1.10 and 1.11 depends on the ‘variance’ of the objects considered (i.e., in the number of indices they have). We shall find, in particular, scalar densities and scalar capacities, that do not have any index. The natural extension of equations 1.10 and 1.11 is, obviously, ay = X ax = 1 ax Y (1.12) by = Y b x = 1 b Xx (1.13) for a scalar density, and for a scalar capacity. Explicitly, these equations can be written, using y as variable, ay (y) = X (y) ax (x(y)) ; 1 b (x(y)) , X (y) x (1.14) by (y(x)) = Y (x) bx (x) . (1.15) by (y) = or, equivalently, using x as variable, ay (y(x)) = 1 ax (x) Y (x) ; Metric, Volume Density, Metric Bijections 1.3 1.3.1 7 Metric, Volume Density, Metric Bijections Metric A manifold is called a metric manifold if there is a definition of distance between points, such that the distance ds between the point of coordinates x = {xi } and the point of coordinates x + dx = {xi + dxi } can be expressed as2 ds2 = (dx)2 = gij (x) dxi dxj , (1.16) i.e., if the notion of distance is ‘of the L2 type’3 . The matrix whose entries are gij is the metric matrix , and an important result of differential geometry and integration theory is that the volume density, g (x) , equals the square root of the determinant of the metric: g (x) = det g(x) . (1.17) Example 1.1 In the Euclidean 3D space, using geographical coordinates (see example ??) the distance element is ds2 = dr2 + r2 cos2 ϑ dϕ2 + r2 dϑ2 , from where it follows that the metric matrix is grr grϕ grϑ 1 0 0 gϕr gϕϕ gϕϑ = 0 r2 cos2 ϑ 0 . (1.18) 2 0 0 r gϑr gϑϕ gϑϑ det g(r, ϕ, ϑ) = r2 cos ϑ . The volume density equals the metric determinant, g (r, ϕ, ϑ) = [End of example.] Note: define here the contravariant components of the metric through g ij gjk = δ i k . (1.19) Using equations 1.9, we see that the covariant and contravariant components of the metric change according to gy k = X i k X j gx ij and k gy = Y k i Y j ij gx . (1.20) In section 1.2, we introdiced the matrices of partial derivatives. It is useful to also introduce two metric matrices, with respectively the covariant and contravariant components of the metric: g11 g12 g13 · · · g 11 g 12 g 13 · · · 21 g 22 g 23 · · · ; g−1 = g (1.21) g = g21 g22 g23 · · · , . .. . .. . . . . . . . . . . . . . . . . . . the notation g−1 for the second matrix being justified by the definition 1.19, that now reads g −1 g = I . (1.22) In matrix notation, the change of the metric matrix under a change of variables, as given by the two equations 1.20, is written gy = Xt gx X 2 3 ; − − gy 1 = Y gx 1 Yt . This is a property that is valid for any coordinate system that can be chosen over the space. As a counterexample, the distance defined as ds = |dx| + |dy | is not of the L2 type (it is L1 ). (1.23) 8 1.3 1.3.2 Volume Density [Note: The text that follows has to be simplified.] We have seen that the metric can be used to define a natural bijection between forms and vectors. Let us now see that it can also be used to define a natural bijection between tensors, densities, and capacities. Let us denote by g the square root of the determinant of the metric, g= det g = 1 ijk... pqr... ε ε gip gjq gkr . . . . n! (1.24) [Note: Explain here that this is a density (in fact, the fundamental density)]. In (Comment: where?) we demonstrate that we have ∂i g = g Γis s . (1.25) Using expression (Comment: which one?) for the (covariant) derivative of a scalar density, this simply gives ∇i g = ∂i g − g Γis s = 0 , (1.26) which is consistent with the fact that ∇i gjk = 0 . (1.27) Note: define here the fundamental capacity g= 1 g , (1.28) an say that it is a capacity (obvious). 1.3.3 Bijection Between Densities Tensors and Capacities Using the scalar density g we can associate tensor densities, pure tensors, and tensor capacities. Using the same letter to designate the objects related through this natural bijection, we will write expressions like ρ = gρ ; V i = g Vi or g T ij... kl... = Tij... kl... . (1.29) So, if gij and g ij can be used to “lower and raise indices”, g and g can be used to “put and remove bars”. Comment: say somewhere that g is the density of volumetric content , as the volume element of a metric space is given by dV = g dτ , (1.30) where dτ is the capacity element defined in (Comment: where?), and which, when we take an element along the coordinate lines, equals dx1 ∧ dx2 ∧ dx3 . . . . Comment: Give somewhere the formula ∂i g = g Γi . It can be justified by the fact that, for any density, s , ∇k s = ∂k s − Γk s , and the result follows by using s = g and remembering that ∇k g = 0 . The Levi-Civita Tensor 1.4 1.4.1 9 The Levi-Civita Tensor Orientation of a Coordinate System The Jacobian determinants associated to a change of variables x y have been defined in section 1.2. As their product must equal +1, they must be both positive or both negative. Two different coordinate systems x = {x1 , x2 , . . . , xn } and y = {y 1 , y 2 , . . . , y n } are said to have the ‘same orientation’ (at a given point) if the Jacobian determinants of the transformation, are positive. If they are negative, it is said that the two coordinate systems have ’opposite orientation’. Precisely, the orientation of a coordinate system is the quantity η that may take the value +1 or the value −1 . The orientation η of any coordinate system is then unambiguously defined when a definite sign of η is assigned to a particular coordinate system. Example 1.2 In the Euclidean 3D space, a positive orientation is assigned to a Cartesian coordinate system {x, y, z } when the positive sense of the z is obtained from the positive senses of the x axis and the y axis following the screwdriver rule. Another Cartesian coordinate system {u, v, w} defined as u = y , v = x , w = z , then would have a negative orientation. A system of theee spherical coordinates, if taken in their usual order {r, θ, ϕ} , then also has a positive orientation, but when changing the order of two coordinates, like in {r, ϕ, θ} , the orientation of the coordinate system is negative. For a system of geographical coordinates, the reverse is true, while {r, , ϑ} is a positively oriented system, {r, ϑ, ϕ} is negatively oriented. [End of example.] 1.4.2 The Fundamental (Levi-Civita) Capacity The Levi-Civita capacity can +η = 0 ijk... −η be defined by the condition if ijk . . . is an even permutation of 12 . . . n if some indices are identical if ijk . . . is an odd permutation of 12 . . . n , (1.31) where η is the orientation of the coordinate system, as defined in section 1.4.1. It can be shown [note: give here a reference or the demonstration] that the object so defined actually is a capacity, i.e., that in a change of coordinates, when it is imposed that the components of this ‘object’ change according to equation 1.11, the defining property 1.31 is preserved. 1.4.3 The Fundamental Density Let g the metric tensor of the manifold. For any positively oriented system of coordinates, we define the quantity g , called the volume density (in the given coordinates) as g=η det g . (1.32) where η is the orientation of the coordinate system, as defined in section 1.4.1. It can be shown [note: give here a reference or the demonstration] that the object so defined actually is a scalar density, i.e., that in a change of coordinates, this quantity changes according to equation 1.12 respectively, the property 1.32 is preserved. 10 1.4.4 1.4 The Levi-Civita Tensor Then, the Levi-Civita tensor can be defined as ij...k 4 =g ij...k , (1.33) i.e., explicitly, ijk... √ + det g = 0 √ − det g if ijk . . . is an even permutation of 12 . . . n if some indices are identical if ijk . . . is an odd permutation of 12 . . . n . (1.34) It can be shown [note: give here a reference or the demonstration] that the object so defined actually is a tensor, i.e., that in a change of coordinates, when it is imposed that the components of this ‘object’ change according to equation 1.9, the property 1.34 is preserved. 1.4.5 Determinants The Levi-Civita’s tensors can be used to define determinants. For instance, the determinants of the tensors Qij , Ri j , S i j , and T ij are defined by Q= 1 ijk... mnr... ε ε Qim Qjn Qkr . . . , n! R= 1 ijk... εmnr... Ri m Rj n Rk r . . . , ε n! = 1 ijk... ε εmnr... Ri m Rj n Rk r . . . , n! S= 1 εijk... εmnr... S i m S j n S k r . . . , n! = 1 εmnr... S i m S j n S k r . . . , ε n! ijk... (1.37) T= 1 εijk... εmnr... T im T jn T kr . . . , n! (1.38) (1.35) (1.36) and where the Levi-Civita’s tensors εijk... , εijk... , εijk... and εijk... have as many indices as the space under consideration has dimensions. 4 It can be shown that this, indeed, a tensor, i.e., in a change of coordinates, it transforms like a tensor should. The Kronecker Tensor 1.5 1.5.1 11 The Kronecker Tensor Kronecker Tensor There are two Kronecker’s “symbols”, gi j and g i j . They are defined similarly: gi j = 1 0 if i and j are the same index if i and j are different indices , (1.39) gij = 1 0 if i and j are the same index if i and j are different indices . (1.40) and Comment: I should be avoid this last notation. It can easily be seen (Comment: how?) that g i j are more than ‘symbols’: they are tensors , in the sense that, if when changing the coordinates, we compute the new components of the Kronecker’s tensors using the rules applying to all tensors, the property (Comment: which equation?) remains satisfied. The Kronecker’s tensors are defined even if the space has not a metric defined on it. Note that, sometimes, instead of using the symbols gi j and g j j to represent the Kronecker’s tensors, the symbols δi j and δ j j are used. But then, using the metric gij to “lower an index” of δi j gives δij = gjk δi k = gij , (1.41) which means that, if the space has a metric, the Kronecker’s tensor and the metric tensor are the same object. Why, then, use a different symbol? The use of the symbol δi j may lead, by inadvertence, after lowering an index, to assing to δij the value 1 when i and j are the same index. This is obviously wrong: if there is not a metric, δij is not defined, and if there is a metric, δij equals gij , which is only 1 in Euclidean spaces using Cartesian coordinates. There is only one Kronecker’s tensor, and gi j and g i j can be deduced one from the other i raising and lowering indices. But, even in that case, we dislike the notation gj , where the place of each index is not indicated, and we will not use it sistematically. Warning: a common error in beginners is to give the value 1 to the symbol gi i (or to g i i ) . In fact, the right value is n , the dimension of the space, as there is an implicit sum assumed: gi i = g0 0 + g1 1 + · · · = 1 + 1 + · · · = n . 1.5.2 Kronecker Determinants Let us denote by n the dimension of the space into consideration. The Levi-Civita’s tensor has then n indices. For any (non-negative) integer p satisfying p ≤ n , consider the integer q such that p + q = n . The following property holds: j1 j2 jp δi1 δi1 . . . δi1 δ j1 δ j2 . . . δ jp i i i2 j1 ...jp s1 ...sq εi1 ...ip s1 ...sq ε = q ! det .2 .2 . . (1.42) . , . . . . . . . jp j1 j2 δip δip . . . δip 12 1.5 where δi j stands for the Kronecker’s tensor. The determinant at the right-hand side is called j1 j ...j the Kronecker’s determinant , and is denoted δi1 i22...ipp : j j ...j 1 δi1 i22...ipp jp j1 j2 δi1 δi1 . . . δi1 δ j1 δ j2 . . . δ jp i i i2 = det .2 .2 . . . . . . . . . . . jp j1 j2 δip δip . . . δip (1.43) As the Kronecker’s determinant is defined as a product of Levi-Civita’s tensors, it is itself a tensor. It generalizes the definition of the Kronecker’s tensor δi j , as it has the properties if (j1 , j2 , . . . , jm ) is an even permutation of (i1 , i2 , . . . , im ) +1 −1 if (j1 , j2 , . . . , jm ) is an odd permutation of (i1 , i2 , . . . , im ) j1 ...jm δi1 ij22...im = 0 if two of the i s or two of the j s are the same index 0 if (i1 , i2 , . . . , im ) and (j1 , j2 , . . . , jm ) are different sets of indices . (1.44) As applying the same permutation to the indices of the two Levi-Civita’s tensors of equation 1.42 will not change the total sign of the expression, we have εi1 ...ip s1 ...sq εj1 ...jp s1 ...sq = j j ...j 1 εs1 ...sq i1 ...ip εs1 ...sq j1 ...jp = q ! δi1 i22...ipp , (1.45) but we only perform a permutation in one of the Levi-Civita’s tensors, then we must care about the sign of the permutation, and we obtain εi1 ...ip s1 ...sq εs1 ...sq j1 ...jp = j j ...j 1 εs1 ...sq i1 ...ip εj1 ...jp s1 ...sq = (−1)pq q ! δi1 i22...ipp . (1.46) This possible change of sign has only effect in spaces with even dimension (n = 2, 4, . . . ) , as in spaces with odd dimension (n = 3, 5, . . . ) the condition p + q = n implies that pq is an even number, and (−1)pq = +1 . Remark that a multiplication and a division by g will not change the value of an expression, so that, instead of using Levi-Civita’s density and capacity we can use Levi-Civita’s true tensors. For instance, εi1 ...ip s1 ...sq εj1 ...jp s1 ...sq = εi1 ...ip s1 ...sq εj1 ...jp s1 ...sq . (1.47) Comment: explain better. Appendix 1.8.4 gives special formulas to spaces with dimension 2 , 3 , and 4 . As shown in appendix 1.8.8, these formulas replace more elementary identities between grad, div, rot, . . . As an example, a well known identity like a · (b × c) = b · (c × a) = c · (a × b) (1.48) is obvious, as the three formulas correspond to the expression εijk ai bj ck . The identity a × (b × c) = (a · c) b − (a · b) c (1.49) The Kronecker Tensor 13 is easily demonstrated, as a × (b × c) = εijk aj (b × c)k = εijk aj εk m b cm , (1.50) which, using XXX, gives a × (b × c) = (am cm )bi − (am bm )ci = (a · c) b − (a · b) c . (1.51) Comment: I should clearly say here that we have the identity εijk... ε mn... = εijk... ε mn... . (1.52) Comment: say somewhere that if Bi1 ...ip is a totally antisymmetric tensor, then 1 1 ... p B ... = Bi1 ...ip δ p! i1 ...ip 1 p (1.53) Comment: give somewhere the property 1 k1 ...kp 1 ... q j1 ...jq k ...k ... q δi1 ...ip j1 ...jq δm1 ...mq = δi11...ipp 1 ...mq . m1 q! (1.54) Comment: give somewhere the property 1 j ...j δ 1 q = εi1 ...ip k1 ...kq . ε q ! i1 ...ip j1 ...jq k1 ...kq Note: Check if there are not factors (−1)pq missing. (1.55) 14 1.6 1.6.1 1.6 Totally Antisymmetric Tensors Totally Antisymmetric Tensors A tensor is completely antisymmetric if any even permutation of indices does not change the value of the components, and if any odd permutation of indices changes the sign of the value of the components: tpgr... = +tijk... −tijk... if ijk . . . is an even permutation of pqr . . . if ijk . . . is an odd permutation of pqr . . . (1.56) For instance, a fourth rank tensor tijkl is totally antisymmetric if tijkl = tiklj = tiljk = tjilk = tjkil = tjlki = tkijl = tkjli = tklij = tlikj = tljik = tlkij = −tijlk = −tikjl = −tilkj = −tjikl = −tjkli = −tjlik = −tkilj = −tkjil = −tklji = −tlijk = −tljki = −tlkij (1.57) a third rank tensor tijk is totally antisymmetric if tijk = tjki = tkji = −tikj = −tjik = −tkji , (1.58) a second rank tensor tij is totally antisymmetric if tij = −tji , (1.59) and a first rank tensor ti can always be considered totally antisymmetric. Well known examples of totally antisymmetric tensors are the Levi-Civita’s tensors of any rank, the rank-two electromagnetic tensors, the “vector product” of two vectors: cij = ai bj − aj bi , (1.60) etc. Comment: say somewhere that the Kronecker’s tensors and determinants are totally antisymmetric. 1.6.2 Dual Tensors In a space with n dimensions, let p and q be two (nonnegative) integers such that p + q = n . To any totally antisymmetric tensor of rank p , B i1 ...ip , we can associate a totally antisymmetric tensor of rank q , bi1 ...iq , defined by bi1 ...iq = 1 εi1 ...iq j1 ...jp B j1 ...jp . p! (1.61) The tensor b is called the dual of B , and we write b = Dual[B] (1.62) Totally Antisymmetric Tensors 15 or b =∗ B (1.63) From the properties of the product of Levi-Civita’s tensors it follows that the dual of the dual gives the original tensor, excepted for a sign: ∗∗ ( B) = Dual[Dual[B]] = (−1)p(n−p) B . (1.64) For spaces with odd dimension (n = 1, 3, 5, . . . ) , the product p(n − p) is even, and ∗∗ ( B) = B (spaces with odd dimension) . (1.65) For spaces with even dimension (n = 2, 4, 6, . . . ) , we have ∗∗ ( B) = (−1)p B (spaces with even dimension) . (1.66) Although definition 1.61 has been written for pure tensors, it can obviously be written for densities and capacities, 1 j1 ...jp εi1 ...iq j1 ...jp B p! 1 = B j1 ...jp , ε p! i1 ...iq j1 ...jp bi1 ...iq = bi1 ...iq (1.67) or for tensor where covariant and contravariant indices have replaced each other: 1 i1 ...iq j1 ...jp ε Dj1 ...jp p! 1 i1 ...iq j1 ...jp = ε Dj1 ...jp p! 1 i1 ...iq j1 ...jp = ε Dj1 ...jp , p! di1 ...iq = di1 ...iq i1 ...iq d (1.68) Appendix 1.8.13 gives explicitly the dual tensor relations in spaces with 2, 3, and 4 dimensions. Example 1.3 Consider an antisymmetric tensor E11 E12 E13 E21 E12 E23 = E31 E32 E33 Eij in three dimensions. It has components 0 E12 E13 E21 0 E23 , (1.69) E31 E32 0 with Eij = −Eji . The definition ei = gives 1 ijk ε Ejk 2! 0 0 E12 E13 e3 −e2 E21 0 E23 = −e3 0 e1 , 2 1 E31 E32 0 e −e 0 (1.70) (1.71) which is the classical relation between the three independent components of a 3-D antisymmetric tensor and the components of a vector density. [End of example.] 16 1.6 Example 1.4 The vector product of two vectors Ui and Vi can be either defined as the antisymmetric tensor Wij = Ui Vj − Vj Ui , (1.72) 1 ijk ε Uj Vk . 2! (1.73) or as the vector density wi = The two definitions are equivalent, as Wij and wi are mutually duals. [End of example.] Definition 1.73 shows that the vector product of two vectors is not a pure vector, but a vector density. Changing the sense of one axis gives a Jacobian equal to −1 , thus changing the sign of the vector product wi . 1.6.3 Exterior Product of Tensors In a space of dimension n , let Ai1 i2 ...ip and Bi1 i2 ...iq , be two totally antisymmetric tensors with ranks p and q such that p + q ≤ n . Note: check that total antisymmetry has been defined. The exterior product of the two tensors is denoted C=A∧B (1.74) and is the totally antisymmetric tensor of rank p + q defined by Ci1 ...ip j1 ...jq = 1 k ...k ... δi11...ipp 11 qq Ak1 i2 ...kp B 1 i2 ... q . j ...j (p + q )! (1.75) Permuting the set of indices {k1 . . . kp } by the set { 1 . . . q } in the above definition gives the property (A ∧ B) = (−1)pq (B ∧ A) . (1.76) It is also easy to see that the associativity property holds: A ∧ (B ∧ C) = (A ∧ B) ∧ C . (1.77) j1 ... Comment: say that δi1 ij22... are the components of the Kronecker’s determinant defined in Section 1.5.2. Say that it equation 1.54 gives the property (A1 ∧ A2 ∧ . . . AP)i1 i2 ...ip = 1 j1 j2 ...jp A1j1 A2j2 . . . APjp . δ p! i1 i2 ...ip (1.78) Totally Antisymmetric Tensors 1.6.3.1 17 Particular cases: It follows from equation 1.53 that the exterior product of a tensor of rank zero (a scalar) by a totally antisymmetric tensor of any order is the simple product of the scalar by the tensor: (A , → Bi1 ...iq ) (A ∧ B)i1 ...iq = A Bi1 ...iq . (1.79) For the exterior product of two vectors we easily obtain (independently of the dimension of the space into consideration) 1 (Ai , Bi ) → (A ∧ B)ij = (Ai Bj − Aj Bi ) . (1.80) 2 The exterior product of a vector by a second rank (antisymmetric) tensor gives 1 → (A ∧ B)ijk = (Ai Bjk + Aj Bki + Ak Bij ) . (1.81) (Ai , Bij ) 3 Finally, it can be seen that the exterior product of three vectors gives , Bi , Ci ) → (1.82) 1 (Ai (Bj Ck − Bk Cj ) + Aj (Bk Ci − Bi Ck ) + Ak (Bi Cj − Bj Ci )) (A ∧ B ∧ C)ijk = 6 1 = (Bi (Cj Ak − Ck Aj ) + Bj (Ck Ai − Ci Ak ) + Bk (Ci Aj − Cj Ai )) 6 1 (Ci (Aj Bk − Ak Bj ) + Cj (Ak Bi − Ai Bk ) + Ck (Ai Bj − Aj Bi )) . = 6 Let us examine with more detail the formulas above in the special case of a 3-D space. The dual of the exterior product of two vectors (equation 1.80) gives i 1 ∗ (a ∧ b) = εijk aj bk , (1.83) 2 i.e., one half the usual vector product of the two vectors: 1 ∗ (a ∧ b) = (a × b) . (1.84) 2 The dual of the exterior product of a vector by a second rank (antisymmetric) tensor (equation 1.81) is (Ai ∗ or, introducing the vector (a ∧ b) = 1 ai 3 1 ijk ε bjk 2! , (1.85) ∗i b , dual of the tensor bij , 1 ∗i (1.86) ai b . 3 This shows that the exterior product contains, via the duals, the contraction of a form and a vector. Finally, the dual of the exterior product of three vectors (equation 1.82) is 1 ∗ (a ∧ b ∧ c) = εijk ai bj ck , (1.87) 3! i.e., one sixth of the triple product of the three vectors. Comment: explain that the triple product of three vectors is a · (b × c) = b · (c × a) = c · (a × b) . ∗ (a ∧ b) = 18 1.6.4 1.6 Exterior Derivative of Tensors Let T be a totally antisymmetric tensor with components Ti1 i2 ...ip . The exterior product of “nabla” with T is called the exterior derivative of T , and is denoted ∇ ∧ T : k ... (∇ ∧ T)ij1 j2 ...jp = δij11j22...jpp ∇k T 1 2 ... p . (1.88) Here, ∇i Tjk... denotes the covariant derivative defined in section XXX. The “nabla” notation allows to use direclty the formulas developed for the exterior product of a vector by a tensor to obtain formulas for exterior derivatives. For instance, from equation 1.80 it follows the definition of the exterior derivative of a vector 1 (∇i bj − ∇j bi ) , 2 (1.89) 1 ijk ε ∇j bk , 2 (1.90) 1 (∇ ∧ b) = (∇ × b) . 2 (1.91) (∇ ∧ b)ij = or, if we use the dual (equations 1.83–1.84), ∗ i (∇ ∧ b) = i.e., ∗ The exterior derivative of a vector equals one-half the rotational (curl) of the vector. The exterior derivative of a second rank (antisymmetric) tensor is directly obtained from equation 1.81: (∇ ∧ b)ijk = 1 (∇i bjk + ∇j bki + ∇k bij ) . 3 (1.92) i Taking the dual of the expression and introducing the vector ∗ b , dual of the tensor bij , gives (see equation 1.86) ∗ (∇ ∧ b) = 1 i ∇i ∗ b , 3 (1.93) which shows that the dual of the exterior derivative of a second rank (antisymmetric) tensor equals one-third of the divergence of the dual of the tensor. The exterior derivative contains, via the duals, the divergence of a vector. Integration, Volumes 1.7 1.7.1 19 Integration, Volumes The Volume Element Consider, in a space with n dimensions, p linearly independent vectors {dr1 , dr2 , . . . , drp } . As they are linear independent, p ≤ n . We define the “differential element” d(p)σ = p! (dr1 ∧ dr2 ∧ · · · ∧ drp ) . (1.94) Using equation 1.78 (Note: in fact this equation with indices changed of place) gives the components i ...i i i i d(p)σ i1 ...ip = δj1 ...jp dr11 dr22 . . . drpp . p 1 (1.95) In a space with n dimensions, the dual of the differential element of dimension p will have q indices, with p + q = n . The general definition of dual (equation 1.67) gives 1 ∗ (p) d σ i1 ...iq = εi1 ...iq j1 ...jp d(p)σ j1 ...jp (1.96) p! The definition 1.95 and the property 1.55 give ∗ (p) j j j d σ i1 ...iq = εi1 ...iq j1 ...jp dr11 dr22 . . . drpp . (1.97) In order to simplify subsequent notations, it is better not to keep the ∗ notation. Instead, we will write ∗ (p) d σ i1 ...iq = d(p)Σi1 ...iq (1.98) For reasons to be developed below, d(p)Σi1 ...iq will be called the capacity element . We can easily see, for instance, that the differential elements of dimensions 0, 1, 2 and 3 have components d0σ d1 σ i d2 σ ij d3 σ ijk = = = = = = 1 (1.99) i dr1 (1.100) j j i i dr1 dr2 − dr1 dr2 (1.101) j j j j j i k k k i i k k i i dr1 (dr2 dr3 − dr2 dr3 ) + dr1 (dr2 dr3 − dr2 dr3 ) + dr1 (dr2 dr3 − dr2 dr3 ) j j j j j i k k k i i k k i i dr2 (dr3 dr1 − dr3 dr1 ) + dr2 (dr3 dr1 − dr3 dr1 ) + dr2 (dr3 dr1 − dr3 dr1 ) j j j j j i k k k i i k k i i dr3 (dr1 dr2 − dr1 dr2 ) + dr3 (dr1 dr2 − dr1 dr2 ) + dr3 (dr1 dr2 − dr1 dr2 ) . (1.102) For a given dimemsion of the differential element, the number of indices of the capacity elements depends on the dimension of the space. In a three-dimensional space, for instance, we have d0 Σijk = εijk (1.103) k d1 Σij = εijk dr1 d2 Σi = 3 dΣ = j k εijk dr1 dr2 j i k εijk dr1 dr2 dr3 (1.104) (1.105) . (1.106) Note: explain that I use the notation d(p) but d1 , d2 , . . . in order not to suggest that p is a tensor index and, at the same time, for not using too heavy notations.. Note: refer here to figure 1.1, and explain that we have, in fact, vector products of vectors and triple products of vectors. 20 1.7 dr3 dr2 dr2 dr1 dr1 dr1 Figure 1.1: From vectors in a three-dimensional space we define the one-dimensional capacity j k k element d1 Σij = εijk dr1 , the two-dimensional capacity element d2 Σi = εijk dr1 dr2 and the j i k three-dimensional capacity element d3Σ = εijk dr1 dr2 dr3 . In a metric space, the rank-two 1 form d Σij defines a surface perpendicular to dr1 and with a surface magnitude equal to the length of dr1 . The rank-one form d2 Σi defines a vector perpendicular to the surface defined by dr1 and dr2 and with length representing the surface magnitude (the vector product of the two vectors). The rank-zero form d3Σ is a scalar representing the volume defined by the three vectors dr1 , dr2 and dr3 (the triple product of the vectors). Note: clarify all this. 1.7.2 The Stokes’ Theorem Comment: I must explain here first what integration means. Let, in a space with n dimensions, (T) be a totally antisymmetric tensor of rank p , with (p < n) . The Stokes’ theorem d(p+1)σ i1 ...ip+1 (∇ ∧ T)i1 ...ip+1 = (p+1)D d(p)σ i1 ...ip Ti1 ...ip (1.107) pD holds. Here, the symbol (p+1)D d(p+1) stands for an integral over a p+1)-dimensional “volume”, (embedded in an space of dimension n ), and pD d(p) for the integral over the pdimensional boundary of the “volume”. This fundamental theorem contains, as special cases, the divergence theorem of GaussOstrogradsky, and the rotational theorem of Stokes (stricto sensu). Rather than deriving it here, we will explore its consequences. For a demonstration, see, for instance, Von Westenholz (1981). In a three-dimensional space (n = 3) , we may have p respectively equal to 2 , 1 and 0 . This gives the three theorems d3σ ijk (∇ ∧ T)ijk = 3D d2σ ij Tij (1.108) 2D d2σ ij (∇ ∧ T)ij = 2D d1σ i (∇ ∧ T)i = 1D d1σ i Ti (1.109) 1D d0σ T . 0D (1.110) Integration, Volumes 21 It is easy to see (appendix 1.8.14) that these equation can be written 1 ijk 1 d3Σ ε ∇i Tjk 0! 3D 2! 1 1 ijk d2Σi ε ∇j Tk 1! 2D 1! 1 1 ijk d1Σij ε ∂k T 2! 1D 0! 1 1! 1 = 2! 1 = 3! d2Σi = 2D d1Σij 1D d0Σijk 0D 1 ijk ε Tjk 2! 1 ijk ε Tk 1! 1 ijk εT 0! (1.111) (1.112) . (1.113) i Simplifying equation 1.111 and introducing the vector density t , dual to the tensor Tij , ( i 1 i.e., t = 2! εijk Tjk ), gives i d3Σ ∇i t = 3D i d2Σi t . (1.114) 2D This corresponds to the divergence theorem of Gauss-Ostrogradsky: The integral over a (3-D) volume of the divergence of a vector equals the flux of the vector across the surface bounding the volume. It is worth to mention here that expression 1.114 has been derived without any mention to a metric in the space. We have sen elsewhere that densities and capacities can be defined even if there is no notion of distance. If there is a metric, then from the capacity element d3Σ we can introduce the volume element d3Σ using the standard rule for putting on and taking off bars d3Σ = g d3Σ , (1.115) d2Σi = g d2Σi . (1.116) as well as the surface element d3Σ is now the familiar volume inside a prism, and d2Σi the vector (if we raise the index with the metric) representing the surface inside a lozenge. Equation 1.114 then gives d3Σ ∇i ti = 3D d2Σi ti , (1.117) 2D which is the familiar form for the divergence theorem. Keeping the compact expression for the capacity element in the lefthand side of equation 1.112, but introducing its explicit expression in the right hand side gives, after simplification, d2Σi (εijk ∇j Tk ) = 2D i dr1 Ti , (1.118) 1D which corresponds to the rotational theorem (theorem of Stokes stricto sensu): the integral of the rotational (curl) of a vector on a surface equals the circulation of the vector along the line bounding the surface. 22 1.7 Finally, introducing explicit expressions for the capacity elements at both sides of equation 1.113 gives i dr1 ∂i T = 1D T. (1.119) 0D Writing this in the more familiar form gives b dri ∂i T = T (b) − T (a) , (1.120) a which corresponds the fundamental theorem of integral calculus: the integral over a line of the gradient of a scalar equals the difference of the values of the scalar at the two end-points. Note: say that more details can be found in appendix 1.8.14 Comment: explain here what the “capacity element”is. Explain that, in polar coordinates, it is given by drdϕ , to be compared with the “surface element” rdrdϕ . Comment figure 1.2. ϕ = 2π . . . . . . . . .. . . . . . . . . . . . . . . . . . . .. . . . .. . . .. . . . . .. . . . . . 2 . . .. . .. . . .. .. .. 0 ϕ=π . . .. . . .. . .. . . -2 .. .. . -4 . . . ϕ = 0 -4 4 -2 2 0 r=0 r = 1/2 r=1 4 ϕ = π/2 1 . .. . . .. . .. . .. . .. . .. . . 0 ϕ=π ... . .. . ϕ=0 . . . . . .. . -0.5 . . . . . . . -1 1 -1 -0.5 0 ϕ = 3π/2 0.5 0.5 Figure 1.2: We consider, in an Euclidean space, a cylinder with a circular basis of radius 1, and cylindrical coordinates (r, ϕ, z ) . Only a section of the cylinder is represented in the figure, with all its thickness, dz , projected on the drawing plane. At left, we have represented a “map” of the corresponding circle, and, at right, the coordinate lines on the circle itself. All the “cells” at left have the same capacity dV = drdϕdz , while the cells at right have the volume dV (r, ϕ, z ) = rdrdϕdz . The points represent particles with given masses. If, at left, at point with coordinates (r, ϕ, z ) the sum of all the masses inside the local cell is denoted, dM , then, the mass density at this point is estimated by ρ(r, ϕ, z ) = dM/dV , i.e., ρ(r, ϕ) = dM/(drdϕdz ) . If, at right, at point (r, ϕ, z ) the total mass inside the local cell is dM , the volumetric mass at this point is estimated by ρ(r, ϕ, z ) = dM/dV (r, ϕ, z ) , i.e., ρ(r, ϕ, z ) = dM/(rdrdϕdz ) . By definition, then, the total mass inside a volume V will be found by M = V dV ρ(r, ϕ, z ) = V drdϕdz ρ(r, ϕ, z ) or by M = V dV (r, ϕ, z )ρ(r, ϕ, z ) = rdrdϕdzρ(r, ϕ, z ) . V Appendixes 1.8 23 Appendixes 1.8.1 Appendix: Tensors For Beginners 1.8.1.1 Tensor Notations The velocity of the wind at the top of Eiffel’s tower, at a given moment, can be represented by a vector v with components, in some local, given, basis, {v i } (i = 1, 2, 3) . The velocity of the wind is defined at any point x of the atmosphere at any time t : we have a vector field v i (x, t) . The water’s temperature at some point in the ocean, at a given moment, can be represented by a scalar T . The field T (x, t) is a scalar field . The state of stress at a given point of the Earth’s crust, at a given moment, is represented by a second order tensor σ with components {σ ij } (i = 1, 2, 3; j = 1, 2, 3) . In a general model of continuous media, where it is not assumed that the stress tensor is symmetric, this means that we need 9 scalar quantities to characterize the state of stress. In more particular models, the stress tensor is symmetric, σ ij = σ ji , and only six scalar quantities are needed. The stress field σ ij (x, t) is a second order tensor field . Tensor fields can be combined, to give other fields. For instance, if ni is a unit vector considered at a point inside a medium, the vector 3 i σ ij (x, t) nj (x) = σ ij (x, t) nj (x) τ (x, t) = ; (i = 1, 2, 3) (1.121) j =1 represents the traction that the medium at one side of the surface defined by the normal ni exerts the medium at the other side, at the considered point. As a further example, if the deformations of an elastic solid are small enough, the stress tensor is related linearly to the strain tensor (Hooke’s law). A linear relation between two second order tensors means that each component of one tensor can be computed as a linear combination of all the components of the other tensor: 3 3 σ ij (x, t) = cijk (x) εk (x, t) = cijk (x) εk (x, t) ; (i = 1, 2, 3; j = 1, 2, 3) . (1.122) k=1 =1 The fourth order tensor cijkl represents a property of an elastic medium: its elastic stiffness. As each index takes 3 values, there are 3 × 3 × 3 × 3 = 81 scalars to define the elastic stiffness of a solid at a point (assuming some symmetries we may reduce this number to 21, and asuming isotropy of the medium, to 2). We are yet interested in the physical meaning of the equations above, but in their structure. First, tensor notations are such that they are independent on the coordinates being used. This is not obvious, as changing the coordinates implies changing the local basis where the components of vectors and tensors are expressed. That the two equalities equalities above hold for any coordinate system, means that all the components of all tensors will change if we change the coordinate system being used (for instance, from Cartesian to spherical coordinates), but still the two sides of the expression will take equal values. The mechanics of the notation, once understood, are such that it is only possible to write expressions that make sense (see a list of rules at the end of this section). 24 1.8 For reasons about to be discussed, indices may come in upper or lower positions, like in v i , fi or Ti j . The definitions will be such that in all tensor expression (i.e., in all expressions that will be valid for all coordinate systems), the sums over indices will always concern an index in lower position an one index on upper position. For instance, we may encounter expressions like 3 3 i ϕ= Ai B = Ai B i=1 i or 3 Dijk E jk = Dijk E jk Ai = . (1.123) j =1 k=1 These two equations (as equations 1.121 and 1.122) have been written in two version, one with the sums over the indices explicitly indicated, and another where this sum is implicitly assumed. This implicit notation is useful as one easily forgets that one is dealing with sums, and that it happens that, with respect to the usual tensor operations (sum with another tensor field, multiplication with another tensor field, and derivation), a sum of such terms is handled as one single term of the sum could be handled. In an expression like Ai = Dijk E jk it is said that the indices j and k have been contracted (or are “dummy indices”), while the index i is a free index . A tensor equation is assumed to hold for all possible values of the free indices. In some spaces, like our physical 3-D space, it is posible to define the distance between two points, and in such a way that, in a local system of coordinates, approximately Cartesian, the distance has approximately the Euclidean form (square root of a sum of squares). These spaces are called metric spaces . A mathematically convenient manner to introduce a metric is by defining the length of an arc Γ by S = Γ ds , where, for instance, in Cartesian coordinates, ds2 = dx2 + dy 2 + dz 2 or, in spherical coordinates, ds2 = dr2 + r2 dθ2 + r2 sin2 θ dϕ2 . In general, we write ds2 = gij dxi dxj , and we call gij (x) the metric field or, simply, the metric . The components of a vector v are associated to a given basis (the vector will have different components on different basis). If a basis ei is given, then, the components v i are defined through v = v i ei (implicit sum). The dual basis of the basis {ei } is denoted {ei } and is defined by the equation ei ej = δi j (equal to 1 if i are the same index and to 0 if not). When there is a metric, this equation can be interpreted as a scalar vector product, and the dual basis is just another basis (identical to the first one when working with Cartesian coordinates in Euclidena spaces, but different in general). The properties of the dual basis will be analyzed later in the chapter. Here we just need to recall that if v i are the components of the vector v on the basis {ei } (remember the expression v = v i ei ), we will denote by vi are the components of the vector v on the basis {ei } : v = vi ei . In that case (metric spaces) the components on the two basis are related by vi = gij v i : It is said that “the metric tensor ascends (or descends) the indices”. Here is a list with some rules helping to recognize tensor equations: • A tensor expression must have the same free indices, at the top and at the bottom, of the two sides of an equality. For instance, the expressions ϕ = Ai B i ϕ = gij B i C j Ai = Dijk E jk Dijk = ∇i Fjk (1.124) Appendixes 25 are valid, but the expressions Ai = Fij B i B i = Aj C j Ai = B (1.125) i are not. • Sum and multiplication of tensors (with eventual “contraction” of indices) gives tensors. For instance, if Dijk , Gijk and Hi j are tensors, then Jijk = Dijk + Gijk Kijk m = Dijk H m Lik = Dijk H (1.126) j also are tensors. • True (or “covariant”) derivatives of tensor felds give tensor fields. For instance, if E ij is a tensor field, then Mi jk = ∇i E jk B j = ∇i E ij (1.127) also are tensor fields. But partial derivatives of tensors do not define, in general, tensors. For instance, if E ij is a tensor field, then Mi jk = ∂i V jk B j = ∂i V ij (1.128) are not tensors, in general. • All “objects with indices” that are normally introduced are tensors, with four notable exceptions. The first exception are the coordinates {xi } (to see that it makes no sense to add coordinates, think, for instance, in adding the spherical coordinates of two points). But the differentials dxi appearing in an expression like ds2 = gij dxi dxj do correspond to the components on a vector dr = dxi ei . Another notable exception is the “symbol” ∂i mentioned above. The third exception is the “connection” Γij k to be introduced later in the chapter. In fact, it is because both of the symbols ∂i and Γij k are not tensors than an expression like ∇i V j = ∂i V j + Γik j V k (1.129) can have a tensorial sense: if one of the terms at right was a tensor and not the other, their sum could never give a tensor. The objects ∂i and Γij k are both non tensors, and “what one term misses, the other term has”. The fourth and last case of “objects with indices” which are not tensors are the Jacobian matrices arising in coordinate changes x y, J iI = ∂xi . ∂y J (1.130) 26 1.8 That this is not a tensor is obvious when considering that, contrarily to a tensor, the Jacobian matrix is not defined per se, but it is only defined when two different coordinate systems have been chosen. A tensor exists even if no coordinate system at all has been defined. 1.8.1.2 Differentiable Manifolds A manifold is a continuous space of points. In an n-dimensional manifold it is always possible to “draw” coordinate lines in such a way that to any point P of the manifold correspond coordinates {x1 , x2 , . . . , xn } and vice versa. Saying that the manifold is a continuous space of points is equivalent to say that the coordinates themselves are “continuous”, i.e., if they are, in fact, a part of Rn . On such manifolds we define physical fields, and the continuity of the manifold will allow to define the derivatives of the considered fields. When derivatives of fields on a manifold can be defined, the manifold is then called a differentiable manifold . Obvious examples of differentiable manifolds are the lines and surfaces of ordinary geometry. Our 3-D physical space (with, possibly, curvature and torsion) is also represented by a differentiable manifold. The space-time of general relativity is a four dimensional differentiable manifold. A coordinate system may not “cover” all the manifold. For instance, the poles of a sphere are as ordinary as any other point in the sphere, but the coordinates are singular there (the coordinate ϕ is not defined). Changing the coordinate system around the poles will make any problem related to the coordinate choice to vanish there. A more serious difficulty appears when at some point, not the coordinates, but the manifold itself is singular (the linear tangent space is not defined at this point), as for instance, in the example shown in figure 1.3. Those ane named “essential singularities”. No effort will be made on this book to classify them. Figure 1.3: The surface at left has an essential singularity that will cause trouble for whatever system of coordinates we may choose (the tangent linear space is not defined at the singular point). The sphere at rigth has no essential singularity, but the coordinate system chosen is singular at the two poles. Other coordinate systems will be singular at different points. 1.8.1.3 Tangent Linear Space, Tensors. Consider, for instance, in classical dynamics, a trajectory xi (t) on a space which may not be flat, as the surface of a sphere. The trajectory is “on” the sphere. If we define now the velocity at some point, vi = dxi , dt (1.131) Appendixes 27 we get a vector which is not “on” the sphere, but tangent to it. It belongs to what is called the tangent linear space to the considered point. At that point, we will have a basis for vectors. At another point, we will another tangent linear space, and another vector basis. More generally, at every point of a differential manifold, we can consider different vector or tensor quantities, like the forces , velocities , or stresses of mechanics of continuous media. As suggested by figure 1.4, those tensorial objects do not belong to the nonlinear manifold, but to the tangent linear space to the manifold at the considered point (that will only be introduced intuitively here). At every point of an space, tensors can be added, multiplied by scalars, contracted, etc. This means that at every point of the manifold we have to consider a different vector space (in general, a tensor space). It is important to understand that two tensors at two different points of the space belong to two different tangent spaces, and can not be added as such (see figure 1.4). This is why we will later need to introduce the concept of “parallel transport of tensors”. All through this book, the two names linear space and vector space will be used as completely equivalent. The structure of vector space is too narrow to be of any use in physics. What is needed is the structure where equations like λ = Ri S i T j = Ui V ij + µ W j X ij = Y i Z j (1.132) make sense. This structure is that of a tensor space . In short, a tensor space is a collection of vector spaces and rules of multiplication and differentiation that use elements of the vector spaces considered to get other elements of other vector spaces. Figure 1.4: Surface with two planes tangent at two points, and a vector drawn at each point. As the vectors belong to two different vector spaces, their sum is not defined. Should we need to add them, for instance, to define true (or “covariant”) derivatives of the vector field, then, we would need to transport them (by “parallel transportation”) to a common point. 1.8.1.4 Vectors and Forms When we introduce some vector space, with elements denoted, for instance, V , v . . . , it often happens that a new, different, vector space is needed, with elements denoted, for instance F , F . . . , and such that when taking an element of each space, we can “multiply” them and get a scalar, λ = F, V . (1.133) In terms of components, this will be written λ = Fi V i . (1.134) 28 1.8 The product in 1.133–1.134, is called a duality product , and it has to be clearly distinguished from an inner (or scalar) product: in an inner product, we multiply two elements of a vector space; in a duality product, we multiply an element of a vector space by an element of a “dual space”. This operation can always be defined, including the case where the do not have a metric (and, therefore, a scalar product). As an example, imagine that we work with pieces of metal and we need to consider the two parameters “electric conductivity” σ and “temperature” T . We may need to consider some (possibly nonlinear) function of σ and T , say S (σ, T ) . For instance, S (σ, T ) may represent a “misfit function” on the (σ, T ) space of those encountered when solving inverse problems in physics if we are measuring the parameters σ and T using indirect means. In this case, S is adimensional5 . We may wish to know by which amont will S change when passing from point (σ0 , T0 ) to a neighbouring point (σ0 + ∆σ, T0 + ∆T ) . Writing only the first order term, and using matrix notations, ∂S ∂σ ∂S ∂T S (σ0 + ∆σ, T0 + ∆T ) = S (σ0 , T0 ) + T ∆σ ∆T + ... , (1.135) where the partial derivatives are taken at point (σ0 , T0 ) . Using tensor notations, setting x = (x1 , x2 ) = (σ, T ) , we can write S (x + ∆x) = S (x) + i ∂S ∆xi i ∂x (1.136) = S (x) + γi ∆xi = S (x) + γ , ∆x , where the notation introduced in equations 1.133–1.134 is used. As above, the partial derivatives are taken at point x0 = (x1 , x2 ) = (σ0 , T0 ) . 0 0 Note: say that figure 1.5 illustrates the definition of gradient as a tangent linear application. Say that the “mille-feuilles” are the “level-lines” of that tangent linear application. Note: I have to explain somewhere the reason for putting an index in lower position to represent ∂/∂xi , i.e., to use the notation ∂i = ∂ . ∂xi Note: I have also to explain in spite of the fact that we have here partial derivatives, we have defined a tensorial object: the partial derivative of a scalar equals its true (covariant) derivative. It is important that we realize that there is no “scalar product” involved in equations 1.136. Here are the arguments: • The components of γi are not the components of a vector in the (σ, T ) space. This can directly be seen by an inspection of their physical dimensions. As the function S is adimensional (see footnote 5), the components of γ have as dimensions the inverse of the physical dimensions of the components of the vector ∆x = (∆x1 , ∆x2 ) = (∆σ, ∆T ) . This clearly means that ∆x and γ are “objects” that do not belong to the same space. For instance, one could have the simple expression S (σ, T ) = |σ−σ0 | + sP standard deviations (or mean deviations) of some probability distribution. 5 |T −T0 | sT , where sP and sT are Appendixes 29 • If equations 1.136 involved a scalar product we could define the norm of x , the norm of γ and the angle between x and γ . But these norms and angle are not defined. For instance, what could be the norm of x = (∆σ, ∆T ) ? Should we choose an L2 norm? Or, as suggested by footnote 5, an L1 norm? And, in any case, how could we make consistent such a definition of a norm with a change of variables where, instead of electric conductivity we use electric resistivity? (Note: make an appendix where the solution to this problem is given). The product in equations 1.136 is not a scalar product (i.e., it is not the “product” of two elements belonging to the same space): it is a “duality product”, multiplying an element of a vector space and one element of a “dual space”. Why this discussion is needed? Because of the tendency of imagining the gradient of a function S (σ, T ) as a vector (an “arrow”) in the S (σ, T ) space. If the gradient is not an arrow, then, what it is? Note: say here that figures 1.6 and 1.7 answer this by showing that an element of a dual space can be represented as a “mille-feuilles”. Up to here we have only considered a vector space and its dual. But the notion generalizes to more general tensor spaces, i.e., to the case where “we have more than one index’’. For instance, instead of equation 1.134 we could use an equation like λ = Fij k V ij k (1.137) to define scalars, consider that we are doing a duality product, and also use the notation of equation 1.133 to denote it. But this is not very useful, as, from a given “tensor” Fij k we can obtain scalar by operations like λ = Fij k V i W j k . (1.138) It is better, in general, to just write explicitly the indices to indocate which sort of “product” we consider. Sometimes (like in quantum mechanics), a “bra-ket” notation is used, where the name stands for the bra “ |” and the ket “| ”. Then, instead of λ = F , V one writes λ= F|V = Fi V i . (1.139) Then, the bra-ket notation is also used for the expression λ= V|H|W = Hij V i W j . (1.140) Note: say that the general rules for the change of component values in a change of coordinates, allow us to talk about “tensors” for “generalized vectors” as well as for “generalized forms”. The “number of indices” that have to be used to represent the components of a tensor is called the rank , or the order of the tensor. Thus the tensors F and V just introduced are second rank, or second order. A tensor object with components Rijk could be called, in all rigor, a “(third-rank-form)-(first-rank-vector)” will we wil not try to usew this heavy terminology, the simple writing of the indices being explicit. Note: say that if there is a metric, there is a trivial identification between a vector space and its dual, through equations like Fi = gij V j , or S ijk = g ip g jq g kr g s Rpqr s , and in that case, the same letter is used to designate one vector and its dual element, as in Vi = gij V j , and Rijk = g ip g jq g kr g s Rpqr s . But in non metric spaces (i.e., spaces without metric), there is usually a big difference between an space and its dual. 30 1.8 1.8.1.4.1 Gradient and Hessian Explain somewhere that if φ(x) is a scalar function, the Taylor development φ(x + ∆x) = φ(x) + g |∆x + 1 ∆x | H | ∆x 2! (1.141) defines the gradien g and the Hessian H . 1.8.1.4.2 Old text We may want the gradient to be “perpendicular” at the level lines of ϕ at O , but there is no natural way to define a scalar product in the {P, T } space, so we can not naturally define what “perpendicularity” is. That there is no natural way to define a scalar product does not mean that we can not define one: we can define many. For any symmetric, positive-definite matrix with the right physical dimensions (i.e., for any covariance matrix), the expression T −1 δ P1 δ P2 δ P1 CP P CP T δ P2 , = δT1 δT2 δT1 CT P CT T δT2 defines a scalar product. By an appropriate choice of the covariance matrix, we can make any of the two lines in figure 1.6 (or any other line) to be perpendicular to the level lines at the considered point: the gradient at a given point is something univocally defined, even in the absence of any scalar product; the “direction of steepest descent” is not, and there are as many as we may choose different scalar products. The gradient is not an arrow, i.e, it is not a vector . So, then, how to draw the gradient? Roughly speaking, the gradient is the linear tangent application at the considered point. It is represented in figure 1.7. As, by definition, it is a linear application, the level lines are straight lines, and the spacing of the level lines in the tangent linear application corresponds to the spacing of the level lines in the original function around the point where the gradient is computed. Speaking more technically, it is the development ϕ(x + δ x) = ϕ(x) + g , δ x + . . . = ϕ(x) + gi δxi + . . . , when limited to its first order, that defines the tangent linear application. The gradient of ϕ is then g . The gradient g = {gi } at O allows to associate a scalar to any vector V = {V i } (also at O ): λ = gi V i = g , V . This scalar is the difference of the values at the top and the bottom of the arrow representing the vector V on the local tangent linear application to ϕ at O . The index on the gradient can be a lower index, as the gradient is not a vector. Note: say that figure 1.8 illustrates the fact that an element of the dual space can be represented as a “mille-feuilles” in the “primal” space or a an “arrow” un the dual space. And reciprocally. Note: say that figure 1.9 illustrates the sum of arrows and the sum of “mille-feuilles”. Note: say that figure 1.10 illutrates the sum of “mille-feuilles” in 3-D. 1.8.1.5 Natural Basis A coordinate system associates to any point of the space, its coordinates. Each individual coordinate can be seen as a function associating, to any point of the space, the particular coordinate. We can define the gradient of this scalar function. We will have as many gradients Appendixes 31 Figure 1.5: The gradient of a function (i.e., of an application) at a point x0 is the tangent linear application at the given point. Let x → f (x) represent the original (possibly nonlinear) application. The tangent linear application could be considered as mapping x into the values given by the linearized approximation fo f (x) : x → F (x) = α + β x . (Note: explain better). Rather, it is mathematically simpler to consider that the gradient maps increments of the independent variable x , ∆x = x − x0 into increments of the linearized dependent variable: ∆y = y − f (x0 ) : ∆x → ∆y = β ∆x . (Note: explain this MUCH better). Figure 1.6: A scalar function ϕ(P, T ) depends on pressure and temperature. From a given point, two directions in the {P, T } space are drawn. Which one corresponds to the gradient of ϕ(P, T ) ? In the figure at left, the pressure is indicated in International Units (m, kg, s), while in the figure at right, the c.g.s. units (cm, g, s) are used (remember that 1 Pa = 10 dyne/cm−2 ). From the left figure, we may think that the gradient is direction A , while from the figure at right we may think it is B . It is none: the right definition of gradient (see text) only allows, as graphic representation, the result shown in figure 1.7. Figure 1.7: Gradient of the function displayed in figure 1.6, at the considered point. As the gradient is the linear tangent application at the given point, it is a linear application, and its level lines are stright lines. The value of the gradient at the considered point equals the value of the original function at that point. The spacing of the level lines in the gradient corresponds to the spacing of the level lines in the original function around the point where the gradient is computed. The two figures shown here are perflectly equivalent, as it should. f(x) f(x,y) TO BE REDRAWN y x0 x 1.2 T/K 1 1.2 T/K A A 1 0.8 0.8 B 0.6 B 0.6 0.4 0.4 0.2 0.2 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 P/(N m-2) 0.4 0.6 0.8 1 0.8 1 P/(dyne m-2) 1.2 T/K 1.2 T/K 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 0 0.2 0.4 0.6 P/(N m-2) 0.8 1 0 0.2 0.4 0.6 P/(dyne m-2) 32 Figure 1.8: A point, at the left of the figure, may serve as the origin point for any vector we may want to represent. As usual, we may represent a vector V by an arrow. Then, a form F is represented by an oriented pattern of lines (or by an oriented pattern of surfaces in 3-D) with the line of zero value passing through the origin point. Each line has a value, that is the number that the form associates to any vector whose end point is on the line. Here, V and F are such that F, V = 2 . But a form is an element of the dual space, wich is also a linear space. In the dual space, then, the form F can be represented by an arrow (figure at right). In turn, V is represented, in the dual space, by a pattern of lines. 1.8 ``Primal'' space f = +1 Dual space f = +4 f=0 F f = +3 V f = −1 f = +2 v = +6 v = +3 v=0 v = −3 〈F,V〉 = 2 V+W Figure 1.9: When representing vectors by arrows, the sum of two vectors is given by the main diagonal of the “parallelogram” drawn by two arrows. Then, a form is represented by a pattern of lines. The sum of two forms can be geometrically obtained using the “parallelogram” defined by the principal lozenge (containing the origin and with positive sense for both forms): the secondary diagonal of the lozenge is a line of the sum of the two forms. Note: explain this better. W V Two vectors The sum of two vectors g = +1 g = 0 f+g = +1 f+g = 0 f = +1 f=0 Two forms The sum of two forms f = +1 f=0 Figure 1.10: Sum of two forms, like in the previous figure, but here in 3-D. Note: explain that this figure can be “sheared” an one wants (we do not need to have a metric). Note: explain this better. f+g = +1 f = -1 g = -1 g = 0 g = +1 f+g = 0 f+g = -1 f = +1 f=0 f+g = +1 f = -1 g = +1 g = 0 g = -1 f+g = 0 f+g = -1 Appendixes 33 f1=-2 f1=-1 f1=0 x1=5 x1=6 x2=5 Figure 1.11: A system of coordinates, at left, and their gradients, at right. These gradient are forms. When in an n-dimensional space we have n forms, we can define n associate vertors by f i , ej = δ j i . x1=7 f1=1 x1=8 f2=2 f2=1 x2=4 f2=0 x2=3 f2=-1 f2=-2 x2=2 To be redrawn f i as coordinates xi . As a gradient, we have seen, is a form, we will have as many forms as coordinates. The usual requirements that coordinate systems have to fulfill (different points of the space have different coordinates, and vice versa) gives n linearly independent forms (we can not obtain one of them by linear combination of the others), i.e., a basis for the forms. If we have a basis f i of forms, then we can introduce a basis ei of vectors, through f i , ej = δ j i . (1.142) If we define the components V i of a vector V by V = V i ei , (1.143) then, we can compute the components V i by the formula V i = fi , V , (1.144) as we have f i , V = f i , V j ej = f i , V j ej = j V j f i , ej = i V j δij = V j δij = V i . i (1.145) Note that the computation of the components of a vector does not involve a scalar product, but a duality product. To find the equivalent of equations 1.143 and 1.144 for forms, one defines the components Fi of a form F by F = Fi f i , (1.146) Fi = F , ei . (1.147) and one easily gets The notation ei for the basis of vectors is quite universal. Although the notation f i seems well adapted for a basis of forms, it is quite common to use the same letter for the basis of forms and for the basis of vectors. In what follows, we will use the notation ei ≡ f i . (1.148) whose dangerousness vanishes only if we have a metric, i.e., when we can give sense to an expression like ei = gij ej . Using this notation the expressions V = V i ei ⇐⇒ V i = fi , V ; F = Fi f i , ⇐⇒ Fi = F , ei (1.149) 34 1.8 become V = V i ei ⇐⇒ V i = ei , V ; F = Fi ei , ⇐⇒ Fi = F , ei . (1.150) We have now basis for vectors and forms, so wa can write expressions like V = V i ei and F = Fi ei . We need basis for objects “with more than one index”, so we can write expressions like B = B ij eij ; C = Cij eij ; D = Ci j ei j ; E = Eijk... mn... eijk... mn... (1.151) The introduction of these basis raises a difficulty. While we have an immediate intuitive representation for vectors (as “arrows”) and for forms (as “millefeuilles”), tensor objects of higher rank are more difficult to represent. If a symmetric 2-tensor, like the stress tensor σ ij of mechanics, can be viewed as an ellipsoid, how could we view a tensor Tijk m ? It is the power of mathematics to suggest analogies, so we can work even without geometric interpretations. But this absence of intuitive interpreation of high-rank tensors tells us that we will have to introduce the basis for these objects in a non-intuitive way. Essentially, what we want is that the basis for high rank tensors is not independent for the basis of vectors and forms. We want, in fact, more than this. Given two vectors U i and V i , we understand what we mean when we define a 2-tensor W by W ij = U i V j . The basis for 2-tensors is perfectly defined by the condition that we wish that the components of W are precisely U i V j and not, for instance, the values obtained after some rotation or change of coordinates. This is enough, and we could directly use the notations introduced by equations 1.151. Instead, common mathematical developments introduce the notion of “tensor product”, and, instead of notations like eij , eij , ei j , or eijk... mn... , introduce the notations ei ⊗ ei , ei ⊗ ej , ei ⊗ ej , or ei ⊗ ej ⊗ ek ⊗ . . . e ⊗ em ⊗ en ⊗ . . . . Then, equations 1.151 are written B = B ij ei ⊗ ei E = Eijk... mn... ; C = Cij ei ⊗ ej ; D = Ci j ei ⊗ ej ei ⊗ ej ⊗ ek ⊗ . . . e ⊗ em ⊗ en ⊗ . . . . (1.152) What follows is an old text, to be updated. The metric tensor has been introduced in section 1.3. Let us show here that if the space into consideration has a scalar product, then, the metric can be computed. Here, the scalar product of two vectors V and W is denoted V · W . Then, defining dr = dxi ei (1.153) ds2 = dr · dr (1.154) ds2 = dr · dr = (dxi ei ) · (dxj ej ) = (ei · ej ) dxi dxj . (1.155) and gives Defining the metric tensor gij = ei · ej (1.156) Appendixes 35 gives then ds2 = gij dxi dxj . (1.157) To emphasize that at every point of the manifold we have a different tensor space, and a different basis, we can always write explicitly the dependence of the basis vectors on the coordinates, as in ei (x) . Equation 1.143 is then just a short notation for V(x) = V i (x) ei (x) , (1.158) while equation 1.146 is a short notation for F(x) = Fi (x) ei (x) . (1.159) Here and in most places of the book, the notation x is a short-cut notation for {x1 , x2 , . . . } . The reader should just remember that x represents a point in the space, but it is not a vector. It is important to realize that, when dealing with tensor mathematics, a single basis is a basis for all the vector spaces at the considered point. For instance, the vector V may be a velocity, and the vector E may be an electric field. The two vectors belong to different vector spaces, but the are obtained as “linear combinations” of the same basis vectors: V = V i ei E = E i ei , (1.160) but, of course, the components are not pure real numbers: they have dimensions. Box ?? recalls what the dimensions of components are. Let us examine the components of the basis vectors (on the basis they define). Obviously, (ei )j = δi j or, explicitly, 1 0 e1 = 0 . . . (ej )i = δi j , 0 1 e2 = 0 . . . (1.161) ... . (1.162) Equivalently, for the basis of 2-tensors we have 1 0 e 1 ⊗ e1 = 0 ··· 0 0 0 ··· ··· e2 ⊗ e1 = 0 1 0 0 0 0 ··· (ei ⊗ ej )kl = δi k δj l 0 ··· 0 0 0 ··· e1 ⊗ e2 = 0 0 ··· ... ··· ··· ··· ··· ··· ... ··· 0 0 0 0 1 0 ··· ··· e2 ⊗ e2 = 0 0 0 (1.163) 1 0 0 ··· ··· ··· ... ··· 0 0 0 ··· ··· ··· ··· ... ··· 0 0 0 ... ... (1.164) 36 1.8 ... ... ... and similar formular for other basis. Note: say somewhere that the definition of basis vectors given above imposes that the vectors of the natural basis are, at any point, tangent to the coordinate lines at that point. The notion of tangency is independent of the existence, or not, of a metric, i.e., of the possibility of measuring distances in the space. This is not so for the notion of perpendicularity, that makes sense only if we can measure distances (and, therefore, angles). In, general, then, the vectors of the natural basis are tangent to the coordinate lines. When a metric has been introduced, the vectors in the natural basis at a given point will be mutually perpendicular only if the coordinate lines themselves are mutually perpendicular at that point. Ordinary coordinates in the Euclidean 3-D space (Cartesian, cylindrical, spherical, . . . ) define coordinate lines that are orthogonal at every point. Then, the vectors of the natural basis will also be mutually orthogonal at all points. But the vectors of the natural basis are not, in general, normed to 1 . For instance, figure XXX illustrates the fact that the norm of the vectors of the natural basis in polar coordinates are, at point (r, ϕ) , er = 1 and eϕ = r . 1.8.1.6 Tensor Components Consider, over an n-dimensional manifold X . At any point P of the manifold, one can consider the linear space L that is tangent to the manifold at that point, and its dual L . One can also consider, at point P , the ‘tensor product’ of spaces L(p, q ) = L ⊗ L ⊗ · · · ⊗ L ⊗ L ⊗ L ⊗ · · · ⊗ L . A ‘p-times contravariant, q -times covariant tensor’ at p times q times point P of the manifold is an element of L(p, q ) . When a coordinate system x = {x1 , . . . , xn } is chosen over X , one has, at point P , the ‘natural basis’ for the linear tangent space L , say {ei } , and, by virtue of the tensor product, also a basis for the space L(p, q ) , say {ei1 ⊗ ei2 ⊗ · · · ⊗ eip ⊗ ej1 ⊗ ej2 ⊗ · · · ⊗ ejq } . Any tensor T at point P of X can then be developed on this basis, T = Tx i1 i2 ...ip j1 j2 ...jq ei1 ⊗ ei2 ⊗ · · · ⊗ eip ⊗ ej1 ⊗ ej2 ⊗ · · · ⊗ ejq , (1.165) to define the natural components of the tensor, Tx i1 i2 ...ip j1 j2 ...jq . They are intimately linked to the coordinate system chosen over X , as this coordinate system has induced the natural basis {ei } at the considered point P . The index x in the components is there to recall this fact. It is essential when different coordinates are going to be simultaneously considered, but it can be droped when there is no possible confusion about the coordinate system being used. Its lower or upper position may be chosen for typographical clarity, and, of course, has no special variance meaning. 1.8.1.7 Tensors in Metric Spaces Comment: explain here that it is possible to give a lot of structure to a manifold (tangent linear space, (covariant) derivation, etc.) without the need of a metric. It is introduced here to simplify the text, as, if not, we would have need to come bak to most of the results to add the particular properties arising when there is a metric. But, in all rigor, it would be preferable to introduce the metric after, for instance, the definition of covariant differentiaition, that does not need it. Appendixes 37 Having a metric in a differential manifold means being able to define the length of a line. This will then imply that we can define a scalar product at every local tangent linear space (and, thus, the angle between two crossing lines). The metric will also allow to define a natural bijection between vectors and forms, and between tensors densities and capacities. A metric is defined when a second rank symetric form g with components gij is given. The length L of a line xi (λ) is then defined by the line integral ds , L= (1.166) λ where ds2 = gij dxi dxj . (1.167) Once we have a metric, it is possible to define a bijection between forms and vectors. For, to the vector V with components V i we can associate the form F with components Fi = gij V j . (1.168) Then, it is customary to use the same letter to designate a vector and a form that are linked by this natural bijection, as in Vi = gij V j . (1.169) The inverse of the previous equation is written V i = g ij Vj , (1.170) gij g jk = δi k . (1.171) where The reader will easily give sense to the expression ei = gij ej . (1.172) The equations above, and equations like Tij... kl... = gip gjq . . . g kr g ls . . . T pq... rs... , (1.173) are summarized by saying that “the metric tensor allows to raise and lower indices”. The value of the metric at a particular point of the manifold allows to define a scalar product for the vectors in the local tangent linear space. Denoting the scalar product of two vectors V and W by V · W , we can use any of the definitions V · W = gij V i W j = Vi W j = V i Wj . (1.174) To define parallel transportation of tensors, we have introduced a connection Γij k . Now that we have a metric we may wonder if when parallel-transporting a vector, it conserves constant 38 1.8 length. It is easy to show (see demonstration in [Comment: where?]) that this is true if we have the compatibility condition ∇i gjk = 0 , (1.175) ∂i gjk = gsk Γij s + gjs Γik s . (1.176) i.e., The compatibility condition 1.175 implies that the metric tensor and the nabla symbol commute: ∇i (gjk T pq... rs... ) = gjk (∇i T pq... rs... ), (1.177) which, in fact, means that it is equivalent to take a covariant derivative, then raise or lower an index, or first raise or lower an index, then take the covariant derivative. Note: introduce somewhere the notation Γijk = gks Γij s , (1.178) warn the reader that this is just a notation : the connection coefficients are not the components of a tensor. and say that if the condition 1.175 holds, then, it is possible to compute the connection coefficients from the metric and the torsion: Γijk = 1 1 (∂i gjk + ∂j gik − ∂k gij ) + (Sijk + Skij + Skji ) . 2 2 (1.179) As the basis vectors have components (ei )j = δi j , (1.180) ei · ej = gij . (1.181) dr = dxi ei (1.182) dr · dr = ds2 . (1.183) we have Defining gives then We have seen that the metric can be used to define a natural bijection between forms and vectors. Let us now see that it can also be used to define a natural bijection between tensors, densities, and capacities. We denote by g the determinant of gij : g = det({gij }) = 1 ijk... pqr... ε ε gip gjq gkr . . . . n! (1.184) Appendixes 39 The two upper bars recall that g is a second order density, as there is the product of two densities at the right-hand side. For a reason that will become obvious soon, the square root of g is denoted g : g = g g. (1.185) In (Comment: where?) we demonstrate that we have ∂i g = g Γis s . (1.186) Using expression (Comment: which one?) for the (covariant) derivative of a scalar density, this simply gives ∇i g = ∂i g − g Γis s = 0 , (1.187) which is consistent with the fact that ∇i gjk = 0 . (1.188) We can also define the determinant of g ij : g = det({g ij }) = 1 ε g ip g jq g kr . . . , ε n! ijk... pqr... (1.189) and its square root g : g = g g. (1.190) As the matrices gij and g ij are mutually inverses, we have g g = 1. (1.191) Using the scalar density g and the scalar capacity g we can associate tensor densities, pure tensors, and tensor capacities. Using the same letter to designate the objects related through this natural bijection, we will write expressions like ρ = g ρ, i (1.192) V = g Vi, (1.193) Tij... kl... = g T ij... kl... . (1.194) or So, if gij and g ij can be used to “lower and raise indices”, g and g can be used to “put and remove bars”. Comment: say somewhere that g is the density of volumetric content , as the volume element of a metric space is given by dV = g dτ , (1.195) 40 1.8 where dτ is the capacity element defined in (Comment: where?), and which, when we take an element along the coordinate lines, equals dx1 dx2 dx3 . . . . Comment: Say that we can demonstrate that, in an Euclidean space, the matrix representing the metric equals the product of the Jacobian matrix times the transposed matrix: ∂X 1 ∂X 1 ∂X 1 ∂X 2 g11 g12 . . . ... ... 1 2 1 1 ∂x ∂x ∂x ∂x ∂X 2 ∂X 2 ∂X 1 ∂X 2 (1.196) {gij } = g21 g22 . . . = ∂x1 ∂x2 . . . × ∂x2 ∂x2 . . . . . .. . . . . . ... ... . . . . . . . . . . . . . In short, gij = K ∂X K ∂X K . ∂xi ∂xj (1.197) This follows directly from the general equation gij = ∂X I ∂X J gIJ ∂xi ∂xj using the fact that, if the {X I } are Cartesian coordinates, g11 g12 . . . 1 0 ... {gIJ } = g21 g22 . . . = 0 1 . . . . . .. . . . 0 0 ... . . . (1.198) (1.199) Comment: explain here that the metric introduces a bijection between forms and vectors: Vi = gij V j . (1.200) Comment: introduce here the notation (V, W) = gij V i W j = Vi W i = Wi V i . (1.201) Appendixes 1.8.2 41 Appendix: Dimension of Components Which dimensions have the components of a vector? Contrarily to the basis of elementary calculus, the vectors defining the natural basis are not normed to one. Rather, it follows from gij = ei · ej that the length (i.e., the norm) of the basis vector ei is ei = √ gii . For instance, if in the Euclidean 3-D space with Cartesian coordinates ex = ey = ez = 1 , the use of spherical coordinates gives er = 1 eθ = r eϕ = r sin θ . Denoting by [ V ] the physical dimension of (the norm of) a vector, this gives √ [ ei ] = [ gii ] . For instance, in Cartesian coordinates, [ e x ] = [ e y ] = [ ez ] = 1 , and in spherical coordinates, [ er ] = 1 [ eθ ] = L [ eϕ ] = L , where L represents the dimension of a length . A vector V = V i ei has components with dimensions [V] [V] Vi = . =√ [ ei ] gii For instance, in Cartesian coordinates, [V x ] = [V y ] = [V y ] = [ V ] and in spherical coordinates, [V r ] = [ V ] Vθ = [V] L [V ϕ ] = In general, the physical dimension of the component Tij... k Tij... k ... of a tensor T is 1 1 ... [ ek ] [ e ] 1 1 ... √ ... . √ gkk g = [ T ] [ ei ] [ ej ] . . . √ √ gjj = [ T ] [ gii ] ... [V] . L 42 1.8.3 1.8 Appendix: The Jacobian in Geographical Coordinates Example 1.5 Let x = {x, y, z } y = {r, ϕ, ϑ} ; (1.202) respectively represent a Cartesian and a geographical system of coordinates over the Euclidean 3D space, x = r cos ϑ cos ϕ y = r cos ϑ sin ϕ z = r sin ϑ . (1.203) The matrix of partial derivatives defined at the right of equation 1.2 is cos ϑ cos ϕ r sin ϑ cos ϕ −r cos ϑ sin ϕ X = cos ϑ sin ϕ r sin ϑ sin ϕ r cos ϑ cos ϕ sin ϑ −r cos ϑ 0 , (1.204) The matrix Y defined at the left of equation 1.2 could be computed by, first, solving in equations 1.203 for the geographical coordinates as a finction of the Cartesian ones, and, then, by computing the partial derivatives. This would give the matrix Y as a function of {x, y, z } . More simply, we can just evaluate Y as X−1 (equation 1.5), but his, of course, gives Y as a finction of {r, ϑ, ϕ} : cos ϑ cos ϕ cos ϑ sin ϕ sin ϑ sin ϑ sin ϕ / r − cos ϑ / r . (1.205) Y = sin ϑ cos ϕ / r − sin ϕ / (r cos ϑ) cos ϕ / (r cos ϑ) 0 The two Jacobian determinants are X= 1 = r2 cos ϑ Y . (1.206) For the metric, as ds2 = dx2 + dy 2 + dz 2 = dr2 + r2 cos2 ϑ dϕ2 + r2 dϑ2 √ one has the volume densities (remember that g = det g ) gx = 1 ; g y = r2 cos ϑ , . (1.207) (1.208) The comparison of these two last equations with equations 1.206 shows that one has gy = X gx , (1.209) in accordance with the general rule for the change of values of a scalar density under a change of variables (equation 1.12). Appendixes 43 Here, the fundamental capacity elements are dv x = dx ∧ dy ∧ dz ; dv y = dr ∧ dϕ ∧ dϑ . (1.210) Using the change of variables in equation 1.203 one obtains6 dx ∧ dy ∧ dz = r2 cos ϑ dr ∧ dϕ ∧ dϑ , (1.211) and inserting this into equation 1.210 gives dv y = r2 1 1 dv x = dv cos ϑ Xx , (1.212) in accordance with the general rule for the change of values of a scalar capacity under a change of variables (equation 1.13). [End of example.] This results from the explicit computation of the exterior product dx ∧ dy ∧ dz , where dx = cos ϕ cos ϑ dr − r sin ϕ cos ϑ dϕ − r cos ϕ sin ϑ dϑ , dy = sin ϕ cos ϑ dr + r cos ϕ cos ϑ dϕ − r sin ϕ sin ϑ dϑ and dz = sin ϑ dr + r cos ϑ dϑ . 6 44 1.8 1.8.4 Appendix: Kronecker Determinants in 2 3 and 4 D 1.8.4.1 The Kronecker’s determinants in 2-D k δij = k k (1/0!) εij εk = δi δj − δi δj k δj = k (1/1!) εij εik = δj δ 1.8.4.2 = (1.213) ij (1/2!) εij ε = 1 The Kronecker’s determinants in 3-D mn δijk = mn mn n m nm nm m n (1/0!) εijk ε mn = δi δj δk + δi δj δk + δi δj δk − δi δj δk − δi δj δk − δi δj δk m δjk = m m (1/1!) εijk εi m = δj δk − δj δk δk = (1/2!) εijk εij = δk δ = (1/3!) εijk εijk = 1 1.8.4.3 (1.214) The Kronecker’s determinants in 4-D mnpq δijk = (1/0!) εijk εmnpq mnp mpq mqn nqp npm nmq = +δi δj δk δ q + δi δj δk δ n + δi δj δk δ p + δi δj δk δ m + δi δj δk δ q + δi δj δk δ p pqm pmn pnq qmp qnm qpn + δi δj δk δ n + δi δj δk δ q + δi δj δk δ m + δi δj δk δ n + δi δj δk δ p + δi δj δk δ m mnq mpn mqp npq nqm nmp − δi δj δk δ p − δi δj δk δ q − δi δj δk δ n − δi δj δk δ m − δi δj δk δ p − δi δj δk δ q pqn pmq pnm qmn qnp qpm − δi δj δk δ m − δi δj δk δ n − δi δj δk δ q − δi δj δk δ p − δi δj δk δ m − δi δj δk δ n mnp δjk = (1/1!) εijk εimnp pm pn mn np mp nm = δj δk δ p + δj δk δ m + δj δk δ n − δj δk δ n − δj δk δ p − δj δk δ m mn m n δk = (1/2!) εijk εijmn = (δk δ n − δk δ m ) δ m = (1/3!) εijk εijkm = δ m δ = (1/4!) εijk εijk = 1 (1.215) Appendixes 1.8.5 45 Appendix: Definition of Vectors Consider the 3-D physical space, with coordinates {xi } = {x1 , x2 , x3 } . In classical mechanics, the trajectory of a particle is described by the three functions of time xi (t) . Obviously the three values {x1 , x2 , x3 } are not the components of a vector, as an expression like xi (t) = xi (t)+ xi (t) I II has, in general, no sense (think, for instance, in the case where we use spherical coordinates). Define now the velocity of the particle at time t0 : v i (t0 ) = dxi dt . t=t0 If two particles coincide at some point of the space {x1 , x2 , x3 } , it makes sense to define, for 0 0 0 i i instance, their relative velocity by v i (x1 , x2 , x3 , t0 ) = vI (x1 , x2 , x3 , t0 ) − vII (x1 , x2 , x3 , t0 ) . The 0 0 0 0 0 0 0 0 0 v i are the components of a vector. If we change coordinates, x I = x I (xj ) , then the velocity is defined, in the new coordinate system, v I = dx I /dt , and we have v I = dx I /dt = ∂x I /∂xi dxi /dt , i.e., ∂x I i v= v, ∂xi I which is the standard rule for transformation of the components of a vector when the coordinates (and, so, the natural basis) change. Objects with upper or lower indices not always are tensors . The four classical objects which do not have necessarily tensorial character are: • the coordinates {xi } , • the partial differential operator ∂i , • the Connection Coefficients Γij k , • the elements of the Jacobian matrix Ji I = ∂x I /∂xi . 46 1.8 1.8.6 Appendix: Change of Components capacity s =J s F I = J JI i F i 0-rank 1-form 1-vector 2-form V Q IJ I = J V i Ji I = J JI i JJ j Qij (1-form)-(1-vector) R I J = J JI i R i j Jj J (1-vector)-(1-form) S 2-vector . . . T I J IJ tensor s =s FI = JI i Fi V Q IJ density s =J s F I = J JI i F i I = V i Ji I = JI i JJ j Qij V Q IJ I i = J V Ji I = J JI i JJ j Qij R I J = JI i R i j Jj J R I J = J JI i R i j Jj J = J Ji I S i j JJ j S I J = J i I S i j JJ j S = J T ij Ji I Jj J . . . T IJ = T ij Ji I Jj J . . . T I J IJ i = J Ji I S j J J j ij = J T Ji I J j J . . . Table 1.1: Changes of the components of the capacities, tensors and densities under a change of variables. Appendixes 1.8.7 47 Appendix: Covariant Derivatives Capacity Tensor Density ∇k s = ∂k s + Γk s ∇k s = ∂k s ∇k s = ∂k s − Γk s ∇k F i = ∂k F i + Γk F i −Γki s F s ∇k Fi = ∂k Fi −Γki s Fs ∇k F i = ∂k F i − Γk F i −Γki s F s ∇k V i = ∂k V i + Γk V i +Γks i V s ∇k V i = ∂k V i +Γks i V s ∇k V = ∂k V − Γk V s +Γks i V ∇k Qij = ∂k Qij + Γk Qij −Γki s Qsj − Γkj s Qis ∇k Qij = ∂k Qij −Γki Qsj − Γkj s Qis ∇k Qij = ∂k Qij − Γk Qij −Γki s Qsj − Γkj s Qis ∇k Ri j = ∂k Ri j + Γk Ri j −Γki s Rs j + Γks j Ri s ∇k Ri j = ∂k Ri j −Γki Rs j + Γks j Ri s ∇k Ri j = ∂k Ri j − Γk Ri j −Γki s Rs j + Γks j Ri s ∇k S i j = ∂k S i j + Γk S i j +Γks i S s j − Γkj s S i s ∇k S i j = ∂k S i j +Γks i S s j − Γkj s S i s ∇k S j = ∂k S j − Γk S j s i +Γks i S j − Γkj s S s ∇k T ij = ∂k T ij + Γk T ij +Γks i T sj + Γks j T is ∇k T ij = ∂k T ij +Γks i T sj + Γks j T is ∇k T = ∂k T − Γk T sj is +Γks i T + Γks j T . . . . . . . . . s s i i i i i i ij ij ij Table 1.2: Covariant derivatives for capacities, tensors and densities. 48 1.8 1.8.8 Appendix: Formulas of Vector Analysis Let be a , b , and c vector fields, ϕ a scalar field, and ∆a the vector Laplacian (the Laplacian applied to each component of the vector). The following list of identities holds: div rot a = 0 (1.216) rot grad ϕ = 0 (1.217) div(ϕa) = (grad ϕ) · a + ϕ(div a) (1.218) rot(ϕa) = (grad ϕ) × a + ϕ(rot a) (1.219) grad(a · b) = (a · ∇)b + (b · ∇)a + a × (rot b) + b × (rot a) (1.220) div(a × b) = b · (rot a) − a · (rot b) (1.221) rot(a × b) = a(div b) − b(div a) + (b · ∇)a − (a · ∇)b (1.222) rot rot a = grad(div a) − ∆a . (1.223) Using the nabla symbol everywhere, these equations become: ∇ · (∇ × a) = 0 (1.224) ∇ × (∇ · a) = 0 (1.225) ∇ · (ϕa) = (∇ϕ) · a + ϕ(∇ · a) (1.226) ∇ × (ϕa) = (∇ϕ) × a + ϕ(∇ × a) (1.227) ∇(a · b) = (a · ∇)b + (b · ∇)a + a × (∇ × b) + b × (∇ × a) (1.228) ∇ · (a × b) = b · (∇ × a) − a · (∇ × b) (1.229) ∇ × (a × b) = a(∇ · b) − b(∇ · a) + (b · ∇)a − (a · ∇)b (1.230) ∇ × (∇ × a) = ∇(∇ · a) − ∆a . (1.231) Appendixes 49 The following three vector equations are also often useful: a · (b × c) = b · (c × a) = c · (a × b) (1.232) a × (b × c) = (a · c) · b − (a · b) · c (1.233) (a × b) · (c × d) = a · [b × (c × d)] = (a · c)(b · d) − (a · d)(b · c) (1.234) As, in tensor notations, the scalar product of two vectors is a · b = ai bi , and the vector product has components (a × b)i = εijk aj bk (see section XXX), the identities 1.232–1.234 correspond respectively to: ∇i εijk ∇j ak = 0 (1.235) εijk ∇j ∇k ϕ = 0 (1.236) ai εijk bj ck = bi εijk cj ak = ci εijk aj bk (1.237) εijk aj (k m b cm ) = (aj cj ) bi − (aj bj ) ci (εijk aj bk )(εi m c dm ) = ai (εijk bj (εk m c dm )) , (1.238) (1.239) while the identities 1.226–1.231 correspond respectively to ∇i (ϕ ai ) = (∇i ϕ)ai + ϕ(∇i ai ) (1.240) εijk ∇j (ϕ ak ) = εijk (∇j ϕ)ak + ϕεijk ∇j ak (1.241) ∇i (aj bj ) = (aj ∇j )bi + (bj ∇j )ai + εijk aj (εk m ∇ bm ) + εijk bj (εk m ∇ am ) (1.242) ∇i (εijk aj bk ) = bk εkij ∇i aj − aj εjik ∇i bk (1.243) εijk ∇j (εk m a bm )ai ∇j bj − bi ∇j aj + bj ∇j ai − aj ∇j bi (1.244) εijk ∇j (εk m ∇ am ) = ∇i (∇j aj ) − ∇j ∇j ai , (1.245) where the (inelegant) notation ∇i represents g ij ∇j . The truth of the set of equations 1.235–1.245, when not obvious, is easily demonstrated by the simple use of the property (see section XXX) εijk εk m = δi δj m − δi m δj (1.246) 50 1.8.9 1.8 Appendix: Metric, Connection, etc. in Usual Coordinate Systems [Note: This appendix shall probably be suppressed.] 1.8.9.1 1.8.9.1.1 Cartesian Coordinates Line element ds2 = dx2 + dy 2 + dz 2 100 gxx gxy gxz gyx gyy gyz = 0 1 0 001 gzx gzy gzz 1.8.9.1.2 (1.247) (1.248) Metric 1.8.9.1.3 Fundamental density g=1 1.8.9.1.4 Connection x Γxx Γyx x Γzx x y Γxx Γyx y Γzx y z Γxx Γyx z Γzx z 1.8.9.1.5 (1.249) 000 Γxy x Γxz x Γyy x Γyz x = 0 0 0 000 Γzy x Γzz x y y 000 Γxy Γxz Γyy y Γyz y = 0 0 0 000 Γzy y Γzz y z z 000 Γxy Γxz Γyy z Γyz z = 0 0 0 000 Γzy z Γzz z (1.250) Contracted connection Γx 0 Γy = 0 0 Γz (1.251) 1.8.9.1.6 Relationship between covariant and contravariant components for first order tensors x Vx V Vy = V y (1.252) Vz Vz Appendixes 51 1.8.9.1.7 Relationship between covariant ond order tensors x Tx Tx y Txx Txy Txz Tyx Tyy Tyz = Ty x Ty y Tzx Tzy Tzz Tz x Tz y 1.8.9.1.8 and contravariant components for sec xx T Tx z T xy T xz Ty z = T yx T yy T yz Tz z T zx T zy T zz Norm of the vectors of the natural basis e x = e y = ez = 1 1.8.9.1.9 (1.254) Norm of the vectors of the normed basis e x = e y = ez = 1 1.8.9.1.10 (1.253) (1.255) Missing Comment: give also the norms of the vectors of the dual basis. 1.8.9.1.11 Relations between components on the natural and the normed basis for first order tensors x x Vx V V Vx y y V Vy = Vy ; (1.256) = V z z Vz V Vz V 1.8.9.1.12 Relations between components on the natural and the normed basis for second order tensors Txx Txy Txz Txx Txy Txz Tyx Tyy Tyz = Tyx Tyy Tyz Tzx Tzy Tzz Tzx Tzy Tzz x Tx x Tx y Tx z Tx Tx y Tx z Ty x Ty y Ty z = Ty x Ty y Ty z x y z Tz x Tz y Tz z Tz Tz Tz xx xy xz T xx T xy T xz T T T T yx T yy T yz = T yx T yy T yz (1.257) zx zy zz zx zy zz T T T T T T ———————————————— 52 1.8.9.2 1.8.9.2.1 1.8 Spherical Coordinates Line element ds2 = dr2 + r2 dθ2 + r2 sin2 θ dϕ2 1.8.9.2.2 1.8.9.2.3 Metric (1.258) 10 0 grr grθ grϕ gθr gθθ gθϕ = 0 r2 0 2 0 0 r sin2 θ gϕr gϕθ gϕϕ (1.259) Fundamental density g = r2 sin θ 1.8.9.2.4 1.8.9.2.5 Connection r Γrr Γθr r Γϕr r θ Γrr Γθr θ Γϕr θ ϕ Γrr Γθr ϕ Γϕr ϕ 00 0 Γrθ r Γrϕ r 0 Γθθ r Γθϕ r = 0 −r 2 r r 0 0 −r sin θ Γϕθ Γϕϕ 0 1/r 0 Γrθ θ Γrϕ θ 0 Γθθ θ Γθϕ θ = 1/r 0 θ θ 0 0 − sin θ cos θ Γϕθ Γϕϕ ϕ ϕ 0 0 1/r Γrθ Γrϕ 0 cotg θ Γθθ ϕ Γθϕ ϕ = 0 ϕ ϕ 1/r cotg θ 0 Γϕθ Γϕϕ (1.260) (1.261) Contracted connection Γr 2/r Γθ = cotg θ 0 Γϕ (1.262) 1.8.9.2.6 Relationship between covariant and contravariant components for first order tensors Vr Vr Vθ = r2 V θ (1.263) Vϕ r2 sin2 θ V ϕ 1.8.9.2.7 Relationship between covariant and contravariant components for second order tensors r 1 Trr r12 Trθ r2 sin2 θ Trϕ Tr Tr θ Tr ϕ T rr T rθ T rϕ Tθr 12 Tθθ 2 1 2 Tθϕ = Tθ r Tθ θ Tθ ϕ = r2 T θr r2 T θθ r2 T θϕ r r sin θ 2 2 1 1 r θ ϕ 2 ϕr 2 ϕθ 2 Tϕ Tϕ Tϕ r sin θ T r sin θ T r sin2 θ T ϕϕ Tϕr r2 Tϕθ r2 sin2 θ Tϕϕ (1.264) Appendixes 1.8.9.2.8 53 Norm of the vectors of the natural basis er = 1 1.8.9.2.9 ; eθ = r ; eϕ = r sin θ Norm of the vectors of the normed basis er = eθ = eϕ = 1 1.8.9.2.10 (1.265) (1.266) Missing Comment: give also the norms of the vectors of the dual basis. 1.8.9.2.11 Relations between components on the natural and the normed basis for first order tensors r Vr Vr V Vr V θ = 1 V θ Vθ = r Vθ ; (1.267) r ϕ 1 ϕ Vϕ V r sin θ Vϕ V r sin θ 1.8.9.2.12 Relations between for second order tensors Trr Trθ Trϕ Tθr Tθθ Tθϕ Tϕr Tϕθ Tϕϕ r Tr Tr θ Tr ϕ Tθ r Tθ θ Tθ ϕ Tϕ r Tϕ θ Tϕ ϕ rr T T rθ T rϕ T θr T θθ T θϕ T ϕr T ϕθ T ϕϕ components on the natural and the normed basis rTrθ r sin θ Trϕ Trr = rTθr r2 Tθθ r2 sin θ Tθϕ r sin θ Tϕr r2 sin θ Tϕθ r2 sin2 θ Tϕϕ 1 1 Tr r Tr θ Tr ϕ r r sin θ 1 = rTθ r Tθ θ Tϕ sin θ θ Tϕ ϕ r sin θ Tϕ r sin θ Tϕ θ 1 rθ 1 T T rϕ T rr r r sin θ 1 θθ 1 = 1 T θr T T θϕ r r2 r 2 sin θ 1 1 1 T ϕr r2 sin θ T ϕθ r2 sin2 θ T ϕϕ r sin θ (1.268) Note: say somewhere in this appendix that the two following formulas are quite useful in deriving the formulas above. 1∂ n ∂ψ n (r ψ ) = +ψ rn ∂r ∂r r (1.269) ∂ψ 1∂ (sinn ϑ ψ ) = + n cotgϑ ψ . n sin ϑ ∂ϑ ∂ϑ (1.270) ———————————————— 54 1.8.9.3 1.8.9.3.1 1.8 Cylindrical Coordinates: Metric, Connection . . . Line element ds2 = dr2 + r2 dϕ2 + dz 2 100 grr grϕ grz gϕr gϕϕ gϕz = 0 r2 0 001 gzr gzϕ gzz 1.8.9.3.2 (1.271) (1.272) Metric 1.8.9.3.3 Fundamental density g=r 1.8.9.3.4 Connection Γrr r Γϕr r Γzr r ϕ Γrr Γϕr ϕ Γzr ϕ z Γrr Γϕr z Γzr z 1.8.9.3.5 (1.273) 000 Γrϕ r Γrz r Γϕϕ r Γϕz r = 0 −r 0 000 Γzϕ r Γzz r 0 1/r 0 Γrϕ ϕ Γrz ϕ Γϕϕ ϕ Γϕz ϕ = 1/r 0 0 0 00 Γzϕ ϕ Γzz ϕ 000 Γrϕ z Γrz z Γϕϕ z Γϕz z = 0 0 0 000 Γzϕ z Γzz z (1.274) Contracted connection Γr 1/r Γϕ = 0 0 Γz (1.275) 1.8.9.3.6 Relationship between covariant and contravariant components for first order tensors r Vr V Vϕ = r2 V ϕ (1.276) z Vz V 1.8.9.3.7 Relationship between ond order tensors r Tr Trr r12 Trϕ Trz Tϕr 12 Tϕϕ Tϕz = Tϕ r r Tzr r12 Tzϕ Tzz Tz r covariant and contravariant components for sec rr T Tr ϕ Tr z T rϕ T rz Tϕ ϕ Tϕ z = r2 T ϕr r2 T ϕϕ r2 T ϕz Tz ϕ Tz z T zr T zθ T zz (1.277) Appendixes 1.8.9.3.8 55 Norm of the vectors of the natural basis er = 1 1.8.9.3.9 ; eϕ = r ; ez = 1 Norm of the vectors of the normed basis er = eϕ = ez = 1 1.8.9.3.10 (1.278) (1.279) Missing Comment: give also the norms of the vectors of the dual basis. 1.8.9.3.11 Relations between components on the natural and the normed basis for first order tensors r r Vr V V Vr V ϕ = 1 V ϕ Vϕ = r Vϕ ; (1.280) r z z Vz V Vz V 1.8.9.3.12 Relations between for second order tensors Trr Trϕ Tϕr Tϕϕ Tzr Tzϕ r Tr Tr ϕ Tϕ r Tϕ ϕ Tz r Tz ϕ rr T T rϕ T ϕr T ϕϕ T zr T zϕ components on the natural and the normed basis Trr Trz = rTϕr Tϕz Tzz Tzr Tr r Tr z z Tϕ = rTϕ r z Tz Tz r T rr T rz 1 ϕr T ϕz = r T T zz T zr rTrϕ Trz r2 Tϕϕ rTϕz rTzϕ Tzz 1 Tr ϕ Tr z r Tϕ ϕ rTϕ z 1 T ϕ Tz z rz 1 rϕ T T rz r 1 ϕϕ 1 ϕz T T r2 r 1 zϕ T T zz r (1.281) 56 1.8 1.8.10 Appendix: Gradient, Divergence and Curl in Usual Coordinate Systems Here we analyze the 3-D Euclidean space, using Cartesian, spherical or cylindrical coordinates. The words scalar, vector, and tensor mean “true” scalars, vectors and tensors, respectively. The scalar densities, vector densities and tensor densities (see section XXX) are named explicitly. 1.8.10.1 Definitions If x → φ(x) is a scalar field, its gradient is the form defined by Gi = ∇i φ . (1.282) i If x → V (x) is a vector density field, its divergence is the scalar density defined by i D = ∇i V . (1.283) If x → Fi (x) is a form field, its curl (or rotational ) is the vector density defined by i R = εijk ∇j Fk . 1.8.10.2 (1.284) Properties These definitions are such that we can replace everywhere true (“covariant”) derivatives by partial derivatives (see exercise XXX). This gives, for the gradient of a density, Gi = ∇i φ = ∂i φ , (1.285) for the divergence of a vector density, i i D = ∇i V = ∂i V , (1.286) R = εijk ∇j Fk = εijk ∂j Fk (1.287) and for the curl of a form, i i [this equation is only valid for spaces without torsion; the general formula is R = εijk ∇j Fk = εijk (∂j Fk − 1 Sjk V ) ]. 2 These equations lead to particularly simple expressions. For instance, the following table shows that the explicit expressions have the same form for Cartesian, spherical and cylindrical coordinates (or for whatever coordinate system). Cartesian Gx = ∂x φ Gradient Gy = ∂y φ Gz = ∂z φ D Divergence = x y z ∂x V + ∂y V + ∂z V x R = ∂y Fz − ∂z Fy y Curl R = ∂z Fx − ∂x Fz z R = ∂x Fy − ∂y Fx Spherical Gr = ∂r φ Gθ = ∂θ φ Gϕ = ∂ϕ φ D = r θ ϕ ∂r V + ∂θ V + ∂ϕ V r R = ∂θ Fϕ − ∂ϕ Fθ θ R = ∂ϕ Fr − ∂r Fϕ ϕ R = ∂r Fθ − ∂θ Fr Cylindrical Gr = ∂r φ Gϕ = ∂ϕ φ Gz = ∂z φ D = r ϕ z ∂r V + ∂ϕ V + ∂z V r R = ∂ϕ Fz − ∂z Fϕ ϕ R = ∂z Fr − ∂r Fz z R = ∂r Fϕ − ∂ϕ Fr Appendixes 1.8.10.3 57 Remarks Although we have only defined the gradient of a true scalar, the divergence of a vector density, and the curl of a form, the definitions can be immediately be extended by “putting bars on” and “taking bars off” (see section XXX). As an example, from equation 1.282, we can immediately write the definition of the gradient of a scalar density, Gi = ∇i φ , (1.288) from equation 1.283 we can write the definition of the divergence of a (true) vector field, D = ∇i V i , (1.289) and from equation 1.284 we can write the definition of the curl of a form as a true vector, Ri = εijk ∇j Fk , (1.290) R = g i εijk ∇j Fk . (1.291) or a true form, Although equation 1.289 seems well adapted to the practical computation of the divergence of a true vector, it is better to use 1.286 instead. For we have successively D = ∂i V i ⇐⇒ ⇐⇒ g D = ∂i (g V i ) D= 1 ∂i (g V i ) . g (1.292) This last expression provides directly compact expressions for the divergence of a vector. For instance, as the fundamental density g takes, in Cartesian, spherical and cylindrical coordinates, respectively the values 1 , r2 sin θ and r , this leads to the results of the following table. ∂V x ∂V y ∂V z + + (1.293) ∂x ∂y ∂z 1 ∂ (sin θ V θ ) ∂V ϕ 1 ∂ (r2 V r ) + + (1.294) Divergence, Spherical coordinates : D = 2 r ∂r sin θ ∂θ ∂ϕ 1 ∂ (rV r ) ∂V ϕ ∂V z + + (1.295) Divergence, Cylindrical coordinates : D = r ∂r ∂ϕ ∂z Divergence, Cartesian coordinates : D = Replacing the components on the natural basis by the components on the normed basis (see section XXX) gives Divergence, Cartesian coordinates : D = ∂V x ∂V y ∂V z + + ∂x ∂y ∂z Divergence, Spherical coordinates : D = 1 ∂ (sin θ V θ ) 1 ∂V ϕ 1 ∂ (r2 V r ) + + (1.297) r2 ∂r r sin θ ∂θ r sin θ ∂ϕ Divergence, Cylindrical coordinates : D = 1 ∂ (rV r ) 1 ∂ V ϕ ∂ V z + + r ∂r r ∂ϕ ∂z (1.296) (1.298) 58 1.8 These are the formulas given in elementary texts (not using tensor concepts). Similarly, although 1.291 seems well adapted to a practical computation of the curl, it is better to go back to equation 1.287. We have, successively, i R = εijk ∂j Fk ⇐⇒ g Ri = εijk ∂j Fk ⇐⇒ Ri = 1 ijk ε ∂j Fk g ⇐⇒ R= 1 g i εijk ∂j Fk . g (1.299) This last expression provides directly compact expressions for the curl. For instance, as the fundamental density g takes, in Cartesian, spherical and cylindrical coordinates, respectively the values 1 , r2 sin θ and r , this leads to the results of the following table. Rx = ∂y Fz − ∂z Fy Curl, Cartesian coordinates : Ry = ∂z Fx − ∂x Fz Rz = ∂x Fy − ∂y Fx 1 (∂θ Fϕ − ∂ϕ Fθ ) sin θ 1 Curl, Spherical coordinates : Rθ = (∂ϕ Fr − ∂r Fϕ ) sin θ Rϕ = sin θ (∂r Fθ − ∂θ Fr ) Rr = (1.300) r2 1 Rr = (∂ϕ Fz − ∂z Fϕ ) r Curl, Cylindrical coordinates : Rϕ = r(∂z Fr − ∂r Fz ) 1 Rz = (∂r Fϕ − ∂ϕ Fr ) r (1.301) (1.302) Appendixes 59 Replacing the components on the natural basis by the components on the normed basis (see section XXX) gives Rx = ∂y Fz − ∂z Fy Curl, Cartesian coordinates : Ry = ∂z Fx − ∂x Fz (1.303) Rz = ∂x Fy − ∂y Fx Rr = 1 r sin θ ∂ (sin θFϕ ) ∂ Fθ − ∂θ ∂ϕ Curl, Spherical coordinates : Rθ = 1 r 1 ∂ Fr ∂ (rFϕ ) − sin θ ∂ϕ ∂r Rϕ = 1 r ∂ (rFθ ) ∂ Fr − ∂r ∂θ Rr = 1 r ∂ Fz ∂ (rFϕ ) − ∂ϕ ∂z ∂ Fr ∂ Fz − ∂z ∂r 1 ∂ (rFϕ ) ∂ Fr − Rz = r ∂r ∂ϕ Curl, Cylindrical coordinates : Rϕ = (1.304) (1.305) These are the formulas given in elementary texts (not using tensor concepts). Comment: I should remember not to put this back in a table, as it is not very readable: Curl Cartesian Rx = ∂y Fz − ∂z Fy Ry = ∂z Fx − ∂x Fz Rz = ∂x Fy − ∂y Fx Spherical Cylindrical ∂ sin θFϕ ∂θ − ∂ Fθ ∂ϕ ∂rFϕ 1 1 ∂ Fr Rθ = r sin θ ∂ϕ − ∂r F Rϕ = 1 ( ∂rFθ − ∂∂θϕ ) r ∂r Rr = ∂ Fz − ∂rFϕ ∂ϕ ∂z ∂ Fr ∂ Fz Rϕ = ∂z − ∂r R z = 1 ∂ r Fϕ − ∂ Fr r ∂r ∂ϕ Rr = 1 r sin θ 1.8.10.3.1 Comment: What follows is not very interesting and should be suppresed. From 1.288 we can write g Gi = ∇i (g φ) , (1.306) 60 1.8 which leads to the formula Gi = 1 ∇i (g φ) . g (1.307) For instance, as the fundamental density g takes, in Cartesian, spherical and cylindrical coordinates, respectively the values 1 , r2 sin θ and r , this leads to the results of the following table. Cartesian Gradient Gx = Gy = Gz = ∂φ ∂x ∂φ ∂y ∂φ ∂z Spherical ∂ Gr = r2 ∂r r12 φ ∂ 1 Gθ = sin θ ∂θ sin θ φ ∂φ Gϕ = ∂ϕ Cylindrical ∂ Gr = r ∂r 1 φ r ∂φ Gϕ = ∂ϕ Gz = ∂φ ∂z Appendixes 1.8.11 61 Appendix: Connection and Derivative in Different Coordinate Systems (Comment: mention here the boxes with different coordinate systems). 1.8.11.1 Polar coordinates (Two-dimensional Euclidean space with non-Cartesian coordinates). ds2 = r2 + r2 dϕ2 (1.308) Γrϕ ϕ = 1/r ; Γϕr ϕ = 1/r ; Γϕϕ r = −r ; (the others vanish) (1.309) Rij = 0 (1.310) ∇i V i = 1.8.11.2 1∂ ∂V ϕ (rV r ) + r ∂r ∂ϕ (1.311) Cylindrical coordinates (Three-dimensional Euclidean space with non-Cartesian coordinates). ds2 = r2 + r2 dϕ2 + dz 2 (1.312) Γrϕ ϕ = 1/r ; Γϕr ϕ = 1/r ; Γϕϕ r = −r ; (the others vanish) (1.313) Rij = 0 (1.314) ∇i V i = 1.8.11.3 1∂ ∂V ϕ ∂V z (rV r ) + + r ∂r ∂ϕ ∂z (1.315) Geographical coordinates Geographical coordinates (Two-dimensional non-Euclidean space). ds2 = R2 (dθ2 + sin2 θ dϕ2 ) Γθϕ ϕ = cotg θ ; Γϕθ ϕ = cotg θ ; ϕϕ θ = − sin θ cos θ ; (1.316) (the others vanish) Rθθ = 1/R2 ; Rϕϕ = 1/R2 ; (the others vanish) ; R = 2/R2 ∇i V i = 1∂ ∂V ϕ (sin θ V θ ) + sin θ ∂θ ∂ϕ (1.317) (1.318) (1.319) 62 1.8.11.4 1.8 Spherical coordinates (Three-dimensional Euclidean space). ds2 = dr2 + r2 dθ2 + r2 sin2 θ dϕ2 Γrθ θ = 1/r ; Γθθ r = −r ; Γϕθ ϕ = cotg θ ; Γrϕ ϕ = 1/r ; Γθr θ = 1/r ; Γθϕ ϕ = cotg θ ; Γϕr ϕ = 1/r ; 2 r θ Γϕϕ = −r sin θ ; Γϕϕ = − sin θ cos θ ; (the others vanish) Rij = 0 ∇i V i = 1∂ 2 r 1∂ ∂V ϕ (r V ) + (sin θ V θ ) + r2 ∂r sin θ ∂θ ∂ϕ (1.320) (1.321) (1.322) (1.323) Appendixes 1.8.12 63 Appendix: Computing in Polar Coordinates [Note: This appendix is probably to be suppressed.] 1.8.12.1 1.8.12.1.1 General formula Simple-minded computation From div V = ∂V ϕ 1∂ (rV r ) + , r ∂r ∂ϕ (1.324) we obtain, using a simple-minded discretisation, at (div V)(r, ϕ) = 1 (r + δr)V r (r + δr, ϕ) − (r − δr)V r (r − δr, ϕ) r 2 δr + 1.8.12.1.2 leads to V ϕ (r, ϕ + δϕ) − V ϕ (r, ϕ − δϕ) . 2 δϕ Computation through parallel transport The notion of parallel transport (div V)(r, ϕ) = + which gives (1.325) V ϕ (r, ϕ V r (r, ϕ r + δr, ϕ) − V r (r, ϕ 2 δr r, ϕ + δϕ) − V ϕ (r, ϕ 2 δϕ r − δr, ϕ) r, ϕ − δϕ) , (1.326) V r (r + δr, ϕ) − V r (r − δr, ϕ) 2 δr V ϕ (r, ϕ + δϕ) − V ϕ (r, ϕ − δϕ) + cos(δϕ) 2 δϕ (div V)(r, ϕ) = + sin(δϕ) 1 V r (r + δr, ϕ) + V r (r − δr, ϕ) . δϕ r 2 (1.327) 1.8.12.1.3 Note: Natural basis and “normed” basis The components on the natural basis V r et V ϕ are related with the components on the normed basis V r and V ϕ through Vr =Vr (1.328) Vϕ =r Vϕ . (1.329) and 1.8.12.2 Divergence of a constant field A constant vector field (oriented “as the x axis”) has components V r (r, ϕ) = k cos ϕ (1.330) k V ϕ (r, ϕ) = − sin ϕ . r (1.331) and 64 1.8.12.2.1 gives 1.8 Simple-minded computation An exact evaluation of approximation 1.325 (div V)(r, ϕ) = sin(δϕ) k cos ϕ 1 − r δϕ , (1.332) expression with an error of order (δϕ)2 . 1.8.12.2.2 Computation through parallel transport An exact evaluation of approximation 1.327 gives (div V)(r, ϕ) = 0 , as it should. (1.333) Appendixes 65 1.8.13 Appendix: Dual Tensors in 2 3 and 4D 1.8.13.1 Dual tensors in 2-D In 2-D, we may need to take the following duals of contravariant (antisymmetric) tensors: ∗ Bij = 1 εij B 0! ∗ Bi = B= 1 εB 0! ij Bi = 1 j εij B 1! ∗ 1 εij B ij 2! Bij = ∗ 1 εij B j 1! ∗ ∗ B= 1 ij εij B 2! ∗ B ij = 1 εB 0! ij (1.334) ∗ Bi = 1 ε Bj 1! ij (1.335) ∗ B= 1 ε B ij 2! ij (1.336) ∗ B= ij 1 ij εB 0! (1.337) ∗ B= i 1 ij ε Bj 1! (1.338) ∗ B= 1 ij ε Bij 2! (1.339) We may also need to take duals of covariant tensors: ∗ B ij = 1 ij εB 0! ∗ Bi = 1 ij ε Bj 1! ∗ B= ∗ 1 ij εB 0! ∗ Bi = 1 ij ε Bj 1! ∗ 1 ij ε Bij 2! B ij = B= 1 ij ε B ij 2! As in a space with an even number of dimensions the dual of the dual of a tensor of rank p equals (−1)p the original tensor (see text), we have, in 2-D, that for a tensor with 0 or 2 indices, ∗ (∗ B) = B , while for a tensor with 1 index, ∗ (∗ B) = −B . 1.8.13.2 Dual tensors in 3-D In 3-D, we may need to take the following duals of contravariant (totally antisymmetric) tensors: ∗ Bijk = 1 εijk B 0! ∗ Bij = 1 εijk B k 1! ∗ Bi = 1 εijk B jk 2! ∗ B= 1 εijk B ijk 3! 1 εB 0! ijk ∗ Bijk = ∗ Bij = 1 k εijk B 1! ∗ Bi = 1 jk εijk B 2! ∗ B= 1 ijk εijk B 3! ∗ B ijk = 1 εB 0! ijk (1.340) ∗ B ij = 1 εijk B k 1! (1.341) ∗ Bi = 1 ε B jk 2! ijk (1.342) ∗ B= 1 ε B ijk 3! ijk (1.343) 66 1.8 We may also need to take duals of covariant tensors: 1 1 ∗ ijk ∗ ijk B = εijk B B = εijk B 0! 0! ∗ B ij = 1 ijk ε Bk 1! ∗ Bi = 1 ijk ε Bjk 2! ∗ B ij = 1 ijk ε Bk 1! ∗ Bi = 1 ijk ε B jk 2! 1 ijk εB 0! (1.344) ij 1 ijk ε Bk 1! (1.345) i 1 ijk ε Bjk 2! (1.346) ijk ∗ B ∗ B= ∗ B= = 1 ijk 1 1 ∗ ∗ B = εijk B ijk B = εijk Bijk (1.347) ε Bijk 3! 3! 3! As in a space with an odd number of dimensions the dual of the dual of a tensor always equals the original tensor (see text), we have, in 3-D, that for all tensors above, ∗ (∗ B) = B . ∗ 1.8.13.3 B= Dual tensors in 4-D In 4-D, we may need to take the following duals of contravariant (totally antisymmetric) tensors: ∗ Bijk = 1 εijk B 0! 1 εijk B 1! ∗ Bijk = ∗ Bij = 1 εijk B k 2! ∗ Bi = 1 εijk B jk 3! 1 εB 0! ijk ∗ Bijk = ∗ Bijk = ∗ Bij = 1 k εijk B 2! ∗ Bi = 1 jk εB 3! ijk 1 εB 1! ijk 1 1 ijk ∗ B = εijk B εijk B ijk 4! 4! We may also need to take duals of covariant tensors: 1 1 ∗ ijk ∗ ijk B = εijk B B = εijk B 0! 0! ∗ B= 1 ijk εB 1! ∗ B ijk = ∗ B ij = 1 ijk ε Bk 2! ∗ Bi = 1 ijk ε Bjk 3! 1 ijk εB 1! ∗ B ijk = ∗ B ij = 1 ijk ε Bk 2! ∗ Bi = 1 ijk ε B jk 3! ∗ B ijk = 1 εB 0! ijk (1.348) ∗ B ijk = 1 εB 1! ijk (1.349) ∗ B ij = 1 ε Bk 2! ijk (1.350) ∗ Bi = 1 ε B jk 3! ijk (1.351) ∗ B= 1 ε B ijk 4! ijk (1.352) ∗ B = 1 ijk εB 0! (1.353) ∗ B = 1 ijk εB 1! (1.354) ∗ B= ij 1 ijk ε Bk 2! (1.355) ∗ B= i 1 ijk ε Bjk 3! (1.356) ijk ijk 1 ijk 1 1 ∗ ∗ B = εijk B ijk B = εijk Bijk (1.357) ε Bijk 4! 4! 4! As in a space with an even number of dimensions the dual of the dual of a tensor of rank p equals (−1)p the original tensor (see text), we have, in 4-D, that for a tensor with 0 , 2 or 4 indices, ∗ (∗ B) = B , while for a tensor with 1 or 3 indices, ∗ (∗ B) = −B . ∗ B= Appendixes 1.8.14 67 Appendix: Integration in 3D In a three-dimensional space (n = 3) , we may have p respectively equal to 2 , 1 and 0 . This gives the three theorems d3σ ijk (∇ ∧ T)ijk = d2σ ij Tij 3D (1.358) 2D d2σ ij (∇ ∧ T)ij = d1σ i Ti 2D (1.359) 1D d1σ i (∇ ∧ T)i = d0σ T . 1D (1.360) 0D Explicitly, using the results of sections 1.6.3 and 1.6.4, this gives d3σ ijk 3D 1 (∇i Tjk + ∇j Tki + ∇k Tij ) = 3 1 d2σ ij (∇i Tj − ∇j Ti ) = 2 2D d2σ ij Tij (1.361) 2D d1σ i Ti (1.362) 1D d1σ i ∇i T = d0σ T , 1D (1.363) 0D or, we use the antisymmetry of the tensors, d3σ ijk ∇i Tjk = d2σ ij Tij 3D (1.364) 2D d2σ ij ∇i Tj = d1σ i Ti 2D (1.365) 1D d1σ i ∂i T = d0σ T . 1D (1.366) 0D We can now introduce the capacity elements instead of the differential elements: 1 ijk 1 d3Σ ε ∇i Tjk 0! 3D 2! 1 1 ijk d2Σi ε ∇j Tk 1! 2D 1! 1 1 ijk d1Σij ε ∂k T 2! 1D 0! 1 1! 1 = 2! 1 = 3! = d2Σi 2D d1Σij 1D d0Σijk 0D 1 ijk ε Tjk 2! 1 ijk ε Tk 1! 1 ijk εT 0! (1.367) (1.368) . (1.369) Introducing explicit expressions for the capacity elements gives i j k (εjk dr1 dr2 dr3 ) ∇i t = 3D (1.370) 2D m (εi m dr1 dr2 ) (εijk ∇j Tk ) = 2D i dr1 Ti (1.371) 1D i dr1 ∂i T = 1D i i j k (εijk dr1 dr2 ) t T, (1.372) 0D i where, in equation 1.370, t stands for the vector dual to the tensor Tij , i.e., t = 1 ijk ε Tjk 2! . 68 1.8 Equations 1.367 and 1.370 correspond to the divergence theorem of Gauss-Ostrogradsky, equations 1.368 and 1.369 correspond to the rotational theorem of Stokes (stricto sensu), and equation 1.372, when written in its more familiar form b dri ∂i T = T (b) − T (a) a corresponds the fundamental theorem of integral calculus. (1.373) Chapter 2 Elements of Probability As probability theory is essential to the formulation of the rules of physical inference —to be analyzed in subsequent chapters— we have to start by an introduction of the concept of probability. This chapter is, however, more than a simple review. I assume that the spaces we shall work with, have a natural definition of distance between points, and, therefore, a definition of volume. This allows the introduction of the notion of ‘volumetric probability’, as opposed to the more conventional ‘probability density’. The notion of conditional volumetric probability is carefully introduced (I disagree with usual definitions of conditional probability density), and finally, the whole concept of conditional probability is generalized into a more general notion: the product of probability distributions. 69 70 2.1 2.1 2.1.1 Volume Notion of Volume The axiomatic introduction of a ‘volume’ over an n-dimensional manifold is very similar to the introduction of a ‘probability’, and both can be reduced to the axiomatic introduction of a ‘measure’. For pedagogical reasons, I choose to separate the two notions, presenting the notion of volume as more fundamental than that of a probability, as the definition of a probability shall require the previous definition of the volume. Of course, given an n-dimensional manifold X , one may wish to associate to it different ‘measures’ of the volume of any region of it. But, in this text, we shall rather assume than, within a given context, there is one ‘natural’ definition of volume. So it is assumed the to any region A ⊂ X it is associated a real or imaginary1 quantity V (A) , called the volume of A , that satisfies Postulate 2.1 for any region A of the space, V (A) ≥ 0 ; Postulate 2.2 if A1 and A2 are two disjoint regions of the space, then V (A1 ∪ A2 ) = V (A1 ) + V (A2 ) . We shall say that a volume distribution (or, for short, a ‘volume ’) has been definer over X . The volume of the whole space X may be positive real, positive imaginary, it may be zero or it may be infinite. 2.1.2 Volume Element Consider a region A of an n-dimensional manifold X , and an approximate subdivision of it into regions with individual volume ∆Vi (see illustration 2.1). Successively refining the subdivision, allows easily to relate the volume of the whole region to the volumes of the indivual regions, V (A) = lim ∆Vi →0 ∆Vi , (2.1) i expression that we may take as an elementary definition for the integral V (A) = P∈A dV (P) . (2.2) When some coordinates x = {x1 , . . . , xn } are chosen over X , we may rewrite this equation as V (A) = dv (x) . (2.3) x∈A While dV (P) stands for a function depending on the abstract notion of a ‘point’, dv (x) stands for an ordinary function depending on some coordinates. A part this subtle difference, the two objects coincide: if by x(P) we designate the coordinates of the point P , then, dV (P) = dv ( x(P) ) . (2.4) 1 Some spaces having an ‘hyperbolic metric’, like the Minkowskian space-time of special relativity, have an imaginary volume. By convention, this volume is taken as imaginary positive. Volume 71 Figure 2.1: The volume of an arbitrarily shaped, smooth, region of a space X , can be defined as the limit of a sum, using elementary regions whose individual volume is known (for instance, triangles in this 2D illustration). This way of defining the volume of a region does not require the definition of a coordinate system over the space. 2.1.3 Volume Density and Capacity Element Consider, at a given point P of an n-dimensional manifold, n vectors (of the tangent linear space) {v1 , v2 , . . . , vn } . These vectors may not have the same physical dimensions (for instance, v1 may represent a displacement, v2 a velocity, etc.). The exterior product of the n vectors, denoted v1 ∧ v2 ∧ · · · ∧ vn , is the scalar capacity v1 ∧ v2 ∧ · · · ∧ vn = i1 i2 ...in i i i v11 v22 . . . vnn , (2.5) where ij... is the Levi-Civita capacity, defined in section 1.4.2. This is, of course, a totally antisymmetric expression. If some coordinates x = {x1 , x2 , . . . , xn } have been defined over the manifold, then, at any given point we may consider the n infinitesimal vectors dx1 0 0 0 dx2 0 dr1 = . ; dr2 = . ; · · · ; drn = . (2.6) . . . . . . 0 dxn 0 corresponding to the respective perturbation of the n coordinates. The exterior product, at point x , of these n vectors is called the capacity element , and is denoted dv (x) : dv (x) = dr1 ∧ dr2 ∧ · · · ∧ drn . (2.7) In view of expressions 2.6, and using a notational abuse, the capacity element so defined is usually written as dv (x) = dx1 ∧ dx2 ∧ · · · ∧ dxn . (2.8) One of the major theorems of integration theory is that the volume element introduced in equation 2.3 is related to the capacity element dv through dv (x) = g (x) dv (x) , (2.9) where g (x) is the volume density in the coordinates x , as defined in equation 1.32: g (x) = η det g(x) . (2.10) Here η is the orientation of the coordinate system, as defined in section 1.4.1. If the system of coordinates in use is positively oriented, the quantities g (x) and dv (x) are both positive. Alternatively, if the system of coordinates is negatively oriented, these two quantities are negative. The volume element dv (x) is always a positive quantity. 72 2.1 The overbar in g is to remember that the determinant of the metric tensor is a density, in the tensorial sense of section 1.2.2, while the underbar in dv is to remember that the ‘capacity element’ is a capacity in the tensorial sense of the term. In equation 2.9, the product of a density times a capacity gives the volume element dv , that is an invariant scalar. In view of this equation, we can call g (x) the volume density in the coordinates x = {x1 , . . . , xn } . It is important to realize that g (x) does not represent any intrinsic property of the space, but, rather, a propery of the coordinates being used. Example 2.1 In the Euclidean 3D space, using geographical coordinates2 x = {r, ϕ, λ} , it is well known that the volume element is dv (r, ϕ, λ) = r2 cos λ dr ∧ dϕ ∧ dλ , (2.11) so the volume density in to the geographical coordinates is g (r, ϕ, λ) = r2 cos λ . (2.12) The metric in geographical coordinates is ds2 = dr2 + r2 cos2 λ dϕ2 + r2 dλ2 , (2.13) so det g = r2 cos λ . (2.14) Comparing this equation with equation 2.12 shows that one has g= det g . (2.15) as it should. [End of example.] Figure 2.2: The geographical coordinates generalize better to n-dimensional spaces than the usual spherical coordinates. Note that the order of the angles, {ϕ, λ} , has to be the reverse of that of the angles {θ, ϕ} , so as to define in both cases local referentials dr ∧ dθ ∧ dϕ and dr ∧ dϕ ∧ dλ that have the same orientation as dx ∧ dy ∧ dz . Geographical coordinates: {r,ϕ,λ} Spherical coordinates: {r,θ,ϕ} x = r cosλ cosϕ = r sinθ cosϕ y = r cosλ sinϕ = r sinθ sinϕ z = r sinλ = r cosθ x z r θ λ ϕ y The usual spherical coordinates are {r, θ, ϕ} , and the domain of variation of θ is 0 ≤ θ ≤ π . These 3D coordinates do not generalize properly into ‘spherical’ coordinates in spaces of dimension larger than three. To these spherical coordinates one should prefer the ‘geographical coordinates’ {r, ϕ, λ} , where the domain of variation of λ is −π/2 ≤ λ ≤ +π/2 . These are not ‘geographical coordinates’ in the normal sense used by geodesists, as r is here a radius (not the ‘height’ above some reference). See figure 2.2 for more details. 2 Volume 73 Example 2.2 In the 4D space-time of special relativity, with the Minkowskian coordinates {τ0 , τ1 , τ2 , τ3 } = {t, x/c, y/c, z/c} , the distance element ds satisfyes 2 2 2 2 ds2 = dτ0 − dτ1 − dτ2 − dτ3 . (2.16) Then, the metric g is diagonal, with the elements {+1, −1, −1, −1} in the diagonal, and g= det(−1) = i det g = . (2.17) [End of example.] Replacing 2.9, into equation 2.3 gives V (A) = dv (x) g (x) . (2.18) x∈A Using expressions 2.8 and 1.32 we can write this in the more explicit (but not manifestly covariant) form V (A) = η x∈A dx1 ∧ · · · ∧ dxn det g(x) . (2.19) These two (equivalent) expressions allow the usual interpretation of an integral as a limit involving the domains defined by constant increments of the coordinate values (see figure 2.3). Although such an expression is useful for analytic developments it is usually not well adapted to numerical evaluations (unless the coordinates are very specially chosen). Figure 2.3: For the same shape of figure 2.1, the volume can be evaluated using, for instance, a polar coordinate system. In a numerical integration, regions near the origin may be oversampled, while regions far from the orign may be undersampled. In some situation, this problem may become crucial, so this sort of ‘coordinate integration’ is to be reserved to analytical developments only. 2.1.4 Change of Variables 2.1.4.1 Volume Element and Change of Variables Consider an n-dimensional metric manifold with some coordinates x . The defining property of the volume element, say dvx (x) , was (equation 2.3) V (A) = Under a change of variables x x∈A dvx (x) . (2.20) y , this expression shall become V (A) = y∈A dvy (y) . (2.21) 74 2.1 These two equations just correspond to a different labeling, respectively using the coordinates x and the coordinates y , of the fundamental equation 2.2 defining the volume element dV , so they are completely equivalent. In other words, the volume element is an invariant scalar, and one may write dvy = dvx , (2.22) or, more explicitly, dvy (y) = dvx ( x(y) ) . 2.1.4.2 (2.23) Volume Density, Capacity Element, and Change of Variables In a change of variables x via y , the two capacity elements dv x (x) and dv y (y) are related dv y (y) = 1 dv ( x(y) ) , X (y) x (2.24) where X (y) is the Jacobian determinant det{∂xi /∂y j } , as they are tensorial capacities, in the sense of section 1.2.2. Also, because a ‘volume density’ is a tensorial density, we have g y (y) = X (y) g x ( x(y) ) . (2.25) Equation 2.18, that can be written, in the coordinates x , V (A) = x∈A dv x (x) g x (x) , (2.26) g x (x) being the determinant of the metric matrix in the coordinates x , becomes V (A) = y∈A dv y (y) g y (y) , (2.27) g y (y) being the determinant of the metric matrix in the coordinates y . Of course, the two capacity elements can be expressed as (equation 2.8) dv x (x) = dx1 ∧ dx2 ∧ · · · ∧ dxn (2.28) and dv y (y) = dy 1 ∧ dy 2 ∧ · · · ∧ dy n . (2.29) If the two coordinate systems {x1 , . . . , xn } and {y 1 , . . . , y n } have the same orientation, the two capacity elements dv x (x) and dv y (y) have the same sign. Otherwise, they have opposite sign. Volume 2.1.5 75 Conditional Volume Consider an n-dimensional manifold X n , with some coordinates x = {x1 , . . . , xn } , and a metric tensor g(x) = {gij (x)} . Consider also a p-dimensional submanifold X p of the ndimensional manifold X n (with p ≤ n ). The n-dimensional volume over X n as characterized √ by the metric determinant det g , induces a p-dimensional volume over the submanifold Xp . Let us try to characterize it. The simplest way to represent a p-dimensional submanifold X p of the n-dimensional manifold X n is by separating the n coordinates x = {x1 , . . . , xn } of X n into one group of p coordinates r = {r1 , . . . , rp } and one group of q coordinates s = {s1 , . . . , sq } , with p+q = n . (2.30) x = {x1 , . . . , xn } = {r1 , . . . , rp , s1 , . . . , sq } = {r, s} , (2.31) Using the notations the set of q relations = = = = s1 (r1 , r2 , . . . , rp ) s2 (r1 , r2 , . . . , rp ) ... sq (r1 , r2 , . . . , rp ) , (2.32) s = s(r) , s1 s2 ... sq (2.33) that, for short, may be written define a p-dimensional submanifold X p in the (p + q )-dimensional we can now introduce the matrix of partial derivatives ∂s1 ∂s1 ··· S 11 S 12 · · · S 1p ∂r1 ∂r2 S 2 1 S 2 2 · · · S 2 p ∂s2 ∂s2 · · · ∂r1 ∂r2 S=. . .=. . .. .. . . . . . . . . . . . . q q q ∂sq ∂sq S 1 S 2 ··· S p ··· 1 2 ∂r ∂r space X n . For later use, ∂s1 ∂rp ∂s2 ∂rp . . . . (2.34) ∂sq ∂rp We can write S(r) for this matrix, as it is defined at a point {x} = {r, s(r)} . Note also that the metric over X can always be partitioned as g(x) = g(r, s) = grr (r, s) grs (r, s) gsr (r, s) gss (r, s) , (2.35) with grs = (gsr )T . In what follows, let us use the Greek indexes for the variables {r1 , . . . , rp } , like in rα ; α ∈ {1, . . . , p} , and Latin indexes for the variables {s1 , . . . , sq } , like in si ; i ∈ {1, . . . , q } . Consider an arbitrary point {r, s} of the space X . If the coordinates rα are perturbed to rα + drα , with the coordinates si kept unperturbed, one defines a p-dimensional subvolume of the n-dimensional manifold X n that can be written3 (middle panel in figure 2.4) dvp (r, s) = det grr (r, s) dr1 ∧ · · · ∧ drp . (2.36) In all generality, we should write dvp (r, s) = η det grr (r, s) dr1 ∧ · · · ∧ drp , where η is ±1 depending on the order of the coordinates {r1 , . . . , rp } . Let us simplify the equations here but assuming that we have chosen the order of the coordinates so as to have a positively oriented capacity element dr1 ∧ · · · ∧ drp . 3 76 2.1 An elementary region on the coordinate surface defined by a condition s = constant Some surface coordinates of a coordinate system over a 3D manifold s s r1 r1 r r2 2 An elementary region on the surface defined by a condition s = s( r 1 r 2 ) , dS s = s( r 1 r 2 ) , dS Figure 2.4: On a 3D space (3D manifold), a coordinate system {x1 , x2 , x3 } = {r1 , r2 , s} is defined. Some characteristic surface coordinates are represented (left). In the middle, a surface element (2D volume element) on a coordinate surface s = const. is represented, that corresponds to the expression in equation 2.36. In the right, a submanifold (surface) is defined by an equation s = s(r1 , r2 ) . A surface element (2D volume element) is represented on the submanifold, that corresponds to the expression in equation 2.37. Alternatively, consider a point (r, s) of X n that, in fact, is on the submanifold X p , i.e., a point that has coordinates of the form (r, s(r)) . It is clear that the variables {r1 . . . rp } define a coordinate system over the submanifold, as it is enough to precise r to define a point in Xp . If the coordinates rα are perturbed to rα + drα , and the coordinates si are also perturbed to si + dsi in a way that one remains on the submanifold, (i.e., with dsi = S i α drα ), then, with the metric over X n partitioned as in equation 2.35, the general distance element ds2 = gij dxi dxj can be written ds2 = (grr )αβ drα drβ + (grs )αj drα dsj + (gsr )iβ dsi drβ + (gss )ij dsi dsj , and replacing dsi by dsi = S i α drα , we obtain ds2 = Gαβ drα drβ , with G = grr + grs S + ST gsr + ST gss S . The ds2 just expressed gives the distance between two any points of X p , i.e., G is the metric matrix of the submanifold associated to the coordinates √ r . The p-dimensional volume element on the manifold is, then, dvr = det G dr1 ∧ · · · ∧ drp , i.e., dvp (r) = det (grr + grs S + ST gsr + ST gss S) dr1 ∧ · · · ∧ drp where S = S(r) , grr = grr (r, s(r)) , grs gss (r, s(r)) . Figure 2.4 illustrates this result. volume density induced over the submanifold g (x) = η , (2.37) = grs (r, s(r)) , gsr = gsr (r, s(r)) and gss = The expression 2.37 says that the p-dimensional Xp is det (grr + grs S + ST gsr + ST gss S) . (2.38) Note: say here that in the case the space X n is formed as the caresian product of two spaces, R p × S q , with the metric over X n induced from the metric gr over R p and the metric gs over S q by ds2 = ds2 + ds2 x r s , (2.39) Volume 77 then, the expression of the metric 2.35 simplifies into g(x) = 0 gr (r) 0 gs (s) , (2.40) and equations 2.37–2.38 simplify into dvp (r) = det (gr + ST gs S) dr1 ∧ · · · ∧ drp (2.41) and g (x) = η det (gr + ST gs S) . (2.42) 78 2.2 2.2.1 2.2 Probability Notion of Probability Consider an n-dimensional metric manifold, over which a ‘volume distribution’ has been defined (satisfying the axioms in section 2.1.1), associating to any region (i.e., subset) A of X its volume A → V (A) . (2.43) A particular volume distribution having been introduced over X , once for all, different ‘probability distributions’ may be considered, that we are about to characterize axiomatically. We shall say that a probability distribution (or, for short, a probability ) has been defined over X if to any region A ⊂ X we can associate an adimensional real number, A → P (A) (2.44) called the probability of A , that satisfies Postulate 2.3 for any region A of the space, P (A) ≥ 0 ; (2.45) Postulate 2.4 for disjoints regions of the space, the probabilities are additive: A 1 ∩ A2 = ∅ ⇒ P (A1 ∪ A2 ) = P (A1 ) + P (A2 ) ; (2.46) Postulate 2.5 the probability distribution must be absolutely continuous with respect to the volume distribution, i.e., the probability P (A) of any region A ⊂ X with vanishing volume must be zero: V (A) = 0 ⇒ P (A) = 0 . (2.47) The probability of the whole space X may be zero, it may be finite, or it may be infinite. The first two axioms are due to Kolmogorov (1933). In common texts, there is usually an axiom concerning the behaviour of a probability when we consider an infinite collection4 of sets, A1 , A2 , A3 . . . , but this is a technical issue that I choose to ignore. Our third axiom here is not usually introduced, as the distinction between the ‘volume distribution’ and a ‘probability distribution’ is generally not made: both are just considered as examples of ‘measure distributions’. This distinction shall, in fact, play a major role in the theory that follows. When the probability of the whole space is finite, a probability distribution can be renormalized, so as to have P (X ) = 1 . We shall then say that we face an ‘absolute probability’. If a probability distribution is not normalizable, we shall say that we have a ‘relative probability’: in that case, what usually matters is not the probability P (A) of a region A ∈ X , but the relative between probability two regions A and B , denoted P ( A ; B ) , and defined as P(A; B ) = P (A) P (B ) . (2.48) 4 Presentations of measure theory that pretend to mathematical rigor, assume ‘finite additivity’ or, alternatively, ‘countable additivity’. See, for instance, the interesting discussion in Jaynes (1995). Probability 2.2.2 79 Volumetric Probability We have just defined a probability distribution over an n-dimensional manifold, that is absolutely continuous with respect to the volume distribution over the manifold. Then, by virtue of the Radon-Nikodym theorem (e.g., Taylor, 1966), one can define over X a volumetric probability f (P) such that the probability of any region A of the space can be obtained as P (A) = P∈A dV (P) f (P) . (2.49) Note that this equation makes sense even if no particular coordinate system is defined over the manifold X , as the integral here can be understood in the sense suggested in figure 2.1. If a coordinate system x = {x1 , . . . , xn } is defined over X , we may well wish to write equation 2.49 as P (A) = x∈A dvx (x) fx (x) , (2.50) where, now, dvx (x) is to be understood as the special expression of the volume element in the coordinates x . One may be interested in using the volume element dvx (x) directly for the integration (as suggested in figure 2.1). Alternatively, one may wish to use the coordinate lines for the integration (as suggested in figure 2.3). In this case, one writes (equation 2.9) dvx (x) = g x (x) dv x (x) , (2.51) to get P (A) = x∈A dv x (x) g x (x) fx (x) . (2.52) Using dv x (x) = dx1 ∧ · · · ∧ dxn (equation 2.8) and g x (x) = det g(x) (equation 1.32), this expression can be written in the more explicit (but not manifestly covariant) form P (A) = η x∈A dx1 ∧ · · · ∧ dxn det g(x) fx (x) , (2.53) where η is +1 is the system of coordinates is positively oriented and -1 if it is negatively oriented. These two (equivalent) expressions may be useful for analytical developments, but not for numerical evaluations, where one should choose a direct handling of expression 2.50. 2.2.3 Probability Density In equation 2.52 we can introduce the definition f x (x) = g x (x) fx (x) , (2.54) to obtain P (A) = x∈A dv x (x) f x (x) , (2.55) 80 2.2 where dv x (x) = dx1 ∧ · · · ∧ dxn . (2.56) The function f x (x) is called the probability density (associated to the probability distribution P ). It is a density, in the tensorial sense of the term, i.e., under a change of variables x y it change according to the Jacobian rule (see section 2.2.5.2). Having defined a volumetric probability fx (x) in section 2.2.2, why should one care at all about the probability density f x (x) ? One possible advantage of a probability density aver a volumetric probability appears when comparing equation 2.50 to equation 2.55. To integrate a volumetric probability one must have defined a volume element over the space, while to integrate a volumetric probability, one only needs to have defined coordinates, irrespectively of any metric meaning they may have. This, is, of course, why usual expositions of the theory use probability densities. In fact, I see this as a handicap. When probability theory is developed without the notion of volume and of distance, one is forced to include definitions that do not have the necessary invariances, the most striking example being the usual definition of ‘conditional probability density’. One does not obtain a correct definition unless a metric in the space is introduced (see section 2.4). The well-known ‘Borel paradox’ (see appendix 2.8.10) is the simplest example of this annoying situation. If I mention at all the notion of probability density is to allow the reader to make the connection between the formulas to be developed in this book and the formulas she/he may find elsewhere. As we have chosen in this text to give signs to densities and capacities that are associated to the orientation of the coordinate system, it is clear from definition 2.54 that, contrary to a volumetric probability, a probability density is not necessarily positive: it has the sign of the capacity element, i.e., a positive sign in positively oriented coordinate systems, and a negative sign in negatively oriented coordinate systems. Example 2.3 Consider a homogeneous probability distribution at the surface of a sphere of radius r . When parameterizing a point by its geographical coordinates (ϕ, λ) , the associated (2D) volumetric probability is f (ϕ, λ) = 1 4πr2 . (2.57) The probability of a region A of the surface is computed as P (A) = dS (ϕ, λ) f (ϕ, λ) , (2.58) {ϕ,λ}∈A where dS (ϕ, λ) = r2 cos λ dϕ dλ , and the total probability equals one. Alternatively, the probability density associated to the homogeneous probability distribution over the sphere is f (ϕ, λ) = 1 cos λ 4π . (2.59) The probability of a region A of the surface is computed as P (A) = dϕ dλ f (ϕ, λ) , {ϕ,λ}∈A and the probability of the whole surface also equals one. [End of example.] (2.60) Probability 2.2.4 81 Volumetric Histograms and Density Histograms Note: explain here what is a volumetric histogram and a density histogram. Say that while the limit of a volumetric histogram is a volumetric probability, the limit of a density histogram is a probability density. Introduce the notion of ‘na¨ histogram’. ıve Consider a problem where we have two physical properties to analyze. The first is the property of electric resistance-conductance of a metallic wire, as it can be characterized, for instance, by its resistance R or by its conductance 4 C = 1/R . The second is the ‘cold-warm’ property of the wire, as it can be charcterized by its temperature T or its thermodynamic parameter β = 1/kT (k being the Boltzmann constant). The ‘parameter space’ is, here, two-dimensional. In the ‘resistance-conductance’ space, the distance between two points, characterized by the resistances R1 and R2 , or by the conductances C1 and C2 is, as explained in section XXX, D = log R2 R1 = log C2 C1 . (2.61) Similarly, in the ‘cold-warm’ space, the distance between two points, characterized by the temperatures T1 and T2 , or by the thermodynamic parameters β1 and β2 is D = log T2 T1 = log β2 β1 . (2.62) R = 100 Ω R = 80 Ω R = 60 Ω R = 40 Ω R = 20 Ω R=0Ω R = 100 Ω R = 80 Ω R = 60 Ω R = 40 Ω R = 20 Ω R=0Ω R = 100 Ω R = 50 Ω R = 30 Ω R = 20 Ω R = 10 Ω An homogeneous probability distribution can be defined as . . . Bla, bla, bla . . . In figure 2.5, the two histograms that can be made from the two first diagrams give the volumetric probability. The na¨ histrogram that could be made form the diagram at the right ıve would give a probability density. T = 100 K T* = 1.8 T* = 2.0 T = 100 K T* = 2.0 T = 100 K T* = 1.9 T* = 2.0 T = 80 K T* = 1.9 T = 80 K T* = 1.8 T = 60 K T = 50 K T* = 1.8 T* = 1.6 T = 30 K T = 20 K T = 60 K T* = 1.7 T* = 1.7 T* = 1.5 T = 40 K T = 40 K T* = 1.4 T* = 1.5 T = 20 K T = 20 K T* = 1.2 T* = 1.0 T* = 1.0 T = 10 K R* = 2.0 R* = 1.9 R* = 1.8 R* = 1.7 T0= 1 K R* = 1.5 R0= 1 Ω R* = 1.0 R* = 2.0 R* = 1.9 R* = 1.8 R* = 1.7 T* = log10 T/T0 T=0K T=0K R* = 1.5 R* = log10 R/R 0 R* = 1.0 R* = 2.0 R* = 1.8 R* = 1.6 R* = 1.4 R* = 1.2 R* = 1.0 T* = 1.0 Figure 2.5: Note: explain here how to make a volumetric histogram. Explain that when the electric resistance or thetemperature span ordrs of magnituce, the disgram at the right become totally impractical. 82 2.2 2.2.5 Change of Variables 2.2.5.1 Volumetric Probability and Change of Variables In a change of coordinates x → y(x) , the expression 2.50 P (A) = x∈A dvx (x) fx (x) (2.63) dvy (y) fy (y) (2.64) becomes P (A) = y∈A where dvy (y) and fy (y) are respectively the expressions of the volume element and of the volumetric probability in the coordinates y . These are actual invariants (in the tensorial sense), so, when comparing this equation (written in the coordinates y ) to equation 2.50 (written in the coordinates x ), one simply has, at every point, fy = fx dvy = dvx , (2.65) or, to be more explicit, fy (y) = fx ( x(y) ) ; dvy (y) = dvx ( x(y) ) . (2.66) That under a change of variables x y one has fy = fx for volumetric probabilities, is an important property. It contrasts with the property found in usual texts (where the Jacobian of the transformation appears): remember that we are considering here volumetric probabilites, not the usual probability densities. A volumetric probability can also be integrated using the expression 2.53 P (A) = ηx x∈A dx1 ∧ · · · ∧ dxn det gx (x) fx (x) , (2.67) det gy (y) fy (y) . (2.68) that, under the change of variables becomes P (A) = ηy y∈A dy 1 ∧ · · · ∧ dy n These equations contain each a capacity element and a volume density, that change, under the change of variables, following the rules given in section 1.2.2, but we do not need to be concerned with this here, as the meaning of dy 1 ∧ · · · ∧ dy n is clear, and one usually obtains ηy det gy (y) by an explicit computation of the determinant in the coordinates y , rather than by mutiplying the volume density ηx det gx (x(y)) by the Jacobian determinant X (y) (see section 2.25). [Note: Important: I have to erect as a basic principle to use, in a change of variables, the representation exemplified by figures 9.5, 9.8 and 9.9.] Probability 2.2.5.2 83 Probability Density and Change of Variables A probability density being defined as the product of an invariant times a density (equation 2.54) it is a density in the tensorial sense of the term. Under a change of variables x y , expression 2.55 P (A) = x∈A dv x (x) f x (x) , (2.69) dv y (y) f y (y) , (2.70) where dv x (x) = dx1 ∧ · · · ∧ dxn , becomes P (A) = y∈A where dv y (y) = dy 1 ∧ · · · ∧ dy n . The two capacity elements dv x (x) and dv y (y) are related through the relation 2.24, and, more importantly, the two probability densities are related as tensorial densities should (see section 1.2.2), f y (y) = X (y) f x (x(y)) . (2.71) This is called the Jacobian rule for the change of a probability density under a change of ‘variables’ (i.e., under a change of coordinates over the considered manifold). Note that the X appearing in this equation is the determinant of the matrix {X i j } = {∂xi /∂y j } , not that of the matrix {Y i j } = {∂y i /∂xj } . Many authors take the absolute value of the Jacobian in this equation, which is not quite correct: it is the actual Jacobian that appears. The absolute value of the Jacobian is taken by these authors to force probability densities to always be positive, but this denyes to probability densities the right to be densities, in the full tensorial sense of the term (see section 1.2.2). In this text, I try to avoid the use of probability densities, and only mention them in the appendixes. 84 2.3 2.3 Sum and Product of Probabilities Let X be an n-dimensional metric manifold, with a volume distribution V , and let P and Q be two normalized probability distributions over X . In what follows we shall deduce, from P and Q , two new probability distributions over X , their sum, denoted P ∪ Q and their product, denoted P ∩ Q . 2.3.1 Sum of Probabilities P and Q being two probability distributions over X , their sum (or union ), denoted P ∪ Q is defined by the conditions Postulate 2.6 for any A ⊂ X , (P ∪ Q)(A) = (Q ∪ P )(A) ; (2.72) Postulate 2.7 for any A ⊂ X , P (A) = 0 and Q(A) = 0 =⇒ (P ∪ Q)(A) = 0 ; (2.73) Postulate 2.8 if there is some A ⊂ X for which P (A) = 0 , then, necessarily, for any probability Q , (P ∪ Q)(A) = Q(A) . (2.74) Note: I have to explain here that these postulates do not characterize uniquely the sum operation. The solution I choose is the following one. Property 2.1 If the probability distribution P is characterized by the volumetric probability f (P) , and the probability distribution Q is characterized by the volumetric probability g (P) , then, the probability distribution P ∪ Q is characterized by the volumetric probability, denoted (f + g )(P) given by (f + g )(P) = α f (P) + β g (P) α+β , (2.75) where α and β are two arbitrary constants. Note: An alternative solution would be what is used in fuzzy set theory to define the union of fuzzy sets. Translated to the language of volumetric probabilities, and slightly generalized, this would correspond to (f + g )(P) = k max α f (P) , β g (P) , (2.76) where α and β are two arbitrary constants, and k a normalizing one. Let me try to give an interpretation of this sum of probabilities. If an experimenter faces realizations of a random process and wants to investigate the probability distribution governing the process, she/he may start making histograms of the realizations. As an example, for realizations of a probability distribution over a continuous space, the experimenter will obtain Sum and Product of Probabilities 85 histograms that, in some sense, will approach the volumetric probability corresponding to the probability distribution. A histogram is typically made by dividing the working space into cells, by counting how many realizations fall inside each cell and by dividing the count by the cell volume. A more subtle approach is possible. First, we have to understand that, in the physical sciences, when “a random point materializes in an abstract space” we have to measure its coordinates. As any physical measure of a real quantity will have attached uncertainties, mathematically speaking, the measurement will not produce a ‘point’, but a state of information over the space, i.e., a volumetric probability. If we have measured the coordinates of many points, the results of each measurement will be described by a volumetric probability fi (x) . The ‘sum’ of all these, i.e., the volumetric probability (f1 + f2 + . . . ) (x) = fi (x) (2.77) i is a finer estimation of the background volumetric probability than an ordinary histogram, as actual measurement uncertainties are used, irrespective of any division of the space into cells. 2.3.2 Product of Probabilities P and Q being two probability distributions over X , their product (or intersection ), denoted P ∩ Q is defined by the conditions Postulate 2.9 for any A ⊂ X , (P ∩ Q)(A) = (Q ∩ P )(A) ; (2.78) Postulate 2.10 for any A ⊂ X , P (A) = 0 or Q(A) = 0 =⇒ (P ∩ Q)(A) = 0 . (2.79) Postulate 2.11 if for whatever B ⊂ X , one has P (B ) = k V (B ) , then, necessarily, for any A ⊂ X and for any probability Q , (P ∩ Q)(A) = (Q ∩ P )(A) = Q(A) . (2.80) (The homogeneous probability distribution is the neutral element of the product operation). Note: I have to explain here that these postulates do not characterize uniquely the product operation. The solution I choose is the following one. Property 2.2 If the probability distribution P is characterized by the volumetric probability f (P) , and the probability distribution Q is characterized by the volumetric probability g (P) , then, the probability distribution P ∪ Q is characterized by the volumetric probability, denoted (f · g )(P) given by (f · g )(P) = f (P) g (P) dV (P) f (P) g (P) P∈X . (2.81) 86 2.3 More generally, the ‘product’ of the volumetric probabilities f1 (P) , f2 (P) . . . is (f1 · f2 · f3 . . . )(P) = f1 (P) f2 (P) f3 (P) . . . dV (P) f1 (P) f2 (P) f3 (P) . . . P∈X . (2.82) Note: An alternative solution would be what is used in fuzzy set theory to define the intersection of fuzzy sets. Translated to the language of volumetric probabilities, and slightly generalized, this would correspond to (f · g )(P) = k min α f (P) , β g (P) , (2.83) where α and β are two arbitrary constants, and k a normalizing one. It is easy to write some extra conditions that distinguish the solution to the axioms given by equation 2.75 end equation 2.81 and that given by equations 2.76 and 2.83. For instance, as volumetric probabilities are normed using a multiplicative constant (this is not the case with the grades of membership in fuzzy set theory), it makes sense to impose the simplest possible algebra for the multiplication of volumetric probabilities f (P), g (P) . . . by constants λ, µ . . . : [(λ + µ)f ] (P) = (λ f + µf ) (P) ; [λ(f · g )] (P) = (λf · g ) (P) = (f · λg ) (P) . (2.84) One important property of the two operations ‘sum’ and ‘product’ just introduced is that of invariance with respect to a change of variables. As we consider probability distribution over a continuous space, and as our definitions are independent of any choice of coordinates over the space, we obtain equivalent results in any coordinate system. [Note: Say somewhere that the set of 11 postulates 2.1–2.11, defining the volume and a set of probability distributions furnished with two operations, define an inference space.] The interpretation of this product of volumetric probabilities, can be obtained by comparing figures 2.7 and 2.6. In figure 2.7, a probability distribution P ( · ) is represented by the volumetric probability associated to it. To any region A of the plane, it associates the probability P (A) . If a point has been realized following the probability distribution P ( · ) and we are given the information that, in fact, the point is “somewhere” inside the region B , then we can update the prior probability P ( · ) , replacing it by the conditional probability P ( · |B) = P ( · ∩ B)/P (B ) . This (classical) definition means that P ( · |B) equals P ( · ) inside B and is zero outside, as suggested in the center of the figure (the division by P (B ) just corresponds to a renormalization). If the probability A → P (A) is represented by a volumetric probability f (P) , the probability A → P (A|B ) is represented by the volumetric probability f (P|B) given by f (P|B ) = k f (P) H (P) = f (P) H (P) dV (P) f (P) H (P) X , (2.85) where H (P) takes a constant value inside B , and vanishes outside. We see that f (P|B) is proportional to f (P) inside B and is zero outside B . While the elements entering the definition of a conditional probability are a probability distribution P and a subset B ⊂ X , we here consider two probability distributions P and Q , with volumetric probabilities f (P) and g (P) . It is clear that equation 2.81 is a generalization of equation 2.85, as the set B is now replaced a a probability distribution Q (see figure 2.6). In the special case where the probability Q is zero everywhere excepted inside a domain B , where it is homogeneous, then, we recover the standard notion of conditional probability. Sum and Product of Probabilities 87 Figure 2.6: Illustration of the definition of the product of two probability distribution, interpreted here as a generalization of the notion of conditional probability (see figure 2.7). While a conditional probability combines a probability distribution P ( · ) with an ‘event’ B , the product operation combines two probability distributions P ( · ) and Q( · ) defined over the same space. . . (P∩Q)( ) Q( ) P( ) f(x) g(x) P∩Q (f.g)(x) (f.g)(x) = k f(x) g(x) Example 2.4 Let S represent the surface of the Earth, using geographical coordinates (longitude ϕ and latitude λ ). An estimation of the position of a floating object at the surface of the sea by an airplane navigator gives a probability distribution for the position of the object corresponding to the (2D) volumetric probability f (ϕ, λ) , and an independent, simultaneous estimation of the position by another airplane navigator gives a probability distribution corresponding to the volumetric probability g (ϕ, λ) . How the two volumetric probabilities f (ϕ, λ) and g (ϕ, λ) should be ‘combined’ to obtain a ‘resulting’ volumetric probability? The answer is given by the ‘product’ of the two volumetric probabilities densities: (f · g )(ϕ, λ) = [End of example.] f (ϕ, λ) g (ϕ, λ) dS (ϕ, λ) f (ϕ, λ) g (ϕ, λ) S . (2.86) 88 2.4 2.4 2.4.1 Conditional Probability Notion of Conditional Probability Let P ( · ) represent a probability distribution over an n-dimensional manifold Xn , i.e., a function A → P (A) satisfying the Kolmogorov axioms. Letting now B be a ‘fixed’ region of X n , we can define another probability distribution, say PB ( · ) , that to any region A associates the probability PB (A) defined by PB (A) = P (A ∩ B )/P (B ) . It can be shown that this, indeed, is a probability (i.e., satisfies the Kolmogorov axioms). Instead of the notation PB (A) , it is customary to use the notation PB (A) = P (A|B ) and the definition then reads P (A|B ) = P (A ∩ B ) P (B ) . (2.87) It is important to intuitively understand this definition. The left of figure 2.7 (to be examined later with more detail) suggests a 2D probability distribution P ( · ) , that to any region A of the space associates the probability P (A) . Given now a fixed region B , suggested in the figure by an ovoid, we can define another probability distribution, denoted P ( · |B) that to any region A of the space associates the probability P (A|B ) defined by equation 2.87. The probability P ( · |B) is to be understood as • being identical to P ( · ) inside B (except for a renormalization factor guaranteeing that P (B|B) = 1 ), • vanishing outside B . This standard definition of conditional probability is mathematically consistent, and not prone to misinterpretations Figure 2.7: Illustration of the definition of conditional probability. Given an intial probability distribution P ( · ) (left of the figure) and a set B (middle of the figure), P ( · |B) is identical to P ( · ) inside B (except for a renormalization factor guaranteeing that P (B|B) = 1 ) and vanishes outside B (right of the figure) . . P( |B) P( ) B p(x) P(A|B) = H(x) P(A∩B) P(B) p(x|B) = k p(x) H(x) p(x|B) Conditional Probability 2.4.2 89 Conditional Volumetric Probability A volumetric probability over n-dimensional manifold induces a volumetric probability over any p-dimensional submanifold (see figure 2.8). We examine here the details of this important issue. . .. .. . .. .......... ..... .. . ... . . . .. . . .. . . . . . . ...... . .... . .. . .. . . . . .. .. .. . .. .. . .. .......... ..... .. . ... . . . .. . . .. . . . . . . ...... . .... . .. . .. . . . . . . .. . . . .. .. . .. .. . . . . .. . . . . . . . . .......... . . .. . ... . .. . . . . . . ...... . .... . . . .. ....... . .. . .. . . . . . . . ... .. . .. . . . . .. . . .. . .. . ... .. . . . . . .. . . . . . .. . . .. . .. . . . . . . .. . . . . ... .. . . . . . . .. . 2.4.2.1 . . . . . .. . .......... . . .. . .. . . . .. ... . .. ............. . .. . . . . . .. . .. . . . . . ... .. . .. . . . . .. . fn(x) . . Figure 2.8: Top: A probability distribution in an ndimensional metric manifold X n is suggested by some sample points. The probability distribution can be represented by a volumetric probability fn (x) , proportional everywhere to the number of points per unit of n-dimensional volume. A p-dimensional submanifold Xp is also suggested (by a line). Middle: To define the conditional volumetric probability on the submanifold X p , one considers a ‘tube’ of constant thickness around the submanifold, and counts the number of points the unit of n-dimensional volume. Bottom: In the limit where the thickness of the tube tends to zero, this defines a p-dimensional volumetric probability over the submanifold X p . The metric over X p is that induced by the metric over X n , as is the element of volume. When the n coordinates x = {x1 , . . . , xn } can be separated into p coordinates r = {r1 , . . . , rp } and q coordinates s = {s1 , . . . , sq } (with n = p + q ), so that the p-dimensional submanifold Xp can be defined by the conditions s = s(r) , then, the coordinates r can be used as coordinates over the submanifold X p , and the (p-dimensional) conditional volumetric probability, as given by equation 2.95, is simply fp (r) = k fn (r, s(r)) , where k is a normalization constant. The probability of a region Ap ⊂ Xp is to be evaluated as P (Ap ) = r∈Ap dvp (x) fp (x) , where the p-dimensional volume element dvp (x) is given in equations 2.97–2.99. . .. fp(x) General Situation As in section 2.1.5, consider an n-dimensional manifold X n , with some coordinates x = {x1 , . . . , xn } , and a metric tensor g(x) = {gij (x)} . The n-dimensional volume element is, det g(x) dx1 ∧ · · · ∧ dxn . In section 2.1.5, the n coordinates then, dV (x) = g (x) dv (x) = 1 n x = {x , . . . , x } of X have been separated into one group of p coordinates r = {r1 , . . . , rp } and one group of q coordinates s = {s1 , . . . , sq } , with p + q = n , and a p-dimensional submanifold Xp of the n-dimensional manifold X (with p ≤ n ) has been introduced via the constraint s = s(r) . (2.88) Consider a probability distribution P over X n , represented by the volumetric probability 90 2.4 f (x) = f (r, s) . We wish to define (and to characterize) the ‘conditional volumetric probability’ induced over the submanifold by the volumetric probability f (x) = f (r, s) . Given the p-dimensional submanifold X p of the n-dimensional manifold X n , one can define a set B (∆s) as being the set of all points whose distance to the submanifold X p is less or equal than ∆s . For any finite value of ∆s , Kolmogorov’s definition of conditional probability applies, and the conditional probability so defined associates, to any A ⊂ X n , the probability 2.87. Excepted for a normalization factor, this conditional probability equals the original one, excepted in that all the region whose points are at a distance larger than ∆s have been ‘trimmed away’. This is still a probability distribution over Xn . In the limit when ∆s → 0 this shall define a probability distribution over the submanifold X p that we are about to characterize. Consider a volume element dvp over the submanifold X n , and all the points of X n that are at a distance smaller or equal that ∆s of the points inside the volume element. For small enough ∆s the n-dimensional volume ∆vn so defined is ∆vn ≈ dvp ∆ωq , (2.89) where ∆ωq is the volume of the q -dimensional sphere of radius ∆s that is orthogonal to the submanifold at the considered point. This volume is proportional to (∆s)q , so we have ∆vn ≈ k dvp (∆s)q , (2.90) where k is a numerical factor. The conditional probability associated of this n-dimensional region by formula 2.87 is, by definition of volumetric probability, dP(p+q) ≈ k f ∆vn ≈ k f dvp (∆s)q , (2.91) where k and k are constants. The conditional probability of the p-dimensional volume element dvp of the submanifold X p is then defined as the limit dP(p+q) ∆s→0 (∆s)q dPp = lim , (2.92) this giving dPn = k f dvp , or, to put the variables explicitly, dPn (r) = k f (r, s(r)) dvp (r) . (2.93) We have thus arrived at a p-dimensional volumetric probability over the submanifold X p that is given by fp (r) = k f (r, s(r)) , (2.94) where k is a constant. If the probability is normalizable, and we choose to normalize it to one, then, fp (r) = f (r, s(r)) dvp (r) f (r, s(r)) r∈X p . (2.95) Conditional Probability 91 With this volumetric probability, the probability of a region Ap of the submanifold is computed as P (Ap ) = r∈Ap dvp (x) fp (r) . (2.96) I must emphasize here the the limit we have used to define the conditional volumetric probability is an ‘orthogonal limit’ (see figure 2.9). This contrasts with usual texts, where, instead, a ‘vertical limit’ is used. The formal similarity of the result 2.95 with that proposed in the books that use the ‘vertical limit’ deserves explanation: we are handling here volumetric probabilities, not probability densities. The results for the ‘orthogonal limit’ used here, when translated to the language of probability densities, give results that are not the familiar results of common texts (see appendix 2.8.1). Figure 2.9: The three limits that could be used to define a conditional volumetric probability over a submanifold. In the top, the ‘orthogonal’ or ‘natural’ limit. In the middle, the usual ‘vertical’ limit, and in the bottom a ‘horizontal’ limit. The last two although mentioned below (section 2.4.2.2), are not used in this book. As already mentioned, the coordinates r define a coordinate system over the submanifold X p . The volume element of the submanifold can, then, be written dvp (r) = g p (r) dv p (r) , (2.97) with dv p (r) = dr1 ∧ · · · ∧ drp . The volume density in the coordinates r on the submanifold X p has been characterized in section 2.1.5 (equation 2.37): g p (r) = det gp (r) , (2.98) with gp (r) = grr + grs S + ST gsr + ST gss S . (2.99) It is understood that all the ‘matrices’ appearing at the right are taken at the point ( r, s(r) ) . The probability of a region Ap of the submanifold can then either be computed using equation 2.96 or as P (Ap ) = r∈Ap dv (r) g p (r) fp (r) , with the g p (r) given in equation 2.98 and with dv (r) = dr1 ∧ · · · ∧ drp . (2.100) 92 2.4 Figure 2.10: The spherical Fisher distribution corresponds to the conditional probability distribution induced over a sphere by a Gaussian probability distribution in an Euclidean 3D space (see example 2.5). To have a full 3D representation of the property, this figure should be ‘rotated around the vertical axis’. ϑ Example 2.5 In the Euclidean 3D space, consider an isotropic Gaussian probability distribution with standard deviation σ . Which is the conditional (2D) volumetric probability it induces on the surface of a sphere of unit radius whose center is at unit distance from the center of the Gaussian? Using geographical coordinates (see figure 2.10), the answer is given by the (2D) volumetric probability f (ϕ, λ) = k exp sin λ σ2 , (2.101) where k is a norming constant (see the demonstration in appendix XXX). This is the celebrated Fisher probability distribution, widely used as a model probability on the sphere’s surface. The surface element over the surface of the sphere could be obtained using the equations 2.98–2.99, but it is well known to be dS (ϕ, λ) = cos λ dϕ dλ . [End of example.] Example 2.6 In the case where we work in a two-dimensional space X 2 , with p = q = 1 , we can use the notation r and s instead of r and s , so that the constraint 2.88 is written s = s(r) , (2.102) and the ‘matrix’ of partial derivatives is now a simple real quantity ∂s ∂r S= . (2.103) The conditional volumetric probability on the line s = s(r) induced by a volumetric probability f (r, s) is (equation 2.95), f1 (r) = f (r, s(r)) d (r ) f (r , s(r )) , (2.104) where, if the metric of the space X 2 is written g(r, s) = grr (r, s) grs (r, s) gsr (r, s) gss (r, s) , (2.105) the (1D) volume element is (equations 2.97–2.99) d (r) = grr (r, s(r)) + 2 S (r) grs (r, s(r)) + S (r)2 gss (r, s(r)) dr . (2.106) The probability of an interval (r1 < r < r2 ) along the line s = s(r) is then r2 d (r) f1 (r) . P= r1 (2.107) Conditional Probability 93 If the constraint 2.102 is, in fact, s = s0 , then, equation 2.104 simplifies into f1 (r) = f (r, s0 ) d (r ) f (r , s0 ) , (2.108) and, as the partial derive vanishes, S = 0 , the length element 2.106 becomes d (r) = grr (r, s0 ) dr . (2.109) [End of example.] Example 2.7 Consider two Cartesian coordinates {x, y } on the Euclidean plane, associated to the usual metric ds2 = dx2 + dy 2 . It is easy to see (using, for instance, equation 1.23) that the metric matrix associated to the new coordinates (see figure 2.11) r=x ; s = xy (2.110) is g(r, s) = 1 + s2 /r4 −s/r3 −s/r3 1/r2 , (2.111) det g(r, s) = 1/r . Assume that all what we know about the with metric determinant position of a given point is described by the volumetric probability f (r, s) . Then, we are told that, in fact, the point is on the line defined by the equation s = s0 . What can we now say about the coordinate r of the point? This is clearly a problem of conditional volumetric probability, and the information we have now on the position of the point is represented by the volumetric probability (on the line s = s0 ) given by equation 2.108: f1 (r) = f (r, s0 ) d (r ) f (r , s0 ) . (2.112) Here, considering the special form of the metric in equation 2.111, the length element given by equation 2.109 is d (r) = 1 + s2 /r4 dr 0 . (2.113) The special case s = s0 = 0 gives f1 (r) = f (r, 0) d (r ) f (r , 0) ; d (r) = dr . (2.114) [End of example.] Example 2.8 To address a paradox mentioned by E.T. Jaynes, let us solve the same problem as in the previous example, but using the Cartesian coordinates {x, y } . The information that was represented by the volumetric probability f (r, s) is now represented by the volumetric probability h(x, y ) given by (as volumetric probabilities are invariant objects) h(x, y ) = f (r, s)|r=x ; s=x y . (2.115) 94 2.4 y = +1 v = +1 y = +0.5 y=0 v=0 y = -0.5 Figure 2.11: The Euclidian plane, with, at the left, two Cartesian coordinates {x, y } , and, at the right the two coordinates u = x ; v = x y . v= +0.5 0.5 v=-1 v= u=1 u = 0.8 u = 0.6 u = 0.4 u = 0.2 u=0 x=1 x = 0.8 x = 0.6 x = 0.4 x = 0.2 x=0 y = -1 As the condition s = 0 is equivalent to the condition y = 0 , and as the metric matrix is the identity, it is clear that the shall arrive, for the (1D) volumetric probability representing the information we have on the coordinate x to h1 (x) = h(x, 0) d (x ) h(x , 0) ; d (x) = dx . (2.116) Not only this equation is similar in form to equation 2.114; replacing here h by f (using equation 2.115) we obtain an identity that can be expressed using any of the two equivalent forms h1 (x) = f1 (r)|r=x ; f1 (r) = h1 (x)|x=r . (2.117) Along the line s = y = 0 , the two coordinates r and s coincide, so we obtain the same volumetric probability (with the same length elements d (x) = dx and d (r) = dr ). Trivial as it may seem, this result is not that found the traditional definition of conditional probability density. Jaynes, in the 15-th chaper of his unfinished Probability Theory book lists this as one of the paradoxes of probability theory. It is not a paradox, it is a mistake one makes when falling into the illusion that a conditional probability density (or a conditional volumetric probability) can be defined without invoking the existence of a metric (i.e., of a notion of distance) in the working space. This ‘paradox’ is related to the ‘Borel-Kolmogorov paradox’, that I address in appendix 2.8.10. [End of example.] Conditional Probability 2.4.2.2 95 Case X = R × S I shall show here that a ‘joint’ volumetric probability f (r, s) over a space X p+q = R p × S q can induce, via a relation s = s(r) , three different conditional volumetric probabilities: (i) a volumetric probability f( r) over the submanifold s = s(r) itself; (ii) a volumetric probability fr (r) over R ; and (iii) a volumetric probability fs (r) (case p ≤ q ) or fs (s) (case p ≥ q ) over S q . Figure 2.12 shows a schematical view of the properties we are about to analyze. s . .. .. . .. .......... ..... .. . ... . . . .. . . .. . . . . . . ...... ...... . .. . .. . .. .. . . . . . .. . . .. . .. .. . . . . . . .. . . . . .. . . ... .. . . . . . . Figure 2.12: In an n-dimensional space X n that is the Cartesian product of two spaces R p and S q , with coordinates r = {r1 , . . . , rp } and s = {s1 , . . . , sq } and metric tensors gr and gs , there is a volume element on each of R p and S q , and an induced volume element in X n = R p × S q . Given a p-dimensional submanifold manifold s = s(r) of X n , there also is an induced volume element on it. A volumetric probability f (r, s) over X n , induces a (conditional) volumetric probability fx (r) over the submanifold s = s(r) (equation 2.125), and, as the submanifold shares the same coordinates as R p , a volumetric probability fr (r) is also induced over R p (equation 2.127). This volumetric probability can, in turn, be transported into S q , using the concepts developed in section 2.6. . . . .. . .......... . . .. . . . . .. . . . . . ...... . .... . . . .. ....... . .. . .. . . . . . . . ... .. . .. . . . . .. . . .. r s r Consider a p-dimensional manifold R with a coordinate system r = {rα } and metric tensor gr (r) , and a q -dimensional manifold S with a coordinate system s = {si } and metric tensor gs (s) . Each space has, then, a distance element ds2 = (gr )αβ drα drβ r ; ds2 = (gs )ij dsi dsj s , (2.118) and a volume element dvr (r) = g r (r) dv r (r) ; dvs (s) = g s (s) dv s (s) , (2.119) 96 2.4 that are related to the capacity elements dv r (r) = dr1 ∧ · · · ∧ drp ; dv s (r) = ds1 ∧ · · · ∧ dsq (2.120) via the volume densities g r (r) = ηr det gr (r) ; g s (s) = ηs det gs (s) . (2.121) We can build the Cartesian product X = R × S of the two spaces, by defining the points of X as being made by a point of R and a point of S (so we can write x = {r, s} ), and by introducing a metric tensor g(x) over X through the definition5 ds2 = ds2 + ds2 r s . (2.122) This implies that the metric g(x) = g(r, s) has the partitioned form g(r, s) = gr (r) 0 0 gs (s) . (2.123) Note: explain that what follows is on the subamnifold. With this partitioned metric, the metric tensor in equation 2.99 simplifies to gp = gr + ST gs S , (2.124) or, more explicitly, gp (r) = gr (r) + ST (r) gs (s(r)) S(r) . Collecting here equations 2.95, 2.98 and 2.100, we can write the conditional probability of a region Ap of the submanifold s = s(r) as fx (r) = k f (r, s(r)) , (2.125) where k is a normalization constant. Using the volume element over the submanifold, the probability of a region A of the submanifold s = s(r) is computed via P (A) = C dr1 ∧ · · · ∧ drp det(gr + St gs S) fx (r) . (2.126) As the conditional volumetric probability fx (r) is on the submanifold s = s(r) , it is integrated with the volume density of the submanifold (equation 2.126). Remember that the coordinates r are not only the coordinates of the subspace R , they also define a coordinate system over the submanifold s = s(r) . Note: explain that what follows is on the space R p : Equations 2.125–2.126 define a volumetric probability over the submanifold X p . As the coordinates r are both, coordinates over R p and over the submanifold X p , if we define fr (r) = k 5 det(gr + St gs S) √ f (r, s(r)) , det gr (2.127) Expression 2.122 is just a special situation. More generally, one should take ds2 = α2 ds2 + β 2 ds2 . r s Conditional Probability 97 where the normalization factor k is given by 1 = k Rp det(gr + St gs S) √ f (r, s(r)) , det gr dvr (r) (2.128) a probability is then expressed as P (A) = A dvr (r) fr (r) , (2.129) the volume element being dvr (r) = det gr dr1 ∧ · · · ∧ drp . (2.130) As this is the volume element of R p , we see that we have defined a volumetric probability over R p . [Note: This is very important, it has to be better explained.] We see thus that, via s = s(r) , the volumetric probability F (r, s) , has not only induced a conditional volumetric probability fx (r) over the submanifold s = s(r) , but also a volumetric probability fr (r) over R . These two volumetric probabilities are completely equivalent, and one may focus in one or the other depending on the applications in view. We shall talk about the conditional volumetric probability fx (r) on the submanifold s = s(r) and about the conditional volumetric probability fr (r) on the subspace R . If instead of the volumetric probabilities fr (r) and f (r, s) we introduce the probability densities f (r, s) = g r (r) f (r, s) = det gr (r) f (r, s) f (r, s) = g (r, s) f (r, s) = det gr (r) det gs (s) f (r, s) , (2.131) then, equation 2.127 becomes det(gr + St gs S) √ f r (r) = k √ f (r, s(r)) , det gr det gs (2.132) where the normalization factor k is given by 1 = k det(gr + St gs S) √ dv r (r) √ f (r, s(r)) , det gr det gs Rp (2.133) the capacity element being dv r (r) = dr1 ∧ · · · ∧ drp . (2.134) A probability is expressed as P (A) = A dv r (r) f r (r) . (2.135) Note: analyze here the case where the application s = s(r) degenerates into s = s0 , (2.136) 98 2.4 in which case the matrix S of partial derivatives vanishes. Then, using for the conditional volumetric probability the usual notation f (r|s0 ) , equations 2.127–2.128 simply give f (r|s0 ) = f (r, s0 ) dvr (r) f (r, s0 ) . (2.137) Equivalently, in terms of probability densities, equations 2.132–2.133 become, in the case s = s0 , f (r|s0 ) = f (r, s0 ) / det gs (s0 ) dv r (r) f (r, s0 ) / det gs (s0 ) . (2.138) det gs (s0 ) from this equation. Note: I have to check if I can drop the constant term The assumption that the joint metric diagonalizes ‘in the variables’ {r, s} is essential here. If from the variables {r, s} we pass to some other variables {u, v} through a general change of variables, the metric of the space X shall no longer be diagonal in the new variables, and a definition of, say , f (u|v0 ) shall not be possible. This difficulty is often disregarded in usual texts working with probability densities, this causing some confusions in applications of probability theory using the notion of conditional probability density, and the associated expression of the Bayes theorem (see section 2.5.4). Example 2.9 With the notations of this section, consider that the metric gr of the space R p and the metric gs of the space S q are constant (i.e., that both, the coordinates rα and si are rectilinear coordinates in Euclidean spaces), and that the application s = s(r) is a linear application, that we can write s = Sr , (2.139) as this is consistent with the definition of S as the matrix of partial derivatives, S i α = ∂si /∂sα . Consider that we have a Gaussian probability distribution over the space R p , represented by the volumetric probability fp (r) = 1 1 exp − (r − r0 )t gr (r − r0 ) p/2 (2π ) 2 , (2.140) √ √ dr1 ∧ · · · ∧ drp fp (r) = 1 . that is normalized via dr1 ∧ · · · ∧ drp det gr fp (r) = det gr Similarly, consider that we also have a Gaussian probability distribution over the space S q , represented by the volumetric probability fq (s) = 1 1 exp − (s − s0 )t gs (s − s0 ) (2π )q/2 2 , (2.141) √ ds1 ∧ · · · ∧ dsq fq (s) = 1 . that is normalized via ds1 ∧ · · · ∧ dsq det gs fq (s) = det gs Finally, consider the p + q -dimensional probability distribution over the space X p+q defined as the product of these two volumetric probabilities, f (r, s) = fp (r) fq (s) . (2.142) Conditional Probability 99 Given this p + q -dimensional volumetric probability f (r, s) and given the p-dimensional hyperplane s = S r , we obtain the conditional volumetric probability fr (r) over R p as given by equation 2.127. All simplifications done6 one obtains the Gaussian volumetric probability7 fr (r) = det gr 1 1 √ exp − (r − r0 )t gr (r − r0 ) (2π )p/2 2 det gr , (2.143) where the metric gr (inverse of the covariance matrix) is gr = gr + St gs S (2.144) and where the mean r0 can be obtained solving the expression8 gr (r0 − r0 ) = St gs (s0 − S0 r0 ) . (2.145) Note: I should now show here that fs (s) , the volumetric probability in the space S q is given, in all cases ( p ≤ q or p ≥ q ) by fs (s) = det gs 1 1 √ exp − (s − s0 )t gs (s − s0 ) q/2 (2π ) 2 det gs , (2.146) where the metric gs (inverse of the covariance matrix) is (gs )−1 = S (gr )−1 St (2.147) and where the mean s0 is s0 = S r0 . (2.148) S r Note: say that this is illustrated in figure 2.13. [End of example.] Figure 2.13: Provisional figure to illustrate example 2.9. fs(s) fp(r) 6 s= fq(s) fq(s) fp(r) fs(r) Note: explain this. √ This volumetric probability is normalized by dr1 ∧ · · · ∧ drp det gr fr (r) = 1 . 8 Explicitly, one can write r0 = r0 + (gr )−1 St gs (s0 − S r0 ) , but in numerical applications, the direct resolution of the linear system 2.145 is preferable. 7 100 2.5 2.5.1 2.5 Marginal Probability Marginal Probability Density In a p + q -dimensional space Xp+q , consider a continuous, non intersecting set of p-dimensional hypersurfaces, parameterized by some parameters s = {s1 , s2 , . . . , sq } , as suggested in figure 2.14. Each given value of s , say s = s0 , defines une such hypersurface. Consider also a probability distribution over X p+q (suggested by the ovoidal shape marked ‘P’ in the figure). We have seen above that given a particular hypersurface s = s0 , we can define a conditional probability distribution, that associates a different value of a volumetric probability to each point of the hypersurface. We are not interested now in the ‘variability’ inside each hypersurface, but in defining a global ‘probability’ for each hypersurface, to analyze the variation of the probability from one hypersurface to another one. Crudely speaking, to the hypersurface marked ‘H’ in the figure, we are going to associate the probability of the small ‘crescent’ defined by two infinitely close hypersurfaces. s H Figure 2.14: Figure por the definition of marginal probability. Caption to be written. _ f(s) P The easiest way to develop the idea (and to find explicit expressions) is to characterize the points inside each of the hypersurfaces by some coordinates r = {r1 , r2 , . . . , rp } . Still better, we can assume that the set {r, s} individualizes one particular point of X p+q , i.e., the set x = {r, s} is a coordinate system over X p+q (see figure 2.15). s = +1 s= +0.5 s=0 Figure 2.15: Figure por the definition of marginal probability. Caption to be written. 0.5 s=- r=1 r = 0.8 r = 0.6 r = 0.4 r = 0.2 r=0 -1 s= We shall verify at the end that the definition we are going to made of a probability distribution over s is independent of the particular choice of coordinates r . Marginal Probability 101 Let, then, f (r, s) be a volumetric probability over Xp+q . The probability of a domain A ⊂ X p+q is computed as P (A) = where dv (r, s) = √ dv (r, s) f (r, s) , (2.149) A det g ds1 ∧ · · · ∧ dsq ∧ dr1 ∧ · · · ∧ drp . Explicitly, P (A) = A ds1 ∧ · · · ∧ dsq ∧ dr1 ∧ · · · ∧ drp det g f (r, s) . (2.150) As the (infinitesimal) probability of the ‘crescent’ around the hypersurface ‘H’ in figure 2.14 is dPq (s) = ds1 ∧ · · · ∧ dsq dr1 ∧ · · · ∧ drp det g f (r, s) , (2.151) all values of r we can introduce the definition dr1 ∧ · · · ∧ drp f s (s) = det g f (r, s) , (2.152) all values of r to have dPq (s) = ds1 ∧ · · · ∧ dsq f s (s) . When the parameters s are formally seen as coordinates over some (yet undefined) space, the probability of a region B of this space is, by definition of f s (s) , computed as P (B ) = B ds1 ∧ · · · ∧ dsq f s (s) , (2.153) this showing that f s (s) can be interpreted as a probability density over s , that, by construction, corresponds to the integrated probability over the hypersurface defined by a constant value of the parameters s (see figure 2.14 again). The expression 2.153 in the typical one for evaluating finite probabilities from a probability density (see equations 2.55–2.56); for this reason we shall call f s (s) the marginal probability density (for the variables s ). This is the most one can do given only the elements of the problem, i.e., a probability distribution over a space and a continuous family of hypersurfaces. Note that we have been able to introduce a probability density over the variables s , but not a volumetric probability, that can only be defined over a well defined space. Once we understand that we can only define a probability density f s (s) (and not a volumetric probability) we can rewrite equation 2.152 as dr1 ∧ · · · ∧ drp f (r, s) , f s (s) = (2.154) det g f (r, s) (2.155) all values of r where f (r, s) = is the probability density representing (in the coordinates {r, s} ) the initial probability distribution over the space X p+q . 102 2.5 The elements used in the definition of the marginal probability density f s (s) are: (i) a probability distribution over a (p + q )-dimensional metric space X p+q , and (ii) a continuous family of p-dimensional hypersurfaces characterized by some q parameters s = {s1 , . . . , sq } . This is independent of any coordinate system over X p+q . It remains that the q parameters s can be considered as q coordinates over X p+q that can be completed, in an arbitrary manner, by p more coordinates r = {r1 , . . . , rp } in order to have a complete coordinate system x = {r, s} over X p+q . That the probability density f s (s) is independent of the choice of the coordinates r is seen by considering equation 2.152. For any fixed value of s (i.e., on a given p-dimensional √ submanifold), the term det g dr1 ∧ · · · ∧ drp is just the expression of the volume element on the submanifold, that, by definition, is an invariant, as is the volumetric probability f . Therefore, the integral sum in equation 2.152 shall keep its value invariant under any change of the coordinates r . In many applications, the continuous family of p-dimensional hypersurfaces is not introduced per se. Rather, one has a given coordinate system x over X p+q that is, for some reason, splitted into p coordinates r and q coordinates s . These coordinates define different coordinate hypersurfaces over X p+q , and, among them, the p-dimensional hypersurfaces defined by constant values of the coordinates s . Then, the definition of marginal probability density given above applies. NOTE COME BACK HERE AFTER ANALYZING POISSON. In this particular situation, the metric properties of the space need not to be taken into account, and the two equations 2.153–2.154, than only invoke probability densities can be used. 2.5.2 Marginal Volumetric Probability Consider now the special situation where the (p + q )-dimensional space X p+q is defined as the Cartesian product of two spaces, X = R × S , with respective dimensions p and q . The notion of Cartesian product of two metric manifolds has been introduced in section 2.4.2.2. Note: recall here equations 2.122–2.123: ds2 = ds2 + ds2 r s . (2.156) This implies that the metric g(x) = g(r, s) has the partitioned form gr (r) 0 0 gs (s) g(r, s) = . (2.157) In particular, over the p + q -dimensional manifold X one then has the induced volume element dv (r, s) = dvr (r) dvs (s) , (2.158) where the ‘marginal’ volume elements dvr (r) and dvs (s) are those given in equations 2.119. Consider now a probability distribution over X , characterized by a volumetric probability f (x) = f (r, s) . It is not assumed that this volumetric probability factors as a product of a volumetric probability over R by a volumetric probability over S . Assuming that this probability is normalizable, we can write the equivalent expressions P (X ) = dv (x) f (x) = x∈X dvr (r) r∈R dvs (s) f (r, s) s∈S = dvs (s) s∈S dvr (r) f (r, s) . r∈R (2.159) Marginal Probability 103 Defining the two marginal volumetric probabilities fr (r) = dvs (s) f (r, s) ; fs (s) = s∈S dvr (r) f (r, s) (2.160) r∈R this can be written P (X) = dv (x) f (x) = x∈X dvr (r) fr (r) = r∈R dvs (s) fs (s) . (2.161) s∈S It is clear that the marginal volumetric probability fr (r) defines a probability over R , while the marginal volumetric probability fs (s) defines a probability over S . 2.5.3 Interpretation of Marginal Volumetric Probability These definitions can be intuitively interpreted as follows. Assume that there is a volumetric probability f (x) = f (r, s) defined over a space X that is the Cartesian product of two spaces R and S , in the sense just explained. A sampling of the (probability distribution over X associated to the) ‘joint’ volumetric probability f would produce points (of X ) x1 = (r1 , s1 ) , x2 = (r2 , s2 ) , x3 = (r3 , s3 ) , ... . (2.162) Then, are samples of the (probability • the points (of R ) r1 , r 2 , r 3 , . . . distribution over R associated to the) marginal volumetric probability fr ; and are samples of the (probability • the points (of S ) s1 , s2 , s3 , . . . distribution over S associated to the) marginal volumetric probability fs . Thus, if when working with a Cartesian product of two manifolds X = R × S , and facing a ‘joint’ volumetric probability f (r, s) , one is only interested in the probability properties induced by f (r, s) over R (respectively over S ) one only needs to consider the marginal volumetric probability fr (r) (respectively fs (s) ). This, of course, implies that one is not interested in the possible dependences between the variables r and the variables s . 2.5.4 Bayes Theorem Let us continue to work in the special situation where the n-dimensional space X is defined as the Cartesian product of two spaces, X = R × S , with respective dimensions p and q , with n = p + q . Given a ‘joint’ volumetric probability f (r, s) over X n , we have defined two marginal volumetric probabilities fr (r) and fs (s) using equations 2.160. We have also written, for any fixed value of s (equation 2.137 dropping the index ‘0’) f (r|s) = f (r, s) dvr (r) f (r, s) R , = f (r|s) = f (r, s) fs (s) , (2.163) 104 2.5 where, in the second equality we have used the definition of marginal volumetric probability. It follows f (r, s) = f (r|s) fs (s) , (2.164) equation that can be read as saying bla, bla, bla . . . Similarly, f (r, s) = f (s|r) fr (r) , (2.165) and comparing these two equations we deduce the well known Bayes theorem f (s|r) fr (r) fs (s) f (r|s) = (2.166) , equation that can be read as saying bla, bla, bla . . . Note: explain here again that the assumption that the metric of the space X takes the form expressed in equation 2.157 is fundamental. 2.5.5 Independent Probability Distributions Assume again that there is a volumetric probability f (x) = f (r, s) defined over a space X that is the Cartesian product of two spaces R and S , in the sense being considered. Then, one may define the marginal volumetric probabilities fr (r) and fs (s) defined by equation 2.160. If it happens that the ‘joint’ volumetric probability f (r, s) is just the product of the two marginal probability distributions, f (r, s) = fr (r) fs (s) , (2.167) it is said that the probability distributions over R and S (as characterized by the marginal volumetric probabilities fr (r) and fs (s) ) are independent . Note: the comparison of this definition with equations 2.164–2.165 shows that, in this case, f (r|s) = fr (r) f (s|r) = fs (s) , ; (2.168) from where the ‘independence’ notion can be understood (note: explain this). NOTE: REFRESH THE EXAMPLE BELOW. Example 2.10 Over the surface of the unit sphere, using geographical coordinates, we have the two displacement elements dsϕ (ϕ, λ) = cos λ dϕ ; dsλ (ϕ, λ) = dλ , (2.169) with the associated surface element (as the coordinates are orthogonal) ds(ϕ, λ) = cos λ dϕ dλ . Consider a (2D) volumetric probability f (ϕ, λ) over the surface of the sphere, normed under the usual condition +π ds(ϕ, λ) f (ϕ, λ) = surface +π/2 +π/2 +π dϕ dλ cos λ f (ϕ, λ) = dλ cos λ dϕ f (ϕ, λ) = 1 . (2.170) −π −π/2 −π/2 −π Marginal Probability 105 One may define the partial integrations +π/2 ηϕ (ϕ) = +π dλ cos λ f (ϕ, λ) −π/2 ; ηλ (λ) = dϕ f (ϕ, λ) , (2.171) −π so that the probability of a sector between two meridians and of an annulus between two parallels are respectively computed as ϕ2 P (ϕ1 < ϕ < ϕ2 ) = dϕ ηϕ (ϕ) ϕ1 λ2 ; P (λ1 < λ < λ2 ) = dλ cos λ ηλ (λ) , (2.172) λ1 but the terms dϕ and cos λ dλ appearing in these two expressions are not the displacement elements on the sphere’s surface (equation 2.169). The functions ηϕ (ϕ) and ηλ (λ) should not be mistaken as marginal volumetric probabilities: as the surface of the sphere is not the Cartesian product of two 1D spaces, marginal volumetric probabilities are not defined. [End of example.] 106 2.6 2.6.0.1 2.6 Transport of Probabilities The Problem We are contemplating: • a p-dimensional metric space R p , with coordinates r = {rα } , and a metric matrix that, in these coordinates, is gr ; • a q -dimensional metric space S q , with coordinates s = {si } , and a metric matrix that, in these coordinates, is gs ; • an application s = σ (r) from R p into S q . To any volumetric probability fr (r) over R p , the application s = σ (r) (2.173) associates a unique volumetric probability fs (s) over S q . To intuitively understand this, consider a large collection of samples of fr (r) , say {r1 , r2 , . . . } . To each of these points in R p we can associate a unique point in S q , via s = σ (r) , so we have a large collection of points {s1 , s2 , . . . } in S q . Of which volumetric probability fs (s) are these points samples? Although the major inference problems considered in this book (conditional probability, product of probabilities, etc.) are only defined when the considered spaces are metric, this problem of transport of probabilities makes perfect sense even if the spaces do not have a metric. For this reason, one could set the problem of transportation of a probability distribution in terms or probabilty densities, intead of volumetric probabilities. I prefer to use the metric concepts and language, but shall also give below the equivalent formulas for those who may choose to work with volumetric probabilities. In what follows, S denotes the matrix of partial derivatives S iα = ∂si ∂rα . (2.174) Note: write somewhere what follows: As we have represented by gr the metric in the space R p , the volume element is given by the usual expression dvr (r) = det gr (r) dr1 ∧ · · · ∧ drp , (2.175) the volume of a finite region A being computed via V (A) = A dvr (r) = A dr1 ∧ · · · ∧ drp det gr (r) . (2.176) Transport of Probabilities 2.6.0.2 107 Case p ≤ q When p ≤ q , the p-dimensional manifold R p is mapped, via s = s(r) , into a p-dimensional submanifold of S q , say S p (see figure 2.16). In that submanifold we can use as coordinates the coordinates induced from the coordinates r of R p via s = s(r) . So, now, the coordinates r define, at the same time, a point of R p and a point of S p ⊂ S q (if the points of S q are covered more than once by the application s = s(r) , then, let us assume that we work inside a subdomain of R p where the problem does not exist). Note: I should mention here figure 2.19. The application s = s(r) maps the p-dimensional volume element dvr on R p into a p-dimensional volume element dvs on the submanifold S p of S q . Let us characterize it. The distance element between two points in S q is ds2 = (gs )ij dsi dsj . If, in fact, those are points of Sp , then we can write dsi = S i α drα , to obtain ds2 = Gαβ drα drβ , where G = St gs S (remember that we can use r as coordinates over the submanifold S p of S q ). The p-dimensional volume element obtained on S p by transportation of the volume element dvr of R p (via s = s(r)) is dvs (r) = det St gs S dr1 ∧ · · · ∧ drp , (2.177) where gs = gs (s(r)) and S = S(r) . The volume of a finite region A of Sp is computed via V (A) = A dvs (r) = A dr1 ∧ · · · ∧ drp det St gs S . (2.178) Note that by comparing equations 2.175 and 2.177 we obtain the ratio of the volumes, √ det St gs S dvs √ = . (2.179) dvr det gr We have seen that bla, bla, bla, and we have the same coordinates over the two spaces, and bla, bla, bla, and to a common capacity element dr1 ∧ · · · ∧ drp corresponds the two volume elements dvr (r) = det gr dr1 ∧ · · · ∧ drp dvs (r) = det St gs S dr1 ∧ · · · ∧ drp . (2.180) When there are volumetric probabilities fr (r) and fs (r) , they are defined so as to have dPr = fr dvr dPs = fs dvs . (2.181) We say the the volumetric probability fs has been ‘transported’ from fr if the two probabilities associated to the two volumes defined by the common capacity element dr1 ∧ · · · ∧ drp are identical, i.e., if dPr = dPs . It follows the relation fs = dvr fr , i.e., dvs √ fs = √ det gr fr det St gs S , (2.182) or, more explicitly, fs (r) = det gr (r) det St (r) gs (σ (r)) S(r) fr (r) . (2.183) 108 2.6 The matrix S of partial derivatives has dimension (q × p) , and unless p = q , it is not a squared matrix. This implies that, in general, det St gs S = det(St S) det gs . While the probability of a domain A of R p is to be evaluated as Pr (A) = dr1 ∧ · · · ∧ drp r∈A det gr fr (r) , (2.184) the probability of its image s(A) (that, by definition is identical to the probability of A ), is to be evaluated as Ps (s(A)) = Pr (A) = dr1 ∧ · · · ∧ drp det St gs S fs (r) . r∈A (2.185) Of course, one could introduce the probability densities f r (r) = det gr fr (r) ; f s (r) = det St gs S fs (r) . (2.186) Using them, the integrations 2.184–2.185 would formally simplify into Pr = dr1 ∧ · · · ∧ drp f r (r) , ; r∈A Ps = dr1 ∧ · · · ∧ drp f s (r) , (2.187) r∈A while the relation 2.183 would trivialize into f s (r) = f r (r) . (2.188) There is no harm in using equations 2.187–2.188 in analytical developments (I have already mentioned that for numerical integrations is much better to use volume elements, rather than capacity elements, and volumetric probabilities rather than probability densities), provided one remembers that the volume of a domain A of R p is to be evaluated as V (A) = dr1 ∧ · · · ∧ drp r∈A det gr , (2.189) while the volume of its image s(A) (of course, different from that of A ) is to be evaluated as V (s(A)) = 2.6.0.3 dr1 ∧ · · · ∧ drp r∈A det St gs S . (2.190) Case p ≥ q Let us now consider the case p ≥ q , i.e., when the ‘starting space’ has larger (or equal) dimension than the ‘arrival space’. Let us begin by choosing over Rp a new system of coordinates specially adapted to the problem. Remember that we are using Latin indices for the coordinates si , where 1 ≤ i ≤ q , and Greek indices for the coordinates rα , where 1 ≤ α ≤ p . We pass from the p coordinates r to the new p coordinates si = si (r) ; (1 ≤ i ≤ q ) A ; (q + 1 ≤ A ≤ p) , t A = t (r) (2.191) Transport of Probabilities 109 Figure 2.16: Transporting a volume element from a p-dimensional space R p into a q -dimensional space S q , via an expression s = s(r) . Left: 1 = p < q = 2 ; in this case, we start with a p-dimensional volume in R p and arrive at S q with a volume of same dimension (equations 2.175 and 2.177). Right: 2 = p > q = 1 : in this case the we start with a p-dimensional volume in R p but arrive at S q with a q -dimensional volume, i.e., a volume of lower dimension (equations 2.176 and ??). Sq s1 = s1(r) s2 = s2(r) s2 p≤q s Sq s = s(r1,r2) p≥q dvp dvp dvq dvp s1 r2 dvp r r1 Rp Rp dvp Figure 2.17: Detail of figure 2.16, showing a domain of R p that maps into a single point of S q . where the functions σ i are the same as those appearing in equation 2.173 (i.e., the q coordinates s of S q are used as q of the p coordinates of R p ), and where the functions τ A are arbitrary (one could, for instance, choose tA = rA , for q + 1 ≤ A ≤ p ). It may well happen that the coordinates {s, t} are only regular inside distinct regions of R p . Let us work inside one such region, letting the ad-hoc management of the more general situation be just suggested in figure 2.19. We need to express the metric tensor in the new coordinates, and, for this, we must introduce the (Jacobian) matrix K of partial derivatives {K α β } = S iβ T Aβ = ∂ si /∂rβ ∂tA /∂rβ , (2.192) and its inverse L = K−1 . (2.193) Using L , the matrix representing the metric tensor of the space R p in the new coordinates is (see, for instance, equation 1.23) G = Lt gr L , (2.194) while, in terms of the matrix K , equivalently, − G−1 = K gr 1 Kt . Note: I have to say here that, as the matrices K and L are invertible, √ 1 det G = L det gr = det gr , K (2.195) (2.196) 110 2.6 where L= √ Lt L ; K= √ K Kt . (2.197) [Note: Emphasize here that when only the determinant of the metric appears, and not the full metric, this means that we only need a volume element over the space, not a distance element. Important to solve the problem of ‘relative weights’.] The definition of volumetric probability that we have used makes it an invariant. The relation between a volumetric probability fr (r) , expressed in the coordinates r and the equivalent volumetric probability f (s, t) , expressed in the coordinates {s, t} , is, simply, fr (r) = f (s(r), t(r)) . (2.198) In the coordinates r , the probability of a region of R p is computed as Pr = dr1 ∧ · · · ∧ drp det gr fr (r) . (2.199) In the coordinates { s , t } , it is computed using the equation Pr = ds1 ∧ · · · ∧ dsq dtq+1 ∧ · · · ∧ dtp √ det G f (s, t) , (2.200) where G is given by equation 2.194. While this expression defines the probability of an arbitrary region of R p , the expression Ps = ds1 ∧ · · · ∧ dsq dtq+1 ∧ · · · ∧ dtp √ det G f (s, t) , (2.201) all t where the first sum is taken over an arbitrary domain of the coordinates s , but the second sum is now taken for all possible values of the coordinates t , corresponds to the probability of a region of S q (as the coordinates s are not only some of the coordinates of R p , but are also the coordinates over S q ). As a volumetric probability fs (s) over S q is to be integrated via Ps = ds1 ∧ · · · ∧ dsq det gs fs (s) , (2.202) then, by comparison with equation 2.201, we deduce that the expression representing the volumetric probability we wished to characterize is fs (s) = √ 1 det gs dtq+1 ∧ · · · ∧ dtp √ det G f (s, t) , (2.203) all t and our problem, essentially, solved. Note that, here, the volumetric probability appears with the variables s and t , while the original volumetric probability was fr (r) . Although the two expressions are linked through equation 2.198, this is not enough to actually have the expression of f (s, t) . This requires that we solve the change of variables 2.191, to obtain the relations r = r(s, t) , (2.204) Transport of Probabilities 111 so we can write f (s, t) = fr (r(s, t)) . (2.205) Explicitly, using equation 2.196, the volumetric probability fs (s) can be written fs (s) = √ 1 det gs dtq+1 ∧ · · · ∧ dtp all t det gr (r) fr (r) K (r) . (2.206) r = r (s ,t ) As the probability densities associated to the volumetric probabilities fs and fr are fs = det gs fs ; fr = det gr fr , (2.207) equation 2.206 can also be written f s (s) = dtq+1 ∧ · · · ∧ dtp all t 1 f (r(s, t)) , K (r(s, t)) r (2.208) an expression that is independent of the metrics in the spaces R p and S q . S1 Figure 2.18: Lines that map into a same value of s . Two different choices for the variables t . t Figure 2.19: Consider that we have a mapping from the Euclidean plane, with polar coordinates r = {ρ, ϕ} , into a one-dimensional space with a metric coordinate s (in this illustration, s = s(ρ, ϕ) = sin ρ/ρ ). When transporting a probability from the plane into the ‘vertical axis’, for a given value of s = s0 we have, first, to obtain the set of discrete values ρn giving the same s0 , and, for each of these values, we have to perform the integration for −π < ϕ ≤ +π corresponding to that indicated in equations ??–2.206. S1 R2 s R2 t 112 2.6 Example 2.11 A one-dimensional material medium with an initial length X is deformed into a second state, where its length is Y . The strain that has affected the medium, denoted ε , is defined as Y . (2.209) X A measurement of X and Y provides the information represented by a volumetric probability fr (Y, X ) . This induces an information on the actual value of the strain, that shall be represented by a volumetric probability fs (ε) . The problem is to express fs (ε) using as ‘inputs’ the definition 2.209 and the volumetric probability fr (Y, X ) . Let us introduce the two-dimensional ‘data’ space R 2 , over which the quantities X and Y are coordinates. The lengths X and Y being Jeffreys quantities (see discussion in section XXX), we have, in the space R 2 , the distance element ds2 = ( dY )2 + ( dX )2 , associated to the metric matrix r Y X ε = log gr = 1 Y2 0 . (2.210) , 1 X2 0 (2.211) This, in particular, gives det gr = so the (2D) volume element over R 2 is dvr = over R 2 is to be integrated via Pr = 1 YX dY ∧dX YX d Y ∧ dX , and any volumetric probability fr (Y, X ) 1 fr (Y, X ) , YX (2.212) over the appropriate bounds. In particular, a volumetric probability fr (Y, X ) is normalized if the integral over ( 0 < Y < ∞ ; 0 < X < ∞ ) equals one. Let us also introduce the onedimensional ‘space of deformations’ S 1 , over which the quantity ε is the chosen coordinate (one could as well chose the exponential of ε , or twice the strain as coordinate). The strain being an ordinary Cartesian coordinate, we have, in the space of deformations S 1 the distance element ds2 = dε2 , associated to the trivial metric matrix gs = (1) . Therefore, s det gs = 1 . (2.213) The (1D) volume element over S 1 is dvs = dε , and any volumetric probability fs (ε) over S 1 is to be integrated via Ps = dε fs (ε) , (2.214) over given bounds. A volumetric probability fs (ε) is normalized by the condition that the integral over (−∞ < ε < +∞) equals one. As suggested in the general theory, we must change the coordinates in R 2 using as part of the coordinates those of S 1 , i.e., here, using the strain ε . Then, arbitrarily, select X as second coordinate, so we pass in R 2 from the coordinates { Y , X } to the coordinates { ε , X } . Then, the Jacobian matrix defined in equation 2.192 is K= U V = ∂ ε/∂Y ∂X/∂Y ∂ε/∂X ∂X/∂X = 1/Y 0 −1/X 1 , (2.215) Transport of Probabilities 113 and we obtain, using the metric 2.210, − det K gr 1 Kt = X . (2.216) Noting that the expression 2.209 can trivially be solved for Y as Y = X exp ε , (2.217) everything is ready now to attack the problem. If a measurement of X and Y has produced the information represented by the volumetric probability fr (Y, X ) , this transports into a volumetric probability fs (ε) that is given by equation 2.206. Using the particular expressions 2.213, 2.216 and 2.217 this gives ∞ fs (ε) = dX 0 1 fr ( X exp ε , X ) . X (2.218) [End of example.] Example 2.12 In the context of the previous example, assume that the measurement of the two lengths X and Y has provided an information on their actual values that: (i) has independent uncertainties and (ii) is Gaussian (which, as indicated in section 2.8.4, means that the dependence of the volumetric probability on the Jeffreys quantities X and Y is expressed by the lognormal function). Then we have 1 1 fX (X ) = √ exp − 2 2 sx 2π sX X log X0 2 1 1 exp − 2 2 sY 2π sY Y log Y0 2 fY (Y ) = √ , (2.219) (2.220) and fr (Y, X ) = fY (Y ) fX (X ) . (2.221) The volumetric probability for X is centered at point X0 , with standard deviation sX , and the volumetric probability for Y is centered at point Y0 , with standard deviation sY (see section 2.7 for a precise —invariant— definition of standard deviation). In this simple example, the integration in equation 2.218 can be performed analytically, and one obtains a Gaussian probability distribution for the strain, represented by the normal function 1 (ε − ε0 )2 exp − , (2.222) 2 s2 2π sε ε where ε0 , the center of the probability distribution for the strain, equals the logarithm of the ratio of the centers of the probability distributions for the lengths, fs (ε) = √ ε0 = log Y0 X0 , (2.223) and where s2 , the variance of the probability distribution for the strain, equals the sum of the ε variances of the probability distributions for the lengths, s2 = s 2 + s2 ε X Y [End of example.] . (2.224) 114 2.6 2.6.0.4 Case p = q The two cases examined above, p ≤ q and p ≥ q , both contain the case p = q , but let us, to avoid possible misunderstandings, treat the case explicitly here. In the case p ≤ q , we have chosen to use over the subspace S p , image of S q through s = σ (r) , the image of the coordinates r of R p , and, in these coordinates, we have found the expression 2.182 √ det gr fr (r) , fs (r) = √ (2.225) det St gs S that is directly valid here. As the matrix S is a squared matrix, we could further write √ det gr 1 fr (r) , (2.226) fs (r) = √ det gs S where S = det S . In the case p ≥ q , we have used over S q its own coordinates. Expression 2.206 drastically simplifies when p = q (there are no variables t ), to give fs (s) = √ 1 det gs 1 fr (r(s)) , − det S gr 1 St (2.227) or, as the matrix S is now squared, √ det gr 1 fr (r(s)) . fs (s) = √ det gs S (2.228) This is, of course, the same expression than that in 2.226: we know that volumetric probabilities are invariant, and have the same value, at a given point, irrespectively of the coordinates being used. Note that the expression s = σ (r) is not defining a change of variables inside a given space: we have two different spaces, a R p with coordinates r and a metrix matrix gr , and a S q with coordinates s and metric matrix gs . These two metric are totally independent, and the application s = σ (s) is mapping points from R p into S q . If we were contemplating a change of variables inside a given space, then the metric matrices, instead of being independently given, would be related in the usual way tensors relate under a change of variables, (gr )αβ = ∂si ∂sj (gs )ij ∂rβ i.e., for short, ∂rα (if we were considering a change of variables) . gr = St gs S √ √ det gr = det St gs S , i.e., as the matrix S is (p × p) , In particular, then, det gr = S det gs (if we were considering a change of variables) , (2.229) (2.230) where S is the Jacobian determinant, S = det S . Then, the rwo equations 2.226–2.228 would simply give fs (s) = fr (r) , (2.231) expressing the invariance of a volumetric probability under a change of variables (equation 2.65). Of course, we are not considering this situation: equations 2.226–2.228 represent a transport of a probability distribution between two spaces, not a change of variables insise a given space. Transport of Probabilities 2.6.0.5 115 Transportation into the manifold s = s(r) itself. Note: say here that we use in the space X p+q the ‘induced metric’. We have seen that bla, bla, bla, and we have the same coordinates over the two spaces, and bla, bla, bla, and to a common capacity element dr1 ∧ · · · ∧ drp corresponds the two volume elements dvr (r) = det gr dr1 ∧ · · · ∧ drp dvx (r) = det ( gr + St gs S ) dr1 ∧ · · · ∧ drp . (2.232) When there are volumetric probabilities fr (r) and fx (r) , they are defined so as to have dPr = fr dvr dPx = fx dvx (2.233) . We say the the volumetric probability fx and has been ‘transported’ from fr if the two probabilities associated to the two volumes defined by the common capacity element dr1 ∧ · · · ∧ drp are identical, i.e., if dPr = dPx . (2.234) fx dvx = fr dvr (2.235) It follows the relation i.e., fx det ( gr + St gs S ) = fr det gr . (2.236) 116 2.7 2.7.1 2.7 Central Estimators and Dispersion Estimators Introduction Let X be an n-dimensional manifold, and let P, Q, . . . represent points of X . The manifold is assumed to have a metric defined over it, i.e., the distance between any two points P and Q is defined, and denoted D(Q, P) . Of course, D(Q, P) = D(P, Q) . A normalized probability distribution P is defined over X , represented by the volumetric probability f . The probability of A ⊂ X is obtained, using the notations of equation 2.49, as P (A) = P∈A dV (P) f (P) . (2.237) If ψ (P) is a scalar (invariant) function defined over X , its average value is denoted and is defined as ψ = P∈X dV (P) f (P) ψ (P) . ψ, (2.238) This clearly corresponds to the intuitive notion of ‘average’. 2.7.2 Center and Radius of a Probability Distribution Let p be a real number in the range 1 ≤ p < ∞ . To any point P we can associate the quantity (having the dimension of a length) σp (P) = dV (Q) f (Q) D(Q, P) p 1 p . (2.239) Q∈X Definition 2.1 The point9 where σp (P) attains its minimum value is called the Lp -norm center of the probability distribution f (P) , and it is denoted Pp . Definition 2.2 The minimum value of σp (P) is called the Lp -norm radius of the probability distribution f (P) , and it is denoted σp . The interpretation of these definitions is simple. Take, for instance p = 1 . Comparing the two equations 2.238–2.239, we see that, for a fixed point P , the quantity σ1 (P) corresponds to the average of the distances from the point P to all the points. The point P that minimizes this average distance is ‘at the center’ of the distribution (in the L1 -norm sense). For p = 2 , it is the average of the squared distances that is minimized, etc. The following terminology shall be used: • P1 is called the median , and σ1 is called the mean deviation ; • P2 is called the barycenter (or the center , or the mean ), and σ2 is called the standard deviation (while its square is called the variance ); 9 If there is more than one point where σp (P) attains its minimum value, any such point is called a center (in the Lp -norm sense) of the probability distribution f (P) . Central Estimators and Dispersion Estimators 117 • P∞ is called10 the circumcenter , and σ∞ is called the circumradius . Calling P∞ and σ∞ respectively the ‘circumcenter’ and the ‘circumradius’ seems justified when considering, in the Euclidean plane, a volumetric probability that is constant inside a triangle, and zero outside. The ‘circumcenter’ of the probability distribution is then the circumcenter of the triangle, in the usual geometrical sense, and the ‘circumradius’ of the probability distribution is the radius of the circumscribed circle11 . More generally, the circumcenter of a probability distribution is always at the point that minimizes the maximum distance to all other points, and the circumradius of the probability distribution is this ‘minimax’ distance. Example 2.13 Consider a one-dimensional space N , with a coordinate ω , such that the distance between the point ν1 and the point ν2 is D(ν2 , ν1 ) = log ν2 ν1 . (2.240) As suggested in XXX, the space N could be the space of musical notes, and ν the frequency of a note. Then, this distance is just (up to a multiplicative factor) the usual distance between notes, as given by the number of ‘octaves’. Consider a normalized volumetric probability f (ν ) , and let us be interested in the L2 -norm criteria. For p = 2 , equation 2.239 can be written σ2 (µ) 2 ∞ = ds(ν ) f (ν ) 0 ν log µ 2 , (2.241) The L2 -norm center of the probability distribution, i.e., the value ν2 at which σ2 (µ) is minimum, is easily found12 to be ∞ ν2 = ν0 exp ds(ν ) f (ν ) log 0 ν ν0 , (2.242) where ν0 is an arbitrary constant (in fact, and by virtue of the properties of the log-exp functions, the value ν2 is independent of this constant). This mean value ν2 corresponds to what in statistical theory is called the ‘geometric mean’. The variance of the distribution, i.e., the value of the expression 2.241 at its minimum, is σ2 2 ∞ = ds(ν ) f (ν ) log 0 ν ν2 2 . (2.243) The distance element associated to the distance in equation 2.240 is, clearly, ds(ν ) = dν/ν , and the probability density associated to f (ν ) is f (ν ) = f (ν )/ν , so, in terms of the probability density f (ν ) , equation 2.242 becomes ∞ ν2 = ν0 exp dν f (ν ) log 0 ν ν0 , (2.244) The L∞ -norm center and radius are defined as the limit p → ∞ of the Lp -norm center and radius. The circumscribed circle is the circle that contains the three vertices of the triangle. Its center (called circumcenter) is at the the point where the perpendicular bisectors of the sides cross. 2 12 For the minimization of the function σ2 (µ) is equivalent to the minimization of σ2 (µ) , and this gives the condition ds(ν ) f (ν ) log(ν/µ) = 0 . For any constant ν0 , this is equivalent to ds(ν ) f (ν ) (log(ν/ν0 ) − log(µ/ν0 )) = 0 , i.e., log(µ/ν0 ) = ds(ν ) f (ν ) log(ν/ν0 ) , from where the result follows. The constant ν0 is necessary in these equations for reasons of physical dimensions (only the logarithm of adimensional quantities is defined). 10 11 118 2.7 while equation 2.243 becomes ∞ 2 σ2 = ν log ν2 dν f (ν ) 0 2 . (2.245) The reader shall easily verify that if instead of the variable ν , one chooses to use the logarithmic variable ν ∗ = log(ν/ν0 ) , where ν0 is an arbitrary constant (perhaps the same as above), then instead of the six expressions 2.240–2.245 we would have obtained, respectively, ∗ ∗ ∗ ∗ s(ν2 , ν1 ) = | ν2 − ν1 | 2 σ2 (µ∗ ) +∞ ∗ ν2 = 2 σ2 ds(ν ∗ ) f (ν ∗ ) (ν ∗ − µ∗ )2 = −∞ +∞ −∞ +∞ = −∞ ∗ ν2 = (2.246) ds(ν ∗ ) f (ν ∗ ) ν ∗ ∗ ds(ν ∗ ) f (ν ∗ ) ν ∗ − ν2 2 +∞ dν ∗ f (ν ∗ ) ν ∗ (2.247) −∞ and σ2 2 +∞ = −∞ ∗ dν ∗ f (ν ∗ ) ν ∗ − ν2 2 , (2.248) with, for this logarithmic variable, ds(ν ∗ ) = dν ∗ and f (ν ∗ ) = f (ν ∗ ) . The two last expressions are the ordinary equations used to define the mean and the variance in elementary texts. [End of example.] Example 2.14 Consider a one-dimensional space, with a coordinate χ , the distance between two poits χ1 and χ2 being denoted D(χ2 , χ1 ) . Then, the associated length element is d (χ) = D( χ + dχ , χ ) . Finally, consider a (1D) volumetric probability f (χ) , and let us be interested in the L1 -norm case. Assume that χ runs from a minimum value χmin to a maximum value χmax (both could be infinite). For p = 1 , equation 2.239 can be written σ1 (χ) = d (χ ) f (χ ) D(χ , χ) . (2.249) Denoting χ1 be the median, i.e., the point the point where σ1 (χ) is minimum), one easily13 founds that χ1 is characterized by the property that it separates the line into two regions of equal probability, i.e., χ1 χmax d (χ) f (χ) = χmin 13 d (χ) f (χ) , (2.250) χ1 In fact, the property 2.250 of the median being intrinsic (independent of any coordinate system), we can limit ourselves to demonstrate it using a special ‘Cartesian’ coordinate, where d (x) = dx , and D(x1 , x2 ) = |x2 − x1 | , where the property is easy to demonstrate (and well known). Central Estimators and Dispersion Estimators 119 expression that can readily be used for an actual computation of the median, and which corresponds to its elementary definition. The mean deviation is then given by χmax σ1 = d (χ) f (χ) D(χ, χ1 ) . (2.251) χmin [End of example.] Example 2.15 Consider the same situation as in the previous example, but let us become interested in the L∞ -norm case. Let χmin and χmax the minimum and the maximum values of χ for which f (χ) = 0 . It can be shown that the circumcenter of the probability distribution is the point χ∞ that separates the interval {χmin , χmax } in two intervals of equal length, i.e., satisfying the condition D(χ, χmin ) = D(χmax , χ) , (2.252) and that the circumradius is D(χmax , χmin ) 2 σ∞ = . (2.253) [End of example.] Example 2.16 Consider, in the Euclidean n-dimensional space En , with Cartesian coordinates x = {x1 , . . . , xn } , a normalized volumetric probability f (x) , and let us be interested in the L2 -norm case. For p = 2 , equation 2.239 can be written, using obvious notations, σ2 (y) 2 = dx f (x) x−y 2 . (2.254) Let x2 denote the mean of the probability distribution, i.e., the point where σ2 (y) is minimum 2 is minimum). The condition of minimum (the vanishing of (or, equivalently, where σ2 (y) the derivatives) gives dx f (x) (x − x2 ) = 0 , i.e., x2 = dx f (x) x , (2.255) which is an elementary definition of mean. The variance of the probability distribution is then (σ2 )2 = dx f (x) x − x2 2 . (2.256) In the context of this example, we can define the covariance tensor C= dx f (x) x − x2 ⊗ x − x2 . (2.257) Note that equation 2.255 and equation 2.257 can be written, using indices, as xi = 2 dx1 ∧ · · · ∧ dxn f (x1 , . . . , xn ) xi , (2.258) and C ij = [End of example.] dx1 ∧ · · · ∧ dxn f (x1 , . . . , xn ) (xi − xi ) (xj − xj ) . 2 2 (2.259) 120 2.8 2.8.1 2.8 Appendixes Appendix: Conditional Probability Density Note to the reader: this section can be skipped, unless one is particularly interested in probability densities. In view of equation 2.100, the conditional probability density (over the submanifold X p ) is to be defined as f p (r) = g p (r) fp (r) (2.260) i.e., f p (r) = ηr det gp (r) fp (r) , (2.261) so the probability of a region Ap of the submanifold is given by P (Ap ) = r∈X p dv p (r) f p (r) , (2.262) where dv p (r) = dr1 ∧ · · · ∧ drp . We must now express f p (r) in terms of f (r, s) . First, from equations 2.95 and 2.261 we obtain f p (r) = ηr As f (r, s) = f (r, s)/(η √ det gp (r) f (r, s(r)) dvp (r) f (r, s(r)) r∈X p . (2.263) det g ) (equation 2.54), f p (r) = ηr √ f (r, s(r))/ det g √ dvp (r) f (r, s(r))/ det g r∈X p det gp (r) . (2.264) Finally, using 2.97, and expliciting gp (r) , √ f p (r) = r∈X p dr1 det(grr +grs S+ST gsr +ST gss S) √ det g ∧ ··· ∧ √ drp f (r, s(r)) det(grr +grs S+ST gsr +ST gss S) √ det g . (2.265) f (r, s(r)) Again, it is understood here that all the ‘matrices’ are taken at the point ( r, s(r) ) . This expression does not coincide the the conditional probability defined given in usual texts (even when the manifold is defined by the condition s = s0 = const. ). This is because we contemplate here the ‘metric’ or ‘orthogonal’ limit to the manifold (in the sense of figure 2.9), while usual texts just consider the ‘vertical limit’. Of course, I take this approach here because I think it is essential for consistent applications of the notion of conditional probability. The best known expression of this problem is the so called ‘Borel Paradox’ that we analyze in section 2.8.10. Appendixes 121 Example 2.17 If we face the case where the space X is the Cartesian product of two spaces R × S , with guv = gvu = 0 , grr = gr (r) and gss = gs (s) , then det g(r, s) = det gr (r) det gs (s) , and the conditional probability density of equation 2.265 becomes, √ T (r det( √ gr (r)+S √) gs (s(r)) S(r)) f (r, s(r)) det gr (r) det gs (s(r)) √ . (2.266) f p (r) = det(gr (r)+ST (r) gs (s(r)) S(r)) 1 ∧ · · · ∧ dr p √ √ dr f (r, s(r)) r∈X p det gr (r) det gs (s(r)) [End of example.] Example 2.18 If, in addition to the condition of the previous example, the hyperfurface is defined by a constant value of s , say s = s0 , then, the probability density becomes f p (r) = f (r, s0 ) dr1 ∧ · · · ∧ drp f (r, s0 ) r∈X p . (2.267) [End of example.] Example 2.19 In the situation of the previous example, let us rewrite equation 2.267 dropping the index 0 from s0 , and use the notations f r|s (r|s) = f (r, s) f s (s) , ; f s (s) = r∈X p dr1 ∧ · · · ∧ drp f (r, s) . (2.268) We could redo all the computations to define the conditional for s , given a fixed value v , but it is clear by simple analogy that we obtain, in this case, f s|r (s|r) = f (r, s) f r (r) , ; f r (r) = r∈X q ds1 ∧ · · · ∧ dsq f (r, s) . (2.269) Solving in these two equations for f (r, s) gives the ‘Bayes theorem’ f s|r (s|r) = f r|s (r|s) f s (s) f r (r) . (2.270) Note that this theorem is valid only if we work in the Cartesian product of two spaces. In particular, we must have gss (r, s) = gs (s) . Working, for instance, at the surface of the sphere with geographical coordinates (r, s) = (r, s) = (ϕ, λ) this condition is not fulfilled, as gϕ = cos λ is a function of λ : the surface of the sphere is not the Cartesian product of two 1D spaces. A we shall later see, this enters in the discussion of the so-called ‘Borel paradox’ (there is no paradox, if we do things properly). [End of example.] 122 2.8 2.8.2 Appendix: Marginal Probability Density In the context of section 2.5.2, where a manifold X is built through the Cartesian product R × S of two manifolds, and given a ‘joint’ volumetric probability f (r, s) , the marginal volumetric probabily fr (r) is defined as (see equation 2.160) fr (r) = dvs (s) f (r, s) . (2.271) s∈S Let us find the equivalent expression using probability densities instead of volumetric probabilities. Here below, following our usual conventions, the following notations g (r, s)) = det g(r, s) ; g r (r) = det gr (r) ; g s (s) = det gs (s) (2.272) are introduced. First, we may use the relation f (r, s) g (r, s) f (r, s) = (2.273) linking the volumetric probability f (r, s) and the probability density f (r, s) . Here, g is the metric of the manifold X , that has been assumed to have a partitioned form (equation 2.123). Then, f (r, s) = f (r, s) / ( g r (r) g s (s) ) , and equation 2.271 becomes fr (r) = 1 g r (r) dvs (s) s∈S f (r, s) g s (s) . (2.274) As the volume element dvs (s) is related to the capacity element dv s (s) = ds1 ∧ ds2 ∧ . . . via the relation dvs (s) = g s (s) dv s (s) , (2.275) we can write fr (r) = 1 g r (r) dv s (s) f (r, s) , (2.276) dv s (s) f (r, s) . (2.277) s∈S i.e., g r (r) fr (r) = s∈S We recognize, at the left-hand side, the usual defintion of a probability density as the product of a volumetric probability by the volume density, so we can introduce the marginal probability density f r (r) = g r (r) fr (r) . (2.278) Then, equation 2.277 becomes f r (r) = dv s (s) f (r, s) , (2.279) s∈S expression that could be taken as a direct definition of the marginal probability density f r (r) in terms of the ‘joint’ probability density f (r, s) . Note that this expression is formally identical to 2.271. This contrasts with the expression of a conditional probability density (equation 2.265) that is formally very different from the expression of a conditional volumetric probability (equation 2.95). Appendixes 2.8.3 123 Appendix: Replacement Gymnastics In an n-dimensional manifold with coordinates x , the volume element dvx (x) , is related to the the capacity element dv x (x) = dx1 ∧ · · · ∧ dxn via the volume density g x (x) = det gx (x) , dvx (x) = g x (x) dv x (x) , (2.280) while the relation between a volumetric probability fx (x) and the associated probability density f x (x) is f x (x) = g x (x) fx (x) . In a change of variables x (2.281) y , while the capacity element changes according to dv x (x) = X (y) dv y (y) , (2.282) where the Jacobian determinant X is the determinant of the matrix {X i j } = {∂xi /∂y j } , the probability density changes according to f x (x) = 1 f (y) . X (y) y (2.283) In the variables y , the relation between a volumetric probability fy (y) and the associated probability density f y (y) is f y (y) = g y (y) fy (y) , (2.284) where g y (y) = det gy (y) is the volume density in the coordinates y . Finally, the volume element dvy (y) , is related to the the capacity element dv y (y) = dy 1 ∧ · · · ∧ dy n through dvy (y) = g y (y) dv y (y) . (2.285) Using these relations in turn, we can obtain the following circle of equivalent equations: P (A) = P∈A dV (P) f (P) = x∈A = x∈A = x∈A = y∈A = y∈A = y∈A = y∈A dvx (x) fx (x) dv x (x) g x (x) fx (x) dv x (x) f x (x) X (y) dv y 1 f (y) X (y) y (2.286) dv y (y) f y (y) dv y (y) g y (y) fy (y) dvy (y) fy (y) = P∈A dV (P) f (P) = P (A) . Each one of them may be useful in different circumstances. The student should be able to easily move from one equation to the next. 124 2.8 Example 2.20 In the example Cartesian-geographical, the equations above give, respectively (using the index r for the geographical coordinates), dvx (x, y, z ) = dx ∧ dy ∧ dz (2.287) f x (x, y, z ) = fx (x, y, z ) (2.288) dx ∧ dy ∧ dz = r2 cos λ dr ∧ dϕ ∧ dλ (2.289) 1 f (r, ϕ, λ) cos λ r (2.290) f r (r, ϕ, λ) = r2 cos λ fr (r, ϕ, λ) (2.291) f x (x, y, z ) = r2 dvr (r, ϕ, λ) = r2 cos λ dr ∧ dϕ ∧ dλ , (2.292) to obtain the circle of equations, P (A) = dV (P) f (P) = P∈A = dvx (x, y, z ) {x,y,z }∈A fx (x, y, z ) dx ∧ dy ∧ dz fx (x, y, z ) {x,y,z }∈A = dx ∧ dy ∧ dz f x (x, y, z ) {x,y,z }∈A = r2 cos λ dr ∧ dϕ ∧ dλ {r,ϕ,λ}∈A = r2 1 f (r, ϕ, λ) cos λ r dr ∧ dϕ ∧ dλ f r (r, ϕ, λ) {r,ϕ,λ}∈A = dr ∧ dϕ ∧ dλ r2 cos λ fr (r, ϕ, λ) {r,ϕ,λ}∈A = dvr (r, ϕ, λ) {r,ϕ,λ}∈A fr (r, ϕ, λ) = dV (P) f (P) = P (A) . P∈A (2.293) Note that the Cartesian system of coordinates is special: scalar densities, scalar capacities and invariant scalars coincide. [End of example.] Appendixes 125 2.8.4 Appendix: The Gaussian Probability Distribution 2.8.4.1 One Dimensional Spaces Let X by a one-dimensional metric line with points P , Q . . . , and let s(Q, P) denote the displacement from point P to point Q , the distance or ‘length’ between the two points being the absolute value of the displacement, L(Q, P) = L(Q, P) = | s(Q, P) | . Given any particular point P on the line, it is assumed that the line extends to infinite distances from P in the two senses. The one-dimensional Gaussian probability distribution is defined by the volumetric probability f (P; P0 ; σ ) = √ 12 1 exp − s (P, P0 ) 2 σ2 2π σ , (2.294) and it follows from the general definition of volumetric probability, that the probability of the interval between any two points P1 and P2 is P2 P= P1 dL(P) f (P; P0 ; σ ) , (2.295) where dL denotes the elementary length element. The following properties are easy to demonstrate: • the probability of the whole line equals one (i.e., the volumetric probability f (P; P0 ; σ ) is normalized); • the mean of f (P; P0 ; σ ) is the point P0 ; • the standard deviation of f (P; P0 ; σ ) equals σ . Example 2.21 Consider a coordinate X such that the displacement between two points is sX (X , X ) = log(X /X ) . Then, the Gaussian distribution 2.294 takes the form fX (X ; X0 , σ ) = √ 1 1 exp − 2 2σ 2π σ log X X0 2 , (2.296) where X0 is the mean and σ the standard deviation. As, here, ds(X ) = dX/X , the probability of an interval is P (X1 ≤ X ≤ X2 ) = X2 X1 dX fX (X ; X0 , σ ) , X (2.297) and we have the normalization ∞ 0 dX fX (X ; X0 , σ ) = 1 . X (2.298) This expression of the Gaussian probability distribution, written in terms on the variable X , is called the lognormal law. I suggest that the information on the parameter X represented by the volumetric probability 2.296 should be expressed by a notation like14 log 14 X = ±σ X0 , (2.299) · Equivalently, one may write X = X0 exp(±σ ) , or X = X0 ÷ Σ , where Σ = exp σ . 126 2.8 that is the exact equivalent of the notation used in equation 2.303 below. Defining the difference δX = X −X0 one converts this equation into log (1 + δX/X0 ) , whose first order approximation is δX/X0 = ±σ . This shows that σ corresponds to what is usually called the ‘relative uncertainty’. I do not recommend this terminology, as, with the definitions used in this book (see section 2.7), σ is the actual standard deviation of the quantity X . [End of example.] Exercise: write the equivalent of the three expressions 2.296–2.298 using, instead of the variable X , the variables U = 1/X or Y = X n . Example 2.22 Consider a coordinate x such that the displacement between two points is sx (x , x) = x − x . Then, the Gaussian distribution 2.294 takes the form fx (x; x0 , σ ) = √ 1 1 exp − 2 (x − x0 )2 2σ 2π σ , (2.300) where x0 is the mean and σ the standard deviation. As, here, ds(x) = dx , the probability of an interval is x2 P (x1 ≤ x ≤ x2 ) = dx fx (x; x0 , σ ) , (2.301) x1 and we have the normalization +∞ −∞ dx fx (x; x0 , σ ) = 1 . (2.302) This expression of the Gaussian probability distribution, written in terms on the variable x , is called the normal law. The information on the parameter x represented by the volumetric probability 2.300 is commonly expressed by a notation like15 x = x0 ± σ . (2.303) [End of example.] Example 2.23 It is easy to verify that through the change of variable x = log X K , (2.304) where K is an arbitrary constant, the equations of the example 2.21 become those of the example 2.22, and vice-versa. In this case, the quantity x has no physical dimensions (this is, of course, a possibility, but not a necessity, for the quantity x in example 2.22). [End of example.] The Gaussian probability distribution is represented in figure 2.20. Note that there is no need to make different plots for the normal and the lognormal volumetric probabilities. When one is interested in a wide range of values, a logarithmic version of the vertical axis may be necessary (see figure 2.21). More concise notations are also used. As an example, the expression x = 1 234.567 89 m ± 0.000 11 m (here, ‘m’ represents the physical unit ‘meter’) is sometimes written x = ( 1 234.567 89 ± 0.000 11 ) m or even x = 1 234.567 89(11) m . 15 Appendixes 127 Figure 2.20: A representation of the Gaussian probability distribution, where the example of a temperature T is used. Reading the scale at the top, we associate to each value of the temperature T the value h(T ) of a lognormal volumetric probability. Reading the scale at the bottom, we associate to every value of the logarithmic temperature t the value g (t) of a normal volumetric probability. There is no need to make a special plot where the lognormal volumetric probability h(T ) would not be represented ‘in a logarithmic axis’, as this strongly distorts the beautiful Gaussian bell (see figures 2.22 and 2.23). In the figure represented here, one standard deviation corresponds to one unit of t , so the whole range represented equals 8 σ . T 10-4K -4 -3 -2 -1 0 1 2 104K 3 4 t = log10(T/T0) ; T0 = 1K 0 -5 -10 -15 -20 -25 -5 -10 5 0 10 4 Probability Density 4 Volumetric Probability 102K 1K t Figure 2.21: A representation of the normal volumetric probability using a logarithmic vertical axis (here, a base 10 logarithm of the volumetric probability, relative to its maximum value). While the representation in figure 2.20 is not practical is one is interested in values of t outside the interval with endpoints situated at ±3σ of the center, this representation allows the examination of the statistics concerning as many decades as we may wish. Here, the whole range represented equals more than 20 standard deviations. Figure 2.22: Left: the lognormal volumetric probability h(X ) . Right: the lognormal probability density h(X ) . Distributions centered at 1, with standard deviations respectively equal to 0.1, 0.2, 0.4, 0.8, 1.6 and 3.2 . 10-2K 3 2 1 0 3 2 1 0 0 0.5 1 1.5 2 2.5 3 0 0.5 1 1.5 2 2.5 3 128 2.8 Figure 2.23 gives the interpretation of these functions in terms of histograms. By definition of volumetric probability, an histogram should be made dividing the interval under study in segments of same length ds(X ) = dX/Y , as opposed to the definition of probability density, where the interval should be divided in segments of equal ‘variable increment’ dX . We clearly see, at the right of the figure the impracticality of making the histogram corresponding to the probability density: while the right part of the histogram oversamples the variable, the left part undersamples it. The histogram suggested at the left samples the variable homogeneously, but this only means that we are using constant steps of the logarithmic quantity x associated to the positive quantity X . Better, then, to directly use the representation suggested in figure 2.20 or in figure 2.21. We have then a double conclusion: (i) the lognormal probability density (at the right in figures 2.22 and 2.23) does not correspond to any practical histogram; it is generally uninteresting. (ii) the lognormal volumetric probability (at the left in figures 2.22 and 2.23) does correspond to a practical histogram, but is better handled when the associated normal volumetric probability is used instead (figure 2.20 or figure 2.21). In short: lognormal functions should never be used. 0.7 Probability Density 0.35 Volumetric Probability Figure 2.23: A typical Gaussian distribution, with central point 1 and standard deviation 5/4, represented here, using a Jeffreys (positive) quantity, by the lognormal volumetric probability (left) and the lognormal probability density (right). 0.3 0.25 0.2 0.15 0.1 0.05 0 0.6 0.5 0.4 0.3 0.2 0.1 0 0 2 4 6 8 10 0 2 4 6 8 10 Appendixes 2.8.4.2 129 Multi Dimensional Spaces In dimension grater than one, the spaces may have curvature. But the multidimensional Gaussian distribution is useful in flat spaces (i.e., Euclidean spaces) only. Although it is possible to give general expressions for arbitrary coordinate systems, let us simplify the exposition, and assume that we are using rectilinear coordinates. The squared distance between points x1 and x2 is then given by the ‘sum of squares’ D2 (x2 , x1 ) = (x2 − x1 )t g (x2 − x1 ) , (2.305) where the metric tensor g is constant (because the assumption of an Euclidean space with rectlinear coordinates). The volume element is, then, dv (x) = det g dx1 ∧ · · · ∧ dxn , (2.306) √ where, again, det g is a constant. Let f (x) be a volumetric probability over the space. By definition, the probability of a region A is P (A) = dv (x) f (x) , (2.307) dx1 ∧ · · · ∧ dxn f (x) . (2.308) A i.e., P (A) = det g The multidimensional Gaussian volumetric probability is √ 1 1 det G √ f (x) = exp − (x − x0 )t G (x − x0 ) n/2 (2π ) 2 det g . (2.309) The following properties are slight generalizations of well known results concerning the multidimensuonal Gaussian: • f (x) is normed, i.e., dv (x) f (x) = 1 ; • the mean of f (x) is x0 ; • the covariance matrix of f (x) is16 C = G−1 . Note that when in an Euclidean space with metric g one defines a Gaussian with covariance C , one may use the inverse of the covariance matrix, G = C−1 , as a supplementary metric over the space. 16 Remember that the general definition of covariance gives here C ij = this property is not as obvious as it may seem. dv (x)(xi − xi )(xj − xj ) f (x) , so 0 0 130 2.8 2.8.5 Appendix: The Laplacian Probability Distribution 2.8.5.1 Appendix: One Dimensional Laplacian Let X by a metric manifold with points P , Q . . . , and let s(P, Q) = s(Q, P) denote the distance netween two points P and Q . The Gaussian probability distribution is represented by the volumetric probability f (P) = k exp − [Note: Elaborate this.] 1 s(P, Q) σ . (2.310) Appendixes 2.8.6 131 Appendix: Exponential Distribution Note: I have to verify that the following terminology has been introduced. By s(P, P0 ) we denote the geodesic line arriving at P , with origin at P0 . If the space is 1D, we write s(P, P0 ) , and call this a displacement . Then, the distance is D(P, P0 ) = |s(P, P0 )| . In an nD Euclidean space, using Cartesian coordinates, we write s(x, x0 ) = x − x0 , and call this the displacement vector . 2.8.6.1 Definition Consider a one-dimensional space, and denote s(Q, P) , the displacement from point P to point Q . The exponential distribution has the (1D) volumetric probability f (P; P0 ) = α exp − α s(P, P0 ) α≥0 , ; (2.311) where P0 is some fixed point. This volumetric probability is normed via ds(P) f (P, P0 ) = 1 , where the sum concerns the half-interval at the right or at the left of point P0 , depending on the orientation chosen (see examples 2.24 and 2.25). Example 2.24 Consider a coordinate X such that the displacement between two points is sX (X , X ) = log(X /X ) . Then, the exponential distribution 2.311 takes the form fX (X ; X0 ) = k exp (−α log(X/X0 )) , i.e., fX (X ) = α X X0 −α α≥0 . ; (2.312) As, here, ds(X ) = dX/X , the probability of an interval is P (X1 ≤ X ≤ X2 ) = The volumetric probability fX (X ) has been normed using ∞ X0 dX fX (X ) = 1 . X X2 dX f (X ) X1 X X . (2.313) This form of the exponential distribution is usually called the Pareto law. The cumulative probability function is X gX (X ) = X0 dX fX (X ) = 1 − X X X0 −α . (2.314) It is negative for X < X0 , zero for X = X0 , and positive for X > X0 . The power α of the ‘power law’ 2.312 may be any real number, but it most examples concerning the physical, biological or economical sciences, it is of the form α = p/q , with p and q being small positive integers17 . With a variable U = 1/X , equation 2.317 becomes fU (U ) = k U α 17 ; α≥0 , (2.315) In most problems, the variables seem to be chosen in such a way that α = 2/3 . This is the case for the probability distributions of Earthquakes as a function of their energy (Gutenberg-Richter law, see figure 2.25), or of the probability distribution of meteorites hitting the Earth as a function of their volume (see figure 2.28). 132 2.8 U2 dU U1 U the probability on an interval is P (U1 ≤ U ≤ U2 ) = U0 dU U 0 fU (U ) , and one typically uses the norming condition fU (U ) = 1 , where U0 is some selected point. Using a variable n Y = X , one arrives at the volumetric probability fY (Y ) = k Y −β ; β= α ≥0 . n (2.316) Example 2.25 Consider a coordinate x such that the displacement between two points is sx (x , x) = x − x . Then, the exponential distribution 2.311 takes the form fx (x) = α exp (−α (x − x0 )) ; α≥0 . As, here, ds(s) = ds , the probability of an interval is P (x1 ≤ x ≤ x2 ) = fx (x) is normed by (2.317) x2 x1 dxfx (x) , and +∞ dx fx (x) = 1 . (2.318) x0 With a variable u = −x , equation 2.317 becomes fu (u) = α exp (α (u − u0 )) u ; α≥0 , (2.319) 0 and the norming condition is −∞ du fu (u) = 1 . For the plotting of these volumetric probabilities, sometimes a logarithmic ‘vertical axis’ is used, as suggested in figure 2.24. Note that via a logarithmic change of variables x = log(X/K ) (where K is some constant) this example is identical to the example 2.24. The two volumetric probabilities 2.312 and 2.317 represent the same exponential distribution. Note: mention here figure 2.24. Appendixes 133 2 1.5 α= 1 1.5 α= 0 α = 1/4 α = 1/2 α= 1 α= 2 1 0.5 α = 1/2 α = 1/4 1 α= 0 0.5 0 0 0 0.5 1 1.5 2 0 0.5 1 X f f 1.5 α = 1/2 1.5 α = 1/4 α= 0 = 1/4 α = 1/2 α= 1 α= 2 0.5 2 α= 2 α= 1 2 1 1.5 U 2 1 α= 0 0.5 0 0 -1 -0.5 0 0.5 1 -1 x = log X/X0 -0.5 0 log f/f0 1 0.5 α= 0 0 α = 1/4 α = 1/2 -0.5 α= 2 -1 -0.5 0 0.5 x = log X/X0 1 α= 2 α= 1 0.5 α = 1/2 α = 1/4 0 α= 0 -0.5 α= 1 -1 0.5 u = log U/U0 1 log f/f0 Figure 2.24: Plots of exponential distribution for different definitions of the variables. Top: The power functions fX (X ) = 1/X −α , and fU (U ) = 1/U α . Middle: Using logarithmic variables x and u , one has the exponential functions fx (x) = exp(−α x) and fu (u) = exp(α u) . Bottom: the ordinate is also represented using a logarithmic variable, this giving the typical log-log linear functions. α= 2 2 f f -1 1 -1 -0.5 0 0.5 u = log U/U0 1 134 2.8 2.8.6.2 Example: Distribution of Earthquakes The historically first example of power law distribution is the distribution of energies of Earthquakes (the famous Gutenberg-Richter law). An earthquake can be characterized by the seismic energy generated, E , or by the moment corresponding to the dislocation, that I denote here18 M . As a rough approximation, the moment is given by the product M = ν S , where ν is the elastic shear modulus of the medium, the average displacement between the two sides of the fault, and S is the faults’ surface (Aki and Richards, 1980). Figure 2.25 shows the distribution of earthquakes in the Earth. As the same logarithmic base (of 10) has been chosen in both axes, the slope of the line approximating the histogram (which is quite close to -2/3 ) directly leads to the power of the power law (Pareto) distribution. The volumetric probability f (M ) representing the distribution of earthquakes in the Earth is f (M ) = k , M 2/3 (2.320) where k is a constant. Kanamori (1977) pointed that the moment and the seismic energy liberated are roughly proportional: M ≈ 2.0 104 E (energy and moment have the same physical dimensions). This implies that the volumetric probability as a function of the energy has the same form as for the moment: 2.8.6.3 Example: Size of oil fields Note: mention here figure 2.27. 18 3 1000 2 100 1 10 0 1 23 24 Example: Shapes at the Surface of the Earth. Note: mention here figure 2.26. 2.8.6.4 (2.321) It is traditionally denoted M0 . 25 26 27 µ = Log10(Moment/MK) 28 29 Number of Events Figure 2.25: Histogram of the number of earthquakes (in base 10 logarithmic scale) recorded by the global seismological networks in a period of xxx years, as a function of the logarithmic seismic moment (adapted from Lay and Wallace, 1995). More precisely, the quantity in the horizontal axis is µ = log10 (M/MK ) , where M is the seismic moment, and MK = 107 J = 1 erg is a constant, whose value is arbitrarily taken equal the unit of moment (and of energy) in the cgs system of units. [note: Ask for the permission to publish this figure.] k . E 2/3 n = Log10(Number of Events) g (E ) = Figure 2.26: Wessel and Smith (1996) have compiled a highresolution shoreline data, and have processed it to suppress erratic points and crossing segments. The shorelines are closed polygons, and they are classified in 4 levels: ocean boundaries, lake boundaries, islands-in-lake boundaries and pondin-island-in-lake boundaries. The 180,496 polygons they encountered had the size distribution shown at the right (the approximate numbers are in the quoted paper, the exact numbers where kindly sent to me by Wessel). A line of slope is -2/3 is suggested in the figure. Figure 2.27: Histogram of the sizes of oil fields in a region of Texas. The horizontal axis corresponds, with a logarithmic scale, to the ‘millions of Barrels of Oil Equivalent’ (mmBOE). Extracted from chapter 2 (The fractal size and spatial distribution of hydrocarbon accumulation, by Christopher C. Barton and Christopher H. Scholz) of the book “Fractals in petroleum geology and Earth processes”, edited by Christopher C. Barton and Paul R. La Pointe, Plenum Press, New York and London, 1995. [note: ask for the permission to publish this figure]. The slope of the straight line is -2/3, comparable to the value found with the data of Wessel & Smith (figure 2.26). 135 log10(Number of Polygons) Appendixes 5 4 3 2 1 0 -4 -2 0 2 4 6 8 log10(S/S0) ; S0 = 1 km2 4 3 2 1 0 136 2.8.6.5 2.8 Example: Meteorites Figure 2.28: The approximate number of meteorites falling on Earth every year is distributed as follows: 1012 meteorites with a diameter of 10−3 mm, 106 with a diameter 1 mm, 1 with a diameter 1 m, 10−4 with a diameter 100 m, and 10−8 with a diameter 10 km. The statement is loosy, and I have extracted it from the general press. It is nevertheless clear that a log-log plot of this ‘histogram’ gives a linear trend with a slope equal to -2. Rather, transforming the diameter D into volume V = D3 (which is proportional to mass), gives the ‘histogram’ at the right, with a slope of -2/3. log10 (number every year) Note: mention here figure 2.28. 10 0 -10 -20 -10 0 10 log10 V/V0 (V0 = 1 m3) Appendixes 2.8.7 137 Appendix: Spherical Gaussian Distribution The simplest probabilistic distribution over the circle and over the surface of the sphere are the von Mises and the Fisher probability distributions, respectively. 2.8.7.1 The von Mises Distribution As already mentioned in example 2.5, and demonstrated in section 2.8.7.3 here below, the conditional volumetric probability induced over the unit circle by a 2D Gaussian is f (λ) = k exp sin λ σ2 . The constant k is to be fixed by the normalization condition k= (2.322) 2π 0 dϕ f (ϕ) = 1 , this giving 1 , 2 π I0 (1/σ 2 ) (2.323) where I0 ( · ) is the modified Bessel function of order zero. Figure 2.29: The circular (von Mises) distribution corresponds to the intersection of a 2D Gaussian by a circle passing by the center of the Gaussian. Here, the unit circle has been represented, and two Gaussians with standard deviations σ = 1 (left) and σ = 1/2 (right) . In fact, this is my preferred representation of the von Mises distribution, rather than the conventional functional display of figure 2.30. ϑ ϑ 0.8 Figure 2.30: The circular (von Mises) distribution, drawn for two full periods, centered at zero, and with √ √ values of σ equal to 2 , 2 , 1 , 1/ 2 , 1/2 (from smooth to sharp). 0.6 0.4 0.2 0 -6 −π/2 2.8.7.2 -4 0 -2 0 +π/2 2 4 6 The Fisher Probability Distribution Note: mention here Fisher (1953). As already mentioned in example 2.5, and demonstrated in section 2.8.7.3 here below, the conditional volumetric probability induced over the surface of a sphere by a 3D Gaussian is, using geographical coordinates f (ϕ, λ) = k exp sin λ σ2 . (2.324) 138 2.8 We can normalize this volumetric probability by dS (ϕ, λ)) f (ϕ, λ)) = 1 , (2.325) with dS (ϕ, λ) = cos λ dϕ dλ . This gives 1 , 4 π χ(1/σ 2 ) k= (2.326) sinh(x) . x (2.327) where χ(x) = 2.8.7.3 Appendix: Fisher from Gaussian (Demonstration) Let us demonstrate here that the Fisher probability distribution is obtained as the conditional of a Gaussian probability distribution over a sphere. As the demonstration is independent of the dimension of the space, let us take an space with n dimensions, where the (generalized) geographical coordinates are x1 = r cos λ x2 = r cos λ ... = ... n−2 x = r cos λ n−1 x = r cos λ n x = r sin λ cos λ2 cos λ3 cos λ4 . . . cos λn−2 cos λn−1 cos λ2 cos λ3 cos λ4 . . . cos λn−2 sin λn−1 (2.328) cos λ2 sin λ3 sin λ2 . We shall consider the unit sphere at the origin, and an isotropic Gaussian probability distribution with standard deviation σ , with its center along the xn axis, at position xn = 1 . The Gaussian volumetric probability, when expressed as a function of the Cartesian coordinates is fx (x1 , . . . , xn ) = k exp − 1 2 σ2 (x1 )2 + (x2 )2 + · · · + (xn−1 )2 + (xn − 1)2 . (2.329) As the volumetric probability is an invariant, to express it using the geographical coordinates we just need to use the replacements 2.328, to obtain fr (r, λ, λ , . . . ) = k exp − 1 2 σ2 r2 cos2 λ + (r sin λ − 1)2 , (2.330) i.e., fr (r, λ, λ , . . . ) = k exp − 1 r2 + 1 − 2 r sin λ 2 σ2 . (2.331) The condition to be on the sphere is just r=1 , (2.332) Appendixes 139 so that the conditional volumetric probability, as given in equation 2.95, is just obtained (up to a multiplicative constant) by setting r = 1 in equation 2.331, f (λ, λ , . . . ) = k exp sin λ − 1 σ2 , (2.333) i.e., absorbing the constant exp(1/σ 2 ) , f (λ, λ , . . . ) = k exp sin λ σ2 . (2.334) This volumetric probability corresponds to the n-dimensional version of the Fisher distribution. Its expression is identical in all dimensions, only the norming constant depends on the dimension of the space. 140 2.8 2.8.8 Appendix: Probability Distributions for Tensors In this appendix we consider a symmetric second rank tensor, like the stress tensor σ of continuum mechanics. A symmetric tensor, σij = σji , has only sex degrees of freedom, while it has nine components. It is important, for the development that follows, to agree in a proper definition of a set of ‘independent components’. This can be done, for instance, by defining the following six-dimensional basis for symmetric tensors 100 000 000 (2.335) e1 = 0 0 0 ; e2 = 0 1 0 ; e3 = 0 0 0 000 000 001 000 1 = √ 0 0 1 2 010 e4 ; e5 001 1 = √ 0 0 0 2 100 ; e6 010 1 = √ 1 0 0 2 000 . (2.336) Then, any symmetric tensor can be written as σ = sα eα , (2.337) and the six values sα are the six ‘independent components’ of the tensor, in terms of which the tensor writes √ √ 1 s√ s6 / 2 s5 /√2 2 σ = s6 /√2 (2.338) s√ s4 / 2 . 5 4 3 s/ 2 s/ 2 s The only natural definition of distance between two tensors is the norm of their difference, se we can write D(σ 2 , σ 1 ) = σ2 − σ1 , (2.339) where the norm of a tensor σ is19 σ = σij σ ji . (2.340) The basis in equation 2.336 is normed with respect to this norm20 . In terms of the independent components in expression 2.338 the norm of a tensor simply becomes σ = (s1 )2 + (s2 )2 + (s3 )2 + (s4 )2 + (s5 )2 + (s6 )2 , (2.341) this showing that the six components sα play the role of Cartesian coordinates of this 6D space of tensors. A Gaussian volumetric probability in this space has then, obviously, the form fs (s) = k exp − 19 α=6 α α=1 (s − sα )2 0 2 ρ2 , (2.342) Of course, as, here, σij = σji one can also write σ = σij σ ij , but this expression is only valid for symmertric tensors, while the expression 2.340 is generally valid. 20 It is also orthonormed, with the obvious definition of scalar product from which this norm derives. Appendixes 141 or, more generally, fs (s) = k exp − 1 sα − sα Wαβ sβ − sβ 0 0 2 2ρ (2.343) . It is easy to find probabilistic models for tensors, when we choose as coordinates the independent components of the tensor, as this Gaussian example suggests. But a symmetric second rank tensor may also be described using its three eigenvalues {λ1 , λ2 , λ3 } and the three Euler angles {ψ, θ, ϕ} defining the eigenvector’s directions √ √ 1 s√ s6 / 2 s5 /√2 λ1 0 0 2 s6 / 2 s√ s4 / 2 = R(ψ ) R(θ) R(ϕ) 0 λ2 0 R(ϕ)T R(θ)T R(ψ )T , √ 5 4 0 0 λ3 s/ 2 s/ 2 s3 (2.344) where R denotes the usual rotation matrix. Some care is required when using the coordinates {λ1 , λ2 , λ3 , ψ, θ, ϕ} . To write a Gaussian volumetric probability in terms on eigenvectors and eigendirections only requires, of course, to insert in the fs (s) of equation 2.343 the expression 2.344 giving the tensor components as a function of the eigenvectors and eigendirections (we consider volumetric probabilities —that are invariant— and not probability densities —that would require an extra multiplication by the Jacobian determinant of the transformation—), f (λ1 , λ2 , λ3 , ψ, θ, ϕ) = fs (s1 , s2 , s3 , s4 , s5 , s6 ) . (2.345) But then, of course, we still need how to integrate in the space using these new coordinates, in order to evaluate probabilities. Before facing this problem, let us remark that it is the replacement in equation 2.343 of the components sα in terms of the eigenvalues and eigendirections of the tensor that shall express a Gaussian probability distribution in terms of the variables {λ1 , λ2 , λ3 , ψ, θ, ϕ} . Using a function a would ‘look Gaussian’ in the variables {λ1 , λ2 , λ3 , ψ, θ, ϕ} would not correspond to a Gaussian probability distribution, in the sense of section 2.8.4. The Jacobian determinant of the transformation {s1 , s2 , s3 , s4 , s5 , s6 } {λ1 , λ2 , λ3 , ψ, θ, ϕ} can be obtained using a direct computation, that gives21 ∂ (s1 , s2 , s3 , s4 , s5 , s6 ) ∂ (λ1 , λ2 , λ3 , ψ, θ, ϕ) = (λ1 − λ2 ) (λ2 − λ3 ) (λ3 − λ1 ) sin θ . (2.346) The capacity elements in the two systems of coordinates are dv s (s1 , s2 , s3 , s4 , s5 , s6 ) = ds1 ∧ ds2 ∧ ds3 ∧ ds4 ∧ ds5 ∧ ds6 dv (λ1 , λ2 , λ3 , ψ, θ, ϕ) = dλ1 ∧ dλ2 ∧ dλ3 ∧ dψ ∧ dθ ∧ dϕ . (2.347) As the coordinates {sα } are Cartesian, the volume element of the space is numerically identical to the capacity element, dvs (s1 , s2 , s3 , s4 , s5 , s6 ) = dv s (s1 , s2 , s3 , s4 , s5 , s6 ) , 21 (2.348) If instead of the 3 Euler angles, we take 3 rotations around the three coordinate axes, the sin θ here above becomes replaced by the cosinus of the second angle. This is consistent with the formula by Xu and Grafarend (1997). 142 2.8 but in the coordinates {λ1 , λ2 , λ3 , ψ, θ, ϕ} the volume element and the capacity are related via the Jacobian determinant in equation 2.346, dv (λ1 , λ2 , λ3 , ψ, θ, ϕ) = (λ1 − λ2 ) (λ2 − λ3 ) (λ3 − λ1 ) sin θ dv (λ1 , λ2 , λ3 , ψ, θ, ϕ) . (2.349) Then, while the evaluation of a probability in the variables {s1 , s2 , s3 , s4 , s5 , s6 } should be done via P= dvs (s1 , s2 , s3 , s4 , s5 , s6 ) fs (s1 , s2 , s3 , s4 , s5 , s6 ) (2.350) = ds ∧ ds ∧ ds ∧ ds ∧ ds ∧ ds fs (s , s , s , s , s , s ) , 1 2 3 4 5 6 1 2 3 4 5 6 in the variables {λ1 , λ2 , λ3 , ψ, θ, ϕ} it should be done via P= = dv (λ1 , λ2 , λ3 , ψ, θ, ϕ) f (λ1 , λ2 , λ3 , ψ, θ, ϕ) dλ1 ∧ dλ2 ∧ dλ3 ∧ dψ ∧ dθ ∧ dϕ × × (2.351) (λ1 − λ2 ) (λ2 − λ3 ) (λ3 − λ1 ) sin θ f (λ1 , λ2 , λ3 , ψ, θ, ϕ) . To conclude this appendix, we may remark that the homogeneous probability distribution (defined as the one who is ‘proportional to the volume distribution’) is obtained by taking both fs (s1 , s2 , s3 , s4 , s5 , s6 ) and f (λ1 , λ2 , λ3 , ψ, θ, ϕ) as constants. [Note: I should explain somewhere that there is a complication when, instead of considering ‘a tensor like the stress tensor’ one consider a positive tensor (like an electric permittivity tensor). The treatment above applies approximately to the logarithm of such a tensor.] Appendixes 2.8.9 143 Appendix: Determinant of a Partitioned Matrix Using well known properties of matrix algebra (e.g., L¨tkepohl, 1996), the determinant of a u partitioned matrix can be expressed as det grr grs gsr gss − = det grr det gss − gsr grr1 grs . (2.352) 144 2.8.10 2.8 Appendix: The Borel ‘Paradox’ [Note: This appendix has to be updated.] A description of the paradox is given, for instance, by Kolmogorov (1933), in his Foundations of the Theory of Probability (see figure 2.31). Figure 2.31: A reproduction of a section of Kolmogorov’s book Foundations of the theory of probability (1950, pp. 50–51). He describes the so-called “Borel paradox”. His explanation is not profound: instead of discussing the behaviour of a conditional probability density under a change of variables, it concerns the interpretation of a probability density over the sphere when using spherical coordinates. I do not agree with the conclusion (see main text). A probability distribution is considered over the surface of the unit sphere, associating, as it should, to any region A of the surface of the sphere, a positive real number P (A) . To any possible choice of coordinates {u, v } on the surface of the sphere will correspond a probability density f (u, v ) representing the given probability distribution, through P (A) = du dv f (u, v ) (integral over the region A ). At this point of the discussion, the coordinates {u, v } may be the standard spherical coordinates or any other system of coordinates (as, for instance, the Cartesian coordinates in a representation of the surface of the sphere as a ‘geographical map’, using any ‘geographical projection’). A great circle is given on the surface of the sphere, that, should we use spherical coordinates, is not necessarily the ‘equator’ or a ‘meridian’. Points on this circle may be parameterized by a coordinate α , that, for simplicity, we may take to be the circular angle (as measured from the center of the sphere). The probability distribution P ( · ) defined over the surface of the sphere will induce a probability distribution over the circle. Said otherwise, the probability density f (u, v ) defined over the surface of the sphere will induce a probability density g (α) over the circle. This is the situation one has in mind when defining the notion of conditional probability density, Appendixes 145 so we may say that g (α) is the conditional probability density induced on the circle by the probability density f (u, v ) , given the condition that points must lie on the great circle. The Borel-Kolmogorov paradox is obtained when the probability distribution over the surface of the sphere is homogeneous. If it is homogeneous over the sphere, the conditional probability distribution over the great circle must be homogeneous too, and as we parameterize by the circular angle α , the conditional probability density over the circle must be g (α) = 1 , 2π (2.353) and this is not what one gets from the standard definition of conditional probability density, as we will see below. From now on, assume that the spherical coordinates {λ, ϕ} are used, where λ is the latitude (rather than the colalitude θ ), so the domains of definition of the variables are −π/2 < λ ≤ +π/2 ; −π < ϕ ≤ +π . (2.354) As the surface element is dS (λ, ϕ) = cos λ dλ dϕ , the homogeneous probability distribution over the surface of the sphere is represented, in spherical coordinates, by the probability density f (λ, ϕ) = 1 cos λ , 4π (2.355) and we satisfy the normalization condition +π/2 +π dλ −π/2 dϕ f (λ, ϕ) = 1 . (2.356) −π The probability of any region equals the relative surface of the region (i.e., the ratio of the surface of the region divided by the surface of the sphere, 4π ), so the probability density in equation 2.355 do represents the homogeneous probability distribution. Two different computations follow. Both are aimed at computing the conditional probability density over a great circle. The first one uses the nonconventional definition of conditional probability density introduced in section in section ?? of this article (and claimed to be ‘consistent’). No paradox appears. No matter if we take as great circle a meridian or the equator. The second computation is the conventional one. The traditional Borel-Kolmogorov paradox appears, when the great circle is taken to be a meridian. We interpret this as a sign of the inconsistency of the conventional theory. Let us develop the example. We have the line element (taking a sphere of radius 1 ), ds2 = dλ2 + cos2 λ dϕ2 , (2.357) which gives the metric components gλλ (λ, ϕ) = 1 ; gϕϕ (λ, ϕ) = cos2 λ (2.358) and the surface element dS (λ, ϕ) = cos λ dλ dϕ . (2.359) 146 2.8 Letting f (λ, ϕ) be a probability density over the sphere, consider the restriction of this probability on the (half) meridian ϕ = ϕ0 , i.e., the conditional probability density on this (half) meridian. It is, following equation ??, f (λ, ϕ0 ) gϕϕ (λ, ϕ0 ) f λ (λ|ϕ = ϕ0 ) = k . (2.360) In our case, using the second of equations 2.358 f λ (λ|ϕ = ϕ0 ) = k f (λ, ϕ0 ) cos λ , (2.361) or, in normalized version, f λ (λ|ϕ = ϕ0 ) = f (λ, ϕ0 )/ cos λ +π/2 −π/2 dλ f (λ, ϕ0 )/ cos λ . (2.362) If the original probability density f (λ, ϕ) represents an homogeneous probability, then it must be proportional to the surface element dS (equation 2.359), so, in normalized form, the homogeneous probability density is f (λ, ϕ) = 1 cos λ . 4π (2.363) Then, equation 2.361 gives f λ (λ|ϕ = ϕ0 ) = 1 π . (2.364) We see that this conditional probability density is constant22 . This is in contradiction with usual ‘definitions’ of conditional probability density, where the metric of the space is not considered, and where instead of the correct equation 2.360, the conditional probability density is ‘defined’ by f λ (λ|ϕ = ϕ0 ) = k f (λ, ϕ0 ) = f (λ, ϕ0 ) +π/2 −π/2 dλ f (λ, ϕ0 )/ cos λ wrong definition , (2.365) this leading, in the considered case, to the conditional probability density f λ (λ|ϕ = ϕ0 ) = cos λ 2 wrong result . (2.366) This result is the celebrated ‘Borel paradox’. As any other ‘mathematical paradox’, it is not a paradox, it is just the result of an inconsistent calculation, with an arbitrary definition of conditional probability density. The interpretation of the paradox by Kolmogorov (1933) sounds quite strange to us (see figure 2.31). Jaynes (1995) says “Whenever we have a probability density on one space and we wish to generate from it one on a subspace of measure zero, the only safe procedure is to pass to an explicitly defined limit [ . . . ]. In general, the final result will and must depend on 22 This constant value is 1/π if we consider half a meridian, or it is 1/2π if we consider a whole meridian. Appendixes 147 which limiting operation was specified. This is extremely counter-intuitive at first hearing; yet it becomes obvious when the reason for it is understood.” We agree with Jaynes, and go one step further. We claim that usual parameter spaces, where we define probability densities, normally accept a natural definition of distance, and that the ‘limiting operation’ (in the words of Jaynes) must the the uniform convergence associated to the metric . This is what we have done to define the notion of conditional probability. Many examples of such distances are shown in this text. 148 2.8 2.8.11 Appendix: Axioms for the Sum and the Product 2.8.11.1 The Sum I guess that the two defining axioms for the union of two probabilities are P (A) = 0 and Q(A) = 0 =⇒ (P ∪ Q)(A) = 0 (2.367) and P (A) = 0 or Q(A) = 0 =⇒ (P ∪ Q)(A) = 0 . (2.368) But the last property is equivalent to its negation, P (A) = 0 and Q(A) = 0 ⇐= (P ∪ Q)(A) = 0 , (2.369) and this can be reunited with the first property, to give the single axiom P (A) = 0 and Q(A) = 0 2.8.11.2 ⇐⇒ (P ∪ Q)(A) = 0 . (2.370) The product We only have the axiom P (A) = 0 or Q(A) = 0 =⇒ (P ∩ Q)(A) = 0 . (2.371) (P ∩ Q)(A) = 0 (2.372) and, of course, its (equivalent) negation P (A) = 0 and Q(A) = 0 ⇐= Appendixes 2.8.12 Appendix: Random Points on the Surface of the Sphere Figure 2.32: 1000 random points on the surface of the sphere. Note: Figure 2.32 has been generated using the following Mathematica code: spc[t_,p_,r_:1] := r {Sqrt[1-t^2] Cos[p], Sqrt[1-t^2] Sin[p], t} Show[Graphics3D[Table[Point[spc[Random[Real,{-1,1}], Random[Real,{0,2Pi}]]],{1000}]]] Figure 2.33: A geodesic dome dividing the surface of the sphere into regions with approximately the same area. Figure 2.34: The coordinate division of the surface of the sphere. 149 150 2.8 ϕ = +π/2 ϕ = +π/2 ϕ=0 ϕ=0 ϕ = −π/2 θ = −π θ = −π/2 θ=0 θ = +π/2 θ = +π ϕ = −π/2 θ = −π θ = −π/2 θ=0 θ = +π/2 θ = +π Figure 2.35: Map representation of a random homogeneous distribution of points at the surface of the sphere. At the left, the na¨ division of the surface of the sphere using constant ıve increments of the coordinates. At the right, the cylindrical equal-area projection. Counting the points inde each ‘rectangle’ gives, at the left, the probability density of points. At the right, the volumetric probability. Appendixes 2.8.13 151 Appendix: Histograms for the Volumetric Mass of Rocks Figure 2.36: Histogram of the volumetric mass for the 557 minerals listed in the Handbook of Physical Properties of rocks (Johnson and Olhoeft, 1984). A logarithmic axis is used that represents the variable u = log10 (ρ/K ) , with K = 1 g/cm3 . Superposed to the histogram is the normal function with mean 0.60 and stansard deviation 0.23. The vertical lines correspond to successive deviations multiples of the standard deviation. See the lognormal function in 2.37). 0.1 0.08 0.06 0.04 0.02 0 -0.25 0 0.25 0.5 0.75 1 1.25 1.5 u = log10(ρ/(g/cm3)) 0.3 0.25 0.2 Figure 2.37: A na¨ version of the histogram ıve in figure 2.36, using an axis labeled in volumetric mass. 0.15 0.1 0.05 0 0 0 5 10 10 g/cm3 15 0 0 5 10 10 g/cm3 15 20 20 g/cm3 25 1 0.8 Figure 2.38: A third version of the histogram, obtained using intervals of constant length δρ/ρ . 0.6 0.4 0.2 0 20 20 g/cm3 25 152 2.8 Chapter 3 Monte Carlo Sampling Methods Note: write here a small introduction to the chapter. 153 154 3.1 3.1 Introduction When a probability distribution has been defined, we have to face the problem of how to ‘use’ it. The definition of some ‘central estimators’ (like the mean or the median) and some ‘estimators of dispersion’ (like the covariance matrix), lacks generality, as it is quite easy to find examples (like multimodal distributions in highly-dimensioned spaces) where these estimators fail to have any interesting meaning. When a probability distribution has been defined over a space of low dimension (say, from one to four dimensions), then we can directly represent the associated probability density1 . This is trivial in one or two dimensions. It is easy in three dimensions, using, for instance, virtual reality software. Some tricks may allow us to represent a four-dimensional probability distribution, but clearly this approach cannot be generalized to the high dimensional case. Let us explain the only approach that seems practical, with help of figure 3.1. At the left of the figure, there is an explicit representation of a 2D probability distribution (by means of the associated probability density or the associated (2D) volumetric probability). In the middle, some random points have been generated (using the Monte Carlo method about to be described). It is clear that if we make a histogram with these points, in the limit of a sufficiently large number of points, we recover the representation at the left2 . Disregarding the histogram possibility, we can concentrate on the individual points. In the 2D example of the figure, we have actual points in a plane. If the problem is multidimensional, each ‘point’ may corresponds to some abstract notion. For instance, for a geophysicist a ‘point’ may be a given model of the Earth. This model may be represented in some way, for instance a nice drawing with plenty of colors. Then a collection of ‘points’ is a collections of such drawings. Our experience shows that, given such a collection of randomly generated ‘models’, the human eye-brain system is extremely good at apprehending the basic characteristics of the underlying probability distribution, including possible multimodalities, correlations, etc. Figure 3.1: An explicit representation of a 2D probability distribution, and the sampling of it, using Monte Carlo methods. While the representation at the top-left cannot be generalized to high dimensions, the examination of a collection of points can be done in arbitrary dimensions. Practically, Monte Carlo generation of points is done through a ‘random walk’ where a ‘new point’ is generated in the vicinity of the previous point. . . . .. . ... . ... . . . ..... . . .. . . . ... . .. . . . . .. . When such a (hopefully large) collection of random models is available we can also answer quite interesting questions. For instance, a geologist may ask: at which depth is that subsurface strucure? To answer this, we can make an histogram of the depth of the given geological 1 Or, best, the associated volumetric probability. There are two ways for making an histogram. If the space is devided in cells with constant coordinate differences dx1 , dx2 , . . . , then the limit converges to the probability density. If, instead, the space is divided in cells of constant volume dV , then the limit converges to the volumetric probability. 2 Random Walks 155 structure over the collection of random models, and the histogram is the answer to the question. Which is the probability of having a low velocity zone around a given depth? The ratio of the number of models presenting such a low velocity zone over the total number of models in the collection gives the answer (if the collection of models is large enough). This is essentially what we propose: looking to a large number of randomly generated models in order to intuitively apprehend the basic properties of the probability distribution, followed by precise computations of the probability of all interesting ‘events’. Practically, as we shall see, the random sampling is not made by generating points independently of each other. Rather, as suggested in the last image of figure 3.1, through a ‘random walk’ where a ‘new point’ is generated in the vicinity of the previous point. Monte Carlo methods have a random generator at their core3 . At present, Monte Carlo methods are typically implemented on digital computers, and are based on the pseudorandom generation of numbers4 . As we shall see, any conceivable operation on probability densities (e.g., computing marginals and conditionals, integration, conjunction (the and operation), etc.) has its counterpart in an operation on/by their corresponding Monte Carlo algorithms. Inverse problems are often formulated in high dimensional spaces. In this case a certain class of Monte Carlo algorithms, the so-called importance sampling algorithms, come to rescue, allowing us to sample the space with a sampling density proportional to the given probability density. In this case excessive (and useless) sampling of low-probability areas of the space is avoided. That this is not only important, but in fact vital in high dimensional spaces, can be seen in figure 3.2, where the failure of a plain Monte Carlo sampling (one that samples the space uniformly) in high dimensional spaces is made clear. Another advantage of the importance sampling Monte Carlo algorithms is that we need not have a closed form mathematical expression for the probability density we want to sample. Only an algorithm that allows us to evaluate it at a given point in the space is needed. This has considerable practical advantage in analysis of inverse problems where computer intensive evaluation of, e.g., misfit functions plays an important role in calculation of certain probability densities. Given a probability density that we wish to sample, and a class of Monte Carlo algorithms that samples this density, which one of the algorithms should we choose? Practically, the problem is here to find the most efficient of these algorithms. This is an interesting and difficult problem that we will not go into detail with here. We will, later in this chapter, limit ourselves to only two general methods which are recommendable in many practical situations. 3.2 Random Walks To escape the dimensionality problem, any sampling of a probability density for which point values are available only upon request has to be based on a random walk, i.e., in a generation of successive points with the constraint that point xi+1 sampled in iteration (i + 1) is in the vicinity of the point xi sampled in iteration i. The simplest of the random walks are the socalled Markov Chain Monte Carlo (MCMC) algorithms, where the point xi+1 depends on the point xi , but not on previous points. We will concentrate on these algorithms here. 3 Note: Cite here the example of Buffon, and a couple of other simple examples. I.e., series of numbers that appear random if tested with any reasonable statistical test. Note: cite here some references (Press, etc.). 4 156 3.2 4 πR3 (...) πn R2n 2n+1 πn R2n+1 3 n! (2n+1)!! 2R πR2 2R (2R)2 (2R)3 (...) (2R)2n (2R)2n+1 Volume hypersphere / Volume hypercube 1 1.0 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0.0 1 2 2 3 4 4 5 6 6 7 8 8 9 10 10 11 Dimension Figure 3.2: Consider a square and the inscribed circle. If the circle’s surface is πR2 , that of the square is (2R)2 . If we generate a random point inside the square, with homogeneous probability distribution, the probability of hitting the circle equals the ratio of the surfaces, i.e., P = π/4 . We can do the same in 3D, but, in this case, the ratio of volumes is P = π/6 : the probability of hitting the target is smaller in 3D than in 2D. This probability tends dramatically to zero when the dimension of the space increases. For instance, in dimension 100, the probability of hitting the hypersphere incribed in the hypercube is P = 1.9 10−70 , what means that it is practically impossible to hit the target ‘by chance’. The formulas at the top give the volume of an hypersphere of radius R in a space of dimension 2n or 2n + 1 (the formula is not the same for spaces with even or odd dimension), and the volume of an hypercube with sides of length 2R . The graph at the bottom shows the evolution, as a function of the dimension of the space, of the ratio between the volume of the hypersphere and the volume of the hypercube. In large dimension, the hypersphere fills a negligible amount of the hypercube. Modification of Random Walks 157 If random rules have been defined to select points such that the probability of selecting a point in the infinitesimal “box” dx1 . . . dxN is p(x)dx1 . . . dxN , then the points selected in this way are called samples of the probability density p(x). Depending on the rules defined, successive samples i, j, k, . . . may be dependent or independent. Before going into more complex sampling situations, we should mention that there exist methods for sampling probability densities that can be described by an explicit mathematical expressions. Information on some of the most important of these methods can be found in appendix 3.10.3. Sampling in cases where only point values of the probability density are available upon request can be done by means of Monte Carlo algorithms based on random walks. In the following, we shall describe the essential properties of random walks performing the so-called importance sampling. 3.3 Modification of Random Walks Assume here that we can start with a random walk that samples some probability density f (x) , and have the goal of having a random walk that samples the probability density h(x) = k f (x) g (x) µ(x) . (3.1) Call xi the ‘current point’. With this current point as starting point, run one step of the random walk that unimpeded would sample the probability density f (x) , and generate a ‘test point’ xtest . Compute the value qtest = g (xtest ) µ(xtest ) . (3.2) If that value is ‘high enough’, let that point ‘survive’. If qtest is not ‘high enough’, discard this point and generate another one (making another step of the random walk sampling the prior probability density f (x)) , using again the ‘current point’ xi as starting point). There are many criteria for deciding when a point should survive or should be discarded, all of them resulting in a collection of ‘surviving points’ that are samples of the target probability density h(x) . For instance, if we know the maximum possible value of the ratio g (x)/µ(x) , say qmax , then define Ptest = qtest qmax , (3.3) and give the point xtest the probability Ptest of survival (note that 0 < Ptest < 1 ). It is intuitively obvious why the random walk modified using such a criterion produces a random walk that actually samples the probability density h(x) defined by equation 3.1. Among the many criteria that can be used, the by far most efficient is the Metropolis criterion, the criterion behind the Metropolis Algorithm (Metropolis et al. 1953). In the following we shall describe this algorithm in some detail. 158 3.4 3.5 The Metropolis Rule Consider the following situation. Some random rules define a random walk that samples the probability density f (x) . At a given step, the random walker is at point xj , and the application of the rules would lead to a transition to point xi . If that ‘proposed transition’ xi ← xj is always accepted, the random walker will sample the probability density f (x). Instead of always accepting the proposed transition xi ← xj , we reject it sometimes by using the following rule to decide if it is allowed to move to xi of if it must stay at xj : • if g (xi )/µ(xi ) ≥ g (xj )/µ(xj ) , then accept the proposed transition to xi , • if g (xi )/µ(xi ) < g (xj )/µ(xj ) , then decide randomly to move to xi , or to stay at xj , with the following probability of accepting the move to xi : P= g (xi )/µ(xi ) g (xj )/µ(xj ) . (3.4) Then we have the following Theorem 3.1 The random walker samples the conjunction h(x) of the probability densities f (x) and g (x) h(x) = k f (x) f (x) g (x) g (x) =k µ(x) µ(x) (3.5) (see appendix 3.10.2 for a demonstration). It should be noted here that this algorithm nowhere requires the probability densities to be normalized. This is of vital importance in practice, since it allows sampling of probability densities whose values are known only in points already sampled by the algorithm. Obviously, such probability densities cannot be normalized. Also, the fact that our theory allows unnormalizable probability densities will not cause any trouble in the application of the above algorithm. The algorithm above is reminiscent (see appendix 3.10.2) of the Metropolis algorithm (Metropolis et al., 1953), originally designed to sample the Gibbs-Boltzmann distribution5 . Accordingly, we will refer to the above acceptance rule as the Metropolis rule . 3.5 The Cascaded Metropolis Rule As above, assume that some random rules define a random walk that samples the probability density f1 (x) . At a given step, the random walker is at point xj ; 1 apply the rules, that unthwarted, would generate samples of f1 (x) , to propose a new point xi , exp(−E (x)/T ) To see this, put f (x) = 1, µ(x) = 1 , and g (x) = exp(−E (x)/T )dx , where E (x) is an “energy” associated to the point x, and T is a “temperature”. The summation in the denominator is over the entire space. In this way, our acceptance rule becomes the classical Metropolis rule: point xi is always accepted if E (xi ) ≤ E (xj ), but if E (xi ) > E (xj ), it is only accepted with probability pacc = exp (− (E (xi ) − E (xj )) /T ) . ij 5 Initiating a Random Walk 159 2 if f2 (xi )/µ(xi ) ≥ f2 (xj )/µ(xj ) , go to point 3; if f2 (xi )/µ(xi ) < f2 (xj )/µ(xj ) , then decide randomly to go to point 3 or to go back to point 1, with the following probability of going to point 3: P = (f2 (xi )/µ(xi ))/(f2 (xj )/µ(xj )) ; 3 if f3 (xi )/µ(xi ) ≥ f3 (xj )/µ(xj ) , go to point 4; if f3 (xi )/µ(xi ) < f3 (xj )/µ(xj ) , then decide randomly to go to point 4 or to go back to point 1, with the following probability of going to point 4: P = (f3 (xi )/µ(xi ))/(f3 (xj )/µ(xj )) ; . .... . n if fn (xi )/µ(xi ) ≥ fn (xj )/µ(xj ) , then accept the proposed transition to xi ; if fn (xi )/µ(xi ) < fn (xj )/µ(xj ) , then decide randomly to move to xi , or to stay at xj , with the following probability of accepting the move to xi : P = (fn (xi )/µ(xi ))/(fn (xj )/µ(xj )) ; Then we have the following Theorem 3.2 The random walker samples the conjunction h(x) of the probability densities f1 (x), f2 (x), . . . , fn (x) : h(x) = k f1 (x) f2 (x) fn (x) ... µ(x) µ(x) . (3.6) (see appendix XXX for a demonstration). 3.6 Initiating a Random Walk Consider the problem of obtaining samples of a probability density h(x) defined as the conjunction of some probability densitites f1 (x), f2 (x), f3 (x) . . . , h(x) = k f1 (x) f2 (x) f3 (x) ... µ(x) µ(x) , (3.7) and let us examine three common situations. 3.6.0.0.1 We start with a random walk that samples f1 (x) (optimal situation): This corresponds to the basic algorithm where we know how to produce a random walk that samples f1 (x) , and we only need to modify it, taking into account the values f2 (x)/µ(x) , f3 (x)/µ(x) . . . , using the cascaded Metropolis rule, to obtain a random walk that samples h(x) . 3.6.0.0.2 as We start with a random walk that samples µ(x) : We can write equation 3.7 h(x) = k µ(x) f1 (x) µ(x) f2 (x) µ(x) ... . (3.8) The expression corresponds to the case where we are not able to start with a random walk that samples f1 (x) , but we have a random walk that samples the homogeneous probability density µ(x) . Then, with respect to the example just mentioned, there is one extra step to be added, taking into account the values of f1 (x)/µ(x) . 160 3.7 3.6.0.0.3 We start with an arbitrary random walk (worst situation): In the situation where we are not able to directly define a random walk that samples the homogeneous probability distribution, but only one that samples some arbitrary probability distribution ψ (x) , we can write equation 3.7 on the form h(x) = k ψ (x) µ(x) ψ (x) f1 (x) µ(x) f2 (x) µ(x) ... . (3.9) Then, with respect to the example just mentioned, there is one more extra step to be added, taking into account the values of µ(x))/ψ (x) . Note that the closer ψ (x) will be to µ(x) , the more efficient will be the first modification of the random walk. 3.7 Designing Primeval Walks What the Metropolis algorithm does is to modify some initial walk, in cascade, to produce a final random walk that samples the target probability distribution. The initial walk, that is designed ab initio, i.e., independently of the Metropolis algorithm (or any similar algorithm), may be called the primeval walk . We shall see below some examples where primeval walks are designed that sample the homogeneous probability distribution µ(x) , or directly the probability density f (x) (see equation 3.7). If we do not know how to do this, then we have to resort to using a primeval walk that samples the arbitrary function ψ (x) mentioned above. Example 3.1 Consider the homogeneous probability density on the 2D surface of a sphere of 2 radius R , µ(ϑ, ϕ) = Rπ cos ϑ , where we use geographical coordinates. This distribution can 4 be sampled by generating a value of ϑ using the probability density 21 cos ϑ , and then a value π of ϕ using a constant probability density. Alternatively, one could use a purely geometrical approach. [End of example.] Example 3.2 If instead of the surface of a sphere, we have some spheroid, with spheroidal coordinates {ϑ, ϕ} , the homogeneous probability density will have some expression µ(ϑ, ϕ) , that will not be identical to that corresponding to a sphere (see example 3.1). We may then 2 use the function ψ (ϑ, ϕ) = Rπ cos ϑ , i.e., we may start with the same primeval walk as in 4 example 3.1, using, in the Metropolis rule, the ‘corrective step’ mentioned in section 3.6, and depending on the values µ(ϑ, ϕ)/ψ (ϑ, ϕ) . [End of example.] Example 3.3 If x is a one-dimensional Cartesian quantity, i.e., if the associated homogeneous probability density is constant, then, it is trivial to designate a random walk that samples it. If x is the ‘current point’, choose randomly a real number e with an arbitrary probability density that is symmetric around zero, and jump to x + e . The iteration of this rule produces a random walk that samples the homogeneous probability density for a Cartesian parameter, µ(x) = k . [End of example.] Example 3.4 Consider the homogeneous probability density for a temperature, µ(T ) = 1/T , as an example of a Jeffreys parameter. This distribution can be sampled by the following procedure. If T is the ‘current point’, choose randomly a real number e with an arbitrary probability density that is symmetric around zero, let be Q = exp e , and jump6 to QT . The 6 Note that if Q > 1 , the algorithm ‘goes to the right’, while if Q < 1 , it ‘goes to the left’. Multistep Iterations 161 iteration of this rule produces7 a random walk that samples the homogeneous probability density for a Jeffreys parameter, µ(T ) = 1/T . [End of example.] Example 3.5 Consider a random walk that, when it is at point xj , chooses another point xi with a probability density f (x) = U (x|xj ) , satisfying U (x|y) = U (y|x) . (3.10) Then, the random walk samples the constant probability density f (x) = k (see appendix ?? for a proof ). [End of example.] The reader should be warned that although the Metroplis rule would allow to use a primeval walk sampling a probability density ψ (x) that may be quite different from the homogeneous probability density µ(x) , this may be quite inefficient. One should not, in general, use the random walk defined in example 3.5 as a general primeval walk. 3.8 Multistep Iterations An algorithm will converge to a unique equilibrium distribution if the random walk is irreducible. Often, it is convenient to split up an iteration in a number of steps, having their own transition probability densities, and their own transition probabilities. A typical example is a random walk in an N -dimensional Euclidian space where we are interested in dividing an iteration of the random walk into N steps, where the n-th move of the random walker is in a direction parallel to the n-th axis. The question is now: if we want to form an iteration consisting of a series of steps, can we give a sufficient condition to be satified by each step such that the complete iteration has the desired convergence properties? It is easy to see that if the individual steps in an iteration all have the same probability density p(x) as their equilibrium probability density (not necessarily unique), then the complete iteration also has p(x) as an equilibrium probability density. This follows from the fact that the equilibrium probability density is an eigenfunction with eigenvalue 1 for the integral operators corresponding to each of the step transition probability densities. Then it is also an eigenfunction with eigenvalue 1, and hence an equilibrium probability density, for the integral operator corresponding to the transition probability density for the complete iteration. If this transition probability density is to be the unique equilibrium probability density for the complete iteration, then random walk must be irreducible. That is, it must be possible to go from any point to any other point by performing iterations consisting of the specified steps. If the steps of an iteration satisfy these sufficient conditions, there is also another way of defining an iteration with the desired, unique equilibrium density. Instead of performing an iteration as a series of steps, it is possible to define the iteration as consisting of one of the steps, chosen randomly (with any distribution having nonzero probabilities) among the possible steps. In this case, the transition probability density for the iteration is equal to a linear combination of the transition probability densities for the individual steps. The coefficient of the transition probability density for a given step is the probability that this step is selected. Since the 7 It is easy to see why. Let t = log T /T0 . Then f (T ) = 1/T transforms into g (t) = const . This example is then just the ‘exponentiated version’ of example 3.3. 162 3.9 desired probability density is an equilibrium probability density (eigenfunction with eigenvalue 1) for the integral operators corresponding to each of the step transition probability matrices, and since the sum of all the coefficients in the linear combination is equal to 1, it is also an equilibrium probability density for the integral operator corresponding to the transition probability density for the complete iteration. This equilibrium probability density is unique, since it is possible, following the given steps, to go from any point to any other point in the space. Of course, a step of an iteration can, in the same way, be built from substeps, and in this way acquire the same (not necessarily unique) equilibrium probability density as the substeps. 3.9 Choosing Random Directions and Step Lengths A random walk is an iterative process where, when we stay at some ‘current point’, we may jump to a neighboring point. We must decide two things, the direction of the jump and its step length. Let us examine the two problems in turn. 3.9.1 Choosing Random Directions When the number of dimensions is small, a ‘direction’ in a space is something simple. This is not so when we work in large-dimensional spaces. Consider, for instance, the problem of choosing a direction in a space of functions. Of course, a space where each point is a function is infinite-dimensional, and we work here with finite-dimensional spaces, but we may just assume that we have discretized the functions using a large number of points, say 10 000 or 10 000 000 points. If we are ‘at the origin’ of the space, i.e., at point {0, 0, . . . } representing a function that is everywhere zero, we may decide to choose a direction pointing towards smooth functions, or fractal functions, gaussian-like functions, functions having zero mean value, L1 functions, L2 functions, functions having a small number of large jumps, etc. This freedom of choice, typical of large-dimensional problems, has to be carefully analyzed, and it is indispensable to take advantage of it whe designing random walks. Assume that we are able to design a primeval random walk that samples the probability density f (x) , and we wish to modify it considering the values g (x)/µ(x) , using the Metropolis rule (or any equivalent rule), in order to obtain a random walk that samples h(x) = k f (x) g (x) µ(x) . (3.11) We can design many primeval random walks that sample f (x) . Using Metropolis modification of a random walk, we will always obtain a random walk that samples h(x) . A well designed primeval random walk will ‘present’ to the Metropolis criterion test points xtest that have a large probability of being accepted (i.e., that have a large value of g (x)test )/µ(x)test ) ). A poorly designed primeval random walk will test points with a low probability of being accepted. Then, the algorithm is very slow in producing accepted points. Although high acceptance probability can always be obtained with very small step lengths (if the probability density to be sampled is smooth), we need to discover directions that give high acceptacne ratios even for large step lengths. Choosing Random Directions and Step Lengths 3.9.2 163 Choosing Step Lengths Numerical algorithms are usually forced to compromise between some conflicting wishes. For instance, a gradient-based minimization algorithm has to select a finite step length along the direction of steepest descent. The larger the step length, the smaller may be the number of iterations required to reach the minimum, but if the step length is chosen too large, we may lose efficiency; we can even increase the value of the target function, instead of diminishing it. The random walks contemplated here faces exactly the same situation. The direction of the move is not deterministically calculated, but is chosen randomly, with the common-sense constraint discussed in the previous section. But once a direction has been decided, the size of the jump along this direction, that has to be submitted to the Metropolis criterion, has to be ‘as large as possible’, but not too large. Again, the ‘Metropolis theorem’ guarantees that the final random walk will sample the target probability distribution, but the better we are in choosing the step length, the more efficient the algorithm will be. In practice, a neighborhood size giving an acceptance rate of 30% − 60% (for the final, posterior sampler) can be recommended. 164 3.10 3.10 Appendixes 3.10.1 Random Walk Design The design of a random walk that equilibrates at a desired distribution p(x) can be formulated as the design of an equilibrium flow having a throughput of p(xi )dxi particles in the neighborhood of point xi . The simplest equilibrium flows are symmetric , that is, they satisfy F (xi , xj ) = F (xj , xi ) (3.12) That is, the transition xi ← xj is as likely as the transition xi → xj . It is easy to define a symmetric flow, but it will in general not have the required throughput of p(xj )dxj particles in the neighborhood of point xj . This requirement can be satisfied if the following adjustment of the flow density is made: first multiply F (xi , xj ) with a positive constant c. This constant must be small enough to assure that the throughput of the resulting flow density cF (xi , xj ) at every point xj is smaller than the desired probability p(xj )dxj of its neighborhood. Finally, at every point xj , add a flow density F (xj , xj ), going from the point to itself, such that the throughput at xj gets the right size p(xj )dxj . Neither the flow scaling nor the addition of F (xj , xj ) will destroy the equilibrium property of the flow. In practice, it is unnecessary to add a flow density F (xj , xj ) explicitly, since it is implicit in our algorithms that if no move away from the current point takes place, the move goes from the current point to itself. This rule automatically adjusts the throughput at xj to the right size p(xj )dxj Appendixes 3.10.2 165 The Metropolis Algorithm Characteristic of a random walk is that the probability of going to a point xi in the space X in a given step (iteration) depends only on the point xj it came from. We will define the conditional probability density P (xi | xj ) of the location of the next destination xi of the random walker, given that it currently is at neighbouring point xj . The P (xi | xj ) is called the transition probability density. As, at each step, the random walker must go somewhere (including the possibility of staying at the same point), then X P (xi | xj )dxi = 1. (3.13) For convenience we shall assume that P (xi | xj ) is nonzero everywhere (but typically negligibly small everywhere, except in a certain neighborhood around xj ). For this reason, staying in an infinitesimal neighborhood of the current point xj has nonzero probability, and therefore is considered a “transition” (from the point xj to itself). The current point, having been reselected, contributes then with one more sample. Given a random walk defined by the transition probability density P (xi | xj ). Assume that the point, where the random walk is initiated, is only known probabilistically: there is a probability density q (x) that the random walk is initiated at point x. Then, when the number of steps tends to infinity, the probability density that the random walker is at point x will “equilibrate” at some other probability density p(x). It is said that p(x) is an equilibrium probability density of P (xi | xj ). Then, p(x) is an eigenfunction with eigenvalue 1 of the linear integral operator with kernel P (xi | xj ): X P (xi | xj )p(xj )dxj = p(xi ). (3.14) If for any initial probability density q (x) the random walk equilibrates to the same probability density p(x), then p(x) is called the equilibrium probability of P (xi | xj ). Then, p(x) is the unique eigenfunction of with eigenvalue 1 of the integral operator. If it is possible for the random walk to go from any point to any other point in X it is said that the random walk is irreducible. Then, there is only one equilibrium probability density (Note: Find appropriate reference...). Given a probability density p(x), many random walks can be defined that have p(x) as their equilibrium density. Some tend more rapidly to the final probability density than others. Samples x(1) , x(2) , x(3) , . . . obtained by a random walk where P (xi | xj ) is negligibly small everywhere, except in a certain neighborhood around xj will, of course, not be independent unless we only consider points separated by a sufficient number of steps. Instead of considering p(x) to be the probability density of the position of a (single) random walker (in which case X p(x))dx = 1), we can consider a situation where we have a “density p(x) of random walkers” in point x. Then, X p(x))dx represents the total number of random walkers. None of the results presented below will depend on the way p(x) is normed. If at some moment the density of random walkers at a point xj is p(xj ), and the transitions probability density is P (xi | xj ), then F (xi , xj ) = P (xi | xj )p(xj ) (3.15) represents the probability density of transitions from xj to xi : while P (xi | xj ) is the conditional probability density of the next point xi visited by the random walker, given that it currently is 166 3.10 at xj , F (xi , xj ) is the unconditional probability density that the next step will be a transition from xj to xi , given only the probability density p(xj ). When p(xj ) is interpreted as the density of random walkers at a point xj , F (xi , xj ) is called the flow density , as F (xi , xj )dxi dxj can be interpreted as the number of particles going to a neighborhood of volume dxi around point xi from a neighborhood of volume dxj around point xj in a given step. The flow corresponding to an equilibrated random walk has the property that the particle density p(xi ) at point xi is constant in time. Thus, that a random walk has equilibrated at a distribution p(x) means that, in each step, the total flow into an infinitesimal neighborhood of a given point is equal to the total flow out of this neighborhood Since each of the particles in a neighborhood around point xi must move in each step (possibly to the neighborhood itself), the flow has the property that the total flow out from the neighborhood, and hence the total flow into the neighborhood, must equal p(xi )dxi : X F (xi , xj )dxj = X F (xk , xi )dxk = p(xi ) (3.16) Consider a random walk with transition probability density P (xi | xj ) with equilibrium probability density p(x) and equilibrium flow density F (xi , xj ). We can multiply F (xi , xj ) with any symmetric flow density ψ (xi , xj ), where ψ (xi , xj ) ≤ q (xj ), for all xi and xj , and the resulting flow density ϕ(xi , xj ) = F (xi , xj )ψ (xi , xj ) (3.17) will also be symmetric, and hence an equilibrium flow density. A “modified” algorithm with flow density ψ (xi , xj ) and equilibrium probability density r(xj ) is obtained by dividing ϕ(xi , xj ) with the product probability density r(xj ) = p(xj )q (xj ). This gives the transition probability density ψ (xi , xj ) p(xj )q (xj ) ψ (xi , xj ) = P (xi | xj ) , q (xj ) P (xi , xj )modified = F (xi , xj ) which is the product of the original transition probability density, and a new probability — the acceptance probability acc Pij = ψ (xi , xj ) . q (xj ) (3.18) If we choose to multiply F (xi , xj ) with the symmetric flow density ψij = Min(q (xi ), q (xj )), (3.19) we obtain the Metropolis acceptance probability metrop Pij = Min 1, q (xi ) q (xj ) , which is one for q (xi ) ≥ q (xj ), and equals q (xi )/q (xj ) when q (xi ) < q (xj ). (3.20) Appendixes 167 The efficiency of an acceptance rule can be defined as the sum of acceptance probabilities for all possible transitions. The acceptance rule with maximum efficiency is obtained by simultaneously maximizing ψ (xi , xj ) for all pairs of points xj and xi . Since the only constraint on ψ (xi , xj ) (except for positivity) is that ψ (xi , xj ) is symmetric and ψ (xk , xl ) ≤ q (xl ), for all k and l, we have ψ (xi , xj ) ≤ q (xj ) and ψ (xi , xj ) ≤ q (xi ). This means that the acceptance rule with maximum efficiency is the Metropolis rule, where ψij = Min (q (xi ), q (xj )) . (3.21) 168 3.10.3 3.10 Appendix: Sampling Explicitly Given Probability Densities Three methods for sampling explicitly known probability densities are important, and they are given by the following three theorems (formulated for a probability density over a 1-dimensional space): Theorem 1. Let p be an everywhere nonzero probability density with distribution function P , given by x P (s) = p(s)ds, (3.22) −∞ and let r be a random number chosen uniformly at random between 0 and 1. Then the random number x generated through the formula x = P −1 (r) (3.23) has probability density p. Theorem 2. Let p be a nonzero probability density defined on the interval I = [a, b] for which there exists a positive number M , such that p(x) ≤ M. (3.24) and let r and u be two random numbers chosen uniformly at random from the intervals [0, 1] and I , respectively. If u survives the test r≤ p(u) M (3.25) it is a sample of the probability density p. More special, yet useful, is the following way of generating Gaussian random numbers: Theorem 3. Let r1 and r2 be random numbers chosen uniformly at random between 0 and 1. Then the random numbers x1 and x1 generated through the formulas x1 = −2 ln r2 cos (2πr1 ) x2 = −2 ln r2 sin (2πr1 ) are independent and Gaussian distributed with zero mean and unit variance. These theorems are straightforward to use in practice. The proofs are left to the reader as an exercise. Chapter 4 Homogeneous Probability Distributions 4.1 Parameters To describe a physical system (a planet, an elastic sample, etc.) we use physical quantities (temperature and mass density at some given points, total mass, surface color, etc.). We examine here the situation where the total number of physical quantities is finite. The limitation to a finite number of quantities may seem essential to some (in inverse problems, the school of thought developed by Backus and Gilbert) and accessory to others (like the authors of this text). When we consider a function (for instance, a temperature profile as a function of depth), we assume that the function has been discretized in sufficient detail. By ‘sufficient’ we mean that a limit has practically been attained where the computation of the finite probability of any event becomes practically independent of any further refinement of the discretization of the function1 . In this section, {x1 , x2 . . . xn } represents a set of n physical quantities, for which we will assume to have a probability distribution defined. The quantities {x1 , x2 . . . xn } are assumed to take real values (with, generally, some physical dimensions). Example 4.1 We may consider, for instance, (i) the mass of a particle, (ii) the temperature at the center of the Earth, (iii) the value of the fine-structure constant, etc. [End of example.] Assuming that we have a set of real quantities excludes the possibility that we may have a quantity that takes only discrete values, like spin ∈ { +1/2 , -1/2 } , or even a nonnumerical variable, like organism ∈ { plant , animal } . This is not essential, and the formulas given here could easily be generalized to the case where we have both, discrete and continuous probabilities. But, as discrete probability distributions have obvious definitions of marginal and conditional probability distributions, we do not wish to review them here. On the contrary, probabilities over continuous manifolds have specific problems (change of variables, limits, etc.) that demand our attention. 1 A random function is a function that, at each point, is a random variable. A random function is completely characterized if, for whatever choice of n points we may make, we are able to exhibit the joint n-dimensional probability distribution for the n random variables, and this for any value of n . If the considered random function has some degree of smoothness, there is a limit in the value of n such that any finite probability computed using the actual random function is practically identical to the same probability computed from an n-dimensional discretization of the random function. For an excellent introductory text on random functions, see Pugachev (1965). 169 170 4.1 [Note: Explain here that we shall use the language of ‘manifolds’.] [Note: Explain here that ’space’ is used as synonymous of ‘manifold’] In the following, we will meet two distinct categories of uncertain ‘parameters’. The first category consists of physical quantities whose ‘actual values’ are not exactly known but cannot be analyzed by generating many realizations of the parameter values in a repetitive experiment. An obvious example of such a parameter is the radius of the earth’s core (say r ). If f (r) is a probability density over r , we will never say that r is a ‘random variable’; we will rather say that we have a probability density defined over a ‘physical quantity’. The second category of parameters are bona fide ‘random variables’, for which we can obtain histograms through repeated experiments. Such ‘random variables’ do not play any major role in this article. Although in mathematical texts there is a difference in notation between a parameter and a particular value of the parameter (for instance, by denoting them X and x respectively), we choose here to simplify the notation and use expressions like ‘let x = x0 be a particular value of the parameter x .’ Note: I have to talk about the conmensurability of distances, ds2 = ds2 + ds2 r s , (4.1) every time I have to define the Cartesian product of two spaces each with its own metric. Homogeneous Probability Distributions 4.2 171 Homogeneous Probability Distributions In some parameter spaces, there is an obvious definition of distance between points, and therefore of volume. For instance, in the 3D Euclidean space the distance between two points is just the Euclidean distance (which is invariant under translations and rotations). Should we choose to parameterize the position of a point by its Cartesian coordinates {x, y, z } , then, the volume element in the space would be dV (x, y, z ) = dx dy dz . (4.2) Should we choose to use geographical coordinates, then the volume element would be dV (r, ϑ, ϕ) = r2 cos ϑ dr dϑ dϕ . (4.3) Question: what would be, in this parameter space, a homogeneous probability distribution of points? Answer: a probability distribution assigning to each region of the space a probability proportional to the volume of the region. Then, question: which probability density represents such a homogeneous probability distribution? Let us give the answer in three steps. • If we use Cartesian coordinates {x, y, z } , as we have dV (x, y, z ) = dx dy dz , the probability density representing the homogeneous probability distribution is constant: f (x, y, z ) = k . (4.4) • If we use geographical coordinates {r, ϑ, ϕ} , as we have dV (r, ϑ, ϕ) = r2 cos ϑ dr dϑ dϕ , the probability density representing the homogeneous probability distribution is (see example 2.3) g (r, ϑ, ϕ) = k r2 cos ϑ . (4.5) • Finally, if we use an arbitrary system of coordinates {u, v, w} , in which the volume element of the space is dV (u, v, w) = v (u, v, w) du dv dw , the homogeneous probability distribution is represented by the probability density h(u, v, w) = k v (u, v, w) . (4.6) This is obviously true, since if we calculate the probability of a region A of the space, with volume V (A) , we get a number proportional to V (A) . We can arrive at some conclusions from this example, that are of general validity. First, the homogeneous probability distribution is represented by a constant probability density only if we use Cartesian (or rectilinear) coordinates. Two other conclusions can be stated as two (equivalent) rules: Rule 4.1 The probability density representing the homogeneous probability distribution is easily obtained if the expression of the volume element dV (u1 , u2 , . . . ) = v (u1 , u2 , . . . ) du1 du2 . . . of the space is known, as it is then given by h(u1 , u2 , . . . ) = k v (u1 , u2 , . . . ) , where k is a proportionality constant (that may have physical dimensions). 172 4.2 Rule 4.2 If there is a metric gij (u1 , u2 , . . . ) in the space, then, as mentioned above, the volume element is given by dV (u1 , u2 , . . . ) = det g(u1 , u2 , . . . ) du1 du2 . . . , i.e., we have v (u1 , u2 , . . . ) = det g(u1 , u2 , . . . ) . The probability density representing the homogeneous probability distribution is, then, h(u1 , u2 , . . . ) = k det g(u1 , u2 , . . . ) . Rule 4.3 If the expression of the probability density representing the homogeneous probability distribution is known in one system of coordinates, then, it is known in any other system of coordinates, through the Jacobian rule (equation ??). Indeed, in the expression above, g (r, ϑ, ϕ) = k r2 cos ϑ , we recognize the Jacobian between the geographical and the Cartesian coordinates (where the probability density is constant). For short when we say the homogeneous probability density we mean the probability density representing the homogeneous probability distribution . One should remember that, in general, the homogeneous probability density is not constant. Let us now examine ‘positive parameters’, like a temperature, a period, etc. One of the properties of the parameters we have in mind is that they occur in pairs of mutually reciprocal parameters: Period T = 1/ν Resistivity ρ = 1/σ Temperature T = 1/(kβ ) Mass density ρ = 1/ Compressibility γ = 1/κ ; Frequency ν = 1/T ; Conductivity ρ = 1/σ ; Thermodynamic parameter β = 1/(kT ) ; Lightness = 1/ρ ; Bulk modulus (uncompressibility) κ = 1/γ . When physical theories are elaborated, one may freely choose one of these parameters or its reciprocal. Sometimes these pairs of equivalent parameters come from a definition, like when we define frequency ν as a function of the period T , by ν = 1/T . Sometimes these parameters arise when analyzing an idealized physical system. For instance, Hooke’s law, relating stress σij to strain εij can be expressed as σij = cij k εk , thus introducing the stiffness tensor cijk , or as εij = dij k σk , thus introducing the compliance tensor dijk , inverse of the stiffness tensor. Then the respective eigenvalues of these two tensors belong to the class of scalars analyzed here. Let us take, as an example, the pair conductivity-resistivity (this may be thermal, electric, etc.). Assume we have two samples in the laboratory S1 and S2 whose resistivities are respectively ρ1 and ρ2 . Correspondingly, their conductivities are σ1 = 1/ρ1 and σ2 = 1/ρ2 . How should we define the ‘distance’ between the two samples? As we have |ρ2 − ρ1 | = |σ2 − σ1 | , choosing one of the two expressions as the ‘distance’would be arbitrary. Consider the following definition of ‘distance’ between the two samples D(S1 , S2 ) = log σ2 ρ2 = log ρ1 σ1 . (4.7) This definition (i) treats symmetrically the two equivalent parameters ρ and σ and, more importantly, (ii) has an invariance of scale (what matters is how many ‘octaves’ we have between the two values, not the plain difference between the values). In fact, it is the only ‘sensible’ definition of distance between the two samples S1 and S2 . Homogeneous Probability Distributions 173 Associated to the distance D(x1 , x2 ) = | log (x2 /x1 ) | is the distance element (differential form of the distance) dL(x) = dx x . (4.8) This being a ‘one-dimensional volume’ we can apply now the rule 4.1 above, to get the expression of the homogeneous probability density for such a positive parameter: f (x) = k x . (4.9) Defining the reciprocal parameter y = 1/x and using the Jacobian rule we arrive at the homogeneous probability density for y : g (y ) = k y . (4.10) These two probability densities have the same form: the two reciprocal parameters are treated symmetrically. Introducing the logarithmic parameters x∗ = log x x0 ; y ∗ = log y y0 , (4.11) where x0 and y0 are arbitrary positive constants, and using the Jacobian rule we arrive at the homogeneous probability densities f (x∗ ) = k ; f (y ∗ ) = k . (4.12) This shows that the logarithm of a positive parameter (of the type considered above) is a ‘Cartesian’ parameter. In fact, it is the consideration of equations 4.12, together with the Jacobian rule, that allows full understanding of the (homogeneous) probability densities 4.9– 4.10. The association of the probability density f (u) = k/u to positive parameters was first made by Jeffreys (1939). To honor him, we propose to use the term Jeffreys parameters for all the parameters of the type considered above . The 1/u probability density was advocated by Jaynes (1968), and a nontrivial use of it was made by Rietsch (1977), in the context of inverse problems. Rule 4.4 The homogeneous probability density for a Jeffreys quantity u is f (u) = k/u . Rule 4.5 The homogeneous probability density for a ‘Cartesian parameter’ u (like the logarithm of a Jeffreys parameter, an actual Cartesian coordinate in an Euclidean space, or the Newtonian time coordinate) is (by definition of Newtonian time) f (u) = k . The homogeneous probability density for an angle describing the position of a point in a circle is also constant. If a parameter u is a Jeffreys parameter, with the homogeneous probability density f (u) = k/u , then, its inverse, its square, and, in general, any power of the parameter is also a Jeffreys parameter, as it can easily be seen using the Jacobian rule. Rule 4.6 Any power of a Jeffreys quantity (including its inverse) is a Jeffreys quantity. 174 4.2 It is important to recognize when we do not face a Jeffreys parameter. Among the many parameters used in the literature to describe an isotropic linear elastic medium we find parameters like the Lam´’s coefficients λ and µ , the bulk modulus κ , the Poisson ratio σ , etc. A e simple inspection of the theoretical range of variation of these parameters shows that the first Lam´ parameter λ and the Poisson ratio σ may take negative values, so they are certainly e not Jeffreys parameters. In contrast, Hooke’s law σij = cijk εk , defining a linearity between stress σij and strain εij , defines the positive definite stiffness tensor cijk or, if we write εij = dijk σ k , defines its inverse, the compliance tensor dijk . The two reciprocal tensors cijk and dijk are ‘Jeffreys tensors’. This is a notion that would take too long to develop here, but we can give the following rule: Rule 4.7 The eigenvalues of a Jeffreys tensor are Jeffreys quantities2 . As the two (different) eigenvalues of the stiffness tensor cijk are λκ = 3κ (with multiplicity 1) and λµ = 2µ (with multiplicity 5) , we see that the uncompressibility modulus κ and the shear modulus µ are Jeffreys parameters3 (as are any parameter proportional to them, or any power of them, including the inverses). If for some reason, instead of working with κ and µ , we wish to work with other elastic parameters, like for instance the Young modulus Y and the Poisson ratio σ , then the homogeneous probability distribution must be found using the Jacobian of the transformation between (Y, σ ) and (κ, µ) . This is done in appendix 4.3.2. Some probability densities have conspicuous ‘dispersion parameters’, like the σ ’s in the normal probability density f (x) = k exp (−(x − x0 )2 /(2σ 2 )) , in the lognormal probability g (X ) = k exp − (log X/X0 )2 /(2σ 2 ) or in the Fisher probability density h(ϑ, ϕ) = k cos ϑ exp (sin ϑ / σ 2 ) . A consistent probability model requires that when the dispersion parameter σ tends to infinity, the probability density tends to the homogeneous probability distribution. For instance, in the three examples just given, f (x) → k , g (X ) → k/X and h(ϑ, ϕ) → k cos ϑ , which are the respective homogeneous probability densities for a Cartesian quantity, a Jeffreys quantity and the geographical coordinates on the surface of the sphere. We can state the Rule 4.8 A probability density is only consistent if it tends to the homogeneous probability density when its dispersion parameters tend to infinity. As an example, using the normal probability density f (x) = k exp (−(x − x0 )2 /(2σ 2 )) , for a Jeffreys parameter is not consistent. Note that it would assign a finite probability to negative values of a positive parameter that, by definition, is positive. More technically, this would violate our postulate ??. There is a problem of terminology in the Bayesian literature. The homogeneous probability distribution is a very special distribution. When the problem of selecting a ‘prior’ probability distribution arises, in the absence of any information except the fundamental symmetries of the problem, one may select as prior probability distribution the homogeneous distribution. But 2 This solves the complete problem for isotropic tensors only. It is beyond the scope of this text to propose rules valid for general anisotropic tensors: the necessary mathematics have not yet been developed. 3 The definition of the elastic constants was made before the tensorial structure of the theory was understood. Seismologists, today, should never introduce, at a theoretical level, parameters like the first Lam´ coefficient λ e or the Poisson ratio. Instead they should use κ and µ (and their inverses). In fact, our suggestion, in this IASPEI volume, is to use the true eigenvalues of the stiffness tensor, λκ = 3κ , and λµ = 2µ , that we propose to call the eigen-bulk-modulus and the eigen-shear-modulus . Homogeneous Probability Distributions 175 enthusiastic Bayesians do not call it ‘homogeneous’ but ‘noninformative’. We do not agree with this. The homogeneous probability distribution is as informative as any other distribution, it is just the homogeneous one4 . In general, each time we consider an abstract parameter space, each point being represented by some parameters x = {x1 , x2 . . . xn } , we will start by solving the (sometimes nontrivial) problem of defining a distance between points that respects the necessary symmetries of the problem. Only exceptionally this distance will be a quadratic expression of the parameters (coordinates) being used (i.e., only exceptionally our parameters will correspond to ‘Cartesian coordinates’ in the space). From this distance, a volume element dV (x) = v (x) dx will be deduced, from where the expression f (x) = k v (x) of the homogeneous probability density will follow. We emphasize the need of defining a distance in the parameter space, from which the notion of homogeneity will follow. In this, we slightly depart from the original work by Jeffreys and Jaynes. 4 Note that Shannon’s definition of information content (Shannon, 1948) of a discrete probability I = pi log pi does not generalize into a definition of the information content of a probability density (the i ‘definition’ I = dx f (x) log f (x) is not invariant under a change of variables). Rather, one may define the ‘Kullback distance’ (Kullback, 1967) from the probability density g (x) to the probability density f (x) as I (f |g ) = dx f (x) log f (x) g (x) . This means, in particular, that we can never know if a single probability density is, by itself, informative or not. The equation above defines the information gain when we pass from g (x) to f (x) ( I is always positive). But there is also an information gain when we pass from f (x) to g (x) : I (g |f ) = dx g (x) log g (x)/f (x) . One should note that (i) the ‘Kullback distance’ is not a distance (the distance from f (x) to g (x) does not equal the distance from g (x) to f (x) ); (ii) for the ‘Kullback distance’ I (f |g ) = dx f (x) log f (x)/g (x) to be defined, the probability density f (x) has to be ‘absolutely continuous’ with respect to g (x) , which amounts to say that f (x) can only be zero where g (x) is zero. We have postulated that any probability density f (x) is absolutely continuous with respect to the homogeneous probability distribution µ(x) . For the homogeneous probability distribution ‘fills the space’. Then, one may take the convention to measure the information content of any probability density f (x) with respect to the homogeneous probability density: I (f ) ≡ f (f |µ) = dx f (x) log f (x) µ(x) . The homogeneous probability density is then ‘noninformative’, I (µ) = I (µ|µ) = 0 , but this is just by definition. 176 4.3 4.3.1 4.3 Appendixes Appendix: First Digit of the Fundamental Physical Constants Note: mention here figure 4.1, and explain. Say that the negative numbers of the table are ‘false negatives’. Figure 4.3 statistics of surfaces and populations of States and Islands. First digit of the Fundamental Physical Constants (1998 CODATA least-squares adjustement) 80 Actual Frequency Figure 4.1: Statistics of the first digit in the table of Fundamental Physical Constants (1998 CODATA least-squares adjustement; Mohr and Taylor, 2001). I have indiscriminately taken all the constants of the table (263 in total). The ‘model’ corresponds to the prediction that the relative frequency of digit n in a base K system of numeration is logK (n + 1)/n . Here, K = 10 . statistics Model 60 40 20 0 1 2 3 4 5 6 Digits 7 8 9 Appendixes 177 STATES, TERRITORIES & PRINCIPAL ISLANDS OF THE WORLD Figure 4.2: The begining of the list of the States, Territories and Principal Islands of the World, in the Times Atlas of the World (Times Books, 1983), with the first digit of the surfaces and populations highlighted. The statistics of this first digit in shown in figure 4.3. Sq. km Abu Dhabi, see United Arab Emirates Afghanistan [31] 636,267 Capital: Kabul Ajman, see United Arab Emirates Åland [51] 1,505 Self-governing Island Territory of Finland Albania [83] 28,748 Capital: Tirana (Tiranë) Aleutian Islands [113] 17,666 Territory of U.S.A. Algeria [88] 2,381,745 Capital: Algiers (Alger) American Samoa [10] 197 Unincorporated Territory of U.S.A. Andorra [75] 465 Capital: Andorra la Vella Angola [91] 1,246,700 Capital: Luanda … … Sq. miles Population 245,664 15,551,358* 1979 581 22,000 1981 11,097 2,590,600 1979 6,821 6,730* 1980 919,354 18,250,000 76 30,600 1977 180 35,460 1981 481,226 6,920,000 1981 … … … 1979 Surfaces and Populations of the States, Territories and Principal Islands (Times Atlas of the World) 400 Actual Frequency Figure 4.3: Statistics of the first digit in the table of the surfaces (both in squared kilometers and squared miles) and populations of the States, Territories and Principal Islands of the World, as printed in the first few pages of the Times Atlas of the World (Times Books, 1983). As for figure 4.1, the ‘model’ corresponds to the prediction that the relative frequency of digit n is log10 (n + 1)/n . Name [Plate] and Description statistics Model 300 200 100 0 1 2 3 4 5 6 Digits 7 8 9 178 4.3.2 4.3 Appendix: Homogeneous Probability for Elastic Parameters In this appendix, we start from the assumption that the uncompressibility modulus and the shear modulus are Jeffreys parameters (they are the eigenvalues of the stiffness tensor cijk ), and find the expression of the homogeneous probability density for other sets of elastic parameters, like the set { Young’s modulus - Poisson ratio } or the set { Longitudinal wave velocity - Tranverse wave velocity } . 4.3.2.1 Uncompressibility Modulus and Shear Modulus The ‘Cartesian parameters’ of elastic theory are the logarithm of the uncompressibility modulus and the logarithm of the shear modulus κ∗ = log κ κ0 µ∗ = log ; µ µ0 , (4.13) where κ0 and µ0 are two arbitrary constants. The homogeneous probability density is just constant for these parameters (a constant that we set arbitrarily to one) fκ∗ µ∗ (κ∗ , µ∗ ) = 1 . (4.14) As is often the case for homogeneous ‘probability’ densities, fκ∗ µ∗ (κ∗ , µ∗ ) is not normalizable. Using the jacobian rule, it is easy to transform this probability density into the equivalent one for the positive parameters themselves fκµ (κ, µ) = 1 κµ . (4.15) This 1/x form of the probability density remains invariant if we take any power of κ and of µ . In particular, if instead of using the uncompressibility κ we use the compressibility γ = 1/κ , the Jacobian rule simply gives fγµ (γ, µ) = 1/(γ µ) . Associated to the probability density 4.14 there is the Euclidean definition of distance ds2 = (dκ∗ )2 + (dµ∗ )2 , (4.16) that corresponds, in the variables (κ, µ) , to ds 2 = 2 dκ κ + dµ µ 2 , (4.17) i.e., to the metric gκκ gκµ gµκ gµµ 4.3.2.2 1/κ2 0 0 1/µ2 = . (4.18) Young Modulus and Poisson Ratio The Young modulus Y and the Poisson ration σ can be expressed as a function of the uncompressibility modulus and the shear modulus as Y= 9κµ 3κ + µ ; σ= 1 3κ − 2µ 2 3κ + µ (4.19) Appendixes 179 or, reciprocally, κ= Y 3(1 − 2σ ) ; µ= Y 2(1 + σ ) . (4.20) The absolute value of the Jacobian of the transformation is easily computed, J= Y 2(1 + σ )2 (1 − 2σ )2 , (4.21) and the Jacobian rule transforms the probability density 4.15 into fY σ (Y, σ ) = 3 1 J= κµ Y (1 + σ )(1 − 2σ ) , (4.22) which is the probability density representing the homogeneous probability distribution for elastic parameters using the variables (Y, σ ) . This probability density is the product of the probability density 1/Y for the Young modulus and the probability density g (σ ) = 3 Y (1 + σ )(1 − 2σ ) (4.23) for the Poisson ratio. This probability density is represented in figure 4.4. From the definition of σ it can be demonstrated that its values must range in the interval −1 < σ < 1/2 , and we see that the homogeneous probability density is singular at these points. Although most rocks have positive values of the Poisson ratio, there are materials where σ is negative (e.g., Yeganeh-Haeri et al., 1992). - 30 25 20 Figure 4.4: The homogeneous probability density for the Poisson ratio, as deduced from the condition that the uncompressibility and the shear modulus are Jeffreys parameters. 15 10 5 0 -1 -1 -0.8 -0.6 -0.4 -0.2 -0.5 0 0 0.2 0.4 +0.5 Poisson's ratio It may be surprising that the probability density in figure 4.4 corresponds to a homogeneous distribution. If we have many samples of elastic materials, and if their logarithmic uncompressibility modulus κ∗ and their logarithmic shear modulus µ∗ have a constant probability density (what is the definition of homogeneous distribution of elastic materials), then, σ will be distributed according to the g (σ ) of the figure. To be complete, let us mention that in a change of variables xi xI , a metric gij changes to gIJ = ΛI i ΛJ j gij = ∂xi ∂xj gij ∂xI ∂xJ . (4.24) 180 4.3 The metric 4.17 then transforms into gY Y gσY gY σ gσσ = 2 (1−2 σ ) Y 2 Y2 − 2 (1−2 σ ) Y 4 (1−2 σ )2 1 (1+σ ) Y − + 1 (1+σ ) Y 1 (1+σ )2 . (4.25) The surface element is dSY σ (Y, σ ) = det g dY dσ = 3 dY dσ Y (1 + σ )(1 − 2σ ) , (4.26) a result from which expression 4.22 can be inferred. Although the Poisson ratio has a historical interest, it is not a simple parameter, as shown by its theoretical bounds −1 < σ < 1/2 , or the form of the homogeneous probability density (figure 4.4). In fact, the Poisson ratio σ depends only on the ratio κ/µ (incompressibility modulus over shear modulus), as we have 1+σ 3κ = . 1 − 2σ 2µ (4.27) The ratio J = κ/µ of two Jeffreys parameters being a Jeffreys parameter, a useful pair of Jeffreys parameters may be {κ, J } . The ratio J = κ/µ has a physical interpretation easy to grasp (as the ratio between the uncompressibility and the shear modulus), and should be preferred, in theoretical developments, to the Poisson ratio, as it has simpler theoretical properties. As the name of the nearest metro station to the university of one of the authors (A.T.) is Jussieu , we accordingly call J the Jussieu’s ratio . 4.3.2.3 Longitudinal and Transverse Wave Velocities Equation 4.15 gives the probability density representing the homogeneous homogeneous probability distribution of elastic media, when parameterized by the uncompressibility modulus and the shear modulus: 1 κµ fκµ (κ, µ) = . (4.28) Should we have been interested, in addition, to the mass density ρ , then we would have arrived (as ρ is another Jeffreys parameter), to the probability density fκµρ (κ, µ, ρ) = 1 κµρ . (4.29) This is the starting point for this section. What about the probability density representing the homogeneous probability distribution of elastic materials when we use as parameters the mass density and the two wave velocities? The longitudinal wave velocity α and the shear wave velocity β are related to the uncompressibility modulus κ and the shear modulus µ through α= κ + 4µ/3 ρ ; β= µ , ρ (4.30) Appendixes 181 and a direct use of the Jacobian rule transforms the probability density 4.29 into 1 fαβρ (α, β, ρ) = 3 4 ραβ − . β2 α2 (4.31) which is the answer to our question. 2 That this function becomes singular for α = √3 β is just due to the fact that the “boundary” 2 α = √3 β can not be crossed: the fundamental inequalities κ > 0 ; µ > 0 impose that the two velocities are linked by the inequality constraint 2 α>√β 3 . (4.32) Let us focus for a moment on the homogeneous probability density for the two wave velocities (α, β ) existing in an elastic solid (disregard here the mass density ρ ). We have 1 fαβ (α, β ) = 3 4 αβ − β2 α2 . (4.33) It is displayed in figure 4.5. β Figure 4.5: The joint homogeneous probability density for the velocities (α, β ) of the longitudinal and transverse waves propagating in an elastic solid. Contrary to the incompressibility and the shear modulus, that are independent parameters, the longitudinal wave velocity and the transversal wave velocity are not independent (see text for an explanation). The scales for the velocities are unimportant: it is possible to multiply the two velocity scales by any factor without modifying the form of the probability (which is itself defined up to a multiplicative constant). 0 α Let us demonstrate that the marginal probability density for both α and β is of the form 1/x . For we have to compute √ 3 α/2 fα (α) = dβ f (α, β ) (4.34) dα f (α, β ) (4.35) 0 and +∞ fβ (β ) = √ 2 β/ 3 182 4.3 (the bounds of integration can easily be understood by a look at figure 4.5). These integrals can be evaluated as √ fα (α) = lim 1− ε √ 3 α/2 √√ ε 3 α/2 ε→0 dβ f (α, β ) = lim ε→0 1−ε 4 log 3 ε 1 α (4.36) and √√ 2 β/( ε 3) fβ (β ) = lim ε→0 √ √ dα f (α, β ) = lim ε→0 1+ε 2 β/ 3 1/ε − 1 2 log 3 ε 1 β . (4.37) The numerical factors tend to infinity, but this is only one more manifestation of the fact that the homogeneous probability densities are usually improper (not normalizable). Dropping these numerical factors gives 1 α fα (α) = (4.38) and fβ (β ) = 1 β . (4.39) It is interesting to note that we have here an example where two parameters that look like Jeffreys parameters, but are not, because they are not independent (the homogeneous joint probability density is not the product of the homogeneous marginal probability densities.). It is also worth to know that using slownesses instead of velocities ( n = 1/α, η = 1/β ) leads, as one would expect, to 1 fnηρ (n, η, ρ) = ρnη 3 4 − n2 η2 . (4.40) Appendixes 4.3.3 183 Appendix: Homogeneous Distribution of Second Rank Tensors The usual definition of the norm of a tensor provides the only natural definition of distance in the space of all possible tensors. This shows that, when using a Cartesian system of coordinates, the components of a tensor are the ‘Cartesian coordinates’ in the 6D space of symmetric tensors. The homogeneous distribution is then represented by a constant (nonnormalizable) probability density: f (σxx , σyy , σzz , σxy , σyz , σzx ) = k . (4.41) Instead of using the components, we may use the three eigenvalues {λ1 , λ2 , λ3 } of the tensor and the three Euler angles {ψ, θ, ϕ} defining the orientation of the eigendirections in the space. As the Jacobian of the transformation {σxx , σyy , σzz , σxy , σyz , σzx } {λ1 , λ2 , λ3 , ψ, θ, ϕ} (4.42) is ∂ (σxx , σyy , σzz , σxy , σyz , σzx ) ∂ (λ1 , λ2 , λ3 , ψ, θ, ϕ) = (λ1 − λ2 )(λ2 − λ3 )(λ3 λ1 ) sin θ , (4.43) the homogeneous probability density 4.41 transforms into g (λ1 , λ2 , λ3 , ψ, θ, ϕ) = k (λ1 − λ2 )(λ2 − λ3 )(λ3 − λ1 ) sin θ . (4.44) Although this is not obvious, this probability density is isotropic in spatial directions (i.e., the 3D referentials defined by the three Euler angles are isotropically distributed). In this sense, we recover ‘isotropy’ as a special case of ‘homogeneity’. The rule 4.8, imposing that any probability density on the variables {λ1 , λ2 , λ3 , ψ, θ, ϕ} has to tend to the homogeneous probability density 4.44 when the ‘dispersion parameters’ tend to infinity imposes a strong constraint on the form of acceptable probability densities, that is, generally, overlooked. For instance, a Gaussian model for the variables {σxx , σyy , σzz , σxy , σyz , σzx } is consistent (as the limit of Gaussian is a constant). This induces, via the Jacobian rule, a probability density for the variables {λ1 , λ2 , λ3 , ψ, θ, ϕ} , a probability density that is not simple, but consistent. A Gaussian model for the parameters {λ1 , λ2 , λ3 , ψ, θ, ϕ} would not be consistent. 184 4.3 Chapter 5 Basic Measurements Note: Complete and expand what follows: I take here a probabilistic point of view. The axioms of probability theory apply to different situations. One is the traditional statistical analysis of random phenomena, another one is the description of (more or less) subjective states of information on a system. For instance, estimation of the uncertainties attached to any measurement usually involves both uses of probability theory: some uncertainties contributing to the total uncertainty are estimated using statistics, while some other uncertainties are estimated using informed scientific judgement about the quality of an instrument, about effects not explicitly taken into account, etc. The International Organization for Standardization (ISO) in Guide to the Expression of Uncertainty in Measurement (1993), recommends that the uncertainties evaluated by statistical methods are named ‘type A’ uncertainties, and those evaluated by other means (for instance, using Bayesian arguments) are named ‘type B’ uncertainties. It also recommends that former classifications, for instance into ‘random’ and ‘systematic uncertainties’, should be avoided. In the present text, we accept ISO’s basic point of view, and extend it, by underplaying the role assigned by ISO to the particular Gaussian model for uncertainties (see section 5.8) and by not assuming that the uncertainties are ‘small’. 185 186 5.1 5.1 Terminology Note: Introduce here the ISO terminology for analyzing uncertainties in measurements. Note: Note say that we are interested in volumetric probabilities, not ‘uncertainties’. Note: For the time being, this section is written in telegraphical style. It will, obviously, be rewritten. Measurand : Particular quantity subject to measurement. It is the input to the measuring instrument. Input may be a length; output may be an electric tension. They may not be the same physical quantity. For instance, the input of a seismometer is a displacement, the output is a voltage. At a given time, the voltage is a convolution of the past input by a transfer function. Old text: Measuring physical parameters 5.2 187 Old text: Measuring physical parameters To define the experimental procedure that will lead to a “measurement” we need to conceptualize the objects of the “universe”: do we have point particles or a continuous medium? Any instrument that we can build will have finite accuracy, as any manufacture is imperfect. Also, during the measurement act, the instrument will always be submitted to unwanted sollicitations (like uncontrolled vibrations). This is why, even if the experimenter postulates the existence of a well defined, “true value”, of the measured parameter, she/he will never be able to measure it exactly. Careful modeling of experimental uncertainties is not easy, Sometimes, the result of a measurement of a parameter p is presented as p = p0 ± σ , where the interpretation of σ may be diverse. For instance, the experimenter may imagine a bell-shaped probability density around p0 representing her/his state of information “on the true value of the parameter”. The constant σ can be the standard deviation (or mean deviation, or other estimator of dispersion) of the probability density used to model the experimental uncertainty. In part, the shape of this probability density may come from histograms of observed or expected fluctuations. In part, it will come from a subjective estimation of the defects of the unique pieces of the instrument. We postulate here that the result of any measurement can, in all generality, be described by defining a probability density over the measured parameter, representing the information brought by the experiment on the “true”, unknowable, value of the parameter. The official guidelines for expressing uncertainty in measurement, as given by the International Organization for Standardization (ISO) and the National Institute of Standards and Technology1 , although stressing the special notion of standard deviation, are consistent with the possible use of general probability distributions to express the result of a measurement, as advocated here. Any shape of the density function is not acceptable. For instance, the use of a Gaussian density to represent the result of a measurement of a positive quantity (like an electric resistivity) would give a finite probability for negative values of the variable, which is inconsistent (a lognormal probability density, on the contrary, could be acceptable). In the event of an “infinitely bad measurement” (like when, for instance, an unexpected event prevents, in fact, any meaningful measure) the result of the measurement should be described using the null information probability density introduced above. In fact, when the density function used to represent the result of a mesurement has a parameter σ describing the “width” of the function, it is the limit of the density function for σ → ∞ that should represent a measurement of infinitely bad quality. This is consistent, for instance, with the use of a lognormal probability density for a parameter like an electric resisitivity r , as the limit of the lognormal for σ → ∞ is the 1/r function, which is the right choice of noninformative probability density for r . Another example of possible probability density to represent the result of a measurement of a parameter p is to take the noninformative probability density for p1 < p < p2 and zero outside. This fixes strict bounds for possible values of the parameter, and tends to the noninformative probability density when the bounds tend to infinity. The point of view proposed here will be consistent with the the use of “theoretical param1 Guide to the expression of uncertainty in measurement, International Organization of Standardization (ISO), Switzerland, 1993. B.N. Taylor and C.E. Kuyatt, 1994, Guidelines for evaluating and expressing the uncertainty of NIST measurement results, NIST technical note 1297. 188 5.2 eter correlations” as proposed in section ??, so that there is no difference, from our point of view, between a “simple measurement” and a measurement using physical theories, including, perhaps, sophisticated inverse methods. From ISO 5.3 189 From ISO The International Organization for Standardization (ISO) has published (ISO, 1993) a “Guide to the expression of uncertainty in measurement”, which is the result of a joint work with the BIPM2 , the IEC3 and the OIML4 . The recommendations of the Guide have also been adopted by the U.S. National Institute of Standards and Technology (Taylor and Kuyatt, 1994). These recommendations have the advantage of being widely accepted (in addition of being legal). It is therefore important to see into which extent the approach proposed in this book to describe the result of a measurement is consistent with that proposed by ISO. 5.3.1 Proposed vocabulary to be used in metrology In the definitions that follow, the use of parentheses around certain words of some terms means that the words may be omitted if this is unlikely to cause confusion. 5.3.1.1 (measurable) quantity: attribute of a phenomenon, body or substance that may be distinguished qualitatively and determined quantitatively. 5.3.1.2 value (of a quantity): magnitude of a particular quantity generally expressed as a unit of measurement multiplied by a number. 5.3.1.3 true value (of a quantity): definition not reproduced here. Comments from the ISO guide: The term “true value of a measurand” or of a quantity (often truncated to “true value”) is avoided in this guide because the word “true” is viewed as redundant. “Measurand” means “particular quantity subject to measurement”, hence “value of a measurand” means “value of a particular quantity subject to measurement”. Since “particular quantity” is generally understood to mean a definite or specified quantity, the adjective “true” in “true value of a measurand” (or in “true value of a quantity”) is unnecessary — the “true” value of the measurand (or quantity) is simply the value of the measurand (or quantity). In addition, as indicated in the discussion above, a unique “true” value is only an idealized concept. My comments: I have not reproduced the definition of the term “true value” because i) I do not understand it, and ii) it does not seem consistent with the comment above (that I understand perfectly). 5.3.1.4 measurement: set of operations having the object of determining a value of a quantity. 2 Bureau International des Poids et Mesures International Electrotechnical Commission 4 International Organization of Legal Metrology 3 190 5.3 My comments: I do not agree. The object of a measurement is not to determine “a value” of a quantity, but, rather, to obtain a “state of information” on the (true) value of a quantity. The proposed definition is acceptable only in the particular case when the information obtained in the measurement can be represented by a probability density that, being practically monomodal, can be well described by a central estimator (the “determined value” of the quantity) and an estimator of dispersion (the “uncertainty” of the measurement). 5.3.1.5 measurand: particular quantity subject to measurement. Comments from the ISO guide: The specification of a measurand may require statements about quantities such as time, temperature and pressure. 5.3.1.6 influence quantity: quantity that is not the measurand but that affects the result of the measurement. 5.3.1.7 result of a measurement: value attributed to a measurand, obtained by measurement. My comments: see comments in “measurement”. 5.3.1.8 uncertainty (of measurement): parameter, associated with the result of a measurement, that characterizes the dispersion of the values that could reasonably be attributed to the measurand. Comments from the ISO guide: The word “uncertainty” means doubt, and thus in its broadest sense “uncertainty of measurement” means doubt about the validity of the result of a measurement. Because of the lack of different words for this general concept of uncertainty and the specific quantities that provide quantitative measures of the concept, for example, the standard deviation, it is necessary to use the word “uncertainty” in these two different senses. More comments from the ISO guide: The definition of uncertainty of measurement is an operational one that focuses on the measurement result and its evaluated uncertainty. However, it is not inconsistent with other concepts of uncertainty of measurement, such as i) a measure of the possible error in the estimated value of the measurand as provided by the result of a measurement; ii) an estimate characterizing the range of values within which the true value of a measurand lies. Although these two traditional concepts are valid as ideals, they focus on unknowable quantities: the “error” of the result of a measurement and the “true value” of the measurand (in contrast to its estimated value), respectively. Still more comments from the ISO guide: Uncertainty of measurement comprises, in general, many components. Some of these components may be evaluated from the statistical distribution of the results of series of measurements and can be characterized by experimental standard deviations. The other components, which can also be characterized by standard deviations, are evaluated from assumed probability distributions based on experience or other information. My comments: I could almost agree with this definition, but would rather say that as the result of a measurement is a probability density, the uncertainty, as a parameter, is any estimator of dispersion associated to the probability density. I was pleasantly surprised to discover that the ISO guidelines accept probability distributions coming from subjective knowledge as From ISO 191 an essential part of the description of the results of a measurement. One could fear that normal statistical practices, that exclude Bayesian (subjective) reasoning, were exclusively adopted. I am personally inclined (as this book demonstrates) to push the other way, and reject the notion of “statistical distribution of results of series of measurements”: the maximum generality is obtained when each individual measurement is used, and the statistical well known rules for combining individual measurement “results” will appear by themselves when working properly at the elementary level. At most, the rules proposed by statistical texts are a way for (approximately) short-circuiting some of the steps of the inference methods proposed in this book. 5.3.2 Some basic concepts Note: what follows is important for the chapter on “physical theories” too. In practice, the required specification or definition of the measurand is dictated by the required accuracy of [the] measurement. The measurand should be defined with sufficient completeness with respect to the required accuracy so that for all practical purposes associated with the measurement its value is unique. It is in this sense that the expression “value of the measurand” is used in this Guide. Example: If the length of a nominally one-metre long steel bar is to be determined to micrometre accuracy, its specification should include the temperature and pressure at which the length is defined. This the measurand should be specified as, for example, the length of the bar at 35.00 ◦ C and 101 325 Pa (plus any other defining parameters deemed necessary, such as the way the bar is to be supported). However, if the length is to be determined to only millimetre accuracy, its specification would not require a defining temperature or pressure or a value for any other defining parameter. Note: Incomplete definition of the measurand can give rise to a component of uncertainty sufficiently large that it must be included in the evaluation of the uncertainty of the measurement result. Note: The first step in making a measurement is to specify the measurand — the quantity to be measured; the measurand cannot be specified by a value but only by a description of a quantity. However, in principle, a measurand cannot be completely described without an infinite amount of information. Thus, to the extent that it leaves room for interpretation, incomplete definition of the measurand introduces into the uncertainty of the result of a measurement a component of uncertainty that may or may not be significant relative to the accuracy required of the measurement. Note: At some level, every measurand has [...] an “intrinsic” uncertainty that can in principle be estimated in some way. This is the minimum uncertainty with wich a measurand can be determined, and every measurement that achieves such an uncertainty may be viewed as the best possible measurement of the measurand. To obtain a value of the quantity in question having a smaller uncertainty requires that the measurand be more completely defined. [...] The uncertainty in the result of a measurement generally consists of several components which may be grouped into two categories according to the way in which their numerical value is estimated: • A. those which are evaluated by statistical methods, 192 5.3 • B. those which are evaluated by other means. [...] a type A standard uncertainty is obtained from a probability density function derived from an observed frequency distribution, while a type B standard uncertainty is obtained from an assumed probability density function based on the degree of belief that an event will occur (often called subjective probability). Both approaches emply recognized interpretations of probability. [...] In practice, there are many possible sources of uncertainty in a measurement, including • incomplete definition of the measurand; • imperfect realization of the definition of the measurand; • nonrepresentative sampling — the sample measured may not represent the defined measurand; • indequate knowledge of the effects of environmental conditions on the measurement or imperfect measurement of environmental conditions; • personal bias in reading analogue instruments; • finite instrument resolution or discrimination threshold; • inexact values of measurement standards and reference materials; • inexact values of constants and other parameters obtained from external sources and used in the data-reduction algorithm; • approximations and assumptions incorporated in the measurement method and procedure; • variations in repeated observations of the measurand under apparently identical conditions. [...] 5.3.2.1 The need for type B evaluations. If a measurement laboratory had limitless time and ressources, it could conduct an exhaustive statistical investigation of every conceivable cause of uncertainty, for example, by using many different makes and kinds of instruments, different methods of measurement, different applications of the method, and different approximations in its theoretical models of the measurement. The uncertainties associated with all of these causes could then be evaluated by the statistical analysis of series of observations and the uncertainty of each cause would be characterized by a statistically evaluated standard deviation. In other words, all of the uncertainty components would be obtained from type A evaluations. Since such an investigation is not an economic practicality, many uncertainty components must be evaluated by whatever other means is practical. From ISO 5.3.2.2 193 Single observation, calibrated instruments. If an input estimate has been obtained from a single observation with a particular instrument that has been calibrated against a standard of small uncertainty, the uncertainty of the estimate is mainly one of repeatability. The variance of repeated measurements by the instrument may have been obtained on an earlier occasion, not necessarily at precisely the same value of the reading but near enough to be useful, and it may be possible to assume the variance to be applicable to the input value in question. If no such information is available, an estimate must be made based on the nature of the measuring apparatus or instrument, the known variances of other instruments of similar construction, etc. 5.3.2.3 Single observation, verified instruments Not all measuring instruments are accopanied by a calibration certificate or a calibration curve. Most instruments, however, are constructed to a written standard and verified, either by the manufacturer or by an independent authority, to conform to that standard. Usually the standard contains metrological requirements, often in the form of “maxium permissible errors”, to which the instrument is required to conform. The compliance of the instrument with these requirements is determined by comparison with a reference instrument whose maximum allowed uncertainty is usually specified in the standard. This uncertainty is then a component of the uncertainty of the verified instrument. If nothing is known about the characteristic error curve of the verified instrument it must be assumed that there is an equal probability that the error has any value within the permitted limits, this is, a rectangular probability distribution. However, certain types of instruments have characteristic curves such that the errors qre, for example, likely always to be positive in part of the measuring range and negative in other parts. Sometimes such information can be deduced from a study of the written standard. 194 5.4 5.4 The Ideal Output of a Measuring Instrument Note: mention here figures 5.1 and 5.2. SO N SE MEASURING SYSTEM Instrument noise Τ∗= log Τ/Τ0 ν∗ = log ν/ν0 ν = 100 Hz R Figure 5.1: Instrument built to measure pithches of musical notes. Due to unavoidable measuring noises, a measurement is never infinitely accurate. Figure 5.2 suggests an ideal instrument output. ν0 = 1/Τ0 = 1 Hz Τ = 0.01 s Τ ∗= −5 ν∗ = 5 Τ = 0.005 s Τ = 0.004 s ν = 220 Hz ν = 300 Hz ν = 400 Hz ν = 500 Hz ν = 1000 Hz ν = 2000 Hz Center: ν = 440 Hz ν = 880 Hz ν∗ = 6 Τ ∗= −6 ν∗ = 7 Τ ∗= −7 ν = 1760 Hz ν = 440 Hz Environmental noise Τ0 = 1/ν0 = 1 s ν = 110 Hz ν = 200 Hz INSTRUMENT OUTPUT Τ = 0.003 s Τ = 0.002 s Τ = 0.001 s Τ = 0.005 s Τ = 0.004 s ν∗= +6.09 −3 Τ = 2.27 10 s Τ∗= +6.09 Radius (standard deviation): σ = 0.12 Figure 5.2: The ideal output of a mesuring instrument (in this example, measuring frequenciesperiods). The curve in the middle corresponds to the volumetric probability describing the information brought by the measurement (on ‘the measurand’). Five different scales are shown (in a real instrument, the user would just select one of the scales). Here, the logarithmic scales correspond to the natural logarithms that a physicist should prefer, but engineers could select scales using decimal logarithms. Note that all the scales are ‘linear’ (with respect to the natural distance in the frequency-period space [see section XXX]): I do not recommend the use of a scale where the frequencies (or the periods) would ‘look linear’. Output as Conditional Probability Density 5.5 195 Output as Conditional Probability Density OUTPUT As suggested by figure 5.3, an ‘measuring instrument’ is specified when the conditional volumetric probability f (y |x) for the output y , given the input x is given. Figure 5.3: The input (or measurand) and the output of a measuting instrument. The output is never an actual value, but a probability distribution, in fact, a conditional volumetric probability f (y |x) for the output y , given the input x . INPUT 5.6 A Little Bit of Theory We want to measure a given property of an object, say the quantity x . Assume that the object has been randomly selected from a set of objects, so that the ‘prior’ probability for the quantity x is fx (x) . Then, the conditional . . . Then, Bayes theorem . . . 5.7 Example: Instrument Specification [Note: This example is to be put somewhere, I don’t know yet where.] It is unfortunate that ordinary measuring instruments tend to just display some ‘observed value’, the ‘measurement uncertainty’ tending to be hidden inside some written documentation. Awaiting the day when measuring instruments directly display a probability distribution for the measurand, let us contemplate the simple situation where the maker of an instrument, say a frequencymeter, writes someting like the following. This frequencymeter can operate, with high accuracy, in the range 102 Hz < ν < 109 Hz . When very far from this range, one may face uncontrollable uncertainties. Inside (or close to) this range, the measurement uncertainty is, with a good approximation, independent of the value of the measured frequency. When the instrument displays the value ν0 , this means that the (1D) volumetric probability for the measurand is if log ν ≤ −σ then f (ν ) = 0 ν0 , (5.1) then f (ν ) = 9 2 2 2 σ − log νν0 if − σ < log νν0 < +2 σ σ if + 2 σ ≤ log ν then f (ν ) = 0 ν0 196 5.7 where σ = 10−4 . This volumetric probability is displayed at the top of figure 5.4. Using the logarithmic frequency as coordinate, this is an asymmetric triangle. Κ = 1 Ηz σ = 10 ν∗− σ 0 ν ν∗= log 10 0 0 Κ ν∗ + 2σ 0 Figure 5.4: Figure for ‘instrument specification’. Note: write this caption. −4 ν ν∗= log10 Κ ν0 = 1.0000 106 Hz ν∗= 6.0000 0 99 .99 ∗=5 ν0 0 00 .0 ∗=6 ν0 02 .00 ∗=6 ν0 Measurements and Experimental Uncertainties 5.8 197 Measurements and Experimental Uncertainties Observation of geophysical phenomena is represented by a set of parameters d that we usually call data. These parameters result from prior measurement operations, and they are typically seismic vibrations on the instrument site, arrival times of seismic phases, gravity or electromagnetic fields. As in any measurement, the data is determined with an associated certainty, described with a volumetric probability over the data parameter space, that we denote here ρd (d). This density describes, not only marginals on individual datum values, but also possible cross-relations in data uncertainties. Although the instrumental errors are an important source of data uncertainties, in geophysical measurements there are other sources of uncertainty. The errors associated with the positioning of the instruments, the environmental noise, and the human appreciation (like for picking arrival times) are also relevant sources of uncertainty. Example 5.1 Non-analytic volumetric probability Assume that we wish to measure the time t of occurrence of some physical event. It is often assumed that the result of a measurement corresponds to something like t = t0 ± σ . (5.2) An obvious question is the exact meaning of the ±σ . Has the experimenter in mind that she/he is absolutely certain that the actual arrival time satisfies the strict conditions t0 −σ ≤ t ≤ t0 +σ , or has she/he in mind something like a Gaussian probability, or some other probability distribution (see figure 5.5)? We accept, following ISO’s recommendations (1993) that the result of any measurement has a probabilistic interpretation, with some sources of uncertainty being analyzed using statistical methods (‘type A’ uncertainties), and other sources of uncertainty being evaluated by other means (for instance, using Bayesian arguments) (‘type B’ uncertainties). But, contrary to ISO suggestions, we do not assume that the Gaussian model of uncertainties should play any central role. In an extreme example, we may well have measurements whose probabilistic description may correspond to a multimodal volumetric probability. Figure 5.6 shows a typical example for a seismologist: the measurement on a seismogram of the arrival time of a certain seismic wave, in the case one hesitates in the phase identification, or in the identification of noise and signal. In this case the volumetric probability for the arrival of the seismic phase does not have an explicit expression like f (t) = k exp(−(t − t0 )2 /(2σ 2 )) , but is a numerically defined function. Using, for instance, the Mathematica (registered trademark) computer language we may define the volumetric probability f (t) as f[t_] := ( If[t1<t<t2,a,c] If[t3<t<t4,b,c] ) . Here, a and b are the ‘levels’ of the two steps, and c is the ‘background’ volumetric probability. [End of example.] Figure 5.5: What has an experimenter in mind when she/he describes the result of a measurement by something like t = t0 ± σ ? t0 t0 t0 t0 198 5.8 Signal amplitude Figure 5.6: A seismologist tries to measure the arrival time of a seismic wave at a seismic station, by ‘reading’ the seismogram at the top of the figure. The seismologist may find quite likely that the arrival time of the wave is between times t3 and t4 , and believe that what is before t3 is just noise. But if there is a significant probability that the signal between t1 and t2 is not noise but the actual arrival of the wave, then the seismologist should define a bimodal volumetric probability, as the one suggested at the bottom of the figure. Typically, the actual form of each peak of the volumetric probability is not crucial (here, box-car functions are chosen), but the position of the peaks is important. Rather than assigning a zero volumetric probability to the zones outside the two intervals, it is safer (more ‘robust’) to attribute some small ‘background’ value, as we may never exclude some unexpected source of error. t1 t2 t3 t4 Probability density Time t1 t2 t3 t4 Time Example 5.2 The Gaussian model for uncertainties. The simplest probabilistic model that can be used to describe experimental uncertainties is the Gaussian model 1 ρD (d) = k exp − (d − dobs )T C−1 (d − dobs ) D 2 . (5.3) It is here assumed that we have some ‘observed data values’ dobs , with uncertainties described by the covariance matrix CD . If the uncertainties are uncorrelated, ρD (d) = k exp − 1 2 di − di obs σi i 2 , (5.4) where the σ i are the ‘standard deviations’. [End of example.] Example 5.3 The Generalized Gaussian model for uncertainties. An alternative to the Gaussian model, is to use the Laplacian (double exponential) model for uncertainties, |di − di | obs σi ρD (d) = k exp − i . (5.5) While the Gaussian model leads to least-squares related methods, this Laplacian model least to absolute-values methods (see section8.2.6), well known for producing robust5 results. More generally, there is the Lp model of uncertainties ρp (d) = k exp − 1 p i |di − di |p obs (σp )p (see figure 5.7). [End of example.] 5 A numerical method is called robust if it is not sensitive to a small number of large errors. (5.6) Measurements and Experimental Uncertainties 199 0.5 0.5 0.5 0.5 0.5 0.4 0.4 0.4 0.4 0.4 0.5 0.4 0.3 0.3 0.3 0.3 0.3 0.3 0.2 0.2 0.2 0.2 0.2 0.2 0.1 0.1 0.1 0.1 0.1 0.1 0 0 -6 -4 -2 0 2 4 6 0 -6 -4 -2 0 2 4 6 0 0 -6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6 0 -6 -4 -2 0 2 4 6 -6 -4 -2 0 2 √ Figure 5.7: Generalized Gaussian for values of the parameter p = 1, 2, 2, 4, 8 and ∞ . 4 6 200 5.9 5.9.1 5.9 Appendixes Appendix: Operational Definitions can not be Infinitely Accurate Note: refer here to figure 5.8, and explain that “the length” of a real object (as opposed to a mathematically defined object) can only be defined by specifying the measuring instrument. There are different notions of length associated to a given object. For instance, figure 5.8 suggests that the length of a piece of wood is larger when defined by the use of a calliper6 than when defined by the use of a ruler7 , because a calliper tends to measure the distance between extremal points, while an observer using a ruler tends to average the rugosities at the wood ends. Figure 5.8: Different definitions of the length of an object. 6 Calliper: an instrument for measuring diameters (as of logs or trees) consisting of a graduated beam and at right angles to it a fixed arm and a movable arm. From the Digital Webster. 7 Ruler: a smooth-edged strip (as of wood or metal) that is usu. marked off in units (as inches) and is used as a straightedge or for measuring. From the Digital Webster. Appendixes 5.9.2 201 Appendix: The International System of Units (SI) Note: make here a small introduction about the usefulness of a unified system of units. The rest of this appendix is a reproduction (with permission) of a text published by Robert A. Nelson in the August 1996, issue of Physics Today , pages 15–16. Robert Nelson is the author of the booklet SI: The International System of Units , 2nd ed. (American Association of Physics Teachers, College Park, Maryland, 1982). He is Program Director for Commerial Space at Veda Incorporated in Alexandria, Virginia and teaches in the Department of Aerospace Engineering at the University of Maryland. Note: ASK FOR THE PERMISSION TO REPRODUCE!!! Note: The accent in “amp`re” is valid in French; check if it is valid in English. e 5.9.2.1 Guide for Metric Practice, by Robert A. Nelson The modernized metric system is known as the Syst`me International d’Unit´s (International e e System of Units), with the international abbreviation SI. It is founded on seven base units, listed in table 1, that by convention are regarded as dimensionally independent. All other units are derived units, formed coherently by multiplying and dividing units within the system without numerical factors. Examples of derived units, including some with special names, are listed in table 2. The expression of multiples and submultiples of SI units is facilitated through the use of the prefixes listed in table 3. Table 1. SI base units Quantity length mass time electric current thermodynamic temperature amount of substance luminous intensity Unit Name Symbol meter m kilogram kg second s amp`re e A kelvin K mole mol candela cd SI obtains its international authority from the Meter Convention, signed in Paris by the delegates of 17 countries, including the United States, on 20 May 1875, and amended in 1921. Today 48 states are members. The treaty established the Conf´rence G´n´rale des Poids et e ee Mesures (General Conference on Weights and Measures) as the formal diplomatic body responsible for ratification of the new proposals related to metric units. The scientific decisions are made by the Comit´ International des Poids et Mesures (International Committee for Weights e and Measures). It is assisted by the advise of eight Consultative Committees specializing in particular areas of metrology. The activities of the national standards laboratories are coordinated by the Bureau International des Poids et Mesures (International Bureau of Weights and Measures), whose headquarters is at the Pavillon de Breuteuil in S`vres, France, and which is e under the supervision of the CIPM. The SI was established by the 11th CGPM in 1960, when the metric unit definitions, symbols and terminology were extensively revised and simplified.8 8 For history of the metric system and SI units, see R.A. Nelson, Phys. Teach. 19; 596 (1981). 202 5.9 Table 2. Examples of SI derived units Quantity plane angle solid angle speed, velocity acceleration angular velocity angular acceleration frequency force pressure, stress work, energy, heat impulse, momentum power electric charge electric potential, emf resistance conductance magnetic flux inductance capacitance electric field strength magnetic flux density electric displacement magnetic field strength Celsius temperature luminous flux illuminance radioactivity Unit Special name Symbol radian rad steradian sr hertz newton pascal joule Hz N Pa J watt coulomb volt ohm siemens weber henry farad W C V Ω S Wb H F tesla T degree Celsius lumen lux becquerel ◦C lm lx Bq Equivalent m/m = 1 m2 /m2 = 1 m/s m/s2 rad/s = 1 rad/s2 s−1 kg·m/s2 N/m2 N·m , kg·m2 /s2 N·s , kg·m/s J/s A·s J/C , W/A V/A A/V , Ω−1 V·s Wb/A C/V V/m , N/C Wb/m2 , N/(A·m) C/m2 A/m K cd·sr lm/m2 s −1 Table 3. SI prefixes Factor 1024 1021 1018 1015 1012 109 106 103 102 101 Prefix yotta zetta exa peta tera giga mega kilo hecto deka Symbol Y Z E P T G M k h da Factor 10−1 10−2 10−3 10−6 10−9 10−12 10−15 10−18 10−21 10−24 Prefix deci centi milli micro nano pico femto atto zepto yocto Symbol d c m µ n p f a z y Appendixes 203 The BIPM, with the guidance of the Consultative Committee for Units and approval of the CIPM, periodically publihes a document9 that summarizes the historical decisions of the CGPM and the CIPM and gives some conventions for metric practice. In addition, Technical Committee 12 of the International Organization for Standardization has prepared recommendations concerning the practical use of the SI10 . Some other recommendations have been given by the Commission for Symbols, Units, Nomenclature, Atomic Masses and Fundamental Constants of the International Union of Pure and Applied Physics11 . The National Institute of Standards and Technology has published a practical guide for the use of the SI12 . The Institute of Electrical and Electronics Engineers has developped a metric practise manual13 that has been recognized by the American National Standards Institute and has been adopted by the US Department of Defense. The American Society fot Testing and Materials has prepared a similar manual14 . The Secretary of Commerce, through NIST, has also issued recommendations for US metric practise15 as provided under the Metric Conversion Act of 1975 and the Omnibus Trade and Competitiveness Act of 1988. In October 1995 the 20th CGPM, at the recommendations of the CCU and CIPM, eliminated the “supplementary units” radian and steradian as a special class of derived units having dimension 1 (so-called dimensionless derived units). Thus the SI now consists of only two classes of units, base units and derived units, with the radian and steradian included among the derived units as shown in table 2. 5.9.2.2 Style conventions Letter symbols include quantity symbols and unit symbols. Symbols for physical quantities are set in italic (sloping) type, while symbols for units are set in roman (upright) type (for example, F = 15 N). Symbols for unit names derived from proper names have the first letter capitalized — otherwise unit symbols are lower case — but the unit names themselves are not capitalized (for example, tesla, T; meter, m). A unit symbol is a mathematical entity (not an abbreviation) and is usually denoted by the first letter of the unit name (for example, the symbol for gram is g, not gm; the symbol for second is s, not sec), with some exceptions (for example, mol, cd and Hz). The unit symbol is not followed by a period, and plurals of unit symbols are not followed by an “s” (for example, 3 Kg, not 3 Kg. or 3 Kgs). 9 Bureau International des Poids et Mesures, Le systme International d’unit´s (SI), 6th ed., BIPM S`vres, ` e e France (1991); US ed.: The International System of Units (SI), B.N. Taylor. ed., Natl. Inst. Stand. Technol. Spec. Pub. 330, US Govt. Printing Office, Washington, D.C. (1991). 10 International Organization for Standardization, Quantities and Units, ISO Standards Handbook, 3rd ed., ISO, Geneva (1993). This is a compilation of individual standards ISO 31-0 to 31-13 and ISO 1000, available from Am. Natl. Stand. Inst., New York. 11 E. R. Cohen, P. Giacomo, eds., Physica 146A, 1 (1987). Reprinted as symbols, Units, Nomenclature and Fundamental Constants in Physics (1987 revision), document IUPAP-25 (sunamco 87-1). 12 B. N. Taylor, Guide for the Use of the International System of Units, Natl. Inst. Stand. Technol. Spec. Pub. 811, US Govt. Printing Office, Washington, D.C. (1995). 13 Inst. of Electrical and Electronics Engineers, American National Standard for Metric Practise, ANSI/IEEE Std. 268-1992, IEEE, New York (1992). 14 Am. Soc. for Testing and Materials, Standard Practise for Use of the International System of Units (SI) (The Modernized Metric System), ASTM E 380-93, ASTM, Philadelphia (1993). 15 “Metric System of Measurement; Interpretation of the International System of Units for the United States,” Fed. Register 55 (245), 52 242 (20 December 1990). 204 5.9 The word “degree” and its symbol, ◦ , are omitted from the unit of thermodynamic temperature T (that is, one uses kelvin or K, not degree Kelvin or ◦ K). However, they are retained in the unit of Celsius temperature t, defined as t ≡ T − T0 , where T0 = 273.15 K exactly (that is, degree Celsius, ◦ C). Symbols for prefixes representing 106 or greater are capitalized; all others are lower case. There is no space between the prefix and the unit. Compound prefixes are to be avoided (for example, pF, not µµF). An exponent applies to the whole unit including its prefix (for example, cm3 = 10−6 m3 ). When a unit multiple or submultiple is written out in full, the prefix should be written in full, beginning with a lower-case letter (for example, megahertz, not Megahertz or Mhertz). The kilogram is the only base unit whose name, for historical reasons, contains a prex; names of multiples and submultiples of the kilogram and their symbols are formed by attaching prefixes to the word “gram” and the symbol “g”. Multiplication of units is indicated by inserting a raised dot or by leaving a space between the units (for example, N·m or N m). Division may be indicated by the use of the solidus, a horizontal fraction bar or a negative exponent (for example, m/s, m or m·s−1 ) but repeated s use of the solidus is not permitted (for example, m/s2 , not m/s/s). To avoid possible misinterpretation when more than one unit appears in the denominator, the preferred practise is to use parentheses or negative exponents (for example, W/(m2 ·K4 ) or W·m−2 ·K−4 ). The unit expression may include a prefixed unit (for example, kJ/mol, W/cm2 ). Unit names should not be mixed with symbols for mathematical operations. (For example, one should write “meter per second” but not “meter/second” or “meter second−1 ”). When spelling out the product of two units, a space is recommended (although a hyphen is permissible), but one should never use a centered dot. (Write, for example, “newton meter” or “newton-meter”, but not “newton·meter”). Three-digit groups in numbers with more than four digits are separated by thin spaces instead of commas (for example, 229 792 458, not 299,792,458) to avoid confusion with the decimal marker in European literature. This spacing convention is also used to the right of the decimal marker. The numerical value and unit symbol must be separated by a space, even when used as an adjective (for example, 35 mm, not 35mm or 35-mm). A zero should be placed in front of the decimal marker in decimal fractions (for example, 0.3 J, not .3 J). The prefix of a unit should be chosen so that the numerical value will be within a practical range, usually between 0.1 and 1000 (for example, 200 kN, 0.5 mA).16 5.9.2.3 Non-SI units An important function of the SI is to discourage the proliferation of unnecessary units. However, it is recognized that some units outside the SI are so well established that their use is to be permitted. Units in use with the SI are listed in table 4. As exceptions to the rules, the symbols ◦ , ’ and ” for units of plane angle are not preceded by a space, and the symbol for liter, L, is capitalized to avoid confusion between the letter l and the number 1. Certain units whose values are obtained experimentally, listed in table 5, are also accepted for use in special fields. 16 This footnote is from A. Tarantola, not from R. Nelson: Remark that “Three-digit groups ( . . . ) are A separated by thin spaces”. In L TEX document preparation system, for instance, a thin space is abtained by “\,”. I also use thin spaces to separate the numerical value and unit symbol (for example, 35 mm, not 35 mm), but I do not know if this is an explicit specification. Appendixes 205 Table 4. Units in use with the SI Quantity time plane angle volume mass land area Name minute hour day degree minute second liter metric ton hectare Unit Symbol min h d ◦ ’ ” L t ha Definition 1 min = 60 s 1 h = 60 min = 3600 s 1 d = 24 h = 86 400 s 1◦ = (π /180) rad 1’ = (1/60)◦ = (π /10 800) rad 1” = (1/60)’ = (π /648 800) rad 1 L = 1 dm3 = 10−3 m3 1 t = 1000 kg 1 ha = 1 hm2 = 104 m2 Table 5. Units whose values are obtained experimentally Quantity energy mass Unit Name Symbol electron volt eV unified atomic mass unit u Value 1.602 177 33(49)×10−19 J 1.660 540 2(10)×10−27 kg 206 5.9 Chapter 6 Inference Problems of the First Kind (Sum of Probabilities) Note: Say here the we consider here the Problem of Making Histograms. 207 208 6.1 6.1 Experimental Histograms [Note: This is a provisional text, to be expanded.] Consider an n-dimensional manifold, with a volume element dv , and a probability distribution defined over it, represented by the (normalized) volumetric probability f . Although this is not necessary, let us simplify the exposition by assuming that some coordinates have been chosen over the manifold. Then, the probability distribution is represented by the volumetric probability function f (x) , and the volume distribution by the volume element function dv (x) . Some process, mathematical or physical, produces points P1 , P2 , . . . PK that are samples of the probability distribution. Assume that we dont know f (x) , and that we wish to obtain a reasonable estimation of it, by measuring the coordinates of the points P1 , P2 , . . . PK . As any physical measure has some experimental uncertainties, the measure of the coordinates of the point P1 shall not produce some values x1 but, rather, an information about the coordinates of the point, that we can represent by the volumetric probability f1 (x) . Let, then, f1 (x) , f2 (x) , . . . PK be the (normalized) volumetric probabilities obtained when measuring the coordinates of the points P1 , P2 , . . . PK . When we have a large enough number of points, i.e., when K is large enough1 we can start having some information about the probability distribution f (x) itself. Which volumetric probability f (x) shall we choose to represent our information? Of course, the one that satisfyes the postulates used in section 2.3 to define the ‘sum’ of probabilities. We then arrive to the volumetric probability f (x) = 1 K K fi (x) . (6.1) i=1 This is the equivalent, but in a slightly more sophisticate manner, to ‘making an histogram of the observed points’. Example 6.1 A seismologist has analyzed for many years the seismicity of a quite active region of the Earth. For every earthquake, using the arrival times of the seismic waves at some observatories, she/he has estimated its epicentral (geographic) coordinates {ϕ, λ} , obtaining the (2D) volumetric probabilities f1 (ϕ, λ) , f2 (ϕ, λ) , . . . , fK (ϕ, λ) . If the next earthquake has to be a standard earthquake, the best estimate we have for the probability distribution ot its epicentral coordinates (in the absence of any supplementary information) is that represented by 1 the volumetric probability f (ϕ, λ) = K K fi (ϕ, λ) . [End of example.] i=1 As suggested in chapter 2, let us write the volume element of the space as dv (x) = g (x) dv (x) , (6.2) where g (x) and dv (x) are respectively the volume density and the capacity element of the space in the coordinates x . By definition of probability density (see section 2.2.3), the relation between a volumetric probability h(x) and the associated probability density h(x) is h(x) = g (x) h(x) . (6.3) 1 How large is large enough? This depends, of course, on the relative radiuses of f and of the fi , on the number of dimensions of the space, and on the relative degree of smoothness of the probability distributions. Sampling a Sum 209 Equation 6.1 can obviously also be written as 1 f (x) = K K fi (x) , (6.4) i=1 where, now only probability densities are invoked. 6.2 Sampling a Sum Note: explain here that if we wish to obtain a sample of the volumetric probability 1 f (x) = K K fi (x) , (6.5) i=1 we can: • first, select at random, with equal probability, a value i in the interval 1 ≤ i ≤ K ; • then, obtain a sample of fi (x) . 6.3 Further Work to be Done Note: I have to prove here the following conjecture. Consider a metric coordinate x over a one-dimensional metric space. Let f (x) be a (1D) volumetric probability over the space, and let x1 , x2 , . . . be samples of it. When trying to measure the coordinate x with a given instrument, assume that ‘the reading’ of the instrument is a value x that is a sample of a volumetric probability g (x ; x; σ ) centered at x and with standard deviation σ . Given the reading x , then the volumetric probability for the measurand is h(x) = h(x; x , σ ) = WRITE THIS . (6.6) The readings have been x1 , x2 , . . . . Then, F (x) = k hi (x) = k i h(x; xi ; σ ) = k i g (xi ; x; σ ) . (6.7) i And I conjecture that the relation between the original f (x) and our estimation F (x) is F (x) = This, in fact, is a convolution. dx g (x , x, σ ) f (x ) . (6.8) 210 6.3 Chapter 7 Inference Problems of the Second Kind (Product of Probabilities) Note: write an introduction here. 211 212 7.1 7.1 The ‘Shipwrecked Person’ Problem Note: this example is to be developed. For the time being this is just a copy of example 2.4 Let S represent the surface of the Earth, using geographical coordinates (longitude ϕ and latitude λ ). An estimation of the position of a floating object at the surface of the sea by an airplane navigator gives a probability distribution for the position of the object corresponding to the (2D) volumetric probability f (ϕ, λ) , and an independent, simultaneous estimation of the position by another airplane navigator gives a probability distribution corresponding to the volumetric probability g (ϕ, λ) . How the two volumetric probabilities f (ϕ, λ) and g (ϕ, λ) should be ‘combined’ to obtain a ‘resulting’ volumetric probability? The answer is given by the ‘product’ of the two volumetric probabilities densities: (f · g )(ϕ, λ) = f (ϕ, λ) g (ϕ, λ) dS (ϕ, λ) f (ϕ, λ) g (ϕ, λ) S . (7.1) Physical Laws as Probabilistic Correlations 7.2 7.2.1 213 Physical Laws as Probabilistic Correlations Physical Laws Are we forced to introduce uncertainties in physical laws to be used as ‘thicknesses’ of a mathematical function d = g(m) via a metric in the space? In fact, actual theories are always approximate and they have some ‘uncertainty bars’ associated to them (see an example in section 7.2.2). The conditional volumetric probability has to be seen as a way of taking a limit when the uncertainty bars tend to zero. Then the sort of limit defining the conditional probability density is imposed by the form of the ‘theoretical uncertainty bars’. Rather than basing inversion theory on an expression like 8.16, it is better to introduce explicitly the theoretical uncertainties, and take any ‘small uncertainty limit’ afterwards. Let us do this. Assume that the physical correlations between the model parameters m and the data parameters d are not represented by an analytical expression like d = f (m) , but by a probability density ϑ(m, d) . Then, the conjunction of the ‘a priori and experimental information’ contained in ρ(m, d) and the ‘theoretical information’ contained in ϑ(m, d) can be combined using the conjunction operation defined by equation ??, to give σ (m, d) = k ρ(m, d) ϑ(m, d) µ(m, d) , (7.2) where µ(m, d) is the homogeneous probability density. The implications of this equation will be examined later. 7.2.2 Example: Realistic ‘Uncertainty Bars’ Around a Functional Relation In the approximation of a constant gravity field, with acceleration g , the position at time t of an apple in free fall is r(t) = r0 + v0 t + 1 g t2 , where r0 and v0 are, respectively, the 2 position and velocity of the object at time t = 0 . More simply, if the movement is 1D, x(t) = x0 + v0 t + 12 gt 2 . (7.3) Of course, or many reasons this equation can never be exact: air friction, wind effects, inhomogeneity of the gravity field, effects of the Earth rotation, forces from the Sun and the Moon (not to mention Pluto), relativity (special and general), etc. It is not a trivial task, given very careful experimental conditions, to estimate the size of the leading uncertainty. Although one may think of an equation x = x(t) as a line, infinitely thin, there will always be sources of uncertainty (at least due to the unknown limits of validity of general relativity): looking at the line with a magnifying glass should reveal a fuzzy object of finite thickness. As a simple example, let us examine here the mathematical object we arrive at when assuming that the leading sources of uncertainty in the relation x = x(t) are the uncertainties in the initial position and velocity of the falling apple. Let us assume that: • the initial position of the apple is random, with a Gaussian distribution centered at x0 , and with standard deviation σx ; 214 7.2 • the initial velocity of the apple is random, with a Gaussian distribution centered at v0 , and with standard deviation σv ; Then, it can be shown that at a given time t , the possible positions of the apple are random, with probability density ϑ(x|t) = √ 2π 1 x − (x0 + v0 t + 1 g t2 ) 1 2 exp − 2 2 2 + σ 2 t2 2 σx + σv t2 σx v 2 . (7.4) This is obviously a conditional probability density for x , given t . If we select the time t randomly with homogeneous probability distribution (i.e., if we assume that the marginal probability density for t is constant), then the joint probability density for x and t is ϑ(x, t) = k ϑ(x|t) (7.5) Joint Volumetric Probability 20 20 15 10 5 10 x/meter 25 Marginal Volumetric Probability 0 0 0 0.2 0.1 0.4 0.3 0.5 Figure 7.1: A typical parabola representing the free fall of an object (position x as a function of time t ). Here, rather than an infinitely thin line we have a fuzzy object (a probability distribution) because the initial position and initial velocity is uncertain. This figure represents the probability density defined by equation 7.5, with x0 = 0 , v0 = 1 m/s , σx = 1 m , σv = 1 m/s and g = 9.91 m/s2 . While, by definition, the marginal of the probability density with respect to the time t is homogeneous, the marginal for the position x is not: there is a pronounced maximum for x = 0 (when the falling object is slower), and the distribution is very asymmetric (as the object is falling ‘downwards’). 30 where k is a constant, and where ϑ(x|t) is that in equation 7.4. This probability density is represented in figure 7.1, together with the two marginals, and the conditional probability density at three different times is represented in figure 7.2. -1 -2 0 1 2 1 2 t/second 1 Marginal Volumetric Probability 0.8 0.6 0.4 0.2 0 -1 -2 0 0.4 t=0s Figure 7.2: Three conditional volumetric probabilities from the joint distribution of the previous figure at times t = 0 , t = 1 s and t = 2 s . The width increases with time because of the uncertainty in the initial velocity. 0.3 t= 1 s t=2s 0.2 0.1 0 0 5 10 15 20 25 30 x/meter 7.2.3 Inverse Problems We have seen that the result of measurements can be represented by a probability density ρd (d) in the data space. We have also seen that the a priori information on the model parameters Physical Laws as Probabilistic Correlations 215 can be represented by another probability density ρm (m) in the model space. When we talk about ‘measurements’ and about ‘a priori information on model parameters’, we usually mean that we have a joint probability density in the (M , D ) space, that is ρ(m, d) = ρm (m) ρd (d) . But let us consider the more general situation where for the whole set of parameters (M , D ) we have some information that can be represented by a joint probability density ρ(m, d) . Having well in mind the interpretation of this information, let us use the simple name of ‘experimental information’ for it ρ(m, d) (experimental information) . (7.6) We have also seen that we have information coming from physical theories, that predict correlations between the parameters, and it has been argued that a probabilistic description of these correlations is well adapted to the resolution of inverse problems1 . Let ϑ(m, d) be the probability density representing this ‘theoretical information’: ϑ(m, d) (theoretical information) . (7.7) A quite fundamental assumption is that in all the spaces we consider, there is a notion of volume which allows to give sense to the notion of ‘homogeneous probability distribution’ over the space. The corresponding probability density is not constant, but is proportional to the volume element of the space (see section 4): µ(m, d) (homogeneous probability distribution) . (7.8) Finally, we have seen examples suggesting that the conjunction of of the experimental information with the theoretical information corresponds exactly to the and operation defined over the probability densities, to obtain the ‘conjunction of information’, as represented by the probability density σ (m, d) = k ρ(m, d) ϑ(m, d) µ(m, d) (conjunction of informations) , (7.9) with marginal probability densities2 σm (m) = dd σ (m, d) D ; σd (d) = dm σ (m, d) . (7.10) M Example 7.1 We may assume that the physical correlations between the parameters m and d are of the form ϑ(m, d) = ϑD|M (d|m) ϑM (m) , (7.11) this expressing that a ‘physical theory’ gives, one the one hand, the conditional probability for d , given m , and on the other hand, the marginal probability density for m . [End of example.] 1 Remember that, even if we wish to use a simple method based on the notion of conditional probability density, an analytic expression like d = f (m) needs some ‘thickness’ before going to the limit defining the conditional probability density. This limit crucially depends on the ‘thickness’, i.e., on the type of uncertainties the theory contains. 2 As explained in section ??, the definition or marginal probability density is only intrinsic if the total space is the Cartesian product of the two spaces, i.e., when (M , D ) = M × D . 216 7.2 Example 7.2 Many applications concern the special situation where we have µ(m, d) = µm (m) µd (d) ; ρ(m, d) = ρm (m) ρd (d) . (7.12) In this case, equations 7.9–7.10 give σm (m) = k ρm (m) µm (m) ρd (d) ϑ(m, d) µd (d) ; (7.13) ρm (m) ϑ(m, d) µm (m) . (7.14) dd D and σd (d) = k ρd (d) µd (d) dm M If equation 7.11 holds, then σm (m) = k ρm (m) ϑm (m) µm (m) dd D ρd (d) ϑD|M (d | m) µd (d) (7.15) ϑm (m) . µm (m) (7.16) and σd (d) = k ρd (d) µd (d) M dm ρm (m) ϑD|M (d|m) Finally, if the simplification ϑM (m) = µm (m) arises (this usually holds only if nonlinearities are weak3 ), then, σm (m) = k ρm (m) dd D ρd (d) ϑ(d|m) µd (d) (7.17) and σd (d) = k ρd (d) µd (d) M dm ρm (m) ϑ(d|m) . (7.18) [End of example.] Example 7.3 Let us reproduce here equation 7.17, σm (m) = k ρm (m) dd D ρd (d) ϑ(d|m) . µd (d) (7.19) Assume that observational uncertainties are Gaussian, 1 ρd (d) = k exp − (d − dobs )t C−1 (d − dobs ) D 2 . (7.20) Note that the limit for infinite variances gives the homogeneous probability density µd (d) = k . Furthermore, assume that uncertainties in the physical law are also Gaussian: 1 ϑ(d|m) = k exp − (d − f (m))t C−1 (d − f (m)) T 2 3 Note: some explanation is needed here. . (7.21) Physical Laws as Probabilistic Correlations 217 Here ‘the physical theory says’ that the data values must be ‘close’ to the ‘computed values’ f (m) , with a notion of closeness defined by the ‘theoretical covariance matrix’ CT . As demonstrated in Tarantola (1987, page 158), the integral in equation 7.19 can be analytically evaluated, and gives dd D ρd (d) ϑ(d|m) 1 = k exp − (f (m) − dobs )t (CD + CT )−1 (f (m) − dobs ) µd (d) 2 . (7.22) This shows that when using the Gaussian probabilistic model, observational and theoretical uncertainties combine through addition of the respective covariance operators (a nontrivial result). [End of example.] Example 7.4 In the ‘Galilean law’ example developed in section 7.2.1, we described the correlation between the position x and the time t of a free falling object through a probability density ϑ(x, t) . This law says than falling objects describe, approximately, a space-time parabola. Assume that in a particular experiment the falling object explodes at some point of its space-time trajectory A plain measurement of the coordinates (x, t) of the event gives the probability density ρ(x, t) . By ‘plain measurement’ we mean here that we have used a measurement technique that is not taking into account the particular parabolic character of the fall (i.e., the measurement is designed to work identically for any sort of trajectory). The conjunction of the physical law ϑ(x, t) and the experimental result ρ(x, t) , using expression 7.9, gives σ (x, t) = k ρ(x, t) ϑ(x, t) µ(x, t) , (7.23) where, as the coordinates (x, t) are ‘Cartesian’, µ(x, t) = k . Taking the explicit expression given for ϑ(x, t) in equations 7.24–7.5, ϑ(x, t) = √ 2π 1 1 x − (x0 + v0 t + 1 g t2 ) 2 exp − 2 2 2 2 2 σx + σv t2 σx + σv t2 2 , (7.24) , (7.25) and assuming the Gaussian form4 for ρ(x, t) , ρ(x, t) = ρx (x) ρt (t) = k exp − 1 (x − xobs )2 2 Σ2 x exp − 1 (t − tobs )2 2 Σ2 t we obtain the combined probability density σ (x, t) = 1 k exp − 2 2 2 σx + σv t2 x − (x0 + v0 t + 1 g t2 ) (x − xobs )2 (t − tobs )2 2 + + 2 2 Σ2 Σ2 σx + σv t2 t x 2 . (7.26) Figure 7.3 illustrates the three probability densities ϑ(x, t) , ρ(x, t) and σ (x, t) . [End of example.] Note: explain here that δ (d − f (m)) , as it concerns a difference in the data space (rather than a distance), it is not a mathematically nice object. 4 Note that taking the limit of ϑ(x, t) or of ρ(x, t) for infinite variances we obtain µ(x, t) , as we should. 218 7.2 Figure 7.3: Note: this is a provisional figure. It was made with the numerical values mentioned in figure 7.1 with, in addition, xobs = 5.0 m , Σx = 4.0 m , tobs = 2.0 s and Σt = 0.75 s . 20 20 20 15 15 15 10 10 10 5 5 5 0 0 -2 -1 0 1 2 0 -2 -1 0 1 -2 2 -1 0 1 2 Example 7.5 Note: consider here equation 7.11 and let us formally take ϑ(d|m) = δ (d − f (m)) ϑM (m) = k det gmm + gmd F + FT gdm + FT gdd F . (7.27) d=f (m) [Note: Explain this choice for ϑM (m)...] Then we arrive at σm (m) = k ρ(m, f (m)) det (gmm + gmd F + FT gdm + FT gdd F) µ(m, d) . (7.28) d=f (m) If µ(m, d) = k det g(m, d) (i.e., if we use the same metric to represent theoretical uncertainties as we used to define the homogeneous probability distributions), this equation is identical to equation 8.16, obtained using the equation d = f (m) to define a conditional probability. [End of example.] The previous example is important because it shows that the formulation using an ‘exact physical law’ can be found as a particular case of this, more general, approach were physical correlations are represented probabilistically. Chapter 8 Inference Problems of the Third Kind (Conditional Probabilities) Note: Say here the we consider here two problems: (i) ‘adjusting measurements’ to a physical theory and (ii) resolution of Inverse problems. These two problems are mathematically very similar, and are essentially solved using either the notion of ‘conditional probability’ or the notion of ‘product of probabilities’ (see chapter 2). Note: what follows comes from an old text: A so-called ‘inverse problem’ usually consists in a sort quite complex measurement, simetimes a gigantic measurement, involving years of observations and thousands of instruments. Any measurement is indirect (we may weigh a mass by observing the displacement of the cursor of a balance), and as such, a possibly nontrivial analysis of uncertainties must be done. Any good guide describing good experimental practice (see, for instance ISO’s Guide to the expression of uncertainty in measurement [ISO, 1993] or the shorter description by Taylor and Kuyatt, 1994) acknowledges that any measurement involves, at least, two different sources of uncertainties: those that we estimate using statistical methods, and those that we estimate using subjective, common sense estimations. Both are described using the axioms of probability theory, and this article clearly takes the probabilistic point of view for developing inverse theory. 219 220 8.1 8.1 Adjusting Measurements to a Physical Theory When a particle of mass m is submitted to a force F , one has F=m d dt v . 1 − v 2 /c2 (8.1) Assuming initial conditions of rest (at a time arbitrarily set to 0 ), the trajectory of the particle is 2 2 c γt x(t) = 1+ − 1 , (8.2) γ c where F . (8.3) m Note: introduce here the problem set in the caption of figure 8.1. Say, in particular that we have a measurement whose results are represented by the volumetric probability f (t, x) . γ= 3X x 2X X Figure 8.1: In the space-time of special relativity, we have measured the space-time coordinates of an event, and obtained the volumetric probability f (t, x) displayed in the figure at the top. We then learn that that event happened on the trajectory of a particle with mass m submitted to a constant force F (equation 8.2). This trajectory is represented in the figure at the middle. It is clear that thanks to the theory, we can ameliorate the knowledge of the coordinates of the event, by considering the conditional volumetric probability induced on the trajectory. See text for details. 0 T 0 3X x 2T c T= γ 3T t 4T c2 X= γ 2X X 0 T 0 3X 2T 3T t 4T 2T 3T t 4T x 2X X 0 0 T The problem here, is clearly a problem of conditional probability, and it makes sense because we do have a metric over our 2D space, the Minkowskian metric ds2 = dt2 − 12 dx c2 . (8.4) Adjusting Measurements to a Physical Theory 221 With respect to the notations in section 2.4.2.2, we have here r = r = t , and s = s = x , and the relation s = s(r) is, here, the relation x = x(t) given by equation 8.2. As we have, here, det(gr + St gs S) = c/ 1 + (γ t/c)2 , A direct use of equation 2.127 gives the (1D) volumetric probability over the time variable k ft (t) = 1 + (γ t/c)2 f (t, x)|x=x(t) , (8.5) where k is the normalization constant ensuring that ∞ dt ft (t) = 1 , (8.6) 0 and where x = x(t) is a short-hand notation for the relation 8.2. Note: I have now to transport this volumetric probability over the time axis into a volumetric probability over the x axis, using the transport of probabilities introduced in section 2.6. Note: I have to convince the reader here that we can not give an intrinsic definition of this problem inside the Galilean physics, as there is no space-time metric. This is very important, and enforces my decision to use a metric definition of the conditional volumetric probabilities. 222 8.2 8.2 Inverse Problems [Note: Complete and expands what follows.] In the so called ‘inverse problems’, values of the parameters describing physical systems are estimated, using as data some indirect measurements. A consistent formulation of inverse problems can be made using the concepts of probability theory. Data and attached uncertainties, (a possibly vague) a priori information on model parameters, and a physical theory relating the model parameters to the observations are the fundamental elements of any inverse problem. While the most general solution of the inverse problem requires extensive use of Monte Carlo methods, special hypothesis (e.g., Gaussian uncertainties) allow, in some cases, to solve part of the problem analytically (e.g., using the method of least squares). Given a physical system, the ‘forward’ of ‘direct’ problem consists, by definition, in using a physical theory to predict the outcome of possible experiments. In classical physics, this problem has a unique solution. For instance, given a seismic model of the whole Earth (elastic constants, attenuation, etc. at every point inside the Earth) and given a model of a seismic source, we can use current seismological theories to predict which seismograms should be observed at given locations at the Earth’s surface. The ‘inverse problem’ arises when we do not have a good model of the Earth, or a good model of the seismic source, but we have a set of seismograms, and we wish to use these observations to infer the internal Earth structure or a model of the source (typically we try to infer both). There are many reasons that make the inverse problem underdetermined (the solution is not unique). In the seismic example, two different Earth models may predict the same seismograms1 , the finite bandwidth of our data sets will never allow us to resolve very small features of the Earth model, and there are always experimental uncertainties that allow different models to be ‘acceptable’. The name ‘inverse problem’ is widely accepted. I only like this name moderately, as I see the problem more as a problem of ‘conjunction of states of information’ (theoretical, experimental and a priori information). In fact, the equations used below have a range of applicability well beyond ‘inverse problems’: they can be used, for instance, to predict the values of observation in a realistic situation where the parameters describing the Earth model are not ‘given’, but only known approximately. In fact, I like to think of an ‘inverse’ problem as merely a ‘measurement’. A measurement that can be quite complex, but the basic principles and the basic equations to be used are the same for a relatively complex ‘inverse problem’ as for a relatively simple ‘measurement’. 1 For instance, we could fit our observations with a heterogeneous but isotropic Earth model or, alternatively, with an homogeneous but anisotropic Earth. Inverse Problems 8.2.1 223 Model Parameters and Observable Parameters Although the separation of all the variables of a problem in two groups may sometimes be artificial, we take this point of view here, since it allows us to propose a simple setting for a wide class of problems. We may have in mind a given physical system, like the whole Earth, or a small crystal under our microscope. The system (or a given state of the system) may be described by assigning values to a given set of parameters m = {m1 , m2 , . . . , mNM } that we will name the model parameters . Let us assume that we make observations on this system. Although we are interested in the parameters m , they may not be directly observable, so we may make some indirect measurement like obtaining seismograms at the Earth’s surface for analyzing the Earth’s interior, or making spectroscopic measurements for analyzing the chemical properties of a crystal. The set of (directly) observable parameters (or, by language abuse, the set of data parameters ) will be represented by d = {d1 , d2 , . . . , dND } . We assume that we have a physical theory that solves the forward problem , i.e., that given an arbitrary model m , it allows us to predict the theoretical data values d that an ideal measurement should produce (if m was the actual system). The generally nonlinear function that associates to any model m the theoretical data values d may be represented by a notation like di = g i (m1 , m2 , . . . , mNM ) ; i = 1, 2, . . . , ND , (8.7) or, for short, d = f (m) . (8.8) In fact, it is this expression that separates the whole set of our parameters into the subsets d and m , as sometimes there is no difference of nature between the parameters in d and the parameters in m . For instance, in the classical inverse problem of estimating the hypocenter coordinates of an earthquake, we may put in d the arrival times of the seismic waves at some seismic observatories, and we need to put in m the coordinates of the observatories —as these are parameters that are needed to compute the travel times—, although we estimate arrival times of waves as well as coordinates of the observatories using similar types of measurements. 8.2.2 A Priori Information on Model Parameters In a typical geophysical problem, the model parameters contain geometrical parameters (positions and sizes of geological bodies) and physical parameters (values of the mass density, of the elastic parameters, the temperature, the porosity, etc.). The a priori information on these parameters is all the information we possess independently of the particular measurements that will be considered as ‘data’ (to be described below). This probability distribution is, generally, quite complex, as the model space may be high dimensional, and the parameters may have nonstandard probability densities. To this, generally complex, probability distribution over the model space corresponds a volumetric probability that we denote as ρm (m) . If an explicit expression for the volumetric probability ρm (m) is known, then it can be used in analytical developments. But such an explicit expression is, by no means, necessary. 224 8.2 All that is needed is a set of probabilistic rules that allows us to generate samples of ρm (m) in the model space (random samples distributed according to ρm (m) ). Example 8.1 Gaussian a priori Information. Of course, the simplest example of a probability distribution is the Gaussian (or ‘normal’) distribution. Not many physical parameters accept the Gaussian as a probabilistic model (we have, in particular, seen that many positive parameters are Jeffreys parameters, for which the simplest consistent volumetric probability is not the normal, but the lognormal). But if we have chosen the right parameters (for instance, taking the logarithms of all Jeffreys parameters), it may happen that the Gaussian probabilistic model is acceptable. We then have 1 ρm (m) = k exp − (m − mprior )T C−1 (m − mprior ) prior 2 . (8.9) When this Gaussian volumetric probability is used, mprior , the center of the Gaussian is called the ‘a priori model’ while Cprior is called the ‘a priori covariance matrix’. The name ‘a priori model’ is dangerous, as for large dimensional problems, the average model may not be a good representative of the models that can be obtained as samples of the distribution (see figure 8.27 as an example). Other usual sources of prior information are the ranges and distribution of media properties in the rocks, or probabilities for the localization of media discontinuities. If the information refers to marginals of the model parameters, and is not including the description of relations across model parameters, the prior volumetric probability reduces to a product of univariate densities, ρm (m) = i ρi (mi ). The next example illustrates this case. [End of example.] Example 8.2 Prior Information for a 1D Mass Density Model We consider the problem of describing a model consisting of a stack of horizontal layers with variable thickness and uniform mass density. The prior information is shown in figure 8.2, involving marginal distributions of the mass density and the layer thickness. Spatial statistical homogeneity is assumed, hence marginals are not dependent on depth in this example. Additionally, they are independent of neighbor layer parameters. The model parameters consist of a sequence of thicknesses and a sequence of mass density parameters, m = { 1 , 2 , . . . , N L , ρ1 , ρ2 , . . . , ρN L } . The marginal prior probability densities for the layer thicknesses are all assumed to be identical and of the form (exponential volumetric probability) f( ) = 1 exp − 0 , (8.10) 0 where the constant 0 has the value 0 = 4 km (see the left of figure 8.2), while all the marginal prior probability densities for the mass density are also assumed to be identical, and of the form (lognormal volumetric probability) 1 1 exp − 2 g (ρ) = √ 2σ 2π σ ρ log ρ0 2 , (8.11) where ρ0 = 3.98 g/cm3 and σ = 0.58 (see the right of figure 8.2). Assuming that the probability distribution of any layer thickness is independent of the thicknesses of the other layers, that the probability distribution of any mass density is independent of the mass densities of the Inverse Problems 225 other layers, and that layer thicknesses are independent of mass densities, the a priori volumetric probability in this problem is the product of a priori probability densities (equations 8.10 and 8.11) for each parameter, NL ρm (m) = ρm ( 1 , 2 , . . . , N L , ρ1 , ρ2 , . . . , ρN L ) = k f (ρi ) g (ρi ) . (8.12) i Figure 8.3 shows (pseudo) random models generated according to this probability distribution. Of course, the explicit expression 8.12 has not been used to generate these random models. Rather, consecutive layer thicknesses and consecutive mass densities have been generated using the univariate probability densities defined by equations 8.10 and 8.11. [End of example.] 0.25 Figure 8.2: At left, the probability density for the layer thickness. At right, the probability density for the density of mass. 1 0.2 0.8 0.15 0.6 0.1 0.4 0.05 0.2 0 0 0 5 15 10 20 25 30 5 0 15 10 20 25 Mass Density (g/cm3) Depth (km) Mass Density (g/cm3) 20 40 40 40 40 60 60 60 60 60 80 80 80 80 80 100 100 100 100 20 40 60 80 100 100 8.2.3 0 20 40 Depth (km) 20 20 15 20 0 20 10 0 0 0 Figure 8.3: Three random Earth models generated according to the a priori probability density in the model space. 5 15 0 0 5 20 0 10 15 2 0 10 20 0 5 10 0 15 20 5 15 20 10 15 5 5 10 0 0 Measurements and Experimental Uncertainties Note: the text that was here has been moved to section 5.8. 8.2.4 Joint ‘Prior’ Probability Distribution in the (M , D ) Space We have just seen that the a priori information on model parameters can be described by a volumetric probability in the model space, ρm (m) , and that the result of measurements can be described by a volumetric probability in the data space ρd (d) . As by ‘a priori’ information on model parameters we mean information obtained independently from the measurements, we can multiply these two volumetric probabilities (see section 2.5.5 on Independent Probability Distributions) to define a joint volumetric probability in the X = (M, D) space. ρ(x) = ρ(m, d) = ρm (m) ρd (d) . (8.13) Although we have introduced ρm (m) and ρd (d) separately, and we have suggested to build a probability distribution in the (M , D ) space by the multiplication 8.13, we may have more general situation where the information we have on m and on d is not independent. So, in what follows, let us assume that we have some information in the (M , D ) space, represented by the volumetric probability ρ(x) = ρ(m, d) and let us contemplate equation 8.13 as just a special case. 226 8.2 8.2.5 Physical Laws Physics analyzes the correlations existing between physical parameters. In standard mathematical physics, these correlations are represented by ‘equalities’ between physical parameters (like when we write f = m a to relate the force f applied to a particle, the mass m of the particle and the acceleration a ). In the context of inverse problems this corresponds to assuming that we have a function from the ‘parameter space’ to the ‘data space’ that we may represent as d = d(m) . (8.14) We do not mean that the relation is necessarily explicit. Given m , we may need to solve a complex system of equations in order to get d , but this, nevertheless defines a function m → d = d(m) . At this point, given the volumetric probability ρ(m, d) and given the relation d = d(m) , one may wish to define the associated conditional volumetric probability. But we have emphasized in chapter 2 that there is no way to define a conditional volumetric probability given only an equation like d = d(m) : we must, in addition, specify a metric in the (M, D) space2 , that we may denote here by g(m, d) = gm (m) 0 0 gd (d) , (8.15) where, to simplify the exposition, I assume the special case where the metric partitions into a metric gm (m) in the model space M and a metric gd (d) in the data space D . 8.2.6 Inverse Problems In the X = (M , D ) space, we have the volumetric probability ρ(m, d) , and we have the hypersurface defined by the relation d = d(m) . We can ‘combine’ these two kinds of information by using the conditional volumetric probability deduced from ρ(m, d) on the hypersurface d = d(m) (see equation 2.127) σm (m) = k ρ(m, d(m)) det (gm + DT gd D) √ det gm , (8.16) where D = D(m) is the matrix of partial derivatives, with components Di α = ∂di /∂mα , where gm = gm (m) and where gd = gd (d(m)) . The probability of a finite domain A of the model space is then to be evaluated as P (A) = A dm1 ∧ · · · ∧ dmN M det gm σm (m) . (8.17) Example 8.3 In the particular case where ρ(m, d) = ρm (m) ρd (d) , 2 Or, at least in the vicinity of the submanifold d = d(m) . (8.18) Inverse Problems 227 equation 8.16 becomes σm (m) = k ρm (m) ρd (d(m)) det (gm + DT gd D) √ det gm , (8.19) where, again, D = D(m) , gm = gm (m) and gd = gd (d(m)) . [End of example.] Example 8.4 The conditional volumetric probability has been defined by taking an ‘orthogonal limit’. Should one have some reason to prefer the ‘vertical limit’, it can be obtained here by formally taking the limit gd → 0 . Then, equation 8.19 simplifies into σm (m) = k ρm (m) ρd (d(m)) , (8.20) where the partial derivatives D don’t appear. [End of example.] Example 8.5 Gaussian Case. Let us examine here how equation 8.19 simplifies when assuming that the ‘input’ probability densities are Gaussian, and that the weight matrices (inverse of the covariance matrices) are the metric matrices (note: explain this, and give here the argument that the accuracy of a theory is, ultimately, the accuracy of the experiments used to control it): 1 ρm (m) = k exp − (m − mprior )t gm (m − mprior ) 2 1 ρd (d) = k exp − (d − dobs )t gd (d − dobs ) 2 . (8.21) (8.22) Equation 8.19 then gives σm (m) = k det gm + Dt (m) gd D(m) √ det gm × (8.23) 1 × exp − (m − mprior )t gm (m − mprior ) + (d(m) − dobs )t gd (d(m) − dobs ) 2 √ det gm has been left for subsequent simplifications). Defining the misfit (the constant factor S (m) = − log σm (m) σ0 , (8.24) where σ0 is an arbitrary value of σm (m) , gives, up to an additive constant, S (m) = S1 (m) − S2 (m) , (8.25) where S1 (m) is the usual least-squares misfit function 2 S1 (m) = (m − mprior )t gm (m − mprior ) + (d(m) − dobs )t gd (d(m) − dobs ) √ and where (as log A = 1 log A ) 2 2 S2 (m) = log det gm + Dt (m) gd D(m) det gm . (8.26) (8.27) 228 8.2 The maximum likelihood point is defined as the point where the volumetric probability is maximum3 . If γ denotes the gradient of the misfit, γα = ∂S ∂mα , (8.28) then, the steepest ascent direction is the vector γ defined through gm γ = γ . (8.29) The algorithm mk+1 = mk − k γk , (8.30) where k is an ad-hoc, well chosen number, called the algorithm of steepest descent, converges to the maximum likelihood point (or, at least, to a local maximum). To ensure convergence, it is sufficient to use a descent direction, not necessarily the steepest one. This, in practice, allows two simplifications: (i) compute only an approximation to the gradient, (ii) use physical intuition to define directions that are better (for finite jumps) than the locally steepest one. In many applications, it is the gradient of S1 (m) that is computed, not that of S (m) = S1 (m) + S2 (m) , and this gradient is approximated by dropping the derivatives of D(m) (i.e., second derivatives of d(m) ). One then has γ k ≈ gm (mk − mprior ) + Dt gd (dk − dobs ) , k where Dk = D(mk ) and dk = d(mk ) . Using the relation between gradient and steepest descent (equation 8.29) this gives4 gm γ k ≈ gm (mk − mprior ) + Dt gd (dk − dobs ) . k (8.31) The two equations 8.30–8.31 encapsulate the algorithm of steepest descent. Once the algorithm has converged, if the volumetric probability σm (m) is approximated by a Gaussian centered on the maximum likelihood point, then, the weight matrix (inverse of the covariance matrix) of the Gaussian is (see equation 2.144) gm = gm + Dt gd D . (8.32) [End of example.] Example 8.6 If the ‘relation solving the forward problem’ d = d(m) happens to be a linear relation, d = Dm , (8.33) then the volumetric probability σm (m) in equation 8.23 becomes5 σm (m) = 3 (8.34) Unfortunately, many authors define, unconsistently, the maximum likelihood point as the point where the probability density is maximum. 4 − Of course, one could equivalently write γ k ≈ (mk − mprior ) + gm1 Dt gd (dk − dobs ) , but, numerically, it k is usually much better to solve a linear system than to evaluate the inverse of a matrix. This may be important in large-dimensioned spaces. 5 The last multiplicative factor in equation 8.23 is a constant that can be integrated into the constant k . Inverse Problems 229 1 . (m − mprior )t gm (m − mprior ) + (D m − dobs )t gd (D m − dobs ) 2 As the argument of the exponential is a quadratic function of m , we can write it in standard form, k exp − 1 σm (m) = k exp − (m − m )t gm (m − m ) 2 , (8.35) this implying that σm (m) is a Gaussian volumetric probability. The values m and gm of the center and the weight matrix (inverse of the covariance matrix), respectively, of the Gaussian representing the a posteriori information in the model space, can be computed using certain matrix identities (see, for instance, Tarantola, 1987, problem 1.19). For the weight matrix, this gives gm = gm + Dt gd D (8.36) and the central point m is obtained via gm (m − mprior ) = Dt gd (dobs − D mprior ) . (8.37) Let us introduce the covariance matrices − Cm = gm1 ; Cm = gm −1 − Cd = gd 1 ; . (8.38) An equation equivalent to 8.36 is Cm = Cm − Cm Dt D Cm Dt + Cd −1 D Cm , (8.39) while an equation equivalent to 8.37 is m − mprior = Cm Dt D Cm Dt + Cd −1 (dobs − D mprior ) . (8.40) [End of example.] Example 8.7 If, in the context of the previous example, we do not have any a priori information on the model parameters, then CM → ∞ I , i.e., gm → 0 . In this case, gm = Dt gd D , (8.41) and equation 8.37 simplifies to m= Dt gd D −1 Dt gd dobs . (8.42) [End of example.] Example 8.8 In the context of the previous example, let us explore the very special circumstance where we have the same number of ‘data’ and ‘unknowns’, i.e., the case where the matrix D is a square matrix. Assume that the matrix is regular, so its inverse exists. It is easy to see that equation 8.42 then becomes m = D−1 dobs . (8.43) We see that in this special case, m is just the Cramer solution of the linear equation dobs = D m . [End of example.] 230 8.2 The formulas in the examples above give expressions that contain analytic parts. What we write as d = d(m) may sometimes correspond to an explicit expression; sometimes it may correspond to the solution of an implicit equation6 . Should d = d(m) be an explicit expression, and should the ‘prior probability densities’ ρm (m) and ρd (d) (or the joint ρ(m, d) ) also be given by explicit expressions (like as when we have Gaussian probability densities), then the formulas of this section would give explicit expressions for the posterior volumetric probability σm (m) . If the relation d = d(m) is a linear relation, then the expression giving σm (m) can sometimes be simplified easily (as with the linear Gaussian case to be examined below). More often than not, the relation d = d(m) is a complex nonlinear relation, and the expression we are left with for σm (m) is explicit, but complex. Once the volumetric probability σm (m) has been defined, there are different ways of ‘using’ it. If the ‘model space’ M has a small number of dimensions (say between one and four), the values of σm (m) can be computed at every point of a grid and a graphical representation of σm (m) can be attempted. A visual inspection of such a representation is usually worth a thousand ‘estimators’ (central estimators or estimators of dispersion). But, of course, if the values of σm (m) are known at all significant points, these estimators can also be computed. This point of view is emphasized in section ??. If the ‘model space’ M has a large number of dimensions (say from five to many millions or billions), then an exhaustive exploration of the space is not possible, and we must turn to Monte Carlo sampling methods to extract information from σm (m) . We discuss the application of Monte Carlo methods to inverse problems in 8.3.6. Finally, the optimization techniques are discussed in section 8.3.7. 6 Practically, it may correspond to the output of some ‘black box’ solving the ‘forward problem’. Appendixes 8.3 8.3.1 231 Appendixes Appendix: Short Bibliographical Review For long time, scientists have estimated parameters using optimization techniques. Laplace explicitly stated the least absolute values criterion. This, and the least squares criterion were later popularized by Gauss (1809). While Laplace and Gauss were mainly interested in overdetermined problem, Hadamard (1902, 1932) introduced the notion of “ill-posed problem”, that can be seen as an underdetermined problem. For seismologists, the first bona fide solution of an inverse problem was the estimation of the hypocenter coordinates of an earthquake using the ‘Geiger method’ (Geiger, 1910), that present-day computers have made practical. In fact, seismologists have been the originators of the theory of inverse problems (for data interpretation), and this is because the problem of understanding the structure of the Earth’s interior using only surface data is a difficult problem. The first uses of the Monte Carlo theory to obtain Earth models were made by KeilisBorok and Yanovskaya (1967) and by Press (1968). At about the same time, Backus and Gilbert, and Backus alone, in the years 1967–1970, made original contributions to the theory of inverse problems, focusing on the problem of obtaining an unknown function from discrete data. Although the resulting mathematical theory is quite beautiful, its initial predominance over the more ‘brute force’ (but more powerful) Monte Carlo theory was possibly due to the quite limited capacities of the computers at that time. It is our feeling that Monte Carlo methods will play a more important role in the future (and this is the reason why we put emphasis on these methods in this article). Interesting contributions to the theory were made by Wiggins (1969), with his method of suppressing ‘small eigenvalues’, and by Franklin (1970), by introducing the right mathematical setting for the Gaussian, functional (i.e., infinite dimensional) inverse problem (see also Lehtinen et al., 1989). The 3-D tomography of the Earth, using travel times of seismic waves, was developed by Keiiti Aki and his coworkers, in a couple of well known papers (Aki and Lee, 1976; Aki, Christofferson and Husebye 1977). Minster and Jordan (1978) applied the theory of inverse problems to the reconstruction of the tectonic plate motions, introducing the concept of ‘data importance’. In an interesting paper, Rietsch (1977) made a nontrivial use of the notion of ‘noninformative’ (homogeneous, in our terminology) a priori distribution for positive parameters. Jackson (1979) made an explicit introduction of a priori information in the context of linear inverse problems, approach that was generalized by Tarantola and Valette (1982) to nonlinear problems. There are three monographs in the area of Inverse Problems (from the view point of data interpretation). In Tarantola (1987), the general, probabilistic formulation for nonlinear inverse problems is proposed. The small book by Menke (1984) is easy to read. Finally, Parker (1994) exposes his view of the general theory of linear problems. From time to time, some authors try to resuscitate the Laplacian ‘least absolute criterion’ (and this is good). Claerbout and Muir (1973), for instance, show that the use of the 1 -norm can accommodate for erratic data, and Djikp´ss´ and Tarantola (1999) used the 1 -norm in a ee large scale inverse problem, involving seismic waveforms (a seismic reflection experiment). Recently, the interest in Monte Carlo methods, for the solution of Inverse Problems, has been increasing. Mosegaard and Tarantola (1995) proposed a generalization of the Metropolis algorithm for analysis of general inverse problems, introducing explicitly a priori probability 232 8.3 distributions, and they applied the theory to a synthetic numerical example. Monte Carlo analysis was recently applied to real data inverse problems by Mosegaard et al. (1997), DahlJensen et al. (1998), Mosegaard and Rygaard-Hjalsted (1999), and Khan et al. (2000). Appendixes 8.3.2 233 Appendix: Example of Ideal (Although Complex) Geophysical Inverse Problem Assume we wish to explore a complex medium, like the Earth’s crust, using elastic waves. Figure 8.4 suggests an Earth model and a set of seismograms produced by the waves generated by an earthquake (or an artificial source). The seismometers (not represented) may be at the Earth’s surface or inside boreholes. Although only four seismograms are displayed, actual experiments may generate thousands or millions of them. The problem here is to use a set of observed seismograms to infer the structure of the Earth. Figure 8.4: A set of observed seismograms (at the right) is to be used to infer the structure of the Earth (at the left). A couple of trees suggest an scale (the numbers could correspond to meters), although the same principle can be used for global Earth tomography. An Earth Model and a Set of Observed Seismograms 0 0 50 100 150 200 0 50 100 150 200 -20 -40 -60 -80 -100 -100 -50 0 50 100 The first step is to define the set of parameters to be used to represent an Earth model. These parameters have to qualitatively correspond to the ideas we have about the Earth’s interior: Thicknes and curvature of the geological layers, position and dip of the geological faults, etc. Inside of the bodies so defined, different types of rocks will correspond to different values of some geophysical quantities (volumetric mass, elastic rigidity, porosity, etc.). These quantities, that have a smooth space variation (inside a given body), may be discretized by considering a grid of points, by using a discrete basis of functions to represent them, etc. If the source of seismic waves is not perfectly known (this is alwaus the case it the source is an earthquake), then, the parameters describing the source also belong to the ‘model parameter set’. A given Earth model (including the source of the waves), then, will consist in a huge set of values: the ‘numerical values’ of all the parameters being used in the description. For instance, we may use the parameters m = {m1 , m2 , . . . , mM } to decribe an Earth model, where M may be a small number (for simple 1D models) or a large number (by the millions or billions for comlex 3D models). Then, we may consider an ‘Earth model number one’, denoted m1 , an ‘Earth model number two’, denoted m2 , and so on. Now, what is a seismogram? It is, in fact, one of the components of a vectorial function s(t) that depends on the vectorial displacement r(t) of the particles ‘at the point’ where the seismometer is located. Given the manufacturing parameters of the seismometers, then, it is possible to calculate the ‘output’ (seismogram) s(t) that corresponds to a given ‘input’ (soil displacement) r(t) . In some loosy sense, the instrument acts as a ‘nonlinear filter’ (the nonlinearity coming from the possible saturation of the sensors for large values of the input, or from their insensivity to small values). While the displacement of the soil is measured, say, in micrometers, the output of the seismometer, typically an electric tension, is measured, say in millivolts. In our digital era, seismograms are not recorded as ‘functions’. Rather, a discrete value of the output is recorded with a given frequency (for instance, one value every millisecond). A seismogram set consists, then, in a large number of (discrete) values, say, siam = (si (ta ))m representing the value at time ta of the i-th component of the m-th seismogram. Such a 234 8.3 seismogram set is what is schematically represented at the right of figure 8.4. For our needs, the particular structure of a seismogram set is not interesting, and we will simply represent such a set using the notation d = {d1 , d2 , . . . , dN } , where the number N may range in the thousands (if we only have one seismogram), or in the trillions for global Earth data or data from seismic exploration for minerals. An exact theory then defines a function d = f (m) : given an arbitrary Earth model m , the associated theoretical seismograms d = f (m) can be computed. A ‘theory’ able to predict seismograms has to encompass the whole way between the Earth model and the instrument output, the millivolts. An ‘exact theory’ would define a functional relationship d = f (m) associating, to any Earth model m a precisely defined point in the data space. This theory would essentially consist in the theory of elastic waves in anisotropic and heterogeneous media, perhaps modified to include attenuation, nonlinear effects, the descrotion of the recording instrument, etc. As mentioned alsewhere [note: where?] there are many reasons for which a ‘theory’ is not an exact functional relationship but, rather, a conditional volumetric probability ϑ(d|m) . [note: explain this better.] Realistic estimations of this probability distribution may be extremely complex. Sometimes we may limit ourselves to ‘putting uncertainty bars around a functional relation’, as suggested in section 7.2.2. Then, for instance, using a Gaussian model, we may write 1 ϑ(d|m) = k exp − (d − f (m))T C−1 (d − f (m)) T 2 , (8.44) where the uncertaintity on the predicted data point, d = f (m) is described by the ‘theory covariance operator’ CT . With a simple probability model, lime this one, or by any other means, it is assumed the the conditional probability volumetric probability ϑ(d|m) is defined. Then, given any point m representing an Earth model, we should be able to sample the volumetric probability ϑ(d|m) , i.e., to obtain as many samples (specimens) of d as we may wish. Figure 8.5 gives a schematic illustration of this. Assume that we do not have yet collected the seismograms. At this moment, the information we have on the Earth is called ‘a priori’ information. As explained elsewhere [note: say where], it may always be represented by a probability distribution over the model parameter space, corresponding to a volumetric probability ρm (m) . The expression of this volumetric probability is, in realistic problems, never explicitly known. Let us see this with some detail. In some very simple situations, we may have an ‘average a priori model’ mprior and a priori uncertainties that can be modeled by a Gaussian distribution with covariance operator Cm . Then, 1 ρm (m) = k exp − (m − mprior )T C−1 (m − mprior ) m 2 . (8.45) Other probability models (Laplace, Pareto, etc.) may, or course, be used. In more realistic situations, the a priori information we have over the model space is not easily expressible as an explicit expression of a volumetric probability. Rather, a large set of rules, some of them probabilistic, is expressed. Already, the very definition of the parameters contains a fundamental topological information (the type of objects being considered: geological layers, faults, etc.). Then, we may have rules of the type ‘a sedimentary layer may never be below a layer of igneous origin’ or ‘with Appendixes 235 Theoretical Sets of Seismograms (inside theoretical uncertainties) 0 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 Figure 8.5: Given an arbitrary Earth model m , a (non exact) theory given a probability distribution for the data, ϑ(d|m) , than can be smpled, producing the sets of seismograms shown here. 50 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 -20 -40 -60 -80 -100 -100 -50 0 50 100 236 8.3 probability 2/3, a layer with a thichness larger that D is followed by a layer with a thickness smaller than d ’, etc. There are, also, explicit volumetric probabilities, like ‘the joint volumetric probability for porosity π and rigidity µ for a calcareous layer is g (π, µ) = . . . ’. They may come from statistical studies made using large petrophysical data banks, or from qualitative ‘Bayesian’ estimations of the correlations existing between different parameters. Figure 8.6: Samples of the a priori distribution of Earth models, each accompanied by the predicted set of seismograms. A set of rules, some determistic, some random, is used to randomly generate Earth models. These are assumed to be samples from a probability distribution over the model space corresponding to a volumetric probability ρm (m) whose explicit expression may be difficult to obtain. But it is not this expression that is required for proceeding with the method, only the possibility of obtaining as many samples of it as we may wish. Although a large number of samples may be necessary to grasp all the details of a probability distribution, as few as the six samples shown here already provide some elementary information. For instance, there are always five geological layers, separated by smooth interfaces. In each model, all the four interfaces are dipping ‘leftwards’ or all the four are dipping ‘rightwards’. These observations may be confirmed, and other properties become conspicuous as more and more samples are displayed. The theoretical set of the seismograms associated to each model, displayed at rigth, are as different as the models are different. Thse are only ‘schematic’ seismograms, bearing no relation with any actual situation. Samples of the a priori Distribution, and Seismograms 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 0 50 100 150 200 0 50 100 150 200 -20 -40 -60 -80 -100 -100 -50 0 50 100 0 -20 -40 -60 -80 -100 -100 -50 0 50 100 0 -20 -40 -60 -80 -100 -100 -50 0 50 100 0 -20 -40 -60 -80 -100 -100 -50 0 50 100 0 -20 -40 -60 -80 -100 -100 -50 0 50 100 0 -20 -40 -60 -80 -100 -100 -50 0 50 100 The fundamental hypothesis of the approach that follows is that we are able to use this set of rules to randomly generate Earth models. And as many as may wish. Figure 8.6 suggests the results obtained using such a procedure. In a computer screen, when the models are displayed one after the other, we have a ‘movie’. A geologist (knowing nothing about mathematics) should, when observing such a movie for long enough, agree with a sentence like the following. Appendixes 237 All models displayed are possible models; the more likely models appear quite frequently; some unlikely models appear, but unfrequently; if we wait long enough we may well reach a model that may be arbitrarily close to the actual Earth. This means that (i) we have described the a priori information, by defining a probability distribution over the model space, (ii) we are sampling this probability distribution, event if an expression for the associated volumetric probability ρm (m) has not been developed explicitly. Assume now that we collect a data set, i.e., in our example, the set of seismograms generated by a given set of earthquakes, or by a given set of artificial sources. In the notation introduced above, a given set of seismograms corresponds to a particular point d in the data space. As any measurement has attached uncertainties, rather than ‘a point’ in the data space, we have, as explained elsewhere [note: say where], a probability distribution in the data space, corresponding to a volumetric probability ρd (d) . The simplests examples of probability distribution in the data space are obtained when using simple probability models. For instance, the assumption of Gaussian uncertainties would give 1 ρd (d) = k exp − (d − dobs )T C−1 (d − dobs ) d 2 , (8.46) where dobs represents the ‘observed data values’, with ‘experimental uncertainties’ described by the covariance operator Cd . As always, other probability models may, of course, be used. Actual experimental uncertainties are quite difficult to model. [note: develop here this notion, and explain, here or somewhere, what is ‘noise’ in a data set (unmodeled signal)]. [Note: explain that figure 8.7 represents a few samples of ‘data points’ generated according to ρd (d) .] Note: explain here that from σ (d, m) = k ρ(d, m) ϑ(d, m) ρ(d, m) = ρd (d) ρm (m) ϑ(d, m) = ϑ(d|m) ϑm (m) = k ϑ(d|m) (8.47) σ (d, m) = k ρd (d) ϑ(d|m) ρm (m) . (8.48) it follows Assume that we are able to generate a random walk that samples the a priori probability distribution of Earth models, ρm (m) (we have seen above how to do this; see also section XXX). Consider the following algorithm: 1. Initialize the algorithm at an arbitrary point (m1 , d1 ) , the first ‘accepted’ point. 2. Relabel the last accepted point (mn , dn ) . Given mn , use the rules that sample the volumetric probability ρm (m) to generate a candidate point mc . 3. Given mc , randomly generate a sample data point, according to the volumetric probability ϑ(d|mc ) , and name it dc . 4. Compare the values ρd (dn ) and ρd (dc ) , and decide to accept or to reject the candidate point dc according to the logistic or to the Metropolis rule (or any equivalent rule). If the candidate point is accepted, set (mn+1 , dn+1 ) = (mc , dc ) and go to 2. If the candidate point is rejected, set (mn+1 , dn+1 ) = (mn , dn ) and go to 2. 238 8.3 Acceptable Sets of Seismograms (inside experimental uncertainties) 0 100 150 200 0 50 100 150 200 0 Figure 8.7: We have one ‘observed set of seismograms’, together with a description of the uncertainties in the data. The corresponding probability distribution may be complex (correlation of uncertainties, non Gaussianity of the noise, etc.). Rather than plotting the ‘oberved set of seismograms’, pseudorandom realizations of the probability distribution in the data space are displayed here. 50 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 -20 ? -40 -60 -80 -100 -100 -50 0 50 100 Appendixes 239 [Note: explain here that figure 8.8 shows some samples of the a posteriori probability distribution.] Samples of the a posteriori Distribution, and Seismograms 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 0 50 100 150 200 0 50 100 150 200 -20 -40 -60 -80 -100 -100 -50 0 50 100 0 -20 -40 -60 -80 -100 -100 Figure 8.8: Samples of the a posteriori distribution of Earth models, each accompanied by the predicted set of seismograms. Note that, contrary to what happens with the a priori samples, all the models presented here have ‘left dipping interfaces’. The second layer is quite thin. Etc. -50 0 50 100 0 -20 -40 -60 -80 -100 -100 -50 0 50 100 0 -20 -40 -60 -80 -100 -100 -50 0 50 100 0 -20 -40 -60 -80 -100 -100 -50 0 50 100 0 -20 -40 -60 -80 -100 -100 -50 0 50 100 [Note: the marginal for m corresponds to the same ‘movie’, just looking to the models, and disregarding the date sets. Reciprocally, the marginal for d . . . ] ‘Things’ can be considerably simplified if uncertainties in the theory can be neglected (i.e., if the ‘theory’ is assumed to be exact): ϑ(d|m) = δ ( d − f (m) ) . Then, the marginal for m , σm (m) = (8.49) dVd (d) σ (d, m) , is using 8.48, σm (m) = k ρm (m) ρd ( f (m) ) . (8.50) 240 8.3 The algorithm proposed above, simplifies to: 1. Initialize the algorithm at an arbitrary point m1 , the first ‘accepted’ point. 2. Relabel the last accepted point mn . Use the rules that sample the volumetric probability ρm (m) to generate a candidate point mc . 3. Compute dc = f (mc ) . 4. Compare the values ρd (dn ) and ρd (dc ) , and decide to accept or to reject the candidate point dc according to the logistic or to the Metropolis rule (or any equivalent rule). If the candidate point is accepted, set mn+1 = mc and go to 2. If the candidate point is rejected, set mn+1 = mn and go to 2. [Note: explain that both algortihms require the resolution of the ‘forward problem’.] [Note: explain that the initial point can not be completely arbitrary.] [Note: the validity of the algorithm with the conditional probability inside has not been demonstrated.] [Note: develop these notions.] Appendixes 8.3.3 241 Appendix: Probabilistic Estimation of Earthquake Locations Earthquakes generate waves, and the arrival times of the waves at a network of seismic observatories carries information on the location of the hypocenter. This information is better understood by a direct examination of the probability density f (X, Y, Z ) defined by the arrival times, rather than just estimating a particular location (X, Y, Z ) and the associated uncertainties. Provided that a ‘black box’ is available that rapidly computes the travel times to the seismic station from any possible location of the earthquake, this probabilistic approach can be relatively efficient. Tjhis appendix shows that it is quite trivial to write a computer code that uses this probabilistic approach (much easier than to write a code using the traditional Geiger method, that seeks to obtain the ‘best’ hypocentral coordinates). 8.3.3.1 A Priori Information on Model Parameters The ‘unknowns’ of the problem are the hypocentral coordinates of an Earthquake7 {X, Z } , as well as the origin time T . We assume to have some a priori information about the location of the earthquake, as well as about ots origin time. This a priori information is assumed to be represented using the probability density ρm (X, Z, T ) . (8.51) Because we use Cartesian coordinates and Newtonian time, the homogeneous probability density is just a constant, µm (X, Y, T ) = k . (8.52) For consistency, we must assume (rule 4.8) that the limit of ρm (X, Z, T ) for infinite ‘dispersions’ is µm (X, Z, T ) . Example 8.9 We assume that the a priori probability density for (X, Z ) is constant inside the region 0 < X < 60 km , 0 < Z < 50 km , and that the (unnormalizable) probability density for T is constant. [End of example.] 8.3.3.2 Data The data of the problem are the arrival times {t1 , t2 , t3 , t4 } of the seismic waves at a set of four seismic observatories whose coordinates are {xi , z i } . The measurement of the arrival times will produce a probability density ρd (t1 , t2 , t3 , t4 ) (8.53) over the ‘data space’. As these are Newtonian times, the associated homogeneous probability density is constant: µd (t1 , t2 , t3 , t4 ) = k . (8.54) For consistency, we must assume (rule 4.8) that the limit of ρd (t1 , t2 , t3 , t4 ) for infinite ‘dispersions’ is µd (t1 , t2 , t3 , t4 ) . 7 To simplify, here, we consider a 2D flat model of the Earth, and use Cartesian coordinates. 242 8.3 Example 8.10 Assuming Gaussian, independent uncertainties, we have 1 (t1 − t1 )2 obs 2 2 σ1 1 (t3 − t3 )2 obs exp − 2 2 σ3 ρd (t1 , t2 , t3 , t4 ) = k exp − × 1 (t2 − t2 )2 obs 2 2 σ2 1 (t4 − t4 )2 obs exp − 2 2 σ4 exp − . (8.55) [End of example.] 8.3.3.3 Solution of the Forward Problem The forward problem consists in calculating the arrival times ti as a function of the hypocentral coordinates {X, Z } , and the origin time T : ti = f i (X, Z, T ) . (8.56) Example 8.11 Assuming that the velocity of the medium is constant, equal to v , t1 = T + cal 8.3.3.4 (X − xi )2 + (Z − z i )2 v . (8.57) Solution of the Inverse Problem Note: explain here that ‘putting all this together’, σm (X, Z, T ) = k ρm (X, Z, T ) ρd (t1 , t2 , t3 , t4 ) 8.3.3.5 ti =f i (X,Z,T ) . (8.58) Numerical Implementation To show how simple is to implement an estimation of the hypocentral coordinates using the solution given by equation 8.58, we give, in extenso, all the commands that are necessary to the implementation, using a commercial mathematical software (Mathematica). Unfortunately, while it is perfectly possible, using this software, to explicitly use quantities with their physical dimensions, the plotting routines require adimensional numbers. This is why the dimensions have been suppresed in whay follows. We use kilometers for the space positions and seconds for the time positions. We start by defining the geometry of the seismic network (the vertical coordinate z is oriented with positive sign upwards): x1 z1 x2 z2 x3 z3 x4 z4 = = = = = = = = 5; 0; 10; 0; 15; 0; 20; 0; The velocity model is simply defined, in this toy example, by giving its constant value ( 5 km/s ): Appendixes 243 v = 5; The ‘data’ of the problem are those of example 8.10. Explicitly: t1OBS = 30.3; s1 = 0.1; t2OBS = 29.4; s2 = 0.2; t3OBS = 28.6; s3 = 0.1; t4OBS = 28.3; s4 = 0.1; rho1[t1_] rho2[t2_] rho3[t3_] rho4[t4_] := := := := Exp[ Exp[ Exp[ Exp[ - (1/2) (1/2) (1/2) (1/2) (t1 (t2 (t3 (t4 - t1OBS)^2/s1^2 t2OBS)^2/s2^2 t3OBS)^2/s3^2 t4OBS)^2/s4^2 rho[t1_,t2_,t3_,t4_]:=rho1[t1] rho2[t2] rho3[t3] rho4[t4] Although an arbitrarily complex velocity velocity model could be considered here, let us take, for solving the forward problem, the simple model in example 8.11: t1CAL[X_, t2CAL[X_, t3CAL[X_, t4CAL[X_, Z_, Z_, Z_, Z_, T_] T_] T_] T_] := := := := T T T T + + + + (1/v) (1/v) (1/v) (1/v) Sqrt[ Sqrt[ Sqrt[ Sqrt[ (X (X (X (X - x1)^2 x2)^2 x3)^2 x4)^2 + + + + (Z (Z (Z (Z - z1)^2 z2)^2 z3)^2 z4)^2 The posterior probability density is just that defined in equation 8.58: sigma[X_,Z_,T_] := rho[t1CAL[X,Z,T],t2CAL[X,Z,T],t3CAL[X,Z,T],t4CAL[X,Z,T]] We should have multiplied by the ρm (X, Z, T ) defined in example 8.9, but as this just corresponds to a ‘trimming’ of the values of the probability density outside the ‘box’ 0 < X < 60 km , 0 < Z < 50 km , we can do this afterwards. The defined probability density is 3D, and we could try to represent it. Instead, let us just represent the marginal probabilty densities. First, we ask the software to evaluate analytically the space marginal: sigmaXZ[X_,Z_] = Integrate[ sigma[X,Z,T], {T,-Infinity,Infinity} ]; This gives a complicated result, with hypergeometric functions8 . Representing this probability density is easy, as we just need to type the command ContourPlot[-sigmaXZ[X,Z],{X,15,35},{Z,0,-25}, PlotRange->All,PlotPoints->51] 8 Typing sigmaXZ[X,Z] presents the result. 244 8.3 The result is represented in figure 8.9 (while the level lines are those directly produced by the software, there has been some additional editing to add the labels). When using ContourPlot, we change the sign of sigma, because we wish to reverse the software’s convention of using light colors for positive values. We have chosen the right region of the space to be plotted (significant values of sigma) by a preliminary plotting of ‘all’ the space (not represented here). Should we have some a priori probability density on the location of the earthquake, represented by the probability density f(X,Y,Z), then, the theory says that we should multiply the density just plotted by f(X,Y,Z). For instance, if we have the a priori information that the hypocenter is above the level z = −10 km, we just put to zero everyhing below this level in the figure just plotted. Let us now evaluate the marginal probability density for the time, by typing the command sigmaT[T_] := NIntegrate[ sigma[X,Z,T], {X,0,+60}, {Z,0,+50} ] Here, we ask Mathematica NOT to try to evaluate analytically the result, but to perform a numerical computation (as we have checked that no analytical result is found). We use the ‘a priori information’ that the hypocenter must be inside a region 0 < X < 60 km , 0 < Z < 50 km but limiting the integration domain to that area (see example 8.9). To represent the result, we enter the command p = Table[0,{i,1,400}]; Do[ p[[i]] = sigmaT[i/10.] , {i,100,300}] ListPlot[ p,PlotJoined->True, PlotRange->{{100,300},All}] Figure 8.9: The probability density for the location of the hypocenter. Its asymmetric shape is quite typical, as seismic observatories tend to be asymmetrically placed. 0 km s 1) s = (2 8. 3± 0. 1) s 0. 2) ob s t4 t3 ob s = (2 8. 6± 0. 4± 9. (2 = ob s t2 t1 ob s = (3 0. 3± 0. 1) s and the produced result is shown (after some editing) in figure 8.10. The software was not very stable in producing the results of the numerical integhration. 0 -5 -10 km -10 -15 -20 km -20 v = 5 km/s -25 15 0 km 10 km 20 20 km 25 30 30 km 35 Appendixes 245 Figure 8.10: The marginal probability density for the origin time. The asymmetry seen in the probability density in figure 8.9, where the decay of probability is slow downwards, translates here in significant probabilities for early times. The sharp decay of the probability density for t < 17s does not come from the values of the arrival times, but from the a priori information that the hypocenters must be above the depth Z = −50 km . 8.3.3.6 10 s 20 s 15 s 25 s 30 s An Example of Bimodal Probability Density for an Arrival Time. As an exercise, the reader could reformulate the problem replacing the assumtion of Gaussian uncertainties in the arrival times by multimodal probability densities. For instance, figure 5.6 suggested the use of a bimodal probability density for the reading of the arrival time of a seismic wave. Using the Mathematica software, the command rho[t_] := (If[8.0<t<8.8,5,1] If[9.8<t<10.2,10,1]) defines a probability density that, when plotted using the command Plot[ rho[t],{t,7,11} ] produces the result displayed in figure 8.11. 10 Figure 8.11: In figure 5.6 it was suggested that the probability density for the arrival time of a seismic phase may be multimodal. This is just an example to show that it is quite easy to define such multimodal probability densities in computer codes, even if they are not analytic. 8 6 4 2 8 9 10 11 246 8.3 8.3.4 Appendix: Functional Inverse Problems 8.3.4.1 Introduction As mentioned in section 2.2, main concern of this article is with discrete problems, i.e., problems where the number of data/parameters is finite. When functions are involved, it was assumed that a sampling of the function could be made that was fine enough for subsequent refinements of the sampling having no effect on the results. This, of course, means replacing any step (Heaviside) function by a sort of discretized Erf function9 . The limit of a very steep Erf function being the step function, any functional operation involving the Erf will have as limit the same functional operation involving the step (unless very pathological problems are considered). The major reason for this limitation is that probability theory is easily developed in finitedimensional spaces, but not in infinite-dimensional spaces. In fact, the only practical infinitedimensional probability theory, where ‘measures’ are replaced by ‘cylinder measures’, is nothing but the assumption that the probabilities calculated have a well behaved limit when the dimensions of the space tend to infinity. Then, the ‘cylinder measure’ or ‘probability’ of a region of the infinite-dimensional space is defined as the limit of the probability calculated in a finitedimensional subspace, when the dimensions of this subspace tend to infinity. There are, nevertheless, some parcels of the theory whose generalization to the infinite dimensional case is possible and well understood. For instance, infinite dimensional Gaussian probability distributions have been well studied. This is not well surprised, because the random realizations of an infinite dimensional Gaussian probability distribution are L2 functions, la cr`me de la cr`me of the functions. e e Most of what will be said here will concern L2 functions10 , and formulas presented will be the functional equivalent to the least-squares formalism developed above for discrete problems. In fact, most results will be valid for Lp functions. The difference, of course, between an L2 space and an Lp space is the existence of an scalar product in the L2 spaces, scalar product intimately related, as we will see, with the covariance operator typical of Gaussian probability distributions. We face here an unfortunate fact that plagues some mathematical literature: the abuse of the term ‘adjoint operator’ where the simple ‘transpose operator’ would suffice. As we will see below, the transposed of a linear operator is something as simple as the original operator (like the transpose of a matrix is as simple as the original matrix), but the adjoint of an operator is a different thing. It is defined only in spaces that have a scalar product (i.e., in L2 spaces), and depends essentially of the particular scalar product of the space. As the scalar product is, usually, nontrivial (it will always involve covariance operators in our examples), the adjoint operator is generally an object more complex than the transpose operator. What we need, for using optimization methods in functional spaces, is to be able to define the norm of a function, and the transposed of an operator, so the ideal setting is that of Lp spaces. Unfortunately, most mathematical results that, in fact, are valid for Lp , are demonstrated only for L2 . The steps necessary for the solution of an inverse problem involving functions are: (i) definition of the functional norms; (ii) definition of the (generally nonlinear) application between parameters and data (forward problem); (iii) calculation of its tangent linear application (char9 The Erf function, or error function, is the primitive of a Gaussian. It is a simple example of a ‘sigmoidal’ function. 1 /2 10 Grossly speaking, a function f (x) belongs to L2 if f= dxf (x)2 is finite. A function f (x) belongs to Lp if f = dx|f (x)|p 1/p is finite. The limit for p → ∞ corresponds to the l∞ space. Appendixes 247 acterized by a linear operator); (iv) understanding of the transposed of this operator; (v) setting an iterative procedure that leads to the function minimizing the norm of the ‘misfit’. Let us see here the main mathematical points to be understood prior to any attempt of ‘functional inversion’. There are not many good books on functional analysis, the best probably is the ‘Introduction to Functional Analysis’ by Taylor and Lay (1980). 8.3.4.2 The Functional Spaces Under Investigation A seismologist may consider a (three-component) seismogram u = { ui (t) ; i = 1, 2, 3 ; t0 ≤ t ≤ t1 } , (8.59) representing the displacement of a given material point of an elastic body, as a function of time. She/he may wish to define the norm of the function (in fact of ‘the set of three functions’) u , denoted u , as u 2 t1 dt ui (t) ui (t) , = (8.60) t0 where, as usual, ui ui stands for the Euclidean scalar product. The space of all the elements u where this norm u is finite, is, by definition, an L2 space. This plain example is here to warn against wrong definitions of norm. For instance, we may measure a resistivity-versus-depth profile ρ = { ρ(z ) ; z0 ≤ z ≤ z1 } , (8.61) but it will generally not make sense to define z1 2 ρ dz ρ(t)2 = (bad definition) . (8.62) z0 For the resistivity-versus-depth profile is equivalent to the conductivity-versus-depth profile σ = { σ (z ) ; z0 ≤ z ≤ z1 } , (8.63) where, for any z , ρ(z ) σ (z ) = 1 , and the definition of the norm z1 2 σ dz σ (t)2 = (bad definition) , (8.64) z0 would not be consistent with that of the norm ρ (we do not have, in general, any reason to assume that σ (z ) sould be ‘more L2 ’ than ρ(z ) , or vice-versa). This is a typical example where the logarithmic variables r = log ρ/ρ0 and s = log σ/σ0 (where ρ0 and σ0 are arbitrary constants) allow the only sensible definition of norm r 2 = s 2 z1 z1 dz r(t)2 = = z0 dz s(t)2 (good definition) , (8.65) z0 or, in terms of ρ and σ , ρ 2 = σ 2 z1 = dz z0 ρ(z ) log ρ0 2 z1 = dz z0 σ (z ) log σ0 2 (good definition) , (8.66) 248 8.3 We see that the right functional spaces for the resistivity ρ(z ) or the conductivity σ (z ) is not L2 , but, to speak grossly, the exponential of L2 . Although these examples concern the L2 norm, the same comments apply to any Lp norm. We will see below an example with the L1 norm. 8.3.4.3 Duality Product Every time we define a functional space, and we start developing mathematical properties (for instance, analyzing the existence and unicity of solutions to partial differential equations), we face another function space, with the same degrees of freedom. For instance, in elastic theory we may define the strain field ε = {εij (x, t)} . It will automatically appear another field, with the same variables (degrees of freedom) that, in this case, is the stress σ = {σij (x, t)} . The ‘contacted multiplication’ will consist in making the sum (over discrete indices) and the integral (over continuous variables) of the product of the two fields, as in σ, ε = dt dV (x) σij (x, t) εij (x, t) , (8.67) where the sum over i, j is implicitly notated. The space of strains and the space of stresses is just one example of dual spaces . When one space is called ‘the primal space’, the other one is calles ‘the dual space’, but this is just a matter of convention. The product 8.67 is one example of duality product , where one element of the primal space and one element of the dual space are ‘mutiplied’ to form a scalar (that may be a real number or that may have physical dimensions). This implies the sum or the integral over the variables of the functions. Mathematicians say that ‘the dual of an space X is the space of all linear forms over X ’. It is true that a given σ associates, to any ε , the number defined by equation 8.67; and that this association defines a linear application. But this rough definition of duality doesn’t help readers to understand the actual mathematical structure. 8.3.4.4 Scalar Product in L2 Spaces When we consider a functional space, its dual appears spontaneously, and we can say that any space is always accompanied by its dual space (as in the example strain-stress seen above). Then, the duality product is always defined. Things are completely different with the scalar product, that it is only defined sometimes . If, for instance, we consider functions f = {f (x)} belonging to a space F , the scalar product is a bilinear form that associates, to any pair of elements f1 and f2 of F , a number11 denoted ( f1 , f2 ) . Practically, to define a scalar product over a space F , we must first define a symmetric, positive definite operator C−1 mapping F into its dual, F . The dual of a function f = {f (x)} , that we may denote f = {f (x)} , is then f = C−1 f . 11 It is usually a real number, but it may have physical dimensions. (8.68) Appendixes 249 The scalar product of two elements f1 and f2 of F is then defined as ( f1 , f2 ) = f1 , f2 = C−1 f1 , f2 (8.69) In the context of an infinite-dimensional Gaussian process, some mean and some covariance are always defined. If, for instance, we consider functions f = {f (x)} , the mean function may be denoted f0 = {f0 (x)} and the covariance function (the kernel of the covariance operator) may be denoted C = {C (x, x )} . The space of functions we work with, say F , is the set of all the possible random realization of such a Gaussian process with the given mean and the given covariance. The dual of F can be here identified with the image of F under C−1 , the inverse of the covariance operator (that is a symmetric, positive definite operator). So, denoting F the dual of F , we can formally write F = C−1 F o, equivalently, F = C F . The explicit expression of the equation f = Cf (8.70) dx C (x, x ) f (x) . (8.71) is f (x) = Let us denote W the inverse of the covariance operator, W = C−1 , (8.72) that is usually named the weight operator . As C W = W C = I , its kernel, W (x, x ) , the weight function , satisfyes dx W (x, x ) C (x , x ) = δ (x − x ) , dx C (x, x ) W (x , x ) = (8.73) where δ ( · ) is the Dirac’s delta ‘function’. Typically, the covariance function C (x, x ) is a smooth function; then, the weight function W (x, x ) is a distribution (sum of Dirac delta ‘functions’ and its derivatives). Equations 8.70–8.71 can equivalently be written f = Wf (8.74) dx W (x, x ) f (x) . (8.75) and f (x) = If the duality product between f1 and f2 is written f1 , f2 = dxf1 (x) f2 (x) , the scalar product, as defined by equation 8.69, becomes ( f1 , f2 ) = f1 , f2 = C−1 f1 , f2 = W f1 , f2 (8.76) 250 8.3 = = dx W (x, x )f1 (x ) dx dx The norm of f , denoted f f2 (x) dx f1 (x) W (x, x ) f2 (x ) . (8.77) and defined as f 2 =(f , f ) , (8.78) dx f (x) W (x, x ) f (x ) . (8.79) is expressed, in this example, as 2 f = dx This is the L2 norm of the function f (x) (the case where W (x, x ) = δ (x − x ) being a very special case). One final remark. If f (x) is a random realization of a Gaussian white noise with zero mean, then, the function f (x) defined by equation 8.71 is a random realization of a Gaussian process with zero mean and covariance function C (x, x ) . This means that if the space F is the space of all the random realizations of a Gaussian process with covariance operator C , then, its dual, F , is the space of all the realizations of a Gaussian white noise. Example 8.12 Consider the covariance operator C , with covariance function C (x, x ) , ⇐⇒ f = Cf +∞ dx C (x, x ) f (x ) , f (x) = (8.80) −∞ in the special case where the covariance function is the exponential function, C (x, x ) = σ 2 exp − |x − x | X , (8.81) where X is a constant. The results of this example are a special case of those demonstrated in Tarantola (1987, page 572). The inverse covariance operator is f = C−1 f ⇐⇒ f (t) = 1 2 σ2 1 ¨ f (x) − X f (x) X , (8.82) where the double dot means second derivative. As noted above, if f (x) is a random realization of a Gaussian process having the exponential covariance function considered here, then, the f (x) given by this equation is a random realization of a white noise. Formally, this means that 1 ¨ the weighting function (kernel of C−1 ) is W (x, x ) = 2 1 2 X δ (x) − X δ (x) . The squared σ norm of a function f (x) is obtained integrating by parts: f 2 = f,f = 1 2 σ2 1 X +∞ dx f 2 (x) + X ∞ +∞ dx f˙2 (x) . −∞ This is the usual norm in the so-called Sobolev space H 1 . [End of example.] (8.83) Appendixes 8.3.4.5 251 The Transposed Operator Let G a linear operator mapping an space E into an space F (we have in mind functional spaces, but the definition is general). We denote, as usual G : E →F . (8.84) f = Ge . (8.85) If e ∈ E and f ∈ F , then we write Let E and F be the respective duals of E and F , and denote · , · E and · , · F the respective duality products. A linear operator H mapping the dual of F into the dual of E , is named the transpose of G if for any f ∈ F and for any e ∈ E we have f , G e F = H f , e E , and, in this case, we use the notation H = GT . The whole definition then reads GT : F → E ∀e ∈ E ∀f ∈ F ; : (8.86) f , Ge F = GT f , e E . (8.87) Example 8.13 The Transposed of a Matrix. Let us consider a discrete situation where ⇐⇒ f = Ge fi = Giα eα . (8.88) α In this circumstance, the duality products in each space will read f,f F fi fi = ; e, e E eα eα = . (8.89) α i The linear operator H is the transposed of G if for any f and for any e (equation 8.87), f , Ge F = Hf, e E , (8.90) i.e., if fi (G e)i = (H f )α eα (8.91) α i or, explicitly, fi Giα eα α i Hαi fi = α eα . (8.92) i The condition can be written fi Giα eα = i α fi Hαi eα , i α (8.93) 252 8.3 and it is clear that this true for any f and for any e iff Hαi = Giα , (8.94) i.e., if the matrix representing H is the transposed (in the elementary matricial sense) of the matrix representing G : H = GT . (8.95) This demonstrates that the abstract definition given above of the transpose of a linear operator is consistent with the matricial notion of transpose. [End of example.] Example 8.14 The Transposed of the Derivative Operator. Let us consider a situation where ⇐⇒ v = Dx v (t) = dx (t) , dt (8.96) i.e., where the linear operator D is the derivative operator. In this circumstance, the duality products in each space will typically read t2 v, v V = t2 dt v (t) v (t) ; x, x X = t1 dt x(t) x(t) . (8.97) t1 If the linear operator DT has to be the transposed of D , for any v and for any x we mst have (equation 8.87) v , Dx V DT v , x = X . (8.98) [End of example.] Let us demonstrate that the derivative operator is an antisymmetric operator i.e, that DT = −D . (8.99) To demonstrate this, we will need to make a restrictive condition, interesting to analyze. Using 8.99, equation 8.98 writes t2 t2 dt v (t) (D x)(t) = − t1 dt (D v)(t) x(t) (8.100) t1 i.e., t2 dt v (t) t1 dx (t) + dt t2 dt t1 dv (t) x(t) = 0 . dt (8.101) We have to check if this equation holds for any x(t) and any v (t) . The condition is equivalent to t2 dt t1 v (t) dv dx (t) + (t) x(t) dt dt =0, (8.102) Appendixes 253 i.e., to t2 dt t1 d (v (t) x(t)) = 0 , dt (8.103) or, using the elementary properties of the integral, to v (t2 ) x(t2 ) + v (t1 ) x(t1 ) = 0 . (8.104) In general, there is no reason for this being true. So, in general, we can not say that DT = −D . If the spaces of functions we work with (here, the space of functions v (t) and the space of functions x(t) ) satisfy the condition 8.104 it is said that the spaces satisfy dual boundary conditions . If the spaces satisfy dual boundary conditions, then it is true that DT = −D , i.e., that the derivative operator is antisymmetric. A typical example of dual boundary conditions being satisfied is in the case where all the functions x(t) vanish at the initial time, and all the functions v (t) vanish at the final time: x(t1 ) = 0 ; v (t2 ) = 0 . (8.105) The notation DT = −D is very suggestive. One has, nevertheless, to remember that (with the boundary conditions chose) while D acts on functions that vanish at the initial time, DT acts on functions v (t) that vanish at the final time. Consider now the operator D2 (second derivative) γ (t) = dx2 (t) . dt2 (8.106) Following the same lines of reasoning as above, the reader may easily demonstrate that the second derivative operator is symmetrical, i.e., (D2 )T = D2 , provided that the functional spaces into consideration satisfy the dual doundary condition γ (t2 ) dx dγ dx dγ (t2 ) − (t2 ) x(t2 ) = γ (t1 ) (t1 ) − (t1 ) x(t1 ) . dt dt dt dt (8.107) A typical example where this condition is satisfied is when we have x(t1 ) = 0 ; dx (t1 ) = 0 dt ; γ (t2 ) = 0 ; dγ (t2 ) = 0 , dt (8.108) i.e., when the functions x(t) have zero value and zero derivative value at the initial time and the functions γ (t) have zero value and zero derivative value at the final time. This is the sort of boundary conditions found when working with the wave equation, as it contains second order time derivatives. Further details are given in section 8.3.4.7 below. As an exercise, the reader may try to understand why the quite obvious property ∂ ∂xi T =− ∂ ∂xi (8.109) divT = −grad (8.110) corresponds, in fact, to the properties gradT = −div ; 254 8.3 (hint: if an operator maps E into F , its transpose maps F into E ; the dual of an space has the same ‘variables’ as the original space). Let us formally demostrate that the operator representing the acoustic wave equation is symmetric. Starting from12 L= 1 ∂2 1 − div grad , 2 κ(x) ∂t ρ(x) (8.111) we have 1 ∂2 1 − div grad κ(x) ∂t2 ρ(x) LT = T 1 ∂2 κ(x) ∂t2 = T 1 grad ρ(x) − div T . (8.112) Using the property (A B)T = BT AT , we arrive at T L= ∂2 ∂t2 T 1 κ(x) T T − (grad) 1 ρ(x) T (div)T . (8.113) Now, (i) the transposed of a scalar is the scalar itself; (ii) the second derivative (as we have seen) is a symmetric operator; (iii) we have (as it has been mentioned above) gradT = −div and divT = −grad . We then have LT = ∂2 1 1 − div grad , ∂t2 κ(x) ρ(x) (8.114) and, as the uncompressibility κ is assumed to be independent on time, LT = 1 ∂2 1 − div grad = L , 2 κ(x) ∂t ρ(x) (8.115) and we see that the acoustic wave operator is symmetric. As we have seen above, this conclusion has to be understood with the condition that the wavefields p(x, t) on which acts L satisfy boundary conditions that are dual with those satisfied by the fields p(x, t) on which acts LT . Typically the fields p(x, t) satisfy initial conditions of rest, and the fields p(x, t) satisfy final conditions of rest. Tarantola (1988) demostrates that the transposed of the operator corresponding to the ‘wave equation with attenuation’ corresponds to the wave equation with ‘anti-attenuation’. But it has to be understood that any physical or numerical implementation of the operator LT is made ‘backwards in time’, so, in that sense of time, we face an ondinary attenuation: there is no difficulty in the implementation of LT . Example 8.15 The Kernel of the Transposed Operator If the explicit expression of the equation f = Ge 12 (8.116) Here, and below, an expression like A B C , means, as usual, A(B C) . This means, for instance, that the div operator in this equation is to be understood as being applied not to 1/ρ(x) only, but to ‘everything at its right’. Appendixes 255 is f (t) = dt G(t, x) e(t) , (8.117) where G(t, x) is an ordinary function13 , then, it is said that G is an integral operator, and that the function G(t, x) is its kernel. [End of example.] The transpose of G will map an element f into an element e , these two elements belonging to the respective duals of the spaces where the elements e and f mentioned in equation 8.116 belong. An equation like e = GT f (8.118) will correspond, explicitly, to dx GT (x, t) f (t) . e(t) = (8.119) The reader may easily verify that the definition of transpose operator imposes that the kernel of GT is related to the kernel of G by the simple expression GT (x, t) = G(t, x) . (8.120) We see that the kernels of G and of GT are, in fact, identical, via a simple ‘transposition’ of the variables. 8.3.4.6 The Adjoint Operator Let G be a linear operator mapping an space E into an space F : G : E →F . (8.121) f = Ge . (8.122) If e ∈ E and f ∈ F , then we write Assume that both, E and F are furnished with an scalar product each (see section 8.3.4.4), that we denote, respectively, as ( e1 , e2 )E and ( f1 , f2 )F A linear operator H mapping F into E , is named the adjoint of G if for any f ∈ F and for any e ∈ E we have ( f , G e )F = ( H f , e )E , and, in this case, we use the notation H = G∗ . The whole definition then reads G∗ : F → E ∀e ∈ E ; ∀f ∈ F : ( f , G e )F = ( G∗ f , e )E . (8.123) (8.124) 13 If G(t, x) is a distribution (like the derivative of a Dirac’s delta) then equation 8.116 may be a disguised expression for a differential operator. 256 8.3 Let E and F be the respective duals of E and F , and denote · , · E and · , · F the respective duality products. We have seen above that a scalar product is defined through a symmetric, positive operator mapping a space into its dual. Then, as E and F are assumed to have a scalar product defined, there are two ‘covariance’ operators CE and CF such that the respective scalar products are given by ( e1 , e2 )E = ( f1 , f2 )F = Then, equation 8.124 writes C−1 e2 , e1 E C−1 f2 , f1 F C−1 f , G e F f , Ge F F E F . = C−1 G∗ f , e E = C−1 G∗ CF f , e E E (8.125) E , or, denoting f = C−1 f , F . (8.126) The comparison with equation 8.124 defining the transposed operator gives the relation between adjoint and transpose, GT = C−1 G∗ CF , that can be written, equivalently, as E G∗ = CE GT C−1 . F (8.127) The transposed operator is an elementary operator. Its definition only requires the existence of the dual of the considered spaces, that is automatic. If, for instance, a linear operator G has the kernel G(u, v ) , the transposed operator GT will have the kernel GT (v, u) = G(u, v ) . The adjoint operator is not an elementary operator. Its definition requires the existence of scalar products in the working spaces, that are necessarily defoned through symmetric, positive definite operators. This means that (excepted degenerated cases) the adjoint operator is a complex object, depending on three elementary objects: this is how equation 8.127 is to be interpreted. 8.3.4.7 The Green Operator The pressure field p(x, t) propagating in an elastic medium with uncompressibility modulus κ(x) and volumetric mass ρ(x) satisfies the ‘acoustic wave equation’ 1 ∂2p (x, t) − div κ(x) ∂t2 1 grad p(x, t) ρ(x) = S (x, t) . (8.128) Here, x denotes a point inside the medium (the coordinate system being still unspecified), t is the Newtonian time, and S (x, t) is a source function. To simplify the notations, the variables x and t will be dropped when there is no risk of confusion. For instance, the equation above will be written 1 ∂2p − div κ ∂t2 1 grad p ρ =S . (8.129) Also, I shall denote p the function {p(x, t)} as a whole, and not its value at a given point of space and time. Similarly, S shall denote the source function S (x, t) . For fixed κ(x) and ρ(x) , the wave equation above can be written, for short, Lp = S , (8.130) Appendixes 257 where L is the second order differential operator defined through equation 8.129. In order to define an unique wavefield p , we have to prescribe some boundary and initial conditions. An example of those are, if we work inside the time interval (t1 , t2 ) , and inside a volume V bounded by the surface S , p(x, t1 ) = 0 p(x, t1 ) = 0 ˙ p(x, t) = 0 x∈V x∈V x ∈ S ; t ∈ (t1 , t2 ) . ; ; ; (8.131) Here, a dot means time derivative. With prescribed initial and boundary conditions, then, there is an one to one correspondence between the source field S and the wavefield p . The inverse of the wave equation operator, L−1 , is called the Green operator , and is denoted G : G = L−1 . (8.132) We can then write ⇐⇒ Lp = S p = GS . (8.133) As L is a differential operator, its inverse G is an integral operator. The kernel of the Green operator is named the Green function , and is usually denoted G(x, t; x , t ) . The explicit expression for p = G S is then t2 p(x, t) = dV (x ) dt G(x, t; x , t ) S (x , t ) . V (8.134) t1 It is easy to demonstrate14 that the wave equation operator L is a symmetric operator, so this is also true for the Green operator G . But we have seen that the transpose operators work in spaces with have dual boundary conditions (see section 8.14 above). Using the method outlined in section 8.14, the boundary conditions dual to those in equations 8.131 are p(x, t2 ) = 0 p(x, t2 ) = 0 ˙ p(x, t) = 0 x∈V x∈V x ∈ S ; t ∈ (t1 , t2 ) , ; ; ; (8.135) i.e., we have final conditions of rest instead of initial conditions of rest (and the same surface condition). We have to underdstand that while the equation L p = S is associated to the boundary conditions 8.131, equations like LT p = S ; p = GT S (8.136) are associated to the dual boundary conditions 8.135 (the hats here mean that the transpose operator operator operates in the dual spaces (see section 8.3.4.3). This being understood, we can write LT = L and GT = G , and rewrite equations 8.136 as Lp = S 14 ; p = GS . (8.137) This comes from the property that the derivative operator is antisymmetric, (so that the second derivative is a symmetric operator) and from the properties gradT = −div and divT = −grad , mentioned in section protect8.14. 258 8.3 The hats have to be maintained, to remember that the fields with a hat must satisfy boundary conditions dual to those satisfied by the fields without a hat. Using the transposed of the Green operator, we can write t2 p(x, t) = dt GT (x, t; x , t ) S (x , t ) . dV (x ) (8.138) t1 Some text is missing here. Some text is missing here. Some text is missing here. Some text is missing here. Some text is missing here. Some text is missing here. Some text is missing here. Some text is missing here. Some text is missing here. Some text is missing here. Some text is missing here. Some text is missing here. 8.3.4.8 Born Approximation for the Acoustic Wave Equation Let us start from equation 8.129, using the same notations: 1 ∂2p − div κ ∂t2 1 grad p ρ =S . (8.139) I shall denote p the function {p(x, t)} as a whole, and not its value at a given point of space and time. Similarly, κ and ρ will denote the functions {κ(x)} and {ρ(x)} . Given appropriate boundary and initial conditions, and given a source function, the acoustic wave equation defines an application {κ, ρ} → p = ψ (κ, ρ) , i.e., an application that associates to each medium {κ, ρ} the (unique) pressure field p that satisfies the wave equation (with given boundary and initial conditions). Let p0 be the pressure field propagating in the medium defined by κ0 and ρ0 , i.e., p0 = ψ (κ0 , ρ0 ) , and let p be the pressure field propagating in the medium defined by κ and ρ , i.e., p = ψ (κ, ρ) . Clearly, if κ and ρ are close (in a sense to be defined) to κ0 and ρ0 , then, the wavefield p will be close to p0 . Let us obtain an explicit expression for the first order approximation to p . This is known as the (first) Born approximation of the wavefield. Both κ and ρ could be perturbed, but I simplify the discussion here by considering only perturbations in the uncompressibility κ . The reader may easily obtain the general case. The pressure inside an elastic fluid medium is (note: check if this sign is consistent with the sign given to the stress tensor elsewhere in the book) 1 p = − σk k . (8.140) 3 So defined, the pressure may take positive or negative values, corresponding to an elastic medium that is compressed or stretched. In the terminology of section 2, this is a Cartesian quantity. Note: check what follows. Perhaps it is better to assume that the pressure P is positive quantity, and to define p = P0 log P , P0 (8.141) where P0 is the ‘ambient pressure’. For small pressure perturbations, we have p = P0 log 1 + (P − P0 ) P0 ≈ P − P0 . (8.142) Appendixes 259 The uncompressibility and the volumetric mass are positive, Jeffreys quantities. In most texts, the difference p − p0 is calculated as a function of the difference κ − κ0 , but we have seen that this is not the right way, as the resulting approximation will depend on the fact that we are using uncompressibilty κ(x) instead of compressibility γ (x) = 1/κ(x) . At this point we may introduce the logarithmic parameters, and proceed trivially (note: explain why this is important). The logarithmic uncompressibilities for the reference medium and for the perturbed medium are κ∗ = log 0 κ0 K κ∗ = log ; κ K , (8.143) where K and R are arbitrary constants (having the right physical dimension). Reciprocally, κ0 = K exp κ∗ 0 ; κ = K exp κ∗ . (8.144) In particular, we have κ = κ0 exp(δκ∗ ) , (8.145) where δκ∗ = κ∗ − κ∗ = log 0 κ . κ0 (8.146) Note that we have here a perturbation δκ∗ of a logarithmic (Cartesian) quantity, not of the positive (Jeffreys) one. We also write p = p0 + δp . (8.147) The reference solution satisfies 1 ∂ 2 p0 − div κ0 ∂t2 1 grad p0 ρ0 =S , (8.148) while the perturbed solution satisfies 1 ∂2p − div κ ∂t2 1 grad p ρ0 =S . (8.149) In this equation, κ can be replaced by the expression 8.145, and p by the expression 8.147. Using then the first order approximation exp(−δκ∗ ) = 1 − δκ∗ leads to 1 δκ∗ − κ0 κ0 ∂ 2 p0 ∂ 2 δp + ∂t2 ∂t2 − div 1 (grad p0 + gradδp) ρ0 =S . (8.150) Some of the terms in this equation correspond to the terms in the reference equation 8.148, and can be simplified. Keeping only first order terms then leads to 1 ∂ 2 δp − div κ0 ∂t2 1 grad δp ρ0 = δκ∗ ∂ 2 p0 . κ0 ∂t2 (8.151) 260 8.3 Explicitly, replacing δp = p − p0 and δκ∗ = log κ/κ0 , gives 1 ∂ 2 (p − p0 ) − div κ0 ∂t2 1 grad (p − p0 ) ρ0 = 1 κ ∂ 2 p0 log . κ0 κ0 ∂t2 (8.152) This is the equation we were looking for. It says that the field p − p0 satisfies the wave equation with the unperturbed value of the uncompressibility κ0 , and is generated by the ‘Born secondary source’ SBorn = 1 κ ∂ 2 p0 log . κ0 κ0 ∂t2 (8.153) Should we have made the development using the compressibility γ = 1/κ instead of the uncompressibility, we would have arrived at the secondary source SBorn = γ0 log γ0 ∂ 2 p0 γ ∂t2 (8.154) that is identical to the previous one. The expression here obtained for the secondary source is not the usual one, as it depends on the distance log κ/κ0 and not on the difference κ − κ0 . For an additive perturbation κ = κ0 + δκ of the positive parameter κ would have lead to the Born secondary source Sκ = δκ ∂ 2 p0 κ − κ0 ∂ 2 p0 = κ2 ∂t2 κ2 ∂t2 (8.155) while an additive perturbation γ = γ0 + δγ of the positive parameter γ = 1/κ would have lead to the Born secondary source Sγ = −δγ ∂ 2 p0 ∂ 2 p0 = (γ − γ0 ) 2 , ∂t2 ∂t (8.156) and these two sources are not identical. I mean here that they finite expression is not identical. Of course, in the limit for an infinitesimal perturbation they tend to be identical. The approach followed here has two advantages. First, mathematical consistence, in the sense that the secondary source is defined independently of the quantities used to make the computation (covariance of the results). Second advantage, in a numerical computation, the perturbations may be small, but they are finite. ‘Large contrasts’ in the parameters may give, when inserting the differences in expressions 8.155 or 8.156 quite bad approximations, while the logarithmic expressions in the right Born source (equation 8.153 or 8.154) may remain good. 8.3.4.9 Tangent Application of Data With Respect to Parameters In the context of an inverse problem, assume that we observe the pressure field p(x, t) at some points xi inside the volume. The solution of the forward problem is obtained by solving the wave equation, or by using the Green’s function. We are here interested in the tangent linear application. Let us write the first order perturbation δp(xi , t) of the pressure wavefield produced when the logarithmic uncompressibility is perturbed by the amount δκ∗ (x) as (linear tangent application) δ p = F δ κ∗ , (8.157) Appendixes 261 or, introducing the kernel of the Fr´chet derivative F , e dV (x ) F (xi , t; x ) δκ∗ (x ) . δp(xi , t) = (8.158) V Let us express the kernel F (xi , t; x ) . We have seen that a perturbation δκ∗ is equivalent, up to the first order, to have the secondary Born source (equation 8.151) δκ∗ (x) SBorn (x, t) = p0 (x, t) . ¨ κ0 (x) (8.159) Then, using the Green function, t1 δp(xi , t) = dV (x ) dt G(xi , t; x , t ) SBorn (x , t ) V t2 t1 dV (x ) = dt G(xi , t; x , t ) V t2 δκ∗ (x ) p0 (x , t ) . ¨ κ0 (x ) (8.160) The last expression can be rearranged into the form used in equation 8.158, this showing that F (xi , t; x , t ) is given by F (xi , t; x ) = t1 1 κ0 (x ) dt G(xi , t; x , t ) p0 (x , t ) ¨ (8.161) t2 This is the kernel of the Fr´chet derivative of the data with respect to the parameter κ∗ (x) . e 8.3.4.10 The Transpose of the Fr´chet Derivative Just Computed e Now that we are able to understand the expression δ p = F δ κ∗ , let us face the dual problem. Which is the meaning of an expresion like δ κ∗ = FT δ p ? (8.162) Denoting by F T (x ; xi , t) the kernel of FT , such an expression writes t1 dt F T (x ; xi , t) δ p(xi , t) , δ κ(x ) = (8.163) t2 i but we know that the kernel of the transpose operator equals the kernel of the original operator, with variables transposed (note: say where this has been demonstrated), so that we can write this equation as t1 δ κ(x ) = dt F (xi , t; x ) δ p(xi , t) , (8.164) t2 i where F (xi , t; x ) is the kernel given in equation 8.161. Replacing the kernel by its expression gives t1 ∗ δ κ (x ) = i t2 1 dt κ0 (x ) t1 dt G(xi , t; x , t ) p0 (x , t ) δ p(xi , t) , ¨ t2 (8.165) 262 8.3 and this can be rearranged into (note that primed and nonprimed variables have been exchanged) 1 δ κ (x) = κ0 (x) ∗ t1 dt ψ (x, t) p0 (x, t) , ¨ (8.166) t2 where t1 dt G(xi , t ; x, t) δ p(xi , t ) , ψ (x, t) = i (8.167) t2 or, using the kernel of the tranposed Green’s operator, t1 dt GT (x, t; xi , t ) δ p(xi , t ) . ψ (x, t) = i (8.168) t2 (note: explain here that this means that the field ψ (x, t) can be interpreted as the solution of the transposed wave equation, with a point source at each point xi where we have a receiver, radiating the value δ p(xi , t ) . As we have the transposed of the Green’s operator, the field ψ (x, t) must satisfy dual boundary conditions, i.e., in our case, final conditions of rest). 8.3.4.11 The Continuous Inverse Problem Let be p = f (κ∗ ) the function calculating the theoretical data associated to the model κ (resolution of the forward problem). We seek the model minimizing the sum S (κ∗ ) = 1 2 f (κ∗ ) − pobs 2 + κ∗ − κ∗ prior 2 (8.169) 1 . C−1 (f (κ∗ ) − pobs ) , f (κ∗ ) − pobs + C−∗1 (κ∗ − κ∗ ) , κ∗ − κ∗ p κ prior prior 2 Using, in this functional context, the steepest descent algorithm proposed in section 8.3.7.4, we arrive at = κ∗ +1 = κ∗ − n n Cκ∗ FT C−1 (pn − pobs ) + (κ∗ − κ∗ ) , n p n prior (8.170) where pn = f (κ∗ ) and where FT is the transposed operatoir defined above, at point κ∗ . n n n Covariances aside, we see that the fundamental object appearing in this inversion algorithm is the transposed operator FT . As it has been interpreted above, we have all the elements to understand how this sort of inverse problems are solved. For more details, see Tarantola (1984, 1986, 1987). Appendixes 8.3.5 263 Appendix: Nonlinear Inversion of Waveforms (by Charara & Barnes) [Note: I plan to convince Marwan and Christophe to contribute to our book by writing this section (on a work that, unfortunately, has never been published).] Figure 8.12: Geometry. 264 8.3 VSP WEST filtered real data X 10 20 Traces 30 40 50 60 70 1000 1500 2000 2500 3000 3500 Time (ms) Figure 8.13: Observed seismograms. X component. 4000 Appendixes 265 VSP WEST filtered real data Z 10 20 40 50 60 70 1000 1500 2000 2500 3000 3500 4000 Time (ms) Figure 8.14: Observed seismograms. Z component. P Velocity 0 500 4750 1000 4500 4250 4000 1500 3750 3500 3250 2000 Figure 8.15: Model. VP. Depth (m) Traces 30 3000 2750 2500 2500 2250 2000 3000 1750 1500 1250 3500 1000 4000 4500 5000 0 500 1000 1500 Offset (m) 2000 2500 266 8.3 S Velocity 0 500 2300 1000 2150 2000 1850 1500 1700 1550 1400 Figure 8.16: Model. VS. Depth (m) 2000 1250 1100 2500 950 800 650 3000 500 300 150 3500 0 4000 4500 5000 0 500 1000 1500 2000 2500 Offset (m) Density 0 500 2700 1000 2530 2420 2310 1500 2200 2090 1980 Figure 8.17: Model. RHO. Depth (m) 2000 1870 1760 2500 1650 1540 1430 3000 1320 1210 1100 3500 980 4000 4500 5000 0 500 1000 1500 Offset (m) 2000 2500 Appendixes 267 Log(Qs) 0 500 5.5 1000 4.8 4.6 4.4 1500 4.2 4.0 3.8 Figure 8.18: Model. Q. Depth (m) 2000 3.6 3.4 2500 3.2 3.0 2.8 3000 2.6 2.4 2.2 3500 2.0 4000 4500 5000 0 500 1000 1500 Offset (m) 2000 2500 268 8.3 VSP WEST synthetic X component 10 20 Traces 30 40 50 60 70 1000 1500 2000 2500 3000 3500 Time (ms) Figure 8.19: Calculated seismograms. X component. 4000 Appendixes 269 VSP WEST synthetic Z component 10 20 Traces 30 40 50 60 70 1000 1500 2000 2500 3000 3500 Time (ms) Figure 8.20: Calculated seismograms. Z component. 4000 270 8.3 VSP WEST residuals X component 10 20 Traces 30 40 50 60 70 1000 1500 2000 2500 3000 3500 Time (ms) Figure 8.21: Residuals seismograms. X component. 4000 Appendixes 271 VSP WEST residuals Z component 10 20 Traces 30 40 50 60 70 1000 1500 2000 2500 3000 3500 Time (ms) Figure 8.22: Residuals seismograms. Z component. 4000 272 8.3 8.3.6 Appendix: Using Monte Carlo Methods [Note: Write a small introduction here]. 8.3.6.1 Basic Equations The starting point could be the general equation 7.9, σ (m, d) = k ρ(m, d) ϑ(m, d) µ(m, d) , (8.171) combining the ‘a priori’ information ρ(m, d) with the ‘theoretical’ information ϑ(m, d) . We have seen in section 3 that if we are able to design a random walk that samples ρ(m, d) , then, the Metropolis rule can be used to obtain a random walk that samples σ (m, d) . We have also seen that if we are not able to design a (primeval) random walk that samples ρ(m, d) , then we can start using a random walk that samples the homogeneous probability density µ(m, d) , of even an arbitrary15 probability density ψ (m, d) . This point of view, is very general, but more practical algorithms are obtained when we particularize. Let us consider, for instance, the explicit expression (equation ??) for σm (m) given in section 8.2.6: σm (m) = k ρm (m) φ(m) µm (m) . (8.172) where φ(m) = ρd (d) µd (d) det (gm (m) + FT (m) gd (d) F(m)) . (8.173) d=f (m) In this expression the matrix of partial derivatives F = F(m) , with components Diα = ∂di /∂mα , appears. The ‘slope’ F enters here because the steeper the slope for a given m , the greater the accumulation of points we will have with this particular m . This is because we use explicitly the analytic expression d = f (m) . One should realize that using the more general approach based on equation 8.171, the effect is automatically accounted for, and there is no need to explicitly consider the partial derivatives. In any case, equation 8.172 has the standard form of a conjunction of two probability densities, and is, therefore, ready to be integrated in a Metropolis algorithm. But one should note that, contrary to many ‘nonlinear’ formulations of inverse problems, the partial derivatives F are needed, even if we use a Monte Carlo method. In some weakly nonlinear problems, we have FT (m) gd (d) F(m) << gm (m) and, then, φ(m) = µm (m) ρd (d) µd (d) , (8.174) d=f (m) and equation 8.172 becomes σm (m) = k ρm (m) L(m) , 15 Although, hopefully, not too different from µ(m, d) . (8.175) Appendixes 273 where L(m) = ρd (d) µd (d) . (8.176) d=f (m) This expression is also ready for use using the Matropolis algorithm. In this way sampling of the prior ρm (m) is modified into a sampling of the posterior σm (m), and the Metropolis Rule uses the “Likelihood function” L(m) (in fact, a volumetric probability) to calculate acceptance probabilities. 8.3.6.2 Sampling the Homogeneous Probability Distribution If we do not have an algorithm that samples the prior probability density directly, the first step in a Monte Carlo analysis of an inverse problem is to design a random walk that samples the model space according to the homogeneous probability distribution µ(m). In some cases this is easy, but in other cases only an algorithm (a primeval random walk) that samples a probability density ψ (m) = µ(m) is available. Then the Metropolis Rule can be used to modify ψ (m) into µ(m). This way of generating samples from µ(m) is efficient if ψ (m) is close to µ(m), otherwise it may be very inefficient. Methods for designing primeval random walks are found in Section 3.4. Once µ(m) can be sampled, the Metropolis Rule allows us to use modify this sampling into an algorithm that samples the prior. 8.3.6.3 Sampling the Prior Probability Distribution The first step in the Monte Carlo analysis is to switch off the comparison between computed and observed data, thereby generating samples of the a priori probability density. This allows us verify statistically that the algorithm is working correctly, and it allows us to understand the prior information we are using. We will refer to a large collection of models representing the prior probability distribution as the “prior movie”. The more models present in this movie, the more accurate representation of the prior probability density. If we are interested in smooth Earth models (knowing, e.g., that only smooth properties are resolved by the data), a smooth movie can be produced simply by smoothing the individual models of the original movie. 8.3.6.4 Sampling the Posterior Probability Distribution If we now switch on the comparison between computed and observed data using, e.g., the Metropolis Rule, the random walk sampling the prior distribution is modified into a walk sampling the posterior distribution. Again, smoothed versions of this “posterior movie” can be generated by smoothing the individual models in the original, posterior movie. Since data rarely put strong constraints on The Earth, the “posterior movie” typically shows that many different models are possible. But even though the models in the posterior movie may be quite different, all of them predict data that, within experimental uncertainties, are models with high likelihood. In other words, we must accept that data alone cannot have a preferred model. The posterior movie allows us to perform a proper resolution analysis that helps us to choose between different interpretations of a given data set. Using the movie we can answer 274 8.3 complicated questions about the correlations between several model parameters. To answer such questions, we can view the posterior movie and try to discover structure that is well resolved by data. Such structure will appear as “persistent” in the posterior movie. Another, more traditional, way of investigating resolution is to calculate covariances and higher order moments. For this we need to evaluate integrals of the form Rf = dm f (m) σ (m) (8.177) A where f (m) is a given function of the model parameters and A is an event in the model space M containing the models we are interested in. For instance, A = {m |a given range of parameters in m is cyclic} . (8.178) In the special case when A = M is the entire model space, and f (m) = mi , the Rf in eq. (8.177) equals the mean mi of the i’th model parameter mi . If f (m) = (mi − mi ) (mj − mj ), Rf becomes the covariance between the i’th and j ’th model parameters. Typically, in the general inverse problem we cannot evaluate the integral in (8.177) analytically because we have no analytical expression for σ (m). However, from the samples of the posterior movie m1 , . . . , mN we can approximate Rf by the simple average: Rf ≈ f (mn ). {n|mn ∈A} (8.179) Appendixes 8.3.7 275 Appendix: Using Optimization Methods As we have seen, the solution of an inverse problem essentially consists of a probability distribution over the space of all possible models of the physical system under study. In general, this ‘model space’ is highly-dimensional, and the only general way to explore it is by using the Monte Carlo methods developed in section 3. If the probability distributions are ‘bell-shaped’ (i.e., if they look like a Gaussian or like a generalized Gaussian), then, one may simplify the problem by calculating only the point around which the probability is maximum, with an approximate estimation of the variances and covariances. This is the problem addressed in this section. [Note: I rephrased this sentence] Among the many methods available to obtain the point at which a scalar function reaches its maximum value (relaxation methods, linear programming techniques, etc.), we limit our scope here to the methods using the gradient of the function, which we assume can be computed analytically or, at least, numerically. For more general methods, the reader may have a look at Fletcher, (1980, 1981), Powell (1981), Scales (1985), Tarantola (1987) or Scales et al. (1992). 8.3.7.1 Maximum Likelihood Point Let us consider a space X , with a notion of volume element dV defined. If some coordinates x ≡ {x1 , x2 , . . . , xn } are chosen over the space, the volume element has an expression dV (x) = v (x) dx , and each probability distribution over X can be represented by a probability density f (x) . For any fixed small volume ∆V , we can search for the point xM L such that the probability dP of the small volume, when centered around xM L , gets a maximum. In the limit ∆V → 0 this defines the maximum likelihood point . The maximum likelihood point may be unique (if the probability distribution is monomodal), may be degenerated (if the probability distribution is ‘roof-shaped’) or may be multiple (as when we have the sum of a few bell-shaped functions). The maximum likelihood point is not the point at which the probability density is maximum. [Note: Rephrase the following sentence...] For our definition imposes that what must be maximum is the ratio of the probability density by the function v (x) defining the volume element: f (x) x = xM L ⇐⇒ F (x) = maximum . (8.180) v (x) We recognize in the ratio F (x) = f (x)/v (x) the volumetric probability associated to the probability density f (x) (see equation ??). As the homogeneous probability density is µ(x) = k v (x) (see rule 4.2), we can equivalently define the maximum likelihhod point by the condition x = xM L ⇐⇒ f (x) µ(x) maximum . (8.181) The point at which a probability density has its maximum is not xM L . In fact, the maximum of a probability density does not correspond to an intrinsic definition of a point: a change of coordinates x → y = ψ (x) would change the probability density f (x) into the probability density g (y) (obtained using the Jacobian rule), but the point of the space at which f (x) is maximum is not the same as the point of the space where g (y) is maximum (unless the change of variables is linear). This contrasts with the maximum likelihood point, as defined by equation 8.181, that is an intrinsically defined point: no matter which coordinates we use in the computation we always obtain the same point of the space. 276 8.3.7.2 8.3 Misfit One of the goals here is to develop gradient-based methods for obtaining the maximum of F (x) = f (x)/µ(x) . As a quite general rule, gradient-based methods perform quite poorly for (bell-shaped) probability distributions, as when one is far from the maximum the probability densities tend to be quite flat, and it is difficult to get, reliably, the direction of steepest ascent. Taking a logarithm transforms a bell-shaped distribution into a paraboloid-shaped distribution on which gradient methods work well. The logarithmic volumetric probability, or misfit , is defined as S (x) = − log(F (x)/F0 ) , where p and F0 are two constants, and is given by S (x) = − log f (x) µ(x) . (8.182) The problem of maximization of the (typically) bell-shaped function f (x)/µ(x) has been transformed into the problem of minimization of the (typically) paraboloid-shaped function S (x) : x = xM L ⇐⇒ S (x) minimum . (8.183) Example 8.16 The conjunction σ (x) of two probability densities ρ(x) and ϑ(x) was defined (equation ??) as σ (x) = p ρ(x) ϑ(x) µ(x) . (8.184) Then, S (x) = Sρ (x) + Sϑ (x) , (8.185) where Sρ (x) = − log ρ(x) µ(x) ; Sϑ (x) = − log ϑ(x) µ(x) . (8.186) [End of example.] Example 8.17 In the context of Gaussian distributions, we have found the probability density (see example ??) σm (m) = = k exp − 1 (m − mprior )t C−1 (m − mprior ) + (f (m) − dobs )t C−1 (f (m) − dobs ) M D 2 (8.187) . The limit of this distribution for infinite variances is a constant, so in this case µm (m) = k . The misfit function S (m) = − log( σm (m)/µm (m) ) is then given by 2 S (m) = (m − mprior )t C−1 (m − mprior ) + (f (m) − dobs )t C−1 (f (m) − dobs ) . M D (8.188) The reader should remember that this misfit function is valid only for weakly nonlinear problems (see examples 8.5 and ??). The maximum likelihood model here is the one that minimizes the sum of squares 8.188. This correpponds to the least squares criterion. [End of example.] Appendixes 277 Example 8.18 In the context of Laplacian distributions, we have found the probability density (see example ??) σm (m) = k exp − α |mα − mα | prior + σα |f i (m) − di | obs σi i . (8.189) The limit of this distribution for infinite mean deviations is a constant, so here µm (m) = k . The misfit function S (m) = − log( σm (m)/µm (m) ) is then given by S (m) = α |mα − mα | prior + σα i |f i (m) − di | obs σi . (8.190) The reader should remember that this misfit function is valid only for weakly nonlinear problems. The maximum likelihood model here is the one that minimizes the sum of least absolute values 8.190. This correpponds to the least absolute values criterion. [End of example.] 8.3.7.3 Gradient and Direction of Steepest Ascent One must not consider as synonymous the notions of ‘gradient’ and ‘direction of steepest ascent’. Consider, for instance, an adimensional misfit function16 S (P, T ) over a pressure P and a temperature T . Any sensible definition of the gradient of S will lead to an expression like ∂S grad S = ∂P (8.191) ∂S ∂T and this by no means can be regarded as a ‘direction’ in the (P, T ) space (for instance, the components of this ‘vector’ does not have the dimensions of pressure and temperature, but of inverse pressure and inverse temperature). Mathematically speaking, the gradient of a function S (x) at a point x0 is the linear application that is tangent to S (x) at x0 . [Note: Rephrase the following sentence...] This definition of gradient is consistent with the more elementary one, based on the use of the first order development S (x0 + δ x) = S (x0 ) + γ T δ x + . . . 0 (8.192) Here, it is γ 0 what is called the gradient of S (x) at point x0 . It is clear that S (x0 ) + γ T δ x 0 is a linear application, and that it is tangent to S (x) at x0 , so the two defintions are, in fact, equivalent. Explicitly, the components of the gradient at point x0 are (γ 0 )p = ∂S (x0 ) . ∂xp (8.193) Everybody is well trained at computing the gradient of a function (event if the interpretation of the result as a direction in the original space is wrong). How can we pass from the gradient to the direction of steepest ascent (a bona fide direction in the original space)? In fact, the 16 We take this example because typical misfit functions are adimensional, but the argument has general validity. 278 8.3 gradient (at a given point) of a function defined over a given space E ) is an element of the dual of the space. To obtain a direction in E , we must pass from the dual to the primal space. As usual, it is the metric of the space that maps the dual of the space into the space itself. So if g is the metric of the space where S (x) is defined, and if γ is the gradient of S at a given point, the direction of steepest ascent is γ = g −1 γ . (8.194) The direction of steepest ascent must be interpreted as follows: if we are at a point x of the space, we can consider a very small hypersphere around x0 . The direction of steepest ascent points towards the point of the sphere at which S (x) gets its maximum value. Example 8.19 Figure 8.23 represents the level lines of a scalar function S (u, v ) in a 2D space. A particular point has been selected. What is the gradient of the function at the given point? As suggested in the main text, it is not an arrow ‘perpendicular’ to the level lines of the function at the considered point, as the notion of perpendicularity will depend on a metric not yet specified (and unnecessary to define the gradient). The gradient must be seen as ‘the linear function that is tangent to S (u, v ) at the considered point’. If S (u, v ) has been represented by its level lines, then the gradient may also be represented by its level lines (right of the figure). We see that the condition, in fact, is that the level lines of the gradient are tangent to the level lines of the original function (at the considered point). Contrary to the notion of perpendicularity, the notion of tangency is metric-independent. [End of example.] A function, a point and the tangent level line The gradient of the function at the considered point Figure 8.23: The gradient of a function has not to be seen as a vector orthogonal to the level lines, but as a form parallel to them (see text.) Example 8.20 In the context of least squares, we consider a misfit function S (m) and a covariance matrix CM . If γ 0 is the gradient of S , at a point x0 , and if we use CM to define distances in the space, the direction of steepest ascent is γ 0 = CM γ 0 . (8.195) [End of example.] Example 8.21 If the misfit function S (P, T ) depends on a pressure P and on a temperature T , the gradient of S is, as mentioned above (equation 8.191), ∂S γ= ∂P ∂S ∂T . (8.196) Appendixes 279 2 P As the quantities P and T are Jeffreys quantities, associated to the metric ds2 = dP + dT 2 , the direction of steepest ascent is17 T 2 ∂S P ∂P . γ= (8.197) 2 ∂S T ∂T [End of example.] 8.3.7.4 The Steepest Descent Method Consider that we have a probability distribution defined over an n-dimensional space X . Having chosen some coordinates x ≡ {x1 , x2 , . . . , xn } over the space, the probability distribution is represented by the probability density f (x) whose homogeneous limit (in the sense developed in section 4) is µ(x) . We wish to calculate the coordinates xM L of the maximum likelihood point. By definition (equation 8.181), x = xM L ⇐⇒ f (x) µ(x) maximum , (8.198) x = xM L ⇐⇒ S (x) minimum , (8.199) i.e., where S (x) is the misfit (equation8.182) S (x) = −k log f (x) µ(x) . (8.200) Let us denote by γ (xk ) the gradient of S (x) at point xk , i.e. (equation 8.193), (γ 0 )p = ∂S (x0 ) . ∂xp (8.201) We have seen above that γ (x) is not to be interpreted as a direction in the space X , but a direction in the dual space. The gradient can be converted into a direction using some metric g(x) over X . In simple situations the metric g will be that used to define the volume element of the space, i.e., we will have µ(x) = k v (x) = k det g(x) , but this is not a necessity, and iterative algorithms may be accelerated by astute introduction of ad-hoc metrics. Given, then, the gradient γ (xk ) (at some particular point xk ) to any possible choice of metric g(x) we can define the direction of steepest ascent associated to the metric g , by (equation 8.195) γ (xk ) = g−1 (xk ) γ (xk ) . (8.202) The algorithm of steepest descent is an iterative algorithm passing from point xk to point xk+1 by making a ‘small jump’ along the local direction of steepest descent, − xk+1 = xk − εk gk 1 γ k 17 We have here gP P gT P gP T gT T = 1/P 2 0 0 1/T 2 . , (8.203) 280 8.3 where εk is an ad-hoc (real, positive) value adjusted to force the algorithm to converge rapidly (if εk is chosen too small the convergence may be too slow; it is it chosen too large, the algorithm may even diverge). Many elementary presentations of the steepest descent algorithm just forget to include the metric gk in expression 8.203. These algorithms are not consistent. Even the physical dimensionality of the equation is not assured. The authors of this article have traced some ‘numerical’ problems in existing computer implementations of steepest descent algorithms to this neglection of the metric. Example 8.22 In the context of example 8.17, where the misfit function S (m) is given by 2 S (m) = (f (m) − dobs )t C−1 (f (m) − dobs ) + (m − mprior )t C−1 (m − mprior ) , D M (8.204) the gradient γ , whose components are γα = ∂S/∂mα , is given by the expression γ (m) = Ft (m) C−1 (f (m) − dobs ) + C−1 (m − mprior ) , D M (8.205) where F is the matrix of partial derivatives F iα = ∂f i ∂mα . (8.206) An example of computation of partial derivatives is given in appendix ??. [End of example.] Example 8.23 In the context of example 8.22 the model space M has an obvious metric, namely that defined by the inverse of the ‘a priori’covariance operator g = C−1 . Using this M metric and the gradient given by equation 8.205, the steepest descent algorithm 8.203 becomes mk+1 = mk − εk CM Ft C−1 (fk − dobs ) + (mk − mprior ) k D , (8.207) where Fk ≡ F(mk ) and fk ≡ f (mk ) . The real positive quantities εk can be fixed, after some trial and error, by accurate linear search, or by using a linearized approximation18 . [End of example.] Example 8.24 In the context of example 8.22 the model space M has a less obvious metric, namely that defined by the inverse of the ‘a posteriori’ covariance operator, g = C−1 . Note: M Explain here that the ‘best current estimator’ of CM is CM ≈ Ft C−1 Fk + C−1 k D M −1 . (8.208) Using this metric and the gradient given by equation 8.205, the steepest descent algorithm 8.203 becomes mk+1 = mk − εk Ft C−1 Fk + C−1 k D M −1 Ft C−1 (fk − dobs ) + C−1 (mk − mprior ) k D M , (8.209) 18 As shown in Tarantola (1987), if γ k is the direction of steepest ascent at point mk , i.e., γk = CM Ft C−1 (fk − dobs ) + (mk − mprior ) , then, a local linearized approximation for the optimal εk gives k D γ t C −1 γ k M εk = γ t ( Ft Ck 1 F +C−1 ) γ . − k D M k k k Appendixes 281 where Fk ≡ F(mk ) and fk ≡ f (mk ) . The real positive quantities εk can be fixed, after some trial and error, by accurate linear search, or by using a linearized approximation that simply gives19 εk ≈ 1 . [End of example.] The algorithm 8.209 is usually called a ‘quasi-Newton algorithm’. [Note: Rephrase the following sentence...] This is a misname, as a Newton method applied to the minimization of the misfit function S (m) would be a method using the second derivatives of S (m) , and ∂2f i i thus the derivatives Hαβ = ∂mα ∂mβ , that are not computed (or not estimated) when using this algorithm. It is just a steepest descent algorithm with a nontrivial definition of metric in the working space. In this sense it belongs to the wider class of ‘variable metric methods’, not discussed in this article. Example 8.25 In the context of example 8.18, where the misfit function S (m) is given by S (m) = i |f i (m) − di | obs + σi the gradient γ whose components are γα = ∂S/∂m F iα γα = i |mα − mα | prior σα α α , (8.210) is given by the expression 1 1 sign(f i − di ) + sign(mα − mα ) , obs prior σi σα (8.211) where F iα = ∂f i ∂/mα . We can now choose in the model space the ad-hoc metric defined as the inverse of the ‘covariance matrix’ formed by the square of the mean deviations σi and σα (interpreted as if they were variances). Using this metric, the direction of steepest ascent associated to the gradient in 8.211, is F iα σi sign(f i − di ) + σα sign(mα − mα ) . obs prior γα = (8.212) i The steepest descent algorithm can now be appplied: mk+1 = mk − εk γ k . (8.213) The real positive quantities εk can be fixed after some trial and error or by accurate linear search. [End of example.] An expression like 8.210 defines a sort of deformed polyhedron, and to solve this sort of minimization problems the linear programming techniques are often advocated (e.g., Claerbout and Muir, 1973). We have found that for problems involving many dimensions the crude steepest descent method defined by equations 8.212–8.213 performs extremely well. For instance, in Djikp´ss´ and Tarantola (1999) a large-sized problem of waveform fitting is solved using this ee algorithm. It is well known that the sum of absolute values 8.210 provides a more robust20 criterion than the sum of squares 8.204. If one fears that the data set to be used is corrupted by some unexpected errors, the least-absolute values criterion should be preferred to the least squares criterion21 . 19 While a sensible estimation of the optimal values of the real positive quantities εk is crucial for the algorithm 8.207, they can, in many usual circumstances, be dropped from the algorithm 8.209. 20 A method is ‘robust’ if its output is not sensible to a small number of large errors in the inputs. 21 Of course, it would be much better to develop a realistic model of the uncertainties, and use the more general probabilistic methods developed above, but if those models are not available, then the least absolute values criterion is a valuable criterion. 282 8.3.7.5 8.3 Estimation of A Posteriori Uncertainties In the Gaussian context, the Gaussian probability density that is tangent to σm (m) has its center at the point given by the iterative algorithm mk+1 = mk − εk CM Ft C−1 (fk − dobs ) + (mk − mprior ) k D , (8.214) (equation 8.207) or, equivalently, by the iterative algorithm mk+1 = mk − εk Ft C−1 Fk + C−1 k D M −1 Ft C−1 (fk − dobs ) + C−1 (mk − mprior ) k D M (8.215) (equation 8.209). The covariance of the tangent gaussian is CM ≈ Ft C−1 F∞ + C−1 ∞ D M −1 , (8.216) where F∞ refers to the value of the matrix of partial derivatives at the convergence point. [note: Emphasize here the importance of CM ]. 8.3.7.6 Some Comments on the Use of Deterministic Methods 8.3.7.6.1 About the Use of the Term ‘Matrix’ [note: Warning, old text to be updated.] Contrary to the next chapter, where the model parameter space and the data space may be functional spaces, I assume here that we have discrete spaces, with a finite number of dimensions. [Note: What is ’indicial’ ?] Then, it makes sense to use the indicial notation d = {di } , i ∈ ID ; m = {m α } , i ∈ IM , (8.217) where ID and IM are two index sets, for the data and the model parameters respectively. In the simplest case, the indices are simple integers, ID = {1, 2, 3 . . . } , and IM = {1, 2, 3 . . . } , but this is not necessarily true. For instance, figure 8.24 suggests a 2D problem where we compute the gravitational field from a distribution of masses. Then, the index α is better understood as consisting on a pair of integers. Figure 8.24: A simple example where the index in m = {mα } is not necessarily an integer. In this case, where we are interested in predicting the gravitational field g generated by a 2-D distribution of mass, the index α is better understood as consisting on a pair of integers. Here, for instance, mA,B means the total mass in the block at row A and column B. g2 g3 m1,1 m1,2 m1,3 m1,4 m2,1 m2,2 m2,3 m2,4 g4 m3,1 m3,2 m3,3 m3,4 g1 Appendixes 283 8.3.7.6.2 Linear, Weakly Nonlinear and Nonlinear Problems There are different degrees of nonlinearity. Figure 8.25 illustrates the four domains of nonlinearity allowing the use of the different optimisation algorithms This figure symbolically represents the model space in the abscissa axis, and the data space in the ordinates axis. The gray oval represents the information coming in part from a priori information on the model parameters and coming in part from the data observations22 . It is the function ρ(d, m) = ρd (d) ρm (m) seen elsewhere (note: say where). Figure 8.25: Illustration of the four domains of nonlinearity allowing the use of the different optimization algorithms The model space is symbolically represented in the abscissa axis, and the data space in the ordinates axis. The gray oval represents the information coming in part from a priori information on the model parameters and coming in part from the data observations. What is important is not some intrinsic nonlinearity of the function relating model parameters to data, but how linear the function is inside the domain of significant probabilty . Linear problem Linearisable problem d - dprior = G0 (m - mprior) D D dobs dobs d = g(m) d=Gm σΜ(m) σΜ(m) mprior mprior M M Non-linear problem Weakly non-linear problem D dobs d = g(m) D dobs d = g(m) mprior σΜ(m) σΜ(m) M mprior M To fix ideas, the oval suggests here a Gaussian probability, but the sorting of problems we are about to make as a function of their nonlinearity will not depend fundamentally on this. First, there are some strictly linear problems. For instance, in the example illustrated by figure 8.24, the gravitational field g depends linearly on the masses inside the blocks23 22 The gray oval is the product of the probability density over the model space, representing the a priori information, times the probability density over the data space representing the experimental results. 23 The gravitational field at point x0 generated by a distribution of volumetric mass ρ(x) is given by g(x0 ) = dV (y) x0 − y ρ(x) . x0 − x 3 When the volumetric mass is constant inside some predefined (2-D) volumes, as suggested in figure 8.24, this gives g(x0 ) = GA,B (x0 ) mA,B . A B This is a strictly linear equation between data (the gravitational field at a given observation point) and the model parameters (the masses inside the volumes). Note that if instead of choosing as model parameters the 284 8.3 Strictly linear problems are illustrated at the top left of figure 8.25. The linear relationship between data and model parameters, d = G m , is represented by a straight line. The a priori probability density ρ(d, m) “induces”, on this straight line, the a posteriori probability density (warning: this notation corresponds to volumetric probabilities) σ (d, m) whose “projection” over the model space gives gives the a posteriori probability density over the model parameter space, σm (m) . Should the a priori probability densities be Gaussian, then the a posteriori probability distribution would also be Gaussian: this is the simplest situation (in such problems, as we will later see (section xxx), the problem reduces to find the mean and the covariance of the a posteriori Gaussian). Quasi-linear problems are illustrated at the bottom-left of figure 8.25. If the relationship linking the observable data d to the model parameters m , d = g(m) , (8.218) is approximately linear inside the domain of significant a priori probability (i.e., inside the gray oval of the figure), then the a posteriori probability is as simple as the a priori probability. For instance, if the a priori probability is Gaussian the a posteriori probability is also Gaussian. In this case also, the problem can be reduced to the computation of the mean and the covariance of the Gaussian. Typically, one begins at some “starting model” m0 (typically, one takes for m0 the “a priori model” mprior ) (note: explain clearly somewhere in this section that “a priori model” is a language abuse for the “mean a priori model”), linearizing the function d = g(m) around m0 and one looks for a model m1 “better than m0 ”. Iterating such an algorithm, one tends to the model m∞ at which the “quasi-Gaussian” σm (m) is maximum. The linearizations made in order to arrive to m∞ are not, so far, an approximation: the point m∞ is perfectly defined independently of any linearization, and any method used to find it. But once the convergence to this point has been obtained, a linearization of the function d = g(m) around this point, d − g(m∞ ) = G∞ (m − m∞ ) , (8.219) allows to obtain a good approximation of the a posteriori uncertainties. For instance, if the a priori probability is Gaussian this will give the covariance of the “tangent Gaussian”. Between linear and quasi-linear problem there are the “linearizable problems”. The scheme at the top-right of figure 8.25 shows the case where the linearization of the function d = g(m) around the a priori model, d − g(mprior ) = Gprior (m − mprior ) , (8.220) gives a function that, inside the domain of significant probability, is very similar to the true (nonlinear) function. In this case, there is no practical difference between this problem and the strictly linear problem, and the iterative procedure necessary for quasi-linear problems is here superfluous. It remains to analyze the true nonlinear problems that, using a pleonasm, are sometimes called strongly nonlinear problems . They are illustrated at the bottom-right of figure 8.25. total masses inside some predefined volumes one chooses the geometrical parameters defining the sizes of the volumes, then the gravity field is not a linear function of the parameters. More details can be found in Tarantola and Valette (1982b, page 229). Appendixes 285 In this case, even if the a priori probability is simple, the a posteriori probability can be quite complicated. For instance, it can be multimodal. [Note: Rephrase the following sentence...] These problems are, in general, quite complex to solve, and only the Monte Carlo methods described in the previous chapter are sufficiently general. If full Monte Carlo methods cannot be used, because they are too expensive, then one can mix some random part (for instance, to choose the starting point) and some deterministic part. The optimization methods applicable to quasi-linear problems can, for instance, allow us to go from the randomly chosen starting point to the “nearest” optimal point (note: explain this better). Repeating these computations for different starting points one can arrive at a good idea of the a posteriori probability in the model space. 8.3.7.6.3 The Maximum Likelihood Model The most likely model is, by definition, that at which the volumetric probability σβ (m) attains its maximum. As σβ (m) is maximum when S (m) is minimum, we see that the most likely model is also the the ‘best model’ obtained when using a ‘least squares criterion’. Should we have used the double exponential model for all the uncertainties, then the most likely model would be defined by a ‘least absolute values’ criterion. There are many circumstances where the most likely model is not an interesting model. One trivial example is when the volumetric probability has a ‘narrow maximum’, with small total probability (see figure 8.26). A much less trivial situation arises when the number of parameters is very large, as for instance when we deal with a random function (that, in all rigor, corresponds to an infinite number of random variables). Figure XXX, for instance, shows a few realizations of a Gaussian function with zero mean and an (approximately) exponential correlation. The most likely function is the center of the Gaussian, i.e., the null function shown at the left. But this is not a representative sample (specimen) of the probability distribution, as any realization of the probability distribution will have, with a probability very close to one, the ‘oscillating’ characteristics of the three samples shown at the right. 1 Figure 8.26: One of the circumstances where the ‘maximum likelihood model’ may not be very interesting, is when it corresponds to a narrow maximum, with small total probability, as the peak at the left of this probability distribution. 0.8 0.6 0.4 0.2 0 -40 -20 0 20 40 8.3.7.6.4 The Interpretation of ‘The Least Squares Solution’ Note: explain here that when working with a large number of dimensions, the center of a Gaussian is a bad representer of the possible realizations of the Gaussian. Mention somewhere that mpost is not the ‘posterior model’, but the center of the a posteriori Gaussian, and explain that for multidimensional problems, the center of a Gaussian is not representative of a random realisation of the Gaussian. [note: Mention somewhere that one should not compute the inverse of the matrices, but solve the associated linear system.] 286 8.3 Figure 8.27: At the right, three random realizations of a Gaussian random function with zero mean and (approximatelty) exponential correlation function. The most likely function, i.e., the center of the Gaussian, is shown at the left. We see that the most likely function is not a representative of the probability distribution. Chapter 9 Inference Problems of the Fourth Kind (Transport of Probabilities) Note: Say here the we consider here two problems: (i) the measure of physical quantities — through a direct use of their definition— and (ii) the prediction of observations. It is, of course, our goal to pay attention to the uncertainties involved. These two problems are mathematically very similar, and are essentially solved using the notion od ‘transport of probabilities’ introduced in chapter 2. 287 288 9.1 9.1 Measure of Physical Quantities Note: we develop here a problem that is fundamental in metrology: when a quantity s is defined as a function of some other quantity r , through s = s(r) , and we measure r , we must ‘transport’ the information we have obtained on r into information on s . Note: give the main ideas here. The method is illustrated in section 9.1.1 where the Poisson ratio of a solid is evaluated, using its definition in terms of stresses ans deformations. It is also illustrated in appendix 9.3.1, in an example of mass calibration. 9.1.1 Example: Measure of Poisson’s Ratio 9.1.1.1 Hooke’s Law in Isotropic Media For an elastic medium, in the limit of infinitesimal strains (Hooke’s law), σij = cijk εk where cijk , (9.1) is the stiffness tensor . If the elastic medium is isotropic, λκ λµ 2 , (9.2) gij gk + gik gj + gi gjk − gij gk 3 2 3 where λκ (with multiplicity one) and λµ (with multiplicity five) are the two eigenvalues of the stiffness tensor cijk . They are related to the common umcompressibility modulus κ and shear modulus µ through cijk = κ = λκ /3 ; µ = λµ /2 . (9.3) The Hooke’s law 9.1 can, alternatively, be written εij = dijk σ k , (9.4) where dijk , the inverse of the stiffness tensor, is called the compliance tensor . If the elastic medium is isotropic, γ ϕ 2 , (9.5) dijk = gij gk + gik gj + gi gjk − gij gk 3 2 3 where γ (with multiplicity one) and ϕ (with multiplicity five) are the two eigenvalues of dijk . These are, of course, the inverse of the eigenvalues of cijk : 1 1 1 1 γ= = = ; ϕ= . (9.6) λκ 3κ λµ 2µ From now on, I shall call γ the eigencompressibility or, if there is no risk of confusion with 1/κ , the compressibility. The quantitity ϕ shall be called the eigenshearability or, if there is no risk of confusion with 1/µ , the shearability. With the isotropic stiffness tensor of equation 9.2, the Hooke’s law 9.1 becomes λκ 1 (9.7) σij = gij εk k + λµ εij − gij εk k , 3 3 or, equivalently, with the isotropic compliance tensor of equation 9.5, the Hooke’s law 9.4 becomes γ 1 (9.8) gij σk k + ϕ σij − gij σk k . εij = 3 3 Measure of Physical Quantities 9.1.1.2 289 Definition of the Poisson’s Ratio Consider the experimental arrangement of figure 9.1, where an elastic medium is submitted to the (homogeneous) uniaxial stress (using Cartesian coordinates) σxx = σyy = σxy = σyz = σzx = 0 ; σzz = 0 . (9.9) Then, the Hooke’s law 9.4 predicts the strain εxx = εyy = 1 (γ − ϕ) σzz 3 1 (γ + 2 ϕ) σzz 3 = σyz = σzx = 0 . (9.10) εzz = σxy The Young modulus Y and the Poisson ratio ν are defined as εxx εyy σzz ; ν=− =− Y= εzz εzz εzz , (9.11) and equation 9.10 gives Y= 3 2ϕ + γ ; ν= 1 − 2ν Y ; ϕ= ϕ−γ 2ϕ + γ , (9.12) with reciprocal relations γ= 1+ν Y . Figure 9.1: A possible experimental setup for measuring the Young modulus and the Poisson ratio of an elastic medium. The measurement of the force F of the ‘bar length’ Z and of the bar diameter X allows to estimate the two elastic parameters. Details below. (9.13) Z F X Note that when γ and ϕ take values inside their natural range 0<γ<∞ ; 0<ϕ<∞ , (9.14) −1 < ν < +1/2 . (9.15) the variation of Y and ν is 0<Y <∞ ; Although most materials have positive values of the Poisson ratio ν , there are materials where it is negative (see figures 9.2 and 9.3) The Poisson ratio has mainly a historical interest. Note that a simple function of it would have given a bona fide Jeffreys quantity, J= λκ 1+ν = 1 − 2ν λµ with the natural domain of variation 0 < J < ∞ . , (9.16) 290 9.1 Figure 9.2: An example of a 2D elastic structure with a positive value of the Poisson ratio. When imposing a stretching in one direction (the ‘horizontal’ here), the elastic structure reacts contracting in the perpendicular direction. Figure 9.3: An example of a 2D elastic structure with a negative value of the Poisson ratio. When imposing a stretching in one direction (the ‘horizontal’ here), the elastic structure reacts also stretching in the perpendicular direction. 9.1.1.3 The Parameters Although one may be interested in the Young modulus Y and the Poisson ratio ν , we may choose to measure the compressibility γ = 1/λκ and the shearability ϕ = 1/λµ . Any information we may need on Y and ν can be obtained, as usual, through the change of variables. From the two first equations in expression 9.10 it follows that the relation between the elastic parameters γ and ϕ , the stress and the strains is γ= εzz + 2 εxx σzz ; ϕ= εzz − εxx σzz . (9.17) As the uniaxial tress is generated by a force F applied to one of the ends of the bar (and the reaction force of the support), σzz = F s , (9.18) where s , the section of the bar, is π X2 s= 4 . (9.19) The most general definition of strain (that does not assume the strains to be small) is εxx = log X X0 ; εzz = log Z Z0 , (9.20) where X0 and Z0 are the initial lengths (see figure 9.1) and X and Z are the final lengths. We have then the final relation γ= π X 2 log Z/Z0 + 2 log X/X0 4F ; ϕ= π X 2 log Z/Z0 − log X/X0 4F . (9.21) When necessary, these two expressions shall be written γ = γ (X0 , Z0 , X, Z, F ) ; ϕ = ϕ(X0 , Z0 , X, Z, F ) . (9.22) Measure of Physical Quantities 291 We shall later need to extract from these relations the two parameters X0 and Z0 : X0 = X exp − 4 F (γ − ϕ) 3 π X2 Z0 = Z exp − ; 4 F (γ + 2 ϕ) 3 π X2 , (9.23) expressions that, when necessary, shall be written X0 = X0 (γ, ϕ, X, Z, F ) 9.1.1.4 ; Z0 = Z0 (γ, ϕ, X, Z, F ) . (9.24) The Partial Derivatives In what follows, let us use the notation r = {X0 , Z0 , X, Z, F } s = {γ, ϕ} , ; (9.25) so the relation 9.21 may be written s = s(r) . (9.26) We need to complete the set of two variables s to have a set of five variables, as suggested in section 2.6.0.3. The simplest choice is t = {X, Z, F } (9.27) as supplementary variables. We can then introduce the matrix ∂γ/∂X0 ∂γ/∂Z0 ∂γ/∂X ∂γ/∂Z ∂ϕ/∂X0 ∂ϕ/∂Z0 ∂ϕ/∂X ∂ϕ/∂Z K = ∂X/∂X0 ∂X/∂Z0 ∂X/∂X ∂X/∂Z ∂Z/∂X0 ∂Z/∂Z0 ∂Z/∂X ∂Z/∂Z ∂F/∂X0 ∂F/∂Z0 ∂F/∂X ∂F/∂Z of partial derivatives ∂γ/∂F ∂ϕ/∂F ∂X/∂F , ∂Z/∂F ∂F/∂F (9.28) to easily obtain K= 9.1.1.5 √ det K Kt = 3 π2 X 4 16 F 2 X0 Z0 . (9.29) The Measurement Space and the Measurand Space We measure the five quantities r = {X0 , Z0 , X, Z, F } in order to evaluate the two quantities s = {γ, ϕ} . Let us denote by R 5 the five-dimensional measurement space , over which r = {X0 , Z0 , X, Z, F } shall be considered coordinates. The distance element over the measurement space is [note: explain why] ds 2 1 =2 a dX0 X0 2 + dX X 2 + dZ0 Z0 2 dZ Z + 2 dF 2 +2 b , (9.30) where a and b represent arbitrary ‘weights’. We then have the metric determinant det gr = k X0 Z0 X Z , (9.31) 292 9.1 where the constant k = 1/(a5 b) shall not play any important role in what follows (it will spontaneously desappear). Similarly, let us denote by S 2 the two-dimensional measurand space , over which s = {γ, ϕ} shall be considered coordinates. The distance element over the measurand space is [note: explain why] ds2 = dγ γ 1 c2 2 + dϕ ϕ 2 , (9.32) where c represents an arbitrary ‘weight’. The metric matrix is, therefore, gr = 1 c2 0 1/γ 2 0 1/ϕ2 , (9.33) and this gives the metric determinant det gs = k γϕ , (9.34) where the constant k = 1/(c2 ) shall not play any important role in what follows (it will spontaneously desappear). 9.1.1.6 The Measurement We measure {X0 , Z0 , X, Z, F } and describe the result of our measurement via a volumetric probability fr (X0 , Z0 , X, Z, F ) . (9.35) [Note: Explain this.] 9.1.1.7 Transportation of the Probability Distribution Equation 2.206 applyes here directly, and gives the transported volumetric probability over the measurand space. Using the present notations, this gives √ ∞ ∞ +∞ det gr 1 dX dZ dF (9.36) fr (X0 , Z0 , X, Z, F ) , fs (γ, ϕ) = √ K det gs 0 0 −∞ X0 =X0 (γ,ϕ,X,Z,F ) ; Z0 =Z0 (γ,ϕ,X,Z,F ) where the functions X0 = X0 (γ, ϕ, X, Z, F ) and Z0 = Z0 (γ, ϕ, X, Z, F ) are those expressed by equations 9.23–9.24. More explicitly, using the result for the Jacobian determinant K given by equation 9.29, and the two metric determinants given by equations 9.31 and 9.34, fs (γ, ϕ) = k 16 γϕ k 3 π2 ∞ 0 dX X ∞ 0 dZ Z +∞ dF −∞ F2 X4 fr (X0 , Z0 , X, Z, F ) . X0 =X0 (γ,ϕ,X,Z,F ) ; Z0 =Z0 (γ,ϕ,X,Z,F ) (9.37) Measure of Physical Quantities 293 The two associated marginal volumetric probabilities are, then, ∞ 0 dϕ fs (γ, ϕ) ϕ (9.38) dγ fs (γ, ϕ) . γ fγ (γ ) = (9.39) and ∞ fϕ (ϕ) = 0 To represent these volumetric probabilities I prefer to use the ‘Cartesian parameters’ of the problem [note: explain]. Here, the logarithmic parameters γ ∗ = log γ γ0 ϕ∗ = log ϕ ϕ0 , (9.40) where γ0 and ϕ0 are two arbitray constants having the dimension of a compliance are Cartesian coordinates over the 2D space of elastic (isotropic) media. For the distance element of equation 9.32 becomes c2 ds2 = (dγ ∗ )2 + (dϕ∗ )2 , (9.41) typical of Cartesian coordinates in Euclidean spaces. As volumetric probabilities are invariant quantities, the new volumetric probability function, say gs (γ ∗ , ϕ∗ ) , is simply given by gs (γ ∗ , ϕ∗ ) = fs (γ, ϕ)|γ = γ0 exp γ ∗ ; ϕ = ϕ0 exp ϕ∗ . (9.42) To be complete, let us mention that equations 9.37–9.39 define volumetric probabilities; should we wish to evaluate probability densities, f s (γ, ϕ) = fs (γ, ϕ) γϕ fγ (γ ) γ f γ (γ ) = ; f ϕ (ϕ) = ; fϕ (ϕ) ϕ , (9.43) then f s (γ, ϕ) = k 16 k 3 π2 ∞ 0 dX X ∞ 0 dZ Z +∞ dF −∞ F2 X4 fr (X0 , Z0 , X, Z, F ) ∞ dϕ f s (γ, ϕ) and 0 9.1.1.8 (9.44) X0 =X0 (γ,ϕ,X,Z,F ) ; Z0 =Z0 (γ,ϕ,X,Z,F ) ∞ f γ (γ ) = , f ϕ (ϕ) = dγ f s (γ, ϕ) . (9.45) 0 Numerical Illustration Note: to do things properly, the constants k and k of equations 9.31 and 9.34 should appear here, as they measures distances. They should all simplify and desappear. Let us use the notations N (u, u0 , s) and L(U, U0 , s) respectively for the normal and the lognormal functions (u − u0 )2 N (u, u0 , s) = k exp − 2 s2 ; 1 L(U, U0 , s) = k exp − 2 2s U log U0 2 . (9.46) 294 9.1 Asume that the result of the measurement of the quantities X0 , Z0 (initial diameter and length of the bar), X , Z (final diameter and length of the bar), and the force F , has given an information that can be represented by a five-dimensional Gaussian volumetric probability with independent uncertainties fr (X, X0 , Z, Z0 , F ) = obs obs L(X0 , X0 , sX0 ) L(Z0 , Z0 , sZ0 ) L(X, X obs , sX ) L(Z, Z obs , sZ ) N (F, F obs , sF ) , (9.47) with the numerical values obs X0 = 1.000 m obs Z0 = 1.000 m X obs = 0.975 m Z obs = 1.105 m F obs = 9.81 kg m/s2 ; sX0 = 0.015 ; sZ0 = 0.015 ; sX = 0.015 ; sZ = 0.015 ; sF ≈ 0 . This is the volumetric probability that appears at the right of equation 9.37. To simplify the example I have assumed that the uncertainty on the force F is much smaller than the other uncertainties, so, in fact, F can be treated as a constant. With the small uncertainties chosen, the lognormal functions in 9.47 look much like a normal one. Figure 9.4 displays the four (marginal) one-dimensional lognormal functions. To illustrate how the uncertaintiers in the measurement of the lengths propagate into uncertainties in the elastic parameters, I have chosen the quite unrealistic example where the uncertainties in X and X0 overlap: it is likely that the diameter of the rod has decreased (so the Poisson ratio is positive) but the probability that it has increased (negative Poisson ratio) is significant. In fact, as we shall see, the measurement don’t even exclude the virtuality of negative elastis parameters γ and ϕ (this possibility being exxcluded by the elastic theory). X Figure 9.4: The four 1D marginal volumetic probabilitities for the initial and final lengths. Note that the uncertainties in X and X0 overlap: it is likely that the diameter of the rod has decreased (so the Poisson ratio is positive) but the probability that it has increased (negative Poisson ratio) is significant. X0 1 length diameter 1.1 Z Z0 1 1.1 Figure 9.5 represents the volumetric probability fs (γ, ϕ) defined by equations 9.37 and 9.42. It represents the information that the measurements of the length has given on the elastic parameters γ and ϕ . [Note: Explain this better.] [Note: Explain that negative values of γ and ϕ are excluded ‘by hand’]. The two associated marginal volumetric probabilities are defined in equations 9.38–9.39, and are represented in figure 9.6. Note: mention here figure 9.7. 9.1.1.9 Translation into the Young Modulus and Poisson Ratio Language To obtain the expression of the metric in the coordinates {Y, ν } one can use the partial derivatives of the old coordinates with respect to the new coordinates, and equation 1.23. 295 ϕ∗ = log ϕ Q Measure of Physical Quantities -4 ϕ∗ = log ϕ Q -4 -5 -6 -10 -5 -4 -6 -8 γ∗ = log γ Q -6 -10 -8 -6 γ∗ = log γ Q ( Q = 1N/m2 ) -4 Figure 9.5: The (2D) volumetric probability for the compressibility γ and the shearability ϕ , as induced from the measurement results. At the left a direct representation of the volumetric probability defined by equation 9.37 and 9.42. At the right, a Monte Carlo simulation of the measurement (see section XXX). Here, natural logarithms are used, and Q = 1 N/m2 . Of the 3000 points used, 9 falled at the left and 7 below the domain plotted, and are not represented. The zone of nonvanishing probability extends over all the space, and only the level lines automatically proposed by the plotting software have been used. Figure 9.6: The marginal (1D) volumetric probabilities defined by equations 9.38–9.39. -12 -10 -8 γ∗ = log γ Q -6 -4 -7 -6 -5 ϕ∗ = log ϕ Q -4 Log[X0 /k] = +0.068 Log[X0 /k] = −0.094 Log[X/k] = −0.094 Log[X/k] = +0.068 Log[X/k] = −0.094 Log[X/k] = +0.068 Figure 9.7: The marginal probability distributions for the lengths X and X0 . At the left, a Monte Carlo sampling of the probability distribution for X as X0 defined by equation 9.47 (the values Z and Z0 are also sampled, but are not shown. At the right, the same Monte Carlo sampling, but where only the points that correspond, through equation 9.21, to positive values of γ and ϕ (and, thus, acceptable by the theory of elastic media). Note that many of the points ‘behind’ the diagonal bar have been suppressed. 296 9.1 Then, the metric matrix in equation 9.33, written in the coordinates {γ, ϕ} becomes gY Y gνY gY ν gνν = 2 Y (1−2 ν ) 2 Y2 − 2 Y (1−2 ν ) 4 (1−2 ν )2 1 Y (1+ν ) − + 1 Y (1+ν ) 1 (1+ν )2 , (9.48) with the metric determinant being given as 3 Y (1 + ν )(1 − 2ν ) det g = . (9.49) To obtain the equivalent of the volumetric probability fs (γ, ϕ) in terms of the Young modulus Y and the Poisson ratio ν we just need to perform the change of variables (remember that volumetric probabilities are invat under a change of variables), so the volumetric probability fs (γ, ϕ) transforms into a volumetric probability ψ (Y, ν ) that is given by (see relations 9.13) q (Y, ν ) = fs (γ, ϕ)|γ = 1−2ν Y ν = 1+ν Y . (9.50) To evaluate the probability of a domain we have to integrate, in view of equation 9.49, as Y2 P (Y1 < Y < Y2 , ν1 < ν < ν ) = ν2 dY Y1 dν ν1 3 q (Y, ν ) . Y (1 + ν )(1 − 2ν ) (9.51) .2 +0 υ = υ= =− υ= −0 − .8 0.6 0.2 0 = = +0 .4 +0 .4 = υ υ -4 υ -5 υ ϕ∗ = log ϕ Q Figure 9.8: The metrically correct representation of the volumetric probability q (Y, ν ) , obtained by just superimposing on the figure 9.5 the new coordinates {Y, ν } . As above, Q = 1 N/m2 . Y = 100 Q Y = 200 Q Y = 300 Q 9 This being said, the question now is: how should we represent the volumetric probability q (Y, ν ) ? A direct, na¨ plot, using Y as an abscissa and ν as ordinate is possible, and only ıve needs the use of equation 9.50 (as the probability density fs (γ, ϕ) has already been evaluated). But let us first use a subtler approach. We have seen that the quantities γ ∗ and ϕ∗ (logarithmic compressibility and and logarithmic shearability) are Cartesian quantities in the 2D space of linear elastic media. My preferred choice for visualizing q (Y, ν ) is a direct representation of the ‘new coordinates’ on a metrically correct representation, i.e., to superimpose in figure 9.5, where the coordinates γ ∗ and ϕ∗ where used, the new coordinates {Y, ν } (the change of variables being deined by equations 9.12–9.13). This gives the representation displayed in figure 9.8. -6 -10 -6 -8 γ∗ = log γ Q -4 As this is not the conventional way of plotting probability distributions, let us also examine the more conventional plot of q (Y, ν ) in figure 9.9. One may observe, in particular, the ‘round’ character of the ‘level lines’ in this plot, due to the fact that the experiment was specially Measure of Physical Quantities 297 0.4 ν 0.2 Figure 9.9: The volumetric probability for the Young modulus Y and the Poisson ratio ν , deduced, using a change of variables, from the volumetric probability on γ and ϕ represented in figure 9.5(see equation 9.50). 0 -0.2 100 150 200 Y designed to have a good (and independent) resolution of the Young modulus and the Poisson ratio. As the metric matrix is not diagonal in the coordinates {Y, ν } , one can not define marginal volumetric probabilities, but marginal probability√ densities only (see section 2.5). We can start by introducing the probability density q (Y, ν ) = det g q (Y, ν ) , i.e., 3 q (Y, ν ) Y (1 + ν ) (1 − 2ν ) q (Y, ν ) = . (9.52) Then, the marginal probability density for the Young modulus is q Y (Y ) = i.e., q Y (Y ) = 3 Y +1/2 dν −1 q (Y, ν ) (1 + ν ) (1 − 2ν ) ∞ dY 0 q (Y, ν ) Y dν q (Y, ν ) , , (9.53) ∞ 0 and the marginal probability density for the Poisson ratio is q ν (ν ) = 3 q ν (ν ) = (1 + ν ) (1 − 2ν ) +1/2 −1 dY q (Y, ν ) , i.e., . (9.54) Then, the can evaluate probabilities like Y2 P (Y1 < Y < Y2 ) = dY q Y (Y ) ; Y1 ν2 P (ν1 < ν < ν2 ) = dν q ν (ν ) . (9.55) ν1 As an example, the marginal probability density for the Poisson ratio, q ν (ν ) , is plotted in figure 9.10. Figure 9.10: The marginal probability density for the Poisson ratio ν (equation 9.54). -1 -0.5 ν 0 0.5 298 9.1.1.10 9.1 Direct Evaluation Using Young Modulus and Poisson Ratio Rather than deducing the volumetric probability for {Y, ν } from that of {γ, ϕ} , we could redo all the computations using directly {Y, ν } as parameters, the only major difference is that the metric matrix 9.48 replaces that in equation 9.33. I leave this as an exercice for the reader. Prediction of Observations 9.2 299 Prediction of Observations This is the typical prediction problem in physics: any serious physical theory is to able to make predictions (that may be confronted to experiments). An engineer, for instance, may wish to predict the load at which a given bridge may collapse, or an astrophysicist may wish to predict the flux of neutrinos from the Sun. In these situations, the parameters defining the system (the bridge or the Sun) may be known with some uncertainties, and these uncertainties shall reflect as an uncertainty on the prediction. Note: I could use here a notation like d = d(p) (9.56) d = d(m) . (9.57) or like 300 9.3 9.3.1 9.3 Appendixes Appendix: Mass Calibration Note: I take this problem from Measurement Uncertainty and the Propagation of Distributions, by Cox and Harris, 10-th International Metrology Congress, 2001. When two bodies, with masses mW and mR , equilibrate in a balance that operates in air of density a , one has (taking into account Archimedes’ buoyancy), 1− a ρW mW = 1− a ρR , mR (9.58) where ρW and ρR are the two volumetric masses of the bodies. Given a body with mass m , and volumetric mass ρ , it is a common practice in metrology to define its ‘conventional mass’, denoted m0 , as the mass of a (hypothetical) body of conventional density ρ0 = 8000 kg/m3 in air of conventional density a0 = 1.2 kg/m3 . The equation above then gives the relation 1− a0 ρ0 m0 = 1− a0 ρ m. (9.59) In terms of conventional masses, equation 9.58 becomes ρW − a ρR − a mW,0 = mR,0 ρW − a0 ρR − a0 . (9.60) To evaluate the mass mW,0 of a body one puts a mass mR,0 in the other arm, and selects the (typically small) mass δmR,0 (with same volumetric mass as mR,0 ) that equilibrates the balance. Replacing mR,0 by mR,0 + δmR,0 in the equation above, and solving for mW,0 gives mW,0 = (ρR − a) (ρW − a0 ) (mR,0 + δmR,0 ) . (ρW − a) (ρR − a0 ) (9.61) The knowledge of the five quantities { mR,0 , δmR,0 , a , ρW , ρR } allows, via equation 9.61, to evaluate mW,0 . Assume that a measure of these five quantities has provided the information represented by the probability density f (mR,0 , δmR,0 , a, ρW , ρR ) . Which is the probability density induced over the quantity mW,0 by equation 9.61? This is just a special case of the transport of probabilities considered in section 2.6.0.3, so we can directly apply here the results of the section. In the five-dimensional ‘measurement space’ over which the variables { mR,0 , δmR,0 , a , ρW , ρR } can be considered as coordinates, we can change to the variables { mW,0 , δmR,0 , a , ρW , ρR } , this defining the matrix K of partial derivatives (see equation 2.192). One easily arrives at the simple result √ det K Kt = (ρR − a) (ρW − a0 ) (ρW − a) (ρR − a0 ) . (9.62) Because of the change of variables used, we shall also need to express mW,0 as a function of { mR,0 , δmR,0 , a , ρW , ρR } . From equation 9.61 one immediately obtains mR,0 = (ρW − a) (ρR − a0 ) mW,0 − δmR,0 (ρR − a) (ρW − a0 ) . (9.63) Appendixes 301 Equation 2.206 gives the probability density for mW,0 : g (mW,0 ) = dδmR,0 da dρW dρR (ρW − a) (ρR − a0 ) f (mR,0 , δmR,0 , a, ρW , ρR ) , (9.64) (ρR − a) (ρW − a0 ) where in f (mR,0 , δmR,0 , a, ρW , ρR ) one has to replace the variable mR,0 by its expression as a function of the other five variables, as given by equation 9.63. Given the probability density f (mR,0 , δmR,0 , a, ρW , ρR ) representing the information obtained though the measurement act, one can try an analytic integration (provided the probability density f has an analytical expression, or it can be approximated by one). More generally, the probability density f can be sampled using the Monte Carlo methods described in section XXX. This is, in fact, quite trivial here. Let us denote r = { mR,0 , δmR,0 , a, ρW , ρR } and s = mW,0 . Then the relation 9.61 can be written formally as s = s(r) . One just needs to sample f (r) to obtain points r1 , r2 , . . . . The points s1 = s(r1 ) , s2 = s(r2 ) , . . . are samples of g (s) (because of the very definition of the notion of transport of probabilities). 302 9.3 Bibliography Aki, K. and Lee, W.H.K., 1976, Determination of three-dimensional velocity anomalies under a seismic array using first P arrival times from local earthquakes, J. Geophys. Res., 81, 4381–4399. Aki, K., Christofferson, A., and Husebye, E.S., 1977, Determination of the three-dimensional seismic structure of the lithosphere, J. Geophys. Res., 82, 277-296. Aki, K., and Richards, P.G., 1980, Quantitative seismology, (2 volumes), Freeman and Co. Andresen, B., Hoffmann, K. H., Mosegaard, K., Nulton, J. D., Pedersen, J. M., and Salamon, P., On lumped models for thermodynamic properties of simulated annealing problems, Journal de Physique , 49, 1485–1492, 1988. Backus, G., 1970a. Inference from inadequate and inaccurate data: I, Proceedings of the National Academy of Sciences, 65, 1, 1-105. Backus, G., 1970b. Inference from inadequate and inaccurate data: II, Proceedings of the National Academy of Sciences, 65, 2, 281-287. Backus, G., 1970c. Inference from inadequate and inaccurate data: III, Proceedings of the National Academy of Sciences, 67, 1, 282-289. Backus, G., 1971. Inference from inadequate and inaccurate data, Mathematical problems in the Geophysical Sciences: Lecture in applied mathematics, 14, American Mathematical Society, Providence, Rhode Island. Backus, G., and Gilbert, F., 1967. Numerical applications of a formalism for geophysical inverse problems, Geophys. J. R. astron. Soc., 13, 247-276. Backus, G., and Gilbert, F., 1968. The resolving power of gross Earth data, Geophys. J. R. astron. Soc., 16, 169-205. Backus, G., and Gilbert, F., 1970. Uniqueness in the inversion of inaccurate gross Earth data, Philos. Trans. R. Soc. London, 266, 123-192. Bamberger, A., Chavent, G, Hemon, Ch., and Lailly, P., 1982. Inversion of normal incidence seismograms, Geophysics, 47, 757-770. Ben-Menahem, A., and Singh, S.J., 1981. Seismic waves and sources, Springer Verlag. Bender, C.M., and Orszag, S.A., 1978. Advanced mathematical methods for scientists and engineers, McGraw-Hill. ´ Borel, E., 1967, Probabilit´s, erreurs, 14e ´d., Paris. e e ´ dir., 1924–1952, Trait´ du calcul des probabilit´s et de ses applications, 4 t., Gauthier Borel, E., e e Villars, Paris. Cary, P.W., and C.H. Chapman, Automatic 1-D waveform inversion of marine seismic refraction data, Geophys. J. R. Astron. Soc., 93, 527–546, 1988. Claerbout, J.F., 1971. Toward a unified theory of reflector mapping, Geophysics, 36, 467-481. Claerbout, J.F., 1976. Fundamentals of Geophysical data processing, McGraw Hill. Claerbout, J.F., 1985. Imaging the Earth’s interior, Blackwell Science Publishers. 303 304 9.3 Claerbout, J.F., and Muir, F., 1973. Robust modelling with erratic data, Geophysics, 38, 5, 826-844. Dahl-Jensen, D., Mosegaard, K., Gundestrup, N., Clow, G. D., Johnsen, S. J., Hansen, A. W., and Balling, N., 1998, Past temperatures directly from the Greenland Ice Sheet, Science, Oct. 9, 268–271. Davidon, W.C., 1959. Variable metric method for minimization, AEC Res. and Dev., Report ANL-5990 (revised). Devaney, A.J., 1984. Geophysical diffraction tomography, IEEE trans. Geos. remote sensing, Vol. GE-22, No. 1. Djikp´ss´, H.A. and Tarantola, A., 1999, Multiparameter 1 norm waveform fitting: Interpreee tation of Gulf of Mexico reflection seismograms, Geophysics, Vol. 64, No. 4, 1023–1035. Evrard, G., 1995, La recherche des param`tres des mod`les standard de la cosmologie vue e e comme un probl`me inverse, Th`se de Doctorat, Univ. Montpellier. e e Evrard, G., 1966, Objective prior for cosmological parameters, Proc. of the Maximum Entropy and Bayesian Methods 1995 workshop, K. Hanson and R. Silver (eds), Kluwer. Evrard, G. and P. Coles, 1995. Getting the measure of the flatness problem, Classical and quantum gravity, Vol. 12, No. 10, pp. L93-L97. Feller, W., An introduction to probability theory and its applications, Wiley, N.Y., 1971 (or 1970?). Fisher, R.A., 1953, Dispersion on a sphere, Proc. R. Soc. London, A, 217, 295–305. Fletcher, R., 1980. Practical methods of optimization, Volume 1: Unconstrained optimization, Wiley. Fletcher, R., 1981. Practical methods of optimization, Volume 2: Constrained optimization, Wiley. Franklin, J.N., 1970. Well posed stochastic extensions of ill posed linear problems, J. Math. Anal. Applic., 31, 682-716. Gauss, C.F., 1809, Theoria Motus Corporum Cœlestium. Gauthier, O., Virieux, J., and Tarantola, A., 1986. Two-dimensional inversion of seismic waveforms: numerical results, Geophysics, 51, 1387-1403. Geiger, L., 1910, Herdbestimmung bei Erdbeben aus den Ankunftszeiten, Nachrichten von der K¨niglichen Gesellschaft der Wissenschaften zu G¨ttingen, 4, 331–349. o o Geman, S., and Geman, D., Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images, Inst. Elect. Electron. Eng. Trans. on pattern analysis and machine intelligence , PAMI-6, 721-741, 1984. Goldberg, D.E., Genetic algorithms in search, optimization, and machine learning (AddisonWesley, 1989). Hadamard, J., 1902, Sur les probl´mes aux d´riv´es partielles et leur signification physique, e ee Bull. Univ. Princeton, 13. Hadamard, J., 1932, Le probl`me de Cauchy et les ´quations aux d´riv´es partielles lin´aires e e ee e hyperboliques, Hermann, Paris. Hammersley, J. M., and Handscomb, D.C., Monte Carlo Methods, in Monographs on Statistics and Applied Probability , Cox, D. R., and Hinkley, D. V.(eds.), Chapman and Hall, 1964. Herman, G.T., 1980. Image reconstruction from projections, the fundamentals of computerized tomography, Academic Press. Holland, J.H., Adaptation in Natural and Artificial Systems, University of Michigan Press, 1975. Appendixes 305 Ikelle, L.T., Diet, J.P., and Tarantola, A., 1986. Linearized inversion of multi offset seismic reflection data in the f -k domain, Geophysics, 51, 1266-1276. ISO, 1993, Guide to the expression of uncertainty in measurement, International Organization for Standardization, Switzerland. Jackson, D.D., The use of a priori data to resolve non-uniqueness in linear inversion, Geophys. J. R. Astron. Soc., 57, 137–157, 1979. Jannane, M., Beydoun, W., Crase, E., Cao Di, Koren, Z., Landa, E., Mendes, M., Pica, A., Noble, M., R¨th, G., Singh, S., Snieder, R., Tarantola, A., Tr´z´guet, D., and Xie, M., o ee Wavelengths of earth structures that can be resolved from seismic reflected data. Geophysics , 54, 906–910, 1988. Jaynes, E.T., Prior probabilities, IEEE Transactions on systems, science, and cybernetics , Vol. SSC–4, No. 3, 227–241, 1968. Jaynes, E.T., 1995, Probability theory: the logic of science, Available on Internet (ftp: bayes.wustl.edu). Jaynes, E.T., Where do we go from here?, in Smith, C. R., and Grandy, W. T., Jr., Eds., Maximum-entropy and Bayesian methods in inverse problems, Reidel, 1985. Jeffreys, H., 1939, Theory of probability, Clarendon Press, Oxford. Reprinted in 1961 by Oxford University Press. Here he introduces the positive parameters. Johnson, G.R. and and Olhoeft, G.R., Density or rocks and minerals, in: CRC Handbook of Physical Properties of rocks, Vol. III, ed: R.S. Carmichael, CRC, Boca Ratn, Florida, USA, 1984. Journel, A. and Huijbregts, Ch., Mining Geostatistics, Academic Press, 1978. Kalos, M.H. & Whitlock, P.A., Monte Carlo methods , Wiley, N.Y., 1986. Kandel, A., 1986, Fuzzy mathematical techniques with applications, Addison-Wesley. Keilis-Borok, V.J., and Yanovskaya, T.B., Inverse problems in seismology (structural review), Geophys. J. R. astr. Soc., 13, 223–234, 1967. Khan, A., Mosegaard, K., and Rasmussen, K. L., 2000, A New Seismic Velocity Model for the Moon from a Monte Carlo Inversion of the Apollo Lunar Seismic Data, Geophys. Res. Lett. (in press). Khintchine, A.I., 1969, Introduction a la th´orie des probabilit´s (Elementarnoe vvedenie v ` e e e teoriju verojatnostej), trad. M. Gilliard, 3 ed., Paris; en anglais: An elementary introduction to the theory of probability, avec B.V., Gnedenko, New York, 1962. Kirkpatrick, S., Gelatt, C.D., Jr., and Vecchi, M.P., Optimization by Simulated Annealing, Science , 220, 671–680, 1983. Kolmogorov, A.N., 1933, Grundbegriffe der Wahrscheinlichkeitsrechnung, Springer, Berlin; Engl. trans.: Foundations of the theory of probability, Chelsea, New York, 1950. Koren, Z., Mosegaard, K., Landa, E., Thore, P., and Tarantola, A., Monte Carlo estimation and resolution analysis of seismic background velocities, J. Geophys. Res., 96, B12, 20,289– 20,299 (1991). Kullback, S., 1967, The two concepts of information, J. Amer. Statist. Assoc., 62, 685–686. Landa, E., Beydoun, W., and Tarantola, A., Reference velocity model estimation from prestack waveforms: coherency optimization by simulated annealing, Geophysics , 54, 984–990, 1989. Lehtinen, M.S., P¨iv¨rinta, L., and Somersalo, E., 1989, Linear inverse problems for generalized aa random variables, Inverse Problems, 5,599–612. Lions, J.L., 1968. Contrˆle optimal de syst`mes gouvern´s par des ´quations aux d´riv´es o e e e ee partielles, Dunod, Paris. English translation: Optimal control of systems governed by partial differential equations, Springer, 1971. 306 9.3 L¨tkepohl, H., 1996, Handbook of Matrices, John Wiley & Sons. u Marroquin, J., Mitter, S., and Poggio, T., 1987, Probabilistic solution of ill-posed problems in computational vision, Journal of the American Statistical Association , 82, 76–89. Mehrabadi, M.M., and S.C. Cowin, 1990, Eigentensors of linear anisotropic elastic materials, Q. J. Mech. appl. Math., 43, 15–41. Mehta, M.L., 1967, Random matrices and the statistical theory of energy levels, Academic Press, New York and London. Menke, W., 1984, Geophysical data analysis: discrete inverse theory, Academic Press. Metropolis, N., and Ulam, S.M., The Monte Carlo Method, J. Amer. Statist. Assoc., 44, 335–341, 1949. Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H., and Teller, E., Equation of State Calculations by Fast Computing Machines, J. Chem. Phys., Vol. 1, No. 6, 1087–1092, 1953. Miller, K.S., 1964, Multidimensional Gaussian distributions, John Wiley and Sons, New York. Minster, J.B. and Jordan, T.M., 1978, Present-day plate motions, J. Geophys. Res., 83, 5331– 5354. Mohr, P.J., and B.N. Taylor, 2001, The Fundamental Physical Constants, Physics Today, Vol. 54, No. 8, BG6–BG13. Moritz, H., 1980. Advanced physical geodesy, Herbert Wichmann Verlag, Karlsruhe, Abacus Press, Tunbridge Wells, Kent. Morse, P.M., and Feshbach, H., 1953. Methods of theoretical physics, McGraw Hill. Mosegaard, K., and Rygaard-Hjalsted, C., 1999, Bayesian analysis of implicit inverse problems, Inverse Problems, 15, 573–583. Mosegaard, K., Singh, S.C., Snyder, D., and Wagner, H., 1997, Monte Carlo Analysis of seismic reflections from Moho and the W-reflector, J. Geophys. Res. B /, 102, 2969–2981. Mosegaard, K., and Tarantola, A., 1995, Monte Carlo sampling of solutions to inverse problems, J. Geophys. Res., Vol. 100, No. B7, 12,431–12,447. Mosegaard, K. and Vestergaard, P.D., A simulated annealing approach to seismic model optimization with sparse prior information, Geophysical Prospecting , 39, 599–611, 1991. Nercessian, Al., Hirn, Al., and Tarantola, Al., 1984. Three-dimensional seismic transmission prospecting of the Mont-Dore volcano, France, Geophys. J.R. astr. Soc., 76, 307-315. Nolet, G., 1985. Solving or resolving inadequate and noisy tomographic systems, J. Comp. Phys., 61, 463-482. Nulton, J.D., and Salamon, P., 1988, Statistical mechanics of combinatorial optimization: Physical Review A, 37, 1351-1356. Parker, R.L., 1975. The theory of ideal bodies for gravity interpretation, Geophys. J. R. astron. Soc., 42, 315-334. Parker, R.L., 1977. Understanding inverse theory, Ann. Rev. Earth Plan. Sci., 5, 35-64. Parker, R.L., 1994, Geophysical Inverse Theory, Princeton University Press. Pedersen, J.B., and Knudsen, O., Variability of estimated binding parameters, Biophys. Chemistry , 36, 167–176 , 1990. Pica, A., Diet, J.P., and Tarantola, A., 1990, Nonlinear inversion of seismic reflection data in a laterally medium, Geophysics , Vol. 55, No. 3, pp 284–292. Polack, E. et Ribi`re, G., 1969. Note sur la convergence de m´thodes de directions conjugu´es, e e e Revue Fr. Inf. Rech. Oper., 16-R1, 35-43. Appendixes 307 Popper, K., Objective knowledge, Oxford, 1972. Trad. fran¸.: La logique de la d´couverte c e scientifique, Payot, Paris, 1978. Powell, M.J.D., 1981. Approximation theory and methods, Cambridge University Press. Press, F., Earth models obtained by Monte Carlo inversion, J. Geophys. Res., 73, 5223–5234, 1968. Press, F., An introduction to Earth structure and seismotectonics, Proceedings of the International School of Physics Enrico Fermi , Course L, Mantle and Core in Planetary Physics, J. Coulomb and M. Caputo (editors), Academic Press, 1971. Press, W.H., Flannery, B.P., Teukolsky, S.A., and Vetterling, W.T., Numerical Recipes, Cambridge, 1986. Pugachev, V.S., Theory of random functions and its application to control problems, Pergamon, 1965. R´nyi, A., 1966, Calcul des probabilit´s, Dunod, Paris. e e R´nyi, A., 1970, Probability theory, Elsevier, New York. e Rietsch, E., The maximum entropy approach to inverse problems, J. Geophys., 42, 489–506, 1977. Rothman, D.H., Nonlinear inversion, statistical mechanics, and residual statics estimation, Geophysics , 50, 2797–2807, 1985. Rothman, D.H., Automatic estimation of large residual static corrections, Geophysics , 51, 332–346, 1986. Scales, L. E., 1985. Introduction to non-linear optimization, Macmillan. Scales, J.A., Smith, M.L., and Fischer, T.L., 1992, Global optimization methods for multimodal inverse problems, Journal of Computational Physics, 102, 258-268. Scales, J., 1996, Uncertainties in seismic inverse calculations, in: Inverse methods, Interdisciplinary elements of methodology, computation, and applications, Eds.: B.H. Jacobsen, K. Mosegaard and P. Sibani, Springer, Berlin, p. 79–97. Shannon, C.E., 1948, A mathematical theory of communication, Bell System Tech. J., 27, 379–423. Simon, J.L., 1995, Resampling: the new statistics, Resampling stats Inc., Arlington, VA, USA. Stein, S.R., 1985, Frequency and time — their measure and characterization, in: Precision frequency control, Vol. 2, edited by E.A. Gerber and A. Ballato, Academic Press, New York, pp. 191–232 and pp. 399–416. Tarantola, A., 1984. Linearized inversion of seismic reflection data, Geophysical Prospecting, 32, 998-1015. Tarantola, A., 1984. Inversion of seismic reflection data in the acoustic approximation, Geophysics, 49, 1259-1266. Tarantola, A., 1984. The seismic reflection inverse problem, in: Inverse problems of Acoustic and Elastic Waves, edited by: F. Santosa, Y.-H. Pao, W. Symes, and Ch. Holland, SIAM, Philadelphia. Tarantola, A., 1986. A strategy for nonlinear elastic inversion of seismic reflection data, Geophysics, 51, 1893-1903. Tarantola, A., 1987. Inverse problem theory; methods for data fitting and model parameter estimation, Elsevier. Tarantola, A., 1987. Inversion of travel time and seismic waveforms, in: Seismic tomography, edited by G. Nolet, Reidel. 308 9.3 Tarantola, A., 1990, Probabilistic foundations of Inverse Theory, in: Geophysical Tomography , Desaubies, Y., Tarantola, A., and Zinn-Justin, J., (eds.), North Holland. Tarantola, A., Jobert, G., Tr´z´guet, D., and Denelle, E., 1987. The inversion of seismic waveee forms can either be performed by time or by depth extrapolation, submitted to Geophysics. Tarantola, A. and Nercessian, A., 1984. Three-dimensional inversion without blocks, Geophys. J. R. astr. Soc., 76, 299-306. Tarantola, A., and Valette, B., 1982a. Inverse Problems = Quest for Information, J. Geophys., 50, 159-170. Tarantola, A., and Valette, B., 1982b. Generalized nonlinear inverse problems solved using the least-squares criterion, Rev. Geophys. Space Phys., 20, No. 2, 219-232. Taylor, S.J., 1966, Introduction to measure and integration, Cambridge Univ. Press. Taylor, A.E., and Lay, D.C., 1980. Introduction to functional analysis, Wiley. Taylor, B.N., and C.E. Kuyatt, 1994, Guidelines for evaluating and expressing the uncertainty of NIST measurement results, NIST Technical note 1297. Watson, G.A., 1980. Approximation theory and numerical methods, Wiley. Weinberg, S., 1972, Gravitation and Cosmology: Principles and Applications of the General Theory of Relativity, John Wiley & Sons. Wiggins, R.A., 1969, Monte Carlo Inversion of Body-Wave Observations, J. Geoph. Res., Vol. 74, No. 12, 3171–3181. Wiggins, R.A., 1972, The General Linear Inverse Problem: Implication of Surface Waves and Free Oscillations for earth Structure, Rev. Geoph. and Space Phys., Vol. 10, No. 1, 251–285. Winogradzki, J., 1979, Calcul Tensoriel (I), Masson. Winogradzki, J., 1987, Calcul Tensoriel (II), Masson. Xu, P. and Grafarend, E., 1997, Statistics and geometry of the eigenspectra of 3-D second-rank symmetric random tensors, Geophys. J. Int. 127, 744–756. Xu, P., 1999, Spectral theory of constrained second-rank symmetric random tensors, Geophys. J. Int. 138, 1–24. Yeganeh-Haeri, A., Weidner, D.J. and Parise, J.B., Elasticity of α-cristobalite: a silicon dioxide with a negative Poisson ratio, Science , 257, 650–652, 1992. ...
View Full Document

This note was uploaded on 07/17/2011 for the course STOR 635 taught by Professor Leadbetter during the Fall '10 term at UNC.

Page1 / 324

Probability and measurements - Tarantola A. - ALBERT ALBERT...

This preview shows document pages 1 - 8. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online