This preview shows pages 1–8. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: ALBERT
ALBERT TARANTOLA to be published by ... Probability
and
Measurements 1
Albert Tarantola
Universit´ de Paris, Institut de Physique du Globe
e
4, place Jussieu; 75005 Paris; France
Email: [email protected] December 3, 2001 1 c A. Tarantola, 2001. ii iii To the memory of my father.
To my mother and my wife. iv v Preface
In this book, I attempt to reach two goals. The ﬁrst is purely mathematical: to clarify some
of the basic concepts of probability theory. The second goal is physical: to clarify the methods
to be used when handling the information brought by measurements, in order to understand
how accurate are the predictions we may wish to make.
Probability theory is solidly based on Kolmogorov axioms, and there is no problem when
treating discrete probabilities. But I am very unhappy with the usual way of extending the
theory to continuous probability distributions. In this text, I introduce the notion of ‘volumetric
probability’ diﬀerent from the more usual notion of ‘probability density’. I claim that some
of the more basic problems of the theory of continuous probability distributions can only ne
solved within this framework, and that many of the well known ‘paradoxes’ of the theory are
fundamental misunderstandings, that I try to clarify.
I start the book with an introduction to tensor calculus, because I choose to develop the
probability theory considering metric manifolds.
The second chapter deals with the probability theory per se. I try to use intrinsic notions
everywhere, i.e., I only introduce deﬁnitions that make sense irrespectively of the particular
coordinates being used in the manifold under investigation. The reader shall see that this leads
to many develoments that are at odds with those found in usual texts.
In physical applications one not only needs to deﬁne probability distributions over (typically)
largedimensional manifolds. One also needs to make use of them, and this is achieved by
sampling the probability distributions using the ‘Monte Carlo’ methods described in chapter 3.
There is no major discovery exposed in this chapter, but I make the eﬀort to set Monte Carlo
methods using the intrinsic point of view mentioned above.
The metric foundation used here allows to introduce the important notion of ‘homogeneous’
probability distributions. Contrary to the ‘noninformative’ probability distributions common
in the Bayesian literature, the homogeneity notion is not controversial (provided one has agreed
ona given metric over the space of interest).
After a brief chapter that explain what an ideal measuring instrument should be, the book
enters in the four chapter developing what I see as the four more basic inference problems
in physics: (i) problems that are solved using the notion of ‘sum of probabilities’ (just an
elaborate way of ‘making histograms), (ii) problems that are solved using the ‘product of
probabilities’ (and approach that seems to be original), (iii) problems that are solved using
‘conditional probabilities’ (these including the socalled ‘inverse problems’), and (iv) problems
that are solved using the ‘transport of probabilities’ (like the typical [indirect] mesurement
problem, but solved here transporting probability distributions, rather than just transporting
‘uncertainties).
I am very indebted to my colleagues (Bartolom´ Coll, Georges Jobert, Klaus Mosegaard,
e
´
Miguel Bosch, Guillaume Evrard, John Scales, Christophe Barnes, Fr´d´ric Parrenin and
ee
Bernard Valette) for illuminating discussions. I am also grateful to my collaborators at what
was the Tomography Group at the Institut de Physique du Globe de Paris.
Paris, December 3, 2001
Albert Tarantola vi Contents
1 Introduction to Tensors 1 2 Elements of Probability 69 3 Monte Carlo Sampling Methods 153 4 Homogeneous Probability Distributions 169 5 Basic Measurements 185 6 Inference Problems of the First Kind (Sum of Probabilities) 207 7 Inference Problems of the Second Kind (Product of Probabilities) 211 8 Inference Problems of the Third Kind (Conditional Probabilities) 219 9 Inference Problems of the Fourth Kind (Transport of Probabilities) 287 vii viii Contents
1 Introduction to Tensors
1.1 Chapter’s overview . . . . . . . . . . . . .
1.2 Change of Coordinates (Notations) . . . .
1.3 Metric, Volume Density, Metric Bijections
1.4 The LeviCivita Tensor . . . . . . . . . . .
1.5 The Kronecker Tensor . . . . . . . . . . .
1.6 Totally Antisymmetric Tensors . . . . . .
1.7 Integration, Volumes . . . . . . . . . . . .
1.8 Appendixes . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
. .
.
.
.
.
.
.
. 2 Elements of Probability
2.1 Volume . . . . . . . . . . . . . . . . . . . . . .
2.2 Probability . . . . . . . . . . . . . . . . . . .
2.3 Sum and Product of Probabilities . . . . . . .
2.4 Conditional Probability . . . . . . . . . . . . .
2.5 Marginal Probability . . . . . . . . . . . . . .
2.6 Transport of Probabilities . . . . . . . . . . .
2.7 Central Estimators and Dispersion Estimators
2.8 Appendixes . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
. .
.
.
.
.
.
.
. 3 Monte Carlo Sampling Methods
3.1 Introduction . . . . . . . . . . . . . . . . . . . .
3.2 Random Walks . . . . . . . . . . . . . . . . . .
3.3 Modiﬁcation of Random Walks . . . . . . . . .
3.4 The Metropolis Rule . . . . . . . . . . . . . . .
3.5 The Cascaded Metropolis Rule . . . . . . . . . .
3.6 Initiating a Random Walk . . . . . . . . . . . .
3.7 Designing Primeval Walks . . . . . . . . . . . .
3.8 Multistep Iterations . . . . . . . . . . . . . . . .
3.9 Choosing Random Directions and Step Lengths
3.10 Appendixes . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
. .
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
. .
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
. .
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
. .
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
. .
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
. .
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
. .
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
. .
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
. .
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
. .
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
. .
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
. .
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
. .
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
. .
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
. .
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
. .
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
. 1
3
4
7
9
11
14
19
23 .
.
.
.
.
.
.
. .
.
.
.
.
.
.
. 69
70
78
84
88
100
106
116
120 .
.
.
.
.
.
.
.
.
. 153
. 154
. 155
. 157
. 158
. 158
. 159
. 160
. 161
. 162
. 164 .
.
.
.
.
.
.
. 4 Homogeneous Probability Distributions
169
4.1 Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
4.2 Homogeneous Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . 171
4.3 Appendixes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
ix x
5 Basic Measurements
5.1 Terminology . . . . . . . . . . . . . . . . . . .
5.2 Old text: Measuring physical parameters . . .
5.3 From ISO . . . . . . . . . . . . . . . . . . . .
5.4 The Ideal Output of a Measuring Instrument .
5.5 Output as Conditional Probability Density . .
5.6 A Little Bit of Theory . . . . . . . . . . . . .
5.7 Example: Instrument Speciﬁcation . . . . . .
5.8 Measurements and Experimental Uncertainties
5.9 Appendixes . . . . . . . . . . . . . . . . . . .
6 Inference Problems of the First Kind
6.1 Experimental Histograms . . . . . . .
6.2 Sampling a Sum . . . . . . . . . . . .
6.3 Further Work to be Done . . . . . . . .
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
. 185
186
187
189
194
195
195
195
197
200 (Sum of Probabilities)
207
. . . . . . . . . . . . . . . . . . . . . . . . 208
. . . . . . . . . . . . . . . . . . . . . . . . 209
. . . . . . . . . . . . . . . . . . . . . . . . 209 7 Inference Problems of the Second Kind (Product of Probabilities)
211
7.1 The ‘Shipwrecked Person’ Problem . . . . . . . . . . . . . . . . . . . . . . . . . 212
7.2 Physical Laws as Probabilistic Correlations . . . . . . . . . . . . . . . . . . . . . 213
8 Inference Problems of the Third Kind (Conditional
8.1 Adjusting Measurements to a Physical Theory . . . .
8.2 Inverse Problems . . . . . . . . . . . . . . . . . . . .
8.3 Appendixes . . . . . . . . . . . . . . . . . . . . . . .
9 Inference Problems of the Fourth Kind
9.1 Measure of Physical Quantities . . . .
9.2 Prediction of Observations . . . . . . .
9.3 Appendixes . . . . . . . . . . . . . . . Probabilities)
219
. . . . . . . . . . . . . . . 220
. . . . . . . . . . . . . . . 222
. . . . . . . . . . . . . . . 231 (Transport of
.........
.........
......... Probabilities)
287
. . . . . . . . . . . . . . 288
. . . . . . . . . . . . . . 299
. . . . . . . . . . . . . . 300 Contents
1 Introduction to Tensors
1.1 Chapter’s overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Change of Coordinates (Notations) . . . . . . . . . . . . . . . . . . . . .
1.2.1 Jacobian Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2.2 Tensors, Capacities and Densities . . . . . . . . . . . . . . . . . .
1.3 Metric, Volume Density, Metric Bijections . . . . . . . . . . . . . . . . .
1.3.1 Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3.2 Volume Density . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3.3 Bijection Between Densities Tensors and Capacities . . . . . . . .
1.4 The LeviCivita Tensor . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4.1 Orientation of a Coordinate System . . . . . . . . . . . . . . . . .
1.4.2 The Fundamental (LeviCivita) Capacity . . . . . . . . . . . . . .
1.4.3 The Fundamental Density . . . . . . . . . . . . . . . . . . . . . .
1.4.4 The LeviCivita Tensor . . . . . . . . . . . . . . . . . . . . . . . .
1.4.5 Determinants . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.5 The Kronecker Tensor . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.5.1 Kronecker Tensor . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.5.2 Kronecker Determinants . . . . . . . . . . . . . . . . . . . . . . .
1.6 Totally Antisymmetric Tensors . . . . . . . . . . . . . . . . . . . . . . .
1.6.1 Totally Antisymmetric Tensors . . . . . . . . . . . . . . . . . . .
1.6.2 Dual Tensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.6.3 Exterior Product of Tensors . . . . . . . . . . . . . . . . . . . . .
1.6.4 Exterior Derivative of Tensors . . . . . . . . . . . . . . . . . . . .
1.7 Integration, Volumes . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.7.1 The Volume Element . . . . . . . . . . . . . . . . . . . . . . . . .
1.7.2 The Stokes’ Theorem . . . . . . . . . . . . . . . . . . . . . . . . .
1.8 Appendixes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.8.1 Appendix: Tensors For Beginners . . . . . . . . . . . . . . . . . .
1.8.2 Appendix: Dimension of Components . . . . . . . . . . . . . . . .
1.8.3 Appendix: The Jacobian in Geographical Coordinates . . . . . . .
1.8.4 Appendix: Kronecker Determinants in 2 3 and 4 D . . . . . . . .
1.8.5 Appendix: Deﬁnition of Vectors . . . . . . . . . . . . . . . . . . .
1.8.6 Appendix: Change of Components . . . . . . . . . . . . . . . . .
1.8.7 Appendix: Covariant Derivatives . . . . . . . . . . . . . . . . . .
1.8.8 Appendix: Formulas of Vector Analysis . . . . . . . . . . . . . . .
1.8.9 Appendix: Metric, Connection, etc. in Usual Coordinate Systems
xi .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. 1
3
4
4
5
7
7
8
8
9
9
9
9
10
10
11
11
11
14
14
14
16
18
19
19
20
23
23
41
42
44
45
46
47
48
50 xii
1.8.10
1.8.11
1.8.12
1.8.13
1.8.14 Appendix:
Appendix:
Appendix:
Appendix:
Appendix: Gradient, Divergence and Curl in Usual Coordinate Systems
Connection and Derivative in Diﬀerent Coordinate Systems
Computing in Polar Coordinates . . . . . . . . . . . . . . .
Dual Tensors in 2 3 and 4D . . . . . . . . . . . . . . . . . .
Integration in 3D . . . . . . . . . . . . . . . . . . . . . . . . 2 Elements of Probability
2.1 Volume . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.1 Notion of Volume . . . . . . . . . . . . . . . . . . .
2.1.2 Volume Element . . . . . . . . . . . . . . . . . . .
2.1.3 Volume Density and Capacity Element . . . . . . .
2.1.4 Change of Variables . . . . . . . . . . . . . . . . . .
2.1.5 Conditional Volume . . . . . . . . . . . . . . . . . .
2.2 Probability . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.1 Notion of Probability . . . . . . . . . . . . . . . . .
2.2.2 Volumetric Probability . . . . . . . . . . . . . . . .
2.2.3 Probability Density . . . . . . . . . . . . . . . . . .
2.2.4 Volumetric Histograms and Density Histograms . .
2.2.5 Change of Variables . . . . . . . . . . . . . . . . . .
2.3 Sum and Product of Probabilities . . . . . . . . . . . . . .
2.3.1 Sum of Probabilities . . . . . . . . . . . . . . . . .
2.3.2 Product of Probabilities . . . . . . . . . . . . . . .
2.4 Conditional Probability . . . . . . . . . . . . . . . . . . . .
2.4.1 Notion of Conditional Probability . . . . . . . . . .
2.4.2 Conditional Volumetric Probability . . . . . . . . .
2.5 Marginal Probability . . . . . . . . . . . . . . . . . . . . .
2.5.1 Marginal Probability Density . . . . . . . . . . . .
2.5.2 Marginal Volumetric Probability . . . . . . . . . . .
2.5.3 Interpretation of Marginal Volumetric Probability .
2.5.4 Bayes Theorem . . . . . . . . . . . . . . . . . . . .
2.5.5 Independent Probability Distributions . . . . . . .
2.6 Transport of Probabilities . . . . . . . . . . . . . . . . . .
2.7 Central Estimators and Dispersion Estimators . . . . . . .
2.7.1 Introduction . . . . . . . . . . . . . . . . . . . . . .
2.7.2 Center and Radius of a Probability Distribution . .
2.8 Appendixes . . . . . . . . . . . . . . . . . . . . . . . . . .
2.8.1 Appendix: Conditional Probability Density . . . . .
2.8.2 Appendix: Marginal Probability Density . . . . . .
2.8.3 Appendix: Replacement Gymnastics . . . . . . . .
2.8.4 Appendix: The Gaussian Probability Distribution .
2.8.5 Appendix: The Laplacian Probability Distribution .
2.8.6 Appendix: Exponential Distribution . . . . . . . .
2.8.7 Appendix: Spherical Gaussian Distribution . . . . .
2.8.8 Appendix: Probability Distributions for Tensors . .
2.8.9 Appendix: Determinant of a Partitioned Matrix . .
2.8.10 Appendix: The Borel ‘Paradox’ . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
. 56
61
63
65
67 .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. 69
70
70
70
71
73
75
78
78
79
79
81
82
84
84
85
88
88
89
100
100
102
103
103
104
106
116
116
116
120
120
122
123
125
130
131
137
140
143
144 xiii
2.8.11 Appendix: Axioms for the Sum and the Product . . . . . . . . . . . . . . 148
2.8.12 Appendix: Random Points on the Surface of the Sphere . . . . . . . . . . 149
2.8.13 Appendix: Histograms for the Volumetric Mass of Rocks . . . . . . . . . 151
3 Monte Carlo Sampling Methods
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Random Walks . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3 Modiﬁcation of Random Walks . . . . . . . . . . . . . . . . . . .
3.4 The Metropolis Rule . . . . . . . . . . . . . . . . . . . . . . . . .
3.5 The Cascaded Metropolis Rule . . . . . . . . . . . . . . . . . . . .
3.6 Initiating a Random Walk . . . . . . . . . . . . . . . . . . . . . .
3.7 Designing Primeval Walks . . . . . . . . . . . . . . . . . . . . . .
3.8 Multistep Iterations . . . . . . . . . . . . . . . . . . . . . . . . . .
3.9 Choosing Random Directions and Step Lengths . . . . . . . . . .
3.9.1 Choosing Random Directions . . . . . . . . . . . . . . . .
3.9.2 Choosing Step Lengths . . . . . . . . . . . . . . . . . . . .
3.10 Appendixes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.10.1 Random Walk Design . . . . . . . . . . . . . . . . . . . . .
3.10.2 The Metropolis Algorithm . . . . . . . . . . . . . . . . . .
3.10.3 Appendix: Sampling Explicitly Given Probability Densities .
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
. 4 Homogeneous Probability Distributions
4.1 Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Homogeneous Probability Distributions . . . . . . . . . . . . . . . . .
4.3 Appendixes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.1 Appendix: First Digit of the Fundamental Physical Constants
4.3.2 Appendix: Homogeneous Probability for Elastic Parameters .
4.3.3 Appendix: Homogeneous Distribution of Second Rank Tensors
5 Basic Measurements
5.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2 Old text: Measuring physical parameters . . . . . . . . . . . .
5.3 From ISO . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3.1 Proposed vocabulary to be used in metrology . . . . .
5.3.2 Some basic concepts . . . . . . . . . . . . . . . . . . .
5.4 The Ideal Output of a Measuring Instrument . . . . . . . . . .
5.5 Output as Conditional Probability Density . . . . . . . . . . .
5.6 A Little Bit of Theory . . . . . . . . . . . . . . . . . . . . . .
5.7 Example: Instrument Speciﬁcation . . . . . . . . . . . . . . .
5.8 Measurements and Experimental Uncertainties . . . . . . . . .
5.9 Appendixes . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.9.1 Appendix: Operational Deﬁnitions can not be Inﬁnitely
5.9.2 Appendix: The International System of Units (SI) . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
. ......
......
......
......
......
......
......
......
......
......
......
Accurate
...... .
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
. 153
. 154
. 155
. 157
. 158
. 158
. 159
. 160
. 161
. 162
. 162
. 163
. 164
. 164
. 165
. 168 .
.
.
.
.
. 169
. 169
. 171
. 176
. 176
. 178
. 183 .
.
.
.
.
.
.
.
.
.
.
.
. 185
. 186
. 187
. 189
. 189
. 191
. 194
. 195
. 195
. 195
. 197
. 200
. 200
. 201 xiv
6 Inference Problems of the First Kind
6.1 Experimental Histograms . . . . . . .
6.2 Sampling a Sum . . . . . . . . . . . .
6.3 Further Work to be Done . . . . . . . (Sum of Probabilities)
207
. . . . . . . . . . . . . . . . . . . . . . . . 208
. . . . . . . . . . . . . . . . . . . . . . . . 209
. . . . . . . . . . . . . . . . . . . . . . . . 209 7 Inference Problems of the Second Kind (Product of Probabilities)
7.1 The ‘Shipwrecked Person’ Problem . . . . . . . . . . . . . . . . . . . . . . .
7.2 Physical Laws as Probabilistic Correlations . . . . . . . . . . . . . . . . . . .
7.2.1 Physical Laws . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.2.2 Example: Realistic ‘Uncertainty Bars’ Around a Functional Relation
7.2.3 Inverse Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
. 211
. 212
. 213
. 213
. 213
. 214 8 Inference Problems of the Third Kind (Conditional Probabilities)
219
8.1 Adjusting Measurements to a Physical Theory . . . . . . . . . . . . . . . . . . . 220
8.2 Inverse Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
8.2.1 Model Parameters and Observable Parameters . . . . . . . . . . . . . . . 223
8.2.2 A Priori Information on Model Parameters . . . . . . . . . . . . . . . . . 223
8.2.3 Measurements and Experimental Uncertainties . . . . . . . . . . . . . . . 225
8.2.4 Joint ‘Prior’ Probability Distribution in the (M , D ) Space . . . . . . . . 225
8.2.5 Physical Laws . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
8.2.6 Inverse Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
8.3 Appendixes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
8.3.1 Appendix: Short Bibliographical Review . . . . . . . . . . . . . . . . . . 231
8.3.2 Appendix: Example of Ideal (Although Complex) Geophysical Inverse
Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
8.3.3 Appendix: Probabilistic Estimation of Earthquake Locations . . . . . . . 241
8.3.4 Appendix: Functional Inverse Problems . . . . . . . . . . . . . . . . . . . 246
8.3.5 Appendix: Nonlinear Inversion of Waveforms (by Charara & Barnes) . . 263
8.3.6 Appendix: Using Monte Carlo Methods . . . . . . . . . . . . . . . . . . . 272
8.3.7 Appendix: Using Optimization Methods . . . . . . . . . . . . . . . . . . 275
9 Inference Problems of the Fourth Kind (Transport of
9.1 Measure of Physical Quantities . . . . . . . . . . . . .
9.1.1 Example: Measure of Poisson’s Ratio . . . . . .
9.2 Prediction of Observations . . . . . . . . . . . . . . . .
9.3 Appendixes . . . . . . . . . . . . . . . . . . . . . . . .
9.3.1 Appendix: Mass Calibration . . . . . . . . . . . Probabilities)
..........
..........
..........
..........
.......... .
.
.
.
. .
.
.
.
. .
.
.
.
. 287
. 288
. 288
. 299
. 300
. 300 Bibliography 501 Index 601 Chapter 1
Introduction to Tensors [Note: This is an old introduction, to be updated!]
The ﬁrst part of this book recalls some of the mathematical tools developed to describe the
geometric properties of a space. By “geometric properties” one understands those properties
that Pythagoras (6th century B.C.) or Euclid (3rd century B.C.) were interested on. The only
major conceptual progress since those times has been the recognition that the physical space
may not be Euclidean, but may have curvature and torsion, and that the behaviour of clocks
depends on their space displacements.
Still these representations of the space accept the notion of continuity (or, equivalently,
of diﬀerentiability). New theories are being developed dropping that condition (e.g. Nottale,
1993). They will not be examined here.
A mathematical structure can describe very diﬀerent physical phenomena. For instance, the
structure “3D vector space” may describe the combination of forces being applied to a particle,
as well as the combination of colors. The same holds for the mathematical structure “diﬀerential
manifold”. It may describe the 3D physical space, any 2D surface, or, more importantly, the
4dimensional spacetime space brought into physics by Minkowski and Einstein. The same
theorem, when applied to the physical 3D space, will have a geometrical interpretation (stricto
sensu), while when applied to the 4D spacetime will have a dynamical interpretation.
The aim of this ﬁrst chapter is to introduce the fundamental concepts necessary to describe
geometrical properties: those of tensor calculus. Many books on tensor calculus exist. Then,
why this chapter here? Essentially because no uniform system of notations exist (indices at
diﬀerent places, diﬀerent signs . . . ). It is then not possible to start any serious work without
ﬁxig the notations ﬁrst. This chapter does not aim to give a complete discussion on tensor
calculus. Among the many books that do that, the best are (of course) in French, and Brillouin
(1960) is the best among them. Many other books contain introductory discussions on tensor
calculus. Weinberg (1972) is particularly lucid. I do not pretend to give a complete set of
demonstrations, but to give a complete description of interesting properties, some of which are
not easily found elsewhere.
Perhaps original is a notation proposed to distinguish between densities and capacities.
1 2
While the trick of using indices in upper or lower position to distinguish between tensors or
forms (or, in metric spaces, to distinguish between “contravariant” or “covariant” components)
makes formulas intuitive, I propose to use a bar (in upper or lower position) to distinguish
between densities (like a probability density) or capacities (like a volume element), this also
leading to intuitive results. In particular the bijection existing between these objects in metric
spaces becomes as “natural” as the one just mentioned between contravariant and covariant
components. Chapter’s overview 1.1 3 Chapter’s overview [Note: This is an old introduction, to be updated!]
A vector at a point of an space can intuitively be imagined as an “arrow”. As soon as we
can introduce vectors, we can introduce other objects, the forms . A form at a point of an space
can intuitively be imagined as a series of parallel planes . . . At any point of a space we may
have tensors, of which the vectors of elementary texts are a particular case. Those tensors may
describe the properties of the space itself (metric, curvature, torsion . . . ) or the properties of
something that the space “contains”, like the stress at a point of a continuous medium.
If the space into consideration has a metric (i.e., if the notion of distance between two
points has a sense), only tensors have to be considered. If there is not a metric, then, we have
to simultaneously consider tensors and forms.
It is well known that in a transformation of coordinates, the value of a probability density f
at any point of the space is multiplied by ‘the Jacobian’ of the transformation. In fact, a
probability density is a scalar ﬁeld that has well deﬁned tensor properties. This suggests to
introduce two diﬀerent notions where sometimes only one is found: for instance, in addition
to the notion of mass density, ρ , we will also consider the notion of volumetric mass ρ ,
identical to the former only in Cartesian coordinates. If ρ(x) is a mass density, and v i (x)
a true vector, like a velocity. Their product pi (x) = ρ(x) v i (x) will not transform like a true
vector: there will be an extra multiplication by the Jacobian. pi (x) is a density too (of linear
momentum).
In addition to tensors and to densities, the concept of “capacity” will be introduced. Under
a transformation of coordinates, a capacity is divided by the Jacobian of the trasformation. An
example is the capacity element dV = dx0 dx1 . . . , not to be assimilated to the volume element
dV . The product of a capacity by a density gives a true scalar, like in dM = ρ dV .
It is well known that if there is a metric, we can deﬁne a bijection between forms and vectors
(we can “raise and lower indices”) through Vi = gij V j . The square root of the determinant of
{gij } will be denoted g and we will see that it deﬁnes a natural bijection between capacities,
tensors, and densities, like in pi = g pi , so, in addition to the rules concerning the indices, we
will have rules concerning the “bars”.
Without a clear understanding of the concept of densities and capacities, some properties
remain obscure. We can, for instance, easily introduce a LeviCivita capacity εijk... , or a
LeviCivita density (the components of both take only the values 1, +1 or 0). A LeviCivita
pure tensor can be deﬁned, but it does not have that simple property. The lack of clear
understanding of the need to work simultaneously with densities, pure tensors, and capacities,
forces some authors to juggle with “pseudothings” like the pseudovector corresponding to the
vector product of two vectors, or to the curl of a vector ﬁeld.
Many of the properties of tensor spaces arte not dependent on the fact that the space may
have a metric (i.e., a notion of distance). We will only assume that we have a metric when
the property to be demonstrated will require it. In particular, the deﬁnition of “covariant”
derivative, in the next chapter, will not depend on that assumption.
Also, the dimension of the diﬀerentiable manifold (i.e., space) into consideration, is arbitrary
(but ﬁnite). We will use Latin indices {i, j, k, . . . } to denote the components of tensors.
In the second part of the book, as we will speciﬁcally deal with the physical space and
spacetime, the Latin indices {i, j, k, . . . } will be reserved for the 3D physical space, while
the Greek indices {α, β, γ, . . . } will be reserved for the 4D spacetime. 4 1.2 1.2 Change of Coordinates (Notations) 1.2.1 Jacobian Matrices Consider a change of coordinates, passing from the coordinate system x = {xi } = {x1 , . . . , xn }
to another coordinate system y = {y i } = {y 1 , . . . , y n } . One may write the coordinate
transformation using any of the two equivalent functions
y = y(x) ; x = x(y) , (1.1) this being, of course, a shorthand notation for y i = y i (x1 , . . . , xn ) ; (i = 1, . . . , n) and
xi = xi (y 1 , . . . , y n ) ; (i = 1, . . . , n) . We shall need the two sets of partial derivatives
Y ij = ∂y i
∂xj ; X ij = ∂xi
∂y j . (1.2) One has
Y ik X k j = X ik Y k j = δij . (1.3) To simplify language and notations, it is useful to introduce a matrices of partial derivatives ,
ranging the elements X i j and Y i j as follows, X 11 X 12 X 13 · · ·
2
2 2
X = X 1 X 2 X 3 · · · .
.
.
...
.
.
.
.
.
. Y 11 Y 12 Y 13 · · ·
2
2
2 Y = Y 1 Y 2 Y 3 · · · .
.
.
...
.
.
.
.
.
. ; . (1.4) Then, equations 1.3 just tell that the matrices X and Y are mutually inverses:
YX = XY = I . (1.5) The two matrices X and Y are called Jacobian matrices . As the matrix Y is obtained by
taking derivatives of the variables y i with respect to the variables xi , one obtains the matrix
{Y i j } as a function of the variables {xi } , so we can write Y(x) rather than just writting
Y . The reciprocal argument tels that we can write X(y) rather than just X . We shall later
use this to make some notations more explicit.
Finally, the Jacobian determinants of the transformation are the determinants1 of the two
Jacobian matrices:
Y = det Y ; X = det X . (1.6) 1
Explicitly,
Y = det Y = n! εijk... Y i p Y j q Y k r . . . εpqr... , and
X = det X =
1
pqr...
i
j
k
, and where the LeviCivita’s “symbols” εijk... take the value +1 if
n! εijk... X p X q X r . . . ε
{i, j, k, . . . } is an even permutation of {1, 2, 3, . . . } , the value −1 if {i, j, k, . . . } is an odd permutation of
{1, 2, 3, . . . } , and the value 0 if some indices are identical. The LeviCivita’s tensors will be introduced with
mre detail in section 1.4).
1 Change of Coordinates (Notations) 1.2.2 5 Tensors, Capacities and Densities Consider an ndimensional manifold, and let P be a point of it. Also consider a tensor T at
point P , and let Tx ij... k ... be the components of T on the local natural basis associated to
some coordinates x = {x1 , . . . , xn } .
On a change of coordinates from x into y = {y 1 , . . . , y n } (and the corresponding change
of local natural basis) the components of T shall become Ty ij... k ... . It is well known that the
components are related through
Ty pq... rs... ∂y p ∂y q
∂xk ∂x
=
··· r
· · · Tx ij... k
i ∂xj
s
∂x
∂y ∂ y ... , (1.7) or, using the notations introduced above,
Ty pq... rs... = Y p i Y q j · · · X k r X s · · · Tx ij... k ... . (1.8) In particular, for totally contravariant and totally contravariant tensors,
k
Ty ... = Y ki Y j ij
· · Tx ··· ; Ty k ... = X i k X j · · · Tx ij... . (1.9) In addition to actual tensors, we shall encounter other objects, that ‘have indices’ also, and
that transform in a slightly diﬀerent way: densities and capacities (see for instance Weinberg
[1972] and Winogradzki [1979]). Rather than a general exposition of the properties of densities
and capacities, let us anticipate that we shall only ﬁnd totally contravariant densities and
totally covariant capacities (the most notable example being the LeviCivita capacity, to be
introduced below). From now on, in all this text,
• a density is denoted with an overline, like in a ;
• a capacity is denoted with an underline, like in b .
It is time now to give what we can take as deﬁning properties: Under the considered change of
coordinates, a totally contravariant density a changes components following the law
ak ... =
y 1k
Y iY
Y j · · · aij...
x , (1.10) or, equivalently, ak ... = X Y k i Y j · · · aij... . Here X = det X and Y = det Y are the
y
x
Jacobian determinants introduced in equation 1.6. This rule for the change of components for
a totally contravariant density is the same as that for a totally contravariant tensor (equation
at left in 1.9), excepted that there is an extra factor, the Jacobian determinant X = 1/Y .
Similarly, a totally covariant capacity b changes components following the law
by k ... = 1
X i k X j · · · bx ij...
X , (1.11) or, equivalently, by k ... = Y X i k X j · · · bx ij... . Again, this rule for the change of components
for a totally covariant capacity is the same as that for a totally covariant tensor (equation at
right in 1.9), excepted that there is an extra factor, the Jacobian determinant Y = 1/X . 6 1.2 The number of terms in equations 1.10 and 1.11 depends on the ‘variance’ of the objects
considered (i.e., in the number of indices they have). We shall ﬁnd, in particular, scalar densities
and scalar capacities, that do not have any index. The natural extension of equations 1.10
and 1.11 is, obviously,
ay = X ax = 1
ax
Y (1.12) by = Y b x = 1
b
Xx (1.13) for a scalar density, and for a scalar capacity. Explicitly, these equations can be written, using y as variable,
ay (y) = X (y) ax (x(y)) ; 1
b (x(y)) ,
X (y) x (1.14) by (y(x)) = Y (x) bx (x) . (1.15) by (y) = or, equivalently, using x as variable,
ay (y(x)) = 1
ax (x)
Y (x) ; Metric, Volume Density, Metric Bijections 1.3
1.3.1 7 Metric, Volume Density, Metric Bijections
Metric A manifold is called a metric manifold if there is a deﬁnition of distance between points, such
that the distance ds between the point of coordinates x = {xi } and the point of coordinates
x + dx = {xi + dxi } can be expressed as2
ds2 = (dx)2 = gij (x) dxi dxj , (1.16) i.e., if the notion of distance is ‘of the L2 type’3 . The matrix whose entries are gij is the
metric matrix , and an important result of diﬀerential geometry and integration theory is that
the volume density, g (x) , equals the square root of the determinant of the metric:
g (x) = det g(x) . (1.17) Example 1.1 In the Euclidean 3D space, using geographical coordinates (see example ??) the
distance element is ds2 = dr2 + r2 cos2 ϑ dϕ2 + r2 dϑ2 , from where it follows that the metric
matrix is grr grϕ grϑ
1
0
0
gϕr gϕϕ gϕϑ = 0 r2 cos2 ϑ 0 .
(1.18)
2
0
0
r
gϑr gϑϕ gϑϑ
det g(r, ϕ, ϑ) = r2 cos ϑ . The volume density equals the metric determinant, g (r, ϕ, ϑ) =
[End of example.] Note: deﬁne here the contravariant components of the metric through
g ij gjk = δ i k . (1.19) Using equations 1.9, we see that the covariant and contravariant components of the metric
change according to
gy k = X i k X j gx ij and k
gy = Y k i Y j ij
gx . (1.20) In section 1.2, we introdiced the matrices of partial derivatives. It is useful to also introduce two metric matrices, with respectively the covariant and contravariant components of the
metric: g11 g12 g13 · · ·
g 11 g 12 g 13 · · · 21 g 22 g 23 · · · ;
g−1 = g
(1.21)
g = g21 g22 g23 · · · ,
. ..
. ..
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
the notation g−1 for the second matrix being justiﬁed by the deﬁnition 1.19, that now reads
g −1 g = I . (1.22) In matrix notation, the change of the metric matrix under a change of variables, as given
by the two equations 1.20, is written
gy = Xt gx X
2
3 ; −
−
gy 1 = Y gx 1 Yt . This is a property that is valid for any coordinate system that can be chosen over the space.
As a counterexample, the distance deﬁned as ds = dx + dy  is not of the L2 type (it is L1 ). (1.23) 8 1.3 1.3.2 Volume Density [Note: The text that follows has to be simpliﬁed.]
We have seen that the metric can be used to deﬁne a natural bijection between forms and
vectors. Let us now see that it can also be used to deﬁne a natural bijection between tensors,
densities, and capacities.
Let us denote by g the square root of the determinant of the metric,
g= det g = 1 ijk... pqr...
ε
ε
gip gjq gkr . . . .
n! (1.24) [Note: Explain here that this is a density (in fact, the fundamental density)].
In (Comment: where?) we demonstrate that we have
∂i g = g Γis s . (1.25) Using expression (Comment: which one?) for the (covariant) derivative of a scalar density, this
simply gives
∇i g = ∂i g − g Γis s = 0 , (1.26) which is consistent with the fact that
∇i gjk = 0 . (1.27) Note: deﬁne here the fundamental capacity
g= 1
g , (1.28) an say that it is a capacity (obvious). 1.3.3 Bijection Between Densities Tensors and Capacities Using the scalar density g we can associate tensor densities, pure tensors, and tensor capacities.
Using the same letter to designate the objects related through this natural bijection, we will
write expressions like
ρ = gρ ; V i = g Vi or g T ij... kl... = Tij... kl... . (1.29) So, if gij and g ij can be used to “lower and raise indices”, g and g can be used to “put
and remove bars”.
Comment: say somewhere that g is the density of volumetric content , as the volume
element of a metric space is given by
dV = g dτ , (1.30) where dτ is the capacity element deﬁned in (Comment: where?), and which, when we take an
element along the coordinate lines, equals dx1 ∧ dx2 ∧ dx3 . . . .
Comment: Give somewhere the formula ∂i g = g Γi . It can be justiﬁed by the fact that, for
any density, s , ∇k s = ∂k s − Γk s , and the result follows by using s = g and remembering
that ∇k g = 0 . The LeviCivita Tensor 1.4
1.4.1 9 The LeviCivita Tensor
Orientation of a Coordinate System The Jacobian determinants associated to a change of variables x y have been deﬁned in
section 1.2. As their product must equal +1, they must be both positive or both negative. Two
diﬀerent coordinate systems x = {x1 , x2 , . . . , xn } and y = {y 1 , y 2 , . . . , y n } are said to have
the ‘same orientation’ (at a given point) if the Jacobian determinants of the transformation,
are positive. If they are negative, it is said that the two coordinate systems have ’opposite
orientation’. Precisely, the orientation of a coordinate system is the quantity η that may
take the value +1 or the value −1 . The orientation η of any coordinate system is then
unambiguously deﬁned when a deﬁnite sign of η is assigned to a particular coordinate system.
Example 1.2 In the Euclidean 3D space, a positive orientation is assigned to a Cartesian
coordinate system {x, y, z } when the positive sense of the z is obtained from the positive senses
of the x axis and the y axis following the screwdriver rule. Another Cartesian coordinate system
{u, v, w} deﬁned as u = y , v = x , w = z , then would have a negative orientation. A system
of theee spherical coordinates, if taken in their usual order {r, θ, ϕ} , then also has a positive
orientation, but when changing the order of two coordinates, like in {r, ϕ, θ} , the orientation
of the coordinate system is negative. For a system of geographical coordinates, the reverse is
true, while {r, , ϑ} is a positively oriented system, {r, ϑ, ϕ} is negatively oriented. [End
of example.] 1.4.2 The Fundamental (LeviCivita) Capacity The LeviCivita capacity can +η =
0
ijk... −η be deﬁned by the condition
if ijk . . . is an even permutation of 12 . . . n
if some indices are identical
if ijk . . . is an odd permutation of 12 . . . n , (1.31) where η is the orientation of the coordinate system, as deﬁned in section 1.4.1.
It can be shown [note: give here a reference or the demonstration] that the object so
deﬁned actually is a capacity, i.e., that in a change of coordinates, when it is imposed that the
components of this ‘object’ change according to equation 1.11, the deﬁning property 1.31 is
preserved. 1.4.3 The Fundamental Density Let g the metric tensor of the manifold. For any positively oriented system of coordinates,
we deﬁne the quantity g , called the volume density (in the given coordinates) as
g=η det g . (1.32) where η is the orientation of the coordinate system, as deﬁned in section 1.4.1.
It can be shown [note: give here a reference or the demonstration] that the object so deﬁned
actually is a scalar density, i.e., that in a change of coordinates, this quantity changes according
to equation 1.12 respectively, the property 1.32 is preserved. 10 1.4.4 1.4 The LeviCivita Tensor Then, the LeviCivita tensor can be deﬁned as
ij...k 4 =g ij...k , (1.33) i.e., explicitly, ijk... √
+ det g =
0
√ − det g if ijk . . . is an even permutation of 12 . . . n
if some indices are identical
if ijk . . . is an odd permutation of 12 . . . n . (1.34) It can be shown [note: give here a reference or the demonstration] that the object so deﬁned
actually is a tensor, i.e., that in a change of coordinates, when it is imposed that the components
of this ‘object’ change according to equation 1.9, the property 1.34 is preserved. 1.4.5 Determinants The LeviCivita’s tensors can be used to deﬁne determinants. For instance, the determinants
of the tensors Qij , Ri j , S i j , and T ij are deﬁned by
Q= 1 ijk... mnr...
ε
ε
Qim Qjn Qkr . . . ,
n! R= 1 ijk...
εmnr... Ri m Rj n Rk r . . . ,
ε
n! = 1 ijk...
ε
εmnr... Ri m Rj n Rk r . . . ,
n! S= 1
εijk... εmnr... S i m S j n S k r . . . ,
n! = 1
εmnr... S i m S j n S k r . . . ,
ε
n! ijk... (1.37) T= 1
εijk... εmnr... T im T jn T kr . . . ,
n! (1.38) (1.35) (1.36) and where the LeviCivita’s tensors εijk... , εijk... , εijk... and εijk... have as many indices as the
space under consideration has dimensions. 4
It can be shown that this, indeed, a tensor, i.e., in a change of coordinates, it transforms like a tensor
should. The Kronecker Tensor 1.5
1.5.1 11 The Kronecker Tensor
Kronecker Tensor There are two Kronecker’s “symbols”, gi j and g i j . They are deﬁned similarly:
gi j = 1
0 if i and j are the same index
if i and j are diﬀerent indices , (1.39) gij = 1
0 if i and j are the same index
if i and j are diﬀerent indices . (1.40) and Comment: I should be avoid this last notation.
It can easily be seen
(Comment: how?)
that g i j are more than ‘symbols’: they are tensors , in the sense that, if when changing
the coordinates, we compute the new components of the Kronecker’s tensors using the rules
applying to all tensors, the property (Comment: which equation?) remains satisﬁed.
The Kronecker’s tensors are deﬁned even if the space has not a metric deﬁned on it. Note
that, sometimes, instead of using the symbols gi j and g j j to represent the Kronecker’s
tensors, the symbols δi j and δ j j are used. But then, using the metric gij to “lower an
index” of δi j gives
δij = gjk δi k = gij , (1.41) which means that, if the space has a metric, the Kronecker’s tensor and the metric tensor are
the same object. Why, then, use a diﬀerent symbol? The use of the symbol δi j may lead, by
inadvertence, after lowering an index, to assing to δij the value 1 when i and j are the
same index. This is obviously wrong: if there is not a metric, δij is not deﬁned, and if there
is a metric, δij equals gij , which is only 1 in Euclidean spaces using Cartesian coordinates.
There is only one Kronecker’s tensor, and gi j and g i j can be deduced one from the other
i
raising and lowering indices. But, even in that case, we dislike the notation gj , where the place
of each index is not indicated, and we will not use it sistematically.
Warning: a common error in beginners is to give the value 1 to the symbol gi i (or to g i i ) .
In fact, the right value is n , the dimension of the space, as there is an implicit sum assumed:
gi i = g0 0 + g1 1 + · · · = 1 + 1 + · · · = n . 1.5.2 Kronecker Determinants Let us denote by n the dimension of the space into consideration. The LeviCivita’s tensor
has then n indices. For any (nonnegative) integer p satisfying p ≤ n , consider the integer
q such that p + q = n . The following property holds: j1 j2
jp δi1 δi1 . . . δi1
δ j1 δ j2 . . . δ jp i
i
i2 j1 ...jp s1 ...sq
εi1 ...ip s1 ...sq ε
= q ! det .2 .2 . .
(1.42)
. ,
.
.
.
. .
.
.
jp
j1
j2
δip δip . . . δip 12 1.5 where δi j stands for the Kronecker’s tensor. The determinant at the righthand side is called
j1 j ...j
the Kronecker’s determinant , and is denoted δi1 i22...ipp : j j ...j 1
δi1 i22...ipp jp j1
j2
δi1 δi1 . . . δi1
δ j1 δ j2 . . . δ jp i
i
i2 = det .2 .2 . .
. .
.
.
.
. .
.
.
jp
j1
j2
δip δip . . . δip (1.43) As the Kronecker’s determinant is deﬁned as a product of LeviCivita’s tensors, it is itself a
tensor. It generalizes the deﬁnition of the Kronecker’s tensor δi j , as it has the properties if (j1 , j2 , . . . , jm ) is an even permutation of (i1 , i2 , . . . , im ) +1 −1
if (j1 , j2 , . . . , jm ) is an odd permutation of (i1 , i2 , . . . , im )
j1 ...jm
δi1 ij22...im =
0
if two of the i s or two of the j s are the same index 0
if (i1 , i2 , . . . , im ) and (j1 , j2 , . . . , jm ) are diﬀerent sets of indices .
(1.44)
As applying the same permutation to the indices of the two LeviCivita’s tensors of equation 1.42 will not change the total sign of the expression, we have
εi1 ...ip s1 ...sq εj1 ...jp s1 ...sq =
j j ...j 1
εs1 ...sq i1 ...ip εs1 ...sq j1 ...jp = q ! δi1 i22...ipp , (1.45) but we only perform a permutation in one of the LeviCivita’s tensors, then we must care about
the sign of the permutation, and we obtain
εi1 ...ip s1 ...sq εs1 ...sq j1 ...jp =
j j ...j 1
εs1 ...sq i1 ...ip εj1 ...jp s1 ...sq = (−1)pq q ! δi1 i22...ipp . (1.46) This possible change of sign has only eﬀect in spaces with even dimension (n = 2, 4, . . . ) , as
in spaces with odd dimension (n = 3, 5, . . . ) the condition p + q = n implies that pq is an
even number, and (−1)pq = +1 .
Remark that a multiplication and a division by g will not change the value of an expression,
so that, instead of using LeviCivita’s density and capacity we can use LeviCivita’s true tensors.
For instance,
εi1 ...ip s1 ...sq εj1 ...jp s1 ...sq = εi1 ...ip s1 ...sq εj1 ...jp s1 ...sq . (1.47) Comment: explain better.
Appendix 1.8.4 gives special formulas to spaces with dimension 2 , 3 , and 4 . As shown in
appendix 1.8.8, these formulas replace more elementary identities between grad, div, rot, . . .
As an example, a well known identity like
a · (b × c) = b · (c × a) = c · (a × b) (1.48) is obvious, as the three formulas correspond to the expression εijk ai bj ck . The identity
a × (b × c) = (a · c) b − (a · b) c (1.49) The Kronecker Tensor 13 is easily demonstrated, as
a × (b × c) = εijk aj (b × c)k = εijk aj εk m b cm , (1.50) which, using XXX, gives
a × (b × c) = (am cm )bi − (am bm )ci = (a · c) b − (a · b) c . (1.51) Comment: I should clearly say here that we have the identity
εijk... ε mn... = εijk... ε mn... . (1.52) Comment: say somewhere that if Bi1 ...ip is a totally antisymmetric tensor, then
1 1 ... p
B ... = Bi1 ...ip
δ
p! i1 ...ip 1 p (1.53) Comment: give somewhere the property
1 k1 ...kp 1 ... q j1 ...jq
k ...k ... q
δi1 ...ip j1 ...jq δm1 ...mq = δi11...ipp 1 ...mq .
m1
q! (1.54) Comment: give somewhere the property
1
j ...j
δ 1 q = εi1 ...ip k1 ...kq .
ε
q ! i1 ...ip j1 ...jq k1 ...kq
Note: Check if there are not factors (−1)pq missing. (1.55) 14 1.6
1.6.1 1.6 Totally Antisymmetric Tensors
Totally Antisymmetric Tensors A tensor is completely antisymmetric if any even permutation of indices does not change the
value of the components, and if any odd permutation of indices changes the sign of the value
of the components:
tpgr... = +tijk...
−tijk... if ijk . . . is an even permutation of pqr . . .
if ijk . . . is an odd permutation of pqr . . . (1.56) For instance, a fourth rank tensor tijkl is totally antisymmetric if
tijkl = tiklj = tiljk = tjilk = tjkil = tjlki
= tkijl = tkjli = tklij = tlikj = tljik = tlkij
= −tijlk = −tikjl = −tilkj = −tjikl = −tjkli = −tjlik
= −tkilj = −tkjil = −tklji = −tlijk = −tljki = −tlkij (1.57) a third rank tensor tijk is totally antisymmetric if
tijk = tjki = tkji = −tikj = −tjik = −tkji , (1.58) a second rank tensor tij is totally antisymmetric if
tij = −tji , (1.59) and a ﬁrst rank tensor ti can always be considered totally antisymmetric.
Well known examples of totally antisymmetric tensors are the LeviCivita’s tensors of any
rank, the ranktwo electromagnetic tensors, the “vector product” of two vectors:
cij = ai bj − aj bi , (1.60) etc.
Comment: say somewhere that the Kronecker’s tensors and determinants are totally antisymmetric. 1.6.2 Dual Tensors In a space with n dimensions, let p and q be two (nonnegative) integers such that
p + q = n . To any totally antisymmetric tensor of rank p , B i1 ...ip , we can associate a totally
antisymmetric tensor of rank q , bi1 ...iq , deﬁned by
bi1 ...iq = 1
εi1 ...iq j1 ...jp B j1 ...jp .
p! (1.61) The tensor b is called the dual of B , and we write
b = Dual[B] (1.62) Totally Antisymmetric Tensors 15 or
b =∗ B (1.63) From the properties of the product of LeviCivita’s tensors it follows that the dual of the
dual gives the original tensor, excepted for a sign:
∗∗ ( B) = Dual[Dual[B]] = (−1)p(n−p) B . (1.64) For spaces with odd dimension (n = 1, 3, 5, . . . ) , the product p(n − p) is even, and
∗∗ ( B) = B (spaces with odd dimension) . (1.65) For spaces with even dimension (n = 2, 4, 6, . . . ) , we have
∗∗ ( B) = (−1)p B (spaces with even dimension) . (1.66) Although deﬁnition 1.61 has been written for pure tensors, it can obviously be written for
densities and capacities,
1
j1 ...jp
εi1 ...iq j1 ...jp B
p!
1
=
B j1 ...jp ,
ε
p! i1 ...iq j1 ...jp bi1 ...iq =
bi1 ...iq (1.67) or for tensor where covariant and contravariant indices have replaced each other:
1 i1 ...iq j1 ...jp
ε
Dj1 ...jp
p!
1 i1 ...iq j1 ...jp
=
ε
Dj1 ...jp
p!
1 i1 ...iq j1 ...jp
=
ε
Dj1 ...jp ,
p! di1 ...iq =
di1 ...iq
i1 ...iq d (1.68) Appendix 1.8.13 gives explicitly the dual tensor relations in spaces with 2, 3, and 4 dimensions.
Example 1.3 Consider an antisymmetric tensor E11 E12 E13 E21 E12 E23 = E31 E32 E33 Eij in three dimensions. It has components 0 E12 E13
E21 0 E23 ,
(1.69)
E31 E32 0 with Eij = −Eji . The deﬁnition
ei =
gives 1 ijk
ε Ejk
2! 0
0 E12 E13
e3 −e2 E21 0 E23 = −e3 0
e1 ,
2
1
E31 E32 0
e −e
0 (1.70) (1.71) which is the classical relation between the three independent components of a 3D antisymmetric
tensor and the components of a vector density. [End of example.] 16 1.6 Example 1.4 The vector product of two vectors Ui and Vi can be either deﬁned as the
antisymmetric tensor
Wij = Ui Vj − Vj Ui , (1.72) 1 ijk
ε Uj Vk .
2! (1.73) or as the vector density
wi = The two deﬁnitions are equivalent, as Wij and wi are mutually duals. [End of example.]
Deﬁnition 1.73 shows that the vector product of two vectors is not a pure vector, but a
vector density. Changing the sense of one axis gives a Jacobian equal to −1 , thus changing
the sign of the vector product wi . 1.6.3 Exterior Product of Tensors In a space of dimension n , let Ai1 i2 ...ip and Bi1 i2 ...iq , be two totally antisymmetric tensors
with ranks p and q such that p + q ≤ n . Note: check that total antisymmetry has been
deﬁned. The exterior product of the two tensors is denoted
C=A∧B (1.74) and is the totally antisymmetric tensor of rank p + q deﬁned by
Ci1 ...ip j1 ...jq = 1
k ...k ...
δi11...ipp 11 qq Ak1 i2 ...kp B 1 i2 ... q .
j ...j
(p + q )! (1.75) Permuting the set of indices {k1 . . . kp } by the set { 1 . . . q } in the above deﬁnition gives the
property
(A ∧ B) = (−1)pq (B ∧ A) . (1.76) It is also easy to see that the associativity property holds:
A ∧ (B ∧ C) = (A ∧ B) ∧ C . (1.77) j1 ...
Comment: say that δi1 ij22... are the components of the Kronecker’s determinant deﬁned in
Section 1.5.2.
Say that it equation 1.54 gives the property (A1 ∧ A2 ∧ . . . AP)i1 i2 ...ip = 1 j1 j2 ...jp
A1j1 A2j2 . . . APjp .
δ
p! i1 i2 ...ip (1.78) Totally Antisymmetric Tensors
1.6.3.1 17 Particular cases: It follows from equation 1.53 that the exterior product of a tensor of rank zero (a scalar) by a
totally antisymmetric tensor of any order is the simple product of the scalar by the tensor:
(A , → Bi1 ...iq ) (A ∧ B)i1 ...iq = A Bi1 ...iq . (1.79) For the exterior product of two vectors we easily obtain (independently of the dimension of
the space into consideration)
1
(Ai , Bi )
→
(A ∧ B)ij = (Ai Bj − Aj Bi ) .
(1.80)
2
The exterior product of a vector by a second rank (antisymmetric) tensor gives
1
→
(A ∧ B)ijk = (Ai Bjk + Aj Bki + Ak Bij ) .
(1.81)
(Ai , Bij )
3
Finally, it can be seen that the exterior product of three vectors gives
, Bi , Ci ) → (1.82)
1
(Ai (Bj Ck − Bk Cj ) + Aj (Bk Ci − Bi Ck ) + Ak (Bi Cj − Bj Ci ))
(A ∧ B ∧ C)ijk =
6
1
=
(Bi (Cj Ak − Ck Aj ) + Bj (Ck Ai − Ci Ak ) + Bk (Ci Aj − Cj Ai ))
6
1
(Ci (Aj Bk − Ak Bj ) + Cj (Ak Bi − Ai Bk ) + Ck (Ai Bj − Aj Bi )) .
=
6
Let us examine with more detail the formulas above in the special case of a 3D space.
The dual of the exterior product of two vectors (equation 1.80) gives
i
1
∗
(a ∧ b) = εijk aj bk ,
(1.83)
2
i.e., one half the usual vector product of the two vectors:
1
∗
(a ∧ b) = (a × b) .
(1.84)
2
The dual of the exterior product of a vector by a second rank (antisymmetric) tensor (equation 1.81) is
(Ai ∗ or, introducing the vector (a ∧ b) = 1
ai
3 1 ijk
ε bjk
2! , (1.85) ∗i b , dual of the tensor bij , 1 ∗i
(1.86)
ai b .
3
This shows that the exterior product contains, via the duals, the contraction of a form and a
vector.
Finally, the dual of the exterior product of three vectors (equation 1.82) is
1
∗
(a ∧ b ∧ c) = εijk ai bj ck ,
(1.87)
3!
i.e., one sixth of the triple product of the three vectors.
Comment: explain that the triple product of three vectors is a · (b × c) = b · (c × a) =
c · (a × b) .
∗ (a ∧ b) = 18 1.6.4 1.6 Exterior Derivative of Tensors Let T be a totally antisymmetric tensor with components Ti1 i2 ...ip . The exterior product of
“nabla” with T is called the exterior derivative of T , and is denoted ∇ ∧ T :
k ... (∇ ∧ T)ij1 j2 ...jp = δij11j22...jpp ∇k T 1 2 ... p . (1.88) Here, ∇i Tjk... denotes the covariant derivative deﬁned in section XXX.
The “nabla” notation allows to use direclty the formulas developed for the exterior product of a vector by a tensor to obtain formulas for exterior derivatives. For instance, from
equation 1.80 it follows the deﬁnition of the exterior derivative of a vector
1
(∇i bj − ∇j bi ) ,
2 (1.89) 1 ijk
ε ∇j bk ,
2 (1.90) 1
(∇ ∧ b) = (∇ × b) .
2 (1.91) (∇ ∧ b)ij =
or, if we use the dual (equations 1.83–1.84),
∗ i (∇ ∧ b) = i.e.,
∗ The exterior derivative of a vector equals onehalf the rotational (curl) of the vector.
The exterior derivative of a second rank (antisymmetric) tensor is directly obtained from
equation 1.81:
(∇ ∧ b)ijk = 1
(∇i bjk + ∇j bki + ∇k bij ) .
3 (1.92) i Taking the dual of the expression and introducing the vector ∗ b , dual of the tensor bij , gives
(see equation 1.86)
∗ (∇ ∧ b) = 1
i
∇i ∗ b ,
3 (1.93) which shows that the dual of the exterior derivative of a second rank (antisymmetric) tensor
equals onethird of the divergence of the dual of the tensor. The exterior derivative contains,
via the duals, the divergence of a vector. Integration, Volumes 1.7
1.7.1 19 Integration, Volumes
The Volume Element Consider, in a space with n dimensions, p linearly independent vectors {dr1 , dr2 , . . . , drp } .
As they are linear independent, p ≤ n .
We deﬁne the “diﬀerential element”
d(p)σ = p! (dr1 ∧ dr2 ∧ · · · ∧ drp ) . (1.94) Using equation 1.78 (Note: in fact this equation with indices changed of place) gives the
components
i ...i i
i
i
d(p)σ i1 ...ip = δj1 ...jp dr11 dr22 . . . drpp .
p
1 (1.95) In a space with n dimensions, the dual of the diﬀerential element of dimension p will
have q indices, with p + q = n . The general deﬁnition of dual (equation 1.67) gives
1
∗ (p)
d σ i1 ...iq = εi1 ...iq j1 ...jp d(p)σ j1 ...jp
(1.96)
p!
The deﬁnition 1.95 and the property 1.55 give
∗ (p) j
j
j
d σ i1 ...iq = εi1 ...iq j1 ...jp dr11 dr22 . . . drpp . (1.97) In order to simplify subsequent notations, it is better not to keep the ∗ notation. Instead, we
will write
∗ (p) d σ i1 ...iq = d(p)Σi1 ...iq (1.98) For reasons to be developed below, d(p)Σi1 ...iq will be called the capacity element .
We can easily see, for instance, that the diﬀerential elements of dimensions 0, 1, 2 and 3
have components
d0σ
d1 σ i
d2 σ ij
d3 σ ijk =
=
=
=
=
= 1
(1.99)
i
dr1
(1.100)
j
j
i
i
dr1 dr2 − dr1 dr2
(1.101)
j
j
j
j
j
i
k
k
k
i
i
k
k
i
i
dr1 (dr2 dr3 − dr2 dr3 ) + dr1 (dr2 dr3 − dr2 dr3 ) + dr1 (dr2 dr3 − dr2 dr3 )
j
j
j
j
j
i
k
k
k
i
i
k
k
i
i
dr2 (dr3 dr1 − dr3 dr1 ) + dr2 (dr3 dr1 − dr3 dr1 ) + dr2 (dr3 dr1 − dr3 dr1 )
j
j
j
j
j
i
k
k
k
i
i
k
k
i
i
dr3 (dr1 dr2 − dr1 dr2 ) + dr3 (dr1 dr2 − dr1 dr2 ) + dr3 (dr1 dr2 − dr1 dr2 ) . (1.102) For a given dimemsion of the diﬀerential element, the number of indices of the capacity elements
depends on the dimension of the space. In a threedimensional space, for instance, we have
d0 Σijk = εijk (1.103) k
d1 Σij = εijk dr1 d2 Σi =
3 dΣ = j
k
εijk dr1 dr2
j
i
k
εijk dr1 dr2 dr3 (1.104)
(1.105)
. (1.106) Note: explain that I use the notation d(p) but d1 , d2 , . . . in order not to suggest that p is a
tensor index and, at the same time, for not using too heavy notations..
Note: refer here to ﬁgure 1.1, and explain that we have, in fact, vector products of vectors
and triple products of vectors. 20 1.7
dr3
dr2 dr2
dr1 dr1 dr1 Figure 1.1: From vectors in a threedimensional space we deﬁne the onedimensional capacity
j
k
k
element d1 Σij = εijk dr1 , the twodimensional capacity element d2 Σi = εijk dr1 dr2 and the
j
i
k
threedimensional capacity element d3Σ = εijk dr1 dr2 dr3 . In a metric space, the ranktwo
1
form d Σij deﬁnes a surface perpendicular to dr1 and with a surface magnitude equal to the
length of dr1 . The rankone form d2 Σi deﬁnes a vector perpendicular to the surface deﬁned
by dr1 and dr2 and with length representing the surface magnitude (the vector product of
the two vectors). The rankzero form d3Σ is a scalar representing the volume deﬁned by the
three vectors dr1 , dr2 and dr3 (the triple product of the vectors). Note: clarify all this. 1.7.2 The Stokes’ Theorem Comment: I must explain here ﬁrst what integration means.
Let, in a space with n dimensions, (T) be a totally antisymmetric tensor of rank p ,
with (p < n) . The Stokes’ theorem d(p+1)σ i1 ...ip+1 (∇ ∧ T)i1 ...ip+1 =
(p+1)D d(p)σ i1 ...ip Ti1 ...ip (1.107) pD holds. Here, the symbol (p+1)D d(p+1) stands for an integral over a p+1)dimensional “volume”, (embedded in an space of dimension n ), and pD d(p) for the integral over the pdimensional boundary of the “volume”.
This fundamental theorem contains, as special cases, the divergence theorem of GaussOstrogradsky, and the rotational theorem of Stokes (stricto sensu). Rather than deriving it
here, we will explore its consequences. For a demonstration, see, for instance, Von Westenholz
(1981).
In a threedimensional space (n = 3) , we may have p respectively equal to 2 , 1 and 0 .
This gives the three theorems d3σ ijk (∇ ∧ T)ijk =
3D d2σ ij Tij (1.108) 2D d2σ ij (∇ ∧ T)ij =
2D d1σ i (∇ ∧ T)i =
1D d1σ i Ti (1.109) 1D d0σ T .
0D (1.110) Integration, Volumes 21 It is easy to see (appendix 1.8.14) that these equation can be written
1 ijk
1
d3Σ
ε ∇i Tjk
0! 3D
2!
1
1 ijk
d2Σi
ε ∇j Tk
1! 2D
1!
1
1 ijk
d1Σij
ε ∂k T
2! 1D
0! 1
1!
1
=
2!
1
=
3! d2Σi = 2D d1Σij
1D d0Σijk
0D 1 ijk
ε Tjk
2!
1 ijk
ε Tk
1!
1 ijk
εT
0! (1.111)
(1.112)
. (1.113) i Simplifying equation 1.111 and introducing the vector density t , dual to the tensor Tij , (
i
1
i.e., t = 2! εijk Tjk ), gives
i d3Σ ∇i t =
3D i d2Σi t . (1.114) 2D This corresponds to the divergence theorem of GaussOstrogradsky: The integral over a (3D)
volume of the divergence of a vector equals the ﬂux of the vector across the surface bounding
the volume.
It is worth to mention here that expression 1.114 has been derived without any mention to
a metric in the space. We have sen elsewhere that densities and capacities can be deﬁned even
if there is no notion of distance. If there is a metric, then from the capacity element d3Σ we
can introduce the volume element d3Σ using the standard rule for putting on and taking oﬀ
bars
d3Σ = g d3Σ , (1.115) d2Σi = g d2Σi . (1.116) as well as the surface element d3Σ is now the familiar volume inside a prism, and d2Σi the vector (if we raise the index with
the metric) representing the surface inside a lozenge.
Equation 1.114 then gives
d3Σ ∇i ti =
3D d2Σi ti , (1.117) 2D which is the familiar form for the divergence theorem.
Keeping the compact expression for the capacity element in the lefthand side of equation 1.112, but introducing its explicit expression in the right hand side gives, after simpliﬁcation,
d2Σi (εijk ∇j Tk ) =
2D i
dr1 Ti , (1.118) 1D which corresponds to the rotational theorem (theorem of Stokes stricto sensu): the integral of
the rotational (curl) of a vector on a surface equals the circulation of the vector along the line
bounding the surface. 22 1.7 Finally, introducing explicit expressions for the capacity elements at both sides of equation 1.113 gives
i
dr1 ∂i T =
1D T. (1.119) 0D Writing this in the more familiar form gives
b dri ∂i T = T (b) − T (a) , (1.120) a which corresponds the fundamental theorem of integral calculus: the integral over a line of the
gradient of a scalar equals the diﬀerence of the values of the scalar at the two endpoints.
Note: say that more details can be found in appendix 1.8.14
Comment: explain here what the “capacity element”is. Explain that, in polar coordinates,
it is given by drdϕ , to be compared with the “surface element” rdrdϕ . Comment ﬁgure 1.2. ϕ = 2π . . . . . . . . .. . . . . . . . .
. . . . . . . . . . .. . . . .. . .
..
. . . . .. . . . .
.
2
.
.
..
. ..
.
. ..
..
..
0
ϕ=π . .
..
.
. .. .
..
.
.
2
..
..
.
4
.
.
.
ϕ = 0 4
4
2
2
0
r=0
r = 1/2
r=1
4 ϕ = π/2
1 .
.. .
.
..
.
..
.
.. .
..
.
..
.
.
0
ϕ=π
...
. .. . ϕ=0
.
.
.
.
.
..
.
0.5
.
.
.
.
.
.
.
1
1
1
0.5
0
ϕ = 3π/2 0.5
0.5 Figure 1.2: We consider, in an Euclidean space, a cylinder with a circular basis of radius 1,
and cylindrical coordinates (r, ϕ, z ) . Only a section of the cylinder is represented in the
ﬁgure, with all its thickness, dz , projected on the drawing plane. At left, we have represented
a “map” of the corresponding circle, and, at right, the coordinate lines on the circle itself.
All the “cells” at left have the same capacity dV = drdϕdz , while the cells at right have
the volume dV (r, ϕ, z ) = rdrdϕdz . The points represent particles with given masses. If,
at left, at point with coordinates (r, ϕ, z ) the sum of all the masses inside the local cell is
denoted, dM , then, the mass density at this point is estimated by ρ(r, ϕ, z ) = dM/dV , i.e.,
ρ(r, ϕ) = dM/(drdϕdz ) . If, at right, at point (r, ϕ, z ) the total mass inside the local cell
is dM , the volumetric mass at this point is estimated by ρ(r, ϕ, z ) = dM/dV (r, ϕ, z ) , i.e.,
ρ(r, ϕ, z ) = dM/(rdrdϕdz ) . By deﬁnition, then, the total mass inside a volume V will be
found by M = V dV ρ(r, ϕ, z ) = V drdϕdz ρ(r, ϕ, z ) or by M = V dV (r, ϕ, z )ρ(r, ϕ, z ) =
rdrdϕdzρ(r, ϕ, z ) .
V Appendixes 1.8 23 Appendixes 1.8.1 Appendix: Tensors For Beginners 1.8.1.1 Tensor Notations The velocity of the wind at the top of Eiﬀel’s tower, at a given moment, can be represented by
a vector v with components, in some local, given, basis, {v i } (i = 1, 2, 3) . The velocity of
the wind is deﬁned at any point x of the atmosphere at any time t : we have a vector ﬁeld
v i (x, t) .
The water’s temperature at some point in the ocean, at a given moment, can be represented
by a scalar T . The ﬁeld T (x, t) is a scalar ﬁeld .
The state of stress at a given point of the Earth’s crust, at a given moment, is represented
by a second order tensor σ with components {σ ij } (i = 1, 2, 3; j = 1, 2, 3) . In a general
model of continuous media, where it is not assumed that the stress tensor is symmetric, this
means that we need 9 scalar quantities to characterize the state of stress. In more particular
models, the stress tensor is symmetric, σ ij = σ ji , and only six scalar quantities are needed.
The stress ﬁeld σ ij (x, t) is a second order tensor ﬁeld .
Tensor ﬁelds can be combined, to give other ﬁelds. For instance, if ni is a unit vector
considered at a point inside a medium, the vector
3
i σ ij (x, t) nj (x) = σ ij (x, t) nj (x) τ (x, t) = ; (i = 1, 2, 3) (1.121) j =1 represents the traction that the medium at one side of the surface deﬁned by the normal ni
exerts the medium at the other side, at the considered point.
As a further example, if the deformations of an elastic solid are small enough, the stress
tensor is related linearly to the strain tensor (Hooke’s law). A linear relation between two
second order tensors means that each component of one tensor can be computed as a linear
combination of all the components of the other tensor:
3 3 σ ij (x, t) = cijk (x) εk (x, t) = cijk (x) εk (x, t) ; (i = 1, 2, 3; j = 1, 2, 3) . (1.122) k=1 =1 The fourth order tensor cijkl represents a property of an elastic medium: its elastic stiﬀness.
As each index takes 3 values, there are 3 × 3 × 3 × 3 = 81 scalars to deﬁne the elastic stiﬀness
of a solid at a point (assuming some symmetries we may reduce this number to 21, and asuming
isotropy of the medium, to 2).
We are yet interested in the physical meaning of the equations above, but in their structure.
First, tensor notations are such that they are independent on the coordinates being used.
This is not obvious, as changing the coordinates implies changing the local basis where the
components of vectors and tensors are expressed. That the two equalities equalities above hold
for any coordinate system, means that all the components of all tensors will change if we change
the coordinate system being used (for instance, from Cartesian to spherical coordinates), but
still the two sides of the expression will take equal values.
The mechanics of the notation, once understood, are such that it is only possible to write
expressions that make sense (see a list of rules at the end of this section). 24 1.8 For reasons about to be discussed, indices may come in upper or lower positions, like in v i ,
fi or Ti j . The deﬁnitions will be such that in all tensor expression (i.e., in all expressions that
will be valid for all coordinate systems), the sums over indices will always concern an index in
lower position an one index on upper position. For instance, we may encounter expressions like
3 3
i ϕ= Ai B = Ai B
i=1 i or 3 Dijk E jk = Dijk E jk Ai = . (1.123) j =1 k=1 These two equations (as equations 1.121 and 1.122) have been written in two version, one with
the sums over the indices explicitly indicated, and another where this sum is implicitly assumed.
This implicit notation is useful as one easily forgets that one is dealing with sums, and that
it happens that, with respect to the usual tensor operations (sum with another tensor ﬁeld,
multiplication with another tensor ﬁeld, and derivation), a sum of such terms is handled as one
single term of the sum could be handled.
In an expression like Ai = Dijk E jk it is said that the indices j and k have been contracted
(or are “dummy indices”), while the index i is a free index . A tensor equation is assumed to
hold for all possible values of the free indices.
In some spaces, like our physical 3D space, it is posible to deﬁne the distance between two
points, and in such a way that, in a local system of coordinates, approximately Cartesian, the
distance has approximately the Euclidean form (square root of a sum of squares). These spaces
are called metric spaces . A mathematically convenient manner to introduce a metric is by
deﬁning the length of an arc Γ by S = Γ ds , where, for instance, in Cartesian coordinates,
ds2 = dx2 + dy 2 + dz 2 or, in spherical coordinates, ds2 = dr2 + r2 dθ2 + r2 sin2 θ dϕ2 . In general,
we write ds2 = gij dxi dxj , and we call gij (x) the metric ﬁeld or, simply, the metric .
The components of a vector v are associated to a given basis (the vector will have diﬀerent
components on diﬀerent basis). If a basis ei is given, then, the components v i are deﬁned
through v = v i ei (implicit sum). The dual basis of the basis {ei } is denoted {ei } and is
deﬁned by the equation ei ej = δi j (equal to 1 if i are the same index and to 0 if not). When
there is a metric, this equation can be interpreted as a scalar vector product, and the dual
basis is just another basis (identical to the ﬁrst one when working with Cartesian coordinates
in Euclidena spaces, but diﬀerent in general). The properties of the dual basis will be analyzed
later in the chapter. Here we just need to recall that if v i are the components of the vector
v on the basis {ei } (remember the expression v = v i ei ), we will denote by vi are the
components of the vector v on the basis {ei } : v = vi ei . In that case (metric spaces)
the components on the two basis are related by vi = gij v i : It is said that “the metric tensor
ascends (or descends) the indices”.
Here is a list with some rules helping to recognize tensor equations:
• A tensor expression must have the same free indices, at the top and at the bottom, of
the two sides of an equality. For instance, the expressions
ϕ = Ai B i
ϕ = gij B i C j
Ai = Dijk E jk
Dijk = ∇i Fjk (1.124) Appendixes 25 are valid, but the expressions
Ai = Fij B i
B i = Aj C j
Ai = B (1.125) i are not.
• Sum and multiplication of tensors (with eventual “contraction” of indices) gives tensors.
For instance, if Dijk , Gijk and Hi j are tensors, then
Jijk = Dijk + Gijk
Kijk m = Dijk H m
Lik = Dijk H (1.126) j also are tensors.
• True (or “covariant”) derivatives of tensor felds give tensor ﬁelds. For instance, if E ij is
a tensor ﬁeld, then
Mi jk = ∇i E jk
B j = ∇i E ij (1.127) also are tensor ﬁelds. But partial derivatives of tensors do not deﬁne, in general, tensors.
For instance, if E ij is a tensor ﬁeld, then
Mi jk = ∂i V jk
B j = ∂i V ij (1.128) are not tensors, in general.
• All “objects with indices” that are normally introduced are tensors, with four notable
exceptions. The ﬁrst exception are the coordinates {xi } (to see that it makes no sense
to add coordinates, think, for instance, in adding the spherical coordinates of two points).
But the diﬀerentials dxi appearing in an expression like ds2 = gij dxi dxj do correspond
to the components on a vector dr = dxi ei . Another notable exception is the “symbol”
∂i mentioned above. The third exception is the “connection” Γij k to be introduced later
in the chapter. In fact, it is because both of the symbols ∂i and Γij k are not tensors
than an expression like
∇i V j = ∂i V j + Γik j V k (1.129) can have a tensorial sense: if one of the terms at right was a tensor and not the other,
their sum could never give a tensor. The objects ∂i and Γij k are both non tensors, and
“what one term misses, the other term has”. The fourth and last case of “objects with
indices” which are not tensors are the Jacobian matrices arising in coordinate changes
x y,
J iI = ∂xi
.
∂y J (1.130) 26 1.8
That this is not a tensor is obvious when considering that, contrarily to a tensor, the
Jacobian matrix is not deﬁned per se, but it is only deﬁned when two diﬀerent coordinate
systems have been chosen. A tensor exists even if no coordinate system at all has been
deﬁned. 1.8.1.2 Diﬀerentiable Manifolds A manifold is a continuous space of points. In an ndimensional manifold it is always possible
to “draw” coordinate lines in such a way that to any point P of the manifold correspond
coordinates {x1 , x2 , . . . , xn } and vice versa.
Saying that the manifold is a continuous space of points is equivalent to say that the
coordinates themselves are “continuous”, i.e., if they are, in fact, a part of Rn . On such
manifolds we deﬁne physical ﬁelds, and the continuity of the manifold will allow to deﬁne the
derivatives of the considered ﬁelds. When derivatives of ﬁelds on a manifold can be deﬁned,
the manifold is then called a diﬀerentiable manifold .
Obvious examples of diﬀerentiable manifolds are the lines and surfaces of ordinary geometry. Our 3D physical space (with, possibly, curvature and torsion) is also represented by a
diﬀerentiable manifold. The spacetime of general relativity is a four dimensional diﬀerentiable
manifold.
A coordinate system may not “cover” all the manifold. For instance, the poles of a sphere
are as ordinary as any other point in the sphere, but the coordinates are singular there (the
coordinate ϕ is not deﬁned). Changing the coordinate system around the poles will make
any problem related to the coordinate choice to vanish there. A more serious diﬃculty appears
when at some point, not the coordinates, but the manifold itself is singular (the linear tangent
space is not deﬁned at this point), as for instance, in the example shown in ﬁgure 1.3. Those
ane named “essential singularities”. No eﬀort will be made on this book to classify them.
Figure 1.3: The surface at left has an essential
singularity that will cause trouble for whatever system of coordinates we may choose (the
tangent linear space is not deﬁned at the singular point). The sphere at rigth has no essential singularity, but the coordinate system
chosen is singular at the two poles. Other coordinate systems will be singular at diﬀerent
points. 1.8.1.3 Tangent Linear Space, Tensors. Consider, for instance, in classical dynamics, a trajectory xi (t) on a space which may not be
ﬂat, as the surface of a sphere. The trajectory is “on” the sphere. If we deﬁne now the velocity
at some point,
vi = dxi
,
dt (1.131) Appendixes 27 we get a vector which is not “on” the sphere, but tangent to it. It belongs to what is called the
tangent linear space to the considered point. At that point, we will have a basis for vectors.
At another point, we will another tangent linear space, and another vector basis.
More generally, at every point of a diﬀerential manifold, we can consider diﬀerent vector or
tensor quantities, like the forces , velocities , or stresses of mechanics of continuous media. As
suggested by ﬁgure 1.4, those tensorial objects do not belong to the nonlinear manifold, but to
the tangent linear space to the manifold at the considered point (that will only be introduced
intuitively here).
At every point of an space, tensors can be added, multiplied by scalars, contracted, etc.
This means that at every point of the manifold we have to consider a diﬀerent vector space
(in general, a tensor space). It is important to understand that two tensors at two diﬀerent
points of the space belong to two diﬀerent tangent spaces, and can not be added as such (see
ﬁgure 1.4). This is why we will later need to introduce the concept of “parallel transport of
tensors”.
All through this book, the two names linear space and vector space will be used as completely
equivalent.
The structure of vector space is too narrow to be of any use in physics. What is needed is
the structure where equations like
λ = Ri S i
T j = Ui V ij + µ W j
X ij = Y i Z j (1.132) make sense. This structure is that of a tensor space . In short, a tensor space is a collection
of vector spaces and rules of multiplication and diﬀerentiation that use elements of the vector
spaces considered to get other elements of other vector spaces. Figure 1.4: Surface with two planes tangent at two points, and a
vector drawn at each point. As the vectors belong to two diﬀerent
vector spaces, their sum is not deﬁned. Should we need to add
them, for instance, to deﬁne true (or “covariant”) derivatives of the
vector ﬁeld, then, we would need to transport them (by “parallel
transportation”) to a common point. 1.8.1.4 Vectors and Forms When we introduce some vector space, with elements denoted, for instance, V , v . . . , it
often happens that a new, diﬀerent, vector space is needed, with elements denoted, for instance
F , F . . . , and such that when taking an element of each space, we can “multiply” them and
get a scalar,
λ = F, V . (1.133) In terms of components, this will be written
λ = Fi V i . (1.134) 28 1.8 The product in 1.133–1.134, is called a duality product , and it has to be clearly distinguished
from an inner (or scalar) product: in an inner product, we multiply two elements of a vector
space; in a duality product, we multiply an element of a vector space by an element of a “dual
space”.
This operation can always be deﬁned, including the case where the do not have a metric
(and, therefore, a scalar product). As an example, imagine that we work with pieces of metal
and we need to consider the two parameters “electric conductivity” σ and “temperature” T .
We may need to consider some (possibly nonlinear) function of σ and T , say S (σ, T ) . For
instance, S (σ, T ) may represent a “misﬁt function” on the (σ, T ) space of those encountered
when solving inverse problems in physics if we are measuring the parameters σ and T using
indirect means. In this case, S is adimensional5 . We may wish to know by which amont will
S change when passing from point (σ0 , T0 ) to a neighbouring point (σ0 + ∆σ, T0 + ∆T ) .
Writing only the ﬁrst order term, and using matrix notations,
∂S
∂σ
∂S
∂T S (σ0 + ∆σ, T0 + ∆T ) = S (σ0 , T0 ) + T ∆σ
∆T + ... , (1.135) where the partial derivatives are taken at point (σ0 , T0 ) . Using tensor notations, setting
x = (x1 , x2 ) = (σ, T ) , we can write
S (x + ∆x) = S (x) +
i ∂S
∆xi
i
∂x
(1.136) = S (x) + γi ∆xi
= S (x) + γ , ∆x , where the notation introduced in equations 1.133–1.134 is used. As above, the partial derivatives
are taken at point x0 = (x1 , x2 ) = (σ0 , T0 ) .
0
0
Note: say that ﬁgure 1.5 illustrates the deﬁnition of gradient as a tangent linear application.
Say that the “millefeuilles” are the “levellines” of that tangent linear application.
Note: I have to explain somewhere the reason for putting an index in lower position to
represent ∂/∂xi , i.e., to use the notation
∂i = ∂
.
∂xi Note: I have also to explain in spite of the fact that we have here partial derivatives, we
have deﬁned a tensorial object: the partial derivative of a scalar equals its true (covariant)
derivative.
It is important that we realize that there is no “scalar product” involved in equations 1.136.
Here are the arguments:
• The components of γi are not the components of a vector in the (σ, T ) space. This
can directly be seen by an inspection of their physical dimensions. As the function S is
adimensional (see footnote 5), the components of γ have as dimensions the inverse of
the physical dimensions of the components of the vector ∆x = (∆x1 , ∆x2 ) = (∆σ, ∆T ) .
This clearly means that ∆x and γ are “objects” that do not belong to the same space.
For instance, one could have the simple expression S (σ, T ) = σ−σ0  +
sP
standard deviations (or mean deviations) of some probability distribution.
5 T −T0 
sT , where sP and sT are Appendixes 29 • If equations 1.136 involved a scalar product we could deﬁne the norm of x , the norm of
γ and the angle between x and γ . But these norms and angle are not deﬁned. For
instance, what could be the norm of x = (∆σ, ∆T ) ? Should we choose an L2 norm?
Or, as suggested by footnote 5, an L1 norm? And, in any case, how could we make
consistent such a deﬁnition of a norm with a change of variables where, instead of electric
conductivity we use electric resistivity? (Note: make an appendix where the solution to
this problem is given).
The product in equations 1.136 is not a scalar product (i.e., it is not the “product” of two
elements belonging to the same space): it is a “duality product”, multiplying an element of a
vector space and one element of a “dual space”.
Why this discussion is needed? Because of the tendency of imagining the gradient of a
function S (σ, T ) as a vector (an “arrow”) in the S (σ, T ) space. If the gradient is not an
arrow, then, what it is? Note: say here that ﬁgures 1.6 and 1.7 answer this by showing that an
element of a dual space can be represented as a “millefeuilles”.
Up to here we have only considered a vector space and its dual. But the notion generalizes
to more general tensor spaces, i.e., to the case where “we have more than one index’’. For
instance, instead of equation 1.134 we could use an equation like
λ = Fij k V ij k (1.137) to deﬁne scalars, consider that we are doing a duality product, and also use the notation of
equation 1.133 to denote it. But this is not very useful, as, from a given “tensor” Fij k we can
obtain scalar by operations like
λ = Fij k V i W j k . (1.138) It is better, in general, to just write explicitly the indices to indocate which sort of “product”
we consider.
Sometimes (like in quantum mechanics), a “braket” notation is used, where the name
stands for the bra “ ” and the ket “ ”. Then, instead of λ = F , V one writes
λ= FV = Fi V i . (1.139) Then, the braket notation is also used for the expression
λ= VHW = Hij V i W j . (1.140) Note: say that the general rules for the change of component values in a change of coordinates, allow us to talk about “tensors” for “generalized vectors” as well as for “generalized
forms”.
The “number of indices” that have to be used to represent the components of a tensor is
called the rank , or the order of the tensor. Thus the tensors F and V just introduced
are second rank, or second order. A tensor object with components Rijk could be called,
in all rigor, a “(thirdrankform)(ﬁrstrankvector)” will we wil not try to usew this heavy
terminology, the simple writing of the indices being explicit.
Note: say that if there is a metric, there is a trivial identiﬁcation between a vector space
and its dual, through equations like Fi = gij V j , or S ijk = g ip g jq g kr g s Rpqr s , and in that
case, the same letter is used to designate one vector and its dual element, as in Vi = gij V j ,
and Rijk = g ip g jq g kr g s Rpqr s . But in non metric spaces (i.e., spaces without metric), there is
usually a big diﬀerence between an space and its dual. 30 1.8 1.8.1.4.1 Gradient and Hessian Explain somewhere that if φ(x) is a scalar function,
the Taylor development
φ(x + ∆x) = φ(x) + g ∆x + 1
∆x  H  ∆x
2! (1.141) deﬁnes the gradien g and the Hessian H .
1.8.1.4.2 Old text We may want the gradient to be “perpendicular” at the level lines of
ϕ at O , but there is no natural way to deﬁne a scalar product in the {P, T } space, so we can
not naturally deﬁne what “perpendicularity” is. That there is no natural way to deﬁne a scalar
product does not mean that we can not deﬁne one: we can deﬁne many. For any symmetric,
positivedeﬁnite matrix with the right physical dimensions (i.e., for any covariance matrix), the
expression
T
−1
δ P1
δ P2
δ P1
CP P CP T
δ P2
,
=
δT1
δT2
δT1
CT P CT T
δT2
deﬁnes a scalar product. By an appropriate choice of the covariance matrix, we can make
any of the two lines in ﬁgure 1.6 (or any other line) to be perpendicular to the level lines at
the considered point: the gradient at a given point is something univocally deﬁned, even in
the absence of any scalar product; the “direction of steepest descent” is not, and there are as
many as we may choose diﬀerent scalar products. The gradient is not an arrow, i.e, it is not
a vector . So, then, how to draw the gradient? Roughly speaking, the gradient is the linear
tangent application at the considered point. It is represented in ﬁgure 1.7. As, by deﬁnition,
it is a linear application, the level lines are straight lines, and the spacing of the level lines
in the tangent linear application corresponds to the spacing of the level lines in the original
function around the point where the gradient is computed. Speaking more technically, it is the
development
ϕ(x + δ x) = ϕ(x) + g , δ x + . . .
= ϕ(x) + gi δxi + . . . ,
when limited to its ﬁrst order, that deﬁnes the tangent linear application. The gradient of ϕ
is then g . The gradient g = {gi } at O allows to associate a scalar to any vector V = {V i }
(also at O ): λ = gi V i = g , V . This scalar is the diﬀerence of the values at the top and
the bottom of the arrow representing the vector V on the local tangent linear application to
ϕ at O . The index on the gradient can be a lower index, as the gradient is not a vector.
Note: say that ﬁgure 1.8 illustrates the fact that an element of the dual space can be
represented as a “millefeuilles” in the “primal” space or a an “arrow” un the dual space. And
reciprocally.
Note: say that ﬁgure 1.9 illustrates the sum of arrows and the sum of “millefeuilles”.
Note: say that ﬁgure 1.10 illutrates the sum of “millefeuilles” in 3D.
1.8.1.5 Natural Basis A coordinate system associates to any point of the space, its coordinates. Each individual
coordinate can be seen as a function associating, to any point of the space, the particular
coordinate. We can deﬁne the gradient of this scalar function. We will have as many gradients Appendixes 31 Figure 1.5: The gradient of a function (i.e.,
of an application) at a point x0 is the tangent linear application at the given point. Let
x → f (x) represent the original (possibly
nonlinear) application. The tangent linear application could be considered as mapping x
into the values given by the linearized approximation fo f (x) : x → F (x) = α + β x .
(Note: explain better). Rather, it is mathematically simpler to consider that the gradient
maps increments of the independent variable
x , ∆x = x − x0 into increments of the linearized dependent variable: ∆y = y − f (x0 ) :
∆x → ∆y = β ∆x . (Note: explain this
MUCH better).
Figure 1.6: A scalar function ϕ(P, T ) depends on pressure and temperature. From
a given point, two directions in the {P, T }
space are drawn. Which one corresponds to
the gradient of ϕ(P, T ) ? In the ﬁgure at
left, the pressure is indicated in International
Units (m, kg, s), while in the ﬁgure at right,
the c.g.s. units (cm, g, s) are used (remember that 1 Pa = 10 dyne/cm−2 ). From the
left ﬁgure, we may think that the gradient is
direction A , while from the ﬁgure at right
we may think it is B . It is none: the right
deﬁnition of gradient (see text) only allows,
as graphic representation, the result shown in
ﬁgure 1.7.
Figure 1.7: Gradient of the function displayed
in ﬁgure 1.6, at the considered point. As the
gradient is the linear tangent application at
the given point, it is a linear application, and
its level lines are stright lines. The value of
the gradient at the considered point equals the
value of the original function at that point.
The spacing of the level lines in the gradient corresponds to the spacing of the level
lines in the original function around the point
where the gradient is computed. The two ﬁgures shown here are perﬂectly equivalent, as
it should. f(x) f(x,y) TO BE REDRAWN y x0
x 1.2
T/K
1 1.2
T/K A A 1 0.8 0.8 B 0.6 B 0.6 0.4 0.4 0.2 0.2 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 P/(N m2) 0.4 0.6 0.8 1 0.8 1 P/(dyne m2) 1.2
T/K 1.2
T/K 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0
0 0.2 0.4 0.6 P/(N m2) 0.8 1 0 0.2 0.4 0.6 P/(dyne m2) 32 Figure 1.8: A point, at the left of the ﬁgure,
may serve as the origin point for any vector
we may want to represent. As usual, we may
represent a vector V by an arrow. Then, a
form F is represented by an oriented pattern of lines (or by an oriented pattern of surfaces in 3D) with the line of zero value passing through the origin point. Each line has
a value, that is the number that the form associates to any vector whose end point is on
the line. Here, V and F are such that
F, V = 2 . But a form is an element of the
dual space, wich is also a linear space. In the
dual space, then, the form F can be represented by an arrow (ﬁgure at right). In turn,
V is represented, in the dual space, by a pattern of lines. 1.8 ``Primal'' space
f = +1 Dual space f = +4 f=0 F f = +3 V
f = −1 f = +2
v = +6 v = +3 v=0 v = −3 〈F,V〉 = 2 V+W Figure 1.9: When representing vectors by arrows, the sum of two vectors is given by the
main diagonal of the “parallelogram” drawn
by two arrows. Then, a form is represented
by a pattern of lines. The sum of two forms
can be geometrically obtained using the “parallelogram” deﬁned by the principal lozenge
(containing the origin and with positive sense
for both forms): the secondary diagonal of the
lozenge is a line of the sum of the two forms.
Note: explain this better. W
V
Two vectors The sum of two vectors g = +1 g = 0 f+g = +1
f+g = 0 f = +1 f=0 Two forms The sum of two forms f = +1
f=0 Figure 1.10: Sum of two forms, like in the
previous ﬁgure, but here in 3D. Note: explain
that this ﬁgure can be “sheared” an one wants
(we do not need to have a metric). Note: explain this better. f+g = +1 f = 1
g = 1 g = 0 g = +1 f+g = 0
f+g = 1 f = +1
f=0
f+g = +1
f = 1
g = +1 g = 0 g = 1 f+g = 0
f+g = 1 Appendixes 33
f1=2 f1=1 f1=0
x1=5 x1=6
x2=5 Figure 1.11: A system of coordinates, at left,
and their gradients, at right. These gradient
are forms. When in an ndimensional space
we have n forms, we can deﬁne n associate
vertors by f i , ej = δ j i . x1=7 f1=1 x1=8 f2=2
f2=1 x2=4 f2=0 x2=3 f2=1
f2=2 x2=2 To be redrawn f i as coordinates xi . As a gradient, we have seen, is a form, we will have as many forms as
coordinates. The usual requirements that coordinate systems have to fulﬁll (diﬀerent points of
the space have diﬀerent coordinates, and vice versa) gives n linearly independent forms (we
can not obtain one of them by linear combination of the others), i.e., a basis for the forms.
If we have a basis f i of forms, then we can introduce a basis ei of vectors, through
f i , ej = δ j i . (1.142) If we deﬁne the components V i of a vector V by
V = V i ei , (1.143) then, we can compute the components V i by the formula
V i = fi , V , (1.144) as we have
f i , V = f i , V j ej = f i , V j ej =
j V j f i , ej =
i V j δij = V j δij = V i .
i (1.145)
Note that the computation of the components of a vector does not involve a scalar product,
but a duality product.
To ﬁnd the equivalent of equations 1.143 and 1.144 for forms, one deﬁnes the components
Fi of a form F by
F = Fi f i , (1.146) Fi = F , ei . (1.147) and one easily gets The notation ei for the basis of vectors is quite universal. Although the notation f i seems
well adapted for a basis of forms, it is quite common to use the same letter for the basis of
forms and for the basis of vectors. In what follows, we will use the notation
ei ≡ f i . (1.148) whose dangerousness vanishes only if we have a metric, i.e., when we can give sense to an
expression like ei = gij ej . Using this notation the expressions
V = V i ei ⇐⇒ V i = fi , V ; F = Fi f i , ⇐⇒ Fi = F , ei (1.149) 34 1.8 become
V = V i ei ⇐⇒ V i = ei , V ; F = Fi ei , ⇐⇒ Fi = F , ei . (1.150) We have now basis for vectors and forms, so wa can write expressions like V = V i ei and
F = Fi ei . We need basis for objects “with more than one index”, so we can write expressions
like
B = B ij eij ; C = Cij eij ; D = Ci j ei j ; E = Eijk... mn... eijk... mn... (1.151)
The introduction of these basis raises a diﬃculty. While we have an immediate intuitive
representation for vectors (as “arrows”) and for forms (as “millefeuilles”), tensor objects of
higher rank are more diﬃcult to represent. If a symmetric 2tensor, like the stress tensor σ ij
of mechanics, can be viewed as an ellipsoid, how could we view a tensor Tijk m ? It is the power
of mathematics to suggest analogies, so we can work even without geometric interpretations.
But this absence of intuitive interpreation of highrank tensors tells us that we will have to
introduce the basis for these objects in a nonintuitive way. Essentially, what we want is that
the basis for high rank tensors is not independent for the basis of vectors and forms. We want,
in fact, more than this. Given two vectors U i and V i , we understand what we mean when
we deﬁne a 2tensor W by W ij = U i V j . The basis for 2tensors is perfectly deﬁned by the
condition that we wish that the components of W are precisely U i V j and not, for instance,
the values obtained after some rotation or change of coordinates.
This is enough, and we could directly use the notations introduced by equations 1.151.
Instead, common mathematical developments introduce the notion of “tensor product”, and,
instead of notations like eij , eij , ei j , or eijk... mn... , introduce the notations ei ⊗ ei ,
ei ⊗ ej , ei ⊗ ej , or ei ⊗ ej ⊗ ek ⊗ . . . e ⊗ em ⊗ en ⊗ . . . . Then, equations 1.151 are written
B = B ij ei ⊗ ei
E = Eijk... mn... ; C = Cij ei ⊗ ej ; D = Ci j ei ⊗ ej ei ⊗ ej ⊗ ek ⊗ . . . e ⊗ em ⊗ en ⊗ . . . . (1.152) What follows is an old text, to be updated.
The metric tensor has been introduced in section 1.3. Let us show here that if the space
into consideration has a scalar product, then, the metric can be computed. Here, the scalar
product of two vectors V and W is denoted V · W . Then, deﬁning
dr = dxi ei (1.153) ds2 = dr · dr (1.154) ds2 = dr · dr = (dxi ei ) · (dxj ej ) = (ei · ej ) dxi dxj . (1.155) and gives Deﬁning the metric tensor
gij = ei · ej (1.156) Appendixes 35 gives then
ds2 = gij dxi dxj . (1.157) To emphasize that at every point of the manifold we have a diﬀerent tensor space, and
a diﬀerent basis, we can always write explicitly the dependence of the basis vectors on the
coordinates, as in ei (x) . Equation 1.143 is then just a short notation for
V(x) = V i (x) ei (x) , (1.158) while equation 1.146 is a short notation for
F(x) = Fi (x) ei (x) . (1.159) Here and in most places of the book, the notation x is a shortcut notation for {x1 , x2 , . . . } .
The reader should just remember that x represents a point in the space, but it is not a vector.
It is important to realize that, when dealing with tensor mathematics, a single basis is a
basis for all the vector spaces at the considered point. For instance, the vector V may be a
velocity, and the vector E may be an electric ﬁeld. The two vectors belong to diﬀerent vector
spaces, but the are obtained as “linear combinations” of the same basis vectors:
V = V i ei
E = E i ei , (1.160) but, of course, the components are not pure real numbers: they have dimensions. Box ?? recalls
what the dimensions of components are.
Let us examine the components of the basis vectors (on the basis they deﬁne). Obviously,
(ei )j = δi j
or, explicitly, 1
0 e1 = 0 .
.
. (ej )i = δi j , 0
1 e2 = 0 .
.
. (1.161) ... . (1.162) Equivalently, for the basis of 2tensors we have 1
0 e 1 ⊗ e1 = 0 ··· 0
0
0 ··· ··· e2 ⊗ e1 = 0
1
0 0
0
0
··· (ei ⊗ ej )kl = δi k δj l 0 ···
0
0 0 ··· e1 ⊗ e2 = 0
0 ··· ...
···
··· ···
··· ··· ...
···
0
0
0 0
1
0 ··· ··· e2 ⊗ e2 = 0
0
0 (1.163)
1
0
0 ···
··· ··· ...
···
0
0
0 ··· ···
··· ··· ...
···
0
0
0 ... ... (1.164) 36 1.8
... ... ... and similar formular for other basis.
Note: say somewhere that the deﬁnition of basis vectors given above imposes that the
vectors of the natural basis are, at any point, tangent to the coordinate lines at that point. The
notion of tangency is independent of the existence, or not, of a metric, i.e., of the possibility of
measuring distances in the space. This is not so for the notion of perpendicularity, that makes
sense only if we can measure distances (and, therefore, angles). In, general, then, the vectors
of the natural basis are tangent to the coordinate lines. When a metric has been introduced,
the vectors in the natural basis at a given point will be mutually perpendicular only if the
coordinate lines themselves are mutually perpendicular at that point. Ordinary coordinates
in the Euclidean 3D space (Cartesian, cylindrical, spherical, . . . ) deﬁne coordinate lines that
are orthogonal at every point. Then, the vectors of the natural basis will also be mutually
orthogonal at all points. But the vectors of the natural basis are not, in general, normed to 1 .
For instance, ﬁgure XXX illustrates the fact that the norm of the vectors of the natural basis
in polar coordinates are, at point (r, ϕ) , er = 1 and eϕ = r .
1.8.1.6 Tensor Components Consider, over an ndimensional manifold X . At any point P of the manifold, one
can consider the linear space L that is tangent to the manifold at that point, and its
dual L . One can also consider, at point P , the ‘tensor product’ of spaces L(p, q ) =
L ⊗ L ⊗ · · · ⊗ L ⊗ L ⊗ L ⊗ · · · ⊗ L . A ‘ptimes contravariant, q times covariant tensor’ at
p times q times point P of the manifold is an element of L(p, q ) .
When a coordinate system x = {x1 , . . . , xn } is chosen over X , one has, at point P , the
‘natural basis’ for the linear tangent space L , say {ei } , and, by virtue of the tensor product,
also a basis for the space L(p, q ) , say {ei1 ⊗ ei2 ⊗ · · · ⊗ eip ⊗ ej1 ⊗ ej2 ⊗ · · · ⊗ ejq } . Any tensor
T at point P of X can then be developed on this basis,
T = Tx i1 i2 ...ip j1 j2 ...jq ei1 ⊗ ei2 ⊗ · · · ⊗ eip ⊗ ej1 ⊗ ej2 ⊗ · · · ⊗ ejq , (1.165) to deﬁne the natural components of the tensor, Tx i1 i2 ...ip j1 j2 ...jq . They are intimately linked
to the coordinate system chosen over X , as this coordinate system has induced the natural
basis {ei } at the considered point P . The index x in the components is there to recall this
fact. It is essential when diﬀerent coordinates are going to be simultaneously considered, but it
can be droped when there is no possible confusion about the coordinate system being used. Its
lower or upper position may be chosen for typographical clarity, and, of course, has no special
variance meaning.
1.8.1.7 Tensors in Metric Spaces Comment: explain here that it is possible to give a lot of structure to a manifold (tangent
linear space, (covariant) derivation, etc.) without the need of a metric. It is introduced here to
simplify the text, as, if not, we would have need to come bak to most of the results to add the
particular properties arising when there is a metric. But, in all rigor, it would be preferable to
introduce the metric after, for instance, the deﬁnition of covariant diﬀerentiaition, that does
not need it. Appendixes 37 Having a metric in a diﬀerential manifold means being able to deﬁne the length of a line.
This will then imply that we can deﬁne a scalar product at every local tangent linear space
(and, thus, the angle between two crossing lines).
The metric will also allow to deﬁne a natural bijection between vectors and forms, and
between tensors densities and capacities.
A metric is deﬁned when a second rank symetric form g with components gij is given.
The length L of a line xi (λ) is then deﬁned by the line integral
ds , L= (1.166) λ where
ds2 = gij dxi dxj . (1.167) Once we have a metric, it is possible to deﬁne a bijection between forms and vectors. For,
to the vector V with components V i we can associate the form F with components
Fi = gij V j . (1.168) Then, it is customary to use the same letter to designate a vector and a form that are linked
by this natural bijection, as in
Vi = gij V j . (1.169) The inverse of the previous equation is written
V i = g ij Vj , (1.170) gij g jk = δi k . (1.171) where The reader will easily give sense to the expression
ei = gij ej . (1.172) The equations above, and equations like
Tij... kl... = gip gjq . . . g kr g ls . . . T pq... rs... , (1.173) are summarized by saying that “the metric tensor allows to raise and lower indices”.
The value of the metric at a particular point of the manifold allows to deﬁne a scalar product
for the vectors in the local tangent linear space. Denoting the scalar product of two vectors
V and W by V · W , we can use any of the deﬁnitions
V · W = gij V i W j = Vi W j = V i Wj . (1.174) To deﬁne parallel transportation of tensors, we have introduced a connection Γij k . Now that
we have a metric we may wonder if when paralleltransporting a vector, it conserves constant 38 1.8 length. It is easy to show (see demonstration in [Comment: where?]) that this is true if we
have the compatibility condition
∇i gjk = 0 , (1.175) ∂i gjk = gsk Γij s + gjs Γik s . (1.176) i.e., The compatibility condition 1.175 implies that the metric tensor and the nabla symbol commute:
∇i (gjk T pq... rs... ) = gjk (∇i T pq... rs... ), (1.177) which, in fact, means that it is equivalent to take a covariant derivative, then raise or lower an
index, or ﬁrst raise or lower an index, then take the covariant derivative.
Note: introduce somewhere the notation
Γijk = gks Γij s , (1.178) warn the reader that this is just a notation : the connection coeﬃcients are not the components
of a tensor. and say that if the condition 1.175 holds, then, it is possible to compute the
connection coeﬃcients from the metric and the torsion:
Γijk = 1
1
(∂i gjk + ∂j gik − ∂k gij ) + (Sijk + Skij + Skji ) .
2
2 (1.179) As the basis vectors have components
(ei )j = δi j , (1.180) ei · ej = gij . (1.181) dr = dxi ei (1.182) dr · dr = ds2 . (1.183) we have Deﬁning gives then We have seen that the metric can be used to deﬁne a natural bijection between forms and
vectors. Let us now see that it can also be used to deﬁne a natural bijection between tensors,
densities, and capacities.
We denote by g the determinant of gij :
g = det({gij }) = 1 ijk... pqr...
ε
ε
gip gjq gkr . . . .
n! (1.184) Appendixes 39 The two upper bars recall that g is a second order density, as there is the product of two
densities at the righthand side.
For a reason that will become obvious soon, the square root of g is denoted g :
g = g g. (1.185) In (Comment: where?) we demonstrate that we have
∂i g = g Γis s . (1.186) Using expression (Comment: which one?) for the (covariant) derivative of a scalar density, this
simply gives
∇i g = ∂i g − g Γis s = 0 , (1.187) which is consistent with the fact that
∇i gjk = 0 . (1.188) We can also deﬁne the determinant of g ij :
g = det({g ij }) = 1
ε
g ip g jq g kr . . . ,
ε
n! ijk... pqr... (1.189) and its square root g :
g = g g. (1.190) As the matrices gij and g ij are mutually inverses, we have
g g = 1. (1.191) Using the scalar density g and the scalar capacity g we can associate tensor densities, pure
tensors, and tensor capacities. Using the same letter to designate the objects related through
this natural bijection, we will write expressions like
ρ = g ρ,
i (1.192) V = g Vi, (1.193) Tij... kl... = g T ij... kl... . (1.194) or So, if gij and g ij can be used to “lower and raise indices”, g and g can be used to “put
and remove bars”.
Comment: say somewhere that g is the density of volumetric content , as the volume
element of a metric space is given by
dV = g dτ , (1.195) 40 1.8 where dτ is the capacity element deﬁned in (Comment: where?), and which, when we take an
element along the coordinate lines, equals dx1 dx2 dx3 . . . .
Comment: Say that we can demonstrate that, in an Euclidean space, the matrix representing
the metric equals the product of the Jacobian matrix times the transposed matrix: ∂X 1 ∂X 1 ∂X 1 ∂X 2 g11 g12 . . .
...
...
1
2
1
1
∂x
∂x
∂x
∂x ∂X 2 ∂X 2 ∂X 1 ∂X 2 (1.196)
{gij } = g21 g22 . . . = ∂x1 ∂x2 . . . × ∂x2 ∂x2 . . . .
. ..
.
.
.
.
.
...
...
.
.
.
.
.
.
.
.
.
.
.
.
.
In short,
gij =
K ∂X K ∂X K
.
∂xi ∂xj (1.197) This follows directly from the general equation
gij = ∂X I ∂X J
gIJ
∂xi ∂xj using the fact that, if the {X I } are Cartesian coordinates, g11 g12 . . .
1 0 ... {gIJ } = g21 g22 . . . = 0 1 . . . .
. ..
.
.
.
0 0 ...
.
.
. (1.198) (1.199) Comment: explain here that the metric introduces a bijection between forms and vectors:
Vi = gij V j . (1.200) Comment: introduce here the notation
(V, W) = gij V i W j = Vi W i = Wi V i . (1.201) Appendixes 1.8.2 41 Appendix: Dimension of Components Which dimensions have the components of a vector? Contrarily to the basis of elementary
calculus, the vectors deﬁning the natural basis are not normed to one. Rather, it follows from
gij = ei · ej that the length (i.e., the norm) of the basis vector ei is
ei = √ gii . For instance, if in the Euclidean 3D space with Cartesian coordinates
ex = ey = ez = 1 ,
the use of spherical coordinates gives
er = 1 eθ = r eϕ = r sin θ . Denoting by [ V ] the physical dimension of (the norm of) a vector, this gives
√
[ ei ] = [ gii ] .
For instance, in Cartesian coordinates,
[ e x ] = [ e y ] = [ ez ] = 1 ,
and in spherical coordinates,
[ er ] = 1 [ eθ ] = L [ eϕ ] = L , where L represents the dimension of a length . A vector V = V i ei has components with
dimensions
[V]
[V]
Vi =
.
=√
[ ei ]
gii
For instance, in Cartesian coordinates,
[V x ] = [V y ] = [V y ] = [ V ]
and in spherical coordinates,
[V r ] = [ V ] Vθ = [V]
L [V ϕ ] = In general, the physical dimension of the component Tij... k
Tij... k ... of a tensor T is 1
1
...
[ ek ] [ e ]
1
1
... √
... .
√
gkk
g = [ T ] [ ei ] [ ej ] . . .
√
√
gjj
= [ T ] [ gii ] ... [V]
.
L 42 1.8.3 1.8 Appendix: The Jacobian in Geographical Coordinates Example 1.5 Let
x = {x, y, z } y = {r, ϕ, ϑ} ; (1.202) respectively represent a Cartesian and a geographical system of coordinates over the Euclidean
3D space,
x = r cos ϑ cos ϕ
y = r cos ϑ sin ϕ
z = r sin ϑ . (1.203) The matrix of partial derivatives deﬁned at the right of equation 1.2 is cos ϑ cos ϕ r sin ϑ cos ϕ −r cos ϑ sin ϕ
X = cos ϑ sin ϕ r sin ϑ sin ϕ r cos ϑ cos ϕ sin ϑ
−r cos ϑ
0 , (1.204) The matrix Y deﬁned at the left of equation 1.2 could be computed by, ﬁrst, solving in
equations 1.203 for the geographical coordinates as a ﬁnction of the Cartesian ones, and, then,
by computing the partial derivatives. This would give the matrix Y as a function of {x, y, z } .
More simply, we can just evaluate Y as X−1 (equation 1.5), but his, of course, gives Y as
a ﬁnction of {r, ϑ, ϕ} : cos ϑ cos ϕ
cos ϑ sin ϕ
sin ϑ
sin ϑ sin ϕ / r − cos ϑ / r .
(1.205)
Y = sin ϑ cos ϕ / r
− sin ϕ / (r cos ϑ) cos ϕ / (r cos ϑ)
0
The two Jacobian determinants are
X= 1
= r2 cos ϑ
Y . (1.206) For the metric, as
ds2 = dx2 + dy 2 + dz 2 = dr2 + r2 cos2 ϑ dϕ2 + r2 dϑ2
√
one has the volume densities (remember that g = det g )
gx = 1 ; g y = r2 cos ϑ , . (1.207) (1.208) The comparison of these two last equations with equations 1.206 shows that one has
gy = X gx , (1.209) in accordance with the general rule for the change of values of a scalar density under a change
of variables (equation 1.12). Appendixes 43 Here, the fundamental capacity elements are
dv x = dx ∧ dy ∧ dz ; dv y = dr ∧ dϕ ∧ dϑ . (1.210) Using the change of variables in equation 1.203 one obtains6
dx ∧ dy ∧ dz = r2 cos ϑ dr ∧ dϕ ∧ dϑ , (1.211) and inserting this into equation 1.210 gives
dv y = r2 1
1
dv x =
dv
cos ϑ
Xx , (1.212) in accordance with the general rule for the change of values of a scalar capacity under a change
of variables (equation 1.13). [End of example.] This results from the explicit computation of the exterior product dx ∧ dy ∧ dz , where dx = cos ϕ cos ϑ dr −
r sin ϕ cos ϑ dϕ − r cos ϕ sin ϑ dϑ , dy = sin ϕ cos ϑ dr + r cos ϕ cos ϑ dϕ − r sin ϕ sin ϑ dϑ and dz = sin ϑ dr +
r cos ϑ dϑ .
6 44 1.8 1.8.4 Appendix: Kronecker Determinants in 2 3 and 4 D 1.8.4.1 The Kronecker’s determinants in 2D
k
δij = k
k
(1/0!) εij εk = δi δj − δi δj k
δj = k
(1/1!) εij εik = δj δ
1.8.4.2 = (1.213) ij (1/2!) εij ε = 1 The Kronecker’s determinants in 3D mn
δijk = mn
mn
n
m
nm
nm
m
n
(1/0!) εijk ε mn = δi δj δk + δi δj δk + δi δj δk − δi δj δk − δi δj δk − δi δj δk m
δjk = m
m
(1/1!) εijk εi m = δj δk − δj δk δk = (1/2!) εijk εij = δk δ = (1/3!) εijk εijk = 1 1.8.4.3 (1.214) The Kronecker’s determinants in 4D mnpq
δijk = (1/0!) εijk εmnpq
mnp
mpq
mqn
nqp
npm
nmq
= +δi δj δk δ q + δi δj δk δ n + δi δj δk δ p + δi δj δk δ m + δi δj δk δ q + δi δj δk δ p
pqm
pmn
pnq
qmp
qnm
qpn
+ δi δj δk δ n + δi δj δk δ q + δi δj δk δ m + δi δj δk δ n + δi δj δk δ p + δi δj δk δ m
mnq
mpn
mqp
npq
nqm
nmp
− δi δj δk δ p − δi δj δk δ q − δi δj δk δ n − δi δj δk δ m − δi δj δk δ p − δi δj δk δ q
pqn
pmq
pnm
qmn
qnp
qpm
− δi δj δk δ m − δi δj δk δ n − δi δj δk δ q − δi δj δk δ p − δi δj δk δ m − δi δj δk δ n mnp
δjk = (1/1!) εijk εimnp
pm
pn
mn
np
mp
nm
= δj δk δ p + δj δk δ m + δj δk δ n − δj δk δ n − δj δk δ p − δj δk δ m mn
m
n
δk = (1/2!) εijk εijmn = (δk δ n − δk δ m ) δ m = (1/3!) εijk εijkm = δ m
δ = (1/4!) εijk εijk = 1 (1.215) Appendixes 1.8.5 45 Appendix: Deﬁnition of Vectors Consider the 3D physical space, with coordinates {xi } = {x1 , x2 , x3 } . In classical mechanics,
the trajectory of a particle is described by the three functions of time xi (t) . Obviously the three
values {x1 , x2 , x3 } are not the components of a vector, as an expression like xi (t) = xi (t)+ xi (t)
I
II
has, in general, no sense (think, for instance, in the case where we use spherical coordinates).
Deﬁne now the velocity of the particle at time t0 :
v i (t0 ) = dxi
dt .
t=t0 If two particles coincide at some point of the space {x1 , x2 , x3 } , it makes sense to deﬁne, for
0
0
0
i
i
instance, their relative velocity by v i (x1 , x2 , x3 , t0 ) = vI (x1 , x2 , x3 , t0 ) − vII (x1 , x2 , x3 , t0 ) . The
0
0
0
0
0
0
0
0
0
v i are the components of a vector.
If we change coordinates, x I = x I (xj ) , then the velocity is deﬁned, in the new
coordinate system, v I = dx I /dt , and we have v I = dx I /dt = ∂x I /∂xi dxi /dt , i.e.,
∂x I i
v=
v,
∂xi
I which is the standard rule for transformation of the components of a vector when the coordinates
(and, so, the natural basis) change.
Objects with upper or lower indices not always are tensors . The four classical objects
which do not have necessarily tensorial character are:
• the coordinates {xi } ,
• the partial diﬀerential operator ∂i ,
• the Connection Coeﬃcients Γij k ,
• the elements of the Jacobian matrix Ji I = ∂x I /∂xi . 46 1.8 1.8.6 Appendix: Change of Components
capacity
s =J s
F I = J JI i F i 0rank
1form
1vector
2form V
Q IJ I = J V i Ji I
= J JI i JJ j Qij (1form)(1vector) R I J = J JI i R i j Jj J (1vector)(1form) S 2vector
.
.
. T I
J
IJ tensor
s =s
FI = JI i Fi
V
Q IJ density
s =J s
F I = J JI i F i I = V i Ji I
= JI i JJ j Qij V
Q IJ I i = J V Ji I
= J JI i JJ j Qij R I J = JI i R i j Jj J R I J = J JI i R i j Jj J = J Ji I S i j JJ j S I J = J i I S i j JJ j S = J T ij Ji I Jj J
.
.
. T IJ = T ij Ji I Jj J
.
.
. T I J
IJ i = J Ji I S j J J j
ij = J T Ji I J j J
.
.
. Table 1.1: Changes of the components of the capacities, tensors and densities under a change
of variables. Appendixes 1.8.7 47 Appendix: Covariant Derivatives
Capacity Tensor Density ∇k s = ∂k s + Γk s ∇k s = ∂k s ∇k s = ∂k s − Γk s ∇k F i = ∂k F i + Γk F i
−Γki s F s ∇k Fi = ∂k Fi
−Γki s Fs ∇k F i = ∂k F i − Γk F i
−Γki s F s ∇k V i = ∂k V i + Γk V i
+Γks i V s ∇k V i = ∂k V i
+Γks i V s ∇k V = ∂k V − Γk V
s
+Γks i V ∇k Qij = ∂k Qij + Γk Qij
−Γki s Qsj − Γkj s Qis ∇k Qij = ∂k Qij
−Γki Qsj − Γkj s Qis ∇k Qij = ∂k Qij − Γk Qij
−Γki s Qsj − Γkj s Qis ∇k Ri j = ∂k Ri j + Γk Ri j
−Γki s Rs j + Γks j Ri s ∇k Ri j = ∂k Ri j
−Γki Rs j + Γks j Ri s ∇k Ri j = ∂k Ri j − Γk Ri j
−Γki s Rs j + Γks j Ri s ∇k S i j = ∂k S i j + Γk S i j
+Γks i S s j − Γkj s S i s ∇k S i j = ∂k S i j
+Γks i S s j − Γkj s S i s ∇k S j = ∂k S j − Γk S j
s
i
+Γks i S j − Γkj s S s ∇k T ij = ∂k T ij + Γk T ij
+Γks i T sj + Γks j T is ∇k T ij = ∂k T ij
+Γks i T sj + Γks j T is ∇k T = ∂k T − Γk T
sj
is
+Γks i T + Γks j T .
.
. .
.
. .
.
. s s i i i i i i ij ij ij Table 1.2: Covariant derivatives for capacities, tensors and densities. 48 1.8 1.8.8 Appendix: Formulas of Vector Analysis Let be a , b , and c vector ﬁelds, ϕ a scalar ﬁeld, and ∆a the vector Laplacian (the
Laplacian applied to each component of the vector). The following list of identities holds:
div rot a = 0 (1.216) rot grad ϕ = 0 (1.217) div(ϕa) = (grad ϕ) · a + ϕ(div a) (1.218) rot(ϕa) = (grad ϕ) × a + ϕ(rot a) (1.219) grad(a · b) = (a · ∇)b + (b · ∇)a + a × (rot b) + b × (rot a) (1.220) div(a × b) = b · (rot a) − a · (rot b) (1.221) rot(a × b) = a(div b) − b(div a) + (b · ∇)a − (a · ∇)b (1.222) rot rot a = grad(div a) − ∆a . (1.223) Using the nabla symbol everywhere, these equations become:
∇ · (∇ × a) = 0 (1.224) ∇ × (∇ · a) = 0 (1.225) ∇ · (ϕa) = (∇ϕ) · a + ϕ(∇ · a) (1.226) ∇ × (ϕa) = (∇ϕ) × a + ϕ(∇ × a) (1.227) ∇(a · b) = (a · ∇)b + (b · ∇)a + a × (∇ × b) + b × (∇ × a) (1.228) ∇ · (a × b) = b · (∇ × a) − a · (∇ × b) (1.229) ∇ × (a × b) = a(∇ · b) − b(∇ · a) + (b · ∇)a − (a · ∇)b (1.230) ∇ × (∇ × a) = ∇(∇ · a) − ∆a . (1.231) Appendixes 49 The following three vector equations are also often useful:
a · (b × c) = b · (c × a) = c · (a × b) (1.232) a × (b × c) = (a · c) · b − (a · b) · c (1.233) (a × b) · (c × d) = a · [b × (c × d)] = (a · c)(b · d) − (a · d)(b · c) (1.234) As, in tensor notations, the scalar product of two vectors is a · b = ai bi , and the vector
product has components (a × b)i = εijk aj bk (see section XXX), the identities 1.232–1.234
correspond respectively to:
∇i εijk ∇j ak = 0 (1.235) εijk ∇j ∇k ϕ = 0 (1.236) ai εijk bj ck = bi εijk cj ak = ci εijk aj bk (1.237) εijk aj (k m b cm ) = (aj cj ) bi − (aj bj ) ci (εijk aj bk )(εi m c dm ) = ai (εijk bj (εk m c dm )) , (1.238) (1.239) while the identities 1.226–1.231 correspond respectively to
∇i (ϕ ai ) = (∇i ϕ)ai + ϕ(∇i ai ) (1.240) εijk ∇j (ϕ ak ) = εijk (∇j ϕ)ak + ϕεijk ∇j ak (1.241) ∇i (aj bj ) = (aj ∇j )bi + (bj ∇j )ai + εijk aj (εk m ∇ bm ) + εijk bj (εk m ∇ am ) (1.242) ∇i (εijk aj bk ) = bk εkij ∇i aj − aj εjik ∇i bk (1.243) εijk ∇j (εk m a bm )ai ∇j bj − bi ∇j aj + bj ∇j ai − aj ∇j bi (1.244) εijk ∇j (εk m ∇ am ) = ∇i (∇j aj ) − ∇j ∇j ai , (1.245) where the (inelegant) notation ∇i represents g ij ∇j .
The truth of the set of equations 1.235–1.245, when not obvious, is easily demonstrated by
the simple use of the property (see section XXX)
εijk εk m = δi δj m − δi m δj (1.246) 50 1.8.9 1.8 Appendix: Metric, Connection, etc. in Usual Coordinate Systems [Note: This appendix shall probably be suppressed.]
1.8.9.1
1.8.9.1.1 Cartesian Coordinates
Line element
ds2 = dx2 + dy 2 + dz 2 100
gxx gxy gxz
gyx gyy gyz = 0 1 0
001
gzx gzy gzz 1.8.9.1.2 (1.247) (1.248) Metric 1.8.9.1.3 Fundamental density
g=1 1.8.9.1.4 Connection
x
Γxx
Γyx x
Γzx x
y
Γxx
Γyx y
Γzx y
z
Γxx
Γyx z
Γzx z 1.8.9.1.5 (1.249) 000
Γxy x Γxz x
Γyy x Γyz x = 0 0 0
000
Γzy x Γzz x y
y
000
Γxy Γxz
Γyy y Γyz y = 0 0 0
000
Γzy y Γzz y z
z
000
Γxy Γxz
Γyy z Γyz z = 0 0 0
000
Γzy z Γzz z (1.250) Contracted connection Γx
0
Γy = 0
0
Γz (1.251) 1.8.9.1.6 Relationship between covariant and contravariant components for ﬁrst
order tensors x
Vx
V
Vy = V y (1.252)
Vz
Vz Appendixes 51 1.8.9.1.7 Relationship between covariant
ond order tensors
x Tx Tx y
Txx Txy Txz
Tyx Tyy Tyz = Ty x Ty y
Tzx Tzy Tzz
Tz x Tz y
1.8.9.1.8 and contravariant components for sec xx T
Tx z
T xy T xz
Ty z = T yx T yy T yz Tz z
T zx T zy T zz Norm of the vectors of the natural basis
e x = e y = ez = 1 1.8.9.1.9 (1.254) Norm of the vectors of the normed basis
e x = e y = ez = 1 1.8.9.1.10 (1.253) (1.255) Missing Comment: give also the norms of the vectors of the dual basis. 1.8.9.1.11 Relations between components on the natural and the normed basis
for ﬁrst order tensors x x Vx
V
V
Vx y
y
V
Vy = Vy ;
(1.256)
= V z
z
Vz
V
Vz
V
1.8.9.1.12 Relations between components on the natural and the normed basis
for second order tensors Txx Txy Txz
Txx Txy Txz
Tyx Tyy Tyz = Tyx Tyy Tyz Tzx Tzy Tzz
Tzx Tzy Tzz x Tx x Tx y Tx z
Tx Tx y Tx z
Ty x Ty y Ty z = Ty x Ty y Ty z x
y
z
Tz x Tz y Tz z
Tz Tz Tz xx xy
xz
T xx T xy T xz
T
T
T
T yx T yy T yz = T yx T yy T yz (1.257) zx
zy
zz
zx
zy
zz
T
T
T
T
T
T
———————————————— 52
1.8.9.2
1.8.9.2.1 1.8
Spherical Coordinates
Line element
ds2 = dr2 + r2 dθ2 + r2 sin2 θ dϕ2 1.8.9.2.2 1.8.9.2.3 Metric (1.258) 10
0
grr grθ grϕ gθr gθθ gθϕ = 0 r2
0
2
0 0 r sin2 θ
gϕr gϕθ gϕϕ (1.259) Fundamental density
g = r2 sin θ 1.8.9.2.4 1.8.9.2.5 Connection
r
Γrr Γθr r
Γϕr r
θ
Γrr Γθr θ
Γϕr θ
ϕ
Γrr Γθr ϕ
Γϕr ϕ 00
0
Γrθ r Γrϕ r 0
Γθθ r Γθϕ r = 0 −r
2
r
r
0 0 −r sin θ
Γϕθ Γϕϕ 0 1/r
0
Γrθ θ Γrϕ θ 0
Γθθ θ Γθϕ θ = 1/r 0
θ
θ
0
0 − sin θ cos θ
Γϕθ Γϕϕ ϕ
ϕ
0
0
1/r
Γrθ
Γrϕ
0
cotg θ
Γθθ ϕ Γθϕ ϕ = 0
ϕ
ϕ
1/r cotg θ
0
Γϕθ Γϕϕ (1.260) (1.261) Contracted connection Γr
2/r Γθ = cotg θ
0
Γϕ (1.262) 1.8.9.2.6 Relationship between covariant and contravariant components for ﬁrst
order tensors Vr
Vr Vθ = r2 V θ (1.263)
Vϕ
r2 sin2 θ V ϕ
1.8.9.2.7 Relationship between covariant and contravariant components for second order tensors r 1
Trr r12 Trθ r2 sin2 θ Trϕ
Tr Tr θ Tr ϕ
T rr
T rθ
T rϕ Tθr 12 Tθθ 2 1 2 Tθϕ = Tθ r Tθ θ Tθ ϕ = r2 T θr
r2 T θθ
r2 T θϕ r
r sin θ
2
2
1
1
r
θ
ϕ
2
ϕr
2
ϕθ
2
Tϕ Tϕ Tϕ
r sin θ T
r sin θ T
r sin2 θ T ϕϕ
Tϕr r2 Tϕθ r2 sin2 θ Tϕϕ
(1.264) Appendixes
1.8.9.2.8 53
Norm of the vectors of the natural basis
er = 1 1.8.9.2.9 ; eθ = r ; eϕ = r sin θ Norm of the vectors of the normed basis
er = eθ = eϕ = 1 1.8.9.2.10 (1.265) (1.266) Missing Comment: give also the norms of the vectors of the dual basis. 1.8.9.2.11 Relations between components on the natural and the normed basis
for ﬁrst order tensors r Vr
Vr
V
Vr
V θ = 1 V θ Vθ = r Vθ ;
(1.267) r ϕ
1
ϕ
Vϕ
V
r sin θ Vϕ
V
r sin θ
1.8.9.2.12 Relations between
for second order tensors Trr Trθ Trϕ Tθr Tθθ Tθϕ Tϕr Tϕθ Tϕϕ
r Tr Tr θ Tr ϕ Tθ r Tθ θ Tθ ϕ Tϕ r Tϕ θ Tϕ ϕ rr T
T rθ T rϕ T θr T θθ T θϕ T ϕr T ϕθ T ϕϕ components on the natural and the normed basis rTrθ
r sin θ Trϕ
Trr = rTθr
r2 Tθθ
r2 sin θ Tθϕ r sin θ Tϕr r2 sin θ Tϕθ r2 sin2 θ Tϕϕ 1
1
Tr r
Tr θ
Tr ϕ
r
r sin θ 1
= rTθ r
Tθ θ
Tϕ
sin θ θ Tϕ ϕ
r sin θ Tϕ r sin θ Tϕ θ 1 rθ
1
T
T rϕ
T rr
r
r sin θ 1 θθ
1
= 1 T θr
T
T θϕ r
r2
r 2 sin θ
1
1
1
T ϕr r2 sin θ T ϕθ r2 sin2 θ T ϕϕ
r sin θ (1.268) Note: say somewhere in this appendix that the two following formulas are quite useful in
deriving the formulas above.
1∂ n
∂ψ n
(r ψ ) =
+ψ
rn ∂r
∂r
r (1.269) ∂ψ
1∂
(sinn ϑ ψ ) =
+ n cotgϑ ψ .
n
sin ϑ ∂ϑ
∂ϑ (1.270) ———————————————— 54
1.8.9.3
1.8.9.3.1 1.8
Cylindrical Coordinates: Metric, Connection . . .
Line element
ds2 = dr2 + r2 dϕ2 + dz 2 100
grr grϕ grz
gϕr gϕϕ gϕz = 0 r2 0
001
gzr gzϕ gzz 1.8.9.3.2 (1.271) (1.272) Metric 1.8.9.3.3 Fundamental density
g=r 1.8.9.3.4 Connection Γrr r
Γϕr r
Γzr r
ϕ
Γrr
Γϕr ϕ
Γzr ϕ
z
Γrr
Γϕr z
Γzr z
1.8.9.3.5 (1.273) 000
Γrϕ r Γrz r
Γϕϕ r Γϕz r = 0 −r 0
000
Γzϕ r Γzz r 0 1/r 0
Γrϕ ϕ Γrz ϕ
Γϕϕ ϕ Γϕz ϕ = 1/r 0 0
0
00
Γzϕ ϕ Γzz ϕ 000
Γrϕ z Γrz z
Γϕϕ z Γϕz z = 0 0 0
000
Γzϕ z Γzz z (1.274) Contracted connection Γr
1/r
Γϕ = 0 0
Γz (1.275) 1.8.9.3.6 Relationship between covariant and contravariant components for ﬁrst
order tensors r
Vr
V
Vϕ = r2 V ϕ (1.276)
z
Vz
V
1.8.9.3.7 Relationship between
ond order tensors
r Tr
Trr r12 Trϕ Trz
Tϕr 12 Tϕϕ Tϕz = Tϕ r
r
Tzr r12 Tzϕ Tzz
Tz r covariant and contravariant components for sec rr T
Tr ϕ Tr z
T rϕ
T rz
Tϕ ϕ Tϕ z = r2 T ϕr r2 T ϕϕ r2 T ϕz Tz ϕ Tz z
T zr
T zθ
T zz (1.277) Appendixes
1.8.9.3.8 55
Norm of the vectors of the natural basis
er = 1 1.8.9.3.9 ; eϕ = r ; ez = 1 Norm of the vectors of the normed basis
er = eϕ = ez = 1 1.8.9.3.10 (1.278) (1.279) Missing Comment: give also the norms of the vectors of the dual basis. 1.8.9.3.11 Relations between components on the natural and the normed basis
for ﬁrst order tensors r r Vr
V
V
Vr
V ϕ = 1 V ϕ Vϕ = r Vϕ ;
(1.280) r z
z
Vz
V
Vz
V
1.8.9.3.12 Relations between
for second order tensors Trr Trϕ
Tϕr Tϕϕ
Tzr Tzϕ
r
Tr Tr ϕ
Tϕ r Tϕ ϕ
Tz r Tz ϕ rr
T
T rϕ
T ϕr T ϕϕ
T zr T zϕ components on the natural and the normed basis Trr
Trz = rTϕr
Tϕz Tzz
Tzr Tr r
Tr z z
Tϕ
= rTϕ r
z
Tz
Tz r T rr
T rz 1 ϕr
T ϕz = r T
T zz
T zr rTrϕ Trz r2 Tϕϕ rTϕz rTzϕ Tzz 1
Tr ϕ Tr z
r Tϕ ϕ rTϕ z 1
T ϕ Tz z
rz 1 rϕ
T
T rz
r
1 ϕϕ 1 ϕz T
T
r2
r
1 zϕ
T
T zz
r (1.281) 56 1.8 1.8.10 Appendix: Gradient, Divergence and Curl in Usual Coordinate Systems Here we analyze the 3D Euclidean space, using Cartesian, spherical or cylindrical coordinates.
The words scalar, vector, and tensor mean “true” scalars, vectors and tensors, respectively. The
scalar densities, vector densities and tensor densities (see section XXX) are named explicitly.
1.8.10.1 Deﬁnitions If x → φ(x) is a scalar ﬁeld, its gradient is the form deﬁned by
Gi = ∇i φ . (1.282) i If x → V (x) is a vector density ﬁeld, its divergence is the scalar density deﬁned by
i D = ∇i V . (1.283) If x → Fi (x) is a form ﬁeld, its curl (or rotational ) is the vector density deﬁned by
i R = εijk ∇j Fk .
1.8.10.2 (1.284) Properties These deﬁnitions are such that we can replace everywhere true (“covariant”) derivatives by
partial derivatives (see exercise XXX). This gives, for the gradient of a density,
Gi = ∇i φ = ∂i φ , (1.285) for the divergence of a vector density,
i i D = ∇i V = ∂i V , (1.286) R = εijk ∇j Fk = εijk ∂j Fk (1.287) and for the curl of a form,
i i [this equation is only valid for spaces without torsion; the general formula is R = εijk ∇j Fk =
εijk (∂j Fk − 1 Sjk V ) ].
2
These equations lead to particularly simple expressions. For instance, the following table
shows that the explicit expressions have the same form for Cartesian, spherical and cylindrical
coordinates (or for whatever coordinate system).
Cartesian
Gx = ∂x φ
Gradient
Gy = ∂y φ
Gz = ∂z φ
D
Divergence
=
x
y
z
∂x V + ∂y V + ∂z V
x
R = ∂y Fz − ∂z Fy
y
Curl
R = ∂z Fx − ∂x Fz
z
R = ∂x Fy − ∂y Fx Spherical
Gr = ∂r φ
Gθ = ∂θ φ
Gϕ = ∂ϕ φ
D
=
r
θ
ϕ
∂r V + ∂θ V + ∂ϕ V
r
R = ∂θ Fϕ − ∂ϕ Fθ
θ
R = ∂ϕ Fr − ∂r Fϕ
ϕ
R = ∂r Fθ − ∂θ Fr Cylindrical
Gr = ∂r φ
Gϕ = ∂ϕ φ
Gz = ∂z φ
D
=
r
ϕ
z
∂r V + ∂ϕ V + ∂z V
r
R = ∂ϕ Fz − ∂z Fϕ
ϕ
R = ∂z Fr − ∂r Fz
z
R = ∂r Fϕ − ∂ϕ Fr Appendixes
1.8.10.3 57 Remarks Although we have only deﬁned the gradient of a true scalar, the divergence of a vector density,
and the curl of a form, the deﬁnitions can be immediately be extended by “putting bars on”
and “taking bars oﬀ” (see section XXX).
As an example, from equation 1.282, we can immediately write the deﬁnition of the gradient
of a scalar density,
Gi = ∇i φ , (1.288) from equation 1.283 we can write the deﬁnition of the divergence of a (true) vector ﬁeld,
D = ∇i V i , (1.289) and from equation 1.284 we can write the deﬁnition of the curl of a form as a true vector,
Ri = εijk ∇j Fk , (1.290) R = g i εijk ∇j Fk . (1.291) or a true form, Although equation 1.289 seems well adapted to the practical computation of the divergence
of a true vector, it is better to use 1.286 instead. For we have successively
D = ∂i V i ⇐⇒ ⇐⇒ g D = ∂i (g V i ) D= 1
∂i (g V i ) .
g (1.292) This last expression provides directly compact expressions for the divergence of a vector. For
instance, as the fundamental density g takes, in Cartesian, spherical and cylindrical coordinates, respectively the values 1 , r2 sin θ and r , this leads to the results of the following
table.
∂V x ∂V y ∂V z
+
+
(1.293)
∂x
∂y
∂z
1 ∂ (sin θ V θ ) ∂V ϕ
1 ∂ (r2 V r )
+
+
(1.294)
Divergence, Spherical coordinates : D = 2
r
∂r
sin θ
∂θ
∂ϕ
1 ∂ (rV r ) ∂V ϕ ∂V z
+
+
(1.295)
Divergence, Cylindrical coordinates : D =
r ∂r
∂ϕ
∂z
Divergence, Cartesian coordinates : D = Replacing the components on the natural basis by the components on the normed basis (see
section XXX) gives
Divergence, Cartesian coordinates : D = ∂V x ∂V y ∂V z
+
+
∂x
∂y
∂z Divergence, Spherical coordinates : D = 1 ∂ (sin θ V θ )
1 ∂V ϕ
1 ∂ (r2 V r )
+
+
(1.297)
r2 ∂r
r sin θ
∂θ
r sin θ ∂ϕ Divergence, Cylindrical coordinates : D = 1 ∂ (rV r ) 1 ∂ V ϕ ∂ V z
+
+
r ∂r
r ∂ϕ
∂z (1.296) (1.298) 58 1.8 These are the formulas given in elementary texts (not using tensor concepts).
Similarly, although 1.291 seems well adapted to a practical computation of the curl, it is
better to go back to equation 1.287. We have, successively,
i R = εijk ∂j Fk ⇐⇒ g Ri = εijk ∂j Fk ⇐⇒ Ri = 1 ijk
ε ∂j Fk
g ⇐⇒ R= 1
g i εijk ∂j Fk .
g
(1.299) This last expression provides directly compact expressions for the curl. For instance, as the
fundamental density g takes, in Cartesian, spherical and cylindrical coordinates, respectively
the values 1 , r2 sin θ and r , this leads to the results of the following table.
Rx = ∂y Fz − ∂z Fy
Curl, Cartesian coordinates : Ry = ∂z Fx − ∂x Fz
Rz = ∂x Fy − ∂y Fx
1
(∂θ Fϕ − ∂ϕ Fθ )
sin θ
1
Curl, Spherical coordinates : Rθ =
(∂ϕ Fr − ∂r Fϕ )
sin θ
Rϕ = sin θ (∂r Fθ − ∂θ Fr )
Rr = (1.300) r2 1
Rr = (∂ϕ Fz − ∂z Fϕ )
r
Curl, Cylindrical coordinates : Rϕ = r(∂z Fr − ∂r Fz )
1
Rz = (∂r Fϕ − ∂ϕ Fr )
r (1.301) (1.302) Appendixes 59 Replacing the components on the natural basis by the components on the normed basis (see
section XXX) gives
Rx = ∂y Fz − ∂z Fy
Curl, Cartesian coordinates : Ry = ∂z Fx − ∂x Fz (1.303) Rz = ∂x Fy − ∂y Fx
Rr = 1
r sin θ ∂ (sin θFϕ ) ∂ Fθ
−
∂θ
∂ϕ Curl, Spherical coordinates : Rθ = 1
r 1 ∂ Fr ∂ (rFϕ )
−
sin θ ∂ϕ
∂r Rϕ = 1
r ∂ (rFθ ) ∂ Fr
−
∂r
∂θ Rr = 1
r ∂ Fz ∂ (rFϕ )
−
∂ϕ
∂z ∂ Fr ∂ Fz
−
∂z
∂r
1 ∂ (rFϕ ) ∂ Fr
−
Rz =
r
∂r
∂ϕ Curl, Cylindrical coordinates : Rϕ = (1.304) (1.305) These are the formulas given in elementary texts (not using tensor concepts).
Comment: I should remember not to put this back in a table, as it is not very readable:
Curl
Cartesian Rx = ∂y Fz − ∂z Fy
Ry = ∂z Fx − ∂x Fz
Rz = ∂x Fy − ∂y Fx Spherical Cylindrical ∂ sin θFϕ
∂θ − ∂ Fθ
∂ϕ
∂rFϕ
1
1 ∂ Fr
Rθ = r sin θ ∂ϕ − ∂r
F
Rϕ = 1 ( ∂rFθ − ∂∂θϕ )
r
∂r
Rr = ∂ Fz − ∂rFϕ
∂ϕ
∂z
∂ Fr
∂ Fz
Rϕ = ∂z − ∂r
R z = 1 ∂ r Fϕ − ∂ Fr
r
∂r
∂ϕ Rr = 1
r sin θ 1.8.10.3.1 Comment: What follows is not very interesting and should be suppresed.
From 1.288 we can write
g Gi = ∇i (g φ) , (1.306) 60 1.8 which leads to the formula
Gi = 1
∇i (g φ) .
g (1.307) For instance, as the fundamental density g takes, in Cartesian, spherical and cylindrical
coordinates, respectively the values 1 , r2 sin θ and r , this leads to the results of the
following table.
Cartesian
Gradient Gx =
Gy =
Gz = ∂φ
∂x
∂φ
∂y
∂φ
∂z Spherical
∂
Gr = r2 ∂r r12 φ
∂
1
Gθ = sin θ ∂θ sin θ φ
∂φ
Gϕ = ∂ϕ Cylindrical
∂
Gr = r ∂r 1 φ
r
∂φ
Gϕ = ∂ϕ Gz = ∂φ
∂z Appendixes 1.8.11 61 Appendix: Connection and Derivative in Diﬀerent Coordinate Systems (Comment: mention here the boxes with diﬀerent coordinate systems).
1.8.11.1 Polar coordinates (Twodimensional Euclidean space with nonCartesian coordinates).
ds2 = r2 + r2 dϕ2 (1.308) Γrϕ ϕ = 1/r ; Γϕr ϕ = 1/r ; Γϕϕ r = −r ; (the others vanish) (1.309) Rij = 0 (1.310) ∇i V i =
1.8.11.2 1∂
∂V ϕ
(rV r ) +
r ∂r
∂ϕ (1.311) Cylindrical coordinates (Threedimensional Euclidean space with nonCartesian coordinates).
ds2 = r2 + r2 dϕ2 + dz 2 (1.312) Γrϕ ϕ = 1/r ; Γϕr ϕ = 1/r ; Γϕϕ r = −r ; (the others vanish) (1.313) Rij = 0 (1.314) ∇i V i =
1.8.11.3 1∂
∂V ϕ ∂V z
(rV r ) +
+
r ∂r
∂ϕ
∂z (1.315) Geographical coordinates Geographical coordinates
(Twodimensional nonEuclidean space).
ds2 = R2 (dθ2 + sin2 θ dϕ2 )
Γθϕ ϕ = cotg θ ; Γϕθ ϕ = cotg θ ; ϕϕ θ = − sin θ cos θ ; (1.316)
(the others vanish) Rθθ = 1/R2 ; Rϕϕ = 1/R2 ; (the others vanish) ; R = 2/R2
∇i V i = 1∂
∂V ϕ
(sin θ V θ ) +
sin θ ∂θ
∂ϕ (1.317)
(1.318)
(1.319) 62
1.8.11.4 1.8
Spherical coordinates (Threedimensional Euclidean space).
ds2 = dr2 + r2 dθ2 + r2 sin2 θ dϕ2
Γrθ θ = 1/r ;
Γθθ r = −r ;
Γϕθ ϕ = cotg θ ; Γrϕ ϕ = 1/r ;
Γθr θ = 1/r ;
Γθϕ ϕ = cotg θ ;
Γϕr ϕ = 1/r ;
2
r
θ
Γϕϕ = −r sin θ ; Γϕϕ = − sin θ cos θ ;
(the others vanish)
Rij = 0 ∇i V i = 1∂ 2 r
1∂
∂V ϕ
(r V ) +
(sin θ V θ ) +
r2 ∂r
sin θ ∂θ
∂ϕ (1.320) (1.321) (1.322) (1.323) Appendixes 1.8.12 63 Appendix: Computing in Polar Coordinates [Note: This appendix is probably to be suppressed.]
1.8.12.1
1.8.12.1.1 General formula
Simpleminded computation From
div V = ∂V ϕ
1∂
(rV r ) +
,
r ∂r
∂ϕ (1.324) we obtain, using a simpleminded discretisation, at
(div V)(r, ϕ) = 1 (r + δr)V r (r + δr, ϕ) − (r − δr)V r (r − δr, ϕ)
r
2 δr +
1.8.12.1.2
leads to V ϕ (r, ϕ + δϕ) − V ϕ (r, ϕ − δϕ)
.
2 δϕ Computation through parallel transport The notion of parallel transport
(div V)(r, ϕ) =
+ which gives (1.325) V ϕ (r, ϕ V r (r, ϕ r + δr, ϕ) − V r (r, ϕ
2 δr r, ϕ + δϕ) − V ϕ (r, ϕ
2 δϕ r − δr, ϕ) r, ϕ − δϕ) , (1.326) V r (r + δr, ϕ) − V r (r − δr, ϕ)
2 δr
V ϕ (r, ϕ + δϕ) − V ϕ (r, ϕ − δϕ)
+ cos(δϕ)
2 δϕ (div V)(r, ϕ) = + sin(δϕ) 1 V r (r + δr, ϕ) + V r (r − δr, ϕ)
.
δϕ
r
2 (1.327) 1.8.12.1.3 Note: Natural basis and “normed” basis The components on the natural
basis V r et V ϕ are related with the components on the normed basis V r and V ϕ through
Vr =Vr (1.328) Vϕ =r Vϕ . (1.329) and 1.8.12.2 Divergence of a constant ﬁeld A constant vector ﬁeld (oriented “as the x axis”) has components
V r (r, ϕ) = k cos ϕ (1.330) k
V ϕ (r, ϕ) = − sin ϕ .
r (1.331) and 64
1.8.12.2.1
gives 1.8
Simpleminded computation An exact evaluation of approximation 1.325 (div V)(r, ϕ) = sin(δϕ)
k
cos ϕ 1 −
r
δϕ , (1.332) expression with an error of order (δϕ)2 .
1.8.12.2.2 Computation through parallel transport An exact evaluation of approximation 1.327 gives
(div V)(r, ϕ) = 0 ,
as it should. (1.333) Appendixes 65 1.8.13 Appendix: Dual Tensors in 2 3 and 4D 1.8.13.1 Dual tensors in 2D In 2D, we may need to take the following duals of contravariant (antisymmetric) tensors:
∗ Bij = 1
εij B
0! ∗ Bi = B= 1
εB
0! ij Bi = 1
j
εij B
1! ∗ 1
εij B ij
2! Bij = ∗ 1
εij B j
1! ∗ ∗ B= 1
ij
εij B
2! ∗ B ij = 1
εB
0! ij (1.334) ∗ Bi = 1
ε Bj
1! ij (1.335) ∗ B= 1
ε B ij
2! ij (1.336) ∗ B= ij 1 ij
εB
0! (1.337) ∗ B= i 1 ij
ε Bj
1! (1.338) ∗ B= 1 ij
ε Bij
2! (1.339) We may also need to take duals of covariant tensors:
∗ B ij = 1 ij
εB
0! ∗ Bi = 1 ij
ε Bj
1! ∗ B= ∗ 1 ij
εB
0! ∗ Bi = 1 ij
ε Bj
1! ∗ 1 ij
ε Bij
2! B ij = B= 1 ij
ε B ij
2! As in a space with an even number of dimensions the dual of the dual of a tensor of rank
p equals (−1)p the original tensor (see text), we have, in 2D, that for a tensor with 0 or 2
indices, ∗ (∗ B) = B , while for a tensor with 1 index, ∗ (∗ B) = −B .
1.8.13.2 Dual tensors in 3D In 3D, we may need to take the following duals of contravariant (totally antisymmetric) tensors:
∗ Bijk = 1
εijk B
0! ∗ Bij = 1
εijk B k
1! ∗ Bi = 1
εijk B jk
2! ∗ B= 1
εijk B ijk
3! 1
εB
0! ijk ∗ Bijk = ∗ Bij = 1
k
εijk B
1! ∗ Bi = 1
jk
εijk B
2! ∗ B= 1
ijk
εijk B
3! ∗ B ijk = 1
εB
0! ijk (1.340) ∗ B ij = 1
εijk B k
1! (1.341) ∗ Bi = 1
ε B jk
2! ijk (1.342) ∗ B= 1
ε B ijk
3! ijk (1.343) 66 1.8
We may also need to take duals of covariant tensors:
1
1
∗ ijk
∗ ijk
B = εijk B
B = εijk B
0!
0!
∗ B ij = 1 ijk
ε Bk
1! ∗ Bi = 1 ijk
ε Bjk
2! ∗ B ij = 1 ijk
ε Bk
1! ∗ Bi = 1 ijk
ε B jk
2! 1 ijk
εB
0! (1.344) ij 1 ijk
ε Bk
1! (1.345) i 1 ijk
ε Bjk
2! (1.346) ijk ∗ B ∗ B= ∗ B= = 1 ijk
1
1
∗
∗
B = εijk B ijk
B = εijk Bijk
(1.347)
ε Bijk
3!
3!
3!
As in a space with an odd number of dimensions the dual of the dual of a tensor always
equals the original tensor (see text), we have, in 3D, that for all tensors above, ∗ (∗ B) = B .
∗ 1.8.13.3 B= Dual tensors in 4D In 4D, we may need to take the following duals of contravariant (totally antisymmetric) tensors:
∗ Bijk = 1
εijk B
0!
1
εijk B
1! ∗ Bijk = ∗ Bij = 1
εijk B k
2! ∗ Bi = 1
εijk B jk
3! 1
εB
0! ijk ∗ Bijk = ∗ Bijk = ∗ Bij = 1
k
εijk B
2! ∗ Bi = 1
jk
εB
3! ijk 1
εB
1! ijk 1
1
ijk
∗
B = εijk B
εijk B ijk
4!
4!
We may also need to take duals of covariant tensors:
1
1
∗ ijk
∗ ijk
B
= εijk B
B
= εijk B
0!
0!
∗ B= 1 ijk
εB
1! ∗ B ijk = ∗ B ij = 1 ijk
ε Bk
2! ∗ Bi = 1 ijk
ε Bjk
3! 1 ijk
εB
1! ∗ B ijk = ∗ B ij = 1 ijk
ε Bk
2! ∗ Bi = 1 ijk
ε B jk
3! ∗ B ijk = 1
εB
0! ijk (1.348) ∗ B ijk = 1
εB
1! ijk (1.349) ∗ B ij = 1
ε Bk
2! ijk (1.350) ∗ Bi = 1
ε B jk
3! ijk (1.351) ∗ B= 1
ε B ijk
4! ijk (1.352) ∗ B = 1 ijk
εB
0! (1.353) ∗ B = 1 ijk
εB
1! (1.354) ∗ B= ij 1 ijk
ε Bk
2! (1.355) ∗ B= i 1 ijk
ε Bjk
3! (1.356) ijk ijk 1 ijk
1
1
∗
∗
B = εijk B ijk
B = εijk Bijk
(1.357)
ε Bijk
4!
4!
4!
As in a space with an even number of dimensions the dual of the dual of a tensor of rank
p equals (−1)p the original tensor (see text), we have, in 4D, that for a tensor with 0 , 2
or 4 indices, ∗ (∗ B) = B , while for a tensor with 1 or 3 indices, ∗ (∗ B) = −B .
∗ B= Appendixes 1.8.14 67 Appendix: Integration in 3D In a threedimensional space (n = 3) , we may have p respectively equal to 2 , 1 and 0 . This
gives the three theorems
d3σ ijk (∇ ∧ T)ijk = d2σ ij Tij 3D (1.358) 2D d2σ ij (∇ ∧ T)ij = d1σ i Ti 2D (1.359) 1D d1σ i (∇ ∧ T)i = d0σ T . 1D (1.360) 0D Explicitly, using the results of sections 1.6.3 and 1.6.4, this gives
d3σ ijk
3D 1
(∇i Tjk + ∇j Tki + ∇k Tij ) =
3
1
d2σ ij (∇i Tj − ∇j Ti ) =
2
2D d2σ ij Tij (1.361) 2D d1σ i Ti (1.362) 1D d1σ i ∇i T = d0σ T , 1D (1.363) 0D or, we use the antisymmetry of the tensors,
d3σ ijk ∇i Tjk = d2σ ij Tij 3D (1.364) 2D d2σ ij ∇i Tj = d1σ i Ti 2D (1.365) 1D d1σ i ∂i T = d0σ T . 1D (1.366) 0D We can now introduce the capacity elements instead of the diﬀerential elements:
1 ijk
1
d3Σ
ε ∇i Tjk
0! 3D
2!
1
1 ijk
d2Σi
ε ∇j Tk
1! 2D
1!
1
1 ijk
d1Σij
ε ∂k T
2! 1D
0! 1
1!
1
=
2!
1
=
3!
= d2Σi
2D d1Σij
1D d0Σijk
0D 1 ijk
ε Tjk
2!
1 ijk
ε Tk
1!
1 ijk
εT
0! (1.367)
(1.368)
. (1.369) Introducing explicit expressions for the capacity elements gives
i j
k
(εjk dr1 dr2 dr3 ) ∇i t = 3D (1.370) 2D m
(εi m dr1 dr2 ) (εijk ∇j Tk ) =
2D i
dr1 Ti (1.371) 1D
i
dr1 ∂i T =
1D i i j
k
(εijk dr1 dr2 ) t T, (1.372) 0D
i where, in equation 1.370, t stands for the vector dual to the tensor Tij , i.e., t = 1 ijk
ε Tjk
2! . 68 1.8 Equations 1.367 and 1.370 correspond to the divergence theorem of GaussOstrogradsky,
equations 1.368 and 1.369 correspond to the rotational theorem of Stokes (stricto sensu), and
equation 1.372, when written in its more familiar form
b dri ∂i T = T (b) − T (a) a corresponds the fundamental theorem of integral calculus. (1.373) Chapter 2
Elements of Probability As probability theory is essential to the formulation of the rules of physical inference —to
be analyzed in subsequent chapters— we have to start by an introduction of the concept of
probability. This chapter is, however, more than a simple review. I assume that the spaces we
shall work with, have a natural deﬁnition of distance between points, and, therefore, a deﬁnition
of volume. This allows the introduction of the notion of ‘volumetric probability’, as opposed to
the more conventional ‘probability density’. The notion of conditional volumetric probability is
carefully introduced (I disagree with usual deﬁnitions of conditional probability density), and
ﬁnally, the whole concept of conditional probability is generalized into a more general notion:
the product of probability distributions. 69 70 2.1 2.1
2.1.1 Volume
Notion of Volume The axiomatic introduction of a ‘volume’ over an ndimensional manifold is very similar to the
introduction of a ‘probability’, and both can be reduced to the axiomatic introduction of a
‘measure’. For pedagogical reasons, I choose to separate the two notions, presenting the notion
of volume as more fundamental than that of a probability, as the deﬁnition of a probability
shall require the previous deﬁnition of the volume.
Of course, given an ndimensional manifold X , one may wish to associate to it diﬀerent
‘measures’ of the volume of any region of it. But, in this text, we shall rather assume than,
within a given context, there is one ‘natural’ deﬁnition of volume.
So it is assumed the to any region A ⊂ X it is associated a real or imaginary1 quantity
V (A) , called the volume of A , that satisﬁes
Postulate 2.1 for any region A of the space, V (A) ≥ 0 ;
Postulate 2.2 if A1 and A2 are two disjoint regions of the space, then V (A1 ∪ A2 ) =
V (A1 ) + V (A2 ) .
We shall say that a volume distribution (or, for short, a ‘volume ’) has been deﬁner over X .
The volume of the whole space X may be positive real, positive imaginary, it may be zero or
it may be inﬁnite. 2.1.2 Volume Element Consider a region A of an ndimensional manifold X , and an approximate subdivision
of it into regions with individual volume ∆Vi (see illustration 2.1). Successively reﬁning the
subdivision, allows easily to relate the volume of the whole region to the volumes of the indivual
regions,
V (A) = lim ∆Vi →0 ∆Vi , (2.1) i expression that we may take as an elementary deﬁnition for the integral
V (A) = P∈A dV (P) . (2.2) When some coordinates x = {x1 , . . . , xn } are chosen over X , we may rewrite this equation
as
V (A) = dv (x) . (2.3) x∈A While dV (P) stands for a function depending on the abstract notion of a ‘point’, dv (x)
stands for an ordinary function depending on some coordinates. A part this subtle diﬀerence,
the two objects coincide: if by x(P) we designate the coordinates of the point P , then,
dV (P) = dv ( x(P) ) . (2.4) 1
Some spaces having an ‘hyperbolic metric’, like the Minkowskian spacetime of special relativity, have an
imaginary volume. By convention, this volume is taken as imaginary positive. Volume 71 Figure 2.1: The volume of an arbitrarily shaped, smooth, region of
a space X , can be deﬁned as the limit of a sum, using elementary
regions whose individual volume is known (for instance, triangles in
this 2D illustration). This way of deﬁning the volume of a region does
not require the deﬁnition of a coordinate system over the space. 2.1.3 Volume Density and Capacity Element Consider, at a given point P of an ndimensional manifold, n vectors (of the tangent linear
space) {v1 , v2 , . . . , vn } . These vectors may not have the same physical dimensions (for instance, v1 may represent a displacement, v2 a velocity, etc.). The exterior product of the n
vectors, denoted v1 ∧ v2 ∧ · · · ∧ vn , is the scalar capacity
v1 ∧ v2 ∧ · · · ∧ vn = i1 i2 ...in i
i
i
v11 v22 . . . vnn , (2.5) where ij... is the LeviCivita capacity, deﬁned in section 1.4.2. This is, of course, a totally
antisymmetric expression. If some coordinates x = {x1 , x2 , . . . , xn } have been deﬁned over
the manifold, then, at any given point we may consider the n inﬁnitesimal vectors dx1
0
0
0
dx2 0 dr1 = . ; dr2 = . ; · · · ; drn = . (2.6)
.
.
.
.
.
.
0
dxn
0
corresponding to the respective perturbation of the n coordinates. The exterior product, at
point x , of these n vectors is called the capacity element , and is denoted dv (x) :
dv (x) = dr1 ∧ dr2 ∧ · · · ∧ drn . (2.7) In view of expressions 2.6, and using a notational abuse, the capacity element so deﬁned is
usually written as
dv (x) = dx1 ∧ dx2 ∧ · · · ∧ dxn . (2.8) One of the major theorems of integration theory is that the volume element introduced in
equation 2.3 is related to the capacity element dv through
dv (x) = g (x) dv (x) , (2.9) where g (x) is the volume density in the coordinates x , as deﬁned in equation 1.32:
g (x) = η det g(x) . (2.10) Here η is the orientation of the coordinate system, as deﬁned in section 1.4.1.
If the system of coordinates in use is positively oriented, the quantities g (x) and dv (x)
are both positive. Alternatively, if the system of coordinates is negatively oriented, these two
quantities are negative. The volume element dv (x) is always a positive quantity. 72 2.1 The overbar in g is to remember that the determinant of the metric tensor is a density, in
the tensorial sense of section 1.2.2, while the underbar in dv is to remember that the ‘capacity
element’ is a capacity in the tensorial sense of the term. In equation 2.9, the product of a
density times a capacity gives the volume element dv , that is an invariant scalar. In view of
this equation, we can call g (x) the volume density in the coordinates x = {x1 , . . . , xn } . It
is important to realize that g (x) does not represent any intrinsic property of the space, but,
rather, a propery of the coordinates being used.
Example 2.1 In the Euclidean 3D space, using geographical coordinates2 x = {r, ϕ, λ} , it is
well known that the volume element is
dv (r, ϕ, λ) = r2 cos λ dr ∧ dϕ ∧ dλ , (2.11) so the volume density in to the geographical coordinates is
g (r, ϕ, λ) = r2 cos λ . (2.12) The metric in geographical coordinates is
ds2 = dr2 + r2 cos2 λ dϕ2 + r2 dλ2 , (2.13) so
det g = r2 cos λ . (2.14) Comparing this equation with equation 2.12 shows that one has
g= det g . (2.15) as it should. [End of example.] Figure 2.2: The geographical coordinates
generalize better to ndimensional spaces
than the usual spherical coordinates. Note
that the order of the angles, {ϕ, λ} , has to
be the reverse of that of the angles {θ, ϕ} ,
so as to deﬁne in both cases local referentials
dr ∧ dθ ∧ dϕ and dr ∧ dϕ ∧ dλ that have the
same orientation as dx ∧ dy ∧ dz . Geographical coordinates: {r,ϕ,λ}
Spherical coordinates: {r,θ,ϕ}
x = r cosλ cosϕ = r sinθ cosϕ
y = r cosλ sinϕ = r sinθ sinϕ
z = r sinλ
= r cosθ x z
r
θ
λ
ϕ y The usual spherical coordinates are {r, θ, ϕ} , and the domain of variation of θ is 0 ≤ θ ≤ π . These
3D coordinates do not generalize properly into ‘spherical’ coordinates in spaces of dimension larger than three.
To these spherical coordinates one should prefer the ‘geographical coordinates’ {r, ϕ, λ} , where the domain of
variation of λ is −π/2 ≤ λ ≤ +π/2 . These are not ‘geographical coordinates’ in the normal sense used by
geodesists, as r is here a radius (not the ‘height’ above some reference). See ﬁgure 2.2 for more details.
2 Volume 73 Example 2.2 In the 4D spacetime of special relativity, with the Minkowskian coordinates
{τ0 , τ1 , τ2 , τ3 } = {t, x/c, y/c, z/c} , the distance element ds satisfyes
2
2
2
2
ds2 = dτ0 − dτ1 − dτ2 − dτ3 . (2.16) Then, the metric g is diagonal, with the elements {+1, −1, −1, −1} in the diagonal, and
g= det(−1) = i det g = . (2.17) [End of example.]
Replacing 2.9, into equation 2.3 gives
V (A) = dv (x) g (x) . (2.18) x∈A Using expressions 2.8 and 1.32 we can write this in the more explicit (but not manifestly
covariant) form
V (A) = η x∈A dx1 ∧ · · · ∧ dxn det g(x) . (2.19) These two (equivalent) expressions allow the usual interpretation of an integral as a limit
involving the domains deﬁned by constant increments of the coordinate values (see ﬁgure 2.3).
Although such an expression is useful for analytic developments it is usually not well adapted
to numerical evaluations (unless the coordinates are very specially chosen).
Figure 2.3: For the same shape of ﬁgure 2.1, the volume can be evaluated
using, for instance, a polar coordinate system. In a numerical integration,
regions near the origin may be oversampled, while regions far from the
orign may be undersampled. In some situation, this problem may become
crucial, so this sort of ‘coordinate integration’ is to be reserved to analytical
developments only. 2.1.4 Change of Variables 2.1.4.1 Volume Element and Change of Variables Consider an ndimensional metric manifold with some coordinates x . The deﬁning property
of the volume element, say dvx (x) , was (equation 2.3)
V (A) =
Under a change of variables x x∈A dvx (x) . (2.20) y , this expression shall become
V (A) = y∈A dvy (y) . (2.21) 74 2.1 These two equations just correspond to a diﬀerent labeling, respectively using the coordinates
x and the coordinates y , of the fundamental equation 2.2 deﬁning the volume element dV ,
so they are completely equivalent. In other words, the volume element is an invariant scalar,
and one may write
dvy = dvx , (2.22) or, more explicitly,
dvy (y) = dvx ( x(y) ) .
2.1.4.2 (2.23) Volume Density, Capacity Element, and Change of Variables In a change of variables x
via y , the two capacity elements dv x (x) and dv y (y) are related dv y (y) = 1
dv ( x(y) ) ,
X (y) x (2.24) where X (y) is the Jacobian determinant det{∂xi /∂y j } , as they are tensorial capacities, in
the sense of section 1.2.2. Also, because a ‘volume density’ is a tensorial density, we have
g y (y) = X (y) g x ( x(y) ) . (2.25) Equation 2.18, that can be written, in the coordinates x ,
V (A) = x∈A dv x (x) g x (x) , (2.26) g x (x) being the determinant of the metric matrix in the coordinates x , becomes
V (A) = y∈A dv y (y) g y (y) , (2.27) g y (y) being the determinant of the metric matrix in the coordinates y . Of course, the two
capacity elements can be expressed as (equation 2.8)
dv x (x) = dx1 ∧ dx2 ∧ · · · ∧ dxn (2.28) and
dv y (y) = dy 1 ∧ dy 2 ∧ · · · ∧ dy n . (2.29) If the two coordinate systems {x1 , . . . , xn } and {y 1 , . . . , y n } have the same orientation,
the two capacity elements dv x (x) and dv y (y) have the same sign. Otherwise, they have
opposite sign. Volume 2.1.5 75 Conditional Volume Consider an ndimensional manifold X n , with some coordinates x = {x1 , . . . , xn } , and a
metric tensor g(x) = {gij (x)} . Consider also a pdimensional submanifold X p of the ndimensional manifold X n (with p ≤ n ). The ndimensional volume over X n as characterized
√
by the metric determinant
det g , induces a pdimensional volume over the submanifold
Xp . Let us try to characterize it.
The simplest way to represent a pdimensional submanifold X p of the ndimensional manifold X n is by separating the n coordinates x = {x1 , . . . , xn } of X n into one group of p
coordinates r = {r1 , . . . , rp } and one group of q coordinates s = {s1 , . . . , sq } , with
p+q = n . (2.30) x = {x1 , . . . , xn } = {r1 , . . . , rp , s1 , . . . , sq } = {r, s} , (2.31) Using the notations the set of q relations
=
=
=
= s1 (r1 , r2 , . . . , rp )
s2 (r1 , r2 , . . . , rp )
...
sq (r1 , r2 , . . . , rp ) , (2.32) s = s(r) , s1
s2
...
sq (2.33) that, for short, may be written
deﬁne a pdimensional submanifold X p in the (p + q )dimensional
we can now introduce the matrix of partial derivatives ∂s1 ∂s1
···
S 11 S 12 · · · S 1p
∂r1
∂r2
S 2 1 S 2 2 · · · S 2 p ∂s2 ∂s2 · · · ∂r1 ∂r2
S=.
.
.=.
. ..
..
.
.
.
.
.
.
.
.
.
.
.
.
q
q
q
∂sq
∂sq
S 1 S 2 ··· S p
···
1
2
∂r ∂r space X n . For later use, ∂s1
∂rp
∂s2 ∂rp .
.
. . (2.34) ∂sq
∂rp We can write S(r) for this matrix, as it is deﬁned at a point {x} = {r, s(r)} . Note also that
the metric over X can always be partitioned as
g(x) = g(r, s) = grr (r, s) grs (r, s)
gsr (r, s) gss (r, s) , (2.35) with grs = (gsr )T .
In what follows, let us use the Greek indexes for the variables {r1 , . . . , rp } , like in rα ; α ∈
{1, . . . , p} , and Latin indexes for the variables {s1 , . . . , sq } , like in si ; i ∈ {1, . . . , q } .
Consider an arbitrary point {r, s} of the space X . If the coordinates rα are perturbed to
rα + drα , with the coordinates si kept unperturbed, one deﬁnes a pdimensional subvolume
of the ndimensional manifold X n that can be written3 (middle panel in ﬁgure 2.4)
dvp (r, s) = det grr (r, s) dr1 ∧ · · · ∧ drp . (2.36) In all generality, we should write dvp (r, s) = η det grr (r, s) dr1 ∧ · · · ∧ drp , where η is ±1 depending
on the order of the coordinates {r1 , . . . , rp } . Let us simplify the equations here but assuming that we have
chosen the order of the coordinates so as to have a positively oriented capacity element dr1 ∧ · · · ∧ drp .
3 76 2.1
An elementary region
on the coordinate surface
defined by a condition
s = constant Some surface coordinates
of a coordinate system
over a 3D manifold s s r1 r1
r r2 2 An elementary region
on the surface
defined by a condition
s = s( r 1 r 2 )
, dS s = s( r 1 r 2 )
, dS Figure 2.4: On a 3D space (3D manifold), a coordinate system {x1 , x2 , x3 } = {r1 , r2 , s}
is deﬁned. Some characteristic surface coordinates are represented (left). In the middle, a
surface element (2D volume element) on a coordinate surface s = const. is represented, that
corresponds to the expression in equation 2.36. In the right, a submanifold (surface) is deﬁned
by an equation s = s(r1 , r2 ) . A surface element (2D volume element) is represented on the
submanifold, that corresponds to the expression in equation 2.37.
Alternatively, consider a point (r, s) of X n that, in fact, is on the submanifold X p , i.e.,
a point that has coordinates of the form (r, s(r)) . It is clear that the variables {r1 . . . rp }
deﬁne a coordinate system over the submanifold, as it is enough to precise r to deﬁne a point
in Xp . If the coordinates rα are perturbed to rα + drα , and the coordinates si are also
perturbed to si + dsi in a way that one remains on the submanifold, (i.e., with dsi = S i α drα ),
then, with the metric over X n partitioned as in equation 2.35, the general distance element
ds2 = gij dxi dxj can be written ds2 = (grr )αβ drα drβ + (grs )αj drα dsj + (gsr )iβ dsi drβ +
(gss )ij dsi dsj , and replacing dsi by dsi = S i α drα , we obtain ds2 = Gαβ drα drβ , with
G = grr + grs S + ST gsr + ST gss S . The ds2 just expressed gives the distance between two
any points of X p , i.e., G is the metric matrix of the submanifold associated to the coordinates
√
r . The pdimensional volume element on the manifold is, then, dvr = det G dr1 ∧ · · · ∧ drp ,
i.e.,
dvp (r) = det (grr + grs S + ST gsr + ST gss S) dr1 ∧ · · · ∧ drp where S = S(r) , grr = grr (r, s(r)) , grs
gss (r, s(r)) . Figure 2.4 illustrates this result.
volume density induced over the submanifold
g (x) = η , (2.37) = grs (r, s(r)) , gsr = gsr (r, s(r)) and gss =
The expression 2.37 says that the pdimensional
Xp is det (grr + grs S + ST gsr + ST gss S) . (2.38) Note: say here that in the case the space X n is formed as the caresian product of two
spaces, R p × S q , with the metric over X n induced from the metric gr over R p and the
metric gs over S q by
ds2 = ds2 + ds2
x
r
s , (2.39) Volume 77 then, the expression of the metric 2.35 simpliﬁes into
g(x) = 0
gr (r)
0
gs (s) , (2.40) and equations 2.37–2.38 simplify into
dvp (r) = det (gr + ST gs S) dr1 ∧ · · · ∧ drp (2.41) and
g (x) = η det (gr + ST gs S) . (2.42) 78 2.2
2.2.1 2.2 Probability
Notion of Probability Consider an ndimensional metric manifold, over which a ‘volume distribution’ has been deﬁned
(satisfying the axioms in section 2.1.1), associating to any region (i.e., subset) A of X its
volume
A → V (A) . (2.43) A particular volume distribution having been introduced over X , once for all, diﬀerent ‘probability distributions’ may be considered, that we are about to characterize axiomatically.
We shall say that a probability distribution (or, for short, a probability ) has been deﬁned
over X if to any region A ⊂ X we can associate an adimensional real number,
A → P (A) (2.44) called the probability of A , that satisﬁes
Postulate 2.3 for any region A of the space,
P (A) ≥ 0 ; (2.45) Postulate 2.4 for disjoints regions of the space, the probabilities are additive:
A 1 ∩ A2 = ∅ ⇒ P (A1 ∪ A2 ) = P (A1 ) + P (A2 ) ; (2.46) Postulate 2.5 the probability distribution must be absolutely continuous with respect to the
volume distribution, i.e., the probability P (A) of any region A ⊂ X with vanishing volume
must be zero:
V (A) = 0 ⇒ P (A) = 0 . (2.47) The probability of the whole space X may be zero, it may be ﬁnite, or it may be inﬁnite.
The ﬁrst two axioms are due to Kolmogorov (1933). In common texts, there is usually an
axiom concerning the behaviour of a probability when we consider an inﬁnite collection4 of
sets, A1 , A2 , A3 . . . , but this is a technical issue that I choose to ignore. Our third axiom
here is not usually introduced, as the distinction between the ‘volume distribution’ and a ‘probability distribution’ is generally not made: both are just considered as examples of ‘measure
distributions’. This distinction shall, in fact, play a major role in the theory that follows.
When the probability of the whole space is ﬁnite, a probability distribution can be renormalized, so as to have P (X ) = 1 . We shall then say that we face an ‘absolute probability’. If
a probability distribution is not normalizable, we shall say that we have a ‘relative probability’:
in that case, what usually matters is not the probability P (A) of a region A ∈ X , but the
relative between probability two regions A and B , denoted P ( A ; B ) , and deﬁned as
P(A; B ) = P (A)
P (B ) . (2.48) 4
Presentations of measure theory that pretend to mathematical rigor, assume ‘ﬁnite additivity’ or, alternatively, ‘countable additivity’. See, for instance, the interesting discussion in Jaynes (1995). Probability 2.2.2 79 Volumetric Probability We have just deﬁned a probability distribution over an ndimensional manifold, that is absolutely continuous with respect to the volume distribution over the manifold. Then, by virtue
of the RadonNikodym theorem (e.g., Taylor, 1966), one can deﬁne over X a volumetric
probability f (P) such that the probability of any region A of the space can be obtained as
P (A) = P∈A dV (P) f (P) . (2.49) Note that this equation makes sense even if no particular coordinate system is deﬁned over
the manifold X , as the integral here can be understood in the sense suggested in ﬁgure 2.1.
If a coordinate system x = {x1 , . . . , xn } is deﬁned over X , we may well wish to write
equation 2.49 as
P (A) = x∈A dvx (x) fx (x) , (2.50) where, now, dvx (x) is to be understood as the special expression of the volume element in
the coordinates x . One may be interested in using the volume element dvx (x) directly for
the integration (as suggested in ﬁgure 2.1). Alternatively, one may wish to use the coordinate
lines for the integration (as suggested in ﬁgure 2.3). In this case, one writes (equation 2.9)
dvx (x) = g x (x) dv x (x) , (2.51) to get
P (A) = x∈A dv x (x) g x (x) fx (x) . (2.52) Using dv x (x) = dx1 ∧ · · · ∧ dxn (equation 2.8) and g x (x) = det g(x) (equation 1.32), this
expression can be written in the more explicit (but not manifestly covariant) form
P (A) = η x∈A dx1 ∧ · · · ∧ dxn det g(x) fx (x) , (2.53) where η is +1 is the system of coordinates is positively oriented and 1 if it is negatively
oriented. These two (equivalent) expressions may be useful for analytical developments, but
not for numerical evaluations, where one should choose a direct handling of expression 2.50. 2.2.3 Probability Density In equation 2.52 we can introduce the deﬁnition
f x (x) = g x (x) fx (x) , (2.54) to obtain
P (A) = x∈A dv x (x) f x (x) , (2.55) 80 2.2 where
dv x (x) = dx1 ∧ · · · ∧ dxn . (2.56) The function f x (x) is called the probability density (associated to the probability distribution P ). It is a density, in the tensorial sense of the term, i.e., under a change of variables
x y it change according to the Jacobian rule (see section 2.2.5.2).
Having deﬁned a volumetric probability fx (x) in section 2.2.2, why should one care at all
about the probability density f x (x) ?
One possible advantage of a probability density aver a volumetric probability appears when
comparing equation 2.50 to equation 2.55. To integrate a volumetric probability one must have
deﬁned a volume element over the space, while to integrate a volumetric probability, one only
needs to have deﬁned coordinates, irrespectively of any metric meaning they may have. This,
is, of course, why usual expositions of the theory use probability densities.
In fact, I see this as a handicap. When probability theory is developed without the notion
of volume and of distance, one is forced to include deﬁnitions that do not have the necessary
invariances, the most striking example being the usual deﬁnition of ‘conditional probability
density’. One does not obtain a correct deﬁnition unless a metric in the space is introduced
(see section 2.4). The wellknown ‘Borel paradox’ (see appendix 2.8.10) is the simplest example
of this annoying situation. If I mention at all the notion of probability density is to allow
the reader to make the connection between the formulas to be developed in this book and the
formulas she/he may ﬁnd elsewhere.
As we have chosen in this text to give signs to densities and capacities that are associated
to the orientation of the coordinate system, it is clear from deﬁnition 2.54 that, contrary to a
volumetric probability, a probability density is not necessarily positive: it has the sign of the
capacity element, i.e., a positive sign in positively oriented coordinate systems, and a negative
sign in negatively oriented coordinate systems.
Example 2.3 Consider a homogeneous probability distribution at the surface of a sphere of
radius r . When parameterizing a point by its geographical coordinates (ϕ, λ) , the associated
(2D) volumetric probability is
f (ϕ, λ) = 1
4πr2 . (2.57) The probability of a region A of the surface is computed as
P (A) = dS (ϕ, λ) f (ϕ, λ) , (2.58) {ϕ,λ}∈A where dS (ϕ, λ) = r2 cos λ dϕ dλ , and the total probability equals one. Alternatively, the
probability density associated to the homogeneous probability distribution over the sphere is
f (ϕ, λ) = 1
cos λ
4π . (2.59) The probability of a region A of the surface is computed as
P (A) = dϕ dλ f (ϕ, λ) ,
{ϕ,λ}∈A and the probability of the whole surface also equals one. [End of example.] (2.60) Probability 2.2.4 81 Volumetric Histograms and Density Histograms Note: explain here what is a volumetric histogram and a density histogram. Say that while the
limit of a volumetric histogram is a volumetric probability, the limit of a density histogram is
a probability density.
Introduce the notion of ‘na¨ histogram’.
ıve
Consider a problem where we have two physical properties to analyze. The ﬁrst is the property of electric resistanceconductance of a metallic wire, as it can be characterized, for instance,
by its resistance R or by its conductance 4 C = 1/R . The second is the ‘coldwarm’ property
of the wire, as it can be charcterized by its temperature T or its thermodynamic parameter
β = 1/kT (k being the Boltzmann constant). The ‘parameter space’ is, here, twodimensional.
In the ‘resistanceconductance’ space, the distance between two points, characterized by the
resistances R1 and R2 , or by the conductances C1 and C2 is, as explained in section XXX,
D = log R2
R1 = log C2
C1 . (2.61) Similarly, in the ‘coldwarm’ space, the distance between two points, characterized by the
temperatures T1 and T2 , or by the thermodynamic parameters β1 and β2 is
D = log T2
T1 = log β2
β1 . (2.62) R = 100 Ω R = 80 Ω R = 60 Ω R = 40 Ω R = 20 Ω R=0Ω R = 100 Ω R = 80 Ω R = 60 Ω R = 40 Ω R = 20 Ω R=0Ω R = 100 Ω R = 50 Ω R = 30 Ω R = 20 Ω R = 10 Ω An homogeneous probability distribution can be deﬁned as . . .
Bla, bla, bla . . .
In ﬁgure 2.5, the two histograms that can be made from the two ﬁrst diagrams give the
volumetric probability. The na¨ histrogram that could be made form the diagram at the right
ıve
would give a probability density. T = 100 K T* = 1.8 T* = 2.0 T = 100 K T* = 2.0 T = 100 K T* = 1.9 T* = 2.0 T = 80 K T* = 1.9 T = 80 K T* = 1.8 T = 60 K T = 50 K
T* = 1.8 T* = 1.6
T = 30 K
T = 20 K T = 60 K T* = 1.7 T* = 1.7
T* = 1.5 T = 40 K T = 40 K T* = 1.4 T* = 1.5
T = 20 K T = 20 K T* = 1.2 T* = 1.0 T* = 1.0
T = 10 K R* = 2.0 R* = 1.9 R* = 1.8 R* = 1.7 T0= 1 K R* = 1.5 R0= 1 Ω R* = 1.0 R* = 2.0 R* = 1.9 R* = 1.8 R* = 1.7 T* = log10 T/T0 T=0K T=0K
R* = 1.5 R* = log10 R/R 0 R* = 1.0 R* = 2.0 R* = 1.8 R* = 1.6 R* = 1.4 R* = 1.2 R* = 1.0 T* = 1.0 Figure 2.5: Note: explain here how to make a volumetric histogram. Explain that when the
electric resistance or thetemperature span ordrs of magnituce, the disgram at the right become
totally impractical. 82 2.2 2.2.5 Change of Variables 2.2.5.1 Volumetric Probability and Change of Variables In a change of coordinates x → y(x) , the expression 2.50
P (A) = x∈A dvx (x) fx (x) (2.63) dvy (y) fy (y) (2.64) becomes
P (A) = y∈A where dvy (y) and fy (y) are respectively the expressions of the volume element and of the
volumetric probability in the coordinates y . These are actual invariants (in the tensorial
sense), so, when comparing this equation (written in the coordinates y ) to equation 2.50
(written in the coordinates x ), one simply has, at every point,
fy = fx
dvy = dvx , (2.65) or, to be more explicit,
fy (y) = fx ( x(y) ) ; dvy (y) = dvx ( x(y) ) . (2.66) That under a change of variables x y one has fy = fx for volumetric probabilities, is an
important property. It contrasts with the property found in usual texts (where the Jacobian of
the transformation appears): remember that we are considering here volumetric probabilites,
not the usual probability densities.
A volumetric probability can also be integrated using the expression 2.53
P (A) = ηx x∈A dx1 ∧ · · · ∧ dxn det gx (x) fx (x) , (2.67) det gy (y) fy (y) . (2.68) that, under the change of variables becomes
P (A) = ηy y∈A dy 1 ∧ · · · ∧ dy n These equations contain each a capacity element and a volume density, that change, under
the change of variables, following the rules given in section 1.2.2, but we do not need to be
concerned with this here, as the meaning of dy 1 ∧ · · · ∧ dy n is clear, and one usually obtains
ηy det gy (y) by an explicit computation of the determinant in the coordinates y , rather
than by mutiplying the volume density ηx det gx (x(y)) by the Jacobian determinant X (y)
(see section 2.25).
[Note: Important: I have to erect as a basic principle to use, in a change of variables, the
representation exempliﬁed by ﬁgures 9.5, 9.8 and 9.9.] Probability
2.2.5.2 83 Probability Density and Change of Variables A probability density being deﬁned as the product of an invariant times a density (equation 2.54) it is a density in the tensorial sense of the term. Under a change of variables x y ,
expression 2.55
P (A) = x∈A dv x (x) f x (x) , (2.69) dv y (y) f y (y) , (2.70) where dv x (x) = dx1 ∧ · · · ∧ dxn , becomes
P (A) = y∈A where dv y (y) = dy 1 ∧ · · · ∧ dy n . The two capacity elements dv x (x) and dv y (y) are related
through the relation 2.24, and, more importantly, the two probability densities are related as
tensorial densities should (see section 1.2.2),
f y (y) = X (y) f x (x(y)) . (2.71) This is called the Jacobian rule for the change of a probability density under a change of
‘variables’ (i.e., under a change of coordinates over the considered manifold). Note that the X
appearing in this equation is the determinant of the matrix {X i j } = {∂xi /∂y j } , not that of
the matrix {Y i j } = {∂y i /∂xj } .
Many authors take the absolute value of the Jacobian in this equation, which is not quite
correct: it is the actual Jacobian that appears. The absolute value of the Jacobian is taken by
these authors to force probability densities to always be positive, but this denyes to probability
densities the right to be densities, in the full tensorial sense of the term (see section 1.2.2).
In this text, I try to avoid the use of probability densities, and only mention them in the
appendixes. 84 2.3 2.3 Sum and Product of Probabilities Let X be an ndimensional metric manifold, with a volume distribution V , and let P and
Q be two normalized probability distributions over X . In what follows we shall deduce, from
P and Q , two new probability distributions over X , their sum, denoted P ∪ Q and their
product, denoted P ∩ Q . 2.3.1 Sum of Probabilities P and Q being two probability distributions over X , their sum (or union ), denoted P ∪ Q
is deﬁned by the conditions
Postulate 2.6 for any A ⊂ X ,
(P ∪ Q)(A) = (Q ∪ P )(A) ; (2.72) Postulate 2.7 for any A ⊂ X ,
P (A) = 0 and Q(A) = 0 =⇒ (P ∪ Q)(A) = 0 ; (2.73) Postulate 2.8 if there is some A ⊂ X for which P (A) = 0 , then, necessarily, for any
probability Q ,
(P ∪ Q)(A) = Q(A) . (2.74) Note: I have to explain here that these postulates do not characterize uniquely the sum
operation. The solution I choose is the following one.
Property 2.1 If the probability distribution P is characterized by the volumetric probability
f (P) , and the probability distribution Q is characterized by the volumetric probability g (P) ,
then, the probability distribution P ∪ Q is characterized by the volumetric probability, denoted
(f + g )(P) given by
(f + g )(P) = α f (P) + β g (P)
α+β , (2.75) where α and β are two arbitrary constants.
Note: An alternative solution would be what is used in fuzzy set theory to deﬁne the union
of fuzzy sets. Translated to the language of volumetric probabilities, and slightly generalized,
this would correspond to
(f + g )(P) = k max α f (P) , β g (P) , (2.76) where α and β are two arbitrary constants, and k a normalizing one.
Let me try to give an interpretation of this sum of probabilities. If an experimenter faces
realizations of a random process and wants to investigate the probability distribution governing
the process, she/he may start making histograms of the realizations. As an example, for
realizations of a probability distribution over a continuous space, the experimenter will obtain Sum and Product of Probabilities 85 histograms that, in some sense, will approach the volumetric probability corresponding to the
probability distribution.
A histogram is typically made by dividing the working space into cells, by counting how
many realizations fall inside each cell and by dividing the count by the cell volume. A more
subtle approach is possible. First, we have to understand that, in the physical sciences, when
“a random point materializes in an abstract space” we have to measure its coordinates. As any
physical measure of a real quantity will have attached uncertainties, mathematically speaking,
the measurement will not produce a ‘point’, but a state of information over the space, i.e., a
volumetric probability. If we have measured the coordinates of many points, the results of each
measurement will be described by a volumetric probability fi (x) . The ‘sum’ of all these, i.e.,
the volumetric probability
(f1 + f2 + . . . ) (x) = fi (x) (2.77) i is a ﬁner estimation of the background volumetric probability than an ordinary histogram, as
actual measurement uncertainties are used, irrespective of any division of the space into cells. 2.3.2 Product of Probabilities P and Q being two probability distributions over X , their product (or intersection ), denoted
P ∩ Q is deﬁned by the conditions
Postulate 2.9 for any A ⊂ X ,
(P ∩ Q)(A) = (Q ∩ P )(A) ; (2.78) Postulate 2.10 for any A ⊂ X ,
P (A) = 0 or Q(A) = 0 =⇒ (P ∩ Q)(A) = 0 . (2.79) Postulate 2.11 if for whatever B ⊂ X , one has P (B ) = k V (B ) , then, necessarily, for any
A ⊂ X and for any probability Q ,
(P ∩ Q)(A) = (Q ∩ P )(A) = Q(A) . (2.80) (The homogeneous probability distribution is the neutral element of the product operation).
Note: I have to explain here that these postulates do not characterize uniquely the product
operation. The solution I choose is the following one.
Property 2.2 If the probability distribution P is characterized by the volumetric probability
f (P) , and the probability distribution Q is characterized by the volumetric probability g (P) ,
then, the probability distribution P ∪ Q is characterized by the volumetric probability, denoted
(f · g )(P) given by
(f · g )(P) = f (P) g (P)
dV (P) f (P) g (P)
P∈X . (2.81) 86 2.3
More generally, the ‘product’ of the volumetric probabilities f1 (P) , f2 (P) . . . is
(f1 · f2 · f3 . . . )(P) = f1 (P) f2 (P) f3 (P) . . .
dV (P) f1 (P) f2 (P) f3 (P) . . .
P∈X . (2.82) Note: An alternative solution would be what is used in fuzzy set theory to deﬁne the
intersection of fuzzy sets. Translated to the language of volumetric probabilities, and slightly
generalized, this would correspond to
(f · g )(P) = k min α f (P) , β g (P) , (2.83) where α and β are two arbitrary constants, and k a normalizing one.
It is easy to write some extra conditions that distinguish the solution to the axioms given
by equation 2.75 end equation 2.81 and that given by equations 2.76 and 2.83. For instance, as
volumetric probabilities are normed using a multiplicative constant (this is not the case with
the grades of membership in fuzzy set theory), it makes sense to impose the simplest possible
algebra for the multiplication of volumetric probabilities f (P), g (P) . . . by constants λ, µ . . . :
[(λ + µ)f ] (P) = (λ f + µf ) (P) ; [λ(f · g )] (P) = (λf · g ) (P) = (f · λg ) (P) . (2.84) One important property of the two operations ‘sum’ and ‘product’ just introduced is that of
invariance with respect to a change of variables. As we consider probability distribution over a
continuous space, and as our deﬁnitions are independent of any choice of coordinates over the
space, we obtain equivalent results in any coordinate system.
[Note: Say somewhere that the set of 11 postulates 2.1–2.11, deﬁning the volume and a set
of probability distributions furnished with two operations, deﬁne an inference space.]
The interpretation of this product of volumetric probabilities, can be obtained by comparing
ﬁgures 2.7 and 2.6. In ﬁgure 2.7, a probability distribution P ( · ) is represented by the
volumetric probability associated to it. To any region A of the plane, it associates the
probability P (A) . If a point has been realized following the probability distribution P ( · )
and we are given the information that, in fact, the point is “somewhere” inside the region B ,
then we can update the prior probability P ( · ) , replacing it by the conditional probability
P ( · B) = P ( · ∩ B)/P (B ) . This (classical) deﬁnition means that P ( · B) equals P ( · )
inside B and is zero outside, as suggested in the center of the ﬁgure (the division by P (B )
just corresponds to a renormalization). If the probability A → P (A) is represented by a
volumetric probability f (P) , the probability A → P (AB ) is represented by the volumetric
probability f (PB) given by
f (PB ) = k f (P) H (P) = f (P) H (P)
dV (P) f (P) H (P)
X , (2.85) where H (P) takes a constant value inside B , and vanishes outside. We see that f (PB) is
proportional to f (P) inside B and is zero outside B .
While the elements entering the deﬁnition of a conditional probability are a probability
distribution P and a subset B ⊂ X , we here consider two probability distributions P
and Q , with volumetric probabilities f (P) and g (P) . It is clear that equation 2.81 is a
generalization of equation 2.85, as the set B is now replaced a a probability distribution Q
(see ﬁgure 2.6). In the special case where the probability Q is zero everywhere excepted inside
a domain B , where it is homogeneous, then, we recover the standard notion of conditional
probability. Sum and Product of Probabilities 87 Figure 2.6: Illustration of the deﬁnition of the
product of two probability distribution, interpreted here as a generalization of the notion of
conditional probability (see ﬁgure 2.7). While
a conditional probability combines a probability distribution P ( · ) with an ‘event’ B ,
the product operation combines two probability distributions P ( · ) and Q( · ) deﬁned
over the same space. . . (P∩Q)( ) Q( ) P( ) f(x) g(x)
P∩Q (f.g)(x) (f.g)(x) = k f(x) g(x) Example 2.4 Let S represent the surface of the Earth, using geographical coordinates (longitude ϕ and latitude λ ). An estimation of the position of a ﬂoating object at the surface
of the sea by an airplane navigator gives a probability distribution for the position of the object
corresponding to the (2D) volumetric probability f (ϕ, λ) , and an independent, simultaneous
estimation of the position by another airplane navigator gives a probability distribution corresponding to the volumetric probability g (ϕ, λ) . How the two volumetric probabilities f (ϕ, λ)
and g (ϕ, λ) should be ‘combined’ to obtain a ‘resulting’ volumetric probability? The answer is
given by the ‘product’ of the two volumetric probabilities densities:
(f · g )(ϕ, λ) =
[End of example.] f (ϕ, λ) g (ϕ, λ)
dS (ϕ, λ) f (ϕ, λ) g (ϕ, λ)
S . (2.86) 88 2.4 2.4
2.4.1 Conditional Probability
Notion of Conditional Probability Let P ( · ) represent a probability distribution over an ndimensional manifold Xn , i.e., a
function A → P (A) satisfying the Kolmogorov axioms. Letting now B be a ‘ﬁxed’ region
of X n , we can deﬁne another probability distribution, say PB ( · ) , that to any region A
associates the probability PB (A) deﬁned by PB (A) = P (A ∩ B )/P (B ) . It can be shown that
this, indeed, is a probability (i.e., satisﬁes the Kolmogorov axioms). Instead of the notation
PB (A) , it is customary to use the notation PB (A) = P (AB ) and the deﬁnition then reads
P (AB ) = P (A ∩ B )
P (B ) . (2.87) It is important to intuitively understand this deﬁnition. The left of ﬁgure 2.7 (to be examined later with more detail) suggests a 2D probability distribution P ( · ) , that to any region
A of the space associates the probability P (A) . Given now a ﬁxed region B , suggested in
the ﬁgure by an ovoid, we can deﬁne another probability distribution, denoted P ( · B) that
to any region A of the space associates the probability P (AB ) deﬁned by equation 2.87.
The probability P ( · B) is to be understood as
• being identical to P ( · ) inside B (except for a renormalization factor guaranteeing that
P (BB) = 1 ),
• vanishing outside B .
This standard deﬁnition of conditional probability is mathematically consistent, and not
prone to misinterpretations
Figure 2.7: Illustration of the deﬁnition of
conditional probability. Given an intial probability distribution P ( · ) (left of the ﬁgure) and a set B (middle of the ﬁgure),
P ( · B) is identical to P ( · ) inside B (except for a renormalization factor guaranteeing
that P (BB) = 1 ) and vanishes outside B
(right of the ﬁgure) . . P( B) P( )
B p(x)
P(AB) = H(x)
P(A∩B)
P(B) p(xB) = k p(x) H(x) p(xB) Conditional Probability 2.4.2 89 Conditional Volumetric Probability A volumetric probability over ndimensional manifold induces a volumetric probability over
any pdimensional submanifold (see ﬁgure 2.8). We examine here the details of this important
issue. . .. ..
. ..
.......... ..... ..
. ... . .
.
.. . .
.. .
.
.
.
.
. ...... . ....
. ..
. .. .
. . . .. ..
.. . .. ..
. ..
.......... ..... ..
. ... . .
.
.. . .
.. .
.
.
.
.
. ...... . ....
. ..
. .. .
. . .
.
. .. . .
.
..
..
. .. .. . . . . .. . . . . .
.
.
. .......... . .
.. . ...
. ..
.
.
.
.
.
. ...... . .... . .
. .. ....... .
..
. .. . . . . . .
.
... .. . ..
. . . . ..
.
. ..
. .. .
... .. .
.
.
.
. ..
. . .
.
. .. . .
..
.
..
. . . . . . ..
. . . . ... .. .
.
.
.
. . ..
. 2.4.2.1 .
.
.
. . ..
. .......... . .
..
. .. .
.
.
..
...
.
.. ............. .
.. . . . .
. .. . .. . . . .
.
... .. . ..
. . . . ..
. fn(x)
. . Figure 2.8: Top: A probability distribution in an ndimensional metric manifold X n is suggested by some
sample points. The probability distribution can be represented by a volumetric probability fn (x) , proportional everywhere to the number of points per unit of ndimensional
volume. A pdimensional submanifold Xp is also suggested (by a line). Middle: To deﬁne the conditional volumetric probability on the submanifold X p , one considers a ‘tube’ of constant thickness around the submanifold,
and counts the number of points the unit of ndimensional
volume. Bottom: In the limit where the thickness of
the tube tends to zero, this deﬁnes a pdimensional volumetric probability over the submanifold X p . The metric over X p is that induced by the metric over X n ,
as is the element of volume. When the n coordinates
x = {x1 , . . . , xn } can be separated into p coordinates
r = {r1 , . . . , rp } and q coordinates s = {s1 , . . . , sq }
(with n = p + q ), so that the pdimensional submanifold Xp can be deﬁned by the conditions s = s(r) ,
then, the coordinates r can be used as coordinates over
the submanifold X p , and the (pdimensional) conditional
volumetric probability, as given by equation 2.95, is simply
fp (r) = k fn (r, s(r)) , where k is a normalization constant.
The probability of a region Ap ⊂ Xp is to be evaluated
as P (Ap ) = r∈Ap dvp (x) fp (x) , where the pdimensional
volume element dvp (x) is given in equations 2.97–2.99. . .. fp(x) General Situation As in section 2.1.5, consider an ndimensional manifold X n , with some coordinates x =
{x1 , . . . , xn } , and a metric tensor g(x) = {gij (x)} . The ndimensional volume element is,
det g(x) dx1 ∧ · · · ∧ dxn . In section 2.1.5, the n coordinates
then, dV (x) = g (x) dv (x) =
1
n
x = {x , . . . , x } of X have been separated into one group of p coordinates r = {r1 , . . . , rp }
and one group of q coordinates s = {s1 , . . . , sq } , with p + q = n , and a pdimensional
submanifold Xp of the ndimensional manifold X (with p ≤ n ) has been introduced via the
constraint
s = s(r) . (2.88) Consider a probability distribution P over X n , represented by the volumetric probability 90 2.4 f (x) = f (r, s) . We wish to deﬁne (and to characterize) the ‘conditional volumetric probability’
induced over the submanifold by the volumetric probability f (x) = f (r, s) .
Given the pdimensional submanifold X p of the ndimensional manifold X n , one can
deﬁne a set B (∆s) as being the set of all points whose distance to the submanifold X p is
less or equal than ∆s . For any ﬁnite value of ∆s , Kolmogorov’s deﬁnition of conditional
probability applies, and the conditional probability so deﬁned associates, to any A ⊂ X n , the
probability 2.87. Excepted for a normalization factor, this conditional probability equals the
original one, excepted in that all the region whose points are at a distance larger than ∆s
have been ‘trimmed away’. This is still a probability distribution over Xn . In the limit when
∆s → 0 this shall deﬁne a probability distribution over the submanifold X p that we are about
to characterize.
Consider a volume element dvp over the submanifold X n , and all the points of X n that
are at a distance smaller or equal that ∆s of the points inside the volume element. For small
enough ∆s the ndimensional volume ∆vn so deﬁned is
∆vn ≈ dvp ∆ωq , (2.89) where ∆ωq is the volume of the q dimensional sphere of radius ∆s that is orthogonal to the
submanifold at the considered point. This volume is proportional to (∆s)q , so we have
∆vn ≈ k dvp (∆s)q , (2.90) where k is a numerical factor. The conditional probability associated of this ndimensional
region by formula 2.87 is, by deﬁnition of volumetric probability,
dP(p+q) ≈ k f ∆vn ≈ k f dvp (∆s)q , (2.91) where k and k are constants. The conditional probability of the pdimensional volume
element dvp of the submanifold X p is then deﬁned as the limit
dP(p+q)
∆s→0 (∆s)q dPp = lim , (2.92) this giving dPn = k f dvp , or, to put the variables explicitly,
dPn (r) = k f (r, s(r)) dvp (r) . (2.93) We have thus arrived at a pdimensional volumetric probability over the submanifold X p
that is given by
fp (r) = k f (r, s(r)) , (2.94) where k is a constant. If the probability is normalizable, and we choose to normalize it to
one, then,
fp (r) = f (r, s(r))
dvp (r) f (r, s(r))
r∈X p . (2.95) Conditional Probability 91 With this volumetric probability, the probability of a region Ap of the submanifold is computed
as
P (Ap ) = r∈Ap dvp (x) fp (r) . (2.96) I must emphasize here the the limit we have used to deﬁne the conditional volumetric
probability is an ‘orthogonal limit’ (see ﬁgure 2.9). This contrasts with usual texts, where,
instead, a ‘vertical limit’ is used. The formal similarity of the result 2.95 with that proposed
in the books that use the ‘vertical limit’ deserves explanation: we are handling here volumetric
probabilities, not probability densities. The results for the ‘orthogonal limit’ used here, when
translated to the language of probability densities, give results that are not the familiar results
of common texts (see appendix 2.8.1).
Figure 2.9: The three limits
that could be used to deﬁne a
conditional volumetric probability over a submanifold. In
the top, the ‘orthogonal’ or
‘natural’ limit. In the middle, the usual ‘vertical’ limit,
and in the bottom a ‘horizontal’ limit. The last two although mentioned below (section 2.4.2.2), are not used in
this book.
As already mentioned, the coordinates r deﬁne a coordinate system over the submanifold
X p . The volume element of the submanifold can, then, be written
dvp (r) = g p (r) dv p (r) , (2.97) with dv p (r) = dr1 ∧ · · · ∧ drp . The volume density in the coordinates r on the submanifold
X p has been characterized in section 2.1.5 (equation 2.37):
g p (r) = det gp (r) , (2.98) with
gp (r) = grr + grs S + ST gsr + ST gss S . (2.99) It is understood that all the ‘matrices’ appearing at the right are taken at the point ( r, s(r) ) .
The probability of a region Ap of the submanifold can then either be computed using equation 2.96 or as
P (Ap ) = r∈Ap dv (r) g p (r) fp (r) , with the g p (r) given in equation 2.98 and with dv (r) = dr1 ∧ · · · ∧ drp . (2.100) 92 2.4 Figure 2.10: The spherical Fisher distribution corresponds to the
conditional probability distribution induced over a sphere by a
Gaussian probability distribution in an Euclidean 3D space (see
example 2.5). To have a full 3D representation of the property, this
ﬁgure should be ‘rotated around the vertical axis’. ϑ Example 2.5 In the Euclidean 3D space, consider an isotropic Gaussian probability distribution with standard deviation σ . Which is the conditional (2D) volumetric probability it induces
on the surface of a sphere of unit radius whose center is at unit distance from the center of the
Gaussian? Using geographical coordinates (see ﬁgure 2.10), the answer is given by the (2D)
volumetric probability
f (ϕ, λ) = k exp sin λ
σ2 , (2.101) where k is a norming constant (see the demonstration in appendix XXX). This is the celebrated
Fisher probability distribution, widely used as a model probability on the sphere’s surface. The
surface element over the surface of the sphere could be obtained using the equations 2.98–2.99,
but it is well known to be dS (ϕ, λ) = cos λ dϕ dλ . [End of example.]
Example 2.6 In the case where we work in a twodimensional space X 2 , with p = q = 1 ,
we can use the notation r and s instead of r and s , so that the constraint 2.88 is written
s = s(r) , (2.102) and the ‘matrix’ of partial derivatives is now a simple real quantity
∂s
∂r S= . (2.103) The conditional volumetric probability on the line s = s(r) induced by a volumetric probability
f (r, s) is (equation 2.95),
f1 (r) = f (r, s(r))
d (r ) f (r , s(r )) , (2.104) where, if the metric of the space X 2 is written
g(r, s) = grr (r, s) grs (r, s)
gsr (r, s) gss (r, s) , (2.105) the (1D) volume element is (equations 2.97–2.99)
d (r) = grr (r, s(r)) + 2 S (r) grs (r, s(r)) + S (r)2 gss (r, s(r)) dr . (2.106) The probability of an interval (r1 < r < r2 ) along the line s = s(r) is then
r2 d (r) f1 (r) . P=
r1 (2.107) Conditional Probability 93 If the constraint 2.102 is, in fact, s = s0 , then, equation 2.104 simpliﬁes into
f1 (r) = f (r, s0 )
d (r ) f (r , s0 ) , (2.108) and, as the partial derive vanishes, S = 0 , the length element 2.106 becomes
d (r) = grr (r, s0 ) dr . (2.109) [End of example.]
Example 2.7 Consider two Cartesian coordinates {x, y } on the Euclidean plane, associated
to the usual metric ds2 = dx2 + dy 2 . It is easy to see (using, for instance, equation 1.23) that
the metric matrix associated to the new coordinates (see ﬁgure 2.11)
r=x ; s = xy (2.110) is
g(r, s) = 1 + s2 /r4 −s/r3
−s/r3
1/r2 , (2.111) det g(r, s) = 1/r . Assume that all what we know about the
with metric determinant
position of a given point is described by the volumetric probability f (r, s) . Then, we are told
that, in fact, the point is on the line deﬁned by the equation s = s0 . What can we now
say about the coordinate r of the point? This is clearly a problem of conditional volumetric
probability, and the information we have now on the position of the point is represented by the
volumetric probability (on the line s = s0 ) given by equation 2.108:
f1 (r) = f (r, s0 )
d (r ) f (r , s0 ) . (2.112) Here, considering the special form of the metric in equation 2.111, the length element given by
equation 2.109 is
d (r) = 1 + s2 /r4 dr
0 . (2.113) The special case s = s0 = 0 gives
f1 (r) = f (r, 0)
d (r ) f (r , 0) ; d (r) = dr . (2.114) [End of example.]
Example 2.8 To address a paradox mentioned by E.T. Jaynes, let us solve the same problem
as in the previous example, but using the Cartesian coordinates {x, y } . The information that
was represented by the volumetric probability f (r, s) is now represented by the volumetric
probability h(x, y ) given by (as volumetric probabilities are invariant objects)
h(x, y ) = f (r, s)r=x ; s=x y . (2.115) 94 2.4
y = +1 v = +1 y = +0.5 y=0 v=0 y = 0.5 Figure 2.11: The Euclidian plane,
with, at the left, two Cartesian coordinates {x, y } , and, at the right the
two coordinates u = x ; v = x y . v=
+0.5 0.5
v=1
v= u=1 u = 0.8 u = 0.6 u = 0.4 u = 0.2 u=0 x=1 x = 0.8 x = 0.6 x = 0.4 x = 0.2 x=0 y = 1 As the condition s = 0 is equivalent to the condition y = 0 , and as the metric matrix is
the identity, it is clear that the shall arrive, for the (1D) volumetric probability representing the
information we have on the coordinate x to
h1 (x) = h(x, 0)
d (x ) h(x , 0) ; d (x) = dx . (2.116) Not only this equation is similar in form to equation 2.114; replacing here h by f (using
equation 2.115) we obtain an identity that can be expressed using any of the two equivalent
forms
h1 (x) = f1 (r)r=x ; f1 (r) = h1 (x)x=r . (2.117) Along the line s = y = 0 , the two coordinates r and s coincide, so we obtain the same
volumetric probability (with the same length elements d (x) = dx and d (r) = dr ). Trivial
as it may seem, this result is not that found the traditional deﬁnition of conditional probability
density. Jaynes, in the 15th chaper of his unﬁnished Probability Theory book lists this as one of
the paradoxes of probability theory. It is not a paradox, it is a mistake one makes when falling
into the illusion that a conditional probability density (or a conditional volumetric probability)
can be deﬁned without invoking the existence of a metric (i.e., of a notion of distance) in the
working space. This ‘paradox’ is related to the ‘BorelKolmogorov paradox’, that I address in
appendix 2.8.10. [End of example.] Conditional Probability
2.4.2.2 95 Case X = R × S I shall show here that a ‘joint’ volumetric probability f (r, s) over a space X p+q = R p × S q
can induce, via a relation s = s(r) , three diﬀerent conditional volumetric probabilities: (i) a
volumetric probability f( r) over the submanifold s = s(r) itself; (ii) a volumetric probability
fr (r) over R ; and (iii) a volumetric probability fs (r) (case p ≤ q ) or fs (s) (case p ≥ q )
over S q . Figure 2.12 shows a schematical view of the properties we are about to analyze. s
. .. ..
. ..
.......... ..... ..
. ... . .
.
.. . .
.. .
.
.
.
.
. ...... ......
. ..
. ..
. .. .. . . .
.
. .. . .
..
.
..
.. . . . . . . ..
. . . . ..
. .
... .. .
.
.
.
. . Figure 2.12: In an ndimensional space X n
that is the Cartesian product of two spaces
R p and S q , with coordinates r =
{r1 , . . . , rp } and s = {s1 , . . . , sq } and metric tensors gr and gs , there is a volume
element on each of R p and S q , and an
induced volume element in X n = R p × S q .
Given a pdimensional submanifold manifold
s = s(r) of X n , there also is an induced volume element on it. A volumetric probability
f (r, s) over X n , induces a (conditional)
volumetric probability fx (r) over the submanifold s = s(r) (equation 2.125), and,
as the submanifold shares the same coordinates as R p , a volumetric probability fr (r)
is also induced over R p (equation 2.127).
This volumetric probability can, in turn, be
transported into S q , using the concepts developed in section 2.6. .
.
.
..
. .......... . .
.. . .
. . ..
.
.
.
.
. ...... . .... . .
. .. ....... .
..
. .. . . . . . .
.
... .. . ..
. . . . ..
. . .. r s r Consider a pdimensional manifold R with a coordinate system r = {rα } and metric
tensor gr (r) , and a q dimensional manifold S with a coordinate system s = {si } and metric
tensor gs (s) . Each space has, then, a distance element
ds2 = (gr )αβ drα drβ
r ; ds2 = (gs )ij dsi dsj
s , (2.118) and a volume element
dvr (r) = g r (r) dv r (r) ; dvs (s) = g s (s) dv s (s) , (2.119) 96 2.4 that are related to the capacity elements
dv r (r) = dr1 ∧ · · · ∧ drp ; dv s (r) = ds1 ∧ · · · ∧ dsq (2.120) via the volume densities
g r (r) = ηr det gr (r) ; g s (s) = ηs det gs (s) . (2.121) We can build the Cartesian product X = R × S of the two spaces, by deﬁning the points of
X as being made by a point of R and a point of S (so we can write x = {r, s} ), and by
introducing a metric tensor g(x) over X through the deﬁnition5
ds2 = ds2 + ds2
r
s . (2.122) This implies that the metric g(x) = g(r, s) has the partitioned form
g(r, s) = gr (r)
0
0
gs (s) . (2.123) Note: explain that what follows is on the subamnifold.
With this partitioned metric, the metric tensor in equation 2.99 simpliﬁes to
gp = gr + ST gs S , (2.124) or, more explicitly, gp (r) = gr (r) + ST (r) gs (s(r)) S(r) . Collecting here equations 2.95, 2.98
and 2.100, we can write the conditional probability of a region Ap of the submanifold s = s(r)
as
fx (r) = k f (r, s(r)) , (2.125) where k is a normalization constant. Using the volume element over the submanifold, the
probability of a region A of the submanifold s = s(r) is computed via
P (A) = C dr1 ∧ · · · ∧ drp det(gr + St gs S) fx (r) . (2.126) As the conditional volumetric probability fx (r) is on the submanifold s = s(r) , it is integrated
with the volume density of the submanifold (equation 2.126). Remember that the coordinates
r are not only the coordinates of the subspace R , they also deﬁne a coordinate system over
the submanifold s = s(r) .
Note: explain that what follows is on the space R p :
Equations 2.125–2.126 deﬁne a volumetric probability over the submanifold X p . As the
coordinates r are both, coordinates over R p and over the submanifold X p , if we deﬁne
fr (r) = k
5 det(gr + St gs S)
√
f (r, s(r)) ,
det gr (2.127) Expression 2.122 is just a special situation. More generally, one should take ds2 = α2 ds2 + β 2 ds2 .
r
s Conditional Probability 97 where the normalization factor k is given by
1
=
k Rp det(gr + St gs S)
√
f (r, s(r)) ,
det gr dvr (r) (2.128) a probability is then expressed as
P (A) = A dvr (r) fr (r) , (2.129) the volume element being
dvr (r) = det gr dr1 ∧ · · · ∧ drp . (2.130) As this is the volume element of R p , we see that we have deﬁned a volumetric probability
over R p . [Note: This is very important, it has to be better explained.]
We see thus that, via s = s(r) , the volumetric probability F (r, s) , has not only induced a
conditional volumetric probability fx (r) over the submanifold s = s(r) , but also a volumetric
probability fr (r) over R . These two volumetric probabilities are completely equivalent, and
one may focus in one or the other depending on the applications in view. We shall talk about
the conditional volumetric probability fx (r) on the submanifold s = s(r) and about the
conditional volumetric probability fr (r) on the subspace R .
If instead of the volumetric probabilities fr (r) and f (r, s) we introduce the probability
densities
f (r, s) = g r (r) f (r, s) = det gr (r) f (r, s) f (r, s) = g (r, s) f (r, s) = det gr (r) det gs (s) f (r, s) , (2.131) then, equation 2.127 becomes
det(gr + St gs S)
√
f r (r) = k √
f (r, s(r)) ,
det gr
det gs (2.132) where the normalization factor k is given by
1
=
k det(gr + St gs S)
√
dv r (r) √
f (r, s(r)) ,
det gr
det gs
Rp (2.133) the capacity element being
dv r (r) = dr1 ∧ · · · ∧ drp . (2.134) A probability is expressed as
P (A) = A dv r (r) f r (r) . (2.135) Note: analyze here the case where the application s = s(r) degenerates into
s = s0 , (2.136) 98 2.4 in which case the matrix S of partial derivatives vanishes. Then, using for the conditional
volumetric probability the usual notation f (rs0 ) , equations 2.127–2.128 simply give
f (rs0 ) = f (r, s0 )
dvr (r) f (r, s0 ) . (2.137) Equivalently, in terms of probability densities, equations 2.132–2.133 become, in the case s =
s0 ,
f (rs0 ) = f (r, s0 ) / det gs (s0 ) dv r (r) f (r, s0 ) / det gs (s0 ) . (2.138) det gs (s0 ) from this equation.
Note: I have to check if I can drop the constant term
The assumption that the joint metric diagonalizes ‘in the variables’ {r, s} is essential here.
If from the variables {r, s} we pass to some other variables {u, v} through a general change
of variables, the metric of the space X shall no longer be diagonal in the new variables, and
a deﬁnition of, say , f (uv0 ) shall not be possible.
This diﬃculty is often disregarded in usual texts working with probability densities, this
causing some confusions in applications of probability theory using the notion of conditional
probability density, and the associated expression of the Bayes theorem (see section 2.5.4).
Example 2.9 With the notations of this section, consider that the metric gr of the space R p
and the metric gs of the space S q are constant (i.e., that both, the coordinates rα and si
are rectilinear coordinates in Euclidean spaces), and that the application s = s(r) is a linear
application, that we can write
s = Sr , (2.139) as this is consistent with the deﬁnition of S as the matrix of partial derivatives, S i α =
∂si /∂sα . Consider that we have a Gaussian probability distribution over the space R p ,
represented by the volumetric probability
fp (r) = 1
1
exp − (r − r0 )t gr (r − r0 )
p/2
(2π )
2 , (2.140) √
√
dr1 ∧ · · · ∧ drp fp (r) = 1 .
that is normalized via
dr1 ∧ · · · ∧ drp det gr fp (r) = det gr
Similarly, consider that we also have a Gaussian probability distribution over the space S q ,
represented by the volumetric probability
fq (s) = 1
1
exp − (s − s0 )t gs (s − s0 )
(2π )q/2
2 , (2.141)
√
ds1 ∧ · · · ∧ dsq fq (s) = 1 .
that is normalized via
ds1 ∧ · · · ∧ dsq det gs fq (s) = det gs
Finally, consider the p + q dimensional probability distribution over the space X p+q deﬁned as
the product of these two volumetric probabilities,
f (r, s) = fp (r) fq (s) . (2.142) Conditional Probability 99 Given this p + q dimensional volumetric probability f (r, s) and given the pdimensional hyperplane s = S r , we obtain the conditional volumetric probability fr (r) over R p as given
by equation 2.127. All simpliﬁcations done6 one obtains the Gaussian volumetric probability7
fr (r) = det gr
1
1
√
exp − (r − r0 )t gr (r − r0 )
(2π )p/2
2
det gr , (2.143) where the metric gr (inverse of the covariance matrix) is
gr = gr + St gs S (2.144) and where the mean r0 can be obtained solving the expression8
gr (r0 − r0 ) = St gs (s0 − S0 r0 ) . (2.145) Note: I should now show here that fs (s) , the volumetric probability in the space S q is given,
in all cases ( p ≤ q or p ≥ q ) by
fs (s) = det gs
1
1
√
exp − (s − s0 )t gs (s − s0 )
q/2
(2π )
2
det gs , (2.146) where the metric gs (inverse of the covariance matrix) is
(gs )−1 = S (gr )−1 St (2.147) and where the mean s0 is
s0 = S r0 . (2.148) S r Note: say that this is illustrated in ﬁgure 2.13. [End of example.] Figure 2.13: Provisional ﬁgure to illustrate example 2.9. fs(s) fp(r) 6 s= fq(s) fq(s) fp(r) fs(r) Note: explain this.
√
This volumetric probability is normalized by dr1 ∧ · · · ∧ drp det gr fr (r) = 1 .
8
Explicitly, one can write r0 = r0 + (gr )−1 St gs (s0 − S r0 ) , but in numerical applications, the direct
resolution of the linear system 2.145 is preferable.
7 100 2.5
2.5.1 2.5 Marginal Probability
Marginal Probability Density In a p + q dimensional space Xp+q , consider a continuous, non intersecting set of pdimensional
hypersurfaces, parameterized by some parameters s = {s1 , s2 , . . . , sq } , as suggested in ﬁgure 2.14. Each given value of s , say s = s0 , deﬁnes une such hypersurface.
Consider also a probability distribution over X p+q (suggested by the ovoidal shape marked
‘P’ in the ﬁgure). We have seen above that given a particular hypersurface s = s0 , we can
deﬁne a conditional probability distribution, that associates a diﬀerent value of a volumetric
probability to each point of the hypersurface. We are not interested now in the ‘variability’
inside each hypersurface, but in deﬁning a global ‘probability’ for each hypersurface, to analyze
the variation of the probability from one hypersurface to another one.
Crudely speaking, to the hypersurface marked ‘H’ in the ﬁgure, we are going to associate
the probability of the small ‘crescent’ deﬁned by two inﬁnitely close hypersurfaces. s
H Figure 2.14: Figure por the deﬁnition of
marginal probability. Caption to be written. _
f(s) P The easiest way to develop the idea (and to ﬁnd explicit expressions) is to characterize the
points inside each of the hypersurfaces by some coordinates r = {r1 , r2 , . . . , rp } . Still better,
we can assume that the set {r, s} individualizes one particular point of X p+q , i.e., the set
x = {r, s} is a coordinate system over X p+q (see ﬁgure 2.15).
s = +1 s=
+0.5 s=0 Figure 2.15: Figure por the deﬁnition of marginal probability. Caption to be written. 0.5
s= r=1 r = 0.8 r = 0.6 r = 0.4 r = 0.2 r=0 1
s= We shall verify at the end that the deﬁnition we are going to made of a probability distribution over s is independent of the particular choice of coordinates r . Marginal Probability 101 Let, then, f (r, s) be a volumetric probability over Xp+q . The probability of a domain
A ⊂ X p+q is computed as
P (A) =
where dv (r, s) = √ dv (r, s) f (r, s) , (2.149) A det g ds1 ∧ · · · ∧ dsq ∧ dr1 ∧ · · · ∧ drp . Explicitly, P (A) = A ds1 ∧ · · · ∧ dsq ∧ dr1 ∧ · · · ∧ drp det g f (r, s) . (2.150) As the (inﬁnitesimal) probability of the ‘crescent’ around the hypersurface ‘H’ in ﬁgure 2.14 is
dPq (s) = ds1 ∧ · · · ∧ dsq dr1 ∧ · · · ∧ drp det g f (r, s) , (2.151) all values of r we can introduce the deﬁnition
dr1 ∧ · · · ∧ drp f s (s) = det g f (r, s) , (2.152) all values of r to have dPq (s) = ds1 ∧ · · · ∧ dsq f s (s) . When the parameters s are formally seen as
coordinates over some (yet undeﬁned) space, the probability of a region B of this space is, by
deﬁnition of f s (s) , computed as
P (B ) = B ds1 ∧ · · · ∧ dsq f s (s) , (2.153) this showing that f s (s) can be interpreted as a probability density over s , that, by construction, corresponds to the integrated probability over the hypersurface deﬁned by a constant
value of the parameters s (see ﬁgure 2.14 again). The expression 2.153 in the typical one
for evaluating ﬁnite probabilities from a probability density (see equations 2.55–2.56); for this
reason we shall call f s (s) the marginal probability density (for the variables s ).
This is the most one can do given only the elements of the problem, i.e., a probability
distribution over a space and a continuous family of hypersurfaces. Note that we have been
able to introduce a probability density over the variables s , but not a volumetric probability,
that can only be deﬁned over a well deﬁned space.
Once we understand that we can only deﬁne a probability density f s (s) (and not a
volumetric probability) we can rewrite equation 2.152 as
dr1 ∧ · · · ∧ drp f (r, s) , f s (s) = (2.154) det g f (r, s) (2.155) all values of r where
f (r, s) = is the probability density representing (in the coordinates {r, s} ) the initial probability distribution over the space X p+q . 102 2.5 The elements used in the deﬁnition of the marginal probability density f s (s) are: (i) a
probability distribution over a (p + q )dimensional metric space X p+q , and (ii) a continuous
family of pdimensional hypersurfaces characterized by some q parameters s = {s1 , . . . , sq } .
This is independent of any coordinate system over X p+q . It remains that the q parameters s
can be considered as q coordinates over X p+q that can be completed, in an arbitrary manner, by
p more coordinates r = {r1 , . . . , rp } in order to have a complete coordinate system x = {r, s}
over X p+q . That the probability density f s (s) is independent of the choice of the coordinates
r is seen by considering equation 2.152. For any ﬁxed value of s (i.e., on a given pdimensional
√
submanifold), the term
det g dr1 ∧ · · · ∧ drp is just the expression of the volume element
on the submanifold, that, by deﬁnition, is an invariant, as is the volumetric probability f .
Therefore, the integral sum in equation 2.152 shall keep its value invariant under any change
of the coordinates r .
In many applications, the continuous family of pdimensional hypersurfaces is not introduced per se. Rather, one has a given coordinate system x over X p+q that is, for some
reason, splitted into p coordinates r and q coordinates s . These coordinates deﬁne diﬀerent coordinate hypersurfaces over X p+q , and, among them, the pdimensional hypersurfaces
deﬁned by constant values of the coordinates s . Then, the deﬁnition of marginal probability
density given above applies.
NOTE COME BACK HERE AFTER ANALYZING POISSON.
In this particular situation, the metric properties of the space need not to be taken into
account, and the two equations 2.153–2.154, than only invoke probability densities can be used. 2.5.2 Marginal Volumetric Probability Consider now the special situation where the (p + q )dimensional space X p+q is deﬁned as the
Cartesian product of two spaces, X = R × S , with respective dimensions p and q . The notion
of Cartesian product of two metric manifolds has been introduced in section 2.4.2.2.
Note: recall here equations 2.122–2.123:
ds2 = ds2 + ds2
r
s . (2.156) This implies that the metric g(x) = g(r, s) has the partitioned form
gr (r)
0
0
gs (s) g(r, s) = . (2.157) In particular, over the p + q dimensional manifold X one then has the induced volume
element
dv (r, s) = dvr (r) dvs (s) , (2.158) where the ‘marginal’ volume elements dvr (r) and dvs (s) are those given in equations 2.119.
Consider now a probability distribution over X , characterized by a volumetric probability
f (x) = f (r, s) . It is not assumed that this volumetric probability factors as a product of
a volumetric probability over R by a volumetric probability over S . Assuming that this
probability is normalizable, we can write the equivalent expressions
P (X ) = dv (x) f (x) =
x∈X dvr (r) r∈R dvs (s) f (r, s) s∈S = dvs (s) s∈S dvr (r) f (r, s) . r∈R (2.159) Marginal Probability 103 Deﬁning the two marginal volumetric probabilities
fr (r) = dvs (s) f (r, s) ; fs (s) = s∈S dvr (r) f (r, s) (2.160) r∈R this can be written
P (X) = dv (x) f (x) =
x∈X dvr (r) fr (r) = r∈R dvs (s) fs (s) . (2.161) s∈S It is clear that the marginal volumetric probability fr (r) deﬁnes a probability over R ,
while the marginal volumetric probability fs (s) deﬁnes a probability over S . 2.5.3 Interpretation of Marginal Volumetric Probability These deﬁnitions can be intuitively interpreted as follows. Assume that there is a volumetric
probability f (x) = f (r, s) deﬁned over a space X that is the Cartesian product of two spaces
R and S , in the sense just explained.
A sampling of the (probability distribution over X associated to the) ‘joint’ volumetric
probability f would produce points (of X )
x1 = (r1 , s1 ) , x2 = (r2 , s2 ) , x3 = (r3 , s3 ) , ... . (2.162) Then,
are samples of the (probability
• the points (of R )
r1 , r 2 , r 3 , . . .
distribution over R associated to the) marginal volumetric probability fr ; and
are samples of the (probability
• the points (of S )
s1 , s2 , s3 , . . .
distribution over S associated to the) marginal volumetric probability fs .
Thus, if when working with a Cartesian product of two manifolds X = R × S , and facing
a ‘joint’ volumetric probability f (r, s) , one is only interested in the probability properties
induced by f (r, s) over R (respectively over S ) one only needs to consider the marginal
volumetric probability fr (r) (respectively fs (s) ). This, of course, implies that one is not
interested in the possible dependences between the variables r and the variables s . 2.5.4 Bayes Theorem Let us continue to work in the special situation where the ndimensional space X is deﬁned
as the Cartesian product of two spaces, X = R × S , with respective dimensions p and q ,
with n = p + q . Given a ‘joint’ volumetric probability f (r, s) over X n , we have deﬁned two
marginal volumetric probabilities fr (r) and fs (s) using equations 2.160.
We have also written, for any ﬁxed value of s (equation 2.137 dropping the index ‘0’)
f (rs) = f (r, s)
dvr (r) f (r, s)
R , = f (rs) = f (r, s)
fs (s) , (2.163) 104 2.5 where, in the second equality we have used the deﬁnition of marginal volumetric probability.
It follows
f (r, s) = f (rs) fs (s) , (2.164) equation that can be read as saying bla, bla, bla . . . Similarly,
f (r, s) = f (sr) fr (r) , (2.165) and comparing these two equations we deduce the well known Bayes theorem
f (sr) fr (r)
fs (s) f (rs) = (2.166) , equation that can be read as saying bla, bla, bla . . .
Note: explain here again that the assumption that the metric of the space X takes the
form expressed in equation 2.157 is fundamental. 2.5.5 Independent Probability Distributions Assume again that there is a volumetric probability f (x) = f (r, s) deﬁned over a space X
that is the Cartesian product of two spaces R and S , in the sense being considered. Then, one
may deﬁne the marginal volumetric probabilities fr (r) and fs (s) deﬁned by equation 2.160.
If it happens that the ‘joint’ volumetric probability f (r, s) is just the product of the two
marginal probability distributions,
f (r, s) = fr (r) fs (s) , (2.167) it is said that the probability distributions over R and S (as characterized by the marginal
volumetric probabilities fr (r) and fs (s) ) are independent .
Note: the comparison of this deﬁnition with equations 2.164–2.165 shows that, in this case,
f (rs) = fr (r) f (sr) = fs (s) , ; (2.168) from where the ‘independence’ notion can be understood (note: explain this).
NOTE: REFRESH THE EXAMPLE BELOW.
Example 2.10 Over the surface of the unit sphere, using geographical coordinates, we have
the two displacement elements
dsϕ (ϕ, λ) = cos λ dϕ ; dsλ (ϕ, λ) = dλ , (2.169) with the associated surface element (as the coordinates are orthogonal) ds(ϕ, λ) = cos λ dϕ dλ .
Consider a (2D) volumetric probability f (ϕ, λ) over the surface of the sphere, normed under
the usual condition
+π ds(ϕ, λ) f (ϕ, λ) =
surface +π/2 +π/2 +π dϕ dλ cos λ f (ϕ, λ) = dλ cos λ dϕ f (ϕ, λ) = 1 . (2.170) −π −π/2 −π/2 −π Marginal Probability 105 One may deﬁne the partial integrations
+π/2 ηϕ (ϕ) = +π dλ cos λ f (ϕ, λ)
−π/2 ; ηλ (λ) = dϕ f (ϕ, λ) , (2.171) −π so that the probability of a sector between two meridians and of an annulus between two parallels
are respectively computed as
ϕ2 P (ϕ1 < ϕ < ϕ2 ) = dϕ ηϕ (ϕ)
ϕ1 λ2 ; P (λ1 < λ < λ2 ) = dλ cos λ ηλ (λ) , (2.172)
λ1 but the terms dϕ and cos λ dλ appearing in these two expressions are not the displacement
elements on the sphere’s surface (equation 2.169). The functions ηϕ (ϕ) and ηλ (λ) should
not be mistaken as marginal volumetric probabilities: as the surface of the sphere is not the
Cartesian product of two 1D spaces, marginal volumetric probabilities are not deﬁned. [End
of example.] 106 2.6
2.6.0.1 2.6 Transport of Probabilities
The Problem We are contemplating:
• a pdimensional metric space R p , with coordinates r = {rα } , and a metric matrix that,
in these coordinates, is gr ;
• a q dimensional metric space S q , with coordinates s = {si } , and a metric matrix that,
in these coordinates, is gs ;
• an application s = σ (r) from R p into S q .
To any volumetric probability fr (r) over R p , the application
s = σ (r) (2.173) associates a unique volumetric probability fs (s) over S q . To intuitively understand this,
consider a large collection of samples of fr (r) , say {r1 , r2 , . . . } . To each of these points in
R p we can associate a unique point in S q , via s = σ (r) , so we have a large collection of
points {s1 , s2 , . . . } in S q . Of which volumetric probability fs (s) are these points samples?
Although the major inference problems considered in this book (conditional probability,
product of probabilities, etc.) are only deﬁned when the considered spaces are metric, this
problem of transport of probabilities makes perfect sense even if the spaces do not have a
metric. For this reason, one could set the problem of transportation of a probability distribution
in terms or probabilty densities, intead of volumetric probabilities. I prefer to use the metric
concepts and language, but shall also give below the equivalent formulas for those who may
choose to work with volumetric probabilities.
In what follows, S denotes the matrix of partial derivatives
S iα = ∂si
∂rα . (2.174) Note: write somewhere what follows:
As we have represented by gr the metric in the space R p , the volume element is given
by the usual expression
dvr (r) = det gr (r) dr1 ∧ · · · ∧ drp , (2.175) the volume of a ﬁnite region A being computed via
V (A) = A dvr (r) = A dr1 ∧ · · · ∧ drp det gr (r) . (2.176) Transport of Probabilities
2.6.0.2 107 Case p ≤ q When p ≤ q , the pdimensional manifold R p is mapped, via s = s(r) , into a pdimensional
submanifold of S q , say S p (see ﬁgure 2.16). In that submanifold we can use as coordinates
the coordinates induced from the coordinates r of R p via s = s(r) . So, now, the coordinates
r deﬁne, at the same time, a point of R p and a point of S p ⊂ S q (if the points of S q are
covered more than once by the application s = s(r) , then, let us assume that we work inside a
subdomain of R p where the problem does not exist). Note: I should mention here ﬁgure 2.19.
The application s = s(r) maps the pdimensional volume element dvr on R p into a
pdimensional volume element dvs on the submanifold S p of S q . Let us characterize it.
The distance element between two points in S q is ds2 = (gs )ij dsi dsj . If, in fact, those
are points of Sp , then we can write dsi = S i α drα , to obtain ds2 = Gαβ drα drβ , where
G = St gs S (remember that we can use r as coordinates over the submanifold S p of S q ).
The pdimensional volume element obtained on S p by transportation of the volume element
dvr of R p (via s = s(r)) is
dvs (r) = det St gs S dr1 ∧ · · · ∧ drp , (2.177) where gs = gs (s(r)) and S = S(r) . The volume of a ﬁnite region A of Sp is computed via
V (A) = A dvs (r) = A dr1 ∧ · · · ∧ drp det St gs S . (2.178) Note that by comparing equations 2.175 and 2.177 we obtain the ratio of the volumes,
√
det St gs S
dvs
√
=
.
(2.179)
dvr
det gr
We have seen that bla, bla, bla, and we have the same coordinates over the two spaces, and
bla, bla, bla, and to a common capacity element dr1 ∧ · · · ∧ drp corresponds the two volume
elements
dvr (r) = det gr dr1 ∧ · · · ∧ drp dvs (r) = det St gs S dr1 ∧ · · · ∧ drp . (2.180) When there are volumetric probabilities fr (r) and fs (r) , they are deﬁned so as to have
dPr = fr dvr
dPs = fs dvs . (2.181) We say the the volumetric probability fs has been ‘transported’ from fr if the two probabilities
associated to the two volumes deﬁned by the common capacity element dr1 ∧ · · · ∧ drp are
identical, i.e., if dPr = dPs . It follows the relation fs = dvr fr , i.e.,
dvs
√
fs = √ det gr
fr
det St gs S , (2.182) or, more explicitly,
fs (r) = det gr (r)
det St (r) gs (σ (r)) S(r) fr (r) . (2.183) 108 2.6 The matrix S of partial derivatives has dimension (q × p) , and unless p = q , it is not a
squared matrix. This implies that, in general, det St gs S = det(St S) det gs .
While the probability of a domain A of R p is to be evaluated as
Pr (A) = dr1 ∧ · · · ∧ drp r∈A det gr fr (r) , (2.184) the probability of its image s(A) (that, by deﬁnition is identical to the probability of A ), is
to be evaluated as
Ps (s(A)) = Pr (A) = dr1 ∧ · · · ∧ drp det St gs S fs (r) . r∈A (2.185) Of course, one could introduce the probability densities
f r (r) = det gr fr (r) ; f s (r) = det St gs S fs (r) . (2.186) Using them, the integrations 2.184–2.185 would formally simplify into
Pr = dr1 ∧ · · · ∧ drp f r (r) , ; r∈A Ps = dr1 ∧ · · · ∧ drp f s (r) , (2.187) r∈A while the relation 2.183 would trivialize into
f s (r) = f r (r) . (2.188) There is no harm in using equations 2.187–2.188 in analytical developments (I have already
mentioned that for numerical integrations is much better to use volume elements, rather than
capacity elements, and volumetric probabilities rather than probability densities), provided one
remembers that the volume of a domain A of R p is to be evaluated as
V (A) = dr1 ∧ · · · ∧ drp r∈A det gr , (2.189) while the volume of its image s(A) (of course, diﬀerent from that of A ) is to be evaluated as
V (s(A)) =
2.6.0.3 dr1 ∧ · · · ∧ drp r∈A det St gs S . (2.190) Case p ≥ q Let us now consider the case p ≥ q , i.e., when the ‘starting space’ has larger (or equal)
dimension than the ‘arrival space’.
Let us begin by choosing over Rp a new system of coordinates specially adapted to the
problem. Remember that we are using Latin indices for the coordinates si , where 1 ≤ i ≤ q ,
and Greek indices for the coordinates rα , where 1 ≤ α ≤ p . We pass from the p coordinates
r to the new p coordinates
si = si (r) ; (1 ≤ i ≤ q ) A ; (q + 1 ≤ A ≤ p) , t A = t (r) (2.191) Transport of Probabilities 109 Figure 2.16: Transporting a volume element from a pdimensional space R p into
a q dimensional space S q , via an expression s = s(r) . Left: 1 = p < q = 2 ;
in this case, we start with a pdimensional
volume in R p and arrive at S q with a volume of same dimension (equations 2.175
and 2.177). Right: 2 = p > q = 1 : in
this case the we start with a pdimensional
volume in R p but arrive at S q with
a q dimensional volume, i.e., a volume of
lower dimension (equations 2.176 and ??). Sq s1 = s1(r)
s2 = s2(r) s2
p≤q s Sq s = s(r1,r2) p≥q
dvp dvp dvq
dvp s1 r2 dvp
r r1
Rp Rp dvp Figure 2.17: Detail of ﬁgure 2.16, showing a domain of R p
that maps into a single point of S q . where the functions σ i are the same as those appearing in equation 2.173 (i.e., the q coordinates
s of S q are used as q of the p coordinates of R p ), and where the functions τ A are arbitrary
(one could, for instance, choose tA = rA , for q + 1 ≤ A ≤ p ).
It may well happen that the coordinates {s, t} are only regular inside distinct regions of
R p . Let us work inside one such region, letting the adhoc management of the more general
situation be just suggested in ﬁgure 2.19.
We need to express the metric tensor in the new coordinates, and, for this, we must introduce
the (Jacobian) matrix K of partial derivatives
{K α β } = S iβ
T Aβ = ∂ si /∂rβ
∂tA /∂rβ , (2.192) and its inverse
L = K−1 . (2.193) Using L , the matrix representing the metric tensor of the space R p in the new coordinates
is (see, for instance, equation 1.23)
G = Lt gr L , (2.194) while, in terms of the matrix K , equivalently,
−
G−1 = K gr 1 Kt . Note: I have to say here that, as the matrices K and L are invertible,
√
1
det G = L det gr =
det gr ,
K (2.195) (2.196) 110 2.6 where
L= √ Lt L ; K= √ K Kt . (2.197) [Note: Emphasize here that when only the determinant of the metric appears, and not the
full metric, this means that we only need a volume element over the space, not a distance
element. Important to solve the problem of ‘relative weights’.]
The deﬁnition of volumetric probability that we have used makes it an invariant. The
relation between a volumetric probability fr (r) , expressed in the coordinates r and the
equivalent volumetric probability f (s, t) , expressed in the coordinates {s, t} , is, simply,
fr (r) = f (s(r), t(r)) . (2.198) In the coordinates r , the probability of a region of R p is computed as
Pr = dr1 ∧ · · · ∧ drp det gr fr (r) . (2.199) In the coordinates { s , t } , it is computed using the equation
Pr = ds1 ∧ · · · ∧ dsq dtq+1 ∧ · · · ∧ dtp √ det G f (s, t) , (2.200) where G is given by equation 2.194. While this expression deﬁnes the probability of an
arbitrary region of R p , the expression
Ps = ds1 ∧ · · · ∧ dsq dtq+1 ∧ · · · ∧ dtp √ det G f (s, t) , (2.201) all t where the ﬁrst sum is taken over an arbitrary domain of the coordinates s , but the second
sum is now taken for all possible values of the coordinates t , corresponds to the probability
of a region of S q (as the coordinates s are not only some of the coordinates of R p , but are
also the coordinates over S q ).
As a volumetric probability fs (s) over S q is to be integrated via
Ps = ds1 ∧ · · · ∧ dsq det gs fs (s) , (2.202) then, by comparison with equation 2.201, we deduce that the expression representing the volumetric probability we wished to characterize is
fs (s) = √ 1
det gs dtq+1 ∧ · · · ∧ dtp √ det G f (s, t) , (2.203) all t and our problem, essentially, solved. Note that, here, the volumetric probability appears with
the variables s and t , while the original volumetric probability was fr (r) . Although the two
expressions are linked through equation 2.198, this is not enough to actually have the expression
of f (s, t) . This requires that we solve the change of variables 2.191, to obtain the relations
r = r(s, t) , (2.204) Transport of Probabilities 111 so we can write
f (s, t) = fr (r(s, t)) . (2.205) Explicitly, using equation 2.196, the volumetric probability fs (s) can be written
fs (s) = √
1
det gs dtq+1 ∧ · · · ∧ dtp
all t det gr (r)
fr (r)
K (r) . (2.206) r = r (s ,t ) As the probability densities associated to the volumetric probabilities fs and fr are
fs = det gs fs ; fr = det gr fr , (2.207) equation 2.206 can also be written
f s (s) = dtq+1 ∧ · · · ∧ dtp
all t 1
f (r(s, t)) ,
K (r(s, t)) r (2.208) an expression that is independent of the metrics in the spaces R p and S q . S1
Figure 2.18: Lines that map into a
same value of s . Two diﬀerent
choices for the variables t . t Figure 2.19: Consider that we have a mapping from the Euclidean plane, with polar coordinates r = {ρ, ϕ} , into a onedimensional
space with a metric coordinate s (in this illustration, s = s(ρ, ϕ) = sin ρ/ρ ). When
transporting a probability from the plane
into the ‘vertical axis’, for a given value of
s = s0 we have, ﬁrst, to obtain the set of discrete values ρn giving the same s0 , and, for
each of these values, we have to perform the
integration for −π < ϕ ≤ +π corresponding
to that indicated in equations ??–2.206. S1 R2
s R2
t 112 2.6 Example 2.11 A onedimensional material medium with an initial length X is deformed
into a second state, where its length is Y . The strain that has aﬀected the medium, denoted
ε , is deﬁned as
Y
.
(2.209)
X
A measurement of X and Y provides the information represented by a volumetric probability
fr (Y, X ) . This induces an information on the actual value of the strain, that shall be represented
by a volumetric probability fs (ε) . The problem is to express fs (ε) using as ‘inputs’ the
deﬁnition 2.209 and the volumetric probability fr (Y, X ) . Let us introduce the twodimensional
‘data’ space R 2 , over which the quantities X and Y are coordinates. The lengths X and
Y being Jeﬀreys quantities (see discussion in section XXX), we have, in the space R 2 , the
distance element ds2 = ( dY )2 + ( dX )2 , associated to the metric matrix
r
Y
X
ε = log gr = 1
Y2 0 . (2.210) , 1
X2 0 (2.211) This, in particular, gives
det gr =
so the (2D) volume element over R 2 is dvr =
over R 2 is to be integrated via
Pr = 1
YX dY ∧dX
YX d Y ∧ dX , and any volumetric probability fr (Y, X ) 1
fr (Y, X ) ,
YX (2.212) over the appropriate bounds. In particular, a volumetric probability fr (Y, X ) is normalized if
the integral over ( 0 < Y < ∞ ; 0 < X < ∞ ) equals one. Let us also introduce the onedimensional ‘space of deformations’ S 1 , over which the quantity ε is the chosen coordinate
(one could as well chose the exponential of ε , or twice the strain as coordinate). The strain
being an ordinary Cartesian coordinate, we have, in the space of deformations S 1 the distance
element ds2 = dε2 , associated to the trivial metric matrix gs = (1) . Therefore,
s
det gs = 1 . (2.213) The (1D) volume element over S 1 is dvs = dε , and any volumetric probability fs (ε) over
S 1 is to be integrated via
Ps = dε fs (ε) , (2.214) over given bounds. A volumetric probability fs (ε) is normalized by the condition that the
integral over (−∞ < ε < +∞) equals one. As suggested in the general theory, we must
change the coordinates in R 2 using as part of the coordinates those of S 1 , i.e., here, using
the strain ε . Then, arbitrarily, select X as second coordinate, so we pass in R 2 from
the coordinates { Y , X } to the coordinates { ε , X } . Then, the Jacobian matrix deﬁned in
equation 2.192 is
K= U
V = ∂ ε/∂Y
∂X/∂Y ∂ε/∂X
∂X/∂X = 1/Y
0 −1/X
1 , (2.215) Transport of Probabilities 113 and we obtain, using the metric 2.210,
−
det K gr 1 Kt = X . (2.216) Noting that the expression 2.209 can trivially be solved for Y as
Y = X exp ε , (2.217) everything is ready now to attack the problem. If a measurement of X and Y has produced
the information represented by the volumetric probability fr (Y, X ) , this transports into a volumetric probability fs (ε) that is given by equation 2.206. Using the particular expressions 2.213,
2.216 and 2.217 this gives
∞ fs (ε) = dX
0 1
fr ( X exp ε , X ) .
X (2.218) [End of example.]
Example 2.12 In the context of the previous example, assume that the measurement of the
two lengths X and Y has provided an information on their actual values that: (i) has
independent uncertainties and (ii) is Gaussian (which, as indicated in section 2.8.4, means
that the dependence of the volumetric probability on the Jeﬀreys quantities X and Y is
expressed by the lognormal function). Then we have
1
1
fX (X ) = √
exp − 2
2 sx
2π sX X
log
X0 2 1
1
exp − 2
2 sY
2π sY Y
log
Y0 2 fY (Y ) = √ , (2.219) (2.220) and
fr (Y, X ) = fY (Y ) fX (X ) . (2.221) The volumetric probability for X is centered at point X0 , with standard deviation sX ,
and the volumetric probability for Y is centered at point Y0 , with standard deviation sY
(see section 2.7 for a precise —invariant— deﬁnition of standard deviation). In this simple
example, the integration in equation 2.218 can be performed analytically, and one obtains a
Gaussian probability distribution for the strain, represented by the normal function
1
(ε − ε0 )2
exp −
,
(2.222)
2 s2
2π sε
ε
where ε0 , the center of the probability distribution for the strain, equals the logarithm of the
ratio of the centers of the probability distributions for the lengths,
fs (ε) = √ ε0 = log Y0
X0 , (2.223) and where s2 , the variance of the probability distribution for the strain, equals the sum of the
ε
variances of the probability distributions for the lengths,
s2 = s 2 + s2
ε
X
Y
[End of example.] . (2.224) 114 2.6 2.6.0.4 Case p = q The two cases examined above, p ≤ q and p ≥ q , both contain the case p = q , but let us,
to avoid possible misunderstandings, treat the case explicitly here.
In the case p ≤ q , we have chosen to use over the subspace S p , image of S q through
s = σ (r) , the image of the coordinates r of R p , and, in these coordinates, we have found
the expression 2.182
√
det gr
fr (r) ,
fs (r) = √
(2.225)
det St gs S
that is directly valid here. As the matrix S is a squared matrix, we could further write
√
det gr 1
fr (r) ,
(2.226)
fs (r) = √
det gs S
where S = det S .
In the case p ≥ q , we have used over S q its own coordinates. Expression 2.206 drastically
simpliﬁes when p = q (there are no variables t ), to give
fs (s) = √ 1
det gs 1
fr (r(s)) ,
−
det S gr 1 St (2.227) or, as the matrix S is now squared, √
det gr 1
fr (r(s)) .
fs (s) = √
det gs S (2.228) This is, of course, the same expression than that in 2.226: we know that volumetric probabilities
are invariant, and have the same value, at a given point, irrespectively of the coordinates being
used.
Note that the expression s = σ (r) is not deﬁning a change of variables inside a given space:
we have two diﬀerent spaces, a R p with coordinates r and a metrix matrix gr , and a S q
with coordinates s and metric matrix gs . These two metric are totally independent, and the
application s = σ (s) is mapping points from R p into S q . If we were contemplating a change
of variables inside a given space, then the metric matrices, instead of being independently
given, would be related in the usual way tensors relate under a change of variables, (gr )αβ =
∂si
∂sj
(gs )ij ∂rβ i.e., for short,
∂rα
(if we were considering a change of variables) .
gr = St gs S
√
√
det gr = det St gs S , i.e., as the matrix S is (p × p) ,
In particular, then,
det gr = S det gs (if we were considering a change of variables) , (2.229) (2.230) where S is the Jacobian determinant, S = det S . Then, the rwo equations 2.226–2.228 would
simply give
fs (s) = fr (r) , (2.231) expressing the invariance of a volumetric probability under a change of variables (equation 2.65).
Of course, we are not considering this situation: equations 2.226–2.228 represent a transport
of a probability distribution between two spaces, not a change of variables insise a given space. Transport of Probabilities
2.6.0.5 115 Transportation into the manifold s = s(r) itself. Note: say here that we use in the space X p+q the ‘induced metric’.
We have seen that bla, bla, bla, and we have the same coordinates over the two spaces, and
bla, bla, bla, and to a common capacity element dr1 ∧ · · · ∧ drp corresponds the two volume
elements
dvr (r) = det gr dr1 ∧ · · · ∧ drp dvx (r) = det ( gr + St gs S ) dr1 ∧ · · · ∧ drp . (2.232) When there are volumetric probabilities fr (r) and fx (r) , they are deﬁned so as to have
dPr = fr dvr
dPx = fx dvx (2.233) . We say the the volumetric probability fx and has been ‘transported’ from fr if the two
probabilities associated to the two volumes deﬁned by the common capacity element dr1 ∧
· · · ∧ drp are identical, i.e., if
dPr = dPx . (2.234) fx dvx = fr dvr (2.235) It follows the relation i.e.,
fx det ( gr + St gs S ) = fr det gr . (2.236) 116 2.7
2.7.1 2.7 Central Estimators and Dispersion Estimators
Introduction Let X be an ndimensional manifold, and let P, Q, . . . represent points of X . The manifold
is assumed to have a metric deﬁned over it, i.e., the distance between any two points P and
Q is deﬁned, and denoted D(Q, P) . Of course, D(Q, P) = D(P, Q) .
A normalized probability distribution P is deﬁned over X , represented by the volumetric
probability f . The probability of A ⊂ X is obtained, using the notations of equation 2.49,
as
P (A) = P∈A dV (P) f (P) . (2.237) If ψ (P) is a scalar (invariant) function deﬁned over X , its average value is denoted
and is deﬁned as
ψ =
P∈X dV (P) f (P) ψ (P) . ψ, (2.238) This clearly corresponds to the intuitive notion of ‘average’. 2.7.2 Center and Radius of a Probability Distribution Let p be a real number in the range 1 ≤ p < ∞ . To any point P we can associate the
quantity (having the dimension of a length)
σp (P) = dV (Q) f (Q) D(Q, P) p 1
p . (2.239) Q∈X Deﬁnition 2.1 The point9 where σp (P) attains its minimum value is called the Lp norm
center of the probability distribution f (P) , and it is denoted Pp .
Deﬁnition 2.2 The minimum value of σp (P) is called the Lp norm radius of the probability
distribution f (P) , and it is denoted σp .
The interpretation of these deﬁnitions is simple. Take, for instance p = 1 . Comparing the
two equations 2.238–2.239, we see that, for a ﬁxed point P , the quantity σ1 (P) corresponds
to the average of the distances from the point P to all the points. The point P that minimizes
this average distance is ‘at the center’ of the distribution (in the L1 norm sense). For p = 2 ,
it is the average of the squared distances that is minimized, etc.
The following terminology shall be used:
• P1 is called the median , and σ1 is called the mean deviation ;
• P2 is called the barycenter (or the center , or the mean ), and σ2 is called the standard
deviation (while its square is called the variance );
9
If there is more than one point where σp (P) attains its minimum value, any such point is called a center
(in the Lp norm sense) of the probability distribution f (P) . Central Estimators and Dispersion Estimators 117 • P∞ is called10 the circumcenter , and σ∞ is called the circumradius .
Calling P∞ and σ∞ respectively the ‘circumcenter’ and the ‘circumradius’ seems justiﬁed when considering, in the Euclidean plane, a volumetric probability that is constant inside
a triangle, and zero outside. The ‘circumcenter’ of the probability distribution is then the
circumcenter of the triangle, in the usual geometrical sense, and the ‘circumradius’ of the probability distribution is the radius of the circumscribed circle11 . More generally, the circumcenter
of a probability distribution is always at the point that minimizes the maximum distance to all
other points, and the circumradius of the probability distribution is this ‘minimax’ distance.
Example 2.13 Consider a onedimensional space N , with a coordinate ω , such that the
distance between the point ν1 and the point ν2 is
D(ν2 , ν1 ) = log ν2
ν1 . (2.240) As suggested in XXX, the space N could be the space of musical notes, and ν the frequency
of a note. Then, this distance is just (up to a multiplicative factor) the usual distance between
notes, as given by the number of ‘octaves’. Consider a normalized volumetric probability f (ν ) ,
and let us be interested in the L2 norm criteria. For p = 2 , equation 2.239 can be written
σ2 (µ) 2 ∞ = ds(ν ) f (ν )
0 ν
log
µ 2 , (2.241) The L2 norm center of the probability distribution, i.e., the value ν2 at which σ2 (µ) is
minimum, is easily found12 to be
∞ ν2 = ν0 exp ds(ν ) f (ν ) log
0 ν
ν0 , (2.242) where ν0 is an arbitrary constant (in fact, and by virtue of the properties of the logexp
functions, the value ν2 is independent of this constant). This mean value ν2 corresponds to
what in statistical theory is called the ‘geometric mean’. The variance of the distribution, i.e.,
the value of the expression 2.241 at its minimum, is
σ2 2 ∞ = ds(ν ) f (ν ) log 0 ν
ν2 2 . (2.243) The distance element associated to the distance in equation 2.240 is, clearly, ds(ν ) = dν/ν ,
and the probability density associated to f (ν ) is f (ν ) = f (ν )/ν , so, in terms of the probability
density f (ν ) , equation 2.242 becomes
∞ ν2 = ν0 exp dν f (ν ) log
0 ν
ν0 , (2.244) The L∞ norm center and radius are deﬁned as the limit p → ∞ of the Lp norm center and radius.
The circumscribed circle is the circle that contains the three vertices of the triangle. Its center (called
circumcenter) is at the the point where the perpendicular bisectors of the sides cross.
2
12
For the minimization of the function σ2 (µ) is equivalent to the minimization of
σ2 (µ)
, and
this gives the condition
ds(ν ) f (ν ) log(ν/µ) = 0 . For any constant ν0 , this is equivalent to
ds(ν ) f (ν ) (log(ν/ν0 ) − log(µ/ν0 )) = 0 , i.e., log(µ/ν0 ) = ds(ν ) f (ν ) log(ν/ν0 ) , from where the result
follows. The constant ν0 is necessary in these equations for reasons of physical dimensions (only the logarithm
of adimensional quantities is deﬁned).
10
11 118 2.7 while equation 2.243 becomes
∞ 2 σ2 = ν
log
ν2 dν f (ν )
0 2 . (2.245) The reader shall easily verify that if instead of the variable ν , one chooses to use the logarithmic
variable ν ∗ = log(ν/ν0 ) , where ν0 is an arbitrary constant (perhaps the same as above), then
instead of the six expressions 2.240–2.245 we would have obtained, respectively,
∗
∗
∗
∗
s(ν2 , ν1 ) =  ν2 − ν1 
2 σ2 (µ∗ ) +∞ ∗
ν2 =
2 σ2 ds(ν ∗ ) f (ν ∗ ) (ν ∗ − µ∗ )2 = −∞
+∞
−∞
+∞ =
−∞ ∗
ν2 = (2.246) ds(ν ∗ ) f (ν ∗ ) ν ∗
∗
ds(ν ∗ ) f (ν ∗ ) ν ∗ − ν2 2 +∞ dν ∗ f (ν ∗ ) ν ∗ (2.247) −∞ and
σ2 2 +∞ =
−∞ ∗
dν ∗ f (ν ∗ ) ν ∗ − ν2 2 , (2.248) with, for this logarithmic variable, ds(ν ∗ ) = dν ∗ and f (ν ∗ ) = f (ν ∗ ) . The two last expressions
are the ordinary equations used to deﬁne the mean and the variance in elementary texts. [End
of example.]
Example 2.14 Consider a onedimensional space, with a coordinate χ , the distance between
two poits χ1 and χ2 being denoted D(χ2 , χ1 ) . Then, the associated length element is
d (χ) = D( χ + dχ , χ ) . Finally, consider a (1D) volumetric probability f (χ) , and let us
be interested in the L1 norm case. Assume that χ runs from a minimum value χmin to a
maximum value χmax (both could be inﬁnite). For p = 1 , equation 2.239 can be written
σ1 (χ) = d (χ ) f (χ ) D(χ , χ) . (2.249) Denoting χ1 be the median, i.e., the point the point where σ1 (χ) is minimum), one easily13
founds that χ1 is characterized by the property that it separates the line into two regions of
equal probability, i.e.,
χ1 χmax d (χ) f (χ) =
χmin
13 d (χ) f (χ) , (2.250) χ1 In fact, the property 2.250 of the median being intrinsic (independent of any coordinate system), we can limit
ourselves to demonstrate it using a special ‘Cartesian’ coordinate, where d (x) = dx , and D(x1 , x2 ) = x2 − x1  ,
where the property is easy to demonstrate (and well known). Central Estimators and Dispersion Estimators 119 expression that can readily be used for an actual computation of the median, and which corresponds to its elementary deﬁnition. The mean deviation is then given by
χmax σ1 = d (χ) f (χ) D(χ, χ1 ) . (2.251) χmin [End of example.]
Example 2.15 Consider the same situation as in the previous example, but let us become
interested in the L∞ norm case. Let χmin and χmax the minimum and the maximum values
of χ for which f (χ) = 0 . It can be shown that the circumcenter of the probability distribution
is the point χ∞ that separates the interval {χmin , χmax } in two intervals of equal length, i.e.,
satisfying the condition
D(χ, χmin ) = D(χmax , χ) , (2.252) and that the circumradius is
D(χmax , χmin )
2 σ∞ = . (2.253) [End of example.]
Example 2.16 Consider, in the Euclidean ndimensional space En , with Cartesian coordinates x = {x1 , . . . , xn } , a normalized volumetric probability f (x) , and let us be interested in
the L2 norm case. For p = 2 , equation 2.239 can be written, using obvious notations,
σ2 (y) 2 = dx f (x) x−y 2 . (2.254) Let x2 denote the mean of the probability distribution, i.e., the point where σ2 (y) is minimum
2
is minimum). The condition of minimum (the vanishing of
(or, equivalently, where σ2 (y)
the derivatives) gives
dx f (x) (x − x2 ) = 0 , i.e.,
x2 = dx f (x) x , (2.255) which is an elementary deﬁnition of mean. The variance of the probability distribution is then
(σ2 )2 = dx f (x) x − x2 2 . (2.256) In the context of this example, we can deﬁne the covariance tensor
C= dx f (x) x − x2 ⊗ x − x2 . (2.257) Note that equation 2.255 and equation 2.257 can be written, using indices, as
xi =
2 dx1 ∧ · · · ∧ dxn f (x1 , . . . , xn ) xi , (2.258) and
C ij =
[End of example.] dx1 ∧ · · · ∧ dxn f (x1 , . . . , xn ) (xi − xi ) (xj − xj ) .
2
2 (2.259) 120 2.8
2.8.1 2.8 Appendixes
Appendix: Conditional Probability Density Note to the reader: this section can be skipped, unless one is particularly interested in probability densities.
In view of equation 2.100, the conditional probability density (over the submanifold X p ) is
to be deﬁned as
f p (r) = g p (r) fp (r) (2.260) i.e.,
f p (r) = ηr det gp (r) fp (r) , (2.261) so the probability of a region Ap of the submanifold is given by
P (Ap ) = r∈X p dv p (r) f p (r) , (2.262) where dv p (r) = dr1 ∧ · · · ∧ drp .
We must now express f p (r) in terms of f (r, s) . First, from equations 2.95 and 2.261 we
obtain
f p (r) = ηr
As f (r, s) = f (r, s)/(η √ det gp (r) f (r, s(r))
dvp (r) f (r, s(r))
r∈X p . (2.263) det g ) (equation 2.54), f p (r) = ηr √
f (r, s(r))/ det g
√
dvp (r) f (r, s(r))/ det g
r∈X p det gp (r) . (2.264) Finally, using 2.97, and expliciting gp (r) ,
√
f p (r) =
r∈X p dr1 det(grr +grs S+ST gsr +ST gss S)
√
det g ∧ ··· ∧ √ drp f (r, s(r)) det(grr +grs S+ST gsr +ST gss S)
√
det g . (2.265) f (r, s(r)) Again, it is understood here that all the ‘matrices’ are taken at the point ( r, s(r) ) .
This expression does not coincide the the conditional probability deﬁned given in usual texts
(even when the manifold is deﬁned by the condition s = s0 = const. ). This is because we
contemplate here the ‘metric’ or ‘orthogonal’ limit to the manifold (in the sense of ﬁgure 2.9),
while usual texts just consider the ‘vertical limit’. Of course, I take this approach here because
I think it is essential for consistent applications of the notion of conditional probability. The
best known expression of this problem is the so called ‘Borel Paradox’ that we analyze in
section 2.8.10. Appendixes 121 Example 2.17 If we face the case where the space X is the Cartesian product of two
spaces R × S , with guv = gvu = 0 , grr = gr (r) and gss = gs (s) , then det g(r, s) =
det gr (r) det gs (s) , and the conditional probability density of equation 2.265 becomes,
√
T (r
det(
√ gr (r)+S √) gs (s(r)) S(r)) f (r, s(r))
det gr (r)
det gs (s(r))
√
.
(2.266)
f p (r) =
det(gr (r)+ST (r) gs (s(r)) S(r))
1 ∧ · · · ∧ dr p
√
√
dr
f (r, s(r))
r∈X p
det gr (r) det gs (s(r)) [End of example.]
Example 2.18 If, in addition to the condition of the previous example, the hyperfurface is
deﬁned by a constant value of s , say s = s0 , then, the probability density becomes
f p (r) = f (r, s0 )
dr1 ∧ · · · ∧ drp f (r, s0 )
r∈X p . (2.267) [End of example.]
Example 2.19 In the situation of the previous example, let us rewrite equation 2.267 dropping
the index 0 from s0 , and use the notations
f rs (rs) = f (r, s)
f s (s) , ; f s (s) = r∈X p dr1 ∧ · · · ∧ drp f (r, s) . (2.268) We could redo all the computations to deﬁne the conditional for s , given a ﬁxed value v , but
it is clear by simple analogy that we obtain, in this case,
f sr (sr) = f (r, s)
f r (r) , ; f r (r) = r∈X q ds1 ∧ · · · ∧ dsq f (r, s) . (2.269) Solving in these two equations for f (r, s) gives the ‘Bayes theorem’
f sr (sr) = f rs (rs) f s (s)
f r (r) . (2.270) Note that this theorem is valid only if we work in the Cartesian product of two spaces. In
particular, we must have gss (r, s) = gs (s) . Working, for instance, at the surface of the
sphere with geographical coordinates (r, s) = (r, s) = (ϕ, λ) this condition is not fulﬁlled, as
gϕ = cos λ is a function of λ : the surface of the sphere is not the Cartesian product of two
1D spaces. A we shall later see, this enters in the discussion of the socalled ‘Borel paradox’
(there is no paradox, if we do things properly). [End of example.] 122 2.8 2.8.2 Appendix: Marginal Probability Density In the context of section 2.5.2, where a manifold X is built through the Cartesian product
R × S of two manifolds, and given a ‘joint’ volumetric probability f (r, s) , the marginal
volumetric probabily fr (r) is deﬁned as (see equation 2.160)
fr (r) = dvs (s) f (r, s) . (2.271) s∈S Let us ﬁnd the equivalent expression using probability densities instead of volumetric probabilities.
Here below, following our usual conventions, the following notations
g (r, s)) = det g(r, s) ; g r (r) = det gr (r) ; g s (s) = det gs (s) (2.272) are introduced. First, we may use the relation
f (r, s)
g (r, s) f (r, s) = (2.273) linking the volumetric probability f (r, s) and the probability density f (r, s) . Here, g is the
metric of the manifold X , that has been assumed to have a partitioned form (equation 2.123).
Then, f (r, s) = f (r, s) / ( g r (r) g s (s) ) , and equation 2.271 becomes
fr (r) = 1
g r (r) dvs (s) s∈S f (r, s)
g s (s) . (2.274) As the volume element dvs (s) is related to the capacity element dv s (s) = ds1 ∧ ds2 ∧ . . . via
the relation
dvs (s) = g s (s) dv s (s) , (2.275) we can write
fr (r) = 1
g r (r) dv s (s) f (r, s) , (2.276) dv s (s) f (r, s) . (2.277) s∈S i.e.,
g r (r) fr (r) = s∈S We recognize, at the lefthand side, the usual deﬁntion of a probability density as the
product of a volumetric probability by the volume density, so we can introduce the marginal
probability density
f r (r) = g r (r) fr (r) . (2.278) Then, equation 2.277 becomes
f r (r) = dv s (s) f (r, s) , (2.279) s∈S expression that could be taken as a direct deﬁnition of the marginal probability density f r (r)
in terms of the ‘joint’ probability density f (r, s) .
Note that this expression is formally identical to 2.271. This contrasts with the expression
of a conditional probability density (equation 2.265) that is formally very diﬀerent from the
expression of a conditional volumetric probability (equation 2.95). Appendixes 2.8.3 123 Appendix: Replacement Gymnastics In an ndimensional manifold with coordinates x , the volume element dvx (x) , is related to the
the capacity element dv x (x) = dx1 ∧ · · · ∧ dxn via the volume density g x (x) = det gx (x) ,
dvx (x) = g x (x) dv x (x) , (2.280) while the relation between a volumetric probability fx (x) and the associated probability
density f x (x) is
f x (x) = g x (x) fx (x) .
In a change of variables x (2.281) y , while the capacity element changes according to
dv x (x) = X (y) dv y (y) , (2.282) where the Jacobian determinant X is the determinant of the matrix {X i j } = {∂xi /∂y j } ,
the probability density changes according to
f x (x) = 1
f (y) .
X (y) y (2.283) In the variables y , the relation between a volumetric probability fy (y) and the associated
probability density f y (y) is
f y (y) = g y (y) fy (y) , (2.284) where g y (y) = det gy (y) is the volume density in the coordinates y . Finally, the volume
element dvy (y) , is related to the the capacity element dv y (y) = dy 1 ∧ · · · ∧ dy n through
dvy (y) = g y (y) dv y (y) . (2.285) Using these relations in turn, we can obtain the following circle of equivalent equations:
P (A) = P∈A dV (P) f (P) = x∈A =
x∈A =
x∈A =
y∈A =
y∈A =
y∈A =
y∈A dvx (x) fx (x)
dv x (x) g x (x) fx (x)
dv x (x) f x (x)
X (y) dv y 1
f (y)
X (y) y (2.286) dv y (y) f y (y)
dv y (y) g y (y) fy (y)
dvy (y) fy (y) = P∈A dV (P) f (P) = P (A) . Each one of them may be useful in diﬀerent circumstances. The student should be able to
easily move from one equation to the next. 124 2.8 Example 2.20 In the example Cartesiangeographical, the equations above give, respectively
(using the index r for the geographical coordinates),
dvx (x, y, z ) = dx ∧ dy ∧ dz (2.287) f x (x, y, z ) = fx (x, y, z ) (2.288) dx ∧ dy ∧ dz = r2 cos λ dr ∧ dϕ ∧ dλ (2.289) 1
f (r, ϕ, λ)
cos λ r (2.290) f r (r, ϕ, λ) = r2 cos λ fr (r, ϕ, λ) (2.291) f x (x, y, z ) = r2 dvr (r, ϕ, λ) = r2 cos λ dr ∧ dϕ ∧ dλ , (2.292) to obtain the circle of equations,
P (A) = dV (P) f (P) = P∈A = dvx (x, y, z )
{x,y,z }∈A fx (x, y, z ) dx ∧ dy ∧ dz fx (x, y, z ) {x,y,z }∈A = dx ∧ dy ∧ dz f x (x, y, z ) {x,y,z }∈A = r2 cos λ dr ∧ dϕ ∧ dλ {r,ϕ,λ}∈A = r2 1
f (r, ϕ, λ)
cos λ r dr ∧ dϕ ∧ dλ f r (r, ϕ, λ) {r,ϕ,λ}∈A = dr ∧ dϕ ∧ dλ r2 cos λ fr (r, ϕ, λ) {r,ϕ,λ}∈A = dvr (r, ϕ, λ)
{r,ϕ,λ}∈A fr (r, ϕ, λ) = dV (P) f (P) = P (A) . P∈A (2.293)
Note that the Cartesian system of coordinates is special: scalar densities, scalar capacities and
invariant scalars coincide. [End of example.] Appendixes 125 2.8.4 Appendix: The Gaussian Probability Distribution 2.8.4.1 One Dimensional Spaces Let X by a onedimensional metric line with points P , Q . . . , and let s(Q, P) denote the
displacement from point P to point Q , the distance or ‘length’ between the two points being
the absolute value of the displacement, L(Q, P) = L(Q, P) =  s(Q, P)  . Given any particular
point P on the line, it is assumed that the line extends to inﬁnite distances from P in the
two senses. The onedimensional Gaussian probability distribution is deﬁned by the volumetric
probability
f (P; P0 ; σ ) = √ 12
1
exp −
s (P, P0 )
2 σ2
2π σ , (2.294) and it follows from the general deﬁnition of volumetric probability, that the probability of the
interval between any two points P1 and P2 is
P2 P=
P1 dL(P) f (P; P0 ; σ ) , (2.295) where dL denotes the elementary length element. The following properties are easy to demonstrate:
• the probability of the whole line equals one (i.e., the volumetric probability f (P; P0 ; σ )
is normalized);
• the mean of f (P; P0 ; σ ) is the point P0 ;
• the standard deviation of f (P; P0 ; σ ) equals σ .
Example 2.21 Consider a coordinate X such that the displacement between two points is
sX (X , X ) = log(X /X ) . Then, the Gaussian distribution 2.294 takes the form
fX (X ; X0 , σ ) = √ 1
1
exp − 2
2σ
2π σ log X
X0 2 , (2.296) where X0 is the mean and σ the standard deviation. As, here, ds(X ) = dX/X , the
probability of an interval is
P (X1 ≤ X ≤ X2 ) = X2
X1 dX
fX (X ; X0 , σ ) ,
X (2.297) and we have the normalization
∞
0 dX
fX (X ; X0 , σ ) = 1 .
X (2.298) This expression of the Gaussian probability distribution, written in terms on the variable X ,
is called the lognormal law. I suggest that the information on the parameter X represented
by the volumetric probability 2.296 should be expressed by a notation like14
log
14 X
= ±σ
X0 , (2.299)
· Equivalently, one may write X = X0 exp(±σ ) , or X = X0 ÷ Σ , where Σ = exp σ . 126 2.8 that is the exact equivalent of the notation used in equation 2.303 below. Deﬁning the diﬀerence
δX = X −X0 one converts this equation into log (1 + δX/X0 ) , whose ﬁrst order approximation
is δX/X0 = ±σ . This shows that σ corresponds to what is usually called the ‘relative
uncertainty’. I do not recommend this terminology, as, with the deﬁnitions used in this book
(see section 2.7), σ is the actual standard deviation of the quantity X . [End of example.]
Exercise: write the equivalent of the three expressions 2.296–2.298 using, instead of the
variable X , the variables U = 1/X or Y = X n .
Example 2.22 Consider a coordinate x such that the displacement between two points is
sx (x , x) = x − x . Then, the Gaussian distribution 2.294 takes the form
fx (x; x0 , σ ) = √ 1
1
exp − 2 (x − x0 )2
2σ
2π σ , (2.300) where x0 is the mean and σ the standard deviation. As, here, ds(x) = dx , the probability
of an interval is
x2 P (x1 ≤ x ≤ x2 ) = dx fx (x; x0 , σ ) , (2.301) x1 and we have the normalization
+∞
−∞ dx fx (x; x0 , σ ) = 1 . (2.302) This expression of the Gaussian probability distribution, written in terms on the variable x ,
is called the normal law. The information on the parameter x represented by the volumetric
probability 2.300 is commonly expressed by a notation like15
x = x0 ± σ . (2.303) [End of example.]
Example 2.23 It is easy to verify that through the change of variable
x = log X
K , (2.304) where K is an arbitrary constant, the equations of the example 2.21 become those of the
example 2.22, and viceversa. In this case, the quantity x has no physical dimensions (this
is, of course, a possibility, but not a necessity, for the quantity x in example 2.22). [End of
example.]
The Gaussian probability distribution is represented in ﬁgure 2.20. Note that there is no
need to make diﬀerent plots for the normal and the lognormal volumetric probabilities. When
one is interested in a wide range of values, a logarithmic version of the vertical axis may be
necessary (see ﬁgure 2.21).
More concise notations are also used. As an example, the expression x = 1 234.567 89 m ± 0.000 11 m
(here, ‘m’ represents the physical unit ‘meter’) is sometimes written x = ( 1 234.567 89 ± 0.000 11 ) m or even
x = 1 234.567 89(11) m .
15 Appendixes 127 Figure 2.20: A representation of the Gaussian probability distribution, where the example of a temperature T
is used. Reading the scale at the top, we associate to
each value of the temperature T the value h(T ) of
a lognormal volumetric probability. Reading the scale
at the bottom, we associate to every value of the logarithmic temperature t the value g (t) of a normal
volumetric probability. There is no need to make a
special plot where the lognormal volumetric probability
h(T ) would not be represented ‘in a logarithmic axis’,
as this strongly distorts the beautiful Gaussian bell (see
ﬁgures 2.22 and 2.23). In the ﬁgure represented here,
one standard deviation corresponds to one unit of t , so
the whole range represented equals 8 σ . T
104K 4 3 2 1 0 1 2 104K 3 4 t = log10(T/T0) ; T0 = 1K 0
5
10
15
20
25
5 10 5 0 10 4
Probability Density 4
Volumetric Probability 102K 1K t Figure 2.21: A representation of the normal volumetric probability using a logarithmic vertical axis (here, a
base 10 logarithm of the volumetric probability, relative
to its maximum value). While the representation in ﬁgure 2.20 is not practical is one is interested in values of
t outside the interval with endpoints situated at ±3σ
of the center, this representation allows the examination
of the statistics concerning as many decades as we may
wish. Here, the whole range represented equals more
than 20 standard deviations. Figure 2.22: Left: the lognormal volumetric probability h(X ) . Right: the
lognormal probability density h(X ) . Distributions
centered at 1, with standard deviations respectively
equal to 0.1, 0.2, 0.4, 0.8,
1.6 and 3.2 . 102K 3 2 1 0 3 2 1 0
0 0.5 1 1.5 2 2.5 3 0 0.5 1 1.5 2 2.5 3 128 2.8 Figure 2.23 gives the interpretation of these functions in terms of histograms. By deﬁnition
of volumetric probability, an histogram should be made dividing the interval under study in
segments of same length ds(X ) = dX/Y , as opposed to the deﬁnition of probability density,
where the interval should be divided in segments of equal ‘variable increment’ dX . We clearly
see, at the right of the ﬁgure the impracticality of making the histogram corresponding to the
probability density: while the right part of the histogram oversamples the variable, the left
part undersamples it. The histogram suggested at the left samples the variable homogeneously,
but this only means that we are using constant steps of the logarithmic quantity x associated
to the positive quantity X . Better, then, to directly use the representation suggested in
ﬁgure 2.20 or in ﬁgure 2.21. We have then a double conclusion: (i) the lognormal probability
density (at the right in ﬁgures 2.22 and 2.23) does not correspond to any practical histogram; it
is generally uninteresting. (ii) the lognormal volumetric probability (at the left in ﬁgures 2.22
and 2.23) does correspond to a practical histogram, but is better handled when the associated
normal volumetric probability is used instead (ﬁgure 2.20 or ﬁgure 2.21). In short: lognormal
functions should never be used. 0.7
Probability Density 0.35
Volumetric Probability Figure 2.23:
A typical
Gaussian distribution, with
central point 1 and standard
deviation 5/4, represented
here, using a Jeﬀreys (positive) quantity, by the lognormal volumetric probability (left) and the lognormal
probability density (right). 0.3
0.25
0.2
0.15
0.1
0.05
0 0.6
0.5
0.4
0.3
0.2
0.1
0 0 2 4 6 8 10 0 2 4 6 8 10 Appendixes
2.8.4.2 129 Multi Dimensional Spaces In dimension grater than one, the spaces may have curvature. But the multidimensional Gaussian distribution is useful in ﬂat spaces (i.e., Euclidean spaces) only. Although it is possible to
give general expressions for arbitrary coordinate systems, let us simplify the exposition, and
assume that we are using rectilinear coordinates. The squared distance between points x1 and
x2 is then given by the ‘sum of squares’
D2 (x2 , x1 ) = (x2 − x1 )t g (x2 − x1 ) , (2.305) where the metric tensor g is constant (because the assumption of an Euclidean space with
rectlinear coordinates). The volume element is, then,
dv (x) = det g dx1 ∧ · · · ∧ dxn , (2.306) √
where, again,
det g is a constant.
Let f (x) be a volumetric probability over the space. By deﬁnition, the probability of a
region A is
P (A) = dv (x) f (x) , (2.307) dx1 ∧ · · · ∧ dxn f (x) . (2.308) A i.e.,
P (A) = det g The multidimensional Gaussian volumetric probability is
√
1
1
det G
√
f (x) =
exp − (x − x0 )t G (x − x0 )
n/2
(2π )
2
det g . (2.309) The following properties are slight generalizations of well known results concerning the multidimensuonal Gaussian:
• f (x) is normed, i.e., dv (x) f (x) = 1 ; • the mean of f (x) is x0 ;
• the covariance matrix of f (x) is16 C = G−1 .
Note that when in an Euclidean space with metric g one deﬁnes a Gaussian with covariance
C , one may use the inverse of the covariance matrix, G = C−1 , as a supplementary metric
over the space. 16 Remember that the general deﬁnition of covariance gives here C ij =
this property is not as obvious as it may seem. dv (x)(xi − xi )(xj − xj ) f (x) , so
0
0 130 2.8 2.8.5 Appendix: The Laplacian Probability Distribution 2.8.5.1 Appendix: One Dimensional Laplacian Let X by a metric manifold with points P , Q . . . , and let s(P, Q) = s(Q, P) denote the
distance netween two points P and Q . The Gaussian probability distribution is represented
by the volumetric probability
f (P) = k exp −
[Note: Elaborate this.] 1
s(P, Q)
σ . (2.310) Appendixes 2.8.6 131 Appendix: Exponential Distribution Note: I have to verify that the following terminology has been introduced. By s(P, P0 ) we
denote the geodesic line arriving at P , with origin at P0 . If the space is 1D, we write
s(P, P0 ) , and call this a displacement . Then, the distance is D(P, P0 ) = s(P, P0 ) . In an nD
Euclidean space, using Cartesian coordinates, we write s(x, x0 ) = x − x0 , and call this the
displacement vector .
2.8.6.1 Deﬁnition Consider a onedimensional space, and denote s(Q, P) , the displacement from point P to
point Q . The exponential distribution has the (1D) volumetric probability
f (P; P0 ) = α exp − α s(P, P0 ) α≥0 , ; (2.311) where P0 is some ﬁxed point. This volumetric probability is normed via ds(P) f (P, P0 ) = 1 ,
where the sum concerns the halfinterval at the right or at the left of point P0 , depending on
the orientation chosen (see examples 2.24 and 2.25).
Example 2.24 Consider a coordinate X such that the displacement between two points is
sX (X , X ) = log(X /X ) . Then, the exponential distribution 2.311 takes the form fX (X ; X0 ) =
k exp (−α log(X/X0 )) , i.e.,
fX (X ) = α X
X0 −α α≥0 . ; (2.312) As, here, ds(X ) = dX/X , the probability of an interval is P (X1 ≤ X ≤ X2 ) =
The volumetric probability fX (X ) has been normed using
∞
X0 dX
fX (X ) = 1 .
X X2 dX
f (X )
X1 X X . (2.313) This form of the exponential distribution is usually called the Pareto law. The cumulative
probability function is
X gX (X ) =
X0 dX
fX (X ) = 1 −
X X
X0 −α . (2.314) It is negative for X < X0 , zero for X = X0 , and positive for X > X0 . The power α of
the ‘power law’ 2.312 may be any real number, but it most examples concerning the physical,
biological or economical sciences, it is of the form α = p/q , with p and q being small
positive integers17 . With a variable U = 1/X , equation 2.317 becomes
fU (U ) = k U α
17 ; α≥0 , (2.315) In most problems, the variables seem to be chosen in such a way that α = 2/3 . This is the case for the
probability distributions of Earthquakes as a function of their energy (GutenbergRichter law, see ﬁgure 2.25),
or of the probability distribution of meteorites hitting the Earth as a function of their volume (see ﬁgure 2.28). 132 2.8
U2 dU
U1 U the probability on an interval is P (U1 ≤ U ≤ U2 ) =
U0 dU
U
0 fU (U ) , and one typically uses the norming condition
fU (U ) = 1 , where U0 is some selected point. Using a variable
n
Y = X , one arrives at the volumetric probability
fY (Y ) = k Y −β ; β= α
≥0 .
n (2.316) Example 2.25 Consider a coordinate x such that the displacement between two points is
sx (x , x) = x − x . Then, the exponential distribution 2.311 takes the form
fx (x) = α exp (−α (x − x0 )) ; α≥0 . As, here, ds(s) = ds , the probability of an interval is P (x1 ≤ x ≤ x2 ) =
fx (x) is normed by (2.317)
x2
x1 dxfx (x) , and +∞ dx fx (x) = 1 . (2.318) x0 With a variable u = −x , equation 2.317 becomes
fu (u) = α exp (α (u − u0 ))
u ; α≥0 , (2.319) 0
and the norming condition is −∞ du fu (u) = 1 . For the plotting of these volumetric
probabilities, sometimes a logarithmic ‘vertical axis’ is used, as suggested in ﬁgure 2.24. Note
that via a logarithmic change of variables x = log(X/K ) (where K is some constant) this
example is identical to the example 2.24. The two volumetric probabilities 2.312 and 2.317
represent the same exponential distribution. Note: mention here ﬁgure 2.24. Appendixes 133 2 1.5 α= 1
1.5 α= 0
α = 1/4
α = 1/2
α= 1
α= 2 1 0.5 α = 1/2
α = 1/4 1 α= 0 0.5 0 0
0 0.5 1 1.5 2 0 0.5 1 X f f
1.5 α = 1/2 1.5 α = 1/4 α= 0
= 1/4
α = 1/2
α= 1
α= 2 0.5 2 α= 2
α= 1 2 1 1.5 U 2 1 α= 0 0.5 0 0
1 0.5 0 0.5 1 1 x = log X/X0 0.5 0 log f/f0 1 0.5 α= 0 0 α = 1/4
α = 1/2 0.5 α= 2
1 0.5 0 0.5 x = log X/X0 1 α= 2
α= 1 0.5 α = 1/2
α = 1/4 0 α= 0 0.5 α= 1 1 0.5 u = log U/U0 1 log f/f0 Figure 2.24: Plots of exponential
distribution for diﬀerent deﬁnitions
of the variables. Top: The power
functions fX (X ) = 1/X −α , and
fU (U ) = 1/U α . Middle: Using logarithmic variables x and u , one has
the exponential functions fx (x) =
exp(−α x) and fu (u) = exp(α u) .
Bottom: the ordinate is also represented using a logarithmic variable,
this giving the typical loglog linear
functions. α= 2 2 f f 1
1 1 0.5 0 0.5 u = log U/U0 1 134 2.8 2.8.6.2 Example: Distribution of Earthquakes The historically ﬁrst example of power law distribution is the distribution of energies of Earthquakes (the famous GutenbergRichter law).
An earthquake can be characterized by the seismic energy generated, E , or by the moment
corresponding to the dislocation, that I denote here18 M . As a rough approximation, the
moment is given by the product M = ν S , where ν is the elastic shear modulus of the
medium,
the average displacement between the two sides of the fault, and S is the faults’
surface (Aki and Richards, 1980).
Figure 2.25 shows the distribution of earthquakes in the Earth. As the same logarithmic
base (of 10) has been chosen in both axes, the slope of the line approximating the histogram
(which is quite close to 2/3 ) directly leads to the power of the power law (Pareto) distribution.
The volumetric probability f (M ) representing the distribution of earthquakes in the Earth is
f (M ) = k
,
M 2/3 (2.320) where k is a constant. Kanamori (1977) pointed that the moment and the seismic energy
liberated are roughly proportional: M ≈ 2.0 104 E (energy and moment have the same physical
dimensions). This implies that the volumetric probability as a function of the energy has the
same form as for the moment: 2.8.6.3 Example: Size of oil ﬁelds Note: mention here ﬁgure 2.27.
18 3 1000 2 100 1 10 0 1 23 24 Example: Shapes at the Surface of the Earth. Note: mention here ﬁgure 2.26.
2.8.6.4 (2.321) It is traditionally denoted M0 . 25
26
27
µ = Log10(Moment/MK) 28 29 Number of Events Figure 2.25: Histogram of the number of
earthquakes (in base 10 logarithmic scale)
recorded by the global seismological networks
in a period of xxx years, as a function of the
logarithmic seismic moment (adapted from
Lay and Wallace, 1995). More precisely,
the quantity in the horizontal axis is µ =
log10 (M/MK ) , where M is the seismic moment, and MK = 107 J = 1 erg is a constant,
whose value is arbitrarily taken equal the unit
of moment (and of energy) in the cgs system
of units. [note: Ask for the permission to publish this ﬁgure.] k
.
E 2/3 n = Log10(Number of Events) g (E ) = Figure 2.26: Wessel and Smith (1996) have compiled a highresolution shoreline data, and have processed it to suppress erratic points and crossing segments. The shorelines are closed
polygons, and they are classiﬁed in 4 levels: ocean boundaries, lake boundaries, islandsinlake boundaries and pondinislandinlake boundaries. The 180,496 polygons they encountered had the size distribution shown at the right (the
approximate numbers are in the quoted paper, the exact numbers where kindly sent to me by Wessel). A line of slope is
2/3 is suggested in the ﬁgure. Figure 2.27: Histogram of the sizes of oil ﬁelds in a region of Texas. The horizontal axis corresponds, with
a logarithmic scale, to the ‘millions of Barrels of Oil
Equivalent’ (mmBOE). Extracted from chapter 2 (The
fractal size and spatial distribution of hydrocarbon accumulation, by Christopher C. Barton and Christopher
H. Scholz) of the book “Fractals in petroleum geology
and Earth processes”, edited by Christopher C. Barton
and Paul R. La Pointe, Plenum Press, New York and
London, 1995. [note: ask for the permission to publish
this ﬁgure]. The slope of the straight line is 2/3, comparable to the value found with the data of Wessel &
Smith (ﬁgure 2.26). 135 log10(Number of Polygons) Appendixes 5
4
3
2
1
0
4 2 0 2 4 6 8 log10(S/S0) ; S0 = 1 km2 4 3 2 1 0 136
2.8.6.5 2.8
Example: Meteorites Figure 2.28: The approximate number of meteorites falling on Earth every year is distributed
as follows: 1012 meteorites with a diameter of
10−3 mm, 106 with a diameter 1 mm, 1 with a diameter 1 m, 10−4 with a diameter 100 m, and 10−8
with a diameter 10 km. The statement is loosy,
and I have extracted it from the general press. It
is nevertheless clear that a loglog plot of this ‘histogram’ gives a linear trend with a slope equal to
2. Rather, transforming the diameter D into
volume V = D3 (which is proportional to mass),
gives the ‘histogram’ at the right, with a slope of
2/3. log10 (number every year) Note: mention here ﬁgure 2.28. 10 0 10
20 10
0
10
log10 V/V0 (V0 = 1 m3) Appendixes 2.8.7 137 Appendix: Spherical Gaussian Distribution The simplest probabilistic distribution over the circle and over the surface of the sphere are the
von Mises and the Fisher probability distributions, respectively.
2.8.7.1 The von Mises Distribution As already mentioned in example 2.5, and demonstrated in section 2.8.7.3 here below, the
conditional volumetric probability induced over the unit circle by a 2D Gaussian is
f (λ) = k exp sin λ
σ2 . The constant k is to be ﬁxed by the normalization condition
k= (2.322)
2π
0 dϕ f (ϕ) = 1 , this giving 1
,
2 π I0 (1/σ 2 ) (2.323) where I0 ( · ) is the modiﬁed Bessel function of order zero.
Figure 2.29: The circular (von Mises) distribution
corresponds to the intersection of a 2D Gaussian
by a circle passing by the center of the Gaussian.
Here, the unit circle has been represented, and two
Gaussians with standard deviations σ = 1 (left)
and σ = 1/2 (right) . In fact, this is my preferred representation of the von Mises distribution,
rather than the conventional functional display of
ﬁgure 2.30. ϑ ϑ 0.8 Figure 2.30: The circular (von Mises) distribution,
drawn for two full periods, centered at zero, and with
√
√
values of σ equal to 2 , 2 , 1 , 1/ 2 , 1/2 (from
smooth to sharp). 0.6
0.4
0.2
0 6
−π/2 2.8.7.2 4 0 2 0
+π/2 2 4 6 The Fisher Probability Distribution Note: mention here Fisher (1953).
As already mentioned in example 2.5, and demonstrated in section 2.8.7.3 here below, the
conditional volumetric probability induced over the surface of a sphere by a 3D Gaussian is,
using geographical coordinates
f (ϕ, λ) = k exp sin λ
σ2 . (2.324) 138 2.8 We can normalize this volumetric probability by
dS (ϕ, λ)) f (ϕ, λ)) = 1 , (2.325) with dS (ϕ, λ) = cos λ dϕ dλ . This gives
1
,
4 π χ(1/σ 2 ) k= (2.326) sinh(x)
.
x (2.327) where
χ(x) =
2.8.7.3 Appendix: Fisher from Gaussian (Demonstration) Let us demonstrate here that the Fisher probability distribution is obtained as the conditional
of a Gaussian probability distribution over a sphere. As the demonstration is independent of
the dimension of the space, let us take an space with n dimensions, where the (generalized)
geographical coordinates are
x1 = r cos λ
x2 = r cos λ
... = ...
n−2
x
= r cos λ
n−1
x
= r cos λ
n
x = r sin λ cos λ2 cos λ3 cos λ4 . . . cos λn−2 cos λn−1
cos λ2 cos λ3 cos λ4 . . . cos λn−2 sin λn−1
(2.328) cos λ2 sin λ3
sin λ2
. We shall consider the unit sphere at the origin, and an isotropic Gaussian probability distribution with standard deviation σ , with its center along the xn axis, at position xn = 1 .
The Gaussian volumetric probability, when expressed as a function of the Cartesian coordinates is
fx (x1 , . . . , xn ) = k exp − 1
2 σ2 (x1 )2 + (x2 )2 + · · · + (xn−1 )2 + (xn − 1)2 . (2.329) As the volumetric probability is an invariant, to express it using the geographical coordinates
we just need to use the replacements 2.328, to obtain
fr (r, λ, λ , . . . ) = k exp − 1
2 σ2 r2 cos2 λ + (r sin λ − 1)2 , (2.330) i.e.,
fr (r, λ, λ , . . . ) = k exp − 1
r2 + 1 − 2 r sin λ
2 σ2 . (2.331) The condition to be on the sphere is just
r=1 , (2.332) Appendixes 139 so that the conditional volumetric probability, as given in equation 2.95, is just obtained (up
to a multiplicative constant) by setting r = 1 in equation 2.331,
f (λ, λ , . . . ) = k exp sin λ − 1
σ2 , (2.333) i.e., absorbing the constant exp(1/σ 2 ) ,
f (λ, λ , . . . ) = k exp sin λ
σ2 . (2.334) This volumetric probability corresponds to the ndimensional version of the Fisher distribution.
Its expression is identical in all dimensions, only the norming constant depends on the dimension
of the space. 140 2.8 2.8.8 Appendix: Probability Distributions for Tensors In this appendix we consider a symmetric second rank tensor, like the stress tensor σ of
continuum mechanics.
A symmetric tensor, σij = σji , has only sex degrees of freedom, while it has nine components. It is important, for the development that follows, to agree in a proper deﬁnition of
a set of ‘independent components’. This can be done, for instance, by deﬁning the following
sixdimensional basis for symmetric tensors 100
000
000
(2.335)
e1 = 0 0 0 ; e2 = 0 1 0 ; e3 = 0 0 0
000
000
001 000
1
= √ 0 0 1
2 010 e4 ; e5 001
1
= √ 0 0 0
2 100 ; e6 010
1
= √ 1 0 0
2 000 . (2.336) Then, any symmetric tensor can be written as
σ = sα eα , (2.337) and the six values sα are the six ‘independent components’ of the tensor, in terms of which
the tensor writes
√
√
1
s√ s6 / 2 s5 /√2
2
σ = s6 /√2
(2.338)
s√ s4 / 2 .
5
4
3
s/ 2 s/ 2
s
The only natural deﬁnition of distance between two tensors is the norm of their diﬀerence,
se we can write
D(σ 2 , σ 1 ) = σ2 − σ1 , (2.339) where the norm of a tensor σ is19
σ = σij σ ji . (2.340) The basis in equation 2.336 is normed with respect to this norm20 . In terms of the independent
components in expression 2.338 the norm of a tensor simply becomes
σ = (s1 )2 + (s2 )2 + (s3 )2 + (s4 )2 + (s5 )2 + (s6 )2 , (2.341) this showing that the six components sα play the role of Cartesian coordinates of this 6D
space of tensors.
A Gaussian volumetric probability in this space has then, obviously, the form
fs (s) = k exp −
19 α=6 α
α=1 (s − sα )2
0 2 ρ2 , (2.342) Of course, as, here, σij = σji one can also write
σ
=
σij σ ij , but this expression is only valid
for symmertric tensors, while the expression 2.340 is generally valid.
20
It is also orthonormed, with the obvious deﬁnition of scalar product from which this norm derives. Appendixes 141 or, more generally,
fs (s) = k exp − 1
sα − sα Wαβ sβ − sβ
0
0
2
2ρ (2.343) . It is easy to ﬁnd probabilistic models for tensors, when we choose as coordinates the independent components of the tensor, as this Gaussian example suggests. But a symmetric second
rank tensor may also be described using its three eigenvalues {λ1 , λ2 , λ3 } and the three Euler
angles {ψ, θ, ϕ} deﬁning the eigenvector’s directions
√
√
1 s√ s6 / 2 s5 /√2
λ1 0 0
2
s6 / 2
s√ s4 / 2 = R(ψ ) R(θ) R(ϕ) 0 λ2 0 R(ϕ)T R(θ)T R(ψ )T ,
√
5
4
0 0 λ3
s/ 2 s/ 2
s3
(2.344)
where R denotes the usual rotation matrix. Some care is required when using the coordinates
{λ1 , λ2 , λ3 , ψ, θ, ϕ} .
To write a Gaussian volumetric probability in terms on eigenvectors and eigendirections
only requires, of course, to insert in the fs (s) of equation 2.343 the expression 2.344 giving the
tensor components as a function of the eigenvectors and eigendirections (we consider volumetric
probabilities —that are invariant— and not probability densities —that would require an extra
multiplication by the Jacobian determinant of the transformation—),
f (λ1 , λ2 , λ3 , ψ, θ, ϕ) = fs (s1 , s2 , s3 , s4 , s5 , s6 ) . (2.345)
But then, of course, we still need how to integrate in the space using these new coordinates, in
order to evaluate probabilities.
Before facing this problem, let us remark that it is the replacement in equation 2.343 of
the components sα in terms of the eigenvalues and eigendirections of the tensor that shall
express a Gaussian probability distribution in terms of the variables {λ1 , λ2 , λ3 , ψ, θ, ϕ} . Using
a function a would ‘look Gaussian’ in the variables {λ1 , λ2 , λ3 , ψ, θ, ϕ} would not correspond
to a Gaussian probability distribution, in the sense of section 2.8.4.
The Jacobian determinant of the transformation {s1 , s2 , s3 , s4 , s5 , s6 } {λ1 , λ2 , λ3 , ψ, θ, ϕ}
can be obtained using a direct computation, that gives21
∂ (s1 , s2 , s3 , s4 , s5 , s6 )
∂ (λ1 , λ2 , λ3 , ψ, θ, ϕ) = (λ1 − λ2 ) (λ2 − λ3 ) (λ3 − λ1 ) sin θ . (2.346) The capacity elements in the two systems of coordinates are
dv s (s1 , s2 , s3 , s4 , s5 , s6 ) = ds1 ∧ ds2 ∧ ds3 ∧ ds4 ∧ ds5 ∧ ds6
dv (λ1 , λ2 , λ3 , ψ, θ, ϕ) = dλ1 ∧ dλ2 ∧ dλ3 ∧ dψ ∧ dθ ∧ dϕ . (2.347) As the coordinates {sα } are Cartesian, the volume element of the space is numerically identical
to the capacity element,
dvs (s1 , s2 , s3 , s4 , s5 , s6 ) = dv s (s1 , s2 , s3 , s4 , s5 , s6 ) ,
21 (2.348) If instead of the 3 Euler angles, we take 3 rotations around the three coordinate axes, the sin θ here above
becomes replaced by the cosinus of the second angle. This is consistent with the formula by Xu and Grafarend
(1997). 142 2.8 but in the coordinates {λ1 , λ2 , λ3 , ψ, θ, ϕ} the volume element and the capacity are related
via the Jacobian determinant in equation 2.346,
dv (λ1 , λ2 , λ3 , ψ, θ, ϕ) = (λ1 − λ2 ) (λ2 − λ3 ) (λ3 − λ1 ) sin θ dv (λ1 , λ2 , λ3 , ψ, θ, ϕ) . (2.349)
Then, while the evaluation of a probability in the variables {s1 , s2 , s3 , s4 , s5 , s6 } should be
done via
P= dvs (s1 , s2 , s3 , s4 , s5 , s6 ) fs (s1 , s2 , s3 , s4 , s5 , s6 )
(2.350) = ds ∧ ds ∧ ds ∧ ds ∧ ds ∧ ds fs (s , s , s , s , s , s ) ,
1 2 3 4 5 6 1 2 3 4 5 6 in the variables {λ1 , λ2 , λ3 , ψ, θ, ϕ} it should be done via
P=
= dv (λ1 , λ2 , λ3 , ψ, θ, ϕ) f (λ1 , λ2 , λ3 , ψ, θ, ϕ)
dλ1 ∧ dλ2 ∧ dλ3 ∧ dψ ∧ dθ ∧ dϕ ×
× (2.351) (λ1 − λ2 ) (λ2 − λ3 ) (λ3 − λ1 ) sin θ f (λ1 , λ2 , λ3 , ψ, θ, ϕ) . To conclude this appendix, we may remark that the homogeneous probability distribution
(deﬁned as the one who is ‘proportional to the volume distribution’) is obtained by taking both
fs (s1 , s2 , s3 , s4 , s5 , s6 ) and f (λ1 , λ2 , λ3 , ψ, θ, ϕ) as constants.
[Note: I should explain somewhere that there is a complication when, instead of considering
‘a tensor like the stress tensor’ one consider a positive tensor (like an electric permittivity
tensor). The treatment above applies approximately to the logarithm of such a tensor.] Appendixes 2.8.9 143 Appendix: Determinant of a Partitioned Matrix Using well known properties of matrix algebra (e.g., L¨tkepohl, 1996), the determinant of a
u
partitioned matrix can be expressed as
det grr grs
gsr gss −
= det grr det gss − gsr grr1 grs . (2.352) 144 2.8.10 2.8 Appendix: The Borel ‘Paradox’ [Note: This appendix has to be updated.]
A description of the paradox is given, for instance, by Kolmogorov (1933), in his Foundations
of the Theory of Probability (see ﬁgure 2.31). Figure 2.31: A reproduction
of a section of Kolmogorov’s
book Foundations of the
theory of probability (1950,
pp. 50–51). He describes
the socalled “Borel paradox”. His explanation is not
profound: instead of discussing the behaviour of a
conditional probability density under a change of variables, it concerns the interpretation of a probability density over the sphere
when using spherical coordinates. I do not agree with
the conclusion (see main
text). A probability distribution is considered over the surface of the unit sphere, associating,
as it should, to any region A of the surface of the sphere, a positive real number P (A) .
To any possible choice of coordinates {u, v } on the surface of the sphere will correspond a
probability density f (u, v ) representing the given probability distribution, through P (A) =
du dv f (u, v ) (integral over the region A ). At this point of the discussion, the coordinates
{u, v } may be the standard spherical coordinates or any other system of coordinates (as,
for instance, the Cartesian coordinates in a representation of the surface of the sphere as a
‘geographical map’, using any ‘geographical projection’).
A great circle is given on the surface of the sphere, that, should we use spherical coordinates,
is not necessarily the ‘equator’ or a ‘meridian’. Points on this circle may be parameterized by
a coordinate α , that, for simplicity, we may take to be the circular angle (as measured from
the center of the sphere).
The probability distribution P ( · ) deﬁned over the surface of the sphere will induce a
probability distribution over the circle. Said otherwise, the probability density f (u, v ) deﬁned
over the surface of the sphere will induce a probability density g (α) over the circle. This
is the situation one has in mind when deﬁning the notion of conditional probability density, Appendixes 145 so we may say that g (α) is the conditional probability density induced on the circle by the
probability density f (u, v ) , given the condition that points must lie on the great circle.
The BorelKolmogorov paradox is obtained when the probability distribution over the surface of the sphere is homogeneous. If it is homogeneous over the sphere, the conditional probability distribution over the great circle must be homogeneous too, and as we parameterize by
the circular angle α , the conditional probability density over the circle must be
g (α) = 1
,
2π (2.353) and this is not what one gets from the standard deﬁnition of conditional probability density,
as we will see below.
From now on, assume that the spherical coordinates {λ, ϕ} are used, where λ is the
latitude (rather than the colalitude θ ), so the domains of deﬁnition of the variables are
−π/2 < λ ≤ +π/2 ; −π < ϕ ≤ +π . (2.354) As the surface element is dS (λ, ϕ) = cos λ dλ dϕ , the homogeneous probability distribution
over the surface of the sphere is represented, in spherical coordinates, by the probability density
f (λ, ϕ) = 1
cos λ ,
4π (2.355) and we satisfy the normalization condition
+π/2 +π dλ
−π/2 dϕ f (λ, ϕ) = 1 . (2.356) −π The probability of any region equals the relative surface of the region (i.e., the ratio of the
surface of the region divided by the surface of the sphere, 4π ), so the probability density in
equation 2.355 do represents the homogeneous probability distribution.
Two diﬀerent computations follow. Both are aimed at computing the conditional probability
density over a great circle.
The ﬁrst one uses the nonconventional deﬁnition of conditional probability density introduced in section in section ?? of this article (and claimed to be ‘consistent’). No paradox
appears. No matter if we take as great circle a meridian or the equator.
The second computation is the conventional one. The traditional BorelKolmogorov paradox
appears, when the great circle is taken to be a meridian. We interpret this as a sign of the
inconsistency of the conventional theory. Let us develop the example.
We have the line element (taking a sphere of radius 1 ),
ds2 = dλ2 + cos2 λ dϕ2 , (2.357) which gives the metric components
gλλ (λ, ϕ) = 1 ; gϕϕ (λ, ϕ) = cos2 λ (2.358) and the surface element
dS (λ, ϕ) = cos λ dλ dϕ . (2.359) 146 2.8 Letting f (λ, ϕ) be a probability density over the sphere, consider the restriction of this
probability on the (half) meridian ϕ = ϕ0 , i.e., the conditional probability density on this
(half) meridian. It is, following equation ??,
f (λ, ϕ0 )
gϕϕ (λ, ϕ0 ) f λ (λϕ = ϕ0 ) = k . (2.360) In our case, using the second of equations 2.358
f λ (λϕ = ϕ0 ) = k f (λ, ϕ0 )
cos λ , (2.361) or, in normalized version,
f λ (λϕ = ϕ0 ) = f (λ, ϕ0 )/ cos λ
+π/2
−π/2 dλ f (λ, ϕ0 )/ cos λ . (2.362) If the original probability density f (λ, ϕ) represents an homogeneous probability, then it
must be proportional to the surface element dS (equation 2.359), so, in normalized form, the
homogeneous probability density is
f (λ, ϕ) = 1
cos λ .
4π (2.363) Then, equation 2.361 gives
f λ (λϕ = ϕ0 ) = 1
π . (2.364) We see that this conditional probability density is constant22 .
This is in contradiction with usual ‘deﬁnitions’ of conditional probability density, where
the metric of the space is not considered, and where instead of the correct equation 2.360, the
conditional probability density is ‘deﬁned’ by
f λ (λϕ = ϕ0 ) = k f (λ, ϕ0 ) = f (λ, ϕ0 )
+π/2
−π/2 dλ f (λ, ϕ0 )/ cos λ wrong deﬁnition , (2.365) this leading, in the considered case, to the conditional probability density
f λ (λϕ = ϕ0 ) = cos λ
2 wrong result . (2.366) This result is the celebrated ‘Borel paradox’. As any other ‘mathematical paradox’, it is not
a paradox, it is just the result of an inconsistent calculation, with an arbitrary deﬁnition of
conditional probability density.
The interpretation of the paradox by Kolmogorov (1933) sounds quite strange to us (see
ﬁgure 2.31). Jaynes (1995) says “Whenever we have a probability density on one space and
we wish to generate from it one on a subspace of measure zero, the only safe procedure is to
pass to an explicitly deﬁned limit [ . . . ]. In general, the ﬁnal result will and must depend on
22 This constant value is 1/π if we consider half a meridian, or it is 1/2π if we consider a whole meridian. Appendixes 147 which limiting operation was speciﬁed. This is extremely counterintuitive at ﬁrst hearing; yet
it becomes obvious when the reason for it is understood.”
We agree with Jaynes, and go one step further. We claim that usual parameter spaces, where
we deﬁne probability densities, normally accept a natural deﬁnition of distance, and that the
‘limiting operation’ (in the words of Jaynes) must the the uniform convergence associated to
the metric . This is what we have done to deﬁne the notion of conditional probability. Many
examples of such distances are shown in this text. 148 2.8 2.8.11 Appendix: Axioms for the Sum and the Product 2.8.11.1 The Sum I guess that the two deﬁning axioms for the union of two probabilities are
P (A) = 0 and Q(A) = 0 =⇒ (P ∪ Q)(A) = 0 (2.367) and
P (A) = 0 or Q(A) = 0 =⇒ (P ∪ Q)(A) = 0 . (2.368) But the last property is equivalent to its negation,
P (A) = 0 and Q(A) = 0 ⇐= (P ∪ Q)(A) = 0 , (2.369) and this can be reunited with the ﬁrst property, to give the single axiom
P (A) = 0 and Q(A) = 0
2.8.11.2 ⇐⇒ (P ∪ Q)(A) = 0 . (2.370) The product We only have the axiom
P (A) = 0 or Q(A) = 0 =⇒ (P ∩ Q)(A) = 0 . (2.371) (P ∩ Q)(A) = 0 (2.372) and, of course, its (equivalent) negation
P (A) = 0 and Q(A) = 0 ⇐= Appendixes 2.8.12 Appendix: Random Points on the Surface of the Sphere Figure 2.32: 1000 random points on the surface of the sphere. Note: Figure 2.32 has been generated using the following Mathematica code:
spc[t_,p_,r_:1] := r {Sqrt[1t^2] Cos[p], Sqrt[1t^2] Sin[p], t}
Show[Graphics3D[Table[Point[spc[Random[Real,{1,1}],
Random[Real,{0,2Pi}]]],{1000}]]] Figure 2.33: A geodesic dome dividing the surface of the sphere
into regions with approximately the same area. Figure 2.34: The coordinate division of the surface of the sphere. 149 150 2.8 ϕ = +π/2 ϕ = +π/2 ϕ=0 ϕ=0 ϕ = −π/2
θ = −π θ = −π/2 θ=0 θ = +π/2 θ = +π ϕ = −π/2
θ = −π θ = −π/2 θ=0 θ = +π/2 θ = +π Figure 2.35: Map representation of a random homogeneous distribution of points at the surface
of the sphere. At the left, the na¨ division of the surface of the sphere using constant
ıve
increments of the coordinates. At the right, the cylindrical equalarea projection. Counting the
points inde each ‘rectangle’ gives, at the left, the probability density of points. At the right,
the volumetric probability. Appendixes 2.8.13 151 Appendix: Histograms for the Volumetric Mass of Rocks Figure 2.36: Histogram of the volumetric mass
for the 557 minerals listed in the Handbook of
Physical Properties of rocks (Johnson and Olhoeft, 1984). A logarithmic axis is used that
represents the variable u = log10 (ρ/K ) , with
K = 1 g/cm3 . Superposed to the histogram
is the normal function with mean 0.60 and
stansard deviation 0.23. The vertical lines
correspond to successive deviations multiples
of the standard deviation. See the lognormal
function in 2.37). 0.1
0.08
0.06
0.04
0.02
0
0.25 0 0.25 0.5 0.75 1 1.25 1.5 u = log10(ρ/(g/cm3)) 0.3 0.25 0.2 Figure 2.37: A na¨ version of the histogram
ıve
in ﬁgure 2.36, using an axis labeled in volumetric mass. 0.15 0.1 0.05 0
0
0 5 10
10 g/cm3 15 0
0 5 10
10 g/cm3 15 20
20 g/cm3 25 1
0.8 Figure 2.38: A third version of the histogram,
obtained using intervals of constant length
δρ/ρ . 0.6
0.4
0.2
0
20
20 g/cm3 25 152 2.8 Chapter 3
Monte Carlo Sampling Methods Note: write here a small introduction to the chapter. 153 154 3.1 3.1 Introduction When a probability distribution has been deﬁned, we have to face the problem of how to ‘use’ it.
The deﬁnition of some ‘central estimators’ (like the mean or the median) and some ‘estimators
of dispersion’ (like the covariance matrix), lacks generality, as it is quite easy to ﬁnd examples
(like multimodal distributions in highlydimensioned spaces) where these estimators fail to have
any interesting meaning.
When a probability distribution has been deﬁned over a space of low dimension (say, from
one to four dimensions), then we can directly represent the associated probability density1 .
This is trivial in one or two dimensions. It is easy in three dimensions, using, for instance,
virtual reality software. Some tricks may allow us to represent a fourdimensional probability
distribution, but clearly this approach cannot be generalized to the high dimensional case.
Let us explain the only approach that seems practical, with help of ﬁgure 3.1. At the left
of the ﬁgure, there is an explicit representation of a 2D probability distribution (by means
of the associated probability density or the associated (2D) volumetric probability). In the
middle, some random points have been generated (using the Monte Carlo method about to
be described). It is clear that if we make a histogram with these points, in the limit of a
suﬃciently large number of points, we recover the representation at the left2 . Disregarding
the histogram possibility, we can concentrate on the individual points. In the 2D example of
the ﬁgure, we have actual points in a plane. If the problem is multidimensional, each ‘point’
may corresponds to some abstract notion. For instance, for a geophysicist a ‘point’ may be
a given model of the Earth. This model may be represented in some way, for instance a nice
drawing with plenty of colors. Then a collection of ‘points’ is a collections of such drawings.
Our experience shows that, given such a collection of randomly generated ‘models’, the human
eyebrain system is extremely good at apprehending the basic characteristics of the underlying
probability distribution, including possible multimodalities, correlations, etc. Figure 3.1: An explicit representation of a 2D
probability distribution, and the sampling of
it, using Monte Carlo methods. While the
representation at the topleft cannot be generalized to high dimensions, the examination
of a collection of points can be done in arbitrary dimensions. Practically, Monte Carlo
generation of points is done through a ‘random walk’ where a ‘new point’ is generated
in the vicinity of the previous point. . . . .. .
... .
... .
. . ..... .
.
..
.
. . ...
. ..
.
. . . ..
.
When such a (hopefully large) collection of random models is available we can also answer
quite interesting questions. For instance, a geologist may ask: at which depth is that subsurface
strucure? To answer this, we can make an histogram of the depth of the given geological
1 Or, best, the associated volumetric probability.
There are two ways for making an histogram. If the space is devided in cells with constant coordinate
diﬀerences dx1 , dx2 , . . . , then the limit converges to the probability density. If, instead, the space is divided
in cells of constant volume dV , then the limit converges to the volumetric probability.
2 Random Walks 155 structure over the collection of random models, and the histogram is the answer to the question.
Which is the probability of having a low velocity zone around a given depth? The ratio of the
number of models presenting such a low velocity zone over the total number of models in the
collection gives the answer (if the collection of models is large enough).
This is essentially what we propose: looking to a large number of randomly generated models
in order to intuitively apprehend the basic properties of the probability distribution, followed
by precise computations of the probability of all interesting ‘events’.
Practically, as we shall see, the random sampling is not made by generating points independently of each other. Rather, as suggested in the last image of ﬁgure 3.1, through a ‘random
walk’ where a ‘new point’ is generated in the vicinity of the previous point.
Monte Carlo methods have a random generator at their core3 . At present, Monte Carlo
methods are typically implemented on digital computers, and are based on the pseudorandom
generation of numbers4 . As we shall see, any conceivable operation on probability densities (e.g.,
computing marginals and conditionals, integration, conjunction (the and operation), etc.) has
its counterpart in an operation on/by their corresponding Monte Carlo algorithms.
Inverse problems are often formulated in high dimensional spaces. In this case a certain
class of Monte Carlo algorithms, the socalled importance sampling algorithms, come to rescue,
allowing us to sample the space with a sampling density proportional to the given probability
density. In this case excessive (and useless) sampling of lowprobability areas of the space is
avoided. That this is not only important, but in fact vital in high dimensional spaces, can be
seen in ﬁgure 3.2, where the failure of a plain Monte Carlo sampling (one that samples the
space uniformly) in high dimensional spaces is made clear.
Another advantage of the importance sampling Monte Carlo algorithms is that we need
not have a closed form mathematical expression for the probability density we want to sample.
Only an algorithm that allows us to evaluate it at a given point in the space is needed. This
has considerable practical advantage in analysis of inverse problems where computer intensive
evaluation of, e.g., misﬁt functions plays an important role in calculation of certain probability
densities.
Given a probability density that we wish to sample, and a class of Monte Carlo algorithms
that samples this density, which one of the algorithms should we choose? Practically, the
problem is here to ﬁnd the most eﬃcient of these algorithms. This is an interesting and
diﬃcult problem that we will not go into detail with here. We will, later in this chapter, limit
ourselves to only two general methods which are recommendable in many practical situations. 3.2 Random Walks To escape the dimensionality problem, any sampling of a probability density for which point
values are available only upon request has to be based on a random walk, i.e., in a generation
of successive points with the constraint that point xi+1 sampled in iteration (i + 1) is in the
vicinity of the point xi sampled in iteration i. The simplest of the random walks are the socalled Markov Chain Monte Carlo (MCMC) algorithms, where the point xi+1 depends on the
point xi , but not on previous points. We will concentrate on these algorithms here.
3 Note: Cite here the example of Buﬀon, and a couple of other simple examples.
I.e., series of numbers that appear random if tested with any reasonable statistical test. Note: cite here
some references (Press, etc.).
4 156 3.2 4 πR3 (...) πn R2n 2n+1 πn R2n+1
3
n!
(2n+1)!! 2R πR2 2R (2R)2 (2R)3 (...) (2R)2n (2R)2n+1 Volume hypersphere / Volume hypercube
1
1.0 0.8
0.8 0.6
0.6 0.4
0.4 0.2
0.2 0
0.0 1 2 2 3 4 4 5 6 6 7 8 8 9 10 10 11 Dimension Figure 3.2: Consider a square and the inscribed circle. If the circle’s surface is πR2 , that of the
square is (2R)2 . If we generate a random point inside the square, with homogeneous probability
distribution, the probability of hitting the circle equals the ratio of the surfaces, i.e., P = π/4 .
We can do the same in 3D, but, in this case, the ratio of volumes is P = π/6 : the probability
of hitting the target is smaller in 3D than in 2D. This probability tends dramatically to zero
when the dimension of the space increases. For instance, in dimension 100, the probability of
hitting the hypersphere incribed in the hypercube is P = 1.9 10−70 , what means that it is
practically impossible to hit the target ‘by chance’. The formulas at the top give the volume
of an hypersphere of radius R in a space of dimension 2n or 2n + 1 (the formula is not the
same for spaces with even or odd dimension), and the volume of an hypercube with sides of
length 2R . The graph at the bottom shows the evolution, as a function of the dimension of
the space, of the ratio between the volume of the hypersphere and the volume of the hypercube.
In large dimension, the hypersphere ﬁlls a negligible amount of the hypercube. Modiﬁcation of Random Walks 157 If random rules have been deﬁned to select points such that the probability of selecting
a point in the inﬁnitesimal “box” dx1 . . . dxN is p(x)dx1 . . . dxN , then the points selected in
this way are called samples of the probability density p(x). Depending on the rules deﬁned,
successive samples i, j, k, . . . may be dependent or independent.
Before going into more complex sampling situations, we should mention that there exist
methods for sampling probability densities that can be described by an explicit mathematical
expressions. Information on some of the most important of these methods can be found in
appendix 3.10.3.
Sampling in cases where only point values of the probability density are available upon
request can be done by means of Monte Carlo algorithms based on random walks. In the
following, we shall describe the essential properties of random walks performing the socalled
importance sampling. 3.3 Modiﬁcation of Random Walks Assume here that we can start with a random walk that samples some probability density
f (x) , and have the goal of having a random walk that samples the probability density
h(x) = k f (x) g (x)
µ(x) . (3.1) Call xi the ‘current point’. With this current point as starting point, run one step of the
random walk that unimpeded would sample the probability density f (x) , and generate a ‘test
point’ xtest . Compute the value
qtest = g (xtest )
µ(xtest ) . (3.2) If that value is ‘high enough’, let that point ‘survive’. If qtest is not ‘high enough’, discard this
point and generate another one (making another step of the random walk sampling the prior
probability density f (x)) , using again the ‘current point’ xi as starting point).
There are many criteria for deciding when a point should survive or should be discarded, all
of them resulting in a collection of ‘surviving points’ that are samples of the target probability
density h(x) . For instance, if we know the maximum possible value of the ratio g (x)/µ(x) ,
say qmax , then deﬁne
Ptest = qtest
qmax , (3.3) and give the point xtest the probability Ptest of survival (note that 0 < Ptest < 1 ). It is
intuitively obvious why the random walk modiﬁed using such a criterion produces a random
walk that actually samples the probability density h(x) deﬁned by equation 3.1.
Among the many criteria that can be used, the by far most eﬃcient is the Metropolis
criterion, the criterion behind the Metropolis Algorithm (Metropolis et al. 1953). In the
following we shall describe this algorithm in some detail. 158 3.4 3.5 The Metropolis Rule Consider the following situation. Some random rules deﬁne a random walk that samples the
probability density f (x) . At a given step, the random walker is at point xj , and the application
of the rules would lead to a transition to point xi . If that ‘proposed transition’ xi ← xj is
always accepted, the random walker will sample the probability density f (x). Instead of always
accepting the proposed transition xi ← xj , we reject it sometimes by using the following rule
to decide if it is allowed to move to xi of if it must stay at xj :
• if g (xi )/µ(xi ) ≥ g (xj )/µ(xj ) , then accept the proposed transition to xi ,
• if g (xi )/µ(xi ) < g (xj )/µ(xj ) , then decide randomly to move to xi , or to stay at xj ,
with the following probability of accepting the move to xi :
P= g (xi )/µ(xi )
g (xj )/µ(xj ) . (3.4) Then we have the following
Theorem 3.1 The random walker samples the conjunction h(x) of the probability densities
f (x) and g (x)
h(x) = k f (x) f (x) g (x)
g (x)
=k
µ(x)
µ(x) (3.5) (see appendix 3.10.2 for a demonstration).
It should be noted here that this algorithm nowhere requires the probability densities to
be normalized. This is of vital importance in practice, since it allows sampling of probability
densities whose values are known only in points already sampled by the algorithm. Obviously,
such probability densities cannot be normalized. Also, the fact that our theory allows unnormalizable probability densities will not cause any trouble in the application of the above
algorithm.
The algorithm above is reminiscent (see appendix 3.10.2) of the Metropolis algorithm
(Metropolis et al., 1953), originally designed to sample the GibbsBoltzmann distribution5 .
Accordingly, we will refer to the above acceptance rule as the Metropolis rule . 3.5 The Cascaded Metropolis Rule As above, assume that some random rules deﬁne a random walk that samples the probability
density f1 (x) . At a given step, the random walker is at point xj ;
1 apply the rules, that unthwarted, would generate samples of f1 (x) , to propose a new
point xi ,
exp(−E (x)/T )
To see this, put f (x) = 1, µ(x) = 1 , and g (x) = exp(−E (x)/T )dx , where E (x) is an “energy” associated
to the point x, and T is a “temperature”. The summation in the denominator is over the entire space. In this
way, our acceptance rule becomes the classical Metropolis rule: point xi is always accepted if E (xi ) ≤ E (xj ),
but if E (xi ) > E (xj ), it is only accepted with probability pacc = exp (− (E (xi ) − E (xj )) /T ) .
ij
5 Initiating a Random Walk 159 2 if f2 (xi )/µ(xi ) ≥ f2 (xj )/µ(xj ) , go to point 3; if f2 (xi )/µ(xi ) < f2 (xj )/µ(xj ) , then
decide randomly to go to point 3 or to go back to point 1, with the following probability
of going to point 3: P = (f2 (xi )/µ(xi ))/(f2 (xj )/µ(xj )) ;
3 if f3 (xi )/µ(xi ) ≥ f3 (xj )/µ(xj ) , go to point 4; if f3 (xi )/µ(xi ) < f3 (xj )/µ(xj ) , then
decide randomly to go to point 4 or to go back to point 1, with the following probability
of going to point 4: P = (f3 (xi )/µ(xi ))/(f3 (xj )/µ(xj )) ;
. .... .
n if fn (xi )/µ(xi ) ≥ fn (xj )/µ(xj ) , then accept the proposed transition to xi ; if fn (xi )/µ(xi ) <
fn (xj )/µ(xj ) , then decide randomly to move to xi , or to stay at xj , with the following
probability of accepting the move to xi : P = (fn (xi )/µ(xi ))/(fn (xj )/µ(xj )) ;
Then we have the following
Theorem 3.2 The random walker samples the conjunction h(x) of the probability densities
f1 (x), f2 (x), . . . , fn (x) :
h(x) = k f1 (x) f2 (x)
fn (x)
...
µ(x)
µ(x) . (3.6) (see appendix XXX for a demonstration). 3.6 Initiating a Random Walk Consider the problem of obtaining samples of a probability density h(x) deﬁned as the conjunction of some probability densitites f1 (x), f2 (x), f3 (x) . . . ,
h(x) = k f1 (x) f2 (x) f3 (x)
...
µ(x) µ(x) , (3.7) and let us examine three common situations.
3.6.0.0.1 We start with a random walk that samples f1 (x) (optimal situation):
This corresponds to the basic algorithm where we know how to produce a random walk that
samples f1 (x) , and we only need to modify it, taking into account the values f2 (x)/µ(x) ,
f3 (x)/µ(x) . . . , using the cascaded Metropolis rule, to obtain a random walk that samples
h(x) .
3.6.0.0.2
as We start with a random walk that samples µ(x) : We can write equation 3.7 h(x) = k µ(x) f1 (x)
µ(x) f2 (x)
µ(x) ... . (3.8) The expression corresponds to the case where we are not able to start with a random walk that
samples f1 (x) , but we have a random walk that samples the homogeneous probability density
µ(x) . Then, with respect to the example just mentioned, there is one extra step to be added,
taking into account the values of f1 (x)/µ(x) . 160 3.7 3.6.0.0.3 We start with an arbitrary random walk (worst situation): In the situation where we are not able to directly deﬁne a random walk that samples the homogeneous
probability distribution, but only one that samples some arbitrary probability distribution
ψ (x) , we can write equation 3.7 on the form
h(x) = k ψ (x) µ(x)
ψ (x) f1 (x)
µ(x) f2 (x)
µ(x) ... . (3.9) Then, with respect to the example just mentioned, there is one more extra step to be added,
taking into account the values of µ(x))/ψ (x) . Note that the closer ψ (x) will be to µ(x) ,
the more eﬃcient will be the ﬁrst modiﬁcation of the random walk. 3.7 Designing Primeval Walks What the Metropolis algorithm does is to modify some initial walk, in cascade, to produce a
ﬁnal random walk that samples the target probability distribution. The initial walk, that is
designed ab initio, i.e., independently of the Metropolis algorithm (or any similar algorithm),
may be called the primeval walk . We shall see below some examples where primeval walks
are designed that sample the homogeneous probability distribution µ(x) , or directly the
probability density f (x) (see equation 3.7). If we do not know how to do this, then we have
to resort to using a primeval walk that samples the arbitrary function ψ (x) mentioned above.
Example 3.1 Consider the homogeneous probability density on the 2D surface of a sphere of
2
radius R , µ(ϑ, ϕ) = Rπ cos ϑ , where we use geographical coordinates. This distribution can
4
be sampled by generating a value of ϑ using the probability density 21 cos ϑ , and then a value
π
of ϕ using a constant probability density. Alternatively, one could use a purely geometrical
approach. [End of example.]
Example 3.2 If instead of the surface of a sphere, we have some spheroid, with spheroidal
coordinates {ϑ, ϕ} , the homogeneous probability density will have some expression µ(ϑ, ϕ) ,
that will not be identical to that corresponding to a sphere (see example 3.1). We may then
2
use the function ψ (ϑ, ϕ) = Rπ cos ϑ , i.e., we may start with the same primeval walk as in
4
example 3.1, using, in the Metropolis rule, the ‘corrective step’ mentioned in section 3.6, and
depending on the values µ(ϑ, ϕ)/ψ (ϑ, ϕ) . [End of example.]
Example 3.3 If x is a onedimensional Cartesian quantity, i.e., if the associated homogeneous probability density is constant, then, it is trivial to designate a random walk that samples
it. If x is the ‘current point’, choose randomly a real number e with an arbitrary probability
density that is symmetric around zero, and jump to x + e . The iteration of this rule produces
a random walk that samples the homogeneous probability density for a Cartesian parameter,
µ(x) = k . [End of example.]
Example 3.4 Consider the homogeneous probability density for a temperature, µ(T ) = 1/T ,
as an example of a Jeﬀreys parameter. This distribution can be sampled by the following procedure. If T is the ‘current point’, choose randomly a real number e with an arbitrary
probability density that is symmetric around zero, let be Q = exp e , and jump6 to QT . The
6 Note that if Q > 1 , the algorithm ‘goes to the right’, while if Q < 1 , it ‘goes to the left’. Multistep Iterations 161 iteration of this rule produces7 a random walk that samples the homogeneous probability density
for a Jeﬀreys parameter, µ(T ) = 1/T . [End of example.]
Example 3.5 Consider a random walk that, when it is at point xj , chooses another point xi
with a probability density f (x) = U (xxj ) , satisfying
U (xy) = U (yx) . (3.10) Then, the random walk samples the constant probability density f (x) = k (see appendix ?? for
a proof ). [End of example.]
The reader should be warned that although the Metroplis rule would allow to use a primeval
walk sampling a probability density ψ (x) that may be quite diﬀerent from the homogeneous
probability density µ(x) , this may be quite ineﬃcient. One should not, in general, use the
random walk deﬁned in example 3.5 as a general primeval walk. 3.8 Multistep Iterations An algorithm will converge to a unique equilibrium distribution if the random walk is irreducible. Often, it is convenient to split up an iteration in a number of steps, having their own
transition probability densities, and their own transition probabilities. A typical example is
a random walk in an N dimensional Euclidian space where we are interested in dividing an
iteration of the random walk into N steps, where the nth move of the random walker is in a
direction parallel to the nth axis.
The question is now: if we want to form an iteration consisting of a series of steps, can we
give a suﬃcient condition to be satiﬁed by each step such that the complete iteration has the
desired convergence properties?
It is easy to see that if the individual steps in an iteration all have the same probability
density p(x) as their equilibrium probability density (not necessarily unique), then the complete iteration also has p(x) as an equilibrium probability density. This follows from the fact
that the equilibrium probability density is an eigenfunction with eigenvalue 1 for the integral
operators corresponding to each of the step transition probability densities. Then it is also an
eigenfunction with eigenvalue 1, and hence an equilibrium probability density, for the integral
operator corresponding to the transition probability density for the complete iteration.
If this transition probability density is to be the unique equilibrium probability density for
the complete iteration, then random walk must be irreducible. That is, it must be possible to
go from any point to any other point by performing iterations consisting of the speciﬁed steps.
If the steps of an iteration satisfy these suﬃcient conditions, there is also another way of
deﬁning an iteration with the desired, unique equilibrium density. Instead of performing an
iteration as a series of steps, it is possible to deﬁne the iteration as consisting of one of the steps,
chosen randomly (with any distribution having nonzero probabilities) among the possible steps.
In this case, the transition probability density for the iteration is equal to a linear combination
of the transition probability densities for the individual steps. The coeﬃcient of the transition
probability density for a given step is the probability that this step is selected. Since the
7
It is easy to see why. Let t = log T /T0 . Then f (T ) = 1/T transforms into g (t) = const . This example is
then just the ‘exponentiated version’ of example 3.3. 162 3.9 desired probability density is an equilibrium probability density (eigenfunction with eigenvalue
1) for the integral operators corresponding to each of the step transition probability matrices,
and since the sum of all the coeﬃcients in the linear combination is equal to 1, it is also
an equilibrium probability density for the integral operator corresponding to the transition
probability density for the complete iteration. This equilibrium probability density is unique,
since it is possible, following the given steps, to go from any point to any other point in the
space.
Of course, a step of an iteration can, in the same way, be built from substeps, and in this
way acquire the same (not necessarily unique) equilibrium probability density as the substeps. 3.9 Choosing Random Directions and Step Lengths A random walk is an iterative process where, when we stay at some ‘current point’, we may
jump to a neighboring point. We must decide two things, the direction of the jump and its
step length. Let us examine the two problems in turn. 3.9.1 Choosing Random Directions When the number of dimensions is small, a ‘direction’ in a space is something simple. This
is not so when we work in largedimensional spaces. Consider, for instance, the problem of
choosing a direction in a space of functions. Of course, a space where each point is a function is
inﬁnitedimensional, and we work here with ﬁnitedimensional spaces, but we may just assume
that we have discretized the functions using a large number of points, say 10 000 or 10 000 000
points.
If we are ‘at the origin’ of the space, i.e., at point {0, 0, . . . } representing a function that
is everywhere zero, we may decide to choose a direction pointing towards smooth functions,
or fractal functions, gaussianlike functions, functions having zero mean value, L1 functions,
L2 functions, functions having a small number of large jumps, etc. This freedom of choice,
typical of largedimensional problems, has to be carefully analyzed, and it is indispensable to
take advantage of it whe designing random walks.
Assume that we are able to design a primeval random walk that samples the probability
density f (x) , and we wish to modify it considering the values g (x)/µ(x) , using the Metropolis
rule (or any equivalent rule), in order to obtain a random walk that samples
h(x) = k f (x) g (x)
µ(x) . (3.11) We can design many primeval random walks that sample f (x) . Using Metropolis modiﬁcation of a random walk, we will always obtain a random walk that samples h(x) . A well
designed primeval random walk will ‘present’ to the Metropolis criterion test points xtest that
have a large probability of being accepted (i.e., that have a large value of g (x)test )/µ(x)test ) ). A
poorly designed primeval random walk will test points with a low probability of being accepted.
Then, the algorithm is very slow in producing accepted points. Although high acceptance probability can always be obtained with very small step lengths (if the probability density to be
sampled is smooth), we need to discover directions that give high acceptacne ratios even for
large step lengths. Choosing Random Directions and Step Lengths 3.9.2 163 Choosing Step Lengths Numerical algorithms are usually forced to compromise between some conﬂicting wishes. For
instance, a gradientbased minimization algorithm has to select a ﬁnite step length along the
direction of steepest descent. The larger the step length, the smaller may be the number of
iterations required to reach the minimum, but if the step length is chosen too large, we may
lose eﬃciency; we can even increase the value of the target function, instead of diminishing it.
The random walks contemplated here faces exactly the same situation. The direction of
the move is not deterministically calculated, but is chosen randomly, with the commonsense
constraint discussed in the previous section. But once a direction has been decided, the size
of the jump along this direction, that has to be submitted to the Metropolis criterion, has to
be ‘as large as possible’, but not too large. Again, the ‘Metropolis theorem’ guarantees that
the ﬁnal random walk will sample the target probability distribution, but the better we are in
choosing the step length, the more eﬃcient the algorithm will be.
In practice, a neighborhood size giving an acceptance rate of 30% − 60% (for the ﬁnal,
posterior sampler) can be recommended. 164 3.10 3.10 Appendixes 3.10.1 Random Walk Design The design of a random walk that equilibrates at a desired distribution p(x) can be formulated as
the design of an equilibrium ﬂow having a throughput of p(xi )dxi particles in the neighborhood
of point xi . The simplest equilibrium ﬂows are symmetric , that is, they satisfy
F (xi , xj ) = F (xj , xi ) (3.12) That is, the transition xi ← xj is as likely as the transition xi → xj . It is easy to deﬁne a
symmetric ﬂow, but it will in general not have the required throughput of p(xj )dxj particles
in the neighborhood of point xj . This requirement can be satisﬁed if the following adjustment
of the ﬂow density is made: ﬁrst multiply F (xi , xj ) with a positive constant c. This constant
must be small enough to assure that the throughput of the resulting ﬂow density cF (xi , xj ) at
every point xj is smaller than the desired probability p(xj )dxj of its neighborhood. Finally,
at every point xj , add a ﬂow density F (xj , xj ), going from the point to itself, such that the
throughput at xj gets the right size p(xj )dxj . Neither the ﬂow scaling nor the addition of
F (xj , xj ) will destroy the equilibrium property of the ﬂow. In practice, it is unnecessary to add
a ﬂow density F (xj , xj ) explicitly, since it is implicit in our algorithms that if no move away
from the current point takes place, the move goes from the current point to itself. This rule
automatically adjusts the throughput at xj to the right size p(xj )dxj Appendixes 3.10.2 165 The Metropolis Algorithm Characteristic of a random walk is that the probability of going to a point xi in the space X in a
given step (iteration) depends only on the point xj it came from. We will deﬁne the conditional
probability density P (xi  xj ) of the location of the next destination xi of the random walker,
given that it currently is at neighbouring point xj . The P (xi  xj ) is called the transition
probability density. As, at each step, the random walker must go somewhere (including the
possibility of staying at the same point), then
X P (xi  xj )dxi = 1. (3.13) For convenience we shall assume that P (xi  xj ) is nonzero everywhere (but typically negligibly
small everywhere, except in a certain neighborhood around xj ). For this reason, staying in
an inﬁnitesimal neighborhood of the current point xj has nonzero probability, and therefore
is considered a “transition” (from the point xj to itself). The current point, having been
reselected, contributes then with one more sample.
Given a random walk deﬁned by the transition probability density P (xi  xj ). Assume
that the point, where the random walk is initiated, is only known probabilistically: there is a
probability density q (x) that the random walk is initiated at point x. Then, when the number
of steps tends to inﬁnity, the probability density that the random walker is at point x will
“equilibrate” at some other probability density p(x). It is said that p(x) is an equilibrium
probability density of P (xi  xj ). Then, p(x) is an eigenfunction with eigenvalue 1 of the linear
integral operator with kernel P (xi  xj ):
X P (xi  xj )p(xj )dxj = p(xi ). (3.14) If for any initial probability density q (x) the random walk equilibrates to the same probability
density p(x), then p(x) is called the equilibrium probability of P (xi  xj ). Then, p(x) is the
unique eigenfunction of with eigenvalue 1 of the integral operator.
If it is possible for the random walk to go from any point to any other point in X it is said
that the random walk is irreducible. Then, there is only one equilibrium probability density
(Note: Find appropriate reference...).
Given a probability density p(x), many random walks can be deﬁned that have p(x) as
their equilibrium density. Some tend more rapidly to the ﬁnal probability density than others.
Samples x(1) , x(2) , x(3) , . . . obtained by a random walk where P (xi  xj ) is negligibly small
everywhere, except in a certain neighborhood around xj will, of course, not be independent
unless we only consider points separated by a suﬃcient number of steps.
Instead of considering p(x) to be the probability density of the position of a (single) random
walker (in which case X p(x))dx = 1), we can consider a situation where we have a “density
p(x) of random walkers” in point x. Then, X p(x))dx represents the total number of random
walkers. None of the results presented below will depend on the way p(x) is normed.
If at some moment the density of random walkers at a point xj is p(xj ), and the transitions
probability density is P (xi  xj ), then
F (xi , xj ) = P (xi  xj )p(xj ) (3.15) represents the probability density of transitions from xj to xi : while P (xi  xj ) is the conditional
probability density of the next point xi visited by the random walker, given that it currently is 166 3.10 at xj , F (xi , xj ) is the unconditional probability density that the next step will be a transition
from xj to xi , given only the probability density p(xj ).
When p(xj ) is interpreted as the density of random walkers at a point xj , F (xi , xj ) is called
the ﬂow density , as F (xi , xj )dxi dxj can be interpreted as the number of particles going to a
neighborhood of volume dxi around point xi from a neighborhood of volume dxj around point
xj in a given step. The ﬂow corresponding to an equilibrated random walk has the property
that the particle density p(xi ) at point xi is constant in time. Thus, that a random walk has
equilibrated at a distribution p(x) means that, in each step, the total ﬂow into an inﬁnitesimal
neighborhood of a given point is equal to the total ﬂow out of this neighborhood
Since each of the particles in a neighborhood around point xi must move in each step
(possibly to the neighborhood itself), the ﬂow has the property that the total ﬂow out from
the neighborhood, and hence the total ﬂow into the neighborhood, must equal p(xi )dxi : X F (xi , xj )dxj = X F (xk , xi )dxk = p(xi ) (3.16) Consider a random walk with transition probability density P (xi  xj ) with equilibrium probability density p(x) and equilibrium ﬂow density F (xi , xj ). We can multiply F (xi , xj ) with any
symmetric ﬂow density ψ (xi , xj ), where ψ (xi , xj ) ≤ q (xj ), for all xi and xj , and the resulting
ﬂow density
ϕ(xi , xj ) = F (xi , xj )ψ (xi , xj ) (3.17) will also be symmetric, and hence an equilibrium ﬂow density. A “modiﬁed” algorithm with
ﬂow density ψ (xi , xj ) and equilibrium probability density r(xj ) is obtained by dividing ϕ(xi , xj )
with the product probability density r(xj ) = p(xj )q (xj ). This gives the transition probability
density
ψ (xi , xj )
p(xj )q (xj )
ψ (xi , xj )
= P (xi  xj )
,
q (xj ) P (xi , xj )modiﬁed = F (xi , xj ) which is the product of the original transition probability density, and a new probability — the
acceptance probability
acc
Pij = ψ (xi , xj )
.
q (xj ) (3.18) If we choose to multiply F (xi , xj ) with the symmetric ﬂow density
ψij = Min(q (xi ), q (xj )), (3.19) we obtain the Metropolis acceptance probability
metrop
Pij
= Min 1, q (xi )
q (xj ) , which is one for q (xi ) ≥ q (xj ), and equals q (xi )/q (xj ) when q (xi ) < q (xj ). (3.20) Appendixes 167 The eﬃciency of an acceptance rule can be deﬁned as the sum of acceptance probabilities
for all possible transitions. The acceptance rule with maximum eﬃciency is obtained by simultaneously maximizing ψ (xi , xj ) for all pairs of points xj and xi . Since the only constraint on
ψ (xi , xj ) (except for positivity) is that ψ (xi , xj ) is symmetric and ψ (xk , xl ) ≤ q (xl ), for all k
and l, we have ψ (xi , xj ) ≤ q (xj ) and ψ (xi , xj ) ≤ q (xi ). This means that the acceptance rule
with maximum eﬃciency is the Metropolis rule, where
ψij = Min (q (xi ), q (xj )) . (3.21) 168 3.10.3 3.10 Appendix: Sampling Explicitly Given Probability Densities Three methods for sampling explicitly known probability densities are important, and they are
given by the following three theorems (formulated for a probability density over a 1dimensional
space):
Theorem 1. Let p be an everywhere nonzero probability density with distribution function
P , given by
x P (s) = p(s)ds, (3.22) −∞ and let r be a random number chosen uniformly at random between 0 and 1. Then the random
number x generated through the formula
x = P −1 (r) (3.23) has probability density p.
Theorem 2. Let p be a nonzero probability density deﬁned on the interval I = [a, b] for which
there exists a positive number M , such that
p(x) ≤ M. (3.24) and let r and u be two random numbers chosen uniformly at random from the intervals [0, 1]
and I , respectively. If u survives the test
r≤ p(u)
M (3.25) it is a sample of the probability density p.
More special, yet useful, is the following way of generating Gaussian random numbers:
Theorem 3. Let r1 and r2 be random numbers chosen uniformly at random between 0 and 1.
Then the random numbers x1 and x1 generated through the formulas
x1 = −2 ln r2 cos (2πr1 ) x2 = −2 ln r2 sin (2πr1 ) are independent and Gaussian distributed with zero mean and unit variance.
These theorems are straightforward to use in practice. The proofs are left to the reader as an
exercise. Chapter 4
Homogeneous Probability Distributions
4.1 Parameters To describe a physical system (a planet, an elastic sample, etc.) we use physical quantities
(temperature and mass density at some given points, total mass, surface color, etc.). We
examine here the situation where the total number of physical quantities is ﬁnite. The limitation
to a ﬁnite number of quantities may seem essential to some (in inverse problems, the school
of thought developed by Backus and Gilbert) and accessory to others (like the authors of this
text). When we consider a function (for instance, a temperature proﬁle as a function of depth),
we assume that the function has been discretized in suﬃcient detail. By ‘suﬃcient’ we mean
that a limit has practically been attained where the computation of the ﬁnite probability of
any event becomes practically independent of any further reﬁnement of the discretization of
the function1 .
In this section, {x1 , x2 . . . xn } represents a set of n physical quantities, for which we will
assume to have a probability distribution deﬁned. The quantities {x1 , x2 . . . xn } are assumed
to take real values (with, generally, some physical dimensions).
Example 4.1 We may consider, for instance, (i) the mass of a particle, (ii) the temperature at
the center of the Earth, (iii) the value of the ﬁnestructure constant, etc. [End of example.]
Assuming that we have a set of real quantities excludes the possibility that we may have a
quantity that takes only discrete values, like spin ∈ { +1/2 , 1/2 } , or even a nonnumerical
variable, like organism ∈ { plant , animal } . This is not essential, and the formulas
given here could easily be generalized to the case where we have both, discrete and continuous
probabilities. But, as discrete probability distributions have obvious deﬁnitions of marginal and
conditional probability distributions, we do not wish to review them here. On the contrary,
probabilities over continuous manifolds have speciﬁc problems (change of variables, limits, etc.)
that demand our attention.
1 A random function is a function that, at each point, is a random variable. A random function is completely
characterized if, for whatever choice of n points we may make, we are able to exhibit the joint ndimensional
probability distribution for the n random variables, and this for any value of n . If the considered random
function has some degree of smoothness, there is a limit in the value of n such that any ﬁnite probability
computed using the actual random function is practically identical to the same probability computed from an
ndimensional discretization of the random function. For an excellent introductory text on random functions,
see Pugachev (1965). 169 170 4.1 [Note: Explain here that we shall use the language of ‘manifolds’.]
[Note: Explain here that ’space’ is used as synonymous of ‘manifold’]
In the following, we will meet two distinct categories of uncertain ‘parameters’. The ﬁrst
category consists of physical quantities whose ‘actual values’ are not exactly known but cannot
be analyzed by generating many realizations of the parameter values in a repetitive experiment.
An obvious example of such a parameter is the radius of the earth’s core (say r ). If f (r) is
a probability density over r , we will never say that r is a ‘random variable’; we will rather
say that we have a probability density deﬁned over a ‘physical quantity’. The second category
of parameters are bona ﬁde ‘random variables’, for which we can obtain histograms through
repeated experiments. Such ‘random variables’ do not play any major role in this article.
Although in mathematical texts there is a diﬀerence in notation between a parameter and
a particular value of the parameter (for instance, by denoting them X and x respectively),
we choose here to simplify the notation and use expressions like ‘let x = x0 be a particular
value of the parameter x .’
Note: I have to talk about the conmensurability of distances,
ds2 = ds2 + ds2
r
s , (4.1) every time I have to deﬁne the Cartesian product of two spaces each with its own metric. Homogeneous Probability Distributions 4.2 171 Homogeneous Probability Distributions In some parameter spaces, there is an obvious deﬁnition of distance between points, and therefore of volume. For instance, in the 3D Euclidean space the distance between two points is
just the Euclidean distance (which is invariant under translations and rotations). Should we
choose to parameterize the position of a point by its Cartesian coordinates {x, y, z } , then, the
volume element in the space would be
dV (x, y, z ) = dx dy dz . (4.2) Should we choose to use geographical coordinates, then the volume element would be
dV (r, ϑ, ϕ) = r2 cos ϑ dr dϑ dϕ . (4.3) Question: what would be, in this parameter space, a homogeneous probability distribution
of points? Answer: a probability distribution assigning to each region of the space a probability
proportional to the volume of the region.
Then, question: which probability density represents such a homogeneous probability distribution? Let us give the answer in three steps.
• If we use Cartesian coordinates {x, y, z } , as we have dV (x, y, z ) = dx dy dz , the
probability density representing the homogeneous probability distribution is constant:
f (x, y, z ) = k . (4.4) • If we use geographical coordinates {r, ϑ, ϕ} , as we have dV (r, ϑ, ϕ) = r2 cos ϑ dr dϑ dϕ ,
the probability density representing the homogeneous probability distribution is (see example 2.3)
g (r, ϑ, ϕ) = k r2 cos ϑ . (4.5) • Finally, if we use an arbitrary system of coordinates {u, v, w} , in which the volume
element of the space is dV (u, v, w) = v (u, v, w) du dv dw , the homogeneous probability
distribution is represented by the probability density
h(u, v, w) = k v (u, v, w) . (4.6) This is obviously true, since if we calculate the probability of a region A of the space, with
volume V (A) , we get a number proportional to V (A) .
We can arrive at some conclusions from this example, that are of general validity. First,
the homogeneous probability distribution is represented by a constant probability density only
if we use Cartesian (or rectilinear) coordinates. Two other conclusions can be stated as two
(equivalent) rules:
Rule 4.1 The probability density representing the homogeneous probability distribution is easily
obtained if the expression of the volume element dV (u1 , u2 , . . . ) = v (u1 , u2 , . . . ) du1 du2 . . . of
the space is known, as it is then given by h(u1 , u2 , . . . ) = k v (u1 , u2 , . . . ) , where k is a
proportionality constant (that may have physical dimensions). 172 4.2 Rule 4.2 If there is a metric gij (u1 , u2 , . . . ) in the space, then, as mentioned above, the
volume element is given by dV (u1 , u2 , . . . ) = det g(u1 , u2 , . . . ) du1 du2 . . . , i.e., we have
v (u1 , u2 , . . . ) =
det g(u1 , u2 , . . . ) . The probability density representing the homogeneous
probability distribution is, then, h(u1 , u2 , . . . ) = k det g(u1 , u2 , . . . ) .
Rule 4.3 If the expression of the probability density representing the homogeneous probability
distribution is known in one system of coordinates, then, it is known in any other system of
coordinates, through the Jacobian rule (equation ??).
Indeed, in the expression above, g (r, ϑ, ϕ) = k r2 cos ϑ , we recognize the Jacobian between
the geographical and the Cartesian coordinates (where the probability density is constant).
For short when we say the homogeneous probability density we mean the probability density representing the homogeneous probability distribution . One should remember that, in
general, the homogeneous probability density is not constant.
Let us now examine ‘positive parameters’, like a temperature, a period, etc. One of the
properties of the parameters we have in mind is that they occur in pairs of mutually reciprocal
parameters:
Period T = 1/ν
Resistivity ρ = 1/σ
Temperature T = 1/(kβ )
Mass density ρ = 1/
Compressibility γ = 1/κ ;
Frequency ν = 1/T
;
Conductivity ρ = 1/σ
;
Thermodynamic parameter β = 1/(kT )
;
Lightness
= 1/ρ
; Bulk modulus (uncompressibility) κ = 1/γ . When physical theories are elaborated, one may freely choose one of these parameters or its
reciprocal.
Sometimes these pairs of equivalent parameters come from a deﬁnition, like when we deﬁne
frequency ν as a function of the period T , by ν = 1/T . Sometimes these parameters arise
when analyzing an idealized physical system. For instance, Hooke’s law, relating stress σij to
strain εij can be expressed as σij = cij k εk , thus introducing the stiﬀness tensor cijk , or
as εij = dij k σk , thus introducing the compliance tensor dijk , inverse of the stiﬀness tensor.
Then the respective eigenvalues of these two tensors belong to the class of scalars analyzed
here.
Let us take, as an example, the pair conductivityresistivity (this may be thermal, electric,
etc.). Assume we have two samples in the laboratory S1 and S2 whose resistivities are
respectively ρ1 and ρ2 . Correspondingly, their conductivities are σ1 = 1/ρ1 and σ2 = 1/ρ2 .
How should we deﬁne the ‘distance’ between the two samples? As we have ρ2 − ρ1  = σ2 − σ1  ,
choosing one of the two expressions as the ‘distance’would be arbitrary. Consider the following
deﬁnition of ‘distance’ between the two samples
D(S1 , S2 ) = log σ2
ρ2
= log
ρ1
σ1 . (4.7) This deﬁnition (i) treats symmetrically the two equivalent parameters ρ and σ and, more
importantly, (ii) has an invariance of scale (what matters is how many ‘octaves’ we have
between the two values, not the plain diﬀerence between the values). In fact, it is the only
‘sensible’ deﬁnition of distance between the two samples S1 and S2 . Homogeneous Probability Distributions 173 Associated to the distance D(x1 , x2 ) =  log (x2 /x1 )  is the distance element (diﬀerential
form of the distance)
dL(x) = dx
x . (4.8) This being a ‘onedimensional volume’ we can apply now the rule 4.1 above, to get the expression
of the homogeneous probability density for such a positive parameter:
f (x) = k
x . (4.9) Deﬁning the reciprocal parameter y = 1/x and using the Jacobian rule we arrive at the
homogeneous probability density for y :
g (y ) = k
y . (4.10) These two probability densities have the same form: the two reciprocal parameters are treated
symmetrically. Introducing the logarithmic parameters
x∗ = log x
x0 ; y ∗ = log y
y0 , (4.11) where x0 and y0 are arbitrary positive constants, and using the Jacobian rule we arrive at
the homogeneous probability densities
f (x∗ ) = k ; f (y ∗ ) = k . (4.12) This shows that the logarithm of a positive parameter (of the type considered above) is a
‘Cartesian’ parameter. In fact, it is the consideration of equations 4.12, together with the
Jacobian rule, that allows full understanding of the (homogeneous) probability densities 4.9–
4.10.
The association of the probability density f (u) = k/u to positive parameters was ﬁrst
made by Jeﬀreys (1939). To honor him, we propose to use the term Jeﬀreys parameters for all
the parameters of the type considered above . The 1/u probability density was advocated by
Jaynes (1968), and a nontrivial use of it was made by Rietsch (1977), in the context of inverse
problems.
Rule 4.4 The homogeneous probability density for a Jeﬀreys quantity u is f (u) = k/u .
Rule 4.5 The homogeneous probability density for a ‘Cartesian parameter’ u (like the logarithm of a Jeﬀreys parameter, an actual Cartesian coordinate in an Euclidean space, or the
Newtonian time coordinate) is (by deﬁnition of Newtonian time) f (u) = k . The homogeneous
probability density for an angle describing the position of a point in a circle is also constant.
If a parameter u is a Jeﬀreys parameter, with the homogeneous probability density f (u) =
k/u , then, its inverse, its square, and, in general, any power of the parameter is also a Jeﬀreys
parameter, as it can easily be seen using the Jacobian rule.
Rule 4.6 Any power of a Jeﬀreys quantity (including its inverse) is a Jeﬀreys quantity. 174 4.2 It is important to recognize when we do not face a Jeﬀreys parameter. Among the many
parameters used in the literature to describe an isotropic linear elastic medium we ﬁnd parameters like the Lam´’s coeﬃcients λ and µ , the bulk modulus κ , the Poisson ratio σ , etc. A
e
simple inspection of the theoretical range of variation of these parameters shows that the ﬁrst
Lam´ parameter λ and the Poisson ratio σ may take negative values, so they are certainly
e
not Jeﬀreys parameters. In contrast, Hooke’s law σij = cijk εk , deﬁning a linearity between
stress σij and strain εij , deﬁnes the positive deﬁnite stiﬀness tensor cijk or, if we write
εij = dijk σ k , deﬁnes its inverse, the compliance tensor dijk . The two reciprocal tensors
cijk and dijk are ‘Jeﬀreys tensors’. This is a notion that would take too long to develop
here, but we can give the following rule:
Rule 4.7 The eigenvalues of a Jeﬀreys tensor are Jeﬀreys quantities2 .
As the two (diﬀerent) eigenvalues of the stiﬀness tensor cijk are λκ = 3κ (with multiplicity
1) and λµ = 2µ (with multiplicity 5) , we see that the uncompressibility modulus κ and the
shear modulus µ are Jeﬀreys parameters3 (as are any parameter proportional to them, or any
power of them, including the inverses). If for some reason, instead of working with κ and µ ,
we wish to work with other elastic parameters, like for instance the Young modulus Y and
the Poisson ratio σ , then the homogeneous probability distribution must be found using the
Jacobian of the transformation between (Y, σ ) and (κ, µ) . This is done in appendix 4.3.2.
Some probability densities have conspicuous ‘dispersion parameters’, like the σ ’s in the
normal probability density f (x) = k exp (−(x − x0 )2 /(2σ 2 )) , in the lognormal probability g (X ) = k exp − (log X/X0 )2 /(2σ 2 ) or in the Fisher probability density h(ϑ, ϕ) =
k cos ϑ exp (sin ϑ / σ 2 ) . A consistent probability model requires that when the dispersion
parameter σ tends to inﬁnity, the probability density tends to the homogeneous probability
distribution. For instance, in the three examples just given, f (x) → k , g (X ) → k/X and
h(ϑ, ϕ) → k cos ϑ , which are the respective homogeneous probability densities for a Cartesian
quantity, a Jeﬀreys quantity and the geographical coordinates on the surface of the sphere. We
can state the
Rule 4.8 A probability density is only consistent if it tends to the homogeneous probability
density when its dispersion parameters tend to inﬁnity.
As an example, using the normal probability density f (x) = k exp (−(x − x0 )2 /(2σ 2 )) , for a
Jeﬀreys parameter is not consistent. Note that it would assign a ﬁnite probability to negative
values of a positive parameter that, by deﬁnition, is positive. More technically, this would
violate our postulate ??.
There is a problem of terminology in the Bayesian literature. The homogeneous probability
distribution is a very special distribution. When the problem of selecting a ‘prior’ probability
distribution arises, in the absence of any information except the fundamental symmetries of the
problem, one may select as prior probability distribution the homogeneous distribution. But
2 This solves the complete problem for isotropic tensors only. It is beyond the scope of this text to propose
rules valid for general anisotropic tensors: the necessary mathematics have not yet been developed.
3
The deﬁnition of the elastic constants was made before the tensorial structure of the theory was understood.
Seismologists, today, should never introduce, at a theoretical level, parameters like the ﬁrst Lam´ coeﬃcient λ
e
or the Poisson ratio. Instead they should use κ and µ (and their inverses). In fact, our suggestion, in this
IASPEI volume, is to use the true eigenvalues of the stiﬀness tensor, λκ = 3κ , and λµ = 2µ , that we propose
to call the eigenbulkmodulus and the eigenshearmodulus . Homogeneous Probability Distributions 175 enthusiastic Bayesians do not call it ‘homogeneous’ but ‘noninformative’. We do not agree with
this. The homogeneous probability distribution is as informative as any other distribution, it
is just the homogeneous one4 .
In general, each time we consider an abstract parameter space, each point being represented
by some parameters x = {x1 , x2 . . . xn } , we will start by solving the (sometimes nontrivial)
problem of deﬁning a distance between points that respects the necessary symmetries of the
problem. Only exceptionally this distance will be a quadratic expression of the parameters
(coordinates) being used (i.e., only exceptionally our parameters will correspond to ‘Cartesian
coordinates’ in the space). From this distance, a volume element dV (x) = v (x) dx will be
deduced, from where the expression f (x) = k v (x) of the homogeneous probability density
will follow. We emphasize the need of deﬁning a distance in the parameter space, from which
the notion of homogeneity will follow. In this, we slightly depart from the original work by
Jeﬀreys and Jaynes. 4 Note that Shannon’s deﬁnition of information content (Shannon, 1948) of a discrete probability I =
pi log pi does not generalize into a deﬁnition of the information content of a probability density (the
i
‘deﬁnition’ I = dx f (x) log f (x) is not invariant under a change of variables). Rather, one may deﬁne the
‘Kullback distance’ (Kullback, 1967) from the probability density g (x) to the probability density f (x) as
I (f g ) = dx f (x) log f (x)
g (x) . This means, in particular, that we can never know if a single probability density is, by itself, informative or not.
The equation above deﬁnes the information gain when we pass from g (x) to f (x) ( I is always positive). But
there is also an information gain when we pass from f (x) to g (x) : I (g f ) = dx g (x) log g (x)/f (x) . One
should note that (i) the ‘Kullback distance’ is not a distance (the distance from f (x) to g (x) does not equal
the distance from g (x) to f (x) ); (ii) for the ‘Kullback distance’ I (f g ) = dx f (x) log f (x)/g (x) to be
deﬁned, the probability density f (x) has to be ‘absolutely continuous’ with respect to g (x) , which amounts
to say that f (x) can only be zero where g (x) is zero. We have postulated that any probability density f (x)
is absolutely continuous with respect to the homogeneous probability distribution µ(x) . For the homogeneous
probability distribution ‘ﬁlls the space’. Then, one may take the convention to measure the information content
of any probability density f (x) with respect to the homogeneous probability density:
I (f ) ≡ f (f µ) = dx f (x) log f (x)
µ(x) . The homogeneous probability density is then ‘noninformative’, I (µ) = I (µµ) = 0 , but this is just by deﬁnition. 176 4.3
4.3.1 4.3 Appendixes
Appendix: First Digit of the Fundamental Physical Constants Note: mention here ﬁgure 4.1, and explain. Say that the negative numbers of the table are
‘false negatives’. Figure 4.3 statistics of surfaces and populations of States and Islands.
First digit of the
Fundamental Physical Constants
(1998 CODATA leastsquares adjustement) 80
Actual
Frequency Figure 4.1: Statistics of the ﬁrst
digit in the table of Fundamental Physical Constants (1998 CODATA leastsquares adjustement;
Mohr and Taylor, 2001). I have
indiscriminately taken all the constants of the table (263 in total).
The ‘model’ corresponds to the prediction that the relative frequency of
digit n in a base K system of numeration is logK (n + 1)/n . Here,
K = 10 . statistics Model
60 40 20 0 1 2 3 4 5
6
Digits 7 8 9 Appendixes 177 STATES, TERRITORIES
& PRINCIPAL ISLANDS
OF THE WORLD
Figure 4.2: The begining of the list of the States,
Territories and Principal Islands of the World, in the
Times Atlas of the World (Times Books, 1983), with
the ﬁrst digit of the surfaces and populations highlighted. The statistics of this ﬁrst digit in shown in
ﬁgure 4.3. Sq. km Abu Dhabi, see United Arab Emirates
Afghanistan [31]
636,267
Capital: Kabul
Ajman, see United Arab Emirates
Åland [51]
1,505
Selfgoverning Island Territory of Finland
Albania [83]
28,748
Capital: Tirana (Tiranë)
Aleutian Islands [113]
17,666
Territory of U.S.A.
Algeria [88]
2,381,745
Capital: Algiers (Alger)
American Samoa [10]
197
Unincorporated Territory of U.S.A.
Andorra [75]
465
Capital: Andorra la Vella
Angola [91]
1,246,700
Capital: Luanda
…
… Sq. miles Population 245,664 15,551,358* 1979 581 22,000 1981 11,097 2,590,600 1979 6,821 6,730* 1980 919,354 18,250,000 76 30,600 1977 180 35,460 1981 481,226 6,920,000 1981 … … … 1979 Surfaces and Populations of the States,
Territories and Principal Islands
(Times Atlas of the World) 400
Actual
Frequency Figure 4.3: Statistics of the ﬁrst
digit in the table of the surfaces
(both in squared kilometers and
squared miles) and populations of
the States, Territories and Principal Islands of the World, as printed
in the ﬁrst few pages of the Times
Atlas of the World (Times Books,
1983). As for ﬁgure 4.1, the ‘model’
corresponds to the prediction that
the relative frequency of digit n is
log10 (n + 1)/n . Name [Plate] and Description statistics Model
300 200 100 0 1 2 3 4 5
6
Digits 7 8 9 178 4.3.2 4.3 Appendix: Homogeneous Probability for Elastic Parameters In this appendix, we start from the assumption that the uncompressibility modulus and the
shear modulus are Jeﬀreys parameters (they are the eigenvalues of the stiﬀness tensor cijk ),
and ﬁnd the expression of the homogeneous probability density for other sets of elastic parameters, like the set { Young’s modulus  Poisson ratio } or the set { Longitudinal wave
velocity  Tranverse wave velocity } .
4.3.2.1 Uncompressibility Modulus and Shear Modulus The ‘Cartesian parameters’ of elastic theory are the logarithm of the uncompressibility modulus
and the logarithm of the shear modulus
κ∗ = log κ
κ0 µ∗ = log ; µ
µ0 , (4.13) where κ0 and µ0 are two arbitrary constants. The homogeneous probability density is just
constant for these parameters (a constant that we set arbitrarily to one)
fκ∗ µ∗ (κ∗ , µ∗ ) = 1 . (4.14) As is often the case for homogeneous ‘probability’ densities, fκ∗ µ∗ (κ∗ , µ∗ ) is not normalizable.
Using the jacobian rule, it is easy to transform this probability density into the equivalent one
for the positive parameters themselves
fκµ (κ, µ) = 1
κµ . (4.15) This 1/x form of the probability density remains invariant if we take any power of κ and
of µ . In particular, if instead of using the uncompressibility κ we use the compressibility
γ = 1/κ , the Jacobian rule simply gives fγµ (γ, µ) = 1/(γ µ) .
Associated to the probability density 4.14 there is the Euclidean deﬁnition of distance
ds2 = (dκ∗ )2 + (dµ∗ )2 , (4.16) that corresponds, in the variables (κ, µ) , to
ds 2 = 2 dκ
κ + dµ
µ 2 , (4.17) i.e., to the metric
gκκ gκµ
gµκ gµµ
4.3.2.2 1/κ2
0
0
1/µ2 = . (4.18) Young Modulus and Poisson Ratio The Young modulus Y and the Poisson ration σ can be expressed as a function of the
uncompressibility modulus and the shear modulus as
Y= 9κµ
3κ + µ ; σ= 1 3κ − 2µ
2 3κ + µ (4.19) Appendixes 179 or, reciprocally,
κ= Y
3(1 − 2σ ) ; µ= Y
2(1 + σ ) . (4.20) The absolute value of the Jacobian of the transformation is easily computed,
J= Y
2(1 + σ )2 (1 − 2σ )2 , (4.21) and the Jacobian rule transforms the probability density 4.15 into
fY σ (Y, σ ) = 3
1
J=
κµ
Y (1 + σ )(1 − 2σ ) , (4.22) which is the probability density representing the homogeneous probability distribution for elastic parameters using the variables (Y, σ ) . This probability density is the product of the
probability density 1/Y for the Young modulus and the probability density
g (σ ) = 3
Y (1 + σ )(1 − 2σ ) (4.23) for the Poisson ratio. This probability density is represented in ﬁgure 4.4. From the deﬁnition
of σ it can be demonstrated that its values must range in the interval −1 < σ < 1/2 , and
we see that the homogeneous probability density is singular at these points. Although most
rocks have positive values of the Poisson ratio, there are materials where σ is negative (e.g.,
YeganehHaeri et al., 1992).
 30
25
20 Figure 4.4: The homogeneous probability density for the Poisson ratio, as deduced from the
condition that the uncompressibility and the
shear modulus are Jeﬀreys parameters. 15
10
5
0
1
1 0.8 0.6 0.4 0.2 0.5 0 0 0.2 0.4 +0.5 Poisson's ratio
It may be surprising that the probability density in ﬁgure 4.4 corresponds to a homogeneous
distribution. If we have many samples of elastic materials, and if their logarithmic uncompressibility modulus κ∗ and their logarithmic shear modulus µ∗ have a constant probability
density (what is the deﬁnition of homogeneous distribution of elastic materials), then, σ will
be distributed according to the g (σ ) of the ﬁgure.
To be complete, let us mention that in a change of variables xi
xI , a metric gij changes
to
gIJ = ΛI i ΛJ j gij = ∂xi ∂xj
gij
∂xI ∂xJ . (4.24) 180 4.3 The metric 4.17 then transforms into
gY Y
gσY gY σ
gσσ = 2
(1−2 σ ) Y 2
Y2 − 2
(1−2 σ ) Y
4
(1−2 σ )2 1
(1+σ ) Y −
+ 1
(1+σ ) Y
1
(1+σ )2 . (4.25) The surface element is
dSY σ (Y, σ ) = det g dY dσ = 3 dY dσ
Y (1 + σ )(1 − 2σ ) , (4.26) a result from which expression 4.22 can be inferred.
Although the Poisson ratio has a historical interest, it is not a simple parameter, as shown
by its theoretical bounds −1 < σ < 1/2 , or the form of the homogeneous probability density
(ﬁgure 4.4). In fact, the Poisson ratio σ depends only on the ratio κ/µ (incompressibility
modulus over shear modulus), as we have
1+σ
3κ
=
.
1 − 2σ
2µ (4.27) The ratio J = κ/µ of two Jeﬀreys parameters being a Jeﬀreys parameter, a useful pair
of Jeﬀreys parameters may be {κ, J } . The ratio J = κ/µ has a physical interpretation
easy to grasp (as the ratio between the uncompressibility and the shear modulus), and should
be preferred, in theoretical developments, to the Poisson ratio, as it has simpler theoretical
properties. As the name of the nearest metro station to the university of one of the authors
(A.T.) is Jussieu , we accordingly call J the Jussieu’s ratio .
4.3.2.3 Longitudinal and Transverse Wave Velocities Equation 4.15 gives the probability density representing the homogeneous homogeneous probability distribution of elastic media, when parameterized by the uncompressibility modulus and
the shear modulus:
1
κµ fκµ (κ, µ) = . (4.28) Should we have been interested, in addition, to the mass density ρ , then we would have arrived
(as ρ is another Jeﬀreys parameter), to the probability density
fκµρ (κ, µ, ρ) = 1
κµρ . (4.29) This is the starting point for this section.
What about the probability density representing the homogeneous probability distribution
of elastic materials when we use as parameters the mass density and the two wave velocities? The longitudinal wave velocity α and the shear wave velocity β are related to the
uncompressibility modulus κ and the shear modulus µ through
α= κ + 4µ/3
ρ ; β= µ
,
ρ (4.30) Appendixes 181 and a direct use of the Jacobian rule transforms the probability density 4.29 into
1 fαβρ (α, β, ρ) = 3
4 ραβ − . β2
α2 (4.31) which is the answer to our question.
2
That this function becomes singular for α = √3 β is just due to the fact that the “boundary”
2
α = √3 β can not be crossed: the fundamental inequalities κ > 0 ; µ > 0 impose that the
two velocities are linked by the inequality constraint
2
α>√β
3 . (4.32) Let us focus for a moment on the homogeneous probability density for the two wave velocities
(α, β ) existing in an elastic solid (disregard here the mass density ρ ). We have
1 fαβ (α, β ) = 3
4 αβ − β2
α2 . (4.33) It is displayed in ﬁgure 4.5. β Figure 4.5: The joint homogeneous probability density for the velocities (α, β ) of the
longitudinal and transverse waves propagating in an elastic solid. Contrary to the incompressibility and the shear modulus, that
are independent parameters, the longitudinal
wave velocity and the transversal wave velocity are not independent (see text for an explanation). The scales for the velocities are
unimportant: it is possible to multiply the two
velocity scales by any factor without modifying the form of the probability (which is itself
deﬁned up to a multiplicative constant). 0 α Let us demonstrate that the marginal probability density for both α and β is of the form
1/x . For we have to compute
√ 3 α/2 fα (α) = dβ f (α, β ) (4.34) dα f (α, β ) (4.35) 0 and
+∞ fβ (β ) = √
2 β/ 3 182 4.3 (the bounds of integration can easily be understood by a look at ﬁgure 4.5). These integrals
can be evaluated as
√ fα (α) = lim 1− ε √ 3 α/2 √√
ε 3 α/2 ε→0 dβ f (α, β ) = lim ε→0 1−ε
4
log
3
ε 1
α (4.36) and
√√
2 β/( ε 3) fβ (β ) = lim ε→0 √ √ dα f (α, β ) = lim ε→0 1+ε 2 β/ 3 1/ε − 1
2
log
3
ε 1
β . (4.37) The numerical factors tend to inﬁnity, but this is only one more manifestation of the fact that
the homogeneous probability densities are usually improper (not normalizable). Dropping these
numerical factors gives
1
α fα (α) = (4.38) and
fβ (β ) = 1
β . (4.39) It is interesting to note that we have here an example where two parameters that look like
Jeﬀreys parameters, but are not, because they are not independent (the homogeneous joint
probability density is not the product of the homogeneous marginal probability densities.).
It is also worth to know that using slownesses instead of velocities ( n = 1/α, η = 1/β )
leads, as one would expect, to
1 fnηρ (n, η, ρ) =
ρnη 3
4 − n2
η2 . (4.40) Appendixes 4.3.3 183 Appendix: Homogeneous Distribution of Second Rank Tensors The usual deﬁnition of the norm of a tensor provides the only natural deﬁnition of distance in
the space of all possible tensors. This shows that, when using a Cartesian system of coordinates,
the components of a tensor are the ‘Cartesian coordinates’ in the 6D space of symmetric tensors.
The homogeneous distribution is then represented by a constant (nonnormalizable) probability
density:
f (σxx , σyy , σzz , σxy , σyz , σzx ) = k . (4.41) Instead of using the components, we may use the three eigenvalues {λ1 , λ2 , λ3 } of the tensor
and the three Euler angles {ψ, θ, ϕ} deﬁning the orientation of the eigendirections in the
space. As the Jacobian of the transformation
{σxx , σyy , σzz , σxy , σyz , σzx } {λ1 , λ2 , λ3 , ψ, θ, ϕ} (4.42) is
∂ (σxx , σyy , σzz , σxy , σyz , σzx )
∂ (λ1 , λ2 , λ3 , ψ, θ, ϕ) = (λ1 − λ2 )(λ2 − λ3 )(λ3 λ1 ) sin θ , (4.43) the homogeneous probability density 4.41 transforms into
g (λ1 , λ2 , λ3 , ψ, θ, ϕ) = k (λ1 − λ2 )(λ2 − λ3 )(λ3 − λ1 ) sin θ . (4.44) Although this is not obvious, this probability density is isotropic in spatial directions (i.e., the
3D referentials deﬁned by the three Euler angles are isotropically distributed). In this sense,
we recover ‘isotropy’ as a special case of ‘homogeneity’.
The rule 4.8, imposing that any probability density on the variables {λ1 , λ2 , λ3 , ψ, θ, ϕ}
has to tend to the homogeneous probability density 4.44 when the ‘dispersion parameters’ tend
to inﬁnity imposes a strong constraint on the form of acceptable probability densities, that is,
generally, overlooked.
For instance, a Gaussian model for the variables {σxx , σyy , σzz , σxy , σyz , σzx } is consistent
(as the limit of Gaussian is a constant). This induces, via the Jacobian rule, a probability
density for the variables {λ1 , λ2 , λ3 , ψ, θ, ϕ} , a probability density that is not simple, but
consistent. A Gaussian model for the parameters {λ1 , λ2 , λ3 , ψ, θ, ϕ} would not be consistent. 184 4.3 Chapter 5
Basic Measurements Note: Complete and expand what follows:
I take here a probabilistic point of view. The axioms of probability theory apply to diﬀerent
situations. One is the traditional statistical analysis of random phenomena, another one is
the description of (more or less) subjective states of information on a system. For instance,
estimation of the uncertainties attached to any measurement usually involves both uses of
probability theory: some uncertainties contributing to the total uncertainty are estimated using
statistics, while some other uncertainties are estimated using informed scientiﬁc judgement
about the quality of an instrument, about eﬀects not explicitly taken into account, etc. The
International Organization for Standardization (ISO) in Guide to the Expression of Uncertainty
in Measurement (1993), recommends that the uncertainties evaluated by statistical methods are
named ‘type A’ uncertainties, and those evaluated by other means (for instance, using Bayesian
arguments) are named ‘type B’ uncertainties. It also recommends that former classiﬁcations,
for instance into ‘random’ and ‘systematic uncertainties’, should be avoided. In the present
text, we accept ISO’s basic point of view, and extend it, by underplaying the role assigned by
ISO to the particular Gaussian model for uncertainties (see section 5.8) and by not assuming
that the uncertainties are ‘small’. 185 186 5.1 5.1 Terminology Note: Introduce here the ISO terminology for analyzing uncertainties in measurements.
Note: Note say that we are interested in volumetric probabilities, not ‘uncertainties’.
Note: For the time being, this section is written in telegraphical style. It will, obviously, be
rewritten.
Measurand : Particular quantity subject to measurement. It is the input to the measuring
instrument.
Input may be a length; output may be an electric tension. They may not be the same
physical quantity.
For instance, the input of a seismometer is a displacement, the output is a voltage. At a
given time, the voltage is a convolution of the past input by a transfer function. Old text: Measuring physical parameters 5.2 187 Old text: Measuring physical parameters To deﬁne the experimental procedure that will lead to a “measurement” we need to conceptualize the objects of the “universe”: do we have point particles or a continuous medium? Any
instrument that we can build will have ﬁnite accuracy, as any manufacture is imperfect. Also,
during the measurement act, the instrument will always be submitted to unwanted sollicitations
(like uncontrolled vibrations).
This is why, even if the experimenter postulates the existence of a well deﬁned, “true value”,
of the measured parameter, she/he will never be able to measure it exactly. Careful modeling of
experimental uncertainties is not easy, Sometimes, the result of a measurement of a parameter
p is presented as p = p0 ± σ , where the interpretation of σ may be diverse. For instance, the
experimenter may imagine a bellshaped probability density around p0 representing her/his
state of information “on the true value of the parameter”. The constant σ can be the standard
deviation (or mean deviation, or other estimator of dispersion) of the probability density used
to model the experimental uncertainty.
In part, the shape of this probability density may come from histograms of observed or
expected ﬂuctuations. In part, it will come from a subjective estimation of the defects of the
unique pieces of the instrument. We postulate here that the result of any measurement can,
in all generality, be described by deﬁning a probability density over the measured parameter,
representing the information brought by the experiment on the “true”, unknowable, value of the
parameter. The oﬃcial guidelines for expressing uncertainty in measurement, as given by the
International Organization for Standardization (ISO) and the National Institute of Standards
and Technology1 , although stressing the special notion of standard deviation, are consistent with
the possible use of general probability distributions to express the result of a measurement, as
advocated here.
Any shape of the density function is not acceptable. For instance, the use of a Gaussian
density to represent the result of a measurement of a positive quantity (like an electric resistivity) would give a ﬁnite probability for negative values of the variable, which is inconsistent
(a lognormal probability density, on the contrary, could be acceptable).
In the event of an “inﬁnitely bad measurement” (like when, for instance, an unexpected
event prevents, in fact, any meaningful measure) the result of the measurement should be
described using the null information probability density introduced above. In fact, when the
density function used to represent the result of a mesurement has a parameter σ describing
the “width” of the function, it is the limit of the density function for σ → ∞ that should
represent a measurement of inﬁnitely bad quality. This is consistent, for instance, with the use
of a lognormal probability density for a parameter like an electric resisitivity r , as the limit
of the lognormal for σ → ∞ is the 1/r function, which is the right choice of noninformative
probability density for r .
Another example of possible probability density to represent the result of a measurement
of a parameter p is to take the noninformative probability density for p1 < p < p2 and
zero outside. This ﬁxes strict bounds for possible values of the parameter, and tends to the
noninformative probability density when the bounds tend to inﬁnity.
The point of view proposed here will be consistent with the the use of “theoretical param1 Guide to the expression of uncertainty in measurement, International Organization of Standardization
(ISO), Switzerland, 1993. B.N. Taylor and C.E. Kuyatt, 1994, Guidelines for evaluating and expressing the
uncertainty of NIST measurement results, NIST technical note 1297. 188 5.2 eter correlations” as proposed in section ??, so that there is no diﬀerence, from our point of
view, between a “simple measurement” and a measurement using physical theories, including,
perhaps, sophisticated inverse methods. From ISO 5.3 189 From ISO The International Organization for Standardization (ISO) has published (ISO, 1993) a “Guide
to the expression of uncertainty in measurement”, which is the result of a joint work with the
BIPM2 , the IEC3 and the OIML4 . The recommendations of the Guide have also been adopted
by the U.S. National Institute of Standards and Technology (Taylor and Kuyatt, 1994).
These recommendations have the advantage of being widely accepted (in addition of being
legal). It is therefore important to see into which extent the approach proposed in this book to
describe the result of a measurement is consistent with that proposed by ISO. 5.3.1 Proposed vocabulary to be used in metrology In the deﬁnitions that follow, the use of parentheses around certain words of some terms means
that the words may be omitted if this is unlikely to cause confusion.
5.3.1.1 (measurable) quantity: attribute of a phenomenon, body or substance that may be distinguished qualitatively and
determined quantitatively.
5.3.1.2 value (of a quantity): magnitude of a particular quantity generally expressed as a unit of measurement multiplied by
a number.
5.3.1.3 true value (of a quantity): deﬁnition not reproduced here.
Comments from the ISO guide: The term “true value of a measurand” or of a quantity
(often truncated to “true value”) is avoided in this guide because the word “true” is viewed as
redundant. “Measurand” means “particular quantity subject to measurement”, hence “value of
a measurand” means “value of a particular quantity subject to measurement”. Since “particular
quantity” is generally understood to mean a deﬁnite or speciﬁed quantity, the adjective “true”
in “true value of a measurand” (or in “true value of a quantity”) is unnecessary — the “true”
value of the measurand (or quantity) is simply the value of the measurand (or quantity). In
addition, as indicated in the discussion above, a unique “true” value is only an idealized concept.
My comments: I have not reproduced the deﬁnition of the term “true value” because i)
I do not understand it, and ii) it does not seem consistent with the comment above (that I
understand perfectly).
5.3.1.4 measurement: set of operations having the object of determining a value of a quantity.
2 Bureau International des Poids et Mesures
International Electrotechnical Commission
4
International Organization of Legal Metrology
3 190 5.3 My comments: I do not agree. The object of a measurement is not to determine “a value”
of a quantity, but, rather, to obtain a “state of information” on the (true) value of a quantity. The proposed deﬁnition is acceptable only in the particular case when the information
obtained in the measurement can be represented by a probability density that, being practically monomodal, can be well described by a central estimator (the “determined value” of the
quantity) and an estimator of dispersion (the “uncertainty” of the measurement).
5.3.1.5 measurand: particular quantity subject to measurement.
Comments from the ISO guide: The speciﬁcation of a measurand may require statements
about quantities such as time, temperature and pressure.
5.3.1.6 inﬂuence quantity: quantity that is not the measurand but that aﬀects the result of the measurement.
5.3.1.7 result of a measurement: value attributed to a measurand, obtained by measurement.
My comments: see comments in “measurement”.
5.3.1.8 uncertainty (of measurement): parameter, associated with the result of a measurement, that characterizes the dispersion of
the values that could reasonably be attributed to the measurand.
Comments from the ISO guide: The word “uncertainty” means doubt, and thus in its
broadest sense “uncertainty of measurement” means doubt about the validity of the result of
a measurement. Because of the lack of diﬀerent words for this general concept of uncertainty
and the speciﬁc quantities that provide quantitative measures of the concept, for example, the
standard deviation, it is necessary to use the word “uncertainty” in these two diﬀerent senses.
More comments from the ISO guide: The deﬁnition of uncertainty of measurement is an
operational one that focuses on the measurement result and its evaluated uncertainty. However,
it is not inconsistent with other concepts of uncertainty of measurement, such as i) a measure
of the possible error in the estimated value of the measurand as provided by the result of a
measurement; ii) an estimate characterizing the range of values within which the true value of
a measurand lies. Although these two traditional concepts are valid as ideals, they focus on
unknowable quantities: the “error” of the result of a measurement and the “true value” of the
measurand (in contrast to its estimated value), respectively.
Still more comments from the ISO guide: Uncertainty of measurement comprises, in general,
many components. Some of these components may be evaluated from the statistical distribution
of the results of series of measurements and can be characterized by experimental standard
deviations. The other components, which can also be characterized by standard deviations, are
evaluated from assumed probability distributions based on experience or other information.
My comments: I could almost agree with this deﬁnition, but would rather say that as the
result of a measurement is a probability density, the uncertainty, as a parameter, is any estimator of dispersion associated to the probability density. I was pleasantly surprised to discover
that the ISO guidelines accept probability distributions coming from subjective knowledge as From ISO 191 an essential part of the description of the results of a measurement. One could fear that normal
statistical practices, that exclude Bayesian (subjective) reasoning, were exclusively adopted.
I am personally inclined (as this book demonstrates) to push the other way, and reject the
notion of “statistical distribution of results of series of measurements”: the maximum generality is obtained when each individual measurement is used, and the statistical well known
rules for combining individual measurement “results” will appear by themselves when working
properly at the elementary level. At most, the rules proposed by statistical texts are a way for
(approximately) shortcircuiting some of the steps of the inference methods proposed in this
book. 5.3.2 Some basic concepts Note: what follows is important for the chapter on “physical theories” too.
In practice, the required speciﬁcation or deﬁnition of the measurand is dictated by the
required accuracy of [the] measurement. The measurand should be deﬁned with suﬃcient
completeness with respect to the required accuracy so that for all practical purposes associated
with the measurement its value is unique. It is in this sense that the expression “value of the
measurand” is used in this Guide.
Example: If the length of a nominally onemetre long steel bar is to be determined to
micrometre accuracy, its speciﬁcation should include the temperature and pressure at which
the length is deﬁned. This the measurand should be speciﬁed as, for example, the length of the
bar at 35.00 ◦ C and 101 325 Pa (plus any other deﬁning parameters deemed necessary, such
as the way the bar is to be supported). However, if the length is to be determined to only
millimetre accuracy, its speciﬁcation would not require a deﬁning temperature or pressure or a
value for any other deﬁning parameter.
Note: Incomplete deﬁnition of the measurand can give rise to a component of uncertainty
suﬃciently large that it must be included in the evaluation of the uncertainty of the measurement result.
Note: The ﬁrst step in making a measurement is to specify the measurand — the quantity
to be measured; the measurand cannot be speciﬁed by a value but only by a description of a
quantity. However, in principle, a measurand cannot be completely described without an inﬁnite
amount of information. Thus, to the extent that it leaves room for interpretation, incomplete
deﬁnition of the measurand introduces into the uncertainty of the result of a measurement a
component of uncertainty that may or may not be signiﬁcant relative to the accuracy required
of the measurement.
Note: At some level, every measurand has [...] an “intrinsic” uncertainty that can in
principle be estimated in some way. This is the minimum uncertainty with wich a measurand
can be determined, and every measurement that achieves such an uncertainty may be viewed as
the best possible measurement of the measurand. To obtain a value of the quantity in question
having a smaller uncertainty requires that the measurand be more completely deﬁned.
[...]
The uncertainty in the result of a measurement generally consists of several components
which may be grouped into two categories according to the way in which their numerical value
is estimated:
• A. those which are evaluated by statistical methods, 192 5.3 • B. those which are evaluated by other means.
[...] a type A standard uncertainty is obtained from a probability density function derived from
an observed frequency distribution, while a type B standard uncertainty is obtained from an
assumed probability density function based on the degree of belief that an event will occur (often
called subjective probability). Both approaches emply recognized interpretations of probability.
[...]
In practice, there are many possible sources of uncertainty in a measurement, including
• incomplete deﬁnition of the measurand;
• imperfect realization of the deﬁnition of the measurand;
• nonrepresentative sampling — the sample measured may not represent the deﬁned measurand;
• indequate knowledge of the eﬀects of environmental conditions on the measurement or
imperfect measurement of environmental conditions;
• personal bias in reading analogue instruments;
• ﬁnite instrument resolution or discrimination threshold;
• inexact values of measurement standards and reference materials;
• inexact values of constants and other parameters obtained from external sources and used
in the datareduction algorithm;
• approximations and assumptions incorporated in the measurement method and procedure;
• variations in repeated observations of the measurand under apparently identical conditions.
[...]
5.3.2.1 The need for type B evaluations. If a measurement laboratory had limitless time and ressources, it could conduct an exhaustive statistical investigation of every conceivable cause of uncertainty, for example, by using
many diﬀerent makes and kinds of instruments, diﬀerent methods of measurement, diﬀerent
applications of the method, and diﬀerent approximations in its theoretical models of the measurement. The uncertainties associated with all of these causes could then be evaluated by the
statistical analysis of series of observations and the uncertainty of each cause would be characterized by a statistically evaluated standard deviation. In other words, all of the uncertainty
components would be obtained from type A evaluations. Since such an investigation is not
an economic practicality, many uncertainty components must be evaluated by whatever other
means is practical. From ISO
5.3.2.2 193
Single observation, calibrated instruments. If an input estimate has been obtained from a single observation with a particular instrument
that has been calibrated against a standard of small uncertainty, the uncertainty of the estimate
is mainly one of repeatability. The variance of repeated measurements by the instrument may
have been obtained on an earlier occasion, not necessarily at precisely the same value of the
reading but near enough to be useful, and it may be possible to assume the variance to be
applicable to the input value in question. If no such information is available, an estimate must
be made based on the nature of the measuring apparatus or instrument, the known variances
of other instruments of similar construction, etc.
5.3.2.3 Single observation, veriﬁed instruments Not all measuring instruments are accopanied by a calibration certiﬁcate or a calibration curve.
Most instruments, however, are constructed to a written standard and veriﬁed, either by the
manufacturer or by an independent authority, to conform to that standard. Usually the standard contains metrological requirements, often in the form of “maxium permissible errors”, to
which the instrument is required to conform. The compliance of the instrument with these requirements is determined by comparison with a reference instrument whose maximum allowed
uncertainty is usually speciﬁed in the standard. This uncertainty is then a component of the
uncertainty of the veriﬁed instrument.
If nothing is known about the characteristic error curve of the veriﬁed instrument it must
be assumed that there is an equal probability that the error has any value within the permitted
limits, this is, a rectangular probability distribution. However, certain types of instruments
have characteristic curves such that the errors qre, for example, likely always to be positive in
part of the measuring range and negative in other parts. Sometimes such information can be
deduced from a study of the written standard. 194 5.4 5.4 The Ideal Output of a Measuring Instrument Note: mention here ﬁgures 5.1 and 5.2. SO
N
SE MEASURING
SYSTEM Instrument
noise Τ∗= log Τ/Τ0 ν∗ = log ν/ν0 ν = 100 Hz R Figure 5.1: Instrument built to measure pithches of musical notes. Due
to unavoidable measuring noises, a
measurement is never inﬁnitely accurate. Figure 5.2 suggests an ideal
instrument output. ν0 = 1/Τ0 = 1 Hz Τ = 0.01 s
Τ ∗= −5 ν∗ = 5 Τ = 0.005 s
Τ = 0.004 s ν = 220 Hz ν = 300 Hz
ν = 400 Hz
ν = 500 Hz ν = 1000 Hz ν = 2000 Hz Center: ν = 440 Hz ν = 880 Hz ν∗ = 6 Τ ∗= −6 ν∗ = 7 Τ ∗= −7 ν = 1760 Hz ν = 440 Hz Environmental
noise Τ0 = 1/ν0 = 1 s ν = 110 Hz ν = 200 Hz INSTRUMENT
OUTPUT Τ = 0.003 s
Τ = 0.002 s Τ = 0.001 s Τ = 0.005 s
Τ = 0.004 s
ν∗= +6.09 −3 Τ = 2.27 10 s Τ∗= +6.09 Radius (standard deviation): σ = 0.12 Figure 5.2: The ideal output of a mesuring instrument (in this example, measuring frequenciesperiods). The curve in the middle corresponds to the volumetric probability describing the
information brought by the measurement (on ‘the measurand’). Five diﬀerent scales are shown
(in a real instrument, the user would just select one of the scales). Here, the logarithmic scales
correspond to the natural logarithms that a physicist should prefer, but engineers could select
scales using decimal logarithms. Note that all the scales are ‘linear’ (with respect to the natural
distance in the frequencyperiod space [see section XXX]): I do not recommend the use of a
scale where the frequencies (or the periods) would ‘look linear’. Output as Conditional Probability Density 5.5 195 Output as Conditional Probability Density OUTPUT As suggested by ﬁgure 5.3, an ‘measuring instrument’ is speciﬁed when the conditional volumetric probability f (y x) for the output y , given the input x is given. Figure 5.3: The input (or measurand) and the output of
a measuting instrument. The output is never an actual
value, but a probability distribution, in fact, a conditional
volumetric probability f (y x) for the output y , given
the input x . INPUT 5.6 A Little Bit of Theory We want to measure a given property of an object, say the quantity x . Assume that the
object has been randomly selected from a set of objects, so that the ‘prior’ probability for the
quantity x is fx (x) .
Then, the conditional . . .
Then, Bayes theorem . . . 5.7 Example: Instrument Speciﬁcation [Note: This example is to be put somewhere, I don’t know yet where.]
It is unfortunate that ordinary measuring instruments tend to just display some ‘observed
value’, the ‘measurement uncertainty’ tending to be hidden inside some written documentation.
Awaiting the day when measuring instruments directly display a probability distribution for
the measurand, let us contemplate the simple situation where the maker of an instrument, say
a frequencymeter, writes someting like the following.
This frequencymeter can operate, with high accuracy, in the range 102 Hz < ν < 109 Hz .
When very far from this range, one may face uncontrollable uncertainties. Inside (or close to)
this range, the measurement uncertainty is, with a good approximation, independent of the
value of the measured frequency. When the instrument displays the value ν0 , this means that
the (1D) volumetric probability for the measurand is if log ν ≤ −σ
then f (ν ) = 0 ν0 ,
(5.1)
then f (ν ) = 9 2 2 2 σ − log νν0
if − σ < log νν0 < +2 σ
σ
if + 2 σ ≤ log ν
then f (ν ) = 0
ν0 196 5.7 where σ = 10−4 . This volumetric probability is displayed at the top of ﬁgure 5.4. Using the
logarithmic frequency as coordinate, this is an asymmetric triangle. Κ = 1 Ηz
σ = 10
ν∗− σ
0 ν
ν∗= log 10 0
0
Κ ν∗ + 2σ
0 Figure 5.4: Figure for ‘instrument speciﬁcation’. Note: write this caption. −4 ν
ν∗= log10
Κ ν0 = 1.0000 106 Hz
ν∗= 6.0000
0 99 .99 ∗=5
ν0 0 00 .0
∗=6
ν0 02 .00 ∗=6
ν0 Measurements and Experimental Uncertainties 5.8 197 Measurements and Experimental Uncertainties Observation of geophysical phenomena is represented by a set of parameters d that we usually
call data. These parameters result from prior measurement operations, and they are typically
seismic vibrations on the instrument site, arrival times of seismic phases, gravity or electromagnetic ﬁelds. As in any measurement, the data is determined with an associated certainty,
described with a volumetric probability over the data parameter space, that we denote here
ρd (d). This density describes, not only marginals on individual datum values, but also possible
crossrelations in data uncertainties.
Although the instrumental errors are an important source of data uncertainties, in geophysical measurements there are other sources of uncertainty. The errors associated with the
positioning of the instruments, the environmental noise, and the human appreciation (like for
picking arrival times) are also relevant sources of uncertainty.
Example 5.1 Nonanalytic volumetric probability Assume that we wish to measure the time
t of occurrence of some physical event. It is often assumed that the result of a measurement
corresponds to something like
t = t0 ± σ . (5.2) An obvious question is the exact meaning of the ±σ . Has the experimenter in mind that she/he
is absolutely certain that the actual arrival time satisﬁes the strict conditions t0 −σ ≤ t ≤ t0 +σ ,
or has she/he in mind something like a Gaussian probability, or some other probability distribution (see ﬁgure 5.5)? We accept, following ISO’s recommendations (1993) that the result of
any measurement has a probabilistic interpretation, with some sources of uncertainty being analyzed using statistical methods (‘type A’ uncertainties), and other sources of uncertainty being
evaluated by other means (for instance, using Bayesian arguments) (‘type B’ uncertainties).
But, contrary to ISO suggestions, we do not assume that the Gaussian model of uncertainties
should play any central role. In an extreme example, we may well have measurements whose
probabilistic description may correspond to a multimodal volumetric probability. Figure 5.6
shows a typical example for a seismologist: the measurement on a seismogram of the arrival
time of a certain seismic wave, in the case one hesitates in the phase identiﬁcation, or in the
identiﬁcation of noise and signal. In this case the volumetric probability for the arrival of the
seismic phase does not have an explicit expression like f (t) = k exp(−(t − t0 )2 /(2σ 2 )) , but
is a numerically deﬁned function. Using, for instance, the Mathematica (registered trademark)
computer language we may deﬁne the volumetric probability f (t) as
f[t_] := ( If[t1<t<t2,a,c] If[t3<t<t4,b,c] ) . Here, a and b are the ‘levels’ of the two steps, and c is the ‘background’ volumetric
probability. [End of example.] Figure 5.5: What has an experimenter in mind when she/he describes the result of a measurement
by something like t = t0 ± σ ? t0 t0 t0 t0 198 5.8 Signal amplitude Figure 5.6: A seismologist tries to measure the arrival
time of a seismic wave at a seismic station, by ‘reading’
the seismogram at the top of the ﬁgure. The seismologist
may ﬁnd quite likely that the arrival time of the wave
is between times t3 and t4 , and believe that what is
before t3 is just noise. But if there is a signiﬁcant probability that the signal between t1 and t2 is not noise
but the actual arrival of the wave, then the seismologist
should deﬁne a bimodal volumetric probability, as the
one suggested at the bottom of the ﬁgure. Typically, the
actual form of each peak of the volumetric probability is
not crucial (here, boxcar functions are chosen), but the
position of the peaks is important. Rather than assigning a zero volumetric probability to the zones outside
the two intervals, it is safer (more ‘robust’) to attribute
some small ‘background’ value, as we may never exclude
some unexpected source of error. t1 t2 t3 t4 Probability density Time t1 t2 t3 t4 Time Example 5.2 The Gaussian model for uncertainties. The simplest probabilistic model that can
be used to describe experimental uncertainties is the Gaussian model
1
ρD (d) = k exp − (d − dobs )T C−1 (d − dobs )
D
2 . (5.3) It is here assumed that we have some ‘observed data values’ dobs , with uncertainties described
by the covariance matrix CD . If the uncertainties are uncorrelated,
ρD (d) = k exp − 1
2 di − di
obs
σi i 2 , (5.4) where the σ i are the ‘standard deviations’. [End of example.]
Example 5.3 The Generalized Gaussian model for uncertainties. An alternative to the Gaussian model, is to use the Laplacian (double exponential) model for uncertainties,
di − di 
obs
σi ρD (d) = k exp −
i . (5.5) While the Gaussian model leads to leastsquares related methods, this Laplacian model least
to absolutevalues methods (see section8.2.6), well known for producing robust5 results. More
generally, there is the Lp model of uncertainties
ρp (d) = k exp − 1
p i di − di p
obs
(σp )p (see ﬁgure 5.7). [End of example.]
5 A numerical method is called robust if it is not sensitive to a small number of large errors. (5.6) Measurements and Experimental Uncertainties 199 0.5 0.5 0.5 0.5 0.5 0.4 0.4 0.4 0.4 0.4 0.5
0.4 0.3 0.3 0.3 0.3 0.3 0.3 0.2 0.2 0.2 0.2 0.2 0.2 0.1 0.1 0.1 0.1 0.1 0.1 0 0
6 4 2 0 2 4 6 0
6 4 2 0 2 4 6 0 0
6 4 2 0 2 4 6 6 4 2 0 2 4 6 0
6 4 2 0 2 4 6 6 4 2 0 2 √
Figure 5.7: Generalized Gaussian for values of the parameter p = 1, 2, 2, 4, 8 and ∞ . 4 6 200 5.9
5.9.1 5.9 Appendixes
Appendix: Operational Deﬁnitions can not be Inﬁnitely Accurate Note: refer here to ﬁgure 5.8, and explain that “the length” of a real object (as opposed to
a mathematically deﬁned object) can only be deﬁned by specifying the measuring instrument.
There are diﬀerent notions of length associated to a given object. For instance, ﬁgure 5.8
suggests that the length of a piece of wood is larger when deﬁned by the use of a calliper6 than
when deﬁned by the use of a ruler7 , because a calliper tends to measure the distance between
extremal points, while an observer using a ruler tends to average the rugosities at the wood
ends. Figure 5.8: Diﬀerent deﬁnitions of the length
of an object. 6 Calliper: an instrument for measuring diameters (as of logs or trees) consisting of a graduated beam and
at right angles to it a ﬁxed arm and a movable arm. From the Digital Webster.
7
Ruler: a smoothedged strip (as of wood or metal) that is usu. marked oﬀ in units (as inches) and is used
as a straightedge or for measuring. From the Digital Webster. Appendixes 5.9.2 201 Appendix: The International System of Units (SI) Note: make here a small introduction about the usefulness of a uniﬁed system of units.
The rest of this appendix is a reproduction (with permission) of a text published by Robert
A. Nelson in the August 1996, issue of Physics Today , pages 15–16. Robert Nelson is the
author of the booklet SI: The International System of Units , 2nd ed. (American Association
of Physics Teachers, College Park, Maryland, 1982). He is Program Director for Commerial
Space at Veda Incorporated in Alexandria, Virginia and teaches in the Department of Aerospace
Engineering at the University of Maryland.
Note: ASK FOR THE PERMISSION TO REPRODUCE!!!
Note: The accent in “amp`re” is valid in French; check if it is valid in English.
e
5.9.2.1 Guide for Metric Practice, by Robert A. Nelson The modernized metric system is known as the Syst`me International d’Unit´s (International
e
e
System of Units), with the international abbreviation SI. It is founded on seven base units,
listed in table 1, that by convention are regarded as dimensionally independent. All other
units are derived units, formed coherently by multiplying and dividing units within the system
without numerical factors. Examples of derived units, including some with special names, are
listed in table 2. The expression of multiples and submultiples of SI units is facilitated through
the use of the preﬁxes listed in table 3.
Table 1. SI base units
Quantity
length
mass
time
electric current
thermodynamic temperature
amount of substance
luminous intensity Unit
Name
Symbol
meter
m
kilogram
kg
second
s
amp`re
e
A
kelvin
K
mole
mol
candela
cd SI obtains its international authority from the Meter Convention, signed in Paris by the
delegates of 17 countries, including the United States, on 20 May 1875, and amended in 1921.
Today 48 states are members. The treaty established the Conf´rence G´n´rale des Poids et
e
ee
Mesures (General Conference on Weights and Measures) as the formal diplomatic body responsible for ratiﬁcation of the new proposals related to metric units. The scientiﬁc decisions are
made by the Comit´ International des Poids et Mesures (International Committee for Weights
e
and Measures). It is assisted by the advise of eight Consultative Committees specializing in
particular areas of metrology. The activities of the national standards laboratories are coordinated by the Bureau International des Poids et Mesures (International Bureau of Weights and
Measures), whose headquarters is at the Pavillon de Breuteuil in S`vres, France, and which is
e
under the supervision of the CIPM. The SI was established by the 11th CGPM in 1960, when
the metric unit deﬁnitions, symbols and terminology were extensively revised and simpliﬁed.8
8 For history of the metric system and SI units, see R.A. Nelson, Phys. Teach. 19; 596 (1981). 202 5.9 Table 2. Examples of SI derived units
Quantity
plane angle
solid angle
speed, velocity
acceleration
angular velocity
angular acceleration
frequency
force
pressure, stress
work, energy, heat
impulse, momentum
power
electric charge
electric potential, emf
resistance
conductance
magnetic ﬂux
inductance
capacitance
electric ﬁeld strength
magnetic ﬂux density
electric displacement
magnetic ﬁeld strength
Celsius temperature
luminous ﬂux
illuminance
radioactivity Unit
Special name Symbol
radian
rad
steradian
sr hertz
newton
pascal
joule Hz
N
Pa
J watt
coulomb
volt
ohm
siemens
weber
henry
farad W
C
V
Ω
S
Wb
H
F tesla T degree Celsius
lumen
lux
becquerel ◦C lm
lx
Bq Equivalent
m/m = 1
m2 /m2 = 1
m/s
m/s2
rad/s = 1
rad/s2
s−1
kg·m/s2
N/m2
N·m , kg·m2 /s2
N·s , kg·m/s
J/s
A·s
J/C , W/A
V/A
A/V , Ω−1
V·s
Wb/A
C/V
V/m , N/C
Wb/m2 , N/(A·m)
C/m2
A/m
K
cd·sr
lm/m2
s −1 Table 3. SI preﬁxes
Factor
1024
1021
1018
1015
1012
109
106
103
102
101 Preﬁx
yotta
zetta
exa
peta
tera
giga
mega
kilo
hecto
deka Symbol
Y
Z
E
P
T
G
M
k
h
da Factor
10−1
10−2
10−3
10−6
10−9
10−12
10−15
10−18
10−21
10−24 Preﬁx
deci
centi
milli
micro
nano
pico
femto
atto
zepto
yocto Symbol
d
c
m
µ
n
p
f
a
z
y Appendixes 203 The BIPM, with the guidance of the Consultative Committee for Units and approval of
the CIPM, periodically publihes a document9 that summarizes the historical decisions of the
CGPM and the CIPM and gives some conventions for metric practice. In addition, Technical
Committee 12 of the International Organization for Standardization has prepared recommendations concerning the practical use of the SI10 . Some other recommendations have been given
by the Commission for Symbols, Units, Nomenclature, Atomic Masses and Fundamental Constants of the International Union of Pure and Applied Physics11 . The National Institute of
Standards and Technology has published a practical guide for the use of the SI12 . The Institute
of Electrical and Electronics Engineers has developped a metric practise manual13 that has
been recognized by the American National Standards Institute and has been adopted by the
US Department of Defense. The American Society fot Testing and Materials has prepared a
similar manual14 . The Secretary of Commerce, through NIST, has also issued recommendations
for US metric practise15 as provided under the Metric Conversion Act of 1975 and the Omnibus
Trade and Competitiveness Act of 1988.
In October 1995 the 20th CGPM, at the recommendations of the CCU and CIPM, eliminated the “supplementary units” radian and steradian as a special class of derived units having
dimension 1 (socalled dimensionless derived units). Thus the SI now consists of only two
classes of units, base units and derived units, with the radian and steradian included among
the derived units as shown in table 2.
5.9.2.2 Style conventions Letter symbols include quantity symbols and unit symbols. Symbols for physical quantities are
set in italic (sloping) type, while symbols for units are set in roman (upright) type (for example,
F = 15 N).
Symbols for unit names derived from proper names have the ﬁrst letter capitalized —
otherwise unit symbols are lower case — but the unit names themselves are not capitalized (for
example, tesla, T; meter, m). A unit symbol is a mathematical entity (not an abbreviation)
and is usually denoted by the ﬁrst letter of the unit name (for example, the symbol for gram is
g, not gm; the symbol for second is s, not sec), with some exceptions (for example, mol, cd and
Hz). The unit symbol is not followed by a period, and plurals of unit symbols are not followed
by an “s” (for example, 3 Kg, not 3 Kg. or 3 Kgs).
9 Bureau International des Poids et Mesures, Le systme International d’unit´s (SI), 6th ed., BIPM S`vres,
`
e
e
France (1991); US ed.: The International System of Units (SI), B.N. Taylor. ed., Natl. Inst. Stand. Technol.
Spec. Pub. 330, US Govt. Printing Oﬃce, Washington, D.C. (1991).
10
International Organization for Standardization, Quantities and Units, ISO Standards Handbook, 3rd ed.,
ISO, Geneva (1993). This is a compilation of individual standards ISO 310 to 3113 and ISO 1000, available
from Am. Natl. Stand. Inst., New York.
11
E. R. Cohen, P. Giacomo, eds., Physica 146A, 1 (1987). Reprinted as symbols, Units, Nomenclature and
Fundamental Constants in Physics (1987 revision), document IUPAP25 (sunamco 871).
12
B. N. Taylor, Guide for the Use of the International System of Units, Natl. Inst. Stand. Technol. Spec. Pub.
811, US Govt. Printing Oﬃce, Washington, D.C. (1995).
13
Inst. of Electrical and Electronics Engineers, American National Standard for Metric Practise, ANSI/IEEE
Std. 2681992, IEEE, New York (1992).
14
Am. Soc. for Testing and Materials, Standard Practise for Use of the International System of Units (SI)
(The Modernized Metric System), ASTM E 38093, ASTM, Philadelphia (1993).
15
“Metric System of Measurement; Interpretation of the International System of Units for the United States,”
Fed. Register 55 (245), 52 242 (20 December 1990). 204 5.9 The word “degree” and its symbol, ◦ , are omitted from the unit of thermodynamic temperature T (that is, one uses kelvin or K, not degree Kelvin or ◦ K). However, they are retained in
the unit of Celsius temperature t, deﬁned as t ≡ T − T0 , where T0 = 273.15 K exactly (that
is, degree Celsius, ◦ C).
Symbols for preﬁxes representing 106 or greater are capitalized; all others are lower case.
There is no space between the preﬁx and the unit. Compound preﬁxes are to be avoided (for
example, pF, not µµF). An exponent applies to the whole unit including its preﬁx (for example,
cm3 = 10−6 m3 ). When a unit multiple or submultiple is written out in full, the preﬁx should
be written in full, beginning with a lowercase letter (for example, megahertz, not Megahertz
or Mhertz). The kilogram is the only base unit whose name, for historical reasons, contains a
prex; names of multiples and submultiples of the kilogram and their symbols are formed by
attaching preﬁxes to the word “gram” and the symbol “g”.
Multiplication of units is indicated by inserting a raised dot or by leaving a space between
the units (for example, N·m or N m). Division may be indicated by the use of the solidus, a
horizontal fraction bar or a negative exponent (for example, m/s, m or m·s−1 ) but repeated
s
use of the solidus is not permitted (for example, m/s2 , not m/s/s). To avoid possible misinterpretation when more than one unit appears in the denominator, the preferred practise is
to use parentheses or negative exponents (for example, W/(m2 ·K4 ) or W·m−2 ·K−4 ). The unit
expression may include a preﬁxed unit (for example, kJ/mol, W/cm2 ).
Unit names should not be mixed with symbols for mathematical operations. (For example,
one should write “meter per second” but not “meter/second” or “meter second−1 ”). When
spelling out the product of two units, a space is recommended (although a hyphen is permissible), but one should never use a centered dot. (Write, for example, “newton meter” or
“newtonmeter”, but not “newton·meter”).
Threedigit groups in numbers with more than four digits are separated by thin spaces
instead of commas (for example, 229 792 458, not 299,792,458) to avoid confusion with the
decimal marker in European literature. This spacing convention is also used to the right of
the decimal marker. The numerical value and unit symbol must be separated by a space, even
when used as an adjective (for example, 35 mm, not 35mm or 35mm). A zero should be placed
in front of the decimal marker in decimal fractions (for example, 0.3 J, not .3 J). The preﬁx of
a unit should be chosen so that the numerical value will be within a practical range, usually
between 0.1 and 1000 (for example, 200 kN, 0.5 mA).16
5.9.2.3 NonSI units An important function of the SI is to discourage the proliferation of unnecessary units. However,
it is recognized that some units outside the SI are so well established that their use is to be
permitted. Units in use with the SI are listed in table 4. As exceptions to the rules, the symbols
◦
, ’ and ” for units of plane angle are not preceded by a space, and the symbol for liter, L,
is capitalized to avoid confusion between the letter l and the number 1. Certain units whose
values are obtained experimentally, listed in table 5, are also accepted for use in special ﬁelds. 16 This footnote is from A. Tarantola, not from R. Nelson: Remark that “Threedigit groups ( . . . ) are
A
separated by thin spaces”. In L TEX document preparation system, for instance, a thin space is abtained by
“\,”. I also use thin spaces to separate the numerical value and unit symbol (for example, 35 mm, not 35 mm),
but I do not know if this is an explicit speciﬁcation. Appendixes 205 Table 4. Units in use with the SI
Quantity
time plane angle volume
mass
land area Name
minute
hour
day
degree
minute
second
liter
metric ton
hectare Unit
Symbol
min
h
d
◦ ’
”
L
t
ha Deﬁnition
1 min = 60 s
1 h = 60 min = 3600 s
1 d = 24 h = 86 400 s
1◦ = (π /180) rad
1’ = (1/60)◦ = (π /10 800) rad
1” = (1/60)’ = (π /648 800) rad
1 L = 1 dm3 = 10−3 m3
1 t = 1000 kg
1 ha = 1 hm2 = 104 m2 Table 5. Units whose values are obtained experimentally
Quantity
energy
mass Unit
Name
Symbol
electron volt
eV
uniﬁed atomic mass unit
u Value
1.602 177 33(49)×10−19 J
1.660 540 2(10)×10−27 kg 206 5.9 Chapter 6
Inference Problems of the First Kind
(Sum of Probabilities) Note: Say here the we consider here the Problem of Making Histograms. 207 208 6.1 6.1 Experimental Histograms [Note: This is a provisional text, to be expanded.]
Consider an ndimensional manifold, with a volume element dv , and a probability distribution deﬁned over it, represented by the (normalized) volumetric probability f . Although this
is not necessary, let us simplify the exposition by assuming that some coordinates have been
chosen over the manifold. Then, the probability distribution is represented by the volumetric
probability function f (x) , and the volume distribution by the volume element function dv (x) .
Some process, mathematical or physical, produces points P1 , P2 , . . . PK that are samples
of the probability distribution. Assume that we dont know f (x) , and that we wish to obtain
a reasonable estimation of it, by measuring the coordinates of the points P1 , P2 , . . . PK .
As any physical measure has some experimental uncertainties, the measure of the coordinates of the point P1 shall not produce some values x1 but, rather, an information about
the coordinates of the point, that we can represent by the volumetric probability f1 (x) .
Let, then, f1 (x) , f2 (x) , . . . PK be the (normalized) volumetric probabilities obtained when
measuring the coordinates of the points P1 , P2 , . . . PK .
When we have a large enough number of points, i.e., when K is large enough1 we can start
having some information about the probability distribution f (x) itself.
Which volumetric probability f (x) shall we choose to represent our information? Of course,
the one that satisfyes the postulates used in section 2.3 to deﬁne the ‘sum’ of probabilities.
We then arrive to the volumetric probability
f (x) = 1
K K fi (x) . (6.1) i=1 This is the equivalent, but in a slightly more sophisticate manner, to ‘making an histogram of
the observed points’.
Example 6.1 A seismologist has analyzed for many years the seismicity of a quite active
region of the Earth. For every earthquake, using the arrival times of the seismic waves at some
observatories, she/he has estimated its epicentral (geographic) coordinates {ϕ, λ} , obtaining
the (2D) volumetric probabilities f1 (ϕ, λ) , f2 (ϕ, λ) , . . . , fK (ϕ, λ) . If the next earthquake has
to be a standard earthquake, the best estimate we have for the probability distribution ot its
epicentral coordinates (in the absence of any supplementary information) is that represented by
1
the volumetric probability f (ϕ, λ) = K K fi (ϕ, λ) . [End of example.]
i=1
As suggested in chapter 2, let us write the volume element of the space as
dv (x) = g (x) dv (x) , (6.2) where g (x) and dv (x) are respectively the volume density and the capacity element of the
space in the coordinates x .
By deﬁnition of probability density (see section 2.2.3), the relation between a volumetric
probability h(x) and the associated probability density h(x) is
h(x) = g (x) h(x) . (6.3) 1
How large is large enough? This depends, of course, on the relative radiuses of f and of the fi , on the
number of dimensions of the space, and on the relative degree of smoothness of the probability distributions. Sampling a Sum 209 Equation 6.1 can obviously also be written as
1
f (x) =
K K fi (x) , (6.4) i=1 where, now only probability densities are invoked. 6.2 Sampling a Sum Note: explain here that if we wish to obtain a sample of the volumetric probability
1
f (x) =
K K fi (x) , (6.5) i=1 we can:
• ﬁrst, select at random, with equal probability, a value i in the interval 1 ≤ i ≤ K ;
• then, obtain a sample of fi (x) . 6.3 Further Work to be Done Note: I have to prove here the following conjecture.
Consider a metric coordinate x over a onedimensional metric space. Let f (x) be a (1D)
volumetric probability over the space, and let x1 , x2 , . . . be samples of it.
When trying to measure the coordinate x with a given instrument, assume that ‘the
reading’ of the instrument is a value x that is a sample of a volumetric probability g (x ; x; σ )
centered at x and with standard deviation σ . Given the reading x , then the volumetric
probability for the measurand is
h(x) = h(x; x , σ ) = WRITE THIS . (6.6) The readings have been x1 , x2 , . . . .
Then,
F (x) = k hi (x) = k
i h(x; xi ; σ ) = k
i g (xi ; x; σ ) . (6.7) i And I conjecture that the relation between the original f (x) and our estimation F (x) is
F (x) =
This, in fact, is a convolution. dx g (x , x, σ ) f (x ) . (6.8) 210 6.3 Chapter 7
Inference Problems of the Second Kind
(Product of Probabilities) Note: write an introduction here. 211 212 7.1 7.1 The ‘Shipwrecked Person’ Problem Note: this example is to be developed. For the time being this is just a copy of example 2.4
Let S represent the surface of the Earth, using geographical coordinates (longitude ϕ and
latitude λ ). An estimation of the position of a ﬂoating object at the surface of the sea by an
airplane navigator gives a probability distribution for the position of the object corresponding
to the (2D) volumetric probability f (ϕ, λ) , and an independent, simultaneous estimation of
the position by another airplane navigator gives a probability distribution corresponding to the
volumetric probability g (ϕ, λ) . How the two volumetric probabilities f (ϕ, λ) and g (ϕ, λ)
should be ‘combined’ to obtain a ‘resulting’ volumetric probability? The answer is given by the
‘product’ of the two volumetric probabilities densities:
(f · g )(ϕ, λ) = f (ϕ, λ) g (ϕ, λ)
dS (ϕ, λ) f (ϕ, λ) g (ϕ, λ)
S . (7.1) Physical Laws as Probabilistic Correlations 7.2
7.2.1 213 Physical Laws as Probabilistic Correlations
Physical Laws Are we forced to introduce uncertainties in physical laws to be used as ‘thicknesses’ of a mathematical function d = g(m) via a metric in the space?
In fact, actual theories are always approximate and they have some ‘uncertainty bars’ associated to them (see an example in section 7.2.2). The conditional volumetric probability has
to be seen as a way of taking a limit when the uncertainty bars tend to zero. Then the sort
of limit deﬁning the conditional probability density is imposed by the form of the ‘theoretical
uncertainty bars’. Rather than basing inversion theory on an expression like 8.16, it is better to introduce explicitly the theoretical uncertainties, and take any ‘small uncertainty limit’
afterwards. Let us do this.
Assume that the physical correlations between the model parameters m and the data
parameters d are not represented by an analytical expression like d = f (m) , but by
a probability density ϑ(m, d) . Then, the conjunction of the ‘a priori and experimental
information’ contained in ρ(m, d) and the ‘theoretical information’ contained in ϑ(m, d) can
be combined using the conjunction operation deﬁned by equation ??, to give
σ (m, d) = k ρ(m, d) ϑ(m, d)
µ(m, d) , (7.2) where µ(m, d) is the homogeneous probability density. The implications of this equation will
be examined later. 7.2.2 Example: Realistic ‘Uncertainty Bars’ Around a Functional
Relation In the approximation of a constant gravity ﬁeld, with acceleration g , the position at time t
of an apple in free fall is r(t) = r0 + v0 t + 1 g t2 , where r0 and v0 are, respectively, the
2
position and velocity of the object at time t = 0 . More simply, if the movement is 1D,
x(t) = x0 + v0 t + 12
gt
2 . (7.3) Of course, or many reasons this equation can never be exact: air friction, wind eﬀects, inhomogeneity of the gravity ﬁeld, eﬀects of the Earth rotation, forces from the Sun and the Moon
(not to mention Pluto), relativity (special and general), etc.
It is not a trivial task, given very careful experimental conditions, to estimate the size of
the leading uncertainty. Although one may think of an equation x = x(t) as a line, inﬁnitely
thin, there will always be sources of uncertainty (at least due to the unknown limits of validity
of general relativity): looking at the line with a magnifying glass should reveal a fuzzy object
of ﬁnite thickness. As a simple example, let us examine here the mathematical object we arrive
at when assuming that the leading sources of uncertainty in the relation x = x(t) are the
uncertainties in the initial position and velocity of the falling apple. Let us assume that:
• the initial position of the apple is random, with a Gaussian distribution centered at x0 ,
and with standard deviation σx ; 214 7.2 • the initial velocity of the apple is random, with a Gaussian distribution centered at v0 ,
and with standard deviation σv ;
Then, it can be shown that at a given time t , the possible positions of the apple are random,
with probability density
ϑ(xt) = √ 2π 1 x − (x0 + v0 t + 1 g t2 )
1
2
exp −
2
2
2 + σ 2 t2
2
σx + σv t2
σx
v 2 . (7.4) This is obviously a conditional probability density for x , given t . If we select the time
t randomly with homogeneous probability distribution (i.e., if we assume that the marginal
probability density for t is constant), then the joint probability density for x and t is
ϑ(x, t) = k ϑ(xt) (7.5) Joint
Volumetric
Probability 20 20
15 10 5 10 x/meter 25 Marginal
Volumetric
Probability 0 0 0 0.2 0.1 0.4 0.3 0.5 Figure 7.1: A typical parabola representing the
free fall of an object (position x as a function of
time t ). Here, rather than an inﬁnitely thin line
we have a fuzzy object (a probability distribution)
because the initial position and initial velocity is
uncertain. This ﬁgure represents the probability
density deﬁned by equation 7.5, with x0 = 0 ,
v0 = 1 m/s , σx = 1 m , σv = 1 m/s and
g = 9.91 m/s2 . While, by deﬁnition, the marginal
of the probability density with respect to the time
t is homogeneous, the marginal for the position x
is not: there is a pronounced maximum for x = 0
(when the falling object is slower), and the distribution is very asymmetric (as the object is falling
‘downwards’). 30 where k is a constant, and where ϑ(xt) is that in equation 7.4. This probability density
is represented in ﬁgure 7.1, together with the two marginals, and the conditional probability
density at three diﬀerent times is represented in ﬁgure 7.2. 1 2 0 1 2 1 2 t/second
1 Marginal
Volumetric
Probability 0.8
0.6
0.4
0.2
0
1 2 0 0.4 t=0s Figure 7.2: Three conditional volumetric probabilities from
the joint distribution of the previous ﬁgure at times t = 0 ,
t = 1 s and t = 2 s . The width increases with time because
of the uncertainty in the initial velocity. 0.3 t= 1 s
t=2s 0.2 0.1 0
0 5 10 15 20 25 30 x/meter 7.2.3 Inverse Problems We have seen that the result of measurements can be represented by a probability density ρd (d)
in the data space. We have also seen that the a priori information on the model parameters Physical Laws as Probabilistic Correlations 215 can be represented by another probability density ρm (m) in the model space. When we talk
about ‘measurements’ and about ‘a priori information on model parameters’, we usually mean
that we have a joint probability density in the (M , D ) space, that is ρ(m, d) = ρm (m) ρd (d) .
But let us consider the more general situation where for the whole set of parameters (M , D ) we
have some information that can be represented by a joint probability density ρ(m, d) . Having
well in mind the interpretation of this information, let us use the simple name of ‘experimental
information’ for it
ρ(m, d) (experimental information) . (7.6) We have also seen that we have information coming from physical theories, that predict
correlations between the parameters, and it has been argued that a probabilistic description of
these correlations is well adapted to the resolution of inverse problems1 . Let ϑ(m, d) be the
probability density representing this ‘theoretical information’:
ϑ(m, d) (theoretical information) . (7.7) A quite fundamental assumption is that in all the spaces we consider, there is a notion of
volume which allows to give sense to the notion of ‘homogeneous probability distribution’ over
the space. The corresponding probability density is not constant, but is proportional to the
volume element of the space (see section 4):
µ(m, d) (homogeneous probability distribution) . (7.8) Finally, we have seen examples suggesting that the conjunction of of the experimental
information with the theoretical information corresponds exactly to the and operation deﬁned
over the probability densities, to obtain the ‘conjunction of information’, as represented by the
probability density
σ (m, d) = k ρ(m, d) ϑ(m, d)
µ(m, d) (conjunction of informations) , (7.9) with marginal probability densities2
σm (m) = dd σ (m, d)
D ; σd (d) = dm σ (m, d) . (7.10) M Example 7.1 We may assume that the physical correlations between the parameters m and
d are of the form
ϑ(m, d) = ϑDM (dm) ϑM (m) , (7.11) this expressing that a ‘physical theory’ gives, one the one hand, the conditional probability for
d , given m , and on the other hand, the marginal probability density for m . [End of
example.]
1 Remember that, even if we wish to use a simple method based on the notion of conditional probability
density, an analytic expression like d = f (m) needs some ‘thickness’ before going to the limit deﬁning the
conditional probability density. This limit crucially depends on the ‘thickness’, i.e., on the type of uncertainties
the theory contains.
2
As explained in section ??, the deﬁnition or marginal probability density is only intrinsic if the total space
is the Cartesian product of the two spaces, i.e., when (M , D ) = M × D . 216 7.2 Example 7.2 Many applications concern the special situation where we have
µ(m, d) = µm (m) µd (d) ; ρ(m, d) = ρm (m) ρd (d) . (7.12) In this case, equations 7.9–7.10 give
σm (m) = k ρm (m)
µm (m) ρd (d) ϑ(m, d)
µd (d) ; (7.13) ρm (m) ϑ(m, d)
µm (m) . (7.14) dd
D and
σd (d) = k ρd (d)
µd (d) dm
M If equation 7.11 holds, then
σm (m) = k ρm (m) ϑm (m)
µm (m) dd
D ρd (d) ϑDM (d  m)
µd (d) (7.15) ϑm (m)
.
µm (m) (7.16) and
σd (d) = k ρd (d)
µd (d) M dm ρm (m) ϑDM (dm) Finally, if the simpliﬁcation ϑM (m) = µm (m) arises (this usually holds only if nonlinearities
are weak3 ), then,
σm (m) = k ρm (m) dd
D ρd (d) ϑ(dm)
µd (d) (7.17) and
σd (d) = k ρd (d)
µd (d) M dm ρm (m) ϑ(dm) . (7.18) [End of example.]
Example 7.3 Let us reproduce here equation 7.17,
σm (m) = k ρm (m) dd
D ρd (d) ϑ(dm)
.
µd (d) (7.19) Assume that observational uncertainties are Gaussian,
1
ρd (d) = k exp − (d − dobs )t C−1 (d − dobs )
D
2 . (7.20) Note that the limit for inﬁnite variances gives the homogeneous probability density µd (d) = k .
Furthermore, assume that uncertainties in the physical law are also Gaussian:
1
ϑ(dm) = k exp − (d − f (m))t C−1 (d − f (m))
T
2
3 Note: some explanation is needed here. . (7.21) Physical Laws as Probabilistic Correlations 217 Here ‘the physical theory says’ that the data values must be ‘close’ to the ‘computed values’
f (m) , with a notion of closeness deﬁned by the ‘theoretical covariance matrix’ CT . As
demonstrated in Tarantola (1987, page 158), the integral in equation 7.19 can be analytically
evaluated, and gives
dd
D ρd (d) ϑ(dm)
1
= k exp − (f (m) − dobs )t (CD + CT )−1 (f (m) − dobs )
µd (d)
2 . (7.22) This shows that when using the Gaussian probabilistic model, observational and theoretical uncertainties combine through addition of the respective covariance operators (a nontrivial result).
[End of example.]
Example 7.4 In the ‘Galilean law’ example developed in section 7.2.1, we described the correlation between the position x and the time t of a free falling object through a probability density
ϑ(x, t) . This law says than falling objects describe, approximately, a spacetime parabola. Assume that in a particular experiment the falling object explodes at some point of its spacetime
trajectory A plain measurement of the coordinates (x, t) of the event gives the probability
density ρ(x, t) . By ‘plain measurement’ we mean here that we have used a measurement
technique that is not taking into account the particular parabolic character of the fall (i.e., the
measurement is designed to work identically for any sort of trajectory). The conjunction of the
physical law ϑ(x, t) and the experimental result ρ(x, t) , using expression 7.9, gives
σ (x, t) = k ρ(x, t) ϑ(x, t)
µ(x, t) , (7.23) where, as the coordinates (x, t) are ‘Cartesian’, µ(x, t) = k . Taking the explicit expression
given for ϑ(x, t) in equations 7.24–7.5,
ϑ(x, t) = √ 2π 1
1 x − (x0 + v0 t + 1 g t2 )
2
exp −
2
2
2
2
2
σx + σv t2
σx + σv t2 2 , (7.24) , (7.25) and assuming the Gaussian form4 for ρ(x, t) ,
ρ(x, t) = ρx (x) ρt (t) = k exp − 1 (x − xobs )2
2
Σ2
x exp − 1 (t − tobs )2
2
Σ2
t we obtain the combined probability density
σ (x, t) = 1
k
exp −
2
2
2
σx + σv t2 x − (x0 + v0 t + 1 g t2 )
(x − xobs )2 (t − tobs )2
2
+
+
2
2
Σ2
Σ2
σx + σv t2
t
x 2 .
(7.26) Figure 7.3 illustrates the three probability densities ϑ(x, t) , ρ(x, t) and σ (x, t) . [End of
example.]
Note: explain here that δ (d − f (m)) , as it concerns a diﬀerence in the data space (rather
than a distance), it is not a mathematically nice object.
4 Note that taking the limit of ϑ(x, t) or of ρ(x, t) for inﬁnite variances we obtain µ(x, t) , as we should. 218 7.2 Figure 7.3: Note: this is
a provisional ﬁgure.
It
was made with the numerical values mentioned in ﬁgure 7.1 with, in addition,
xobs = 5.0 m , Σx = 4.0 m ,
tobs = 2.0 s and Σt =
0.75 s . 20 20 20 15 15 15 10 10 10 5 5 5 0 0
2 1 0 1 2 0
2 1 0 1 2 2 1 0 1 2 Example 7.5 Note: consider here equation 7.11 and let us formally take
ϑ(dm) = δ (d − f (m))
ϑM (m) = k det gmm + gmd F + FT gdm + FT gdd F . (7.27) d=f (m) [Note: Explain this choice for ϑM (m)...] Then we arrive at
σm (m) = k ρ(m, f (m)) det (gmm + gmd F + FT gdm + FT gdd F)
µ(m, d) . (7.28) d=f (m) If µ(m, d) = k det g(m, d) (i.e., if we use the same metric to represent theoretical uncertainties as we used to deﬁne the homogeneous probability distributions), this equation is identical
to equation 8.16, obtained using the equation d = f (m) to deﬁne a conditional probability.
[End of example.]
The previous example is important because it shows that the formulation using an ‘exact
physical law’ can be found as a particular case of this, more general, approach were physical
correlations are represented probabilistically. Chapter 8
Inference Problems of the Third Kind
(Conditional Probabilities) Note: Say here the we consider here two problems: (i) ‘adjusting measurements’ to a physical
theory and (ii) resolution of Inverse problems.
These two problems are mathematically very similar, and are essentially solved using either
the notion of ‘conditional probability’ or the notion of ‘product of probabilities’ (see chapter 2).
Note: what follows comes from an old text:
A socalled ‘inverse problem’ usually consists in a sort quite complex measurement, simetimes a gigantic measurement, involving years of observations and thousands of instruments.
Any measurement is indirect (we may weigh a mass by observing the displacement of the cursor
of a balance), and as such, a possibly nontrivial analysis of uncertainties must be done.
Any good guide describing good experimental practice (see, for instance ISO’s Guide to
the expression of uncertainty in measurement [ISO, 1993] or the shorter description by Taylor
and Kuyatt, 1994) acknowledges that any measurement involves, at least, two diﬀerent sources
of uncertainties: those that we estimate using statistical methods, and those that we estimate
using subjective, common sense estimations. Both are described using the axioms of probability
theory, and this article clearly takes the probabilistic point of view for developing inverse theory. 219 220 8.1 8.1 Adjusting Measurements to a Physical Theory When a particle of mass m is submitted to a force F , one has
F=m d
dt v . 1 − v 2 /c2 (8.1) Assuming initial conditions of rest (at a time arbitrarily set to 0 ), the trajectory of the particle
is 2
2
c
γt
x(t) =
1+
− 1 ,
(8.2)
γ
c
where
F
.
(8.3)
m
Note: introduce here the problem set in the caption of ﬁgure 8.1. Say, in particular that we
have a measurement whose results are represented by the volumetric probability f (t, x) .
γ= 3X x 2X
X Figure 8.1: In the spacetime of special relativity, we have
measured the spacetime coordinates of an event, and obtained the volumetric probability f (t, x) displayed in the
ﬁgure at the top. We then learn that that event happened
on the trajectory of a particle with mass m submitted
to a constant force F (equation 8.2). This trajectory is
represented in the ﬁgure at the middle. It is clear that
thanks to the theory, we can ameliorate the knowledge of
the coordinates of the event, by considering the conditional
volumetric probability induced on the trajectory. See text
for details. 0 T 0 3X x 2T
c
T= γ 3T t
4T c2
X= γ 2X
X
0 T 0 3X 2T 3T t
4T 2T 3T t
4T x 2X
X
0 0 T The problem here, is clearly a problem of conditional probability, and it makes sense because
we do have a metric over our 2D space, the Minkowskian metric
ds2 = dt2 − 12
dx
c2 . (8.4) Adjusting Measurements to a Physical Theory 221 With respect to the notations in section 2.4.2.2, we have here r = r = t , and s = s = x ,
and the relation s = s(r) is, here, the relation x = x(t) given by equation 8.2.
As we have, here,
det(gr + St gs S) = c/ 1 + (γ t/c)2 , A direct use of equation 2.127
gives the (1D) volumetric probability over the time variable
k ft (t) = 1 + (γ t/c)2 f (t, x)x=x(t) , (8.5) where k is the normalization constant ensuring that
∞ dt ft (t) = 1 , (8.6) 0 and where x = x(t) is a shorthand notation for the relation 8.2.
Note: I have now to transport this volumetric probability over the time axis into a volumetric
probability over the x axis, using the transport of probabilities introduced in section 2.6.
Note: I have to convince the reader here that we can not give an intrinsic deﬁnition of this
problem inside the Galilean physics, as there is no spacetime metric. This is very important,
and enforces my decision to use a metric deﬁnition of the conditional volumetric
probabilities. 222 8.2 8.2 Inverse Problems [Note: Complete and expands what follows.]
In the so called ‘inverse problems’, values of the parameters describing physical systems are
estimated, using as data some indirect measurements. A consistent formulation of inverse problems can be made using the concepts of probability theory. Data and attached uncertainties,
(a possibly vague) a priori information on model parameters, and a physical theory relating the
model parameters to the observations are the fundamental elements of any inverse problem.
While the most general solution of the inverse problem requires extensive use of Monte Carlo
methods, special hypothesis (e.g., Gaussian uncertainties) allow, in some cases, to solve part of
the problem analytically (e.g., using the method of least squares).
Given a physical system, the ‘forward’ of ‘direct’ problem consists, by deﬁnition, in using a physical theory to predict the outcome of possible experiments. In classical physics,
this problem has a unique solution. For instance, given a seismic model of the whole Earth
(elastic constants, attenuation, etc. at every point inside the Earth) and given a model of a
seismic source, we can use current seismological theories to predict which seismograms should
be observed at given locations at the Earth’s surface.
The ‘inverse problem’ arises when we do not have a good model of the Earth, or a good
model of the seismic source, but we have a set of seismograms, and we wish to use these
observations to infer the internal Earth structure or a model of the source (typically we try to
infer both).
There are many reasons that make the inverse problem underdetermined (the solution is
not unique). In the seismic example, two diﬀerent Earth models may predict the same seismograms1 , the ﬁnite bandwidth of our data sets will never allow us to resolve very small features
of the Earth model, and there are always experimental uncertainties that allow diﬀerent models
to be ‘acceptable’.
The name ‘inverse problem’ is widely accepted. I only like this name moderately, as I see the
problem more as a problem of ‘conjunction of states of information’ (theoretical, experimental
and a priori information). In fact, the equations used below have a range of applicability well
beyond ‘inverse problems’: they can be used, for instance, to predict the values of observation
in a realistic situation where the parameters describing the Earth model are not ‘given’, but
only known approximately.
In fact, I like to think of an ‘inverse’ problem as merely a ‘measurement’. A measurement
that can be quite complex, but the basic principles and the basic equations to be used are the
same for a relatively complex ‘inverse problem’ as for a relatively simple ‘measurement’. 1
For instance, we could ﬁt our observations with a heterogeneous but isotropic Earth model or, alternatively,
with an homogeneous but anisotropic Earth. Inverse Problems 8.2.1 223 Model Parameters and Observable Parameters Although the separation of all the variables of a problem in two groups may sometimes be
artiﬁcial, we take this point of view here, since it allows us to propose a simple setting for a
wide class of problems.
We may have in mind a given physical system, like the whole Earth, or a small crystal under
our microscope. The system (or a given state of the system) may be described by assigning
values to a given set of parameters m = {m1 , m2 , . . . , mNM } that we will name the model
parameters .
Let us assume that we make observations on this system. Although we are interested in
the parameters m , they may not be directly observable, so we may make some indirect measurement like obtaining seismograms at the Earth’s surface for analyzing the Earth’s interior,
or making spectroscopic measurements for analyzing the chemical properties of a crystal. The
set of (directly) observable parameters (or, by language abuse, the set of data parameters ) will
be represented by d = {d1 , d2 , . . . , dND } .
We assume that we have a physical theory that solves the forward problem , i.e., that given
an arbitrary model m , it allows us to predict the theoretical data values d that an ideal
measurement should produce (if m was the actual system). The generally nonlinear function
that associates to any model m the theoretical data values d may be represented by a
notation like
di = g i (m1 , m2 , . . . , mNM ) ; i = 1, 2, . . . , ND , (8.7) or, for short,
d = f (m) . (8.8) In fact, it is this expression that separates the whole set of our parameters into the subsets d
and m , as sometimes there is no diﬀerence of nature between the parameters in d and the
parameters in m . For instance, in the classical inverse problem of estimating the hypocenter
coordinates of an earthquake, we may put in d the arrival times of the seismic waves at some
seismic observatories, and we need to put in m the coordinates of the observatories —as these
are parameters that are needed to compute the travel times—, although we estimate arrival
times of waves as well as coordinates of the observatories using similar types of measurements. 8.2.2 A Priori Information on Model Parameters In a typical geophysical problem, the model parameters contain geometrical parameters (positions and sizes of geological bodies) and physical parameters (values of the mass density, of the
elastic parameters, the temperature, the porosity, etc.).
The a priori information on these parameters is all the information we possess independently of the particular measurements that will be considered as ‘data’ (to be described below).
This probability distribution is, generally, quite complex, as the model space may be high
dimensional, and the parameters may have nonstandard probability densities.
To this, generally complex, probability distribution over the model space corresponds a
volumetric probability that we denote as ρm (m) .
If an explicit expression for the volumetric probability ρm (m) is known, then it can be
used in analytical developments. But such an explicit expression is, by no means, necessary. 224 8.2 All that is needed is a set of probabilistic rules that allows us to generate samples of ρm (m)
in the model space (random samples distributed according to ρm (m) ).
Example 8.1 Gaussian a priori Information.
Of course, the simplest example of a probability distribution is the Gaussian (or ‘normal’)
distribution. Not many physical parameters accept the Gaussian as a probabilistic model (we
have, in particular, seen that many positive parameters are Jeﬀreys parameters, for which the
simplest consistent volumetric probability is not the normal, but the lognormal). But if we have
chosen the right parameters (for instance, taking the logarithms of all Jeﬀreys parameters), it
may happen that the Gaussian probabilistic model is acceptable. We then have
1
ρm (m) = k exp − (m − mprior )T C−1 (m − mprior )
prior
2 . (8.9) When this Gaussian volumetric probability is used, mprior , the center of the Gaussian is called
the ‘a priori model’ while Cprior is called the ‘a priori covariance matrix’. The name ‘a priori
model’ is dangerous, as for large dimensional problems, the average model may not be a good
representative of the models that can be obtained as samples of the distribution (see ﬁgure 8.27
as an example). Other usual sources of prior information are the ranges and distribution of
media properties in the rocks, or probabilities for the localization of media discontinuities. If the
information refers to marginals of the model parameters, and is not including the description
of relations across model parameters, the prior volumetric probability reduces to a product of
univariate densities, ρm (m) = i ρi (mi ). The next example illustrates this case. [End of
example.]
Example 8.2 Prior Information for a 1D Mass Density Model
We consider the problem of describing a model consisting of a stack of horizontal layers
with variable thickness and uniform mass density. The prior information is shown in ﬁgure 8.2, involving marginal distributions of the mass density and the layer thickness. Spatial statistical homogeneity is assumed, hence marginals are not dependent on depth in this
example. Additionally, they are independent of neighbor layer parameters. The model parameters consist of a sequence of thicknesses and a sequence of mass density parameters,
m = { 1 , 2 , . . . , N L , ρ1 , ρ2 , . . . , ρN L } . The marginal prior probability densities for the layer
thicknesses are all assumed to be identical and of the form (exponential volumetric probability)
f( ) = 1 exp − 0 , (8.10) 0 where the constant 0 has the value 0 = 4 km (see the left of ﬁgure 8.2), while all the
marginal prior probability densities for the mass density are also assumed to be identical, and
of the form (lognormal volumetric probability)
1
1
exp − 2
g (ρ) = √
2σ
2π σ ρ
log
ρ0 2 , (8.11) where ρ0 = 3.98 g/cm3 and σ = 0.58 (see the right of ﬁgure 8.2). Assuming that the probability distribution of any layer thickness is independent of the thicknesses of the other layers,
that the probability distribution of any mass density is independent of the mass densities of the Inverse Problems 225 other layers, and that layer thicknesses are independent of mass densities, the a priori volumetric probability in this problem is the product of a priori probability densities (equations 8.10
and 8.11) for each parameter,
NL ρm (m) = ρm ( 1 , 2 , . . . , N L , ρ1 , ρ2 , . . . , ρN L ) = k f (ρi ) g (ρi ) . (8.12) i Figure 8.3 shows (pseudo) random models generated according to this probability distribution.
Of course, the explicit expression 8.12 has not been used to generate these random models.
Rather, consecutive layer thicknesses and consecutive mass densities have been generated using
the univariate probability densities deﬁned by equations 8.10 and 8.11. [End of example.]
0.25 Figure 8.2: At left, the probability density for the layer thickness. At right,
the probability density for the density
of mass. 1 0.2 0.8 0.15 0.6 0.1 0.4 0.05 0.2 0 0
0 5 15 10 20 25 30 5 0 15 10 20 25 Mass Density (g/cm3) Depth (km) Mass Density (g/cm3)
20 40 40 40 40 60 60 60 60 60 80 80 80 80 80 100 100 100 100 20
40
60
80
100 100 8.2.3 0 20 40 Depth (km) 20 20 15 20 0 20 10 0 0 0 Figure 8.3: Three random Earth models generated according to the a priori probability
density in the model space. 5 15 0 0 5 20 0 10 15 2 0 10 20 0 5 10 0 15 20 5 15 20 10 15 5 5 10 0 0 Measurements and Experimental Uncertainties Note: the text that was here has been moved to section 5.8. 8.2.4 Joint ‘Prior’ Probability Distribution in the (M , D ) Space We have just seen that the a priori information on model parameters can be described by a
volumetric probability in the model space, ρm (m) , and that the result of measurements can
be described by a volumetric probability in the data space ρd (d) . As by ‘a priori’ information
on model parameters we mean information obtained independently from the measurements, we
can multiply these two volumetric probabilities (see section 2.5.5 on Independent Probability
Distributions) to deﬁne a joint volumetric probability in the X = (M, D) space.
ρ(x) = ρ(m, d) = ρm (m) ρd (d) . (8.13) Although we have introduced ρm (m) and ρd (d) separately, and we have suggested to build
a probability distribution in the (M , D ) space by the multiplication 8.13, we may have more
general situation where the information we have on m and on d is not independent. So, in
what follows, let us assume that we have some information in the (M , D ) space, represented
by the volumetric probability ρ(x) = ρ(m, d) and let us contemplate equation 8.13 as just a
special case. 226 8.2 8.2.5 Physical Laws Physics analyzes the correlations existing between physical parameters. In standard mathematical physics, these correlations are represented by ‘equalities’ between physical parameters
(like when we write f = m a to relate the force f applied to a particle, the mass m of
the particle and the acceleration a ). In the context of inverse problems this corresponds to
assuming that we have a function from the ‘parameter space’ to the ‘data space’ that we may
represent as
d = d(m) . (8.14) We do not mean that the relation is necessarily explicit. Given m , we may need to solve
a complex system of equations in order to get d , but this, nevertheless deﬁnes a function
m → d = d(m) .
At this point, given the volumetric probability ρ(m, d) and given the relation d =
d(m) , one may wish to deﬁne the associated conditional volumetric probability. But we have
emphasized in chapter 2 that there is no way to deﬁne a conditional volumetric probability
given only an equation like d = d(m) : we must, in addition, specify a metric in the (M, D)
space2 , that we may denote here by
g(m, d) = gm (m)
0
0
gd (d) , (8.15) where, to simplify the exposition, I assume the special case where the metric partitions into a
metric gm (m) in the model space M and a metric gd (d) in the data space D . 8.2.6 Inverse Problems In the X = (M , D ) space, we have the volumetric probability ρ(m, d) , and we have the hypersurface deﬁned by the relation d = d(m) . We can ‘combine’ these two kinds of information
by using the conditional volumetric probability deduced from ρ(m, d) on the hypersurface
d = d(m) (see equation 2.127)
σm (m) = k ρ(m, d(m)) det (gm + DT gd D)
√
det gm , (8.16) where D = D(m) is the matrix of partial derivatives, with components Di α = ∂di /∂mα ,
where gm = gm (m) and where gd = gd (d(m)) .
The probability of a ﬁnite domain A of the model space is then to be evaluated as
P (A) = A dm1 ∧ · · · ∧ dmN M det gm σm (m) . (8.17) Example 8.3 In the particular case where
ρ(m, d) = ρm (m) ρd (d) ,
2 Or, at least in the vicinity of the submanifold d = d(m) . (8.18) Inverse Problems 227 equation 8.16 becomes
σm (m) = k ρm (m) ρd (d(m)) det (gm + DT gd D)
√
det gm , (8.19) where, again, D = D(m) , gm = gm (m) and gd = gd (d(m)) . [End of example.]
Example 8.4 The conditional volumetric probability has been deﬁned by taking an ‘orthogonal
limit’. Should one have some reason to prefer the ‘vertical limit’, it can be obtained here by
formally taking the limit gd → 0 . Then, equation 8.19 simpliﬁes into
σm (m) = k ρm (m) ρd (d(m)) , (8.20) where the partial derivatives D don’t appear. [End of example.]
Example 8.5 Gaussian Case. Let us examine here how equation 8.19 simpliﬁes when assuming
that the ‘input’ probability densities are Gaussian, and that the weight matrices (inverse of the
covariance matrices) are the metric matrices (note: explain this, and give here the argument
that the accuracy of a theory is, ultimately, the accuracy of the experiments used to control it):
1
ρm (m) = k exp − (m − mprior )t gm (m − mprior )
2
1
ρd (d) = k exp − (d − dobs )t gd (d − dobs )
2 . (8.21) (8.22) Equation 8.19 then gives
σm (m) = k det gm + Dt (m) gd D(m)
√
det gm × (8.23)
1
× exp − (m − mprior )t gm (m − mprior ) + (d(m) − dobs )t gd (d(m) − dobs )
2
√
det gm has been left for subsequent simpliﬁcations). Deﬁning the misﬁt
(the constant factor
S (m) = − log σm (m)
σ0 , (8.24) where σ0 is an arbitrary value of σm (m) , gives, up to an additive constant,
S (m) = S1 (m) − S2 (m) , (8.25) where S1 (m) is the usual leastsquares misﬁt function
2 S1 (m) = (m − mprior )t gm (m − mprior ) + (d(m) − dobs )t gd (d(m) − dobs )
√
and where (as log A = 1 log A )
2
2 S2 (m) = log det gm + Dt (m) gd D(m)
det gm . (8.26) (8.27) 228 8.2 The maximum likelihood point is deﬁned as the point where the volumetric probability is maximum3 . If γ denotes the gradient of the misﬁt,
γα = ∂S
∂mα , (8.28) then, the steepest ascent direction is the vector γ deﬁned through
gm γ = γ . (8.29) The algorithm
mk+1 = mk − k γk , (8.30) where k is an adhoc, well chosen number, called the algorithm of steepest descent, converges
to the maximum likelihood point (or, at least, to a local maximum). To ensure convergence,
it is suﬃcient to use a descent direction, not necessarily the steepest one. This, in practice,
allows two simpliﬁcations: (i) compute only an approximation to the gradient, (ii) use physical
intuition to deﬁne directions that are better (for ﬁnite jumps) than the locally steepest one.
In many applications, it is the gradient of S1 (m) that is computed, not that of S (m) =
S1 (m) + S2 (m) , and this gradient is approximated by dropping the derivatives of D(m) (i.e.,
second derivatives of d(m) ). One then has γ k ≈ gm (mk − mprior ) + Dt gd (dk − dobs ) ,
k
where Dk = D(mk ) and dk = d(mk ) . Using the relation between gradient and steepest
descent (equation 8.29) this gives4
gm γ k ≈ gm (mk − mprior ) + Dt gd (dk − dobs ) .
k (8.31) The two equations 8.30–8.31 encapsulate the algorithm of steepest descent. Once the algorithm
has converged, if the volumetric probability σm (m) is approximated by a Gaussian centered
on the maximum likelihood point, then, the weight matrix (inverse of the covariance matrix) of
the Gaussian is (see equation 2.144)
gm = gm + Dt gd D . (8.32) [End of example.]
Example 8.6 If the ‘relation solving the forward problem’ d = d(m) happens to be a linear
relation,
d = Dm , (8.33) then the volumetric probability σm (m) in equation 8.23 becomes5
σm (m) =
3 (8.34) Unfortunately, many authors deﬁne, unconsistently, the maximum likelihood point as the point where the
probability density is maximum.
4
−
Of course, one could equivalently write γ k ≈ (mk − mprior ) + gm1 Dt gd (dk − dobs ) , but, numerically, it
k
is usually much better to solve a linear system than to evaluate the inverse of a matrix. This may be important
in largedimensioned spaces.
5
The last multiplicative factor in equation 8.23 is a constant that can be integrated into the constant k . Inverse Problems 229 1
.
(m − mprior )t gm (m − mprior ) + (D m − dobs )t gd (D m − dobs )
2
As the argument of the exponential is a quadratic function of m , we can write it in standard
form,
k exp − 1
σm (m) = k exp − (m − m )t gm (m − m )
2 , (8.35) this implying that σm (m) is a Gaussian volumetric probability. The values m and gm of the
center and the weight matrix (inverse of the covariance matrix), respectively, of the Gaussian
representing the a posteriori information in the model space, can be computed using certain
matrix identities (see, for instance, Tarantola, 1987, problem 1.19). For the weight matrix, this
gives
gm = gm + Dt gd D (8.36) and the central point m is obtained via
gm (m − mprior ) = Dt gd (dobs − D mprior ) . (8.37) Let us introduce the covariance matrices
−
Cm = gm1 ; Cm = gm −1 −
Cd = gd 1 ; . (8.38) An equation equivalent to 8.36 is
Cm = Cm − Cm Dt D Cm Dt + Cd −1 D Cm , (8.39) while an equation equivalent to 8.37 is
m − mprior = Cm Dt D Cm Dt + Cd −1 (dobs − D mprior ) . (8.40) [End of example.]
Example 8.7 If, in the context of the previous example, we do not have any a priori information on the model parameters, then CM → ∞ I , i.e., gm → 0 . In this case,
gm = Dt gd D , (8.41) and equation 8.37 simpliﬁes to
m= Dt gd D −1 Dt gd dobs . (8.42) [End of example.]
Example 8.8 In the context of the previous example, let us explore the very special circumstance where we have the same number of ‘data’ and ‘unknowns’, i.e., the case where the matrix
D is a square matrix. Assume that the matrix is regular, so its inverse exists. It is easy to see
that equation 8.42 then becomes
m = D−1 dobs . (8.43) We see that in this special case, m is just the Cramer solution of the linear equation dobs =
D m . [End of example.] 230 8.2 The formulas in the examples above give expressions that contain analytic parts. What we
write as d = d(m) may sometimes correspond to an explicit expression; sometimes it may
correspond to the solution of an implicit equation6 . Should d = d(m) be an explicit expression,
and should the ‘prior probability densities’ ρm (m) and ρd (d) (or the joint ρ(m, d) ) also be
given by explicit expressions (like as when we have Gaussian probability densities), then the
formulas of this section would give explicit expressions for the posterior volumetric probability
σm (m) .
If the relation d = d(m) is a linear relation, then the expression giving σm (m) can
sometimes be simpliﬁed easily (as with the linear Gaussian case to be examined below). More
often than not, the relation d = d(m) is a complex nonlinear relation, and the expression we
are left with for σm (m) is explicit, but complex.
Once the volumetric probability σm (m) has been deﬁned, there are diﬀerent ways of ‘using’
it.
If the ‘model space’ M has a small number of dimensions (say between one and four), the
values of σm (m) can be computed at every point of a grid and a graphical representation
of σm (m) can be attempted. A visual inspection of such a representation is usually worth
a thousand ‘estimators’ (central estimators or estimators of dispersion). But, of course, if the
values of σm (m) are known at all signiﬁcant points, these estimators can also be computed.
This point of view is emphasized in section ??. If the ‘model space’ M has a large number of
dimensions (say from ﬁve to many millions or billions), then an exhaustive exploration of the
space is not possible, and we must turn to Monte Carlo sampling methods to extract information
from σm (m) . We discuss the application of Monte Carlo methods to inverse problems in 8.3.6.
Finally, the optimization techniques are discussed in section 8.3.7. 6 Practically, it may correspond to the output of some ‘black box’ solving the ‘forward problem’. Appendixes 8.3
8.3.1 231 Appendixes
Appendix: Short Bibliographical Review For long time, scientists have estimated parameters using optimization techniques. Laplace
explicitly stated the least absolute values criterion. This, and the least squares criterion were
later popularized by Gauss (1809). While Laplace and Gauss were mainly interested in overdetermined problem, Hadamard (1902, 1932) introduced the notion of “illposed problem”, that
can be seen as an underdetermined problem.
For seismologists, the ﬁrst bona ﬁde solution of an inverse problem was the estimation of
the hypocenter coordinates of an earthquake using the ‘Geiger method’ (Geiger, 1910), that
presentday computers have made practical. In fact, seismologists have been the originators
of the theory of inverse problems (for data interpretation), and this is because the problem of
understanding the structure of the Earth’s interior using only surface data is a diﬃcult problem.
The ﬁrst uses of the Monte Carlo theory to obtain Earth models were made by KeilisBorok and Yanovskaya (1967) and by Press (1968). At about the same time, Backus and
Gilbert, and Backus alone, in the years 1967–1970, made original contributions to the theory
of inverse problems, focusing on the problem of obtaining an unknown function from discrete
data. Although the resulting mathematical theory is quite beautiful, its initial predominance
over the more ‘brute force’ (but more powerful) Monte Carlo theory was possibly due to the
quite limited capacities of the computers at that time. It is our feeling that Monte Carlo
methods will play a more important role in the future (and this is the reason why we put
emphasis on these methods in this article).
Interesting contributions to the theory were made by Wiggins (1969), with his method of
suppressing ‘small eigenvalues’, and by Franklin (1970), by introducing the right mathematical
setting for the Gaussian, functional (i.e., inﬁnite dimensional) inverse problem (see also Lehtinen
et al., 1989).
The 3D tomography of the Earth, using travel times of seismic waves, was developed by
Keiiti Aki and his coworkers, in a couple of well known papers (Aki and Lee, 1976; Aki, Christofferson and Husebye 1977). Minster and Jordan (1978) applied the theory of inverse problems to
the reconstruction of the tectonic plate motions, introducing the concept of ‘data importance’.
In an interesting paper, Rietsch (1977) made a nontrivial use of the notion of ‘noninformative’
(homogeneous, in our terminology) a priori distribution for positive parameters.
Jackson (1979) made an explicit introduction of a priori information in the context of linear
inverse problems, approach that was generalized by Tarantola and Valette (1982) to nonlinear
problems.
There are three monographs in the area of Inverse Problems (from the view point of data
interpretation). In Tarantola (1987), the general, probabilistic formulation for nonlinear inverse
problems is proposed. The small book by Menke (1984) is easy to read. Finally, Parker (1994)
exposes his view of the general theory of linear problems.
From time to time, some authors try to resuscitate the Laplacian ‘least absolute criterion’
(and this is good). Claerbout and Muir (1973), for instance, show that the use of the 1 norm
can accommodate for erratic data, and Djikp´ss´ and Tarantola (1999) used the 1 norm in a
ee
large scale inverse problem, involving seismic waveforms (a seismic reﬂection experiment).
Recently, the interest in Monte Carlo methods, for the solution of Inverse Problems, has
been increasing. Mosegaard and Tarantola (1995) proposed a generalization of the Metropolis
algorithm for analysis of general inverse problems, introducing explicitly a priori probability 232 8.3 distributions, and they applied the theory to a synthetic numerical example. Monte Carlo
analysis was recently applied to real data inverse problems by Mosegaard et al. (1997), DahlJensen et al. (1998), Mosegaard and RygaardHjalsted (1999), and Khan et al. (2000). Appendixes 8.3.2 233 Appendix: Example of Ideal (Although Complex) Geophysical
Inverse Problem Assume we wish to explore a complex medium, like the Earth’s crust, using elastic waves.
Figure 8.4 suggests an Earth model and a set of seismograms produced by the waves generated
by an earthquake (or an artiﬁcial source). The seismometers (not represented) may be at
the Earth’s surface or inside boreholes. Although only four seismograms are displayed, actual
experiments may generate thousands or millions of them. The problem here is to use a set of
observed seismograms to infer the structure of the Earth.
Figure 8.4: A set of observed seismograms (at the right) is to be used to
infer the structure of the Earth (at the
left). A couple of trees suggest an scale
(the numbers could correspond to meters), although the same principle can
be used for global Earth tomography. An Earth Model and
a Set of Observed Seismograms
0 0 50 100 150 200 0 50 100 150 200 20
40
60
80
100
100 50 0 50 100 The ﬁrst step is to deﬁne the set of parameters to be used to represent an Earth model.
These parameters have to qualitatively correspond to the ideas we have about the Earth’s
interior: Thicknes and curvature of the geological layers, position and dip of the geological
faults, etc. Inside of the bodies so deﬁned, diﬀerent types of rocks will correspond to diﬀerent
values of some geophysical quantities (volumetric mass, elastic rigidity, porosity, etc.). These
quantities, that have a smooth space variation (inside a given body), may be discretized by
considering a grid of points, by using a discrete basis of functions to represent them, etc. If
the source of seismic waves is not perfectly known (this is alwaus the case it the source is an
earthquake), then, the parameters describing the source also belong to the ‘model parameter
set’.
A given Earth model (including the source of the waves), then, will consist in a huge set of
values: the ‘numerical values’ of all the parameters being used in the description. For instance,
we may use the parameters m = {m1 , m2 , . . . , mM } to decribe an Earth model, where M
may be a small number (for simple 1D models) or a large number (by the millions or billions
for comlex 3D models). Then, we may consider an ‘Earth model number one’, denoted m1 ,
an ‘Earth model number two’, denoted m2 , and so on.
Now, what is a seismogram? It is, in fact, one of the components of a vectorial function
s(t) that depends on the vectorial displacement r(t) of the particles ‘at the point’ where
the seismometer is located. Given the manufacturing parameters of the seismometers, then,
it is possible to calculate the ‘output’ (seismogram) s(t) that corresponds to a given ‘input’
(soil displacement) r(t) . In some loosy sense, the instrument acts as a ‘nonlinear ﬁlter’ (the
nonlinearity coming from the possible saturation of the sensors for large values of the input, or
from their insensivity to small values). While the displacement of the soil is measured, say, in
micrometers, the output of the seismometer, typically an electric tension, is measured, say in
millivolts. In our digital era, seismograms are not recorded as ‘functions’. Rather, a discrete
value of the output is recorded with a given frequency (for instance, one value every millisecond).
A seismogram set consists, then, in a large number of (discrete) values, say, siam = (si (ta ))m
representing the value at time ta of the ith component of the mth seismogram. Such a 234 8.3 seismogram set is what is schematically represented at the right of ﬁgure 8.4. For our needs,
the particular structure of a seismogram set is not interesting, and we will simply represent
such a set using the notation d = {d1 , d2 , . . . , dN } , where the number N may range in the
thousands (if we only have one seismogram), or in the trillions for global Earth data or data
from seismic exploration for minerals.
An exact theory then deﬁnes a function d = f (m) : given an arbitrary Earth model m ,
the associated theoretical seismograms d = f (m) can be computed.
A ‘theory’ able to predict seismograms has to encompass the whole way between the Earth
model and the instrument output, the millivolts. An ‘exact theory’ would deﬁne a functional
relationship d = f (m) associating, to any Earth model m a precisely deﬁned point in the data
space. This theory would essentially consist in the theory of elastic waves in anisotropic and
heterogeneous media, perhaps modiﬁed to include attenuation, nonlinear eﬀects, the descrotion
of the recording instrument, etc.
As mentioned alsewhere [note: where?] there are many reasons for which a ‘theory’ is not an
exact functional relationship but, rather, a conditional volumetric probability ϑ(dm) . [note:
explain this better.] Realistic estimations of this probability distribution may be extremely
complex. Sometimes we may limit ourselves to ‘putting uncertainty bars around a functional
relation’, as suggested in section 7.2.2. Then, for instance, using a Gaussian model, we may
write
1
ϑ(dm) = k exp − (d − f (m))T C−1 (d − f (m))
T
2 , (8.44) where the uncertaintity on the predicted data point, d = f (m) is described by the ‘theory
covariance operator’ CT . With a simple probability model, lime this one, or by any other
means, it is assumed the the conditional probability volumetric probability ϑ(dm) is deﬁned.
Then, given any point m representing an Earth model, we should be able to sample the
volumetric probability ϑ(dm) , i.e., to obtain as many samples (specimens) of d as we may
wish. Figure 8.5 gives a schematic illustration of this.
Assume that we do not have yet collected the seismograms. At this moment, the information
we have on the Earth is called ‘a priori’ information. As explained elsewhere [note: say where],
it may always be represented by a probability distribution over the model parameter space,
corresponding to a volumetric probability ρm (m) . The expression of this volumetric probability
is, in realistic problems, never explicitly known. Let us see this with some detail.
In some very simple situations, we may have an ‘average a priori model’ mprior and a priori
uncertainties that can be modeled by a Gaussian distribution with covariance operator Cm .
Then,
1
ρm (m) = k exp − (m − mprior )T C−1 (m − mprior )
m
2 . (8.45) Other probability models (Laplace, Pareto, etc.) may, or course, be used. In more realistic
situations, the a priori information we have over the model space is not easily expressible as
an explicit expression of a volumetric probability. Rather, a large set of rules, some of them
probabilistic, is expressed.
Already, the very deﬁnition of the parameters contains a fundamental topological information (the type of objects being considered: geological layers, faults, etc.). Then, we may have
rules of the type ‘a sedimentary layer may never be below a layer of igneous origin’ or ‘with Appendixes 235 Theoretical Sets of Seismograms
(inside theoretical
uncertainties)
0 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 Figure 8.5: Given an arbitrary Earth
model m , a (non exact) theory given
a probability distribution for the data,
ϑ(dm) , than can be smpled, producing the sets of seismograms shown here. 50 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0
20
40
60
80
100
100 50 0 50 100 236 8.3 probability 2/3, a layer with a thichness larger that D is followed by a layer with a thickness
smaller than d ’, etc. There are, also, explicit volumetric probabilities, like ‘the joint volumetric
probability for porosity π and rigidity µ for a calcareous layer is g (π, µ) = . . . ’. They may
come from statistical studies made using large petrophysical data banks, or from qualitative
‘Bayesian’ estimations of the correlations existing between diﬀerent parameters.
Figure 8.6: Samples of the a priori distribution of Earth models, each accompanied by the predicted set of seismograms. A set of rules, some determistic,
some random, is used to randomly generate Earth models. These are assumed to be samples from a probability distribution over the model space
corresponding to a volumetric probability ρm (m) whose explicit expression may be diﬃcult to obtain. But it
is not this expression that is required
for proceeding with the method, only
the possibility of obtaining as many
samples of it as we may wish. Although a large number of samples may
be necessary to grasp all the details of
a probability distribution, as few as the
six samples shown here already provide
some elementary information. For instance, there are always ﬁve geological
layers, separated by smooth interfaces.
In each model, all the four interfaces
are dipping ‘leftwards’ or all the four
are dipping ‘rightwards’. These observations may be conﬁrmed, and other
properties become conspicuous as more
and more samples are displayed. The
theoretical set of the seismograms associated to each model, displayed at
rigth, are as diﬀerent as the models
are diﬀerent. Thse are only ‘schematic’
seismograms, bearing no relation with
any actual situation. Samples of the a priori
Distribution, and Seismograms
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 0 50 100 150 200 0 50 100 150 200 20
40
60
80
100
100 50 0 50 100 0
20
40
60
80
100
100 50 0 50 100 0
20
40
60
80
100
100 50 0 50 100 0
20
40
60
80
100
100 50 0 50 100 0
20
40
60
80
100
100 50 0 50 100 0
20
40
60
80
100
100 50 0 50 100 The fundamental hypothesis of the approach that follows is that we are able to use this set
of rules to randomly generate Earth models. And as many as may wish. Figure 8.6 suggests the
results obtained using such a procedure. In a computer screen, when the models are displayed
one after the other, we have a ‘movie’. A geologist (knowing nothing about mathematics)
should, when observing such a movie for long enough, agree with a sentence like the following. Appendixes 237 All models displayed are possible models; the more likely models appear quite frequently; some
unlikely models appear, but unfrequently; if we wait long enough we may well reach a model that
may be arbitrarily close to the actual Earth.
This means that (i) we have described the a priori information, by deﬁning a probability
distribution over the model space, (ii) we are sampling this probability distribution, event if an
expression for the associated volumetric probability ρm (m) has not been developed explicitly.
Assume now that we collect a data set, i.e., in our example, the set of seismograms generated
by a given set of earthquakes, or by a given set of artiﬁcial sources. In the notation introduced
above, a given set of seismograms corresponds to a particular point d in the data space.
As any measurement has attached uncertainties, rather than ‘a point’ in the data space, we
have, as explained elsewhere [note: say where], a probability distribution in the data space,
corresponding to a volumetric probability ρd (d) .
The simplests examples of probability distribution in the data space are obtained when
using simple probability models. For instance, the assumption of Gaussian uncertainties would
give
1
ρd (d) = k exp − (d − dobs )T C−1 (d − dobs )
d
2 , (8.46) where dobs represents the ‘observed data values’, with ‘experimental uncertainties’ described
by the covariance operator Cd . As always, other probability models may, of course, be used.
Actual experimental uncertainties are quite diﬃcult to model. [note: develop here this
notion, and explain, here or somewhere, what is ‘noise’ in a data set (unmodeled signal)].
[Note: explain that ﬁgure 8.7 represents a few samples of ‘data points’ generated according
to ρd (d) .]
Note: explain here that from
σ (d, m) = k ρ(d, m) ϑ(d, m)
ρ(d, m) = ρd (d) ρm (m)
ϑ(d, m) = ϑ(dm) ϑm (m) = k ϑ(dm) (8.47) σ (d, m) = k ρd (d) ϑ(dm) ρm (m) . (8.48) it follows Assume that we are able to generate a random walk that samples the a priori probability
distribution of Earth models, ρm (m) (we have seen above how to do this; see also section XXX).
Consider the following algorithm:
1. Initialize the algorithm at an arbitrary point (m1 , d1 ) , the ﬁrst ‘accepted’ point.
2. Relabel the last accepted point (mn , dn ) . Given mn , use the rules that sample the
volumetric probability ρm (m) to generate a candidate point mc .
3. Given mc , randomly generate a sample data point, according to the volumetric probability ϑ(dmc ) , and name it dc .
4. Compare the values ρd (dn ) and ρd (dc ) , and decide to accept or to reject the candidate
point dc according to the logistic or to the Metropolis rule (or any equivalent rule).
If the candidate point is accepted, set (mn+1 , dn+1 ) = (mc , dc ) and go to 2. If the
candidate point is rejected, set (mn+1 , dn+1 ) = (mn , dn ) and go to 2. 238 8.3 Acceptable Sets of Seismograms
(inside experimental
uncertainties)
0 100 150 200 0 50 100 150 200 0 Figure 8.7: We have one ‘observed set
of seismograms’, together with a description of the uncertainties in the
data. The corresponding probability
distribution may be complex (correlation of uncertainties, non Gaussianity
of the noise, etc.). Rather than plotting
the ‘oberved set of seismograms’, pseudorandom realizations of the probability distribution in the data space are
displayed here. 50 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0
20 ? 40
60
80
100
100 50 0 50 100 Appendixes 239 [Note: explain here that ﬁgure 8.8 shows some samples of the a posteriori probability
distribution.]
Samples of the a posteriori
Distribution, and Seismograms
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 0 50 100 150 200 0 50 100 150 200 20
40
60
80
100
100 50 0 50 100 0
20
40
60
80
100
100 Figure 8.8: Samples of the a posteriori
distribution of Earth models, each accompanied by the predicted set of seismograms. Note that, contrary to what
happens with the a priori samples, all
the models presented here have ‘left
dipping interfaces’. The second layer
is quite thin. Etc. 50 0 50 100 0
20
40
60
80
100
100 50 0 50 100 0
20
40
60
80
100
100 50 0 50 100 0
20
40
60
80
100
100 50 0 50 100 0
20
40
60
80
100
100 50 0 50 100 [Note: the marginal for m corresponds to the same ‘movie’, just looking to the models,
and disregarding the date sets. Reciprocally, the marginal for d . . . ]
‘Things’ can be considerably simpliﬁed if uncertainties in the theory can be neglected (i.e.,
if the ‘theory’ is assumed to be exact):
ϑ(dm) = δ ( d − f (m) ) .
Then, the marginal for m , σm (m) = (8.49) dVd (d) σ (d, m) , is using 8.48, σm (m) = k ρm (m) ρd ( f (m) ) . (8.50) 240 8.3 The algorithm proposed above, simpliﬁes to:
1. Initialize the algorithm at an arbitrary point m1 , the ﬁrst ‘accepted’ point.
2. Relabel the last accepted point mn . Use the rules that sample the volumetric probability
ρm (m) to generate a candidate point mc .
3. Compute dc = f (mc ) .
4. Compare the values ρd (dn ) and ρd (dc ) , and decide to accept or to reject the candidate
point dc according to the logistic or to the Metropolis rule (or any equivalent rule). If
the candidate point is accepted, set mn+1 = mc and go to 2. If the candidate point is
rejected, set mn+1 = mn and go to 2.
[Note: explain that both algortihms require the resolution of the ‘forward problem’.]
[Note: explain that the initial point can not be completely arbitrary.]
[Note: the validity of the algorithm with the conditional probability inside has not been
demonstrated.]
[Note: develop these notions.] Appendixes 8.3.3 241 Appendix: Probabilistic Estimation of Earthquake Locations Earthquakes generate waves, and the arrival times of the waves at a network of seismic observatories carries information on the location of the hypocenter. This information is better
understood by a direct examination of the probability density f (X, Y, Z ) deﬁned by the arrival times, rather than just estimating a particular location (X, Y, Z ) and the associated
uncertainties.
Provided that a ‘black box’ is available that rapidly computes the travel times to the seismic station from any possible location of the earthquake, this probabilistic approach can be
relatively eﬃcient. Tjhis appendix shows that it is quite trivial to write a computer code that
uses this probabilistic approach (much easier than to write a code using the traditional Geiger
method, that seeks to obtain the ‘best’ hypocentral coordinates).
8.3.3.1 A Priori Information on Model Parameters The ‘unknowns’ of the problem are the hypocentral coordinates of an Earthquake7 {X, Z } ,
as well as the origin time T . We assume to have some a priori information about the location
of the earthquake, as well as about ots origin time. This a priori information is assumed to be
represented using the probability density
ρm (X, Z, T ) . (8.51) Because we use Cartesian coordinates and Newtonian time, the homogeneous probability density is just a constant,
µm (X, Y, T ) = k . (8.52) For consistency, we must assume (rule 4.8) that the limit of ρm (X, Z, T ) for inﬁnite ‘dispersions’
is µm (X, Z, T ) .
Example 8.9 We assume that the a priori probability density for (X, Z ) is constant inside
the region 0 < X < 60 km , 0 < Z < 50 km , and that the (unnormalizable) probability density
for T is constant. [End of example.]
8.3.3.2 Data The data of the problem are the arrival times {t1 , t2 , t3 , t4 } of the seismic waves at a set of four
seismic observatories whose coordinates are {xi , z i } . The measurement of the arrival times
will produce a probability density
ρd (t1 , t2 , t3 , t4 ) (8.53) over the ‘data space’. As these are Newtonian times, the associated homogeneous probability
density is constant:
µd (t1 , t2 , t3 , t4 ) = k . (8.54) For consistency, we must assume (rule 4.8) that the limit of ρd (t1 , t2 , t3 , t4 ) for inﬁnite ‘dispersions’ is µd (t1 , t2 , t3 , t4 ) .
7 To simplify, here, we consider a 2D ﬂat model of the Earth, and use Cartesian coordinates. 242 8.3 Example 8.10 Assuming Gaussian, independent uncertainties, we have
1 (t1 − t1 )2
obs
2
2
σ1
1 (t3 − t3 )2
obs
exp −
2
2
σ3 ρd (t1 , t2 , t3 , t4 ) = k exp −
× 1 (t2 − t2 )2
obs
2
2
σ2
1 (t4 − t4 )2
obs
exp −
2
2
σ4
exp − . (8.55) [End of example.]
8.3.3.3 Solution of the Forward Problem The forward problem consists in calculating the arrival times ti as a function of the hypocentral
coordinates {X, Z } , and the origin time T :
ti = f i (X, Z, T ) . (8.56) Example 8.11 Assuming that the velocity of the medium is constant, equal to v ,
t1 = T +
cal
8.3.3.4 (X − xi )2 + (Z − z i )2
v . (8.57) Solution of the Inverse Problem Note: explain here that ‘putting all this together’,
σm (X, Z, T ) = k ρm (X, Z, T ) ρd (t1 , t2 , t3 , t4 )
8.3.3.5 ti =f i (X,Z,T ) . (8.58) Numerical Implementation To show how simple is to implement an estimation of the hypocentral coordinates using the
solution given by equation 8.58, we give, in extenso, all the commands that are necessary to
the implementation, using a commercial mathematical software (Mathematica). Unfortunately,
while it is perfectly possible, using this software, to explicitly use quantities with their physical
dimensions, the plotting routines require adimensional numbers. This is why the dimensions
have been suppresed in whay follows. We use kilometers for the space positions and seconds
for the time positions.
We start by deﬁning the geometry of the seismic network (the vertical coordinate z is
oriented with positive sign upwards):
x1
z1
x2
z2
x3
z3
x4
z4 =
=
=
=
=
=
=
= 5;
0;
10;
0;
15;
0;
20;
0; The velocity model is simply deﬁned, in this toy example, by giving its constant value ( 5
km/s ): Appendixes 243 v = 5;
The ‘data’ of the problem are those of example 8.10. Explicitly:
t1OBS = 30.3;
s1 = 0.1;
t2OBS = 29.4;
s2 = 0.2;
t3OBS = 28.6;
s3 = 0.1;
t4OBS = 28.3;
s4 = 0.1;
rho1[t1_]
rho2[t2_]
rho3[t3_]
rho4[t4_] :=
:=
:=
:= Exp[
Exp[
Exp[
Exp[  (1/2)
(1/2)
(1/2)
(1/2) (t1
(t2
(t3
(t4  t1OBS)^2/s1^2
t2OBS)^2/s2^2
t3OBS)^2/s3^2
t4OBS)^2/s4^2 rho[t1_,t2_,t3_,t4_]:=rho1[t1] rho2[t2] rho3[t3] rho4[t4]
Although an arbitrarily complex velocity velocity model could be considered here, let us
take, for solving the forward problem, the simple model in example 8.11:
t1CAL[X_,
t2CAL[X_,
t3CAL[X_,
t4CAL[X_, Z_,
Z_,
Z_,
Z_, T_]
T_]
T_]
T_] :=
:=
:=
:= T
T
T
T +
+
+
+ (1/v)
(1/v)
(1/v)
(1/v) Sqrt[
Sqrt[
Sqrt[
Sqrt[ (X
(X
(X
(X  x1)^2
x2)^2
x3)^2
x4)^2 +
+
+
+ (Z
(Z
(Z
(Z  z1)^2
z2)^2
z3)^2
z4)^2 The posterior probability density is just that deﬁned in equation 8.58:
sigma[X_,Z_,T_] := rho[t1CAL[X,Z,T],t2CAL[X,Z,T],t3CAL[X,Z,T],t4CAL[X,Z,T]]
We should have multiplied by the ρm (X, Z, T ) deﬁned in example 8.9, but as this just corresponds to a ‘trimming’ of the values of the probability density outside the ‘box’ 0 < X < 60 km ,
0 < Z < 50 km , we can do this afterwards.
The deﬁned probability density is 3D, and we could try to represent it. Instead, let us just
represent the marginal probabilty densities. First, we ask the software to evaluate analytically
the space marginal:
sigmaXZ[X_,Z_] = Integrate[ sigma[X,Z,T], {T,Infinity,Infinity} ];
This gives a complicated result, with hypergeometric functions8 . Representing this probability
density is easy, as we just need to type the command
ContourPlot[sigmaXZ[X,Z],{X,15,35},{Z,0,25},
PlotRange>All,PlotPoints>51]
8 Typing sigmaXZ[X,Z] presents the result. 244 8.3 The result is represented in ﬁgure 8.9 (while the level lines are those directly produced by the
software, there has been some additional editing to add the labels). When using ContourPlot,
we change the sign of sigma, because we wish to reverse the software’s convention of using light
colors for positive values. We have chosen the right region of the space to be plotted (signiﬁcant
values of sigma) by a preliminary plotting of ‘all’ the space (not represented here).
Should we have some a priori probability density on the location of the earthquake, represented by the probability density f(X,Y,Z), then, the theory says that we should multiply
the density just plotted by f(X,Y,Z). For instance, if we have the a priori information that the
hypocenter is above the level z = −10 km, we just put to zero everyhing below this level in the
ﬁgure just plotted.
Let us now evaluate the marginal probability density for the time, by typing the command
sigmaT[T_] := NIntegrate[ sigma[X,Z,T], {X,0,+60}, {Z,0,+50} ]
Here, we ask Mathematica NOT to try to evaluate analytically the result, but to perform a
numerical computation (as we have checked that no analytical result is found). We use the ‘a
priori information’ that the hypocenter must be inside a region 0 < X < 60 km , 0 < Z < 50 km
but limiting the integration domain to that area (see example 8.9). To represent the result, we
enter the command
p = Table[0,{i,1,400}];
Do[ p[[i]] = sigmaT[i/10.] , {i,100,300}]
ListPlot[ p,PlotJoined>True, PlotRange>{{100,300},All}] Figure 8.9: The probability density for the
location of the hypocenter. Its asymmetric
shape is quite typical, as seismic observatories tend to be asymmetrically placed. 0 km s
1) s
= (2 8. 3± 0. 1) s
0. 2) ob
s t4 t3 ob
s = (2 8. 6± 0.
4±
9. (2
=
ob
s t2 t1 ob
s = (3 0. 3± 0. 1) s and the produced result is shown (after some editing) in ﬁgure 8.10. The software was not very
stable in producing the results of the numerical integhration. 0 5 10 km 10 15 20 km 20
v = 5 km/s 25
15 0 km 10 km 20 20 km 25 30 30 km 35 Appendixes 245 Figure 8.10: The marginal probability density for
the origin time. The asymmetry seen in the probability density in ﬁgure 8.9, where the decay of
probability is slow downwards, translates here in
signiﬁcant probabilities for early times. The sharp
decay of the probability density for t < 17s does
not come from the values of the arrival times, but
from the a priori information that the hypocenters
must be above the depth Z = −50 km .
8.3.3.6 10 s 20 s 15 s 25 s 30 s An Example of Bimodal Probability Density for an Arrival Time. As an exercise, the reader could reformulate the problem replacing the assumtion of Gaussian
uncertainties in the arrival times by multimodal probability densities. For instance, ﬁgure 5.6
suggested the use of a bimodal probability density for the reading of the arrival time of a seismic
wave. Using the Mathematica software, the command
rho[t_] := (If[8.0<t<8.8,5,1] If[9.8<t<10.2,10,1]) deﬁnes a probability density that, when plotted using the command
Plot[ rho[t],{t,7,11} ]
produces the result displayed in ﬁgure 8.11.
10 Figure 8.11: In ﬁgure 5.6 it was suggested that
the probability density for the arrival time of
a seismic phase may be multimodal. This is
just an example to show that it is quite easy
to deﬁne such multimodal probability densities in computer codes, even if they are not
analytic. 8
6
4
2 8 9 10 11 246 8.3 8.3.4 Appendix: Functional Inverse Problems 8.3.4.1 Introduction As mentioned in section 2.2, main concern of this article is with discrete problems, i.e., problems
where the number of data/parameters is ﬁnite. When functions are involved, it was assumed
that a sampling of the function could be made that was ﬁne enough for subsequent reﬁnements
of the sampling having no eﬀect on the results. This, of course, means replacing any step
(Heaviside) function by a sort of discretized Erf function9 . The limit of a very steep Erf function
being the step function, any functional operation involving the Erf will have as limit the same
functional operation involving the step (unless very pathological problems are considered).
The major reason for this limitation is that probability theory is easily developed in ﬁnitedimensional spaces, but not in inﬁnitedimensional spaces. In fact, the only practical inﬁnitedimensional probability theory, where ‘measures’ are replaced by ‘cylinder measures’, is nothing
but the assumption that the probabilities calculated have a well behaved limit when the dimensions of the space tend to inﬁnity. Then, the ‘cylinder measure’ or ‘probability’ of a region
of the inﬁnitedimensional space is deﬁned as the limit of the probability calculated in a ﬁnitedimensional subspace, when the dimensions of this subspace tend to inﬁnity.
There are, nevertheless, some parcels of the theory whose generalization to the inﬁnite
dimensional case is possible and well understood. For instance, inﬁnite dimensional Gaussian
probability distributions have been well studied. This is not well surprised, because the random
realizations of an inﬁnite dimensional Gaussian probability distribution are L2 functions, la
cr`me de la cr`me of the functions.
e
e
Most of what will be said here will concern L2 functions10 , and formulas presented will be
the functional equivalent to the leastsquares formalism developed above for discrete problems.
In fact, most results will be valid for Lp functions. The diﬀerence, of course, between an L2
space and an Lp space is the existence of an scalar product in the L2 spaces, scalar product
intimately related, as we will see, with the covariance operator typical of Gaussian probability
distributions.
We face here an unfortunate fact that plagues some mathematical literature: the abuse of
the term ‘adjoint operator’ where the simple ‘transpose operator’ would suﬃce. As we will see
below, the transposed of a linear operator is something as simple as the original operator (like
the transpose of a matrix is as simple as the original matrix), but the adjoint of an operator is
a diﬀerent thing. It is deﬁned only in spaces that have a scalar product (i.e., in L2 spaces),
and depends essentially of the particular scalar product of the space. As the scalar product
is, usually, nontrivial (it will always involve covariance operators in our examples), the adjoint
operator is generally an object more complex than the transpose operator. What we need, for
using optimization methods in functional spaces, is to be able to deﬁne the norm of a function,
and the transposed of an operator, so the ideal setting is that of Lp spaces. Unfortunately,
most mathematical results that, in fact, are valid for Lp , are demonstrated only for L2 .
The steps necessary for the solution of an inverse problem involving functions are: (i) deﬁnition of the functional norms; (ii) deﬁnition of the (generally nonlinear) application between
parameters and data (forward problem); (iii) calculation of its tangent linear application (char9 The Erf function, or error function, is the primitive of a Gaussian. It is a simple example of a ‘sigmoidal’
function.
1 /2
10
Grossly speaking, a function f (x) belongs to L2 if
f=
dxf (x)2
is ﬁnite. A function f (x)
belongs to Lp if f = dxf (x)p 1/p is ﬁnite. The limit for p → ∞ corresponds to the l∞ space. Appendixes 247 acterized by a linear operator); (iv) understanding of the transposed of this operator; (v) setting
an iterative procedure that leads to the function minimizing the norm of the ‘misﬁt’.
Let us see here the main mathematical points to be understood prior to any attempt of
‘functional inversion’. There are not many good books on functional analysis, the best probably
is the ‘Introduction to Functional Analysis’ by Taylor and Lay (1980).
8.3.4.2 The Functional Spaces Under Investigation A seismologist may consider a (threecomponent) seismogram
u = { ui (t) ; i = 1, 2, 3 ; t0 ≤ t ≤ t1 } , (8.59) representing the displacement of a given material point of an elastic body, as a function of time.
She/he may wish to deﬁne the norm of the function (in fact of ‘the set of three functions’) u ,
denoted u , as
u 2 t1 dt ui (t) ui (t) , = (8.60) t0 where, as usual, ui ui stands for the Euclidean scalar product. The space of all the elements
u where this norm u is ﬁnite, is, by deﬁnition, an L2 space.
This plain example is here to warn against wrong deﬁnitions of norm. For instance, we may
measure a resistivityversusdepth proﬁle
ρ = { ρ(z ) ; z0 ≤ z ≤ z1 } , (8.61) but it will generally not make sense to deﬁne
z1 2 ρ dz ρ(t)2 = (bad deﬁnition) . (8.62) z0 For the resistivityversusdepth proﬁle is equivalent to the conductivityversusdepth proﬁle
σ = { σ (z ) ; z0 ≤ z ≤ z1 } , (8.63) where, for any z , ρ(z ) σ (z ) = 1 , and the deﬁnition of the norm
z1 2 σ dz σ (t)2 = (bad deﬁnition) , (8.64) z0 would not be consistent with that of the norm ρ (we do not have, in general, any reason
to assume that σ (z ) sould be ‘more L2 ’ than ρ(z ) , or viceversa). This is a typical example
where the logarithmic variables r = log ρ/ρ0 and s = log σ/σ0 (where ρ0 and σ0 are
arbitrary constants) allow the only sensible deﬁnition of norm
r 2 = s 2 z1 z1 dz r(t)2 = =
z0 dz s(t)2 (good deﬁnition) , (8.65) z0 or, in terms of ρ and σ ,
ρ 2 = σ 2 z1 = dz
z0 ρ(z )
log
ρ0 2 z1 = dz
z0 σ (z )
log
σ0 2 (good deﬁnition) ,
(8.66) 248 8.3 We see that the right functional spaces for the resistivity ρ(z ) or the conductivity σ (z ) is
not L2 , but, to speak grossly, the exponential of L2 .
Although these examples concern the L2 norm, the same comments apply to any Lp
norm. We will see below an example with the L1 norm.
8.3.4.3 Duality Product Every time we deﬁne a functional space, and we start developing mathematical properties (for
instance, analyzing the existence and unicity of solutions to partial diﬀerential equations), we
face another function space, with the same degrees of freedom.
For instance, in elastic theory we may deﬁne the strain ﬁeld ε = {εij (x, t)} . It will
automatically appear another ﬁeld, with the same variables (degrees of freedom) that, in this
case, is the stress σ = {σij (x, t)} . The ‘contacted multiplication’ will consist in making the
sum (over discrete indices) and the integral (over continuous variables) of the product of the
two ﬁelds, as in
σ, ε = dt dV (x) σij (x, t) εij (x, t) , (8.67) where the sum over i, j is implicitly notated.
The space of strains and the space of stresses is just one example of dual spaces . When
one space is called ‘the primal space’, the other one is calles ‘the dual space’, but this is just a
matter of convention.
The product 8.67 is one example of duality product , where one element of the primal space
and one element of the dual space are ‘mutiplied’ to form a scalar (that may be a real number
or that may have physical dimensions). This implies the sum or the integral over the variables
of the functions. Mathematicians say that ‘the dual of an space X is the space of all linear
forms over X ’. It is true that a given σ associates, to any ε , the number deﬁned by
equation 8.67; and that this association deﬁnes a linear application. But this rough deﬁnition
of duality doesn’t help readers to understand the actual mathematical structure.
8.3.4.4 Scalar Product in L2 Spaces When we consider a functional space, its dual appears spontaneously, and we can say that any
space is always accompanied by its dual space (as in the example strainstress seen above).
Then, the duality product is always deﬁned.
Things are completely diﬀerent with the scalar product, that it is only deﬁned sometimes .
If, for instance, we consider functions f = {f (x)} belonging to a space F , the scalar
product is a bilinear form that associates, to any pair of elements f1 and f2 of F , a number11
denoted ( f1 , f2 ) .
Practically, to deﬁne a scalar product over a space F , we must ﬁrst deﬁne a symmetric,
positive deﬁnite operator C−1 mapping F into its dual, F . The dual of a function
f = {f (x)} , that we may denote f = {f (x)} , is then
f = C−1 f .
11 It is usually a real number, but it may have physical dimensions. (8.68) Appendixes 249 The scalar product of two elements f1 and f2 of F is then deﬁned as
( f1 , f2 ) = f1 , f2 = C−1 f1 , f2 (8.69) In the context of an inﬁnitedimensional Gaussian process, some mean and some covariance
are always deﬁned. If, for instance, we consider functions f = {f (x)} , the mean function may
be denoted f0 = {f0 (x)} and the covariance function (the kernel of the covariance operator)
may be denoted C = {C (x, x )} . The space of functions we work with, say F , is the set
of all the possible random realization of such a Gaussian process with the given mean and the
given covariance. The dual of F can be here identiﬁed with the image of F under C−1 ,
the inverse of the covariance operator (that is a symmetric, positive deﬁnite operator). So,
denoting F the dual of F , we can formally write F = C−1 F o, equivalently, F = C F .
The explicit expression of the equation
f = Cf (8.70) dx C (x, x ) f (x) . (8.71) is
f (x) = Let us denote W the inverse of the covariance operator,
W = C−1 , (8.72) that is usually named the weight operator . As C W = W C = I , its kernel, W (x, x ) , the
weight function , satisfyes
dx W (x, x ) C (x , x ) = δ (x − x ) , dx C (x, x ) W (x , x ) = (8.73) where δ ( · ) is the Dirac’s delta ‘function’. Typically, the covariance function C (x, x ) is
a smooth function; then, the weight function W (x, x ) is a distribution (sum of Dirac delta
‘functions’ and its derivatives).
Equations 8.70–8.71 can equivalently be written
f = Wf (8.74) dx W (x, x ) f (x) . (8.75) and
f (x) = If the duality product between f1 and f2 is written
f1 , f2 = dxf1 (x) f2 (x) , the scalar product, as deﬁned by equation 8.69, becomes
( f1 , f2 ) = f1 , f2 = C−1 f1 , f2 = W f1 , f2 (8.76) 250 8.3
=
= dx W (x, x )f1 (x ) dx
dx The norm of f , denoted f f2 (x) dx f1 (x) W (x, x ) f2 (x ) . (8.77) and deﬁned as
f 2 =(f , f ) , (8.78) dx f (x) W (x, x ) f (x ) . (8.79) is expressed, in this example, as
2 f = dx This is the L2 norm of the function f (x) (the case where W (x, x ) = δ (x − x ) being a very
special case).
One ﬁnal remark. If f (x) is a random realization of a Gaussian white noise with zero
mean, then, the function f (x) deﬁned by equation 8.71 is a random realization of a Gaussian
process with zero mean and covariance function C (x, x ) . This means that if the space F is
the space of all the random realizations of a Gaussian process with covariance operator C ,
then, its dual, F , is the space of all the realizations of a Gaussian white noise.
Example 8.12 Consider the covariance operator C , with covariance function C (x, x ) ,
⇐⇒ f = Cf +∞ dx C (x, x ) f (x ) , f (x) = (8.80) −∞ in the special case where the covariance function is the exponential function,
C (x, x ) = σ 2 exp − x − x 
X , (8.81) where X is a constant. The results of this example are a special case of those demonstrated
in Tarantola (1987, page 572). The inverse covariance operator is
f = C−1 f ⇐⇒ f (t) = 1
2 σ2 1
¨
f (x) − X f (x)
X , (8.82) where the double dot means second derivative. As noted above, if f (x) is a random realization
of a Gaussian process having the exponential covariance function considered here, then, the
f (x) given by this equation is a random realization of a white noise. Formally, this means that
1
¨
the weighting function (kernel of C−1 ) is W (x, x ) = 2 1 2 X δ (x) − X δ (x) . The squared
σ
norm of a function f (x) is obtained integrating by parts:
f 2 = f,f = 1
2 σ2 1
X +∞ dx f 2 (x) + X ∞ +∞ dx f˙2 (x) . −∞ This is the usual norm in the socalled Sobolev space H 1 . [End of example.] (8.83) Appendixes
8.3.4.5 251 The Transposed Operator Let G a linear operator mapping an space E into an space F (we have in mind functional
spaces, but the deﬁnition is general). We denote, as usual
G : E →F . (8.84) f = Ge . (8.85) If e ∈ E and f ∈ F , then we write Let E and F be the respective duals of E and F , and denote · , · E and · , · F
the respective duality products. A linear operator H mapping the dual of F into the
dual of E , is named the transpose of G if for any f ∈ F and for any e ∈ E we have
f , G e F = H f , e E , and, in this case, we use the notation H = GT . The whole
deﬁnition then reads
GT : F → E
∀e ∈ E ∀f ∈ F ; : (8.86) f , Ge F = GT f , e E . (8.87) Example 8.13 The Transposed of a Matrix. Let us consider a discrete situation where
⇐⇒ f = Ge fi = Giα eα . (8.88) α In this circumstance, the duality products in each space will read
f,f F fi fi = ; e, e E eα eα = . (8.89) α i The linear operator H is the transposed of G if for any f and for any e (equation 8.87),
f , Ge F = Hf, e E , (8.90) i.e., if
fi (G e)i = (H f )α eα (8.91) α i or, explicitly,
fi Giα eα
α i Hαi fi =
α eα . (8.92) i The condition can be written
fi Giα eα =
i α fi Hαi eα ,
i α (8.93) 252 8.3 and it is clear that this true for any f and for any e iﬀ
Hαi = Giα , (8.94) i.e., if the matrix representing H is the transposed (in the elementary matricial sense) of the
matrix representing G :
H = GT . (8.95) This demonstrates that the abstract deﬁnition given above of the transpose of a linear operator
is consistent with the matricial notion of transpose. [End of example.]
Example 8.14 The Transposed of the Derivative Operator. Let us consider a situation
where
⇐⇒ v = Dx v (t) = dx
(t) ,
dt (8.96) i.e., where the linear operator D is the derivative operator. In this circumstance, the duality
products in each space will typically read
t2 v, v V = t2 dt v (t) v (t) ; x, x X = t1 dt x(t) x(t) . (8.97) t1 If the linear operator DT has to be the transposed of D , for any v and for any x we mst
have (equation 8.87)
v , Dx V DT v , x = X . (8.98) [End of example.]
Let us demonstrate that the derivative operator is an antisymmetric operator i.e, that
DT = −D . (8.99) To demonstrate this, we will need to make a restrictive condition, interesting to analyze.
Using 8.99, equation 8.98 writes
t2 t2 dt v (t) (D x)(t) = − t1 dt (D v)(t) x(t) (8.100) t1 i.e.,
t2 dt v (t)
t1 dx
(t) +
dt t2 dt
t1 dv
(t) x(t) = 0 .
dt (8.101) We have to check if this equation holds for any x(t) and any v (t) .
The condition is equivalent to
t2 dt
t1 v (t) dv
dx
(t) + (t) x(t)
dt
dt =0, (8.102) Appendixes 253 i.e., to
t2 dt
t1 d
(v (t) x(t)) = 0 ,
dt (8.103) or, using the elementary properties of the integral, to
v (t2 ) x(t2 ) + v (t1 ) x(t1 ) = 0 . (8.104) In general, there is no reason for this being true. So, in general, we can not say that DT = −D .
If the spaces of functions we work with (here, the space of functions v (t) and the space
of functions x(t) ) satisfy the condition 8.104 it is said that the spaces satisfy dual boundary
conditions . If the spaces satisfy dual boundary conditions, then it is true that DT = −D , i.e.,
that the derivative operator is antisymmetric.
A typical example of dual boundary conditions being satisﬁed is in the case where all the
functions x(t) vanish at the initial time, and all the functions v (t) vanish at the ﬁnal time:
x(t1 ) = 0 ; v (t2 ) = 0 . (8.105) The notation DT = −D is very suggestive. One has, nevertheless, to remember that (with
the boundary conditions chose) while D acts on functions that vanish at the initial time, DT
acts on functions v (t) that vanish at the ﬁnal time.
Consider now the operator D2 (second derivative)
γ (t) = dx2
(t) .
dt2 (8.106) Following the same lines of reasoning as above, the reader may easily demonstrate that the
second derivative operator is symmetrical, i.e., (D2 )T = D2 , provided that the functional
spaces into consideration satisfy the dual doundary condition
γ (t2 ) dx
dγ
dx
dγ
(t2 ) − (t2 ) x(t2 ) = γ (t1 ) (t1 ) − (t1 ) x(t1 ) .
dt
dt
dt
dt (8.107) A typical example where this condition is satisﬁed is when we have
x(t1 ) = 0 ; dx
(t1 ) = 0
dt ; γ (t2 ) = 0 ; dγ
(t2 ) = 0 ,
dt (8.108) i.e., when the functions x(t) have zero value and zero derivative value at the initial time and
the functions γ (t) have zero value and zero derivative value at the ﬁnal time.
This is the sort of boundary conditions found when working with the wave equation, as it
contains second order time derivatives. Further details are given in section 8.3.4.7 below.
As an exercise, the reader may try to understand why the quite obvious property
∂
∂xi T =− ∂
∂xi (8.109) divT = −grad (8.110) corresponds, in fact, to the properties
gradT = −div ; 254 8.3 (hint: if an operator maps E into F , its transpose maps F into E ; the dual of an space
has the same ‘variables’ as the original space).
Let us formally demostrate that the operator representing the acoustic wave equation is
symmetric. Starting from12
L= 1 ∂2
1
− div
grad ,
2
κ(x) ∂t
ρ(x) (8.111) we have
1 ∂2
1
− div
grad
κ(x) ∂t2
ρ(x) LT = T 1 ∂2
κ(x) ∂t2 = T 1
grad
ρ(x) − div T . (8.112) Using the property (A B)T = BT AT , we arrive at
T L= ∂2
∂t2 T 1
κ(x) T
T − (grad) 1
ρ(x) T (div)T . (8.113) Now, (i) the transposed of a scalar is the scalar itself; (ii) the second derivative (as we have
seen) is a symmetric operator; (iii) we have (as it has been mentioned above) gradT = −div
and divT = −grad . We then have
LT = ∂2 1
1
− div
grad ,
∂t2 κ(x)
ρ(x) (8.114) and, as the uncompressibility κ is assumed to be independent on time,
LT = 1 ∂2
1
− div
grad = L ,
2
κ(x) ∂t
ρ(x) (8.115) and we see that the acoustic wave operator is symmetric. As we have seen above, this conclusion
has to be understood with the condition that the waveﬁelds p(x, t) on which acts L satisfy
boundary conditions that are dual with those satisﬁed by the ﬁelds p(x, t) on which acts LT .
Typically the ﬁelds p(x, t) satisfy initial conditions of rest, and the ﬁelds p(x, t) satisfy ﬁnal
conditions of rest.
Tarantola (1988) demostrates that the transposed of the operator corresponding to the ‘wave
equation with attenuation’ corresponds to the wave equation with ‘antiattenuation’. But it
has to be understood that any physical or numerical implementation of the operator LT is
made ‘backwards in time’, so, in that sense of time, we face an ondinary attenuation: there is
no diﬃculty in the implementation of LT .
Example 8.15 The Kernel of the Transposed Operator If the explicit expression of the
equation
f = Ge
12 (8.116) Here, and below, an expression like A B C , means, as usual, A(B C) . This means, for instance, that the
div operator in this equation is to be understood as being applied not to 1/ρ(x) only, but to ‘everything at
its right’. Appendixes 255 is
f (t) = dt G(t, x) e(t) , (8.117) where G(t, x) is an ordinary function13 , then, it is said that G is an integral operator, and
that the function G(t, x) is its kernel. [End of example.]
The transpose of G will map an element f into an element e , these two elements
belonging to the respective duals of the spaces where the elements e and f mentioned in
equation 8.116 belong. An equation like
e = GT f (8.118) will correspond, explicitly, to
dx GT (x, t) f (t) . e(t) = (8.119) The reader may easily verify that the deﬁnition of transpose operator imposes that the kernel
of GT is related to the kernel of G by the simple expression
GT (x, t) = G(t, x) . (8.120) We see that the kernels of G and of GT are, in fact, identical, via a simple ‘transposition’
of the variables.
8.3.4.6 The Adjoint Operator Let G be a linear operator mapping an space E into an space F :
G : E →F . (8.121) f = Ge . (8.122) If e ∈ E and f ∈ F , then we write Assume that both, E and F are furnished with an scalar product each (see section 8.3.4.4),
that we denote, respectively, as ( e1 , e2 )E and ( f1 , f2 )F
A linear operator H mapping F into E , is named the adjoint of G if for any f ∈ F
and for any e ∈ E we have ( f , G e )F = ( H f , e )E , and, in this case, we use the notation
H = G∗ . The whole deﬁnition then reads
G∗ : F → E
∀e ∈ E ; ∀f ∈ F : ( f , G e )F = ( G∗ f , e )E . (8.123)
(8.124) 13
If G(t, x) is a distribution (like the derivative of a Dirac’s delta) then equation 8.116 may be a disguised
expression for a diﬀerential operator. 256 8.3 Let E and F be the respective duals of E and F , and denote · , · E and · , · F
the respective duality products. We have seen above that a scalar product is deﬁned through a
symmetric, positive operator mapping a space into its dual. Then, as E and F are assumed
to have a scalar product deﬁned, there are two ‘covariance’ operators CE and CF such that
the respective scalar products are given by
( e1 , e2 )E =
( f1 , f2 )F =
Then, equation 8.124 writes C−1 e2 , e1
E
C−1 f2 , f1
F C−1 f , G e
F
f , Ge F F E
F . = C−1 G∗ f , e
E = C−1 G∗ CF f , e
E E (8.125)
E , or, denoting f = C−1 f ,
F . (8.126) The comparison with equation 8.124 deﬁning the transposed operator gives the relation
between adjoint and transpose, GT = C−1 G∗ CF , that can be written, equivalently, as
E
G∗ = CE GT C−1 .
F (8.127) The transposed operator is an elementary operator. Its deﬁnition only requires the existence
of the dual of the considered spaces, that is automatic. If, for instance, a linear operator G
has the kernel G(u, v ) , the transposed operator GT will have the kernel GT (v, u) = G(u, v ) .
The adjoint operator is not an elementary operator. Its deﬁnition requires the existence of
scalar products in the working spaces, that are necessarily defoned through symmetric, positive
deﬁnite operators. This means that (excepted degenerated cases) the adjoint operator is a
complex object, depending on three elementary objects: this is how equation 8.127 is to be
interpreted.
8.3.4.7 The Green Operator The pressure ﬁeld p(x, t) propagating in an elastic medium with uncompressibility modulus
κ(x) and volumetric mass ρ(x) satisﬁes the ‘acoustic wave equation’
1 ∂2p
(x, t) − div
κ(x) ∂t2 1
grad p(x, t)
ρ(x) = S (x, t) . (8.128) Here, x denotes a point inside the medium (the coordinate system being still unspeciﬁed),
t is the Newtonian time, and S (x, t) is a source function. To simplify the notations, the
variables x and t will be dropped when there is no risk of confusion. For instance, the
equation above will be written
1 ∂2p
− div
κ ∂t2 1
grad p
ρ =S . (8.129) Also, I shall denote p the function {p(x, t)} as a whole, and not its value at a given point
of space and time. Similarly, S shall denote the source function S (x, t) .
For ﬁxed κ(x) and ρ(x) , the wave equation above can be written, for short,
Lp = S , (8.130) Appendixes 257 where L is the second order diﬀerential operator deﬁned through equation 8.129. In order
to deﬁne an unique waveﬁeld p , we have to prescribe some boundary and initial conditions.
An example of those are, if we work inside the time interval (t1 , t2 ) , and inside a volume V
bounded by the surface S ,
p(x, t1 ) = 0
p(x, t1 ) = 0
˙
p(x, t) = 0 x∈V
x∈V
x ∈ S ; t ∈ (t1 , t2 ) . ;
;
; (8.131) Here, a dot means time derivative. With prescribed initial and boundary conditions, then,
there is an one to one correspondence between the source ﬁeld S and the waveﬁeld p . The
inverse of the wave equation operator, L−1 , is called the Green operator , and is denoted G :
G = L−1 . (8.132) We can then write
⇐⇒ Lp = S p = GS . (8.133) As L is a diﬀerential operator, its inverse G is an integral operator. The kernel of the
Green operator is named the Green function , and is usually denoted G(x, t; x , t ) . The explicit
expression for p = G S is then
t2 p(x, t) = dV (x ) dt G(x, t; x , t ) S (x , t ) . V (8.134) t1 It is easy to demonstrate14 that the wave equation operator L is a symmetric operator,
so this is also true for the Green operator G . But we have seen that the transpose operators
work in spaces with have dual boundary conditions (see section 8.14 above).
Using the method outlined in section 8.14, the boundary conditions dual to those in equations 8.131 are
p(x, t2 ) = 0
p(x, t2 ) = 0
˙
p(x, t) = 0 x∈V
x∈V
x ∈ S ; t ∈ (t1 , t2 ) , ;
;
; (8.135) i.e., we have ﬁnal conditions of rest instead of initial conditions of rest (and the same surface
condition). We have to underdstand that while the equation L p = S is associated to the
boundary conditions 8.131, equations like
LT p = S ; p = GT S (8.136) are associated to the dual boundary conditions 8.135 (the hats here mean that the transpose
operator operator operates in the dual spaces (see section 8.3.4.3). This being understood, we
can write LT = L and GT = G , and rewrite equations 8.136 as
Lp = S
14 ; p = GS . (8.137) This comes from the property that the derivative operator is antisymmetric, (so that the second derivative is a symmetric operator) and from the properties gradT = −div and divT = −grad , mentioned in
section protect8.14. 258 8.3 The hats have to be maintained, to remember that the ﬁelds with a hat must satisfy boundary
conditions dual to those satisﬁed by the ﬁelds without a hat.
Using the transposed of the Green operator, we can write
t2 p(x, t) = dt GT (x, t; x , t ) S (x , t ) . dV (x ) (8.138) t1 Some text is missing here. Some text is missing here. Some text is missing here. Some text
is missing here. Some text is missing here. Some text is missing here. Some text is missing
here. Some text is missing here. Some text is missing here. Some text is missing here. Some
text is missing here. Some text is missing here.
8.3.4.8 Born Approximation for the Acoustic Wave Equation Let us start from equation 8.129, using the same notations:
1 ∂2p
− div
κ ∂t2 1
grad p
ρ =S . (8.139) I shall denote p the function {p(x, t)} as a whole, and not its value at a given point of space
and time. Similarly, κ and ρ will denote the functions {κ(x)} and {ρ(x)} .
Given appropriate boundary and initial conditions, and given a source function, the acoustic
wave equation deﬁnes an application {κ, ρ} → p = ψ (κ, ρ) , i.e., an application that associates
to each medium {κ, ρ} the (unique) pressure ﬁeld p that satisﬁes the wave equation (with
given boundary and initial conditions).
Let p0 be the pressure ﬁeld propagating in the medium deﬁned by κ0 and ρ0 , i.e.,
p0 = ψ (κ0 , ρ0 ) , and let p be the pressure ﬁeld propagating in the medium deﬁned by κ and
ρ , i.e., p = ψ (κ, ρ) . Clearly, if κ and ρ are close (in a sense to be deﬁned) to κ0 and
ρ0 , then, the waveﬁeld p will be close to p0 .
Let us obtain an explicit expression for the ﬁrst order approximation to p . This is known
as the (ﬁrst) Born approximation of the waveﬁeld. Both κ and ρ could be perturbed, but
I simplify the discussion here by considering only perturbations in the uncompressibility κ .
The reader may easily obtain the general case.
The pressure inside an elastic ﬂuid medium is (note: check if this sign is consistent with the
sign given to the stress tensor elsewhere in the book)
1
p = − σk k .
(8.140)
3
So deﬁned, the pressure may take positive or negative values, corresponding to an elastic
medium that is compressed or stretched. In the terminology of section 2, this is a Cartesian
quantity.
Note: check what follows. Perhaps it is better to assume that the pressure P is positive
quantity, and to deﬁne
p = P0 log P
,
P0 (8.141) where P0 is the ‘ambient pressure’. For small pressure perturbations, we have
p = P0 log 1 + (P − P0 )
P0 ≈ P − P0 . (8.142) Appendixes 259 The uncompressibility and the volumetric mass are positive, Jeﬀreys quantities.
In most texts, the diﬀerence p − p0 is calculated as a function of the diﬀerence κ − κ0 ,
but we have seen that this is not the right way, as the resulting approximation will depend on
the fact that we are using uncompressibilty κ(x) instead of compressibility γ (x) = 1/κ(x) .
At this point we may introduce the logarithmic parameters, and proceed trivially (note:
explain why this is important). The logarithmic uncompressibilities for the reference medium
and for the perturbed medium are
κ∗ = log
0 κ0
K κ∗ = log ; κ
K , (8.143) where K and R are arbitrary constants (having the right physical dimension). Reciprocally,
κ0 = K exp κ∗
0 ; κ = K exp κ∗ . (8.144) In particular, we have
κ = κ0 exp(δκ∗ ) , (8.145) where
δκ∗ = κ∗ − κ∗ = log
0 κ
.
κ0 (8.146) Note that we have here a perturbation δκ∗ of a logarithmic (Cartesian) quantity, not of the
positive (Jeﬀreys) one. We also write
p = p0 + δp . (8.147) The reference solution satisﬁes
1 ∂ 2 p0
− div
κ0 ∂t2 1
grad p0
ρ0 =S , (8.148) while the perturbed solution satisﬁes
1 ∂2p
− div
κ ∂t2 1
grad p
ρ0 =S . (8.149) In this equation, κ can be replaced by the expression 8.145, and p by the expression 8.147.
Using then the ﬁrst order approximation exp(−δκ∗ ) = 1 − δκ∗ leads to
1
δκ∗
−
κ0
κ0 ∂ 2 p0 ∂ 2 δp
+
∂t2
∂t2 − div 1
(grad p0 + gradδp)
ρ0 =S . (8.150) Some of the terms in this equation correspond to the terms in the reference equation 8.148,
and can be simpliﬁed. Keeping only ﬁrst order terms then leads to
1 ∂ 2 δp
− div
κ0 ∂t2 1
grad δp
ρ0 = δκ∗ ∂ 2 p0
.
κ0 ∂t2 (8.151) 260 8.3 Explicitly, replacing δp = p − p0 and δκ∗ = log κ/κ0 , gives
1 ∂ 2 (p − p0 )
− div
κ0
∂t2 1
grad (p − p0 )
ρ0 = 1
κ ∂ 2 p0
log
.
κ0
κ0 ∂t2 (8.152) This is the equation we were looking for. It says that the ﬁeld p − p0 satisﬁes the wave
equation with the unperturbed value of the uncompressibility κ0 , and is generated by the ‘Born
secondary source’
SBorn = 1
κ ∂ 2 p0
log
.
κ0
κ0 ∂t2 (8.153) Should we have made the development using the compressibility γ = 1/κ instead of the
uncompressibility, we would have arrived at the secondary source
SBorn = γ0 log γ0 ∂ 2 p0
γ ∂t2 (8.154) that is identical to the previous one.
The expression here obtained for the secondary source is not the usual one, as it depends
on the distance log κ/κ0 and not on the diﬀerence κ − κ0 . For an additive perturbation
κ = κ0 + δκ of the positive parameter κ would have lead to the Born secondary source
Sκ = δκ ∂ 2 p0
κ − κ0 ∂ 2 p0
=
κ2 ∂t2
κ2
∂t2 (8.155) while an additive perturbation γ = γ0 + δγ of the positive parameter γ = 1/κ would have
lead to the Born secondary source
Sγ = −δγ ∂ 2 p0
∂ 2 p0
= (γ − γ0 ) 2 ,
∂t2
∂t (8.156) and these two sources are not identical. I mean here that they ﬁnite expression is not identical.
Of course, in the limit for an inﬁnitesimal perturbation they tend to be identical.
The approach followed here has two advantages. First, mathematical consistence, in the
sense that the secondary source is deﬁned independently of the quantities used to make the
computation (covariance of the results). Second advantage, in a numerical computation, the
perturbations may be small, but they are ﬁnite. ‘Large contrasts’ in the parameters may give,
when inserting the diﬀerences in expressions 8.155 or 8.156 quite bad approximations, while the
logarithmic expressions in the right Born source (equation 8.153 or 8.154) may remain good.
8.3.4.9 Tangent Application of Data With Respect to Parameters In the context of an inverse problem, assume that we observe the pressure ﬁeld p(x, t) at
some points xi inside the volume. The solution of the forward problem is obtained by solving
the wave equation, or by using the Green’s function. We are here interested in the tangent
linear application. Let us write the ﬁrst order perturbation δp(xi , t) of the pressure waveﬁeld
produced when the logarithmic uncompressibility is perturbed by the amount δκ∗ (x) as (linear
tangent application)
δ p = F δ κ∗ , (8.157) Appendixes 261 or, introducing the kernel of the Fr´chet derivative F ,
e
dV (x ) F (xi , t; x ) δκ∗ (x ) . δp(xi , t) = (8.158) V Let us express the kernel F (xi , t; x ) .
We have seen that a perturbation δκ∗ is equivalent, up to the ﬁrst order, to have the
secondary Born source (equation 8.151)
δκ∗ (x)
SBorn (x, t) =
p0 (x, t) .
¨
κ0 (x) (8.159) Then, using the Green function,
t1 δp(xi , t) = dV (x ) dt G(xi , t; x , t ) SBorn (x , t ) V t2
t1 dV (x ) = dt G(xi , t; x , t ) V t2 δκ∗ (x )
p0 (x , t ) .
¨
κ0 (x ) (8.160) The last expression can be rearranged into the form used in equation 8.158, this showing that
F (xi , t; x , t ) is given by
F (xi , t; x ) = t1 1
κ0 (x ) dt G(xi , t; x , t ) p0 (x , t )
¨ (8.161) t2 This is the kernel of the Fr´chet derivative of the data with respect to the parameter κ∗ (x) .
e
8.3.4.10 The Transpose of the Fr´chet Derivative Just Computed
e Now that we are able to understand the expression δ p = F δ κ∗ , let us face the dual problem.
Which is the meaning of an expresion like
δ κ∗ = FT δ p ? (8.162) Denoting by F T (x ; xi , t) the kernel of FT , such an expression writes
t1 dt F T (x ; xi , t) δ p(xi , t) , δ κ(x ) = (8.163) t2 i but we know that the kernel of the transpose operator equals the kernel of the original operator,
with variables transposed (note: say where this has been demonstrated), so that we can write
this equation as
t1 δ κ(x ) = dt F (xi , t; x ) δ p(xi , t) , (8.164) t2 i where F (xi , t; x ) is the kernel given in equation 8.161. Replacing the kernel by its expression
gives
t1 ∗ δ κ (x ) =
i t2 1
dt
κ0 (x ) t1 dt G(xi , t; x , t ) p0 (x , t ) δ p(xi , t) ,
¨
t2 (8.165) 262 8.3 and this can be rearranged into (note that primed and nonprimed variables have been exchanged)
1
δ κ (x) =
κ0 (x)
∗ t1 dt ψ (x, t) p0 (x, t) ,
¨ (8.166) t2 where
t1 dt G(xi , t ; x, t) δ p(xi , t ) , ψ (x, t) =
i (8.167) t2 or, using the kernel of the tranposed Green’s operator,
t1 dt GT (x, t; xi , t ) δ p(xi , t ) . ψ (x, t) =
i (8.168) t2 (note: explain here that this means that the ﬁeld ψ (x, t) can be interpreted as the solution of
the transposed wave equation, with a point source at each point xi where we have a receiver,
radiating the value δ p(xi , t ) . As we have the transposed of the Green’s operator, the ﬁeld
ψ (x, t) must satisfy dual boundary conditions, i.e., in our case, ﬁnal conditions of rest).
8.3.4.11 The Continuous Inverse Problem Let be p = f (κ∗ ) the function calculating the theoretical data associated to the model κ
(resolution of the forward problem). We seek the model minimizing the sum
S (κ∗ ) = 1
2 f (κ∗ ) − pobs 2 + κ∗ − κ∗
prior 2 (8.169) 1
.
C−1 (f (κ∗ ) − pobs ) , f (κ∗ ) − pobs + C−∗1 (κ∗ − κ∗ ) , κ∗ − κ∗
p
κ
prior
prior
2
Using, in this functional context, the steepest descent algorithm proposed in section 8.3.7.4, we
arrive at
= κ∗ +1 = κ∗ −
n
n Cκ∗ FT C−1 (pn − pobs ) + (κ∗ − κ∗ ) ,
n
p
n
prior (8.170) where pn = f (κ∗ ) and where FT is the transposed operatoir deﬁned above, at point κ∗ .
n
n
n
Covariances aside, we see that the fundamental object appearing in this inversion algorithm
is the transposed operator FT . As it has been interpreted above, we have all the elements to
understand how this sort of inverse problems are solved. For more details, see Tarantola (1984,
1986, 1987). Appendixes 8.3.5 263 Appendix: Nonlinear Inversion of Waveforms (by Charara &
Barnes) [Note: I plan to convince Marwan and Christophe to contribute to our book by writing this
section (on a work that, unfortunately, has never been published).] Figure 8.12: Geometry. 264 8.3 VSP WEST filtered real data X
10
20 Traces 30
40
50
60
70 1000 1500 2000 2500 3000 3500 Time (ms)
Figure 8.13: Observed seismograms. X component. 4000 Appendixes 265 VSP WEST filtered real data Z
10
20 40
50
60
70 1000 1500 2000 2500 3000 3500 4000 Time (ms)
Figure 8.14: Observed seismograms. Z component. P Velocity
0 500
4750
1000 4500
4250
4000 1500 3750
3500
3250 2000 Figure 8.15: Model. VP. Depth (m) Traces 30 3000
2750
2500 2500
2250
2000 3000 1750
1500
1250 3500 1000
4000 4500 5000
0 500 1000 1500 Offset (m) 2000 2500 266 8.3 S Velocity
0 500
2300
1000 2150
2000
1850 1500 1700
1550
1400 Figure 8.16: Model. VS. Depth (m) 2000 1250
1100
2500 950
800
650 3000 500
300
150 3500 0
4000 4500 5000
0 500 1000 1500 2000 2500 Offset (m) Density
0 500
2700
1000 2530
2420
2310 1500 2200
2090
1980 Figure 8.17: Model. RHO. Depth (m) 2000 1870
1760
2500 1650
1540
1430 3000 1320
1210
1100 3500 980
4000 4500 5000
0 500 1000 1500 Offset (m) 2000 2500 Appendixes 267 Log(Qs)
0 500
5.5
1000 4.8
4.6
4.4 1500 4.2
4.0
3.8 Figure 8.18: Model. Q. Depth (m) 2000 3.6
3.4
2500 3.2
3.0
2.8 3000 2.6
2.4
2.2 3500 2.0
4000 4500 5000
0 500 1000 1500 Offset (m) 2000 2500 268 8.3 VSP WEST synthetic X component
10
20 Traces 30
40
50
60
70 1000 1500 2000 2500 3000 3500 Time (ms)
Figure 8.19: Calculated seismograms. X component. 4000 Appendixes 269 VSP WEST synthetic Z component
10
20 Traces 30
40
50
60
70 1000 1500 2000 2500 3000 3500 Time (ms)
Figure 8.20: Calculated seismograms. Z component. 4000 270 8.3 VSP WEST residuals X component
10
20 Traces 30
40
50
60
70 1000 1500 2000 2500 3000 3500 Time (ms)
Figure 8.21: Residuals seismograms. X component. 4000 Appendixes 271 VSP WEST residuals Z component
10
20 Traces 30
40
50
60
70 1000 1500 2000 2500 3000 3500 Time (ms)
Figure 8.22: Residuals seismograms. Z component. 4000 272 8.3 8.3.6 Appendix: Using Monte Carlo Methods [Note: Write a small introduction here].
8.3.6.1 Basic Equations The starting point could be the general equation 7.9,
σ (m, d) = k ρ(m, d) ϑ(m, d)
µ(m, d) , (8.171) combining the ‘a priori’ information ρ(m, d) with the ‘theoretical’ information ϑ(m, d) . We
have seen in section 3 that if we are able to design a random walk that samples ρ(m, d) , then,
the Metropolis rule can be used to obtain a random walk that samples σ (m, d) . We have also
seen that if we are not able to design a (primeval) random walk that samples ρ(m, d) , then
we can start using a random walk that samples the homogeneous probability density µ(m, d) ,
of even an arbitrary15 probability density ψ (m, d) .
This point of view, is very general, but more practical algorithms are obtained when we
particularize.
Let us consider, for instance, the explicit expression (equation ??) for σm (m) given in
section 8.2.6:
σm (m) = k ρm (m) φ(m)
µm (m) . (8.172) where
φ(m) = ρd (d)
µd (d) det (gm (m) + FT (m) gd (d) F(m)) . (8.173) d=f (m) In this expression the matrix of partial derivatives F = F(m) , with components Diα =
∂di /∂mα , appears. The ‘slope’ F enters here because the steeper the slope for a given m ,
the greater the accumulation of points we will have with this particular m . This is because
we use explicitly the analytic expression d = f (m) . One should realize that using the more
general approach based on equation 8.171, the eﬀect is automatically accounted for, and there
is no need to explicitly consider the partial derivatives.
In any case, equation 8.172 has the standard form of a conjunction of two probability
densities, and is, therefore, ready to be integrated in a Metropolis algorithm. But one should
note that, contrary to many ‘nonlinear’ formulations of inverse problems, the partial derivatives
F are needed, even if we use a Monte Carlo method.
In some weakly nonlinear problems, we have FT (m) gd (d) F(m) << gm (m) and, then,
φ(m) = µm (m) ρd (d)
µd (d) , (8.174) d=f (m) and equation 8.172 becomes
σm (m) = k ρm (m) L(m) ,
15 Although, hopefully, not too diﬀerent from µ(m, d) . (8.175) Appendixes 273 where
L(m) = ρd (d)
µd (d) . (8.176) d=f (m) This expression is also ready for use using the Matropolis algorithm. In this way sampling of
the prior ρm (m) is modiﬁed into a sampling of the posterior σm (m), and the Metropolis Rule
uses the “Likelihood function” L(m) (in fact, a volumetric probability) to calculate acceptance
probabilities.
8.3.6.2 Sampling the Homogeneous Probability Distribution If we do not have an algorithm that samples the prior probability density directly, the ﬁrst step
in a Monte Carlo analysis of an inverse problem is to design a random walk that samples the
model space according to the homogeneous probability distribution µ(m). In some cases this is
easy, but in other cases only an algorithm (a primeval random walk) that samples a probability
density ψ (m) = µ(m) is available. Then the Metropolis Rule can be used to modify ψ (m)
into µ(m). This way of generating samples from µ(m) is eﬃcient if ψ (m) is close to µ(m),
otherwise it may be very ineﬃcient. Methods for designing primeval random walks are found
in Section 3.4.
Once µ(m) can be sampled, the Metropolis Rule allows us to use modify this sampling into
an algorithm that samples the prior.
8.3.6.3 Sampling the Prior Probability Distribution The ﬁrst step in the Monte Carlo analysis is to switch oﬀ the comparison between computed
and observed data, thereby generating samples of the a priori probability density. This allows
us verify statistically that the algorithm is working correctly, and it allows us to understand
the prior information we are using. We will refer to a large collection of models representing
the prior probability distribution as the “prior movie”. The more models present in this movie,
the more accurate representation of the prior probability density.
If we are interested in smooth Earth models (knowing, e.g., that only smooth properties are
resolved by the data), a smooth movie can be produced simply by smoothing the individual
models of the original movie.
8.3.6.4 Sampling the Posterior Probability Distribution If we now switch on the comparison between computed and observed data using, e.g., the
Metropolis Rule, the random walk sampling the prior distribution is modiﬁed into a walk
sampling the posterior distribution. Again, smoothed versions of this “posterior movie” can be
generated by smoothing the individual models in the original, posterior movie.
Since data rarely put strong constraints on The Earth, the “posterior movie” typically shows
that many diﬀerent models are possible. But even though the models in the posterior movie
may be quite diﬀerent, all of them predict data that, within experimental uncertainties, are
models with high likelihood. In other words, we must accept that data alone cannot have a
preferred model.
The posterior movie allows us to perform a proper resolution analysis that helps us to
choose between diﬀerent interpretations of a given data set. Using the movie we can answer 274 8.3 complicated questions about the correlations between several model parameters. To answer
such questions, we can view the posterior movie and try to discover structure that is well
resolved by data. Such structure will appear as “persistent” in the posterior movie. Another,
more traditional, way of investigating resolution is to calculate covariances and higher order
moments. For this we need to evaluate integrals of the form
Rf = dm f (m) σ (m) (8.177) A where f (m) is a given function of the model parameters and A is an event in the model space
M containing the models we are interested in. For instance,
A = {m a given range of parameters in m is cyclic} . (8.178) In the special case when A = M is the entire model space, and f (m) = mi , the Rf in eq. (8.177)
equals the mean mi of the i’th model parameter mi . If f (m) = (mi − mi ) (mj − mj ), Rf
becomes the covariance between the i’th and j ’th model parameters.
Typically, in the general inverse problem we cannot evaluate the integral in (8.177) analytically because we have no analytical expression for σ (m). However, from the samples of the
posterior movie m1 , . . . , mN we can approximate Rf by the simple average:
Rf ≈ f (mn ).
{nmn ∈A} (8.179) Appendixes 8.3.7 275 Appendix: Using Optimization Methods As we have seen, the solution of an inverse problem essentially consists of a probability distribution over the space of all possible models of the physical system under study. In general,
this ‘model space’ is highlydimensional, and the only general way to explore it is by using the
Monte Carlo methods developed in section 3.
If the probability distributions are ‘bellshaped’ (i.e., if they look like a Gaussian or like
a generalized Gaussian), then, one may simplify the problem by calculating only the point
around which the probability is maximum, with an approximate estimation of the variances
and covariances. This is the problem addressed in this section. [Note: I rephrased this sentence]
Among the many methods available to obtain the point at which a scalar function reaches its
maximum value (relaxation methods, linear programming techniques, etc.), we limit our scope
here to the methods using the gradient of the function, which we assume can be computed
analytically or, at least, numerically. For more general methods, the reader may have a look at
Fletcher, (1980, 1981), Powell (1981), Scales (1985), Tarantola (1987) or Scales et al. (1992).
8.3.7.1 Maximum Likelihood Point Let us consider a space X , with a notion of volume element dV deﬁned. If some coordinates
x ≡ {x1 , x2 , . . . , xn } are chosen over the space, the volume element has an expression dV (x) =
v (x) dx , and each probability distribution over X can be represented by a probability density
f (x) . For any ﬁxed small volume ∆V , we can search for the point xM L such that the
probability dP of the small volume, when centered around xM L , gets a maximum. In the
limit ∆V → 0 this deﬁnes the maximum likelihood point . The maximum likelihood point may
be unique (if the probability distribution is monomodal), may be degenerated (if the probability
distribution is ‘roofshaped’) or may be multiple (as when we have the sum of a few bellshaped
functions).
The maximum likelihood point is not the point at which the probability density is maximum. [Note: Rephrase the following sentence...] For our deﬁnition imposes that what must
be maximum is the ratio of the probability density by the function v (x) deﬁning the volume
element:
f (x)
x = xM L
⇐⇒
F (x) =
maximum .
(8.180)
v (x)
We recognize in the ratio F (x) = f (x)/v (x) the volumetric probability associated to the
probability density f (x) (see equation ??). As the homogeneous probability density is µ(x) =
k v (x) (see rule 4.2), we can equivalently deﬁne the maximum likelihhod point by the condition
x = xM L ⇐⇒ f (x)
µ(x) maximum . (8.181) The point at which a probability density has its maximum is not xM L . In fact, the
maximum of a probability density does not correspond to an intrinsic deﬁnition of a point: a
change of coordinates x → y = ψ (x) would change the probability density f (x) into the
probability density g (y) (obtained using the Jacobian rule), but the point of the space at
which f (x) is maximum is not the same as the point of the space where g (y) is maximum
(unless the change of variables is linear). This contrasts with the maximum likelihood point, as
deﬁned by equation 8.181, that is an intrinsically deﬁned point: no matter which coordinates
we use in the computation we always obtain the same point of the space. 276
8.3.7.2 8.3
Misﬁt One of the goals here is to develop gradientbased methods for obtaining the maximum of
F (x) = f (x)/µ(x) . As a quite general rule, gradientbased methods perform quite poorly for
(bellshaped) probability distributions, as when one is far from the maximum the probability
densities tend to be quite ﬂat, and it is diﬃcult to get, reliably, the direction of steepest ascent.
Taking a logarithm transforms a bellshaped distribution into a paraboloidshaped distribution
on which gradient methods work well.
The logarithmic volumetric probability, or misﬁt , is deﬁned as S (x) = − log(F (x)/F0 ) ,
where p and F0 are two constants, and is given by
S (x) = − log f (x)
µ(x) . (8.182) The problem of maximization of the (typically) bellshaped function f (x)/µ(x) has been
transformed into the problem of minimization of the (typically) paraboloidshaped function
S (x) :
x = xM L ⇐⇒ S (x) minimum . (8.183) Example 8.16 The conjunction σ (x) of two probability densities ρ(x) and ϑ(x) was
deﬁned (equation ??) as
σ (x) = p ρ(x) ϑ(x)
µ(x) . (8.184) Then,
S (x) = Sρ (x) + Sϑ (x) , (8.185) where
Sρ (x) = − log ρ(x)
µ(x) ; Sϑ (x) = − log ϑ(x)
µ(x) . (8.186) [End of example.]
Example 8.17 In the context of Gaussian distributions, we have found the probability density
(see example ??)
σm (m) =
= k exp − 1
(m − mprior )t C−1 (m − mprior ) + (f (m) − dobs )t C−1 (f (m) − dobs )
M
D
2 (8.187)
. The limit of this distribution for inﬁnite variances is a constant, so in this case µm (m) = k .
The misﬁt function S (m) = − log( σm (m)/µm (m) ) is then given by
2 S (m) = (m − mprior )t C−1 (m − mprior ) + (f (m) − dobs )t C−1 (f (m) − dobs ) .
M
D (8.188) The reader should remember that this misﬁt function is valid only for weakly nonlinear problems
(see examples 8.5 and ??). The maximum likelihood model here is the one that minimizes the
sum of squares 8.188. This correpponds to the least squares criterion. [End of example.] Appendixes 277 Example 8.18 In the context of Laplacian distributions, we have found the probability density
(see example ??)
σm (m) = k exp −
α mα − mα 
prior
+
σα f i (m) − di 
obs
σi i . (8.189) The limit of this distribution for inﬁnite mean deviations is a constant, so here µm (m) = k .
The misﬁt function S (m) = − log( σm (m)/µm (m) ) is then given by
S (m) =
α mα − mα 
prior
+
σα i f i (m) − di 
obs
σi . (8.190) The reader should remember that this misﬁt function is valid only for weakly nonlinear problems. The maximum likelihood model here is the one that minimizes the sum of least absolute
values 8.190. This correpponds to the least absolute values criterion. [End of example.]
8.3.7.3 Gradient and Direction of Steepest Ascent One must not consider as synonymous the notions of ‘gradient’ and ‘direction of steepest ascent’.
Consider, for instance, an adimensional misﬁt function16 S (P, T ) over a pressure P and a
temperature T . Any sensible deﬁnition of the gradient of S will lead to an expression like ∂S grad S = ∂P (8.191) ∂S
∂T and this by no means can be regarded as a ‘direction’ in the (P, T ) space (for instance, the
components of this ‘vector’ does not have the dimensions of pressure and temperature, but of
inverse pressure and inverse temperature).
Mathematically speaking, the gradient of a function S (x) at a point x0 is the linear
application that is tangent to S (x) at x0 . [Note: Rephrase the following sentence...] This
deﬁnition of gradient is consistent with the more elementary one, based on the use of the ﬁrst
order development
S (x0 + δ x) = S (x0 ) + γ T δ x + . . .
0 (8.192) Here, it is γ 0 what is called the gradient of S (x) at point x0 . It is clear that S (x0 ) + γ T δ x
0
is a linear application, and that it is tangent to S (x) at x0 , so the two deﬁntions are, in fact,
equivalent. Explicitly, the components of the gradient at point x0 are
(γ 0 )p = ∂S
(x0 ) .
∂xp (8.193) Everybody is well trained at computing the gradient of a function (event if the interpretation
of the result as a direction in the original space is wrong). How can we pass from the gradient
to the direction of steepest ascent (a bona ﬁde direction in the original space)? In fact, the
16
We take this example because typical misﬁt functions are adimensional, but the argument has general
validity. 278 8.3 gradient (at a given point) of a function deﬁned over a given space E ) is an element of the dual
of the space. To obtain a direction in E , we must pass from the dual to the primal space. As
usual, it is the metric of the space that maps the dual of the space into the space itself. So if
g is the metric of the space where S (x) is deﬁned, and if γ is the gradient of S at a given
point, the direction of steepest ascent is
γ = g −1 γ . (8.194) The direction of steepest ascent must be interpreted as follows: if we are at a point x of
the space, we can consider a very small hypersphere around x0 . The direction of steepest
ascent points towards the point of the sphere at which S (x) gets its maximum value.
Example 8.19 Figure 8.23 represents the level lines of a scalar function S (u, v ) in a 2D
space. A particular point has been selected. What is the gradient of the function at the given
point? As suggested in the main text, it is not an arrow ‘perpendicular’ to the level lines of the
function at the considered point, as the notion of perpendicularity will depend on a metric not
yet speciﬁed (and unnecessary to deﬁne the gradient). The gradient must be seen as ‘the linear
function that is tangent to S (u, v ) at the considered point’. If S (u, v ) has been represented by
its level lines, then the gradient may also be represented by its level lines (right of the ﬁgure). We
see that the condition, in fact, is that the level lines of the gradient are tangent to the level lines
of the original function (at the considered point). Contrary to the notion of perpendicularity,
the notion of tangency is metricindependent. [End of example.]
A function, a point
and the tangent
level line The gradient of
the function
at the considered point Figure 8.23: The gradient of a function has
not to be seen as a vector orthogonal to the
level lines, but as a form parallel to them (see
text.) Example 8.20 In the context of least squares, we consider a misﬁt function S (m) and a
covariance matrix CM . If γ 0 is the gradient of S , at a point x0 , and if we use CM to
deﬁne distances in the space, the direction of steepest ascent is
γ 0 = CM γ 0 . (8.195) [End of example.]
Example 8.21 If the misﬁt function S (P, T ) depends on a pressure P and on a temperature
T , the gradient of S is, as mentioned above (equation 8.191), ∂S γ= ∂P
∂S
∂T . (8.196) Appendixes 279
2 P
As the quantities P and T are Jeﬀreys quantities, associated to the metric ds2 = dP +
dT 2
, the direction of steepest ascent is17
T 2 ∂S P ∂P .
γ=
(8.197)
2 ∂S
T ∂T [End of example.]
8.3.7.4 The Steepest Descent Method Consider that we have a probability distribution deﬁned over an ndimensional space X . Having chosen some coordinates x ≡ {x1 , x2 , . . . , xn } over the space, the probability distribution is
represented by the probability density f (x) whose homogeneous limit (in the sense developed
in section 4) is µ(x) . We wish to calculate the coordinates xM L of the maximum likelihood
point. By deﬁnition (equation 8.181),
x = xM L ⇐⇒ f (x)
µ(x) maximum , (8.198) x = xM L ⇐⇒ S (x) minimum , (8.199) i.e., where S (x) is the misﬁt (equation8.182)
S (x) = −k log f (x)
µ(x) . (8.200) Let us denote by γ (xk ) the gradient of S (x) at point xk , i.e. (equation 8.193),
(γ 0 )p = ∂S
(x0 ) .
∂xp (8.201) We have seen above that γ (x) is not to be interpreted as a direction in the space X , but a
direction in the dual space. The gradient can be converted into a direction using some metric
g(x) over X . In simple situations the metric g will be that used to deﬁne the volume element
of the space, i.e., we will have µ(x) = k v (x) = k det g(x) , but this is not a necessity, and
iterative algorithms may be accelerated by astute introduction of adhoc metrics.
Given, then, the gradient γ (xk ) (at some particular point xk ) to any possible choice of
metric g(x) we can deﬁne the direction of steepest ascent associated to the metric g , by
(equation 8.195)
γ (xk ) = g−1 (xk ) γ (xk ) . (8.202) The algorithm of steepest descent is an iterative algorithm passing from point xk to point
xk+1 by making a ‘small jump’ along the local direction of steepest descent,
−
xk+1 = xk − εk gk 1 γ k
17 We have here gP P
gT P gP T
gT T = 1/P 2
0 0
1/T 2 . , (8.203) 280 8.3 where εk is an adhoc (real, positive) value adjusted to force the algorithm to converge rapidly
(if εk is chosen too small the convergence may be too slow; it is it chosen too large, the
algorithm may even diverge).
Many elementary presentations of the steepest descent algorithm just forget to include
the metric gk in expression 8.203. These algorithms are not consistent. Even the physical
dimensionality of the equation is not assured. The authors of this article have traced some
‘numerical’ problems in existing computer implementations of steepest descent algorithms to
this neglection of the metric.
Example 8.22 In the context of example 8.17, where the misﬁt function S (m) is given by
2 S (m) = (f (m) − dobs )t C−1 (f (m) − dobs ) + (m − mprior )t C−1 (m − mprior ) ,
D
M (8.204) the gradient γ , whose components are γα = ∂S/∂mα , is given by the expression
γ (m) = Ft (m) C−1 (f (m) − dobs ) + C−1 (m − mprior ) ,
D
M (8.205) where F is the matrix of partial derivatives
F iα = ∂f i
∂mα . (8.206) An example of computation of partial derivatives is given in appendix ??. [End of example.]
Example 8.23 In the context of example 8.22 the model space M has an obvious metric,
namely that deﬁned by the inverse of the ‘a priori’covariance operator g = C−1 . Using this
M
metric and the gradient given by equation 8.205, the steepest descent algorithm 8.203 becomes
mk+1 = mk − εk CM Ft C−1 (fk − dobs ) + (mk − mprior )
k
D , (8.207) where Fk ≡ F(mk ) and fk ≡ f (mk ) . The real positive quantities εk can be ﬁxed, after some
trial and error, by accurate linear search, or by using a linearized approximation18 . [End of
example.]
Example 8.24 In the context of example 8.22 the model space M has a less obvious metric,
namely that deﬁned by the inverse of the ‘a posteriori’ covariance operator, g = C−1 . Note:
M
Explain here that the ‘best current estimator’ of CM is
CM ≈ Ft C−1 Fk + C−1
k
D
M −1 . (8.208) Using this metric and the gradient given by equation 8.205, the steepest descent algorithm 8.203
becomes
mk+1 = mk − εk Ft C−1 Fk + C−1
k
D
M −1 Ft C−1 (fk − dobs ) + C−1 (mk − mprior )
k
D
M ,
(8.209) 18 As shown in Tarantola (1987), if γ k is the direction of steepest ascent at point mk , i.e., γk =
CM Ft C−1 (fk − dobs ) + (mk − mprior ) , then, a local linearized approximation for the optimal εk gives
k
D
γ t C −1 γ k
M
εk = γ t ( Ft Ck 1 F +C−1 ) γ .
−
k
D
M
k
k
k Appendixes 281 where Fk ≡ F(mk ) and fk ≡ f (mk ) . The real positive quantities εk can be ﬁxed, after some
trial and error, by accurate linear search, or by using a linearized approximation that simply
gives19 εk ≈ 1 . [End of example.]
The algorithm 8.209 is usually called a ‘quasiNewton algorithm’. [Note: Rephrase the
following sentence...] This is a misname, as a Newton method applied to the minimization of
the misﬁt function S (m) would be a method using the second derivatives of S (m) , and
∂2f i
i
thus the derivatives Hαβ = ∂mα ∂mβ , that are not computed (or not estimated) when using
this algorithm. It is just a steepest descent algorithm with a nontrivial deﬁnition of metric in
the working space. In this sense it belongs to the wider class of ‘variable metric methods’, not
discussed in this article.
Example 8.25 In the context of example 8.18, where the misﬁt function S (m) is given by
S (m) =
i f i (m) − di 
obs
+
σi the gradient γ whose components are γα = ∂S/∂m
F iα γα =
i mα − mα 
prior
σα α
α , (8.210) is given by the expression 1
1
sign(f i − di ) +
sign(mα − mα ) ,
obs
prior
σi
σα (8.211) where F iα = ∂f i ∂/mα . We can now choose in the model space the adhoc metric deﬁned
as the inverse of the ‘covariance matrix’ formed by the square of the mean deviations σi and
σα (interpreted as if they were variances). Using this metric, the direction of steepest ascent
associated to the gradient in 8.211, is
F iα σi sign(f i − di ) + σα sign(mα − mα ) .
obs
prior γα = (8.212) i The steepest descent algorithm can now be appplied:
mk+1 = mk − εk γ k . (8.213) The real positive quantities εk can be ﬁxed after some trial and error or by accurate linear
search. [End of example.]
An expression like 8.210 deﬁnes a sort of deformed polyhedron, and to solve this sort of
minimization problems the linear programming techniques are often advocated (e.g., Claerbout
and Muir, 1973). We have found that for problems involving many dimensions the crude steepest descent method deﬁned by equations 8.212–8.213 performs extremely well. For instance, in
Djikp´ss´ and Tarantola (1999) a largesized problem of waveform ﬁtting is solved using this
ee
algorithm. It is well known that the sum of absolute values 8.210 provides a more robust20
criterion than the sum of squares 8.204. If one fears that the data set to be used is corrupted
by some unexpected errors, the leastabsolute values criterion should be preferred to the least
squares criterion21 .
19 While a sensible estimation of the optimal values of the real positive quantities εk is crucial for the
algorithm 8.207, they can, in many usual circumstances, be dropped from the algorithm 8.209.
20
A method is ‘robust’ if its output is not sensible to a small number of large errors in the inputs.
21
Of course, it would be much better to develop a realistic model of the uncertainties, and use the more
general probabilistic methods developed above, but if those models are not available, then the least absolute
values criterion is a valuable criterion. 282
8.3.7.5 8.3
Estimation of A Posteriori Uncertainties In the Gaussian context, the Gaussian probability density that is tangent to σm (m) has its
center at the point given by the iterative algorithm
mk+1 = mk − εk CM Ft C−1 (fk − dobs ) + (mk − mprior )
k
D , (8.214) (equation 8.207) or, equivalently, by the iterative algorithm
mk+1 = mk − εk Ft C−1 Fk + C−1
k
D
M −1 Ft C−1 (fk − dobs ) + C−1 (mk − mprior )
k
D
M (8.215) (equation 8.209). The covariance of the tangent gaussian is
CM ≈ Ft C−1 F∞ + C−1
∞
D
M −1 , (8.216) where F∞ refers to the value of the matrix of partial derivatives at the convergence point.
[note: Emphasize here the importance of CM ].
8.3.7.6 Some Comments on the Use of Deterministic Methods 8.3.7.6.1 About the Use of the Term ‘Matrix’ [note: Warning, old text to be updated.]
Contrary to the next chapter, where the model parameter space and the data space may
be functional spaces, I assume here that we have discrete spaces, with a ﬁnite number of
dimensions. [Note: What is ’indicial’ ?] Then, it makes sense to use the indicial notation
d = {di } , i ∈ ID ; m = {m α } , i ∈ IM , (8.217) where ID and IM are two index sets, for the data and the model parameters respectively. In
the simplest case, the indices are simple integers, ID = {1, 2, 3 . . . } , and IM = {1, 2, 3 . . . } ,
but this is not necessarily true. For instance, ﬁgure 8.24 suggests a 2D problem where we
compute the gravitational ﬁeld from a distribution of masses. Then, the index α is better
understood as consisting on a pair of integers. Figure 8.24: A simple example where the index in m = {mα } is not necessarily an integer. In this case, where we are interested in
predicting the gravitational ﬁeld g generated
by a 2D distribution of mass, the index α is
better understood as consisting on a pair of
integers. Here, for instance, mA,B means the
total mass in the block at row A and column
B. g2
g3
m1,1 m1,2 m1,3 m1,4
m2,1 m2,2 m2,3 m2,4
g4 m3,1 m3,2 m3,3 m3,4 g1 Appendixes 283 8.3.7.6.2 Linear, Weakly Nonlinear and Nonlinear Problems There are diﬀerent
degrees of nonlinearity. Figure 8.25 illustrates the four domains of nonlinearity allowing the
use of the diﬀerent optimisation algorithms This ﬁgure symbolically represents the model space
in the abscissa axis, and the data space in the ordinates axis. The gray oval represents the
information coming in part from a priori information on the model parameters and coming in
part from the data observations22 . It is the function ρ(d, m) = ρd (d) ρm (m) seen elsewhere
(note: say where).
Figure 8.25: Illustration of
the four domains of nonlinearity allowing the use
of the diﬀerent optimization algorithms The model
space is symbolically represented in the abscissa axis,
and the data space in the
ordinates axis. The gray
oval represents the information coming in part from
a priori information on the
model parameters and coming in part from the data
observations. What is important is not some intrinsic nonlinearity of the function relating model parameters to data, but how linear the function is inside the
domain of signiﬁcant probabilty . Linear problem Linearisable problem
d  dprior = G0 (m  mprior) D D dobs dobs
d = g(m)
d=Gm
σΜ(m) σΜ(m)
mprior mprior M M Nonlinear problem Weakly nonlinear problem
D
dobs d = g(m) D
dobs d = g(m)
mprior σΜ(m) σΜ(m)
M mprior M To ﬁx ideas, the oval suggests here a Gaussian probability, but the sorting of problems we
are about to make as a function of their nonlinearity will not depend fundamentally on this.
First, there are some strictly linear problems. For instance, in the example illustrated by
ﬁgure 8.24, the gravitational ﬁeld g depends linearly on the masses inside the blocks23
22 The gray oval is the product of the probability density over the model space, representing the a priori
information, times the probability density over the data space representing the experimental results.
23
The gravitational ﬁeld at point x0 generated by a distribution of volumetric mass ρ(x) is given by
g(x0 ) = dV (y) x0 − y
ρ(x) .
x0 − x 3 When the volumetric mass is constant inside some predeﬁned (2D) volumes, as suggested in ﬁgure 8.24, this
gives
g(x0 ) =
GA,B (x0 ) mA,B .
A B This is a strictly linear equation between data (the gravitational ﬁeld at a given observation point) and the
model parameters (the masses inside the volumes). Note that if instead of choosing as model parameters the 284 8.3 Strictly linear problems are illustrated at the top left of ﬁgure 8.25. The linear relationship
between data and model parameters, d = G m , is represented by a straight line. The a priori
probability density ρ(d, m) “induces”, on this straight line, the a posteriori probability density
(warning: this notation corresponds to volumetric probabilities) σ (d, m) whose “projection”
over the model space gives gives the a posteriori probability density over the model parameter
space, σm (m) . Should the a priori probability densities be Gaussian, then the a posteriori
probability distribution would also be Gaussian: this is the simplest situation (in such problems,
as we will later see (section xxx), the problem reduces to ﬁnd the mean and the covariance of
the a posteriori Gaussian).
Quasilinear problems are illustrated at the bottomleft of ﬁgure 8.25. If the relationship
linking the observable data d to the model parameters m ,
d = g(m) , (8.218) is approximately linear inside the domain of signiﬁcant a priori probability (i.e., inside the gray
oval of the ﬁgure), then the a posteriori probability is as simple as the a priori probability. For
instance, if the a priori probability is Gaussian the a posteriori probability is also Gaussian.
In this case also, the problem can be reduced to the computation of the mean and the
covariance of the Gaussian. Typically, one begins at some “starting model” m0 (typically,
one takes for m0 the “a priori model” mprior ) (note: explain clearly somewhere in this
section that “a priori model” is a language abuse for the “mean a priori model”), linearizing
the function d = g(m) around m0 and one looks for a model m1 “better than m0 ”.
Iterating such an algorithm, one tends to the model m∞ at which the “quasiGaussian”
σm (m) is maximum. The linearizations made in order to arrive to m∞ are not, so far, an
approximation: the point m∞ is perfectly deﬁned independently of any linearization, and
any method used to ﬁnd it. But once the convergence to this point has been obtained, a
linearization of the function d = g(m) around this point,
d − g(m∞ ) = G∞ (m − m∞ ) , (8.219) allows to obtain a good approximation of the a posteriori uncertainties. For instance, if the a
priori probability is Gaussian this will give the covariance of the “tangent Gaussian”.
Between linear and quasilinear problem there are the “linearizable problems”. The scheme
at the topright of ﬁgure 8.25 shows the case where the linearization of the function d = g(m)
around the a priori model,
d − g(mprior ) = Gprior (m − mprior ) , (8.220) gives a function that, inside the domain of signiﬁcant probability, is very similar to the true
(nonlinear) function.
In this case, there is no practical diﬀerence between this problem and the strictly linear
problem, and the iterative procedure necessary for quasilinear problems is here superﬂuous.
It remains to analyze the true nonlinear problems that, using a pleonasm, are sometimes
called strongly nonlinear problems . They are illustrated at the bottomright of ﬁgure 8.25.
total masses inside some predeﬁned volumes one chooses the geometrical parameters deﬁning the sizes of the
volumes, then the gravity ﬁeld is not a linear function of the parameters. More details can be found in Tarantola
and Valette (1982b, page 229). Appendixes 285 In this case, even if the a priori probability is simple, the a posteriori probability can be quite
complicated. For instance, it can be multimodal. [Note: Rephrase the following sentence...]
These problems are, in general, quite complex to solve, and only the Monte Carlo methods
described in the previous chapter are suﬃciently general.
If full Monte Carlo methods cannot be used, because they are too expensive, then one can
mix some random part (for instance, to choose the starting point) and some deterministic part.
The optimization methods applicable to quasilinear problems can, for instance, allow us to
go from the randomly chosen starting point to the “nearest” optimal point (note: explain this
better). Repeating these computations for diﬀerent starting points one can arrive at a good
idea of the a posteriori probability in the model space.
8.3.7.6.3 The Maximum Likelihood Model The most likely model is, by deﬁnition, that
at which the volumetric probability σβ (m) attains its maximum. As σβ (m) is maximum
when S (m) is minimum, we see that the most likely model is also the the ‘best model’ obtained
when using a ‘least squares criterion’. Should we have used the double exponential model for
all the uncertainties, then the most likely model would be deﬁned by a ‘least absolute values’
criterion.
There are many circumstances where the most likely model is not an interesting model.
One trivial example is when the volumetric probability has a ‘narrow maximum’, with small
total probability (see ﬁgure 8.26). A much less trivial situation arises when the number of
parameters is very large, as for instance when we deal with a random function (that, in all
rigor, corresponds to an inﬁnite number of random variables). Figure XXX, for instance, shows
a few realizations of a Gaussian function with zero mean and an (approximately) exponential
correlation. The most likely function is the center of the Gaussian, i.e., the null function shown
at the left. But this is not a representative sample (specimen) of the probability distribution,
as any realization of the probability distribution will have, with a probability very close to one,
the ‘oscillating’ characteristics of the three samples shown at the right.
1 Figure 8.26: One of the circumstances where
the ‘maximum likelihood model’ may not be
very interesting, is when it corresponds to a
narrow maximum, with small total probability, as the peak at the left of this probability
distribution. 0.8
0.6
0.4
0.2
0
40 20 0 20 40 8.3.7.6.4 The Interpretation of ‘The Least Squares Solution’ Note: explain here
that when working with a large number of dimensions, the center of a Gaussian is a bad
representer of the possible realizations of the Gaussian.
Mention somewhere that mpost is not the ‘posterior model’, but the center of the a posteriori
Gaussian, and explain that for multidimensional problems, the center of a Gaussian is not
representative of a random realisation of the Gaussian.
[note: Mention somewhere that one should not compute the inverse of the matrices, but
solve the associated linear system.] 286 8.3 Figure 8.27: At the right, three random realizations of a Gaussian random function with zero
mean and (approximatelty) exponential correlation function. The most likely function, i.e.,
the center of the Gaussian, is shown at the left. We see that the most likely function is not a
representative of the probability distribution. Chapter 9
Inference Problems of the Fourth Kind
(Transport of Probabilities) Note: Say here the we consider here two problems: (i) the measure of physical quantities —
through a direct use of their deﬁnition— and (ii) the prediction of observations. It is, of course,
our goal to pay attention to the uncertainties involved.
These two problems are mathematically very similar, and are essentially solved using the
notion od ‘transport of probabilities’ introduced in chapter 2. 287 288 9.1 9.1 Measure of Physical Quantities Note: we develop here a problem that is fundamental in metrology: when a quantity s is
deﬁned as a function of some other quantity r , through s = s(r) , and we measure r , we
must ‘transport’ the information we have obtained on r into information on s .
Note: give the main ideas here.
The method is illustrated in section 9.1.1 where the Poisson ratio of a solid is evaluated,
using its deﬁnition in terms of stresses ans deformations.
It is also illustrated in appendix 9.3.1, in an example of mass calibration. 9.1.1 Example: Measure of Poisson’s Ratio 9.1.1.1 Hooke’s Law in Isotropic Media For an elastic medium, in the limit of inﬁnitesimal strains (Hooke’s law),
σij = cijk εk
where cijk , (9.1) is the stiﬀness tensor . If the elastic medium is isotropic, λκ
λµ
2
,
(9.2)
gij gk +
gik gj + gi gjk − gij gk
3
2
3
where λκ (with multiplicity one) and λµ (with multiplicity ﬁve) are the two eigenvalues of
the stiﬀness tensor cijk . They are related to the common umcompressibility modulus κ and
shear modulus µ through
cijk = κ = λκ /3 ; µ = λµ /2 . (9.3) The Hooke’s law 9.1 can, alternatively, be written
εij = dijk σ k , (9.4) where dijk , the inverse of the stiﬀness tensor, is called the compliance tensor . If the elastic
medium is isotropic,
γ
ϕ
2
,
(9.5)
dijk =
gij gk +
gik gj + gi gjk − gij gk
3
2
3
where γ (with multiplicity one) and ϕ (with multiplicity ﬁve) are the two eigenvalues of
dijk . These are, of course, the inverse of the eigenvalues of cijk :
1
1
1
1
γ=
=
=
;
ϕ=
.
(9.6)
λκ
3κ
λµ
2µ
From now on, I shall call γ the eigencompressibility or, if there is no risk of confusion with
1/κ , the compressibility. The quantitity ϕ shall be called the eigenshearability or, if there is
no risk of confusion with 1/µ , the shearability.
With the isotropic stiﬀness tensor of equation 9.2, the Hooke’s law 9.1 becomes
λκ
1
(9.7)
σij =
gij εk k + λµ εij − gij εk k ,
3
3
or, equivalently, with the isotropic compliance tensor of equation 9.5, the Hooke’s law 9.4
becomes
γ
1
(9.8)
gij σk k + ϕ σij − gij σk k .
εij =
3
3 Measure of Physical Quantities
9.1.1.2 289 Deﬁnition of the Poisson’s Ratio Consider the experimental arrangement of ﬁgure 9.1, where an elastic medium is submitted to
the (homogeneous) uniaxial stress (using Cartesian coordinates)
σxx = σyy = σxy = σyz = σzx = 0 ; σzz = 0 . (9.9) Then, the Hooke’s law 9.4 predicts the strain
εxx = εyy = 1
(γ − ϕ) σzz
3 1
(γ + 2 ϕ) σzz
3
= σyz = σzx = 0 . (9.10) εzz =
σxy The Young modulus Y and the Poisson ratio ν are deﬁned as
εxx
εyy
σzz
;
ν=−
=−
Y=
εzz
εzz
εzz , (9.11) and equation 9.10 gives
Y= 3
2ϕ + γ ; ν= 1 − 2ν
Y ; ϕ= ϕ−γ
2ϕ + γ , (9.12) with reciprocal relations
γ= 1+ν
Y . Figure 9.1: A possible experimental setup for measuring the
Young modulus and the Poisson ratio of an elastic medium.
The measurement of the force F of the ‘bar length’ Z and
of the bar diameter X allows to estimate the two elastic
parameters. Details below. (9.13) Z
F
X Note that when γ and ϕ take values inside their natural range
0<γ<∞ ; 0<ϕ<∞ , (9.14) −1 < ν < +1/2 . (9.15) the variation of Y and ν is
0<Y <∞ ; Although most materials have positive values of the Poisson ratio ν , there are materials where
it is negative (see ﬁgures 9.2 and 9.3)
The Poisson ratio has mainly a historical interest. Note that a simple function of it would
have given a bona ﬁde Jeﬀreys quantity,
J= λκ
1+ν
=
1 − 2ν
λµ with the natural domain of variation 0 < J < ∞ . , (9.16) 290 9.1 Figure 9.2: An example of a 2D elastic structure with a positive value of the Poisson ratio. When imposing a stretching in one direction (the ‘horizontal’ here), the elastic structure reacts contracting in the perpendicular
direction.
Figure 9.3: An example of a 2D elastic structure with a negative value of the Poisson ratio.
When imposing a stretching in one direction
(the ‘horizontal’ here), the elastic structure reacts also stretching in the perpendicular direction.
9.1.1.3 The Parameters Although one may be interested in the Young modulus Y and the Poisson ratio ν , we
may choose to measure the compressibility γ = 1/λκ and the shearability ϕ = 1/λµ . Any
information we may need on Y and ν can be obtained, as usual, through the change of
variables.
From the two ﬁrst equations in expression 9.10 it follows that the relation between the
elastic parameters γ and ϕ , the stress and the strains is
γ= εzz + 2 εxx
σzz ; ϕ= εzz − εxx
σzz . (9.17) As the uniaxial tress is generated by a force F applied to one of the ends of the bar (and the
reaction force of the support),
σzz = F
s , (9.18) where s , the section of the bar, is
π X2
s=
4 . (9.19) The most general deﬁnition of strain (that does not assume the strains to be small) is
εxx = log X
X0 ; εzz = log Z
Z0 , (9.20) where X0 and Z0 are the initial lengths (see ﬁgure 9.1) and X and Z are the ﬁnal lengths.
We have then the ﬁnal relation
γ= π X 2 log Z/Z0 + 2 log X/X0
4F ; ϕ= π X 2 log Z/Z0 − log X/X0
4F . (9.21) When necessary, these two expressions shall be written
γ = γ (X0 , Z0 , X, Z, F ) ; ϕ = ϕ(X0 , Z0 , X, Z, F ) . (9.22) Measure of Physical Quantities 291 We shall later need to extract from these relations the two parameters X0 and Z0 :
X0 = X exp − 4 F (γ − ϕ)
3 π X2 Z0 = Z exp − ; 4 F (γ + 2 ϕ)
3 π X2 , (9.23) expressions that, when necessary, shall be written
X0 = X0 (γ, ϕ, X, Z, F )
9.1.1.4 ; Z0 = Z0 (γ, ϕ, X, Z, F ) . (9.24) The Partial Derivatives In what follows, let us use the notation
r = {X0 , Z0 , X, Z, F } s = {γ, ϕ} , ; (9.25) so the relation 9.21 may be written
s = s(r) . (9.26) We need to complete the set of two variables s to have a set of ﬁve variables, as suggested in
section 2.6.0.3. The simplest choice is
t = {X, Z, F } (9.27) as supplementary variables. We can then introduce the matrix ∂γ/∂X0 ∂γ/∂Z0 ∂γ/∂X ∂γ/∂Z ∂ϕ/∂X0 ∂ϕ/∂Z0 ∂ϕ/∂X ∂ϕ/∂Z K = ∂X/∂X0 ∂X/∂Z0 ∂X/∂X ∂X/∂Z ∂Z/∂X0 ∂Z/∂Z0 ∂Z/∂X ∂Z/∂Z
∂F/∂X0 ∂F/∂Z0 ∂F/∂X ∂F/∂Z of partial derivatives ∂γ/∂F
∂ϕ/∂F ∂X/∂F , ∂Z/∂F ∂F/∂F (9.28) to easily obtain
K=
9.1.1.5 √ det K Kt = 3 π2 X 4
16 F 2 X0 Z0 . (9.29) The Measurement Space and the Measurand Space We measure the ﬁve quantities r = {X0 , Z0 , X, Z, F } in order to evaluate the two quantities
s = {γ, ϕ} . Let us denote by R 5 the ﬁvedimensional measurement space , over which r =
{X0 , Z0 , X, Z, F } shall be considered coordinates. The distance element over the measurement
space is [note: explain why]
ds 2 1
=2
a dX0
X0 2 + dX
X 2 + dZ0
Z0 2 dZ
Z + 2 dF 2
+2
b , (9.30) where a and b represent arbitrary ‘weights’. We then have the metric determinant
det gr = k
X0 Z0 X Z , (9.31) 292 9.1 where the constant k = 1/(a5 b) shall not play any important role in what follows (it will
spontaneously desappear).
Similarly, let us denote by S 2 the twodimensional measurand space , over which s = {γ, ϕ}
shall be considered coordinates. The distance element over the measurand space is [note:
explain why]
ds2 = dγ
γ 1
c2 2 + dϕ
ϕ 2 , (9.32) where c represents an arbitrary ‘weight’. The metric matrix is, therefore,
gr = 1
c2 0
1/γ 2
0
1/ϕ2 , (9.33) and this gives the metric determinant
det gs = k
γϕ , (9.34) where the constant k = 1/(c2 ) shall not play any important role in what follows (it will
spontaneously desappear).
9.1.1.6 The Measurement We measure {X0 , Z0 , X, Z, F } and describe the result of our measurement via a volumetric
probability
fr (X0 , Z0 , X, Z, F ) . (9.35) [Note: Explain this.]
9.1.1.7 Transportation of the Probability Distribution Equation 2.206 applyes here directly, and gives the transported volumetric probability over the
measurand space. Using the present notations, this gives
√
∞
∞
+∞
det gr
1
dX dZ
dF
(9.36)
fr (X0 , Z0 , X, Z, F ) ,
fs (γ, ϕ) = √
K
det gs 0
0
−∞
X0 =X0 (γ,ϕ,X,Z,F ) ; Z0 =Z0 (γ,ϕ,X,Z,F ) where the functions X0 = X0 (γ, ϕ, X, Z, F ) and Z0 = Z0 (γ, ϕ, X, Z, F ) are those expressed
by equations 9.23–9.24. More explicitly, using the result for the Jacobian determinant K given
by equation 9.29, and the two metric determinants given by equations 9.31 and 9.34,
fs (γ, ϕ) = k 16
γϕ
k 3 π2 ∞
0 dX
X ∞
0 dZ
Z +∞ dF
−∞ F2
X4 fr (X0 , Z0 , X, Z, F ) . X0 =X0 (γ,ϕ,X,Z,F ) ; Z0 =Z0 (γ,ϕ,X,Z,F ) (9.37) Measure of Physical Quantities 293 The two associated marginal volumetric probabilities are, then,
∞
0 dϕ
fs (γ, ϕ)
ϕ (9.38) dγ
fs (γ, ϕ) .
γ fγ (γ ) = (9.39) and
∞ fϕ (ϕ) =
0 To represent these volumetric probabilities I prefer to use the ‘Cartesian parameters’ of the
problem [note: explain]. Here, the logarithmic parameters
γ ∗ = log γ
γ0 ϕ∗ = log ϕ
ϕ0 , (9.40) where γ0 and ϕ0 are two arbitray constants having the dimension of a compliance are
Cartesian coordinates over the 2D space of elastic (isotropic) media. For the distance element
of equation 9.32 becomes
c2 ds2 = (dγ ∗ )2 + (dϕ∗ )2 , (9.41) typical of Cartesian coordinates in Euclidean spaces. As volumetric probabilities are invariant
quantities, the new volumetric probability function, say gs (γ ∗ , ϕ∗ ) , is simply given by
gs (γ ∗ , ϕ∗ ) = fs (γ, ϕ)γ = γ0 exp γ ∗ ; ϕ = ϕ0 exp ϕ∗ . (9.42) To be complete, let us mention that equations 9.37–9.39 deﬁne volumetric probabilities;
should we wish to evaluate probability densities,
f s (γ, ϕ) = fs (γ, ϕ)
γϕ fγ (γ )
γ f γ (γ ) = ; f ϕ (ϕ) = ; fϕ (ϕ)
ϕ , (9.43) then
f s (γ, ϕ) = k 16
k 3 π2 ∞
0 dX
X ∞
0 dZ
Z +∞ dF
−∞ F2
X4 fr (X0 , Z0 , X, Z, F ) ∞ dϕ f s (γ, ϕ) and 0 9.1.1.8 (9.44) X0 =X0 (γ,ϕ,X,Z,F ) ; Z0 =Z0 (γ,ϕ,X,Z,F ) ∞ f γ (γ ) = , f ϕ (ϕ) = dγ f s (γ, ϕ) . (9.45) 0 Numerical Illustration Note: to do things properly, the constants k and k of equations 9.31 and 9.34 should appear
here, as they measures distances. They should all simplify and desappear.
Let us use the notations N (u, u0 , s) and L(U, U0 , s) respectively for the normal and the
lognormal functions
(u − u0 )2
N (u, u0 , s) = k exp −
2 s2 ; 1
L(U, U0 , s) = k exp − 2
2s U
log
U0 2 .
(9.46) 294 9.1 Asume that the result of the measurement of the quantities X0 , Z0 (initial diameter and
length of the bar), X , Z (ﬁnal diameter and length of the bar), and the force F , has given
an information that can be represented by a ﬁvedimensional Gaussian volumetric probability
with independent uncertainties
fr (X, X0 , Z, Z0 , F ) =
obs
obs
L(X0 , X0 , sX0 ) L(Z0 , Z0 , sZ0 ) L(X, X obs , sX ) L(Z, Z obs , sZ ) N (F, F obs , sF ) , (9.47) with the numerical values
obs
X0 = 1.000 m
obs
Z0 = 1.000 m
X obs = 0.975 m
Z obs = 1.105 m
F obs = 9.81 kg m/s2 ; sX0 = 0.015
; sZ0 = 0.015
; sX = 0.015
; sZ = 0.015
;
sF ≈ 0 . This is the volumetric probability that appears at the right of equation 9.37. To simplify
the example I have assumed that the uncertainty on the force F is much smaller than the
other uncertainties, so, in fact, F can be treated as a constant. With the small uncertainties
chosen, the lognormal functions in 9.47 look much like a normal one. Figure 9.4 displays the
four (marginal) onedimensional lognormal functions. To illustrate how the uncertaintiers in
the measurement of the lengths propagate into uncertainties in the elastic parameters, I have
chosen the quite unrealistic example where the uncertainties in X and X0 overlap: it is
likely that the diameter of the rod has decreased (so the Poisson ratio is positive) but the
probability that it has increased (negative Poisson ratio) is signiﬁcant. In fact, as we shall see,
the measurement don’t even exclude the virtuality of negative elastis parameters γ and ϕ
(this possibility being exxcluded by the elastic theory). X
Figure 9.4: The four 1D marginal volumetic probabilitities for
the initial and ﬁnal lengths. Note that the uncertainties in X
and X0 overlap: it is likely that the diameter of the rod has
decreased (so the Poisson ratio is positive) but the probability
that it has increased (negative Poisson ratio) is signiﬁcant. X0
1 length diameter
1.1 Z Z0
1 1.1 Figure 9.5 represents the volumetric probability fs (γ, ϕ) deﬁned by equations 9.37 and 9.42.
It represents the information that the measurements of the length has given on the elastic
parameters γ and ϕ . [Note: Explain this better.] [Note: Explain that negative values of γ
and ϕ are excluded ‘by hand’].
The two associated marginal volumetric probabilities are deﬁned in equations 9.38–9.39,
and are represented in ﬁgure 9.6.
Note: mention here ﬁgure 9.7.
9.1.1.9 Translation into the Young Modulus and Poisson Ratio Language To obtain the expression of the metric in the coordinates {Y, ν } one can use the partial
derivatives of the old coordinates with respect to the new coordinates, and equation 1.23. 295 ϕ∗ = log ϕ Q Measure of Physical Quantities 4 ϕ∗ = log ϕ Q 4 5 6
10 5 4 6
8
γ∗ = log γ Q 6
10 8
6
γ∗ = log γ Q ( Q = 1N/m2 ) 4 Figure 9.5: The (2D) volumetric probability for the compressibility γ and the shearability ϕ ,
as induced from the measurement results. At the left a direct representation of the volumetric
probability deﬁned by equation 9.37 and 9.42. At the right, a Monte Carlo simulation of
the measurement (see section XXX). Here, natural logarithms are used, and Q = 1 N/m2 .
Of the 3000 points used, 9 falled at the left and 7 below the domain plotted, and are not
represented. The zone of nonvanishing probability extends over all the space, and only the
level lines automatically proposed by the plotting software have been used. Figure 9.6:
The marginal
(1D) volumetric probabilities
deﬁned by equations 9.38–9.39.
12 10 8 γ∗ = log γ Q 6 4 7 6 5 ϕ∗ = log ϕ Q 4 Log[X0 /k] = +0.068 Log[X0 /k] = −0.094
Log[X/k] = −0.094 Log[X/k] = +0.068 Log[X/k] = −0.094 Log[X/k] = +0.068 Figure 9.7: The marginal probability distributions for the lengths X and X0 . At the left, a
Monte Carlo sampling of the probability distribution for X as X0 deﬁned by equation 9.47
(the values Z and Z0 are also sampled, but are not shown. At the right, the same Monte
Carlo sampling, but where only the points that correspond, through equation 9.21, to positive
values of γ and ϕ (and, thus, acceptable by the theory of elastic media). Note that many of
the points ‘behind’ the diagonal bar have been suppressed. 296 9.1 Then, the metric matrix in equation 9.33, written in the coordinates {γ, ϕ} becomes
gY Y
gνY gY ν
gνν = 2
Y (1−2 ν ) 2
Y2 − 2
Y (1−2 ν )
4
(1−2 ν )2 1
Y (1+ν ) −
+ 1
Y (1+ν )
1
(1+ν )2 , (9.48) with the metric determinant being given as
3
Y (1 + ν )(1 − 2ν ) det g = . (9.49) To obtain the equivalent of the volumetric probability fs (γ, ϕ) in terms of the Young
modulus Y and the Poisson ratio ν we just need to perform the change of variables (remember that volumetric probabilities are invat under a change of variables), so the volumetric
probability fs (γ, ϕ) transforms into a volumetric probability ψ (Y, ν ) that is given by (see
relations 9.13)
q (Y, ν ) = fs (γ, ϕ)γ = 1−2ν
Y ν = 1+ν
Y . (9.50) To evaluate the probability of a domain we have to integrate, in view of equation 9.49, as
Y2 P (Y1 < Y < Y2 , ν1 < ν < ν ) = ν2 dY
Y1 dν
ν1 3
q (Y, ν ) .
Y (1 + ν )(1 − 2ν ) (9.51) .2
+0 υ = υ= =− υ=
−0 −
.8 0.6 0.2 0 = = +0 .4 +0
.4
=
υ υ 4 υ 5 υ ϕ∗ = log ϕ Q Figure 9.8: The metrically correct representation of the volumetric probability q (Y, ν ) , obtained by
just superimposing on the ﬁgure 9.5
the new coordinates {Y, ν } . As
above, Q = 1 N/m2 . Y = 100 Q
Y = 200 Q
Y = 300 Q 9 This being said, the question now is: how should we represent the volumetric probability
q (Y, ν ) ? A direct, na¨ plot, using Y as an abscissa and ν as ordinate is possible, and only
ıve
needs the use of equation 9.50 (as the probability density fs (γ, ϕ) has already been evaluated).
But let us ﬁrst use a subtler approach.
We have seen that the quantities γ ∗ and ϕ∗ (logarithmic compressibility and and logarithmic shearability) are Cartesian quantities in the 2D space of linear elastic media. My
preferred choice for visualizing q (Y, ν ) is a direct representation of the ‘new coordinates’ on a
metrically correct representation, i.e., to superimpose in ﬁgure 9.5, where the coordinates γ ∗
and ϕ∗ where used, the new coordinates {Y, ν } (the change of variables being deined by
equations 9.12–9.13). This gives the representation displayed in ﬁgure 9.8. 6
10 6
8
γ∗ = log γ Q 4 As this is not the conventional way of plotting probability distributions, let us also examine
the more conventional plot of q (Y, ν ) in ﬁgure 9.9. One may observe, in particular, the ‘round’
character of the ‘level lines’ in this plot, due to the fact that the experiment was specially Measure of Physical Quantities 297 0.4 ν
0.2 Figure 9.9: The volumetric probability for the Young
modulus Y and the Poisson ratio ν , deduced, using a
change of variables, from the volumetric probability on
γ and ϕ represented in ﬁgure 9.5(see equation 9.50). 0 0.2 100 150 200 Y designed to have a good (and independent) resolution of the Young modulus and the Poisson
ratio.
As the metric matrix is not diagonal in the coordinates {Y, ν } , one can not deﬁne marginal
volumetric probabilities, but marginal probability√
densities only (see section 2.5). We can start
by introducing the probability density q (Y, ν ) = det g q (Y, ν ) , i.e.,
3 q (Y, ν )
Y (1 + ν ) (1 − 2ν ) q (Y, ν ) = . (9.52) Then, the marginal probability density for the Young modulus is q Y (Y ) =
i.e.,
q Y (Y ) = 3
Y +1/2 dν
−1 q (Y, ν )
(1 + ν ) (1 − 2ν ) ∞ dY
0 q (Y, ν )
Y dν q (Y, ν ) , , (9.53)
∞
0 and the marginal probability density for the Poisson ratio is q ν (ν ) =
3
q ν (ν ) =
(1 + ν ) (1 − 2ν ) +1/2
−1 dY q (Y, ν ) , i.e., . (9.54) Then, the can evaluate probabilities like
Y2 P (Y1 < Y < Y2 ) = dY q Y (Y ) ;
Y1 ν2 P (ν1 < ν < ν2 ) = dν q ν (ν ) . (9.55) ν1 As an example, the marginal probability density for the Poisson ratio, q ν (ν ) , is plotted in
ﬁgure 9.10. Figure 9.10: The marginal probability density for the Poisson ratio ν
(equation 9.54).
1 0.5 ν 0 0.5 298
9.1.1.10 9.1
Direct Evaluation Using Young Modulus and Poisson Ratio Rather than deducing the volumetric probability for {Y, ν } from that of {γ, ϕ} , we could
redo all the computations using directly {Y, ν } as parameters, the only major diﬀerence is
that the metric matrix 9.48 replaces that in equation 9.33. I leave this as an exercice for the
reader. Prediction of Observations 9.2 299 Prediction of Observations This is the typical prediction problem in physics: any serious physical theory is to able to make
predictions (that may be confronted to experiments). An engineer, for instance, may wish to
predict the load at which a given bridge may collapse, or an astrophysicist may wish to predict
the ﬂux of neutrinos from the Sun. In these situations, the parameters deﬁning the system (the
bridge or the Sun) may be known with some uncertainties, and these uncertainties shall reﬂect
as an uncertainty on the prediction.
Note: I could use here a notation like
d = d(p) (9.56) d = d(m) . (9.57) or like 300 9.3
9.3.1 9.3 Appendixes
Appendix: Mass Calibration Note: I take this problem from Measurement Uncertainty and the Propagation of Distributions,
by Cox and Harris, 10th International Metrology Congress, 2001.
When two bodies, with masses mW and mR , equilibrate in a balance that operates in air
of density a , one has (taking into account Archimedes’ buoyancy),
1− a
ρW mW = 1− a
ρR , mR (9.58) where ρW and ρR are the two volumetric masses of the bodies.
Given a body with mass m , and volumetric mass ρ , it is a common practice in metrology to
deﬁne its ‘conventional mass’, denoted m0 , as the mass of a (hypothetical) body of conventional
density ρ0 = 8000 kg/m3 in air of conventional density a0 = 1.2 kg/m3 . The equation above
then gives the relation
1− a0
ρ0 m0 = 1− a0
ρ m. (9.59) In terms of conventional masses, equation 9.58 becomes
ρW − a
ρR − a
mW,0 =
mR,0
ρW − a0
ρR − a0 . (9.60) To evaluate the mass mW,0 of a body one puts a mass mR,0 in the other arm, and selects
the (typically small) mass δmR,0 (with same volumetric mass as mR,0 ) that equilibrates the
balance. Replacing mR,0 by mR,0 + δmR,0 in the equation above, and solving for mW,0 gives
mW,0 = (ρR − a) (ρW − a0 )
(mR,0 + δmR,0 ) .
(ρW − a) (ρR − a0 ) (9.61) The knowledge of the ﬁve quantities { mR,0 , δmR,0 , a , ρW , ρR } allows, via equation 9.61, to
evaluate mW,0 . Assume that a measure of these ﬁve quantities has provided the information
represented by the probability density f (mR,0 , δmR,0 , a, ρW , ρR ) . Which is the probability
density induced over the quantity mW,0 by equation 9.61?
This is just a special case of the transport of probabilities considered in section 2.6.0.3, so
we can directly apply here the results of the section. In the ﬁvedimensional ‘measurement
space’ over which the variables { mR,0 , δmR,0 , a , ρW , ρR } can be considered as coordinates,
we can change to the variables { mW,0 , δmR,0 , a , ρW , ρR } , this deﬁning the matrix K of
partial derivatives (see equation 2.192). One easily arrives at the simple result
√ det K Kt = (ρR − a) (ρW − a0 )
(ρW − a) (ρR − a0 ) . (9.62) Because of the change of variables used, we shall also need to express mW,0 as a function of
{ mR,0 , δmR,0 , a , ρW , ρR } . From equation 9.61 one immediately obtains
mR,0 = (ρW − a) (ρR − a0 )
mW,0 − δmR,0
(ρR − a) (ρW − a0 ) . (9.63) Appendixes 301 Equation 2.206 gives the probability density for mW,0 :
g (mW,0 ) = dδmR,0 da dρW dρR (ρW − a) (ρR − a0 )
f (mR,0 , δmR,0 , a, ρW , ρR ) , (9.64)
(ρR − a) (ρW − a0 ) where in f (mR,0 , δmR,0 , a, ρW , ρR ) one has to replace the variable mR,0 by its expression as
a function of the other ﬁve variables, as given by equation 9.63.
Given the probability density f (mR,0 , δmR,0 , a, ρW , ρR ) representing the information obtained though the measurement act, one can try an analytic integration (provided the probability density f has an analytical expression, or it can be approximated by one). More generally,
the probability density f can be sampled using the Monte Carlo methods described in section
XXX.
This is, in fact, quite trivial here. Let us denote r = { mR,0 , δmR,0 , a, ρW , ρR } and s =
mW,0 . Then the relation 9.61 can be written formally as s = s(r) . One just needs to sample
f (r) to obtain points r1 , r2 , . . . . The points s1 = s(r1 ) , s2 = s(r2 ) , . . . are samples of g (s)
(because of the very deﬁnition of the notion of transport of probabilities). 302 9.3 Bibliography
Aki, K. and Lee, W.H.K., 1976, Determination of threedimensional velocity anomalies under
a seismic array using ﬁrst P arrival times from local earthquakes, J. Geophys. Res., 81,
4381–4399.
Aki, K., Christoﬀerson, A., and Husebye, E.S., 1977, Determination of the threedimensional
seismic structure of the lithosphere, J. Geophys. Res., 82, 277296.
Aki, K., and Richards, P.G., 1980, Quantitative seismology, (2 volumes), Freeman and Co.
Andresen, B., Hoﬀmann, K. H., Mosegaard, K., Nulton, J. D., Pedersen, J. M., and Salamon,
P., On lumped models for thermodynamic properties of simulated annealing problems,
Journal de Physique , 49, 1485–1492, 1988.
Backus, G., 1970a. Inference from inadequate and inaccurate data: I, Proceedings of the
National Academy of Sciences, 65, 1, 1105.
Backus, G., 1970b. Inference from inadequate and inaccurate data: II, Proceedings of the
National Academy of Sciences, 65, 2, 281287.
Backus, G., 1970c. Inference from inadequate and inaccurate data: III, Proceedings of the
National Academy of Sciences, 67, 1, 282289.
Backus, G., 1971. Inference from inadequate and inaccurate data, Mathematical problems in
the Geophysical Sciences: Lecture in applied mathematics, 14, American Mathematical
Society, Providence, Rhode Island.
Backus, G., and Gilbert, F., 1967. Numerical applications of a formalism for geophysical inverse
problems, Geophys. J. R. astron. Soc., 13, 247276.
Backus, G., and Gilbert, F., 1968. The resolving power of gross Earth data, Geophys. J. R.
astron. Soc., 16, 169205.
Backus, G., and Gilbert, F., 1970. Uniqueness in the inversion of inaccurate gross Earth data,
Philos. Trans. R. Soc. London, 266, 123192.
Bamberger, A., Chavent, G, Hemon, Ch., and Lailly, P., 1982. Inversion of normal incidence
seismograms, Geophysics, 47, 757770.
BenMenahem, A., and Singh, S.J., 1981. Seismic waves and sources, Springer Verlag.
Bender, C.M., and Orszag, S.A., 1978. Advanced mathematical methods for scientists and
engineers, McGrawHill.
´
Borel, E., 1967, Probabilit´s, erreurs, 14e ´d., Paris.
e
e
´ dir., 1924–1952, Trait´ du calcul des probabilit´s et de ses applications, 4 t., Gauthier
Borel, E.,
e
e
Villars, Paris.
Cary, P.W., and C.H. Chapman, Automatic 1D waveform inversion of marine seismic refraction
data, Geophys. J. R. Astron. Soc., 93, 527–546, 1988.
Claerbout, J.F., 1971. Toward a uniﬁed theory of reﬂector mapping, Geophysics, 36, 467481.
Claerbout, J.F., 1976. Fundamentals of Geophysical data processing, McGraw Hill.
Claerbout, J.F., 1985. Imaging the Earth’s interior, Blackwell Science Publishers.
303 304 9.3 Claerbout, J.F., and Muir, F., 1973. Robust modelling with erratic data, Geophysics, 38, 5,
826844.
DahlJensen, D., Mosegaard, K., Gundestrup, N., Clow, G. D., Johnsen, S. J., Hansen, A. W.,
and Balling, N., 1998, Past temperatures directly from the Greenland Ice Sheet, Science,
Oct. 9, 268–271.
Davidon, W.C., 1959. Variable metric method for minimization, AEC Res. and Dev., Report
ANL5990 (revised).
Devaney, A.J., 1984. Geophysical diﬀraction tomography, IEEE trans. Geos. remote sensing,
Vol. GE22, No. 1.
Djikp´ss´, H.A. and Tarantola, A., 1999, Multiparameter 1 norm waveform ﬁtting: Interpreee
tation of Gulf of Mexico reﬂection seismograms, Geophysics, Vol. 64, No. 4, 1023–1035.
Evrard, G., 1995, La recherche des param`tres des mod`les standard de la cosmologie vue
e
e
comme un probl`me inverse, Th`se de Doctorat, Univ. Montpellier.
e
e
Evrard, G., 1966, Objective prior for cosmological parameters, Proc. of the Maximum Entropy
and Bayesian Methods 1995 workshop, K. Hanson and R. Silver (eds), Kluwer.
Evrard, G. and P. Coles, 1995. Getting the measure of the ﬂatness problem, Classical and
quantum gravity, Vol. 12, No. 10, pp. L93L97.
Feller, W., An introduction to probability theory and its applications, Wiley, N.Y., 1971 (or
1970?).
Fisher, R.A., 1953, Dispersion on a sphere, Proc. R. Soc. London, A, 217, 295–305.
Fletcher, R., 1980. Practical methods of optimization, Volume 1: Unconstrained optimization,
Wiley.
Fletcher, R., 1981. Practical methods of optimization, Volume 2: Constrained optimization,
Wiley.
Franklin, J.N., 1970. Well posed stochastic extensions of ill posed linear problems, J. Math.
Anal. Applic., 31, 682716.
Gauss, C.F., 1809, Theoria Motus Corporum Cœlestium.
Gauthier, O., Virieux, J., and Tarantola, A., 1986. Twodimensional inversion of seismic waveforms: numerical results, Geophysics, 51, 13871403.
Geiger, L., 1910, Herdbestimmung bei Erdbeben aus den Ankunftszeiten, Nachrichten von der
K¨niglichen Gesellschaft der Wissenschaften zu G¨ttingen, 4, 331–349.
o
o
Geman, S., and Geman, D., Stochastic relaxation, Gibbs distributions, and the Bayesian
restoration of images, Inst. Elect. Electron. Eng. Trans. on pattern analysis and machine intelligence , PAMI6, 721741, 1984.
Goldberg, D.E., Genetic algorithms in search, optimization, and machine learning (AddisonWesley, 1989).
Hadamard, J., 1902, Sur les probl´mes aux d´riv´es partielles et leur signiﬁcation physique,
e
ee
Bull. Univ. Princeton, 13.
Hadamard, J., 1932, Le probl`me de Cauchy et les ´quations aux d´riv´es partielles lin´aires
e
e
ee
e
hyperboliques, Hermann, Paris.
Hammersley, J. M., and Handscomb, D.C., Monte Carlo Methods, in Monographs on Statistics
and Applied Probability , Cox, D. R., and Hinkley, D. V.(eds.), Chapman and Hall, 1964.
Herman, G.T., 1980. Image reconstruction from projections, the fundamentals of computerized
tomography, Academic Press.
Holland, J.H., Adaptation in Natural and Artiﬁcial Systems, University of Michigan Press,
1975. Appendixes 305 Ikelle, L.T., Diet, J.P., and Tarantola, A., 1986. Linearized inversion of multi oﬀset seismic
reﬂection data in the f k domain, Geophysics, 51, 12661276.
ISO, 1993, Guide to the expression of uncertainty in measurement, International Organization
for Standardization, Switzerland.
Jackson, D.D., The use of a priori data to resolve nonuniqueness in linear inversion, Geophys.
J. R. Astron. Soc., 57, 137–157, 1979.
Jannane, M., Beydoun, W., Crase, E., Cao Di, Koren, Z., Landa, E., Mendes, M., Pica, A.,
Noble, M., R¨th, G., Singh, S., Snieder, R., Tarantola, A., Tr´z´guet, D., and Xie, M.,
o
ee
Wavelengths of earth structures that can be resolved from seismic reﬂected data. Geophysics , 54, 906–910, 1988.
Jaynes, E.T., Prior probabilities, IEEE Transactions on systems, science, and cybernetics , Vol.
SSC–4, No. 3, 227–241, 1968.
Jaynes, E.T., 1995, Probability theory: the logic of science, Available on Internet (ftp: bayes.wustl.edu).
Jaynes, E.T., Where do we go from here?, in Smith, C. R., and Grandy, W. T., Jr., Eds.,
Maximumentropy and Bayesian methods in inverse problems, Reidel, 1985.
Jeﬀreys, H., 1939, Theory of probability, Clarendon Press, Oxford. Reprinted in 1961 by Oxford
University Press. Here he introduces the positive parameters.
Johnson, G.R. and and Olhoeft, G.R., Density or rocks and minerals, in: CRC Handbook of
Physical Properties of rocks, Vol. III, ed: R.S. Carmichael, CRC, Boca Ratn, Florida, USA,
1984.
Journel, A. and Huijbregts, Ch., Mining Geostatistics, Academic Press, 1978.
Kalos, M.H. & Whitlock, P.A., Monte Carlo methods , Wiley, N.Y., 1986.
Kandel, A., 1986, Fuzzy mathematical techniques with applications, AddisonWesley.
KeilisBorok, V.J., and Yanovskaya, T.B., Inverse problems in seismology (structural review),
Geophys. J. R. astr. Soc., 13, 223–234, 1967.
Khan, A., Mosegaard, K., and Rasmussen, K. L., 2000, A New Seismic Velocity Model for the
Moon from a Monte Carlo Inversion of the Apollo Lunar Seismic Data, Geophys. Res. Lett.
(in press).
Khintchine, A.I., 1969, Introduction a la th´orie des probabilit´s (Elementarnoe vvedenie v
`
e
e
e
teoriju verojatnostej), trad. M. Gilliard, 3 ed., Paris; en anglais: An elementary introduction to the theory of probability, avec B.V., Gnedenko, New York, 1962.
Kirkpatrick, S., Gelatt, C.D., Jr., and Vecchi, M.P., Optimization by Simulated Annealing,
Science , 220, 671–680, 1983.
Kolmogorov, A.N., 1933, Grundbegriﬀe der Wahrscheinlichkeitsrechnung, Springer, Berlin;
Engl. trans.: Foundations of the theory of probability, Chelsea, New York, 1950.
Koren, Z., Mosegaard, K., Landa, E., Thore, P., and Tarantola, A., Monte Carlo estimation and
resolution analysis of seismic background velocities, J. Geophys. Res., 96, B12, 20,289–
20,299 (1991).
Kullback, S., 1967, The two concepts of information, J. Amer. Statist. Assoc., 62, 685–686.
Landa, E., Beydoun, W., and Tarantola, A., Reference velocity model estimation from prestack
waveforms: coherency optimization by simulated annealing, Geophysics , 54, 984–990, 1989.
Lehtinen, M.S., P¨iv¨rinta, L., and Somersalo, E., 1989, Linear inverse problems for generalized
aa
random variables, Inverse Problems, 5,599–612.
Lions, J.L., 1968. Contrˆle optimal de syst`mes gouvern´s par des ´quations aux d´riv´es
o
e
e
e
ee
partielles, Dunod, Paris. English translation: Optimal control of systems governed by
partial diﬀerential equations, Springer, 1971. 306 9.3 L¨tkepohl, H., 1996, Handbook of Matrices, John Wiley & Sons.
u
Marroquin, J., Mitter, S., and Poggio, T., 1987, Probabilistic solution of illposed problems in
computational vision, Journal of the American Statistical Association , 82, 76–89.
Mehrabadi, M.M., and S.C. Cowin, 1990, Eigentensors of linear anisotropic elastic materials,
Q. J. Mech. appl. Math., 43, 15–41.
Mehta, M.L., 1967, Random matrices and the statistical theory of energy levels, Academic
Press, New York and London.
Menke, W., 1984, Geophysical data analysis: discrete inverse theory, Academic Press.
Metropolis, N., and Ulam, S.M., The Monte Carlo Method, J. Amer. Statist. Assoc., 44,
335–341, 1949.
Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H., and Teller, E., Equation of
State Calculations by Fast Computing Machines, J. Chem. Phys., Vol. 1, No. 6, 1087–1092,
1953.
Miller, K.S., 1964, Multidimensional Gaussian distributions, John Wiley and Sons, New York.
Minster, J.B. and Jordan, T.M., 1978, Presentday plate motions, J. Geophys. Res., 83, 5331–
5354.
Mohr, P.J., and B.N. Taylor, 2001, The Fundamental Physical Constants, Physics Today, Vol.
54, No. 8, BG6–BG13.
Moritz, H., 1980. Advanced physical geodesy, Herbert Wichmann Verlag, Karlsruhe, Abacus
Press, Tunbridge Wells, Kent.
Morse, P.M., and Feshbach, H., 1953. Methods of theoretical physics, McGraw Hill.
Mosegaard, K., and RygaardHjalsted, C., 1999, Bayesian analysis of implicit inverse problems,
Inverse Problems, 15, 573–583.
Mosegaard, K., Singh, S.C., Snyder, D., and Wagner, H., 1997, Monte Carlo Analysis of seismic
reﬂections from Moho and the Wreﬂector, J. Geophys. Res. B /, 102, 2969–2981.
Mosegaard, K., and Tarantola, A., 1995, Monte Carlo sampling of solutions to inverse problems,
J. Geophys. Res., Vol. 100, No. B7, 12,431–12,447.
Mosegaard, K. and Vestergaard, P.D., A simulated annealing approach to seismic model optimization with sparse prior information, Geophysical Prospecting , 39, 599–611, 1991.
Nercessian, Al., Hirn, Al., and Tarantola, Al., 1984. Threedimensional seismic transmission
prospecting of the MontDore volcano, France, Geophys. J.R. astr. Soc., 76, 307315.
Nolet, G., 1985. Solving or resolving inadequate and noisy tomographic systems, J. Comp.
Phys., 61, 463482.
Nulton, J.D., and Salamon, P., 1988, Statistical mechanics of combinatorial optimization: Physical Review A, 37, 13511356.
Parker, R.L., 1975. The theory of ideal bodies for gravity interpretation, Geophys. J. R. astron.
Soc., 42, 315334.
Parker, R.L., 1977. Understanding inverse theory, Ann. Rev. Earth Plan. Sci., 5, 3564.
Parker, R.L., 1994, Geophysical Inverse Theory, Princeton University Press.
Pedersen, J.B., and Knudsen, O., Variability of estimated binding parameters, Biophys. Chemistry , 36, 167–176 , 1990.
Pica, A., Diet, J.P., and Tarantola, A., 1990, Nonlinear inversion of seismic reﬂection data in
a laterally medium, Geophysics , Vol. 55, No. 3, pp 284–292.
Polack, E. et Ribi`re, G., 1969. Note sur la convergence de m´thodes de directions conjugu´es,
e
e
e
Revue Fr. Inf. Rech. Oper., 16R1, 3543. Appendixes 307 Popper, K., Objective knowledge, Oxford, 1972. Trad. fran¸.: La logique de la d´couverte
c
e
scientiﬁque, Payot, Paris, 1978.
Powell, M.J.D., 1981. Approximation theory and methods, Cambridge University Press.
Press, F., Earth models obtained by Monte Carlo inversion, J. Geophys. Res., 73, 5223–5234,
1968.
Press, F., An introduction to Earth structure and seismotectonics, Proceedings of the International School of Physics Enrico Fermi , Course L, Mantle and Core in Planetary Physics,
J. Coulomb and M. Caputo (editors), Academic Press, 1971.
Press, W.H., Flannery, B.P., Teukolsky, S.A., and Vetterling, W.T., Numerical Recipes, Cambridge, 1986.
Pugachev, V.S., Theory of random functions and its application to control problems, Pergamon,
1965.
R´nyi, A., 1966, Calcul des probabilit´s, Dunod, Paris.
e
e
R´nyi, A., 1970, Probability theory, Elsevier, New York.
e
Rietsch, E., The maximum entropy approach to inverse problems, J. Geophys., 42, 489–506,
1977.
Rothman, D.H., Nonlinear inversion, statistical mechanics, and residual statics estimation,
Geophysics , 50, 2797–2807, 1985.
Rothman, D.H., Automatic estimation of large residual static corrections, Geophysics , 51,
332–346, 1986.
Scales, L. E., 1985. Introduction to nonlinear optimization, Macmillan.
Scales, J.A., Smith, M.L., and Fischer, T.L., 1992, Global optimization methods for multimodal
inverse problems, Journal of Computational Physics, 102, 258268.
Scales, J., 1996, Uncertainties in seismic inverse calculations, in: Inverse methods, Interdisciplinary elements of methodology, computation, and applications, Eds.: B.H. Jacobsen, K.
Mosegaard and P. Sibani, Springer, Berlin, p. 79–97.
Shannon, C.E., 1948, A mathematical theory of communication, Bell System Tech. J., 27,
379–423.
Simon, J.L., 1995, Resampling: the new statistics, Resampling stats Inc., Arlington, VA, USA.
Stein, S.R., 1985, Frequency and time — their measure and characterization, in: Precision
frequency control, Vol. 2, edited by E.A. Gerber and A. Ballato, Academic Press, New
York, pp. 191–232 and pp. 399–416.
Tarantola, A., 1984. Linearized inversion of seismic reﬂection data, Geophysical Prospecting,
32, 9981015.
Tarantola, A., 1984. Inversion of seismic reﬂection data in the acoustic approximation, Geophysics, 49, 12591266.
Tarantola, A., 1984. The seismic reﬂection inverse problem, in: Inverse problems of Acoustic
and Elastic Waves, edited by: F. Santosa, Y.H. Pao, W. Symes, and Ch. Holland, SIAM,
Philadelphia.
Tarantola, A., 1986. A strategy for nonlinear elastic inversion of seismic reﬂection data, Geophysics, 51, 18931903.
Tarantola, A., 1987. Inverse problem theory; methods for data ﬁtting and model parameter
estimation, Elsevier.
Tarantola, A., 1987. Inversion of travel time and seismic waveforms, in: Seismic tomography,
edited by G. Nolet, Reidel. 308 9.3 Tarantola, A., 1990, Probabilistic foundations of Inverse Theory, in: Geophysical Tomography ,
Desaubies, Y., Tarantola, A., and ZinnJustin, J., (eds.), North Holland.
Tarantola, A., Jobert, G., Tr´z´guet, D., and Denelle, E., 1987. The inversion of seismic waveee
forms can either be performed by time or by depth extrapolation, submitted to Geophysics.
Tarantola, A. and Nercessian, A., 1984. Threedimensional inversion without blocks, Geophys.
J. R. astr. Soc., 76, 299306.
Tarantola, A., and Valette, B., 1982a. Inverse Problems = Quest for Information, J. Geophys.,
50, 159170.
Tarantola, A., and Valette, B., 1982b. Generalized nonlinear inverse problems solved using the
leastsquares criterion, Rev. Geophys. Space Phys., 20, No. 2, 219232.
Taylor, S.J., 1966, Introduction to measure and integration, Cambridge Univ. Press.
Taylor, A.E., and Lay, D.C., 1980. Introduction to functional analysis, Wiley.
Taylor, B.N., and C.E. Kuyatt, 1994, Guidelines for evaluating and expressing the uncertainty
of NIST measurement results, NIST Technical note 1297.
Watson, G.A., 1980. Approximation theory and numerical methods, Wiley.
Weinberg, S., 1972, Gravitation and Cosmology: Principles and Applications of the General
Theory of Relativity, John Wiley & Sons.
Wiggins, R.A., 1969, Monte Carlo Inversion of BodyWave Observations, J. Geoph. Res., Vol.
74, No. 12, 3171–3181.
Wiggins, R.A., 1972, The General Linear Inverse Problem: Implication of Surface Waves and
Free Oscillations for earth Structure, Rev. Geoph. and Space Phys., Vol. 10, No. 1,
251–285.
Winogradzki, J., 1979, Calcul Tensoriel (I), Masson.
Winogradzki, J., 1987, Calcul Tensoriel (II), Masson.
Xu, P. and Grafarend, E., 1997, Statistics and geometry of the eigenspectra of 3D secondrank
symmetric random tensors, Geophys. J. Int. 127, 744–756.
Xu, P., 1999, Spectral theory of constrained secondrank symmetric random tensors, Geophys.
J. Int. 138, 1–24.
YeganehHaeri, A., Weidner, D.J. and Parise, J.B., Elasticity of αcristobalite: a silicon dioxide
with a negative Poisson ratio, Science , 257, 650–652, 1992. ...
View
Full
Document
This note was uploaded on 07/17/2011 for the course STOR 635 taught by Professor Leadbetter during the Fall '10 term at UNC.
 Fall '10
 LEADBETTER
 The Land

Click to edit the document details