Thus we have
{
⌃

y
1
, . . . ,
y
n
,
✓
}
⇠
inverseWishart(
⌫
0
+
n,
[
S
0
+
S
✓
]

1
)
.
(7.9)
Hopefully this result seems somewhat intuitive: We can think of
⌫
0
+
n
as the
“posterior sample size,” being the sum of the “prior sample size”
⌫
0
and the
data sample size. Similarly,
S
0
+
S
✓
can be thought of as the “prior” residual
sum of squares plus the residual sum of squares from the data. Additionally,
the conditional expectation of the population covariance matrix is
E[
⌃

y
1
, . . . ,
y
n
,
✓
] =
1
⌫
0
+
n

p

1
(
S
0
+
S
✓
)
=
⌫
0

p

1
⌫
0
+
n

p

1
1
⌫
0

p

1
S
0
+
n
⌫
0
+
n

p

1
1
n
S
✓
and so the conditional expectation can be seen as a weighted average of the
prior expectation and the unbiased estimator. Because it can be shown that
S
✓
converges to the true population covariance matrix, the posterior expectation
of
⌃
is a consistent estimator of the population covariance, even if the true
population distribution is not multivariate normal.
112
7 The multivariate normal model
7.4 Gibbs sampling of the mean and covariance
In the last two sections we showed that
{
✓

y
1
, . . . ,
y
n
,
⌃
}
⇠
multivariate normal(
μ
n
,
⇤
n
)
{
⌃

y
1
, . . . ,
y
n
,
✓
}
⇠
inverseWishart(
⌫
n
,
S

1
n
)
,
where
{
⇤
n
,
μ
n
}
are defined in Equations 7.4 and 7.5,
⌫
n
=
⌫
0
+
n
and
S
n
=
S
0
+
S
✓
. These full conditional distributions can be used to construct a Gibbs
sampler, providing us with an MCMC approximation to the joint posterior
distribution
p
(
✓
,
⌃

y
1
, . . . ,
y
n
). Given a starting value
⌃
(0)
, the Gibbs sampler
generates
{
✓
(
s
+1)
,
⌃
(
s
+1)
}
from
{
✓
(
s
)
,
⌃
(
s
)
}
via the following two steps:
1. Sample
✓
(
s
+1)
from its full conditional distribution:
a) compute
μ
n
and
⇤
n
from
y
1
, . . . ,
y
n
and
⌃
(
s
)
;
b) sample
✓
(
s
+1)
⇠
multivariate normal(
μ
n
,
⇤
n
).
2. Sample
⌃
(
s
+1)
from its full conditional distribution:
a) compute
S
n
from
y
1
, . . . ,
y
n
and
✓
(
s
+1)
;
b) sample
⌃
(
s
+1)
⇠
inverseWishart(
⌫
0
+
n,
S

1
n
).
Steps 1.a and 2.a highlight the fact that
{
μ
n
,
⇤
n
}
depend on the value of
⌃
,
and that
S
n
depends on the value of
✓
, and so these quantities need to be
recalculated at every iteration of the sampler.
Example: Reading comprehension
Let’s return to the example from the beginning of the chapter in which each
of 22 children were given two reading comprehension exams, one before a
certain type of instruction and one after. We’ll model these 22 pairs of scores as
i.i.d. samples from a multivariate normal distribution. The exam was designed
to give average scores of around 50 out of 100, so
μ
0
= (50
,
50)
T
would
be a good choice for our prior expectation. Since the true mean cannot be
below 0 or above 100, it is desirable to use a prior variance for
✓
that puts
little probability outside of this range. We’ll take the prior variances on
✓
1
and
✓
2
to be
λ
2
0
,
1
=
λ
2
0
,
2
= (50
/
2)
2
= 625, so that the prior probability
Pr(
✓
j
62
[0
,
100]) is only 0.05. Finally, since the two exams are measuring
similar things, whatever the true values of
✓
1
and
✓
2
are it is probable that
they are close. We can reflect this with a prior correlation of 0
.
5, so that
λ
1
,
2
= 312
.
5. As for the prior distribution on
⌃
, some of the same logic about
the range of exam scores applies. We’ll take
S
0
to be the same as
⇤
0
, but only
loosely center
⌃
around this value by taking
⌫
0
=
p
+ 2 = 4.