�
March
14,
2003
CHAPTER
3.
MAXIMUM
LIKELIHOOD
AND
M-ESTIMATION
3.1 Maximum likelihood estimates — in exponential families.
Let
(
X,
B
) be
a
measurable
space
and
{
P
θ
, θ
∈
Θ
}
a
measurable
family
of
laws
on
(
X,
B
),
dominated
by
a
σ
-finite
measure
v
. Let
f
(
θ, x
)
be
a
jointly
measurable
version
of
the
density
(
dP
θ
/dv
)(
x
)
by
Theorem
1.3.3.
For
each
x
∈
X
, a
maximum likelihood
estimate
(MLE)
of
θ
is
any
θ
ˆ
=
θ
ˆ
(
x
)
such
that
f
(
ˆ
θ, x
) = sup
{
f
(
φ, x
) :
φ
∈
Θ
}
.
In
other
words,
θ
ˆ
(
x
) is a point at
which
f
(
·
, x
)
attains
its
maximum.
In
general,
the
supremum
may
not
be
attained,
or
it
may
be
attained
at
more
than
one
point.
If
it
is
attained
at
a
unique
point
θ
ˆ
, then
θ
ˆ
is
called
the
maximum
likelihood
estimate
of
θ
.
A
measurable
function
θ
ˆ
(
·
)
defined
on
a
measurable
subset
B
of
X
is
called
a
maximum likelihood estimator
if
for
all
x
∈
B
,
θ
ˆ
(
x
)
is
a
maximum
likelihood
estimate
of
θ
, and
for
v
-almost
all
x
not
in
B
,
the
supremum
of
f
(
·
, x
)
is
not
attained
at
any
point.
Examples
. (i)
For
each
θ >
0 let
P
θ
be
the
uniform
distribution
on
[0
, θ
],
with
f
(
θ, x
) :=
1
[0
,θ
]
(
x
)
/θ
for
all
x
. Then
if
X
1
, . . .
, X
n
are
observed,
i.i.d.
(
P
θ
),
the
MLE
of
θ
is
X
(
n
)
:=
max(
X
1
, . . .
, X
n
).
Note
however
that
if
the
density
had
been
defined
as
1
[0
,θ
)
(
x
),
its
supremum
for
given
X
1
, . . .
, X
n
would
not
be
attained
at
any
θ
.
The
MLE
of
θ
is
the
smallest
possible
value
of
θ
given
the
data,
so
it
is
not
a
very
reasonable
estimate
in
some
ways.
For
example,
it
is
not
Bayes
admissible.
(ii).
For
P
θ
=
N
(
θ,
1)
n
on
R
n
,
with
usual
densities,
the
sample
mean
X
is
the
MLE
of
n
θ
. For
N
(0
, σ
2
)
n
, σ
>
0,
the
MLE
of
σ
2
is
j
=1
X
j
2
/n
. For
N
(
m, σ
2
)
n
, n
≥
2,
the
MLE
n
of
(
m, σ
2
) is (
X,
�
j
=1
(
X
j
−
X
)
2
/n
).
Here
recall
that
the
usual,
unbiased
estimator
of
σ
2
has
n
−
1 in place
of
n
,
so
that
the
MLE
is
biased,
although
the
bias
is
small,
of
order
1
/n
2
as
n
→ ∞
. The
MLE
of
σ
2
fails
to
exist
(or
equals
0,
if
0
were
allowed
as
a
value
of
σ
2
)
exactly
on
the
event
that
all
X
j
are
equal
for
j
≤
n
,
which
happens
for
n
=
1,
but
only
with
probability
0
for
n
≥
2.
On
this
event,
f
((
X, σ
2
)
, x
)
→
+
∞
as
σ
↓
0.
In
general,
let
Θ
be
an
open
subset
of
R
k
and
suppose
f
(
θ, x
)
has
first
partial
deriva-
tives
with
respect
to
θ
j
for
j
= 1
, . . .
, k
,
forming
the
gradient
vector
k
�
θ
f
(
θ, x
) :=
{
∂f
(
θ, x
)
/∂θ
j
}
j
=1
.