Review of Probability Theory
Arian Maleki and Tom Do
Stanford University
Probability theory is the study of uncertainty. Through this class, we will be relying on concepts
from probability theory for deriving machine learning algorithms. These notes attempt to cover the
basics of probability theory at a level appropriate for CS 229. The mathematical theory of probability
is very sophisticated, and delves into a branch of analysis known as
measure theory
. In these notes,
we provide a basic treatment of probability that does not address these Fner details.
1
Elements of probability
In order to deFne a probability on a set we need a few basic elements,
•
Sample space
Ω
: The set of all the outcomes of a random experiment. Here, each outcome
ω
∈
Ω
can be thought of as a complete description of the state of the real world at the end
of the experiment.
•
Set of events
(or
event space
)
F
: A set whose elements
A
∈ F
(called
events
) are subsets
of
Ω
(i.e.,
A
⊆
Ω
is a collection of possible outcomes of an experiment).
1
.
•
Probability measure
: A function
P
:
F →
R
that satisFes the following properties,

P
(
A
)
≥
0
, for all
A

P
(Ω) = 1

If
A
1
,A
2
,...
are disjoint events (i.e.,
A
i
∩
A
j
=
∅
whenever
i
±
=
j
), then
P
(
∪
i
A
i
)=
±
i
P
(
A
i
)
These three properties are called the
Axioms of Probability
.
Example
: Consider the event of tossing a sixsided die. The sample space is
Ω =
{
1
,
2
,
3
,
4
,
5
,
6
}
.
We can deFne different event spaces on this sample space. ±or example, the simplest event space
is the trivial event space
F
=
{∅
,
Ω
}
. Another event space is the set of all subsets of
Ω
. ±or the
Frst event space, the unique probability measure satisfying the requirements above is given by
P
(
∅
)=0
,P
(Ω) = 1
. ±or the second event space, one valid probability measure is to assign the
probability of each set in the event space to be
i
6
where
i
is the number of elements of that set; for
example,
P
(
{
1
,
2
,
3
,
4
}
4
6
and
P
(
{
1
,
2
,
3
}
3
6
.
Properties
:

If
A
⊆
B
=
⇒
P
(
A
)
≤
P
(
B
)
.

P
(
A
∩
B
)
≤
min(
P
(
A
)
(
B
))
.

(Union Bound)
P
(
A
∪
B
)
≤
P
(
A
)+
P
(
B
)
.

P
(Ω
\
A
) = 1

P
(
A
)
.

(Law of Total Probability) If
A
1
,...,A
k
are a set of disjoint events such that
∪
k
i
=1
A
i
= Ω
, then
∑
k
i
=1
P
(
A
k
) = 1
.
1
F
should satisfy three properties: (1)
∅ ∈ F
; (2)
A
=
⇒
Ω
\
A
; and (3)
A
1
2
, . . .
=
⇒
∪
i
A
i
.
1