1
Introduction to Information Theory
Information must not be confused with meaning. “The semantic aspects of
communications are irrelevant to the engineering aspects”. [Sh48].
Information is a measure of one’s freedom of choice and is measured by the
logarithm of
the number of choices.
Tossing of a coin gives two choices . If the logarithm is with
respect to base 2, we have unit information called a “bit”. With doubling of choices you
have an extra bit of information. Thus 4,8,16 choices lead to 2, 3, 4 bits, respectively, of
information. In general if you have
N
choices, the information content of the situation
is
⎡⎤
N
2
log
, which is the number of binary digits to encode the number
N.
The above situation can be
captured by using probability
. Since each event is assumed
to be independent, the probability of the
i
th (
1
≤
i
≤
N
) event is
p
i
=1/N
( All events are
assumed to be equally probable) and the amount of information associated with the
occurrence of this event or
self-information
is given by
i
p
log
−
.
If
p
i
=
1 then the
information is zero (certainty) and if
p
i
=
0
, it is infinity; if
p
i
equals 0.5, it is one bit
corresponding to N=2. If N=4,
p
i
=0.25 and the information is 2 bits and so on.
Note in the case of tossing of a coin there are two possible events: head or tail If you
consider the tossing of the coin to be an “experiment”, the question is
how much total
information will this experiment have?
This can be quantified if we can describe the
outcome of the experiment in some reasonable fashion. Lets “encode” the outcome
‘head’ to be represented by the bit 1 and outcome ‘tail’ by the bit 0. Thus, a minimal
description of this experiment needs only one bit.
Note the experiment is the sum total
of all the events. If we take the self-information of each event, multiply this by its
probability and sum it up over all the events,
intuitively that gives a measure of
information content or
average information
of the experiment
. It just so happens that
this entity is also just one bit for the tossing event since the probability of either head or
tail is 0.5 and self information for each event is also 1 bit.
This 1 bit also expresses how
uncertain we are of the outcome.
How do you generalize the definition?
Suppose, we have a set of
N
events whose probabilities of occurrence are
p
1
, p
2
,…,p
N
.
Can we measure how much “choice” is involved or how much uncertain we are of the
outcome? Such a measure is precisely the
entropy
of the experiment or “source” denoted
as
H(p
1
, p
2
,…,p
N
).
[More precisely, it is called the
first order entropy
. Higher order
entropies depend on contextual information. The true entropy is infinite order entropy.
But,by popular use, entropy most often refers to first order entropy unless stated
otherwise. Read the discussion from Sayood pp.14-16].