6.896 Sublinear Time Algorithms
March 1, 2007
Lecture 8
Lecturer: Ronitt Rubinfeld
Scribe: Jacob Scott
1
Huffman Coding and Entropy
Consider a string
w
=
w
1
w
2
. . . w
m
on an alphabet
A
=
a
1
a
2
. . . a
n
. We will now be considering our data
as fixed, as opposed to being generated from a probability distribution as in previous lectures. Thus, we
can consider the frequency of each letter in the alphabet,
p
=
{
p
1
, p
2
, . . . , p
n
}
. We can now define a code
C
=
{
c
1
, c
2
, . . . , c
n
}
such that
c
i
is the “code word” for
a
i
. The following coding algorithm encodes
w
:
Coding Algorithm
scan left to right
→
if
w
i
=
a
j
write
c
j
Choice of code
We would like to pick variable lengths from the
c
i
’s to minimize
L
(
C
) =
i
p
(
i
)

c
i

Which can be considered the expected length of a letter
a
i
drawn from
p
and written as
c
i
.
Shannon’s
Source Coding Theorem
relates this quantity to entropy as follows:
L
(
C
)
≥
H
(
p
)
Huffman codes achieve this bound when for all
i
there is an integral
j
i
such that
p
i
= 2

j
i
.
Some examples of distributions and their entropies are:
1.
H
(
U
n
) = log
n
2.
H
(
p
1
= 1
, p
i>
1
= 0) = 0
3. If
p
1
= 1
/
2
, p
2
=
p
3
=
. . .
=
p
n
:
L
(
C
)
≥
H
(
p
)
=
−
1
/
2 log 1
/
2 + (
n
−
1)
1
2(
n
−
1)
log
1
2(
n
−
1)
=
log 2 + 1
/
2 log
1
n
−
1
This is approximately half of the entropy of the uniform distribution.
4. If
p
i
= 1
/
2
i
:
H
(
p
) = 2
5. If
p
1
=
p
2
=
. . .
=
1
l
, p
l
+1
=
. . .
=
p
n
= 0:
H
(
p
) = log
l
1
This preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
2
Distinct Colors
Before moving to talk about LempelZiv compression, we will explore the following questions: how many
distinct letters are there in a string? This problem arises in many areas, for example in the study of
This is the end of the preview.
Sign up
to
access the rest of the document.
 Fall '04
 RonittRubinfeld
 Algorithms, distinct colors, Lempel Ziv

Click to edit the document details