4.8
Huffman Codes
These lecture slides are supplied by Mathijs de Weerd
2
Data Compression
Q.
Given a text that uses 32 symbols (26 different letters, space, and
some punctuation characters), how can we encode this text in bits?
Q.
Some symbols (e, t, a, o, i, n) are used far more often than others.
How can we use this to reduce our encoding?
Q.
How do we know when the next symbol begins?
Ex.
c(a) = 01
What is 0101?
c(b) = 010
c(e) = 1
3
Data Compression
Q.
Given a text that uses 32 symbols (26 different letters, space, and
some punctuation characters), how can we encode this text in bits?
A.
We can encode 2
5
different symbols using a fixed length of 5 bits per
symbol. This is called
fixed length encoding
.
Q.
Some symbols (e, t, a, o, i, n) are used far more often than others.
How can we use this to reduce our encoding?
A.
Encode these characters with fewer bits, and the others with more bits.
Q.
How do we know when the next symbol begins?
A.
Use a separation symbol (like the pause in Morse), or make sure that
there is no ambiguity by ensuring that
no
code is a
prefix
of another one.
Ex.
c(a) = 01
What is 0101?
c(b) = 010
c(e) = 1
4
Prefix Codes
Definition.
A
prefix code
for a set S is a function c that maps each x
∈
S to 1s and 0s in such a way that for x,y
∈
S, x
≠
y,
c(x) is not a prefix of
c(y).
Ex.
c(a) = 11
c(e) = 01
c(k) = 001
c(l) = 10
c(u) = 000
Q.
What is the meaning of 1001000001 ?
Suppose frequencies are known in a text of 1G:
f
a
=0.4,
f
e
=0.2,
f
k
=0.2,
f
l
=0.1,
f
u
=0.1
Q.
What is the size of the encoded text?