1 of 5
CSE 143, Winter 2010
Programming Assignment #8: Huffman Coding (20 points)
Due Thursday, March 11, 2010, 11:30 PM
No submissions for this assignment will be accepted after Sunday, March 14, at 11:30pm.
This program focuses on binary trees, priority queues, and input/output.
Turn in files named
from the Homework section of the
web site. You will need support files
, and input files from the course web page.
Huffman coding is an algorithm devised by David A. Huffman of MIT in 1952 for compressing text data to make a file
occupy a smaller number of bytes.
This relatively simple compression algorithm is powerful enough that variations of it
are still used today in computer networks, fax machines, modems, HDTV, and other areas.
Normally text data is stored in a standard format of 8 bits per character, commonly using an encoding called ASCII that
maps every character to a binary integer value from 0-255.
The idea of Huffman coding is to abandon the rigid 8-bits-per-
character requirement and use different-length binary encodings for different characters.
The advantage of doing this is
that if a character occurs frequently in the file, such as the letter
, it could be given a shorter encoding (fewer bits),
making the file smaller.
The tradeoff is that some characters may need to use encodings that are longer than 8 bits, but
this is reserved for characters that occur infrequently, so the extra cost is worth it.
The table below compares ASCII values of various characters to possible Huffman encodings for the text of Shakespeare's
Frequent characters such as space and
have short encodings, while rarer ones like
have longer ones.
The steps involved in Huffman coding a given text source file into a destination compressed file are the following:
Examine the source file's contents and count the number of occurrences of each character.
Place each character and its frequency (count of occurrences) into a sorted "priority" queue.
Convert the contents of this priority queue into a binary tree with a particular structure.
Traverse the tree to discover the binary encodings of each character.
Re-examine the source file's contents, and for each character, output the encoded binary version of that character
to the destination file.
Encoding a File:
For example, suppose we have a file named
with the following contents:
ab ab cab
In the original file, this text occupies 10 bytes (80 bits) of data.
The 10th is a special "end-of-file" (EOF) byte.