February 22, 2008
Data Compression and Huffman Encoding
In the early 1980s, personal computers had hard disks that were no larger than 10MB;
today, the puniest of disks are still measured in gigabytes.
Even though hard drives are
getting bigger, the files we want to store (images, videos, MP3s and so on) seem to keep
pace with that growth which makes even today's gargantuan disk seem too small to
One technique to use our storage more optimally is to compress the files.
advantage of redundancy or patterns, we may be able to "abbreviate" the contents in
such a way to take up less space yet maintain the ability to reconstruct a full version of
the original when needed.
Such compression could be useful when trying to cram more
things on a disk or to shorten the time needed to copy/send a file over a network.
There are compression algorithms that you may already have heard of.
compression formats, such as GIF, MPEG, or MP3, are specifically designed to handle a
particular type of data file.
They tend to take advantage of known features of that type
of data (such as the propensity for pixels in an image to be same or similar colors to their
neighbors) to compress it.
Other tools such as
and programs like
can be used to compress any sort of file.
These algorithms have no
expectations and usually rely on studying the particular data file contents to find
redundancy and patterns that allow for compression.
Some of the compression algorithms (e.g. JPEG, MPEG) are
compressed result doesn't recreate a perfect copy of the original.
Such an algorithm
compresses by "summarizing" the data.
The summary retains the general structure
while discarding the more minute details.
For sound, video, and images, this
imprecision may be acceptable because the bulk of the data is maintained and a few
missed pixels or milliseconds of video delay is no big deal.
For text data, though, a lossy
algorithm usually isn't appropriate.
An example of a lossy algorithm for compressing
text would be to remove all the vowels.
Compressing the previous sentence by this
scheme results in:
n xmpl f
lssy lgrthm fr cmprssng txt wld b t rmv ll th vwls.
This shrinks the original 87 characters down to just 61 and requires only 70% of the
To decompress, we could try matching the consonant patterns to English
words with vowels inserted, but we cannot reliably reconstruct the original in this
Is the compressed word "fr" an abbreviation for the word "four" or the word
"fir" or "far"?
An intelligent reader can usually figure it out by context, but, alas, a