Data Compression and Huffman Encoding
Handout written by Julie Zelenski.
In the early 1980s, personal computers had hard disks that were no larger than 10MB;
today, the puniest of disks are still measured in gigabytes.
Even though hard drives are
getting bigger, the files we want to store (images, videos, MP3s and so on) seem to keep
pace with that growth which makes even today's gargantuan disk seem too small to hold
One technique to use our storage more optimally is to compress the files.
advantage of redundancy or patterns, we may be able to "abbreviate" the contents in such a
way to take up less space yet maintain the ability to reconstruct a full version of the original
Such compression could be useful when trying to cram more things on a
disk or to shorten the time needed to copy/send a file over a network.
There are compression algorithms that you may already have heard of.
formats, such as GIF, MPEG, or MP3, are specifically designed to handle a particular type
of data file.
They tend to take advantage of known features of that type of data (such as the
propensity for pixels in an image to be same or similar colors to their neighbors) to
Other tools such as
and programs like
can be used to compress any sort of file.
These algorithms have no
expectations and usually rely on studying the particular data file contents to find
redundancy and patterns that allow for compression.
Some of the compression algorithms (e.g. JPEG, MPEG) are
compressed result doesn't recreate a perfect copy of the original.
Such an algorithm
compresses by "summarizing" the data.
The summary retains the general structure while
discarding the more minute details.
For sound, video, and images, this imprecision may be
acceptable because the bulk of the data is maintained and a few missed pixels or
milliseconds of video delay is no big deal.
For text data, though, a lossy algorithm usually
An example of a lossy algorithm for compressing text would be to
remove all the vowels.
Compressing the previous sentence by this scheme results in:
n xmpl f
lssy lgrthm fr cmprssng txt wld b t rmv ll th vwls.
This shrinks the original 87 characters down to just 61 and requires only 70% of the
To decompress, we could try matching the consonant patterns to English
words with vowels inserted, but we cannot reliably reconstruct the original in this manner.
Is the compressed word "fr" an abbreviation for the word "four" or the word "fir" or "far"?
An intelligent reader can usually figure it out by context, but, alas, a brainless computer