Unformatted text preview: Robert Sedgewick and Kevin Wayne Copyright 2006 http://www.Princeton.EDU/~cos226 Data Compression Reference: Chapter 22, Algorithms in C, 2 nd Edition , Robert Sedgewick. Reference: Introduction to Data Compression , Guy Blelloch. 2 Data Compression Compression reduces the size of a file: ! To save space when storing it. ! To save time when transmitting it. ! Most files have lots of redundancy. Who needs compression? ! Moore's law: # transistors on a chip doubles every 18-24 months. ! Parkinson's law: data expands to fill space available. ! Text, images, sound, video, Basic concepts ancient (1950s), best technology recently developed. All of the books in the world contain no more information than is broadcast as video in a single large American city in a single year. Not all bits have equal value. -Carl Sagan 3 Applications of Data Compression Generic file compression. ! Files: GZIP, BZIP, BOA. ! Archivers: PKZIP. ! File systems: NTFS. Multimedia. ! Images: GIF, JPEG. ! Sound: MP3. ! Video: MPEG, DivX, HDTV. Communication. ! ITU-T T4 Group 3 Fax. ! V.42bis modem. Databases. Google. 4 Encoding and Decoding Message. Binary data M we want to compress. Encode. Generate a "compressed" representation C(M). Decode. Reconstruct original message or some approximation M'. Compression ratio. Bits in C(M) / bits in M. Lossless. M = M', 50-75% or lower. Ex. Natural language, source code, executables . Lossy. M ! M', 10% or lower. Ex. Images, sound, video. Encoder M Decoder C(M) M' hopefully uses fewer bits 5 Ancient Ideas Ancient ideas. ! Braille. ! Morse code. ! Natural languages. ! Mathematical notation. ! Decimal number system. "Poetry is the art of lossy data compression." 6 Natural Encoding Natural encoding. (19 " 51) + 6 = 975 bits. needed to encode number of characters per line 000000000000000000000000000011111111111111000000000 000000000000000000000000001111111111111111110000000 000000000000000000000001111111111111111111111110000 000000000000000000000011111111111111111111111111000 000000000000000000001111111111111111111111111111110 000000000000000000011111110000000000000000001111111 000000000000000000011111000000000000000000000011111 000000000000000000011100000000000000000000000000111 000000000000000000011100000000000000000000000000111 000000000000000000011100000000000000000000000000111 000000000000000000011100000000000000000000000000111 000000000000000000001111000000000000000000000001110 000000000000000000000011100000000000000000000111000 011111111111111111111111111111111111111111111111111 011111111111111111111111111111111111111111111111111 011111111111111111111111111111111111111111111111111 011111111111111111111111111111111111111111111111111 011111111111111111111111111111111111111111111111111 011000000000000000000000000000000000000000000000011 19-by-51 raster of letter 'q' lying on its side 7 Run-Length Encoding Natural encoding. (19 " 51) + 6 = 975 bits....
