This preview shows page 1. Sign up to view the full content.
Unformatted text preview: CSC 311 CHAPTER FIVE DATA COMPRESSION CSC 311 For many of the applications and uses we make of modern computers, data compression is absolutely essential. Fax Mp3 Video TV etc. CSC 311 For example: a typical fax uses 40,000 dots per square inch, using a 56K modem would require more than one minute per page. A typical 2 hour movie would require 1.04 * 1012 bits, far beyond the capacity of any DVD, yet you can put 2 two hour movies on a DVD This is made possible by the use of data compression. CSC 311 There are fundamentally two types of data compression: Lossless Lossy CSC 311 Lossless: Lossless compression techniques allow the receiver to precisely reconstruct the original data being transmitted. Lossy: Lossy compression techniques allow the receiver to approximately reconstruct the original data. CSC 311 Frequency Dependent Codes:
We first want to examine two compression techniques that rely on the frequency of occurrence of various symbols in constructing a compression algorithm. Huffman Codes: Arithmetic compression CSC 311
Huffman Codes: Huffman codes rely on the frequency of use of the various symbols to produce codes of varying length to represent the symbols. Huffman codes display the canonical property that: No valid Huffman code for any symbol is the prefix for the code of any other symbol sometimes called the : no prefix property CSC 311
Example: Letter A B C D E Frequency 25 15 10 20 30 Huffman Code 01 110 111 10 00 Note: Huffman codes are not unique, but a properly formed Huffman code will always be optimal CSC 311 CSC 311
Arithmetic Compression: Another frequency dependent compression technique. Based on representing a character string as a single real number. Assigning ranges based on frequency: Letter Frequency % Subinterval [p,q] A 25 [0,0.25] B 15 [0.25,0.40] C 10 [0.4,0.5] D 20 [0.5,0.7] E 30 [0.7,1.0] CSC 311
How does it work? We calculate the new interval based on the old interval and the probabilities of the current symbol The interval, in this case would change from 0.30.9 to 0.45 0.60 CSC 311 Math shown in next slide CSC 311
Step String 1 2 C Next Char C A Current[x,y] [p,q] Interval [0,1] width yx new x xx+w*p new y y=x+w*q [0.4,0.5] 1.0 0 + 1*0.4=0.4 0+1*0.5=0.5 [0.4,0.5] [0,0.25] 0 .1 0.4+0.1*0=0.4 0.4+0.1*0.25=0.425 3 CA B [0.4,0.425] [0.25,0.40] 0.025 0.4+ 0.025*0.25=0.40625 0.4+0.025*0.4=0.41 4 CAB A [0.40625,0.41] [0.0.25] 0.00375 0.40625+0.00375*0= 0.40625+.0..375*.025= O.40625 0.4071875 5 CABA C [0.40625, [0.4,0.5] 0.0009375 0.40625+0.0009375 0.40625+0.0009375 0.4071875] *0.40= 0.406625 *0.5= 0.4067187 We could choose any number in the interval [0.406625,0.4067187] to represent the string ABCAC Suppose we send N = 0.4067. The receiver only knows the number we sent and the contents of the original table of symbols and their probabilities. How do we produce the original string? CSC 311
Step 1 2 3 4 5 N 0.4067 0.067 0.268 0.12 0.48 Interval[p,q] Width [0.4,0.5] [0, 0.25] [0.25,0.40] [0,0.25] [0.4,0.5] Char 0.1 0.25 0.15 0.25 0.10 Np C A B A C Divide by width 0.0067 0.067 0.018 0.12 0.08 0.067 0.268 0.12 0.48 0.8 How do we know when to stop? Obviously we could continue the decoding process begun above, but there are no more characters actually encoded in the message. It is customary to include a terminating character in the code, when you decode the terminating character, you stop. Number of characters that can be encoded is limited by the precision of real number representation on your machine. CSC 311
Run Length Encoding: Look for long runs of one or zero. Example: Runs of the same bit, here we are looking for long runs of 0, which one may commonly find in fax transmissions, for example. CSC 311
If we encounter a run of more than 15 zeros, how can we specify that with only 4 bit codes? This would allow only 15 zeros max. To send for example; 20 zeros we would send: 1111 0101 The receiver assumes when it sees 1111, that the next code is a continuation of the previous. How then might we send a code for 30 zeros? 1111 1111 0000 the code for 0 zeros is needed to terminate the 1111 code Lempel ZIV Compression LZ is a compression that realizes compression ratios of up to 20 to 1. It relies on the fact that, in any document, character strings are going to be repeated. For example: in legal documents such as contracts, one is likely to find phrases such as: "whereas the party of the first part", repeated many times in the document. Would it not be nice if we could, rather than sending the thirty five individual characters contained in the above phrase, simply send a single integer, such as "18" an have the receiver understand that "18" stands for the above phrase? LempelZIV Compression
LempelZIV provides an elegant algorithm for accomplishing this. The sender has the original message and a previously agreed upon symbol table, usually the set of allowable characters in the alphabet. The receiving party knows nothing to the message content, but it knows what the contents and organization of the symbol table are. LempelZIV Compression
Let us suppose, at the senders end, we wish to send the message: ABABAAABBCACABABACAC The sender would have the following symbol table, assuming that all possible messages consist only of patterns of the characters: A B and C. Beginning Symbol Table: 0 A 1 B 2 C The receiver, knowning that all messages are composed only of the characters A,B, and C, would have a similar symbol table at the beginning: 0 1 2 A B C LempelZIV Compression
At the sending end, the sender will keep track of the following information: The goal is to build an expanded symbol table containing all of the character patterns encountered so far. One pass through the algorithm is the processing of a new character in the message, the sender tracks the following info: Pass Buffer Current What is sent What is stored New buffer Content char in table content 1 A B 0 (code for A) AB (code = 3) B The algorithm begins by sending the first character, the first pass thru the loop begins by reading the second character "B" The sender's symbol table would now look as follows: 0 A 1 B 2 C 3 AB LempelZIV Compression
At the other end of the transmission, the receiver is trying to reconstruct the symbol table that the sender is building. The receiver is gathering the following info: Pass Prior Current Is Current C Tempstring/ What is Printed (string) (string) Code in Table? 1 st Code Pair curr or temp? 1 0 (A) 1 (B) Yes B AB/3 B (current) Since the receiver has received the code for both A and B sequentially, he knows the sender has seen the character pattern AB and stores this as entry 3 in his table Receiver's table after pass one. 0 A 1 B 2 C 3 AB This process continues for the entire Message: LempelZIV Compression ABABAAABBCACABABACAC Sender Pass Buffer Current What is sent What is stored New buffer Content char in table content 1 A B 0 (code for A) AB (code = 3) B 2 B A 1(code for B) BA (code = 4) A 3 A B AB 4 AB A 3 (code for AB) ABA(code=5) A 5 A B ___________ ________ AB 6 AB C 3(code for AB) ABC(code =6) C 7 C B 2(code for C) CB(code = 7) B 8 B A ________ _________ BA 9 BA B 4 (code for BA) BAB (code = 8) B 10 B A ________ _________ BA 11 BA B _______ ________ BAB 12 BAB A 8(code for BAB) BABA(code=9) A Pass Prior Current Is Current C Tempstring/ What is Printed (string) (string) Code in Table? 1st Code Pair curr or temp? 1 0 (A) 1 (B) Yes B AB/3 B (current 2 1(B) 3(AB) Yes A BA/4 AB(current) 3 3(AB) 3(AB) Yes A ABA/5 AB(current) 4 3(AB) 2(C) Yes C ABC/6 C(current) 5 2 ( C ) 4(BA) Yes B CB/7 BA(current) 6 4(BA) 8 No B BAB/8 BAB(temp) LempelZIV Compression At this point the sender and receiver symbol tables would contain: 0 1 2 3 4 5 6 7 8 9 Sender A B C AB BA ABA ABC CB BAB BABA Receiver A B C AB BA ABA ABC CB BAB not yet CSC 311 Image compression is an example of Lossy Compression: At just 640 X 480 resolution, a color image would require 7,372,800 bits, for motion, we send 30 images per second which would require over 220 million bits per second for a single video stream. Lossy compression schemes are used to dramatically reduce this requirement. We won't got thru the details of how video compression is accomplished, but I suggest you read the remainder of the chapter for your own enlightenment. Images that are transmitted consist of three different frame types: P frame: encoded by computing the differences between the current frame and the previous frame; B frame: similar to a P frame except it is interpolated between a previous and future frame I frame: just a JPEG encoded image CSC 311 CSC 311 CSC 311 CSC 311 ...
View Full
Document
 Spring '08
 Whitston,H

Click to edit the document details