l09 - 6.896 Sublinear Time Algorithms March 6 2007 Lecture

6.896 Sublinear Time Algorithms March 6, 2007 Lecture 9 Lecturer: Ronitt Rubinfeld Scribe: Khanh Do Ba 1 Lempel-Ziv Compression 1.1 The algorithm LZ77( w ) t 1 1 repeat 2 find longest substring w t ...w t + 1 s.t. index p < t with w p ...w p + 1 = w t ...w t + 1 3 if none then 4 next symbol = w t 5 t t + 1 6 else 7 next symbol = ( p, ) 8 t t + 9 until t > n (= | w | ) 10 1.2 Some notation The following notation will be used extensively. n ( w ) = # compressed segments of length in w , not counting alphabet symbols nor last compressed segment C LZ ( w ) = # symbols in compressed string (# pairs + # alphabet symbols d ( w ) = # distinct substrings of length Examples: 1. aaaaaaa is encoded as a (1 , 6) and has d ( w ) = 1 [7]. 2. abcd is encoded as abcd . 3. abaabaaabaaa is encoded as aa (1 , 1)(1 , 4)(4 , 5), where the compressed segments are broken up as a, b, a, abaa, abaaa . 4. abcaabbccaaabbbccc is encoded as abc (1 , 1)(1 , 2)(3 , 2)(3 , 3)(5 , 3)(7 , 3)(3 , 1). The compressed seg- ments are broken up as a, b, c, a, ab, bc, caa, abb, bcc, c , and d 1 ( w ) = 3 d 2 ( w ) = 7 d 3 ( w ) = 12 d 4 ( w ) = 13 n 1 ( w ) = 2 n 2 ( w ) = 2 n 3 ( w ) = 3 n > 3 ( w ) = 0 . 1

2 Estimating LZ Compressibility 2.1 Main theorem The key idea is the following theorem.
