6.896 Sublinear Time Algorithms
March 6, 2007
Lecture 9
Lecturer: Ronitt Rubinfeld
Scribe: Khanh Do Ba
1
LempelZiv Compression
1.1
The algorithm
LZ77(
w
)
t
←
1
1
repeat
2
fnd longest substring
w
t
...w
t
+
`
−
1
s.t.
∃
index
p<t
with
w
p
...w
p
+
`
−
1
=
w
t
...w
t
+
`
−
1
3
if
none
then
4
next symbol =
w
t
5
t
←
t
+1
6
else
7
next symbol = (
p, `
)
8
t
←
t
+
`
9
until
t>n
(=

w

)
10
1.2
Some notation
The Following notation will be used extensively.
n
`
(
w
)
=
# compressed segments oF length
`
in
w
, not counting alphabet symbols nor last compressed segment
C
LZ
(
w
)
=
# symbols in compressed string (# pairs + # alphabet symbols
d
`
(
w
)
=
# distinct substrings oF length
`
Examples:
1.
aaaaaaa
is encoded as
a
(1
,
6) and has
d
`
(
w
)=1
∀
`
∈
[7].
2.
abcd
is encoded as
abcd
.
3.
abaabaaabaaa
is encoded as
aa
(1
,
1)(1
,
4)(4
,
5), where the compressed segments are broken up as
a, b, a, abaa, abaaa
.
4.
abcaabbccaaabbbccc
is encoded as
abc
(1
,
1)(1
,
2)(3
,
2)(3
,
3)(5
,
3)(7
,
3)(3
,
1). The compressed seg
ments are broken up as
a, b, c, a, ab, bc, caa, abb, bcc, c
,and
d
1
(
w
)=3
d
2
(
w
)=7
d
3
(
w
2
d
4
(
w
3
n
1
(
w
)=2
n
2
(
w
n
3
(
w
n
`>
3
(
w
)=0
.
1