Unbounded Length Contexts for PPM
J
OHN
G. C
LEARY AND
W. J. T
EAHAN
Department of Computer Science, University of Waikato, Hamilton, New Zealand
Email: jcleary@cs.waikato.ac.nz, wjt@cs.waikato.ac.nz
The PPM data compression scheme has set the performance standard in lossless compression of
text throughout the past decade. PPM is a
"
nite-context statistical modelling technique that can
be viewed as blending together several
"
xed-order context models to predict the next character
in the input sequence. This paper gives a brief introduction to PPM, and describes a variant of
the algorithm, called PPM*, which exploits contexts of unbounded length. Although requiring
considerably greater computational resources (in both time and space), this reliably achieves
compression superior to the benchmark PPMC version. Its major contribution is that it shows that
the full information available by considering all substrings of the input string can be used effectively
to generate high-quality predictions. Hence, it provides a useful tool for exploring the bounds of
compression.
Received June 28, 1996; revised July 25, 1997
1.
INTRODUCTION
The prediction by partial matching (PPM) data compression
scheme has set the performance standard in lossless com-
pression of text throughout the past decade. The original
algorithm was
"
rst published in 1984 by Cleary and Witten
[1], and a series of improvements was described by Mof-
fat, culminating in a careful implementation, called PPMC,
which has become the benchmark version [2]. This still
achieves results superior to virtually all other compression
methods, despite many attempts to better it. Other meth-
ods such as those based on Ziv&Lempel coding [3, 4] are
more commonly used in practice, but their attractiveness
lies in their relative speed rather than any superiority in
compression±indeed, their compression performance gen-
erally falls distinctly below that of PPM in practical bench-
mark tests [5].
Prediction by partial matching, or PPM, is a
"
nite-context
statistical modelling technique that can be viewed as blend-
ing together several
"
xed-order context models to predict the
next character in the input sequence. Prediction probabilities
for each context in the model are calculated from frequency
counts which are updated adaptively, and the symbol that ac-
tually occurs is encoded relative to its predicted distribution
using arithmetic coding [6, 7]. The maximum context length
is a
"
xed constant, and it has been found that increasing
it beyond about 5 does not generally improve compression
[1, 2, 8].
The present paper
1
describes an algorithm, PPM*, which
exploits contexts of unbounded length. It reliably achieves
compression superior to the benchmark PPMC version, al-
though our current implementation uses considerably greater
computational resources (in both time and space).
The
next section describes the basic PPM compression scheme.