JOURNAL
OF
COMPUTER AND SYSTEM SCIENCES
31,
182209
(1985)
Pro
ba
bit
istic
Cou
nting Algorithms
for
Data
Base
Applications
PHILIPPE
FLAJOLET
INRIA,
Rocquencourt, 78153 Le Chesnay, France
AND
G.
NIGEL
MARTIN
IBM
Developmeni Laboratory, Hursiey
Park,
Winchester, Hampshire S0212JN, United Kingdom
Received June
13,
1984; revised April
3,
1985
This paper introduces
a class of probabilistic counting algorithms with which one can
estimate the number of distinct elements in a large collection
of data (typically a large file
stored on
disk)
in a single pass using only a small additional storage (typically less than a
hundred binary words)
and
only a few operations per element scanned. The algorithms are
based on statistical observations made on bits of hashed values
of records. They are
by
con
struction totally insensitive to the replicative structure
of elements in the file; they can be used
in the context
of distributed systems without any degradation of performances and prove
especially useful in the context of data bases query optirnisation.
B''
1985
Academic Press. Inc.
1.
INTRODUCTION
As
data base systems allow the user
to
specify more and more complex queries,
the need arises for efficient processing methods. A complex query can however
generally
be
evaluated
in
a number
of
different manners, and the overall perfor
mance
a data base
system depends
rather crucially on
the. selection
appropriate
decomposition strategies
each particular case.
Even a problem as trivial as computing the intersection of
two
collections
data
A
and
B
lends itself to a number
different treatments (see, e.g.,
[7]):
lnB:
1.
Sort
A,
search each element
B
A
and retain it if it appears in
A;
2.
sort
B,
then perform a mergelike operation to determine the inter
section;
3,
eliminate duplicates
A
and/or
B
using
hashing or hash filters, then per
form Algorithm 1 or
2.
Each of these evaluation strategy
wiil
have a cost essentially determined
by
the
number of records
a,
b
A
the number
cr'isiitic*i
elements
cc,
A
and for
typical
sorting methods, the costs are:
182
(}~~~*[~~~0/~5
$3.00
View Full Document PROBABILISTIC COUNTING ALGORITHMS
183
for strategy 1:
O(a
log
a
+
b
log
a);
for strategy
2:
log
a
+
b
b
+
a
+
b)
....
In a number
of
similar situations, it appears thus that, apart from the
sizes
the
files on which one operates (i.e., the number
records), a major determinant
efficiency
is
the
cardinalities
the underlying sets, i.e., the
number
distinct
elements
they comprise.
The situation gets much more complex when operations like projections, selec
tions, multiple joins in combination with various boolean operations appear in
queries.
As
an example, the relational system
system
R
has
a
sophisticated query
optimiser. In order
to
perform
;t,s
task, that programme keeps several statistics on
relations
the data base. The most important ones are
sizes of relations as well
as the number
different
elements of some key fields
[SI.
