FlMa85 - JOURNAL OF COMPUTER A ND SYSTEM-209(1985 Proba bit istic Cou n ting Algorithms for D ata B ase Applications PHILIPPELAJOLET F I NRIA

Info iconThis preview shows pages 1–3. Sign up to view the full content.

View Full Document Right Arrow Icon
JOURNAL OF COMPUTER AND SYSTEM SCIENCES 31, 182-209 (1985) Pro ba bit istic Cou nting Algorithms for Data Base Applications PHILIPPE FLAJOLET INRIA, Rocquencourt, 78153 Le Chesnay, France AND G. NIGEL MARTIN IBM Developmeni Laboratory, Hursiey Park, Winchester, Hampshire S0212JN, United Kingdom Received June 13, 1984; revised April 3, 1985 This paper introduces a class of probabilistic counting algorithms with which one can estimate the number of distinct elements in a large collection of data (typically a large file stored on disk) in a single pass using only a small additional storage (typically less than a hundred binary words) and only a few operations per element scanned. The algorithms are based on statistical observations made on bits of hashed values of records. They are by con- struction totally insensitive to the replicative structure of elements in the file; they can be used in the context of distributed systems without any degradation of performances and prove especially useful in the context of data bases query optirnisation. B'' 1985 Academic Press. Inc. 1. INTRODUCTION As data base systems allow the user to specify more and more complex queries, the need arises for efficient processing methods. A complex query can however generally be evaluated in a number of different manners, and the overall perfor- mance a data base system depends rather crucially on the. selection appropriate decomposition strategies each particular case. Even a problem as trivial as computing the intersection of two collections data A and B lends itself to a number different treatments (see, e.g., [7]): lnB: 1. Sort A, search each element B A and retain it if it appears in A; 2. sort B, then perform a merge-like operation to determine the inter- section; 3, eliminate duplicates A and/or B using hashing or hash filters, then per- form Algorithm 1 or 2. Each of these evaluation strategy wiil have a cost essentially determined by the number of records a, b A the number cr'isiitic*i elements cc, A and for typical sorting methods, the costs are: 182 (}~~~*[~~~0/~5 $3.00
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
PROBABILISTIC COUNTING ALGORITHMS 183 for strategy 1: O(a log a + b log a); for strategy 2: log a + b b + a + b) .... In a number of similar situations, it appears thus that, apart from the sizes the files on which one operates (i.e., the number records), a major determinant efficiency is the cardinalities the underlying sets, i.e., the number distinct elements they comprise. The situation gets much more complex when operations like projections, selec- tions, multiple joins in combination with various boolean operations appear in queries. As an example, the relational system system R has a sophisticated query optimiser. In order to perform ;t,s task, that programme keeps several statistics on relations the data base. The most important ones are sizes of relations as well as the number different elements of some key fields [SI.
Background image of page 2
Image of page 3
This is the end of the preview. Sign up to access the rest of the document.

This note was uploaded on 12/08/2011 for the course CS 525 taught by Professor Gupta during the Spring '08 term at University of Illinois, Urbana Champaign.

Page1 / 28

FlMa85 - JOURNAL OF COMPUTER A ND SYSTEM-209(1985 Proba bit istic Cou n ting Algorithms for D ata B ase Applications PHILIPPELAJOLET F I NRIA

This preview shows document pages 1 - 3. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online