CS 70
Discrete Mathematics and Probability Theory
Fall 2010
Tse/Wagner
Note 13
A Killer Application
In this lecture, we will see a “killer app” of elementary probability in Computer Science. Suppose a hash
function distributes keys evenly over a table of size
n
. How many (randomly chosen) keys can we hash
before the probability of a collision exceeds (say)
1
2
? As we shall see, this question can be tackled by an
analysis of the ballsandbins probability space which we have already encountered.
Application: Hash functions
As you may already know, a hash table is a data structure that supports the storage of sets of keys from
a (large) universe
U
(say, the names of all 250 million people in the US). The operations supported are
ADD
ing a key to the set,
DELETE
ing a key from the set, and testing
MEMBER
ship of a key in the set. The
hash function
h
maps
U
to a table
T
of modest size. To
ADD
a key
x
to our set, we evaluate
h
(
x
)
(i.e.,
apply the hash function to the key) and store
x
at the location
h
(
x
)
in the table
T
.
All keys in our set
that are mapped to the same table location are stored in a simple linked list. The operations
DELETE
and
MEMBER
are implemented in similar fashion, by evaluating
h
(
x
)
and searching the linked list at
h
(
x
)
. Note
that the efficiency of a hash function depends on having only few
collisions
—i.e., keys that map to the same
location. This is because the search time for
DELETE
and
MEMBER
operations is proportional to the length
of the corresponding linked list.
The question we are interested in here is the following: suppose our hash table
T
has size
n
, and that our
hash function
h
distributes
U
evenly over
T
.
1
Assume that the keys we want to store are chosen uniformly at
random and independently from the universe
U
. What is the largest number,
m
, of keys we can store before
the probability of a collision reaches
1
2
?
Let’s begin by seeing how this problem can be put into the balls and bins framework. The balls will be
the
m
keys to be stored, and the bins will be the
n
locations in the hash table
T
. Since the keys are chosen
uniformly and independently from
U
, and since the hash function distributes keys evenly over the table, we
can see each key (ball) as choosing a hash table location (bin) uniformly and independently from
T
. Thus
the probability space corresponding to this hashing experiment is exactly the same as the balls and bins
space.
We are interested in the event
A
that there is no collision, or equivalently, that all
m
balls land in different
bins. Clearly Pr
[
A
]
will decrease as
m
increases (with
n
fixed). Our goal is to find the largest value of
m
such
that Pr
[
A
]
remains above
1
2
.
Let’s fix the value of
m
and try to compute Pr
[
A
]
. Since our probability space is uniform (each outcome has
probability
1
n
m
), it’s enough just to count the number of outcomes in
A
. In how many ways can we arrange
m
balls in
n
bins so that no bin contains more than one ball? Well, there are
n
places to put the first ball, then
n

This preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
This is the end of the preview.
Sign up
to
access the rest of the document.
 Fall '11
 Rau
 Computer Science, The Land, Probability theory, hash function, Bloom filter

Click to edit the document details