Copyright @ 2009 Ananda Gunawardena
Lecture 15
Introduction to Hashing
Why Hashing?
Internet has grown to millions of users generating terabytes of content every day.
According to internet data tracking services, the amount of content on the internet
doubles every six months. With this kind of growth, it is impossible to find anything in
the internet, unless we develop new data structures and algorithms for storing and
accessing data. So what is wrong with traditional data structures like Arrays and Linked
Lists? Suppose we have a very large data set stored in an array. The amount of time
required to look up an element in the array is either O(log n) or O( n) based on whether
the array is sorted or not. If the array is sorted then a technique such as
binary search
can
be used to search the array. Otherwise, the array must be searched
linearly
. Either case
may not be desirable if we need to process a very large data set. Therefore we discuss a
new technique called
hashing
that allows us to
update and retrieve
any entry in
constant time
O(1).
The constant time or O(1) performance means, the amount of time to
perform the operation does not depend on data size n.
The Map Data Structure
In a mathematical sense, a map is a relation between two sets. We can define Map M as a
set of pairs, where each pair is of the form
(key, value), where for given a key, we can
find a value using some kind of a “function” that maps keys to values. That is, the key for
a given object can be calculated using a function called a
hash function
.
In its simplest
form, we can think of an array as a Map where key is the index and value is the value at
that index. For example, given an array A, if i is the key, then we can find the value by
simply looking up A[i]. The idea of a hash table can be described as follows.
The concept of a
hash table
is a generalized idea of an array where key does not have to
be an integer. We can have a name as a key, or for that matter any object as the key. The
trick is to find a
hash function
to compute an index so that the object can be stored at a
specific location in a table such that it can easily be found.
Example:
Suppose we have a set of strings {“abc”, “def”, “ghi”} that we’d like to store in a table.
Our objective here is to find or update them quickly from a table, actually in O(1). We
are not concerned about ordering them or maintaining any order at all. Let us think of a
simple schema to do this. Suppose we assign “a” = 1, “b”=2, … etc to all alphabetical
characters. We can then simply compute a number for each of the strings by using the
sum of the characters as follows.
“abc” = 1 + 2 + 3=6,
“def” = 4 + 5 + 6=15
,
“ghi” = 7 + 8 + 9=24
Now if we assume that we have a table of size 5 to store these strings, we can compute
the location of the string by taking the
sum mod 5
. So we will store
“abc” in 6 mod 5 = 1, “def” in 15 mod 5 = 0, and “ghi” in 24 mod 5 = 4