#### You've reached the end of your free preview.

Want to read all 12 pages?

**Unformatted text preview: **Similarity Search in High Dimensions via Hashing
Aristides Gionis Piotr Indyky Rajeev Motwaniz Department of Computer Science
Stanford University
Stanford, CA 94305 fgionis,indyk,[email protected] from the database so as to ensure that the probability of collision is much higher for objects that
are close to each other than for those that are far
apart. We provide experimental evidence that our
method gives signicant improvement in running
time over other methods for searching in highdimensional spaces based on hierarchical tree decomposition. Experimental results also indicate
that our scheme scales well even for a relatively
large number of dimensions (more than 50). Abstract
The nearest- or near-neighbor query problems
arise in a large variety of database applications,
usually in the context of similarity searching. Of
late, there has been increasing interest in building search/index structures for performing similarity search over high-dimensional data, e.g., image databases, document collections, time-series
databases, and genome databases. Unfortunately,
all known techniques for solving this problem fall
prey to the \curse of dimensionality." That is,
the data structures scale poorly with data dimensionality; in fact, if the number of dimensions
exceeds 10 to 20, searching in k-d trees and related structures involves the inspection of a large
fraction of the database, thereby doing no better
than brute-force linear search. It has been suggested that since the selection of features and the
choice of a distance metric in typical applications
is rather heuristic, determining an approximate
nearest neighbor should suce for most practical purposes. In this paper, we examine a novel
scheme for approximate similarity search based
on hashing. The basic idea is to hash the points 1 Introduction Supported by NAVY N00014-96-1-1221 grant and NSF
Grant IIS-9811904.
ySupported by Stanford Graduate Fellowship and NSF NYI
Award CCR-9357849.
zSupported by ARO MURI Grant DAAH04-96-1-0007, NSF
Grant IIS-9811904, and NSF Young Investigator Award CCR9357849, with matching funds from IBM, Mitsubishi, Schlumberger Foundation, Shell Foundation, and Xerox Corporation.
Permission to copy without fee all or part of this material is
granted provided that the copies are not made or distributed for
direct commercial advantage, the VLDB copyright notice and
the title of the publication and its date appear, and notice is
given that copying is by permission of the Very Large Data Base
Endowment. To copy otherwise, or to republish, requires a fee
and/or special permission from the Endowment.
Proceedings of the 25th VLDB Conference,
Edinburgh, Scotland, 1999. 518 A similarity search problem involves a collection of objects (e.g., documents, images) that are characterized
by a collection of relevant features and represented
as points in a high-dimensional attribute space; given
queries in the form of points in this space, we are required to nd the nearest (most similar) object to the
query. The particularly interesting and well-studied
case is the d-dimensional Euclidean space. The problem is of major importance to a variety of applications;
some examples are: data compression [20]; databases
and data mining [21]; information retrieval [11, 16, 38];
image and video databases [15, 17, 37, 42]; machine
learning [7]; pattern recognition [9, 13]; and, statistics
and data analysis [12, 27]. Typically, the features of
the objects of interest are represented as points in <d
and a distance metric is used to measure similarity of
objects. The basic problem then is to perform indexing
or similarity searching for query objects. The number
of features (i.e., the dimensionality) ranges anywhere
from tens to thousands. For example, in multimedia
applications such as IBM's QBIC (Query by Image
Content), the number of features could be several hundreds [15, 17]. In information retrieval for text documents, vector-space representations involve several
thousands of dimensions, and it is considered to be a
dramatic improvement that dimension-reduction techniques, such as the Karhunen-Loeve transform [26, 30]
(also known as principal components analysis [22] or
latent semantic indexing [11]), can reduce the dimensionality to a mere few hundreds! The low-dimensional case (say, for d equal to 2 or
3) is well-solved [14], so the main issue is that of dealing with a large number of dimensions, the so-called
\curse of dimensionality." Despite decades of intensive eort, the current solutions are not entirely satisfactory; in fact, for large enough d, in theory or in
practice, they provide little improvement over a linear
algorithm which compares a query to each point from
the database. In particular, it was shown in [45] that,
both empirically and theoretically, all current indexing techniques (based on space partitioning) degrade
to linear search for suciently high dimensions. This
situation poses a serious obstacle to the future development of large scale similarity search systems. Imagine
for example a search engine which enables contentbased image retrieval on the World-Wide Web. If the
system was to index a signicant fraction of the web,
the number of images to index would be at least of
the order tens (if not hundreds) of million. Clearly, no
indexing method exhibiting linear (or close to linear)
dependence on the data size could manage such a huge
data set.
The premise of this paper is that in many cases
it is not necessary to insist on the exact answer; instead, determining an approximate answer should sufce (refer to Section 2 for a formal denition). This
observation underlies a large body of recent research
in databases, including using random sampling for histogram estimation [8] and median approximation [33],
using wavelets for selectivity estimation [34] and approximate SVD [25]. We observe that there are many
applications of nearest neighbor search where an approximate answer is good enough. For example, it
often happens (e.g., see [23]) that the relevant answers
are much closer to the query point than the irrelevant ones; in fact, this is a desirable property of a
good similarity measure. In such cases, the approximate algorithm (with a suitable approximation factor)
will return the same result as an exact algorithm. In
other situations, an approximate algorithm provides
the user with a time-quality tradeo | the user can
decide whether to spend more time waiting for the
exact answer, or to be satised with a much quicker
approximation (e.g., see [5]).
The above arguments rely on the assumption that
approximate similarity search can be performed much
faster than the exact one. In this paper we show that
this is indeed the case. Specically, we introduce a
new indexing method for approximate nearest neighbor with a truly sublinear dependence on the data size
even for high-dimensional data. Instead of using space
partitioning, it relies on a new method called localitysensitive hashing (LSH). The key idea is to hash the
points using several hash functions so as to ensure that,
for each function, the probability of collision is much
higher for objects which are close to each other than
for those which are far apart. Then, one can deter- 519 mine near neighbors by hashing the query point and
retrieving elements stored in buckets containing that
point. We provide such locality-sensitive hash functions that are simple and easy to implement; they can
also be naturally extended to the dynamic setting, i.e.,
when insertion and deletion operations also need to be
supported. Although in this paper we are focused on
Euclidean spaces, dierent LSH functions can be also
used for other similarity measures, such as dot product [5].
Locality-Sensitive Hashing was introduced by Indyk
and Motwani [24] for the purposes of devising main
memory algorithms for nearest neighbor search; in particular, it enabled us to achieve worst-case O(dn1=)time for approximate nearest neighbor query over an
n-point database. In this paper we improve that technique and achieve a signicantly improved query time
of O(dn1=(1+)). This yields an approximate nearest
neighbor algorithm running in sublinear time for any
> 0. Furthermore, we generalize the algorithm and
its analysis to the case of external memory.
We support our theoretical arguments by empirical evidence. We performed experiments on two data
sets. The rst contains 20,000 histograms of color
images, where each histogram was represented as a
point in d-dimensional space, for d up to 64. The second contains around 270,000 points representing texture information of blocks of large aerial photographs.
All our tables were stored on disk. We compared
the performance of our algorithm with the performance of the Sphere/Rectangle-tree (SR-tree) [28], a
recent data structure which was shown to be comparable to or signicantly more ecient than other
tree-decomposition-based indexing methods for spatial data. The experiments show that our algorithm is
signicantly faster than the earlier methods, in some
cases even by several orders of magnitude. It also
scales well as the data size and dimensionality increase.
Thus, it enables a new approach to high-performance
similarity search | fast retrieval of approximate answer, possibly followed by a slower but more accurate
computation in the few cases where the user is not
satised with the approximate answer.
The rest of this paper is organized as follows. In
Section 2 we introduce the notation and give formal
denitions of the similarity search problems. Then in
Section 3 we describe locality-sensitive hashing and
show how to apply it to nearest neighbor search. In
Section 4 we report the results of experiments with
LSH. The related work is described in Section 5. Finally, in Section 6 we present conclusions and ideas for
future research. 2 Preliminaries We use lpd to denote the Euclidean space <d under the
lp norm, i.e., when the length of a vector (x1 ; : : :xd ) is
dened as (jx1jp + : : : + jxd jp)1=p . Further, dp(p; q) = jjp , qjjp denotes the distance between the points p and
q in lpd . We use H d to denote the Hamming metric
space of dimension d, i.e., the space of binary vectors
of length d under the standard Hamming metric. We
use dH (p; q) denote the Hamming distance, i.e., the
number of bits on which p and q dier. for both data sets). Moreover, in most cases (i.e., for
67% of the queries in the rst set and 73% in the second set) the nearest neighbors under l1 and l2 norms
were exactly the same. This observation is interesting in its own right, and can be partially explained
via the theorem by Figiel et al (see [19] and references
The nearest neighbor search problem is dened as therein). They showed analytically that by simply applying scaling and random rotation to the space l2 ,
follows:
we can make the distances induced by the l1 and l2
Denition 1 (Nearest Neighbor Search (NNS)) norms almost equal up to an arbitrarily small factor.
Given a set P of n objects represented as points in a It seems plausible that real data is already randomly
normed space lpd , preprocess P so as to eciently an- rotated, thus the dierence between l1 and l2 norm
swer queries by nding the point in P closest to a query is very small. Moreover, for the data sets for which
point q.
this property does not hold, we are guaranteed that
after performing scaling and random rotation our alThe denition generalizes naturally to the case gorithms can be used for the l2 norm with arbitrarily
where we want to return K > 1 points. Specically, in small loss of precision.
the K -Nearest Neighbors Search (K -NNS), we wish to
As far as the second assumption is concerned,
return the K points in the database that are closest to clearly all coordinates can be made positive by propthe query point. The approximate version of the NNS erly translating the origin of <d . We can then conproblem is dened as follows:
vert all coordinates to integers by multiplying them
a suitably large number and rounding to the nearDenition 2 (-Nearest Neighbor Search
(
-NNS)) by
est
integer. It can be easily veried that by choosing
Given a set P of points in a normed space lpd , prepro- proper
the error induced by rounding can
cess P so as to eciently return a point p 2 P for any be madeparameters,
arbitrarily
Notice that after this opergiven query point q, such that d(q; p) (1 + )d(q; P ), ation the minimum small.
interpoint
distance is 1.
where d(q; P ) is the distance of q to the its closest point
in P . Again, this denition generalizes naturally to nding K > 1 approximate nearest neighbors. In the Approximate K -NNS problem, we wish to nd K points
p1 ; : : :; pK such that the distance of pi to the query q is
at most (1+ ) times the distance from the ith nearest
point to q. 3 The Algorithm In this section we present ecient solutions to the approximate versions of the NNS problem. Without signicant loss of generality, we will make the following
two assumptions about the data:
1. the distance is dened by the l1 norm (see comments below),
2. all coordinates of points in P are positive integers.
The rst assumption is not very restrictive, as usually there is no clear advantage in, or even dierence
between, using l2 or l1 norm for similarity search. For
example, the experiments done for the Webseek [43]
project (see [40], chapter 4) show that comparing color
histograms using l1 and l2 norms yields very similar
results (l1 is marginally better). Both our data sets
(see Section 4) have a similar property. Specically,
we observed that a nearest neighbor of an average
query point computed under the l1 norm was also an
-approximate neighbor under the l2 norm with an average value of less than 3% (this observation holds 520 3.1 Locality-Sensitive Hashing In this section we present locality-sensitive hashing
(LSH). This technique was originally introduced by
Indyk and Motwani [24] for the purposes of devising
main memory algorithms for the -NNS problem. Here
we give an improved version of their algorithm. The
new algorithm is in many respects more natural than
the earlier one: it does not require the hash buckets to
store only one point; it has better running time guarantees; and, the analysis is generalized to the case of
secondary memory.
Let C be the largest coordinate in all points in P .
Then, as per [29], we can embed P into the Hamming
cube H d with d0 = Cd, by transforming each point
p = (x1 ; : : :xd ) into a binary vector
v(p) = UnaryC (x1) : : : UnaryC (xd );
where UnaryC (x) denotes the unary representation of
x, i.e., is a sequence of x ones followed by C , x zeroes.
0 Fact 1 For any pair of points p; q with coordinates in
the set f1 : : :C g,
d1(p; q) = dH (v(p); v(q)): That is, the embedding preserves the distances between the points. Therefore, in the sequel we can
concentrate on solving -NNS in the Hamming space
H d . However, we emphasize that we do not need to
actually convert the data to the unary representation,
0 which could be expensive when C is large; in fact, all
our algorithms can be made to run in time independent on C . Rather, the unary representation provides
us with a convenient framework for description of the
algorithms which would be more complicated otherwise.
We dene the hash functions as follows. For an integer l to be specied later, choose l subsets I1 ; : : :; Il of
f1; : : :; d0g. Let pjI denote the projection of vector p on
the coordinate set I , i.e., we compute pjI by selecting
the coordinate positions as per I and concatenating
the bits in those positions. Denote gj (p) = pjIj . For
the preprocessing, we store each p 2 P in the bucket
gj (p), for j = 1; : : :; l. As the total number of buckets
may be large, we compress the buckets by resorting
to standard hashing. Thus, we use two levels of hashing: the LSH function maps a point p to bucket gj (p),
and a standard hash function maps the contents of
these buckets into a hash table of size M . The maximal bucket size of the latter hash table is denoted by
B . For the algorithm's analysis, we will assume hashing with chaining, i.e., when the number of points in
a bucket exceeds B , a new bucket (also of size B ) is
allocated and linked to and from the old bucket. However, our implementation does not employ chaining,
but relies on a simpler approach: if a bucket in a given
index is full, a new point cannot be added to it, since
it will be added to some other index with high probability. This saves us the overhead of maintaining the
link structure.
The number n of points, the size M of the hash
table, and the maximum bucket size B are related by
the following equation:
n
M = ;
B where is the memory utilization parameter, i.e., the
ratio of the memory allocated for the index to the size
of the data set.
To process a query q, we search all indices
g1(q); : : : ; gl (q) until we either encounter at least c l
points (for c specied later) or use all l indices. Clearly,
the number of disk accesses is always upper bounded
by the number of indices, which is equal to l. Let
p1 ; : : :; pt be the points encountered in the process.
For Approximate K -NNS, we output the K points pi
closest to q; in general, we may return fewer points if
the number of points encountered is less than K .
It remains to specify the choice of the subsets Ij .
For each j 2 f1; : : :; lg, the set Ij consists of k elements from f1; : : :; d0g sampled uniformly at random
with replacement. The optimal value of k is chosen to
maximize the probability that a point p \close" to q
will fall into the same bucket as q, and also to minimize the probability that a point p0 \far away" from q
will fall into the same bucket. The choice of the values
of l and k is deferred to the next section. 521 Algorithm Preprocessing
Input A set of points P ,
l (number of hash tables),
Output Hash tables Ti, i = 1; : : :; l
Foreach i = 1; : : :; l
Initialize hash table Ti by generating
a random hash function gi ()
Foreach i = 1; : : :; l
Foreach j = 1; : : :; n Store point pj on bucket gi (pj ) of hash table Ti Figure 1: Preprocessing algorithm for points already
embedded in the Hamming cube. Algorithm Approximate Nearest Neighbor Query
Input A query point q,
K (number of appr. nearest neighbors)
Access To hash tables Ti , i = 1; : : :; l
generated by the preprocessing algorithm
Output K (or less) appr. nearest neighbors
S
Foreach i = 1; : : :; l
S S [ fpoints found in gi(q) bucket of table Tig
Return the K nearest neighbors of q found in set S
/* Can be found by main memory linear search */ Figure 2: Approximate Nearest Neighbor query answering algorithm.
Although we are mainly interested in the I/O complexity of our scheme, it is worth pointing out that
the hash functions can be eciently computed if the
data set is obtained by mapping l1d into d0-dimensional
Hamming space. Let p be any point from the data set
and let p0 denote its image after the mapping. Let I
be the set of coordinates and recall that we need to
compute p0jI . For i = 1; : : :; d, let Iji denote, in sorted
order, the coordinates in I which correspond to the
ith coordinate of p. Observe, that projecting p0 on Iji
results in a sequence of bits which is monotone, i.e.,
consists of a number, say oi , of ones followed by zeros. Therefore, in order to represent p0I it is sucient
to compute oi for i = 1; : : :; d. However, the latter
task is equivalent to nding the number of elements
in the sorted array Iji which are smaller than a given
value, i.e., the ith coordinate of p. This can be done
via binary search in log C time, or even in constant
time using a precomputed array of C bits. Thus, the
total time needed to compute the function is either
O(d log C ) or O(d), depending on resources used. In
our experimental section, the value of C can be made
very small, and therefore we will resort to the second
method.
For quick reference we summarize the preprocessing and query answering algorithms in Figures 1 and 2. 3.2 ...

View
Full Document

- Spring '17
- Final Project Reference