Similarity Search in High Dimensions via Hashing.pdf - Similarity Search in High Dimensions via Hashing Aristides Gionis  Piotr Indyky Rajeev Motwaniz

Similarity Search in High Dimensions via Hashing.pdf -...

This preview shows page 1 out of 12 pages.

You've reached the end of your free preview.

Want to read all 12 pages?

Unformatted text preview: Similarity Search in High Dimensions via Hashing Aristides Gionis  Piotr Indyky Rajeev Motwaniz Department of Computer Science Stanford University Stanford, CA 94305 fgionis,indyk,[email protected] from the database so as to ensure that the probability of collision is much higher for objects that are close to each other than for those that are far apart. We provide experimental evidence that our method gives signi cant improvement in running time over other methods for searching in highdimensional spaces based on hierarchical tree decomposition. Experimental results also indicate that our scheme scales well even for a relatively large number of dimensions (more than 50). Abstract The nearest- or near-neighbor query problems arise in a large variety of database applications, usually in the context of similarity searching. Of late, there has been increasing interest in building search/index structures for performing similarity search over high-dimensional data, e.g., image databases, document collections, time-series databases, and genome databases. Unfortunately, all known techniques for solving this problem fall prey to the \curse of dimensionality." That is, the data structures scale poorly with data dimensionality; in fact, if the number of dimensions exceeds 10 to 20, searching in k-d trees and related structures involves the inspection of a large fraction of the database, thereby doing no better than brute-force linear search. It has been suggested that since the selection of features and the choice of a distance metric in typical applications is rather heuristic, determining an approximate nearest neighbor should suce for most practical purposes. In this paper, we examine a novel scheme for approximate similarity search based on hashing. The basic idea is to hash the points 1 Introduction  Supported by NAVY N00014-96-1-1221 grant and NSF Grant IIS-9811904. ySupported by Stanford Graduate Fellowship and NSF NYI Award CCR-9357849. zSupported by ARO MURI Grant DAAH04-96-1-0007, NSF Grant IIS-9811904, and NSF Young Investigator Award CCR9357849, with matching funds from IBM, Mitsubishi, Schlumberger Foundation, Shell Foundation, and Xerox Corporation. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Very Large Data Base Endowment. To copy otherwise, or to republish, requires a fee and/or special permission from the Endowment. Proceedings of the 25th VLDB Conference, Edinburgh, Scotland, 1999. 518 A similarity search problem involves a collection of objects (e.g., documents, images) that are characterized by a collection of relevant features and represented as points in a high-dimensional attribute space; given queries in the form of points in this space, we are required to nd the nearest (most similar) object to the query. The particularly interesting and well-studied case is the d-dimensional Euclidean space. The problem is of major importance to a variety of applications; some examples are: data compression [20]; databases and data mining [21]; information retrieval [11, 16, 38]; image and video databases [15, 17, 37, 42]; machine learning [7]; pattern recognition [9, 13]; and, statistics and data analysis [12, 27]. Typically, the features of the objects of interest are represented as points in <d and a distance metric is used to measure similarity of objects. The basic problem then is to perform indexing or similarity searching for query objects. The number of features (i.e., the dimensionality) ranges anywhere from tens to thousands. For example, in multimedia applications such as IBM's QBIC (Query by Image Content), the number of features could be several hundreds [15, 17]. In information retrieval for text documents, vector-space representations involve several thousands of dimensions, and it is considered to be a dramatic improvement that dimension-reduction techniques, such as the Karhunen-Loeve transform [26, 30] (also known as principal components analysis [22] or latent semantic indexing [11]), can reduce the dimensionality to a mere few hundreds! The low-dimensional case (say, for d equal to 2 or 3) is well-solved [14], so the main issue is that of dealing with a large number of dimensions, the so-called \curse of dimensionality." Despite decades of intensive e ort, the current solutions are not entirely satisfactory; in fact, for large enough d, in theory or in practice, they provide little improvement over a linear algorithm which compares a query to each point from the database. In particular, it was shown in [45] that, both empirically and theoretically, all current indexing techniques (based on space partitioning) degrade to linear search for suciently high dimensions. This situation poses a serious obstacle to the future development of large scale similarity search systems. Imagine for example a search engine which enables contentbased image retrieval on the World-Wide Web. If the system was to index a signi cant fraction of the web, the number of images to index would be at least of the order tens (if not hundreds) of million. Clearly, no indexing method exhibiting linear (or close to linear) dependence on the data size could manage such a huge data set. The premise of this paper is that in many cases it is not necessary to insist on the exact answer; instead, determining an approximate answer should suf ce (refer to Section 2 for a formal de nition). This observation underlies a large body of recent research in databases, including using random sampling for histogram estimation [8] and median approximation [33], using wavelets for selectivity estimation [34] and approximate SVD [25]. We observe that there are many applications of nearest neighbor search where an approximate answer is good enough. For example, it often happens (e.g., see [23]) that the relevant answers are much closer to the query point than the irrelevant ones; in fact, this is a desirable property of a good similarity measure. In such cases, the approximate algorithm (with a suitable approximation factor) will return the same result as an exact algorithm. In other situations, an approximate algorithm provides the user with a time-quality tradeo | the user can decide whether to spend more time waiting for the exact answer, or to be satis ed with a much quicker approximation (e.g., see [5]). The above arguments rely on the assumption that approximate similarity search can be performed much faster than the exact one. In this paper we show that this is indeed the case. Speci cally, we introduce a new indexing method for approximate nearest neighbor with a truly sublinear dependence on the data size even for high-dimensional data. Instead of using space partitioning, it relies on a new method called localitysensitive hashing (LSH). The key idea is to hash the points using several hash functions so as to ensure that, for each function, the probability of collision is much higher for objects which are close to each other than for those which are far apart. Then, one can deter- 519 mine near neighbors by hashing the query point and retrieving elements stored in buckets containing that point. We provide such locality-sensitive hash functions that are simple and easy to implement; they can also be naturally extended to the dynamic setting, i.e., when insertion and deletion operations also need to be supported. Although in this paper we are focused on Euclidean spaces, di erent LSH functions can be also used for other similarity measures, such as dot product [5]. Locality-Sensitive Hashing was introduced by Indyk and Motwani [24] for the purposes of devising main memory algorithms for nearest neighbor search; in particular, it enabled us to achieve worst-case O(dn1=)time for approximate nearest neighbor query over an n-point database. In this paper we improve that technique and achieve a signi cantly improved query time of O(dn1=(1+)). This yields an approximate nearest neighbor algorithm running in sublinear time for any  > 0. Furthermore, we generalize the algorithm and its analysis to the case of external memory. We support our theoretical arguments by empirical evidence. We performed experiments on two data sets. The rst contains 20,000 histograms of color images, where each histogram was represented as a point in d-dimensional space, for d up to 64. The second contains around 270,000 points representing texture information of blocks of large aerial photographs. All our tables were stored on disk. We compared the performance of our algorithm with the performance of the Sphere/Rectangle-tree (SR-tree) [28], a recent data structure which was shown to be comparable to or signi cantly more ecient than other tree-decomposition-based indexing methods for spatial data. The experiments show that our algorithm is signi cantly faster than the earlier methods, in some cases even by several orders of magnitude. It also scales well as the data size and dimensionality increase. Thus, it enables a new approach to high-performance similarity search | fast retrieval of approximate answer, possibly followed by a slower but more accurate computation in the few cases where the user is not satis ed with the approximate answer. The rest of this paper is organized as follows. In Section 2 we introduce the notation and give formal de nitions of the similarity search problems. Then in Section 3 we describe locality-sensitive hashing and show how to apply it to nearest neighbor search. In Section 4 we report the results of experiments with LSH. The related work is described in Section 5. Finally, in Section 6 we present conclusions and ideas for future research. 2 Preliminaries We use lpd to denote the Euclidean space <d under the lp norm, i.e., when the length of a vector (x1 ; : : :xd ) is de ned as (jx1jp + : : : + jxd jp)1=p . Further, dp(p; q) = jjp , qjjp denotes the distance between the points p and q in lpd . We use H d to denote the Hamming metric space of dimension d, i.e., the space of binary vectors of length d under the standard Hamming metric. We use dH (p; q) denote the Hamming distance, i.e., the number of bits on which p and q di er. for both data sets). Moreover, in most cases (i.e., for 67% of the queries in the rst set and 73% in the second set) the nearest neighbors under l1 and l2 norms were exactly the same. This observation is interesting in its own right, and can be partially explained via the theorem by Figiel et al (see [19] and references The nearest neighbor search problem is de ned as therein). They showed analytically that by simply applying scaling and random rotation to the space l2 , follows: we can make the distances induced by the l1 and l2 De nition 1 (Nearest Neighbor Search (NNS)) norms almost equal up to an arbitrarily small factor. Given a set P of n objects represented as points in a It seems plausible that real data is already randomly normed space lpd , preprocess P so as to eciently an- rotated, thus the di erence between l1 and l2 norm swer queries by nding the point in P closest to a query is very small. Moreover, for the data sets for which point q. this property does not hold, we are guaranteed that after performing scaling and random rotation our alThe de nition generalizes naturally to the case gorithms can be used for the l2 norm with arbitrarily where we want to return K > 1 points. Speci cally, in small loss of precision. the K -Nearest Neighbors Search (K -NNS), we wish to As far as the second assumption is concerned, return the K points in the database that are closest to clearly all coordinates can be made positive by propthe query point. The approximate version of the NNS erly translating the origin of <d . We can then conproblem is de ned as follows: vert all coordinates to integers by multiplying them a suitably large number and rounding to the nearDe nition 2 (-Nearest Neighbor Search ( -NNS)) by est integer. It can be easily veri ed that by choosing Given a set P of points in a normed space lpd , prepro- proper the error induced by rounding can cess P so as to eciently return a point p 2 P for any be madeparameters, arbitrarily Notice that after this opergiven query point q, such that d(q; p)  (1 + )d(q; P ), ation the minimum small. interpoint distance is 1. where d(q; P ) is the distance of q to the its closest point in P . Again, this de nition generalizes naturally to nding K > 1 approximate nearest neighbors. In the Approximate K -NNS problem, we wish to nd K points p1 ; : : :; pK such that the distance of pi to the query q is at most (1+ ) times the distance from the ith nearest point to q. 3 The Algorithm In this section we present ecient solutions to the approximate versions of the NNS problem. Without signi cant loss of generality, we will make the following two assumptions about the data: 1. the distance is de ned by the l1 norm (see comments below), 2. all coordinates of points in P are positive integers. The rst assumption is not very restrictive, as usually there is no clear advantage in, or even di erence between, using l2 or l1 norm for similarity search. For example, the experiments done for the Webseek [43] project (see [40], chapter 4) show that comparing color histograms using l1 and l2 norms yields very similar results (l1 is marginally better). Both our data sets (see Section 4) have a similar property. Speci cally, we observed that a nearest neighbor of an average query point computed under the l1 norm was also an -approximate neighbor under the l2 norm with an average value of  less than 3% (this observation holds 520 3.1 Locality-Sensitive Hashing In this section we present locality-sensitive hashing (LSH). This technique was originally introduced by Indyk and Motwani [24] for the purposes of devising main memory algorithms for the -NNS problem. Here we give an improved version of their algorithm. The new algorithm is in many respects more natural than the earlier one: it does not require the hash buckets to store only one point; it has better running time guarantees; and, the analysis is generalized to the case of secondary memory. Let C be the largest coordinate in all points in P . Then, as per [29], we can embed P into the Hamming cube H d with d0 = Cd, by transforming each point p = (x1 ; : : :xd ) into a binary vector v(p) = UnaryC (x1) : : : UnaryC (xd ); where UnaryC (x) denotes the unary representation of x, i.e., is a sequence of x ones followed by C , x zeroes. 0 Fact 1 For any pair of points p; q with coordinates in the set f1 : : :C g, d1(p; q) = dH (v(p); v(q)): That is, the embedding preserves the distances between the points. Therefore, in the sequel we can concentrate on solving -NNS in the Hamming space H d . However, we emphasize that we do not need to actually convert the data to the unary representation, 0 which could be expensive when C is large; in fact, all our algorithms can be made to run in time independent on C . Rather, the unary representation provides us with a convenient framework for description of the algorithms which would be more complicated otherwise. We de ne the hash functions as follows. For an integer l to be speci ed later, choose l subsets I1 ; : : :; Il of f1; : : :; d0g. Let pjI denote the projection of vector p on the coordinate set I , i.e., we compute pjI by selecting the coordinate positions as per I and concatenating the bits in those positions. Denote gj (p) = pjIj . For the preprocessing, we store each p 2 P in the bucket gj (p), for j = 1; : : :; l. As the total number of buckets may be large, we compress the buckets by resorting to standard hashing. Thus, we use two levels of hashing: the LSH function maps a point p to bucket gj (p), and a standard hash function maps the contents of these buckets into a hash table of size M . The maximal bucket size of the latter hash table is denoted by B . For the algorithm's analysis, we will assume hashing with chaining, i.e., when the number of points in a bucket exceeds B , a new bucket (also of size B ) is allocated and linked to and from the old bucket. However, our implementation does not employ chaining, but relies on a simpler approach: if a bucket in a given index is full, a new point cannot be added to it, since it will be added to some other index with high probability. This saves us the overhead of maintaining the link structure. The number n of points, the size M of the hash table, and the maximum bucket size B are related by the following equation: n M = ; B where is the memory utilization parameter, i.e., the ratio of the memory allocated for the index to the size of the data set. To process a query q, we search all indices g1(q); : : : ; gl (q) until we either encounter at least c  l points (for c speci ed later) or use all l indices. Clearly, the number of disk accesses is always upper bounded by the number of indices, which is equal to l. Let p1 ; : : :; pt be the points encountered in the process. For Approximate K -NNS, we output the K points pi closest to q; in general, we may return fewer points if the number of points encountered is less than K . It remains to specify the choice of the subsets Ij . For each j 2 f1; : : :; lg, the set Ij consists of k elements from f1; : : :; d0g sampled uniformly at random with replacement. The optimal value of k is chosen to maximize the probability that a point p \close" to q will fall into the same bucket as q, and also to minimize the probability that a point p0 \far away" from q will fall into the same bucket. The choice of the values of l and k is deferred to the next section. 521 Algorithm Preprocessing Input A set of points P , l (number of hash tables), Output Hash tables Ti, i = 1; : : :; l Foreach i = 1; : : :; l Initialize hash table Ti by generating a random hash function gi () Foreach i = 1; : : :; l Foreach j = 1; : : :; n Store point pj on bucket gi (pj ) of hash table Ti Figure 1: Preprocessing algorithm for points already embedded in the Hamming cube. Algorithm Approximate Nearest Neighbor Query Input A query point q, K (number of appr. nearest neighbors) Access To hash tables Ti , i = 1; : : :; l generated by the preprocessing algorithm Output K (or less) appr. nearest neighbors S  Foreach i = 1; : : :; l S S [ fpoints found in gi(q) bucket of table Tig Return the K nearest neighbors of q found in set S /* Can be found by main memory linear search */ Figure 2: Approximate Nearest Neighbor query answering algorithm. Although we are mainly interested in the I/O complexity of our scheme, it is worth pointing out that the hash functions can be eciently computed if the data set is obtained by mapping l1d into d0-dimensional Hamming space. Let p be any point from the data set and let p0 denote its image after the mapping. Let I be the set of coordinates and recall that we need to compute p0jI . For i = 1; : : :; d, let Iji denote, in sorted order, the coordinates in I which correspond to the ith coordinate of p. Observe, that projecting p0 on Iji results in a sequence of bits which is monotone, i.e., consists of a number, say oi , of ones followed by zeros. Therefore, in order to represent p0I it is sucient to compute oi for i = 1; : : :; d. However, the latter task is equivalent to nding the number of elements in the sorted array Iji which are smaller than a given value, i.e., the ith coordinate of p. This can be done via binary search in log C time, or even in constant time using a precomputed array of C bits. Thus, the total time needed to compute the function is either O(d log C ) or O(d), depending on resources used. In our experimental section, the value of C can be made very small, and therefore we will resort to the second method. For quick reference we summarize the preprocessing and query answering algorithms in Figures 1 and 2. 3.2 ...
View Full Document

  • Spring '17
  • Final Project Reference

What students are saying

  • Left Quote Icon

    As a current student on this bumpy collegiate pathway, I stumbled upon Course Hero, where I can find study resources for nearly all my courses, get online help from tutors 24/7, and even share my old projects, papers, and lecture notes with other students.

    Student Picture

    Kiran Temple University Fox School of Business ‘17, Course Hero Intern

  • Left Quote Icon

    I cannot even describe how much Course Hero helped me this summer. It’s truly become something I can always rely on and help me. In the end, I was not only able to survive summer classes, but I was able to thrive thanks to Course Hero.

    Student Picture

    Dana University of Pennsylvania ‘17, Course Hero Intern

  • Left Quote Icon

    The ability to access any university’s resources through Course Hero proved invaluable in my case. I was behind on Tulane coursework and actually used UCLA’s materials to help me move forward and get everything together on time.

    Student Picture

    Jill Tulane University ‘16, Course Hero Intern

Stuck? We have tutors online 24/7 who can help you get unstuck.
A+ icon
Ask Expert Tutors You can ask You can ask You can ask (will expire )
Answers in as fast as 15 minutes
A+ icon
Ask Expert Tutors