This preview shows page 1. Sign up to view the full content.
Unformatted text preview: In uence Sets Based on Rev erse Nearest Neigh Queries bor Flip Korn AT&T Labs-Researc h S. Muth ukrishnan AT&T Labs-Researc h email@example.com firstname.lastname@example.org Inherent in the operation of many decision support and continuous referral systems is the notion of the in uence" of a data point on the database. This notion arises in examples such as nding the set of customers a ected by the opening of a new store outlet location, notifying the subset of subscribers to a digital library who will nd a newly added document most relevant, etc. Standard approaches to determining the in uence set of a data point involve range searching and nearest neighbor queries. In this paper, we formalize a novel notion of in uence based on reverse neighbor queries and its variants. Since the nearest neighbor relation is not symmetric, the set of points that are closest to a query point i.e., the nearest neighbors di ers from the set of points that have the query point as their nearest neighbor called the reverse nearest neighbors. In uence sets based on reverse nearest neighbor RNN queries seem to capture the intuitive notion of in uence from our motivating examples. We present a general approach for solving RNN queries and an e cient R-tree based method for large data sets, based on this approach. Although the RNN query appears to be natural, it has not been studied previously. RNN queries are of independent interest, and as such should be part of the suite of available queries for processing spatial and multimedia data. In our experiments with real geographical data, the proposed method appears to scale logarithmically, whereas straightforward sequential scan scales linearly. Our experimental study also shows that approaches based on range searching or nearest neighbors are ine ective at nding in uence sets of our interest. Abstract in uence of a new store outlet or the in uence of a new document to a repository. The concept of in uence depends on the application at hand and is often di cult to formalize. W e rst develop an intuitive notion of in uence sets through examples to motivate our formalization of it. The following two examples are drawn from spatial domains. are many factors that may contribute to a clientele adopting one outlet over another, but a simple premise is to base it on the geographical proximity to the customers. Consider a marketing application in which the issue is to determine the business impact of opening an outlet of Company A at a given location. A simple task is to determine the segment of A's customers who would be likely to use this new facility. Alternatively, one may wish to determine the segment of customers of Company B say A's competitor who are likely to nd the new facility more convenient than the locations of B . Such segments of customers are loosely what we would like to refer to as in uence sets. 2 Example 1 Decision Support Systems: There A fundamental task that arises in various marketing and decision support systems is to determine the in uence" of a data point on the database, for example, the
Permission to make digital or hard copies of part or all of this work or personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. MOD 2000, Dallas, T X USA ACM 2000 1-58113-218-2/00/05 . . .$5.00 1 Introduction Example 2 Con tinuous Referral Systems: Consider a referral service wherein a user can specify a street address, and the system returns a list of the ve closest FedExTM drop-o locations.1 A responsible referral service may wish to give the option e.g., by clicking a button to make this a continuous query, that is, to request the system to notify the user when this list changes. The referral service will then notify those users whose list changes due to the opening of a closer FedEx drop-o location or the closing of an existing one. When such an event happens, the users who need to be updated correspond to our notion of the in uence set of the added or dropped location. 2 Both examples above reinforce the notion of the in uence set of a data point in terms of geographical proximity. This concept of in uence sets is inherent
1 See this. http: www.fedex.com us dropoff for a realization of 201 in many other decision support situations and referral services for which there is no underlying spatial or geographical distance, but for which there is a notion of similarity based on the vector space model in which distance" between vectors is taken as a measure of dissimilarity. The following two examples provide illustration. Example 3 Pro le-based Marketing: A com- pany may wish to keep pro les of its customers' interests so that it can gear a new service towards most customers. For example, suppose AT&T launches a new wireless service. The service may be abstracted a feature vector e.g., covers New England area, free local calling on weekends, best for $100-per-month users. The issue is which customers will nd this the most suitable plan for their calling patterns; these customers form the in uence set of the new service. One approach is to identify such users based on the distance between their pro les and the feature vector representing the new service. 2 When a new report is led, it may be desirable to alert the authors of other TRs who would likely nd the document interesting based on similarity to their publications; the set of all such authors corresponds to the notion of in uence set we have been developing so far. Here, the in uence set is de ned based on the similarity between text documents which has been well-explored in the Information Retrieval community. Other similar scenarios abound, such as in a repository of Web pages, precendent legal cases, etc. 2 Let us now make the notion of an in uence set more precise. We start with a data set S , some suitable de nition of distance between points in S , and a query point q; the goal is to nd the subset of points in S in uenced by q. Two suggestions present themselves immediately. The rst is to use range queries wherein one speci es a threshold radius from q, and all points within are returned. The second is to use the well known concept of nearest neighbors NN, or, more generally, k-nearest neighbors wherein one speci es k, and the k closest points to q are returned. Both of these suggestions fall short of capturing the intuitive notion of in uence we have so far developed. In both cases, parameters have to be engineered to yield an appropriate result size, and it is not obvious how to choose a value without a priori knowledge of the local density of points. Range queries may be appropriate for other notions of in uence e.g., the opening of a toxic waste dump on its surrounding population but not for what is required in the examples given above. NN queries are commonly used in domains
202 Example 4 Maintaining Document Repositories: Consider a repository of technical reports. which call for searching based on proximity; however, they are not appropriate in this context for similar reasons. Consider Example 1, in which one wants to nd potential customers for a new store outlet q. The deciding factor is not how close a customer is to q, but rather if the customer is further from every other store than from q. Thus, it may very well be the case that potential customers lie outside a small radius from q, or are further from q than the rst few nearest neighbors. Expanding the search radius will not necessarily work around this problem. Although it may encompass more customers who are likely to be in uenced by q, it may do so at the trade-o of introducing many customers who are not in the in uence set i.e., customers whose closest store is not q. Later, we will make these discussions more concrete and present quantitative measures of comparison see Section 6. We address these shortcomings and develop a notion of in uence set with broad applications. A fundamental observation which is the basis for our work here is that the nearest neighbor relation is not symmetric. For example, if p is the nearest neighbor of q, then q need not be the nearest neighbor of p see Figure 1.2 Note that this is the case even though the underlying distance function is Euclidean and, hence, symmetric. Similarly, the k-nearest neighbor relation is not symmetric. It follows that, for a given query point q, the nearest neighbors of q may di er substantially from the set of all points for which q is a nearest neighbor. We call these points the reverse nearest neighbors of q.
p q r Figure 1: Nearest neighbors need not be symmetric: the NN of q is p , whereas the NN of p is r. An arrow from point i to point j indicates that j is the nearest neighbor of i. We now summarize our contributions: We identify a natural and broadly applicable notion for the in uence" of a data point on the database namely, the in uence set, and formalize it based on reverse nearest neighbors RNN and its variants such as reverse k-nearest neighbors, reverse furthest neighbor, etc.;
2 That is, provided there are other points in the collection. We present a general approach for determining reverse nearest neighbors. Our approach is geometric, reducing the problem to that of testing the enclosure of points in geometric objects; it works for di erent distance functions and variants of RNNs. Although the RNN query appears to be natural, it has not been studied previously. RNN queries are of independent interest, and as such should be part of the suite of available queries for processing spatial and multimedia data; Based on our approach, we propose e cient and scalable R-tree based methods for implementing reverse nearest neighbor queries. We also perform an experimental study of the I O-e ciency of the proposed R-tree based methods. Using our approach, we show in terms of standard precision and recall measures to assess the output quality, that well known database queries range and nearest neighbor queries are not e ective in nding in uence sets. The structure of the paper is as follows. Section 2 de nes RNN queries and describes its relationship to NN queries. Section 3 presents an approach and algorithmic framework for answering RNN queries; we also propose a scalable method for implementating this framework using R-trees in Section 4. Section 5 gives empirical results from experiments for RNN queries. In Section 6, we formalize the basic notion of in uence sets based on RNN queries and give results from a qualitative study of the e ectiveness of well known queries to substitute for RNN queries. Then we develop the variants of RNN queries needed for generalized notions of in uence sets. Section 7 reviews the related work. Section 8 lists the conclusions and gives directions for future work. Reverse nearest neighbor RNN queries are the basis for in uence sets, and are also of independent interest. We de ne and develop them in this section. We start from the de nition of the nearest neighbor NN query, a standard query in spatial and multimedia databases and de ne the RNN query and its variants based on this. We will develop the underlying concepts in two dimensions for simplicity; there will be no di culty in extending them to higher dimensions. In our discussion, we shall assume the distance between any two points p = px ; py and q = qx ; qy is dp; q = qx , px2 + qy , py 2 , known as the Euclidean, or L2 , distance.3
Other p distances may also be interest, for example 1 where = j x , x j + j y , y j or 1 where = maxfj x , x j j y , y jg.
L L d p; q q p q p L d p; q q p ; q p Suppose we have a collection S of points in the plane. For a nearest neighbor query, we are given a query point q, and the goal is to determine the nearest neighbor set NN q de ned as 2.1 Formal De nitions NN q = fr 2 S j 8p 2 S : dq; r dq; pg:
Our focus here is on the inverse relation among the points. Given any query point q, we need to determine the set RNN q of reverse nearest neighbors, de ned as RNN q = fr 2 S j 8p 2 S : dr; q dr; pg: RNN q may be empty, or have one or more elements,
and we may wish to return any one of them, or the entire list. There are two variants of this basic scenario that are of interest to us. We will de ne only the variants for RNN queries, although the corresponding variants of NN queries may also be of interest. Monochromatic vs Bichromatic. In some applications, the points in S are of two di erent categories, such as clients and servers; the points may therefore be thought of as being colored red or blue. The RNN query now consists of a point in one of the categories, say blue, and must determine the red points for which the query point is the closest blue point. Formally, let B denote the set of blue points and R the set of red points. Consider a blue query point q. We have, 2.2 Variants RNN q = fr 2 R j 8p 2 B : dr; q dr; pg:
We call this the bichromatic version; in contrast, the basic scenario above wherein all points were of the same category is the monochromatic version. Both versions of the problem are of interest. At rst look, the mono and bichromatic versions of the RNN problem seem very similar. For a blue query point, we consider only the red points and their distance to the closest blue point vice versa for the red query points. However, at a deeper level, there is a fundamental di erence. Let us focus on the L2 case. Proposition 1 For any query point, RNN q may have at most 6 points in the monochromatic case; in the bichromatic case, the size of the set RNN q may
be unbounded. 2 Reverse Nearest Neighbor Queries A proof of this may be found in 17 . From a combinatorial viewpoint, the output of RNN queries is bounded; this in turn a ects the e ciency because a RNN query is output-sensitive. This entire phenomenon is not restricted to the plane e.g., in three dimensions, the
203 RNN q contains at most 12 points under L2 distance and so on, or the distance function e.g., in the L1 case, the cardinality of RNN q is at most 3d , 1 in d
Static vs Dynamic. Sometimes we wish to insert or delete points from the set S and still support the RNN query; we refer to this as the dynamic case. In contrast, the case when set S is not modi ed is called the static case. The dynamic case is relevant in most applications. The crux here, as in all dynamic problems, is to be able to handle insertions and deletions e ciently without rebuilding the entire data structure. Other distance functions. If the distance function is L1 rather than L2 , we generate squares p; N p in Step 1 with center p and sides 2N p. See Figure 2b for an illustration. Similarly, for other Lp distance functions, we will have suitable geometric shapes. Bichromatic version. Consider only blue query points for now. We perform the two steps above only for the red points in set S . For each red point p 2 S , we determine N p, the distance to the nearest blue neighbor. The rest of the description above remains unchanged. We also process for red query points analogously. Our description above was for the static case only. For the dynamic case, we need to make some modi cations. Below we assume the presence of a dynamically maintained data structure for answering NN queries. Recall the de nition of N p for point p from the previous section. Consider an insertion of a point q as illustrated in Figure 3a: 1. Determine the reverse nearest neighbors p of q. For each such point p, we replace circle p; N p with p; dp; q, and update N p to equal dp; q; 2. Find N q, the distance of q from its nearest neighbor, and add q; N q to the collection of circles. Our approach for solving the reverse nearest neighbors query problem is quite general, and it applies also to its variants as we shall see. For exposition, let us consider a basic version of the problem. We are given a set S of points which is not updated, and the distance between any two points is measured using Euclidean distance. Our approach involves two steps. Step 1. For each point p 2 S , determine the distance to the nearest neighbor of p in S , denoted N p. Formally, N p = minq2S ,fpg dp; q. For each p 2 S , generate a circle p; N p where p is its center and N p its radius. See Figure 2a for an illustration. Step 2. For any query q, determine all the circles p; N p that contain q and return their centers p. We have not yet described how to perform the two steps above, but we will rst prove that they su ce. 3 Our Approach to RNN Queries
Static Case 3.2 Dynamic Case 3.1 Lemma 2 The insertion procedure is correct. Proof. It su ces to argue that, for each point p, N p Lemma 1 Step 2 determines precisely all the reverse
nearest neighbors of q. Proof. If point p is returned from Step 2, then q falls within the circle p; N p. Therefore, the distance dp; q is smaller than the radius N p. In other words, dp; q N p and hence q is the nearest neighbor of p equivalently, p is a reverse nearest neighbor of q. Conversely, if p is the reverse nearest neighbor of q, dp; q N p and, therefore, q lies within the circle p; N p. Hence, p will be found in Step 2. 2 What our approach has achieved is to reduce the problem of answering the reverse nearest neighbor query to the problem of nding all nearest neighbors Step 1 and then to what is known in the literature as point enclosure problems wherein we need to determine all the objects that contain a query point Step 2. Our approach is attractive for two reasons. First, both steps are of independent interest and have been studied in the literature. They have e cient solutions, as we will see later. Second, our approach extends to the variants of our interest as we show below.
204 is the correct distance of p to its nearest neighbor after an insertion. This clearly holds for the inserted point q from Step 2. Among the rest of the points, the only ones which will be a ected are those which have q as their nearest neighbor, in other words, the reverse nearest neighbors of q. For all such points p, we update their N p's appropriately in Step 1. The remaining points p do not change N p as a result of inserting q. Hence, all points p have the correct value of N p. 2 Step 1 is shown in Figure 3b where we shrink all circles p; N p for which q is the nearest neighbor of p to p; dp; q. Step 2 is shown in Figure 3c. Now consider an deletion of a point q as illustrated in Figure 4a: 1. We need to remove the circle q; N q from the collection of circles see Figure 4b; 2. Determine all the reverse nearest neighbors p of q. For each such point p, determine its current N p and replace its existing circle with p; N p. We can argue much as before that the deletion procedure is correct. The crucial observation is that p p N(p) a L2 case b L1 case Figure 2: A point set and its nearest neighborhoods. q q N(p) q a nd RNN q b shrink circles c nd NN q Figure 3: A geometrical illustration of the insertion algorithm. the only existing circles p; N p that get a ected are those that have q on the circumference, that is, those associated with the reverse nearest neighbors of q; their circles get expanded in Step 1 see Figure 4c. The details for how to extend these algorithms to other distance functions and to the bichromatic version are similar to those given in the previous section. In this section we propose a scalable method for implementing RNN queries on large, out-of-core data sets, based on our approach from Section 3. Like NN queries, RNN queries are I O-bound as opposed to, e.g., spatial joins which are CPU-bound, and thus the focus is on I O performance. Because R-trees 7, 2, 16 have been successfully deployed in spatial databases and because of their generality to support a variety of norms via bounding boxes, we use them in the proposed method. However, note that any spatial access method could be employed see 6 for a recent
205 survey of spatial access methods. Our deployment of R-trees is standard, but requires some elaboration. First we describe static RNN search; we then present details of the algorithms and data structures for the dynamic case. The rst step in being able to e ciently answer RNN queries is to precompute the nearest neighbor for each and every point. The problem of e ciently computing all-nearest neighbors in large data sets has been studied in 3, 8 , and thus we do not investigate it further in this paper.4 Given a query point q, a straightforward but naive approach for nding reverse nearest neighbors is to sequentially scan through the entries pi ! pj of a precomputed all-NN list in order to determine which points pi are closer to q than to pi 's current nearest
4 4 Scalable RNN Queries 4.1 Static Case All-nearest neighbors is a special case of a spatial join. q q q a remove NN q b nd RNN q c expand circles Figure 4: A geometrical illustration of the deletion algorithm. neighbor pj . Ideally, one would like to avoid having to sequentially scan through the data. Based on the approach in Section 3, a RNN query reduces to a point enclosure query in a database of nearest neighborhood objects e.g., circles for L2 distance in the plane; these objects can be obtained from the all-nearest neighbor distances. We propose to store the objects explicitly in an R-tree. Henceforth, we shall refer to this instantiation of an R-tree as an RNNtree. Thus, we can answer RNN queries by a simple search in the R-tree for those objects enclosing q. As mentioned in Section 4.1, a sequential scan of a precomputed all-NN list can be used to determine the reverse nearest neighbors of a given point query q. Insertion and deletion can be handled similarly. Even if this list were inverted, enabling deletion to be achieved in constant time by looking up the corresponding entry pj ! fpi1 ; pi2 ; : : :; pik g, queries and insertions would still require a pass over the data. We would like to avoid having to do this. We describe how to incrementally maintain the RNNtree in the presence of insertions and deletions. To do this will require a supporting access method that can nd nearest neighbors of points e ciently. At this point, one may wonder if a single R-tree will su ce for nding reverse nearest neighbors as well as nearest neighbors, in other words, if our RNN-tree can be used for this purpose, This turns out to be not the case since geometric objects rather than points are stored in the RNN-tree, and thus the bounding boxes are not optimized for nearest neighbor search performance on points. Therefore, we propose to use a separate R-tree for NN queries, henceforth referred to as an NN-tree. Note that the NN-tree is not needed for static RNN queries, only for insertions and deletions, and that, in addition to the RNN-tree, it must be dynamically
Algorithm Insert: Input: point
q 1. 2. 3. 4. 5. 6. f 4.2 Dynamic Case Algorithm Delete: Input: point
q p1 ; p2 ; : : : ; pk query q in RNN-tree; for each pi with corresponding Ri do shrink Ri to pi ; d pi ; q ; find N q from NN-tree; insert q in NN-tree; insert q; N q in RNN-tree; g 1. 2. 3. 4. 5. 6. delete from NN-tree; query q in RNN-tree; delete q; N q from RNN-tree; for each pi with corresponding Ri do find N pi from NN-tree; grow Ri to pi ; N pi ; f q p1 ; p2 ; : : : ; pk g Figure 5: Proposed Algorithms for Insertion and Deletion. Figure 5 presents pseudocode for insertion and deletion. The algorithm for insertion retrieves from the RNN-tree the reverse nearest neighbors pi of q, and their corresponding neighborhood objects Ri, without having to scan; each Ri is then reduced in size to pi ; dpi; q. The algorithm for deletion works similarly, using the RNN-tree to nd the points pi a ected by the deletion; each corresponding Ri is then expanded to pi ; dpi; N pi. We designed a set of experiments to test the I O performance of our proposed method on large data sets. Our goal was to determine the scale-up trend of both static and dynamic queries. We also examined the performance of bichromatic versus monochromatic 5 Experiments on RNN queries data. Below we present results from two batches of experiments, for static and dynamic RNN queries. Methods: We compared the proposed algorithms given in Section 4 to the basic scanning approach. In the static case, the scanning approach precomputes an all-NN list and makes a pass through it to determine the reverse nearest neighbors. In the dynamic case, the scanning approach precomputes and maintains an inverted all-NN list. Each entry in the all-NN list corresponds to a point in the data set, and thus requires storing two items for nearest neighbor information: the point coordinates and nearest neighbor distances. Similarly, the RNN-tree used in the proposed method requires storing each point and its associated nearest neighborhood. Both also use an NN-tree for nearest neighbor search. Thus, the methods require the same storage space. Data Sets: Our testbed includes two real data sets. The rst is mono and the second is bichromatic: cities1 - Centers of 100K cities and small towns in the USA chosen at random from a larger data set of 132K cities, represented as latitude and longitude coordinates; - Coordinates of 100K red cities i.e., clients and 400 black cities i.e., servers. The red cities are mutually disjoint from the black cities, and points from both colors were chosen at random from the same source.
cities2 We uniformly sampled the cities1 data set to get subsets of varying sizes, between 10K and 100K points. Figure 6a shows the I O performance of the proposed method compared to sequential scan. Each query took roughly between 9-28 I Os for the data sets we tried with our approach; in contrast, the performance of the scanning approach increased from 40 to 400 I Os with increasing data set size n. The gap between the two curves clearly widens as n increases, and the proposed method appears to scale logarithmically, whereas the scanning approach scales linearly. We performed the same experiment for cities2. Figure 6b plots the I O performance. It is interesting to note that the performance degrades more with increasing n from 12-65 I Os with bichromatic data; this is primarily because the output size is larger in bichromatic case than in the monochromatic case as remarked earlier. However, this increase again appears to be logarithmic. Again, we used the cities1 data set and uniformly sampled it to get subsets of varying sizes, between 10K and 100K points. As shown in Figure 7, the I O cost for an even workload of insertions and deletions appears to scale logarithmically, whereas the scanning method scales linearly. It is interesting to note that the average I O is up to four times worse than in the static case, although this factor decreases for larger data sets. We broke down the I O into four categories RNN queries, NN queries, insertions and deletions and found that each took approximately the same number of I Os. Thus, the maintenance of the NN-tree accounts for the extra I O compared to the static queries.
cities: Dynamic Performance 220 200 180 160 "insdel.proposed.cities" "insdel.scan.cities" 5.1 Static Case 5.2 Dynamic Case Queries: We assume the so-called `biased' query model, in which queries are more likely to come from dense regions 13 . We chose 500 query points at random without replacement from the same source that the data sets were chosen; note that these points are external to the data sets. For dynamic queries, we simulated a mixed workload of insertions by randomly choosing between insertions and deletions. In the case of insertions, one of the 500 query points were inserted; for deletions, an existing point was chosen at random. We report the average I O per query, that is, the cumulative number of page accesses divided by the number of queries. Avg I/Os 140 120 100 80 60 40 20 10000 30000 50000 70000 90000 Software: The code for our experiments was imple- mented in C on a Sun SparcWorkstation. To implement RNN queries, we extended DR-tree, a disk-resident R*tree package; to implement NN queries which were used for the second batch of experiments, we used the DR-tree as is.5 The page size was set to 4K.
5 Data set size (n) available at ftp: ftp.olympos.umd.edu . Figure 7: The I O performance of dynamic RNN queries proposed method vs. scanning in the presence of an even mix of insertions and deletions. 207 cities: Static Performance 400 350 300 "static.proposed.cities" "static.scan.cities" 400 350 300 cities.bipartite: Static RNN Performance "static.proposed.cities.bipartite" "static.scan.cities.bipartite" Avg I/Os 250 200 150 100 50 0 10000 30000 50000 70000 90000 Avg Leaf I/Os 250 200 150 100 50 0 10000 30000 50000 70000 90000 a cities1 monochromatic Data set size (n) b cities2 bipartite
cities1 Data set size (n) Figure 6: The I O performance of static RNN queries proposed method vs. scanning for a monochromatic and b cities2 bipartite. Our rst, and most basic, de nition of the in uence set of a point q is simply that it is the set of all reverse nearest neighbors of q, that is, RNN q. This may be mono or bichromatic reverse nearest neighbors, depending on the application. Before exploring this notion further, let us brie y reexamine the motivating examples from Section 1. In Examples 1 and 2, the in uence set of the new location of a store outlet is indeed the set of customers who nd the new location the closest amongst all locations of stores. This is an instance of bichromatic RNN. In Example 3, the customers who are in uenced by a new service are those whose pro les have the feature vector of the new service closest amongst all service feature vectors. Again, the in uence set of the new service corresponds to our basic de nition above. In Example 4, the in uence set of a new document is the set of all documents in the database that nd it the closest under a suitable measure of similarity; here, the de nition of an in uence set based on monochromatic RNNs applies. We can think of many other applications where the basic notion of in uence set arises. What is perhaps more interesting is that this notion of in uence sets implicitly arises in many computational tasks. For example, many problems of interest in Operations Research and Combinatorial Optimization have greedy solutions with good performance. One such example is the facility location problem. Here we are given many points and the goal is to designate some as facilities and others as non-facilities. There is a cost to designating a point as a facility, and a cost for non208 6.1 6 In uence Sets Basic notion and applications facilities which equals the cost of accessing the closest facility. This problem is known to be NP-hard, and thus the focus is on designing approximation algorithms for this problem. The method of choice in practice for this problem is the greedy method it is simple, and is a provably small approximation 14 .6 The greedy algorithm involves repeatedly adding a facility, deleting one, or swapping a facility with a non-facility. In order to implement this algorithm, we need to determine the enhanced cost when a new facility is added which involves looking at precisely those locations whose NN distance is changed when a new facility is added or deleted, swapped. The set of all such locations is indeed our basic de nition of a in uence set; these have been implicitly computed in this context for a long time. Another example is that of computing the shortest path from a single point to every other point in the database. When a point is added to a partial solution that greedy algorithms maintain, the distance of remaining points to the partial solution has to be updated and this will again be given by the in uence set of the point added to the partial solution. Many other implicit uses of in uence sets exist in Combinatorial Optimization. There are two potential problems with the e ectiveness of any approach to nding in uence sets. One is the precision problem wherein a large portion of the retrieved set contains irrelevant points. Conversely, there is the recall problem wherein the retrieved set misses some of the relevant points. An e ective approach would achieve high precision at high recall
6 Better approximations exist, but they are based on Linear Programming 10 . 6.2 Using existing methods ideally, 100 precision at 100 recall. In this section we present results from an experiment to demonstrate that nearest neighbor queries and range queries are not e ective engineering" substitutes for RNN queries in nding in uence sets; we use standard precision and recall metrics from information retrieval to assess their quality. The rst issue that arises in nding in uence sets is what region to search in. Two possibilities immediately present themselves: nd the closest points i.e., the k-nearest neighbors or all points within some radius i.e., -range search. Of course, there are many variants of these basic queries, such as searching with weighted distances, searching over polygonal or elliptical regions, etc. To demonstrate the ine ectiveness of these approaches, it shall su ce to consider the most basic version. The question then is how to engineer the parameter value namely k or that will contain the desired information. Without a priori knowledge of the density of points near the query point q, it is not clear how to choose these values. Regardless, we show that any clever strategy to engineer parameter values be it from histograms, etc. would still fall short. Figure 8 illustrates this concept. The black points represent servers and the white points represent clients. In this example, we wish to nd all the clients for which q is their closest server. The example illustrates that a -range alternatively, k-NN query cannot nd the desired information in this case, regardless of which value of or k is chosen. Figure 8a shows a 'safe' radius l in which all points are reverse nearest neighbors of q; however, there exist reverse nearest neighbors of q outside l . Figure 8b shows a wider radius h that includes all of the reverse nearest neighbors of q but also includes points which are not. In this example, it is possible to achieve 100 precision or 100 recall, but not both simultaneously. We ran an experiment to investigate how often this trade-o occurs in practice. The experiment was carried out as follows. Suppose we had an oracle to suggest the largest radius l admitting no false-positives, i.e., whose neighborhood contains only points in the in uence set. For this scenario, we assess the quality of the retrieved set from the number of false-negatives within this radius. More speci cally, we measured the recall at 100 precision, that is, the cardinality of the retrieved set divided by that of the in uence set. Further suppose we had an oracle to suggest the smallest radius h allowing no false-negatives, i.e., whose neighborhood contains the full in uence set equivalently, reverse nearest neighbors. For this scenario, we assess the quality of the retrieved set from the number of falsepositives within this radius. More spec cially, we measured the precision at 100 recall, that is, the cardinality of the in uence set divided by that of the
209 retrieved set. We used the cities2 data set in the our experiment and averaged over 100 queries. The results are summarized in Table 1. The quality of the retrieved set at radius l is poor, containing a small fraction of the full in uence set. The quality of the retrieved set at radius h is also poor, containing a lot of 'garbage' in addition to the in uenced points.
measure precision at 100 recall recall at 100 precision radius
h l 44.3 40.2 value Table 1: The e ectiveness of range queries in nding in uence sets. Quality is measured by precision at 100 recall and recall at 100 precision. In this section, we extend the notion of in uence sets from the previous section. We do not explore these notions in depth here using experiments; instead we focus on sketching how our approach for nding the basic in uence sets can be modi ed to nd these extended in uence sets. Some of these modi cations will be straightforward, others less so. Reverse k-nearest neighbors. A rather simple extension of the in uence set of point q is to de ne it to be the set of all points that have q as one of their k nearest neighbors. Here, k is xed and speci ed a priori. For static queries, the only di erence in our solution is that we store the neighborhood of kth neighbor rather than nearest neighbor. Note that we do not explicitly store the k nearest neighbors. Each query is an enclosure problem on these objects as in the basic case. For insertions and deletions, we update the neighborhood of the kth nearest neighbor of each a ected point as follows. When inserting or deleting q, we rst nd the set of a ected points using the enclosure problem as done for answering queries. For insertion, we perform a range query to determine the k nearest neighbors of each such a ected point and do necessary updates. For deletion, the neighborhood radius of the a ected points is expanded to the distance of the k + 1th neighbor, which can be found by a modi ed NN search on R-trees. In uence sets with predicates. The basic notion of in uence sets can be enhanced with predicates. Some examples of predicates involve bounding the search distance nd reverse nearest neighbors within a speci ed region of interest and providing multiple facilities nd the reverse nearest neighbors to any, some, or all of multiple points in the set fq1; : : :; qm g. 6.3 Extended notions of in uence sets q q l h a range l b range h Figure 8: In many cases, any -range query or k-NN query will be either a too small or b too big. For such queries, we can push the predicates inside the R-tree search. Reverse furthest neighbors. An interesting variation of in uence sets is to base it on dissimilarity rather than similarity, in other words, on furthest neighbors rather than nearest neighbors. More formally, de ne the in uence set of a point q to be the set of all points r such that q is farther from r than any other point of the database is from r. This notion of in uence has a solution that di ers from the basic notion in an interesting way. We sketch the solution here only for the static case, but all modi cations to convert this into a dynamic solution are based on ideas we already described before. We will also only describe the solution for the two dimensional case, but extending it to the multi-dimensional case is straightforward. Say S is the set of points which will be xed. A query point is denoted q. For simplicity, we will rst describe our solution for the L1 distance. Preprocessing: We rst determine the furthest point for each point p 2 S and denote it as f p. We will put a square with center p and sides 2dp; f p for each p; say this square is Rp . Query processing: The simple observation is that for any query q, the reverse furthest neighbors r are those for which the Rr does not include q. Thus the problem we have is square non-enclosure problem. Recall that, in contrast, the reverse nearest neighbors problem led to square enclosure problem. The following observation is the key to solving the square non-enclosure problem. Lemma 3 Consider the intervals xr and yr obtained by projecting the square Rr on x and y axis respectively. A point q = x; y is not contained in Rr if and only if either xr does not contain x or yr does not contain y.
210 Therefore, if we return all the xr 's that do not contain x as well as those yr 's that do not contain y's, each square r in the output is repeated atmost twice. So the problem can be reduced to a one dimensional problem on intervals without losing much e ciency. Let us restate the one dimensional problem formally: we are given a set of intervals, say N of them. Each query is a one dimensional point, say p, and the goal is to return all interval that do not contain p. For solving this problem, we maintain two sorted arrays, one of the right endpoints of the intervals and the other of their left endpoints. The following observation is easy: Proposition 2 Any interval with its right endpoint to the left of p does not contain p. Likewise, any interval with its left endpoint to the right of p does not contain it. Hence, it su ces to perform two binary searches with p in the two arrays, to determine the intervals that do not contain p. Notice that from a practical viewpoint, the only data structure we need is a B-tree to keep these two arrays. Theorem 1 There exists an log B N + t time algorithm to answer each query in the square non-enclosure problem, where t is the output size, N = n=B is the number of disk blocks, and B is the block size; space used is N . Although the solution above is simple, we are not aware of any published claim of this result for square non-disclosure problem. As mentioned before, this immediately gives a solution for nding the set of all reverse furthest neighbors for a query point under the L1 distance. While a R-tree may be used to solve this problem, our solution above shows that the only data structure we need is to a B-tree. Hence, the solution is very e cient. For other distance functions, we still have non-enclosure problem, but with di erent shapes e.g., circles for Euclidean distance. Practical approaches for solving such problems would be to either use bounding boxes to reduce the problem to square non-enclosure with some false positives, or to use R-tree based search. There has been a lot of work on nearest neighbor queries 5, 11, 4, 15, 9 . NN queries are useful in many applications: GIS `Find the k nearest hospitals from the place of an accident.', information retrieval `Find the most similar web page to mine.', multimedia databases `Find the tumor shape that looks the most similar to the query shape.' 12 , etc. Conceptually, a RNN query is di erent from a NN query; it is the inverse. There has been work in the area of spatial joins, and more speci cally with all-nearest neighbor queries 8 . To the best of our knowledge, none of the previous work has addressed the issue of incremental maintenance. While reverse nearest neighbors is conceptually di erent, RNN queries provide an e cient means to incrementally maintain all-nearest neighbors. Both incremental and random incremental Delaunay triangulations could be used for answering RNN queries, as the update step involves identifying points and their circumcircles local to the query point whose edges are a ected by insertion deletion, a superset of reverse nearest neighbors.7 However, these algorithms rely on being able to e ciently locate the simplex containing the query point, a problem for which there is no e cient solution in large data sets. In addition, the algorithms make the general position assumption and do not work well for the bipartite case. Our approach for RNN queries relied on solving point enclosure problems with di erent shapes. Point enclosure problems with n rectangles can be solved after On logd,1 n time preprocessing in Ologd,1 n+t time per query where t is the output size 1 . Such e cient algorithms are not known for other shapes, for dynamic cases, or for external memory datasets. Our R-tree approach is simple and applies to all the variants of RNN queries. The in uence" of a point in a database is a useful concept. In this paper, we introduce an intuitive notion of in uence based on reverse nearest neighbors, and illustrate through examples that it has broad
7 Alternatively, one could use the dual data structure, Voronoi diagrams. 7 Related Work appeal in many application domains. The basic notion of in uence sets depends on reverse nearest neighbor queries, which are also of independent interest. We provide the rst solution to this problem, and validate its I O-e ciency through experiments. We also demonstrate using experiments that standard database queries such as range searching and NN queries are ine ective at nding in uence sets. Finally, we further extend the notion of in uence based on variants of RNN queries and provide e cient solutions for these variants as well. We have initiated the study of in uence sets using reverse nearest neighbors. Many issues remain to be explored, for example, the notion of in uence outside of the typical query-response context, such as in data mining. It is often desirable to process the data set to suggest a region in which a query point should lie so as to exert maximal in uence. What is the appropriate notion of in uence sets in this context? From a technical point of view, e cient solutions for RNN queries are needed in high dimensions. Extensions of our approach to higher dimensions is straightforward; however, in very high dimensions, alternative approaches may be needed, as is the case for high dimensional NN queries. Also, the role of RNN queries can be explored further, such as in other proximity-based problems. It is our belief that the RNN query is a fundamental query, deserving to be a standard tool for data processing. Acknowledgements References The authors wish to thank Christos Faloutsos, Dimitris Gunopoulos, H.V. Jagadish, Nick Koudas, and Dennis Shasha for their comments. 1 P. Agrawal. Range searching. In E. Goodman and J. O'Rourke, editors, Handbook of Discrete and Computational Geometry, pages 575 598. CRC Press, Boca Raton, FL, 1997. 2 N. Beckmann, H.-P. Kriegel, R. Schneider, and B. Seeger. The R*-tree: An e cient and robust access method for points and rectangles. ACM SIGMOD, pages 322 331, May 23-25 1990. 3 T. Brinkho , H.-P. Kriegel, and B. Seeger. E cient processing of spatial joins using R-trees. In Proc. of ACM SIGMOD, pages 237 246, Washington, D.C., May 26-28 1993. 4 B. Chazelle and L. J. Guibas. Fractional cascading: I. A data structuring technique. Algorithmica, 1:133 162, 1986. 5 K. Fukunaga and P. M. Narendra. A branch and bound algorithm for computing k-nearest 8 Conclusions and Future Work 211 6 7 8 9 10 11 12 13 neighbors. IEEE Trans. on Computers TOC, C247:750 753, July 1975. V. Gaede and O. Gunther. Multidimensional access methods. ACM Computing Surveys, 302:170 231, June 1998. A. Guttman. R-trees: A dynamic index structure for spatial searching. In Proc. ACM SIGMOD, pages 47 57, Boston, Mass, June 1984. G. R. Hjaltason and H. Samet. Incremental distance join algorithms for spatial databases. ACM SIGMOD '98, pages 237 248, June 1998. G. R. Hjaltason and H. Samet. Distance browsing in spatial databases. ACM TODS, 242:265 318, June 1999. K. Jain and V. Vazirani. Primal-dual approximation algorithms for metric facility location and kmedian problems. Proc. 40th IEEE Foundations of Computer Science FOCS '99, pages 2 13, 1999. D. G. Kirkpatrick. Optimal search in planar subdivisions. SIAM J. Comput., 12:28 35, 1983. F. Korn, N. Sidiropoulos, C. Faloutsos, E. Siegel, and Z. Protopapas. Fast nearest-neighbor search in medical image databases. Conf. on Very Large Data Bases VLDB, pages 215 226, September 1996. B. Pagel, H. Six, H. Toben, and P. Widmayer. Towards an analysis of range query performance. In Proc. of ACM SIGACT-SIGMOD-SIGART 1993. R. Rajaraman, M. Korupolu, and G. Plaxton. Analysis of a local search heuristic for facility location problems. Proceedings of ACM-SIAM Symposium on Discrete Algorithms SODA '98, pages 1 10, 1998. N. Roussopoulos, S. Kelley, and F. Vincent. Nearest neighbor queries. In Proc. of ACM-SIGMOD, pages 71 79, San Jose, CA, May 1995. T. Sellis, N. Roussopoulos, and C. Faloutsos. The R+ tree: A dynamic index for multi-dimensional objects. In Proc. 13th International Conference on VLDB, pages 507 518, England,, September 1987. M. Smid. Closest point problems in computational geometry. In J.-R. Sack and J. Urrutia, editors, Handbook on Computational Geometry. Elsevier Science Publishing, 1997.
Symposium on Principles of Database Systems PODS, pages 214 221, Washington, D.C., May 14 15 16 17 212 ...
View Full Document
This note was uploaded on 02/28/2008 for the course CSCI 599 taught by Professor Cyrusshahabi during the Fall '07 term at USC.
- Fall '07