This preview shows pages 1–11. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: 1 Theory of LSH Distance Measures LS Families of Hash Functions SCurves 2 Distance Measures Generalized LSH is based on some kind of distance between points. Similar points are close. Two major classes of distance measure: 1. Euclidean 2. NonEuclidean 3 Euclidean Vs. NonEuclidean A Euclidean space has some number of realvalued dimensions and dense points. There is a notion of average of two points. A Euclidean distance is based on the locations of points in such a space. A NonEuclidean distance is based on properties of points, but not their location in a space. 4 Axioms of a Distance Measure d is a distance measure if it is a function from pairs of points to real numbers such that: 1. d(x,y) > 0. 2. d(x,y) = 0 iff x = y. 3. d(x,y) = d(y,x). 4. d(x,y) < d(x,z) + d(z,y) ( triangle inequality ). 5 Some Euclidean Distances L 2 norm : d(x,y) = square root of the sum of the squares of the differences between x and y in each dimension. The most common notion of distance. L 1 norm : sum of the differences in each dimension. Manhattan distance = distance if you had to travel along coordinates only. 6 Examples of Euclidean Distances a = (5,5) b = (9,8) L 2norm : dist(x,y) = (4 2 +3 2 ) = 5 L 1norm : dist(x,y) = 4+3 = 7 4 3 5 7 Another Euclidean Distance L norm : d(x,y) = the maximum of the differences between x and y in any dimension. Note : the maximum is the limit as n goes to of the L n norm : what you get by taking the n th power of the differences, summing and taking the n th root. 8 NonEuclidean Distances Jaccard distance for sets = 1 minus Jaccard similarity. Cosine distance = angle between vectors from the origin to the points in question. Edit distance = number of inserts and deletes to change one string into another. Hamming Distance = number of positions in which bit vectors differ. 9 Jaccard Distance for Sets (BitVectors) Example : p 1 = 10111; p 2 = 10011. Size of intersection = 3; size of union = 4, Jaccard similarity (not distance) = 3/4. d(x,y) = 1 (Jaccard similarity) = 1/4. 10 Why J.D. Is a Distance Measure d(x,x) = 0 because x x = x x. d(x,y) = d(y,x) because union and intersection are symmetric....
View
Full
Document
This document was uploaded on 03/04/2012.
 Fall '09

Click to edit the document details