10similarity3

# 10similarity3 - 1 Theory of LSH Distance Measures LS...

This preview shows pages 1–14. Sign up to view the full content.

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: 1 Theory of LSH Distance Measures LS Families of Hash Functions S-Curves 2 Distance Measures ◆ Generalized LSH is based on some kind of “distance” between points. ◗ Similar points are “close.” ◆ Two major classes of distance measure: 1. Euclidean 2. Non-Euclidean 3 Euclidean Vs. Non-Euclidean ◆ A Euclidean space has some number of real-valued dimensions and “dense” points. ◗ There is a notion of “average” of two points. ◗ A Euclidean distance is based on the locations of points in such a space. ◆ A Non-Euclidean distance is based on properties of points, but not their “location” in a space. 4 Axioms of a Distance Measure ◆ d is a distance measure if it is a function from pairs of points to real numbers such that: 1. d(x,y) > 0. 2. d(x,y) = 0 iff x = y. 3. d(x,y) = d(y,x). 4. d(x,y) < d(x,z) + d(z,y) ( triangle inequality ). 5 Some Euclidean Distances ◆ L norm : d(x,y) = square root of the 6 Examples of Euclidean Distances a = (5,5) b = (9,8) 4 3 5 7 Another Euclidean Distance ◆ L norm : d(x,y) = the maximum of the 8 Non-Euclidean Distances ◆ Jaccard distance for sets = 1 minus Jaccard similarity. ◆ Cosine distance = angle between vectors from the origin to the points in question. ◆ Edit distance = number of inserts and deletes to change one string into another. ◆ Hamming Distance = number of positions in which bit vectors differ. 9 Jaccard Distance for Sets (Bit-Vectors) ◆ Example : p 1 = 10111; p 2 = 10011. ◆ Size of intersection = 3; size of union = 4, Jaccard similarity (not distance) = 3/4. ◆ d(x,y) = 1 – (Jaccard similarity) = 1/4. 10 Why J.D. Is a Distance Measure ◆ d(x,x) = 0 because x ∩ x = x ∪ x. ◆ d(x,y) = d(y,x) because union and intersection are symmetric. ◆ d(x,y) > 0 because |x ∩ y| < |x ∪ y|. ◆ d(x,y) < d(x,z) + d(z,y) trickier – next slide. 11 Triangle Inequality for J.D. 1 - |x ∩ z| + 1 - |y ∩ z| > 1 -|x ∩ y| |x ∪ z| |y ∪ z| |x ∪ y| ◆ Remember : |a ∩ b|/|a ∪ b| = probability that minhash(a) = minhash(b). ◆ Thus, 1 - |a ∩ b|/|a ∪ b| = probability that minhash(a) ≠ minhash(b). 12 Triangle Inequality – (2) ◆ Claim : prob[minhash(x) ≠ minhash(y)] < prob[minhash(x) ≠ minhash(z)] + prob[minhash(z) ≠ minhash(y)] ◆ Proof : whenever minhash(x) ≠ minhash(y), at least one of minhash(x) ≠ minhash(z) and minhash(z) ≠ minhash(y) must be true. 13 Cosine Distance ◆ Think of a point as a vector from the origin (0,0,…,0) to its location....
View Full Document

{[ snackBarMessage ]}

### Page1 / 49

10similarity3 - 1 Theory of LSH Distance Measures LS...

This preview shows document pages 1 - 14. Sign up to view the full document.

View Full Document
Ask a homework question - tutors are online