{[ promptMessage ]}

Bookmark it

{[ promptMessage ]}

similarity4 - Methods for High Degrees of Similarity...

Info iconThis preview shows pages 1–10. Sign up to view the full content.

View Full Document Right Arrow Icon
1 Methods for High Degrees of  Similarity Index-Based Methods Exploiting Prefixes and Suffixes Exploiting Length
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
2 Overview LSH-based methods are excellent for  similarity thresholds that are not too  high. Possibly up to 80% or 90%. But for similarities above that, there are  other methods that are more efficient. And also give exact answers.
Background image of page 2
3 Setting : Sets as Strings We’ll again talk about Jaccard  similarity and distance of sets. However, now represent sets by  strings (lists of symbols): 1. Enumerate the universal set. 2. Represent a set by the string of its  elements in sorted order.
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
4 Example : Shingles If the universal set is k-shingles, there is a  natural lexicographic order. Think of each shingle as a single symbol. Then the 2-shingling of  abcad , which is  the set {ab, bc, ca, ad}, is represented by  the list ab, ad, bc, ca of length 4. Alternative : hash shingles; order by  bucket number.
Background image of page 4
5 Example : Words If we treat a document as a set of  words, we could order the words  alphabetically. Better : Order words lowest-frequency- first. Why?  We shall index documents based  on the early words in their lists. Documents spread over more buckets.
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
6 Jaccard and Edit Distances Suppose two sets have Jaccard distance  J and are represented by strings  s 1  and  s 2 .  Let the LCS of  s 1  and  s 2  have length  C and the edit distance of  s 1  and  s 2  be E.   Then : 1-J = Jaccard similarity = C/(C+E). J = E/(C+E). Works because these strings never repeat a symbol, and symbols appear in the same order.
Background image of page 6
7 Indexes The general approach is to build some  indexes on the set of strings. Then, visit each string once and use the  index to find possible candidates for  similarity. For thought : how does this approach  compare with bucketizing and looking  within buckets for similarity?
Background image of page 7

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
8 Length-Based Indexes The simplest thing to do is create an  index on the length of strings. A string of length L can be Jaccard  distance J from a string of length M only  if L × (1-J) <  M <  L/(1-J). Example : if 1-J = 90% (Jaccard  similarity), then M is between 90% and  111% of L.
Background image of page 8
9 Why the Limit on Lengths?
Background image of page 9

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Image of page 10
This is the end of the preview. Sign up to access the rest of the document.

{[ snackBarMessage ]}