similarity4

similarity4 - 1 Methods for High Degrees of Similarity...

Info iconThis preview shows pages 1–9. Sign up to view the full content.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: 1 Methods for High Degrees of Similarity Index-Based Methods Exploiting Prefixes and Suffixes Exploiting Length 2 Overview r LSH-based methods are excellent for similarity thresholds that are not too high. R Possibly up to 80% or 90%. r But for similarities above that, there are other methods that are more efficient. R And also give exact answers. 3 Setting : Sets as Strings r Well again talk about Jaccard similarity and distance of sets. r However, now represent sets by strings (lists of symbols): 1. Enumerate the universal set. 2. Represent a set by the string of its elements in sorted order. 4 Example : Shingles r If the universal set is k-shingles, there is a natural lexicographic order. r Think of each shingle as a single symbol. r Then the 2-shingling of abcad , which is the set {ab, bc, ca, ad}, is represented by the list ab, ad, bc, ca of length 4. r Alternative : hash shingles; order by bucket number. 5 Example : Words r If we treat a document as a set of words, we could order the words alphabetically. r Better : Order words lowest-frequency-first. r Why? We shall index documents based on the early words in their lists. R Documents spread over more buckets. 6 Jaccard and Edit Distances r Suppose two sets have Jaccard distance J and are represented by strings s 1 and 2 . Let the LCS of 1 and 2 have length C and the edit distance of 1 and 2 be E. Then : R 1-J = Jaccard similarity = C/(C+E). R J = E/(C+E). Works because these strings never repeat a symbol, and symbols appear in the same order. 7 Indexes r The general approach is to build some indexes on the set of strings. r Then, visit each string once and use the index to find possible candidates for similarity. r For thought : how does this approach compare with bucketizing and looking within buckets for similarity? 8 Length-Based Indexes r The simplest thing to do is create an index on the length of strings. r A string of length L can be Jaccard distance J from a string of length M only if L (1-J) < M < L/(1-J)....
View Full Document

This document was uploaded on 03/04/2012.

Page1 / 37

similarity4 - 1 Methods for High Degrees of Similarity...

This preview shows document pages 1 - 9. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online