This preview shows pages 1–9. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: 1 Methods for High Degrees of Similarity IndexBased Methods Exploiting Prefixes and Suffixes Exploiting Length 2 Overview r LSHbased methods are excellent for similarity thresholds that are not too high. R Possibly up to 80% or 90%. r But for similarities above that, there are other methods that are more efficient. R And also give exact answers. 3 Setting : Sets as Strings r Well again talk about Jaccard similarity and distance of sets. r However, now represent sets by strings (lists of symbols): 1. Enumerate the universal set. 2. Represent a set by the string of its elements in sorted order. 4 Example : Shingles r If the universal set is kshingles, there is a natural lexicographic order. r Think of each shingle as a single symbol. r Then the 2shingling of abcad , which is the set {ab, bc, ca, ad}, is represented by the list ab, ad, bc, ca of length 4. r Alternative : hash shingles; order by bucket number. 5 Example : Words r If we treat a document as a set of words, we could order the words alphabetically. r Better : Order words lowestfrequencyfirst. r Why? We shall index documents based on the early words in their lists. R Documents spread over more buckets. 6 Jaccard and Edit Distances r Suppose two sets have Jaccard distance J and are represented by strings s 1 and 2 . Let the LCS of 1 and 2 have length C and the edit distance of 1 and 2 be E. Then : R 1J = Jaccard similarity = C/(C+E). R J = E/(C+E). Works because these strings never repeat a symbol, and symbols appear in the same order. 7 Indexes r The general approach is to build some indexes on the set of strings. r Then, visit each string once and use the index to find possible candidates for similarity. r For thought : how does this approach compare with bucketizing and looking within buckets for similarity? 8 LengthBased Indexes r The simplest thing to do is create an index on the length of strings. r A string of length L can be Jaccard distance J from a string of length M only if L (1J) < M < L/(1J)....
View
Full
Document
This document was uploaded on 03/04/2012.
 Fall '09

Click to edit the document details