similarity1

similarity1 - 1 Finding Similar Sets Applications Shingling...

Info iconThis preview shows pages 1–10. Sign up to view the full content.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: 1 Finding Similar Sets Applications Shingling Minhashing Locality-Sensitive Hashing 2 Goals Many Web-mining problems can be expressed as finding similar sets: 1. Pages with similar words, e.g., for classification by topic. 2. NetFlix users with similar tastes in movies, for recommendation systems. 3. Dual : movies with similar sets of fans. 4. Images of related things. 3 Similarity Algorithms The best techniques depend on whether you are looking for items that are very similar or only somewhat similar. Well cover the somewhat case first, then talk about very. 4 Example Problem: Comparing Documents Goal : common text, not common topic. Special cases are easy, e.g., identical documents, or one document contained character-by-character in another. General case, where many small pieces of one doc appear out of order in another, is very hard. 5 Similar Documents (2) Given a body of documents, e.g., the Web, find pairs of documents with a lot of text in common, e.g.: Mirror sites, or approximate mirrors. Application : Dont want to show both in a search. Plagiarism, including large quotations. Similar news articles at many news sites. Application : Cluster articles by same story. 6 Three Essential Techniques for Similar Documents 1. Shingling : convert documents, emails, etc., to sets. 2. Minhashing : convert large sets to short signatures, while preserving similarity. 3. Locality-sensitive hashing : focus on pairs of signatures likely to be similar. 7 The Big Picture Shingling Docu-ment The set of strings of length k that appear in the doc-ument Minhash-ing Signatures : short integer vectors that represent the sets, and reflect their similarity Locality-sensitive Hashing Candidate pairs : those pairs of signatures that we need to test for similarity. 8 Shingles A k -shingle (or k -gram ) for a document is a sequence of k characters that appears in the document. Example : k=2; doc = abcab. Set of 2-shingles = {ab, bc, ca}. Option : regard shingles as a bag, and count ab twice. Represent a doc by its set of k-shingles. 9 Working Assumption Documents that have lots of shingles in common have similar text, even if the text appears in different order. Careful : you must pick k large enough, or most documents will have most shingles....
View Full Document

Page1 / 53

similarity1 - 1 Finding Similar Sets Applications Shingling...

This preview shows document pages 1 - 10. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online