cs345-lsh

cs345-lsh - 1 Low-Support, High-Correlation Finding Rare...

Info iconThis preview shows pages 1–13. Sign up to view the full content.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: 1 Low-Support, High-Correlation Finding Rare but Similar Items Minhashing Locality-Sensitive Hashing 2 The Problem r Rather than finding high-support item- pairs in basket data, look for items that are highly correlated. R If one appears in a basket, there is a good chance that the other does. R Yachts and caviar as itemsets: low support, but often appear together. 3 Correlation Versus Support r A-Priori and similar methods are useless for low-support, high-correlation itemsets. r When support threshold is low, too many itemsets are frequent. R Memory requirements too high. r A-Priori does not address correlation. 4 Matrix Representation of Item/Basket Data r Columns = items. r Rows = baskets. r Entry ( r , c ) = 1 if item c is in basket r ; = 0 if not. r Assume matrix is almost all 0s. 5 In Matrix Form m c p b j {m,c,b} 1 1 1 {m,p,b} 1 1 1 {m,b} 1 1 {c,j} 1 1 {m,p,j} 1 1 1 {m,c,b,j} 1 1 1 1 {c,b,j} 1 1 1 {c,b} 1 1 6 Applications --- (1) r Rows = customers; columns = items. R ( r , c ) = 1 if and only if customer r bought item c . R Well correlated columns are items that tend to be bought by the same customers. R Used by on-line vendors to select items to pitch to individual customers. 7 Applications --- (2) r Rows = (footprints of) shingles; columns = documents. R ( r , c ) = 1 iff footprint r is present in document c . R Find similar documents, as in Anands 10/10 lecture. 8 Applications --- (3) r Rows and columns are both Web pages. R ( r , c ) = 1 iff page r links to page c . R Correlated columns are pages with many of the same in-links. R These pages may be about the same topic. 9 Assumptions --- (1) 1. Number of items allows a small amount of main-memory/item. r E.g., main memory = Number of items * 100 2. Too many items to store anything in main-memory for each p a i r of items. 10 Assumptions --- (2) 3. Too many baskets to store anything in main memory for each basket. 4. Data is very sparse: it is rare for an item to be in a basket. 11 From Correlation to Similarity r Statistical correlation is too hard to compute, and probably meaningless. R Most entries are 0, so correlation of columns is always high. r Substitute similarity, as in shingles- and-documents study. 12 Similarity of Columns r Think of a column as the set of rows in which it has 1....
View Full Document

This note was uploaded on 01/31/2011 for the course CS 345 taught by Professor Dunbar,a during the Fall '07 term at UC Davis.

Page1 / 43

cs345-lsh - 1 Low-Support, High-Correlation Finding Rare...

This preview shows document pages 1 - 13. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online