This preview shows pages 1–13. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: 1 LowSupport, HighCorrelation Finding Rare but Similar Items Minhashing LocalitySensitive Hashing 2 The Problem r Rather than finding highsupport item pairs in basket data, look for items that are highly correlated. R If one appears in a basket, there is a good chance that the other does. R Yachts and caviar as itemsets: low support, but often appear together. 3 Correlation Versus Support r APriori and similar methods are useless for lowsupport, highcorrelation itemsets. r When support threshold is low, too many itemsets are frequent. R Memory requirements too high. r APriori does not address correlation. 4 Matrix Representation of Item/Basket Data r Columns = items. r Rows = baskets. r Entry ( r , c ) = 1 if item c is in basket r ; = 0 if not. r Assume matrix is almost all 0s. 5 In Matrix Form m c p b j {m,c,b} 1 1 1 {m,p,b} 1 1 1 {m,b} 1 1 {c,j} 1 1 {m,p,j} 1 1 1 {m,c,b,j} 1 1 1 1 {c,b,j} 1 1 1 {c,b} 1 1 6 Applications  (1) r Rows = customers; columns = items. R ( r , c ) = 1 if and only if customer r bought item c . R Well correlated columns are items that tend to be bought by the same customers. R Used by online vendors to select items to pitch to individual customers. 7 Applications  (2) r Rows = (footprints of) shingles; columns = documents. R ( r , c ) = 1 iff footprint r is present in document c . R Find similar documents, as in Anands 10/10 lecture. 8 Applications  (3) r Rows and columns are both Web pages. R ( r , c ) = 1 iff page r links to page c . R Correlated columns are pages with many of the same inlinks. R These pages may be about the same topic. 9 Assumptions  (1) 1. Number of items allows a small amount of mainmemory/item. r E.g., main memory = Number of items * 100 2. Too many items to store anything in mainmemory for each p a i r of items. 10 Assumptions  (2) 3. Too many baskets to store anything in main memory for each basket. 4. Data is very sparse: it is rare for an item to be in a basket. 11 From Correlation to Similarity r Statistical correlation is too hard to compute, and probably meaningless. R Most entries are 0, so correlation of columns is always high. r Substitute similarity, as in shingles anddocuments study. 12 Similarity of Columns r Think of a column as the set of rows in which it has 1....
View
Full
Document
This note was uploaded on 01/31/2011 for the course CS 345 taught by Professor Dunbar,a during the Fall '07 term at UC Davis.
 Fall '07
 Dunbar,A

Click to edit the document details