{[ promptMessage ]}

Bookmark it

{[ promptMessage ]}

cs345-lsh

# cs345-lsh - Low-Support High-Correlation Finding Rare but...

This preview shows pages 1–13. Sign up to view the full content.

1 Low-Support, High-Correlation Finding Rare but Similar Items Minhashing Locality-Sensitive Hashing

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
2 The Problem rhombus6 Rather than finding high-support item- pairs in basket data, look for items that are highly “correlated.” rhombus4 If one appears in a basket, there is a good chance that the other does. rhombus4 “Yachts and caviar” as itemsets: low support, but often appear together.
3 Correlation Versus Support rhombus6 A-Priori and similar methods are useless for low-support, high-correlation itemsets. rhombus6 When support threshold is low, too many itemsets are frequent. rhombus4 Memory requirements too high. rhombus6 A-Priori does not address correlation.

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
4 Matrix Representation of Item/Basket Data rhombus6 Columns = items. rhombus6 Rows = baskets. rhombus6 Entry ( r , c ) = 1 if item c is in basket r ; = 0 if not. rhombus6 Assume matrix is almost all 0’s.
5 In Matrix Form m c p b j {m,c,b} 1 1 0 1 0 {m,p,b} 1 0 1 1 0 {m,b} 1 0 0 1 0 {c,j} 0 1 0 0 1 {m,p,j} 1 0 1 0 1 {m,c,b,j} 1 1 0 1 1 {c,b,j} 0 1 0 1 1 {c,b} 0 1 0 1 0

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
6 Applications --- (1) rhombus6 Rows = customers; columns = items. rhombus4 ( r, c ) = 1 if and only if customer r bought item c . rhombus4 Well correlated columns are items that tend to be bought by the same customers. rhombus4 Used by on-line vendors to select items to “pitch” to individual customers.
7 Applications --- (2) rhombus6 Rows = (footprints of) shingles; columns = documents. rhombus4 ( r, c ) = 1 iff footprint r is present in document c . rhombus4 Find similar documents, as in Anand’s 10/10 lecture.

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
8 Applications --- (3) rhombus6 Rows and columns are both Web pages. rhombus4 ( r, c ) = 1 iff page r links to page c . rhombus4 Correlated columns are pages with many of the same in-links. rhombus4 These pages may be about the same topic.
9 Assumptions --- (1) 1. Number of items allows a small amount of main-memory/item. rhombus6 E.g., main memory = Number of items * 100 2. Too many items to store anything in main-memory for each pair of items.

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
10 Assumptions --- (2) 3. Too many baskets to store anything in main memory for each basket. 4. Data is very sparse: it is rare for an item to be in a basket.
11 From Correlation to Similarity rhombus6 Statistical correlation is too hard to compute, and probably meaningless. rhombus4 Most entries are 0, so correlation of columns is always high. rhombus6 Substitute “similarity,” as in shingles- and-documents study.

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
12 Similarity of Columns rhombus6 Think of a column as the set of rows in which it has 1.
This is the end of the preview. Sign up to access the rest of the document.

{[ snackBarMessage ]}

### Page1 / 43

cs345-lsh - Low-Support High-Correlation Finding Rare but...

This preview shows document pages 1 - 13. Sign up to view the full document.

View Full Document
Ask a homework question - tutors are online