This preview shows pages 1–9. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: LowSupport, HighCorrelation
Finding Rare but Similar Items
Minhashing
LocalitySensitive Hashing
1 The Problem
x Rather than finding highsupport item
pairs in basket data, look for items that are highly “correlated.” If one appears in a basket, there is a good chance that the other does.
“Yachts and caviar” as itemsets: low support, but often appear together. 2 Correlation Versus Support
x APriori and similar methods are useless for lowsupport, highcorrelation itemsets.
x When support threshold is low, too many itemsets are frequent.
Memory requirements too high. x APriori does not address correlation.
3 Matrix Representation of Item/Basket Data
x Columns = items.
x Rows = baskets.
x Entry (r , c ) = 1 if item c is in basket r ; = 0 if not.
x Assume matrix is almost all 0’s. 4 In Matrix Form
{m,c,b}
{m,p,b}
{m,b}
{c,j}
{m,p,j}
{m,c,b,j} 1
{c,b,j}
{c,b} m
1
1
1
0
1
1
0
0 c
1
0
0
1
0
0
1
1 p
0
1
0
0
1
1
0
0 b
1
1
1
0
0
1
1
1 j
0
0
0
1
1
1
0
5 Applications (1)
x Rows = customers; columns = items. (r, c ) = 1 if and only if customer r bought item c.
Well correlated columns are items that tend to be bought by the same customers.
Used by online vendors to select items to “pitch” to individual customers. 6 Applications (2)
x Rows = (footprints of) shingles; columns = documents.
(r, c ) = 1 iff footprint r is present in document c.
Find similar documents, as in Anand’s 10/10 lecture. 7 Applications (3)
x Rows and columns are both Web pages. (r, c) = 1 iff page r links to page c.
Correlated columns are pages with many of the same inlinks.
These pages may be about the same topic. 8 Assumptions (1)
1. Number of items allows a small amount of mainmemory/item. x E.g., main memory = Number of items * 100 2. Too many items to store anything in mainmemory for each pair of items. 9 Assumptions (2)
3. Too many baskets to store anything in main memory for each basket.
4. Data is very sparse: it is rare for an item to be in a basket. 10 From Correlation to Similarity
x Statistical correlation is too hard to compute, and probably meaningless.
Most entries are 0, so correlation of columns is always high. x Substitute “similarity,” as in shingles
anddocuments study. 11 Similarity of Columns
x Think of a column as the set of rows in which it has 1.
x The similarity of columns C1 and C2 = Sim (C1,C2) = is the ratio of the sizes of the intersection and union of C1 and C2.
Sim (C1,C2) = C1∩C2/C1∪C2 = Jaccard measure. 12 Example
C1 C2
01
10
11
00
11
01 Sim (C1, C2) =
2/5 = 0.4 13 Outline of Algorithm
1. Compute “signatures” (“sketches”) of columns = small summaries of columns.
Read from disk to main memory. 1. Examine signatures in main memory to find similar signatures.
Essential: similarity of signatures and columns are related. 1. Check that columns with similar signatures are really similar (optional).
14 Signatures
x Key idea: “hash” each column C to a small signature Sig (C), such that:
1. Sig (C) is small enough that we can fit a signature in main memory for each column.
Sim (C1, C2) is the same as the “similarity” of Sig (C1) and Sig (C2). 15 An Idea That Doesn’t Work
x Pick 100 rows at random, and let the signature of column C be the 100 bits of C in those rows.
x Because the matrix is sparse, many columns would have 00. . .0 as a signature, yet be very dissimilar because their 1’s are in different rows.
16 Four Types of Rows
x Given columns C1 and C2, rows may be classified as:
a
b
c
d C1
1
1
0
0 C2
1
0
1
0 x Also, a = # rows of type a , etc.
x Note Sim (C1, C2) = a /(a +b +c ).
17 Minhashing
x Imagine the rows permuted randomly.
x Define “hash” function h (C ) = the number of the first (in the permuted order) row in which column C has 1.
x Use several (100?) independent hash functions to create a signature. 18 Minhashing Example
Input matrix 143
324
717
636
261
572
455 1 0 1 0 1 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 1 0 1 0 1 0 1 0 Signature matrix M 2 1 2 1 2 1 4 1 1 2 1 2 19 Surprising Property
x The probability (over all permutations of the rows) that h (C1) = h (C2) is the same as Sim (C1, C2).
x Both are a /(a +b +c )!
x Why? Look down columns C1 and C2 until we see a 1.
If it’s a type a row, then h (C1) = h (C2). If a type b or c row, then not.
20 Similarity for Signatures
x The similarity of signatures is the fraction of the rows in which they agree.
Remember, each row corresponds to a permutation or “hash function.” 21 Min Hashing – Example
Input matrix 143
324
717
636
261
572
455 Signature matrix M 1 0 1 0 1 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 1 0 1 0 1 0 1 0 2 1 2 1 2 1 4 1 1 2 1 2 Similarities:
13
24 12
Col.Col. 0.75 0.75 0
Sig.Sig. 0.67 1.00 0 22 34
0
0 Minhash Signatures
x Pick (say) 100 random permutations of the rows.
x Think of Sig (C) as a column vector.
x Let Sig (C)[i] = row number of the first row with 1 in column C, for i th permutation. 23 Implementation (1)
x Number of rows = 1 billion.
x Hard to pick a random permutation from 1…
billion.
x Representing a random permutation requires billion entries.
x Accessing rows in permuted order is tough!
The number of passes would be prohibitive. 24 Implementation (2) 1. Pick (say) 100 hash functions.
2. For each column c and each hash function hi , keep a “slot” M (i, c ) for that minhash value.
3. for each row r, and for each column c with 1 in row r, and for each hash function hi do if hi (r ) is a smaller value than M (i, c ) then
M (i, c ) := hi (r ). 1. Needs only one pass through the data.
25 Example
h(1) = 1
g(1) = 3
Row
1
2
3
4
5 C1
1
0
1
1
0 C2
0
1
1
0
1 h(x) = x mod 5
g(x) = 2x+1 mod 5 1
3  h(2) = 2
g(2) = 0 1
3 2
0 h(3) = 3
g(3) = 2 1
2 2
0 h(4) = 4
g(4) = 4 1
2 2
0 h(5) = 0
g(5) = 1 1
2 0
0
26 Comparison with “Shingling”
x The shingling paper proposed using one hash function and taking the first 100 (say) values.
x Almost the same, but:
Faster saves on hashcomputation.
Admits some correlation among rows of the signatures. 27 Candidate Generation
x Pick a similarity threshold s, a fraction < 1.
x A pair of columns c and d is a candidate pair if their signatures agree in at least fraction s of the rows.
I.e., M (i, c ) = M (i, d ) for at least fraction s values of i. 28 The Problem with Checking Candidates
x While the signatures of all columns may fit in main memory, comparing the signatures of all pairs of columns is quadratic in the number of columns.
x Example: 106 columns implies 5*1011 comparisons.
x At 1 microsecond/comparison: 6 days.
29 Solutions
1. DCM method (Anand’s 10/10 slides) relies on external sorting, so several passes over the data are needed.
2. LocalitySensitive Hashing (LSH) is a method that can be carried out in main memory, but admits some false negatives.
30 LocalitySensitive Hashing
x Unrelated to “minhashing.”
x Operates on signatures.
x Big idea: hash columns of signature matrix M several times.
x Arrange that similar columns are more likely to hash to the same bucket.
x Candidate pairs are those that hash at least once to the same bucket. 31 Partition into Bands
x Divide matrix M into b bands of r rows.
x For each band, hash its portion of each column to k buckets.
x Candidate column pairs are those that hash to the same bucket for ≥ 1 band.
x Tune b and r to catch most similar pairs, few nonsimilar pairs.
32 Simplifying Assumption
x There are enough buckets that columns are unlikely to hash to the same bucket unless they are identical in a particular band.
x Hereafter, we assume that “same bucket” means “identical.” 33 Buckets Matrix M r rows b bands 34 Example
x Suppose 100,000 columns.
x Signatures of 100 integers.
x Therefore, signatures take 40Mb.
x But 5,000,000,000 pairs of signatures can take a while to compare.
x Choose 20 bands of 5 integers/band.
35 Suppose C1, C2 are 80% Similar
x Probability C1, C2 identical in one particular band: (0.8)5 = 0.328.
x Probability C1, C2 are not similar in any of the 20 bands: (10.328)20 = .00035 .
i.e., we miss about 1/3000th of the 80%
similar column pairs. 36 Suppose C1, C2 Only 40% Similar
x Probability C1, C2 identical in any one particular band: (0.4)5 = 0.01 .
x Probability C1, C2 identical in ≥ 1 of 20 bands: ≤ 20 * 0.01 = 0.2 .
x Small probability C1, C2 not identical in a band, but hash to the same bucket.
x But false positives much lower for similarities << 40%. 37 LSH Graphically
x Example Target: All pairs with Sim > 60%.
x Suppose we use only one hash function:
1.0 1.0 Prob. Prob.
0.0 Ideal s Sim
1.0 s Sim
1.0 LSH (partition into bands) gives us:
1.0
Prob.
0.0 1 − (1 − s ) rb s Sim
1.0 38 LSH Summary
x Tune to get almost all pairs with similar signatures, but eliminate most pairs that do not have similar signatures.
x Check in main memory that candidate pairs really do have similar signatures.
x Then, in another pass through data, check that the remaining candidate pairs really are similar columns .
39 New Topic: Hamming LSH
x An alternative to minhash + LSH.
x Takes advantage of the fact that if columns are not sparse, random rows serve as a good signature.
x Trick: create data matrices of exponentially decreasing sizes, increasing densities.
40 Amplification of 1’s
x Hamming LSH constructs a series of matrices, each with half as many rows, by ORing together pairs of rows.
x Candidate pairs from each matrix have (say) between 20% 80% 1’s and are similar in selected 100 rows.
20%80% OK for similarity thresholds ≥ 0.5. Otherwise, two “similar” columns could fail to both be in range for at least one matrix.
41 Example
0
0
1
1
0
0
1
0 0
1
0
1 1
1 1 42 Using Hamming LSH
x Construct the sequence of matrices. If there are R rows, then log2R matrices.
Total work = twice that of reading the original matrix. x Use standard LSH to identify similar columns in each matrix, but restricted to columns of “medium” density.
43 ...
View Full
Document
 Spring '09

Click to edit the document details