emnlp - Hashing, sketching, and other approximate...

Info iconThis preview shows pages 1–12. Sign up to view the full content.

View Full Document Right Arrow Icon
1 Hashing , sketching , and other approximate algorithms for high-dimensional data Piotr Indyk MIT
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
2 Plan •I n t r o – High dimensionality – Problems • Technique: randomized projection – Intuition –P roo fo id • Applications: – Sketching/streaming – Nearest Neighbor Search • Conclusions •R e f s
Background image of page 2
3 High-Dimensional Data To be or not to be … To be or not to be … (... , 2, …, 2, … , 1 , …, 1, …) to be or not (... , 1, …, 4, … , 2 , …, 2, …) (... , 6, …, 1, … , 3 , …, 6, …) (... , 1, …, 3, … , 7 , …, 5, …)
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
4 Problems • Storage – How to represent the data “accurately” using “small” space • Search – How to find “similar” documents • Learning, etc… ? ?
Background image of page 4
5 Randomized Dimensionality Reduction
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
6 Randomized Dimensionality Reduction (a.k.a. “Flattening Lemma”) • Johnson-Lindenstrauss lemma (1984) – Choose the projection plane “at random” – The distances are “approximately” preserved with “high” probability
Background image of page 6
7 Dimensionality Reduction, Formally •J L : For any set of n points X in R d under Euclidean norm, there is a (1+ ε )- distortion embedding of X into R d’ , for d’=O(log n / ε 2 ) L : There is a distribution over random linear mappings A: R d R d’ , such that for any vector x we have ||Ax|| = (1 ±ε ) ||x|| with probability 1 - e -Cd’ ε ^2 Questions: What is the distribution ? Why does it work ?
Background image of page 7

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
8 Normal Distribution • Normal distribution: – Range: (- , ) – Density: f(x)=e -x^2/2 / (2 π ) 1/2 – Mean= 0 , Variance= 1 • Basic facts: –I f X and Y independent r.v. with normal distribution, then X+Y has normal distribution – Var(cX)=c 2 Var(X) – If X,Y independent, then Var(X+Y)=Var(X)+Var(Y)
Background image of page 8
9 Back to the Embedding • We use mapping Ax where each entry of A has normal distribution • Let a 1 ,…,a d’ be the rows of A • Consider Z=a i *x = a*x= i a i x i • Each term a i x i – Has normal distribution – With variance x i 2 •T h u s , Z has normal distribution with variance i x i 2 =||x|| 2 • This holds for each a j
Background image of page 9

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
10 What is ||Ax|| 2 • ||Ax|| 2 = (a 1 * x) 2 +…+(a d’ * x) 2 = Z 1 2 +…+Z d’ 2 where: – All Z i ’s are independent – Each has normal distribution with variance ||x|| 2 • Therefore, E[ ||Ax|| 2 ]=d’*E[Z 1 2 ]=d’ ||x|| 2 • By “law of large numbers” (quantitive): Pr[ | ||Ax|| 2 –d’ ||x|| 2 |> ε d’]<e -C d’ ε ^2 for some constant C
Background image of page 10
11 Streaming/sketching implications Can replace d -dimensional vectors by d ’- dimensional ones – Cost: O(dd’) per vector – Faster method known [Ailon-Chazelle’06] Can avoid storing the original d -dimensional vectors in the first place (thanks to linearity of the mapping A ) – Suppose: x is the histogram of a document • We are receiving a stream of document words
Background image of page 11

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Image of page 12
This is the end of the preview. Sign up to access the rest of the document.

This note was uploaded on 11/09/2011 for the course CIS 6930 taught by Professor Staff during the Fall '08 term at University of Florida.

Page1 / 40

emnlp - Hashing, sketching, and other approximate...

This preview shows document pages 1 - 12. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online