6.896 Sublinear Time Algorithms
December 2, 2004
Lecture 22
Lecturer: Eli BenSasson
Scribe: Rafael Pass
1
Testing Proximity of Distributions
A distribution
p
on [
n
]=
{
1
,
2
,...,n
}
is given by the probabilities (
p
1
, .., p
n
), such that
∑
n
i
=1
p
i
=1
,
and 0
≤
p
i
≤
1. We will consider algorithms that are given oracle access to a distribution
p
, i.e., each
time we ”press a button” we get a sample
i
∈
[
n
] with probability
p
i
. Our objective is to test if two
distributions
p, q
over [
n
] are “close”.
1.1
DeFnitions of closeness of distributions
Ideally we would like to consider the
L
1
distance between two distributions, deFned as follows:

p
−
q

=
n
X
i
=1

p
i
−
q
i

Today we will instead focus on the (easier)
L
2
distance (i.e., the Euclidean norm), deFned as follows:

p
−
q

=
v
u
u
t
n
X
i
=1
(
p
i
−
q
i
)
2
Later we will use this to estimate the
L
1
distance. (A well known fact is that

p
−
q
≤
√
n

p
−
q

.Th
i
s
fact will, however, not be enough to get a good estimate of the
L
1
distance.)
1.2
The Theorem
We will prove the following theorem:
Theorem 1 (Batu, Fortnow, Rubinfeld, Smith, White [1])
For every constant
±
, and every dis
tributions
p, q
over
[
n
]
, there exists a test that runs in time
O
(
δ
−
4
log(1
/±
))
such that:
•
If

p
−
q

<δ/
2
,then
Pr[
test accepts
]
≥
1
−
±
•
If

p
−
q

>δ
Pr[
test accepts
]
≤
±
•
The query complexity of the tester (i.e., the number of sample) is less than the running time (which
is constant for constant
δ, ±
).
Next lecture we show a tester for
L
1
distance which uses a query complexity of
≈
n
3
/
2
(which thus is
super constant).
1.3
Why is it harder to test
L
1
distance?
The following example shows why
L
1
distance requires superconstant query complexity (even for con
stant
δ, ±
). Consider the following two cases:
•
p, q
are two uniform distributions on two equally large (unknown)
disjoint
subsets of [
n
]. Note
that

p
−
q

=2.