6.896 Sublinear Time Algorithms
December 2, 2004
Lecture 22
Lecturer: Eli BenSasson
Scribe: Rafael Pass
1
Testing Proximity of Distributions
A distribution
p
on [
n
] =
{
1
,
2
, . . ., n
}
is given by the probabilities (
p
1
, .., p
n
), such that
∑
n
i
=1
p
i
= 1,
and 0
≤
p
i
≤
1. We will consider algorithms that are given oracle access to a distribution
p
, i.e., each
time we ”press a button” we get a sample
i
∈
[
n
] with probability
p
i
. Our objective is to test if two
distributions
p, q
over [
n
] are “close”.
1.1
Definitions of closeness of distributions
Ideally we would like to consider the
L
1
distance between two distributions, defined as follows:

p
−
q

=
n
i
=1

p
i
−
q
i

Today we will instead focus on the (easier)
L
2
distance (i.e., the Euclidean norm), defined as follows:

p
−
q

=
n
i
=1
(
p
i
−
q
i
)
2
Later we will use this to estimate the
L
1
distance. (A well known fact is that

p
−
q
 ≤
√
n

p
−
q

. This
fact will, however, not be enough to get a good estimate of the
L
1
distance.)
1.2
The Theorem
We will prove the following theorem:
Theorem 1 (Batu, Fortnow, Rubinfeld, Smith, White [1])
For every constant
, and every dis
tributions
p, q
over
[
n
]
, there exists a test that runs in time
O
(
δ
−
4
log(1
/
))
such that:
•
If

p
−
q

< δ/
2
, then
Pr[
test accepts
]
≥
1
−
•
If

p
−
q

> δ
, then
Pr[
test accepts
]
≤
•
The query complexity of the tester (i.e., the number of sample) is less than the running time (which
is constant for constant
δ,
).
Next lecture we show a tester for
L
1
distance which uses a query complexity of
≈
n
3
/
2
(which thus is
super constant).
1.3
Why is it harder to test
L
1
distance?
The following example shows why
L
1
distance requires superconstant query complexity (even for con
stant
δ,
). Consider the following two cases:
•
p, q
are two uniform distributions on two equally large (unknown)
disjoint
subsets of [
n
].
Note
that

p
−
q

= 2.