This preview shows page 1. Sign up to view the full content.
Unformatted text preview: length n Pick a random time t to start, so that any time is
equally likely Let at time t the stream have element a (i.e., X.el = a) Maintain count c (X.val = c) of the number a’s in the
stream starting from the chosen time t Then the estimate of the 2nd moment
is n (2 c – 1) Store n once, count a’s for each X 3/2/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 27 1 2 3 ma a a a a 2nd moment is Σa (ma)2 ct … the number of times the stream element
at time t appears from that time on E[X.val] = (1/n) Σall times t n (2 ct  1)
= Σa (1/n) (n) (1 + 3 + 5 + … + 2ma1)
= Σa(ma)2 Group times
by the value
seen
3/2/2011 Time when
the last a
is seen Jure Leskovec, Stanford C246: Mining Massive Datasets Time when
the penultimate
a is seen Time when
the first a
is seen 28 In practice: Compute n (2 c – 1) for as many variables X as you
can fit in memory Average them in groups Take median of averages Proper balance of group sizes and number of
groups assures not only correct expected value,
but expected error goes to 0 as number of samples
gets large 3/2/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 29 We assumed there was a number n, the
number of positions in the stream But real streams go on forever, so n is a
variable – the number of inputs seen so far 3/2/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 30 1. The variables X have n as a factor – keep n
separately; just hold the count in X 2. Suppose we can only store k counts.
We must throw some Xs out as time goes on: Objective: Each starting time t is selected with
probability k /n Solution: Choose the first k times for k variables When the nth element arrives (n > k), choose it with
probability k / n. If you choose it, throw one of the previously stored variables
out, with equal probability. 3/2/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 31 New Problem: Given a stream, which items
appear more than s times in the window? Possible solution: Think of the stream of
baskets as one binary stream per item 1 = item presen...
View Full
Document
 Winter '09
 Algorithms

Click to edit the document details