16-streams

# Var x we store xel and xval note this requires a

This preview shows page 1. Sign up to view the full content.

This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: length n Pick a random time t to start, so that any time is equally likely Let at time t the stream have element a (i.e., X.el = a) Maintain count c (X.val = c) of the number a’s in the stream starting from the chosen time t Then the estimate of the 2nd moment is n (2 c – 1) Store n once, count a’s for each X 3/2/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 27 1 2 3 ma a a a a 2nd moment is Σa (ma)2 ct … the number of times the stream element at time t appears from that time on E[X.val] = (1/n) Σall times t n (2 ct - 1) = Σa (1/n) (n) (1 + 3 + 5 + … + 2ma-1) = Σa(ma)2 Group times by the value seen 3/2/2011 Time when the last a is seen Jure Leskovec, Stanford C246: Mining Massive Datasets Time when the penultimate a is seen Time when the first a is seen 28 In practice: Compute n (2 c – 1) for as many variables X as you can fit in memory Average them in groups Take median of averages Proper balance of group sizes and number of groups assures not only correct expected value, but expected error goes to 0 as number of samples gets large 3/2/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 29 We assumed there was a number n, the number of positions in the stream But real streams go on forever, so n is a variable – the number of inputs seen so far 3/2/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 30 1. The variables X have n as a factor – keep n separately; just hold the count in X 2. Suppose we can only store k counts. We must throw some Xs out as time goes on: Objective: Each starting time t is selected with probability k /n Solution: Choose the first k times for k variables When the nth element arrives (n > k), choose it with probability k / n. If you choose it, throw one of the previously stored variables out, with equal probability. 3/2/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 31 New Problem: Given a stream, which items appear more than s times in the window? Possible solution: Think of the stream of baskets as one binary stream per item 1 = item presen...
View Full Document

## This document was uploaded on 02/26/2014 for the course CS 246 at Stanford.

Ask a homework question - tutors are online