This preview shows pages 1–3. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: CSE5120Fall2009 Querying and Mining Data Streams Data Streams: You Only Get One Look (adopted from tutorials, VLDB 2002, PODS 2002) Stream Processing Engine Synopsis in Memory (Approximate Answer ) Motivation : A growing number of applications: • network monitoring and traffic manage ment • Call detail records in telecommunications • Transactions in retail chains • ATM operations in banks • Log records generated by Web Servers • Sensor network data Characteristics Massive volumes of data e.g. several terabytes(10 12 bytes) Records arrive at a rapid rate Goal : Mine patterns, process queries, com pute statistics on data streams in real time. Stream Projects • Amazon/Cougar (Cornell)  sensors • Aurora (Brown/MIT)  sensor monitoring, dataflow • Hancock (AT&T)  telecom streams • Niagara (OGI/Wisconsin)  internet XML databases • OpenCQ (Georgia)  triggers, view mainte nance • Stream (Stanford)  general purpose • Tapestry (Xerox)  contentbased filtering • Telegraph (Berkeley)  adaptive engine for sensors • Tribeca (Bellcore)  network monitoring Data Stream Processing Algorithms • Generally, algorithms compute approxi mate answers Difficult to compute answers accurately with limited memory • Bounds on error Although answer is approximate, there are bounds on the error. The bound may be probabilistic: e.g. With probability of at least 1 δ , the computed answer is within a fraction ² of the actual answer. Sampling: A Simple Case A small random sample S of the data often well represents all the data For a fast approximate answer to a query, apply a ”modified” query to S . 75 Example: Data stream R : 9, 3, 5, 2, 7, 1, 6, 5, 8, 4, 9, 1 (n=12) Sample S : 9, 5, 1, 8 • Select average value from R where R.e is odd Answer = 5 • Select count from R where R.e is odd answer = 12 × 3 / 4 = 9 Frequency Counts over Data Streams ( G.S. Manku, R. Motwani, VLDB 2002 ) Examples of data streams: stock tickers, network traffic measurements, webserver logs, sensor networks, telecom call records. A data stream is a stream of tuples, where each tuple is either a single item or a set of items. Problem: To find the frequency counts ex ceeding a userspecified threshold s Problem definition Let N be the current length of the stream. Some guarantees: 1. All items whose true frequency exceeds sN are output. There are no false negative . 2. No item whose true frequency is less than ( s ² ) N is output 3. Estimated frequencies are less than the true frequencies by at most ²N . An algorithm maintains an ² deficient synop sis if its output satisfies the above properties. E.g. s = 0.5, ² = 0 . 1, No item with frequency less than (0.50.1)N = 0.4N is output Estimated frequency are less than the true fre quency by at most 0.1 N Sticky Sampling Algorithm Data structure S : S is a set of entries of the form ( c, f ) c is an element f estimates the frequency of c ....
View
Full
Document
 Fall '09
 AdaFu

Click to edit the document details