This preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: CSE5120Fall2009 Querying and Mining Data Streams Data Streams: You Only Get One Look (adopted from tutorials, VLDB 2002, PODS 2002) Stream Processing Engine Synopsis in Memory (Approximate Answer ) Motivation : A growing number of applications: • network monitoring and traffic manage ment • Call detail records in telecommunications • Transactions in retail chains • ATM operations in banks • Log records generated by Web Servers • Sensor network data Characteristics Massive volumes of data e.g. several terabytes(10 12 bytes) Records arrive at a rapid rate Goal : Mine patterns, process queries, com pute statistics on data streams in real time. Stream Projects • Amazon/Cougar (Cornell)  sensors • Aurora (Brown/MIT)  sensor monitoring, dataflow • Hancock (AT&T)  telecom streams • Niagara (OGI/Wisconsin)  internet XML databases • OpenCQ (Georgia)  triggers, view mainte nance • Stream (Stanford)  general purpose • Tapestry (Xerox)  contentbased filtering • Telegraph (Berkeley)  adaptive engine for sensors • Tribeca (Bellcore)  network monitoring Data Stream Processing Algorithms • Generally, algorithms compute approxi mate answers Difficult to compute answers accurately with limited memory • Bounds on error Although answer is approximate, there are bounds on the error. The bound may be probabilistic: e.g. With probability of at least 1 δ , the computed answer is within a fraction ² of the actual answer. Sampling: A Simple Case A small random sample S of the data often well represents all the data For a fast approximate answer to a query, apply a ”modified” query to S . 75 Example: Data stream R : 9, 3, 5, 2, 7, 1, 6, 5, 8, 4, 9, 1 (n=12) Sample S : 9, 5, 1, 8 • Select average value from R where R.e is odd Answer = 5 • Select count from R where R.e is odd answer = 12 × 3 / 4 = 9 Frequency Counts over Data Streams ( G.S. Manku, R. Motwani, VLDB 2002 ) Examples of data streams: stock tickers, network traffic measurements, webserver logs, sensor networks, telecom call records. A data stream is a stream of tuples, where each tuple is either a single item or a set of items. Problem: To find the frequency counts ex ceeding a userspecified threshold s Problem definition Let N be the current length of the stream. Some guarantees: 1. All items whose true frequency exceeds sN are output. There are no false negative . 2. No item whose true frequency is less than ( s ² ) N is output 3. Estimated frequencies are less than the true frequencies by at most ²N . An algorithm maintains an ² deficient synop sis if its output satisfies the above properties. E.g. s = 0.5, ² = 0 . 1, No item with frequency less than (0.50.1)N = 0.4N is output Estimated frequency are less than the true fre quency by at most 0.1 N Sticky Sampling Algorithm Data structure S : S is a set of entries of the form ( c, f ) c is an element f estimates the frequency of c ....
View
Full Document
 Fall '09
 AdaFu
 Count von Count, Frequency Count, data stream, true frequency

Click to edit the document details