This preview has intentionally blurred sections. Sign up to view the full version.View Full Document
Unformatted text preview: CSE-5120-Fall-2009 Querying and Mining Data Streams Data Streams: You Only Get One Look (adopted from tutorials, VLDB 2002, PODS 2002) Stream Processing Engine Synopsis in Memory (Approximate Answer ) Motivation : A growing number of applications: • network monitoring and traffic manage- ment • Call detail records in telecommunications • Transactions in retail chains • ATM operations in banks • Log records generated by Web Servers • Sensor network data Characteristics Massive volumes of data e.g. several terabytes(10 12 bytes) Records arrive at a rapid rate Goal : Mine patterns, process queries, com- pute statistics on data streams in real time. Stream Projects • Amazon/Cougar (Cornell) - sensors • Aurora (Brown/MIT) - sensor monitoring, dataflow • Hancock (AT&T) - telecom streams • Niagara (OGI/Wisconsin) - internet XML databases • OpenCQ (Georgia) - triggers, view mainte- nance • Stream (Stanford) - general purpose • Tapestry (Xerox) - content-based filtering • Telegraph (Berkeley) - adaptive engine for sensors • Tribeca (Bellcore) - network monitoring Data Stream Processing Algorithms • Generally, algorithms compute approxi- mate answers- Difficult to compute answers accurately with limited memory • Bounds on error- Although answer is approximate, there are bounds on the error.- The bound may be probabilistic: e.g. With probability of at least 1- δ , the computed answer is within a fraction ² of the actual answer. Sampling: A Simple Case A small random sample S of the data often well represents all the data For a fast approximate answer to a query, apply a ”modified” query to S . 75 Example: Data stream R : 9, 3, 5, 2, 7, 1, 6, 5, 8, 4, 9, 1 (n=12) Sample S : 9, 5, 1, 8 • Select average value from R where R.e is odd Answer = 5 • Select count from R where R.e is odd answer = 12 × 3 / 4 = 9 Frequency Counts over Data Streams ( G.S. Manku, R. Motwani, VLDB 2002 ) Examples of data streams: stock tickers, network traffic measurements, web-server logs, sensor networks, telecom call records. A data stream is a stream of tuples, where each tuple is either a single item or a set of items. Problem: To find the frequency counts ex- ceeding a user-specified threshold s Problem definition Let N be the current length of the stream. Some guarantees: 1. All items whose true frequency exceeds sN are output. There are no false negative . 2. No item whose true frequency is less than ( s- ² ) N is output 3. Estimated frequencies are less than the true frequencies by at most ²N . An algorithm maintains an ²- deficient synop- sis if its output satisfies the above properties. E.g. s = 0.5, ² = 0 . 1, No item with frequency less than (0.5-0.1)N = 0.4N is output Estimated frequency are less than the true fre- quency by at most 0.1 N Sticky Sampling Algorithm Data structure S : S is a set of entries of the form ( c, f ) c is an element f estimates the frequency of c ....
View Full Document
- Fall '09
- Count von Count, Frequency Count, data stream, true frequency