10stream - CSE-5120-Fall-2009 Querying and Mining Data...

Info iconThis preview shows pages 1–3. Sign up to view the full content.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: CSE-5120-Fall-2009 Querying and Mining Data Streams Data Streams: You Only Get One Look (adopted from tutorials, VLDB 2002, PODS 2002) Stream Processing Engine Synopsis in Memory (Approximate Answer ) Motivation : A growing number of applications: • network monitoring and traffic manage- ment • Call detail records in telecommunications • Transactions in retail chains • ATM operations in banks • Log records generated by Web Servers • Sensor network data Characteristics Massive volumes of data e.g. several terabytes(10 12 bytes) Records arrive at a rapid rate Goal : Mine patterns, process queries, com- pute statistics on data streams in real time. Stream Projects • Amazon/Cougar (Cornell) - sensors • Aurora (Brown/MIT) - sensor monitoring, dataflow • Hancock (AT&T) - telecom streams • Niagara (OGI/Wisconsin) - internet XML databases • OpenCQ (Georgia) - triggers, view mainte- nance • Stream (Stanford) - general purpose • Tapestry (Xerox) - content-based filtering • Telegraph (Berkeley) - adaptive engine for sensors • Tribeca (Bellcore) - network monitoring Data Stream Processing Algorithms • Generally, algorithms compute approxi- mate answers- Difficult to compute answers accurately with limited memory • Bounds on error- Although answer is approximate, there are bounds on the error.- The bound may be probabilistic: e.g. With probability of at least 1- δ , the computed answer is within a fraction ² of the actual answer. Sampling: A Simple Case A small random sample S of the data often well represents all the data For a fast approximate answer to a query, apply a ”modified” query to S . 75 Example: Data stream R : 9, 3, 5, 2, 7, 1, 6, 5, 8, 4, 9, 1 (n=12) Sample S : 9, 5, 1, 8 • Select average value from R where R.e is odd Answer = 5 • Select count from R where R.e is odd answer = 12 × 3 / 4 = 9 Frequency Counts over Data Streams ( G.S. Manku, R. Motwani, VLDB 2002 ) Examples of data streams: stock tickers, network traffic measurements, web-server logs, sensor networks, telecom call records. A data stream is a stream of tuples, where each tuple is either a single item or a set of items. Problem: To find the frequency counts ex- ceeding a user-specified threshold s Problem definition Let N be the current length of the stream. Some guarantees: 1. All items whose true frequency exceeds sN are output. There are no false negative . 2. No item whose true frequency is less than ( s- ² ) N is output 3. Estimated frequencies are less than the true frequencies by at most ²N . An algorithm maintains an ²- deficient synop- sis if its output satisfies the above properties. E.g. s = 0.5, ² = 0 . 1, No item with frequency less than (0.5-0.1)N = 0.4N is output Estimated frequency are less than the true fre- quency by at most 0.1 N Sticky Sampling Algorithm Data structure S : S is a set of entries of the form ( c, f ) c is an element f estimates the frequency of c ....
View Full Document

Page1 / 9

10stream - CSE-5120-Fall-2009 Querying and Mining Data...

This preview shows document pages 1 - 3. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online