This preview has intentionally blurred sections. Sign up to view the full version.View Full Document
Unformatted text preview: Summarizing and Mining Inverse Distributions on Data Streams via Dynamic Inverse Sampling Graham Cormode Bell Laboratories [email protected] S. Muthukrishnan Rutgers University [email protected] Irina Rozenbaum Rutgers University [email protected] Abstract Emerging data stream management systems ap- proach the challenge of massive data distributions which arrive at high speeds while there is only small storage by summarizing and mining the dis- tributions using samples or sketches. However, data distributions can be viewed in different ways. A data stream of integer values can be viewed either as the forward distribution f ( x ) , ie., the number of occurrences of x in the stream, or as its inverse, f- 1 ( i ) , which is the number of items that appear i times. While both such views are equivalent in stored data systems, over data streams that entail approximations, they may be significantly different. In other words, samples and sketches developed for the forward distribu- tion may be ineffective for summarizing or min- ing the inverse distribution. Yet, many applica- tions such as IP traffic monitoring naturally rely on mining inverse distributions. We formalize the problems of managing and min- ing inverse distributions and show provable dif- ferences between summarizing the forward dis- tribution vs the inverse distribution. We present methods for summarizing and mining inverse dis- tributions of data streams: they rely on a novel technique to maintain a dynamic sample over the stream with provable guarantees which can be used for variety of summarization tasks (build- ing quantiles or equidepth histograms) and min- ing (anomaly detection: finding heavy hitters, and measuring the number of rare items), all with provable guarantees on quality of approximations and time/space used by our streaming methods. Permission to copy without fee all or part of this material is granted pro- vided that the copies are not made or distributed for direct commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Very Large Data Base Endowment. To copy otherwise, or to republish, requires a fee and/or special permission from the Endowment. Proceedings of the 31st VLDB Conference, Trondheim, Norway, 2005 We also complement our analytical and algorith- mic results by presenting an experimental study of the methods over network data streams. 1 Introduction Database systems are evolving to handle high speed data streams where transactions arrive rapidly and have to be processed while storing only a limited amount of informa- tion. Many applications generate data streams: IP traffic streams, click streams, financial transactions, text streams at application level, sensor streams. Each of these appli- cations demands systems to manage the vast streams and provide basic analyses or mining capability. For exam- ple, in the IP traffic analysis example, there is a great de-...
View Full Document
- Three '10
- hash function, inverse distribution