Real-world Data is Dirty Data Cleansing and The Merge Purge Problem

Thus what the controlled empirical studies have shown

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: controlled empirical studies have shown indicates that improved accuracy will be exhibited for real world data with the same sorts of errors and complexity of matching as described in this paper. Finally, the results reported here form the basis of a DataBlade Module available from Informix Software as the DataCleanser DataBlade. The technology is broadly applicable after all, real world data is dirty. 7 Acknowledgments We are grateful to Timothy Clark, Computer Information Consultant for OCAR, for the help provided obtaining the results in section 4. Thanks also to Dr. Diana English, OCAR's Chair, for allowing the use of their database in our work. 34 References 1] ACM. SIGMOD record, December 1991. 2] R. Agrawal and H. V. Jagadish. Multiprocessor Transitive Closure Algorithms. In Proc. Int'l Symp. on Databases in Parallel and Distributed Systems, pages 56{66, December 1988. 3] C. Batini, M. Lenzerini, and S. Navathe. A Comparative Analysis of Methodologies for Database Schema Integration. ACM Computing Surverys, 18(4):323{364, December 1986. 4] D. Bitton and D. J. DeWitt. Duplicate Record Elimination in Large Data Files. ACM Transactions on Database Systems, 8(2):255{265, June 1983. 5] B. P. Buckles and F. E. Petry. A fuzzy representation of data for relational databases. Fuzzy Sets and Systems, 7:213{226, 1982. Generally regarded as the paper that originated Fuzzy Databases. 6] J. P. Buckley. A Hierarchical Clustering Strategy for Very Large Fuzzy Databases. In Proceedings of the IEEE International Conference on Systems, Man and Cybernetics, pages 3573{3578, 1995. 7] K. W. Church and W. A. Gale. Probability Scoring for Spelling Correction. Statistics and Computing, 1:93{103, 1991. 8] T. K. Clark. Analyzing Foster Childrens' Foster Home Payments Database. In KDD Nuggets 95:7 (, Piatetsky-Shapiro, ed., 1995. 9] T. Dietterich and R. Michalski. A Comparative Review of Selected Methods for Learning from Examples. In R. Michalski, J. Carbonell, and T. Mitchell, editors, Machine Learning, volume 1, pages 41{81. Morgan Kaufmann Publishers, Inc., 1983. 10] R. Dubes and A. Jain. Clustering Techniques: The User's Dilema. Pattern Recognition, 8:247{260, 1976. 11] U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth. From Data Mining to Knowledge Discovery in Databases. AI Magazine, 17(3), Fall 1996. 12] I. Fellegi and A. Sunter. A Theory for Record Linkage. American Statistical Association Journal, pages 1183{1210, December 1969. 13] C. L. Forgy. OPS5 User's Manual. Technical Report CMU-CS-81-135, Carnegie Mellon University, July 1981. 35 14] R. George, F. E. Petry, B. P. Buckles, and R. Srikanth. Fuzzy Database Systems { Challenges and Opportunities of a New Era. International Journal of Intelligent Systems, 11:649{659, 1996. 15] S. Ghandeharizadeh. Physical Database Design in Multiprocessor Database Systems. PhD thesis, Department of Computer Science, University of Wisconsin - Madison, 1990. 16] M. Hernandez and S. Stolfo. The Merge/Purge Problem for Large Databases. In Proceedings of the 1995 ACM-SIGMOD Conference, May 1995. 17] K. Kukich. Techniques for Automatically Correcting Words in Text. ACM Computing Surveys, 24(4):377{439, 1992. 18] M. Lebowitz. Not the Path to Perdition: The Utility of Similarity-Based Learning. In Proceedings of 5th National Conference on Arti cial Intelligence, pages 533{537, 1986. 19] A. Monge and C. Elkan. An E cient Domain-independent Algorithm for Detecting Approximate Duplicate Database Records. In Proceedings of the 1997 SIGMOD Workshop on Research Issues on DMKD, pages 23{29, 1997. 20] C. Nyberg, T. Barclay, Z. Cvetanovic, J. Gray, and D. Lomet. AlphaSort: A RISC Machine Sort. In Proceedings of the 1994 ACM-SIGMOD Conference, pages 233{242, 1994. 21] J. J. Pollock and A. Zamora. Automatic spelling correction in scienti c and scholarly text. ACM Computing Surveys, 27(4):358{368, 1987. 22] T. Senator, H. Goldberg, J. Wooton, A. Cottini, A. Umar, C. Klinger, W. Llamas, M. Marrone, and R. Wong. The FinCEN Arti cial Intelligence System: Identifying Potential Money Laundering from Reports of Large Cash Transactions. In Proceedings of the 7th Conference on Innovative Applications of AI, August 1995. 23] Y. R. Wang and S. E. Madnick. The Inter-Database Instance Identi cation Problem in Integrating Autonomous Systems. In Proceedings of the Sixth International Conference on Data Engineering, February 1989. 36 A OPS5 version of the equational theory / / RULEs: same-ssn-and-address and same-name-and-address / if ((similar ssns jj similar names) && very similar addrs) f merge tuples(person1, person2) continue rule program ( number of tuples, rst tuple, window size ) Compare all tuples inside a window. If a match is found, call merge tuples(). / void rule program(int ntuples, int start, int wsize) f g register int i, j register WindowEntry person1, person2 boolean similar ssns, similar names, similar addrs boolean similar city, similar state, similar zip boolean very similar addres, very close aptm, very close stnum, not close not close = close but not much(person1!stname, person2!stnam...
View Full Document

This document was uploaded on 02/15/2014.

Ask a homework question - tutors are online