- An Experimental Comparison of Naive...

Info iconThis preview shows pages 1–2. Sign up to view the full content.

View Full Document Right Arrow Icon
An Experimental Comparison of Naive Bayesian and Keyword-Based Anti-Spam Filtering with Personal E-mail Messages Ion Androutsopoulos, John Koutsias, Konstantinos V. Cbandrinos and Constantine D. Spyropoulos Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications National Centre for Scientific Research "Demokritos" 153 l0 Ag. Paraskevi, Athens, Greece e-maih {ionandr, jkoutsi, kostel, [email protected] Abstract The growing problem of unsolicited bulk e-mail, also known as "spare", has generated a need for reliable anti-spam e-mail filters. Filters of this type have so far been based mostly on manually constructed keyword patterns. An alternative approach has recently been proposed, whereby a Naive Bayesian classifier is trained automatically to detect spam messages. We test this approach on a large collection of personal e-mail messages, which we make publicly available in "encrypted" form contributing towards standard benchmarks. We introduce appropriate cost-sensitive measures, investigating at the same time the effect of attribute- set size, training-corpus size, lemmatization, and stop lists, issues that have not been explored in previous experiments. Finally, the Naive Bayesian filter is compared, in terms of performance, to a filter that uses keyword patterns, and which is part of a widely used e-mail reader. Keywords filtering/routing; text categorization; machine learning and IR; evaluation (general); test collections I. Introduction In recent years, the increasing popularity and low cost of e- mail have attracted the attention of direct marketers. Using readily available bulk-mailing software and large lists of e- mail addresses, typically harvested from web pages and newsgroup archives, it is now possible to send blindly unsolicited messages to thousands of recipients at essentially no cost. As a result, it is becoming increasingly eomrnon for users to receive daily large quantities of unsolicited bulk e- mail, known as spare, advertising anything from vacations to get-rich schemes. The term Unsolicited Commercial E-mail (UCE) is also used in the literature. We use "spare" with a broader meaning, that does not exclude unsolicited bulk e-mail sent for non-commercial purposes (e.g. to communicate a message from a sectarian group). Permission to make digital or hard copras of all or pert of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advan- tage and that copies bear this notice and the full citation on the first page To copy otherwise, to repubhsh, to post on servers or to redistribute to hsts, requires prior specific permission arid/or a fee. SIGIR 2000 7•00 Athens. Greece © 2000 ACM 1-58113-226-3/O0/0007. ..$5.00 Spare messages are annoying to most users, as they waste their time and clutter their mailboxes. They also cost money to users with dial-up connections, waste bandwidth, and may expose minors unsuitable content (e.g. when advertising pornographic sites). A 1997 study [3] found that spare
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Image of page 2
This is the end of the preview. Sign up to access the rest of the document.

This note was uploaded on 02/10/2012 for the course CSE 5800 taught by Professor Staff during the Fall '09 term at FIT.

Page1 / 8 - An Experimental Comparison of Naive...

This preview shows document pages 1 - 2. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online