10.1.1.91.8665 - An Experimental Comparison of Naive...

Info icon This preview shows pages 1–2. Sign up to view the full content.

View Full Document Right Arrow Icon
An Experimental Comparison of Naive Bayesian and Keyword-Based Anti-Spam Filtering with Personal E-mail Messages Ion Androutsopoulos, John Koutsias, Konstantinos V. Cbandrinos and Constantine D. Spyropoulos Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications National Centre for Scientific Research "Demokritos" 153 l0 Ag. Paraskevi, Athens, Greece e-maih {ionandr, jkoutsi, kostel, costass}@iit.demokritos.gr Abstract The growing problem of unsolicited bulk e-mail, also known as "spare", has generated a need for reliable anti-spam e-mail filters. Filters of this type have so far been based mostly on manually constructed keyword patterns. An alternative approach has recently been proposed, whereby a Naive Bayesian classifier is trained automatically to detect spam messages. We test this approach on a large collection of personal e-mail messages, which we make publicly available in "encrypted" form contributing towards standard benchmarks. We introduce appropriate cost-sensitive measures, investigating at the same time the effect of attribute- set size, training-corpus size, lemmatization, and stop lists, issues that have not been explored in previous experiments. Finally, the Naive Bayesian filter is compared, in terms of performance, to a filter that uses keyword patterns, and which is part of a widely used e-mail reader. Keywords filtering/routing; text categorization; machine learning and IR; evaluation (general); test collections I. Introduction In recent years, the increasing popularity and low cost of e- mail have attracted the attention of direct marketers. Using readily available bulk-mailing software and large lists of e- mail addresses, typically harvested from web pages and newsgroup archives, it is now possible to send blindly unsolicited messages to thousands of recipients at essentially no cost. As a result, it is becoming increasingly eomrnon for users to receive daily large quantities of unsolicited bulk e- mail, known as spare, advertising anything from vacations to get-rich schemes. The term Unsolicited Commercial E-mail (UCE) is also used in the literature. We use "spare" with a broader meaning, that does not exclude unsolicited bulk e-mail sent for non-commercial purposes (e.g. to communicate a message from a sectarian group). Permission to make digital or hard copras of all or pert of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advan- tage and that copies bear this notice and the full citation on the first page To copy otherwise, to repubhsh, to post on servers or to redistribute to hsts, requires prior specific permission arid/or a fee. SIGIR 2000 7•00 Athens. Greece © 2000 ACM 1-58113-226-3/O0/0007...$5.00 Spare messages are annoying to most users, as they waste their time and clutter their mailboxes. They also cost money to users with dial-up connections, waste bandwidth, and may expose minors to unsuitable content (e.g. when advertising pornographic sites). A 1997 study [3] found that spare
Image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Image of page 2
This is the end of the preview. Sign up to access the rest of the document.

{[ snackBarMessage ]}

What students are saying

  • Left Quote Icon

    As a current student on this bumpy collegiate pathway, I stumbled upon Course Hero, where I can find study resources for nearly all my courses, get online help from tutors 24/7, and even share my old projects, papers, and lecture notes with other students.

    Student Picture

    Kiran Temple University Fox School of Business ‘17, Course Hero Intern

  • Left Quote Icon

    I cannot even describe how much Course Hero helped me this summer. It’s truly become something I can always rely on and help me. In the end, I was not only able to survive summer classes, but I was able to thrive thanks to Course Hero.

    Student Picture

    Dana University of Pennsylvania ‘17, Course Hero Intern

  • Left Quote Icon

    The ability to access any university’s resources through Course Hero proved invaluable in my case. I was behind on Tulane coursework and actually used UCLA’s materials to help me move forward and get everything together on time.

    Student Picture

    Jill Tulane University ‘16, Course Hero Intern