Though SVMs are very powerful and commonly used in classification they suffer

Though svms are very powerful and commonly used in

This preview shows page 11 - 13 out of 23 pages.

Though SVMs are very powerful and commonly used in classification, they suffer from several drawbacks. They require high computations to train the data. Also, they are sensitive to noisy data and hence prone to overfitting. Fig. 4. Support Vector Machines. 4. Quantitative evaluation 4.1 Phishing dataset The phishing dataset constitutes of 6561 raw emails. The total number of phishing emails in the dataset is 1409 emails. These emails are donated by (Nazario, 2007) covering many of the new trends in phishing and collected between August 7, 2006 and August 7, 2007. The total number of legitimate email is 5152 emails. These emails are a combination of financial-related and other regular communication emails. The financial-related emails are received from financial institutions such as Bank of America, eBay, PayPal, American Express, Chase, Amazon, AT&T, and many others. As shown in Table 1, the percentage of these emails is 3% of the complete dataset. The other part of the legitimate set is collected from the authors' mailboxes. These emails represent regular communications, emails about conferences and academic events, and emails from several mailing lists. Table 1. Corpus description. 4.1.1 Data standardization, cleansing, and transformation The analysis of emails consists of two steps: First, textual analysis, where text mining is performed on all emails. In order to get consistent results from the analysis, one needs to standardize the studied data. Therefore, we convert all emails into XML documents after stripping all HTML tags and email header information. Figure 5 shows an example of a phishing email after the conversions. Text mining is performed using the text-miner software kit (TMSK) provided by (Weiss et al., 2004). Second, structural analysis. In this step
Background image
Machine Learning 196 we analyze the structure of emails. Specifically, we analyze links, images, forms, javascript code and other components in the emails. Fig. 5. Phishing email after conversion to XML. Afterwards, each email is converted into a vector x= x1, x2, ..., xp, where x1, ..., xp are the values corre- sponding to a specific feature we are interested in studying (Salton & McGill, 1983). Our dataset consists of 70 continuous and binary features (variables) and one binary response variable, which indicates that email is phishing=1 or legitimate=0. The first 60 features represent the frequency of the most frequent terms that appear in phishing emails. Choosing words (terms) as features is widely applied in the text mining literature and is referred to as “bag-of-words”. In Table 2 we list both textual and structural features used in the dataset. As shown in Figure 6, we start by striping all attachments from emails in order to facilitate the analysis of emails. The following subsections illustrate the textual and structural analysis in further details.
Background image
Image of page 13

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture