FinalProject - 2007 6 9 [DATA MINING & MACNINE...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: 2007 6 9 [DATA MINING & MACNINE LEARNING FINAL PROJECT] Group 2 R95922027 R95922034 R95922081 R95942129 Contents Experiment setting Feature extraction Date of mail Number of receivers Mail with attachment Mail with image Mail with URL Mail Title Mail body Model training Nave Bayes Knn Maximum Entropy SVM HybridModel Vote Neural network Conclusion Reference We select a corpus on internet, which is called enronspampreprocessed, derived from enron. It contains six folders (enron1 ~ enron6), with 13496 spam mails and 15045 ham mails. It is a preprocessed mail corpus--removing the html tags and factoring all the important headers. Feature Extration Feature 1 : Date of the mail Figure 1 shows the distribution of the date of the mail in one day. Spam mails are nearly uniform distributed in twenty four hours. Ham mails are concentrated in the daytime, from 7 am to 6 pm. It is reasonable because most people work in the daytime. Figure 1 Feature 2 : Number of receivers Figure 2 shows the number of receiver in each mail. We can see that most spam mails have only 1 receiver and there are not any spam mails whose receivers are more than 20. However, some ham mails have many receivers. Because sometimes we send information Figure 2 to a group of people such as coworkers in the company or classmates in the school. The maximum number of receiver in the training data is 206. When given a mail, we can check the date of it and the number of receiver and assign it a probability of being ham or spam. I assign the probability of being ham = P[ham | date = h] = P[ham |# ofreceiver = r ] = P[r | ham] P[r | spam] + P[r | ham] P[h | ham] P[h | spam] + P[h | ham] Experiment setting 2 Spam Ham Figure 3 Attachment (with / Image(with without : ratio) without : ratio) 5 / 13491 : 0.0307% 1109 / 13936 : 7.3712% . . / URL (with / without : ratio) 4154 / 9342 : 30.779% 1061 / 13984 : 7.0521% Table 1 92 / 13404 : 0.6816% 0 / 15045 : 0% Feature 3 : Mail with Attachment Ham : Spam = = 229 : 1 | = . Mail with attachment : We can compare column1 and see that while an email with attachment, the probability that it is a spam mail is extremely low. Feature 4 : Mail with Image (img src = ) Ham : Spam = 0 : 1 | Mail with image : = 0.999 Although in column2, we see that no a file with image implies it is 100% spam. To avoid zeroprobability problem, we assign the conditional probability to 0.999. Feature 5: Mail with URL (http://) Ham : Spam = . . = 4 : 1 | Mail with URL : 0.8 Given a file with URL, the probability which it is a spam mail is still higher. Feature 6 : Mail Title Previous research works have mentioned that nonalphanumeric character, Arabic numerals, punctuation marks in mails' titles, or even no title can be viewed as discriminate features between spam and ham. Some papers said spams sometimes without titles. That's true, but sometimes people forget to write titles, too. In our mail corpus, mails without titles take 6% and 7 % for hams and spams, respectively. 3 For the Arabic parts in our mail corpus, it is not a powerful feature to discriminate the two classes. Many commerce hams have some IDs (receiver's ID, product ID) or serial numbers (part #). Date is another common numeral in mail title, which is equally possible to show in titles of both spam and ham. We do some other analysis by counting the numerals larger and smaller than 31. For both types of numerals, the two mail types are almost fifty to fifty. In the experiment, punctuation marks and non alphanumeric character can be classified into three types--spambias, hambias, and nonbias. For example, "!" and "?" are wideused marks in the spam since spam often preferred a surprising mode. In our computations, "~ ^ | * % ! ? =" are spambias punctuation mark. And, "\ / ; &" are ham preferred. The others, such as "," and "", are nonbias since they are preferred punctuation marks in the writing. Marks ~ ^ | * % ! ? = \ / ; & Probability of being Spam Mail 0.911 0.182 Feature Showing Rate 28% in spams 16% in hams Feature 7 : Mailbody Table 2 Once we get the mail documents, there is an important issue about word morphology, which is the field within linguistic that studies the internal structure of words (ex: verb "do" can be represented as "did""doing"). In order to eliminate the data sparseness problem caused by the morphology of English words, we need to propose word stemming in preprocessing. There are many useful tools in the web and here we use a good NLP tool called Treetagger developed by the Institute for Computational Linguistics of the University of Stuttgart. Treetagger provides two main functions: word stemming and partofspeech tagging. The input is a sequence of words, and it will return the corresponding prototype and partofspeech of each word. For example: Word "the" Treetagger is POS DT NP VBZ Prototype the Treetagger be Table 3 Our experiments are based on enron1 mail corpus, including 1406 spam mails and 3671 ham mails. Nave Bayes method: Given a bag of words, Nave Bayes is a common technique in NLP for document classification. So we first use this method to solve the spam mail filtering problem: given a document X=(x1, x2, x3,...,xn), we need to calculate the posterior probability Model training 4 and expand it by the Bayes' theorem, independent assumption and ignore the evidence P(X): C = arg max P (Ci | X ), Ci Ci {ham, spam} = arg max Ci P( X | Ci ) P(Ci ) P( X ) j = arg max P( x j | Ci )P (Ci ) Ci = arg max log P( x j | Ci ) + log P (Ci ) Ci j So our task is to calculate the likelihood P(xj|Ci) by simply counting: c( x j , Ci ) log P( x j | Ci ) = log = log c( x j , Ci ) - log c(Ci ) c(Ci ) The logarithm used here transforms the formula to a summation and avoids underflow by successive floating point number multiplications. Figure 4 Vector space method with knearest neighbor: Another popular approach for document processing is to transform a document to high dimensional super vector. So we use this idea to solve the problem. First we concatenate all the mails in our corpus to a big file and then use a famous language model tool SRIlm, which provides a lot of useful functions such as language model building, word counts, viterbi algorithm...etc. Here we use the function "ngramcount" to create the dictionary we need from the big file: ngram-count text big.file write dict order 1 By the dictionary, we processed each document and create a worddocument matrix: 5 w1 w2 wi d1 d2 ........ dj .......... dN wij w ij = c ij nj wM wij is the normalized word count. After matrix is prepared, we use knearest neighbor with cosine similarity function to solve our problem: similarity (di , d j ) = diT d j || di || * || d j || In document processing, the cosine distance function is more reasonable than common Euclidean distance function. Figure 5 Maximum Entropy method: Maximum Entropy is a stateofart logistic regression extension in machine learning and nature language processing area. It has the dual property which not only maximizes the log likelihood but also maximizes the entropy, minimizes the KullbackLeiber distance between model and the real distribution. Similar to nave Bayes method, it needs to make the independent assumption. C = arg max P(Ci | X ) = arg max P(Ci | x j ) Ci Ci j = arg max log P(Ci | x j ) Ci j k f k ( x j ,Ci ) ek = arg max log k fk ( x j ,Ci ) Ci j ek Ci So we tried to adopt this model to solve our problem and here we use a good tool 6 Maxent developed by Zhang Le in University of Edinburgh. Because the ME model only accept nominal attribute for the feature function f(xj, Ci), so we need to modified the element in worddocument matrix to the binary value {0, 1}. In the model training, we set the iteration parameter for convex optimization to 30. Figure 6 SVM: We've found a tool called svmlight, which can help us do svm model training and classifying. We fed our extracted files (sparse format) to do crossvalidation, and used two ways to represents our mailbody features : 1. Binary : using just 0 or 1 to represent that this word appears or not 2. Normalized : counting the appearance of each word, divided by their maximum appearance counts. Figure is the result of svm model using only mailbody features.. Compare the blue line and red line, we can see obviously binary format, with accuracy around 97%, outperforms better than binary format, with accuracy about 92%. Figure 7 Figure is the svm model adding mailheader features. Compare the blue line with green line, we found that adding header information indeed help us 7 Figure 8 to judge correctly. The red and purple line are the results of normalized method, still perform worse than binary method. We think that the reason why binary method performs better is that using a vector to represent the occurrence of words in a file can somewhat in a sense represent the "semantic of a document". Since binary method is better, we'll use this way to do our latter work. Hybrid Model Committeebased approach From the above experiment, now we have 3 classifiers to filter the spam mail. Instead of using single classifier independently, we can build a committeebased classifier: Mail (Bag of words) committee Nave Bayes K-nearest neighbor Maximum entropy Decision maker So we propose two kinds of decision makers: vote and single layer neural network. Single layer neural network: We can use linear combination as a decision maker: C = ( i ci + ) i Input layer Output Layer Nave Bayes But the question is: how to decide the weights and efficiently? Here we use the popular neural knn Maximum entropy network learning method: backpropagation algorithm to learn the weights iteratively. Because ordinary backpropagation is a gradient decent method, and it is quite slow if the initial is bad, so we use several accelerate improvements such as sample shuffling and momentum. The sigmoid function we use is sigmoid ( x) = 1 (1 + e - x ) . Comparing to the voting method, backpropagation is more error driven and machine intelligent. Vote: Voting is an adhoc way to judge whether the mail is spam or not. If more classifiers consider that the test document is a spam, then we are more confident 8 about the decision. Vote1 Knn + nave Bayes + Maximum Entropy: We first tried using knn, nave Bayes and maximum entropy to build our vote model. The result is shown below. In this case, we can see that vote can indeed improve the performance slightly. Figure 9 Vote 2 nave Bayes + Maximum Entropy + SVM Secondly we combined nave Bayes, Maximum Entropy and SVM. Originally we expect that the accuracy of "vote" is always the highest. However, we can see from the picture that three of the 10fold CV are the best and Figure 10 seven of them are the second best. If two of the models predict correctly for most instances, the accuracy of "vote" will increase. Nevertheless, if two of the models predict incorrectly for most instances, the accuracy of "vote" will decrease. We think this is the reason why "vote" are the second best in some cases. Conclusion In this work, we implement several wellknown machine learning techniques--including Nave Bayes, Maximum Entropy, SVM, KNN (Vector Space) and Neural network--to simulate spam filter. And, some statistic computations have 9 been done on the feature selection part. Except mail context word counts, another six features have been shown useful in discriminate spam and ham mails. As we implement spam filters, all the five techniques have given an impressive performances with these features we selected. Crossvalidations have been performed on Bayesian model, Vector Space model, and SVM model, and the results give us confidence on our experiments. The hybridmodel, or voting model, averages the classification result, promoting the ability of the filter a little. However, sometimes voting might reduce the accuracy because of misadjustments of majority. Besides, we also tried some other approach such as Latent Dirichlet Allocation, but a discouraging result is reported in indicating the labels of ham or spam. Spam filtering is a keepgoing issue. By our analysis, we have got deeper understanding on it, and also become familiar with these machine learning techniques we implement. Reference [1]. M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz, "A Bayesian Approach to Filtering Junk EMail," in Proc. AAAI 1998, Jul. 1998. [2] A plan for spam : http://www.paulgraham.com/spam.html [3]Enron Corpus : http://www.aueb.gr/users/ion/ [4]Treetagger : http://www.ims.unistuttgart.de/projekte/corplex/TreeTagger/DecisionTreeTagger.ht ml [5]Maximum Entropy: http://homepages.inf.ed.ac.uk/s0450736/maxent_toolkit.html [6]SRILM: http://www.speech.sri.com/projects/srilm/ [7]SVM: http://svmlight.joachims.org/ 10 ...
View Full Document

Ask a homework question - tutors are online