how these network features can be exploited to considerably improve performanceon this problem.In order to combine textual and social network features, we used a classification-based scheme. The idea is to perform the leak prediction in two steps. In the firststep we calculate the textual similarity scores using a cross-validation procedure inthe training set. In the second step, we extract the network features and then welearn a function that combines those with textual scores.The textual scores are calculated in the following way. We split the training set(received + senttrain collections) into 10 parts. Using a 10-fold cross-validationprocedure, we compute the Knn-30 scores on 10% of the messages using as trainingdata the remaining 90% of the data. In the end of this process, each training setexamples will have, associated with it, a list of email addresses (from the top 30messages selected by Knn-30) and their predicted scores. Now we have an “outlierscore” associated with each message recipient in the training set. These scores willbe used as features in the second step of the classification procedure.In addition to the textual scores, we used three different sets of social network fea-tures. The first set is based on the relative frequency of a recipient’s email address inthe training set. For each recipient we extracted the normalized sent frequency (i.e.,the number of messages sent to this recipient divided by the total number of mes-sages sent by this particular Enron user) and the normalized received frequency (i.e.,the number of messages received from this recipient divided by the total number ofmessages received by this particular Enron user). In addition, we used two binaryfeatures to indicate if no messages were sent to a particular user, and if no mes-sages were received from a particular user. We refer to these features asFrequencyfeatures.The second set of social network information is based on co-occurrence of recipi-ents on other messages in the training set. The intuition behind this feature is that weexpect leak-recipients to co-occur less frequently with the other recipients. Given amessage with three recipientsa1,a2 anda3, let the frequency of co-occurrence be-tween recipientsa1 anda2 beF(a1,a2)(i.e., the number of messages in the trainingset that hada1 as well asa2 as recipients). Then the relative co-occurrencefrequencyof usersa1,a2 anda3 will be proportional to, respectively,F(a1,a2)+F(a1,a3),F(a2,a3)+F(a2,a1)andF(a3,a1)+F(a3,a2): i.e., the relative co-occurrence fre-quency of each recipientai=∑jnegationslash=iF(ai,aj). These values are then divided by theirsum and normalized to one. In case of two recipients only, the value of this featureis obviously 0.5 for each. No features will be extracted if the message has only onerecipient. We refer to this feature asCoocurrfeatures.