How these network features can be exploited to

This preview shows page 53 - 54 out of 111 pages.

how these network features can be exploited to considerably improve performance on this problem. In order to combine textual and social network features, we used a classification- based scheme. The idea is to perform the leak prediction in two steps. In the first step we calculate the textual similarity scores using a cross-validation procedure in the training set. In the second step, we extract the network features and then we learn a function that combines those with textual scores. The textual scores are calculated in the following way. We split the training set (received + sent train collections) into 10 parts. Using a 10-fold cross-validation procedure, we compute the Knn-30 scores on 10% of the messages using as training data the remaining 90% of the data. In the end of this process, each training set examples will have, associated with it, a list of email addresses (from the top 30 messages selected by Knn-30) and their predicted scores. Now we have an “outlier score” associated with each message recipient in the training set. These scores will be used as features in the second step of the classification procedure. In addition to the textual scores, we used three different sets of social network fea- tures. The first set is based on the relative frequency of a recipient’s email address in the training set. For each recipient we extracted the normalized sent frequency (i.e., the number of messages sent to this recipient divided by the total number of mes- sages sent by this particular Enron user) and the normalized received frequency (i.e., the number of messages received from this recipient divided by the total number of messages received by this particular Enron user). In addition, we used two binary features to indicate if no messages were sent to a particular user, and if no mes- sages were received from a particular user. We refer to these features as Frequency features. The second set of social network information is based on co-occurrence of recipi- ents on other messages in the training set. The intuition behind this feature is that we expect leak-recipients to co-occur less frequently with the other recipients. Given a message with three recipients a 1 , a 2 and a 3, let the frequency of co-occurrence be- tween recipients a 1 and a 2 be F ( a 1 , a 2 ) (i.e., the number of messages in the training set that had a 1 as well as a 2 as recipients). Then the relative co-occurrencefrequency of users a 1 , a 2 and a 3 will be proportional to, respectively, F ( a 1 , a 2 )+ F ( a 1 , a 3 ) , F ( a 2 , a 3 )+ F ( a 2 , a 1 ) and F ( a 3 , a 1 )+ F ( a 3 , a 2 ) : i.e., the relative co-occurrence fre- quency of each recipient a i = j negationslash = i F ( a i , a j ) . These values are then divided by their sum and normalized to one. In case of two recipients only, the value of this feature is obviously 0.5 for each. No features will be extracted if the message has only one recipient. We refer to this feature as Coocurr features.

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture