on the basis of a single preceding item unigram two preceding items bigram or

On the basis of a single preceding item unigram two

This preview shows page 7 - 9 out of 23 pages.

on the basis of a single preceding item (unigram), two preceding items (bigram) or more items (trigram, four gram etc.). In our case the items are part of speech of the words used in the sentences (Koppel 2002 ) PNN,VB,ART,NN 5 Character n-grams These are similar to other n-grams like word and part of speech. Here the items are letters or characters used in the words of a sentence (Ja ¨rvelin et al. 2007 ) ‘a’, ‘b’, ‘o’, ‘v’, ‘e’ 6 Function words ? part of speech n-grams Combined function words and part of speech n-grams and used them as a single feature for classification This, VB, in, NN, ART 7 All words all the words present in the tweets including the stop words This, is, a, dog 8 Content words ? function words ? part of speech n-grams combination of the most informative content words, the function words and the part of speech n-grams as features School, this, VB, in, Beer, ART 6,7,8 are combinations of 1,2,3,4 Gender classification of microblog text based on authorial style 123 123
Image of page 7
validation entails dividing the dataset into ten equal random folds and nine of them are used for training and one for testing or validation. The whole process is repeated ten times with each of the sub folds being used for validation exactly once. This ensures that the model generalizes to an independent dataset and doesn’t over-fit (Kohavi 1995 ). Usable features from tweets were extracted and selected from the training set. The features were then tested for accuracy and F-measure on the test set. We started with a small number of tweets and progressively increased the number to observe its effect on the classification accuracy and the F-measure. One must bear in mind that, for a small dataset, the method for manually cleaning and labeling tweets is standard in supervised learning. We have emphasized extracting features that have not been used in extant literature. We now explain the feature extraction and feature selection methods used in our work. 3.1 Feature extraction Feature extraction is a method used to reduce the amount of resources required to describe a large dataset (Guyon and Elisseeff 2003 ). When analyzing complex data one of the major problems stems from the number of variables involved. Analysis with a large number of variables generally requires a large amount of memory and computation power. It may also lead to the formation of a classification algorithm, which over fits the training sample and generalizes poorly to new samples. Hence, feature extraction becomes essential while dealing with classification problems with large number of variables. We have extracted a comprehensive list of linguistic features for our classifi- cation job. Using different features let us compare the results across the features as listed in the Table 1 . 3.2 Feature selection Feature selection is a process through which a subset of relevant features is selected for model formation (Guyon and Elisseeff 2003 ). This removes the redundant and/or irrelevant and/or less important features. Though it causes some loss of information,
Image of page 8
Image of page 9

You've reached the end of your free preview.

Want to read all 23 pages?

  • Winter '18
  • Amrita Chakraborty
  • Naive Bayes classifier, Document classification, P. K. Bala

What students are saying

  • Left Quote Icon

    As a current student on this bumpy collegiate pathway, I stumbled upon Course Hero, where I can find study resources for nearly all my courses, get online help from tutors 24/7, and even share my old projects, papers, and lecture notes with other students.

    Student Picture

    Kiran Temple University Fox School of Business ‘17, Course Hero Intern

  • Left Quote Icon

    I cannot even describe how much Course Hero helped me this summer. It’s truly become something I can always rely on and help me. In the end, I was not only able to survive summer classes, but I was able to thrive thanks to Course Hero.

    Student Picture

    Dana University of Pennsylvania ‘17, Course Hero Intern

  • Left Quote Icon

    The ability to access any university’s resources through Course Hero proved invaluable in my case. I was behind on Tulane coursework and actually used UCLA’s materials to help me move forward and get everything together on time.

    Student Picture

    Jill Tulane University ‘16, Course Hero Intern

Stuck? We have tutors online 24/7 who can help you get unstuck.
A+ icon
Ask Expert Tutors You can ask You can ask ( soon) You can ask (will expire )
Answers in as fast as 15 minutes
A+ icon
Ask Expert Tutors