Subsequent blog classification was attempted by Yan and Yan 2006 The authors

Subsequent blog classification was attempted by yan

This preview shows page 4 - 6 out of 23 pages.

Subsequent blog classification was attempted by Yan and Yan ( 2006 ). The authors used simple word features, background colors, and emoticons to classify text using the Naı¨ve Bayesian algorithm. Zhang and Zhang ( 2010 ) captured word features and simple part of speech tags to classify the gender of blog authors. Later, Mukherjee and Liu ( 2010 ) used word sequence n-grams and feature selection ensembles for classifying blog text. The authors also compared the classification accuracy of their algorithm against the commercially available software and found better results. Gender classification of microblog authors is relatively new and is just starting to be explored by researchers. One of the more remarkable works in the field is by Rao et al. ( 2010 ), which has captured latent user attributes built on a support vector machine (SVM) based algorithm. The authors used n-gram word features of tweets as gender differentiators for their dataset. They also classified authors based on religious beliefs and political orientation using the same features. Penachiotti and Popescue ( 2011 ) used rich linguistic features for classification. They applied a machine learning approach on a comprehensive set of features derived from relevant user information. Alowibdi et al. ( 2013 ) used non-textual features like background colors and its combinations to classify twitter user profiles based on gender and got reasonably high accuracy. Miller et al. ( 2012 ) used character level n-grams as a feature to classify Twitter text. They applied Naı¨ve Bayes and perceptron based classification models. More recently Ikeda et al. ( 2013 ) has used community mining for classifying tweets. They formulated hybrid text-based and community-based methods to classify tweets based on demographics for a large dataset. Most of the major works on text features-based gender classification of microblogs are based on word features. Classification based on the word features generally give reasonable accuracy for the dataset used; however, they are heavily dependent on the words used in the text. The classification algorithms used in the above cases may not satisfactorily capture the latent features of tweeting behavior which go beyond the topic being discussed in tweets, and are dependent on the 120 S. Mukherjee, P. K. Bala 123
Image of page 4
hidden nuances of the writing style of the genders. These features could be important for correctly predicting the gender of the author of a new tweet, which might be written in a different context. In such cases capturing authorial style becomes indispensable. We have tried to overcome this issue by using features that are mostly independent of the topic of discussion in the tweets. In our method, we focus more on the authorial style of both the genders, which are better captured using function words and part of speech n-grams. Koppel ( 2002 ) states that categorization by topic is typically based on keywords which reflects a document’s content, whereas categorization by author style uses precisely those features which are independent of context.
Image of page 5
Image of page 6

You've reached the end of your free preview.

Want to read all 23 pages?

  • Winter '18
  • Amrita Chakraborty
  • Naive Bayes classifier, Document classification, P. K. Bala

What students are saying

  • Left Quote Icon

    As a current student on this bumpy collegiate pathway, I stumbled upon Course Hero, where I can find study resources for nearly all my courses, get online help from tutors 24/7, and even share my old projects, papers, and lecture notes with other students.

    Student Picture

    Kiran Temple University Fox School of Business ‘17, Course Hero Intern

  • Left Quote Icon

    I cannot even describe how much Course Hero helped me this summer. It’s truly become something I can always rely on and help me. In the end, I was not only able to survive summer classes, but I was able to thrive thanks to Course Hero.

    Student Picture

    Dana University of Pennsylvania ‘17, Course Hero Intern

  • Left Quote Icon

    The ability to access any university’s resources through Course Hero proved invaluable in my case. I was behind on Tulane coursework and actually used UCLA’s materials to help me move forward and get everything together on time.

    Student Picture

    Jill Tulane University ‘16, Course Hero Intern

Stuck? We have tutors online 24/7 who can help you get unstuck.
A+ icon
Ask Expert Tutors You can ask You can ask ( soon) You can ask (will expire )
Answers in as fast as 15 minutes
A+ icon
Ask Expert Tutors