Subsequent blog classification was attempted by Yan and Yan (2006). The authorsused simple word features, background colors, and emoticons to classify text usingthe Naı¨ve Bayesian algorithm. Zhang and Zhang (2010) captured word features andsimple part of speech tags to classify the gender of blog authors. Later, Mukherjeeand Liu (2010) used word sequence n-grams and feature selection ensembles forclassifying blog text. The authors also compared the classification accuracy of theiralgorithm against the commercially available software and found better results.Gender classification of microblog authors is relatively new and is just starting tobe explored by researchers. One of the more remarkable works in the field is by Raoet al. (2010), which has captured latent user attributes built on a support vectormachine (SVM) based algorithm. The authors used n-gram word features of tweetsas gender differentiators for their dataset. They also classified authors based onreligious beliefs and political orientation using the same features. Penachiotti andPopescue (2011) used rich linguistic features for classification. They applied amachine learning approach on a comprehensive set of features derived from relevantuser information. Alowibdi et al. (2013) used non-textual features like backgroundcolors and its combinations to classify twitter user profiles based on gender and gotreasonably high accuracy. Miller et al. (2012) used character level n-grams as afeature to classify Twitter text. They applied Naı¨ve Bayes and perceptron basedclassification models. More recently Ikeda et al. (2013) has used community miningfor classifying tweets. They formulated hybrid text-based and community-basedmethods to classify tweets based on demographics for a large dataset.Mostofthemajorworksontextfeatures-basedgenderclassificationofmicroblogs are based on word features. Classification based on the word featuresgenerally give reasonable accuracy for the dataset used; however, they are heavilydependent on the words used in the text. The classification algorithms used in theabove cases may not satisfactorily capture the latent features of tweeting behaviorwhich go beyond the topic being discussed in tweets, and are dependent on the120S. Mukherjee, P. K. Bala123
hidden nuances of the writing style of the genders. These features could beimportant for correctly predicting the gender of the author of a new tweet, whichmight be written in a different context. In such cases capturing authorial stylebecomes indispensable. We have tried to overcome this issue by using features thatare mostly independent of the topic of discussion in the tweets. In our method, wefocus more on the authorial style of both the genders, which are better capturedusing function words and part of speech n-grams. Koppel (2002) states thatcategorization by topic is typically based on keywords which reflects a document’scontent, whereas categorization by author style uses precisely those features whichare independent of context.
You've reached the end of your free preview.
Want to read all 23 pages?
Naive Bayes classifier, Document classification, P. K. Bala
As a current student on this bumpy collegiate pathway, I stumbled upon Course Hero, where I can find study resources for nearly all my courses, get online help from tutors 24/7, and even share my old projects, papers, and lecture notes with other students.
Temple University Fox School of Business ‘17, Course Hero Intern
I cannot even describe how much Course Hero helped me this summer. It’s truly become something I can always rely on and help me. In the end, I was not only able to survive summer classes, but I was able to thrive thanks to Course Hero.
University of Pennsylvania ‘17, Course Hero Intern
The ability to access any university’s resources through Course Hero proved invaluable in my case. I was behind on Tulane coursework and actually used UCLA’s materials to help me move forward and get everything together on time.
Tulane University ‘16, Course Hero Intern
Stuck? We have tutors online 24/7 who can help you get unstuck.
Ask Expert Tutors
You can ask
You can ask
You can ask
(will expire )