EMNLP-2010-blog-gender

EMNLP-2010-blog-gender - Improving Gender Classification of...

Info iconThis preview shows pages 1–2. Sign up to view the full content.

View Full Document Right Arrow Icon
Improving Gender Classification of Blog Authors Arjun Mukherjee Bing Liu Department of Computer Science University of Illinois at Chicago 851 South Morgan Street Chicago, IL 60607, USA [email protected] Department of Computer Science University of Illinois at Chicago 851 South Morgan Street Chicago, IL 60607, USA [email protected] Abstract The problem of automatically classifying the gender of a blog author has important appli- cations in many commercial domains. Exist- ing systems mainly use features such as words, word classes, and POS (part-of- speech) n-grams, for classification learning. In this paper, we propose two new techniques to improve the current result. The first tech- nique introduces a new class of features which are variable length POS sequence pat- terns mined from the training data using a se- quence pattern mining algorithm. The second technique is a new feature selection method which is based on an ensemble of several fea- ture selection criteria and approaches. Empir- ical evaluation using a real-life blog data set shows that these two techniques improve the classification accuracy of the current state-of- the-art methods significantly. 1 Introduction Weblogs, commonly known as blogs, refer to on- line personal diaries which generally contain in- formal writings. With the rapid growth of blogs, their value as an important source of information is increasing. A large amount of research work has been devoted to blogs in the natural language processing (NLP) and other communities. There are also many commercial companies that exploit information in blogs to provide value-added ser- vices, e.g., blog search, blog topic tracking, and sentiment analysis of people’s opinions on prod- ucts and services. Gender classification of blog authors is one such study, which also has many commercial applications. For example, it can help the user find what topics or products are most talked about by males and females, and what products and services are liked or disliked by men and women. Knowing this information is crucial for market intelligence because the information can be exploited in targeted advertising and also product development. In the past few years, several authors have stu- died the problem of gender classification in the natural language processing and linguistic com- munities. However, most existing works deal with formal writings, e.g., essays of people, the Reuters news corpus and the British National Corpus (BNC). Blog posts differ from such text in many ways. For instance, blog posts are typically short and unstructured, and consist of mostly informal sentences, which can contain spurious information and are full of grammar errors, abbreviations, slang words and phrases, and wrong spellings. Due to these reasons, gender classification of blog posts is a harder problem than gender classifica- tion of traditional formal text.
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Image of page 2
This is the end of the preview. Sign up to access the rest of the document.

This note was uploaded on 11/12/2010 for the course CSCI 271 taught by Professor Wilczynski during the Spring '08 term at USC.

Page1 / 11

EMNLP-2010-blog-gender - Improving Gender Classification of...

This preview shows document pages 1 - 2. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online