2011 Fig 1 10000 Tweets ranging across various women related issues

2011 fig 1 10000 tweets ranging across various women

This preview shows page 6 - 8 out of 23 pages.

profile information to the associated Twitter account can be used (Burger et al. 2011 ) (Fig. 1 ). 10,000 Tweets ranging across various women related issues like—abortion, female literacy, violence against women, female empowerment, women rape, gender equality, gender based harassment, forced prostitution of women, domestic violence against women, female infanticide and women health were downloaded. The tweets were cleaned by removing retweets and ambiguous tweets where ever the gender of the tweeter could not be ascertained. The remaining tweets were manually labeled after ascertaining the gender of the author by visiting each individual profile and looking for keywords (mom of two, husband by profession etc.), the profile picture and any other information to confirm the gender of the person. While labeling, we kept one tweet from each user. This reduced the dataset to roughly 3000 users with about 1800 females. To train and test the classifiers, the data was split into two sets randomly. The dataset was randomly divided into a ratio Green Arrow – Training data Amber Arrow – Test data Label- male/female Machine learning algorithm Input-training unstructured tweets Input-testing unstructured tweets Classifier model Using Naïve Bayes and maxent Label- male/female Feature extraction- words, n- gram, pos etc. Feature selection- frequency, information gain Raw data Fig. 1 Gender classification based on supervised learning 122 S. Mukherjee, P. K. Bala 123
Image of page 6
of 75–25. The mentioned ratio has been extensively applied in classification literature (Schu ¨rer and Muskal 2013 ). A tenfold cross-validation was performed on the training set. In choosing the training testing ratio the stress is on generalizability of the results, which is achieved by the K-fold cross validation as explained later in this section (Domingos 2012 ). One needs to ensure that the training data doesn’t over fit the training set as it could drastically distort the result for the test set. This is usually addressed by the K-fold cross validation. For our purposes we use K = 10 which is the usual norm in classification data training (Pennacchiotti and Popescu 2011 ). A tenfold cross Table 1 Different feature types extracted from twitter datasets Sl. no Feature Definition Example 1 Content words Content words typically are a noun, verb, adjective, or adverb, that carries semantic content, bearing reference to the world independently of its use within a particular sentence (Winkler 2012 ) School, beer, run, black, teach 2 Function words Function words are words that have little lexical meaning or have ambiguous meaning, but instead serve to express grammatical relationships with other words within a sentence, or specify the attitude or mood of the speaker (Klammer et al. 2000 ) The, These, in, can, my 3 Part of speech tags It is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition, as well as its context (Church 1989) This/PNN, is/VB, a/ART dog/NN 4 Part of speech n-grams An n-gram model is a type of probabilistic language model for predicting the next item in a sequence in the form of a (n - 1) order Markov model. The prediction could be done
Image of page 7
Image of page 8

You've reached the end of your free preview.

Want to read all 23 pages?

  • Winter '18
  • Amrita Chakraborty
  • Naive Bayes classifier, Document classification, P. K. Bala

What students are saying

  • Left Quote Icon

    As a current student on this bumpy collegiate pathway, I stumbled upon Course Hero, where I can find study resources for nearly all my courses, get online help from tutors 24/7, and even share my old projects, papers, and lecture notes with other students.

    Student Picture

    Kiran Temple University Fox School of Business ‘17, Course Hero Intern

  • Left Quote Icon

    I cannot even describe how much Course Hero helped me this summer. It’s truly become something I can always rely on and help me. In the end, I was not only able to survive summer classes, but I was able to thrive thanks to Course Hero.

    Student Picture

    Dana University of Pennsylvania ‘17, Course Hero Intern

  • Left Quote Icon

    The ability to access any university’s resources through Course Hero proved invaluable in my case. I was behind on Tulane coursework and actually used UCLA’s materials to help me move forward and get everything together on time.

    Student Picture

    Jill Tulane University ‘16, Course Hero Intern

Ask Expert Tutors You can ask You can ask ( soon) You can ask (will expire )
Answers in as fast as 15 minutes