Clustering In this task we clustered out the Hindi and the English portions of

Clustering in this task we clustered out the hindi

This preview shows page 3 - 5 out of 5 pages.

Clustering In this task we clustered out the Hindi and the English portions of the tweet. One of the main properties of such texts is that the English and the Hindi parts generally exist in groups. Hence we first try to isolate them. We use the corpus generated from a dictionary. For e.g. if we have to classify the word 'reccommend', which has been wrongly spelt, and the actual spellings are 'recommend'. So we first of all consider this word and compute it's Levenshtein distance with words in our corpus starting from 'r' and having a length in range (l-2,l+2) where l is the length of the word we are considering. For the example we have considered the levenshtein distance will be less. But for a word in Hindi like 'ghatiya' ,which means 'bad' or 'cheap' depending upon context, will have a large value of the levenshtein distance with any word beginning with g in the dictionary. Hence we allot a distance to every word and then finally apply the k-means algorithm to get two clusters of Hindi and English. In certain cases like the Hindi word 'main' means 'me'. But this is also an English word, however classifying this word as Hindi won't have any effect on our results since the words like these do not have any overall effect on the sentiment of the tweet. Most of the Hindi words which can affect the overall sentiment have a high levenshtein distance with a word of similar length in the English corpus. Processing Using the googletrans library, we translate the Hindi written in Latin script into Hindi written in Devanagari script. Then we use the ESWN and the HSWN to interact with our text and assign senti scores to all the words. For emojis we have used the python regular expression for assigning score to the emojis. Feature set We then construct a feature set consisting of 7 features for every tweet: o Whether it has a positive score or not
Image of page 3
o Whether it has a negative score or not o Word count greater than 8 o
Image of page 4
Image of page 5

You've reached the end of your free preview.

Want to read all 5 pages?

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture