ClusteringIn this task we clustered out the Hindi and the English portions of the tweet. One of the main properties of such texts is that the English and the Hindi parts generally exist in groups. Hence we first try to isolate them. We use the corpus generated from a dictionary. For e.g. if we have to classify the word 'reccommend', which has been wrongly spelt, and the actual spellings are 'recommend'. So we first of all consider this word and compute it's Levenshtein distancewithwords in our corpus starting from 'r' and having a length in range (l-2,l+2) where l is the length ofthe word we are considering. For the example we have considered the levenshtein distance will be less. But for a word in Hindi like 'ghatiya' ,which means 'bad' or 'cheap' depending upon context, will have a large value of the levenshtein distance with any word beginning with g in thedictionary. Hence we allot a distance to every word and then finally apply the k-means algorithm to get two clusters of Hindi and English. In certain cases like the Hindi word 'main' means 'me'. But this is also an English word, however classifying this word as Hindi won't have any effect on our results since the words like these do not have any overall effect on the sentiment of the tweet. Most of the Hindi words which can affect the overall sentiment have a high levenshtein distance with a word of similar length in the English corpus.ProcessingUsing the googletrans library, we translate the Hindi written in Latin script into Hindi written in Devanagari script. Then we use the ESWN and the HSWN to interact with our text and assign senti scores to all the words. For emojis we have used the python regular expression for assigning score to the emojis.Feature setWe then construct a feature set consisting of 7 features for every tweet:oWhether it has a positive score or not
oWhether it has a negative score or notoWord count greater than 8o