Assignment6.docx - Ans 1 After a text is obtained we start...

This preview shows page 1 - 2 out of 2 pages.

Ans 1. After a text is obtained, we start with text normalization. Text normalization includes: 1. Converting all letters to lower or upper case 2. Converting numbers into words or removing numbers 3. Removing punctuations, accent marks and other diacritics 4. Removing white spaces 5. Expanding abbreviations 6. Removing stop words, sparse terms, and particular words. 7. Text canonicalization Short cut words can be treated in 2 ways: 1. Expand the short cut words. :- stemming can bring the words in toot form, though stemming object group needs to be defined for these words. Normalization techniques can be applied to expand these words. 2. Remove the short cut words from text: By Tokenization in python or using “re” regrex library or stop words list can also be updated to remove these words from text. Ans 2 R code : In R sub() function can be used to find the text and with the help of gsub(), text can be replaced: Test <-“asdas bcz asdsd” gsub(“bcz”, “because”, test)
Image of page 1
Image of page 2

You've reached the end of your free preview.

Want to read both pages?

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture

Stuck? We have tutors online 24/7 who can help you get unstuck.
A+ icon
Ask Expert Tutors You can ask You can ask You can ask (will expire )
Answers in as fast as 15 minutes