Assignment Name – Advance Predictive Modelling Problem Statement – 1. How will you treat text having short cut words (like bcz, u, thr etc…)? Most text message normalization approaches are based on supervised learning and rely on human labelled training data. In addition, the nonstandard words are often categorized into different types and specific models are de-signed to tackle each type. Words are the integral part of any classification technique. However, these words are often used with different variations in the text depending on their grammar (verb, adjective, noun, etc.). It is always a good practice to normalize the terms to their root forms. This technique is known as Lemmatization. So the Process will be to do NORMALIZATION which will clean the complete data. Text normalization includes: converting all letters to lower or upper case converting numbers into words or removing numbers removing punctuations, accent marks and other diacritics removing white spaces expanding abbreviations removing stop words, sparse terms, and particular words text canonicalization
You've reached the end of your free preview.
Want to read all 4 pages?
- Fall '19
- Hindi, Devanagari, BCZ