Course Hero Logo

The ultimate objective of any text mining process

Course Hero uses AI to attempt to automatically extract content from documents to surface to you and others so you can study better, e.g., in search results, to enrich docs, and more. This preview shows page 19 - 22 out of 33 pages.

The ultimate objective of any text mining process using the “bag-of-words” approach is to convertthe text to be analysed to a data frame which consists of the words used in the text and theirfrequencies. These are defined by thedocument term matrix (DTM)and theterm documentmatrix (TDM)which we will look into, in the subsequent sections.198/12/19
Md Abu002853247Homework 11To ensure that the DTM and TDM are cleaned up and represent the core set of relevant words, aset of pre-processing activities need to be performed on the corpus. This is similar to the dataclean-up done for structured data before data mining. The following are some of the common pre-processing steps:1.Convert to lower case — this way, if there are 2 words “Dress” and “dress”, it will beconverted to a single entry “dress”corpus_review=tm_map(corpus_review, tolower)2. Remove punctuation:corpus_review=tm_map(corpus_review, removePunctuation)3. Remove stopwords: “stopwords” is a very important concept to understand while doing textmining. When we write, the text generally consists of a large number of prepositions, pronouns,conjunctions etc. These words need to be removed before we analyse the text. Otherwise,stopwords will appear in all the frequently used words list and will not give the correct picture ofthe core words used in the text.There is a list of common stopwords used in English which we canview with this command: stopwords(“en”)#Remove stopwordscorpus_review=tm_map(corpus_review, removeWords, stopwords("english"))We might also want to remove custom stopwords based on the context of the text mining. Theseare words specific to the dataset that may not add value to the text.#Remove context specific stop wordscorpus_review=tm_map(corpus_review, removeWords,c("also", "get","like", "company", "made","can", "im", "dress", "just", "i"))Stemming a documentIn linguistics, stemming is the process of reducing inflected (or derived) words to their word stem,base or root form-generally a written word form.The SnowballC package is used for document stemming. For example “complicated”,“complication” and “complicate” will be reduced to “complicat” after stemming. This is again toensure that the same word is not repeated as multiple versions in the DTM and TDM and we onlyhave the root of the word represented in the DTM and TDM.##Stem documentcorpus_review=tm_map(corpus_review, stemDocument)##Viewing the corpus contentcorpus_review[[8]][1]208/12/19
Md Abu002853247Homework 11Corpus contentThe corpus object in R is a nested list. We can use the r syntax for lists to view contents of thecorpus.Frequently used wordsWe now have a text corpus which is cleaned and only contains the core words required for textmining. The next step is exploratory analysis. The first step in exploratory data analysis is toidentify the most frequently used words in the overall review text.

Upload your study docs or become a

Course Hero member to access this document

Upload your study docs or become a

Course Hero member to access this document

End of preview. Want to read all 33 pages?

Upload your study docs or become a

Course Hero member to access this document

Term
Fall
Professor
NoProfessor
Tags
lemmatizer

Newly uploaded documents

Show More

Newly uploaded documents

Show More

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture