Now that we have access to the CorpusReaderobjects that come with NLTK, we willexplore how to modify them specifically for use with the HTML content that we havebeen ingesting throughout the chapter so far.Reading an HTML CorpusThe CategorizedPlaintextCorpusReaderin the previous section is actually veryuseful as it implements a standard preprocessing API that exposes the followingmethods:•paras(): a generator of paragraphs, blocks of text delimited with double new‐lines.•sents(): a generator of individual sentences in the text.•words(): tokenizes the text into individual words.•raw(): provides access to the raw text without preprocessing.26 | Chapter 1: Text Ingestion and Wrangling
Other CorpusReaderobjects expose other language processing methods, for exampleautomatically tagging or parsing sentences, converting annotated text into meaning‐ful data structures like Treeobjects, or exposing format-specific utilities like individ‐ual XML elements. In order to fit models using machine learning techniques on ourtext, we will need these methods as part of the feature extraction process. In the nextsection, we will discuss the details of preprocessing and explore what is actually goingon. Before we get to that, however, we need a methodology to stream the HTML datawe have ingested to programming.So far in this chapter we have explored data ingestion from the web through a varietyof techniques including web scraping, APIs and search, or by using RSS feeds orother syndication mechanisms. Because we are ingesting data from the Internet, it is asafe bet that the data we’re ingesting is formatted as HTML. One option for creating astreaming corpus reader is to simply strip all the tags from the HTML, writing it asplaintext and using the CategorizedPlaintextCorpusReader. However, if we do that,we will lose the benefits of HTML — namely computer parseable, structuredtext,which we can take advantage of when preprocessing. Therefore, in this section wewill begin to design a custom HTMLCorpusReaderthat we will extend in the prepro‐cessing section.fromnltk.corpus.reader.apiimportCorpusReaderfromnltk.corpus.reader.apiimportCategorizedCorpusReader# Tags to extract as paragraphs from the HTML textTAGS = ['h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'h7', 'p', 'li']classHTMLCorpusReader(CategorizedCorpusReader, CorpusReader):"""A corpus reader for raw HTML documents to enable preprocessing."""def__init__(self, root, tags=TAGS, **kwargs):"""Initialize the corpus reader. Categorization arguments(``cat_pattern``, ``cat_map``, and ``cat_file``) are passed tothe ``CategorizedCorpusReader`` constructor. The remainingarguments are passed to the ``CorpusReader`` constructor."""# Get the CorpusReader specific argumentsfileids = kwargs.pop('fileids')encoding = kwargs.pop('encoding')# Initialize the NLTK corpus reader objectsCategorizedCorpusReader.__init__(self, kwargs)CorpusReader.__init__(self, root, fileids, encoding)Corpus Data Management | 27
# Save the tags that we specifically want to extract.