For example if we wanted to extract the text from one of the news article pages

For example if we wanted to extract the text from one

This preview shows page 13 - 15 out of 82 pages.

library. For example, if we wanted to extract the text from one of the newsarticle pages we saved and print it to the console, we would use the code snippetbelow to do so. In the code below, we first identify the tags where text data is con‐tained. Then we create an html_to_textfunction which takes a file path, reads theHTML from the file, and uses the get_textmethod to yield the text from anywhereit finds a tag that matches our tag list.importbs4TAGS = ['h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'h7', 'p', 'li']defhtml_to_text(path):withopen(path, 'r') asf:html = f.read()soup = bs4.BeautifulSoup(html, "lxml")fortag insoup.find_all(TAGS):yieldtag.get_text()Just as with crawling, there are also a some considerations to take into account whenscraping content from web pages. Some websites have dynamic content that is loadedvia JavaScript. For these websites, you would need to take a different approach inorder to obtain the content.Data Ingestion of Text | 11
Background image
There are also several ways to crawl and scrape websites besides the methods we’vedemonstrated here. For more advanced crawling and scraping, it may be worth look‐ing into the following tools.Scrapy - an open source framework for extracting data from websites.• Selenium - a Python library that allows you to simulate user interaction with awebsite.Apache Nutch - a highly extensible and scalable open source web crawler.Web crawling and scraping can take us a long way in our quest to acquire text datafrom the web, and the tools currently available make performing these tasks easierand more efficient. However, there is still much work left to do after initial ingestion.While formatted HTML is fairly easy to parse with packages like BeautifulSoup,after a bit of experience with scraping, one quickly realizes that while general formatsare similar, different websites can lay out content very differently. Accounting for, andworking with, all these different HTML layouts can be frustrating and time consum‐ing, which can make using more structured text data sources, like RSS, look muchmore attractive.Ingestion using RSS Feeds and FeedparserRSS (Really Simple Syndication) is a standardized XML format for syndicated textdata that is primarily used by blogs, news sites, and other online publishers who pub‐lish multiple documents (posts, articles, etc.) using the same general layout. There aredifferent versions of RSS, all originally evolved from the Resource DescriptionFramework (RDF) data serialization model, the most common of which is currentlyRSS 2.0. Atom is a newer and more standardized, but at the time of this writing, a lesswidely-used approach to providing XML content updates.Text data structured as RSS is formatted more consistently than text data on a regularweb page, as a content feed, or a series of documents arranged in the order they werepublished. This feed means you do not need to crawl the website in order to get othercontent or acquire updates, making it preferable to acquiring data through crawling
Background image
Image of page 15

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture