Course Hero Logo

Chapter 3 stirring the html and css soup 60

Course Hero uses AI to attempt to automatically extract content from documents to surface to you and others so you can study better, e.g., in search results, to enrich docs, and more. This preview shows page 11 - 14 out of 19 pages.

CHAPTER 3STIRRING THE HTML AND CSS SOUP
60[attributename|=value]selects all elements whoseattributenameattribute’s value is a space-separated list of words, with any of thembeing equal to “value” or starting with “value” and followed by ahypen (“-”).[attributename^=value]selects all elements whose attribute valuestarts with the provided value. If you want to include spaces, wrap thevalue in double quotes.[attributename$=value]selects all elements whose attribute valueends with the provided value. If you want to include spaces, wrap thevalue in double quotes.[attributename*=value]selects all elements whose attribute valuecontains the provided value. If you want to include spaces, wrap thevalue in double quotes.Finally, there are a number of “colon” and “double-colon” “pseudo-classes” that can be used in a selector rule as well.p:first-childselects every “<p>” tag that is the first child of its parent element, andp:last-childandp:nth-child(10)provide similar functionality.Play around with the Wikipedia page using your Chrome’s Developer Tools(or the equivalent in your browser): try to find instances of the “class” attribute. TheCSS resource of the page is referenced through a “<link>” tag (note that pages can loadmultiple CSS files as well):<link rel="stylesheet" href="/w/load.php?[...];skin=vector">We’re not going to build websites using CSS. Instead, we’re going to scrape them. Assuch, you might wonder why this discussion regarding CSS is useful for our purposes.The reason is that the same CSS selector syntax can be used to quickly find and retrieveelements from an HTML page using Python. Try right-clicking some HTML elementsin the Elements tab of Chrome’s Developer Tools pane and press “Copy, Copy selector.”Note that you obtain a CSS selector. For instance, this is the selector to fetch one of thetables on the page:#mw-content-text > div > table:nth-child(9).CHAPTER 3STIRRING THE HTML AND CSS SOUP
61Or: “inside the element with id “mw-content-text,” get the child “div” element, andget the 9th “table” child element.” We’ll use these selectors quite often once we startworking with HTML in our web scraping scripts.3.4The Beautiful Soup LibraryWe’re now ready to start working with HTML pages using Python. Recall the followinglines of code:importrequestsurl = '' + \'?title=List_of_Game_of_Thrones_episodes&oldid=802553687'r = requests.get(url)html_contents = r.textHow do we deal with the HTML contained inhtml_contents? To properly parse andtackle this “soup,” we’ll bring in another library, called “Beautiful Soup.”Soup, Rich and GreenAnd finally, it becomes clear why we’ve been referringto messy HTML pages as a “soup.” The Beautiful Soup library was namedafter a Lewis Carroll poem bearing the same name from “Alice’s Adventures inWonderland.” In the tale, the poem is sung by a character called the “Mock Turtle”and goes as follows: “Beautiful Soup, so rich and green,// Waiting in a hot tureen!//Who for such dainties would not stoop?// Soup of the evening, beautiful Soup!”.

Upload your study docs or become a

Course Hero member to access this document

Upload your study docs or become a

Course Hero member to access this document

End of preview. Want to read all 19 pages?

Upload your study docs or become a

Course Hero member to access this document

Term
Winter
Professor
N/A
Tags

Newly uploaded documents

Show More

Newly uploaded documents

Show More

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture