Boilerpipe one of the hardest parts of analyzing web

Info icon This preview shows pages 42–44. Sign up to view the full content.

Boilerpipe One of the hardest parts of analyzing web pages is removing the navigation links, headers, footers, and sidebars to leave the meaningful content text. If all of that boil- erplate is left in, the analysis will be highly distorted by repeated irrelevant words and phrases from those sections. Boilerpipe is a Java framework that uses an algorithmic approach to spotting the actual content of an HTML document, and so makes a great preprocessing tool for any web content. It’s aimed at pages that look something like a news story, but I’ve found it works decently for many different types of sites. A live demonstration of the service OpenCalais OpenCalais is a web API that takes a piece of text, spots the names of entities it knows about, and suggests overall tags. It’s a mature project run by Thomson Reuters and is widely used. In my experience, it tends to be strongest at understanding terms and phrases that you might see in formal news stories, as you might expect from its heritage. It’s definitely a good place to start when you need a semantic analysis of your content, but there are still some reasons you might want to look into alternatives. There is a 50,000 per-day limit on calls, and 100K limit on document sizes for the standard API. This is negotiable with the commercial version, but the overhead is one reason to con- sider running something on a local cluster instead for large volumes of data. You may also need to ensure that the content you’re submitting is not sensitive, though the service does promise not to retain any of it . There may also be a set of terms or phrases unique to your problem domain that’s not covered by the service. In that case, a hand- rolled parser built on NLTK or OpenNLP could be a better solution. 30 | Chapter 7: NLP
Image of page 42

Info icon This preview has intentionally blurred sections. Sign up to view the full version.

CHAPTER 8 Machine Learning Another important processing category, machine learning systems automate decision making on data. They use training information to deal with subsequent data points, automatically producing outputs like recommendations or groupings. These systems are especially useful when you want to turn the results of a one-off data analysis into a production service that will perform something similar on new data without supervi- sion. Some of the most famous uses of these techniques are features like Amazon’s product recommendations. WEKA WEKA is a Java-based framework and GUI for machine learning algorithms. It provides a plug-in architecture for researchers to add their own techniques, with a command- line and window interface that makes it easy to apply them to your own data. You can use it to do everything from basic clustering to advanced classification, together with a lot of tools for visualizing your results. It is heavily used as a teaching tool, but it also comes in extremely handy for prototyping and experimenting outside of the classroom.
Image of page 43
Image of page 44
This is the end of the preview. Sign up to access the rest of the document.
  • Fall '16
  • KYS
  • Hadoop

{[ snackBarMessage ]}

What students are saying

  • Left Quote Icon

    As a current student on this bumpy collegiate pathway, I stumbled upon Course Hero, where I can find study resources for nearly all my courses, get online help from tutors 24/7, and even share my old projects, papers, and lecture notes with other students.

    Student Picture

    Kiran Temple University Fox School of Business ‘17, Course Hero Intern

  • Left Quote Icon

    I cannot even describe how much Course Hero helped me this summer. It’s truly become something I can always rely on and help me. In the end, I was not only able to survive summer classes, but I was able to thrive thanks to Course Hero.

    Student Picture

    Dana University of Pennsylvania ‘17, Course Hero Intern

  • Left Quote Icon

    The ability to access any university’s resources through Course Hero proved invaluable in my case. I was behind on Tulane coursework and actually used UCLA’s materials to help me move forward and get everything together on time.

    Student Picture

    Jill Tulane University ‘16, Course Hero Intern