View the step-by-step solution to:

The Cranfield collection is a standard IR text collection, consisting of 1400 documents from the aerodynamics field. It is available from the class...

The Cranfield collection is a standard IR text collection, consisting of
1400 documents from the aerodynamics field. It is available from the class
web page. (Check the "Links and resources" section).

1. Write a program that preprocesses the collection. This preprocessing stage
should specifically include:
a. Function that eliminates SGML tags
b. Function that tokenizes the text. In doing this, pay particular
attention to characters that need special handling, as
discussed in class (. , - etc.). For this task, please use
_your own_ implementation of a tokenizer.

This question was asked on Jan 28, 2013.

Recently Asked Questions

Why Join Course Hero?

Course Hero has all the homework and study help you need to succeed! We’ve got course-specific notes, study guides, and practice tests along with expert tutors and customizable flashcards—available anywhere, anytime.

-

Educational Resources
  • -

    Study Documents

    Find the best study resources around, tagged to your specific courses. Share your own to gain free Course Hero access or to earn money with our Marketplace.

    Browse Documents
  • -

    Question & Answers

    Get one-on-one homework help from our expert tutors—available online 24/7. Ask your own questions or browse existing Q&A threads. Satisfaction guaranteed!

    Ask a Question
  • -

    Flashcards

    Browse existing sets or create your own using our digital flashcard system. A simple yet effective studying tool to help you earn the grade that you want!

    Browse Flashcards