From a document thereby obtaining useful structured

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: a document, thereby obtaining useful structured data from unstructured text 17, 10 . Speci cally, it involves nding a set of substrings from the document, called llers, for each of a set of speci ed slots. When applied to web pages instead of natural language text, such an extractor is sometimes called a wrapper 15 . The current slots utilized by the recommender are: title, authors, synopses, published reviews, customer comments, related authors, related titles, and subject terms. Amazon produces the information about related authors and titles using collaborative methods; however, Libra simply treats them as additional content about the book. Only books that have at least one synopsis, review or customer comment are retained as having adequate content information. A number of other slots are also extracted e.g. publisher, date, ISBN, price, etc. but are currently not used by the recommender. We have initially assembled databases for literary ction 3,061 titles, science ction 3,813 titles, mystery 7,285 titles, and science 6,177 titles. Since the layout of Amazon's automatically generated pages is quite regular, a fairly simple extraction system is su cient. Libra's extractor employs a simple pattern matcher that uses pre- ller, ller, and post- ller patterns for each slot, as described by 7 . In other applications, more sophisticated information extraction methods and inductive learning of extraction rules might be useful 8 . The text in each slot is then processed into an unordered bag of words tokens and the examples represented as a vector of bags of words one bag for each slot. A book's title and authors are also added to its own related-title and related-author slots, since a book is obviously related" to itself, and this allows overlap in these slots with books listed as related to it. Some minor additions include the removal of a small list of stop-words, the preprocessing of author names into unique tokens of the form rst-initial last-name and the grouping of the words associated with synopses, published reviews, and customer comments all into one bag called words". 2.2 Learning a Pro le Next, the user selects and rates a set of training books. By searching for particular authors or titles, the user can avoid scanning the entire database or picking selections at random. The user is asked to provide a discrete 1 10 rating for each selected title. The inductive learner currently employed by Libra is a bag-of-words naive Bayesian text classi er 23 extended to handle a vector of bags rather than a single bag. Recent experimental results 11, 21 indicate that this relatively simple approach to text categorization performs as well or better than many competing methods. Libra does not attempt to predict the exact numerical rating of a title, but rather just a total ordering ranking of titles in order of preference. This task is then recast as a probabilistic binary categorization problem of predicting the probability...
View Full Document

Ask a homework question - tutors are online