10.1.1.55.6337 - RC 20338 (04/10/97) Digital Libraries IBM...

Info iconThis preview shows pages 1–8. Sign up to view the full content.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: RC 20338 (04/10/97) Digital Libraries IBM Research Report Extracting Names from Natural-Language Text Yael Ravin T. J. Watson Research Center yael@watson.ibm.com Nina Wacholder CRIA, Columbia University nina@cs.columbia.edu IBM Research Division Almaden . T.J. Watson . Tokyo . Zurich Research Report Extracting Names from Natural-Language Text Yael Ravin Nina Wacholder IBM Research Division T. J. Watson Research Center Yorktown Heights, NY 10598 NOTICE This report will be distributed outside of IBM up to one year after the IBM publication date. IBM Research Division Almaden bulletmed T.J. Watson bulletmed Tokyo bulletmed Zurich iv Extracting Names from Natural-Language Text Yael Ravin Nina Wacholder IBM Research T. J. Watson Research Center Yorktown Heights, NY 10598 yael@watson.ibm.com nina@watson.ibm.com Abstract: We describe Nominator, a module we developed to extract proper names from natural language text, which is currently being integrated into IBM products and services. Using fast and robust heuristics, Nominator locates names in text, determines what type of entity they refer to -- such as person, place or organization -- and groups together all the variant names that refer to the same entity. For example, "President Clinton", "Mr. Clinton" and "Bill Clinton" are grouped as referring to the same person. Each group is assigned a "canonical name", (e.g., "Bill Clinton") to distinguish it from other groups refer- ring to other entities ("Clinton, New Jersey"). Nominator produces a dictionary, or database, of names associated with a collection of documents. vi Extracting Names from Natural-Language Text TABLE OF CONTENTS Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Section 1: Some principles and assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Section 2: The name extraction procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Section 3: Forming a candidate name list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Section 4: Splitting sequences into smaller names . . . . . . . . . . . . . . . . . . . . . . . . . 8 Section 5: Grouping names in equivalent classes . . . . . . . . . . . . . . . . . . . . . . . . . 10 Section 6: Aggregation of classes across documents . . . . . . . . . . . . . . . . . . . . . . . 13 Section 7: Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Appendix A: Evaluation of Nominator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 The process of evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Recall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .....
View Full Document

Page1 / 40

10.1.1.55.6337 - RC 20338 (04/10/97) Digital Libraries IBM...

This preview shows document pages 1 - 8. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online