10.1.1.55.6337 - RC 20338 Digital Libraries IBM Research...

Info iconThis preview shows pages 1–8. Sign up to view the full content.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: RC 20338 (04/10/97) Digital Libraries IBM Research Report Extracting Names from Natural-Language Text Yael Ravin T. J. Watson Research Center [email protected] Nina Wacholder CRIA, Columbia University [email protected] IBM Research Division Almaden . T.J. Watson . Tokyo . Zurich Research Report Extracting Names from Natural-Language Text Yael Ravin Nina Wacholder IBM Research Division T. J. Watson Research Center Yorktown Heights, NY 10598 NOTICE This report will be distributed outside of IBM up to one year after the IBM publication date. IBM Research Division Almaden bulletmed T.J. Watson bulletmed Tokyo bulletmed Zurich iv Extracting Names from Natural-Language Text Yael Ravin Nina Wacholder IBM Research T. J. Watson Research Center Yorktown Heights, NY 10598 [email protected] [email protected] Abstract: We describe Nominator, a module we developed to extract proper names from natural language text, which is currently being integrated into IBM products and services. Using fast and robust heuristics, Nominator locates names in text, determines what type of entity they refer to -- such as person, place or organization -- and groups together all the variant names that refer to the same entity. For example, "President Clinton", "Mr. Clinton" and "Bill Clinton" are grouped as referring to the same person. Each group is assigned a "canonical name", (e.g., "Bill Clinton") to distinguish it from other groups refer- ring to other entities ("Clinton, New Jersey"). Nominator produces a dictionary, or database, of names associated with a collection of documents. vi Extracting Names from Natural-Language Text TABLE OF CONTENTS Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Section 1: Some principles and assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Section 2: The name extraction procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Section 3: Forming a candidate name list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Section 4: Splitting sequences into smaller names . . . . . . . . . . . . . . . . . . . . . . . . . 8 Section 5: Grouping names in equivalent classes . . . . . . . . . . . . . . . . . . . . . . . . . 10 Section 6: Aggregation of classes across documents . . . . . . . . . . . . . . . . . . . . . . . 13 Section 7: Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Appendix A: Evaluation of Nominator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 The process of evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Recall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .....
View Full Document

{[ snackBarMessage ]}

Page1 / 40

10.1.1.55.6337 - RC 20338 Digital Libraries IBM Research...

This preview shows document pages 1 - 8. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online