This preview shows pages 1–2. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.View Full Document
Unformatted text preview: COMPUTER PROCESSING OF ORIENTAL LANGUAGES, VOL. 11, NO. 4, 1998 1 Named Entity Extraction for Information Retrieval 1 H SIN –H SI C HEN , Y UNG –W EI D ING , and S HIH –C HUNG T SAI Abstract : Name extraction is indispensable for both natural language understanding and information retrieval. However, proper names are major unknown words in natural language texts, and unknown word identification is still a challenge problem in natural language proces- sing. This paper deals with identification of person names, organization names and location names from Chinese texts. Different types of information from different levels of text are employed, including character conditions, statistic information, titles, punctuation marks, or- ganization and location keywords, speech–act and locative verbs, cache and n–gram model. We also clarify which strategies can be used in which cases, i.e., queries and/or documents. In our experiments, the recall rates and the precision rates for the extraction of person names, orga- nization names, and location names under MET data are (87.33%, 82.33%), (76.67%, 79.33%) and (77.00%, 82.00%), respectively. Keywords : Chinese language processing, Information retrieval, N–gram model, Named entity extraction, Word segmentation. 1. Introduction People, affairs, time, places and things are five basic entities in a document. When we catch the fundamental entities, we can understand a document to some degree. These entities are also the targets that users are interested in. That is, users often issue queries to retrieve such kinds of entities in information retrieval systems. Thompson and Dozier  reported an experiment over periods of several days in 1995. It showed 67.8%, 83.4%, and 38.8% of quer- ies to Wall Street Journal, Los Angeles Times, and Washington Post, respectively, involve name searching. Besides name searching, name identification has many applications. Chen and Wu  consider person names as one of cues in sentence alignment. Chen and Lee  show its application to anaphora resolution. Chen and Bian  propose a method to construct white pages for Internet/Intranet users automatically. They extract information from World Wide Web documents, including proper nouns, E–mail addresses and home page URLs, and find the relationship among these data. Name extraction is indispensable for both natural language understanding and informa- tion retrieval. However, proper names are major unknown words in natural language texts. Chen, He and Xu  examined TREC–5 Chinese collection and found that there were 287 university and college names, and 627 company names. Only 21 out of 287 names and 14 out of 627 are included in their dictionary. Unknown word identification is a challenge problem in natural language processing. Many papers [6–8] touch on this problem. In a famous mes- sage understanding system evaluation and message understanding conference (MUC), which is sponsored by Tipster Text Program of DARPA, named entity, which covers named orga-...
View Full Document
This note was uploaded on 09/21/2009 for the course CS 580 taught by Professor Fdfdf during the Spring '09 term at University of Toronto- Toronto.
- Spring '09