This preview shows pages 1–3. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.View Full Document
Unformatted text preview: Proper Name Extraction from Non-Journalistic Texts Thierry Poibeau * and Leila Kosseim ** * Laboratoire Central de Recherches, Thales/LCR, and Laboratoire dInformatique de Paris-Nord, Institut Galilee, Universite Paris-Nord. Thierry.Poibeau@lcr.thomson-csf.com ** RALI, Universite de Montreal, firstname.lastname@example.org May 30, 2001 Abstract This paper discusses the influence of the corpus on the automatic identification of proper names in texts. Techniques developed for the news-wire genre are generally not sufficient to deal with larger corpora containing texts that do not follow strict writing constraints (for example, e-mail messages, transcriptions of oral conversations, etc). After a brief review of the research performed on news texts, we present some of the problems involved in the analysis of two different corpora: e-mails and hand-transcribed telephone conversations. Once the sources of errors have been presented, we then describe an approach to adapt a proper name extraction system developed for newspaper texts to the analysis of e-mail messages. Key-words: Proper Name Extraction, Information Extraction, Corpus Analysis 1 Introduction The identification of proper nouns in written or oral documents is an important task in natural language processing. This type of expression holds an important place in many corpora (newspapers, corporate documents, e-mails . . . ). It is therefore important to be able to identify these expressions either for specific applications (eg. to index documents by proper names or to build mailing lists) or for general research purposes (eg. to improve the syntactic analysis of a text). Many research projects have addressed the issue of proper name identification in newspaper texts; in particular, the Message Understanding Conferences (MUC) [1, 2, 3]. In these conferences, the first task to achieve is to identify named entities: proper names and also temporal and numerical expressions. This task is generally viewed as being generic , in the sense that all texts use such expressions and their identification seems a priori independent of the discourse domain or textual genre. However, the experiences performed within the MUC framework have all used homogeneous corpora constituted primarily of newspaper articles. This type of text respects strict writing guidelines which facilitates the identification task. For example, sequences like Mr. or Ms. precedes proper names rather systematically. However, these strategies are insufficient to analyse other types of texts such as electronic mail or minutes from a meeting because writing guidelines are either different or are much less strict. With the explosion of documents in electronic Proper Name Extraction from Non-Journalistic Texts 2 format, it is precisely these types of documents that need to be processed automat- ically....
View Full Document
This note was uploaded on 09/21/2009 for the course CS 580 taught by Professor Fdfdf during the Spring '09 term at University of Toronto- Toronto.
- Spring '09