{[ promptMessage ]}

Bookmark it

{[ promptMessage ]} - Proper Name Extraction from...

Info icon This preview shows pages 1–3. Sign up to view the full content.

View Full Document Right Arrow Icon
Proper Name Extraction from Non-Journalistic Texts Thierry Poibeau * and Leila Kosseim ** * Laboratoire Central de Recherches, Thales/LCR, and Laboratoire d’Informatiquede Paris-Nord, Institut Galil´ee, Universit´e Paris-Nord. [email protected] ** RALI, Universit´e de Montr´eal, [email protected] May 30, 2001 Abstract This paper discusses the influence of the corpus on the automatic identification of proper names in texts. Techniques developed for the news-wire genre are generally not sufficient to deal with larger corpora containing texts that do not follow strict writing constraints (for example, e-mail messages, transcriptions of oral conversations, etc). After a brief review of the research performed on news texts, we present some of the problems involved in the analysis of two different corpora: e-mails and hand-transcribed telephone conversations. Once the sources of errors have been presented, we then describe an approach to adapt a proper name extraction system developed for newspaper texts to the analysis of e-mail messages. Key-words: Proper Name Extraction, Information Extraction, Corpus Analysis 1 Introduction The identification of proper nouns in written or oral documents is an important task in natural language processing. This type of expression holds an important place in many corpora (newspapers, corporate documents, e-mails ...). It is therefore important to be able to identify these expressions either for specific applications (eg. to index documents by proper names or to build mailing lists) or for general research purposes (eg. to improve the syntactic analysis of a text). Many research projects have addressed the issue of proper name identification in newspaper texts; in particular, the Message Understanding Conferences (MUC) [1, 2, 3]. In these conferences, the first task to achieve is to identify named entities: proper names and also temporal and numerical expressions. This task is generally viewed as being generic , in the sense that all texts use such expressions and their identification seems a priori independent of the discourse domain or textual genre. However, the experiences performed within the MUC framework have all used homogeneous corpora constituted primarily of newspaper articles. This type of text respects strict writing guidelines which facilitates the identification task. For example, sequences like Mr. or Ms. precedes proper names rather systematically. However, these strategies are insufficient to analyse other types of texts such as electronic mail or minutes from a meeting because writing guidelines are either different or are much less strict. With the explosion of documents in electronic
Image of page 1

Info icon This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Proper Name Extraction from Non-Journalistic Texts 2 format, it is precisely these types of documents that need to be processed automat- ically.
Image of page 2
Image of page 3
This is the end of the preview. Sign up to access the rest of the document.
  • Spring '09
  • fdfdf
  • Speech recognition, Message Understanding Conference, Communication corpus, Valcartier corpus, proper name extraction

{[ snackBarMessage ]}

What students are saying

  • Left Quote Icon

    As a current student on this bumpy collegiate pathway, I stumbled upon Course Hero, where I can find study resources for nearly all my courses, get online help from tutors 24/7, and even share my old projects, papers, and lecture notes with other students.

    Student Picture

    Kiran Temple University Fox School of Business ‘17, Course Hero Intern

  • Left Quote Icon

    I cannot even describe how much Course Hero helped me this summer. It’s truly become something I can always rely on and help me. In the end, I was not only able to survive summer classes, but I was able to thrive thanks to Course Hero.

    Student Picture

    Dana University of Pennsylvania ‘17, Course Hero Intern

  • Left Quote Icon

    The ability to access any university’s resources through Course Hero proved invaluable in my case. I was behind on Tulane coursework and actually used UCLA’s materials to help me move forward and get everything together on time.

    Student Picture

    Jill Tulane University ‘16, Course Hero Intern