10.1.1.64.2109 - Tokenization and Proper Noun Recognition...

Info iconThis preview shows pages 1–2. Sign up to view the full content.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Tokenization and Proper Noun Recognition for Information Retrieval Fco. Mario Barcala, Jes´us Vilares, Miguel A. Alonso, Jorge Gra˜na, Manuel Vilares Departamento de Computaci´on, Universidade da Coru˜na Campus de Elvi˜na s/n, 15071 La Coru˜na, Spain { barcala,jvilares } @mail2.udc.es, { alonso,grana,vilares } @udc.es Abstract In this paper we consider a set of natural language processing techniques that can be used to analyze large amounts of texts, focusing on the advanced tokenizer which accounts for a number of complex linguistic phenomena, as well as for pre-tagging tasks such as proper noun recogni- tion. We also show the results of several experiments per- formed in order to study the impact of the strategy chosen for the recognition of proper nouns. 1 Introduction In recent years, there has been a considerable amount of interest in using Natural Language Processing (NLP) in In- formation Retrieval (IR) research, with specific implemen- tations varying from the word-level morphological analysis to syntactic parsing to conceptual-level semantic analysis. In this paper we consider the employment of a set of NLP techniques adequate for dealing with large amounts of texts. We propose the following sequence of finite-state based processes, each of them corresponding to the recog- nition of intuitive linguistic elements that reflect important universals about language: • A preprocessor that identifies individual words, proper nouns and idioms forming each sentence. • A tagger that assign a syntactic category to each word, in order to identify those words carrying the semantics of the sentence: nouns, adjectives and verbs. • A morphological families generator that groups re- lated words belonging to different categories (e.g. the noun corresponding to the action of a verb). • A shallow parser that extract the basic syntactic struc- tures relating words within a sentence, such as the noun-modifier, subject-verb or verb-object relations. This paper is focused on the description of the preproces- sor module, making emphasis in proper noun recognition tasks. Albeit our scheme is oriented towards the indexing of Spanish texts, it is also a proposal of a general architec- ture that can be applied to other languages with very slight modifications. To facilitate comprehension, English exam- ples are used when possible. 2 The preprocessor Most current systems assume that input texts are already tokenized, i.e. correctly segmented in tokens or high level information units that identify every individual component of the texts. This working hypothesis is not realistic due to the heterogeneous nature of the application texts and their sources. For this reason, we have developed a preprocessor module, an advanced tokenizer which accounts for a num- ber of complex linguistic phenomena, as well as for pre- tagging tasks. The architecture of the preprocessor is shown in Fig.1, consisting of the following submodules....
View Full Document

This note was uploaded on 09/21/2009 for the course CS 580 taught by Professor Fdfdf during the Spring '09 term at University of Toronto.

Page1 / 5

10.1.1.64.2109 - Tokenization and Proper Noun Recognition...

This preview shows document pages 1 - 2. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online