This preview shows pages 1–2. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.View Full Document
Unformatted text preview: Bootstrapping Named Entity Recognition with Automatically Generated Gazetteer Lists Zornitsa Kozareva Dept. de Lenguajes y Sistemas Informaticos University of Alicante Alicante, Spain firstname.lastname@example.org Abstract Current Named Entity Recognition sys- tems suffer from the lack of hand-tagged data as well as degradation when mov- ing to other domain. This paper explores two aspects: the automatic generation of gazetteer lists from unlabeled data; and the building of a Named Entity Recognition system with labeled and unlabeled data. 1 Introduction Automatic information extraction and information retrieval concerning particular person, location, organization, title of movie or book, juxtaposes to the Named Entity Recognition (NER) task. NER consists in detecting the most silent and informa- tive elements in a text such as names of people, company names, location, monetary currencies, dates. Early NER systems (Fisher et al., 1997), (Black et al., 1998) etc., participating in Message Understanding Conferences (MUC), used linguis- tic tools and gazetteer lists. However these are dif- ficult to develop and domain sensitive. To surmount these obstacles, application of machine learning approaches to NER became a research subject. Various state-of-the-art ma- chine learning algorithms such as Maximum En- tropy (Borthwick, 1999), AdaBoost(Carreras et al., 2002), Hidden Markov Models (Bikel et al., ), Memory-based Based learning (Tjong Kim Sang, 2002b), have been used 1 . (Klein et al., 2003), (Mayfield et al., 2003), (Wu et al., 2003), (Kozareva et al., 2005c) among others, combined several classifiers to obtain better named entity coverage rate. 1 For other machine learning methods, consult http://www.cnts.ua.ac.be/conll2002/ner/ http://www.cnts.ua.ac.be/conll2003/ner/ Nevertheless all these machine learning algo- rithms rely on previously hand-labeled training data. Obtaining such data is labor-intensive, time consuming and even might not be present for lan- guages with limited funding. Resource limitation, directed NER research (Collins and Singer, 1999), (Carreras et al., 2003), (Kozareva et al., 2005a) toward the usage of semi-supervised techniques. These techniques are needed, as we live in a multi- lingual society and access to information from var- ious language sources is reality. The development of NER systems for languages other than English commenced. This paper presents the development of a Span- ish Named Recognition system based on machine learning approach. For it no morphologic or syn- tactic information was used. However, we pro- pose and incorporate a very simple method for automatic gazetteer 2 construction. Such method can be easily adapted to other languages and it is low-costly obtained as it relies on n-gram extrac- tion from unlabeled data. We compare the perfor- mance of our NER system when labeled and unla- beled training data is present....
View Full Document
- Spring '09