0010026v1 - Enriching very large ontologies using the WWW...

Info iconThis preview shows pages 1–2. Sign up to view the full content.

View Full Document Right Arrow Icon
arXiv:cs.CL/0010026 17 Oct 2000 Enriching very large ontologies using the WWW Eneko Agirre 1 , Olatz Ansa 1 , Eduard Hovy 2 and David Martínez 1 Abstract. This paper explores the possibility to exploit text on the world wide web in order to enrich the concepts in existing ontologies. First, a method to retrieve documents from the WWW related to a concept is described. These document collections are used 1) to construct topic signatures (lists of topically related words) for each concept in WordNet, and 2) to build hierarchical clusters of the concepts (the word senses) that lexicalize a given word. The overall goal is to overcome two shortcomings of WordNet: the lack of topical links among concepts, and the proliferation of senses. Topic signatures are validated on a word sense disambiguation task with good results, which are improved when the hierarchical clusters are used. 1 INTRODUCTION Knowledge acquisition is a long-standing problem in both Artificial Intelligence and Computational Linguistics. Semantic and world knowledge acquisition pose a problem with no simple answer. Huge efforts and investments have been made to build repositories with such knowledge (which we shall call ontologies for simplicity) but with unclear results, e.g. CYC [1], EDR [2], WordNet [3]. WordNet, for instance, has been criticized for its lack of relations between topically related concepts, and the proliferation of word senses. As an alternative to entirely hand-made repositories, automatic or semi-automatic means have been proposed for the last 30 years. On the one hand, shallow techniques are used to enrich existing ontologies [4] or to induce hierarchies [5], usually analyzing large corpora of texts. On the other hand, deep natural language processing is called for to acquire knowledge from more specialized texts (dictionaries, encyclopedias or domain specific texts) [6][7]. These research lines are complementary; deep understanding would provide specific relations among concepts, whereas shallow techniques could provide generic knowledge about the concepts. This paper explores the possibility to exploit text on the world wide web in order to enrich WordNet. The first step consists on linking each concept in WordNet to relevant document collections in the web, which are further processed to overcome some of WordNet’s shortcomings. On the one hand, concepts are linked to topically related words. Topically related words form the topic signature for each concept in the hierarchy. As in [8][9] we define a topic signature as a family of related terms { t , <( w 1 ,s 1 )…( w i ,s i )…>}, where t is the topic (i.e. the target concept) and each w i is a word associated with 1 IxA NLP group. University of the Basque Country. 649 pk. 20.080 Donostia. Spain. Email: [email protected], [email protected] [email protected]
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Image of page 2
This is the end of the preview. Sign up to access the rest of the document.

This note was uploaded on 09/21/2009 for the course CS 580 taught by Professor Fdfdf during the Spring '09 term at University of Toronto.

Page1 / 6

0010026v1 - Enriching very large ontologies using the WWW...

This preview shows document pages 1 - 2. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online