10.1.1.26.6659 - An Unsupervised Iterative Method for...

Info iconThis preview shows pages 1–2. Sign up to view the full content.

View Full Document Right Arrow Icon
+ Jing-Shin Chang and +* Keh-Yih Su + shin@hermes.ee.nthu.edu.tw, +* kysu@bdc.com.tw + Department of Electrical Engineering, National Tsing-Hua University Hsinchu, Taiwan 30043, ROC * Behavior Design Corporation, 2F, No.5, Industrial East Road IV Science-Based Industrial Park, Hsinchu, Taiwan 30077, ROC ABSTRACT An unsupervised iterative approach for extracting a new lexicon (or unknown words ) from a Chinese text corpus is proposed in this paper. Instead of using a non-iterative segmentation-merging- filtering-and-disambiguation approach, the proposed method iteratively integrates the contextual con- straints (among word candidates) and a joint character association metric to progressively improve the segmentation results of the input corpus (and thus the new word list.) An augmented dictionary, which includes potential unknown words (in addition to known words), is used to segment the input corpus, unlike traditional approaches which use only known words for segmentation. In the segmentation process, the augmented dictionary is used to impose contextual constraints over known words and potential unknown words within input sentences; an unsupervised Viterbi Training process is then applied to ensure that the selected potential unknown words (and known words) maximize the likeli- hood of the input corpus. On the other hand, the joint character association metric (which reflects the global character association characteristics across the corpus) is derived by integrating several commonly used word association metrics, such as mutual information and entropy, with a joint Gaussian mixture density function; such integration allows the filter to use multiple features simulta- neously to evaluate character association, unlike traditional filters which apply multiple features independently. The proposed method then allows the contextual constraints and the joint character association metric to enhance each other; this is achieved by iteratively applying the joint association metric to truncate unlikely unknown words in the augmented dictionary and using the segmentation result to improve the estimation of the joint association metric. The refined augmented dictionary and improved estimation are then used in the next iteration to acquire better segmentation and carry out more reliable filtering. Experiments show that both the precision and recall rates are improved almost monotonically , in contrast to non-iterative segmentation-merging-filtering-and-disambiguation approaches, which often sacrifice precision for recall or vice versa . With a corpus of 311,591 sentences, the performance is 76% (bigram), 54% (trigram), and 70% (quadragram) in F-measure , which is significantly better than using the non-iterative approach with F-measures of 74% (bigram), 46% (trigram), and 58% (quadragram). Keywords: Unknown Word Identification, New Lexicon Extraction, Unsupervised Method,
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Image of page 2
This is the end of the preview. Sign up to access the rest of the document.

This note was uploaded on 09/21/2009 for the course CS 580 taught by Professor Fdfdf during the Spring '09 term at University of Toronto- Toronto.

Page1 / 42

10.1.1.26.6659 - An Unsupervised Iterative Method for...

This preview shows document pages 1 - 2. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online