Automatic Lexicon Acquisition and Precision-Recall Maximization for Untagged Text Corpora

博士 === 國立清華大學 === 電機工程學系 === 85 === Automatic lexicon acquisition from large text corpora is surveyed in thisdissertation, with special emphases on optimization techniques for maximizingthe joint precision- recall performance. Both English compound wor...

Full description

Bibliographic Details
Main Authors: Chang, Jing-Shin, 張景新
Other Authors: Keh-Yih Su
Format: Others
Language:zh-TW
Published: 1997
Online Access:http://ndltd.ncl.edu.tw/handle/62275166919316022114
id ndltd-TW-085NTHU0442122
record_format oai_dc
spelling ndltd-TW-085NTHU04421222015-10-13T18:05:33Z http://ndltd.ncl.edu.tw/handle/62275166919316022114 Automatic Lexicon Acquisition and Precision-Recall Maximization for Untagged Text Corpora 詞彙自動抽取及最佳化技術之研究 Chang, Jing-Shin 張景新 博士 國立清華大學 電機工程學系 85 Automatic lexicon acquisition from large text corpora is surveyed in thisdissertation, with special emphases on optimization techniques for maximizingthe joint precision- recall performance. Both English compound word extractionand Chinese unknown word identification tasks are studied in order to exploreprecision-recall optimization techniques in different languages of differentcomplexity using different available resources. In the English compound wordextraction task, the simplest system architecture, which assumes that thelexicon extraction task is conducted using a classifier (or a filter) based ona set of multiple association features, is studied. Under such circumstances,a two stage optimization scheme is proposed, in which the first stage aims atminimizing classification error and the second stage focuses on maximizingjoint precision-recall, starting from the minimum error status. To achieveminimum error rate, various approaches are used to improve the error rateperformance of the classifier. In addition, a non-linear learning algorithm isdeveloped for achieving maximum precision-recall performance in terms of userspecified objective function of precision and recall. In the Chinese unknownword extraction task, where contextual information as well as word associationmetrics are used, an iterative approach, which allows us to improve bothprecision and recall simultaneously, is proposed to iteratively improve theprecision and recall performance. For the English compound word extractiontask, the weighted precision and recall (WPR) using the proposed approach canachieve as high as about 88% for bigram compounds, and 88% for trigramcompounds for a training (testing) corpus of 20715 (2301) sentences sampledfrom technical manuals of cars. The F-measure performances are about 84% forbigrams and 86% for trigrams. By applying the proposed optimization method,the precision and recall profile is observed to follow the preferred criteriaof different lexicographers. For the Chinese unknown word identification task,experiment results show that both precision and recall rates are improvedalmost monotonically, in contrast to non-iterative segmentation-merging-filtering- and-disambiguation approaches, which often sacrifice precision forrecall or vice versa. With a corpus of 311,591 sentences, the performance is76% (bigram), 54% (trigram), and 70% (quadgram) in F-measure, which issignificantly better than using the non-iterative approach with F-measures of74% (bigram), 46% (trigram), and 58% (quadgram). Keh-Yih Su 蘇克毅 1997 學位論文 ; thesis 113 zh-TW
collection NDLTD
language zh-TW
format Others
sources NDLTD
description 博士 === 國立清華大學 === 電機工程學系 === 85 === Automatic lexicon acquisition from large text corpora is surveyed in thisdissertation, with special emphases on optimization techniques for maximizingthe joint precision- recall performance. Both English compound word extractionand Chinese unknown word identification tasks are studied in order to exploreprecision-recall optimization techniques in different languages of differentcomplexity using different available resources. In the English compound wordextraction task, the simplest system architecture, which assumes that thelexicon extraction task is conducted using a classifier (or a filter) based ona set of multiple association features, is studied. Under such circumstances,a two stage optimization scheme is proposed, in which the first stage aims atminimizing classification error and the second stage focuses on maximizingjoint precision-recall, starting from the minimum error status. To achieveminimum error rate, various approaches are used to improve the error rateperformance of the classifier. In addition, a non-linear learning algorithm isdeveloped for achieving maximum precision-recall performance in terms of userspecified objective function of precision and recall. In the Chinese unknownword extraction task, where contextual information as well as word associationmetrics are used, an iterative approach, which allows us to improve bothprecision and recall simultaneously, is proposed to iteratively improve theprecision and recall performance. For the English compound word extractiontask, the weighted precision and recall (WPR) using the proposed approach canachieve as high as about 88% for bigram compounds, and 88% for trigramcompounds for a training (testing) corpus of 20715 (2301) sentences sampledfrom technical manuals of cars. The F-measure performances are about 84% forbigrams and 86% for trigrams. By applying the proposed optimization method,the precision and recall profile is observed to follow the preferred criteriaof different lexicographers. For the Chinese unknown word identification task,experiment results show that both precision and recall rates are improvedalmost monotonically, in contrast to non-iterative segmentation-merging-filtering- and-disambiguation approaches, which often sacrifice precision forrecall or vice versa. With a corpus of 311,591 sentences, the performance is76% (bigram), 54% (trigram), and 70% (quadgram) in F-measure, which issignificantly better than using the non-iterative approach with F-measures of74% (bigram), 46% (trigram), and 58% (quadgram).
author2 Keh-Yih Su
author_facet Keh-Yih Su
Chang, Jing-Shin
張景新
author Chang, Jing-Shin
張景新
spellingShingle Chang, Jing-Shin
張景新
Automatic Lexicon Acquisition and Precision-Recall Maximization for Untagged Text Corpora
author_sort Chang, Jing-Shin
title Automatic Lexicon Acquisition and Precision-Recall Maximization for Untagged Text Corpora
title_short Automatic Lexicon Acquisition and Precision-Recall Maximization for Untagged Text Corpora
title_full Automatic Lexicon Acquisition and Precision-Recall Maximization for Untagged Text Corpora
title_fullStr Automatic Lexicon Acquisition and Precision-Recall Maximization for Untagged Text Corpora
title_full_unstemmed Automatic Lexicon Acquisition and Precision-Recall Maximization for Untagged Text Corpora
title_sort automatic lexicon acquisition and precision-recall maximization for untagged text corpora
publishDate 1997
url http://ndltd.ncl.edu.tw/handle/62275166919316022114
work_keys_str_mv AT changjingshin automaticlexiconacquisitionandprecisionrecallmaximizationforuntaggedtextcorpora
AT zhāngjǐngxīn automaticlexiconacquisitionandprecisionrecallmaximizationforuntaggedtextcorpora
AT changjingshin cíhuìzìdòngchōuqǔjízuìjiāhuàjìshùzhīyánjiū
AT zhāngjǐngxīn cíhuìzìdòngchōuqǔjízuìjiāhuàjìshùzhīyánjiū
_version_ 1718028375811948544