Automatic Lexicon Acquisition and Precision-Recall Maximization for Untagged Text Corpora

博士 === 國立清華大學 === 電機工程學系 === 85 === Automatic lexicon acquisition from large text corpora is surveyed in thisdissertation, with special emphases on optimization techniques for maximizingthe joint precision- recall performance. Both English compound wor...

Full description

Bibliographic Details
Main Authors:	Chang, Jing-Shin, 張景新
Other Authors:	Keh-Yih Su
Format:	Others
Language:	zh-TW
Published:	1997
Online Access:	http://ndltd.ncl.edu.tw/handle/62275166919316022114

id	ndltd-TW-085NTHU0442122
record_format	oai_dc
spelling	ndltd-TW-085NTHU04421222015-10-13T18:05:33Z http://ndltd.ncl.edu.tw/handle/62275166919316022114 Automatic Lexicon Acquisition and Precision-Recall Maximization for Untagged Text Corpora 詞彙自動抽取及最佳化技術之研究 Chang, Jing-Shin 張景新博士國立清華大學電機工程學系 85 Automatic lexicon acquisition from large text corpora is surveyed in thisdissertation, with special emphases on optimization techniques for maximizingthe joint precision- recall performance. Both English compound word extractionand Chinese unknown word identification tasks are studied in order to exploreprecision-recall optimization techniques in different languages of differentcomplexity using different available resources. In the English compound wordextraction task, the simplest system architecture, which assumes that thelexicon extraction task is conducted using a classifier (or a filter) based ona set of multiple association features, is studied. Under such circumstances,a two stage optimization scheme is proposed, in which the first stage aims atminimizing classification error and the second stage focuses on maximizingjoint precision-recall, starting from the minimum error status. To achieveminimum error rate, various approaches are used to improve the error rateperformance of the classifier. In addition, a non-linear learning algorithm isdeveloped for achieving maximum precision-recall performance in terms of userspecified objective function of precision and recall. In the Chinese unknownword extraction task, where contextual information as well as word associationmetrics are used, an iterative approach, which allows us to improve bothprecision and recall simultaneously, is proposed to iteratively improve theprecision and recall performance. For the English compound word extractiontask, the weighted precision and recall (WPR) using the proposed approach canachieve as high as about 88% for bigram compounds, and 88% for trigramcompounds for a training (testing) corpus of 20715 (2301) sentences sampledfrom technical manuals of cars. The F-measure performances are about 84% forbigrams and 86% for trigrams. By applying the proposed optimization method,the precision and recall profile is observed to follow the preferred criteriaof different lexicographers. For the Chinese unknown word identification task,experiment results show that both precision and recall rates are improvedalmost monotonically, in contrast to non-iterative segmentation-merging-filtering- and-disambiguation approaches, which often sacrifice precision forrecall or vice versa. With a corpus of 311,591 sentences, the performance is76% (bigram), 54% (trigram), and 70% (quadgram) in F-measure, which issignificantly better than using the non-iterative approach with F-measures of74% (bigram), 46% (trigram), and 58% (quadgram). Keh-Yih Su 蘇克毅 1997 學位論文 ; thesis 113 zh-TW
collection	NDLTD
language	zh-TW
format	Others
sources	NDLTD
description	博士 === 國立清華大學 === 電機工程學系 === 85 === Automatic lexicon acquisition from large text corpora is surveyed in thisdissertation, with special emphases on optimization techniques for maximizingthe joint precision- recall performance. Both English compound word extractionand Chinese unknown word identification tasks are studied in order to exploreprecision-recall optimization techniques in different languages of differentcomplexity using different available resources. In the English compound wordextraction task, the simplest system architecture, which assumes that thelexicon extraction task is conducted using a classifier (or a filter) based ona set of multiple association features, is studied. Under such circumstances,a two stage optimization scheme is proposed, in which the first stage aims atminimizing classification error and the second stage focuses on maximizingjoint precision-recall, starting from the minimum error status. To achieveminimum error rate, various approaches are used to improve the error rateperformance of the classifier. In addition, a non-linear learning algorithm isdeveloped for achieving maximum precision-recall performance in terms of userspecified objective function of precision and recall. In the Chinese unknownword extraction task, where contextual information as well as word associationmetrics are used, an iterative approach, which allows us to improve bothprecision and recall simultaneously, is proposed to iteratively improve theprecision and recall performance. For the English compound word extractiontask, the weighted precision and recall (WPR) using the proposed approach canachieve as high as about 88% for bigram compounds, and 88% for trigramcompounds for a training (testing) corpus of 20715 (2301) sentences sampledfrom technical manuals of cars. The F-measure performances are about 84% forbigrams and 86% for trigrams. By applying the proposed optimization method,the precision and recall profile is observed to follow the preferred criteriaof different lexicographers. For the Chinese unknown word identification task,experiment results show that both precision and recall rates are improvedalmost monotonically, in contrast to non-iterative segmentation-merging-filtering- and-disambiguation approaches, which often sacrifice precision forrecall or vice versa. With a corpus of 311,591 sentences, the performance is76% (bigram), 54% (trigram), and 70% (quadgram) in F-measure, which issignificantly better than using the non-iterative approach with F-measures of74% (bigram), 46% (trigram), and 58% (quadgram).
author2	Keh-Yih Su
author_facet	Keh-Yih Su Chang, Jing-Shin 張景新
author	Chang, Jing-Shin 張景新
spellingShingle	Chang, Jing-Shin 張景新 Automatic Lexicon Acquisition and Precision-Recall Maximization for Untagged Text Corpora
author_sort	Chang, Jing-Shin
title	Automatic Lexicon Acquisition and Precision-Recall Maximization for Untagged Text Corpora
title_short	Automatic Lexicon Acquisition and Precision-Recall Maximization for Untagged Text Corpora
title_full	Automatic Lexicon Acquisition and Precision-Recall Maximization for Untagged Text Corpora
title_fullStr	Automatic Lexicon Acquisition and Precision-Recall Maximization for Untagged Text Corpora
title_full_unstemmed	Automatic Lexicon Acquisition and Precision-Recall Maximization for Untagged Text Corpora
title_sort	automatic lexicon acquisition and precision-recall maximization for untagged text corpora
publishDate	1997
url	http://ndltd.ncl.edu.tw/handle/62275166919316022114
work_keys_str_mv	AT changjingshin automaticlexiconacquisitionandprecisionrecallmaximizationforuntaggedtextcorpora AT zhāngjǐngxīn automaticlexiconacquisitionandprecisionrecallmaximizationforuntaggedtextcorpora AT changjingshin cíhuìzìdòngchōuqǔjízuìjiāhuàjìshùzhīyánjiū AT zhāngjǐngxīn cíhuìzìdòngchōuqǔjízuìjiāhuàjìshùzhīyánjiū
_version_	1718028375811948544

Automatic Lexicon Acquisition and Precision-Recall Maximization for Untagged Text Corpora

Similar Items