Automatic Lexicon Acquisition and Precision-Recall Maximization for Untagged Text Corpora
博士 === 國立清華大學 === 電機工程學系 === 85 === Automatic lexicon acquisition from large text corpora is surveyed in thisdissertation, with special emphases on optimization techniques for maximizingthe joint precision- recall performance. Both English compound wor...
Main Authors: | , |
---|---|
Other Authors: | |
Format: | Others |
Language: | zh-TW |
Published: |
1997
|
Online Access: | http://ndltd.ncl.edu.tw/handle/62275166919316022114 |
id |
ndltd-TW-085NTHU0442122 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-TW-085NTHU04421222015-10-13T18:05:33Z http://ndltd.ncl.edu.tw/handle/62275166919316022114 Automatic Lexicon Acquisition and Precision-Recall Maximization for Untagged Text Corpora 詞彙自動抽取及最佳化技術之研究 Chang, Jing-Shin 張景新 博士 國立清華大學 電機工程學系 85 Automatic lexicon acquisition from large text corpora is surveyed in thisdissertation, with special emphases on optimization techniques for maximizingthe joint precision- recall performance. Both English compound word extractionand Chinese unknown word identification tasks are studied in order to exploreprecision-recall optimization techniques in different languages of differentcomplexity using different available resources. In the English compound wordextraction task, the simplest system architecture, which assumes that thelexicon extraction task is conducted using a classifier (or a filter) based ona set of multiple association features, is studied. Under such circumstances,a two stage optimization scheme is proposed, in which the first stage aims atminimizing classification error and the second stage focuses on maximizingjoint precision-recall, starting from the minimum error status. To achieveminimum error rate, various approaches are used to improve the error rateperformance of the classifier. In addition, a non-linear learning algorithm isdeveloped for achieving maximum precision-recall performance in terms of userspecified objective function of precision and recall. In the Chinese unknownword extraction task, where contextual information as well as word associationmetrics are used, an iterative approach, which allows us to improve bothprecision and recall simultaneously, is proposed to iteratively improve theprecision and recall performance. For the English compound word extractiontask, the weighted precision and recall (WPR) using the proposed approach canachieve as high as about 88% for bigram compounds, and 88% for trigramcompounds for a training (testing) corpus of 20715 (2301) sentences sampledfrom technical manuals of cars. The F-measure performances are about 84% forbigrams and 86% for trigrams. By applying the proposed optimization method,the precision and recall profile is observed to follow the preferred criteriaof different lexicographers. For the Chinese unknown word identification task,experiment results show that both precision and recall rates are improvedalmost monotonically, in contrast to non-iterative segmentation-merging-filtering- and-disambiguation approaches, which often sacrifice precision forrecall or vice versa. With a corpus of 311,591 sentences, the performance is76% (bigram), 54% (trigram), and 70% (quadgram) in F-measure, which issignificantly better than using the non-iterative approach with F-measures of74% (bigram), 46% (trigram), and 58% (quadgram). Keh-Yih Su 蘇克毅 1997 學位論文 ; thesis 113 zh-TW |
collection |
NDLTD |
language |
zh-TW |
format |
Others
|
sources |
NDLTD |
description |
博士 === 國立清華大學 === 電機工程學系 === 85 === Automatic lexicon acquisition from large text corpora is
surveyed in thisdissertation, with special emphases on
optimization techniques for maximizingthe joint precision-
recall performance. Both English compound word extractionand
Chinese unknown word identification tasks are studied in order
to exploreprecision-recall optimization techniques in
different languages of differentcomplexity using different
available resources. In the English compound wordextraction
task, the simplest system architecture, which assumes that
thelexicon extraction task is conducted using a classifier (or a
filter) based ona set of multiple association features, is
studied. Under such circumstances,a two stage optimization
scheme is proposed, in which the first stage aims atminimizing
classification error and the second stage focuses on
maximizingjoint precision-recall, starting from the minimum
error status. To achieveminimum error rate, various
approaches are used to improve the error rateperformance of the
classifier. In addition, a non-linear learning algorithm
isdeveloped for achieving maximum precision-recall performance
in terms of userspecified objective function of precision and
recall. In the Chinese unknownword extraction task, where
contextual information as well as word associationmetrics are
used, an iterative approach, which allows us to improve
bothprecision and recall simultaneously, is proposed to
iteratively improve theprecision and recall performance. For
the English compound word extractiontask, the weighted
precision and recall (WPR) using the proposed approach
canachieve as high as about 88% for bigram compounds,
and 88% for trigramcompounds for a training (testing) corpus
of 20715 (2301) sentences sampledfrom technical manuals of cars.
The F-measure performances are about 84% forbigrams and 86% for
trigrams. By applying the proposed optimization method,the
precision and recall profile is observed to follow the preferred
criteriaof different lexicographers. For the Chinese unknown
word identification task,experiment results show that both
precision and recall rates are improvedalmost monotonically,
in contrast to non-iterative segmentation-merging-filtering-
and-disambiguation approaches, which often sacrifice precision
forrecall or vice versa. With a corpus of 311,591 sentences,
the performance is76% (bigram), 54% (trigram), and 70%
(quadgram) in F-measure, which issignificantly better than
using the non-iterative approach with F-measures of74%
(bigram), 46% (trigram), and 58% (quadgram).
|
author2 |
Keh-Yih Su |
author_facet |
Keh-Yih Su Chang, Jing-Shin 張景新 |
author |
Chang, Jing-Shin 張景新 |
spellingShingle |
Chang, Jing-Shin 張景新 Automatic Lexicon Acquisition and Precision-Recall Maximization for Untagged Text Corpora |
author_sort |
Chang, Jing-Shin |
title |
Automatic Lexicon Acquisition and Precision-Recall Maximization for Untagged Text Corpora |
title_short |
Automatic Lexicon Acquisition and Precision-Recall Maximization for Untagged Text Corpora |
title_full |
Automatic Lexicon Acquisition and Precision-Recall Maximization for Untagged Text Corpora |
title_fullStr |
Automatic Lexicon Acquisition and Precision-Recall Maximization for Untagged Text Corpora |
title_full_unstemmed |
Automatic Lexicon Acquisition and Precision-Recall Maximization for Untagged Text Corpora |
title_sort |
automatic lexicon acquisition and precision-recall maximization for untagged text corpora |
publishDate |
1997 |
url |
http://ndltd.ncl.edu.tw/handle/62275166919316022114 |
work_keys_str_mv |
AT changjingshin automaticlexiconacquisitionandprecisionrecallmaximizationforuntaggedtextcorpora AT zhāngjǐngxīn automaticlexiconacquisitionandprecisionrecallmaximizationforuntaggedtextcorpora AT changjingshin cíhuìzìdòngchōuqǔjízuìjiāhuàjìshùzhīyánjiū AT zhāngjǐngxīn cíhuìzìdòngchōuqǔjízuìjiāhuàjìshùzhīyánjiū |
_version_ |
1718028375811948544 |