Chinese Word Segmentation using Specialized HMM
碩士 === 國立中央大學 === 資訊工程研究所 === 94 === The first step in Chinese language processing tasks is word segmentation. Various methods have been proposed to address this problem in previous studies, e.g. heuristic-based approaches, statistical-based approaches, etc. HMM is a statistical machine learning app...
Main Authors: | , |
---|---|
Other Authors: | |
Format: | Others |
Language: | zh-TW |
Published: |
2006
|
Online Access: | http://ndltd.ncl.edu.tw/handle/jea684 |
id |
ndltd-TW-094NCU05392082 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-TW-094NCU053920822018-05-13T04:29:03Z http://ndltd.ncl.edu.tw/handle/jea684 Chinese Word Segmentation using Specialized HMM 基於特製隱藏式馬可夫模型之中文斷詞研究 Qian-Xiang Lin 林千翔 碩士 國立中央大學 資訊工程研究所 94 The first step in Chinese language processing tasks is word segmentation. Various methods have been proposed to address this problem in previous studies, e.g. heuristic-based approaches, statistical-based approaches, etc. HMM is a statistical machine learning approach that has been successfully applied in many fields, e.g. POS tagging, shallow parsing, and so on. However, we find that standard HMM achieved only 80% results in Chinese word segmentation. As is commonly known, segmentation ambiguity and unknown word occurrence are two main problems in Chinese word segmentation. In this paper, we proposed a two-stage specialized HMM by incorporating these information into the model. In the first stage, we combine the maximum matching heuristics to incorporate segmentation ambiguity and use a masking approach to handle unknown word information. By extending the observation symbols, the proposed M-HMM is improved from 0.812 to 0.953 in F-measure. At the second stage, we use lexicalization technique to further enrich HMM performance. The idea is to add new state symbols for high frequency characters or high tagging error symbols. Experimental results show that Lexicalized M-HMM is improved from 0.953 to 0.963 in F-measure. Chia-Hui Chang 張嘉惠 2006 學位論文 ; thesis 41 zh-TW |
collection |
NDLTD |
language |
zh-TW |
format |
Others
|
sources |
NDLTD |
description |
碩士 === 國立中央大學 === 資訊工程研究所 === 94 === The first step in Chinese language processing tasks is word segmentation. Various methods have been proposed to address this problem in previous studies, e.g. heuristic-based approaches, statistical-based approaches, etc. HMM is a statistical machine learning approach that has been successfully applied in many fields, e.g. POS tagging, shallow parsing, and so on. However, we find that standard HMM achieved only 80% results in Chinese word segmentation. As is commonly known, segmentation ambiguity and unknown word occurrence are two main problems in Chinese word segmentation. In this paper, we proposed a two-stage specialized HMM by incorporating these information into the model. In the first stage, we combine the maximum matching heuristics to incorporate segmentation ambiguity and use a masking approach to handle unknown word information. By extending the observation symbols, the proposed M-HMM is improved from 0.812 to 0.953 in F-measure. At the second stage, we use lexicalization technique to further enrich HMM performance. The idea is to add new state symbols for high frequency characters or high tagging error symbols. Experimental results show that Lexicalized M-HMM is improved from 0.953 to 0.963 in F-measure.
|
author2 |
Chia-Hui Chang |
author_facet |
Chia-Hui Chang Qian-Xiang Lin 林千翔 |
author |
Qian-Xiang Lin 林千翔 |
spellingShingle |
Qian-Xiang Lin 林千翔 Chinese Word Segmentation using Specialized HMM |
author_sort |
Qian-Xiang Lin |
title |
Chinese Word Segmentation using Specialized HMM |
title_short |
Chinese Word Segmentation using Specialized HMM |
title_full |
Chinese Word Segmentation using Specialized HMM |
title_fullStr |
Chinese Word Segmentation using Specialized HMM |
title_full_unstemmed |
Chinese Word Segmentation using Specialized HMM |
title_sort |
chinese word segmentation using specialized hmm |
publishDate |
2006 |
url |
http://ndltd.ncl.edu.tw/handle/jea684 |
work_keys_str_mv |
AT qianxianglin chinesewordsegmentationusingspecializedhmm AT línqiānxiáng chinesewordsegmentationusingspecializedhmm AT qianxianglin jīyútèzhìyǐncángshìmǎkěfūmóxíngzhīzhōngwénduàncíyánjiū AT línqiānxiáng jīyútèzhìyǐncángshìmǎkěfūmóxíngzhīzhōngwénduàncíyánjiū |
_version_ |
1718638220658343936 |