Chinese Word Segmentation using Specialized HMM

碩士 === 國立中央大學 === 資訊工程研究所 === 94 === The first step in Chinese language processing tasks is word segmentation. Various methods have been proposed to address this problem in previous studies, e.g. heuristic-based approaches, statistical-based approaches, etc. HMM is a statistical machine learning app...

Full description

Bibliographic Details
Main Authors:	Qian-Xiang Lin, 林千翔
Other Authors:	Chia-Hui Chang
Format:	Others
Language:	zh-TW
Published:	2006
Online Access:	http://ndltd.ncl.edu.tw/handle/jea684

id	ndltd-TW-094NCU05392082
record_format	oai_dc
spelling	ndltd-TW-094NCU053920822018-05-13T04:29:03Z http://ndltd.ncl.edu.tw/handle/jea684 Chinese Word Segmentation using Specialized HMM 基於特製隱藏式馬可夫模型之中文斷詞研究 Qian-Xiang Lin 林千翔碩士國立中央大學資訊工程研究所 94 The first step in Chinese language processing tasks is word segmentation. Various methods have been proposed to address this problem in previous studies, e.g. heuristic-based approaches, statistical-based approaches, etc. HMM is a statistical machine learning approach that has been successfully applied in many fields, e.g. POS tagging, shallow parsing, and so on. However, we find that standard HMM achieved only 80% results in Chinese word segmentation. As is commonly known, segmentation ambiguity and unknown word occurrence are two main problems in Chinese word segmentation. In this paper, we proposed a two-stage specialized HMM by incorporating these information into the model. In the first stage, we combine the maximum matching heuristics to incorporate segmentation ambiguity and use a masking approach to handle unknown word information. By extending the observation symbols, the proposed M-HMM is improved from 0.812 to 0.953 in F-measure. At the second stage, we use lexicalization technique to further enrich HMM performance. The idea is to add new state symbols for high frequency characters or high tagging error symbols. Experimental results show that Lexicalized M-HMM is improved from 0.953 to 0.963 in F-measure. Chia-Hui Chang 張嘉惠 2006 學位論文 ; thesis 41 zh-TW
collection	NDLTD
language	zh-TW
format	Others
sources	NDLTD
description	碩士 === 國立中央大學 === 資訊工程研究所 === 94 === The first step in Chinese language processing tasks is word segmentation. Various methods have been proposed to address this problem in previous studies, e.g. heuristic-based approaches, statistical-based approaches, etc. HMM is a statistical machine learning approach that has been successfully applied in many fields, e.g. POS tagging, shallow parsing, and so on. However, we find that standard HMM achieved only 80% results in Chinese word segmentation. As is commonly known, segmentation ambiguity and unknown word occurrence are two main problems in Chinese word segmentation. In this paper, we proposed a two-stage specialized HMM by incorporating these information into the model. In the first stage, we combine the maximum matching heuristics to incorporate segmentation ambiguity and use a masking approach to handle unknown word information. By extending the observation symbols, the proposed M-HMM is improved from 0.812 to 0.953 in F-measure. At the second stage, we use lexicalization technique to further enrich HMM performance. The idea is to add new state symbols for high frequency characters or high tagging error symbols. Experimental results show that Lexicalized M-HMM is improved from 0.953 to 0.963 in F-measure.
author2	Chia-Hui Chang
author_facet	Chia-Hui Chang Qian-Xiang Lin 林千翔
author	Qian-Xiang Lin 林千翔
spellingShingle	Qian-Xiang Lin 林千翔 Chinese Word Segmentation using Specialized HMM
author_sort	Qian-Xiang Lin
title	Chinese Word Segmentation using Specialized HMM
title_short	Chinese Word Segmentation using Specialized HMM
title_full	Chinese Word Segmentation using Specialized HMM
title_fullStr	Chinese Word Segmentation using Specialized HMM
title_full_unstemmed	Chinese Word Segmentation using Specialized HMM
title_sort	chinese word segmentation using specialized hmm
publishDate	2006
url	http://ndltd.ncl.edu.tw/handle/jea684
work_keys_str_mv	AT qianxianglin chinesewordsegmentationusingspecializedhmm AT línqiānxiáng chinesewordsegmentationusingspecializedhmm AT qianxianglin jīyútèzhìyǐncángshìmǎkěfūmóxíngzhīzhōngwénduàncíyánjiū AT línqiānxiáng jīyútèzhìyǐncángshìmǎkěfūmóxíngzhīzhōngwénduàncíyánjiū
_version_	1718638220658343936

Chinese Word Segmentation using Specialized HMM

Similar Items