Chinese Word Segmentation using Specialized HMM

碩士 === 國立中央大學 === 資訊工程研究所 === 94 === The first step in Chinese language processing tasks is word segmentation. Various methods have been proposed to address this problem in previous studies, e.g. heuristic-based approaches, statistical-based approaches, etc. HMM is a statistical machine learning app...

Full description

Bibliographic Details
Main Authors: Qian-Xiang Lin, 林千翔
Other Authors: Chia-Hui Chang
Format: Others
Language:zh-TW
Published: 2006
Online Access:http://ndltd.ncl.edu.tw/handle/jea684
id ndltd-TW-094NCU05392082
record_format oai_dc
spelling ndltd-TW-094NCU053920822018-05-13T04:29:03Z http://ndltd.ncl.edu.tw/handle/jea684 Chinese Word Segmentation using Specialized HMM 基於特製隱藏式馬可夫模型之中文斷詞研究 Qian-Xiang Lin 林千翔 碩士 國立中央大學 資訊工程研究所 94 The first step in Chinese language processing tasks is word segmentation. Various methods have been proposed to address this problem in previous studies, e.g. heuristic-based approaches, statistical-based approaches, etc. HMM is a statistical machine learning approach that has been successfully applied in many fields, e.g. POS tagging, shallow parsing, and so on. However, we find that standard HMM achieved only 80% results in Chinese word segmentation. As is commonly known, segmentation ambiguity and unknown word occurrence are two main problems in Chinese word segmentation. In this paper, we proposed a two-stage specialized HMM by incorporating these information into the model. In the first stage, we combine the maximum matching heuristics to incorporate segmentation ambiguity and use a masking approach to handle unknown word information. By extending the observation symbols, the proposed M-HMM is improved from 0.812 to 0.953 in F-measure. At the second stage, we use lexicalization technique to further enrich HMM performance. The idea is to add new state symbols for high frequency characters or high tagging error symbols. Experimental results show that Lexicalized M-HMM is improved from 0.953 to 0.963 in F-measure. Chia-Hui Chang 張嘉惠 2006 學位論文 ; thesis 41 zh-TW
collection NDLTD
language zh-TW
format Others
sources NDLTD
description 碩士 === 國立中央大學 === 資訊工程研究所 === 94 === The first step in Chinese language processing tasks is word segmentation. Various methods have been proposed to address this problem in previous studies, e.g. heuristic-based approaches, statistical-based approaches, etc. HMM is a statistical machine learning approach that has been successfully applied in many fields, e.g. POS tagging, shallow parsing, and so on. However, we find that standard HMM achieved only 80% results in Chinese word segmentation. As is commonly known, segmentation ambiguity and unknown word occurrence are two main problems in Chinese word segmentation. In this paper, we proposed a two-stage specialized HMM by incorporating these information into the model. In the first stage, we combine the maximum matching heuristics to incorporate segmentation ambiguity and use a masking approach to handle unknown word information. By extending the observation symbols, the proposed M-HMM is improved from 0.812 to 0.953 in F-measure. At the second stage, we use lexicalization technique to further enrich HMM performance. The idea is to add new state symbols for high frequency characters or high tagging error symbols. Experimental results show that Lexicalized M-HMM is improved from 0.953 to 0.963 in F-measure.
author2 Chia-Hui Chang
author_facet Chia-Hui Chang
Qian-Xiang Lin
林千翔
author Qian-Xiang Lin
林千翔
spellingShingle Qian-Xiang Lin
林千翔
Chinese Word Segmentation using Specialized HMM
author_sort Qian-Xiang Lin
title Chinese Word Segmentation using Specialized HMM
title_short Chinese Word Segmentation using Specialized HMM
title_full Chinese Word Segmentation using Specialized HMM
title_fullStr Chinese Word Segmentation using Specialized HMM
title_full_unstemmed Chinese Word Segmentation using Specialized HMM
title_sort chinese word segmentation using specialized hmm
publishDate 2006
url http://ndltd.ncl.edu.tw/handle/jea684
work_keys_str_mv AT qianxianglin chinesewordsegmentationusingspecializedhmm
AT línqiānxiáng chinesewordsegmentationusingspecializedhmm
AT qianxianglin jīyútèzhìyǐncángshìmǎkěfūmóxíngzhīzhōngwénduàncíyánjiū
AT línqiānxiáng jīyútèzhìyǐncángshìmǎkěfūmóxíngzhīzhōngwénduàncíyánjiū
_version_ 1718638220658343936