A Hybrid Chinese Segmentation System for Finding Long Terms and New Terms

碩士 === 元智大學 === 資訊管理學系 === 99 === This study proposed a hybrid Chinese segmentation method. Firstly, we segment the documents using dual segmentation methods including High-Frequency Maximum Matching(HFMM) and CKIP. Secondly,we verify the HFMM generated long terms using part of speech (POS) given by...

Full description

Bibliographic Details
Main Authors: Yu-Shyang Lin, 林渝翔
Other Authors: Cheng-Jye Luh
Format: Others
Language:zh-TW
Published: 2011
Online Access:http://ndltd.ncl.edu.tw/handle/74114384622240439792
id ndltd-TW-099YZU05396009
record_format oai_dc
spelling ndltd-TW-099YZU053960092016-04-13T04:16:58Z http://ndltd.ncl.edu.tw/handle/74114384622240439792 A Hybrid Chinese Segmentation System for Finding Long Terms and New Terms 一個產生長詞與新詞的中文混合斷詞系統 Yu-Shyang Lin 林渝翔 碩士 元智大學 資訊管理學系 99 This study proposed a hybrid Chinese segmentation method. Firstly, we segment the documents using dual segmentation methods including High-Frequency Maximum Matching(HFMM) and CKIP. Secondly,we verify the HFMM generated long terms using part of speech (POS) given by CKIP and some POS combination rules. Finaly we find that generally won’t be generated by CKIP. The experimental results on Sinica corpus showed that the proposed method can achieve Precision, Recall and F1-measure to a certain level. Once adding the long terms selected manually into Sinica corpus, our method performs much better than other segment than methods. In addition,the experimental results on Google news showed that we can get 7.5 new terms in a average from news articlea of 3 categories. The average accuracy rate of new terms reached to 80.82%, indicating the proposeds can also find new terms accuratly. Cheng-Jye Luh 陸承志 2011 學位論文 ; thesis 61 zh-TW
collection NDLTD
language zh-TW
format Others
sources NDLTD
description 碩士 === 元智大學 === 資訊管理學系 === 99 === This study proposed a hybrid Chinese segmentation method. Firstly, we segment the documents using dual segmentation methods including High-Frequency Maximum Matching(HFMM) and CKIP. Secondly,we verify the HFMM generated long terms using part of speech (POS) given by CKIP and some POS combination rules. Finaly we find that generally won’t be generated by CKIP. The experimental results on Sinica corpus showed that the proposed method can achieve Precision, Recall and F1-measure to a certain level. Once adding the long terms selected manually into Sinica corpus, our method performs much better than other segment than methods. In addition,the experimental results on Google news showed that we can get 7.5 new terms in a average from news articlea of 3 categories. The average accuracy rate of new terms reached to 80.82%, indicating the proposeds can also find new terms accuratly.
author2 Cheng-Jye Luh
author_facet Cheng-Jye Luh
Yu-Shyang Lin
林渝翔
author Yu-Shyang Lin
林渝翔
spellingShingle Yu-Shyang Lin
林渝翔
A Hybrid Chinese Segmentation System for Finding Long Terms and New Terms
author_sort Yu-Shyang Lin
title A Hybrid Chinese Segmentation System for Finding Long Terms and New Terms
title_short A Hybrid Chinese Segmentation System for Finding Long Terms and New Terms
title_full A Hybrid Chinese Segmentation System for Finding Long Terms and New Terms
title_fullStr A Hybrid Chinese Segmentation System for Finding Long Terms and New Terms
title_full_unstemmed A Hybrid Chinese Segmentation System for Finding Long Terms and New Terms
title_sort hybrid chinese segmentation system for finding long terms and new terms
publishDate 2011
url http://ndltd.ncl.edu.tw/handle/74114384622240439792
work_keys_str_mv AT yushyanglin ahybridchinesesegmentationsystemforfindinglongtermsandnewterms
AT línyúxiáng ahybridchinesesegmentationsystemforfindinglongtermsandnewterms
AT yushyanglin yīgèchǎnshēngzhǎngcíyǔxīncídezhōngwénhùnhéduàncíxìtǒng
AT línyúxiáng yīgèchǎnshēngzhǎngcíyǔxīncídezhōngwénhùnhéduàncíxìtǒng
AT yushyanglin hybridchinesesegmentationsystemforfindinglongtermsandnewterms
AT línyúxiáng hybridchinesesegmentationsystemforfindinglongtermsandnewterms
_version_ 1718222565163401216