A Hybrid Chinese Segmentation System for Finding Long Terms and New Terms

碩士 === 元智大學 === 資訊管理學系 === 99 === This study proposed a hybrid Chinese segmentation method. Firstly, we segment the documents using dual segmentation methods including High-Frequency Maximum Matching(HFMM) and CKIP. Secondly,we verify the HFMM generated long terms using part of speech (POS) given by...

Full description

Bibliographic Details
Main Authors: Yu-Shyang Lin, 林渝翔
Other Authors: Cheng-Jye Luh
Format: Others
Language:zh-TW
Published: 2011
Online Access:http://ndltd.ncl.edu.tw/handle/74114384622240439792
Description
Summary:碩士 === 元智大學 === 資訊管理學系 === 99 === This study proposed a hybrid Chinese segmentation method. Firstly, we segment the documents using dual segmentation methods including High-Frequency Maximum Matching(HFMM) and CKIP. Secondly,we verify the HFMM generated long terms using part of speech (POS) given by CKIP and some POS combination rules. Finaly we find that generally won’t be generated by CKIP. The experimental results on Sinica corpus showed that the proposed method can achieve Precision, Recall and F1-measure to a certain level. Once adding the long terms selected manually into Sinica corpus, our method performs much better than other segment than methods. In addition,the experimental results on Google news showed that we can get 7.5 new terms in a average from news articlea of 3 categories. The average accuracy rate of new terms reached to 80.82%, indicating the proposeds can also find new terms accuratly.