A Hybrid Chinese Feature Selection Method for Knowledge Document Classification

碩士 === 國立成功大學 === 工業與資訊管理學系專班 === 101 === Enterprises have knowledge management systems for training employees, and the knowledge documents of industries are very important sources of explicit knowledge. Knowledge documents classification is a significant work for enterprises today. For selecting fe...

Full description

Bibliographic Details
Main Authors: Kuan-ChungKuo, 郭冠忠
Other Authors: Hei-Chia Wang
Format: Others
Language:zh-TW
Published: 2013
Online Access:http://ndltd.ncl.edu.tw/handle/61154689719734252916
id ndltd-TW-101NCKU5041064
record_format oai_dc
spelling ndltd-TW-101NCKU50410642016-03-18T04:42:17Z http://ndltd.ncl.edu.tw/handle/61154689719734252916 A Hybrid Chinese Feature Selection Method for Knowledge Document Classification 利用混合式中文特徵選取法於知識文件分類 Kuan-ChungKuo 郭冠忠 碩士 國立成功大學 工業與資訊管理學系專班 101 Enterprises have knowledge management systems for training employees, and the knowledge documents of industries are very important sources of explicit knowledge. Knowledge documents classification is a significant work for enterprises today. For selecting features which affecting the accuracy of classification, it is necessary to do text pre-processing before classifying knowledge documents. Unfortunately, Chinese sentences are not easy to segment in text pre-processing phase, because there is no white space between two Chinese terms. Currently, there are two common methods to do Chinese segmentation: One is based on dictionary, the other is based on statistics. Unknown term is always a problem of the Chinese segmentation system based on dictionary. A dictionary could not cover all terms, because the newest terms are created without end. For resolving this problem, this study used two dictionary-based Chinese segmentation systems, Stanford Chinese Word Segmenter and CKIP segmentation system, and one statistical-based method, n-grams method, and calculating the TF-ICF(Term Frequency-Inverse Category Frequency) score of terms to select the final features, then, classifying and validating with SVM classifier. This study found that the hybrid Chinese feature selection method has better accuracy of classification, compared with the method using single Chinese segmentation system. The performance of TF-ICF is better than TF and TF-IDF. The hybrid Chinese feature selection can improve the accuracy of Chinese knowledge documents classification. Hei-Chia Wang 王惠嘉 2013 學位論文 ; thesis 45 zh-TW
collection NDLTD
language zh-TW
format Others
sources NDLTD
description 碩士 === 國立成功大學 === 工業與資訊管理學系專班 === 101 === Enterprises have knowledge management systems for training employees, and the knowledge documents of industries are very important sources of explicit knowledge. Knowledge documents classification is a significant work for enterprises today. For selecting features which affecting the accuracy of classification, it is necessary to do text pre-processing before classifying knowledge documents. Unfortunately, Chinese sentences are not easy to segment in text pre-processing phase, because there is no white space between two Chinese terms. Currently, there are two common methods to do Chinese segmentation: One is based on dictionary, the other is based on statistics. Unknown term is always a problem of the Chinese segmentation system based on dictionary. A dictionary could not cover all terms, because the newest terms are created without end. For resolving this problem, this study used two dictionary-based Chinese segmentation systems, Stanford Chinese Word Segmenter and CKIP segmentation system, and one statistical-based method, n-grams method, and calculating the TF-ICF(Term Frequency-Inverse Category Frequency) score of terms to select the final features, then, classifying and validating with SVM classifier. This study found that the hybrid Chinese feature selection method has better accuracy of classification, compared with the method using single Chinese segmentation system. The performance of TF-ICF is better than TF and TF-IDF. The hybrid Chinese feature selection can improve the accuracy of Chinese knowledge documents classification.
author2 Hei-Chia Wang
author_facet Hei-Chia Wang
Kuan-ChungKuo
郭冠忠
author Kuan-ChungKuo
郭冠忠
spellingShingle Kuan-ChungKuo
郭冠忠
A Hybrid Chinese Feature Selection Method for Knowledge Document Classification
author_sort Kuan-ChungKuo
title A Hybrid Chinese Feature Selection Method for Knowledge Document Classification
title_short A Hybrid Chinese Feature Selection Method for Knowledge Document Classification
title_full A Hybrid Chinese Feature Selection Method for Knowledge Document Classification
title_fullStr A Hybrid Chinese Feature Selection Method for Knowledge Document Classification
title_full_unstemmed A Hybrid Chinese Feature Selection Method for Knowledge Document Classification
title_sort hybrid chinese feature selection method for knowledge document classification
publishDate 2013
url http://ndltd.ncl.edu.tw/handle/61154689719734252916
work_keys_str_mv AT kuanchungkuo ahybridchinesefeatureselectionmethodforknowledgedocumentclassification
AT guōguānzhōng ahybridchinesefeatureselectionmethodforknowledgedocumentclassification
AT kuanchungkuo lìyònghùnhéshìzhōngwéntèzhēngxuǎnqǔfǎyúzhīshíwénjiànfēnlèi
AT guōguānzhōng lìyònghùnhéshìzhōngwéntèzhēngxuǎnqǔfǎyúzhīshíwénjiànfēnlèi
AT kuanchungkuo hybridchinesefeatureselectionmethodforknowledgedocumentclassification
AT guōguānzhōng hybridchinesefeatureselectionmethodforknowledgedocumentclassification
_version_ 1718207967847776256