Summary: | 碩士 === 國立高雄應用科技大學 === 電機工程系碩士班 === 92 === “Automatic text categorization” is based on machine learning techniques to fulfill classification of heterogeneous texts through an implemented classification system. The theory of Support Vector Machine (SVM) was constructed based on statistical learning, neural network and optimization techniques. The major features of SVM are: (1). the capacity to deal with linear and non-linear problems, and (2). the total sizes of tested data items (data size) are not limited. As a result, SVM algorithm offers an effective solution to resolve the difficulties in text categorization with a large scale data size.
This research work is mainly based on Support Vector Machine (SVM) learning algorithm and proposed a strategy of feature selection to carry out classification of Chinese document. Based on several experimental situations, we discussed the differences among several feature selection strategies, and verified their impacts on the performance of SVM based classification tasks. After that, according to the analysis of the strategies, we determined one of them for our implementation of developed classification system, and combined different kernel functions with various parameters into the SVM algorithm to establish the experiments of document categorization. Our experimental results indicate that the SVM algorithm for document classification can produce a satisfactory performance, based on the determined strategy of feature selection. We also demonstrate that only 500 dimensions required, our system can perform an outstanding accuracy of categorization. Eventually we conducted several experiments to compare the neural networks and kNN classifiers with our implemented SVM classifier for document categorization. The SVM classifier also obtains a superior performance than others.
|