Summary: | 碩士 === 中原大學 === 電子工程研究所 === 97 === Abstract
Data description and classification are interesting and important tasks which are applied widely in supervised learning. In this thesis, three supervised learning methods are considered: k-Nearest Neighbor (k-NN), Support Vector Data Description (SVDD) and Support Vector Machine (SVM).
Feature selection in supervised learning is useful to find a feature subset that produces higher classification accuracy. Both forward selection based wrapper and correlation based filter approaches are considered in this thesis. Correlation between features and class label is measured using entropy and information gain (IG) while feature-feature correlation is calculated using Pearson correlation. This study compares the performance of three classifiers (k-NN, SVDD and SVM) with and without feature selection. It is expected that the classifiers with the proposed feature selection methods will perform better than the classifiers without feature selection. In addition, the selected feature subset can be used to describe data structure no matter what classifier types or feature selection methods are used.
The data sample chosen is PIMA Indians diabetes from UCI database. The results show that forward feature selection produces the best feature subset for SVM and 5-NN. In addition, feature selection based on mean information gain and a standard deviation threshold gives the best result for 1-NN classifier and such a selection method can be considered as a substitute for forward selection. It is computationally efficient and the accuracy does not decrease significantly for SVM and 5-NN, as compared to forward selection. Finally, among eight candidate features, glucose level is the most prominent feature for diabetes detection in all classifiers and feature selection methods under consideration. Relevancy measurement in IG can be used to sort from the most important feature to the least significant one. It can be very useful in medical applications such as defining feature prioritization for symptom recognition.
|