A Sequeacial Feature Selecting Strategy Based on Relevance Between Data Label and Principle Component

碩士 === 國立臺灣科技大學 === 資訊管理系 === 105 === Binary classification method predicts the class of an object based on the associated feature vector. Traditional classification methods usually suffer from the high dimensionality of the feature vector, resulting in the need for decreasing feature vectors. There...

Full description

Bibliographic Details
Main Authors: Ting-Kng Tiun, 張呈光
Other Authors: Wei-Ning Yang
Format: Others
Language:zh-TW
Published: 2017
Online Access:http://ndltd.ncl.edu.tw/handle/bw8d5v
Description
Summary:碩士 === 國立臺灣科技大學 === 資訊管理系 === 105 === Binary classification method predicts the class of an object based on the associated feature vector. Traditional classification methods usually suffer from the high dimensionality of the feature vector, resulting in the need for decreasing feature vectors. There exist two major approaches to reducing the number of features. One is to select a subset of indigenous features which maintains the original meaning of each feature. The relevance among original features makes it difficult to find a proper subset of significant features from a large number of features, resorting to the need for random optimization algorithms. Another approach first transforms the original attributes to uncorrelated integrated features by the principal component analysis (PCA) and then sequentially search for the subset of significant integrated features. The second approach removes the relevance among integrated features, making the sequential search for the subset of significant integrated features feasible, while losing the interpret ability of significant features. In this study, we first transform the original features to uncorrelated integrated features by PCA and then rank the integrated features according to associated variances. To find the subset of significant integrated features, starting with the integrated features according to the corresponding ranks. For each subset of integrated features, a test score which is a linear combination of the integrated features is generated for classification. The coefficient on each integrated feature in the linear combination is determined such that the area under the Receiver Operating Characteristic(ROC) cure corresponding to the test score is maximized using the Genetic Algorithm(GA). Beside the self-developed classifier, we applied two other commonly used classifiers for comparison. Using the training data, the classification accuracy for each subset is evaluated and the subset with the largest classification accuracy is the final subset of significant integrated features used for classification. In addition to ranking the integrated features by the corresponding variances, we can also rank the integrated features by the corresponding Fisher Information, $R^2$ and AUC and then sequentially inflate the subset of integrated features according to the resulting ranks. Experimental results show that using Fisher Information has chances to get a better subset than merely PCA with variance. However, using PCA has a much consistant result. Using PCA can preduce a more consistance performance and more economy for calculating power. We assume that there are more to investigate further for the situation of using Fisher Information or other correlation methods as selection measurement to get a better classification performance than PCA variance.