A Sequeacial Feature Selecting Strategy Based on Relevance Between Data Label and Principle Component

碩士 === 國立臺灣科技大學 === 資訊管理系 === 105 === Binary classification method predicts the class of an object based on the associated feature vector. Traditional classification methods usually suffer from the high dimensionality of the feature vector, resulting in the need for decreasing feature vectors. There...

Full description

Bibliographic Details
Main Authors: Ting-Kng Tiun, 張呈光
Other Authors: Wei-Ning Yang
Format: Others
Language:zh-TW
Published: 2017
Online Access:http://ndltd.ncl.edu.tw/handle/bw8d5v
id ndltd-TW-105NTUS5396081
record_format oai_dc
spelling ndltd-TW-105NTUS53960812019-05-15T23:46:35Z http://ndltd.ncl.edu.tw/handle/bw8d5v A Sequeacial Feature Selecting Strategy Based on Relevance Between Data Label and Principle Component 植基於主成份分析與資料類別關係之遞進式特徵提取方法 Ting-Kng Tiun 張呈光 碩士 國立臺灣科技大學 資訊管理系 105 Binary classification method predicts the class of an object based on the associated feature vector. Traditional classification methods usually suffer from the high dimensionality of the feature vector, resulting in the need for decreasing feature vectors. There exist two major approaches to reducing the number of features. One is to select a subset of indigenous features which maintains the original meaning of each feature. The relevance among original features makes it difficult to find a proper subset of significant features from a large number of features, resorting to the need for random optimization algorithms. Another approach first transforms the original attributes to uncorrelated integrated features by the principal component analysis (PCA) and then sequentially search for the subset of significant integrated features. The second approach removes the relevance among integrated features, making the sequential search for the subset of significant integrated features feasible, while losing the interpret ability of significant features. In this study, we first transform the original features to uncorrelated integrated features by PCA and then rank the integrated features according to associated variances. To find the subset of significant integrated features, starting with the integrated features according to the corresponding ranks. For each subset of integrated features, a test score which is a linear combination of the integrated features is generated for classification. The coefficient on each integrated feature in the linear combination is determined such that the area under the Receiver Operating Characteristic(ROC) cure corresponding to the test score is maximized using the Genetic Algorithm(GA). Beside the self-developed classifier, we applied two other commonly used classifiers for comparison. Using the training data, the classification accuracy for each subset is evaluated and the subset with the largest classification accuracy is the final subset of significant integrated features used for classification. In addition to ranking the integrated features by the corresponding variances, we can also rank the integrated features by the corresponding Fisher Information, $R^2$ and AUC and then sequentially inflate the subset of integrated features according to the resulting ranks. Experimental results show that using Fisher Information has chances to get a better subset than merely PCA with variance. However, using PCA has a much consistant result. Using PCA can preduce a more consistance performance and more economy for calculating power. We assume that there are more to investigate further for the situation of using Fisher Information or other correlation methods as selection measurement to get a better classification performance than PCA variance. Wei-Ning Yang 楊維寧 2017 學位論文 ; thesis 24 zh-TW
collection NDLTD
language zh-TW
format Others
sources NDLTD
description 碩士 === 國立臺灣科技大學 === 資訊管理系 === 105 === Binary classification method predicts the class of an object based on the associated feature vector. Traditional classification methods usually suffer from the high dimensionality of the feature vector, resulting in the need for decreasing feature vectors. There exist two major approaches to reducing the number of features. One is to select a subset of indigenous features which maintains the original meaning of each feature. The relevance among original features makes it difficult to find a proper subset of significant features from a large number of features, resorting to the need for random optimization algorithms. Another approach first transforms the original attributes to uncorrelated integrated features by the principal component analysis (PCA) and then sequentially search for the subset of significant integrated features. The second approach removes the relevance among integrated features, making the sequential search for the subset of significant integrated features feasible, while losing the interpret ability of significant features. In this study, we first transform the original features to uncorrelated integrated features by PCA and then rank the integrated features according to associated variances. To find the subset of significant integrated features, starting with the integrated features according to the corresponding ranks. For each subset of integrated features, a test score which is a linear combination of the integrated features is generated for classification. The coefficient on each integrated feature in the linear combination is determined such that the area under the Receiver Operating Characteristic(ROC) cure corresponding to the test score is maximized using the Genetic Algorithm(GA). Beside the self-developed classifier, we applied two other commonly used classifiers for comparison. Using the training data, the classification accuracy for each subset is evaluated and the subset with the largest classification accuracy is the final subset of significant integrated features used for classification. In addition to ranking the integrated features by the corresponding variances, we can also rank the integrated features by the corresponding Fisher Information, $R^2$ and AUC and then sequentially inflate the subset of integrated features according to the resulting ranks. Experimental results show that using Fisher Information has chances to get a better subset than merely PCA with variance. However, using PCA has a much consistant result. Using PCA can preduce a more consistance performance and more economy for calculating power. We assume that there are more to investigate further for the situation of using Fisher Information or other correlation methods as selection measurement to get a better classification performance than PCA variance.
author2 Wei-Ning Yang
author_facet Wei-Ning Yang
Ting-Kng Tiun
張呈光
author Ting-Kng Tiun
張呈光
spellingShingle Ting-Kng Tiun
張呈光
A Sequeacial Feature Selecting Strategy Based on Relevance Between Data Label and Principle Component
author_sort Ting-Kng Tiun
title A Sequeacial Feature Selecting Strategy Based on Relevance Between Data Label and Principle Component
title_short A Sequeacial Feature Selecting Strategy Based on Relevance Between Data Label and Principle Component
title_full A Sequeacial Feature Selecting Strategy Based on Relevance Between Data Label and Principle Component
title_fullStr A Sequeacial Feature Selecting Strategy Based on Relevance Between Data Label and Principle Component
title_full_unstemmed A Sequeacial Feature Selecting Strategy Based on Relevance Between Data Label and Principle Component
title_sort sequeacial feature selecting strategy based on relevance between data label and principle component
publishDate 2017
url http://ndltd.ncl.edu.tw/handle/bw8d5v
work_keys_str_mv AT tingkngtiun asequeacialfeatureselectingstrategybasedonrelevancebetweendatalabelandprinciplecomponent
AT zhāngchéngguāng asequeacialfeatureselectingstrategybasedonrelevancebetweendatalabelandprinciplecomponent
AT tingkngtiun zhíjīyúzhǔchéngfènfēnxīyǔzīliàolèibiéguānxìzhīdìjìnshìtèzhēngtíqǔfāngfǎ
AT zhāngchéngguāng zhíjīyúzhǔchéngfènfēnxīyǔzīliàolèibiéguānxìzhīdìjìnshìtèzhēngtíqǔfāngfǎ
AT tingkngtiun sequeacialfeatureselectingstrategybasedonrelevancebetweendatalabelandprinciplecomponent
AT zhāngchéngguāng sequeacialfeatureselectingstrategybasedonrelevancebetweendatalabelandprinciplecomponent
_version_ 1719153231798468608