Protein Sub-Nuclear Localization Based on Effective Fusion Representations and Dimension Reduction Algorithm LDA

An effective representation of a protein sequence plays a crucial role in protein sub-nuclear localization. The existing representations, such as dipeptide composition (DipC), pseudo-amino acid composition (PseAAC) and position specific scoring matrix (PSSM), are insufficient to represent protein se...

Full description

Bibliographic Details
Main Authors: Shunfang Wang, Shuhui Liu
Format: Article
Language:English
Published: MDPI AG 2015-12-01
Series:International Journal of Molecular Sciences
Subjects:
Online Access:http://www.mdpi.com/1422-0067/16/12/26237
id doaj-0c1d3f42bde44bb48e383b53ddc81418
record_format Article
spelling doaj-0c1d3f42bde44bb48e383b53ddc814182020-11-24T21:39:45ZengMDPI AGInternational Journal of Molecular Sciences1422-00672015-12-011612303433036110.3390/ijms161226237ijms161226237Protein Sub-Nuclear Localization Based on Effective Fusion Representations and Dimension Reduction Algorithm LDAShunfang Wang0Shuhui Liu1School of Information Science and Engineering, Yunnan University, Kunming 650504, ChinaSchool of Information Science and Engineering, Yunnan University, Kunming 650504, ChinaAn effective representation of a protein sequence plays a crucial role in protein sub-nuclear localization. The existing representations, such as dipeptide composition (DipC), pseudo-amino acid composition (PseAAC) and position specific scoring matrix (PSSM), are insufficient to represent protein sequence due to their single perspectives. Thus, this paper proposes two fusion feature representations of DipPSSM and PseAAPSSM to integrate PSSM with DipC and PseAAC, respectively. When constructing each fusion representation, we introduce the balance factors to value the importance of its components. The optimal values of the balance factors are sought by genetic algorithm. Due to the high dimensionality of the proposed representations, linear discriminant analysis (LDA) is used to find its important low dimensional structure, which is essential for classification and location prediction. The numerical experiments on two public datasets with KNN classifier and cross-validation tests showed that in terms of the common indexes of sensitivity, specificity, accuracy and MCC, the proposed fusing representations outperform the traditional representations in protein sub-nuclear localization, and the representation treated by LDA outperforms the untreated one.http://www.mdpi.com/1422-0067/16/12/26237protein sub-nuclear localizationDipPSSMPseAAPSSMlinear discriminant analysisKNN classifier
collection DOAJ
language English
format Article
sources DOAJ
author Shunfang Wang
Shuhui Liu
spellingShingle Shunfang Wang
Shuhui Liu
Protein Sub-Nuclear Localization Based on Effective Fusion Representations and Dimension Reduction Algorithm LDA
International Journal of Molecular Sciences
protein sub-nuclear localization
DipPSSM
PseAAPSSM
linear discriminant analysis
KNN classifier
author_facet Shunfang Wang
Shuhui Liu
author_sort Shunfang Wang
title Protein Sub-Nuclear Localization Based on Effective Fusion Representations and Dimension Reduction Algorithm LDA
title_short Protein Sub-Nuclear Localization Based on Effective Fusion Representations and Dimension Reduction Algorithm LDA
title_full Protein Sub-Nuclear Localization Based on Effective Fusion Representations and Dimension Reduction Algorithm LDA
title_fullStr Protein Sub-Nuclear Localization Based on Effective Fusion Representations and Dimension Reduction Algorithm LDA
title_full_unstemmed Protein Sub-Nuclear Localization Based on Effective Fusion Representations and Dimension Reduction Algorithm LDA
title_sort protein sub-nuclear localization based on effective fusion representations and dimension reduction algorithm lda
publisher MDPI AG
series International Journal of Molecular Sciences
issn 1422-0067
publishDate 2015-12-01
description An effective representation of a protein sequence plays a crucial role in protein sub-nuclear localization. The existing representations, such as dipeptide composition (DipC), pseudo-amino acid composition (PseAAC) and position specific scoring matrix (PSSM), are insufficient to represent protein sequence due to their single perspectives. Thus, this paper proposes two fusion feature representations of DipPSSM and PseAAPSSM to integrate PSSM with DipC and PseAAC, respectively. When constructing each fusion representation, we introduce the balance factors to value the importance of its components. The optimal values of the balance factors are sought by genetic algorithm. Due to the high dimensionality of the proposed representations, linear discriminant analysis (LDA) is used to find its important low dimensional structure, which is essential for classification and location prediction. The numerical experiments on two public datasets with KNN classifier and cross-validation tests showed that in terms of the common indexes of sensitivity, specificity, accuracy and MCC, the proposed fusing representations outperform the traditional representations in protein sub-nuclear localization, and the representation treated by LDA outperforms the untreated one.
topic protein sub-nuclear localization
DipPSSM
PseAAPSSM
linear discriminant analysis
KNN classifier
url http://www.mdpi.com/1422-0067/16/12/26237
work_keys_str_mv AT shunfangwang proteinsubnuclearlocalizationbasedoneffectivefusionrepresentationsanddimensionreductionalgorithmlda
AT shuhuiliu proteinsubnuclearlocalizationbasedoneffectivefusionrepresentationsanddimensionreductionalgorithmlda
_version_ 1725929498483884032