N-GlycoGo: Predicting Protein N-Glycosylation Sites on Imbalanced Data Sets by Using Heterogeneous and Comprehensive Strategy

Glycosylation is the most complex post-modification effect of proteins. It participates in many biological processes in the human body and is closely related to many disease states. Among them, N-linked glycosylation is the most contained glycosylation data. However, the current N-linked glycosylati...

Full description

Bibliographic Details
Main Authors: Ching-Hsuan Chien, Chi-Chang Chang, Shih-Huan Lin, Chi-Wei Chen, Zong-Han Chang, Yen-Wei Chu
Format: Article
Language:English
Published: IEEE 2020-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/9187771/
id doaj-4bd4827ff15f4a40b4f9ab7f9f425e54
record_format Article
spelling doaj-4bd4827ff15f4a40b4f9ab7f9f425e542021-03-30T03:32:25ZengIEEEIEEE Access2169-35362020-01-01816594416595010.1109/ACCESS.2020.30226299187771N-GlycoGo: Predicting Protein N-Glycosylation Sites on Imbalanced Data Sets by Using Heterogeneous and Comprehensive StrategyChing-Hsuan Chien0Chi-Chang Chang1https://orcid.org/0000-0001-6513-9212Shih-Huan Lin2Chi-Wei Chen3Zong-Han Chang4Yen-Wei Chu5https://orcid.org/0000-0002-5525-4011Institute of Genomics and Bioinformatics, National Chung Hsing University, Taichung, TaiwanSchool of Medical Informatics, Chung Shan Medical University, Taichung, TaiwanPh.D. Program in Medical Biotechnology, National Chung Hsing University, Taichung, TaiwanInstitute of Genomics and Bioinformatics, National Chung Hsing University, Taichung, TaiwanInstitute of Genomics and Bioinformatics, National Chung Hsing University, Taichung, TaiwanInstitute of Genomics and Bioinformatics, National Chung Hsing University, Taichung, TaiwanGlycosylation is the most complex post-modification effect of proteins. It participates in many biological processes in the human body and is closely related to many disease states. Among them, N-linked glycosylation is the most contained glycosylation data. However, the current N-linked glycosylation prediction tool does not take into account the serious imbalance between positive and negative data. In this study, we used protein sequence and amino acid characteristics to construct an N-linked glycosylation prediction model called N-GlycoGo. Based on sequence, structure, and function, 11 heterogeneous features were encoded. Further, XGBoost was selected for modeling. Finally, independent testing of human and mouse prediction models showed that N-GlycoGo is superior to other tools with Matthews correlation coefficient (MCC) values of 0.397 and 0.719, respectively, which is higher than other glycosylation site prediction tools. We have developed a fast and accurate prediction tool, N-GlycoGo, for N-linked glycosylation. N-GlycoGo is available at http://ncblab.nchu.edu.tw/n-glycogo/.https://ieeexplore.ieee.org/document/9187771/Ensemble learningmachine learningN-linked glycosylation
collection DOAJ
language English
format Article
sources DOAJ
author Ching-Hsuan Chien
Chi-Chang Chang
Shih-Huan Lin
Chi-Wei Chen
Zong-Han Chang
Yen-Wei Chu
spellingShingle Ching-Hsuan Chien
Chi-Chang Chang
Shih-Huan Lin
Chi-Wei Chen
Zong-Han Chang
Yen-Wei Chu
N-GlycoGo: Predicting Protein N-Glycosylation Sites on Imbalanced Data Sets by Using Heterogeneous and Comprehensive Strategy
IEEE Access
Ensemble learning
machine learning
N-linked glycosylation
author_facet Ching-Hsuan Chien
Chi-Chang Chang
Shih-Huan Lin
Chi-Wei Chen
Zong-Han Chang
Yen-Wei Chu
author_sort Ching-Hsuan Chien
title N-GlycoGo: Predicting Protein N-Glycosylation Sites on Imbalanced Data Sets by Using Heterogeneous and Comprehensive Strategy
title_short N-GlycoGo: Predicting Protein N-Glycosylation Sites on Imbalanced Data Sets by Using Heterogeneous and Comprehensive Strategy
title_full N-GlycoGo: Predicting Protein N-Glycosylation Sites on Imbalanced Data Sets by Using Heterogeneous and Comprehensive Strategy
title_fullStr N-GlycoGo: Predicting Protein N-Glycosylation Sites on Imbalanced Data Sets by Using Heterogeneous and Comprehensive Strategy
title_full_unstemmed N-GlycoGo: Predicting Protein N-Glycosylation Sites on Imbalanced Data Sets by Using Heterogeneous and Comprehensive Strategy
title_sort n-glycogo: predicting protein n-glycosylation sites on imbalanced data sets by using heterogeneous and comprehensive strategy
publisher IEEE
series IEEE Access
issn 2169-3536
publishDate 2020-01-01
description Glycosylation is the most complex post-modification effect of proteins. It participates in many biological processes in the human body and is closely related to many disease states. Among them, N-linked glycosylation is the most contained glycosylation data. However, the current N-linked glycosylation prediction tool does not take into account the serious imbalance between positive and negative data. In this study, we used protein sequence and amino acid characteristics to construct an N-linked glycosylation prediction model called N-GlycoGo. Based on sequence, structure, and function, 11 heterogeneous features were encoded. Further, XGBoost was selected for modeling. Finally, independent testing of human and mouse prediction models showed that N-GlycoGo is superior to other tools with Matthews correlation coefficient (MCC) values of 0.397 and 0.719, respectively, which is higher than other glycosylation site prediction tools. We have developed a fast and accurate prediction tool, N-GlycoGo, for N-linked glycosylation. N-GlycoGo is available at http://ncblab.nchu.edu.tw/n-glycogo/.
topic Ensemble learning
machine learning
N-linked glycosylation
url https://ieeexplore.ieee.org/document/9187771/
work_keys_str_mv AT chinghsuanchien nglycogopredictingproteinnglycosylationsitesonimbalanceddatasetsbyusingheterogeneousandcomprehensivestrategy
AT chichangchang nglycogopredictingproteinnglycosylationsitesonimbalanceddatasetsbyusingheterogeneousandcomprehensivestrategy
AT shihhuanlin nglycogopredictingproteinnglycosylationsitesonimbalanceddatasetsbyusingheterogeneousandcomprehensivestrategy
AT chiweichen nglycogopredictingproteinnglycosylationsitesonimbalanceddatasetsbyusingheterogeneousandcomprehensivestrategy
AT zonghanchang nglycogopredictingproteinnglycosylationsitesonimbalanceddatasetsbyusingheterogeneousandcomprehensivestrategy
AT yenweichu nglycogopredictingproteinnglycosylationsitesonimbalanceddatasetsbyusingheterogeneousandcomprehensivestrategy
_version_ 1724183278829699072