N-GlycoGo: Predicting Protein N-Glycosylation Sites on Imbalanced Data Sets by Using Heterogeneous and Comprehensive Strategy
Glycosylation is the most complex post-modification effect of proteins. It participates in many biological processes in the human body and is closely related to many disease states. Among them, N-linked glycosylation is the most contained glycosylation data. However, the current N-linked glycosylati...
Main Authors: | , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2020-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/9187771/ |
id |
doaj-4bd4827ff15f4a40b4f9ab7f9f425e54 |
---|---|
record_format |
Article |
spelling |
doaj-4bd4827ff15f4a40b4f9ab7f9f425e542021-03-30T03:32:25ZengIEEEIEEE Access2169-35362020-01-01816594416595010.1109/ACCESS.2020.30226299187771N-GlycoGo: Predicting Protein N-Glycosylation Sites on Imbalanced Data Sets by Using Heterogeneous and Comprehensive StrategyChing-Hsuan Chien0Chi-Chang Chang1https://orcid.org/0000-0001-6513-9212Shih-Huan Lin2Chi-Wei Chen3Zong-Han Chang4Yen-Wei Chu5https://orcid.org/0000-0002-5525-4011Institute of Genomics and Bioinformatics, National Chung Hsing University, Taichung, TaiwanSchool of Medical Informatics, Chung Shan Medical University, Taichung, TaiwanPh.D. Program in Medical Biotechnology, National Chung Hsing University, Taichung, TaiwanInstitute of Genomics and Bioinformatics, National Chung Hsing University, Taichung, TaiwanInstitute of Genomics and Bioinformatics, National Chung Hsing University, Taichung, TaiwanInstitute of Genomics and Bioinformatics, National Chung Hsing University, Taichung, TaiwanGlycosylation is the most complex post-modification effect of proteins. It participates in many biological processes in the human body and is closely related to many disease states. Among them, N-linked glycosylation is the most contained glycosylation data. However, the current N-linked glycosylation prediction tool does not take into account the serious imbalance between positive and negative data. In this study, we used protein sequence and amino acid characteristics to construct an N-linked glycosylation prediction model called N-GlycoGo. Based on sequence, structure, and function, 11 heterogeneous features were encoded. Further, XGBoost was selected for modeling. Finally, independent testing of human and mouse prediction models showed that N-GlycoGo is superior to other tools with Matthews correlation coefficient (MCC) values of 0.397 and 0.719, respectively, which is higher than other glycosylation site prediction tools. We have developed a fast and accurate prediction tool, N-GlycoGo, for N-linked glycosylation. N-GlycoGo is available at http://ncblab.nchu.edu.tw/n-glycogo/.https://ieeexplore.ieee.org/document/9187771/Ensemble learningmachine learningN-linked glycosylation |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Ching-Hsuan Chien Chi-Chang Chang Shih-Huan Lin Chi-Wei Chen Zong-Han Chang Yen-Wei Chu |
spellingShingle |
Ching-Hsuan Chien Chi-Chang Chang Shih-Huan Lin Chi-Wei Chen Zong-Han Chang Yen-Wei Chu N-GlycoGo: Predicting Protein N-Glycosylation Sites on Imbalanced Data Sets by Using Heterogeneous and Comprehensive Strategy IEEE Access Ensemble learning machine learning N-linked glycosylation |
author_facet |
Ching-Hsuan Chien Chi-Chang Chang Shih-Huan Lin Chi-Wei Chen Zong-Han Chang Yen-Wei Chu |
author_sort |
Ching-Hsuan Chien |
title |
N-GlycoGo: Predicting Protein N-Glycosylation Sites on Imbalanced Data Sets by Using Heterogeneous and Comprehensive Strategy |
title_short |
N-GlycoGo: Predicting Protein N-Glycosylation Sites on Imbalanced Data Sets by Using Heterogeneous and Comprehensive Strategy |
title_full |
N-GlycoGo: Predicting Protein N-Glycosylation Sites on Imbalanced Data Sets by Using Heterogeneous and Comprehensive Strategy |
title_fullStr |
N-GlycoGo: Predicting Protein N-Glycosylation Sites on Imbalanced Data Sets by Using Heterogeneous and Comprehensive Strategy |
title_full_unstemmed |
N-GlycoGo: Predicting Protein N-Glycosylation Sites on Imbalanced Data Sets by Using Heterogeneous and Comprehensive Strategy |
title_sort |
n-glycogo: predicting protein n-glycosylation sites on imbalanced data sets by using heterogeneous and comprehensive strategy |
publisher |
IEEE |
series |
IEEE Access |
issn |
2169-3536 |
publishDate |
2020-01-01 |
description |
Glycosylation is the most complex post-modification effect of proteins. It participates in many biological processes in the human body and is closely related to many disease states. Among them, N-linked glycosylation is the most contained glycosylation data. However, the current N-linked glycosylation prediction tool does not take into account the serious imbalance between positive and negative data. In this study, we used protein sequence and amino acid characteristics to construct an N-linked glycosylation prediction model called N-GlycoGo. Based on sequence, structure, and function, 11 heterogeneous features were encoded. Further, XGBoost was selected for modeling. Finally, independent testing of human and mouse prediction models showed that N-GlycoGo is superior to other tools with Matthews correlation coefficient (MCC) values of 0.397 and 0.719, respectively, which is higher than other glycosylation site prediction tools. We have developed a fast and accurate prediction tool, N-GlycoGo, for N-linked glycosylation. N-GlycoGo is available at http://ncblab.nchu.edu.tw/n-glycogo/. |
topic |
Ensemble learning machine learning N-linked glycosylation |
url |
https://ieeexplore.ieee.org/document/9187771/ |
work_keys_str_mv |
AT chinghsuanchien nglycogopredictingproteinnglycosylationsitesonimbalanceddatasetsbyusingheterogeneousandcomprehensivestrategy AT chichangchang nglycogopredictingproteinnglycosylationsitesonimbalanceddatasetsbyusingheterogeneousandcomprehensivestrategy AT shihhuanlin nglycogopredictingproteinnglycosylationsitesonimbalanceddatasetsbyusingheterogeneousandcomprehensivestrategy AT chiweichen nglycogopredictingproteinnglycosylationsitesonimbalanceddatasetsbyusingheterogeneousandcomprehensivestrategy AT zonghanchang nglycogopredictingproteinnglycosylationsitesonimbalanceddatasetsbyusingheterogeneousandcomprehensivestrategy AT yenweichu nglycogopredictingproteinnglycosylationsitesonimbalanceddatasetsbyusingheterogeneousandcomprehensivestrategy |
_version_ |
1724183278829699072 |