Using the spike protein feature to predict infection risk and monitor the evolutionary dynamic of coronavirus

Abstract Background Coronavirus can cross the species barrier and infect humans with a severe respiratory syndrome. SARS-CoV-2 with potential origin of bat is still circulating in China. In this study, a prediction model is proposed to evaluate the infection risk of non-human-origin coronavirus for...

Full description

Bibliographic Details
Main Authors: Xiao-Li Qiang, Peng Xu, Gang Fang, Wen-Bin Liu, Zheng Kou
Format: Article
Language:English
Published: BMC 2020-03-01
Series:Infectious Diseases of Poverty
Subjects:
Online Access:http://link.springer.com/article/10.1186/s40249-020-00649-8
id doaj-7a860f1a59264e96b86a5563054ea803
record_format Article
spelling doaj-7a860f1a59264e96b86a5563054ea8032020-11-25T03:31:07ZengBMCInfectious Diseases of Poverty2049-99572020-03-01911810.1186/s40249-020-00649-8Using the spike protein feature to predict infection risk and monitor the evolutionary dynamic of coronavirusXiao-Li Qiang0Peng Xu1Gang Fang2Wen-Bin Liu3Zheng Kou4Institute of Computing Science and Technology, Guangzhou UniversityInstitute of Computing Science and Technology, Guangzhou UniversityInstitute of Computing Science and Technology, Guangzhou UniversityInstitute of Computing Science and Technology, Guangzhou UniversityInstitute of Computing Science and Technology, Guangzhou UniversityAbstract Background Coronavirus can cross the species barrier and infect humans with a severe respiratory syndrome. SARS-CoV-2 with potential origin of bat is still circulating in China. In this study, a prediction model is proposed to evaluate the infection risk of non-human-origin coronavirus for early warning. Methods The spike protein sequences of 2666 coronaviruses were collected from 2019 Novel Coronavirus Resource (2019nCoVR) Database of China National Genomics Data Center on Jan 29, 2020. A total of 507 human-origin viruses were regarded as positive samples, whereas 2159 non-human-origin viruses were regarded as negative. To capture the key information of the spike protein, three feature encoding algorithms (amino acid composition, AAC; parallel correlation-based pseudo-amino-acid composition, PC-PseAAC and G-gap dipeptide composition, GGAP) were used to train 41 random forest models. The optimal feature with the best performance was identified by the multidimensional scaling method, which was used to explore the pattern of human coronavirus. Results The 10-fold cross-validation results showed that well performance was achieved with the use of the GGAP (g = 3) feature. The predictive model achieved the maximum ACC of 98.18% coupled with the Matthews correlation coefficient (MCC) of 0.9638. Seven clusters for human coronaviruses (229E, NL63, OC43, HKU1, MERS-CoV, SARS-CoV, and SARS-CoV-2) were found. The cluster for SARS-CoV-2 was very close to that for SARS-CoV, which suggests that both of viruses have the same human receptor (angiotensin converting enzyme II). The big gap in the distance curve suggests that the origin of SARS-CoV-2 is not clear and further surveillance in the field should be made continuously. The smooth distance curve for SARS-CoV suggests that its close relatives still exist in nature and public health is challenged as usual. Conclusions The optimal feature (GGAP, g = 3) performed well in terms of predicting infection risk and could be used to explore the evolutionary dynamic in a simple, fast and large-scale manner. The study may be beneficial for the surveillance of the genome mutation of coronavirus in the field.http://link.springer.com/article/10.1186/s40249-020-00649-8CoronavirusCross-species infectionSpike proteinMachine learning
collection DOAJ
language English
format Article
sources DOAJ
author Xiao-Li Qiang
Peng Xu
Gang Fang
Wen-Bin Liu
Zheng Kou
spellingShingle Xiao-Li Qiang
Peng Xu
Gang Fang
Wen-Bin Liu
Zheng Kou
Using the spike protein feature to predict infection risk and monitor the evolutionary dynamic of coronavirus
Infectious Diseases of Poverty
Coronavirus
Cross-species infection
Spike protein
Machine learning
author_facet Xiao-Li Qiang
Peng Xu
Gang Fang
Wen-Bin Liu
Zheng Kou
author_sort Xiao-Li Qiang
title Using the spike protein feature to predict infection risk and monitor the evolutionary dynamic of coronavirus
title_short Using the spike protein feature to predict infection risk and monitor the evolutionary dynamic of coronavirus
title_full Using the spike protein feature to predict infection risk and monitor the evolutionary dynamic of coronavirus
title_fullStr Using the spike protein feature to predict infection risk and monitor the evolutionary dynamic of coronavirus
title_full_unstemmed Using the spike protein feature to predict infection risk and monitor the evolutionary dynamic of coronavirus
title_sort using the spike protein feature to predict infection risk and monitor the evolutionary dynamic of coronavirus
publisher BMC
series Infectious Diseases of Poverty
issn 2049-9957
publishDate 2020-03-01
description Abstract Background Coronavirus can cross the species barrier and infect humans with a severe respiratory syndrome. SARS-CoV-2 with potential origin of bat is still circulating in China. In this study, a prediction model is proposed to evaluate the infection risk of non-human-origin coronavirus for early warning. Methods The spike protein sequences of 2666 coronaviruses were collected from 2019 Novel Coronavirus Resource (2019nCoVR) Database of China National Genomics Data Center on Jan 29, 2020. A total of 507 human-origin viruses were regarded as positive samples, whereas 2159 non-human-origin viruses were regarded as negative. To capture the key information of the spike protein, three feature encoding algorithms (amino acid composition, AAC; parallel correlation-based pseudo-amino-acid composition, PC-PseAAC and G-gap dipeptide composition, GGAP) were used to train 41 random forest models. The optimal feature with the best performance was identified by the multidimensional scaling method, which was used to explore the pattern of human coronavirus. Results The 10-fold cross-validation results showed that well performance was achieved with the use of the GGAP (g = 3) feature. The predictive model achieved the maximum ACC of 98.18% coupled with the Matthews correlation coefficient (MCC) of 0.9638. Seven clusters for human coronaviruses (229E, NL63, OC43, HKU1, MERS-CoV, SARS-CoV, and SARS-CoV-2) were found. The cluster for SARS-CoV-2 was very close to that for SARS-CoV, which suggests that both of viruses have the same human receptor (angiotensin converting enzyme II). The big gap in the distance curve suggests that the origin of SARS-CoV-2 is not clear and further surveillance in the field should be made continuously. The smooth distance curve for SARS-CoV suggests that its close relatives still exist in nature and public health is challenged as usual. Conclusions The optimal feature (GGAP, g = 3) performed well in terms of predicting infection risk and could be used to explore the evolutionary dynamic in a simple, fast and large-scale manner. The study may be beneficial for the surveillance of the genome mutation of coronavirus in the field.
topic Coronavirus
Cross-species infection
Spike protein
Machine learning
url http://link.springer.com/article/10.1186/s40249-020-00649-8
work_keys_str_mv AT xiaoliqiang usingthespikeproteinfeaturetopredictinfectionriskandmonitortheevolutionarydynamicofcoronavirus
AT pengxu usingthespikeproteinfeaturetopredictinfectionriskandmonitortheevolutionarydynamicofcoronavirus
AT gangfang usingthespikeproteinfeaturetopredictinfectionriskandmonitortheevolutionarydynamicofcoronavirus
AT wenbinliu usingthespikeproteinfeaturetopredictinfectionriskandmonitortheevolutionarydynamicofcoronavirus
AT zhengkou usingthespikeproteinfeaturetopredictinfectionriskandmonitortheevolutionarydynamicofcoronavirus
_version_ 1724573609213558784