Identification of SARS-CoV-2 origin: Using Ngrams, principal component analysis and Random Forest algorithm

COVID-19 is an infectious disease caused by the newly discovered SARS-CoV-2 virus. This virus causes a respiratory tract infection, symptoms include dry cough, fever, tiredness and in more severe cases, breathing difficulty. SARS-CoV-2 is an extremely contagious virus that is spreading rapidly all o...

Full description

Bibliographic Details
Main Authors: Hamoucha El Boujnouni, Mohamed Rahouti, Mohamed El Boujnouni
Format: Article
Language:English
Published: Elsevier 2021-01-01
Series:Informatics in Medicine Unlocked
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2352914821000678
id doaj-a68e5ecdbf80462d90f0cbdac4e8665b
record_format Article
spelling doaj-a68e5ecdbf80462d90f0cbdac4e8665b2021-06-19T04:55:02ZengElsevierInformatics in Medicine Unlocked2352-91482021-01-0124100577Identification of SARS-CoV-2 origin: Using Ngrams, principal component analysis and Random Forest algorithmHamoucha El Boujnouni0Mohamed Rahouti1Mohamed El Boujnouni2Research Center of Plant and Microbial Biotechnologies, Biodiversity, and Environment, Faculty of Sciences, Mohammed V University in Rabat, PO Box 1014, Morocco; Corresponding author.Research Center of Plant and Microbial Biotechnologies, Biodiversity, and Environment, Faculty of Sciences, Mohammed V University in Rabat, PO Box 1014, MoroccoLaboratory of Information Technologies, National School of Applied Sciences, Chouaib Doukkali University in El Jadida, PO Box 1166, MoroccoCOVID-19 is an infectious disease caused by the newly discovered SARS-CoV-2 virus. This virus causes a respiratory tract infection, symptoms include dry cough, fever, tiredness and in more severe cases, breathing difficulty. SARS-CoV-2 is an extremely contagious virus that is spreading rapidly all over the world and the scientific community is working tirelessly to find an effective treatment. This paper aims to determine the origin of this virus by comparing its nucleic acid sequence with all members of the coronaviridae family. This study uses a new approach based on the combination of three powerful techniques which are: Ngrams (For text categorization), Principal Component Analysis (For dimensionality reduction) and Random Forest algorithm (For supervised classification). The experimental results have shown that a large set of SARS-CoV-2 genomes, collected from different locations around the world, present significant similarities to those found in pangolins. This finding confirms some previous results obtained by other methods, which also suggest that pangolins should be considered as possible hosts in the emergence of the new coronavirus.http://www.sciencedirect.com/science/article/pii/S2352914821000678BioinformaticsGenomesSARS-CoV-2COVID-19NgramsPrincipal component analysis
collection DOAJ
language English
format Article
sources DOAJ
author Hamoucha El Boujnouni
Mohamed Rahouti
Mohamed El Boujnouni
spellingShingle Hamoucha El Boujnouni
Mohamed Rahouti
Mohamed El Boujnouni
Identification of SARS-CoV-2 origin: Using Ngrams, principal component analysis and Random Forest algorithm
Informatics in Medicine Unlocked
Bioinformatics
Genomes
SARS-CoV-2
COVID-19
Ngrams
Principal component analysis
author_facet Hamoucha El Boujnouni
Mohamed Rahouti
Mohamed El Boujnouni
author_sort Hamoucha El Boujnouni
title Identification of SARS-CoV-2 origin: Using Ngrams, principal component analysis and Random Forest algorithm
title_short Identification of SARS-CoV-2 origin: Using Ngrams, principal component analysis and Random Forest algorithm
title_full Identification of SARS-CoV-2 origin: Using Ngrams, principal component analysis and Random Forest algorithm
title_fullStr Identification of SARS-CoV-2 origin: Using Ngrams, principal component analysis and Random Forest algorithm
title_full_unstemmed Identification of SARS-CoV-2 origin: Using Ngrams, principal component analysis and Random Forest algorithm
title_sort identification of sars-cov-2 origin: using ngrams, principal component analysis and random forest algorithm
publisher Elsevier
series Informatics in Medicine Unlocked
issn 2352-9148
publishDate 2021-01-01
description COVID-19 is an infectious disease caused by the newly discovered SARS-CoV-2 virus. This virus causes a respiratory tract infection, symptoms include dry cough, fever, tiredness and in more severe cases, breathing difficulty. SARS-CoV-2 is an extremely contagious virus that is spreading rapidly all over the world and the scientific community is working tirelessly to find an effective treatment. This paper aims to determine the origin of this virus by comparing its nucleic acid sequence with all members of the coronaviridae family. This study uses a new approach based on the combination of three powerful techniques which are: Ngrams (For text categorization), Principal Component Analysis (For dimensionality reduction) and Random Forest algorithm (For supervised classification). The experimental results have shown that a large set of SARS-CoV-2 genomes, collected from different locations around the world, present significant similarities to those found in pangolins. This finding confirms some previous results obtained by other methods, which also suggest that pangolins should be considered as possible hosts in the emergence of the new coronavirus.
topic Bioinformatics
Genomes
SARS-CoV-2
COVID-19
Ngrams
Principal component analysis
url http://www.sciencedirect.com/science/article/pii/S2352914821000678
work_keys_str_mv AT hamouchaelboujnouni identificationofsarscov2originusingngramsprincipalcomponentanalysisandrandomforestalgorithm
AT mohamedrahouti identificationofsarscov2originusingngramsprincipalcomponentanalysisandrandomforestalgorithm
AT mohamedelboujnouni identificationofsarscov2originusingngramsprincipalcomponentanalysisandrandomforestalgorithm
_version_ 1721371776118685696