Extended many-item similarity indices for sets of nucleotide and protein sequences

Quantification of similarities between protein sequences or DNA/RNA strands is a (sub-)task that is ubiquitously present in bioinformatics workflows, and is usually accomplished by pairwise comparisons of sequences, utilizing simple (e.g. percent identity) or more intricate concepts (e.g. substituti...

Full description

Bibliographic Details
Main Authors:	Dávid Bajusz, Ramón Alain Miranda-Quintana, Anita Rácz, Károly Héberger
Format:	Article
Language:	English
Published:	Elsevier 2021-01-01
Series:	Computational and Structural Biotechnology Journal
Subjects:	Multiple comparisons DNA sequences Protein sequences Diversity analysis Similarity indices Consistency
Online Access:	http://www.sciencedirect.com/science/article/pii/S2001037021002592

id	doaj-c0d7a08a80384582ab85eb23e6a40801
record_format	Article
spelling	doaj-c0d7a08a80384582ab85eb23e6a408012021-06-27T04:36:45ZengElsevierComputational and Structural Biotechnology Journal2001-03702021-01-011936283639Extended many-item similarity indices for sets of nucleotide and protein sequencesDávid Bajusz0Ramón Alain Miranda-Quintana1Anita Rácz2Károly Héberger3Medicinal Chemistry Research Group, Research Centre for Natural Sciences, Magyar tudósok krt. 2, 1117 Budapest, HungaryDepartment of Chemistry and Quantum Theory Project, University of Florida, Gainesville, FL 32611, USA; Corresponding authors.Plasma Chemistry Research Group, Research Centre for Natural Sciences, Magyar tudósok krt. 2, 1117 Budapest, HungaryPlasma Chemistry Research Group, Research Centre for Natural Sciences, Magyar tudósok krt. 2, 1117 Budapest, Hungary; Corresponding authors.Quantification of similarities between protein sequences or DNA/RNA strands is a (sub-)task that is ubiquitously present in bioinformatics workflows, and is usually accomplished by pairwise comparisons of sequences, utilizing simple (e.g. percent identity) or more intricate concepts (e.g. substitution scoring matrices). Complex tasks (such as clustering) rely on a large number of pairwise comparisons under the hood, instead of a direct quantification of set similarities. Based on our recently introduced framework that enables multiple comparisons of binary molecular fingerprints (i.e., direct calculation of the similarity of fingerprint sets), here we introduce novel symmetric similarity indices for analogous calculations on sets of character sequences with more than two (t) possible items (e.g. DNA/RNA sequences with t = 4, or protein sequences with t = 20). The features of these new indices are studied in detail with analysis of variance (ANOVA), and demonstrated with three case studies of protein/DNA sequences with varying degrees of similarity (or evolutionary proximity). The Python code for the extended many-item similarity indices is publicly available at: https://github.com/ramirandaq/tn_Comparisons.http://www.sciencedirect.com/science/article/pii/S2001037021002592Multiple comparisonsDNA sequencesProtein sequencesDiversity analysisSimilarity indicesConsistency
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Dávid Bajusz Ramón Alain Miranda-Quintana Anita Rácz Károly Héberger
spellingShingle	Dávid Bajusz Ramón Alain Miranda-Quintana Anita Rácz Károly Héberger Extended many-item similarity indices for sets of nucleotide and protein sequences Computational and Structural Biotechnology Journal Multiple comparisons DNA sequences Protein sequences Diversity analysis Similarity indices Consistency
author_facet	Dávid Bajusz Ramón Alain Miranda-Quintana Anita Rácz Károly Héberger
author_sort	Dávid Bajusz
title	Extended many-item similarity indices for sets of nucleotide and protein sequences
title_short	Extended many-item similarity indices for sets of nucleotide and protein sequences
title_full	Extended many-item similarity indices for sets of nucleotide and protein sequences
title_fullStr	Extended many-item similarity indices for sets of nucleotide and protein sequences
title_full_unstemmed	Extended many-item similarity indices for sets of nucleotide and protein sequences
title_sort	extended many-item similarity indices for sets of nucleotide and protein sequences
publisher	Elsevier
series	Computational and Structural Biotechnology Journal
issn	2001-0370
publishDate	2021-01-01
description	Quantification of similarities between protein sequences or DNA/RNA strands is a (sub-)task that is ubiquitously present in bioinformatics workflows, and is usually accomplished by pairwise comparisons of sequences, utilizing simple (e.g. percent identity) or more intricate concepts (e.g. substitution scoring matrices). Complex tasks (such as clustering) rely on a large number of pairwise comparisons under the hood, instead of a direct quantification of set similarities. Based on our recently introduced framework that enables multiple comparisons of binary molecular fingerprints (i.e., direct calculation of the similarity of fingerprint sets), here we introduce novel symmetric similarity indices for analogous calculations on sets of character sequences with more than two (t) possible items (e.g. DNA/RNA sequences with t = 4, or protein sequences with t = 20). The features of these new indices are studied in detail with analysis of variance (ANOVA), and demonstrated with three case studies of protein/DNA sequences with varying degrees of similarity (or evolutionary proximity). The Python code for the extended many-item similarity indices is publicly available at: https://github.com/ramirandaq/tn_Comparisons.
topic	Multiple comparisons DNA sequences Protein sequences Diversity analysis Similarity indices Consistency
url	http://www.sciencedirect.com/science/article/pii/S2001037021002592
work_keys_str_mv	AT davidbajusz extendedmanyitemsimilarityindicesforsetsofnucleotideandproteinsequences AT ramonalainmirandaquintana extendedmanyitemsimilarityindicesforsetsofnucleotideandproteinsequences AT anitaracz extendedmanyitemsimilarityindicesforsetsofnucleotideandproteinsequences AT karolyheberger extendedmanyitemsimilarityindicesforsetsofnucleotideandproteinsequences
_version_	1721358677758181376

Extended many-item similarity indices for sets of nucleotide and protein sequences

Similar Items