RAFTS3G: an efficient and versatile clustering software to analyses in large protein datasets

Abstract Background Clustering methods are essential to partitioning biological samples being useful to minimize the information complexity in large datasets. Tools in this context usually generates data with greed algorithms that solves some Data Mining difficulties which can degrade biological rel...

Full description

Bibliographic Details
Main Authors: Bruno Thiago de Lima Nichio, Aryel Marlus Repula de Oliveira, Camilla Reginatto de Pierri, Leticia Graziela Costa Santos, Alexandre Quadros Lejambre, Ricardo Assunção Vialle, Nilson Antônio da Rocha Coimbra, Dieval Guizelini, Jeroniza Nunes Marchaukoski, Fabio de Oliveira Pedrosa, Roberto Tadeu Raittz
Format: Article
Language:English
Published: BMC 2019-07-01
Series:BMC Bioinformatics
Online Access:http://link.springer.com/article/10.1186/s12859-019-2973-4
id doaj-b2773d0fe0e344969419c808173a62fc
record_format Article
spelling doaj-b2773d0fe0e344969419c808173a62fc2020-11-25T02:40:27ZengBMCBMC Bioinformatics1471-21052019-07-012011710.1186/s12859-019-2973-4RAFTS3G: an efficient and versatile clustering software to analyses in large protein datasetsBruno Thiago de Lima Nichio0Aryel Marlus Repula de Oliveira1Camilla Reginatto de Pierri2Leticia Graziela Costa Santos3Alexandre Quadros Lejambre4Ricardo Assunção Vialle5Nilson Antônio da Rocha Coimbra6Dieval Guizelini7Jeroniza Nunes Marchaukoski8Fabio de Oliveira Pedrosa9Roberto Tadeu Raittz10Laboratory of Bioinformatics, Professional and Technical Education Sector from the Federal University of ParanáLaboratory of Bioinformatics, Professional and Technical Education Sector from the Federal University of ParanáLaboratory of Bioinformatics, Professional and Technical Education Sector from the Federal University of ParanáLaboratory of Bioinformatics, Professional and Technical Education Sector from the Federal University of ParanáLaboratory of Bioinformatics, Professional and Technical Education Sector from the Federal University of ParanáLaboratory of Bioinformatics, Professional and Technical Education Sector from the Federal University of ParanáLaboratory of Bioinformatics, Professional and Technical Education Sector from the Federal University of ParanáLaboratory of Bioinformatics, Professional and Technical Education Sector from the Federal University of ParanáLaboratory of Bioinformatics, Professional and Technical Education Sector from the Federal University of ParanáLaboratory of Bioinformatics, Professional and Technical Education Sector from the Federal University of ParanáLaboratory of Bioinformatics, Professional and Technical Education Sector from the Federal University of ParanáAbstract Background Clustering methods are essential to partitioning biological samples being useful to minimize the information complexity in large datasets. Tools in this context usually generates data with greed algorithms that solves some Data Mining difficulties which can degrade biological relevant information during the clustering process. The lack of standardization of metrics and consistent bases also raises questions about the clustering efficiency of some methods. Benchmarks are needed to explore the full potential of clustering methods - in which alignment-free methods stand out - and the good choice of dataset makes it essentials. Results Here we present a new approach to Data Mining in large protein sequences datasets, the Rapid Alignment Free Tool for Sequences Similarity Search to Groups (RAFTS3G), a method to clustering aiming of losing less biological information in the processes of generation groups. The strategy developed in our algorithm is optimized to be more astringent which reflects increase in accuracy and sensitivity in the generation of clusters in a wide range of similarity. RAFTS3G is the better choice compared to three main methods when the user wants more reliable result even ignoring the ideal threshold to clustering. Conclusion In general, RAFTS3G is able to group up to millions of biological sequences into large datasets, which is a remarkable option of efficiency in clustering. RAFTS3G compared to other “standard-gold” methods in the clustering of large biological data maintains the balance between the reduction of biological information redundancy and the creation of consistent groups. We bring the binary search concept applied to grouped sequences which shows maintaining sensitivity/accuracy relation and up to minimize the time of data generated with RAFTS3G process.http://link.springer.com/article/10.1186/s12859-019-2973-4
collection DOAJ
language English
format Article
sources DOAJ
author Bruno Thiago de Lima Nichio
Aryel Marlus Repula de Oliveira
Camilla Reginatto de Pierri
Leticia Graziela Costa Santos
Alexandre Quadros Lejambre
Ricardo Assunção Vialle
Nilson Antônio da Rocha Coimbra
Dieval Guizelini
Jeroniza Nunes Marchaukoski
Fabio de Oliveira Pedrosa
Roberto Tadeu Raittz
spellingShingle Bruno Thiago de Lima Nichio
Aryel Marlus Repula de Oliveira
Camilla Reginatto de Pierri
Leticia Graziela Costa Santos
Alexandre Quadros Lejambre
Ricardo Assunção Vialle
Nilson Antônio da Rocha Coimbra
Dieval Guizelini
Jeroniza Nunes Marchaukoski
Fabio de Oliveira Pedrosa
Roberto Tadeu Raittz
RAFTS3G: an efficient and versatile clustering software to analyses in large protein datasets
BMC Bioinformatics
author_facet Bruno Thiago de Lima Nichio
Aryel Marlus Repula de Oliveira
Camilla Reginatto de Pierri
Leticia Graziela Costa Santos
Alexandre Quadros Lejambre
Ricardo Assunção Vialle
Nilson Antônio da Rocha Coimbra
Dieval Guizelini
Jeroniza Nunes Marchaukoski
Fabio de Oliveira Pedrosa
Roberto Tadeu Raittz
author_sort Bruno Thiago de Lima Nichio
title RAFTS3G: an efficient and versatile clustering software to analyses in large protein datasets
title_short RAFTS3G: an efficient and versatile clustering software to analyses in large protein datasets
title_full RAFTS3G: an efficient and versatile clustering software to analyses in large protein datasets
title_fullStr RAFTS3G: an efficient and versatile clustering software to analyses in large protein datasets
title_full_unstemmed RAFTS3G: an efficient and versatile clustering software to analyses in large protein datasets
title_sort rafts3g: an efficient and versatile clustering software to analyses in large protein datasets
publisher BMC
series BMC Bioinformatics
issn 1471-2105
publishDate 2019-07-01
description Abstract Background Clustering methods are essential to partitioning biological samples being useful to minimize the information complexity in large datasets. Tools in this context usually generates data with greed algorithms that solves some Data Mining difficulties which can degrade biological relevant information during the clustering process. The lack of standardization of metrics and consistent bases also raises questions about the clustering efficiency of some methods. Benchmarks are needed to explore the full potential of clustering methods - in which alignment-free methods stand out - and the good choice of dataset makes it essentials. Results Here we present a new approach to Data Mining in large protein sequences datasets, the Rapid Alignment Free Tool for Sequences Similarity Search to Groups (RAFTS3G), a method to clustering aiming of losing less biological information in the processes of generation groups. The strategy developed in our algorithm is optimized to be more astringent which reflects increase in accuracy and sensitivity in the generation of clusters in a wide range of similarity. RAFTS3G is the better choice compared to three main methods when the user wants more reliable result even ignoring the ideal threshold to clustering. Conclusion In general, RAFTS3G is able to group up to millions of biological sequences into large datasets, which is a remarkable option of efficiency in clustering. RAFTS3G compared to other “standard-gold” methods in the clustering of large biological data maintains the balance between the reduction of biological information redundancy and the creation of consistent groups. We bring the binary search concept applied to grouped sequences which shows maintaining sensitivity/accuracy relation and up to minimize the time of data generated with RAFTS3G process.
url http://link.springer.com/article/10.1186/s12859-019-2973-4
work_keys_str_mv AT brunothiagodelimanichio rafts3ganefficientandversatileclusteringsoftwaretoanalysesinlargeproteindatasets
AT aryelmarlusrepuladeoliveira rafts3ganefficientandversatileclusteringsoftwaretoanalysesinlargeproteindatasets
AT camillareginattodepierri rafts3ganefficientandversatileclusteringsoftwaretoanalysesinlargeproteindatasets
AT leticiagrazielacostasantos rafts3ganefficientandversatileclusteringsoftwaretoanalysesinlargeproteindatasets
AT alexandrequadroslejambre rafts3ganefficientandversatileclusteringsoftwaretoanalysesinlargeproteindatasets
AT ricardoassuncaovialle rafts3ganefficientandversatileclusteringsoftwaretoanalysesinlargeproteindatasets
AT nilsonantoniodarochacoimbra rafts3ganefficientandversatileclusteringsoftwaretoanalysesinlargeproteindatasets
AT dievalguizelini rafts3ganefficientandversatileclusteringsoftwaretoanalysesinlargeproteindatasets
AT jeronizanunesmarchaukoski rafts3ganefficientandversatileclusteringsoftwaretoanalysesinlargeproteindatasets
AT fabiodeoliveirapedrosa rafts3ganefficientandversatileclusteringsoftwaretoanalysesinlargeproteindatasets
AT robertotadeuraittz rafts3ganefficientandversatileclusteringsoftwaretoanalysesinlargeproteindatasets
_version_ 1724781591467655168