Alignment-free clustering of large data sets of unannotated protein conserved regions using minhashing

Abstract Background Clustering of protein sequences is of key importance in predicting the structure and function of newly sequenced proteins and is also of use for their annotation. With the advent of multiple high-throughput sequencing technologies, new protein sequences are becoming available at...

Full description

Bibliographic Details
Main Authors:	Armen Abnousi, Shira L. Broschat, Ananth Kalyanaraman
Format:	Article
Language:	English
Published:	BMC 2018-03-01
Series:	BMC Bioinformatics
Subjects:	Protein conserved region Clustering Protein domain families
Online Access:	http://link.springer.com/article/10.1186/s12859-018-2080-y

id	doaj-aa8bde82f49447489d855b62c8b337bd
record_format	Article
spelling	doaj-aa8bde82f49447489d855b62c8b337bd2020-11-24T23:34:58ZengBMCBMC Bioinformatics1471-21052018-03-0119111810.1186/s12859-018-2080-yAlignment-free clustering of large data sets of unannotated protein conserved regions using minhashingArmen Abnousi0Shira L. Broschat1Ananth Kalyanaraman2School of EECS, Washington State UniversitySchool of EECS, Washington State UniversitySchool of EECS, Washington State UniversityAbstract Background Clustering of protein sequences is of key importance in predicting the structure and function of newly sequenced proteins and is also of use for their annotation. With the advent of multiple high-throughput sequencing technologies, new protein sequences are becoming available at an extraordinary rate. The rapid growth rate has impeded deployment of existing protein clustering/annotation tools which depend largely on pairwise sequence alignment. Results In this paper, we propose an alignment-free clustering approach, coreClust, for annotating protein sequences using detected conserved regions. The proposed algorithm uses Min-Wise Independent Hashing for identifying similar conserved regions. Min-Wise Independent Hashing works by generating a (w,c)-sketch for each document and comparing these sketches. Our algorithm fits well within the MapReduce framework, permitting scalability. We show that coreClust generates results comparable to existing known methods. In particular, we show that the clusters generated by our algorithm capture the subfamilies of the Pfam domain families for which the sequences in a cluster have a similar domain architecture. We show that for a data set of 90,000 sequences (about 250,000 domain regions), the clusters generated by our algorithm give a 75% average weighted F1 score, our accuracy metric, when compared to the clusters generated by a semi-exhaustive pairwise alignment algorithm. Conclusions The new clustering algorithm can be used to generate meaningful clusters of conserved regions. It is a scalable method that when paired with our prior work, NADDA for detecting conserved regions, provides a complete end-to-end pipeline for annotating protein sequences.http://link.springer.com/article/10.1186/s12859-018-2080-yProtein conserved regionClusteringProtein domain families
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Armen Abnousi Shira L. Broschat Ananth Kalyanaraman
spellingShingle	Armen Abnousi Shira L. Broschat Ananth Kalyanaraman Alignment-free clustering of large data sets of unannotated protein conserved regions using minhashing BMC Bioinformatics Protein conserved region Clustering Protein domain families
author_facet	Armen Abnousi Shira L. Broschat Ananth Kalyanaraman
author_sort	Armen Abnousi
title	Alignment-free clustering of large data sets of unannotated protein conserved regions using minhashing
title_short	Alignment-free clustering of large data sets of unannotated protein conserved regions using minhashing
title_full	Alignment-free clustering of large data sets of unannotated protein conserved regions using minhashing
title_fullStr	Alignment-free clustering of large data sets of unannotated protein conserved regions using minhashing
title_full_unstemmed	Alignment-free clustering of large data sets of unannotated protein conserved regions using minhashing
title_sort	alignment-free clustering of large data sets of unannotated protein conserved regions using minhashing
publisher	BMC
series	BMC Bioinformatics
issn	1471-2105
publishDate	2018-03-01
description	Abstract Background Clustering of protein sequences is of key importance in predicting the structure and function of newly sequenced proteins and is also of use for their annotation. With the advent of multiple high-throughput sequencing technologies, new protein sequences are becoming available at an extraordinary rate. The rapid growth rate has impeded deployment of existing protein clustering/annotation tools which depend largely on pairwise sequence alignment. Results In this paper, we propose an alignment-free clustering approach, coreClust, for annotating protein sequences using detected conserved regions. The proposed algorithm uses Min-Wise Independent Hashing for identifying similar conserved regions. Min-Wise Independent Hashing works by generating a (w,c)-sketch for each document and comparing these sketches. Our algorithm fits well within the MapReduce framework, permitting scalability. We show that coreClust generates results comparable to existing known methods. In particular, we show that the clusters generated by our algorithm capture the subfamilies of the Pfam domain families for which the sequences in a cluster have a similar domain architecture. We show that for a data set of 90,000 sequences (about 250,000 domain regions), the clusters generated by our algorithm give a 75% average weighted F1 score, our accuracy metric, when compared to the clusters generated by a semi-exhaustive pairwise alignment algorithm. Conclusions The new clustering algorithm can be used to generate meaningful clusters of conserved regions. It is a scalable method that when paired with our prior work, NADDA for detecting conserved regions, provides a complete end-to-end pipeline for annotating protein sequences.
topic	Protein conserved region Clustering Protein domain families
url	http://link.springer.com/article/10.1186/s12859-018-2080-y
work_keys_str_mv	AT armenabnousi alignmentfreeclusteringoflargedatasetsofunannotatedproteinconservedregionsusingminhashing AT shiralbroschat alignmentfreeclusteringoflargedatasetsofunannotatedproteinconservedregionsusingminhashing AT ananthkalyanaraman alignmentfreeclusteringoflargedatasetsofunannotatedproteinconservedregionsusingminhashing
_version_	1725526824371355648

Alignment-free clustering of large data sets of unannotated protein conserved regions using minhashing

Similar Items