Compressive genomics for protein databases

Motivation: The exponential growth of protein sequence databases has increasingly made the fundamental question of searching for homologs a computational bottleneck. The amount of unique data, however, is not growing nearly as fast; we can exploit this fact to greatly accelerate homology search. Acc...

Full description

Bibliographic Details
Main Authors:	Daniels, Noah M. (Author), Gallant, Andrew (Author), Peng, Jian (Contributor), Cowen, Lenore J. (Author), Baym, Michael Hartmann (Contributor), Berger Leighton, Bonnie (Contributor)
Other Authors:	Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory (Contributor), Massachusetts Institute of Technology. Department of Mathematics (Contributor)
Format:	Article
Language:	English
Published:	Oxford University Press, 2016-08-26T18:03:39Z.
Subjects:	Article
Online Access:	Get fulltext


LEADER	02564 am a22003013u 4500
001	104045
042			\|a dc
100	1	0	\|a Daniels, Noah M. \|e author
100	1	0	\|a Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory \|e contributor
100	1	0	\|a Massachusetts Institute of Technology. Department of Mathematics \|e contributor
100	1	0	\|a Peng, Jian \|e contributor
100	1	0	\|a Baym, Michael Hartmann \|e contributor
100	1	0	\|a Berger Leighton, Bonnie \|e contributor
700	1	0	\|a Gallant, Andrew \|e author
700	1	0	\|a Peng, Jian \|e author
700	1	0	\|a Cowen, Lenore J. \|e author
700	1	0	\|a Baym, Michael Hartmann \|e author
700	1	0	\|a Berger Leighton, Bonnie \|e author
245	0	0	\|a Compressive genomics for protein databases
260			\|b Oxford University Press, \|c 2016-08-26T18:03:39Z.
856			\|z Get fulltext \|u http://hdl.handle.net/1721.1/104045
520			\|a Motivation: The exponential growth of protein sequence databases has increasingly made the fundamental question of searching for homologs a computational bottleneck. The amount of unique data, however, is not growing nearly as fast; we can exploit this fact to greatly accelerate homology search. Acceleration of programs in the popular PSI/DELTA-BLAST family of tools will not only speed-up homology search directly but also the huge collection of other current programs that primarily interact with large protein databases via precisely these tools. Results: We introduce a suite of homology search tools, powered by compressively accelerated protein BLAST (CaBLASTP), which are significantly faster than and comparably accurate with all known state-of-the-art tools, including HHblits, DELTA-BLAST and PSI-BLAST. Further, our tools are implemented in a manner that allows direct substitution into existing analysis pipelines. The key idea is that we introduce a local similarity-based compression scheme that allows us to operate directly on the compressed data. Importantly, CaBLASTP's runtime scales almost linearly in the amount of unique data, as opposed to current BLASTP variants, which scale linearly in the size of the full protein database being searched. Our compressive algorithms will speed-up many tasks, such as protein structure prediction and orthology mapping, which rely heavily on homology search.
520			\|a Simons Foundation
520			\|a National Institutes of Health (U.S.) (NIH grant (R01GM080330))
520			\|a National Science Foundation (U.S.) (NSF MSPRF grant)
546			\|a en_US
655	7		\|a Article
773			\|t Bioinformatics

Compressive genomics for protein databases

Similar Items