Algorithms, data structures, and numerics for likelihood-based phylogenetic inference of huge trees

<p>Abstract</p> <p>Background</p> <p>The rapid accumulation of molecular sequence data, driven by novel wet-lab sequencing technologies, poses new challenges for large-scale maximum likelihood-based phylogenetic analyses on trees with more than 30,000 taxa and several g...

Full description

Bibliographic Details
Main Authors: Izquierdo-Carrasco Fernando, Smith Stephen A, Stamatakis Alexandros
Format: Article
Language:English
Published: BMC 2011-12-01
Series:BMC Bioinformatics
Online Access:http://www.biomedcentral.com/1471-2105/12/470
id doaj-27c6c73d896c44c3944763946fa56f34
record_format Article
spelling doaj-27c6c73d896c44c3944763946fa56f342020-11-24T22:12:50ZengBMCBMC Bioinformatics1471-21052011-12-0112147010.1186/1471-2105-12-470Algorithms, data structures, and numerics for likelihood-based phylogenetic inference of huge treesIzquierdo-Carrasco FernandoSmith Stephen AStamatakis Alexandros<p>Abstract</p> <p>Background</p> <p>The rapid accumulation of molecular sequence data, driven by novel wet-lab sequencing technologies, poses new challenges for large-scale maximum likelihood-based phylogenetic analyses on trees with more than 30,000 taxa and several genes. The three main computational challenges are: numerical stability, the scalability of search algorithms, and the high memory requirements for computing the likelihood.</p> <p>Results</p> <p>We introduce methods for solving these three key problems and provide respective proof-of-concept implementations in RAxML. The mechanisms presented here are not RAxML-specific and can thus be applied to any likelihood-based (Bayesian or maximum likelihood) tree inference program. We develop a new search strategy that can reduce the time required for tree inferences by more than 50% while yielding equally good trees (in the statistical sense) for well-chosen starting trees. We present an adaptation of the Subtree Equality Vector technique for phylogenomic datasets with missing data (already available in RAxML v728) that can reduce execution times <it>and </it>memory requirements by up to 50%. Finally, we discuss issues pertaining to the numerical stability of the Γ model of rate heterogeneity on very large trees and argue in favor of rate heterogeneity models that use a single rate or rate category for each site to resolve these problems.</p> <p>Conclusions</p> <p>We address three major issues pertaining to large scale tree reconstruction under maximum likelihood and propose respective solutions. Respective proof-of-concept/production-level implementations of our ideas are made available as open-source code.</p> http://www.biomedcentral.com/1471-2105/12/470
collection DOAJ
language English
format Article
sources DOAJ
author Izquierdo-Carrasco Fernando
Smith Stephen A
Stamatakis Alexandros
spellingShingle Izquierdo-Carrasco Fernando
Smith Stephen A
Stamatakis Alexandros
Algorithms, data structures, and numerics for likelihood-based phylogenetic inference of huge trees
BMC Bioinformatics
author_facet Izquierdo-Carrasco Fernando
Smith Stephen A
Stamatakis Alexandros
author_sort Izquierdo-Carrasco Fernando
title Algorithms, data structures, and numerics for likelihood-based phylogenetic inference of huge trees
title_short Algorithms, data structures, and numerics for likelihood-based phylogenetic inference of huge trees
title_full Algorithms, data structures, and numerics for likelihood-based phylogenetic inference of huge trees
title_fullStr Algorithms, data structures, and numerics for likelihood-based phylogenetic inference of huge trees
title_full_unstemmed Algorithms, data structures, and numerics for likelihood-based phylogenetic inference of huge trees
title_sort algorithms, data structures, and numerics for likelihood-based phylogenetic inference of huge trees
publisher BMC
series BMC Bioinformatics
issn 1471-2105
publishDate 2011-12-01
description <p>Abstract</p> <p>Background</p> <p>The rapid accumulation of molecular sequence data, driven by novel wet-lab sequencing technologies, poses new challenges for large-scale maximum likelihood-based phylogenetic analyses on trees with more than 30,000 taxa and several genes. The three main computational challenges are: numerical stability, the scalability of search algorithms, and the high memory requirements for computing the likelihood.</p> <p>Results</p> <p>We introduce methods for solving these three key problems and provide respective proof-of-concept implementations in RAxML. The mechanisms presented here are not RAxML-specific and can thus be applied to any likelihood-based (Bayesian or maximum likelihood) tree inference program. We develop a new search strategy that can reduce the time required for tree inferences by more than 50% while yielding equally good trees (in the statistical sense) for well-chosen starting trees. We present an adaptation of the Subtree Equality Vector technique for phylogenomic datasets with missing data (already available in RAxML v728) that can reduce execution times <it>and </it>memory requirements by up to 50%. Finally, we discuss issues pertaining to the numerical stability of the Γ model of rate heterogeneity on very large trees and argue in favor of rate heterogeneity models that use a single rate or rate category for each site to resolve these problems.</p> <p>Conclusions</p> <p>We address three major issues pertaining to large scale tree reconstruction under maximum likelihood and propose respective solutions. Respective proof-of-concept/production-level implementations of our ideas are made available as open-source code.</p>
url http://www.biomedcentral.com/1471-2105/12/470
work_keys_str_mv AT izquierdocarrascofernando algorithmsdatastructuresandnumericsforlikelihoodbasedphylogeneticinferenceofhugetrees
AT smithstephena algorithmsdatastructuresandnumericsforlikelihoodbasedphylogeneticinferenceofhugetrees
AT stamatakisalexandros algorithmsdatastructuresandnumericsforlikelihoodbasedphylogeneticinferenceofhugetrees
_version_ 1725802185130770432