Algorithms, data structures, and numerics for likelihood-based phylogenetic inference of huge trees

<p>Abstract</p> <p>Background</p> <p>The rapid accumulation of molecular sequence data, driven by novel wet-lab sequencing technologies, poses new challenges for large-scale maximum likelihood-based phylogenetic analyses on trees with more than 30,000 taxa and several g...

Full description

Bibliographic Details
Main Authors: Izquierdo-Carrasco Fernando, Smith Stephen A, Stamatakis Alexandros
Format: Article
Language:English
Published: BMC 2011-12-01
Series:BMC Bioinformatics
Online Access:http://www.biomedcentral.com/1471-2105/12/470
Description
Summary:<p>Abstract</p> <p>Background</p> <p>The rapid accumulation of molecular sequence data, driven by novel wet-lab sequencing technologies, poses new challenges for large-scale maximum likelihood-based phylogenetic analyses on trees with more than 30,000 taxa and several genes. The three main computational challenges are: numerical stability, the scalability of search algorithms, and the high memory requirements for computing the likelihood.</p> <p>Results</p> <p>We introduce methods for solving these three key problems and provide respective proof-of-concept implementations in RAxML. The mechanisms presented here are not RAxML-specific and can thus be applied to any likelihood-based (Bayesian or maximum likelihood) tree inference program. We develop a new search strategy that can reduce the time required for tree inferences by more than 50% while yielding equally good trees (in the statistical sense) for well-chosen starting trees. We present an adaptation of the Subtree Equality Vector technique for phylogenomic datasets with missing data (already available in RAxML v728) that can reduce execution times <it>and </it>memory requirements by up to 50%. Finally, we discuss issues pertaining to the numerical stability of the Γ model of rate heterogeneity on very large trees and argue in favor of rate heterogeneity models that use a single rate or rate category for each site to resolve these problems.</p> <p>Conclusions</p> <p>We address three major issues pertaining to large scale tree reconstruction under maximum likelihood and propose respective solutions. Respective proof-of-concept/production-level implementations of our ideas are made available as open-source code.</p>
ISSN:1471-2105