Algorithms, data structures, and numerics for likelihood-based phylogenetic inference of huge trees
<p>Abstract</p> <p>Background</p> <p>The rapid accumulation of molecular sequence data, driven by novel wet-lab sequencing technologies, poses new challenges for large-scale maximum likelihood-based phylogenetic analyses on trees with more than 30,000 taxa and several g...
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
BMC
2011-12-01
|
Series: | BMC Bioinformatics |
Online Access: | http://www.biomedcentral.com/1471-2105/12/470 |
id |
doaj-27c6c73d896c44c3944763946fa56f34 |
---|---|
record_format |
Article |
spelling |
doaj-27c6c73d896c44c3944763946fa56f342020-11-24T22:12:50ZengBMCBMC Bioinformatics1471-21052011-12-0112147010.1186/1471-2105-12-470Algorithms, data structures, and numerics for likelihood-based phylogenetic inference of huge treesIzquierdo-Carrasco FernandoSmith Stephen AStamatakis Alexandros<p>Abstract</p> <p>Background</p> <p>The rapid accumulation of molecular sequence data, driven by novel wet-lab sequencing technologies, poses new challenges for large-scale maximum likelihood-based phylogenetic analyses on trees with more than 30,000 taxa and several genes. The three main computational challenges are: numerical stability, the scalability of search algorithms, and the high memory requirements for computing the likelihood.</p> <p>Results</p> <p>We introduce methods for solving these three key problems and provide respective proof-of-concept implementations in RAxML. The mechanisms presented here are not RAxML-specific and can thus be applied to any likelihood-based (Bayesian or maximum likelihood) tree inference program. We develop a new search strategy that can reduce the time required for tree inferences by more than 50% while yielding equally good trees (in the statistical sense) for well-chosen starting trees. We present an adaptation of the Subtree Equality Vector technique for phylogenomic datasets with missing data (already available in RAxML v728) that can reduce execution times <it>and </it>memory requirements by up to 50%. Finally, we discuss issues pertaining to the numerical stability of the Γ model of rate heterogeneity on very large trees and argue in favor of rate heterogeneity models that use a single rate or rate category for each site to resolve these problems.</p> <p>Conclusions</p> <p>We address three major issues pertaining to large scale tree reconstruction under maximum likelihood and propose respective solutions. Respective proof-of-concept/production-level implementations of our ideas are made available as open-source code.</p> http://www.biomedcentral.com/1471-2105/12/470 |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Izquierdo-Carrasco Fernando Smith Stephen A Stamatakis Alexandros |
spellingShingle |
Izquierdo-Carrasco Fernando Smith Stephen A Stamatakis Alexandros Algorithms, data structures, and numerics for likelihood-based phylogenetic inference of huge trees BMC Bioinformatics |
author_facet |
Izquierdo-Carrasco Fernando Smith Stephen A Stamatakis Alexandros |
author_sort |
Izquierdo-Carrasco Fernando |
title |
Algorithms, data structures, and numerics for likelihood-based phylogenetic inference of huge trees |
title_short |
Algorithms, data structures, and numerics for likelihood-based phylogenetic inference of huge trees |
title_full |
Algorithms, data structures, and numerics for likelihood-based phylogenetic inference of huge trees |
title_fullStr |
Algorithms, data structures, and numerics for likelihood-based phylogenetic inference of huge trees |
title_full_unstemmed |
Algorithms, data structures, and numerics for likelihood-based phylogenetic inference of huge trees |
title_sort |
algorithms, data structures, and numerics for likelihood-based phylogenetic inference of huge trees |
publisher |
BMC |
series |
BMC Bioinformatics |
issn |
1471-2105 |
publishDate |
2011-12-01 |
description |
<p>Abstract</p> <p>Background</p> <p>The rapid accumulation of molecular sequence data, driven by novel wet-lab sequencing technologies, poses new challenges for large-scale maximum likelihood-based phylogenetic analyses on trees with more than 30,000 taxa and several genes. The three main computational challenges are: numerical stability, the scalability of search algorithms, and the high memory requirements for computing the likelihood.</p> <p>Results</p> <p>We introduce methods for solving these three key problems and provide respective proof-of-concept implementations in RAxML. The mechanisms presented here are not RAxML-specific and can thus be applied to any likelihood-based (Bayesian or maximum likelihood) tree inference program. We develop a new search strategy that can reduce the time required for tree inferences by more than 50% while yielding equally good trees (in the statistical sense) for well-chosen starting trees. We present an adaptation of the Subtree Equality Vector technique for phylogenomic datasets with missing data (already available in RAxML v728) that can reduce execution times <it>and </it>memory requirements by up to 50%. Finally, we discuss issues pertaining to the numerical stability of the Γ model of rate heterogeneity on very large trees and argue in favor of rate heterogeneity models that use a single rate or rate category for each site to resolve these problems.</p> <p>Conclusions</p> <p>We address three major issues pertaining to large scale tree reconstruction under maximum likelihood and propose respective solutions. Respective proof-of-concept/production-level implementations of our ideas are made available as open-source code.</p> |
url |
http://www.biomedcentral.com/1471-2105/12/470 |
work_keys_str_mv |
AT izquierdocarrascofernando algorithmsdatastructuresandnumericsforlikelihoodbasedphylogeneticinferenceofhugetrees AT smithstephena algorithmsdatastructuresandnumericsforlikelihoodbasedphylogeneticinferenceofhugetrees AT stamatakisalexandros algorithmsdatastructuresandnumericsforlikelihoodbasedphylogeneticinferenceofhugetrees |
_version_ |
1725802185130770432 |