OCTAL: Optimal Completion of gene trees in polynomial time

Abstract Background For a combination of reasons (including data generation protocols, approaches to taxon and gene sampling, and gene birth and loss), estimated gene trees are often incomplete, meaning that they do not contain all of the species of interest. As incomplete gene trees can impact down...

Full description

Bibliographic Details
Main Authors: Sarah Christensen, Erin K. Molloy, Pranjal Vachaspati, Tandy Warnow
Format: Article
Language:English
Published: BMC 2018-03-01
Series:Algorithms for Molecular Biology
Subjects:
Online Access:http://link.springer.com/article/10.1186/s13015-018-0124-5
id doaj-dd4fb2459f144549a27142ce5d1a1684
record_format Article
spelling doaj-dd4fb2459f144549a27142ce5d1a16842020-11-25T02:26:30ZengBMCAlgorithms for Molecular Biology1748-71882018-03-0113111810.1186/s13015-018-0124-5OCTAL: Optimal Completion of gene trees in polynomial timeSarah Christensen0Erin K. Molloy1Pranjal Vachaspati2Tandy Warnow3Department of Computer Science, University of Illinois at Urbana-ChampaignDepartment of Computer Science, University of Illinois at Urbana-ChampaignDepartment of Computer Science, University of Illinois at Urbana-ChampaignDepartment of Computer Science, University of Illinois at Urbana-ChampaignAbstract Background For a combination of reasons (including data generation protocols, approaches to taxon and gene sampling, and gene birth and loss), estimated gene trees are often incomplete, meaning that they do not contain all of the species of interest. As incomplete gene trees can impact downstream analyses, accurate completion of gene trees is desirable. Results We introduce the Optimal Tree Completion problem, a general optimization problem that involves completing an unrooted binary tree (i.e., adding missing leaves) so as to minimize its distance from a reference tree on a superset of the leaves. We present OCTAL, an algorithm that finds an optimal solution to this problem when the distance between trees is defined using the Robinson–Foulds (RF) distance, and we prove that OCTAL runs in $$O(n^2)$$ O(n2) time, where n is the total number of species. We report on a simulation study in which gene trees can differ from the species tree due to incomplete lineage sorting, and estimated gene trees are completed using OCTAL with a reference tree based on a species tree estimated from the multi-locus dataset. OCTAL produces completed gene trees that are closer to the true gene trees than an existing heuristic approach in ASTRAL-II, but the accuracy of a completed gene tree computed by OCTAL depends on how topologically similar the reference tree (typically an estimated species tree) is to the true gene tree. Conclusions OCTAL is a useful technique for adding missing taxa to incomplete gene trees and provides good accuracy under a wide range of model conditions. However, results show that OCTAL’s accuracy can be reduced when incomplete lineage sorting is high, as the reference tree can be far from the true gene tree. Hence, this study suggests that OCTAL would benefit from using other types of reference trees instead of species trees when there are large topological distances between true gene trees and species trees.http://link.springer.com/article/10.1186/s13015-018-0124-5Species treesGene treesMissing dataMultispecies coalescentPhylogenomics
collection DOAJ
language English
format Article
sources DOAJ
author Sarah Christensen
Erin K. Molloy
Pranjal Vachaspati
Tandy Warnow
spellingShingle Sarah Christensen
Erin K. Molloy
Pranjal Vachaspati
Tandy Warnow
OCTAL: Optimal Completion of gene trees in polynomial time
Algorithms for Molecular Biology
Species trees
Gene trees
Missing data
Multispecies coalescent
Phylogenomics
author_facet Sarah Christensen
Erin K. Molloy
Pranjal Vachaspati
Tandy Warnow
author_sort Sarah Christensen
title OCTAL: Optimal Completion of gene trees in polynomial time
title_short OCTAL: Optimal Completion of gene trees in polynomial time
title_full OCTAL: Optimal Completion of gene trees in polynomial time
title_fullStr OCTAL: Optimal Completion of gene trees in polynomial time
title_full_unstemmed OCTAL: Optimal Completion of gene trees in polynomial time
title_sort octal: optimal completion of gene trees in polynomial time
publisher BMC
series Algorithms for Molecular Biology
issn 1748-7188
publishDate 2018-03-01
description Abstract Background For a combination of reasons (including data generation protocols, approaches to taxon and gene sampling, and gene birth and loss), estimated gene trees are often incomplete, meaning that they do not contain all of the species of interest. As incomplete gene trees can impact downstream analyses, accurate completion of gene trees is desirable. Results We introduce the Optimal Tree Completion problem, a general optimization problem that involves completing an unrooted binary tree (i.e., adding missing leaves) so as to minimize its distance from a reference tree on a superset of the leaves. We present OCTAL, an algorithm that finds an optimal solution to this problem when the distance between trees is defined using the Robinson–Foulds (RF) distance, and we prove that OCTAL runs in $$O(n^2)$$ O(n2) time, where n is the total number of species. We report on a simulation study in which gene trees can differ from the species tree due to incomplete lineage sorting, and estimated gene trees are completed using OCTAL with a reference tree based on a species tree estimated from the multi-locus dataset. OCTAL produces completed gene trees that are closer to the true gene trees than an existing heuristic approach in ASTRAL-II, but the accuracy of a completed gene tree computed by OCTAL depends on how topologically similar the reference tree (typically an estimated species tree) is to the true gene tree. Conclusions OCTAL is a useful technique for adding missing taxa to incomplete gene trees and provides good accuracy under a wide range of model conditions. However, results show that OCTAL’s accuracy can be reduced when incomplete lineage sorting is high, as the reference tree can be far from the true gene tree. Hence, this study suggests that OCTAL would benefit from using other types of reference trees instead of species trees when there are large topological distances between true gene trees and species trees.
topic Species trees
Gene trees
Missing data
Multispecies coalescent
Phylogenomics
url http://link.springer.com/article/10.1186/s13015-018-0124-5
work_keys_str_mv AT sarahchristensen octaloptimalcompletionofgenetreesinpolynomialtime
AT erinkmolloy octaloptimalcompletionofgenetreesinpolynomialtime
AT pranjalvachaspati octaloptimalcompletionofgenetreesinpolynomialtime
AT tandywarnow octaloptimalcompletionofgenetreesinpolynomialtime
_version_ 1724846577137221632