SUMAC: Constructing Phylogenetic Supermatrices and Assessing Partially Decisive Taxon Coverage

The amount of phylogenetically informative sequence data in GenBank is growing at an exponential rate, and large phylogenetic trees are increasingly used in research. Tools are needed to construct phylogenetic sequence matrices from GenBank data and evaluate the effect of missing data. Supermatrix C...

Full description

Bibliographic Details
Main Author: William A. Freyman
Format: Article
Language:English
Published: SAGE Publishing 2015-01-01
Series:Evolutionary Bioinformatics
Online Access:https://doi.org/10.4137/EBO.S35384
id doaj-24fcb7b65590451d814ca970fe7608f1
record_format Article
spelling doaj-24fcb7b65590451d814ca970fe7608f12020-11-25T03:17:32ZengSAGE PublishingEvolutionary Bioinformatics1176-93432015-01-011110.4137/EBO.S35384SUMAC: Constructing Phylogenetic Supermatrices and Assessing Partially Decisive Taxon CoverageWilliam A. Freyman0Department of Integrative Biology, University of California, Berkeley, Berkeley, CA, USA.The amount of phylogenetically informative sequence data in GenBank is growing at an exponential rate, and large phylogenetic trees are increasingly used in research. Tools are needed to construct phylogenetic sequence matrices from GenBank data and evaluate the effect of missing data. Supermatrix Constructor (SUMAC) is a tool to data-mine GenBank, construct phylogenetic supermatrices, and assess the phylogenetic decisiveness of a matrix given the pattern of missing sequence data. SUMAC calculates a novel metric, Missing Sequence Decisiveness Scores (MSDS), which measures how much each individual missing sequence contributes to the decisiveness of the matrix. MSDS can be used to compare supermatrices and prioritize the acquisition of new sequence data. SUMAC constructs supermatrices either through an exploratory clustering of all GenBank sequences within a taxonomic group or by using guide sequences to build homologous clusters in a more targeted manner. SUMAC assembles supermatrices for any taxonomic group recognized in GenBank and is optimized to run on multicore computer systems by parallelizing multiple stages of operation. SUMAC is implemented as a Python package that can run as a stand-alone command-line program, or its modules and objects can be incorporated within other programs. SUMAC is released under the open source GPLv3 license and is available at https://github.com/wf8/sumac .https://doi.org/10.4137/EBO.S35384
collection DOAJ
language English
format Article
sources DOAJ
author William A. Freyman
spellingShingle William A. Freyman
SUMAC: Constructing Phylogenetic Supermatrices and Assessing Partially Decisive Taxon Coverage
Evolutionary Bioinformatics
author_facet William A. Freyman
author_sort William A. Freyman
title SUMAC: Constructing Phylogenetic Supermatrices and Assessing Partially Decisive Taxon Coverage
title_short SUMAC: Constructing Phylogenetic Supermatrices and Assessing Partially Decisive Taxon Coverage
title_full SUMAC: Constructing Phylogenetic Supermatrices and Assessing Partially Decisive Taxon Coverage
title_fullStr SUMAC: Constructing Phylogenetic Supermatrices and Assessing Partially Decisive Taxon Coverage
title_full_unstemmed SUMAC: Constructing Phylogenetic Supermatrices and Assessing Partially Decisive Taxon Coverage
title_sort sumac: constructing phylogenetic supermatrices and assessing partially decisive taxon coverage
publisher SAGE Publishing
series Evolutionary Bioinformatics
issn 1176-9343
publishDate 2015-01-01
description The amount of phylogenetically informative sequence data in GenBank is growing at an exponential rate, and large phylogenetic trees are increasingly used in research. Tools are needed to construct phylogenetic sequence matrices from GenBank data and evaluate the effect of missing data. Supermatrix Constructor (SUMAC) is a tool to data-mine GenBank, construct phylogenetic supermatrices, and assess the phylogenetic decisiveness of a matrix given the pattern of missing sequence data. SUMAC calculates a novel metric, Missing Sequence Decisiveness Scores (MSDS), which measures how much each individual missing sequence contributes to the decisiveness of the matrix. MSDS can be used to compare supermatrices and prioritize the acquisition of new sequence data. SUMAC constructs supermatrices either through an exploratory clustering of all GenBank sequences within a taxonomic group or by using guide sequences to build homologous clusters in a more targeted manner. SUMAC assembles supermatrices for any taxonomic group recognized in GenBank and is optimized to run on multicore computer systems by parallelizing multiple stages of operation. SUMAC is implemented as a Python package that can run as a stand-alone command-line program, or its modules and objects can be incorporated within other programs. SUMAC is released under the open source GPLv3 license and is available at https://github.com/wf8/sumac .
url https://doi.org/10.4137/EBO.S35384
work_keys_str_mv AT williamafreyman sumacconstructingphylogeneticsupermatricesandassessingpartiallydecisivetaxoncoverage
_version_ 1724631682573664256