Utilities for Off-Target DNA Mining in Non-Model Organisms and Querying for Phylogenetic Patterns

High throughput sequencing data are rich in information and contain many off-target sequences (reads) that are often ignored but may be biologically relevant. Seed extension, a combination of reference and de novo based assembly methods, can be used to extract the information but it is time-consumin...

Full description

Bibliographic Details
Other Authors: Mechtley, Alisha (author)
Format: Others
Language:English
English
Published: Florida State University
Subjects:
Online Access:http://purl.flvc.org/fsu/fd/2018_Sp_Mechtley_fsu_0071E_14520_comp
Description
Summary:High throughput sequencing data are rich in information and contain many off-target sequences (reads) that are often ignored but may be biologically relevant. Seed extension, a combination of reference and de novo based assembly methods, can be used to extract the information but it is time-consuming to implement because it requires that multiple seeds (sequences from one or many closely related species) be gathered in advance. A new tool is presented here, SeedSQrrL, that can automatically crawl the web to gather the seeds from the closest taxonomic relative for each gene and store it into a relational database. The seeds can then be used to create multiple seed extensions which are later combined into a reference or used for downstream phylogenetic analysis. Patterns in the resulting gene trees can be searched for using the traditional methods of tree comparison (Robinson-Foulds topological distance and branch-length comparison methods). Currently, no open source tree pattern matching program exists that allows the user to modify algorithms and create their own custom pattern matching functions. I have worked on such a tool, called Treematcher, and it will be made available in the ETE Toolkit (a Python Environment for Tree Exploration). Three biological case studies will be included included to demonstrate the capabilities of the two programs: 1) a custom function in Treematcher to perform a regular expression-like query, 2) SeedSQrrL will be used to isolate mitochondrial genes from snakes and chloroplast genes from angiosperms, and 3) a large case study of animals will be assembled. === A Dissertation submitted to the Department of Scientific Computing in partial fulfillment of the requirements for the degree of Doctor of Philosophy. === Spring Semester 2018. === April 2, 2018. === Automated Gene Reference Collection, Gene Tree Pattern Matching, High Throughput Sequence Analysis, NCBI Taxonomy, Open Source Software for Bioinformatics, Python === Includes bibliographical references. === Alan Lemmon, Professor Directing Dissertation; Michelle Arbeitman, University Representative; Anke Meyer-Baese, Committee Member; Peter Beerli, Committee Member; Dennis Slice, Committee Member.