Seqenv: linking sequences to environments through text mining

Understanding the distribution of taxa and associated traits across different environments is one of the central questions in microbial ecology. High-throughput sequencing (HTS) studies are presently generating huge volumes of data to address this biogeographical topic. However, these studies are of...

Full description

Bibliographic Details
Main Authors: Lucas Sinclair, Umer Z. Ijaz, Lars Juhl Jensen, Marco J.L. Coolen, Cecile Gubry-Rangin, Alica Chroňáková, Anastasis Oulas, Christina Pavloudi, Julia Schnetzer, Aaron Weimann, Ali Ijaz, Alexander Eiler, Christopher Quince, Evangelos Pafilis
Format: Article
Language:English
Published: PeerJ Inc. 2016-12-01
Series:PeerJ
Subjects:
Online Access:https://peerj.com/articles/2690.pdf
id doaj-9f3814d827644da488609463fc7e4955
record_format Article
spelling doaj-9f3814d827644da488609463fc7e49552020-11-24T22:40:53ZengPeerJ Inc.PeerJ2167-83592016-12-014e269010.7717/peerj.2690 Seqenv: linking sequences to environments through text miningLucas Sinclair0Umer Z. Ijaz1Lars Juhl Jensen2Marco J.L. Coolen3Cecile Gubry-Rangin4Alica Chroňáková5Anastasis Oulas6Christina Pavloudi7Julia Schnetzer8Aaron Weimann9Ali Ijaz10Alexander Eiler11Christopher Quince12Evangelos Pafilis13Department of Ecology and Genetics, Limnology, Uppsala University, Uppsala, SwedenInfrastructure and Environment Research Division, School of Engineering, University of Glasgow, Glasgow, United KingdomThe Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, DenmarkWestern Australia Organic and Isotope Geochemistry Centre (WA-OIGC), Department of Chemistry, Curtin University of Technology, Bentley, WA, AustraliaInstitute of Biological & Environmental Sciences, University of Aberdeen, Aberdeen, United KingdomInstitute of Soil Biology, Biology Centre, Czech Academy of Sciences, České Budějovice, Czech RepublicBioinformatics Group, The Cyprus Institute of Neurology and Genetics, Nicosia, CyprusInstitute of Marine Biology Biotechnology and Aquaculture (IMBBC), Hellenic Centre for Marine Research (HCMR), Heraklion Crete, GreeceDepartment of Molecular Ecology, Microbial Genomics and Bioinformatics Group, Max Planck Institute for Marine Microbiology, Bremen, GermanyComputational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, GermanyHawkesbury Institute for the Environment, University of Western Sydney, Hawkesbury, Sydney, AustraliaDepartment of Ecology and Genetics, Limnology, Uppsala University, Uppsala, SwedenWarwick Medical School, University of Warwick, Warwick, United KingdomInstitute of Marine Biology Biotechnology and Aquaculture (IMBBC), Hellenic Centre for Marine Research (HCMR), Heraklion Crete, GreeceUnderstanding the distribution of taxa and associated traits across different environments is one of the central questions in microbial ecology. High-throughput sequencing (HTS) studies are presently generating huge volumes of data to address this biogeographical topic. However, these studies are often focused on specific environment types or processes leading to the production of individual, unconnected datasets. The large amounts of legacy sequence data with associated metadata that exist can be harnessed to better place the genetic information found in these surveys into a wider environmental context. Here we introduce a software program, seqenv, to carry out precisely such a task. It automatically performs similarity searches of short sequences against the “nt” nucleotide database provided by NCBI and, out of every hit, extracts–if it is available–the textual metadata field. After collecting all the isolation sources from all the search results, we run a text mining algorithm to identify and parse words that are associated with the Environmental Ontology (EnvO) controlled vocabulary. This, in turn, enables us to determine both in which environments individual sequences or taxa have previously been observed and, by weighted summation of those results, to summarize complete samples. We present two demonstrative applications of seqenv to a survey of ammonia oxidizing archaea as well as to a plankton paleome dataset from the Black Sea. These demonstrate the ability of the tool to reveal novel patterns in HTS and its utility in the fields of environmental source tracking, paleontology, and studies of microbial biogeography. To install seqenv, go to: https://github.com/xapple/seqenv.https://peerj.com/articles/2690.pdfBioinformaticsEcologyMicrobiologyGenomicsSequence analysisText processing
collection DOAJ
language English
format Article
sources DOAJ
author Lucas Sinclair
Umer Z. Ijaz
Lars Juhl Jensen
Marco J.L. Coolen
Cecile Gubry-Rangin
Alica Chroňáková
Anastasis Oulas
Christina Pavloudi
Julia Schnetzer
Aaron Weimann
Ali Ijaz
Alexander Eiler
Christopher Quince
Evangelos Pafilis
spellingShingle Lucas Sinclair
Umer Z. Ijaz
Lars Juhl Jensen
Marco J.L. Coolen
Cecile Gubry-Rangin
Alica Chroňáková
Anastasis Oulas
Christina Pavloudi
Julia Schnetzer
Aaron Weimann
Ali Ijaz
Alexander Eiler
Christopher Quince
Evangelos Pafilis
Seqenv: linking sequences to environments through text mining
PeerJ
Bioinformatics
Ecology
Microbiology
Genomics
Sequence analysis
Text processing
author_facet Lucas Sinclair
Umer Z. Ijaz
Lars Juhl Jensen
Marco J.L. Coolen
Cecile Gubry-Rangin
Alica Chroňáková
Anastasis Oulas
Christina Pavloudi
Julia Schnetzer
Aaron Weimann
Ali Ijaz
Alexander Eiler
Christopher Quince
Evangelos Pafilis
author_sort Lucas Sinclair
title Seqenv: linking sequences to environments through text mining
title_short Seqenv: linking sequences to environments through text mining
title_full Seqenv: linking sequences to environments through text mining
title_fullStr Seqenv: linking sequences to environments through text mining
title_full_unstemmed Seqenv: linking sequences to environments through text mining
title_sort seqenv: linking sequences to environments through text mining
publisher PeerJ Inc.
series PeerJ
issn 2167-8359
publishDate 2016-12-01
description Understanding the distribution of taxa and associated traits across different environments is one of the central questions in microbial ecology. High-throughput sequencing (HTS) studies are presently generating huge volumes of data to address this biogeographical topic. However, these studies are often focused on specific environment types or processes leading to the production of individual, unconnected datasets. The large amounts of legacy sequence data with associated metadata that exist can be harnessed to better place the genetic information found in these surveys into a wider environmental context. Here we introduce a software program, seqenv, to carry out precisely such a task. It automatically performs similarity searches of short sequences against the “nt” nucleotide database provided by NCBI and, out of every hit, extracts–if it is available–the textual metadata field. After collecting all the isolation sources from all the search results, we run a text mining algorithm to identify and parse words that are associated with the Environmental Ontology (EnvO) controlled vocabulary. This, in turn, enables us to determine both in which environments individual sequences or taxa have previously been observed and, by weighted summation of those results, to summarize complete samples. We present two demonstrative applications of seqenv to a survey of ammonia oxidizing archaea as well as to a plankton paleome dataset from the Black Sea. These demonstrate the ability of the tool to reveal novel patterns in HTS and its utility in the fields of environmental source tracking, paleontology, and studies of microbial biogeography. To install seqenv, go to: https://github.com/xapple/seqenv.
topic Bioinformatics
Ecology
Microbiology
Genomics
Sequence analysis
Text processing
url https://peerj.com/articles/2690.pdf
work_keys_str_mv AT lucassinclair seqenvlinkingsequencestoenvironmentsthroughtextmining
AT umerzijaz seqenvlinkingsequencestoenvironmentsthroughtextmining
AT larsjuhljensen seqenvlinkingsequencestoenvironmentsthroughtextmining
AT marcojlcoolen seqenvlinkingsequencestoenvironmentsthroughtextmining
AT cecilegubryrangin seqenvlinkingsequencestoenvironmentsthroughtextmining
AT alicachronakova seqenvlinkingsequencestoenvironmentsthroughtextmining
AT anastasisoulas seqenvlinkingsequencestoenvironmentsthroughtextmining
AT christinapavloudi seqenvlinkingsequencestoenvironmentsthroughtextmining
AT juliaschnetzer seqenvlinkingsequencestoenvironmentsthroughtextmining
AT aaronweimann seqenvlinkingsequencestoenvironmentsthroughtextmining
AT aliijaz seqenvlinkingsequencestoenvironmentsthroughtextmining
AT alexandereiler seqenvlinkingsequencestoenvironmentsthroughtextmining
AT christopherquince seqenvlinkingsequencestoenvironmentsthroughtextmining
AT evangelospafilis seqenvlinkingsequencestoenvironmentsthroughtextmining
_version_ 1725702967375429632