Knowledge-Driven Methods for Geographic Information Extraction in the Biomedical Domain

abstract: Accounting for over a third of all emerging and re-emerging infections, viruses represent a major public health threat, which researchers and epidemiologists across the world have been attempting to contain for decades. Recently, genomics-based surveillance of viruses through methods such...

Full description

Bibliographic Details
Other Authors: Tahsin, Tasnia (Author)
Format: Doctoral Thesis
Language:English
Published: 2019
Subjects:
Online Access:http://hdl.handle.net/2286/R.I.55581
id ndltd-asu.edu-item-55581
record_format oai_dc
spelling ndltd-asu.edu-item-555812020-01-15T03:01:11Z Knowledge-Driven Methods for Geographic Information Extraction in the Biomedical Domain abstract: Accounting for over a third of all emerging and re-emerging infections, viruses represent a major public health threat, which researchers and epidemiologists across the world have been attempting to contain for decades. Recently, genomics-based surveillance of viruses through methods such as virus phylogeography has grown into a popular tool for infectious disease monitoring. When conducting such surveillance studies, researchers need to manually retrieve geographic metadata denoting the location of infected host (LOIH) of viruses from public sequence databases such as GenBank and any publication related to their study. The large volume of semi-structured and unstructured information that must be reviewed for this task, along with the ambiguity of geographic locations, make it especially challenging. Prior work has demonstrated that the majority of GenBank records lack sufficient geographic granularity concerning the LOIH of viruses. As a result, reviewing full-text publications is often necessary for conducting in-depth analysis of virus migration, which can be a very time-consuming process. Moreover, integrating geographic metadata pertaining to the LOIH of viruses from different sources, including different fields in GenBank records as well as full-text publications, and normalizing the integrated metadata to unique identifiers for subsequent analysis, are also challenging tasks, often requiring expert domain knowledge. Therefore, automated information extraction (IE) methods could help significantly accelerate this process, positively impacting public health research. However, very few research studies have attempted the use of IE methods in this domain. This work explores the use of novel knowledge-driven geographic IE heuristics for extracting, integrating, and normalizing the LOIH of viruses based on information available in GenBank and related publications; when evaluated on manually annotated test sets, the methods were found to have a high accuracy and shown to be adequate for addressing this challenging problem. It also presents GeoBoost, a pioneering software system for georeferencing GenBank records, as well as a large-scale database containing over two million virus GenBank records georeferenced using the algorithms introduced here. The methods, database and software developed here could help support diverse public health domains focusing on sequence-informed virus surveillance, thereby enhancing existing platforms for controlling and containing disease outbreaks. Dissertation/Thesis Tahsin, Tasnia (Author) Gonzalez, Graciela (Advisor) Scotch, Matthew (Advisor) Runger, George (Committee member) Arizona State University (Publisher) Bioinformatics Public health Geographic information science and geodesy GenBank geographic information extraction geographic information retrieval phylogeography viroinformatics virus surveillance eng 110 pages Doctoral Dissertation Biomedical Informatics 2019 Doctoral Dissertation http://hdl.handle.net/2286/R.I.55581 http://rightsstatements.org/vocab/InC/1.0/ 2019
collection NDLTD
language English
format Doctoral Thesis
sources NDLTD
topic Bioinformatics
Public health
Geographic information science and geodesy
GenBank
geographic information extraction
geographic information retrieval
phylogeography
viroinformatics
virus surveillance
spellingShingle Bioinformatics
Public health
Geographic information science and geodesy
GenBank
geographic information extraction
geographic information retrieval
phylogeography
viroinformatics
virus surveillance
Knowledge-Driven Methods for Geographic Information Extraction in the Biomedical Domain
description abstract: Accounting for over a third of all emerging and re-emerging infections, viruses represent a major public health threat, which researchers and epidemiologists across the world have been attempting to contain for decades. Recently, genomics-based surveillance of viruses through methods such as virus phylogeography has grown into a popular tool for infectious disease monitoring. When conducting such surveillance studies, researchers need to manually retrieve geographic metadata denoting the location of infected host (LOIH) of viruses from public sequence databases such as GenBank and any publication related to their study. The large volume of semi-structured and unstructured information that must be reviewed for this task, along with the ambiguity of geographic locations, make it especially challenging. Prior work has demonstrated that the majority of GenBank records lack sufficient geographic granularity concerning the LOIH of viruses. As a result, reviewing full-text publications is often necessary for conducting in-depth analysis of virus migration, which can be a very time-consuming process. Moreover, integrating geographic metadata pertaining to the LOIH of viruses from different sources, including different fields in GenBank records as well as full-text publications, and normalizing the integrated metadata to unique identifiers for subsequent analysis, are also challenging tasks, often requiring expert domain knowledge. Therefore, automated information extraction (IE) methods could help significantly accelerate this process, positively impacting public health research. However, very few research studies have attempted the use of IE methods in this domain. This work explores the use of novel knowledge-driven geographic IE heuristics for extracting, integrating, and normalizing the LOIH of viruses based on information available in GenBank and related publications; when evaluated on manually annotated test sets, the methods were found to have a high accuracy and shown to be adequate for addressing this challenging problem. It also presents GeoBoost, a pioneering software system for georeferencing GenBank records, as well as a large-scale database containing over two million virus GenBank records georeferenced using the algorithms introduced here. The methods, database and software developed here could help support diverse public health domains focusing on sequence-informed virus surveillance, thereby enhancing existing platforms for controlling and containing disease outbreaks. === Dissertation/Thesis === Doctoral Dissertation Biomedical Informatics 2019
author2 Tahsin, Tasnia (Author)
author_facet Tahsin, Tasnia (Author)
title Knowledge-Driven Methods for Geographic Information Extraction in the Biomedical Domain
title_short Knowledge-Driven Methods for Geographic Information Extraction in the Biomedical Domain
title_full Knowledge-Driven Methods for Geographic Information Extraction in the Biomedical Domain
title_fullStr Knowledge-Driven Methods for Geographic Information Extraction in the Biomedical Domain
title_full_unstemmed Knowledge-Driven Methods for Geographic Information Extraction in the Biomedical Domain
title_sort knowledge-driven methods for geographic information extraction in the biomedical domain
publishDate 2019
url http://hdl.handle.net/2286/R.I.55581
_version_ 1719308519916699648