COMPUTATIONAL TOOLS FOR THE DYNAMIC CATEGORIZATION AND AUGMENTED UTILIZATION OF THE GENE ONTOLOGY

Ontologies provide an organization of language, in the form of a network or graph, which is amenable to computational analysis while remaining human-readable. Although they are used in a variety of disciplines, ontologies in the biomedical field, such as Gene Ontology, are of interest for their role...

Full description

Bibliographic Details
Main Author: Hinderer, Eugene Waverly, III
Format: Others
Published: UKnowledge 2019
Subjects:
Online Access:https://uknowledge.uky.edu/biochem_etds/43
https://uknowledge.uky.edu/cgi/viewcontent.cgi?article=1043&context=biochem_etds
id ndltd-uky.edu-oai-uknowledge.uky.edu-biochem_etds-1043
record_format oai_dc
collection NDLTD
format Others
sources NDLTD
topic Annotation Enrichment Analysis
Biomedical Ontologies
Information Retrieval
Ontological Maintenance
Semantic Correspondence
Bioinformatics
Genomics
spellingShingle Annotation Enrichment Analysis
Biomedical Ontologies
Information Retrieval
Ontological Maintenance
Semantic Correspondence
Bioinformatics
Genomics
Hinderer, Eugene Waverly, III
COMPUTATIONAL TOOLS FOR THE DYNAMIC CATEGORIZATION AND AUGMENTED UTILIZATION OF THE GENE ONTOLOGY
description Ontologies provide an organization of language, in the form of a network or graph, which is amenable to computational analysis while remaining human-readable. Although they are used in a variety of disciplines, ontologies in the biomedical field, such as Gene Ontology, are of interest for their role in organizing terminology used to describe—among other concepts—the functions, locations, and processes of genes and gene-products. Due to the consistency and level of automation that ontologies provide for such annotations, methods for finding enriched biological terminology from a set of differentially identified genes in a tissue or cell sample have been developed to aid in the elucidation of disease pathology and unknown biochemical pathways. However, despite their immense utility, biomedical ontologies have significant limitations and caveats. One major issue is that gene annotation enrichment analyses often result in many redundant, individually enriched ontological terms that are highly specific and weakly justified by statistical significance. These large sets of weakly enriched terms are difficult to interpret without manually sorting into appropriate functional or descriptive categories. Also, relationships that organize the terminology within these ontologies do not contain descriptions of semantic scoping or scaling among terms. Therefore, there exists some ambiguity, which complicates the automation of categorizing terms to improve interpretability. We emphasize that existing methods enable the danger of producing incorrect mappings to categories as a result of these ambiguities, unless simplified and incomplete versions of these ontologies are used which omit problematic relations. Such ambiguities could have a significant impact on term categorization, as we have calculated upper boundary estimates of potential false categorizations as high as 121,579 for the misinterpretation of a single scoping relation, has_part, which accounts for approximately 18% of the total possible mappings between terms in the Gene Ontology. However, the omission of problematic relationships results in a significant loss of retrievable information. In the Gene Ontology, this accounts for a 6% reduction for the omission of a single relation. However, this percentage should increase drastically when considering all relations in an ontology. To address these issues, we have developed methods which categorize individual ontology terms into broad, biologically-related concepts to improve the interpretability and statistical significance of gene-annotation enrichment studies, meanwhile addressing the lack of semantic scoping and scaling descriptions among ontological relationships so that annotation enrichment analyses can be performed across a more complete representation of the ontological graph. We show that, when compared to similar term categorization methods, our method produces categorizations that match hand-curated ones with similar or better accuracy, while not requiring the user to compile lists of individual ontology term IDs. Furthermore, our handling of problematic relations produces a more complete representation of ontological information from a scoping perspective, and we demonstrate instances where medically-relevant terms--and by extension putative gene targets--are identified in our annotation enrichment results that would be otherwise missed when using traditional methods. Additionally, we observed a marginal, yet consistent improvement of statistical power in enrichment results when our methods were used, compared to traditional enrichment analyses that utilize ontological ancestors. Finally, using scalable and reproducible data workflow pipelines, we have applied our methods to several genomic, transcriptomic, and proteomic collaborative projects.
author Hinderer, Eugene Waverly, III
author_facet Hinderer, Eugene Waverly, III
author_sort Hinderer, Eugene Waverly, III
title COMPUTATIONAL TOOLS FOR THE DYNAMIC CATEGORIZATION AND AUGMENTED UTILIZATION OF THE GENE ONTOLOGY
title_short COMPUTATIONAL TOOLS FOR THE DYNAMIC CATEGORIZATION AND AUGMENTED UTILIZATION OF THE GENE ONTOLOGY
title_full COMPUTATIONAL TOOLS FOR THE DYNAMIC CATEGORIZATION AND AUGMENTED UTILIZATION OF THE GENE ONTOLOGY
title_fullStr COMPUTATIONAL TOOLS FOR THE DYNAMIC CATEGORIZATION AND AUGMENTED UTILIZATION OF THE GENE ONTOLOGY
title_full_unstemmed COMPUTATIONAL TOOLS FOR THE DYNAMIC CATEGORIZATION AND AUGMENTED UTILIZATION OF THE GENE ONTOLOGY
title_sort computational tools for the dynamic categorization and augmented utilization of the gene ontology
publisher UKnowledge
publishDate 2019
url https://uknowledge.uky.edu/biochem_etds/43
https://uknowledge.uky.edu/cgi/viewcontent.cgi?article=1043&context=biochem_etds
work_keys_str_mv AT hinderereugenewaverlyiii computationaltoolsforthedynamiccategorizationandaugmentedutilizationofthegeneontology
_version_ 1719269297748967424
spelling ndltd-uky.edu-oai-uknowledge.uky.edu-biochem_etds-10432019-10-16T04:27:58Z COMPUTATIONAL TOOLS FOR THE DYNAMIC CATEGORIZATION AND AUGMENTED UTILIZATION OF THE GENE ONTOLOGY Hinderer, Eugene Waverly, III Ontologies provide an organization of language, in the form of a network or graph, which is amenable to computational analysis while remaining human-readable. Although they are used in a variety of disciplines, ontologies in the biomedical field, such as Gene Ontology, are of interest for their role in organizing terminology used to describe—among other concepts—the functions, locations, and processes of genes and gene-products. Due to the consistency and level of automation that ontologies provide for such annotations, methods for finding enriched biological terminology from a set of differentially identified genes in a tissue or cell sample have been developed to aid in the elucidation of disease pathology and unknown biochemical pathways. However, despite their immense utility, biomedical ontologies have significant limitations and caveats. One major issue is that gene annotation enrichment analyses often result in many redundant, individually enriched ontological terms that are highly specific and weakly justified by statistical significance. These large sets of weakly enriched terms are difficult to interpret without manually sorting into appropriate functional or descriptive categories. Also, relationships that organize the terminology within these ontologies do not contain descriptions of semantic scoping or scaling among terms. Therefore, there exists some ambiguity, which complicates the automation of categorizing terms to improve interpretability. We emphasize that existing methods enable the danger of producing incorrect mappings to categories as a result of these ambiguities, unless simplified and incomplete versions of these ontologies are used which omit problematic relations. Such ambiguities could have a significant impact on term categorization, as we have calculated upper boundary estimates of potential false categorizations as high as 121,579 for the misinterpretation of a single scoping relation, has_part, which accounts for approximately 18% of the total possible mappings between terms in the Gene Ontology. However, the omission of problematic relationships results in a significant loss of retrievable information. In the Gene Ontology, this accounts for a 6% reduction for the omission of a single relation. However, this percentage should increase drastically when considering all relations in an ontology. To address these issues, we have developed methods which categorize individual ontology terms into broad, biologically-related concepts to improve the interpretability and statistical significance of gene-annotation enrichment studies, meanwhile addressing the lack of semantic scoping and scaling descriptions among ontological relationships so that annotation enrichment analyses can be performed across a more complete representation of the ontological graph. We show that, when compared to similar term categorization methods, our method produces categorizations that match hand-curated ones with similar or better accuracy, while not requiring the user to compile lists of individual ontology term IDs. Furthermore, our handling of problematic relations produces a more complete representation of ontological information from a scoping perspective, and we demonstrate instances where medically-relevant terms--and by extension putative gene targets--are identified in our annotation enrichment results that would be otherwise missed when using traditional methods. Additionally, we observed a marginal, yet consistent improvement of statistical power in enrichment results when our methods were used, compared to traditional enrichment analyses that utilize ontological ancestors. Finally, using scalable and reproducible data workflow pipelines, we have applied our methods to several genomic, transcriptomic, and proteomic collaborative projects. 2019-01-01T08:00:00Z text application/pdf https://uknowledge.uky.edu/biochem_etds/43 https://uknowledge.uky.edu/cgi/viewcontent.cgi?article=1043&context=biochem_etds Theses and Dissertations--Molecular and Cellular Biochemistry UKnowledge Annotation Enrichment Analysis Biomedical Ontologies Information Retrieval Ontological Maintenance Semantic Correspondence Bioinformatics Genomics