Probabilistic species-driven gene mention normalisation

The scientific literature is an important for source of knowledge about biological sys- tems and their constituent components. But the increase in the rate of new publications means that manual curation of the literature has become intractable. This has motivated the development and application of t...

Full description

Bibliographic Details
Main Author: Harmston, Nathan
Other Authors: Stumpf, Michael
Published: Imperial College London 2013
Subjects:
Online Access:http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.693953
id ndltd-bl.uk-oai-ethos.bl.uk-693953
record_format oai_dc
spelling ndltd-bl.uk-oai-ethos.bl.uk-6939532018-02-05T15:35:55ZProbabilistic species-driven gene mention normalisationHarmston, NathanStumpf, Michael2013The scientific literature is an important for source of knowledge about biological sys- tems and their constituent components. But the increase in the rate of new publications means that manual curation of the literature has become intractable. This has motivated the development and application of text mining methods to automatically extract the in- formation present in the scientific literature. Extracting information automatically from text is, however, an inherently noisy process and this uncertainty provides the rationale for the development of probabilistic methods for text mining. In this thesis, we concern ourselves with the task of identifying gene mentions in text and normalising these to unique Entrez gene identifiers, referred to as species-driven gene mention normalisation. We propose novel heuristics which improve the performance of species mention normalisation and reduce the number of ambiguous mentions found in the MEDLINE database. This directly impacts the performance down- stream components and we provide a novel probabilistic method for assigning a species to individual gene mentions. In order to avoid uncertainties and noise being propagated in text-mining approaches we develop a Bayesian network description of a text mining pipeline, which allows us to quantify uncertainties reliably and in an easily interpretable manner. Our results show the importance of incorporating as much information as possible into the text mining pipeline and the importance of viewing text mining as a noisy system.572.8Imperial College Londonhttp://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.693953http://hdl.handle.net/10044/1/39837Electronic Thesis or Dissertation
collection NDLTD
sources NDLTD
topic 572.8
spellingShingle 572.8
Harmston, Nathan
Probabilistic species-driven gene mention normalisation
description The scientific literature is an important for source of knowledge about biological sys- tems and their constituent components. But the increase in the rate of new publications means that manual curation of the literature has become intractable. This has motivated the development and application of text mining methods to automatically extract the in- formation present in the scientific literature. Extracting information automatically from text is, however, an inherently noisy process and this uncertainty provides the rationale for the development of probabilistic methods for text mining. In this thesis, we concern ourselves with the task of identifying gene mentions in text and normalising these to unique Entrez gene identifiers, referred to as species-driven gene mention normalisation. We propose novel heuristics which improve the performance of species mention normalisation and reduce the number of ambiguous mentions found in the MEDLINE database. This directly impacts the performance down- stream components and we provide a novel probabilistic method for assigning a species to individual gene mentions. In order to avoid uncertainties and noise being propagated in text-mining approaches we develop a Bayesian network description of a text mining pipeline, which allows us to quantify uncertainties reliably and in an easily interpretable manner. Our results show the importance of incorporating as much information as possible into the text mining pipeline and the importance of viewing text mining as a noisy system.
author2 Stumpf, Michael
author_facet Stumpf, Michael
Harmston, Nathan
author Harmston, Nathan
author_sort Harmston, Nathan
title Probabilistic species-driven gene mention normalisation
title_short Probabilistic species-driven gene mention normalisation
title_full Probabilistic species-driven gene mention normalisation
title_fullStr Probabilistic species-driven gene mention normalisation
title_full_unstemmed Probabilistic species-driven gene mention normalisation
title_sort probabilistic species-driven gene mention normalisation
publisher Imperial College London
publishDate 2013
url http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.693953
work_keys_str_mv AT harmstonnathan probabilisticspeciesdrivengenementionnormalisation
_version_ 1718613263406596096