Probabilistic species-driven gene mention normalisation

The scientific literature is an important for source of knowledge about biological sys- tems and their constituent components. But the increase in the rate of new publications means that manual curation of the literature has become intractable. This has motivated the development and application of t...

Full description

Bibliographic Details
Main Author:	Harmston, Nathan
Other Authors:	Stumpf, Michael
Published:	Imperial College London 2013
Subjects:	572.8
Online Access:	http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.693953

id	ndltd-bl.uk-oai-ethos.bl.uk-693953
record_format	oai_dc
spelling	ndltd-bl.uk-oai-ethos.bl.uk-6939532018-02-05T15:35:55ZProbabilistic species-driven gene mention normalisationHarmston, NathanStumpf, Michael2013The scientific literature is an important for source of knowledge about biological sys- tems and their constituent components. But the increase in the rate of new publications means that manual curation of the literature has become intractable. This has motivated the development and application of text mining methods to automatically extract the in- formation present in the scientific literature. Extracting information automatically from text is, however, an inherently noisy process and this uncertainty provides the rationale for the development of probabilistic methods for text mining. In this thesis, we concern ourselves with the task of identifying gene mentions in text and normalising these to unique Entrez gene identifiers, referred to as species-driven gene mention normalisation. We propose novel heuristics which improve the performance of species mention normalisation and reduce the number of ambiguous mentions found in the MEDLINE database. This directly impacts the performance down- stream components and we provide a novel probabilistic method for assigning a species to individual gene mentions. In order to avoid uncertainties and noise being propagated in text-mining approaches we develop a Bayesian network description of a text mining pipeline, which allows us to quantify uncertainties reliably and in an easily interpretable manner. Our results show the importance of incorporating as much information as possible into the text mining pipeline and the importance of viewing text mining as a noisy system.572.8Imperial College Londonhttp://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.693953http://hdl.handle.net/10044/1/39837Electronic Thesis or Dissertation
collection	NDLTD
sources	NDLTD
topic	572.8
spellingShingle	572.8 Harmston, Nathan Probabilistic species-driven gene mention normalisation
description	The scientific literature is an important for source of knowledge about biological sys- tems and their constituent components. But the increase in the rate of new publications means that manual curation of the literature has become intractable. This has motivated the development and application of text mining methods to automatically extract the in- formation present in the scientific literature. Extracting information automatically from text is, however, an inherently noisy process and this uncertainty provides the rationale for the development of probabilistic methods for text mining. In this thesis, we concern ourselves with the task of identifying gene mentions in text and normalising these to unique Entrez gene identifiers, referred to as species-driven gene mention normalisation. We propose novel heuristics which improve the performance of species mention normalisation and reduce the number of ambiguous mentions found in the MEDLINE database. This directly impacts the performance down- stream components and we provide a novel probabilistic method for assigning a species to individual gene mentions. In order to avoid uncertainties and noise being propagated in text-mining approaches we develop a Bayesian network description of a text mining pipeline, which allows us to quantify uncertainties reliably and in an easily interpretable manner. Our results show the importance of incorporating as much information as possible into the text mining pipeline and the importance of viewing text mining as a noisy system.
author2	Stumpf, Michael
author_facet	Stumpf, Michael Harmston, Nathan
author	Harmston, Nathan
author_sort	Harmston, Nathan
title	Probabilistic species-driven gene mention normalisation
title_short	Probabilistic species-driven gene mention normalisation
title_full	Probabilistic species-driven gene mention normalisation
title_fullStr	Probabilistic species-driven gene mention normalisation
title_full_unstemmed	Probabilistic species-driven gene mention normalisation
title_sort	probabilistic species-driven gene mention normalisation
publisher	Imperial College London
publishDate	2013
url	http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.693953
work_keys_str_mv	AT harmstonnathan probabilisticspeciesdrivengenementionnormalisation
_version_	1718613263406596096

Probabilistic species-driven gene mention normalisation

Similar Items