Probabilistic species-driven gene mention normalisation
The scientific literature is an important for source of knowledge about biological sys- tems and their constituent components. But the increase in the rate of new publications means that manual curation of the literature has become intractable. This has motivated the development and application of t...
Main Author: | |
---|---|
Other Authors: | |
Published: |
Imperial College London
2013
|
Subjects: | |
Online Access: | http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.693953 |
id |
ndltd-bl.uk-oai-ethos.bl.uk-693953 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-bl.uk-oai-ethos.bl.uk-6939532018-02-05T15:35:55ZProbabilistic species-driven gene mention normalisationHarmston, NathanStumpf, Michael2013The scientific literature is an important for source of knowledge about biological sys- tems and their constituent components. But the increase in the rate of new publications means that manual curation of the literature has become intractable. This has motivated the development and application of text mining methods to automatically extract the in- formation present in the scientific literature. Extracting information automatically from text is, however, an inherently noisy process and this uncertainty provides the rationale for the development of probabilistic methods for text mining. In this thesis, we concern ourselves with the task of identifying gene mentions in text and normalising these to unique Entrez gene identifiers, referred to as species-driven gene mention normalisation. We propose novel heuristics which improve the performance of species mention normalisation and reduce the number of ambiguous mentions found in the MEDLINE database. This directly impacts the performance down- stream components and we provide a novel probabilistic method for assigning a species to individual gene mentions. In order to avoid uncertainties and noise being propagated in text-mining approaches we develop a Bayesian network description of a text mining pipeline, which allows us to quantify uncertainties reliably and in an easily interpretable manner. Our results show the importance of incorporating as much information as possible into the text mining pipeline and the importance of viewing text mining as a noisy system.572.8Imperial College Londonhttp://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.693953http://hdl.handle.net/10044/1/39837Electronic Thesis or Dissertation |
collection |
NDLTD |
sources |
NDLTD |
topic |
572.8 |
spellingShingle |
572.8 Harmston, Nathan Probabilistic species-driven gene mention normalisation |
description |
The scientific literature is an important for source of knowledge about biological sys- tems and their constituent components. But the increase in the rate of new publications means that manual curation of the literature has become intractable. This has motivated the development and application of text mining methods to automatically extract the in- formation present in the scientific literature. Extracting information automatically from text is, however, an inherently noisy process and this uncertainty provides the rationale for the development of probabilistic methods for text mining. In this thesis, we concern ourselves with the task of identifying gene mentions in text and normalising these to unique Entrez gene identifiers, referred to as species-driven gene mention normalisation. We propose novel heuristics which improve the performance of species mention normalisation and reduce the number of ambiguous mentions found in the MEDLINE database. This directly impacts the performance down- stream components and we provide a novel probabilistic method for assigning a species to individual gene mentions. In order to avoid uncertainties and noise being propagated in text-mining approaches we develop a Bayesian network description of a text mining pipeline, which allows us to quantify uncertainties reliably and in an easily interpretable manner. Our results show the importance of incorporating as much information as possible into the text mining pipeline and the importance of viewing text mining as a noisy system. |
author2 |
Stumpf, Michael |
author_facet |
Stumpf, Michael Harmston, Nathan |
author |
Harmston, Nathan |
author_sort |
Harmston, Nathan |
title |
Probabilistic species-driven gene mention normalisation |
title_short |
Probabilistic species-driven gene mention normalisation |
title_full |
Probabilistic species-driven gene mention normalisation |
title_fullStr |
Probabilistic species-driven gene mention normalisation |
title_full_unstemmed |
Probabilistic species-driven gene mention normalisation |
title_sort |
probabilistic species-driven gene mention normalisation |
publisher |
Imperial College London |
publishDate |
2013 |
url |
http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.693953 |
work_keys_str_mv |
AT harmstonnathan probabilisticspeciesdrivengenementionnormalisation |
_version_ |
1718613263406596096 |