Summary: | The scientific literature is an important for source of knowledge about biological sys- tems and their constituent components. But the increase in the rate of new publications means that manual curation of the literature has become intractable. This has motivated the development and application of text mining methods to automatically extract the in- formation present in the scientific literature. Extracting information automatically from text is, however, an inherently noisy process and this uncertainty provides the rationale for the development of probabilistic methods for text mining. In this thesis, we concern ourselves with the task of identifying gene mentions in text and normalising these to unique Entrez gene identifiers, referred to as species-driven gene mention normalisation. We propose novel heuristics which improve the performance of species mention normalisation and reduce the number of ambiguous mentions found in the MEDLINE database. This directly impacts the performance down- stream components and we provide a novel probabilistic method for assigning a species to individual gene mentions. In order to avoid uncertainties and noise being propagated in text-mining approaches we develop a Bayesian network description of a text mining pipeline, which allows us to quantify uncertainties reliably and in an easily interpretable manner. Our results show the importance of incorporating as much information as possible into the text mining pipeline and the importance of viewing text mining as a noisy system.
|