Summary: | Information about genes and pathways involved in a disease is usually 'buried' in scientific literature, making it difficult to perform systematic studies for a comprehensive understanding. Text mining has provided opportunities to retrieve and extract most relevant information from literature, and thus might enable collecting and exploring relevant data to a certain disease systematically. This thesis aims to develop a text-mining pipeline that can identify genes and pathways involved in the given disease and extract their interacting patterns. As a case study, we used thyroid cancer, which is a type of cancer with increasing incidence and characterised by multiple heterogeneous subtypes. Firstly, pathways are essential to many systems biology studies, but pathways have not been the focus of text mining. To address this problem, we have designed and implemented an automated method, called PathNER, for mining pathway mentions from literature. PathNER employs soft dictionary matching and rules that can detect pathway mentions from either abstracts or full-texts with an F1-score of 84%. When used on a large-scale, PathNER was able to identify disease-associated pathways that have been “neglected” by the manually created databases. PathNER can also help prioritise pathways that should be curated. A disease subtype classification method for the thyroid cancer literature was designed, which assigns subtype labels to articles. The classification method achieved a micro-average F1-score of 85.9% for primary subtypes. Utilising gene recognition tools and PathNER, we extracted genes and pathways (molecular profiling) associated with different thyroid cancer subtypes. The results have demonstrated the ability to complement existing annotated databases or most recent review articles in terms of coverage and efficiency, providing a basis for a systematic understanding of thyroid cancer. Finally, we have expanded an existing state-of-the-art text-mining system that extracts molecular events to develop PWTEES, which incorporates mentions of pathways. PWTEES achieved an F1-score of 59% and it can be utilised to assist semi-automated curation. We then applied PWTEES to generate a systematic profile of molecular interactions of thyroid cancer from over 30,000 articles. We integrated the molecular details of pathways from curated databases and created our interaction networks taking genes and pathways as nodes. Subsequent network analysis provides additional and more comprehensive perspectives of understanding existing data. The main outcome of the thesis is a systematic and efficient pipeline that can identify key pathways and genes that characterise a disease (molecular profile) and build networks of their interactions (interactome). While we used TC as a case study, the work presented here can be applied to other diseases for systematic studies and assisted curation efforts.
|