From Correlation to Causality: Does Network Information improve Cancer Outcome Prediction?

Motivation: Disease progression in cancer can vary substantially between patients. Yet, patients often receive the same treatment. Recently, there has been much work on predicting disease progression and patient outcome variables from gene expression in order to personalize treatment options. A wide...

Full description

Bibliographic Details
Main Author: Roy, Janine
Other Authors: Technische Universität Dresden, Fakultät Informatik
Format: Doctoral Thesis
Language:English
Published: Saechsische Landesbibliothek- Staats- und Universitaetsbibliothek Dresden 2014
Subjects:
Online Access:http://nbn-resolving.de/urn:nbn:de:bsz:14-qucosa-144933
http://nbn-resolving.de/urn:nbn:de:bsz:14-qucosa-144933
http://www.qucosa.de/fileadmin/data/qucosa/documents/14493/Thesis_Roy.pdf
Description
Summary:Motivation: Disease progression in cancer can vary substantially between patients. Yet, patients often receive the same treatment. Recently, there has been much work on predicting disease progression and patient outcome variables from gene expression in order to personalize treatment options. A widely used approach is high-throughput experiments that aim to explore predictive signature genes which would provide identification of clinical outcome of diseases. Microarray data analysis helps to reveal underlying biological mechanisms of tumor progression, metastasis, and drug-resistance in cancer studies. Despite first diagnostic kits in the market, there are open problems such as the choice of random gene signatures or noisy expression data. The experimental or computational noise in data and limited tissue samples collected from patients might furthermore reduce the predictive power and biological interpretability of such signature genes. Nevertheless, signature genes predicted by different studies generally represent poor similarity; even for the same type of cancer. Integration of network information with gene expression data could provide more efficient signatures for outcome prediction in cancer studies. One approach to deal with these problems employs gene-gene relationships and ranks genes using the random surfer model of Google's PageRank algorithm. Unfortunately, the majority of published network-based approaches solely tested their methods on a small amount of datasets, questioning the general applicability of network-based methods for outcome prediction. Methods: In this thesis, I provide a comprehensive and systematically evaluation of a network-based outcome prediction approach -- NetRank - a PageRank derivative -- applied on several types of gene expression cancer data and four different types of networks. The algorithm identifies a signature gene set for a specific cancer type by incorporating gene network information with given expression data. To assess the performance of NetRank, I created a benchmark dataset collection comprising 25 cancer outcome prediction datasets from literature and one in-house dataset. Results: NetRank performs significantly better than classical methods such as foldchange or t-test as it improves the prediction performance in average for 7%. Besides, we are approaching the accuracy level of the authors' signatures by applying a relatively unbiased but fully automated process for biomarker discovery. Despite an order of magnitude difference in network size, a regulatory, a protein-protein interaction and two predicted networks perform equally well. Signatures as published by the authors and the signatures generated with classical methods do not overlap -- not even for the same cancer type -- whereas the network-based signatures strongly overlap. I analyze and discuss these overlapping genes in terms of the Hallmarks of cancer and in particular single out six transcription factors and seven proteins and discuss their specific role in cancer progression. Furthermore several tests are conducted for the identification of a Universal Cancer Signature. No Universal Cancer Signature could be identified so far, but a cancer-specific combination of general master regulators with specific cancer genes could be discovered that achieves the best results for all cancer types. As NetRank offers a great value for cancer outcome prediction, first steps for a secure usage of NetRank in a public cloud are described. Conclusion: Experimental evaluation of network-based methods on a gene expression benchmark dataset suggests that these methods are especially suited for outcome prediction as they overcome the problems of random gene signatures and noisy expression data. Through the combination of network information with gene expression data, network-based methods identify highly similar signatures over all cancer types, in contrast to classical methods that fail to identify highly common gene sets across the same cancer types. In general allows the integration of additional information in gene expression analysis the identification of more reliable, accurate and reproducible biomarkers and provides a deeper understanding of processes occurring in cancer development and progression.