Causal inference and prior integration in bioinformatics using information theory

An important problem in bioinformatics is the reconstruction of gene regulatory networks from expression data. The analysis of genomic data stemming from high- throughput technologies such as microarray experiments or RNA-sequencing faces several difficulties. The first major issue is the high varia...

Full description

Bibliographic Details
Main Author:	Olsen, Catharina
Other Authors:	Bontempi, Gianluca
Format:	Doctoral Thesis
Language:	fr
Published:	Universite Libre de Bruxelles 2013
Subjects:	Informatique générale Sciences exactes et naturelles Colon (Anatomy) > Cancer Bioinformatics Information theory Cancer colorectal Bio-informatique Théorie de l'information bioinformatics prior integration causal inference machine learning
Online Access:	http://hdl.handle.net/2013/ULB-DIPOT:oai:dipot.ulb.ac.be:2013/209401

id	ndltd-ulb.ac.be-oai-dipot.ulb.ac.be-2013-209401
record_format	oai_dc
spelling	ndltd-ulb.ac.be-oai-dipot.ulb.ac.be-2013-2094012018-04-11T17:33:48Z info:eu-repo/semantics/doctoralThesis info:ulb-repo/semantics/doctoralThesis info:ulb-repo/semantics/openurl/vlink-dissertation Causal inference and prior integration in bioinformatics using information theory Olsen, Catharina Bontempi, Gianluca Lenaerts, Tom Meyer, Patrick E. Quackenbush, John Geurts, Pierre Haibe-Kains, Benjamin Jansen, Maarten Universite Libre de Bruxelles Université libre de Bruxelles, Faculté des Sciences – Informatique, Bruxelles 2013-10-17 fr An important problem in bioinformatics is the reconstruction of gene regulatory networks from expression data. The analysis of genomic data stemming from high- throughput technologies such as microarray experiments or RNA-sequencing faces several difficulties. The first major issue is the high variable to sample ratio which is due to a number of factors: a single experiment captures all genes while the number of experiments is restricted by the experiment’s cost, time and patient cohort size. The second problem is that these data sets typically exhibit high amounts of noise.<p><p>Another important problem in bioinformatics is the question of how the inferred networks’ quality can be evaluated. The current best practice is a two step procedure. In the first step, the highest scoring interactions are compared to known interactions stored in biological databases. The inferred networks passes this quality assessment if there is a large overlap with the known interactions. In this case, a second step is carried out in which unknown but high scoring and thus promising new interactions are validated ’by hand’ via laboratory experiments. Unfortunately when integrating prior knowledge in the inference procedure, this validation procedure would be biased by using the same information in both the inference and the validation. Therefore, it would no longer allow an independent validation of the resulting network.<p><p>The main contribution of this thesis is a complete computational framework that uses experimental knock down data in a cross-validation scheme to both infer and validate directed networks. Its components are i) a method that integrates genomic data and prior knowledge to infer directed networks, ii) its implementation in an R/Bioconductor package and iii) a web application to retrieve prior knowledge from PubMed abstracts and biological databases. To infer directed networks from genomic data and prior knowledge, we propose a two step procedure: First, we adapt the pairwise feature selection strategy mRMR to integrate prior knowledge in order to obtain the network’s skeleton. Then for the subsequent orientation phase of the algorithm, we extend a criterion based on interaction information to include prior knowledge. The implementation of this method is available both as part of the prior retrieval tool Predictive Networks and as a stand-alone R/Bioconductor package named predictionet.<p><p>Furthermore, we propose a fully data-driven quantitative validation of such directed networks using experimental knock-down data: We start by identifying the set of genes that was truly affected by the perturbation experiment. The rationale of our validation procedure is that these truly affected genes should also be part of the perturbed gene’s childhood in the inferred network. Consequently, we can compute a performance score Informatique générale Sciences exactes et naturelles Colon (Anatomy) -- Cancer Bioinformatics Information theory Cancer colorectal Bio-informatique Théorie de l'information bioinformatics prior integration causal inference machine learning 1 v. (xvi, 197 p.) Doctorat en Sciences info:eu-repo/semantics/nonPublished local/bictel.ulb.ac.be:ULBetd-10162013-104610 local/ulbcat.ulb.ac.be:994817 http://hdl.handle.net/2013/ULB-DIPOT:oai:dipot.ulb.ac.be:2013/209401 No full-text files
collection	NDLTD
language	fr
format	Doctoral Thesis
sources	NDLTD
topic	Informatique générale Sciences exactes et naturelles Colon (Anatomy) -- Cancer Bioinformatics Information theory Cancer colorectal Bio-informatique Théorie de l'information bioinformatics prior integration causal inference machine learning
spellingShingle	Informatique générale Sciences exactes et naturelles Colon (Anatomy) -- Cancer Bioinformatics Information theory Cancer colorectal Bio-informatique Théorie de l'information bioinformatics prior integration causal inference machine learning Olsen, Catharina Causal inference and prior integration in bioinformatics using information theory
description	An important problem in bioinformatics is the reconstruction of gene regulatory networks from expression data. The analysis of genomic data stemming from high- throughput technologies such as microarray experiments or RNA-sequencing faces several difficulties. The first major issue is the high variable to sample ratio which is due to a number of factors: a single experiment captures all genes while the number of experiments is restricted by the experiment’s cost, time and patient cohort size. The second problem is that these data sets typically exhibit high amounts of noise.<p><p>Another important problem in bioinformatics is the question of how the inferred networks’ quality can be evaluated. The current best practice is a two step procedure. In the first step, the highest scoring interactions are compared to known interactions stored in biological databases. The inferred networks passes this quality assessment if there is a large overlap with the known interactions. In this case, a second step is carried out in which unknown but high scoring and thus promising new interactions are validated ’by hand’ via laboratory experiments. Unfortunately when integrating prior knowledge in the inference procedure, this validation procedure would be biased by using the same information in both the inference and the validation. Therefore, it would no longer allow an independent validation of the resulting network.<p><p>The main contribution of this thesis is a complete computational framework that uses experimental knock down data in a cross-validation scheme to both infer and validate directed networks. Its components are i) a method that integrates genomic data and prior knowledge to infer directed networks, ii) its implementation in an R/Bioconductor package and iii) a web application to retrieve prior knowledge from PubMed abstracts and biological databases. To infer directed networks from genomic data and prior knowledge, we propose a two step procedure: First, we adapt the pairwise feature selection strategy mRMR to integrate prior knowledge in order to obtain the network’s skeleton. Then for the subsequent orientation phase of the algorithm, we extend a criterion based on interaction information to include prior knowledge. The implementation of this method is available both as part of the prior retrieval tool Predictive Networks and as a stand-alone R/Bioconductor package named predictionet.<p><p>Furthermore, we propose a fully data-driven quantitative validation of such directed networks using experimental knock-down data: We start by identifying the set of genes that was truly affected by the perturbation experiment. The rationale of our validation procedure is that these truly affected genes should also be part of the perturbed gene’s childhood in the inferred network. Consequently, we can compute a performance score === Doctorat en Sciences === info:eu-repo/semantics/nonPublished
author2	Bontempi, Gianluca
author_facet	Bontempi, Gianluca Olsen, Catharina
author	Olsen, Catharina
author_sort	Olsen, Catharina
title	Causal inference and prior integration in bioinformatics using information theory
title_short	Causal inference and prior integration in bioinformatics using information theory
title_full	Causal inference and prior integration in bioinformatics using information theory
title_fullStr	Causal inference and prior integration in bioinformatics using information theory
title_full_unstemmed	Causal inference and prior integration in bioinformatics using information theory
title_sort	causal inference and prior integration in bioinformatics using information theory
publisher	Universite Libre de Bruxelles
publishDate	2013
url	http://hdl.handle.net/2013/ULB-DIPOT:oai:dipot.ulb.ac.be:2013/209401
work_keys_str_mv	AT olsencatharina causalinferenceandpriorintegrationinbioinformaticsusinginformationtheory
_version_	1718628518479265792

Causal inference and prior integration in bioinformatics using information theory

Similar Items