Analysis and Reconstruction of the Hematopoietic Stem Cell Differentiation Tree: A Linear Programming Approach for Gene Selection

Stem cells differentiate through an organized hierarchy of intermediate cell types to terminally differentiated cell types. This process is largely guided by master transcriptional regulators, but it also depends on the expression of many other types of genes. The discrete cell types in the differen...

Full description

Bibliographic Details
Main Author:	Ghadie, Mohamed A.
Other Authors:	Perkins, Theodore
Language:	en
Published:	Université d'Ottawa / University of Ottawa 2015
Subjects:	Linear Programming Distance Metric Learning Machine Learning Feature Selection Tree Reconstruction Hierarchical Clustering Minimum Spanning Tree Clustering Optimization Maximum Parsimony Euclidean Distance Weighted Euclidean Stem Cell Differentiation Hematopoiesis Transcriptional Regulation Transcription Factor Gene Selection Gene Expression Microarray Cell Type Marker Gene Functional Annotation Random Forest Biological Function Regulation Statistical Significance Erythropoiesis Natural Killer Cell T Cell B Cell Granulocyte Monocyte Megakaryocyte Minimization Linear Constraint Cell Lineage
Online Access:	http://hdl.handle.net/10393/32048 http://dx.doi.org/10.20381/ruor-2753

Description
Summary:	Stem cells differentiate through an organized hierarchy of intermediate cell types to terminally differentiated cell types. This process is largely guided by master transcriptional regulators, but it also depends on the expression of many other types of genes. The discrete cell types in the differentiation hierarchy are often identified based on the expression or non-expression of certain marker genes. Historically, these have often been various cell-surface proteins, which are fairly easy to assay biochemically but are not necessarily causative of the cell type, in the sense of being master transcriptional regulators. This raises important questions about how gene expression across the whole genome controls or reflects cell state, and in particular, differentiation hierarchies. Traditional approaches to understanding gene expression patterns across multiple conditions, such as principal components analysis or K-means clustering, can group cell types based on gene expression, but they do so without knowledge of the differentiation hierarchy. Hierarchical clustering and maximization of parsimony can organize the cell types into a tree, but in general this tree is different from the differentiation hierarchy. Using hematopoietic differentiation as an example, we demonstrate how many genes other than marker genes are able to discriminate between different branches of the differentiation tree by proposing two models for detecting genes that are up-regulated or down-regulated in distinct lineages. We then propose a novel approach to solving the following problem: Given the differentiation hierarchy and gene expression data at each node, construct a weighted Euclidean distance metric such that the minimum spanning tree with respect to that metric is precisely the given differentiation hierarchy. We provide a set of linear constraints that are provably sufficient for the desired construction and a linear programming framework to identify sparse sets of weights, effectively identifying genes that are most relevant for discriminating different parts of the tree. We apply our method to microarray gene expression data describing 38 cell types in the hematopoiesis hierarchy, constructing a sparse weighted Euclidean metric that uses just 175 genes. These 175 genes are different than the marker genes that were used to identify the 38 cell types, hence offering a novel alternative way of discriminating different branches of the tree. A DAVID functional annotation analysis shows that the 175 genes reflect major processes and pathways active in different parts of the tree. However, we find that there are many alternative sets of weights that satisfy the linear constraints. Thus, in the style of random-forest training, we also construct metrics based on random subsets of the genes and compare them to the metric of 175 genes. Our results show that the 175 genes frequently appear in the random metrics, implicating their significance from an empirical point of view as well. Finally, we show how our linear programming method is able to identify columns that were selected to build minimum spanning trees on the nodes of random variable-size matrices.

Analysis and Reconstruction of the Hematopoietic Stem Cell Differentiation Tree: A Linear Programming Approach for Gene Selection

Similar Items