Clustering of temporal gene expression data with mixtures of mixed effects models

While time-dependent processes are important to biological functions, methods to leverage temporal information from large data have remained computationally challenging. In temporal gene-expression data, clustering can be used to identify genes with shared function in complex processes. Algorithms...

Full description

Bibliographic Details
Main Author: Lu, Darlene
Other Authors: Demissie, Serkalem
Language:en_US
Published: 2019
Subjects:
Online Access:https://hdl.handle.net/2144/34905
id ndltd-bu.edu-oai-open.bu.edu-2144-34905
record_format oai_dc
spelling ndltd-bu.edu-oai-open.bu.edu-2144-349052019-12-22T15:11:48Z Clustering of temporal gene expression data with mixtures of mixed effects models Lu, Darlene Demissie, Serkalem Biostatistics Clustering EM algorithm Gene expression Mixture model Model selection Polynomial regression While time-dependent processes are important to biological functions, methods to leverage temporal information from large data have remained computationally challenging. In temporal gene-expression data, clustering can be used to identify genes with shared function in complex processes. Algorithms like K-Means and standard Gaussian mixture-models (GMM) fail to account for variability in replicated data or repeated measures over time and require a priori cluster number assumptions, evaluating many cluster numbers to select an optimal result. An improved penalized-GMM offers a computationally-efficient algorithm to simultaneously optimize cluster number and labels. The work presented in this dissertation was motivated by mice bone-fracture models interested in determining patterns of temporal gene-expression during bone-healing progression. To solve this, an extension to the penalized-GMM was proposed to account for correlation between replicated data and repeated measures over time by introducing random-effects using a mixture of mixed-effects polynomial regression models and an entropy-penalized EM-Algorithm (EPEM). First, performance of EPEM for different mixed-effects models were assessed with simulation studies and applied to the fracture-healing study. Second, modifications to address the high computational cost of EPEM were considered that either clustered subsets of data determined by predicted polynomial-order (S-EPEM) or used modified-initialization to decrease the initial burden (I-EPEM). Each was compared to EPEM and applied to the fracture-healing study. Lastly, as varied rates of fracture-healing were observed for mice with different genetic-backgrounds (strains), a new analysis strategy was proposed to compare patterns of temporal gene-expression between different mice-strains and assessed with simulation studies. Expression-profiles for each strain were treated as separate objects to cluster in order to determine genes clustered into different groups across strain. We found that the addition of random-effects decreased accuracy of predicted cluster labels compared to K-Means, GMM, and fixed-effects EPEM. Polynomial-order optimization with BIC performed with highest accuracy, and optimization on subspaces obtained with singular-value-decomposition performed well. Computation time for S-EPEM was much reduced with a slight decrease in accuracy. I-EPEM was comparable to EPEM with similar accuracy and decrease in computation time. Application of the new analysis strategy on fracture-healing data identified several distinct temporal gene-expression patterns for the different strains. 2021-02-27T00:00:00Z 2019-04-23T17:33:01Z 2019 2019-02-27T17:02:54Z Thesis/Dissertation https://hdl.handle.net/2144/34905 en_US Attribution 4.0 International http://creativecommons.org/licenses/by/4.0/
collection NDLTD
language en_US
sources NDLTD
topic Biostatistics
Clustering
EM algorithm
Gene expression
Mixture model
Model selection
Polynomial regression
spellingShingle Biostatistics
Clustering
EM algorithm
Gene expression
Mixture model
Model selection
Polynomial regression
Lu, Darlene
Clustering of temporal gene expression data with mixtures of mixed effects models
description While time-dependent processes are important to biological functions, methods to leverage temporal information from large data have remained computationally challenging. In temporal gene-expression data, clustering can be used to identify genes with shared function in complex processes. Algorithms like K-Means and standard Gaussian mixture-models (GMM) fail to account for variability in replicated data or repeated measures over time and require a priori cluster number assumptions, evaluating many cluster numbers to select an optimal result. An improved penalized-GMM offers a computationally-efficient algorithm to simultaneously optimize cluster number and labels. The work presented in this dissertation was motivated by mice bone-fracture models interested in determining patterns of temporal gene-expression during bone-healing progression. To solve this, an extension to the penalized-GMM was proposed to account for correlation between replicated data and repeated measures over time by introducing random-effects using a mixture of mixed-effects polynomial regression models and an entropy-penalized EM-Algorithm (EPEM). First, performance of EPEM for different mixed-effects models were assessed with simulation studies and applied to the fracture-healing study. Second, modifications to address the high computational cost of EPEM were considered that either clustered subsets of data determined by predicted polynomial-order (S-EPEM) or used modified-initialization to decrease the initial burden (I-EPEM). Each was compared to EPEM and applied to the fracture-healing study. Lastly, as varied rates of fracture-healing were observed for mice with different genetic-backgrounds (strains), a new analysis strategy was proposed to compare patterns of temporal gene-expression between different mice-strains and assessed with simulation studies. Expression-profiles for each strain were treated as separate objects to cluster in order to determine genes clustered into different groups across strain. We found that the addition of random-effects decreased accuracy of predicted cluster labels compared to K-Means, GMM, and fixed-effects EPEM. Polynomial-order optimization with BIC performed with highest accuracy, and optimization on subspaces obtained with singular-value-decomposition performed well. Computation time for S-EPEM was much reduced with a slight decrease in accuracy. I-EPEM was comparable to EPEM with similar accuracy and decrease in computation time. Application of the new analysis strategy on fracture-healing data identified several distinct temporal gene-expression patterns for the different strains. === 2021-02-27T00:00:00Z
author2 Demissie, Serkalem
author_facet Demissie, Serkalem
Lu, Darlene
author Lu, Darlene
author_sort Lu, Darlene
title Clustering of temporal gene expression data with mixtures of mixed effects models
title_short Clustering of temporal gene expression data with mixtures of mixed effects models
title_full Clustering of temporal gene expression data with mixtures of mixed effects models
title_fullStr Clustering of temporal gene expression data with mixtures of mixed effects models
title_full_unstemmed Clustering of temporal gene expression data with mixtures of mixed effects models
title_sort clustering of temporal gene expression data with mixtures of mixed effects models
publishDate 2019
url https://hdl.handle.net/2144/34905
work_keys_str_mv AT ludarlene clusteringoftemporalgeneexpressiondatawithmixturesofmixedeffectsmodels
_version_ 1719306434733146112