Clustering of temporal gene expression data with mixtures of mixed effects models
While time-dependent processes are important to biological functions, methods to leverage temporal information from large data have remained computationally challenging. In temporal gene-expression data, clustering can be used to identify genes with shared function in complex processes. Algorithms...
Main Author: | |
---|---|
Other Authors: | |
Language: | en_US |
Published: |
2019
|
Subjects: | |
Online Access: | https://hdl.handle.net/2144/34905 |
id |
ndltd-bu.edu-oai-open.bu.edu-2144-34905 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-bu.edu-oai-open.bu.edu-2144-349052019-12-22T15:11:48Z Clustering of temporal gene expression data with mixtures of mixed effects models Lu, Darlene Demissie, Serkalem Biostatistics Clustering EM algorithm Gene expression Mixture model Model selection Polynomial regression While time-dependent processes are important to biological functions, methods to leverage temporal information from large data have remained computationally challenging. In temporal gene-expression data, clustering can be used to identify genes with shared function in complex processes. Algorithms like K-Means and standard Gaussian mixture-models (GMM) fail to account for variability in replicated data or repeated measures over time and require a priori cluster number assumptions, evaluating many cluster numbers to select an optimal result. An improved penalized-GMM offers a computationally-efficient algorithm to simultaneously optimize cluster number and labels. The work presented in this dissertation was motivated by mice bone-fracture models interested in determining patterns of temporal gene-expression during bone-healing progression. To solve this, an extension to the penalized-GMM was proposed to account for correlation between replicated data and repeated measures over time by introducing random-effects using a mixture of mixed-effects polynomial regression models and an entropy-penalized EM-Algorithm (EPEM). First, performance of EPEM for different mixed-effects models were assessed with simulation studies and applied to the fracture-healing study. Second, modifications to address the high computational cost of EPEM were considered that either clustered subsets of data determined by predicted polynomial-order (S-EPEM) or used modified-initialization to decrease the initial burden (I-EPEM). Each was compared to EPEM and applied to the fracture-healing study. Lastly, as varied rates of fracture-healing were observed for mice with different genetic-backgrounds (strains), a new analysis strategy was proposed to compare patterns of temporal gene-expression between different mice-strains and assessed with simulation studies. Expression-profiles for each strain were treated as separate objects to cluster in order to determine genes clustered into different groups across strain. We found that the addition of random-effects decreased accuracy of predicted cluster labels compared to K-Means, GMM, and fixed-effects EPEM. Polynomial-order optimization with BIC performed with highest accuracy, and optimization on subspaces obtained with singular-value-decomposition performed well. Computation time for S-EPEM was much reduced with a slight decrease in accuracy. I-EPEM was comparable to EPEM with similar accuracy and decrease in computation time. Application of the new analysis strategy on fracture-healing data identified several distinct temporal gene-expression patterns for the different strains. 2021-02-27T00:00:00Z 2019-04-23T17:33:01Z 2019 2019-02-27T17:02:54Z Thesis/Dissertation https://hdl.handle.net/2144/34905 en_US Attribution 4.0 International http://creativecommons.org/licenses/by/4.0/ |
collection |
NDLTD |
language |
en_US |
sources |
NDLTD |
topic |
Biostatistics Clustering EM algorithm Gene expression Mixture model Model selection Polynomial regression |
spellingShingle |
Biostatistics Clustering EM algorithm Gene expression Mixture model Model selection Polynomial regression Lu, Darlene Clustering of temporal gene expression data with mixtures of mixed effects models |
description |
While time-dependent processes are important to biological functions, methods to leverage temporal information from large data have remained computationally challenging. In temporal gene-expression data, clustering can be used to identify genes with shared function in complex processes. Algorithms like K-Means and standard Gaussian mixture-models (GMM) fail to account for variability in replicated data or repeated measures over time and require a priori cluster number assumptions, evaluating many cluster numbers to select an optimal result. An improved penalized-GMM offers a computationally-efficient algorithm to simultaneously optimize cluster number and labels.
The work presented in this dissertation was motivated by mice bone-fracture models interested in determining patterns of temporal gene-expression during bone-healing progression. To solve this, an extension to the penalized-GMM was proposed to account for correlation between replicated data and repeated measures over time by introducing random-effects using a mixture of mixed-effects polynomial regression models and an entropy-penalized EM-Algorithm (EPEM).
First, performance of EPEM for different mixed-effects models were assessed with simulation studies and applied to the fracture-healing study. Second, modifications to address the high computational cost of EPEM were considered that either clustered subsets of data determined by predicted polynomial-order (S-EPEM) or used modified-initialization to decrease the initial burden (I-EPEM). Each was compared to EPEM and applied to the fracture-healing study. Lastly, as varied rates of fracture-healing were observed for mice with different genetic-backgrounds (strains), a new analysis strategy was proposed to compare patterns of temporal gene-expression between different mice-strains and assessed with simulation studies. Expression-profiles for each strain were treated as separate objects to cluster in order to determine genes clustered into different groups across strain.
We found that the addition of random-effects decreased accuracy of predicted cluster labels compared to K-Means, GMM, and fixed-effects EPEM. Polynomial-order optimization with BIC performed with highest accuracy, and optimization on subspaces obtained with singular-value-decomposition performed well. Computation time for S-EPEM was much reduced with a slight decrease in accuracy. I-EPEM was comparable to EPEM with similar accuracy and decrease in computation time. Application of the new analysis strategy on fracture-healing data identified several distinct temporal gene-expression patterns for the different strains. === 2021-02-27T00:00:00Z |
author2 |
Demissie, Serkalem |
author_facet |
Demissie, Serkalem Lu, Darlene |
author |
Lu, Darlene |
author_sort |
Lu, Darlene |
title |
Clustering of temporal gene expression data with mixtures of mixed effects models |
title_short |
Clustering of temporal gene expression data with mixtures of mixed effects models |
title_full |
Clustering of temporal gene expression data with mixtures of mixed effects models |
title_fullStr |
Clustering of temporal gene expression data with mixtures of mixed effects models |
title_full_unstemmed |
Clustering of temporal gene expression data with mixtures of mixed effects models |
title_sort |
clustering of temporal gene expression data with mixtures of mixed effects models |
publishDate |
2019 |
url |
https://hdl.handle.net/2144/34905 |
work_keys_str_mv |
AT ludarlene clusteringoftemporalgeneexpressiondatawithmixturesofmixedeffectsmodels |
_version_ |
1719306434733146112 |