Supervised and unsupervised model-based clustering with variable selection
The thesis tackles the problem of uncovering hidden structures in high-dimensional data in the presence of noise and non informative variables. It proposes a supervised and an unsupervised mixture models that select the relevant variables and are robust to measurement errors and outliers. Within the...
Main Author: | |
---|---|
Other Authors: | |
Published: |
Imperial College London
2012
|
Subjects: | |
Online Access: | http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.560758 |
id |
ndltd-bl.uk-oai-ethos.bl.uk-560758 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-bl.uk-oai-ethos.bl.uk-5607582017-08-30T03:18:55ZSupervised and unsupervised model-based clustering with variable selectionCozzini, Alberto MariaMontana, Giovanni ; Jasra, Ajay2012The thesis tackles the problem of uncovering hidden structures in high-dimensional data in the presence of noise and non informative variables. It proposes a supervised and an unsupervised mixture models that select the relevant variables and are robust to measurement errors and outliers. Within the class of unsupervised clustering models we extend variable selection to the family of Student's t mixture models. While t distributions are naturally robust to noise and extreme events, sparsity is achieved by imposing regularization on the location and dispersion parameters. An EM algorithm is implemented to return the maximum likelihood estimate of the model parameters given the added penalty term. To further asses the contribution of each variable we propose a resampling procedure that ranks the variables according to their selection probability. Supervised clustering is implemented in a Bayesian framework. The model assumes a mixture of Lasso type regressions with t-distributed errors. While the Lasso representation of the normal linear model imposes regularization on the regression coefficient, variable selection is explicitly modelled by a latent binary indicator variable. The model relies on particle Markov chain Monte Carlo algorithm to approximate the posterior distribution of the parameters of interest. To highlight the properties and advantages of the proposed models, two real life problems are considered. The first one requires us to identify subtypes of breast cancer tumors by grouping patients based only on their gene expression levels when only few of the thousands genes are informative. In the second case our aim is to cluster different financial markets spanning several macro sectors and explain their trading performance only on the basis of the observed statistical features of their price dynamics.519.53Imperial College Londonhttp://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.560758http://hdl.handle.net/10044/1/9973Electronic Thesis or Dissertation |
collection |
NDLTD |
sources |
NDLTD |
topic |
519.53 |
spellingShingle |
519.53 Cozzini, Alberto Maria Supervised and unsupervised model-based clustering with variable selection |
description |
The thesis tackles the problem of uncovering hidden structures in high-dimensional data in the presence of noise and non informative variables. It proposes a supervised and an unsupervised mixture models that select the relevant variables and are robust to measurement errors and outliers. Within the class of unsupervised clustering models we extend variable selection to the family of Student's t mixture models. While t distributions are naturally robust to noise and extreme events, sparsity is achieved by imposing regularization on the location and dispersion parameters. An EM algorithm is implemented to return the maximum likelihood estimate of the model parameters given the added penalty term. To further asses the contribution of each variable we propose a resampling procedure that ranks the variables according to their selection probability. Supervised clustering is implemented in a Bayesian framework. The model assumes a mixture of Lasso type regressions with t-distributed errors. While the Lasso representation of the normal linear model imposes regularization on the regression coefficient, variable selection is explicitly modelled by a latent binary indicator variable. The model relies on particle Markov chain Monte Carlo algorithm to approximate the posterior distribution of the parameters of interest. To highlight the properties and advantages of the proposed models, two real life problems are considered. The first one requires us to identify subtypes of breast cancer tumors by grouping patients based only on their gene expression levels when only few of the thousands genes are informative. In the second case our aim is to cluster different financial markets spanning several macro sectors and explain their trading performance only on the basis of the observed statistical features of their price dynamics. |
author2 |
Montana, Giovanni ; Jasra, Ajay |
author_facet |
Montana, Giovanni ; Jasra, Ajay Cozzini, Alberto Maria |
author |
Cozzini, Alberto Maria |
author_sort |
Cozzini, Alberto Maria |
title |
Supervised and unsupervised model-based clustering with variable selection |
title_short |
Supervised and unsupervised model-based clustering with variable selection |
title_full |
Supervised and unsupervised model-based clustering with variable selection |
title_fullStr |
Supervised and unsupervised model-based clustering with variable selection |
title_full_unstemmed |
Supervised and unsupervised model-based clustering with variable selection |
title_sort |
supervised and unsupervised model-based clustering with variable selection |
publisher |
Imperial College London |
publishDate |
2012 |
url |
http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.560758 |
work_keys_str_mv |
AT cozzinialbertomaria supervisedandunsupervisedmodelbasedclusteringwithvariableselection |
_version_ |
1718521935065776128 |