Supervised and unsupervised model-based clustering with variable selection

The thesis tackles the problem of uncovering hidden structures in high-dimensional data in the presence of noise and non informative variables. It proposes a supervised and an unsupervised mixture models that select the relevant variables and are robust to measurement errors and outliers. Within the...

Full description

Bibliographic Details
Main Author: Cozzini, Alberto Maria
Other Authors: Montana, Giovanni ; Jasra, Ajay
Published: Imperial College London 2012
Subjects:
Online Access:http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.560758
id ndltd-bl.uk-oai-ethos.bl.uk-560758
record_format oai_dc
spelling ndltd-bl.uk-oai-ethos.bl.uk-5607582017-08-30T03:18:55ZSupervised and unsupervised model-based clustering with variable selectionCozzini, Alberto MariaMontana, Giovanni ; Jasra, Ajay2012The thesis tackles the problem of uncovering hidden structures in high-dimensional data in the presence of noise and non informative variables. It proposes a supervised and an unsupervised mixture models that select the relevant variables and are robust to measurement errors and outliers. Within the class of unsupervised clustering models we extend variable selection to the family of Student's t mixture models. While t distributions are naturally robust to noise and extreme events, sparsity is achieved by imposing regularization on the location and dispersion parameters. An EM algorithm is implemented to return the maximum likelihood estimate of the model parameters given the added penalty term. To further asses the contribution of each variable we propose a resampling procedure that ranks the variables according to their selection probability. Supervised clustering is implemented in a Bayesian framework. The model assumes a mixture of Lasso type regressions with t-distributed errors. While the Lasso representation of the normal linear model imposes regularization on the regression coefficient, variable selection is explicitly modelled by a latent binary indicator variable. The model relies on particle Markov chain Monte Carlo algorithm to approximate the posterior distribution of the parameters of interest. To highlight the properties and advantages of the proposed models, two real life problems are considered. The first one requires us to identify subtypes of breast cancer tumors by grouping patients based only on their gene expression levels when only few of the thousands genes are informative. In the second case our aim is to cluster different financial markets spanning several macro sectors and explain their trading performance only on the basis of the observed statistical features of their price dynamics.519.53Imperial College Londonhttp://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.560758http://hdl.handle.net/10044/1/9973Electronic Thesis or Dissertation
collection NDLTD
sources NDLTD
topic 519.53
spellingShingle 519.53
Cozzini, Alberto Maria
Supervised and unsupervised model-based clustering with variable selection
description The thesis tackles the problem of uncovering hidden structures in high-dimensional data in the presence of noise and non informative variables. It proposes a supervised and an unsupervised mixture models that select the relevant variables and are robust to measurement errors and outliers. Within the class of unsupervised clustering models we extend variable selection to the family of Student's t mixture models. While t distributions are naturally robust to noise and extreme events, sparsity is achieved by imposing regularization on the location and dispersion parameters. An EM algorithm is implemented to return the maximum likelihood estimate of the model parameters given the added penalty term. To further asses the contribution of each variable we propose a resampling procedure that ranks the variables according to their selection probability. Supervised clustering is implemented in a Bayesian framework. The model assumes a mixture of Lasso type regressions with t-distributed errors. While the Lasso representation of the normal linear model imposes regularization on the regression coefficient, variable selection is explicitly modelled by a latent binary indicator variable. The model relies on particle Markov chain Monte Carlo algorithm to approximate the posterior distribution of the parameters of interest. To highlight the properties and advantages of the proposed models, two real life problems are considered. The first one requires us to identify subtypes of breast cancer tumors by grouping patients based only on their gene expression levels when only few of the thousands genes are informative. In the second case our aim is to cluster different financial markets spanning several macro sectors and explain their trading performance only on the basis of the observed statistical features of their price dynamics.
author2 Montana, Giovanni ; Jasra, Ajay
author_facet Montana, Giovanni ; Jasra, Ajay
Cozzini, Alberto Maria
author Cozzini, Alberto Maria
author_sort Cozzini, Alberto Maria
title Supervised and unsupervised model-based clustering with variable selection
title_short Supervised and unsupervised model-based clustering with variable selection
title_full Supervised and unsupervised model-based clustering with variable selection
title_fullStr Supervised and unsupervised model-based clustering with variable selection
title_full_unstemmed Supervised and unsupervised model-based clustering with variable selection
title_sort supervised and unsupervised model-based clustering with variable selection
publisher Imperial College London
publishDate 2012
url http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.560758
work_keys_str_mv AT cozzinialbertomaria supervisedandunsupervisedmodelbasedclusteringwithvariableselection
_version_ 1718521935065776128