Clustering Problems for High Dimensional Data

We consider a clustering problem where we observe feature vectors Xi ∈ Rp, i = 1, 2, ..., n, from several possible classes. The class labels are unknown and the main interest is to estimate these labels. We propose a three-step clustering procedure where we first evaluate the significance of each fe...

Full description

Bibliographic Details
Main Author:	Wang, Wangie
Format:	Others
Published:	Research Showcase @ CMU 2014
Online Access:	http://repository.cmu.edu/dissertations/384 http://repository.cmu.edu/cgi/viewcontent.cgi?article=1384&context=dissertations

id	ndltd-cmu.edu-oai-repository.cmu.edu-dissertations-1384
record_format	oai_dc
spelling	ndltd-cmu.edu-oai-repository.cmu.edu-dissertations-13842014-11-05T03:27:56Z Clustering Problems for High Dimensional Data Wang, Wangie We consider a clustering problem where we observe feature vectors Xi ∈ Rp, i = 1, 2, ..., n, from several possible classes. The class labels are unknown and the main interest is to estimate these labels. We propose a three-step clustering procedure where we first evaluate the significance of each feature by the Kolmogorov-Smirnov statistic, then we select the small fraction of features for which the Kolmogorov-Smirnov scores exceed a preselected threshold t > 0, and then use only the selected features for clustering by one version of the Principal Component Analysis (PCA). In this procedure, one of the main challenges is how to set the threshold t. We propose a new approach to set the threshold, where the core is the so-called Signal-to-Noise Ratio (SNR) in post-selection PCA. SNR is reminiscent of the recent innovation of Higher Criticism; for this reason, we call the proposed threshold the Higher Criticism Threshold (HCT), despite that it is significantly different from the HCT proposed earlier by [Donoho 2008] in the context of classification. Motivated by many examples in Big Data, we study the spectral clustering with HCT for a model where the signals are both rare and weak for two-classes clustering case. Through delicate PCA, we forge a close link between the HCT and the ideal threshold choice, and show that the HCT yields optimal results in the spectral clustering approach. The approach is successfully applied to three gene microarray data sets, where it compares favorably with existing clustering methods. Our analysis is subtle and requires new development in the Random Matrix Theory (RMT). One challenge we face is that most results in the RMT can not be applied directly to our case: existing results are usually for matrices with i.i.d. entries, but the object of interest in the current case is the post-selection data matrix, where (due to feature selection) the columns are non-independent and have hard-to-track distributions. We develop intricate new RMT to overcome this problem. We also find the theoretical approximation for the tail distribution of Kolmogorov-Smirnov Statistic under null hypothesis and alternative hypothesis. With the theoretical approximation, we can claim the effectiveness of KS statistic. Besides, we also find the fundamental limits for clustering problem, signal recovery problem, and detection problem under the Asymptotic Rare and Weak model. We find the boundary such that when the model parameters are beyond the boundary, then the inference is unavailable, otherwise there are some methods (usually exhausted search) to achieve the inference. 2014-06-01T07:00:00Z text application/pdf http://repository.cmu.edu/dissertations/384 http://repository.cmu.edu/cgi/viewcontent.cgi?article=1384&context=dissertations Dissertations Research Showcase @ CMU
collection	NDLTD
format	Others
sources	NDLTD
description	We consider a clustering problem where we observe feature vectors Xi ∈ Rp, i = 1, 2, ..., n, from several possible classes. The class labels are unknown and the main interest is to estimate these labels. We propose a three-step clustering procedure where we first evaluate the significance of each feature by the Kolmogorov-Smirnov statistic, then we select the small fraction of features for which the Kolmogorov-Smirnov scores exceed a preselected threshold t > 0, and then use only the selected features for clustering by one version of the Principal Component Analysis (PCA). In this procedure, one of the main challenges is how to set the threshold t. We propose a new approach to set the threshold, where the core is the so-called Signal-to-Noise Ratio (SNR) in post-selection PCA. SNR is reminiscent of the recent innovation of Higher Criticism; for this reason, we call the proposed threshold the Higher Criticism Threshold (HCT), despite that it is significantly different from the HCT proposed earlier by [Donoho 2008] in the context of classification. Motivated by many examples in Big Data, we study the spectral clustering with HCT for a model where the signals are both rare and weak for two-classes clustering case. Through delicate PCA, we forge a close link between the HCT and the ideal threshold choice, and show that the HCT yields optimal results in the spectral clustering approach. The approach is successfully applied to three gene microarray data sets, where it compares favorably with existing clustering methods. Our analysis is subtle and requires new development in the Random Matrix Theory (RMT). One challenge we face is that most results in the RMT can not be applied directly to our case: existing results are usually for matrices with i.i.d. entries, but the object of interest in the current case is the post-selection data matrix, where (due to feature selection) the columns are non-independent and have hard-to-track distributions. We develop intricate new RMT to overcome this problem. We also find the theoretical approximation for the tail distribution of Kolmogorov-Smirnov Statistic under null hypothesis and alternative hypothesis. With the theoretical approximation, we can claim the effectiveness of KS statistic. Besides, we also find the fundamental limits for clustering problem, signal recovery problem, and detection problem under the Asymptotic Rare and Weak model. We find the boundary such that when the model parameters are beyond the boundary, then the inference is unavailable, otherwise there are some methods (usually exhausted search) to achieve the inference.
author	Wang, Wangie
spellingShingle	Wang, Wangie Clustering Problems for High Dimensional Data
author_facet	Wang, Wangie
author_sort	Wang, Wangie
title	Clustering Problems for High Dimensional Data
title_short	Clustering Problems for High Dimensional Data
title_full	Clustering Problems for High Dimensional Data
title_fullStr	Clustering Problems for High Dimensional Data
title_full_unstemmed	Clustering Problems for High Dimensional Data
title_sort	clustering problems for high dimensional data
publisher	Research Showcase @ CMU
publishDate	2014
url	http://repository.cmu.edu/dissertations/384 http://repository.cmu.edu/cgi/viewcontent.cgi?article=1384&context=dissertations
work_keys_str_mv	AT wangwangie clusteringproblemsforhighdimensionaldata
_version_	1716719415250124800

Clustering Problems for High Dimensional Data

Similar Items