Clustering, Classification, and Factor Analysis in High Dimensional Data Analysis
Clustering, classification, and factor analysis are three popular data mining techniques. In this dissertation, we investigate these methods in high dimensional data analysis. Since there are much more features than the sample sizes and most of the features are non-informative in high dimensional da...
Main Author: | |
---|---|
Format: | Others |
Published: |
ScholarWorks @ Georgia State University
2013
|
Subjects: | |
Online Access: | http://scholarworks.gsu.edu/math_diss/16 http://scholarworks.gsu.edu/cgi/viewcontent.cgi?article=1016&context=math_diss |
id |
ndltd-GEORGIA-oai-scholarworks.gsu.edu-math_diss-1016 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-GEORGIA-oai-scholarworks.gsu.edu-math_diss-10162014-11-22T03:37:20Z Clustering, Classification, and Factor Analysis in High Dimensional Data Analysis Wang, Yanhong Clustering, classification, and factor analysis are three popular data mining techniques. In this dissertation, we investigate these methods in high dimensional data analysis. Since there are much more features than the sample sizes and most of the features are non-informative in high dimensional data, dimension reduction is necessary before clustering or classification can be made. In the first part of this dissertation, we reinvestigate an existing clustering procedure, optimal discriminant clustering (ODC; Zhang and Dai, 2009), and propose to use cross-validation to select the tuning parameter. Then we develop a variation of ODC, sparse optimal discriminant clustering (SODC) for high dimensional data, by adding a group-lasso type of penalty to ODC. We also demonstrate that both ODC and SDOC can be used as a dimension reduction tool for data visualization in cluster analysis. In the second part, three existing sparse principal component analysis (SPCA) methods, Lasso-PCA (L-PCA), Alternative Lasso PCA (AL-PCA), and sparse principal component analysis by choice of norm (SPCABP) are applied to a real data set the International HapMap Project for AIM selection to genome-wide SNP data, the classification accuracy is compared for them and it is demonstrated that SPCABP outperforms the other two SPCA methods. Third, we propose a novel method called sparse factor analysis by projection (SFABP) based on SPCABP, and propose to use cross-validation method for the selection of the tuning parameter and the number of factors. Our simulation studies show that SFABP has better performance than the unpenalyzed factor analysis when they are applied to classification problems. 2013-12-17T08:00:00Z text application/pdf http://scholarworks.gsu.edu/math_diss/16 http://scholarworks.gsu.edu/cgi/viewcontent.cgi?article=1016&context=math_diss Mathematics Dissertations ScholarWorks @ Georgia State University Cluster analysis Classification Cross-validation High-dimensional data Optimal score Principal components analysis Tuning parameter Variable selection Factor Analysis |
collection |
NDLTD |
format |
Others
|
sources |
NDLTD |
topic |
Cluster analysis Classification Cross-validation High-dimensional data Optimal score Principal components analysis Tuning parameter Variable selection Factor Analysis |
spellingShingle |
Cluster analysis Classification Cross-validation High-dimensional data Optimal score Principal components analysis Tuning parameter Variable selection Factor Analysis Wang, Yanhong Clustering, Classification, and Factor Analysis in High Dimensional Data Analysis |
description |
Clustering, classification, and factor analysis are three popular data mining techniques. In this dissertation, we investigate these methods in high dimensional data analysis. Since there are much more features than the sample sizes and most of the features are non-informative in high dimensional data, dimension reduction is necessary before clustering or classification can be made. In the first part of this dissertation, we reinvestigate an existing clustering procedure, optimal discriminant clustering (ODC; Zhang and Dai, 2009), and propose to use cross-validation to select the tuning parameter. Then we develop a variation of ODC, sparse optimal discriminant clustering (SODC) for high dimensional data, by adding a group-lasso type of penalty to ODC. We also demonstrate that both ODC and SDOC can be used as a dimension reduction tool for data visualization in cluster analysis. In the second part, three existing sparse principal component analysis (SPCA) methods, Lasso-PCA (L-PCA), Alternative Lasso PCA (AL-PCA), and sparse principal component analysis by choice of norm (SPCABP) are applied to a real data set the International HapMap Project for AIM selection to genome-wide SNP data, the classification accuracy is compared for them and it is demonstrated that SPCABP outperforms the other two SPCA methods. Third, we propose a novel method called sparse factor analysis by projection (SFABP) based on SPCABP, and propose to use cross-validation method for the selection of the tuning parameter and the number of factors. Our simulation studies show that SFABP has better performance than the unpenalyzed factor analysis when they are applied to classification problems. |
author |
Wang, Yanhong |
author_facet |
Wang, Yanhong |
author_sort |
Wang, Yanhong |
title |
Clustering, Classification, and Factor Analysis in High Dimensional Data Analysis |
title_short |
Clustering, Classification, and Factor Analysis in High Dimensional Data Analysis |
title_full |
Clustering, Classification, and Factor Analysis in High Dimensional Data Analysis |
title_fullStr |
Clustering, Classification, and Factor Analysis in High Dimensional Data Analysis |
title_full_unstemmed |
Clustering, Classification, and Factor Analysis in High Dimensional Data Analysis |
title_sort |
clustering, classification, and factor analysis in high dimensional data analysis |
publisher |
ScholarWorks @ Georgia State University |
publishDate |
2013 |
url |
http://scholarworks.gsu.edu/math_diss/16 http://scholarworks.gsu.edu/cgi/viewcontent.cgi?article=1016&context=math_diss |
work_keys_str_mv |
AT wangyanhong clusteringclassificationandfactoranalysisinhighdimensionaldataanalysis |
_version_ |
1716720194848555008 |