High-dimensional Data Clustering and Statistical Analysis of Clustering-based Data Summarization Products
Main Author: | |
---|---|
Language: | English |
Published: |
The Ohio State University / OhioLINK
2012
|
Subjects: | |
Online Access: | http://rave.ohiolink.edu/etdc/view?acc_num=osu1338303646 |
id |
ndltd-OhioLink-oai-etd.ohiolink.edu-osu1338303646 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-OhioLink-oai-etd.ohiolink.edu-osu13383036462021-08-03T06:05:17Z High-dimensional Data Clustering and Statistical Analysis of Clustering-based Data Summarization Products Zhou, Dunke Statistics k-means sparsity stability analysis subspace clustering Mallows distance climate data data reduction <p>With the advancement of modern technology, we have seen the expansion of data in two dimensions: number of variables and number of observations. Such high-dimensionality and large data volume have posed new challenges to statistical analysis. This thesis considers two problems related to cluster analysis: high-dimensional data clustering and statistical analysis of clustering-based data summarization products.</p><p>High-dimensionality often makes traditional clustering methods ineffective. Variable selection is a common approach to reduce the dimensionality of data for better cluster analysis. Most of recently developed methods either explicitly or implicitly perform variable selection based on variable importance (VI) measure. In this thesis, an algorithmic framework is introduced which iterates between constructing VI and performing variable selection conditioning on each other. Within this framework, we develop an ensemble VI which is constructed by averaging a set of VI's. Both theoretical and simulation studies show that the proposed ensemble VI has better variable selection performance than unensemble VI and is robust to the choice of the number of groups in cluster analysis. In addition to the development in VI, we propose a new VI-based variable selection method which selects a set of variables through sequentially testing the existence of group structure in data. Its effectiveness is demonstrated through simulation study and a real data application. </p><p>In the second problem, we consider a histogram type of data summarization product from massive climate data. Unlike traditional data reduction method which summarizes the observations in each spatial grid-box during a period of time by their average, a NASA science team recently uses a multivariate histogram, constructed by cluster analysis, to represent those observations. This method has been applied to the observations collected by Atmospheric Infrared Sounder (AIRS) to produce AIRS L3Q products. In this thesis, we study potential statistical tools using pairwisedissimilarities that are suitable for analyzing thishistogram type of data. Through theoretical analysis and simulations, we investigate several different dissimilarity measures andfind Mallows distance is preferable over others when the locationsof the representative vectors are important for the analysis.We apply MultiDimensional Scalingand clustering method to analyze the AIRS data collected in December 2002. The results from these studies show the effectiveness of statistical methods based on Mallows distance in extracting information from this histogram type of data.</p> 2012-06-27 English text The Ohio State University / OhioLINK http://rave.ohiolink.edu/etdc/view?acc_num=osu1338303646 http://rave.ohiolink.edu/etdc/view?acc_num=osu1338303646 unrestricted This thesis or dissertation is protected by copyright: all rights reserved. It may not be copied or redistributed beyond the terms of applicable copyright laws. |
collection |
NDLTD |
language |
English |
sources |
NDLTD |
topic |
Statistics k-means sparsity stability analysis subspace clustering Mallows distance climate data data reduction |
spellingShingle |
Statistics k-means sparsity stability analysis subspace clustering Mallows distance climate data data reduction Zhou, Dunke High-dimensional Data Clustering and Statistical Analysis of Clustering-based Data Summarization Products |
author |
Zhou, Dunke |
author_facet |
Zhou, Dunke |
author_sort |
Zhou, Dunke |
title |
High-dimensional Data Clustering and Statistical Analysis of Clustering-based Data Summarization Products |
title_short |
High-dimensional Data Clustering and Statistical Analysis of Clustering-based Data Summarization Products |
title_full |
High-dimensional Data Clustering and Statistical Analysis of Clustering-based Data Summarization Products |
title_fullStr |
High-dimensional Data Clustering and Statistical Analysis of Clustering-based Data Summarization Products |
title_full_unstemmed |
High-dimensional Data Clustering and Statistical Analysis of Clustering-based Data Summarization Products |
title_sort |
high-dimensional data clustering and statistical analysis of clustering-based data summarization products |
publisher |
The Ohio State University / OhioLINK |
publishDate |
2012 |
url |
http://rave.ohiolink.edu/etdc/view?acc_num=osu1338303646 |
work_keys_str_mv |
AT zhoudunke highdimensionaldataclusteringandstatisticalanalysisofclusteringbaseddatasummarizationproducts |
_version_ |
1719430684114681856 |