Dimension reduction of high-dimensional dataset with missing values

Nowadays, datasets containing a very large number of variables or features are routinely generated in many fields. Dimension reduction techniques are usually performed prior to statistically analyzing these datasets in order to avoid the effects of the curse of dimensionality. Principal component an...

Full description

Bibliographic Details
Main Authors: Ran Zhang, Bin Ye, Peng Liu
Format: Article
Language:English
Published: SAGE Publishing 2019-08-01
Series:Journal of Algorithms & Computational Technology
Online Access:https://doi.org/10.1177/1748302619867440
id doaj-a47a61c0686343e095a0f076621fb5c0
record_format Article
spelling doaj-a47a61c0686343e095a0f076621fb5c02020-11-25T03:24:00ZengSAGE PublishingJournal of Algorithms & Computational Technology1748-30262019-08-011310.1177/1748302619867440Dimension reduction of high-dimensional dataset with missing valuesRan ZhangBin YePeng LiuNowadays, datasets containing a very large number of variables or features are routinely generated in many fields. Dimension reduction techniques are usually performed prior to statistically analyzing these datasets in order to avoid the effects of the curse of dimensionality. Principal component analysis is one of the most important techniques for dimension reduction and data visualization. However, datasets with missing values arising in almost every field will produce biased estimates and are difficult to handle, especially in the high dimension, low sample size settings. By exploiting a Lasso estimator of the population covariance matrix, we propose to regularize the principal component analysis to reduce the dimensionality of dataset with missing data. The Lasso estimator of covariance matrix is computationally tractable by solving a convex optimization problem. To illustrate the effectiveness of our method on dimension reduction, the principal component directions are evaluated by the metrics of Frobenius norm and cosine distance. The performances are compared with other incomplete data handling methods such as mean substitution and multiple imputation. Simulation results also show that our method is superior to other incomplete data handling methods in the context of discriminant analysis of real world high-dimensional datasets.https://doi.org/10.1177/1748302619867440
collection DOAJ
language English
format Article
sources DOAJ
author Ran Zhang
Bin Ye
Peng Liu
spellingShingle Ran Zhang
Bin Ye
Peng Liu
Dimension reduction of high-dimensional dataset with missing values
Journal of Algorithms & Computational Technology
author_facet Ran Zhang
Bin Ye
Peng Liu
author_sort Ran Zhang
title Dimension reduction of high-dimensional dataset with missing values
title_short Dimension reduction of high-dimensional dataset with missing values
title_full Dimension reduction of high-dimensional dataset with missing values
title_fullStr Dimension reduction of high-dimensional dataset with missing values
title_full_unstemmed Dimension reduction of high-dimensional dataset with missing values
title_sort dimension reduction of high-dimensional dataset with missing values
publisher SAGE Publishing
series Journal of Algorithms & Computational Technology
issn 1748-3026
publishDate 2019-08-01
description Nowadays, datasets containing a very large number of variables or features are routinely generated in many fields. Dimension reduction techniques are usually performed prior to statistically analyzing these datasets in order to avoid the effects of the curse of dimensionality. Principal component analysis is one of the most important techniques for dimension reduction and data visualization. However, datasets with missing values arising in almost every field will produce biased estimates and are difficult to handle, especially in the high dimension, low sample size settings. By exploiting a Lasso estimator of the population covariance matrix, we propose to regularize the principal component analysis to reduce the dimensionality of dataset with missing data. The Lasso estimator of covariance matrix is computationally tractable by solving a convex optimization problem. To illustrate the effectiveness of our method on dimension reduction, the principal component directions are evaluated by the metrics of Frobenius norm and cosine distance. The performances are compared with other incomplete data handling methods such as mean substitution and multiple imputation. Simulation results also show that our method is superior to other incomplete data handling methods in the context of discriminant analysis of real world high-dimensional datasets.
url https://doi.org/10.1177/1748302619867440
work_keys_str_mv AT ranzhang dimensionreductionofhighdimensionaldatasetwithmissingvalues
AT binye dimensionreductionofhighdimensionaldatasetwithmissingvalues
AT pengliu dimensionreductionofhighdimensionaldatasetwithmissingvalues
_version_ 1724604103564197888