Dimension reduction of high-dimensional dataset with missing values
Nowadays, datasets containing a very large number of variables or features are routinely generated in many fields. Dimension reduction techniques are usually performed prior to statistically analyzing these datasets in order to avoid the effects of the curse of dimensionality. Principal component an...
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
SAGE Publishing
2019-08-01
|
Series: | Journal of Algorithms & Computational Technology |
Online Access: | https://doi.org/10.1177/1748302619867440 |
id |
doaj-a47a61c0686343e095a0f076621fb5c0 |
---|---|
record_format |
Article |
spelling |
doaj-a47a61c0686343e095a0f076621fb5c02020-11-25T03:24:00ZengSAGE PublishingJournal of Algorithms & Computational Technology1748-30262019-08-011310.1177/1748302619867440Dimension reduction of high-dimensional dataset with missing valuesRan ZhangBin YePeng LiuNowadays, datasets containing a very large number of variables or features are routinely generated in many fields. Dimension reduction techniques are usually performed prior to statistically analyzing these datasets in order to avoid the effects of the curse of dimensionality. Principal component analysis is one of the most important techniques for dimension reduction and data visualization. However, datasets with missing values arising in almost every field will produce biased estimates and are difficult to handle, especially in the high dimension, low sample size settings. By exploiting a Lasso estimator of the population covariance matrix, we propose to regularize the principal component analysis to reduce the dimensionality of dataset with missing data. The Lasso estimator of covariance matrix is computationally tractable by solving a convex optimization problem. To illustrate the effectiveness of our method on dimension reduction, the principal component directions are evaluated by the metrics of Frobenius norm and cosine distance. The performances are compared with other incomplete data handling methods such as mean substitution and multiple imputation. Simulation results also show that our method is superior to other incomplete data handling methods in the context of discriminant analysis of real world high-dimensional datasets.https://doi.org/10.1177/1748302619867440 |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Ran Zhang Bin Ye Peng Liu |
spellingShingle |
Ran Zhang Bin Ye Peng Liu Dimension reduction of high-dimensional dataset with missing values Journal of Algorithms & Computational Technology |
author_facet |
Ran Zhang Bin Ye Peng Liu |
author_sort |
Ran Zhang |
title |
Dimension reduction of high-dimensional dataset with missing values |
title_short |
Dimension reduction of high-dimensional dataset with missing values |
title_full |
Dimension reduction of high-dimensional dataset with missing values |
title_fullStr |
Dimension reduction of high-dimensional dataset with missing values |
title_full_unstemmed |
Dimension reduction of high-dimensional dataset with missing values |
title_sort |
dimension reduction of high-dimensional dataset with missing values |
publisher |
SAGE Publishing |
series |
Journal of Algorithms & Computational Technology |
issn |
1748-3026 |
publishDate |
2019-08-01 |
description |
Nowadays, datasets containing a very large number of variables or features are routinely generated in many fields. Dimension reduction techniques are usually performed prior to statistically analyzing these datasets in order to avoid the effects of the curse of dimensionality. Principal component analysis is one of the most important techniques for dimension reduction and data visualization. However, datasets with missing values arising in almost every field will produce biased estimates and are difficult to handle, especially in the high dimension, low sample size settings. By exploiting a Lasso estimator of the population covariance matrix, we propose to regularize the principal component analysis to reduce the dimensionality of dataset with missing data. The Lasso estimator of covariance matrix is computationally tractable by solving a convex optimization problem. To illustrate the effectiveness of our method on dimension reduction, the principal component directions are evaluated by the metrics of Frobenius norm and cosine distance. The performances are compared with other incomplete data handling methods such as mean substitution and multiple imputation. Simulation results also show that our method is superior to other incomplete data handling methods in the context of discriminant analysis of real world high-dimensional datasets. |
url |
https://doi.org/10.1177/1748302619867440 |
work_keys_str_mv |
AT ranzhang dimensionreductionofhighdimensionaldatasetwithmissingvalues AT binye dimensionreductionofhighdimensionaldatasetwithmissingvalues AT pengliu dimensionreductionofhighdimensionaldatasetwithmissingvalues |
_version_ |
1724604103564197888 |