A Multicriteria Approach to Find Predictive and Sparse Models with Stable Feature Selection for High-Dimensional Data

Finding a good predictive model for a high-dimensional data set can be challenging. For genetic data, it is not only important to find a model with high predictive accuracy, but it is also important that this model uses only few features and that the selection of these features is stable. This is be...

Full description

Bibliographic Details
Main Authors:	Andrea Bommert, Jörg Rahnenführer, Michel Lang
Format:	Article
Language:	English
Published:	Hindawi Limited 2017-01-01
Series:	Computational and Mathematical Methods in Medicine
Online Access:	http://dx.doi.org/10.1155/2017/7907163

id	doaj-6d253096edf540a4869f7fa4409373aa
record_format	Article
spelling	doaj-6d253096edf540a4869f7fa4409373aa2020-11-25T00:02:49ZengHindawi LimitedComputational and Mathematical Methods in Medicine1748-670X1748-67182017-01-01201710.1155/2017/79071637907163A Multicriteria Approach to Find Predictive and Sparse Models with Stable Feature Selection for High-Dimensional DataAndrea Bommert0Jörg Rahnenführer1Michel Lang2Department of Statistics, TU Dortmund University, 44221 Dortmund, GermanyDepartment of Statistics, TU Dortmund University, 44221 Dortmund, GermanyDepartment of Statistics, TU Dortmund University, 44221 Dortmund, GermanyFinding a good predictive model for a high-dimensional data set can be challenging. For genetic data, it is not only important to find a model with high predictive accuracy, but it is also important that this model uses only few features and that the selection of these features is stable. This is because, in bioinformatics, the models are used not only for prediction but also for drawing biological conclusions which makes the interpretability and reliability of the model crucial. We suggest using three target criteria when fitting a predictive model to a high-dimensional data set: the classification accuracy, the stability of the feature selection, and the number of chosen features. As it is unclear which measure is best for evaluating the stability, we first compare a variety of stability measures. We conclude that the Pearson correlation has the best theoretical and empirical properties. Also, we find that for the stability assessment behaviour it is most important that a measure contains a correction for chance or large numbers of chosen features. Then, we analyse Pareto fronts and conclude that it is possible to find models with a stable selection of few features without losing much predictive accuracy.http://dx.doi.org/10.1155/2017/7907163
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Andrea Bommert Jörg Rahnenführer Michel Lang
spellingShingle	Andrea Bommert Jörg Rahnenführer Michel Lang A Multicriteria Approach to Find Predictive and Sparse Models with Stable Feature Selection for High-Dimensional Data Computational and Mathematical Methods in Medicine
author_facet	Andrea Bommert Jörg Rahnenführer Michel Lang
author_sort	Andrea Bommert
title	A Multicriteria Approach to Find Predictive and Sparse Models with Stable Feature Selection for High-Dimensional Data
title_short	A Multicriteria Approach to Find Predictive and Sparse Models with Stable Feature Selection for High-Dimensional Data
title_full	A Multicriteria Approach to Find Predictive and Sparse Models with Stable Feature Selection for High-Dimensional Data
title_fullStr	A Multicriteria Approach to Find Predictive and Sparse Models with Stable Feature Selection for High-Dimensional Data
title_full_unstemmed	A Multicriteria Approach to Find Predictive and Sparse Models with Stable Feature Selection for High-Dimensional Data
title_sort	multicriteria approach to find predictive and sparse models with stable feature selection for high-dimensional data
publisher	Hindawi Limited
series	Computational and Mathematical Methods in Medicine
issn	1748-670X 1748-6718
publishDate	2017-01-01
description	Finding a good predictive model for a high-dimensional data set can be challenging. For genetic data, it is not only important to find a model with high predictive accuracy, but it is also important that this model uses only few features and that the selection of these features is stable. This is because, in bioinformatics, the models are used not only for prediction but also for drawing biological conclusions which makes the interpretability and reliability of the model crucial. We suggest using three target criteria when fitting a predictive model to a high-dimensional data set: the classification accuracy, the stability of the feature selection, and the number of chosen features. As it is unclear which measure is best for evaluating the stability, we first compare a variety of stability measures. We conclude that the Pearson correlation has the best theoretical and empirical properties. Also, we find that for the stability assessment behaviour it is most important that a measure contains a correction for chance or large numbers of chosen features. Then, we analyse Pareto fronts and conclude that it is possible to find models with a stable selection of few features without losing much predictive accuracy.
url	http://dx.doi.org/10.1155/2017/7907163
work_keys_str_mv	AT andreabommert amulticriteriaapproachtofindpredictiveandsparsemodelswithstablefeatureselectionforhighdimensionaldata AT jorgrahnenfuhrer amulticriteriaapproachtofindpredictiveandsparsemodelswithstablefeatureselectionforhighdimensionaldata AT michellang amulticriteriaapproachtofindpredictiveandsparsemodelswithstablefeatureselectionforhighdimensionaldata AT andreabommert multicriteriaapproachtofindpredictiveandsparsemodelswithstablefeatureselectionforhighdimensionaldata AT jorgrahnenfuhrer multicriteriaapproachtofindpredictiveandsparsemodelswithstablefeatureselectionforhighdimensionaldata AT michellang multicriteriaapproachtofindpredictiveandsparsemodelswithstablefeatureselectionforhighdimensionaldata
_version_	1725436437986279424

A Multicriteria Approach to Find Predictive and Sparse Models with Stable Feature Selection for High-Dimensional Data

Similar Items