Improved two-stage model averaging for high-dimensional linear regression, with application to Riboflavin data analysis

Abstract Background Model averaging has attracted increasing attention in recent years for the analysis of high-dimensional data. By weighting several competing statistical models suitably, model averaging attempts to achieve stable and improved prediction. In this paper, we develop a two-stage mode...

Full description

Bibliographic Details
Main Author:	Juming Pan
Format:	Article
Language:	English
Published:	BMC 2021-03-01
Series:	BMC Bioinformatics
Subjects:	High-dimensional regression Model averaging Variable selection Cross-validation Jackknife
Online Access:	https://doi.org/10.1186/s12859-021-04053-3

id	doaj-5d2430440e6c4688afd382ff46f5569c
record_format	Article
spelling	doaj-5d2430440e6c4688afd382ff46f5569c2021-03-28T11:46:22ZengBMCBMC Bioinformatics1471-21052021-03-0122111710.1186/s12859-021-04053-3Improved two-stage model averaging for high-dimensional linear regression, with application to Riboflavin data analysisJuming Pan0Department of Mathematics, Rowan UniversityAbstract Background Model averaging has attracted increasing attention in recent years for the analysis of high-dimensional data. By weighting several competing statistical models suitably, model averaging attempts to achieve stable and improved prediction. In this paper, we develop a two-stage model averaging procedure to enhance accuracy and stability in prediction for high-dimensional linear regression. First we employ a high-dimensional variable selection method such as LASSO to screen redundant predictors and construct a class of candidate models, then we apply the jackknife cross-validation to optimize model weights for averaging. Results In simulation studies, the proposed technique outperforms commonly used alternative methods under high-dimensional regression setting, in terms of minimizing the mean of the squared prediction error. We apply the proposed method to a riboflavin data, the result show that such method is quite efficient in forecasting the riboflavin production rate, when there are thousands of genes and only tens of subjects. Conclusions Compared with a recent high-dimensional model averaging procedure (Ando and Li in J Am Stat Assoc 109:254–65, 2014), the proposed approach enjoys three appealing features thus has better predictive performance: (1) More suitable methods are applied for model constructing and weighting. (2) Computational flexibility is retained since each candidate model and its corresponding weight are determined in the low-dimensional setting and the quadratic programming is utilized in the cross-validation. (3) Model selection and averaging are combined in the procedure thus it makes full use of the strengths of both techniques. As a consequence, the proposed method can achieve stable and accurate predictions in high-dimensional linear models, and can greatly help practical researchers analyze genetic data in medical research.https://doi.org/10.1186/s12859-021-04053-3High-dimensional regressionModel averagingVariable selectionCross-validationJackknife
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Juming Pan
spellingShingle	Juming Pan Improved two-stage model averaging for high-dimensional linear regression, with application to Riboflavin data analysis BMC Bioinformatics High-dimensional regression Model averaging Variable selection Cross-validation Jackknife
author_facet	Juming Pan
author_sort	Juming Pan
title	Improved two-stage model averaging for high-dimensional linear regression, with application to Riboflavin data analysis
title_short	Improved two-stage model averaging for high-dimensional linear regression, with application to Riboflavin data analysis
title_full	Improved two-stage model averaging for high-dimensional linear regression, with application to Riboflavin data analysis
title_fullStr	Improved two-stage model averaging for high-dimensional linear regression, with application to Riboflavin data analysis
title_full_unstemmed	Improved two-stage model averaging for high-dimensional linear regression, with application to Riboflavin data analysis
title_sort	improved two-stage model averaging for high-dimensional linear regression, with application to riboflavin data analysis
publisher	BMC
series	BMC Bioinformatics
issn	1471-2105
publishDate	2021-03-01
description	Abstract Background Model averaging has attracted increasing attention in recent years for the analysis of high-dimensional data. By weighting several competing statistical models suitably, model averaging attempts to achieve stable and improved prediction. In this paper, we develop a two-stage model averaging procedure to enhance accuracy and stability in prediction for high-dimensional linear regression. First we employ a high-dimensional variable selection method such as LASSO to screen redundant predictors and construct a class of candidate models, then we apply the jackknife cross-validation to optimize model weights for averaging. Results In simulation studies, the proposed technique outperforms commonly used alternative methods under high-dimensional regression setting, in terms of minimizing the mean of the squared prediction error. We apply the proposed method to a riboflavin data, the result show that such method is quite efficient in forecasting the riboflavin production rate, when there are thousands of genes and only tens of subjects. Conclusions Compared with a recent high-dimensional model averaging procedure (Ando and Li in J Am Stat Assoc 109:254–65, 2014), the proposed approach enjoys three appealing features thus has better predictive performance: (1) More suitable methods are applied for model constructing and weighting. (2) Computational flexibility is retained since each candidate model and its corresponding weight are determined in the low-dimensional setting and the quadratic programming is utilized in the cross-validation. (3) Model selection and averaging are combined in the procedure thus it makes full use of the strengths of both techniques. As a consequence, the proposed method can achieve stable and accurate predictions in high-dimensional linear models, and can greatly help practical researchers analyze genetic data in medical research.
topic	High-dimensional regression Model averaging Variable selection Cross-validation Jackknife
url	https://doi.org/10.1186/s12859-021-04053-3
work_keys_str_mv	AT jumingpan improvedtwostagemodelaveragingforhighdimensionallinearregressionwithapplicationtoriboflavindataanalysis
_version_	1724199611096104960

Improved two-stage model averaging for high-dimensional linear regression, with application to Riboflavin data analysis

Similar Items