Improved two-stage model averaging for high-dimensional linear regression, with application to Riboflavin data analysis

Abstract Background Model averaging has attracted increasing attention in recent years for the analysis of high-dimensional data. By weighting several competing statistical models suitably, model averaging attempts to achieve stable and improved prediction. In this paper, we develop a two-stage mode...

Full description

Bibliographic Details
Main Author: Juming Pan
Format: Article
Language:English
Published: BMC 2021-03-01
Series:BMC Bioinformatics
Subjects:
Online Access:https://doi.org/10.1186/s12859-021-04053-3
id doaj-5d2430440e6c4688afd382ff46f5569c
record_format Article
spelling doaj-5d2430440e6c4688afd382ff46f5569c2021-03-28T11:46:22ZengBMCBMC Bioinformatics1471-21052021-03-0122111710.1186/s12859-021-04053-3Improved two-stage model averaging for high-dimensional linear regression, with application to Riboflavin data analysisJuming Pan0Department of Mathematics, Rowan UniversityAbstract Background Model averaging has attracted increasing attention in recent years for the analysis of high-dimensional data. By weighting several competing statistical models suitably, model averaging attempts to achieve stable and improved prediction. In this paper, we develop a two-stage model averaging procedure to enhance accuracy and stability in prediction for high-dimensional linear regression. First we employ a high-dimensional variable selection method such as LASSO to screen redundant predictors and construct a class of candidate models, then we apply the jackknife cross-validation to optimize model weights for averaging. Results In simulation studies, the proposed technique outperforms commonly used alternative methods under high-dimensional regression setting, in terms of minimizing the mean of the squared prediction error. We apply the proposed method to a riboflavin data, the result show that such method is quite efficient in forecasting the riboflavin production rate, when there are thousands of genes and only tens of subjects. Conclusions Compared with a recent high-dimensional model averaging procedure (Ando and Li in J Am Stat Assoc 109:254–65, 2014), the proposed approach enjoys three appealing features thus has better predictive performance: (1) More suitable methods are applied for model constructing and weighting. (2) Computational flexibility is retained since each candidate model and its corresponding weight are determined in the low-dimensional setting and the quadratic programming is utilized in the cross-validation. (3) Model selection and averaging are combined in the procedure thus it makes full use of the strengths of both techniques. As a consequence, the proposed method can achieve stable and accurate predictions in high-dimensional linear models, and can greatly help practical researchers analyze genetic data in medical research.https://doi.org/10.1186/s12859-021-04053-3High-dimensional regressionModel averagingVariable selectionCross-validationJackknife
collection DOAJ
language English
format Article
sources DOAJ
author Juming Pan
spellingShingle Juming Pan
Improved two-stage model averaging for high-dimensional linear regression, with application to Riboflavin data analysis
BMC Bioinformatics
High-dimensional regression
Model averaging
Variable selection
Cross-validation
Jackknife
author_facet Juming Pan
author_sort Juming Pan
title Improved two-stage model averaging for high-dimensional linear regression, with application to Riboflavin data analysis
title_short Improved two-stage model averaging for high-dimensional linear regression, with application to Riboflavin data analysis
title_full Improved two-stage model averaging for high-dimensional linear regression, with application to Riboflavin data analysis
title_fullStr Improved two-stage model averaging for high-dimensional linear regression, with application to Riboflavin data analysis
title_full_unstemmed Improved two-stage model averaging for high-dimensional linear regression, with application to Riboflavin data analysis
title_sort improved two-stage model averaging for high-dimensional linear regression, with application to riboflavin data analysis
publisher BMC
series BMC Bioinformatics
issn 1471-2105
publishDate 2021-03-01
description Abstract Background Model averaging has attracted increasing attention in recent years for the analysis of high-dimensional data. By weighting several competing statistical models suitably, model averaging attempts to achieve stable and improved prediction. In this paper, we develop a two-stage model averaging procedure to enhance accuracy and stability in prediction for high-dimensional linear regression. First we employ a high-dimensional variable selection method such as LASSO to screen redundant predictors and construct a class of candidate models, then we apply the jackknife cross-validation to optimize model weights for averaging. Results In simulation studies, the proposed technique outperforms commonly used alternative methods under high-dimensional regression setting, in terms of minimizing the mean of the squared prediction error. We apply the proposed method to a riboflavin data, the result show that such method is quite efficient in forecasting the riboflavin production rate, when there are thousands of genes and only tens of subjects. Conclusions Compared with a recent high-dimensional model averaging procedure (Ando and Li in J Am Stat Assoc 109:254–65, 2014), the proposed approach enjoys three appealing features thus has better predictive performance: (1) More suitable methods are applied for model constructing and weighting. (2) Computational flexibility is retained since each candidate model and its corresponding weight are determined in the low-dimensional setting and the quadratic programming is utilized in the cross-validation. (3) Model selection and averaging are combined in the procedure thus it makes full use of the strengths of both techniques. As a consequence, the proposed method can achieve stable and accurate predictions in high-dimensional linear models, and can greatly help practical researchers analyze genetic data in medical research.
topic High-dimensional regression
Model averaging
Variable selection
Cross-validation
Jackknife
url https://doi.org/10.1186/s12859-021-04053-3
work_keys_str_mv AT jumingpan improvedtwostagemodelaveragingforhighdimensionallinearregressionwithapplicationtoriboflavindataanalysis
_version_ 1724199611096104960