New perspectives in cross-validation

Appealing due to its universality, cross-validation is an ubiquitous tool for model tuning and selection. At its core, cross-validation proposes to split the data (potentially several times), and alternatively use some of the data for fitting a model and the rest for testing the model. This produces...

Full description

Bibliographic Details
Main Author: Zhou, Wenda
Language:English
Published: 2020
Subjects:
Online Access:https://doi.org/10.7916/d8-3z39-7v31
id ndltd-columbia.edu-oai-academiccommons.columbia.edu-10.7916-d8-3z39-7v31
record_format oai_dc
spelling ndltd-columbia.edu-oai-academiccommons.columbia.edu-10.7916-d8-3z39-7v312020-09-23T05:03:16ZNew perspectives in cross-validationZhou, Wenda2020ThesesStatisticsStatistics--MethodologyStatistics--ModelsAppealing due to its universality, cross-validation is an ubiquitous tool for model tuning and selection. At its core, cross-validation proposes to split the data (potentially several times), and alternatively use some of the data for fitting a model and the rest for testing the model. This produces a reliable estimate of the risk, although many questions remain concerning how best to compare such estimates across different models. Despite its widespread use, many theoretical problems remain unanswered for cross-validation, particularly in high-dimensional regimes where bias issues are non-negligible. We first provide an asymptotic analysis of the cross-validated risk in relation to the train-test split risk for a large class of estimators under stability conditions. This asymptotic analysis is expressed in the form of a central limit theorem, and allows us to characterize the speed-up of the cross-validation procedure for general parametric M-estimators. In particular, we show that when the loss used for fitting differs from that used for evaluation, k-fold cross-validation may offer a reduction in variance less (or greater) than k. We then turn our attention to the high-dimensional regime (where the number of parameters is comparable to the number of observations). In such a regime, k-fold cross-validation presents asymptotic bias, and hence increasing the number of folds is of interest. We study the extreme case of leave-one-out cross-validation, and show that, for generalized linear models under smoothness conditions, it is a consistent estimate of the risk at the optimal rate. Given the large computational requirements of leave-one-out cross-validation, we finally consider the problem of obtaining a fast approximate version of the leave-one-out cross-validation (ALO) estimator. We propose a general strategy for deriving formulas for such ALO estimators for penalized generalized linear models, and apply it to many common estimators such as the LASSO, SVM, nuclear norm minimization. The performance of such approximations are evaluated on simulated and real datasets.Englishhttps://doi.org/10.7916/d8-3z39-7v31
collection NDLTD
language English
sources NDLTD
topic Statistics
Statistics--Methodology
Statistics--Models
spellingShingle Statistics
Statistics--Methodology
Statistics--Models
Zhou, Wenda
New perspectives in cross-validation
description Appealing due to its universality, cross-validation is an ubiquitous tool for model tuning and selection. At its core, cross-validation proposes to split the data (potentially several times), and alternatively use some of the data for fitting a model and the rest for testing the model. This produces a reliable estimate of the risk, although many questions remain concerning how best to compare such estimates across different models. Despite its widespread use, many theoretical problems remain unanswered for cross-validation, particularly in high-dimensional regimes where bias issues are non-negligible. We first provide an asymptotic analysis of the cross-validated risk in relation to the train-test split risk for a large class of estimators under stability conditions. This asymptotic analysis is expressed in the form of a central limit theorem, and allows us to characterize the speed-up of the cross-validation procedure for general parametric M-estimators. In particular, we show that when the loss used for fitting differs from that used for evaluation, k-fold cross-validation may offer a reduction in variance less (or greater) than k. We then turn our attention to the high-dimensional regime (where the number of parameters is comparable to the number of observations). In such a regime, k-fold cross-validation presents asymptotic bias, and hence increasing the number of folds is of interest. We study the extreme case of leave-one-out cross-validation, and show that, for generalized linear models under smoothness conditions, it is a consistent estimate of the risk at the optimal rate. Given the large computational requirements of leave-one-out cross-validation, we finally consider the problem of obtaining a fast approximate version of the leave-one-out cross-validation (ALO) estimator. We propose a general strategy for deriving formulas for such ALO estimators for penalized generalized linear models, and apply it to many common estimators such as the LASSO, SVM, nuclear norm minimization. The performance of such approximations are evaluated on simulated and real datasets.
author Zhou, Wenda
author_facet Zhou, Wenda
author_sort Zhou, Wenda
title New perspectives in cross-validation
title_short New perspectives in cross-validation
title_full New perspectives in cross-validation
title_fullStr New perspectives in cross-validation
title_full_unstemmed New perspectives in cross-validation
title_sort new perspectives in cross-validation
publishDate 2020
url https://doi.org/10.7916/d8-3z39-7v31
work_keys_str_mv AT zhouwenda newperspectivesincrossvalidation
_version_ 1719340296789032960