New perspectives in cross-validation

Appealing due to its universality, cross-validation is an ubiquitous tool for model tuning and selection. At its core, cross-validation proposes to split the data (potentially several times), and alternatively use some of the data for fitting a model and the rest for testing the model. This produces...

Full description

Bibliographic Details
Main Author:	Zhou, Wenda
Language:	English
Published:	2020
Subjects:	Statistics Statistics > Methodology Statistics > Models
Online Access:	https://doi.org/10.7916/d8-3z39-7v31

id	ndltd-columbia.edu-oai-academiccommons.columbia.edu-10.7916-d8-3z39-7v31
record_format	oai_dc
spelling	ndltd-columbia.edu-oai-academiccommons.columbia.edu-10.7916-d8-3z39-7v312020-09-23T05:03:16ZNew perspectives in cross-validationZhou, Wenda2020ThesesStatisticsStatistics--MethodologyStatistics--ModelsAppealing due to its universality, cross-validation is an ubiquitous tool for model tuning and selection. At its core, cross-validation proposes to split the data (potentially several times), and alternatively use some of the data for fitting a model and the rest for testing the model. This produces a reliable estimate of the risk, although many questions remain concerning how best to compare such estimates across different models. Despite its widespread use, many theoretical problems remain unanswered for cross-validation, particularly in high-dimensional regimes where bias issues are non-negligible. We first provide an asymptotic analysis of the cross-validated risk in relation to the train-test split risk for a large class of estimators under stability conditions. This asymptotic analysis is expressed in the form of a central limit theorem, and allows us to characterize the speed-up of the cross-validation procedure for general parametric M-estimators. In particular, we show that when the loss used for fitting differs from that used for evaluation, k-fold cross-validation may offer a reduction in variance less (or greater) than k. We then turn our attention to the high-dimensional regime (where the number of parameters is comparable to the number of observations). In such a regime, k-fold cross-validation presents asymptotic bias, and hence increasing the number of folds is of interest. We study the extreme case of leave-one-out cross-validation, and show that, for generalized linear models under smoothness conditions, it is a consistent estimate of the risk at the optimal rate. Given the large computational requirements of leave-one-out cross-validation, we finally consider the problem of obtaining a fast approximate version of the leave-one-out cross-validation (ALO) estimator. We propose a general strategy for deriving formulas for such ALO estimators for penalized generalized linear models, and apply it to many common estimators such as the LASSO, SVM, nuclear norm minimization. The performance of such approximations are evaluated on simulated and real datasets.Englishhttps://doi.org/10.7916/d8-3z39-7v31
collection	NDLTD
language	English
sources	NDLTD
topic	Statistics Statistics--Methodology Statistics--Models
spellingShingle	Statistics Statistics--Methodology Statistics--Models Zhou, Wenda New perspectives in cross-validation
description	Appealing due to its universality, cross-validation is an ubiquitous tool for model tuning and selection. At its core, cross-validation proposes to split the data (potentially several times), and alternatively use some of the data for fitting a model and the rest for testing the model. This produces a reliable estimate of the risk, although many questions remain concerning how best to compare such estimates across different models. Despite its widespread use, many theoretical problems remain unanswered for cross-validation, particularly in high-dimensional regimes where bias issues are non-negligible. We first provide an asymptotic analysis of the cross-validated risk in relation to the train-test split risk for a large class of estimators under stability conditions. This asymptotic analysis is expressed in the form of a central limit theorem, and allows us to characterize the speed-up of the cross-validation procedure for general parametric M-estimators. In particular, we show that when the loss used for fitting differs from that used for evaluation, k-fold cross-validation may offer a reduction in variance less (or greater) than k. We then turn our attention to the high-dimensional regime (where the number of parameters is comparable to the number of observations). In such a regime, k-fold cross-validation presents asymptotic bias, and hence increasing the number of folds is of interest. We study the extreme case of leave-one-out cross-validation, and show that, for generalized linear models under smoothness conditions, it is a consistent estimate of the risk at the optimal rate. Given the large computational requirements of leave-one-out cross-validation, we finally consider the problem of obtaining a fast approximate version of the leave-one-out cross-validation (ALO) estimator. We propose a general strategy for deriving formulas for such ALO estimators for penalized generalized linear models, and apply it to many common estimators such as the LASSO, SVM, nuclear norm minimization. The performance of such approximations are evaluated on simulated and real datasets.
author	Zhou, Wenda
author_facet	Zhou, Wenda
author_sort	Zhou, Wenda
title	New perspectives in cross-validation
title_short	New perspectives in cross-validation
title_full	New perspectives in cross-validation
title_fullStr	New perspectives in cross-validation
title_full_unstemmed	New perspectives in cross-validation
title_sort	new perspectives in cross-validation
publishDate	2020
url	https://doi.org/10.7916/d8-3z39-7v31
work_keys_str_mv	AT zhouwenda newperspectivesincrossvalidation
_version_	1719340296789032960

New perspectives in cross-validation

Similar Items