How much data are required to develop and validate a risk prediction model?

It has been suggested that when developing risk prediction models using regression, the number of events in the dataset should be at least 10 times the number of parameters being estimated by the model. This rule was originally proposed to ensure the unbiased estimation of regression coefficients wi...

Full description

Bibliographic Details
Main Author: Taiyari, Khadijeh
Published: University College London (University of London) 2017
Subjects:
Online Access:https://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.747107
Description
Summary:It has been suggested that when developing risk prediction models using regression, the number of events in the dataset should be at least 10 times the number of parameters being estimated by the model. This rule was originally proposed to ensure the unbiased estimation of regression coefficients with confidence intervals that have correct coverage. However, only limited research has been conducted to assess the adequacy of this rule with regards to predictive performance. Furthermore, there is only limited guidance regarding the number of events required to develop risk prediction models using hierarchical data, for example when one has observations from several hospitals. One of the aims of this dissertation is to determine the number of events required to obtain reliable predictions from standard or hierarchical models for binary outcomes. This will be achieved by conducting several simulation studies based on real clinical data. It has also been suggested that when validating risk prediction models, there should be at least 100 events in the validation dataset. However, few studies have examined the adequacy of this recommendation. Furthermore, there are no guidelines regarding the sample size requirements when validating a risk prediction model based on hierarchical data. The second main aim of this dissertation is to investigate the sample size requirements for model validation using both simulation and analytical methods. In particular we will derive the relationship between sample size and the precision of some common measures of model performance such as the C statistic, D statistic, and calibration slope. The results from this dissertation will enable researchers to better assess their sample size requirements when developing and validating prediction models using both standard (independent) and clustered data.