On sample selection models and skew distributions

This thesis is concerned with methods for dealing with missing data in nonrandom samples and recurrent events data. The first part of this thesis is motivated by scores arising from questionnaires which often follow asymmetric distributions, on a fixed range. This can be due to scores clustering at...

Full description

Bibliographic Details
Main Author: Ogundimu, Emmanuel O.
Published: University of Warwick 2013
Subjects:
Online Access:http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.582436
Description
Summary:This thesis is concerned with methods for dealing with missing data in nonrandom samples and recurrent events data. The first part of this thesis is motivated by scores arising from questionnaires which often follow asymmetric distributions, on a fixed range. This can be due to scores clustering at one end of the scale or selective reporting. Sometimes, the scores are further subjected to sample selection resulting in partial observability. Thus, methods based on complete cases for skew data are inadequate for the analysis of such data and a general sample selection model is required. Heckman proposed a full maximum likelihood estimation method under the normality assumption for sample selection problems, and parametric and non-parametric extensions have been proposed. A general selection distribution for a vector Y 2 Rp has a PDF fY given by fY(y) = fY?(y) P(S? 2 CjY? = y) P(S? 2 C) ; where S? 2 Rq and Y? 2 Rp are two random vectors, and C is a measurable subset of Rq. We use this generalization to develop a sample selection model with underlying skew-normal distribution. A link is established between the continuous component of our model log-likelihood function and an extended version of a generalized skewnormal distribution. This link is used to derive the expected value of the model, which extends Heckman's two-step method. The general selection distribution is also used to establish the closed skew-normal distribution as the continuous component of the usual multilevel sample selection models. Finite sample performances of the maximum likelihood estimator of the models are studied via Monte Carlo simulation. The model parameters are more precisely estimated under the new models, even in the presence of moderate to extreme skewness, than the Heckman selection models. Application to data from a study of neck injuries where the responses are substantially skew successfully discriminates between selection and inherent skewness, and the multilevel model is used to analyze jointly unit and item non-response. We also discuss computational and identification issues, and provide an extension of the model using copula-based sample selection models with truncated marginals. The second part of this thesis is motivated by studies that seek to analyze processes that generate events repeatedly over time. We consider the number of events per subject within a specified study period as the primary outcome of interest. One considerable challenge in the analysis of this type of data is the large proportion of patients that might discontinue before the end of the study, leading to partially observed data. Sophisticated sensitivity analyses tools are therefore necessary for the analysis of such data. We propose the use of two frequentist based imputation methods for dealing with missing data in recurrent event data framework. The recurrent events are modeled as over-dispersed Poisson data, with constant rate function. Different assumptions about future behavior of dropouts depending on reasons for dropout and treatment received are made and evaluated in a simulation study. We illustrate our approach with a clinical trial in patients who suffer from bladder cancer.