Summary: | High-dimensional data sets with a large number of explanatory variables are increasingly important in applications of regression analysis. It is well known that most traditional statistical techniques, such as the Ordinary Least Square (OLS) estimation do not perform well with such data and are either ill-conditioned or undefined. Thus a need for regularization arises. In the literature, various regularization methods have been suggested; amongst the most famous is the Partial Least Squares (PLS) regression method. The aim of this thesis is to consolidate and extend results in the literature to (a) show that PLS estimation can be regarded as estimation under a statistical model based on the so-called “Krylov hypothesis”, (b) introduce a derivation of the PLS estimator as an approximate maximum likelihood estimator under this model and (c) propose an algorithm to modify the PLS estimator to yield an exact maximum likelihood estimator under the same model. It will be shown that the constrained optimization problem in (c) can be recast as an unconstrained optimization problem on the Grassmann manifold. Two simulation studies consisting of a number of examples (using artificial data) in low dimensions will be presented. These allow us to make a visual inspection of the Krylov maximum likelihood as it varies over the Grassmann manifolds and hence characteristics of the data for which KML can be expected to give better results than PLS can be identified. However it was observed that these ideas make sense only when there is a small number of explanatory variables. As soon as the number of explanatory variables is moderate (say p = 10) or of order thousands, exploring how the different parameters effect the behaviour of the objective function is not straight forward. The predictive ability of the Ordinary Least Squares (OLS), Partial Least Squares (PLS) and Krylov Maximum Likelihood (KML) regression methods when applied to artificial data (for which the sample size is bigger than the number of explanatory variables) with and without multicollinearity is explored. Finally the predictive ability of the Partial Least Squares (PLS) and Krylov Maximum Likelihood (KML) regression methods was also compared on two real life high-dimensional data sets from the literature.
|