Summary: | The area of data imputation, which is the process of replacing missing data with substituted values, has been covered quite extensively in recent years. The literature on the practical impact of data imputation however, remains scarce. This thesis explores the impact of some of the state of the art data imputation methods on HCC survival prediction and classification in combination with data-level methods such as oversampling. More specifically, it explores imputation methods for mixed-type datasets and their impact on a particular HCC dataset. Previous research has shown that, the newer, more sophisticated imputation methods outperform simpler ones when evaluated with normalized root mean square error (NRMSE). Contrary to intuition however, the results of this study show that when combined with other data-level methods such as clustering and oversampling, the differences in imputation performance does not always impact classification in any meaningful way. This might be explained by the noise that is introduced when generating synthetic data points in the oversampling process. The results also show that one of the more sophisticated imputation methods, namely MICE, is highly dependent on prior assumptions about the underlying distributions of the dataset. When those assumptions are incorrect, the imputation method performs poorly and has a considerable negative impact on classification.
|