Summary: | Medicine and health domains are information intensive fields as data volume has been
increasing constantly from them. In order to make full use of the data, the technique of
Knowledge Discovery in Databases (KDD) has been developed as a comprehensive pathway
to discover valid and unsuspected patterns and trends that are both understandable and useful to data analysts.
The present study aimed to investigate the entire KDD process of developing a classification model for cardiovascular disease (CVD) from a Canadian dataset for the first time. The research data source was Canadian Heart Health Database, which contains 265 easily collected variables and 23,129 instances from ten Canadian provinces. Many practical issues involving in different steps of the integrated process were addressed, and possible solutions were suggested based on the experimental results. Five specific learning schemes representing five distinct KDD approaches were employed, as they were never compared with one another. In addition, two improving approaches including cost-sensitive learning and ensemble learning were also examined. The performance of developed models was
measured in many aspects. The data set was prepared through data cleaning and missing value imputation. Three pairs of experiments demonstrated that the dataset balancing and outlier removal exerted positive influence to the classifier, but the variable normalization was not helpful. Three combinations of subset generation method and evaluation function were tested in variable
subset selection phase, and the combination of Best-First search and Correlation-based
Feature Selection showed comparable goodness and was maintained for other benefits.
Among the five learning schemes investigated, C4.5 decision tree achieved the best
performance on the classification of CVD, followed by Multilayer Feed-forward Network, KNearest Neighbor, Logistic Regression, and Naïve Bayes. Cost-sensitive learning exemplified by the MetaCost algorithm failed to outperform the single C4.5 decision tree when varying the cost matrix from 5:1 to 1:7. In contrast, the models developed from ensemble modeling, especially AdaBoost M1 algorithm, outperformed other models.
Although the model with the best performance might be suitable for CVD screening in
general Canadian population, it is not ready to use in practice. I propose some criteria to improve the further evaluation of the model. Finally, I describe some of the limitations of the study and propose potential solutions to address such limitations through out the KDD process. Such possibilities should be explored in further research.
|