Summary: | 碩士 === 國防醫學院 === 公共衛生學研究所 === 96 === Objective: The study aimed to investigate the performance of artificial intelligent methods and datd mining techniques for predicting colorectal cancer patients’ survival classification. The three prediction models were decision tree (DT), artificial neural network (ANN) and traditional biostatistical method- logistic regression (LR). We compared the performance of prediction models with cancer registry data and examined the prognostic factors of colonic and rectal cancers.
Methodology: Study samples were diagnosed as colorectal cancer patients in the USA Cancer Registry Database (SEER) during years 1988-2004 and Taiwan Cancer Registry Database (CRS) during years 1979-2003, those were excluded without diagnostic confirmation of non-pathology or non-histology, diagnosed after year 2002 (the follow years less than 5) and death causes of accident or unknown. The SEER dataset was divided into four sets, the first set was 60131 cases with colonic cancer died of colorectal cancer, the second was 26255 cases with rectal cancer died of colorectal cancer, the third was 34767 cases with colonic cancer died of other metastasis, the forth was 14807 cases with rectal cancer died of other metastasis. The CRS dataset was also split four sets, those were 19901, 19119, 10922 and 9490 cases, respectively. The performances of prediction models were evaluated according to parameters of accuracy (ACC), the area under ROC curve (AUC), and specificity under sensitivity fixed at 95%. All parameters were unbiasedly estimated via 10-fold cross-validation.
Results: 1. The prognostic factors in colorectal cancer patients died of conrectal cancer were tumor extension, surgery type and diagnostic age. The prognostic factors in colonic cancer patients died of other metastasis were tumor extension, surgery type and the number of primaries. The prognostic factors in rectal cancer patients died of other metastasis were the number of primaries, histology type and AJCC stage. We could not get this information in CRS dataset because the numbers and information of predictive variables weren’t enough. 2. Models of colonic cancers performed better prediction than rectal cancers in the SEER. 3. The reduction of number of prognostic factors had similar influence to three models (see page200、appendix table 4 and appendix table 5) 4. The ANN and DT models outperformed extremely than LR when samples were less than 2000. 5. The predicton results indicated that ANN was the best model, DT came out to be the second and LR was the worst among three models. 6. Models of whom other metastasis outperformed better accuracy than whom died of colorectal cancer in the SEER.
Conclusion: The required sample size was suggested as 2000. The sample size influenced largely the LR model. The DT model couldn’t perform without enough predictive variables. Models of whom died of other metastasis outperformed better accuracy than whom died of colorectal cancer in the SEER data. Totally, ANN outperforms than DT and LR.
Key words: data mining、colorectal cancer、decision tree、artificial neural network、logistic regression
|