Lung Cancer Survival Prediction using Ensemble Data Mining on Seer Data

We analyze the lung cancer data available from the SEER program with the aim of developing accurate survival prediction models for lung cancer. Carefully designed preprocessing steps resulted in removal/modification/splitting of several attributes, and 2 of the 11 derived attributes were found to ha...

Full description

Bibliographic Details
Main Authors: Ankit Agrawal, Sanchit Misra, Ramanathan Narayanan, Lalith Polepeddi, Alok Choudhary
Format: Article
Language:English
Published: Hindawi Limited 2012-01-01
Series:Scientific Programming
Online Access:http://dx.doi.org/10.3233/SPR-2012-0335
id doaj-6673e981fa5a42bcb5330b5c1a655023
record_format Article
spelling doaj-6673e981fa5a42bcb5330b5c1a6550232021-07-02T01:13:14ZengHindawi LimitedScientific Programming1058-92441875-919X2012-01-01201294210.3233/SPR-2012-0335Lung Cancer Survival Prediction using Ensemble Data Mining on Seer DataAnkit Agrawal0Sanchit Misra1Ramanathan Narayanan2Lalith Polepeddi3Alok Choudhary4Department of Electrical Engineering and Computer Science, Northwestern University, Evanston, IL, USADepartment of Electrical Engineering and Computer Science, Northwestern University, Evanston, IL, USADepartment of Electrical Engineering and Computer Science, Northwestern University, Evanston, IL, USADepartment of Electrical Engineering and Computer Science, Northwestern University, Evanston, IL, USADepartment of Electrical Engineering and Computer Science, Northwestern University, Evanston, IL, USAWe analyze the lung cancer data available from the SEER program with the aim of developing accurate survival prediction models for lung cancer. Carefully designed preprocessing steps resulted in removal/modification/splitting of several attributes, and 2 of the 11 derived attributes were found to have significant predictive power. Several supervised classification methods were used on the preprocessed data along with various data mining optimizations and validations. In our experiments, ensemble voting of five decision tree based classifiers and meta-classifiers was found to result in the best prediction performance in terms of accuracy and area under the ROC curve. We have developed an on-line lung cancer outcome calculator for estimating the risk of mortality after 6 months, 9 months, 1 year, 2 year and 5 years of diagnosis, for which a smaller non-redundant subset of 13 attributes was carefully selected using attribute selection techniques, while trying to retain the predictive power of the original set of attributes. Further, ensemble voting models were also created for predicting conditional survival outcome for lung cancer (estimating risk of mortality after 5 years of diagnosis, given that the patient has already survived for a period of time), and included in the calculator. The on-line lung cancer outcome calculator developed as a result of this study is available at http://info.eecs.northwestern.edu:8080/LungCancerOutcomeCalculator/.http://dx.doi.org/10.3233/SPR-2012-0335
collection DOAJ
language English
format Article
sources DOAJ
author Ankit Agrawal
Sanchit Misra
Ramanathan Narayanan
Lalith Polepeddi
Alok Choudhary
spellingShingle Ankit Agrawal
Sanchit Misra
Ramanathan Narayanan
Lalith Polepeddi
Alok Choudhary
Lung Cancer Survival Prediction using Ensemble Data Mining on Seer Data
Scientific Programming
author_facet Ankit Agrawal
Sanchit Misra
Ramanathan Narayanan
Lalith Polepeddi
Alok Choudhary
author_sort Ankit Agrawal
title Lung Cancer Survival Prediction using Ensemble Data Mining on Seer Data
title_short Lung Cancer Survival Prediction using Ensemble Data Mining on Seer Data
title_full Lung Cancer Survival Prediction using Ensemble Data Mining on Seer Data
title_fullStr Lung Cancer Survival Prediction using Ensemble Data Mining on Seer Data
title_full_unstemmed Lung Cancer Survival Prediction using Ensemble Data Mining on Seer Data
title_sort lung cancer survival prediction using ensemble data mining on seer data
publisher Hindawi Limited
series Scientific Programming
issn 1058-9244
1875-919X
publishDate 2012-01-01
description We analyze the lung cancer data available from the SEER program with the aim of developing accurate survival prediction models for lung cancer. Carefully designed preprocessing steps resulted in removal/modification/splitting of several attributes, and 2 of the 11 derived attributes were found to have significant predictive power. Several supervised classification methods were used on the preprocessed data along with various data mining optimizations and validations. In our experiments, ensemble voting of five decision tree based classifiers and meta-classifiers was found to result in the best prediction performance in terms of accuracy and area under the ROC curve. We have developed an on-line lung cancer outcome calculator for estimating the risk of mortality after 6 months, 9 months, 1 year, 2 year and 5 years of diagnosis, for which a smaller non-redundant subset of 13 attributes was carefully selected using attribute selection techniques, while trying to retain the predictive power of the original set of attributes. Further, ensemble voting models were also created for predicting conditional survival outcome for lung cancer (estimating risk of mortality after 5 years of diagnosis, given that the patient has already survived for a period of time), and included in the calculator. The on-line lung cancer outcome calculator developed as a result of this study is available at http://info.eecs.northwestern.edu:8080/LungCancerOutcomeCalculator/.
url http://dx.doi.org/10.3233/SPR-2012-0335
work_keys_str_mv AT ankitagrawal lungcancersurvivalpredictionusingensembledataminingonseerdata
AT sanchitmisra lungcancersurvivalpredictionusingensembledataminingonseerdata
AT ramanathannarayanan lungcancersurvivalpredictionusingensembledataminingonseerdata
AT lalithpolepeddi lungcancersurvivalpredictionusingensembledataminingonseerdata
AT alokchoudhary lungcancersurvivalpredictionusingensembledataminingonseerdata
_version_ 1721345321269723136