Lifelong Machine Learning and root cause analysis for large-scale cancer patient data

Abstract Introduction This paper presents a lifelong learning framework which constantly adapts with changing data patterns over time through incremental learning approach. In many big data systems, iterative re-training high dimensional data from scratch is computationally infeasible since constant...

Full description

Bibliographic Details
Main Authors:	Gautam Pal, Xianbin Hong, Zhuo Wang, Hongyi Wu, Gangmin Li, Katie Atkinson
Format:	Article
Language:	English
Published:	SpringerOpen 2019-12-01
Series:	Journal of Big Data
Subjects:	Lifelong learning Real-time data processing Lambda Architecture Streaming k-means Random Decision Forest Dimension reduction
Online Access:	https://doi.org/10.1186/s40537-019-0261-9

id	doaj-c10ab20aa48e471a98bc6374248789dd
record_format	Article
spelling	doaj-c10ab20aa48e471a98bc6374248789dd2020-12-06T12:55:50ZengSpringerOpenJournal of Big Data2196-11152019-12-016112910.1186/s40537-019-0261-9Lifelong Machine Learning and root cause analysis for large-scale cancer patient dataGautam Pal0Xianbin Hong1Zhuo Wang2Hongyi Wu3Gangmin Li4Katie Atkinson5Department of Computer ScienceResearch Institute of Big Data Analytics, Xian Jiaotong-Liverpool UniversityResearch Institute of Big Data Analytics, Xian Jiaotong-Liverpool UniversityResearch Institute of Big Data Analytics, Xian Jiaotong-Liverpool UniversityResearch Institute of Big Data Analytics, Xian Jiaotong-Liverpool UniversityDepartment of Computer ScienceAbstract Introduction This paper presents a lifelong learning framework which constantly adapts with changing data patterns over time through incremental learning approach. In many big data systems, iterative re-training high dimensional data from scratch is computationally infeasible since constant data stream ingestion on top of a historical data pool increases the training time exponentially. Therefore, the need arises on how to retain past learning and fast update the model incrementally based on the new data. Also, the current machine learning approaches do the model prediction without providing a comprehensive root cause analysis. To resolve these limitations, our framework lays foundations on an ensemble process between stream data with historical batch data for an incremental lifelong learning (LML) model. Case description A cancer patient’s pathological tests like blood, DNA, urine or tissue analysis provide a unique signature based on the DNA combinations. Our analysis allows personalized and targeted medications and achieves a therapeutic response. Model is evaluated through data from The National Cancer Institute’s Genomic Data Commons unified data repository. The aim is to prescribe personalized medicine based on the thousands of genotype and phenotype parameters for each patient. Discussion and evaluation The model uses a dimension reduction method to reduce training time at an online sliding window setting. We identify the Gleason score as a determining factor for cancer possibility and substantiate our claim through Lilliefors and Kolmogorov–Smirnov test. We present clustering and Random Decision Forest results. The model’s prediction accuracy is compared with standard machine learning algorithms for numeric and categorical fields. Conclusion We propose an ensemble framework of stream and batch data for incremental lifelong learning. The framework successively applies first streaming clustering technique and then Random Decision Forest Regressor/Classifier to isolate anomalous patient data and provides reasoning through root cause analysis by feature correlations with an aim to improve the overall survival rate. While the stream clustering technique creates groups of patient profiles, RDF further drills down into each group for comparison and reasoning for useful actionable insights. The proposed MALA architecture retains the past learned knowledge and transfer to future learning and iteratively becomes more knowledgeable over time.https://doi.org/10.1186/s40537-019-0261-9Lifelong learningReal-time data processingLambda ArchitectureStreaming k-meansRandom Decision ForestDimension reduction
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Gautam Pal Xianbin Hong Zhuo Wang Hongyi Wu Gangmin Li Katie Atkinson
spellingShingle	Gautam Pal Xianbin Hong Zhuo Wang Hongyi Wu Gangmin Li Katie Atkinson Lifelong Machine Learning and root cause analysis for large-scale cancer patient data Journal of Big Data Lifelong learning Real-time data processing Lambda Architecture Streaming k-means Random Decision Forest Dimension reduction
author_facet	Gautam Pal Xianbin Hong Zhuo Wang Hongyi Wu Gangmin Li Katie Atkinson
author_sort	Gautam Pal
title	Lifelong Machine Learning and root cause analysis for large-scale cancer patient data
title_short	Lifelong Machine Learning and root cause analysis for large-scale cancer patient data
title_full	Lifelong Machine Learning and root cause analysis for large-scale cancer patient data
title_fullStr	Lifelong Machine Learning and root cause analysis for large-scale cancer patient data
title_full_unstemmed	Lifelong Machine Learning and root cause analysis for large-scale cancer patient data
title_sort	lifelong machine learning and root cause analysis for large-scale cancer patient data
publisher	SpringerOpen
series	Journal of Big Data
issn	2196-1115
publishDate	2019-12-01
description	Abstract Introduction This paper presents a lifelong learning framework which constantly adapts with changing data patterns over time through incremental learning approach. In many big data systems, iterative re-training high dimensional data from scratch is computationally infeasible since constant data stream ingestion on top of a historical data pool increases the training time exponentially. Therefore, the need arises on how to retain past learning and fast update the model incrementally based on the new data. Also, the current machine learning approaches do the model prediction without providing a comprehensive root cause analysis. To resolve these limitations, our framework lays foundations on an ensemble process between stream data with historical batch data for an incremental lifelong learning (LML) model. Case description A cancer patient’s pathological tests like blood, DNA, urine or tissue analysis provide a unique signature based on the DNA combinations. Our analysis allows personalized and targeted medications and achieves a therapeutic response. Model is evaluated through data from The National Cancer Institute’s Genomic Data Commons unified data repository. The aim is to prescribe personalized medicine based on the thousands of genotype and phenotype parameters for each patient. Discussion and evaluation The model uses a dimension reduction method to reduce training time at an online sliding window setting. We identify the Gleason score as a determining factor for cancer possibility and substantiate our claim through Lilliefors and Kolmogorov–Smirnov test. We present clustering and Random Decision Forest results. The model’s prediction accuracy is compared with standard machine learning algorithms for numeric and categorical fields. Conclusion We propose an ensemble framework of stream and batch data for incremental lifelong learning. The framework successively applies first streaming clustering technique and then Random Decision Forest Regressor/Classifier to isolate anomalous patient data and provides reasoning through root cause analysis by feature correlations with an aim to improve the overall survival rate. While the stream clustering technique creates groups of patient profiles, RDF further drills down into each group for comparison and reasoning for useful actionable insights. The proposed MALA architecture retains the past learned knowledge and transfer to future learning and iteratively becomes more knowledgeable over time.
topic	Lifelong learning Real-time data processing Lambda Architecture Streaming k-means Random Decision Forest Dimension reduction
url	https://doi.org/10.1186/s40537-019-0261-9
work_keys_str_mv	AT gautampal lifelongmachinelearningandrootcauseanalysisforlargescalecancerpatientdata AT xianbinhong lifelongmachinelearningandrootcauseanalysisforlargescalecancerpatientdata AT zhuowang lifelongmachinelearningandrootcauseanalysisforlargescalecancerpatientdata AT hongyiwu lifelongmachinelearningandrootcauseanalysisforlargescalecancerpatientdata AT gangminli lifelongmachinelearningandrootcauseanalysisforlargescalecancerpatientdata AT katieatkinson lifelongmachinelearningandrootcauseanalysisforlargescalecancerpatientdata
_version_	1724398442320494592

Lifelong Machine Learning and root cause analysis for large-scale cancer patient data

Similar Items