Machine learning with the cancer genome atlas head and neck squamous cell carcinoma dataset: improving usability by addressing inconsistency, sparsity, and high-dimensionality
In recent years, more data is becoming available for historical oncology case analysis. A large dataset that describes over 500 patient cases of Head and Neck Squamous Cell Carcinoma is a potential goldmine for finding ways to improve oncological decision support. Unfortunately, the best approaches...
Main Author: | |
---|---|
Other Authors: | |
Format: | Others |
Language: | English |
Published: |
University of Iowa
2019
|
Subjects: | |
Online Access: | https://ir.uiowa.edu/etd/6841 https://ir.uiowa.edu/cgi/viewcontent.cgi?article=8375&context=etd |
Summary: | In recent years, more data is becoming available for historical oncology case analysis. A large dataset that describes over 500 patient cases of Head and Neck Squamous Cell Carcinoma is a potential goldmine for finding ways to improve oncological decision support. Unfortunately, the best approaches for finding useful inferences are unknown. With so much information, from DNA and RNA sequencing to clinical records, we must use computational learning to find associations and biomarkers.
The available data has sparsity, inconsistencies, and is very large for some datatypes. We processed clinical records with an expert oncologist and used complex modeling methods to substitute (impute) data for cases missing treatment information. We used machine learning algorithms to see if imputed data is useful for predicting patient survival. We saw no difference in ability to predict patient survival with the imputed data, though imputed treatment variables were more important to survival models.
To deal with the large number of features in RNA expression data, we used two approaches: using all the data with High Performance Computers, and transforming the data into a smaller set of features (sparse principal components, or SPCs). We compared the performance of survival models with both datasets and saw no differences. However, the SPC models trained more quickly while also allowing us to pinpoint the biological processes each SPC is involved in to inform future biomarker discovery.
We also examined ten processed molecular features for survival prediction ability and found some predictive power, though not enough to be clinically useful. |
---|