Enhancement of hepatitis virus immunoassay outcome predictions in imbalanced routine pathology data by data balancing and feature selection before the application of support vector machines
Abstract Background Data mining techniques such as support vector machines (SVMs) have been successfully used to predict outcomes for complex problems, including for human health. Much health data is imbalanced, with many more controls than positive cases. Methods The impact of three balancing metho...
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
BMC
2017-08-01
|
Series: | BMC Medical Informatics and Decision Making |
Subjects: | |
Online Access: | http://link.springer.com/article/10.1186/s12911-017-0522-5 |
id |
doaj-6ff1a6157ee046c98ff826c1b72f4928 |
---|---|
record_format |
Article |
spelling |
doaj-6ff1a6157ee046c98ff826c1b72f49282020-11-25T00:42:44ZengBMCBMC Medical Informatics and Decision Making1472-69472017-08-0117111110.1186/s12911-017-0522-5Enhancement of hepatitis virus immunoassay outcome predictions in imbalanced routine pathology data by data balancing and feature selection before the application of support vector machinesAlice M. Richardson0Brett A. Lidbury1Present address: National Centre for Epidemiology & Population Health, Australian National UniversityPresent address: National Centre for Epidemiology & Population Health, Australian National UniversityAbstract Background Data mining techniques such as support vector machines (SVMs) have been successfully used to predict outcomes for complex problems, including for human health. Much health data is imbalanced, with many more controls than positive cases. Methods The impact of three balancing methods and one feature selection method is explored, to assess the ability of SVMs to classify imbalanced diagnostic pathology data associated with the laboratory diagnosis of hepatitis B (HBV) and hepatitis C (HCV) infections. Random forests (RFs) for predictor variable selection, and data reshaping to overcome a large imbalance of negative to positive test results in relation to HBV and HCV immunoassay results, are examined. The methodology is illustrated using data from ACT Pathology (Canberra, Australia), consisting of laboratory test records from 18,625 individuals who underwent hepatitis virus testing over the decade from 1997 to 2007. Results Overall, the prediction of HCV test results by immunoassay was more accurate than for HBV immunoassay results associated with identical routine pathology predictor variable data. HBV and HCV negative results were vastly in excess of positive results, so three approaches to handling the negative/positive data imbalance were compared. Generating datasets by the Synthetic Minority Oversampling Technique (SMOTE) resulted in significantly more accurate prediction than single downsizing or multiple downsizing (MDS) of the dataset. For downsized data sets, applying a RF for predictor variable selection had a small effect on the performance, which varied depending on the virus. For SMOTE, a RF had a negative effect on performance. An analysis of variance of the performance across settings supports these findings. Finally, age and assay results for alanine aminotransferase (ALT), sodium for HBV and urea for HCV were found to have a significant impact upon laboratory diagnosis of HBV or HCV infection using an optimised SVM model. Conclusions Laboratories looking to include machine learning via SVM as part of their decision support need to be aware that the balancing method, predictor variable selection and the virus type interact to affect the laboratory diagnosis of hepatitis virus infection with routine pathology laboratory variables in different ways depending on which combination is being studied. This awareness should lead to careful use of existing machine learning methods, thus improving the quality of laboratory diagnosis.http://link.springer.com/article/10.1186/s12911-017-0522-5Analysis of varianceHepatitis BHepatitis CMachine learningRandom forestsSynthetic minority oversampling technique |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Alice M. Richardson Brett A. Lidbury |
spellingShingle |
Alice M. Richardson Brett A. Lidbury Enhancement of hepatitis virus immunoassay outcome predictions in imbalanced routine pathology data by data balancing and feature selection before the application of support vector machines BMC Medical Informatics and Decision Making Analysis of variance Hepatitis B Hepatitis C Machine learning Random forests Synthetic minority oversampling technique |
author_facet |
Alice M. Richardson Brett A. Lidbury |
author_sort |
Alice M. Richardson |
title |
Enhancement of hepatitis virus immunoassay outcome predictions in imbalanced routine pathology data by data balancing and feature selection before the application of support vector machines |
title_short |
Enhancement of hepatitis virus immunoassay outcome predictions in imbalanced routine pathology data by data balancing and feature selection before the application of support vector machines |
title_full |
Enhancement of hepatitis virus immunoassay outcome predictions in imbalanced routine pathology data by data balancing and feature selection before the application of support vector machines |
title_fullStr |
Enhancement of hepatitis virus immunoassay outcome predictions in imbalanced routine pathology data by data balancing and feature selection before the application of support vector machines |
title_full_unstemmed |
Enhancement of hepatitis virus immunoassay outcome predictions in imbalanced routine pathology data by data balancing and feature selection before the application of support vector machines |
title_sort |
enhancement of hepatitis virus immunoassay outcome predictions in imbalanced routine pathology data by data balancing and feature selection before the application of support vector machines |
publisher |
BMC |
series |
BMC Medical Informatics and Decision Making |
issn |
1472-6947 |
publishDate |
2017-08-01 |
description |
Abstract Background Data mining techniques such as support vector machines (SVMs) have been successfully used to predict outcomes for complex problems, including for human health. Much health data is imbalanced, with many more controls than positive cases. Methods The impact of three balancing methods and one feature selection method is explored, to assess the ability of SVMs to classify imbalanced diagnostic pathology data associated with the laboratory diagnosis of hepatitis B (HBV) and hepatitis C (HCV) infections. Random forests (RFs) for predictor variable selection, and data reshaping to overcome a large imbalance of negative to positive test results in relation to HBV and HCV immunoassay results, are examined. The methodology is illustrated using data from ACT Pathology (Canberra, Australia), consisting of laboratory test records from 18,625 individuals who underwent hepatitis virus testing over the decade from 1997 to 2007. Results Overall, the prediction of HCV test results by immunoassay was more accurate than for HBV immunoassay results associated with identical routine pathology predictor variable data. HBV and HCV negative results were vastly in excess of positive results, so three approaches to handling the negative/positive data imbalance were compared. Generating datasets by the Synthetic Minority Oversampling Technique (SMOTE) resulted in significantly more accurate prediction than single downsizing or multiple downsizing (MDS) of the dataset. For downsized data sets, applying a RF for predictor variable selection had a small effect on the performance, which varied depending on the virus. For SMOTE, a RF had a negative effect on performance. An analysis of variance of the performance across settings supports these findings. Finally, age and assay results for alanine aminotransferase (ALT), sodium for HBV and urea for HCV were found to have a significant impact upon laboratory diagnosis of HBV or HCV infection using an optimised SVM model. Conclusions Laboratories looking to include machine learning via SVM as part of their decision support need to be aware that the balancing method, predictor variable selection and the virus type interact to affect the laboratory diagnosis of hepatitis virus infection with routine pathology laboratory variables in different ways depending on which combination is being studied. This awareness should lead to careful use of existing machine learning methods, thus improving the quality of laboratory diagnosis. |
topic |
Analysis of variance Hepatitis B Hepatitis C Machine learning Random forests Synthetic minority oversampling technique |
url |
http://link.springer.com/article/10.1186/s12911-017-0522-5 |
work_keys_str_mv |
AT alicemrichardson enhancementofhepatitisvirusimmunoassayoutcomepredictionsinimbalancedroutinepathologydatabydatabalancingandfeatureselectionbeforetheapplicationofsupportvectormachines AT brettalidbury enhancementofhepatitisvirusimmunoassayoutcomepredictionsinimbalancedroutinepathologydatabydatabalancingandfeatureselectionbeforetheapplicationofsupportvectormachines |
_version_ |
1725280622252916736 |