Enhancement of hepatitis virus immunoassay outcome predictions in imbalanced routine pathology data by data balancing and feature selection before the application of support vector machines

Abstract Background Data mining techniques such as support vector machines (SVMs) have been successfully used to predict outcomes for complex problems, including for human health. Much health data is imbalanced, with many more controls than positive cases. Methods The impact of three balancing metho...

Full description

Bibliographic Details
Main Authors:	Alice M. Richardson, Brett A. Lidbury
Format:	Article
Language:	English
Published:	BMC 2017-08-01
Series:	BMC Medical Informatics and Decision Making
Subjects:	Analysis of variance Hepatitis B Hepatitis C Machine learning Random forests Synthetic minority oversampling technique
Online Access:	http://link.springer.com/article/10.1186/s12911-017-0522-5

id	doaj-6ff1a6157ee046c98ff826c1b72f4928
record_format	Article
spelling	doaj-6ff1a6157ee046c98ff826c1b72f49282020-11-25T00:42:44ZengBMCBMC Medical Informatics and Decision Making1472-69472017-08-0117111110.1186/s12911-017-0522-5Enhancement of hepatitis virus immunoassay outcome predictions in imbalanced routine pathology data by data balancing and feature selection before the application of support vector machinesAlice M. Richardson0Brett A. Lidbury1Present address: National Centre for Epidemiology & Population Health, Australian National UniversityPresent address: National Centre for Epidemiology & Population Health, Australian National UniversityAbstract Background Data mining techniques such as support vector machines (SVMs) have been successfully used to predict outcomes for complex problems, including for human health. Much health data is imbalanced, with many more controls than positive cases. Methods The impact of three balancing methods and one feature selection method is explored, to assess the ability of SVMs to classify imbalanced diagnostic pathology data associated with the laboratory diagnosis of hepatitis B (HBV) and hepatitis C (HCV) infections. Random forests (RFs) for predictor variable selection, and data reshaping to overcome a large imbalance of negative to positive test results in relation to HBV and HCV immunoassay results, are examined. The methodology is illustrated using data from ACT Pathology (Canberra, Australia), consisting of laboratory test records from 18,625 individuals who underwent hepatitis virus testing over the decade from 1997 to 2007. Results Overall, the prediction of HCV test results by immunoassay was more accurate than for HBV immunoassay results associated with identical routine pathology predictor variable data. HBV and HCV negative results were vastly in excess of positive results, so three approaches to handling the negative/positive data imbalance were compared. Generating datasets by the Synthetic Minority Oversampling Technique (SMOTE) resulted in significantly more accurate prediction than single downsizing or multiple downsizing (MDS) of the dataset. For downsized data sets, applying a RF for predictor variable selection had a small effect on the performance, which varied depending on the virus. For SMOTE, a RF had a negative effect on performance. An analysis of variance of the performance across settings supports these findings. Finally, age and assay results for alanine aminotransferase (ALT), sodium for HBV and urea for HCV were found to have a significant impact upon laboratory diagnosis of HBV or HCV infection using an optimised SVM model. Conclusions Laboratories looking to include machine learning via SVM as part of their decision support need to be aware that the balancing method, predictor variable selection and the virus type interact to affect the laboratory diagnosis of hepatitis virus infection with routine pathology laboratory variables in different ways depending on which combination is being studied. This awareness should lead to careful use of existing machine learning methods, thus improving the quality of laboratory diagnosis.http://link.springer.com/article/10.1186/s12911-017-0522-5Analysis of varianceHepatitis BHepatitis CMachine learningRandom forestsSynthetic minority oversampling technique
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Alice M. Richardson Brett A. Lidbury
spellingShingle	Alice M. Richardson Brett A. Lidbury Enhancement of hepatitis virus immunoassay outcome predictions in imbalanced routine pathology data by data balancing and feature selection before the application of support vector machines BMC Medical Informatics and Decision Making Analysis of variance Hepatitis B Hepatitis C Machine learning Random forests Synthetic minority oversampling technique
author_facet	Alice M. Richardson Brett A. Lidbury
author_sort	Alice M. Richardson
title	Enhancement of hepatitis virus immunoassay outcome predictions in imbalanced routine pathology data by data balancing and feature selection before the application of support vector machines
title_short	Enhancement of hepatitis virus immunoassay outcome predictions in imbalanced routine pathology data by data balancing and feature selection before the application of support vector machines
title_full	Enhancement of hepatitis virus immunoassay outcome predictions in imbalanced routine pathology data by data balancing and feature selection before the application of support vector machines
title_fullStr	Enhancement of hepatitis virus immunoassay outcome predictions in imbalanced routine pathology data by data balancing and feature selection before the application of support vector machines
title_full_unstemmed	Enhancement of hepatitis virus immunoassay outcome predictions in imbalanced routine pathology data by data balancing and feature selection before the application of support vector machines
title_sort	enhancement of hepatitis virus immunoassay outcome predictions in imbalanced routine pathology data by data balancing and feature selection before the application of support vector machines
publisher	BMC
series	BMC Medical Informatics and Decision Making
issn	1472-6947
publishDate	2017-08-01
description	Abstract Background Data mining techniques such as support vector machines (SVMs) have been successfully used to predict outcomes for complex problems, including for human health. Much health data is imbalanced, with many more controls than positive cases. Methods The impact of three balancing methods and one feature selection method is explored, to assess the ability of SVMs to classify imbalanced diagnostic pathology data associated with the laboratory diagnosis of hepatitis B (HBV) and hepatitis C (HCV) infections. Random forests (RFs) for predictor variable selection, and data reshaping to overcome a large imbalance of negative to positive test results in relation to HBV and HCV immunoassay results, are examined. The methodology is illustrated using data from ACT Pathology (Canberra, Australia), consisting of laboratory test records from 18,625 individuals who underwent hepatitis virus testing over the decade from 1997 to 2007. Results Overall, the prediction of HCV test results by immunoassay was more accurate than for HBV immunoassay results associated with identical routine pathology predictor variable data. HBV and HCV negative results were vastly in excess of positive results, so three approaches to handling the negative/positive data imbalance were compared. Generating datasets by the Synthetic Minority Oversampling Technique (SMOTE) resulted in significantly more accurate prediction than single downsizing or multiple downsizing (MDS) of the dataset. For downsized data sets, applying a RF for predictor variable selection had a small effect on the performance, which varied depending on the virus. For SMOTE, a RF had a negative effect on performance. An analysis of variance of the performance across settings supports these findings. Finally, age and assay results for alanine aminotransferase (ALT), sodium for HBV and urea for HCV were found to have a significant impact upon laboratory diagnosis of HBV or HCV infection using an optimised SVM model. Conclusions Laboratories looking to include machine learning via SVM as part of their decision support need to be aware that the balancing method, predictor variable selection and the virus type interact to affect the laboratory diagnosis of hepatitis virus infection with routine pathology laboratory variables in different ways depending on which combination is being studied. This awareness should lead to careful use of existing machine learning methods, thus improving the quality of laboratory diagnosis.
topic	Analysis of variance Hepatitis B Hepatitis C Machine learning Random forests Synthetic minority oversampling technique
url	http://link.springer.com/article/10.1186/s12911-017-0522-5
work_keys_str_mv	AT alicemrichardson enhancementofhepatitisvirusimmunoassayoutcomepredictionsinimbalancedroutinepathologydatabydatabalancingandfeatureselectionbeforetheapplicationofsupportvectormachines AT brettalidbury enhancementofhepatitisvirusimmunoassayoutcomepredictionsinimbalancedroutinepathologydatabydatabalancingandfeatureselectionbeforetheapplicationofsupportvectormachines
_version_	1725280622252916736

Enhancement of hepatitis virus immunoassay outcome predictions in imbalanced routine pathology data by data balancing and feature selection before the application of support vector machines

Similar Items