A comparison of machine learning algorithms for chemical toxicity classification using a simulated multi-scale data model

Abstract Background Bioactivity profiling using high-throughput <it>in vitro </it>assays can reduce the cost and time required for toxicological screening of environmental chemicals and can also reduce the need for animal testing. Several pu...

Full description

Bibliographic Details
Main Authors:	Li Zhen, Setzer R Woodrow, Elloumi Fathi, Judson Richard, Shah Imran
Format:	Article
Language:	English
Published:	BMC 2008-05-01
Series:	BMC Bioinformatics
Online Access:	http://www.biomedcentral.com/1471-2105/9/241

id	doaj-1cc864f1cc574010b61fc3476ba90fee
record_format	Article
spelling	doaj-1cc864f1cc574010b61fc3476ba90fee2020-11-25T00:24:59ZengBMCBMC Bioinformatics1471-21052008-05-019124110.1186/1471-2105-9-241A comparison of machine learning algorithms for chemical toxicity classification using a simulated multi-scale data modelLi ZhenSetzer R WoodrowElloumi FathiJudson RichardShah Imran<p>Abstract</p> <p>Background</p> <p>Bioactivity profiling using high-throughput <it>in vitro </it>assays can reduce the cost and time required for toxicological screening of environmental chemicals and can also reduce the need for animal testing. Several public efforts are aimed at discovering patterns or classifiers in high-dimensional bioactivity space that predict tissue, organ or whole animal toxicological endpoints. Supervised machine learning is a powerful approach to discover combinatorial relationships in complex <it>in vitro/in vivo </it>datasets. We present a novel model to simulate complex chemical-toxicology data sets and use this model to evaluate the relative performance of different machine learning (ML) methods.</p> <p>Results</p> <p>The classification performance of Artificial Neural Networks (ANN), K-Nearest Neighbors (KNN), Linear Discriminant Analysis (LDA), Naïve Bayes (NB), Recursive Partitioning and Regression Trees (RPART), and Support Vector Machines (SVM) in the presence and absence of filter-based feature selection was analyzed using K-way cross-validation testing and independent validation on simulated <it>in vitro </it>assay data sets with varying levels of model complexity, number of irrelevant features and measurement noise. While the prediction accuracy of all ML methods decreased as non-causal (irrelevant) features were added, some ML methods performed better than others. In the limit of using a large number of features, ANN and SVM were always in the top performing set of methods while RPART and KNN (k = 5) were always in the poorest performing set. The addition of measurement noise and irrelevant features decreased the classification accuracy of all ML methods, with LDA suffering the greatest performance degradation. LDA performance is especially sensitive to the use of feature selection. Filter-based feature selection generally improved performance, most strikingly for LDA.</p> <p>Conclusion</p> <p>We have developed a novel simulation model to evaluate machine learning methods for the analysis of data sets in which in vitro bioassay data is being used to predict in vivo chemical toxicology. From our analysis, we can recommend that several ML methods, most notably SVM and ANN, are good candidates for use in real world applications in this area.</p> http://www.biomedcentral.com/1471-2105/9/241
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Li Zhen Setzer R Woodrow Elloumi Fathi Judson Richard Shah Imran
spellingShingle	Li Zhen Setzer R Woodrow Elloumi Fathi Judson Richard Shah Imran A comparison of machine learning algorithms for chemical toxicity classification using a simulated multi-scale data model BMC Bioinformatics
author_facet	Li Zhen Setzer R Woodrow Elloumi Fathi Judson Richard Shah Imran
author_sort	Li Zhen
title	A comparison of machine learning algorithms for chemical toxicity classification using a simulated multi-scale data model
title_short	A comparison of machine learning algorithms for chemical toxicity classification using a simulated multi-scale data model
title_full	A comparison of machine learning algorithms for chemical toxicity classification using a simulated multi-scale data model
title_fullStr	A comparison of machine learning algorithms for chemical toxicity classification using a simulated multi-scale data model
title_full_unstemmed	A comparison of machine learning algorithms for chemical toxicity classification using a simulated multi-scale data model
title_sort	comparison of machine learning algorithms for chemical toxicity classification using a simulated multi-scale data model
publisher	BMC
series	BMC Bioinformatics
issn	1471-2105
publishDate	2008-05-01
description	<p>Abstract</p> <p>Background</p> <p>Bioactivity profiling using high-throughput <it>in vitro </it>assays can reduce the cost and time required for toxicological screening of environmental chemicals and can also reduce the need for animal testing. Several public efforts are aimed at discovering patterns or classifiers in high-dimensional bioactivity space that predict tissue, organ or whole animal toxicological endpoints. Supervised machine learning is a powerful approach to discover combinatorial relationships in complex <it>in vitro/in vivo </it>datasets. We present a novel model to simulate complex chemical-toxicology data sets and use this model to evaluate the relative performance of different machine learning (ML) methods.</p> <p>Results</p> <p>The classification performance of Artificial Neural Networks (ANN), K-Nearest Neighbors (KNN), Linear Discriminant Analysis (LDA), Naïve Bayes (NB), Recursive Partitioning and Regression Trees (RPART), and Support Vector Machines (SVM) in the presence and absence of filter-based feature selection was analyzed using K-way cross-validation testing and independent validation on simulated <it>in vitro </it>assay data sets with varying levels of model complexity, number of irrelevant features and measurement noise. While the prediction accuracy of all ML methods decreased as non-causal (irrelevant) features were added, some ML methods performed better than others. In the limit of using a large number of features, ANN and SVM were always in the top performing set of methods while RPART and KNN (k = 5) were always in the poorest performing set. The addition of measurement noise and irrelevant features decreased the classification accuracy of all ML methods, with LDA suffering the greatest performance degradation. LDA performance is especially sensitive to the use of feature selection. Filter-based feature selection generally improved performance, most strikingly for LDA.</p> <p>Conclusion</p> <p>We have developed a novel simulation model to evaluate machine learning methods for the analysis of data sets in which in vitro bioassay data is being used to predict in vivo chemical toxicology. From our analysis, we can recommend that several ML methods, most notably SVM and ANN, are good candidates for use in real world applications in this area.</p>
url	http://www.biomedcentral.com/1471-2105/9/241
work_keys_str_mv	AT lizhen acomparisonofmachinelearningalgorithmsforchemicaltoxicityclassificationusingasimulatedmultiscaledatamodel AT setzerrwoodrow acomparisonofmachinelearningalgorithmsforchemicaltoxicityclassificationusingasimulatedmultiscaledatamodel AT elloumifathi acomparisonofmachinelearningalgorithmsforchemicaltoxicityclassificationusingasimulatedmultiscaledatamodel AT judsonrichard acomparisonofmachinelearningalgorithmsforchemicaltoxicityclassificationusingasimulatedmultiscaledatamodel AT shahimran acomparisonofmachinelearningalgorithmsforchemicaltoxicityclassificationusingasimulatedmultiscaledatamodel AT lizhen comparisonofmachinelearningalgorithmsforchemicaltoxicityclassificationusingasimulatedmultiscaledatamodel AT setzerrwoodrow comparisonofmachinelearningalgorithmsforchemicaltoxicityclassificationusingasimulatedmultiscaledatamodel AT elloumifathi comparisonofmachinelearningalgorithmsforchemicaltoxicityclassificationusingasimulatedmultiscaledatamodel AT judsonrichard comparisonofmachinelearningalgorithmsforchemicaltoxicityclassificationusingasimulatedmultiscaledatamodel AT shahimran comparisonofmachinelearningalgorithmsforchemicaltoxicityclassificationusingasimulatedmultiscaledatamodel
_version_	1725350535802912768

A comparison of machine learning algorithms for chemical toxicity classification using a simulated multi-scale data model

Similar Items