The influence of the negative-positive ratio and screening database size on the performance of machine learning-based virtual screening.

The machine learning-based virtual screening of molecular databases is a commonly used approach to identify hits. However, many aspects associated with training predictive models can influence the final performance and, consequently, the number of hits found. Thus, we performed a systematic study of...

Full description

Bibliographic Details
Main Authors:	Rafał Kurczab, Andrzej J Bojarski
Format:	Article
Language:	English
Published:	Public Library of Science (PLoS) 2017-01-01
Series:	PLoS ONE
Online Access:	http://europepmc.org/articles/PMC5383296?pdf=render

id	doaj-b34fb8ebbb2c4f6da2962f9eb52238eb
record_format	Article
spelling	doaj-b34fb8ebbb2c4f6da2962f9eb52238eb2020-11-24T21:52:05ZengPublic Library of Science (PLoS)PLoS ONE1932-62032017-01-01124e017541010.1371/journal.pone.0175410The influence of the negative-positive ratio and screening database size on the performance of machine learning-based virtual screening.Rafał KurczabAndrzej J BojarskiThe machine learning-based virtual screening of molecular databases is a commonly used approach to identify hits. However, many aspects associated with training predictive models can influence the final performance and, consequently, the number of hits found. Thus, we performed a systematic study of the simultaneous influence of the proportion of negatives to positives in the testing set, the size of screening databases and the type of molecular representations on the effectiveness of classification. The results obtained for eight protein targets, five machine learning algorithms (SMO, Naïve Bayes, Ibk, J48 and Random Forest), two types of molecular fingerprints (MACCS and CDK FP) and eight screening databases with different numbers of molecules confirmed our previous findings that increases in the ratio of negative to positive training instances greatly influenced most of the investigated parameters of the ML methods in simulated virtual screening experiments. However, the performance of screening was shown to also be highly dependent on the molecular library dimension. Generally, with the increasing size of the screened database, the optimal training ratio also increased, and this ratio can be rationalized using the proposed cost-effectiveness threshold approach. To increase the performance of machine learning-based virtual screening, the training set should be constructed in a way that considers the size of the screening database.http://europepmc.org/articles/PMC5383296?pdf=render
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Rafał Kurczab Andrzej J Bojarski
spellingShingle	Rafał Kurczab Andrzej J Bojarski The influence of the negative-positive ratio and screening database size on the performance of machine learning-based virtual screening. PLoS ONE
author_facet	Rafał Kurczab Andrzej J Bojarski
author_sort	Rafał Kurczab
title	The influence of the negative-positive ratio and screening database size on the performance of machine learning-based virtual screening.
title_short	The influence of the negative-positive ratio and screening database size on the performance of machine learning-based virtual screening.
title_full	The influence of the negative-positive ratio and screening database size on the performance of machine learning-based virtual screening.
title_fullStr	The influence of the negative-positive ratio and screening database size on the performance of machine learning-based virtual screening.
title_full_unstemmed	The influence of the negative-positive ratio and screening database size on the performance of machine learning-based virtual screening.
title_sort	influence of the negative-positive ratio and screening database size on the performance of machine learning-based virtual screening.
publisher	Public Library of Science (PLoS)
series	PLoS ONE
issn	1932-6203
publishDate	2017-01-01
description	The machine learning-based virtual screening of molecular databases is a commonly used approach to identify hits. However, many aspects associated with training predictive models can influence the final performance and, consequently, the number of hits found. Thus, we performed a systematic study of the simultaneous influence of the proportion of negatives to positives in the testing set, the size of screening databases and the type of molecular representations on the effectiveness of classification. The results obtained for eight protein targets, five machine learning algorithms (SMO, Naïve Bayes, Ibk, J48 and Random Forest), two types of molecular fingerprints (MACCS and CDK FP) and eight screening databases with different numbers of molecules confirmed our previous findings that increases in the ratio of negative to positive training instances greatly influenced most of the investigated parameters of the ML methods in simulated virtual screening experiments. However, the performance of screening was shown to also be highly dependent on the molecular library dimension. Generally, with the increasing size of the screened database, the optimal training ratio also increased, and this ratio can be rationalized using the proposed cost-effectiveness threshold approach. To increase the performance of machine learning-based virtual screening, the training set should be constructed in a way that considers the size of the screening database.
url	http://europepmc.org/articles/PMC5383296?pdf=render
work_keys_str_mv	AT rafałkurczab theinfluenceofthenegativepositiveratioandscreeningdatabasesizeontheperformanceofmachinelearningbasedvirtualscreening AT andrzejjbojarski theinfluenceofthenegativepositiveratioandscreeningdatabasesizeontheperformanceofmachinelearningbasedvirtualscreening AT rafałkurczab influenceofthenegativepositiveratioandscreeningdatabasesizeontheperformanceofmachinelearningbasedvirtualscreening AT andrzejjbojarski influenceofthenegativepositiveratioandscreeningdatabasesizeontheperformanceofmachinelearningbasedvirtualscreening
_version_	1725876902441254912

The influence of the negative-positive ratio and screening database size on the performance of machine learning-based virtual screening.

Similar Items