Scalable estimator of the diversity for de novo molecular generation resulting in a more robust QM dataset (OD9) and a more efficient molecular optimization

Abstract Chemical diversity is one of the key term when dealing with machine learning and molecular generation. This is particularly true for quantum chemical datasets. The composition of which should be done meticulously since the calculation is highly time demanding. Previously we have seen that t...

Full description

Bibliographic Details
Main Authors: Jules Leguy, Marta Glavatskikh, Thomas Cauchy, Benoit Da Mota
Format: Article
Language:English
Published: BMC 2021-10-01
Series:Journal of Cheminformatics
Subjects:
Online Access:https://doi.org/10.1186/s13321-021-00554-8
id doaj-5ee95c78f5cb4002b6b284051ae3f53b
record_format Article
spelling doaj-5ee95c78f5cb4002b6b284051ae3f53b2021-10-03T11:48:17ZengBMCJournal of Cheminformatics1758-29462021-10-0113111710.1186/s13321-021-00554-8Scalable estimator of the diversity for de novo molecular generation resulting in a more robust QM dataset (OD9) and a more efficient molecular optimizationJules Leguy0Marta Glavatskikh1Thomas Cauchy2Benoit Da Mota3Univ Angers, LERIA, SFR MATHSTICUniv Angers, LERIA, SFR MATHSTICUniv Angers, CNRS, MOLTECH-ANJOU, SFR MATRIXUniv Angers, LERIA, SFR MATHSTICAbstract Chemical diversity is one of the key term when dealing with machine learning and molecular generation. This is particularly true for quantum chemical datasets. The composition of which should be done meticulously since the calculation is highly time demanding. Previously we have seen that the most known quantum chemical dataset QM9 lacks chemical diversity. As a consequence, ML models trained on QM9 showed generalizability shortcomings. In this paper we would like to present (i) a fast and generic method to evaluate chemical diversity, (ii) a new quantum chemical dataset of 435k molecules, OD9, that includes QM9 and new molecules generated with a diversity objective, (iii) an analysis of the diversity impact on unconstrained and goal-directed molecular generation on the example of QED optimization. Our innovative approach makes it possible to individually estimate the impact of a solution to the diversity of a set, allowing for effective incremental evaluation. In the first application, we will see how the diversity constraint allows us to generate more than a million of molecules that would efficiently complete the reference datasets. The compounds were calculated with DFT thanks to a collaborative effort through the QuChemPedIA@home BOINC project. With regard to goal-directed molecular generation, getting a high QED score is not complicated, but adding a little diversity can cut the number of calls to the evaluation function by a factor of tenhttps://doi.org/10.1186/s13321-021-00554-8Chemical space explorationOrganic chemistryQuantum chemistry dataset
collection DOAJ
language English
format Article
sources DOAJ
author Jules Leguy
Marta Glavatskikh
Thomas Cauchy
Benoit Da Mota
spellingShingle Jules Leguy
Marta Glavatskikh
Thomas Cauchy
Benoit Da Mota
Scalable estimator of the diversity for de novo molecular generation resulting in a more robust QM dataset (OD9) and a more efficient molecular optimization
Journal of Cheminformatics
Chemical space exploration
Organic chemistry
Quantum chemistry dataset
author_facet Jules Leguy
Marta Glavatskikh
Thomas Cauchy
Benoit Da Mota
author_sort Jules Leguy
title Scalable estimator of the diversity for de novo molecular generation resulting in a more robust QM dataset (OD9) and a more efficient molecular optimization
title_short Scalable estimator of the diversity for de novo molecular generation resulting in a more robust QM dataset (OD9) and a more efficient molecular optimization
title_full Scalable estimator of the diversity for de novo molecular generation resulting in a more robust QM dataset (OD9) and a more efficient molecular optimization
title_fullStr Scalable estimator of the diversity for de novo molecular generation resulting in a more robust QM dataset (OD9) and a more efficient molecular optimization
title_full_unstemmed Scalable estimator of the diversity for de novo molecular generation resulting in a more robust QM dataset (OD9) and a more efficient molecular optimization
title_sort scalable estimator of the diversity for de novo molecular generation resulting in a more robust qm dataset (od9) and a more efficient molecular optimization
publisher BMC
series Journal of Cheminformatics
issn 1758-2946
publishDate 2021-10-01
description Abstract Chemical diversity is one of the key term when dealing with machine learning and molecular generation. This is particularly true for quantum chemical datasets. The composition of which should be done meticulously since the calculation is highly time demanding. Previously we have seen that the most known quantum chemical dataset QM9 lacks chemical diversity. As a consequence, ML models trained on QM9 showed generalizability shortcomings. In this paper we would like to present (i) a fast and generic method to evaluate chemical diversity, (ii) a new quantum chemical dataset of 435k molecules, OD9, that includes QM9 and new molecules generated with a diversity objective, (iii) an analysis of the diversity impact on unconstrained and goal-directed molecular generation on the example of QED optimization. Our innovative approach makes it possible to individually estimate the impact of a solution to the diversity of a set, allowing for effective incremental evaluation. In the first application, we will see how the diversity constraint allows us to generate more than a million of molecules that would efficiently complete the reference datasets. The compounds were calculated with DFT thanks to a collaborative effort through the QuChemPedIA@home BOINC project. With regard to goal-directed molecular generation, getting a high QED score is not complicated, but adding a little diversity can cut the number of calls to the evaluation function by a factor of ten
topic Chemical space exploration
Organic chemistry
Quantum chemistry dataset
url https://doi.org/10.1186/s13321-021-00554-8
work_keys_str_mv AT julesleguy scalableestimatorofthediversityfordenovomoleculargenerationresultinginamorerobustqmdatasetod9andamoreefficientmolecularoptimization
AT martaglavatskikh scalableestimatorofthediversityfordenovomoleculargenerationresultinginamorerobustqmdatasetod9andamoreefficientmolecularoptimization
AT thomascauchy scalableestimatorofthediversityfordenovomoleculargenerationresultinginamorerobustqmdatasetod9andamoreefficientmolecularoptimization
AT benoitdamota scalableestimatorofthediversityfordenovomoleculargenerationresultinginamorerobustqmdatasetod9andamoreefficientmolecularoptimization
_version_ 1716845196469075968