Scalable estimator of the diversity for de novo molecular generation resulting in a more robust QM dataset (OD9) and a more efficient molecular optimization
Abstract Chemical diversity is one of the key term when dealing with machine learning and molecular generation. This is particularly true for quantum chemical datasets. The composition of which should be done meticulously since the calculation is highly time demanding. Previously we have seen that t...
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
BMC
2021-10-01
|
Series: | Journal of Cheminformatics |
Subjects: | |
Online Access: | https://doi.org/10.1186/s13321-021-00554-8 |
id |
doaj-5ee95c78f5cb4002b6b284051ae3f53b |
---|---|
record_format |
Article |
spelling |
doaj-5ee95c78f5cb4002b6b284051ae3f53b2021-10-03T11:48:17ZengBMCJournal of Cheminformatics1758-29462021-10-0113111710.1186/s13321-021-00554-8Scalable estimator of the diversity for de novo molecular generation resulting in a more robust QM dataset (OD9) and a more efficient molecular optimizationJules Leguy0Marta Glavatskikh1Thomas Cauchy2Benoit Da Mota3Univ Angers, LERIA, SFR MATHSTICUniv Angers, LERIA, SFR MATHSTICUniv Angers, CNRS, MOLTECH-ANJOU, SFR MATRIXUniv Angers, LERIA, SFR MATHSTICAbstract Chemical diversity is one of the key term when dealing with machine learning and molecular generation. This is particularly true for quantum chemical datasets. The composition of which should be done meticulously since the calculation is highly time demanding. Previously we have seen that the most known quantum chemical dataset QM9 lacks chemical diversity. As a consequence, ML models trained on QM9 showed generalizability shortcomings. In this paper we would like to present (i) a fast and generic method to evaluate chemical diversity, (ii) a new quantum chemical dataset of 435k molecules, OD9, that includes QM9 and new molecules generated with a diversity objective, (iii) an analysis of the diversity impact on unconstrained and goal-directed molecular generation on the example of QED optimization. Our innovative approach makes it possible to individually estimate the impact of a solution to the diversity of a set, allowing for effective incremental evaluation. In the first application, we will see how the diversity constraint allows us to generate more than a million of molecules that would efficiently complete the reference datasets. The compounds were calculated with DFT thanks to a collaborative effort through the QuChemPedIA@home BOINC project. With regard to goal-directed molecular generation, getting a high QED score is not complicated, but adding a little diversity can cut the number of calls to the evaluation function by a factor of tenhttps://doi.org/10.1186/s13321-021-00554-8Chemical space explorationOrganic chemistryQuantum chemistry dataset |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Jules Leguy Marta Glavatskikh Thomas Cauchy Benoit Da Mota |
spellingShingle |
Jules Leguy Marta Glavatskikh Thomas Cauchy Benoit Da Mota Scalable estimator of the diversity for de novo molecular generation resulting in a more robust QM dataset (OD9) and a more efficient molecular optimization Journal of Cheminformatics Chemical space exploration Organic chemistry Quantum chemistry dataset |
author_facet |
Jules Leguy Marta Glavatskikh Thomas Cauchy Benoit Da Mota |
author_sort |
Jules Leguy |
title |
Scalable estimator of the diversity for de novo molecular generation resulting in a more robust QM dataset (OD9) and a more efficient molecular optimization |
title_short |
Scalable estimator of the diversity for de novo molecular generation resulting in a more robust QM dataset (OD9) and a more efficient molecular optimization |
title_full |
Scalable estimator of the diversity for de novo molecular generation resulting in a more robust QM dataset (OD9) and a more efficient molecular optimization |
title_fullStr |
Scalable estimator of the diversity for de novo molecular generation resulting in a more robust QM dataset (OD9) and a more efficient molecular optimization |
title_full_unstemmed |
Scalable estimator of the diversity for de novo molecular generation resulting in a more robust QM dataset (OD9) and a more efficient molecular optimization |
title_sort |
scalable estimator of the diversity for de novo molecular generation resulting in a more robust qm dataset (od9) and a more efficient molecular optimization |
publisher |
BMC |
series |
Journal of Cheminformatics |
issn |
1758-2946 |
publishDate |
2021-10-01 |
description |
Abstract Chemical diversity is one of the key term when dealing with machine learning and molecular generation. This is particularly true for quantum chemical datasets. The composition of which should be done meticulously since the calculation is highly time demanding. Previously we have seen that the most known quantum chemical dataset QM9 lacks chemical diversity. As a consequence, ML models trained on QM9 showed generalizability shortcomings. In this paper we would like to present (i) a fast and generic method to evaluate chemical diversity, (ii) a new quantum chemical dataset of 435k molecules, OD9, that includes QM9 and new molecules generated with a diversity objective, (iii) an analysis of the diversity impact on unconstrained and goal-directed molecular generation on the example of QED optimization. Our innovative approach makes it possible to individually estimate the impact of a solution to the diversity of a set, allowing for effective incremental evaluation. In the first application, we will see how the diversity constraint allows us to generate more than a million of molecules that would efficiently complete the reference datasets. The compounds were calculated with DFT thanks to a collaborative effort through the QuChemPedIA@home BOINC project. With regard to goal-directed molecular generation, getting a high QED score is not complicated, but adding a little diversity can cut the number of calls to the evaluation function by a factor of ten |
topic |
Chemical space exploration Organic chemistry Quantum chemistry dataset |
url |
https://doi.org/10.1186/s13321-021-00554-8 |
work_keys_str_mv |
AT julesleguy scalableestimatorofthediversityfordenovomoleculargenerationresultinginamorerobustqmdatasetod9andamoreefficientmolecularoptimization AT martaglavatskikh scalableestimatorofthediversityfordenovomoleculargenerationresultinginamorerobustqmdatasetod9andamoreefficientmolecularoptimization AT thomascauchy scalableestimatorofthediversityfordenovomoleculargenerationresultinginamorerobustqmdatasetod9andamoreefficientmolecularoptimization AT benoitdamota scalableestimatorofthediversityfordenovomoleculargenerationresultinginamorerobustqmdatasetod9andamoreefficientmolecularoptimization |
_version_ |
1716845196469075968 |