Dataset’s chemical diversity limits the generalizability of machine learning predictions

Abstract The QM9 dataset has become the golden standard for Machine Learning (ML) predictions of various chemical properties. QM9 is based on the GDB, which is a combinatorial exploration of the chemical space. ML molecular predictions have been recently published with an accuracy on par with Densit...

Full description

Bibliographic Details
Main Authors: Marta Glavatskikh, Jules Leguy, Gilles Hunault, Thomas Cauchy, Benoit Da Mota
Format: Article
Language:English
Published: BMC 2019-11-01
Series:Journal of Cheminformatics
Subjects:
QM9
PC9
DFT
Online Access:http://link.springer.com/article/10.1186/s13321-019-0391-2
id doaj-171f98117f0f45a491c571af55c928bd
record_format Article
spelling doaj-171f98117f0f45a491c571af55c928bd2020-11-25T04:01:35ZengBMCJournal of Cheminformatics1758-29462019-11-0111111510.1186/s13321-019-0391-2Dataset’s chemical diversity limits the generalizability of machine learning predictionsMarta Glavatskikh0Jules Leguy1Gilles Hunault2Thomas Cauchy3Benoit Da Mota4LERIA, University of AngersLERIA, University of AngersLERIA, University of AngersLaboratoire MOLTECH-Anjou, UMR CNRS 6200, SFR MATRIX, UNIV AngersLERIA, University of AngersAbstract The QM9 dataset has become the golden standard for Machine Learning (ML) predictions of various chemical properties. QM9 is based on the GDB, which is a combinatorial exploration of the chemical space. ML molecular predictions have been recently published with an accuracy on par with Density Functional Theory calculations. Such ML models need to be tested and generalized on real data. PC9, a new QM9 equivalent dataset (only H, C, N, O and F and up to 9 “heavy” atoms) of the PubChemQC project is presented in this article. A statistical study of bonding distances and chemical functions shows that this new dataset encompasses more chemical diversity. Kernel Ridge Regression, Elastic Net and the Neural Network model provided by SchNet have been used on both datasets. The overall accuracy in energy prediction is higher for the QM9 subset. However, a model trained on PC9 shows a stronger ability to predict energies of the other dataset.http://link.springer.com/article/10.1186/s13321-019-0391-2Molecular chemistrySchNetQM9PC9DFT
collection DOAJ
language English
format Article
sources DOAJ
author Marta Glavatskikh
Jules Leguy
Gilles Hunault
Thomas Cauchy
Benoit Da Mota
spellingShingle Marta Glavatskikh
Jules Leguy
Gilles Hunault
Thomas Cauchy
Benoit Da Mota
Dataset’s chemical diversity limits the generalizability of machine learning predictions
Journal of Cheminformatics
Molecular chemistry
SchNet
QM9
PC9
DFT
author_facet Marta Glavatskikh
Jules Leguy
Gilles Hunault
Thomas Cauchy
Benoit Da Mota
author_sort Marta Glavatskikh
title Dataset’s chemical diversity limits the generalizability of machine learning predictions
title_short Dataset’s chemical diversity limits the generalizability of machine learning predictions
title_full Dataset’s chemical diversity limits the generalizability of machine learning predictions
title_fullStr Dataset’s chemical diversity limits the generalizability of machine learning predictions
title_full_unstemmed Dataset’s chemical diversity limits the generalizability of machine learning predictions
title_sort dataset’s chemical diversity limits the generalizability of machine learning predictions
publisher BMC
series Journal of Cheminformatics
issn 1758-2946
publishDate 2019-11-01
description Abstract The QM9 dataset has become the golden standard for Machine Learning (ML) predictions of various chemical properties. QM9 is based on the GDB, which is a combinatorial exploration of the chemical space. ML molecular predictions have been recently published with an accuracy on par with Density Functional Theory calculations. Such ML models need to be tested and generalized on real data. PC9, a new QM9 equivalent dataset (only H, C, N, O and F and up to 9 “heavy” atoms) of the PubChemQC project is presented in this article. A statistical study of bonding distances and chemical functions shows that this new dataset encompasses more chemical diversity. Kernel Ridge Regression, Elastic Net and the Neural Network model provided by SchNet have been used on both datasets. The overall accuracy in energy prediction is higher for the QM9 subset. However, a model trained on PC9 shows a stronger ability to predict energies of the other dataset.
topic Molecular chemistry
SchNet
QM9
PC9
DFT
url http://link.springer.com/article/10.1186/s13321-019-0391-2
work_keys_str_mv AT martaglavatskikh datasetschemicaldiversitylimitsthegeneralizabilityofmachinelearningpredictions
AT julesleguy datasetschemicaldiversitylimitsthegeneralizabilityofmachinelearningpredictions
AT gilleshunault datasetschemicaldiversitylimitsthegeneralizabilityofmachinelearningpredictions
AT thomascauchy datasetschemicaldiversitylimitsthegeneralizabilityofmachinelearningpredictions
AT benoitdamota datasetschemicaldiversitylimitsthegeneralizabilityofmachinelearningpredictions
_version_ 1724446281404776448