A new semi-automated workflow for chemical data retrieval and quality checking for modeling applications

Abstract The quality of data used for QSAR model derivation is extremely important as it strongly affects the final robustness and predictive power of the model. Ambiguous or wrong structures need to be carefully checked, because they lead to errors in calculation of descriptors, hence leading to me...

Full description

Bibliographic Details
Main Authors: Domenico Gadaleta, Anna Lombardo, Cosimo Toma, Emilio Benfenati
Format: Article
Language:English
Published: BMC 2018-12-01
Series:Journal of Cheminformatics
Subjects:
Online Access:http://link.springer.com/article/10.1186/s13321-018-0315-6
id doaj-94d1c313162c4387a83d36b239a22c9f
record_format Article
spelling doaj-94d1c313162c4387a83d36b239a22c9f2020-11-25T01:19:17ZengBMCJournal of Cheminformatics1758-29462018-12-0110111310.1186/s13321-018-0315-6A new semi-automated workflow for chemical data retrieval and quality checking for modeling applicationsDomenico Gadaleta0Anna Lombardo1Cosimo Toma2Emilio Benfenati3Laboratory of Environmental Chemistry and Toxicology, Department of Environmental Health Sciences, Istituto di Ricerche Farmacologiche Mario Negri IRCCSLaboratory of Environmental Chemistry and Toxicology, Department of Environmental Health Sciences, Istituto di Ricerche Farmacologiche Mario Negri IRCCSLaboratory of Environmental Chemistry and Toxicology, Department of Environmental Health Sciences, Istituto di Ricerche Farmacologiche Mario Negri IRCCSLaboratory of Environmental Chemistry and Toxicology, Department of Environmental Health Sciences, Istituto di Ricerche Farmacologiche Mario Negri IRCCSAbstract The quality of data used for QSAR model derivation is extremely important as it strongly affects the final robustness and predictive power of the model. Ambiguous or wrong structures need to be carefully checked, because they lead to errors in calculation of descriptors, hence leading to meaningless results. The increasing amounts of data, however, have often made it hard to check of very large databases manually. In the light of this, we designed and implemented a semi-automated workflow integrating structural data retrieval from several web-based databases, automated comparison of these data, chemical structure cleaning, selection and standardization of data into a consistent, ready-to-use format that can be employed for modeling. The workflow integrates best practices for data curation that have been suggested in the recent literature. The workflow has been implemented with the freely available KNIME software and is freely available to the cheminformatics community for improvement and application to a broad range of chemical datasets.http://link.springer.com/article/10.1186/s13321-018-0315-6QSARData curationData cleaningSemi-automatedWorkflow
collection DOAJ
language English
format Article
sources DOAJ
author Domenico Gadaleta
Anna Lombardo
Cosimo Toma
Emilio Benfenati
spellingShingle Domenico Gadaleta
Anna Lombardo
Cosimo Toma
Emilio Benfenati
A new semi-automated workflow for chemical data retrieval and quality checking for modeling applications
Journal of Cheminformatics
QSAR
Data curation
Data cleaning
Semi-automated
Workflow
author_facet Domenico Gadaleta
Anna Lombardo
Cosimo Toma
Emilio Benfenati
author_sort Domenico Gadaleta
title A new semi-automated workflow for chemical data retrieval and quality checking for modeling applications
title_short A new semi-automated workflow for chemical data retrieval and quality checking for modeling applications
title_full A new semi-automated workflow for chemical data retrieval and quality checking for modeling applications
title_fullStr A new semi-automated workflow for chemical data retrieval and quality checking for modeling applications
title_full_unstemmed A new semi-automated workflow for chemical data retrieval and quality checking for modeling applications
title_sort new semi-automated workflow for chemical data retrieval and quality checking for modeling applications
publisher BMC
series Journal of Cheminformatics
issn 1758-2946
publishDate 2018-12-01
description Abstract The quality of data used for QSAR model derivation is extremely important as it strongly affects the final robustness and predictive power of the model. Ambiguous or wrong structures need to be carefully checked, because they lead to errors in calculation of descriptors, hence leading to meaningless results. The increasing amounts of data, however, have often made it hard to check of very large databases manually. In the light of this, we designed and implemented a semi-automated workflow integrating structural data retrieval from several web-based databases, automated comparison of these data, chemical structure cleaning, selection and standardization of data into a consistent, ready-to-use format that can be employed for modeling. The workflow integrates best practices for data curation that have been suggested in the recent literature. The workflow has been implemented with the freely available KNIME software and is freely available to the cheminformatics community for improvement and application to a broad range of chemical datasets.
topic QSAR
Data curation
Data cleaning
Semi-automated
Workflow
url http://link.springer.com/article/10.1186/s13321-018-0315-6
work_keys_str_mv AT domenicogadaleta anewsemiautomatedworkflowforchemicaldataretrievalandqualitycheckingformodelingapplications
AT annalombardo anewsemiautomatedworkflowforchemicaldataretrievalandqualitycheckingformodelingapplications
AT cosimotoma anewsemiautomatedworkflowforchemicaldataretrievalandqualitycheckingformodelingapplications
AT emiliobenfenati anewsemiautomatedworkflowforchemicaldataretrievalandqualitycheckingformodelingapplications
AT domenicogadaleta newsemiautomatedworkflowforchemicaldataretrievalandqualitycheckingformodelingapplications
AT annalombardo newsemiautomatedworkflowforchemicaldataretrievalandqualitycheckingformodelingapplications
AT cosimotoma newsemiautomatedworkflowforchemicaldataretrievalandqualitycheckingformodelingapplications
AT emiliobenfenati newsemiautomatedworkflowforchemicaldataretrievalandqualitycheckingformodelingapplications
_version_ 1725139068638986240