A new semi-automated workflow for chemical data retrieval and quality checking for modeling applications
Abstract The quality of data used for QSAR model derivation is extremely important as it strongly affects the final robustness and predictive power of the model. Ambiguous or wrong structures need to be carefully checked, because they lead to errors in calculation of descriptors, hence leading to me...
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
BMC
2018-12-01
|
Series: | Journal of Cheminformatics |
Subjects: | |
Online Access: | http://link.springer.com/article/10.1186/s13321-018-0315-6 |
id |
doaj-94d1c313162c4387a83d36b239a22c9f |
---|---|
record_format |
Article |
spelling |
doaj-94d1c313162c4387a83d36b239a22c9f2020-11-25T01:19:17ZengBMCJournal of Cheminformatics1758-29462018-12-0110111310.1186/s13321-018-0315-6A new semi-automated workflow for chemical data retrieval and quality checking for modeling applicationsDomenico Gadaleta0Anna Lombardo1Cosimo Toma2Emilio Benfenati3Laboratory of Environmental Chemistry and Toxicology, Department of Environmental Health Sciences, Istituto di Ricerche Farmacologiche Mario Negri IRCCSLaboratory of Environmental Chemistry and Toxicology, Department of Environmental Health Sciences, Istituto di Ricerche Farmacologiche Mario Negri IRCCSLaboratory of Environmental Chemistry and Toxicology, Department of Environmental Health Sciences, Istituto di Ricerche Farmacologiche Mario Negri IRCCSLaboratory of Environmental Chemistry and Toxicology, Department of Environmental Health Sciences, Istituto di Ricerche Farmacologiche Mario Negri IRCCSAbstract The quality of data used for QSAR model derivation is extremely important as it strongly affects the final robustness and predictive power of the model. Ambiguous or wrong structures need to be carefully checked, because they lead to errors in calculation of descriptors, hence leading to meaningless results. The increasing amounts of data, however, have often made it hard to check of very large databases manually. In the light of this, we designed and implemented a semi-automated workflow integrating structural data retrieval from several web-based databases, automated comparison of these data, chemical structure cleaning, selection and standardization of data into a consistent, ready-to-use format that can be employed for modeling. The workflow integrates best practices for data curation that have been suggested in the recent literature. The workflow has been implemented with the freely available KNIME software and is freely available to the cheminformatics community for improvement and application to a broad range of chemical datasets.http://link.springer.com/article/10.1186/s13321-018-0315-6QSARData curationData cleaningSemi-automatedWorkflow |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Domenico Gadaleta Anna Lombardo Cosimo Toma Emilio Benfenati |
spellingShingle |
Domenico Gadaleta Anna Lombardo Cosimo Toma Emilio Benfenati A new semi-automated workflow for chemical data retrieval and quality checking for modeling applications Journal of Cheminformatics QSAR Data curation Data cleaning Semi-automated Workflow |
author_facet |
Domenico Gadaleta Anna Lombardo Cosimo Toma Emilio Benfenati |
author_sort |
Domenico Gadaleta |
title |
A new semi-automated workflow for chemical data retrieval and quality checking for modeling applications |
title_short |
A new semi-automated workflow for chemical data retrieval and quality checking for modeling applications |
title_full |
A new semi-automated workflow for chemical data retrieval and quality checking for modeling applications |
title_fullStr |
A new semi-automated workflow for chemical data retrieval and quality checking for modeling applications |
title_full_unstemmed |
A new semi-automated workflow for chemical data retrieval and quality checking for modeling applications |
title_sort |
new semi-automated workflow for chemical data retrieval and quality checking for modeling applications |
publisher |
BMC |
series |
Journal of Cheminformatics |
issn |
1758-2946 |
publishDate |
2018-12-01 |
description |
Abstract The quality of data used for QSAR model derivation is extremely important as it strongly affects the final robustness and predictive power of the model. Ambiguous or wrong structures need to be carefully checked, because they lead to errors in calculation of descriptors, hence leading to meaningless results. The increasing amounts of data, however, have often made it hard to check of very large databases manually. In the light of this, we designed and implemented a semi-automated workflow integrating structural data retrieval from several web-based databases, automated comparison of these data, chemical structure cleaning, selection and standardization of data into a consistent, ready-to-use format that can be employed for modeling. The workflow integrates best practices for data curation that have been suggested in the recent literature. The workflow has been implemented with the freely available KNIME software and is freely available to the cheminformatics community for improvement and application to a broad range of chemical datasets. |
topic |
QSAR Data curation Data cleaning Semi-automated Workflow |
url |
http://link.springer.com/article/10.1186/s13321-018-0315-6 |
work_keys_str_mv |
AT domenicogadaleta anewsemiautomatedworkflowforchemicaldataretrievalandqualitycheckingformodelingapplications AT annalombardo anewsemiautomatedworkflowforchemicaldataretrievalandqualitycheckingformodelingapplications AT cosimotoma anewsemiautomatedworkflowforchemicaldataretrievalandqualitycheckingformodelingapplications AT emiliobenfenati anewsemiautomatedworkflowforchemicaldataretrievalandqualitycheckingformodelingapplications AT domenicogadaleta newsemiautomatedworkflowforchemicaldataretrievalandqualitycheckingformodelingapplications AT annalombardo newsemiautomatedworkflowforchemicaldataretrievalandqualitycheckingformodelingapplications AT cosimotoma newsemiautomatedworkflowforchemicaldataretrievalandqualitycheckingformodelingapplications AT emiliobenfenati newsemiautomatedworkflowforchemicaldataretrievalandqualitycheckingformodelingapplications |
_version_ |
1725139068638986240 |