PubChem chemical structure standardization

Abstract Background PubChem is a chemical information repository, consisting of three primary databases: Substance, Compound, and BioAssay. When individual data contributors submit chemical substance descriptions to Substance, the unique chemical structures are extracted and stored into Compound thr...

Full description

Bibliographic Details
Main Authors: Volker D. Hähnke, Sunghwan Kim, Evan E. Bolton
Format: Article
Language:English
Published: BMC 2018-08-01
Series:Journal of Cheminformatics
Subjects:
Online Access:http://link.springer.com/article/10.1186/s13321-018-0293-8
id doaj-22ad18a5ff6f478e8b0f14e3c88a2e7f
record_format Article
spelling doaj-22ad18a5ff6f478e8b0f14e3c88a2e7f2020-11-25T01:27:31ZengBMCJournal of Cheminformatics1758-29462018-08-0110114010.1186/s13321-018-0293-8PubChem chemical structure standardizationVolker D. Hähnke0Sunghwan Kim1Evan E. Bolton2National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Department of Health and Human ServicesNational Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Department of Health and Human ServicesNational Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Department of Health and Human ServicesAbstract Background PubChem is a chemical information repository, consisting of three primary databases: Substance, Compound, and BioAssay. When individual data contributors submit chemical substance descriptions to Substance, the unique chemical structures are extracted and stored into Compound through an automated process called structure standardization. The present study describes the PubChem standardization approaches and analyzes them for their success rates, reasons that cause structures to be rejected, and modifications applied to structures during the standardization process. Furthermore, the PubChem standardization is compared to the structure normalization of the IUPAC International Chemical Identifier (InChI) software, as manifested by conversion of the InChI back into a chemical structure. Results The observed rejection rate for substances processed by PubChem standardization was 0.36%, which is predominantly attributed to structures with invalid atom valences that cannot be readily corrected without additional information from contributors. Of all structures that pass standardization, 44% are modified in the process, reducing the count of unique structures from 53,574,724 in substance to 45,808,881 in compound as identified by de-aromatized canonical isomeric SMILES. Even though the processing time is very low on average (only 0.4% of structures have individual standardization time above 0.1 s), total standardization time is completely dominated by edge cases: 90% of the time to standardize all structures in PubChem substance is spent on the 2.05% of structures with the highest individual standardization time. It is worth noting that 60% of the structures obtained from PubChem structure standardization are not identical to the chemical structure resulting from the InChI (primarily due to preferences for a different tautomeric form). Conclusions Standardization of chemical structures is complicated by the diversity of chemical information and their representations approaches. The PubChem standardization is an effective and efficient tool to account for molecular diversity and to eliminate invalid/incomplete structures. Further development will concentrate on improved tautomer consideration and an expanded stereocenter definition. Modifications are difficult to thoroughly validate, with slight changes often affecting many thousands of structures and various edge cases. The PubChem structure standardization service is accessible as a public resource (https://pubchem.ncbi.nlm.nih.gov/standardize), and via programmatic interfaces.http://link.springer.com/article/10.1186/s13321-018-0293-8PubChemStandardizationInChITautomerismAromaticityKekulization
collection DOAJ
language English
format Article
sources DOAJ
author Volker D. Hähnke
Sunghwan Kim
Evan E. Bolton
spellingShingle Volker D. Hähnke
Sunghwan Kim
Evan E. Bolton
PubChem chemical structure standardization
Journal of Cheminformatics
PubChem
Standardization
InChI
Tautomerism
Aromaticity
Kekulization
author_facet Volker D. Hähnke
Sunghwan Kim
Evan E. Bolton
author_sort Volker D. Hähnke
title PubChem chemical structure standardization
title_short PubChem chemical structure standardization
title_full PubChem chemical structure standardization
title_fullStr PubChem chemical structure standardization
title_full_unstemmed PubChem chemical structure standardization
title_sort pubchem chemical structure standardization
publisher BMC
series Journal of Cheminformatics
issn 1758-2946
publishDate 2018-08-01
description Abstract Background PubChem is a chemical information repository, consisting of three primary databases: Substance, Compound, and BioAssay. When individual data contributors submit chemical substance descriptions to Substance, the unique chemical structures are extracted and stored into Compound through an automated process called structure standardization. The present study describes the PubChem standardization approaches and analyzes them for their success rates, reasons that cause structures to be rejected, and modifications applied to structures during the standardization process. Furthermore, the PubChem standardization is compared to the structure normalization of the IUPAC International Chemical Identifier (InChI) software, as manifested by conversion of the InChI back into a chemical structure. Results The observed rejection rate for substances processed by PubChem standardization was 0.36%, which is predominantly attributed to structures with invalid atom valences that cannot be readily corrected without additional information from contributors. Of all structures that pass standardization, 44% are modified in the process, reducing the count of unique structures from 53,574,724 in substance to 45,808,881 in compound as identified by de-aromatized canonical isomeric SMILES. Even though the processing time is very low on average (only 0.4% of structures have individual standardization time above 0.1 s), total standardization time is completely dominated by edge cases: 90% of the time to standardize all structures in PubChem substance is spent on the 2.05% of structures with the highest individual standardization time. It is worth noting that 60% of the structures obtained from PubChem structure standardization are not identical to the chemical structure resulting from the InChI (primarily due to preferences for a different tautomeric form). Conclusions Standardization of chemical structures is complicated by the diversity of chemical information and their representations approaches. The PubChem standardization is an effective and efficient tool to account for molecular diversity and to eliminate invalid/incomplete structures. Further development will concentrate on improved tautomer consideration and an expanded stereocenter definition. Modifications are difficult to thoroughly validate, with slight changes often affecting many thousands of structures and various edge cases. The PubChem structure standardization service is accessible as a public resource (https://pubchem.ncbi.nlm.nih.gov/standardize), and via programmatic interfaces.
topic PubChem
Standardization
InChI
Tautomerism
Aromaticity
Kekulization
url http://link.springer.com/article/10.1186/s13321-018-0293-8
work_keys_str_mv AT volkerdhahnke pubchemchemicalstructurestandardization
AT sunghwankim pubchemchemicalstructurestandardization
AT evanebolton pubchemchemicalstructurestandardization
_version_ 1725104929937293312