Consistency of systematic chemical identifiers within and between small-molecule databases

<p>Abstract</p> <p>Background</p> <p>Correctness of structures and associated metadata within public and commercial chemical databases greatly impacts drug discovery research activities such as quantitative structure–property relationships modelling and compound novelty...

Full description

Bibliographic Details
Main Authors: Akhondi Saber A, Kors Jan A, Muresan Sorel
Format: Article
Language:English
Published: BMC 2012-12-01
Series:Journal of Cheminformatics
Subjects:
Online Access:http://www.jcheminf.com/content/4/1/35
id doaj-d6529c3c36d74456a2b1acac00601919
record_format Article
spelling doaj-d6529c3c36d74456a2b1acac006019192020-11-24T22:06:27ZengBMCJournal of Cheminformatics1758-29462012-12-01413510.1186/1758-2946-4-35Consistency of systematic chemical identifiers within and between small-molecule databasesAkhondi Saber AKors Jan AMuresan Sorel<p>Abstract</p> <p>Background</p> <p>Correctness of structures and associated metadata within public and commercial chemical databases greatly impacts drug discovery research activities such as quantitative structure–property relationships modelling and compound novelty checking. MOL files, SMILES notations, IUPAC names, and InChI strings are ubiquitous file formats and systematic identifiers for chemical structures. While interchangeable for many cheminformatics purposes there have been no studies on the inconsistency of these structure identifiers due to various approaches for data integration, including the use of different software and different rules for structure standardisation. We have investigated the consistency of systematic identifiers of small molecules within and between some of the commonly used chemical resources, with and without structure standardisation.</p> <p>Results</p> <p>The consistency between systematic chemical identifiers and their corresponding MOL representation varies greatly between data sources (37.2%-98.5%). We observed the lowest overall consistency for MOL-IUPAC names. Disregarding stereochemistry increases the consistency (84.8% to 99.9%). A wide variation in consistency also exists between MOL representations of compounds linked via cross-references (25.8% to 93.7%). Removing stereochemistry improved the consistency (47.6% to 95.6%).</p> <p>Conclusions</p> <p>We have shown that considerable inconsistency exists in structural representation and systematic chemical identifiers within and between databases. This can have a great influence especially when merging data and if systematic identifiers are used as a key index for structure integration or cross-querying several databases. Regenerating systematic identifiers starting from their MOL representation and applying well-defined and documented chemistry standardisation rules to all compounds prior to creating them can dramatically increase internal consistency.</p> http://www.jcheminf.com/content/4/1/35Molecular structureChemical databasesSystematic chemical identifiersQuality controlInChISMILESIUPAC
collection DOAJ
language English
format Article
sources DOAJ
author Akhondi Saber A
Kors Jan A
Muresan Sorel
spellingShingle Akhondi Saber A
Kors Jan A
Muresan Sorel
Consistency of systematic chemical identifiers within and between small-molecule databases
Journal of Cheminformatics
Molecular structure
Chemical databases
Systematic chemical identifiers
Quality control
InChI
SMILES
IUPAC
author_facet Akhondi Saber A
Kors Jan A
Muresan Sorel
author_sort Akhondi Saber A
title Consistency of systematic chemical identifiers within and between small-molecule databases
title_short Consistency of systematic chemical identifiers within and between small-molecule databases
title_full Consistency of systematic chemical identifiers within and between small-molecule databases
title_fullStr Consistency of systematic chemical identifiers within and between small-molecule databases
title_full_unstemmed Consistency of systematic chemical identifiers within and between small-molecule databases
title_sort consistency of systematic chemical identifiers within and between small-molecule databases
publisher BMC
series Journal of Cheminformatics
issn 1758-2946
publishDate 2012-12-01
description <p>Abstract</p> <p>Background</p> <p>Correctness of structures and associated metadata within public and commercial chemical databases greatly impacts drug discovery research activities such as quantitative structure–property relationships modelling and compound novelty checking. MOL files, SMILES notations, IUPAC names, and InChI strings are ubiquitous file formats and systematic identifiers for chemical structures. While interchangeable for many cheminformatics purposes there have been no studies on the inconsistency of these structure identifiers due to various approaches for data integration, including the use of different software and different rules for structure standardisation. We have investigated the consistency of systematic identifiers of small molecules within and between some of the commonly used chemical resources, with and without structure standardisation.</p> <p>Results</p> <p>The consistency between systematic chemical identifiers and their corresponding MOL representation varies greatly between data sources (37.2%-98.5%). We observed the lowest overall consistency for MOL-IUPAC names. Disregarding stereochemistry increases the consistency (84.8% to 99.9%). A wide variation in consistency also exists between MOL representations of compounds linked via cross-references (25.8% to 93.7%). Removing stereochemistry improved the consistency (47.6% to 95.6%).</p> <p>Conclusions</p> <p>We have shown that considerable inconsistency exists in structural representation and systematic chemical identifiers within and between databases. This can have a great influence especially when merging data and if systematic identifiers are used as a key index for structure integration or cross-querying several databases. Regenerating systematic identifiers starting from their MOL representation and applying well-defined and documented chemistry standardisation rules to all compounds prior to creating them can dramatically increase internal consistency.</p>
topic Molecular structure
Chemical databases
Systematic chemical identifiers
Quality control
InChI
SMILES
IUPAC
url http://www.jcheminf.com/content/4/1/35
work_keys_str_mv AT akhondisabera consistencyofsystematicchemicalidentifierswithinandbetweensmallmoleculedatabases
AT korsjana consistencyofsystematicchemicalidentifierswithinandbetweensmallmoleculedatabases
AT muresansorel consistencyofsystematicchemicalidentifierswithinandbetweensmallmoleculedatabases
_version_ 1725823579143012352