The chemfp project

Abstract The chemfp project has had four main goals: (1) promote the FPS format as a text-based exchange format for dense binary cheminformatics fingerprints, (2) develop a high-performance implementation of the BitBound algorithm that could be used as an effective baseline to benchmark new similari...

Full description

Bibliographic Details
Main Author: Andrew Dalke
Format: Article
Language:English
Published: BMC 2019-12-01
Series:Journal of Cheminformatics
Subjects:
Online Access:https://doi.org/10.1186/s13321-019-0398-8
id doaj-0411d7b7f182429eb36a3bcfc8c74a6a
record_format Article
spelling doaj-0411d7b7f182429eb36a3bcfc8c74a6a2020-12-06T12:49:10ZengBMCJournal of Cheminformatics1758-29462019-12-0111112110.1186/s13321-019-0398-8The chemfp projectAndrew Dalke0Andrew Dalke Scientific ABAbstract The chemfp project has had four main goals: (1) promote the FPS format as a text-based exchange format for dense binary cheminformatics fingerprints, (2) develop a high-performance implementation of the BitBound algorithm that could be used as an effective baseline to benchmark new similarity search implementations, (3) experiment with funding a pure open source software project through commercial sales, and (4) publish the results and lessons learned as a guide for future implementors. The FPS format has had only minor success, though it did influence development of the FPB binary format, which is faster to load but more complex. Both are summarized. The chemfp benchmark and the no-cost/open source version of chemfp are proposed as a reference baseline to evaluate the effectiveness of other similarity search tools. They are used to evaluate the faster commercial version of chemfp, which can test 130 million 1024-bit fingerprint Tanimotos per second on a single core of a standard x86-64 server machine. When combined with the BitBound algorithm, a k = 1000 nearest-neighbor search of the 1.8 million 2048-bit Morgan fingerprints of ChEMBL 24 averages 27 ms/query. The same search of 970 million PubChem fingerprints averages 220 ms/query, making chemfp one of the fastest CPU-based similarity search implementations. Modern CPUs are fast enough that memory bandwidth and latency are now important factors. Single-threaded search uses most of the available memory bandwidth. Sorting the fingerprints by popcount improves memory coherency, which when combined with 4 OpenMP threads makes it possible to construct an N × N similarity matrix for 1 million fingerprints in about 30 min. These observations may affect the interpretation of previous publications which assumed that search was strongly CPU bound. The chemfp project funding came from selling a purely open-source software product. Several product business models were tried, but none proved sustainable. Some of the experiences are discussed, in order to contribute to the ongoing conversation on the role of open source software in cheminformatics.https://doi.org/10.1186/s13321-019-0398-8Molecular fingerprintsSimilarity searchingTanimotoHigh-performanceFormatOpen source
collection DOAJ
language English
format Article
sources DOAJ
author Andrew Dalke
spellingShingle Andrew Dalke
The chemfp project
Journal of Cheminformatics
Molecular fingerprints
Similarity searching
Tanimoto
High-performance
Format
Open source
author_facet Andrew Dalke
author_sort Andrew Dalke
title The chemfp project
title_short The chemfp project
title_full The chemfp project
title_fullStr The chemfp project
title_full_unstemmed The chemfp project
title_sort chemfp project
publisher BMC
series Journal of Cheminformatics
issn 1758-2946
publishDate 2019-12-01
description Abstract The chemfp project has had four main goals: (1) promote the FPS format as a text-based exchange format for dense binary cheminformatics fingerprints, (2) develop a high-performance implementation of the BitBound algorithm that could be used as an effective baseline to benchmark new similarity search implementations, (3) experiment with funding a pure open source software project through commercial sales, and (4) publish the results and lessons learned as a guide for future implementors. The FPS format has had only minor success, though it did influence development of the FPB binary format, which is faster to load but more complex. Both are summarized. The chemfp benchmark and the no-cost/open source version of chemfp are proposed as a reference baseline to evaluate the effectiveness of other similarity search tools. They are used to evaluate the faster commercial version of chemfp, which can test 130 million 1024-bit fingerprint Tanimotos per second on a single core of a standard x86-64 server machine. When combined with the BitBound algorithm, a k = 1000 nearest-neighbor search of the 1.8 million 2048-bit Morgan fingerprints of ChEMBL 24 averages 27 ms/query. The same search of 970 million PubChem fingerprints averages 220 ms/query, making chemfp one of the fastest CPU-based similarity search implementations. Modern CPUs are fast enough that memory bandwidth and latency are now important factors. Single-threaded search uses most of the available memory bandwidth. Sorting the fingerprints by popcount improves memory coherency, which when combined with 4 OpenMP threads makes it possible to construct an N × N similarity matrix for 1 million fingerprints in about 30 min. These observations may affect the interpretation of previous publications which assumed that search was strongly CPU bound. The chemfp project funding came from selling a purely open-source software product. Several product business models were tried, but none proved sustainable. Some of the experiences are discussed, in order to contribute to the ongoing conversation on the role of open source software in cheminformatics.
topic Molecular fingerprints
Similarity searching
Tanimoto
High-performance
Format
Open source
url https://doi.org/10.1186/s13321-019-0398-8
work_keys_str_mv AT andrewdalke thechemfpproject
AT andrewdalke chemfpproject
_version_ 1724398584428756992