Abstract
Sources for commercially available compounds have been experiencing continuous growth for several years, reaching their peak in billion- to trillion-sized combinatorial Chemical Spaces. In order to assess the quality of a compound collection to provide relevant chemistry, a benchmark set of pharmaceutically relevant structures is required that enables an unbiased comparison. For this purpose, the CHEMBL database was mined for molecules displaying biological activity, and three benchmark sets of successive orders of magnitude were created by systematic filtering and processing: Set L (‘large-sized’, 379k), Set M (‘medium-sized’, 25k), and Set S (‘small-sized’, 3k). Tailored for broad coverage of the physicochemical and topological landscape, the benchmark Set S was then employed to analyze the chemical diversity capacities of commercial combinatorial Chemical Spaces and enumerated compound libraries. Among the three utilized search methods—FTrees (pharmacophore features), SpaceLight (molecular fingerprints), and SpaceMACS (maximum common substructure)—the eXplore and REAL Space consistently performed best. In general, each Chemical Space was able to provide a larger number of compounds more similar to the respective query molecule than the enumerated libraries, while also individually offering unique scaffolds for each method.
Supplementary materials
Title
Supporting Information
Description
Supporting figures and tables.
Actions
Title
Benchmark Set L [379k]
Description
Benchmark set L 'large-sized' featuring 379k molecules.
Actions
Title
Benchmark Set M [25k]
Description
Benchmark set M 'medium-sized' featuring 25k molecules.
Actions
Title
Benchmark Set S [3k]
Description
Benchmark Set S 'small-sized' featuring 2.9k molecules.
Actions