Abstract
Modern virtual high-throughput screening (vHTS) pipelines tend to be overmarketed and undervalidated, with no rigorous studies conclusively demonstrating that every one of their steps reliably adds increasing enrichment atop the baseline random hit rate. Moreover, what little benchmarking studies are available primarily focus on the docking aspect of the pipelines, which is usually only the beginning or near the beginning, and even there, authors tend to use flawed data sets that artificially inflate performance metrics. Herein, we present an alternative method to pipeline validation and data set generation that requires no additional experimental work and expenditure, yet offers negative data that is vastly superior both in terms of quality and quantity to any data set used in vHTS pipeline validation up to now. By randomizing ligands across published experimental structures and generating structural isomers of known binders, we can generate practically unlimited amounts of negative data. Such sets of positive and negative data points match closely in molecular properties and are much more suitable for pipeline validation and have far greater evidentiary value than any of the current sets. Once such sets are generated, they are to be run through any proposed pipeline, assessing performance at every step. We stress the importance of using negative data of adequate quality and quantity in validation studies to definitively and verifiably demonstrate the utility of a given tool or workflow. Our goal is to help distinguish tools and pipelines that truly accelerate hit discovery and lead optimization from ones that promise to do so but actually do not, whereupon academia and industry can begin to tackle the many unaddressed medical needs of the 21st century.
Supplementary materials
Title
Protein – ligand pairs used in the present study.
Description
PDBbind sheet. The 754 protein – ligand pairs and their binding status (binding/nonbinding) are given. The first row is the 1FCX ligand docked to the protein from the 1FCX crystal structure (redocking), the second is the 1FCX ligand docked to the protein from 1G74 (crossdocking), the third is the 1FCX ligand docked to the protein from 2P3I (crossdocking), etc. MAYGEN Sheet. The 4QSW, 5A5N, 6EPU ligands, and their structural isomers used in the present study, are given in SMILES format.
Actions