Endless Data for Drug Discovery Pipeline Validation for Free – Computational Chemistry’s Gift

28 October 2024, Version 2
This content is a preprint and has not undergone peer review at the time of posting.

Abstract

Modern virtual high-throughput screening (vHTS) pipelines tend to be overmarketed and undervalidated, with no rigorous studies conclusively demonstrating that every one of their steps reliably adds increasing enrichment atop the baseline random hit rate. Moreover, what little benchmarking studies are available primarily focus on the docking aspect of the pipelines, which is usually only the beginning or near the beginning, and even there, authors tend to use flawed data sets that artificially inflate performance metrics. Herein, we present an alternative method to pipeline validation and data set generation that requires no additional experimental work and expenditure, yet offers negative data that is vastly superior both in terms of quality and quantity to any data set used in vHTS pipeline validation up to now. By randomizing ligands across published experimental structures and generating structural isomers of known binders, practically unlimited amounts of negative data can be generated. Such sets of positive and negative data points match closely in molecular properties and are much more suitable for pipeline validation and have far greater evidentiary value than any of the current sets. Once such sets are generated, they are to be run through any proposed pipeline, assessing performance at every step. We stress the importance of using negative data of adequate quality and quantity in validation studies to definitively and verifiably demonstrate the utility of a given tool or workflow. Our goal is to help distinguish tools and pipelines that truly accelerate hit discovery and lead optimization from ones that promise to do so but actually do not, whereupon academia and industry can begin to tackle the many unaddressed medical needs of the 21st century.

Keywords

virtual high-throughput screening
drug discovery
CADD
molecular modeling
molecular docking
molecular dynamics
Gnina
MM-GBSA
MM-PBSA
ABFE
RBFE

Supplementary materials

Title
Description
Actions
Title
Protein – ligand pairs used in the present study.
Description
PDBbind sheet. The 754 protein – ligand pairs and their binding status (binding/nonbinding) are given. The first row is the 1FCX ligand docked to the protein from the 1FCX crystal structure (redocking), the second is the 1G74 ligand docked to the protein from 1FCX (crossdocking), the third is the 2P3I ligand docked to the 1FCX protein (crossdocking), etc. MAYGEN Sheet. The 4QSW, 5A5N, 6EPU ligands, and their structural isomers used in the present study, are given in SMILES format.
Actions

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.