Abstract
The emergence of ultra-large screening libraries, filled to the brim with billions of readily available compounds, poses a growing challenge for docking-based virtual screening. Machine Learning (ML)-boosted strategies like the tool HASTEN combine rapid ML prediction with the brute-force docking of small fractions of such libraries to increase screening throughput and take on giga-scale libraries. In our case study of an anti-bacterial chaperone and an anti-viral kinase, we first generated a brute-force docking baseline for 1.56 billion compounds in the Enamine REAL lead-like library with the fast Glide HTVS protocol. With HASTEN, we observed robust recall of 90% of the true 1000 top-scoring virtual hits in both targets when docking only 1% of the entire library. This reduction of the required docking experiments by 99% significantly shortens the screening time.In the kinase target, the employment of a hydrogen bonding constraint resulted in a major proportion of unsuccessful docking attempts and hampered ML predictions. We demonstrate the optimization potential in the treatment of failed compounds when performing ML-boosted screening and showcase HASTEN as a fast and robust tool in a growing arsenal of approaches to unlock the chemical space covered by giga-scale screening libraries for everyday drug discovery campaigns.
Supplementary materials
Title
Supporting Information
Description
Supporting Figures S1-S5.
Supporting Tables S1-S6.
Summary of utilized Chemprop parameters.
Actions
Supplementary weblinks
Title
Schrodinger Phase databases for Enamine REAL lead-like library of 1.56 billion compounds (March 2021)
Description
A collection of Phase databases created from the Enamine REAL lead-like library as downloaded in March 2021 (1.56 billion compounds).
Actions
View Title
Glide HTVS docking results of Enamine REAL lead-like library (1.56 billion compounds) for two targets
Description
Docking results for 1.56 billion compounds of the Enamine REAL lead-like library (obtained March 2021) for the targets SurA and GAK. The intended use of the data is to serve as a giga-scale benchmarking dataset, e.g. for machine learning approaches.
Actions
View