Multi-fidelity machine learning models for improved high-throughput screening predictions

26 July 2022, Version 2
This content is a preprint and has not undergone peer review at the time of posting.

Abstract

High throughput screening (HTS) is one of the leading techniques for hit identification in drug discovery and comprises of multiple phases, one primary and one or more confirmatory screens which result in multi-fidelity data. Noisy primary screening data are available on a large number of compounds and higher quality confirmatory data on a low-to-moderate number of compounds. Existing computational pipelines do not integrate primary screening data of individual HTS campaigns, resulting in millions of potentially useful data points being unused for bioactivity prediction. Furthermore, there is a lack of publicly available multi-fidelity bioactivity benchmarks to support modelling real-world HTS data. To address these challenges, we assembled public (PubChem) and private (AstraZeneca) collections of multi-fidelity HTS datasets, totalling over 28 million data points, with many targets possessing more than 1M labels. We then designed and evaluated machine learning models to assess the improvements offered by the integration of multi-fidelity data, including classical models and a bespoke, novel deep learning approach based on graph neural networks. Jointly modelling primary and confirmatory data led to a decrease of 12% in mean absolute error (MAE) and an increase of 152% in R-squared on the public datasets, and a reduction of 17% in MAE coupled with an uplift of 46% in R-squared on the AstraZeneca datasets (averaged across all evaluated methods). We conclude that joint modelling of multi-fidelity HTS data improves predictive performance and that deep learning enables the use of unique and highly desirable strategies such as leveraging signals from multi-million scale datasets and transfer learning.

Keywords

high-throughput screening
hts
single dose
single concentration
dose response
concentration response
primary
confirmatory
screen
assay
bioassay
pubchem
pharmaceutical
industry
molecule
compound
molecular
embedding
representation
chemical space
latent space
data modalities
integration
augmentation
deep learning
machine learning
artificial intelligence
ai
ml
computational
graph neural network
gnn
graph representation learning
neural network
transfer learning
multi-fidelity
lead optimization
public
private
shallow
random forest
support vector machine
svm
rf
vgae
variational
autoencoder
aggregator
fingerprint

Supplementary materials

Title
Description
Actions
Title
Supplementary Information
Description
The materials include additional figures and tables, model hyperparameters, and details about the methodology and evaluation.
Actions

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.