Performance Insights for Small Molecule Drug Discovery Models: Data Scaling, Multitasking, and Generalization

01 November 2024, Version 1
This content is a preprint and has not undergone peer review at the time of posting.

Abstract

Predictive models hold considerable promise in enabling the faster discovery of safe, efficacious therapeutics. To better understand and improve the performance of small molecule predictive models, we conducted multiple experiments with deep learning and traditional machine learning approaches, leveraging our large internal datasets as well as publicly available datasets. These experiments included assessing performance on random, temporal, and reverse-temporal data ablation tasks as well as tasks testing model extrapolation to different property spaces. We were able to identify factors that contribute to higher performance of predictive models built using graph neural networks versus traditional methods such as XGboost and random forest. Expanding upon these learnings, we were able to derive a scaling relationship that accounts for 81% of the variance in model performance across different assays and data regimes. This relationship can be used to estimate the performance of models for ADMET (absorption, distribution, metabolism, excretion, and toxicity) endpoints as well as drug discovery assay data in general. The results provide insights into how to further improve model performance.

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.