Roadmap to Pharmaceutically Relevant Reactivity Models Leveraging High-Throughput Experimentation

Jessica  Xu; Dipannita  Kalyani; Thomas  Struble; Spencer  Dreher; Shane  Krska; Stephen L.  Buchwald; Klavs F. Jensen

doi:10.26434/chemrxiv-2022-x694w

Catalysis

Search within Catalysis

Roadmap to Pharmaceutically Relevant Reactivity Models Leveraging High-Throughput Experimentation

19 September 2022, Version 1

Working Paper

Show author details

This content is a preprint and has not undergone peer review at the time of posting.

Abstract

The merger of High-Throughput Experimentation (HTE) and data science presents an opportunity to both accelerate and inspire innovations in synthetic chemistry. Similarly, developments in machine learning (ML) have enabled the distillation of large and complex data sets into predictive models capable of generalizing patterns in the data. However, efforts to merge HTE with ML remain constrained by a few reported datasets with limited structural diversity and corresponding trained models that do not extrapolate well to substrates beyond the training set. Herein, we detail the first ML models for Pd-catalyzed C–N couplings using pharmaceutically relevant structurally diverse large data sets (~ 5000 unique products) generated using nanomole scale compatible chemistry. Careful consideration is given to both the diversity of the data set and accurate model predictions for substrates bearing features beyond those present in the training set. The structural diversity in the data set is enabled by leveraging the Merck & Co., Inc Building Block Collection with an initial focus on C–N coupling using secondary amines. The large dataset enables the systematic evaluation of model performance using five different data-splitting strategies. These five splits are carefully designed to evaluate the model’s ability to extrapolate beyond the substrates in the training set. The accuracy of classification models built with a lens toward application to medicinal chemistry campaigns exceeded the baseline precision-recall by 25-67% depending on the splitting strategy. These results would manifest as significant enrichment of successful C–N couplings using the hits recommended by the models. In addition, the accuracy of the best models for each of the five splits ranges between 70-87% suggesting excellent overall predictivity of the models even for completely unseen substrates.

Keywords

High-Throughput Experimentation

Machine Learning

palladium–catalyzed C–N cross–coupling

aryl halides

secondary amines

Supplementary materials

Title

Description

Actions

Title

Supplementary material

Description

Supporting information on experimental and modeling details referred to in the manuscript text

Actions

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here .

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Version History

Sep 19, 2022 Version 1

Metrics

4,189

2,896

Views

Downloads

License

The content is available under CC BY NC ND 4.0

DOI

10.26434/chemrxiv-2022-x694w

Funding

Machine Learning for Pharmaceutical Discovery and Synthesis consortium

NIH

R35-GM122483

Author’s competing interest statement

The author(s) have declared they have no conflict of interest with regard to this content

Ethics

The author(s) have declared ethics committee/IRB approval is not relevant to this content

Roadmap to Pharmaceutically Relevant Reactivity Models Leveraging High-Throughput Experimentation

Authors

Abstract

Keywords

Supplementary materials

Comments

Version History

Metrics

License

DOI

Funding

Author’s competing interest statement

Ethics

Share