Abstract
The merger of High-Throughput Experimentation (HTE) and data science presents an opportunity to both accelerate and inspire innovations in synthetic chemistry. Similarly, developments in machine learning (ML) have enabled the distillation of large and complex data sets into predictive models capable of generalizing patterns in the data. However, efforts to merge HTE with ML remain constrained by a few reported datasets with limited structural diversity and corresponding trained models that do not extrapolate well to substrates beyond the training set. Herein, we detail the first ML models for Pd-catalyzed C–N couplings using pharmaceutically relevant structurally diverse large data sets (~ 5000 unique products) generated using nanomole scale compatible chemistry. Careful consideration is given to both the diversity of the data set and accurate model predictions for substrates bearing features beyond those present in the training set. The structural diversity in the data set is enabled by leveraging the Merck & Co., Inc Building Block Collection with an initial focus on C–N coupling using secondary amines. The large dataset enables the systematic evaluation of model performance using five different data-splitting strategies. These five splits are carefully designed to evaluate the model’s ability to extrapolate beyond the substrates in the training set. The accuracy of classification models built with a lens toward application to medicinal chemistry campaigns exceeded the baseline precision-recall by 25-67% depending on the splitting strategy. These results would manifest as significant enrichment of successful C–N couplings using the hits recommended by the models. In addition, the accuracy of the best models for each of the five splits ranges between 70-87% suggesting excellent overall predictivity of the models even for completely unseen substrates.
Supplementary materials
Title
Supplementary material
Description
Supporting information on experimental and modeling details referred to in the manuscript text
Actions