Abstract
Multivariate chemical reaction optimization involving catalytic systems is a non-trivial task due to the high number of tuneable parameters and discrete choices. Closed-loop optimization featuring active Machine Learning (ML) represents a powerful strategy for automating reaction optimization. However, the translation of chemical reaction conditions into a machine-readable format comes with the challenge of finding highly informative features which accurately capture the factors for reaction success and allow the model to learn efficiently. Herein, we compare the efficacy of different calculated chemical descriptors for a high throughput generated dataset to determine the impact on a supervised ML model when predicting reaction yield. Then, the effect of featurization and size of the initial dataset within a closed-loop reaction optimization was examined. Finally, the balance between descriptor complexity and dataset size was considered. Ultimately, tailored descriptors did not outperform simple generic representations, however, a larger initial dataset accelerated reaction optimization.
Supplementary materials
Title
The Effect of Chemical Representation on Supervised and Active Machine Learning Towards Yield Prediction
Description
Table of Contents
General Considerations
Analytical Methods
High Throughput Experimentation
Reaction Scheme and Ligand Structures Synthesis of Materials
Preparation of the Dataset
Generation of Morgan Fingerprints
Density Functional Theory (DFT)-based Geometry Optimization
Sterimol Parameters
Percentage Buried Volume
Natural Bond Orbital (NBO) Analysis
CHarges from ELectrostatic Potentials Using a Grid-Based Method (ChELPG) Analysis
Summary of DFT Descriptor Values
Machine Learning
Linear Model
Random Forest
Gaussian Process
Artificial Neural Network
Adaptive Boosting Model
Support Vector Regression
Leave-one-group-out (LOGO) Cross Validation (CV)
Feature Importance Assessment of the Random Forest
Closed-loop Optimization
Expected Improvement Acquisition Function
De-full Factorization of the Chemical Space Study
Batch-Sequential Active Learning
The Impact of Initialization of the Active Learning
The Impact of Initialization: Dataset Size vs. Complexity of Parameterization
Active Learning Trajectories – Insights
Actions