Abstract
Computer aided synthesis planning is a rapidly growing field for suggesting synthetic routes for molecules of interest. The methods used are usually dependent on access to large datasets for training, but with a finite experimental budget there are limitations on how much data can be obtained from experiments. Active learning, which has been used in recent studies with success, is a strategy to identify which data points impact model accuracy the most. However, little has been done to explore the robustness of the methods predicting reaction yield. This study aims to investigate the influence of machine learning algorithms and the number of initial data points on reaction yield prediction for two public high-throughput experimentation datasets. Our results show that active learning based on output margin reached a pre-defined accuracy (AUROC) faster than using passive learning. Feature importance analysis of the trained machine learning models suggested active learning had larger influence on the model accuracy when only a few features were important for the model prediction.
Supplementary weblinks
Title
Code availability
Description
The github repository for the source code used to run the experiments.
Actions
View