Using Active Learning to Develop Machine Learning Models for Reaction Yield Prediction

30 June 2021, Version 1
This content is a preprint and has not undergone peer review at the time of posting.

Abstract

Computer aided synthesis planning is a rapidly growing field for suggesting synthetic routes for molecules of interest. The methods used are usually dependent on access to large datasets for training, but with a finite experimental budget there are limitations on how much data can be obtained from experiments. Active learning, which has been used in recent studies with success, is a strategy to identify which data points impact model accuracy the most. However, little has been done to explore the robustness of the methods predicting reaction yield. This study aims to investigate the influence of machine learning algorithms and the number of initial data points on reaction yield prediction for two public high-throughput experimentation datasets. Our results show that active learning based on output margin reached a pre-defined accuracy (AUROC) faster than using passive learning. Feature importance analysis of the trained machine learning models suggested active learning had larger influence on the model accuracy when only a few features were important for the model prediction.

Supplementary weblinks

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.