Completing and balancing database excerpted chemical reactions with a hybrid mechanistic - machine learning approach

20 July 2023, Version 1
This content is a preprint and has not undergone peer review at the time of posting.

Abstract

Computer Aided Synthesis Planning (CASP) development of reaction routes requires understanding of complete reaction structures. However, most reactions in the current databases are missing reaction co-participants. Although reaction prediction and atom mapping tools can predict major reaction participants and trace atom rearrangements in reactions, they fail to identify the missing molecules to complete reactions. This is because these approaches are data-driven models trained on the current reaction databases which comprise of incomplete reactions. In this work, a workflow was developed to tackle the reaction completion challenge. This includes a heuristic-based method to identify the balanced reactions from reaction databases and complete some imbalanced reactions by adding candidate molecules. A machine learning masked language model (MLM) was trained to learn from reaction SMILES sentences of these completed reactions. The model predicted missing molecules for the incomplete reactions; a workflow analogous to predicting missing words in sentences. The model is promising for prediction of small and middle size missing molecules in incomplete reaction records. The workflow combining both the heuristic and the machine learning methods completed more than half of the entire reaction space.

Keywords

chemoinformatics
reaction informatics
organic synthesis
machine learning

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.