Exploring BERT for Reaction Yield Prediction: Evaluating the Impact of Tokenization, Molecular Representation, and Pre-training Data Augmentation

19 February 2025, Version 1
This content is a preprint and has not undergone peer review at the time of posting.

Abstract

Predicting reaction yields in synthetic chemistry remains a significant challenge. This study systematically evaluates the impact of tokenization, molecular representation, pre-training data, and adversarial training on a BERT-based model for yield prediction of Buchwald-Hartwig and Suzuki-Miyaura coupling reactions using publicly available HTE datasets. We demonstrate that molecular representation choice (SMILES, DeepSMILES, SELFIES, Morgan fingerprint-based notation, IUPAC names) has minimal impact on model performance, while typically BPE and SentencePiece tokenization outperform other methods. WordPiece is strongly discouraged for SELFIES and fingerprint-based notation. Furthermore, pre-training with relatively small data sets (<100K reactions) achieves comparable performance to larger data sets containing millions of examples. A use of artificially generated domain-specific pre-training data is proposed. The artificially generated sets prove to be a good surrogate to the reaction schemes extracted from reaction data sets such as Pistachio or Reaxys. The best performance was observed for hybrid pre-training sets combining the real and the domain-specific, artificial data. Finally, we show that a novel adversarial training approach, perturbing input embeddings dynamically, improves model robustness and generalisability for yield and reaction success prediction. These findings provide valuable insights for developing robust and practical machine learning models for yield prediction in synthetic chemistry. GSK’s BERT training code base is made available to the community with this work.

Keywords

Predictive Synthesis
Machine Learning
BERT
Molecular Representation
Buchwald-Hartwig Reaction
Suzuki-Miyaura Reaction
Adversarial Training
Data Augmentation
Artificial Data

Supplementary materials

Title
Description
Actions
Title
Supporting Information for the Main Manuscript
Description
File containing supporting results presented in tables and figures.
Actions
Title
Artificially Generated Reaction Schemes
Description
File containing artificially generated Suzuki-Miyaura and Buchwald-Hartwig reaction schemes. Training and validation data are provided.
Actions

Supplementary weblinks

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.