Abstract
Predicting reaction yields in synthetic chemistry remains a significant challenge. This study systematically evaluates the impact of tokenization, molecular representation, pre-training data, and adversarial training on a BERT-based model for yield prediction of Buchwald-Hartwig and Suzuki-Miyaura coupling reactions using publicly available HTE datasets. We demonstrate that molecular representation choice (SMILES, DeepSMILES, SELFIES, Morgan fingerprint-based notation, IUPAC names) has minimal impact on model performance, while typically BPE and SentencePiece tokenization outperform other methods. WordPiece is strongly discouraged for SELFIES and fingerprint-based notation. Furthermore, pre-training with relatively small data sets (<100K reactions) achieves comparable performance to larger data sets containing millions of examples. A use of artificially generated domain-specific pre-training data is proposed. The artificially generated sets prove to be a good surrogate to the reaction schemes extracted from reaction data sets such as Pistachio or Reaxys. The best performance was observed for hybrid pre-training sets combining the real and the domain-specific, artificial data. Finally, we show that a novel adversarial training approach, perturbing input embeddings dynamically, improves model robustness and generalisability for yield and reaction success prediction. These findings provide valuable insights for developing robust and practical machine learning models for yield prediction in synthetic chemistry. GSK’s BERT training code base is made available to the community with this work.
Supplementary materials
Title
Supporting Information for the Main Manuscript
Description
File containing supporting results presented in tables and figures.
Actions
Title
Artificially Generated Reaction Schemes
Description
File containing artificially generated Suzuki-Miyaura and Buchwald-Hartwig reaction schemes. Training and validation data are provided.
Actions
Supplementary weblinks
Title
SynthCoder GitHub Repository
Description
GitHub repository for the GSK's SynthCoder code base.
Actions
View