Improve retrosynthesis planning with a molecular editing language

26 December 2023, Version 1
This content is a preprint and has not undergone peer review at the time of posting.

Abstract

Retrosynthetic analysis is a fundamental strategy in the field of organic synthesis, and many computational methods have been developed to address this significant task. A widely adopted approach is to treat retrosynthetic prediction as a sequence-to-sequence (seq2seq) translation task, where the Simplified Molecular Input Line Entry System (SMILES) of a product is translated into the SMILES of its corresponding reactants. However, these sequence-based models using SMILES also face many issues, including limited performance, lack of interpretability, and controllability. In this work, we introduce a novel chemical language for retrosynthetic prediction named E-SMILES, which is an extension of SMILES specially designed for seq2seq retrosynthetic prediction. This language not only documents the static molecular structure but also encodes the editing operations of the molecule in the retrosynthetic process, enabling it to characterize retrosynthesis reactions more effectively. By using E-SMILES, seq2seq retrosynthetic models can simulate the stepwise retrosynthetic analysis strategy of chemists, ensuring the matching of atoms between the predicted reactants and product, and yielding more interpretable and controllable predictions. Furthermore, E-SMILES is naturally aligned with the product's SMILES, reducing the edit distance between the model's input and output sequences. This liberates the model from learning the complex SMILES syntax and allows it to focus more on the retrosynthesis process itself. Leveraging E-SMILES, our retrosynthesis model achieves top-1 accuracies of 58.9% and 68.5% on the USPTO-50k dataset, with and without given reaction class, respectively, significantly surpassing previous state-of-the-art results. We envisage that E-SMILES can serve as a new foundational tool, promoting the development of sequence-based retrosynthetic prediction methods.

Keywords

Retrosynthesis planning
Chemical language
Deep learning
Reaction retrieval
Prompt Learning
Model interpretability

Supplementary materials

Title
Description
Actions
Title
Supplementary Information
Description
Supplementary Information for Improve retrosynthesis planning with a molecular editing language
Actions

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.