UnCorrupt SMILES: a novel approach to de novo design

14 November 2022, Version 2
This content is a preprint and has not undergone peer review at the time of posting.

Abstract

Generative deep learning models have emerged as a powerful approach for de novo drug design, as they aid researchers in finding new molecules with desired properties. Despite continuous improvements in the field, a subset of the outputs that sequence-based de novo generators produce cannot be progressed due to errors. Here, we propose to fix these invalid outputs post hoc. In similar tasks, transformer models from the field of natural language processing have been shown to be very effective. Therefore, here this type of model was trained to translate invalid Simplified Molecular-Input Line-Entry System (SMILES) into valid representations. The performance of this SMILES corrector was evaluated on four representative methods of de novo generation: a recurrent neural network (RNN), a target-directed RNN, a generative adversarial network (GAN), and a variational autoencoder (VAE). This study has found that the percentage of invalid outputs from these specific generative models ranges between 4 and 89 %, with different models having different error type distributions. Post hoc correction of SMILES increases model validity, with the SMILES corrector fixing 35 to 80 % of invalid model outputs. While, corrector models trained with one error per input sequence alter 60 to 90 % of invalid inputs, a higher performance was obtained for transformer models trained with multiple errors per input. In this case, the best model was able to correct 60 to 95 % of invalid generator outputs. Further analysis showed that these fixed molecules are comparable to the correct molecules from the de novo generators with regard to novelty and similarity. Additionally, the SMILES corrector can also be used to expand the amount of interesting new molecules within the targeted chemical space. Introducing different errors into existing molecules yields novel analogs with a uniqueness of 39 % and a novelty of approximately 20 %. The results of this research demonstrate that SMILES correction is a viable post hoc extension and can enhance the search for better drug candidates.

Keywords

SMILES correction
invalid SMILES
molecular transformer
de novo drug design
analog generation

Supplementary materials

Title
Description
Actions
Title
Supporting information: UnCorrupt SMILES: a novel approach to de novo design
Description
Supporting information relating to working paper UnCorrupt SMILES: a novel approach to de novo design
Actions

Supplementary weblinks

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.