Reconstruction of lossless molecular representations, SMILES and SELFIES, from fingerprints

09 June 2022, Version 2
This content is a preprint and has not undergone peer review at the time of posting.

Abstract

SMILES is the most dominant molecular representation used in AI-based chemical applications, but it has innate limitations associated with its internal structure. Here, we exploit the idea that a set of structural fingerprints can be used as efficient alternatives to unique molecular representations. For this purpose, we trained neural-machine-translation based models that translate a set of various structural fingerprints to conventional text-based molecular representations, i.e., SMILES and SELFIES. The assessment of their conversion efficiency showed that our models successfully reconstructed molecules and achieved a high level of accuracy. Therefore, our approach brings structural fingerprints into play as strong representational tools in chemical natural language processing applications by restoring the connectivity information that is lost during fingerprint transformation. This comprehensive study addressed the major limitation of structural fingerprints, which precludes their implementation in NLP models. Our findings would facilitate the development of text or fingerprint-based chemoinformatic models for generative and translational tasks.

Keywords

Fingerprints
SMILES
SELFIES
NMT

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.