Atom-in-SMILES tokenization

31 October 2022, Version 1
This content is a preprint and has not undergone peer review at the time of posting.

Abstract

Tokenization is an important preprocessing step in natural language processing that may have a significant influence on prediction quality. In this study we show that the conventional SMILES tokenization itself is at fault, resulting in tokens that fail to reflect the true nature of molecules. To address this we propose atom-in-SMILES approach, resolving the ambiguities in the genericness of SMILES tokens. Our findings in multiple translation tasks suggest that proper tokenization has a great impact on the prediction quality. Considering the prediction accuracy and token degeneration comparisons, atom-in-SMILES appears as an effective method to draw higher quality SMILES sequences out of AI-based chemical models than other tokenization schemes. We investigate the token degeneration, highlight its pernicious influence on prediction quality, quantify the token-level repetitions, and include generated examples for qualitative analysis. We believe that atom-in-SMILES tokenization can readily be utilized by the community at large, providing chemically accurate, tailor-made tokens for molecular prediction models.

Keywords

atom-in-molecules
tokenization
SMILES
atom-in-SMILES

Supplementary materials

Title
Description
Actions
Title
Supplementary material
Description
Details of trainings and pseudo-code
Actions

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.