Atom-in-SMILES tokenization

Umit Volkan Ucak; Islambek Ashyrmamatov; Juyong Lee

doi:10.26434/chemrxiv-2022-9xx75

Theoretical and Computational Chemistry

Search within Theoretical and Computational Chemistry

Atom-in-SMILES tokenization

31 October 2022, Version 1

Working Paper

Show author details

This content is a preprint and has not undergone peer review at the time of posting.

Abstract

Tokenization is an important preprocessing step in natural language processing that may have a significant influence on prediction quality. In this study we show that the conventional SMILES tokenization itself is at fault, resulting in tokens that fail to reflect the true nature of molecules. To address this we propose atom-in-SMILES approach, resolving the ambiguities in the genericness of SMILES tokens. Our findings in multiple translation tasks suggest that proper tokenization has a great impact on the prediction quality. Considering the prediction accuracy and token degeneration comparisons, atom-in-SMILES appears as an effective method to draw higher quality SMILES sequences out of AI-based chemical models than other tokenization schemes. We investigate the token degeneration, highlight its pernicious influence on prediction quality, quantify the token-level repetitions, and include generated examples for qualitative analysis. We believe that atom-in-SMILES tokenization can readily be utilized by the community at large, providing chemically accurate, tailor-made tokens for molecular prediction models.

Keywords

Supplementary materials

Title

Description

Actions

Title

Supplementary material

Description

Details of trainings and pseudo-code

Actions

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here .

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Now Published

Improving the quality of chemical language model outcomes with atom-in-SMILES tokenization

Umit V. Ucak, Islambek Ashyrmamatov, Juyong Lee journal article

Journal of Cheminformatics , Volume 15, Issue 1

Online publication date: May 29, 2023

Version History

Oct 31, 2022 Version 1

Metrics

776

1,225

Views

Downloads

Citations

License

The content is available under CC BY NC ND 4.0

DOI

10.26434/chemrxiv-2022-9xx75

Author’s competing interest statement

The author(s) have declared they have no conflict of interest with regard to this content

Ethics

The author(s) have declared ethics committee/IRB approval is not relevant to this content

Atom-in-SMILES tokenization

Authors

Abstract

Keywords

Supplementary materials

Comments

Now Published

Version History

Metrics

License

DOI

Author’s competing interest statement

Ethics

Share