Leveraging Tree-Transformer VAE with fragment tokenization for high-performance large chemical generative model

Tensei Inukai; Aoi Yamato; Manato Akiyama; Yasubumi Sakakibara

doi:10.26434/chemrxiv-2024-77vhr-v3

Theoretical and Computational Chemistry

Search within Theoretical and Computational Chemistry

Leveraging Tree-Transformer VAE with fragment tokenization for high-performance large chemical generative model

08 November 2024, Version 3

Working Paper

Show author details

This content is a preprint and has not undergone peer review at the time of posting.

Abstract

Molecular generation models, especially chemical language model (CLM) utilizing SMILES, a string representation of compounds, face limitations in handling large and complex compounds while maintaining structural accuracy. To address these challenges, we propose FRATTVAE, a Transformer-based variational autoencoder that treats molecules as tree structures with fragments as nodes. FRATTVAE employs several innovative deep learning techniques, including ECFP (Extended Connectivity Fingerprints) based token embeddings and the Transformer’s self-attention mechanism, FRATTVAE efficiently handles large-scale compounds, improving both computational speed and generation accuracy. Evaluations across benchmark datasets, ranging from small molecules to natural compounds, demonstrate that FRATTVAE consistently outperforms existing models, achieving superior reconstruction accuracy and generation quality. Additionally, in molecular optimization tasks, FRATTVAE generated stable, high-quality molecules with desired properties, avoiding structural alerts. These results highlight FRATTVAE as a robust and versatile solution for molecular generation and optimization, making it well-suited for a variety of applications in cheminformatics and drug discovery.

Keywords

Chemical space

Tree Transformer

Distribution learning

Variational autoencoder (VAE)

Molecular generation

Fragment token

Supplementary materials

Title

Description

Actions

Title

Supplementary Materials

Description

Supplemental Methods, Supplemental Tables, and Supplementary Figures.

Actions

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here .

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Version History

Nov 08, 2024 Version 3

Jun 24, 2024 Version 2

May 20, 2024 Version 1

Version Notes

We revised the results section to include new findings, changed the title to a more appropriate one, and updated the abstract, Introduction, and the figures and tables.

Metrics

1,615

1,023

Views

Downloads

Citations

License

The content is available under CC BY NC 4.0

DOI

10.26434/chemrxiv-2024-77vhr-v3

Funding

Ministry of Education, Culture, Sports, Science and Technology

23H04885

Author’s competing interest statement

The author(s) have declared they have no conflict of interest with regard to this content

Ethics

The author(s) have declared ethics committee/IRB approval is not relevant to this content

Leveraging Tree-Transformer VAE with fragment tokenization for high-performance large chemical generative model

Authors

Abstract

Keywords

Supplementary materials

Comments

Version History

Version Notes

Metrics

License

DOI

Funding

Author’s competing interest statement

Ethics

Share