Abstract
Chemical language model (CLM), a molecular generation model, leverages large language models by utilizing SMILES, a string representation of compounds. Chemical variational auto-encoder (VAE), which explicitly constructs a latent space, demonstrates their strength in molecular optimization and generation on a continuous space. We propose the Fragment Tree-Transformer based VAE (FRATTVAE), which treats molecules as tree structures with fragments as nodes. This method allows for fragment tokens, unmanageable with SMILES, and effectively handles large molecules, including salts and solvents. Utilizing tree positional encoding and Extended Connectivity Fingerprints (ECFP) for fragment token embedding, and applying the Transformer architecture, FRATTVAE achieves superior molecule generation accuracy and computational speed. Distribution learning across various benchmark datasets, from small molecules to natural compounds, showed that FRATTVAE consistently achieved high accuracy in all metrics while balancing reconstruction accuracy and generation quality. In molecular optimization tasks, FRATTVAE generated high-quality, stable molecules with desired properties, avoiding structural alerts.
Supplementary materials
Title
Supplementary Materials
Description
Supplemental Methods, Supplemental Tables, and Supplementary Figures.
Actions