Abstract
Designing innovative molecular structures tailored to specific protein targets represents a fundamental challenge in drug discovery. Most existing approaches based on graph neural networks for generating three-dimensional (3D) molecules within protein pockets often produce molecules with invalid configurations, suboptimal drug-like qualities and limited synthesizability, while also requiring extended generation times. To address these challenges, we present 3DSMILES-GPT, a fully language-model-driven framework for 3D molecular generation. Initially, leveraging the architecture of large language models, we treat both two-dimensional (2D) and 3D molecular representations as linguistic expressions and pre-train the model on an extensive dataset. This approach enables the model to comprehensively understand the 2D and 3D characteristics of large-scale molecules. Subsequently, we fine-tune the model on protein pocket and molecule pair data, followed by reinforcement learning to further optimize the biophysical and chemical properties of the generated molecules. Experimental results demonstrate that, compared to existing methods, 3DSMILES-GPT generates molecules with superior metrics, such as Vina docking score. Notably, it achieves a 33% enhancement in the quantitative estimation of drug-likeness (QED) compared to current models. Additionally, the generation speed is significantly improved, with a threefold increase over the fastest existing methods. This innovative approach highlights the potential of 3DSMILES-GPT to revolutionize the generation of drug-like molecules, offering both improved efficacy and efficiency in drug discovery process.
Supplementary materials
Title
Supplementary Information
Description
Detailed Jensen–Shannon divergence (×10−3) between the bond length
Actions