3D Molecular Pocket-based Generation with Token-only Large Language Model

19 August 2024, Version 2
This content is a preprint and has not undergone peer review at the time of posting.

Abstract

Designing innovative molecular structures tailored to specific protein targets represents a fundamental challenge in drug discovery. Most existing approaches based on graph neural networks for generating three-dimensional (3D) molecules within protein pockets often produce molecules with invalid configurations, suboptimal drug-like qualities and limited synthesizability, while also requiring extended generation times. To address these challenges, we present 3DSMILES-GPT, a fully language-model-driven framework for 3D molecular generation. Initially, leveraging the architecture of large language models, we treat both two-dimensional (2D) and 3D molecular representations as linguistic expressions and pre-train the model on an extensive dataset. This approach enables the model to comprehensively understand the 2D and 3D characteristics of large-scale molecules. Subsequently, we fine-tune the model on protein pocket and molecule pair data, followed by reinforcement learning to further optimize the biophysical and chemical properties of the generated molecules. Experimental results demonstrate that, compared to existing methods, 3DSMILES-GPT generates molecules with superior metrics, such as Vina docking score. Notably, it achieves a 33% enhancement in the quantitative estimation of drug-likeness (QED) compared to current models. Additionally, the generation speed is significantly improved, with a threefold increase over the fastest existing methods. This innovative approach highlights the potential of 3DSMILES-GPT to revolutionize the generation of drug-like molecules, offering both improved efficacy and efficiency in drug discovery process.

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.