Abstract
The design of small molecules is crucial for technological applications ranging from drug discovery to energy storage. Due to the vast design space available to modern synthetic chemistry, the community has increasingly sought to use data-driven and machine learning approaches to navigate this space. Although generative machine learning methods have recently shown potential for computational molecular design, their use is hindered by complex training procedures, and they often fail to generate valid and unique molecules. In this context, pre-trained Large Language Models (LLMs) have emerged as potential tools for molecular design, as they appear to be capable of creating and modifying molecules based on simple instructions provided through natural language prompts. In this work, we show that the Claude 3 Opus LLM can read, write, and modify molecules according to prompts, with an impressive 97% valid and unique molecules. By quantifying these modifications in a low-dimensional latent space, we systematically evaluate the model's behavior under different prompting conditions. Notably, the model is able to perform guided molecular generation when asked to manipulate the electronic structure of molecules using simple, natural-language prompts. Our findings highlight the potential of LLMs as powerful and versatile molecular design engines.
Supplementary materials
Title
Supplementary Information (SI) for “Large Language Models as Molecular Design Engines”
Description
Supplementary Information (SI) of the paper, having additional metrics that were recorded and supporting figures to accompany the primary
manuscript.
The following contents are available in the Supporting Information:
(1) Distribution of API call times for molecular generation with the Claude API across prompts A-H. (2) Median Tanimoto similarity between parent and generated molecules calculated across A-H prompts for the 64 parent SMILES structures. (3) Diversity of the generated molecules across prompts A-H, compared to the baseline diversity of the parent SMILES set. (4) Distribution of Synthetic Accessibility scores (SA Scores) for generated molecules across prompts A-H. (5) Tanimoto similarity (parent-generated) for all prompts (A-P).
Actions
Supplementary weblinks
Title
Dataset for "Large Language Models as Molecular Design Engines"
Description
Comprises of Data (claude-gpt-paper.zip), Codes (claude-gpt-paper-codes.zip ), and Molecular Viewer app (llm-visulizer-dashapp.zip) for viewing the molecules generated by the Large Language Model.
Actions
View Title
GitHub Repository for the codes used in the paper "Large Language Models as Molecular Design Engines"
Description
Has a simple Jupyter notebook (GPT_modification_just_plots.ipynb), that can be run on Google Colab to generate the figures shown in the paper. More information on how to run this notebook is provided as markdown cells in the notebook and in the readme file in the GitHub repository.
Actions
View