Fine-tuning ChatGPT Achieves State-of-the-Art Performance for Chemical Text Mining

16 November 2023, Version 1
This content is a preprint and has not undergone peer review at the time of posting.

Abstract

Extracting knowledge from complex and diverse chemical texts is a pivotal task for both experimental and computational chemists. The task is still considered to be extremely challenging due to the complexity of the chemical language and scientific literature. This study fine-tuned ChatGPT for five intricate chemical text mining tasks: compound entity recognition, reaction role labelling, metal-organic framework (MOF) synthesis information extraction, nuclear magnetic resonance spectroscopy (NMR) data extraction, and the conversion of reaction paragraph to action sequence. The fine-tuned ChatGPT demonstrated impressive performance, significantly reducing the need for repetitive and extensive prompt engineering experiments. It achieved exact accuracy levels ranging from 69% to 95% on these tasks with minimal annotated data. For comparison, we fine-tuned open-source pre-trained large language models (LLMs) such as Llama2, T5, and BART. The results showed that the fine-tuned ChatGPT excelled in all tasks. It even outperformed those task-adaptive pre-training and fine-tuning models that were based on a significantly larger amount of in-domain data. Given its versatility, robustness, and low-code capability, leveraging fine-tuned LLMs as toolkits for automated data acquisition could revolutionize chemical knowledge extraction.

Keywords

Chemical Text Mining
Large Language Models
ChatGPT
Fine-tune
Few-data
Knowledge Extraction
Cheminformatics
synthesis
chemical synthesis
llama
language model
MOF
NMR
reaction role
chemical procedure
paragraph
LLMs
structured data

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.