Fine-tuning Large Language Models for Chemical Text Mining

Wei Zhang; Qinggong Wang; Xiangtai Kong; Jiacheng Xiong; Shengkun Ni; Duanhua Cao; Buying Niu; Mingan Chen; Runze Zhang; Yitian Wang; Lehan Zhang; Xutong Li; Zhaoping Xiong; Qian Shi; Ziming Huang; Zunyun Fu; Mingyue Zheng

doi:10.26434/chemrxiv-2023-k7ct5-v2

Organic Chemistry

Search within Organic Chemistry

Fine-tuning Large Language Models for Chemical Text Mining

01 February 2024, Version 2

Working Paper

Show author details

This content is a preprint and has not undergone peer review at the time of posting.

Abstract

Extracting knowledge from complex and diverse chemical texts is a pivotal task for both experimental and computational chemists. The task is still considered to be extremely challenging due to the complexity of the chemical language and scientific literature. This study explored the power of fine-tuned large language models (LLMs) on five intricate chemical text mining tasks: compound entity recognition, reaction role labelling, metal-organic framework (MOF) synthesis information extraction, nuclear magnetic resonance spectroscopy (NMR) data extraction, and the conversion of reaction paragraph to action sequence. The fine-tuned LLMs models demonstrated impressive performance, significantly reducing the need for repetitive and extensive prompt engineering experiments. For comparison, we guided GPT-3.5 and GPT-4 with prompt engineering and fine-tuned GPT-3.5 as well as other open-source LLMs such as Llama2, T5, and BART. The results showed that the fine-tuned GPT models excelled in all tasks. It achieved exact accuracy levels ranging from 69% to 95% on these tasks with minimal annotated data. It even outperformed those task-adaptive pre-training and fine-tuning models that were based on a significantly larger amount of in-domain data. Given its versatility, robustness, and low-code capability, leveraging fine-tuned LLMs as flexible and effective toolkits for automated data acquisition could revolutionize chemical knowledge extraction.

Keywords

Chemical Text Mining

Large Language Models

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here .

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Version History

Feb 01, 2024 Version 2

Nov 16, 2023 Version 1

Version Notes

Update the performance of GPT-4 with prompt engineering. Descriptions and Methods are more clear.

Metrics

3,261

1,861

Views

Downloads

License

The content is available under CC BY NC ND 4.0

DOI

10.26434/chemrxiv-2023-k7ct5-v2

Funding

National Natural Science Foundation of China

T2225002

National Natural Science Foundation of China

82273855

National Natural Science Foundation of China

82204278

National Key Research and Development Program of China

2022YFC3400504

SIMM-SHUTCM Traditional Chinese Medicine Innovation Joint Research Program

E2G805H

Author’s competing interest statement

The author(s) have declared they have no conflict of interest with regard to this content

Ethics

The author(s) have declared ethics committee/IRB approval is not relevant to this content

Fine-tuning Large Language Models for Chemical Text Mining

Authors

Abstract

Keywords

Comments

Version History

Version Notes

Metrics

License

DOI

Funding

Author’s competing interest statement

Ethics

Share