HSQCid: A Powerful Tool for Paving the Way to High-throughput Structural Dereplication of Natural Products Based on Fast NMR Experiments

17 June 2024, Version 1
This content is a preprint and has not undergone peer review at the time of posting.

Abstract

Structural dereplication is an essential step of the studies of natural products (NPs). The number of found NPs is so large that efficient dereplication is highly desirable. NMR spectroscopy is still the gold standard of structural identification. 13C NMR spectra is an effective molecular fingerprint but its acquisition is time-consuming, especially for mass-limited NPs. Several alternative meth-ods or tools have been proposed but never reached general use for some reasons. Here, a new artificial intelligence tool using con-trastive learning between 1H-13C HSQC spectra and structures, HSQCid, is proposed for effective structural identification. Two structure encoders are compared and Graph neural network is preferred over Transformer. In this way, 80% and 20% of about 400K predicted data could be used for training and testing, respectively. Besides, with 18K experimental data as external test data, top-1 and top-5 accuracy reaches 74.9% and 92.2%, respectively. Top-1 accuracy increases by at least 12% when combined with other easily obtainable structure features, such as total number of hydrogens connected to carbons from 1H NMR spectra. Further data analysis shows that the filters by structure features nearly eliminate the influence (>10%) of the difference between predicted and experimental data. Surprisingly the influence of the number or the ratio of quaternary carbons on the identification accuracy is only significant in specific and rare cases (less than 3%). Furthermore, benchmark method by matching 13C peaks is compared and markedly inferior to the proposed method. HSQCid will be available online in the near future for free public use. It is believed that HSQCid contributes to paving the way to high throughput or highly effective structural dereplication for NPs

Supplementary materials

Title
Description
Actions
Title
supporting information 1
Description
Data sources and quality evaluation; models by contrastive; peak matching methods; Data partitioning of traditional Chinese medicine related natural products
Actions
Title
supporting information 2
Description
Chemical structure classes of Traditional Chinese medicine related natural products
Actions

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.