Learning the Language of NMR: Structure Elucidation from NMR spectra using Transformer Models

14 August 2023, Version 1
This content is a preprint and has not undergone peer review at the time of posting.

Abstract

The application of machine learning models in chemistry has made remarkable strides in recent years. Even though there is considerable interest in automating common proce- dure in analytical chemistry using machine learning, very few models have been adopted into everyday use. Among the analytical instruments available to chemists, Nuclear Mag- netic Resonance (NMR) spectroscopy is one of the most important, offering insights into molecular structure unobtainable with other methods. However, most processing and analysis of NMR spectra is still performed manually, making the task tedious and time consuming especially for larger quantities of spectra. We present a transformer-based machine learning model capable of predicting the molecular structure directly from the NMR spectrum. Our model is pretrained on synthetic NMR spectra, achieving a top–1 accuracy of 67.0% when predicting the structure from both the 1H and 13C spectrum. Additionally, we train a model which, given a spectrum and a set of likely compounds, selects the one corresponding to the spectrum. This model achieves a top–1 accuracy of 96.0% when trained on 1H spectra.

Keywords

Deep Learning
Analytical Chemistry
NMR Spectroscopy
Structure Elucidation
Natural Language Processing

Supplementary weblinks

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.