SEISMiQ: de novo impurity structure elucidation from tandem mass spectra boosts drug development

04 March 2025, Version 1
This content is a preprint and has not undergone peer review at the time of posting.

Abstract

Liquid chromatography-tandem mass spectrometry (LC-MS/MS) is an essential analytical technique in the pharmaceutical industry, used particularly for elucidating the structure of unknown impurities in the synthesis of active pharmaceutical ingredients. However, the interpretation of mass spectra is challenging and time-consuming, requiring significant expertise. While recent computational tools aimed at automating this process have been developed, their accuracy in determining the chemical structure is limited. In this paper, we introduce a new method for elucidating unknown impurities from their MS/MS spectra. We are able to significantly improve elucidation accuracy by integrating domain experts’ knowledge, specifically the impurity sum formula and known substructure, into the model's training and inference process. Further performance improvements can be achieved through transfer learning from simulated MS/MS spectra of impurities from an in-house database. Finally, the need for any experimental data collection for finetuning can be circumvented by simulating the entire drug substance synthesis process in silico via reaction templates.

Keywords

Structure Elucidation
Impurity Prediction
Reaction Templates
Tandem Mass Spectrometry
Transformer Architecture
Late Stage Drug Development
Chemical Development
Analytical Development
Reaction Networks
Structure Prediction
De Novo Structure Elucidation
Active Pharmaceutical Ingredient
Domain Knowledge Integration
SMILES
SMARTS
Data Augmentation
Language Model
Spectrum Simulation

Supplementary weblinks

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.