LSM1-MS2: A Self-Supervised Foundation Model for Tandem Mass Spectrometry Applications, Encompassing Extensive Chemical Property Predictions and Spectral Matching

Gabriel Asher; Jennifer M. Campbell; Jack Geremia; Timothy Kassis

doi:10.26434/chemrxiv-2024-k06gb-v2

Analytical Chemistry

Search within Analytical Chemistry

LSM1-MS2: A Self-Supervised Foundation Model for Tandem Mass Spectrometry Applications, Encompassing Extensive Chemical Property Predictions and Spectral Matching

11 March 2024, Version 2

This is not the most recent version. There is a

newer version

of this content available

Working Paper

Show author details

This content is a preprint and has not undergone peer review at the time of posting.

Abstract

We introduce LSM1-MS2, a self-supervised foundation model pre-trained on extensive unlabeled MS2 data for tandem mass spectrometry. LSM1-MS2 leverages the complex data within MS2 spectra, bypassing the need for detailed identification and efficiently processing traditional MS2 data. The model's architecture, based on a specialized transformer and custom tokenization scheme, enables Masked Peak Modeling, which trains the model to reconstruct MS2 peaks, preparing it for various downstream applications. Fine-tuning LSM1-MS2 with a smaller, labeled dataset focuses on compound property prediction and identification through spectral matching, enhancing its practicality. Preliminary results demonstrate that the model predicts 209 RDKIT-extracted descriptors with a mean absolute error (MAE) of 3.35 on the CASMI 2022 dataset, outperforming recent supervised models. Notably, it requires only 1% of the labeled data used by traditional methods. LSM1-MS2's rich embeddings yield similar results to supervised models even without fine-tuning. In compound identification, LSM1-MS2 excels in database lookups, achieving a 0.07 MAE in Tanimoto similarity measurements on CASMI 2022, surpassing supervised models and traditional methods. It also retrieves close matches for known and unknown molecule queries more efficiently than conventional approaches. Overall, LSM1-MS2 demonstrates a transformative impact on mass spectrometry analysis, offering effective fine-tuning with minimal data and robust predictive abilities, reshaping traditional approaches in MS2 data utilization.

Keywords

MS2

Tandem Mass Spectrometry

Deep Learning

Machine Learning

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here .

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.