LSM1-MS2: A Self-Supervised Foundation Model for Tandem Mass Spectrometry Applications, Encompassing Extensive Chemical Property Predictions and Spectral Matching

11 March 2024, Version 2
This content is a preprint and has not undergone peer review at the time of posting.

Abstract

We introduce LSM1-MS2, a self-supervised foundation model pre-trained on extensive unlabeled MS2 data for tandem mass spectrometry. LSM1-MS2 leverages the complex data within MS2 spectra, bypassing the need for detailed identification and efficiently processing traditional MS2 data. The model's architecture, based on a specialized transformer and custom tokenization scheme, enables Masked Peak Modeling, which trains the model to reconstruct MS2 peaks, preparing it for various downstream applications. Fine-tuning LSM1-MS2 with a smaller, labeled dataset focuses on compound property prediction and identification through spectral matching, enhancing its practicality. Preliminary results demonstrate that the model predicts 209 RDKIT-extracted descriptors with a mean absolute error (MAE) of 3.35 on the CASMI 2022 dataset, outperforming recent supervised models. Notably, it requires only 1% of the labeled data used by traditional methods. LSM1-MS2's rich embeddings yield similar results to supervised models even without fine-tuning. In compound identification, LSM1-MS2 excels in database lookups, achieving a 0.07 MAE in Tanimoto similarity measurements on CASMI 2022, surpassing supervised models and traditional methods. It also retrieves close matches for known and unknown molecule queries more efficiently than conventional approaches. Overall, LSM1-MS2 demonstrates a transformative impact on mass spectrometry analysis, offering effective fine-tuning with minimal data and robust predictive abilities, reshaping traditional approaches in MS2 data utilization.

Keywords

MS2
Tandem Mass Spectrometry
AI
Deep Learning
Machine Learning

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.