Abstract
We introduce LSM1-MS2, a self-supervised foundation model pre-trained on extensive unlabeled MS2 data for tandem mass spectrometry. LSM1-MS2 leverages the complex data within MS2 spectra, bypassing the need for detailed identification and efficiently processing traditional MS2 data. The model's architecture, based on a specialized transformer and custom tokenization scheme, enables Masked Peak Modeling, which trains the model to reconstruct MS2 peaks, preparing it for various downstream applications. Fine-tuning LSM1-MS2 with a smaller, labeled dataset focuses on compound property prediction and identification through spectral matching, enhancing its practicality. Preliminary results demonstrate that the model predicts 209 RDKIT-extracted descriptors with a mean absolute error (MAE) of 3.35 on the CASMI 2022 dataset, outperforming recent supervised models. Notably, it requires only 1\% of the labeled data used by traditional methods. LSM1-MS2's rich embeddings yield similar results to supervised models even without fine-tuning. In compound identification, LSM1-MS2 excels in database lookups, achieving a 0.07 MAE in Tanimoto similarity measurements on CASMI 2022, surpassing supervised models and traditional methods. It also retrieves close matches for known and unknown molecule queries more efficiently than conventional approaches. Overall, LSM1-MS2 demonstrates a transformative impact on mass spectrometry analysis, offering effective fine-tuning with minimal data and robust predictive abilities, reshaping traditional approaches in MS2 data utilization.