Abstract
Mass spectrometry (MS) is the primary method for characterizing biological and environmental samples at a molecular level. Despite this, the interpretation of mass spectra remains a challenge to overcome. Existing methods heavily rely on limited spectral libraries and human expertise, so we have taken an orthogonal approach. Here, we introduce a foundation transformer-based model pre-trained in a self-supervised way on millions of unlabeled mass spectra from our new GeMS (GNPS Experimental Mass Spectra) dataset. We show that by learning to predict masked spectral peaks and chromatographic retention orders, our model discovers rich molecular representations, which we name DreaMS (Deep Representations Empowering the Annotation of Mass Spectra). Fine-tuning the neural network for predicting spectral similarity, molecular fingerprints, chemical properties, and the presence of fluorine from mass spectra yields state-of-the-art performance across all tasks. This underscores the practical utility of DreaMS across diverse spectrum interpretation tasks and establishes it as a foundation for future advances in the field.