Abstract
Training machine learning models for tasks such as de novo sequencing or spectral clustering requires large collections of confidently identified spectra. Here we describe a dataset of 2.8 million high-confidence peptide-spectrum matches derived from nine different species. The dataset is based on a previously described benchmark but has been re-processed to ensure consistent data quality and enforce separation of training and test peptides.
Supplementary weblinks
Title
Nine-species benchmark
Description
This is a de novo sequencing benchmark dataset derived from nine
publicly available mass spectrometry datasets. There are two versions
of the benchmark: main and balanced. The balanced version randomly
eliminates some spectra associated with some species in order to
create a smaller, more evenly balanced dataset. Also provided are two
zip files containing the raw data as well as intermediate results.
Actions
View