To Bin or not to Bin: Alternative Representations of Mass Spectra
This work addresses the challenge of improving molecular property prediction from mass spectra for researchers in chemistry and bioinformatics, though it is incremental as it builds on existing embedding methodologies.
The paper tackled the problem of preprocessing mass spectra for machine learning by proposing set-based and graph-based representations as alternatives to binning, showing that both new representations substantially outperform a multilayer perceptron trained on binned data in a regression task.
Mass spectrometry, especially so-called tandem mass spectrometry, is commonly used to assess the chemical diversity of samples. The resulting mass fragmentation spectra are representations of molecules of which the structure may have not been determined. This poses the challenge of experimentally determining or computationally predicting molecular structures from mass spectra. An alternative option is to predict molecular properties or molecular similarity directly from spectra. Various methodologies have been proposed to embed mass spectra for further use in machine learning tasks. However, these methodologies require preprocessing of the spectra, which often includes binning or sub-sampling peaks with the main reasoning of creating uniform vector sizes and removing noise. Here, we investigate two alternatives to the binning of mass spectra before down-stream machine learning tasks, namely, set-based and graph-based representations. Comparing the two proposed representations to train a set transformer and a graph neural network on a regression task, respectively, we show that they both perform substantially better than a multilayer perceptron trained on binned data.