Efficiently predicting high resolution mass spectra with graph neural networks
This addresses the primary open problem in computational metabolomics for researchers needing efficient and accurate small molecule identification from mass spectra.
The paper tackles the problem of predicting high-resolution mass spectra for small molecule identification by modeling it as a mapping from molecular graphs to probability distributions over formulas, achieving significantly lower prediction error and orders-of-magnitude faster runtime than state-of-the-art methods.
Identifying a small molecule from its mass spectrum is the primary open problem in computational metabolomics. This is typically cast as information retrieval: an unknown spectrum is matched against spectra predicted computationally from a large database of chemical structures. However, current approaches to spectrum prediction model the output space in ways that force a tradeoff between capturing high resolution mass information and tractable learning. We resolve this tradeoff by casting spectrum prediction as a mapping from an input molecular graph to a probability distribution over molecular formulas. We discover that a large corpus of mass spectra can be closely approximated using a fixed vocabulary constituting only 2% of all observed formulas. This enables efficient spectrum prediction using an architecture similar to graph classification - GrAFF-MS - achieving significantly lower prediction error and orders-of-magnitude faster runtime than state-of-the-art methods.