Prefix-Tree Decoding for Predicting Mass Spectra from Molecules
This work addresses limitations in computational tools for mass spectra prediction, which is important for metabolite discovery in clinical applications, but it appears incremental as it builds on existing encoding-decoding approaches.
The paper tackles the problem of predicting mass spectra from molecules by introducing a method that treats spectra as sets of molecular formulae, using a prefix tree structure for decoding to overcome combinatorial challenges, and reports promising empirical results.
Computational predictions of mass spectra from molecules have enabled the discovery of clinically relevant metabolites. However, such predictive tools are still limited as they occupy one of two extremes, either operating (a) by fragmenting molecules combinatorially with overly rigid constraints on potential rearrangements and poor time complexity or (b) by decoding lossy and nonphysical discretized spectra vectors. In this work, we use a new intermediate strategy for predicting mass spectra from molecules by treating mass spectra as sets of molecular formulae, which are themselves multisets of atoms. After first encoding an input molecular graph, we decode a set of molecular subformulae, each of which specify a predicted peak in the mass spectrum, the intensities of which are predicted by a second model. Our key insight is to overcome the combinatorial possibilities for molecular subformulae by decoding the formula set using a prefix tree structure, atom-type by atom-type, representing a general method for ordered multiset decoding. We show promising empirical results on mass spectra prediction tasks.