One Small Step with Fingerprints, One Giant Leap for De Novo Molecule Generation from Mass Spectra
This work addresses the challenge of identifying molecular structures from mass spectra for researchers in chemistry and drug discovery, presenting a strong baseline but is incremental as it builds on existing methods.
The paper tackled the problem of de novo molecule generation from mass spectra by using a two-stage pipeline with MIST as the encoder and MolForge as the decoder, enhanced with additional training data and probability thresholding, achieving a tenfold improvement over previous state-of-the-art methods with top-1 31% and top-10 40% correct molecular structures in MassSpecGym.
A common approach to the de novo molecular generation problem from mass spectra involves a two-stage pipeline: (1) encoding mass spectra into molecular fingerprints, followed by (2) decoding these fingerprints into molecular structures. In our work, we adopt MIST (Goldman et. al., 2023) as the encoder and MolForge (Ucak et. al., 2023) as the decoder, leveraging additional training data to enhance performance. We also threshold the probabilities of each fingerprint bit to focus on the presence of substructures. This results in a tenfold improvement over previous state-of-the-art methods, generating top-1 31% / top-10 40% of molecular structures correctly from mass spectra in MassSpecGym (Bushuiev et. al., 2024). We position this as a strong baseline for future research in de novo molecule elucidation from mass spectra.