LGMay 19

MSAlign: Aligning Molecule and Mass Spectra Foundation Models for Metabolite Identification

Paul Krzakala, Gabriel Melo, Camille Lançon, Charlotte Laclau, Rémi Flamary, Etienne Thévenot, Florence d'Alché-Buc

arXiv:2605.1975265.6

AI Analysis

Provides a simple, fast, and reproducible method for metabolite identification from mass spectra, addressing reproducibility issues in the field.

MSAlign aligns frozen DreaMS and ChemBERTa foundation models via lightweight MLP projections and a candidate-based contrastive objective, consistently outperforming existing approaches across all MassSpecGym and Spectraverse benchmarks for metabolite identification.

Accurately identifying metabolites i.e. small molecules from mass spectrometry data remains a core challenge in metabolomics, with broad applications in drug discovery, environmental analysis, and clinical research. We address the Molecule Retrieval task, which consists in recovering the chemical structure of a metabolite from its MS/MS spectrum given a set of candidate molecules. While the recent release of benchmark datasets such as MassSpecGym and Spectraverse has considerably accelerated the development of novel machine learning approaches, the complexity of data preprocessing pipelines and the lack of unified implementations make methods and results difficult to reproduce and compare. We make three contributions. First, we propose a unified framework encompassing recent approaches based on representation alignment and contrastive learning. Second, we introduce MSAlign, inspired by multimodal alignment in vision-language models, which learns a shared representation space by aligning two frozen foundation models (DreaMS for mass spectra and ChemBERTa for molecules) through lightweight MLP projections trained with a candidate-based contrastive objective. MSAlign is simple to implement, fast to train and consistently outperforms existing approaches across all benchmarks. Third, we investigate a long-standing evaluation problem: data splitting strategies in molecule retrieval implicitly trade off data leakage against domain shift. We formalize this tension by introducing a quantitative measure of distribution shift, and use it to evaluate splitting strategies in existing benchmarks. All datasets, splits, candidate sets, and a unified implementation of MSAlign and baselines are publicly released to support reproducible research.

View on arXiv PDF

Similar