Suiren-1.0 Technical Report: A Family of Molecular Foundation Models
This addresses the challenge of molecular modeling for researchers in chemistry and drug discovery, though it appears incremental as it builds on existing foundation model and equivariant architecture concepts.
The authors tackled the problem of accurately modeling diverse organic systems by introducing Suiren-1.0, a family of molecular foundation models, achieving state-of-the-art results across a range of tasks, with Suiren-Base pre-trained on 70M samples and Suiren-Dimer on 13.5M samples.
We introduce Suiren-1.0, a family of molecular foundation models for the accurate modeling of diverse organic systems. Suiren-1.0 comprising three specialized variants (Suiren-Base, Suiren-Dimer, and Suiren-ConfAvg) is integrated within an algorithmic framework that bridges the gap between 3D conformational geometry and 2D statistical ensemble spaces. We first pre-train Suiren-Base (1.8B parameters) on a 70M-sample Density Functional Theory dataset using spatial self-supervision and SE(3)-equivariant architectures, achieving robust performance in quantum property prediction. Suiren-Dimer extends this capability through continued pre-training on 13.5M intermolecular interaction samples. To enable efficient downstream application, we propose Conformation Compression Distillation (CCD), a diffusion-based framework that distills complex 3D structural representations into 2D conformation-averaged representations. This yields the lightweight Suiren-ConfAvg, which generates high-fidelity representations from SMILES or molecular graphs. Our extensive evaluations demonstrate that Suiren-1.0 establishes state-of-the-art results across a range of tasks. All models and benchmarks are open-sourced.