LGMar 24, 2025

MODIS: Multi-Omics Data Integration for Small and unpaired datasets

Daniel Lepe-Soltero, Thierry Artières, Anaïs Baudot, Paul Villoutreix

arXiv:2503.18856v24.12 citationsh-index: 3Has Code

Originality Incremental advance

AI Analysis

This addresses the problem of data integration in computational biology for scenarios like rare diseases, but it is incremental as it builds on existing methods like variational auto-encoders and GANs.

The paper tackles the challenge of integrating multi-omics data for small, unpaired datasets, such as in rare diseases, by proposing MODIS, a semi-supervised framework that achieves high prediction accuracy and robust performance with limited supervision on datasets like TCGA with 10 to 34 classes.

An important objective in computational biology is the efficient integration of multi-omics data. The task of integration comes with challenges: multi-omics data are most often unpaired (requiring diagonal integration), partially labeled with information about biological conditions, and in some situations such as rare diseases, only very small datasets are available. We present MODIS, a semi supervised framework designed to account for these particular challenges. To address the challenge of very small datasets, we propose to exploit information contained in larger multi-omics databases by training our model on a large reference database and a small target dataset simultaneously, effectively turning the problem of transfer learning into a problem of learning with class imbalance. MODIS performs diagonal integration on unpaired samples, leveraging class-labels to align modalities despite class imbalance and data scarcity. The architecture combines multiple variational auto-encoders, a class classifier and an adversarially trained modality classifier. To ensure training stability, we adapted a regularized relativistic GAN loss to this setting. We first validate MODIS on a synthetic dataset to assess the level of supervision needed for accurate alignment and to quantify the impact of class imbalance on predictive performance. We then apply our approach to the large public TCGA database, considering between 10 and 34 classes (cancer types and normal tissue). MODIS demonstrates high prediction accuracy, robust performance with limited supervision, and stability to class imbalance. These results position MODIS as a promising solution for challenging integration scenarios, particularly diagonal integration with a small number of samples, typical of rare diseases studies. The code is available at https://github.com/VILLOUTREIXLab/MODIS.

View on arXiv PDF Code

Similar