CVMay 19, 2025

RMMSS: Towards Advanced Robust Multi-Modal Semantic Segmentation with Hybrid Prototype Distillation and Feature Selection

arXiv:2505.12861v23.6h-index: 3Has Code

Originality Incremental advance

AI Analysis

This work addresses robustness in multi-modal semantic segmentation for real-world applications like autonomous driving, though it is incremental as it builds on existing distillation methods.

The paper tackles the problem of multi-modal semantic segmentation under missing sensor data by proposing RMMSS, a two-stage framework with hybrid prototype distillation and feature selection, which improves missing-modality performance by up to 3.89% on datasets while maintaining full-modality performance with only a -0.1% mIoU drop.

Multi-modal semantic segmentation (MMSS) faces significant challenges in real-world applications due to incomplete, degraded, or missing sensor data. While current MMSS methods typically use self-distillation with modality dropout to improve robustness, they largely overlook inter-modal correlations and thus suffer significant performance degradation when no modalities are missing. To this end, we present RMMSS, a two-stage framework designed to progressively enhance model robustness under missing-modality conditions, while maintaining strong performance in full-modality scenarios. It comprises two key components: the Hybrid Prototype Distillation Module (HPDM) and the Feature Selection Module (FSM). In the first stage, we pre-train the teacher model with full-modality data and then introduce HPDM to do cross-modal knowledge distillation for obtaining a highly robust model. In the second stage, we freeze both the pre-trained full-modality teacher model and the robust model and propose a trainable FSM that extracts optimal representations from both the feature and logits layers of the models via feature score calculation. This process learns a final student model that maintains strong robustness while achieving high performance under full-modality conditions. Our experiments on three datasets demonstrate that our method improves missing-modality performance by 2.80%, 3.89%, and 0.89%, respectively, compared to the state-of-the-art, while causing almost no drop in full-modality performance (only -0.1% mIoU). Meanwhile, different backbones (AnySeg and CMNeXt) are utilized to validate the generalizability of our framework.

View on arXiv PDF Code

Similar