CVMay 19, 2025

RMMSS: Towards Advanced Robust Multi-Modal Semantic Segmentation with Hybrid Prototype Distillation and Feature Selection

arXiv:2505.12861v2h-index: 3
Originality Incremental advance
AI Analysis

This work addresses robustness in multi-modal semantic segmentation for real-world applications like autonomous driving, though it is incremental as it builds on existing distillation methods.

The paper tackles the problem of multi-modal semantic segmentation under missing sensor data by proposing RMMSS, a two-stage framework with hybrid prototype distillation and feature selection, which improves missing-modality performance by up to 3.89% on datasets while maintaining full-modality performance with only a -0.1% mIoU drop.

Multi-modal semantic segmentation (MMSS) faces significant challenges in real-world applications due to incomplete, degraded, or missing sensor data. While current MMSS methods typically use self-distillation with modality dropout to improve robustness, they largely overlook inter-modal correlations and thus suffer significant performance degradation when no modalities are missing. To this end, we present RMMSS, a two-stage framework designed to progressively enhance model robustness under missing-modality conditions, while maintaining strong performance in full-modality scenarios. It comprises two key components: the Hybrid Prototype Distillation Module (HPDM) and the Feature Selection Module (FSM). In the first stage, we pre-train the teacher model with full-modality data and then introduce HPDM to do cross-modal knowledge distillation for obtaining a highly robust model. In the second stage, we freeze both the pre-trained full-modality teacher model and the robust model and propose a trainable FSM that extracts optimal representations from both the feature and logits layers of the models via feature score calculation. This process learns a final student model that maintains strong robustness while achieving high performance under full-modality conditions. Our experiments on three datasets demonstrate that our method improves missing-modality performance by 2.80%, 3.89%, and 0.89%, respectively, compared to the state-of-the-art, while causing almost no drop in full-modality performance (only -0.1% mIoU). Meanwhile, different backbones (AnySeg and CMNeXt) are utilized to validate the generalizability of our framework.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes