MLLGMay 21, 2025

Robust Multimodal Learning via Entropy-Gated Contrastive Fusion

arXiv:2505.15417v11 citationsh-index: 2ICML
Originality Incremental advance
AI Analysis

This addresses robustness and calibration issues in multimodal learning for applications like robotics and healthcare, though it appears incremental as a novel fusion layer.

The paper tackles the problem of missing inputs in multimodal systems by introducing Adaptive Entropy-Gated Contrastive Fusion (AECF), which improves masked-input mAP by +18 percentage points at a 50% drop rate and reduces ECE by up to 200% while adding only 1% run-time.

Real-world multimodal systems routinely face missing-input scenarios, and in reality, robots lose audio in a factory or a clinical record omits lab tests at inference time. Standard fusion layers either preserve robustness or calibration but never both. We introduce Adaptive Entropy-Gated Contrastive Fusion (AECF), a single light-weight layer that (i) adapts its entropy coefficient per instance, (ii) enforces monotone calibration across all modality subsets, and (iii) drives a curriculum mask directly from training-time entropy. On AV-MNIST and MS-COCO, AECF improves masked-input mAP by +18 pp at a 50% drop rate while reducing ECE by up to 200%, yet adds 1% run-time. All back-bones remain frozen, making AECF an easy drop-in layer for robust, calibrated multimodal inference.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes