DeLo: Dual Decomposed Low-Rank Experts Collaboration for Continual Missing Modality Learning
This addresses a real-world challenge in multimodal AI where models must handle incomplete data streams incrementally, offering a novel architectural solution for practitioners in fields like robotics or healthcare.
The paper tackles the problem of adapting Large Multimodal Models to sequential data with frequent missing modalities (Continual Missing Modality Learning) by proposing DeLo, a dual-decomposed low-rank expert architecture that resolves modality interference and prevents catastrophic forgetting, achieving significant performance improvements over state-of-the-art methods on established benchmarks.
Adapting Large Multimodal Models (LMMs) to real-world scenarios poses the dual challenges of learning from sequential data streams while handling frequent modality incompleteness, a task known as Continual Missing Modality Learning (CMML). However, existing works on CMML have predominantly relied on prompt tuning, a technique that struggles with this task due to cross-task interference between its learnable prompts in their shared embedding space. A naive application of Low-Rank Adaptation (LoRA) with modality-shared module will also suffer modality interference from competing gradients. To this end, we propose DeLo, the first framework to leverage a novel dual-decomposed low-rank expert architecture for CMML. Specifically, this architecture resolves modality interference through decomposed LoRA expert, dynamically composing LoRA update matrix with rank-one factors from disentangled modality-specific factor pools. Embedded within a task-partitioned framework that structurally prevents catastrophic forgetting, this expert system is supported by two key mechanisms: a Cross-Modal Guided Routing strategy to handle incomplete data and a Task-Key Memory for efficient, task-agnostic inference. Extensive experiments on established CMML benchmarks demonstrate that our method significantly outperforms state-of-the-art approaches. This highlights the value of a principled, architecturally-aware LoRA design for real-world multimodal challenges.