CVCLETMay 25

CMAP: Cross-Modal Adaptive Prompting for Multi-Domain Task-Incremental Learning

arXiv:2605.2570821.8
AI Analysis

For multi-domain task-incremental learning, CMAP provides a parameter-efficient method that improves performance without external data, addressing the problem of forgetting and task identity inference.

CMAP introduces cross-modal adaptive prompting for multi-domain task-incremental learning, leveraging CLIP's text embedding space for task routing, confidence estimation, and encoder adaptation. It achieves 74.2% Transfer, 80.5% Average, and 88.7% Last on the MTIL benchmark, surpassing prior SOTA by 5.0, 3.7, and 3.0 percentage points with only 2.5M parameters.

Multi-domain task-incremental learning requires a model to sequentially acquire knowledge across visually diverse domains without forgetting prior tasks, and without access to task identity at inference. Parameter-efficient methods built on frozen vision-language models have made strong progress, yet all existing approaches rely exclusively on visual features for task routing, confidence estimation, and encoder adaptation, leaving CLIP's cross-modal text embedding space entirely unexploited. We address this gap through three contributions. Text-space task routing replaces visual Gaussian matching with cosine similarity to frozen CLIP text prototypes, giving order-independent routing robust to data scarcity at zero parameter cost. Multi-prototype visual-textual confidence replaces single-Gaussian class modeling with K-means visual prototypes and cross-modal alignment scores under task-calibrated thresholds. Symmetric cross-modal gating extends per-layer Gumbel gates to the text encoder conditioned on batch image features, preserving cross-modal alignment on out-of-distribution inputs. On the MTIL benchmark spanning 11 datasets and 1201 classes, our method achieves 74.2% Transfer, 80.5% Average, and 88.7% Last under Order-I, surpassing the prior state of the art by 5.0, 3.7, and 3.0 percentage points with only 2.5M trainable parameters and no external data.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes