CVDec 2, 2025

WeMMU: Enhanced Bridging of Vision-Language Models and Diffusion Models via Noisy Query Tokens

arXiv:2512.02536v11 citationsh-index: 24
Originality Incremental advance
AI Analysis

This addresses a specific bottleneck in multimodal AI for researchers, though it appears incremental as it builds on existing query token methods.

The paper tackles the problem of task generalization collapse when bridging Vision-Language Models with Diffusion Models using fixed query tokens, proposing Noisy Query Tokens and a VAE branch to enhance continual learning. The results show mitigation of generalization collapse and stable learning across diverse tasks.

Recent progress in multimodal large language models (MLLMs) has highlighted the challenge of efficiently bridging pre-trained Vision-Language Models (VLMs) with Diffusion Models. While methods using a fixed number of learnable query tokens offer computational efficiency, they suffer from task generalization collapse, failing to adapt to new tasks that are distant from their pre-training tasks. To overcome this, we propose Noisy Query Tokens, which learn a distributed representation space between the VLM and Diffusion Model via end-to-end optimization, enhancing continual learning. Additionally, we introduce a VAE branch with linear projection to recover fine-grained image details. Experimental results confirm our approach mitigates generalization collapse and enables stable continual learning across diverse tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes