WeMMU: Enhanced Bridging of Vision-Language Models and Diffusion Models via Noisy Query Tokens
This addresses a specific bottleneck in multimodal AI for researchers, though it appears incremental as it builds on existing query token methods.
The paper tackles the problem of task generalization collapse when bridging Vision-Language Models with Diffusion Models using fixed query tokens, proposing Noisy Query Tokens and a VAE branch to enhance continual learning. The results show mitigation of generalization collapse and stable learning across diverse tasks.
Recent progress in multimodal large language models (MLLMs) has highlighted the challenge of efficiently bridging pre-trained Vision-Language Models (VLMs) with Diffusion Models. While methods using a fixed number of learnable query tokens offer computational efficiency, they suffer from task generalization collapse, failing to adapt to new tasks that are distant from their pre-training tasks. To overcome this, we propose Noisy Query Tokens, which learn a distributed representation space between the VLM and Diffusion Model via end-to-end optimization, enhancing continual learning. Additionally, we introduce a VAE branch with linear projection to recover fine-grained image details. Experimental results confirm our approach mitigates generalization collapse and enables stable continual learning across diverse tasks.