Transfer between Modalities with MetaQueries
This addresses the problem of complex training and data balancing in multimodal AI for researchers and practitioners, offering a flexible solution for tasks like image editing and generation, though it is incremental as it builds on existing MLLM and diffusion methods.
The paper tackles the challenge of aligning different modalities in unified multimodal models by introducing MetaQueries, a learnable interface that connects autoregressive multimodal LLMs to diffusion models, enabling knowledge-augmented image generation with simplified training and strong performance even with a frozen MLLM backbone.
Unified multimodal models aim to integrate understanding (text output) and generation (pixel output), but aligning these different modalities within a single architecture often demands complex training recipes and careful data balancing. We introduce MetaQueries, a set of learnable queries that act as an efficient interface between autoregressive multimodal LLMs (MLLMs) and diffusion models. MetaQueries connects the MLLM's latents to the diffusion decoder, enabling knowledge-augmented image generation by leveraging the MLLM's deep understanding and reasoning capabilities. Our method simplifies training, requiring only paired image-caption data and standard diffusion objectives. Notably, this transfer is effective even when the MLLM backbone remains frozen, thereby preserving its state-of-the-art multimodal understanding capabilities while achieving strong generative performance. Additionally, our method is flexible and can be easily instruction-tuned for advanced applications such as image editing and subject-driven generation.