AICLCVOct 13, 2023

EasyGen: Easing Multimodal Generation with BiDiffuser and LLMs

arXiv:2310.08949v329 citationsh-index: 11Has Code
Originality Incremental advance
AI Analysis

This work addresses multimodal generation challenges for AI researchers, offering an incremental improvement over existing methods.

The paper tackles the problem of inefficient multimodal generation by introducing EasyGen, which uses BiDiffuser and LLMs to enhance modality interactions, achieving high-quality image generation and data-efficient training.

We present EasyGen, an efficient model designed to enhance multimodal understanding and generation by harnessing the capabilities of diffusion models and large language models (LLMs), Unlike existing multimodal models that predominately depend on encoders like CLIP or ImageBind and need ample amounts of training data to bridge modalities,EasyGen leverages BiDiffuser,a bidirectional conditional diffusion model, to foster more efficient modality interactions. Easygen achieves text generation by training a projection layer linking BiDiffuser and an LLM, and facilities image generation by training an adapter to align the LLM's text space with the BiDiffuser's image space, Comprehensive quantitative and qualitative experiments show that EasyGen excels in data-efficient training, high-quality image generation, and extendibility, effectively addressing the challenges in multimodal generation. The source code is available at https://github.com/zxy556677/EasyGen.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes