CVMay 29, 2025

OpenUni: A Simple Baseline for Unified Multimodal Understanding and Generation

arXiv:2505.23661v341 citationsh-index: 24Has Code
Originality Incremental advance
AI Analysis

This provides an open-source solution for researchers and developers needing efficient multimodal AI models, though it is incremental as it builds on existing methods.

The paper tackles the problem of unifying multimodal understanding and generation by introducing OpenUni, a simple and lightweight baseline that generates high-quality images and achieves exceptional performance on benchmarks like GenEval, DPG-Bench, and WISE with only 1.1B and 3.1B activated parameters.

In this report, we present OpenUni, a simple, lightweight, and fully open-source baseline for unifying multimodal understanding and generation. Inspired by prevailing practices in unified model learning, we adopt an efficient training strategy that minimizes the training complexity and overhead by bridging the off-the-shelf multimodal large language models (LLMs) and diffusion models through a set of learnable queries and a light-weight transformer-based connector. With a minimalist choice of architecture, we demonstrate that OpenUni can: 1) generate high-quality and instruction-aligned images, and 2) achieve exceptional performance on standard benchmarks such as GenEval, DPG- Bench, and WISE, with only 1.1B and 3.1B activated parameters. To support open research and community advancement, we release all model weights, training code, and our curated training datasets (including 23M image-text pairs) at https://github.com/wusize/OpenUni.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes