CVSep 29, 2025

UniVid: The Open-Source Unified Video Model

Jiabin Luo, Junhui Lin, Zeyu Zhang, Biao Wu, Meng Fang, Ling Chen, Hao Tang

arXiv:2509.24200v214 citationsh-index: 7Has Code

Originality Incremental advance

AI Analysis

This work addresses the problem of building efficient and effective unified video models for AI researchers and practitioners, representing an incremental advancement with specific performance gains.

The paper tackles the challenge of unified video modeling for both generation and understanding by addressing text-visual token imbalance and inefficient cross-modal attention, resulting in state-of-the-art performance with a 2.2% improvement on VBench-Long and accuracy gains of 1.0% and 3.3% on MSVD-QA and ActivityNet-QA.

Unified video modeling that combines generation and understanding capabilities is increasingly important but faces two key challenges: maintaining semantic faithfulness during flow-based generation due to text-visual token imbalance and the limitations of uniform cross-modal attention across the flow trajectory, and efficiently extending image-centric MLLMs to video without costly retraining. We present UniVid, a unified architecture that couples an MLLM with a diffusion decoder through a lightweight adapter, enabling both video understanding and generation. We introduce Temperature Modality Alignment to improve prompt adherence and Pyramid Reflection for efficient temporal reasoning via dynamic keyframe selection. Extensive experiments on standard benchmarks demonstrate state-of-the-art performance, achieving a 2.2% improvement on VBench-Long total score compared to EasyAnimateV5.1, and 1.0% and 3.3% accuracy gains on MSVD-QA and ActivityNet-QA, respectively, compared with the best prior 7B baselines. Code: https://github.com/AIGeeksGroup/UniVid. Website: https://aigeeksgroup.github.io/UniVid.

View on arXiv PDF Code

Similar