CVSep 7, 2025

UniVerse-1: Unified Audio-Video Generation via Stitching of Experts

arXiv:2509.06155v147 citationsh-index: 5
Originality Incremental advance
AI Analysis

This addresses the challenge of audio-video generation for multimedia applications, though it is incremental as it builds on existing expert models and aims to close the gap with state-of-the-art models like Veo3.

The paper tackles the problem of generating coordinated audio and video simultaneously by introducing UniVerse-1, a unified model that uses a stitching of experts technique and an online annotation pipeline, achieving well-coordinated audio-visuals for ambient sounds and strong alignment for speech after fine-tuning on 7,600 hours of data.

We introduce UniVerse-1, a unified, Veo-3-like model capable of simultaneously generating coordinated audio and video. To enhance training efficiency, we bypass training from scratch and instead employ a stitching of experts (SoE) technique. This approach deeply fuses the corresponding blocks of pre-trained video and music generation experts models, thereby fully leveraging their foundational capabilities. To ensure accurate annotations and temporal alignment for both ambient sounds and speech with video content, we developed an online annotation pipeline that processes the required training data and generates labels during training process. This strategy circumvents the performance degradation often caused by misalignment text-based annotations. Through the synergy of these techniques, our model, after being finetuned on approximately 7,600 hours of audio-video data, produces results with well-coordinated audio-visuals for ambient sounds generation and strong alignment for speech generation. To systematically evaluate our proposed method, we introduce Verse-Bench, a new benchmark dataset. In an effort to advance research in audio-video generation and to close the performance gap with state-of-the-art models such as Veo3, we make our model and code publicly available. We hope this contribution will benefit the broader research community. Project page: https://dorniwang.github.io/UniVerse-1/.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes