CVAIMay 16

MAVEN A Multi-Agent Framework for Multicultural Text-to-Video Generation

arXiv:2605.1671676.5Has Code
AI Analysis

For researchers and practitioners in text-to-video generation, MAVEN addresses the underexplored problem of cultural representation, providing a method to enhance cultural fidelity in both mono- and cross-cultural scenarios.

MAVEN introduces a multi-agent framework to improve cultural fidelity in text-to-video generation, achieving significant gains in cultural relevance while maintaining visual quality and temporal consistency, as shown on a new benchmark of 243 prompts across three cultures.

Text-to-video (T2V) generation has rapidly progressed in visual fidelity, yet its ability to faithfully represent multiple cultures within a single prompt remains underexplored. We introduce MAVEN, a multi-agent prompt refinement framework designed to improve cultural fidelity in both mono-cultural and cross-cultural T2V generation. MAVEN decomposes prompts into person, action, and location dimensions, handled by specialized agents operating in parallel or sequentially. To support systematic evaluation, we contribute a new benchmark of 243 culturally grounded prompts and 972 corresponding videos, spanning three cultures (Chinese, American, Romanian), three action categories, and both mono-cultural and cross-cultural scenarios. Evaluations combining CLIP-based metrics, VLM-as-judge assessments, and videoquality measures show that multi-agent refinement, particularly parallel specialization, significantly improves cultural relevance while preserving visual quality and temporal consistency. The dataset and code are available athttps://github.com/AIM-SCU/CRAFT

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes