MMApr 17

MCSC-Bench: Multimodal Context-to-Script Creation for Realistic Video Production

arXiv:2604.1512779.5h-index: 5
Predicted impact top 14% in MM · last 90 daysOriginality Synthesis-oriented
AI Analysis

For researchers in video production and multimodal AI, this benchmark fills a gap by evaluating the complete script creation process, but it is an incremental contribution as it extends existing sub-task benchmarks.

The paper introduces MCSC-Bench, a benchmark for multimodal context-to-script creation that evaluates the full video production workflow from noisy inputs to structured scripts. Experiments show that current multimodal LLMs struggle with this task, but models trained on MCSC-Bench achieve SOTA, with an 8B model outperforming Gemini-2.5-Pro.

Real-world video creation often involves a complex reasoning workflow of selecting relevant shots from noisy materials, planning missing shots for narrative completeness, and organizing them into coherent storylines. However, existing benchmarks focus on isolated sub-tasks and lack support for evaluating this full process. To address this gap, we propose Multimodal Context-to-Script Creation (MCSC), a new task that transforms noisy multimodal inputs and user instructions into structured, executable video scripts. We further introduce MCSC-Bench, the first large-scale MCSC dataset, comprising 11K+ well-annotated videos. Each sample includes: (1) redundant multimodal materials and user instructions; (2) a coherent, production-ready script containing material-based shots, newly planned shots (with shooting instructions), and shot-aligned voiceovers. MCSC-Bench supports comprehensive evaluation across material selection, narrative planning, and conditioned script generation, and includes both in-domain and out-of-domain test sets. Experiments show that current multimodal LLMs struggle with structure-aware reasoning under long contexts, highlighting the challenges posed by our benchmark. Models trained on MCSC-Bench achieve SOTA performance, with an 8B model surpassing Gemini-2.5-Pro, and generalize to out-of-domain scenarios. Downstream video generation guided by the generated scripts further validates the practical value of MCSC. Datasets will be public soon.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes