SDAIASFeb 9

NarraScore: Bridging Visual Narrative and Musical Dynamics via Hierarchical Affective Control

arXiv:2602.09070v1
Originality Highly original
AI Analysis

This addresses the problem of generating soundtracks for long videos, which is incremental as it builds on existing vision-language models but introduces novel control mechanisms.

The paper tackles the challenge of synthesizing coherent soundtracks for long-form videos by proposing NarraScore, a hierarchical framework that uses emotion as a narrative proxy, achieving state-of-the-art consistency and narrative alignment with negligible computational overhead.

Synthesizing coherent soundtracks for long-form videos remains a formidable challenge, currently stalled by three critical impediments: computational scalability, temporal coherence, and, most critically, a pervasive semantic blindness to evolving narrative logic. To bridge these gaps, we propose NarraScore, a hierarchical framework predicated on the core insight that emotion serves as a high-density compression of narrative logic. Uniquely, we repurpose frozen Vision-Language Models (VLMs) as continuous affective sensors, distilling high-dimensional visual streams into dense, narrative-aware Valence-Arousal trajectories. Mechanistically, NarraScore employs a Dual-Branch Injection strategy to reconcile global structure with local dynamism: a \textit{Global Semantic Anchor} ensures stylistic stability, while a surgical \textit{Token-Level Affective Adapter} modulates local tension via direct element-wise residual injection. This minimalist design bypasses the bottlenecks of dense attention and architectural cloning, effectively mitigating the overfitting risks associated with data scarcity. Experiments demonstrate that NarraScore achieves state-of-the-art consistency and narrative alignment with negligible computational overhead, establishing a fully autonomous paradigm for long-video soundtrack generation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes