CVJan 12

HiVid-Narrator: Hierarchical Video Narrative Generation with Scene-Primed ASR-anchored Compression

arXiv:2601.07366v12.81 citations

Originality Incremental advance

AI Analysis

This addresses the problem of creating coherent, high-level stories from fast-paced, information-dense e-commerce videos for applications like video summarization, but it is incremental as it builds on existing video captioning approaches.

The paper tackles generating structured narrations for e-commerce videos by introducing the E-HVC dataset with dual-granularity annotations and a HiVid-Narrator framework that uses a staged construction and SPA-Compressor for compression, achieving superior narrative quality with fewer input tokens compared to existing methods.

Generating structured narrations for real-world e-commerce videos requires models to perceive fine-grained visual details and organize them into coherent, high-level stories--capabilities that existing approaches struggle to unify. We introduce the E-commerce Hierarchical Video Captioning (E-HVC) dataset with dual-granularity, temporally grounded annotations: a Temporal Chain-of-Thought that anchors event-level observations and Chapter Summary that compose them into concise, story-centric summaries. Rather than directly prompting chapters, we adopt a staged construction that first gathers reliable linguistic and visual evidence via curated ASR and frame-level descriptions, then refines coarse annotations into precise chapter boundaries and titles conditioned on the Temporal Chain-of-Thought, yielding fact-grounded, time-aligned narratives. We also observe that e-commerce videos are fast-paced and information-dense, with visual tokens dominating the input sequence. To enable efficient training while reducing input tokens, we propose the Scene-Primed ASR-anchored Compressor (SPA-Compressor), which compresses multimodal tokens into hierarchical scene and event representations guided by ASR semantic cues. Built upon these designs, our HiVid-Narrator framework achieves superior narrative quality with fewer input tokens compared to existing methods.

View on arXiv PDF

Similar