SDCLFeb 9

Beyond Transcripts: A Renewed Perspective on Audio Chaptering

arXiv:2602.08979v11 citationsh-index: 4
Originality Incremental advance
AI Analysis

This work addresses audio chaptering for navigating podcasts and lectures, offering incremental improvements with new evaluation protocols and insights into acoustic features.

The paper tackled the problem of audio chaptering by comparing text-based, audio-only, and multimodal approaches, finding that the novel AudioSeg architecture outperforms text-based methods and identifying key factors like pauses and transcript quality that affect performance.

Audio chaptering, the task of automatically segmenting long-form audio into coherent sections, is increasingly important for navigating podcasts, lectures, and videos. Despite its relevance, research remains limited and text-based, leaving key questions unresolved about leveraging audio information, handling ASR errors, and transcript-free evaluation. We address these gaps through three contributions: (1) a systematic comparison between text-based models with acoustic features, a novel audio-only architecture (AudioSeg) operating on learned audio representations, and multimodal LLMs; (2) empirical analysis of factors affecting performance, including transcript quality, acoustic features, duration, and speaker composition; and (3) formalized evaluation protocols contrasting transcript-dependent text-space protocols with transcript-invariant time-space protocols. Our experiments on YTSeg reveal that AudioSeg substantially outperforms text-based approaches, pauses provide the largest acoustic gains, and MLLMs remain limited by context length and weak instruction following, yet MLLMs are promising on shorter audio.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes