CLJun 28, 2024

TreeSeg: Hierarchical Topic Segmentation of Large Transcripts

arXiv:2407.12028v13 citations
Originality Incremental advance
AI Analysis

This addresses the need for efficient and accurate segmentation of noisy transcripts, such as from meetings or videos, for applications like content organization and LLM processing, though it is incremental as it builds on existing embedding and clustering techniques.

The paper tackles the problem of topic segmentation for large transcripts, which is important for organizing content and fitting inputs into LLM context windows, by presenting TreeSeg, a method that uses embedding models and divisive clustering to create hierarchical segmentations, and it outperforms baselines on ICSI and AMI corpora.

From organizing recorded videos and meetings into chapters, to breaking down large inputs in order to fit them into the context window of commoditized Large Language Models (LLMs), topic segmentation of large transcripts emerges as a task of increasing significance. Still, accurate segmentation presents many challenges, including (a) the noisy nature of the Automatic Speech Recognition (ASR) software typically used to obtain the transcripts, (b) the lack of diverse labeled data and (c) the difficulty in pin-pointing the ground-truth number of segments. In this work we present TreeSeg, an approach that combines off-the-shelf embedding models with divisive clustering, to generate hierarchical, structured segmentations of transcripts in the form of binary trees. Our approach is robust to noise and can handle large transcripts efficiently. We evaluate TreeSeg on the ICSI and AMI corpora, demonstrating that it outperforms all baselines. Finally, we introduce TinyRec, a small-scale corpus of manually annotated transcripts, obtained from self-recorded video sessions.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes