AS AI CL LG SDJan 14, 2025

SEAL: Speaker Error Correction using Acoustic-conditioned Large Language Models

Anurag Kumar, Rohit Paturi, Amber Afshan, Sundararajan Srinivasan

arXiv:2501.08421v12.32 citationsh-index: 7ICASSP

Originality Incremental advance

AI Analysis

This work addresses speaker error correction in ASR systems, offering a domain-specific improvement for speech processing applications.

The paper tackled the problem of speaker errors in speaker diarization within ASR pipelines by introducing an acoustic conditioning approach and constrained decoding for LLMs, resulting in a 24-43% reduction in speaker error rates across multiple datasets.

Speaker Diarization (SD) is a crucial component of modern end-to-end ASR pipelines. Traditional SD systems, which are typically audio-based and operate independently of ASR, often introduce speaker errors, particularly during speaker transitions and overlapping speech. Recently, language models including fine-tuned large language models (LLMs) have shown to be effective as a second-pass speaker error corrector by leveraging lexical context in the transcribed output. In this work, we introduce a novel acoustic conditioning approach to provide more fine-grained information from the acoustic diarizer to the LLM. We also show that a simpler constrained decoding strategy reduces LLM hallucinations, while avoiding complicated post-processing. Our approach significantly reduces the speaker error rates by 24-43% across Fisher, Callhome, and RT03-CTS datasets, compared to the first-pass Acoustic SD.

View on arXiv PDF

Similar