CL AI LG SDSep 4, 2025

PARCO: Phoneme-Augmented Robust Contextual ASR via Contrastive Entity Disambiguation

Jiajun He, Naoki Sawada, Koichi Miyazaki, Tomoki Toda

arXiv:2509.04357v1h-index: 55

Originality Incremental advance

AI Analysis

This addresses the challenge of accurate named entity recognition in ASR for domain-specific applications, representing an incremental improvement over prior contextual ASR methods.

The paper tackles the problem of domain-specific named entity recognition in ASR systems, especially for homophones, by proposing PARCO, which integrates phoneme-aware encoding and contrastive disambiguation to improve phonetic discrimination and entity retrieval. Experiments show it achieves a CER of 4.22% on Chinese AISHELL-1 and WER of 11.14% on English DATA2 with 1,000 distractors, outperforming baselines and demonstrating robust gains on out-of-domain datasets.

Automatic speech recognition (ASR) systems struggle with domain-specific named entities, especially homophones. Contextual ASR improves recognition but often fails to capture fine-grained phoneme variations due to limited entity diversity. Moreover, prior methods treat entities as independent tokens, leading to incomplete multi-token biasing. To address these issues, we propose Phoneme-Augmented Robust Contextual ASR via COntrastive entity disambiguation (PARCO), which integrates phoneme-aware encoding, contrastive entity disambiguation, entity-level supervision, and hierarchical entity filtering. These components enhance phonetic discrimination, ensure complete entity retrieval, and reduce false positives under uncertainty. Experiments show that PARCO achieves CER of 4.22% on Chinese AISHELL-1 and WER of 11.14% on English DATA2 under 1,000 distractors, significantly outperforming baselines. PARCO also demonstrates robust gains on out-of-domain datasets like THCHS-30 and LibriSpeech.

View on arXiv PDF

Similar