Contextual Earnings-22: A Speech Recognition Benchmark with Custom Vocabulary in the Wild

Berkin Durmus, Chen Cen, Eduardo Pacheco, Arda Okan, Atila Orhon

arXiv:2604.0735493.7

Predicted impact top 11% in CL · last 90 daysOriginality Incremental advance

AI Analysis

It provides a standardized benchmark for contextual speech-to-text, addressing the gap between academic benchmarks and industrial needs in high-stakes domains.

The paper introduces Contextual Earnings-22, a benchmark for contextual speech recognition with custom vocabulary, and shows that both keyword prompting and keyword boosting achieve comparable and significantly improved accuracy when scaled to large-scale systems.

The accuracy frontier of speech-to-text systems has plateaued on academic benchmarks.1 In contrast, industrial benchmarks and adoption in high-stakes domains suggest otherwise. We hypothesize that the primary difference between the two is contextual conditioning: Academic benchmarks are dominated by frequently encountered general vocabulary that is relatively easy to recognize compared with rare and context-defined custom vocabulary that has disproportionate impact on the usability of speech transcripts. Despite progress on contextual speech-to-text, there is no standardized benchmark. We introduce Contextual Earnings-22, an open dataset built upon Earnings-22, with realistic custom vocabulary contexts to foster research and reveal latent progress. We set six strong baselines for two dominant approaches: keyword prompting and keyword boosting. Experiments show both reach comparable and significantly improved accuracy when scaled from proof-of-concept to large-scale systems.

View on arXiv PDF

Similar