ASAICLJul 18, 2024

Handling Numeric Expressions in Automatic Speech Recognition

arXiv:2408.00004v2h-index: 6
Originality Incremental advance
AI Analysis

This addresses formatting challenges in ASR for applications requiring precise numeric outputs, but it is incremental as it builds on existing methods.

The paper tackled the problem of correctly formatting numeric expressions like years and timestamps in ASR transcripts by comparing cascaded and end-to-end approaches, finding that adapted end-to-end models offer competitive performance with lower latency and cost.

This paper addresses the problem of correctly formatting numeric expressions in automatic speech recognition (ASR) transcripts. This is challenging since the expected transcript format depends on the context, e.g., 1945 (year) vs. 19:45 (timestamp). We compare cascaded and end-to-end approaches to recognize and format numeric expressions such as years, timestamps, currency amounts, and quantities. For the end-to-end approach, we employed a data generation strategy using a large language model (LLM) together with a text to speech (TTS) model to generate adaptation data. The results on our test data set show that while approaches based on LLMs perform well in recognizing formatted numeric expressions, adapted end-to-end models offer competitive performance with the advantage of lower latency and inference cost.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes