CL SD ASDec 4, 2024

ASR-EC Benchmark: Evaluating Large Language Models on Chinese ASR Error Correction

Victor Junqiu Wei, Weicheng Wang, Di Jiang, Yuanfeng Song, Lu Wang

arXiv:2412.03075v16.612 citationsh-index: 13EMNLP

Originality Synthesis-oriented

AI Analysis

This addresses ASR error correction for Chinese language users, providing a new benchmark and methods, but it is incremental as it applies existing LLM techniques to a specific domain.

The paper tackles the problem of Chinese ASR error correction by creating the first benchmark dataset (ASR-EC) and evaluating large language models (LLMs) across three paradigms, finding that multi-modal augmentation is the most effective method and achieves state-of-the-art performance.

Automatic speech Recognition (ASR) is a fundamental and important task in the field of speech and natural language processing. It is an inherent building block in many applications such as voice assistant, speech translation, etc. Despite the advancement of ASR technologies in recent years, it is still inevitable for modern ASR systems to have a substantial number of erroneous recognition due to environmental noise, ambiguity, etc. Therefore, the error correction in ASR is crucial. Motivated by this, this paper studies ASR error correction in the Chinese language, which is one of the most popular languages and enjoys a large number of users in the world. We first create a benchmark dataset named \emph{ASR-EC} that contains a wide spectrum of ASR errors generated by industry-grade ASR systems. To the best of our knowledge, it is the first Chinese ASR error correction benchmark. Then, inspired by the recent advances in \emph{large language models (LLMs)}, we investigate how to harness the power of LLMs to correct ASR errors. We apply LLMs to ASR error correction in three paradigms. The first paradigm is prompting, which is further categorized as zero-shot, few-shot, and multi-step. The second paradigm is finetuning, which finetunes LLMs with ASR error correction data. The third paradigm is multi-modal augmentation, which collectively utilizes the audio and ASR transcripts for error correction. Extensive experiments reveal that prompting is not effective for ASR error correction. Finetuning is effective only for a portion of LLMs. Multi-modal augmentation is the most effective method for error correction and achieves state-of-the-art performance.

View on arXiv PDF

Similar