CVSDASMay 27, 2025

Leveraging Large Language Models in Visual Speech Recognition: Model Scaling, Context-Aware Decoding, and Iterative Polishing

arXiv:2506.02012v11 citationsh-index: 19APSIPA
Originality Incremental advance
AI Analysis

It addresses the challenge of enhancing VSR accuracy for applications like assistive technology, though it is incremental as it builds on existing LLM integration methods.

This paper tackles the problem of effectively integrating Large Language Models (LLMs) into Visual Speech Recognition (VSR) systems, resulting in significant performance improvements through model scaling, context-aware decoding, and iterative polishing.

Visual Speech Recognition (VSR) transcribes speech by analyzing lip movements. Recently, Large Language Models (LLMs) have been integrated into VSR systems, leading to notable performance improvements. However, the potential of LLMs has not been extensively studied, and how to effectively utilize LLMs in VSR tasks remains unexplored. This paper systematically explores how to better leverage LLMs for VSR tasks and provides three key contributions: (1) Scaling Test: We study how the LLM size affects VSR performance, confirming a scaling law in the VSR task. (2) Context-Aware Decoding: We add contextual text to guide the LLM decoding, improving recognition accuracy. (3) Iterative Polishing: We propose iteratively refining LLM outputs, progressively reducing recognition errors. Extensive experiments demonstrate that by these designs, the great potential of LLMs can be largely harnessed, leading to significant VSR performance improvement.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes