Adaptive Audio-Visual Speech Recognition via Matryoshka-Based Multimodal LLMs
This addresses efficiency challenges in AVSR for noisy environments, though it is incremental as it builds on existing LLM and compression methods.
The paper tackles the high computational cost of audio-visual speech recognition with large language models by proposing Llama-MTSK, a Matryoshka-based multimodal LLM that adapts token allocation under compute constraints, achieving performance that matches or exceeds models with fixed compression on major datasets.
Audio-Visual Speech Recognition (AVSR) leverages audio and visual modalities to improve robustness in noisy environments. Recent advances in Large Language Models (LLMs) show strong performance in speech recognition, including AVSR. However, the long speech representations lead to high computational costs for LLMs. Prior methods compress inputs before feeding them to LLMs, but high compression often harms accuracy. To address this, we propose Llama-MTSK, the first Matryoshka-based Multimodal LLM for AVSR, which flexibly adapts audio-visual token allocation under varying compute constraints. Inspired by Matryoshka Representation Learning, our model encodes representations at multiple granularities with a single architecture, avoiding the need for separate models. For efficient fine-tuning, we introduce three LoRA-based strategies using global and scale-specific modules. Evaluations on major AVSR datasets show Llama-MTSK matches or outperforms models trained at fixed compression levels.