LLM-Guided Reinforcement Learning for Audio-Visual Speech Enhancement

Chih-Ning Chen, Jen-Cheng Hou, Hsin-Min Wang, Shao-Yi Chien, Yu Tsao, Fan-Gang Zeng

arXiv:2603.1395246.8h-index: 3

AI Analysis

This work addresses the challenge of improving speech enhancement quality for applications like hearing aids or communication systems, though it is incremental as it builds on existing AVSE methods with a novel reward mechanism.

The paper tackles the problem of poor correlation between traditional metrics and perceptual quality in audio-visual speech enhancement by proposing a reinforcement learning framework with an LLM-based interpretable reward model, resulting in outperformance over baselines in PESQ, STOI, neural quality metrics, and subjective tests on the AVSEC-4 dataset.

In existing Audio-Visual Speech Enhancement (AVSE) methods, objectives such as Scale-Invariant Signal-to-Noise Ratio (SI-SNR) and Mean Squared Error (MSE) are widely used; however, they often correlate poorly with perceptual quality and provide limited interpretability for optimization. This work proposes a reinforcement learning-based AVSE framework with a Large Language Model (LLM)-based interpretable reward model. An audio LLM generates natural language descriptions of enhanced speech, which are converted by a sentiment analysis model into a 1-5 rating score serving as the PPO reward for fine-tuning a pretrained AVSE model. Compared with scalar metrics, LLM-generated feedback is semantically rich and explicitly describes improvements in speech quality. Experiments on the 4th COG-MHEAR AVSE Challenge (AVSEC-4) dataset show that the proposed method outperforms a supervised baseline and a DNSMOS-based RL baseline in PESQ, STOI, neural quality metrics, and subjective listening tests.

View on arXiv PDF

Similar