SDAIASSep 23, 2025

Explore the Reinforcement Learning for the LLM based ASR and TTS system

arXiv:2509.18569v12 citationsh-index: 18
Originality Incremental advance
AI Analysis

This work addresses the problem of improving ASR and TTS systems for audio processing applications, but it appears incremental as it builds on existing RL methods like GRPO and DiffRO.

The study tackled the underexplored application of reinforcement learning (RL) to automatic speech recognition (ASR) and text-to-speech (TTS) systems using large language models, proposing a lightweight RL framework and demonstrating that RL significantly enhances performance in both tasks with limited training data and optimization steps.

In recent years, large language models (LLMs) have played an important role in automatic speech recognition (ASR) and text-to-speech (TTS) systems. While reinforcement learning (RL) has significantly enhanced LLM performance in text-based tasks, its application to ASR and TTS remains underexplored due to the complexity of training audio-based models. In this study, we propose a lightweight RL framework tailored for audio-based LLMs that can process audio inputs and generate audio outputs. Based on this framework, we evaluate the effectiveness of reinforcement learning on both ASR and TTS tasks. For the ASR task, we experiment with different rule-based reward functions within the Group Relative Policy Optimization (GRPO) framework and investigate the impact of RL data construction. For the TTS task, we compare GRPO with Differentiable Reward Optimization (DiffRO) and further combine the two approaches to achieve improved performance. Our experiments demonstrate that RL can significantly enhance the performance of both ASR and TTS systems, even with limited training data and a small number of optimization steps.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes