ASAICLSep 23, 2025

Advancing Speech Summarization in Multi-modal LLMs with Reinforcement Learning

arXiv:2509.19631v1h-index: 2Has Code
Originality Incremental advance
AI Analysis

This addresses the practical deployment limitation of open-source MLLMs for speech summarization, though it appears incremental as it builds on existing MLLM capabilities.

The paper tackles the problem of speech summarization in multi-modal large language models (MLLMs), which lag behind text-based LLMs, by proposing a novel multi-stage reinforcement learning training framework. The result is a model that outperforms strong baselines and larger MLLMs, significantly narrowing the gap with state-of-the-art text-based LLMs.

Speech summarization is a critical component of spoken content understanding, particularly in the era of rapidly growing spoken and audiovisual data. Recent advances in multi-modal large language models (MLLMs), leveraging the power of LLMs, enable generating textual summaries directly from speech without intermediate transcriptions, while supporting controllable styles and zero-shot generalization. However, open-source MLLMs continue to lag behind the state-of-the-art text-based LLMs, limiting their practical deployment for speech summarization. In this work, we present a novel multi-stage reinforcement learning training framework to enhance the speech summarization capabilities in MLLMs. Our model delivers substantial improvements over strong baselines, outperforms much larger MLLMs, and significantly narrows the gap with state-of-the-art text-based LLMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes