CLAISep 18, 2025

Cross-Modal Knowledge Distillation for Speech Large Language Models

arXiv:2509.14930v17 citationsh-index: 1
Originality Incremental advance
AI Analysis

This addresses the problem of maintaining performance in speech LLMs for researchers and practitioners, though it appears incremental as it builds on existing knowledge distillation techniques.

The paper tackles catastrophic forgetting and modality inequivalence in speech large language models, showing that adding speech capabilities degrades knowledge and reasoning even with text inputs, with further drops for spoken queries. It proposes a cross-modal knowledge distillation framework using text-to-text and speech-to-text channels to transfer knowledge from a text-based teacher model, validating its effectiveness in preserving textual knowledge, improving cross-modal alignment, and enhancing reasoning in speech-based interactions.

In this work, we present the first systematic evaluation of catastrophic forgetting and modality inequivalence in speech large language models, showing that introducing speech capabilities can degrade knowledge and reasoning even when inputs remain textual, and performance further decreases with spoken queries. To address these challenges, we propose a cross-modal knowledge distillation framework that leverages both text-to-text and speech-to-text channels to transfer knowledge from a text-based teacher model to a speech LLM. Extensive experiments on dialogue and audio understanding tasks validate the effectiveness of our approach in preserving textual knowledge, improving cross-modal alignment, and enhancing reasoning in speech-based interactions.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes