ASAICLMar 6

X-OPD: Cross-Modal On-Policy Distillation for Capability Alignment in Speech LLMs

arXiv:2603.245964 citationsh-index: 11
Originality Highly original
AI Analysis

This addresses the problem of capability misalignment in speech LLMs for applications requiring robust multi-modal interactions, representing a novel method for a known bottleneck.

The paper tackled the performance gap between end-to-end speech LLMs and text-based counterparts by proposing X-OPD, a cross-modal on-policy distillation framework, which significantly narrowed this gap in complex tasks while preserving inherent capabilities.

While the shift from cascaded dialogue systems to end-to-end (E2E) speech Large Language Models (LLMs) improves latency and paralinguistic modeling, E2E models often exhibit a significant performance degradation compared to their text-based counterparts. The standard Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) training methods fail to close this gap. To address this, we propose X-OPD, a novel Cross-Modal On-Policy Distillation framework designed to systematically align the capabilities of Speech LLMs to their text-based counterparts. X-OPD enables the Speech LLM to explore its own distribution via on-policy rollouts, where a text-based teacher model evaluates these trajectories and provides token-level feedback, effectively distilling teacher's capabilities into student's multi-modal representations. Extensive experiments across multiple benchmarks demonstrate that X-OPD significantly narrows the gap in complex tasks while preserving the model's inherent capabilities.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes