ASAIApr 9

TASU2: Controllable CTC Simulation for Alignment and Low-Resource Adaptation of Speech LLMs

arXiv:2604.0838422.7
Predicted impact top 25% in AS · last 90 daysOriginality Incremental advance
AI Analysis

This work addresses the challenge of efficient cross-modal alignment and robust low-resource adaptation for speech LLMs, offering an incremental improvement over existing text-only alignment methods.

The paper tackles the problem of costly audio-text pair collection for speech LLM post-training by proposing TASU2, a controllable CTC simulation framework that simulates CTC posteriors under a specified WER range, improving in-domain and out-of-domain recognition over prior methods like TASU and outperforming baselines such as text-only fine-tuning and TTS-based augmentation.

Speech LLM post-training increasingly relies on efficient cross-modal alignment and robust low-resource adaptation, yet collecting large-scale audio-text pairs remains costly. Text-only alignment methods such as TASU reduce this burden by simulating CTC posteriors from transcripts, but they provide limited control over uncertainty and error rate, making curriculum design largely heuristic. We propose \textbf{TASU2}, a controllable CTC simulation framework that simulates CTC posterior distributions under a specified WER range, producing text-derived supervision that better matches the acoustic decoding interface. This enables principled post-training curricula that smoothly vary supervision difficulty without TTS. Across multiple source-to-target adaptation settings, TASU2 improves in-domain and out-of-domain recognition over TASU, and consistently outperforms strong baselines including text-only fine-tuning and TTS-based augmentation, while mitigating source-domain performance degradation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes