SDCLASJun 30, 2025

Efficient Interleaved Speech Modeling through Knowledge Distillation

arXiv:2506.23670v21 citationsh-index: 8
Originality Incremental advance
AI Analysis

This enables efficient speech generation for real-time conversational agents, assistive technologies, and low-resource environments.

The paper tackles the problem of deploying large speech language models in constrained environments by building compact speech generation models through knowledge distillation, achieving 3x compression with minimal performance loss (within 1.4 perplexity points of teacher models).

Current speech language models exceed the size and latency constraints of many deployment environments. We build compact, expressive speech generation models through layer-aligned distillation, matching hidden states, attention maps, and softened logits to compress large multimodal transformers by 3x with minimal loss in performance. We introduce TinyWave, a family of 2B-parameter models for speech-to-speech and interleaved speech-text generation, trained on 50,000 hours of public audio. TinyWave supports (i) speech-only generation using phonetic or expressive tokens and (ii) mixed speech-text continuations. Evaluation on Libri-Light shows TinyWave within 1.4 normalized perplexity points of its teacher. Accuracy on spoken StoryCloze and SALMon reaches 93-97% of the teacher's performance, outperforming size-matched baselines. These models are optimized for deployment on commodity hardware, enabling applications in real-time conversational agents, assistive technologies, and low-resource environments. We release models, training code, and evaluation scripts to support reproducible research on compact, expressive speech generation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes