SDCLASDec 14, 2023

STaR: Distilling Speech Temporal Relation for Lightweight Speech Self-Supervised Learning Models

arXiv:2312.09040v24 citationsh-index: 10ICASSP
AI Analysis

This work addresses the computational inefficiency of speech SSL models for deployment in resource-constrained environments, representing an incremental improvement in model compression techniques.

The paper tackles the problem of compressing large Transformer-based speech self-supervised learning models by proposing STaR distillation, which transfers temporal relations between speech frames instead of direct frame representations, achieving an overall score of 79.8 on the SUPERB benchmark for models with up to 27 million parameters.

Albeit great performance of Transformer-based speech selfsupervised learning (SSL) models, their large parameter size and computational cost make them unfavorable to utilize. In this study, we propose to compress the speech SSL models by distilling speech temporal relation (STaR). Unlike previous works that directly match the representation for each speech frame, STaR distillation transfers temporal relation between speech frames, which is more suitable for lightweight student with limited capacity. We explore three STaR distillation objectives and select the best combination as the final STaR loss. Our model distilled from HuBERT BASE achieves an overall score of 79.8 on SUPERB benchmark, the best performance among models with up to 27 million parameters. We show that our method is applicable across different speech SSL models and maintains robust performance with further reduced parameters.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes