ASCLLGMay 19, 2023

Recycle-and-Distill: Universal Compression Strategy for Transformer-based Speech SSL Models with Attention Map Reusing and Masking Distillation

arXiv:2305.11685v27 citations
Originality Incremental advance
AI Analysis

This work addresses the need for more compact speech SSL models to facilitate wider adoption in academia and small companies, representing an incremental improvement in model compression techniques.

The paper tackles the problem of compressing large Transformer-based speech self-supervised learning models by reusing attention maps across layers and introducing a masking distillation strategy, resulting in a student model that achieves a phoneme error rate of 7.72% and word error rate of 9.96% on the SUPERB benchmark.

Transformer-based speech self-supervised learning (SSL) models, such as HuBERT, show surprising performance in various speech processing tasks. However, huge number of parameters in speech SSL models necessitate the compression to a more compact model for wider usage in academia or small companies. In this study, we suggest to reuse attention maps across the Transformer layers, so as to remove key and query parameters while retaining the number of layers. Furthermore, we propose a novel masking distillation strategy to improve the student model's speech representation quality. We extend the distillation loss to utilize both masked and unmasked speech frames to fully leverage the teacher model's high-quality representation. Our universal compression strategy yields the student model that achieves phoneme error rate (PER) of 7.72% and word error rate (WER) of 9.96% on the SUPERB benchmark.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes