CL SD ASOct 13, 2022

On Compressing Sequences for Self-Supervised Speech Models

Yen Meng, Hsuan-Jui Chen, Jiatong Shi, Shinji Watanabe, Paola Garcia, Hung-yi Lee, Hao Tang

arXiv:2210.07189v31.915 citationsh-index: 74

Originality Incremental advance

AI Analysis

This work addresses the computational inefficiency of large self-supervised speech models for researchers and practitioners, offering an incremental improvement by focusing on sequence compression rather than model size reduction.

The paper tackles the problem of reducing computational cost in self-supervised speech models by compressing sequences through fixed-length and variable-length subsampling along the time axis, finding that this approach improves performance on downstream tasks under certain frame rates and provides significant inference speed-up, with variable-length subsampling performing well at low frame rates and no degradation observed at an average frame rate as low as 10 Hz when phonetic boundaries are available.

Compressing self-supervised models has become increasingly necessary, as self-supervised models become larger. While previous approaches have primarily focused on compressing the model size, shortening sequences is also effective in reducing the computational cost. In this work, we study fixed-length and variable-length subsampling along the time axis in self-supervised learning. We explore how individual downstream tasks are sensitive to input frame rates. Subsampling while training self-supervised models not only improves the overall performance on downstream tasks under certain frame rates, but also brings significant speed-up in inference. Variable-length subsampling performs particularly well under low frame rates. In addition, if we have access to phonetic boundaries, we find no degradation in performance for an average frame rate as low as 10 Hz.

View on arXiv PDF

Similar