TriNet: stabilizing self-supervised learning from complete or slow collapse on ASR
This addresses collapse issues in SSL for ASR, offering incremental improvements for speech recognition tasks.
The paper tackles the problem of collapse in self-supervised learning for ASR by proposing TriNet, a triple-branch architecture that stabilizes pre-training and achieves a 6.06% relative word error rate reduction compared to SOTA Data2vec.
Self-supervised learning (SSL) models confront challenges of abrupt informational collapse or slow dimensional collapse. We propose TriNet, which introduces a novel triple-branch architecture for preventing collapse and stabilizing the pre-training. TriNet learns the SSL latent embedding space and incorporates it to a higher level space for predicting pseudo target vectors generated by a frozen teacher. Our experimental results show that the proposed method notably stabilizes and accelerates pre-training and achieves a relative word error rate reduction (WERR) of 6.06% compared to the state-of-the-art (SOTA) Data2vec for a downstream benchmark ASR task. We will release our code at https://github.com/tencent-ailab/.