ASCLMar 31, 2022

Memory-Efficient Training of RNN-Transducer with Sampled Softmax

arXiv:2203.16868v18 citations
Originality Incremental advance
AI Analysis

This work addresses a critical memory bottleneck for developers of end-to-end speech recognition systems, though it is incremental as it adapts an existing technique to a specific architecture.

The authors tackled the high memory consumption problem in training RNN-Transducer models for automatic speech recognition by applying sampled softmax, which reduces memory usage while maintaining accuracy, as demonstrated on datasets like LibriSpeech with concrete memory savings.

RNN-Transducer has been one of promising architectures for end-to-end automatic speech recognition. Although RNN-Transducer has many advantages including its strong accuracy and streaming-friendly property, its high memory consumption during training has been a critical problem for development. In this work, we propose to apply sampled softmax to RNN-Transducer, which requires only a small subset of vocabulary during training thus saves its memory consumption. We further extend sampled softmax to optimize memory consumption for a minibatch, and employ distributions of auxiliary CTC losses for sampling vocabulary to improve model accuracy. We present experimental results on LibriSpeech, AISHELL-1, and CSJ-APS, where sampled softmax greatly reduces memory consumption and still maintains the accuracy of the baseline model.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes