SD AI LG ASFeb 6, 2025

Aligner-Encoders: Self-Attention Transformers Can Be Self-Transducers

Adam Stooke, Rohit Prabhavalkar, Khe Chai Sim, Pedro Moreno Mengibar

arXiv:2502.05232v14.02 citationsh-index: 36NIPS

Originality Incremental advance

AI Analysis

This work addresses efficiency and complexity issues in speech recognition systems, offering a more streamlined approach that could benefit real-time applications, though it appears incremental as it builds on existing transformer and transducer methods.

The paper tackles the problem of simplifying and speeding up automatic speech recognition by showing that transformer-based encoders can internally align audio to text, enabling a new model called Aligner-Encoder that achieves performance close to state-of-the-art with significantly faster inference times, such as being 2x faster than RNN-T and 16x faster than AED.

Modern systems for automatic speech recognition, including the RNN-Transducer and Attention-based Encoder-Decoder (AED), are designed so that the encoder is not required to alter the time-position of information from the audio sequence into the embedding; alignment to the final text output is processed during decoding. We discover that the transformer-based encoder adopted in recent years is actually capable of performing the alignment internally during the forward pass, prior to decoding. This new phenomenon enables a simpler and more efficient model, the "Aligner-Encoder". To train it, we discard the dynamic programming of RNN-T in favor of the frame-wise cross-entropy loss of AED, while the decoder employs the lighter text-only recurrence of RNN-T without learned cross-attention -- it simply scans embedding frames in order from the beginning, producing one token each until predicting the end-of-message. We conduct experiments demonstrating performance remarkably close to the state of the art, including a special inference configuration enabling long-form recognition. In a representative comparison, we measure the total inference time for our model to be 2x faster than RNN-T and 16x faster than AED. Lastly, we find that the audio-text alignment is clearly visible in the self-attention weights of a certain layer, which could be said to perform "self-transduction".

View on arXiv PDF

Similar