Benchmarking Rotary Position Embeddings for Automatic Speech Recognition
This addresses a bottleneck in ASR for researchers and practitioners by offering a more efficient alternative to widely used methods, though it is incremental as it adapts an existing technique to a new domain.
The paper tackled the computational inefficiency of Relative Position embeddings in Automatic Speech Recognition by evaluating Rotary Position Embeddings, finding similar or better error rates and up to 21% faster training across diverse tasks.
Self-attention relies on positional embeddings to encode input order. Relative Position (RelPos) embeddings are widely used in Automatic Speech Recognition (ASR). However, RelPos has quadratic time complexity to input length and is often incompatible with fast GPU implementations of attention. In contrast, Rotary Positional Embedding (RoPE) rotates each input vector based on its absolute position, taking linear time to sequence length, implicitly encoding relative distances through self-attention dot products. Thus, it is usually compatible with efficient attention. However, its use in ASR remains underexplored. This work evaluates RoPE across diverse ASR tasks with training data ranging from 100 to 50,000 hours, covering various speech types (read, spontaneous, clean, noisy) and different accents in both streaming and non-streaming settings. ASR error rates are similar or better than RelPos, while training time is reduced by up to 21%. Code is available via the SpeechBrain toolkit.