CRoPE: Efficient Parametrization of Rotary Positional Embedding
This work addresses parameter redundancy in transformer models for AI practitioners, offering an incremental improvement in efficiency.
The paper tackled the inefficiency in implementing rotary positional embeddings in transformers by proposing a complex linear transformation parametrization, which reduces parameters by nearly 50% in attention blocks with negligible performance impact.
Rotary positional embedding has become the state-of-the-art approach to encode position information in transformer-based models. While it is often succinctly expressed in complex linear algebra, we note that the actual implementation of $Q/K/V$-projections is not equivalent to a complex linear transformation. We argue that complex linear transformation is a more natural parametrization and saves near 50\% parameters within the attention block. We show empirically that removing such redundancy has negligible impact on the model performance. Our modification achieves more efficient parameter usage, as well as a cleaner interpretation of the representation space.