Base of RoPE Bounds Context Length
This work addresses the problem of effectively extending context length in large language models for researchers and practitioners, revealing limitations in current methods and providing insights for future training.
The paper identifies that adjusting the base parameter in Rotary Position Embedding (RoPE) to extend context length in LLMs may lead to superficial long-context ability, and it theoretically and empirically shows that the base of RoPE bounds context length with an absolute lower bound for achieving certain capabilities.
Position embedding is a core component of current Large Language Models (LLMs). Rotary position embedding (RoPE), a technique that encodes the position information with a rotation matrix, has been the de facto choice for position embedding in many LLMs, such as the Llama series. RoPE has been further utilized to extend long context capability, which is roughly based on adjusting the \textit{base} parameter of RoPE to mitigate out-of-distribution (OOD) problems in position embedding. However, in this paper, we find that LLMs may obtain a superficial long-context ability based on the OOD theory. We revisit the role of RoPE in LLMs and propose a novel property of long-term decay, we derive that the \textit{base of RoPE bounds context length}: there is an absolute lower bound for the base value to obtain certain context length capability. Our work reveals the relationship between context length and RoPE base both theoretically and empirically, which may shed light on future long context training.