Efficient Context Scaling with LongCat ZigZag Attention
This addresses the challenge of efficient long-context processing for applications like retrieval-augmented generation and tool-integrated reasoning, representing an incremental improvement in sparse attention methods.
The paper tackles the problem of scaling context length in attention-based models by introducing LongCat ZigZag Attention (LoZA), a sparse attention scheme that transforms full-attention models into sparse versions with limited compute, achieving significant speed-ups in long-context scenarios and enabling processing of up to 1 million tokens.
We introduce LongCat ZigZag Attention (LoZA), which is a sparse attention scheme designed to transform any existing full-attention models into sparse versions with rather limited compute budget. In long-context scenarios, LoZA can achieve significant speed-ups both for prefill-intensive (e.g., retrieval-augmented generation) and decode-intensive (e.g., tool-integrated reasoning) cases. Specifically, by applying LoZA to LongCat-Flash during mid-training, we serve LongCat-Flash-Exp as a long-context foundation model that can swiftly process up to 1 million tokens, enabling efficient long-term reasoning and long-horizon agentic capabilities.