CLJan 26, 2025

ARWKV: Pretrain is not what we need, an RNN-Attention-Based Language Model Born from Transformer

arXiv:2501.15570v17 citationsh-index: 1Has Code
Originality Incremental advance
AI Analysis

This work addresses the need for more efficient and expressive language models for AI researchers and practitioners, though it appears incremental as it builds on existing hybrid and RWKV architectures.

The paper tackles the problem of improving expressiveness and efficiency in language models by introducing ARWKV, a series of models based on pure native RWKV-7 attention distilled from Qwen 2.5, which reduces knowledge processing time to 8 hours on 16 AMD MI300X GPUs while maintaining performance.

As is known, hybrid quadratic and subquadratic attention models in multi-head architectures have surpassed both Transformer and Linear RNN models , with these works primarily focusing on reducing KV complexity and improving efficiency. For further research on expressiveness, we introduce our series of models distilled from Qwen 2.5, based on pure native RWKV-7 attention, which aims to make RNN more expressive and demonstrates state tracking ability beyond transformers. We work with QRWK 32B based on RWKV-6 architecture, another approach that reduces the entire knowledge processing time to just 8 hours using 16 AMD MI300X GPUs while maintaining Qwen 2.5's performance. In fact, the distillation process can utilize any LLM, not just Qwen, and enables knowledge transfer from larger LLMs to smaller ones with more fewer tokens. We will explain the detailed process and share our insights on building more powerful foundation models. Please note that this is an ongoing work that will be updated continuously. The model checkpoints and source code are available at \href{https://github.com/yynil/RWKVInside}{https://github.com/yynil/RWKVInside}, \href{https://huggingface.co/RWKV-Red-Team/ARWKV-7B-Preview-0.1}{https://huggingface.co/RWKV-Red-Team/ARWKV-7B-Preview-0.1}.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes