LGCLApr 11, 2025

Millions of States: Designing a Scalable MoE Architecture with RWKV-7 Meta-learner

arXiv:2504.08247v11 citationsh-index: 1
Originality Incremental advance
AI Analysis

This work addresses scalability and adaptability issues in state-based sequence models for efficient sequence modeling, though it appears incremental as it builds directly on RWKV-7.

The paper tackles the limitations of RWKV-7, such as lack of token-parameter interactions and scalability, by proposing Meta-State, a novel extension that integrates these features through a Self-State Encoder mechanism, enabling progressive model scaling without retraining while maintaining linear complexity and constant memory usage.

State-based sequence models like RWKV-7 offer a compelling alternative to Transformer architectures, achieving linear complexity while demonstrating greater expressive power in short-context scenarios and enabling state tracking beyond the \(\text{TC}^0\) complexity class. However, RWKV-7 lacks mechanisms for token-parameter interactions and native scalability, limiting its adaptability and growth without retraining. In this paper, we propose \textbf{Meta-State}, a novel extension to RWKV-7 that replaces attention mechanisms with a fully state-driven approach, integrating token-parameter interactions through a \textbf{Self-State Encoder} (SSE) mechanism. The SSE repurposes a portion of the RWKV-7 Weighted Key-Value (WKV) state as transformation weights to encode token-parameter interactions in a linear, state-driven manner without introducing new trainable matrices or softmax operations, while preserving the autoregressive property of token processing. Meta-State supports progressive model scaling by expanding the WKV state and parameter tokens, reusing existing parameters without retraining. Our approach bridges the gap between state-based modeling, token-parameter interactions, and scalable architectures, offering a flexible framework for efficient and adaptable sequence modeling with linear complexity and constant memory usage.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes