LG AIJan 30

Agile Reinforcement Learning through Separable Neural Architecture

Rajib Mostakim, Reza T. Batley, Sourav Saha

arXiv:2601.23225v11 citationsh-index: 3

Originality Highly original

AI Analysis

This addresses the problem of slow policy learning in capacity-limited RL settings, offering a more efficient alternative to existing methods.

The paper tackles the problem of parameter inefficiency in deep reinforcement learning for resource-constrained environments by introducing SPAN, a spline-based adaptive network, which achieves 30-50% improvement in sample efficiency and 1.3-9 times higher success rates compared to MLP baselines.

Deep reinforcement learning (RL) is increasingly deployed in resource-constrained environments, yet the go-to function approximators - multilayer perceptrons (MLPs) - are often parameter-inefficient due to an imperfect inductive bias for the smooth structure of many value functions. This mismatch can also hinder sample efficiency and slow policy learning in this capacity-limited regime. Although model compression techniques exist, they operate post-hoc and do not improve learning efficiency. Recent spline-based separable architectures - such as Kolmogorov-Arnold Networks (KANs) - have been shown to offer parameter efficiency but are widely reported to exhibit significant computational overhead, especially at scale. In seeking to address these limitations, this work introduces SPAN (SPline-based Adaptive Networks), a novel function approximation approach to RL. SPAN adapts the low rank KHRONOS framework by integrating a learnable preprocessing layer with a separable tensor product B-spline basis. SPAN is evaluated across discrete (PPO) and high-dimensional continuous (SAC) control tasks, as well as offline settings (Minari/D4RL). Empirical results demonstrate that SPAN achieves a 30-50% improvement in sample efficiency and 1.3-9 times higher success rates across benchmarks compared to MLP baselines. Furthermore, SPAN demonstrates superior anytime performance and robustness to hyperparameter variations, suggesting it as a viable, high performance alternative for learning intrinsically efficient policies in resource-limited settings.

View on arXiv PDF

Similar