56.4AIMay 23
Distilling Game Code World Model Generation into Lightweight Large Language ModelsTyrone Serapio, Arjun Prakash, Haoyang Xu et al.
Large Language Models (LLMs) have shown great ability in generating executable code from natural language, opening the possibility of automatically constructing environments for AI agents. Recent work on Code World Models (CWMs) demonstrates that LLMs can translate game rules into Python implementations compatible with solvers like Monte Carlo Tree Search. We study this problem in game settings, where generated environments must implement rules, legal actions, state transitions, observations, and rewards. We refer to these game-specific executable models as Game Code World Models (GameCWMs). However, current approaches to generating code world models rely on frontier models and inference-time refinement loops, limiting accessibility and scalability. This work investigates whether GameCWM generation capabilities can be distilled into smaller models through post-training. We introduce: (1) a curated dataset of 30 games spanning perfect and imperfect information games, (2) a verification framework that evaluates generated code against structural and semantic game properties, and (3) a post-training pipeline combining Supervised Fine-Tuning (SFT) with Reinforcement Learning with Verifiable Rewards (RLVR). We experiment with Qwen2.5-3B-Instruct and find that SFT can increase syntactic correctness, while RLVR can improve execution-level adherence to game rules, thereby improving Qwen's ability to generate valid GameCWMs in both perfect and imperfect information games. Overall, our pipeline makes Qwen2.5-3B-Instruct more capable of generating valid GameCWMs, thereby offering a scalable path toward automatic environment generation from natural language.
30.1LGMar 17
Bi-Level Policy Optimization with Nyström HypergradientsArjun Prakash, Naicheng He, Denizalp Goktas et al.
The dependency of the actor on the critic in actor-critic (AC) reinforcement learning means that AC can be characterized as a bilevel optimization (BLO) problem, also called a Stackelberg game. This characterization motivates two modifications to vanilla AC algorithms. First, the critic's update should be nested to learn a best response to the actor's policy. Second, the actor should update according to a hypergradient that takes changes in the critic's behavior into account. Computing this hypergradient involves finding an inverse Hessian vector product, a process that can be numerically unstable. We thus propose a new algorithm, Bilevel Policy Optimization with Nyström Hypergradients (BLPO), which uses nesting to account for the nested structure of BLO, and leverages the Nyström method to compute the hypergradient. Theoretically, we prove BLPO converges to (a point that satisfies the necessary conditions for) a local strong Stackelberg equilibrium in polynomial time with high probability, assuming a linear parametrization of the critic's objective. Empirically, we demonstrate that BLPO performs on par with or better than PPO on a variety of discrete and continuous control tasks.
LGSep 26, 2025
Spectral Collapse Drives Loss of Plasticity in Deep Continual LearningNaicheng He, Kaicheng Guo, Arjun Prakash et al.
We investigate why deep neural networks suffer from loss of plasticity in deep continual learning, failing to learn new tasks without reinitializing parameters. We show that this failure is preceded by Hessian spectral collapse at new-task initialization, where meaningful curvature directions vanish and gradient descent becomes ineffective. To characterize the necessary condition for successful training, we introduce the notion of $τ$-trainability and show that current plasticity preserving algorithms can be unified under this framework. Targeting spectral collapse directly, we then discuss the Kronecker factored approximation of the Hessian, which motivates two regularization enhancements: maintaining high effective feature rank and applying L2 penalties. Experiments on continual supervised and reinforcement learning tasks confirm that combining these two regularizers effectively preserves plasticity.
CYSep 10, 2020
On the Fairness of 'Fake' Data in Legal AILauren Boswell, Arjun Prakash
The economics of smaller budgets and larger case numbers necessitates the use of AI in legal proceedings. We examine the concept of disparate impact and how biases in the training data lead to the search for fairer AI. This paper seeks to begin the discourse on what such an implementation would actually look like with a criticism of pre-processing methods in a legal context . We outline how pre-processing is used to correct biased data and then examine the legal implications of effectively changing cases in order to achieve a fairer outcome including the black box problem and the slow encroachment on legal precedent. Finally we present recommendations on how to avoid the pitfalls of pre-processed data with methods that either modify the classifier or correct the output in the final step.
STApr 21, 2020
Structural clustering of volatility regimes for dynamic trading strategiesArjun Prakash, Nick James, Max Menzies et al.
We develop a new method to find the number of volatility regimes in a nonstationary financial time series by applying unsupervised learning to its volatility structure. We use change point detection to partition a time series into locally stationary segments and then compute a distance matrix between segment distributions. The segments are clustered into a learned number of discrete volatility regimes via an optimization routine. Using this framework, we determine a volatility clustering structure for financial indices, large-cap equities, exchange-traded funds and currency pairs. Our method overcomes the rigid assumptions necessary to implement many parametric regime-switching models, while effectively distilling a time series into several characteristic behaviours. Our results provide significant simplification of these time series and a strong descriptive analysis of prior behaviours of volatility. Finally, we create and validate a dynamic trading strategy that learns the optimal match between the current distribution of a time series and its past regimes, thereby making online risk-avoidance decisions in the present.