Aaron David Tucker

LG
h-index16
4papers
24citations
Novelty44%
AI Score41

4 Papers

LGJul 22, 2024Code
Planning in a recurrent neural network that plays Sokoban

Mohammad Taufeeque, Philip Quirke, Maximilian Li et al.

Planning is essential for solving complex tasks, yet the internal mechanisms underlying planning in neural networks remain poorly understood. Building on prior work, we analyze a recurrent neural network (RNN) trained on Sokoban, a challenging puzzle requiring sequential, irreversible decisions. We find that the RNN has a causal plan representation which predicts its future actions about 50 steps in advance. The quality and length of the represented plan increases over the first few steps. We uncover a surprising behavior: the RNN "paces" in cycles to give itself extra computation at the start of a level, and show that this behavior is incentivized by training. Leveraging these insights, we extend the trained RNN to significantly larger, out-of-distribution Sokoban puzzles, demonstrating robust representations beyond the training regime. We open-source our model and code, and believe the neural network's interesting behavior makes it an excellent model organism to deepen our understanding of learned planning.

AIFeb 16
Concept Influence: Leveraging Interpretability to Improve Performance and Efficiency in Training Data Attribution

Matthew Kowal, Goncalo Paulo, Louis Jaburi et al.

As large language models are increasingly trained and fine-tuned, practitioners need methods to identify which training data drive specific behaviors, particularly unintended ones. Training Data Attribution (TDA) methods address this by estimating datapoint influence. Existing approaches like influence functions are both computationally expensive and attribute based on single test examples, which can bias results toward syntactic rather than semantic similarity. To address these issues of scalability and influence to abstract behavior, we leverage interpretable structures within the model during the attribution. First, we introduce Concept Influence which attribute model behavior to semantic directions (such as linear probes or sparse autoencoder features) rather than individual test examples. Second, we show that simple probe-based attribution methods are first-order approximations of Concept Influence that achieve comparable performance while being over an order-of-magnitude faster. We empirically validate Concept Influence and approximations across emergent misalignment benchmarks and real post-training datasets, and demonstrate they achieve comparable performance to classical influence functions while being substantially more scalable. More broadly, we show that incorporating interpretable structure within traditional TDA pipelines can enable more scalable, explainable, and better control of model behavior through data.

LGJun 11, 2025
Interpreting learned search: finding a transition model and value function in an RNN that plays Sokoban

Mohammad Taufeeque, Aaron David Tucker, Adam Gleave et al.

We partially reverse-engineer a convolutional recurrent neural network (RNN) trained to play the puzzle game Sokoban with model-free reinforcement learning. Prior work found that this network solves more levels with more test-time compute. Our analysis reveals several mechanisms analogous to components of classic bidirectional search. For each square, the RNN represents its plan in the activations of channels associated with specific directions. These state-action activations are analogous to a value function - their magnitudes determine when to backtrack and which plan branch survives pruning. Specialized kernels extend these activations (containing plan and value) forward and backward to create paths, forming a transition model. The algorithm is also unlike classical search in some ways. State representation is not unified; instead, the network considers each box separately. Each layer has its own plan representation and value function, increasing search depth. Far from being inscrutable, the mechanisms leveraging test-time compute learned in this network by model-free training can be understood in familiar terms.

LGFeb 3, 2022
Variance-Optimal Augmentation Logging for Counterfactual Evaluation in Contextual Bandits

Aaron David Tucker, Thorsten Joachims

Methods for offline A/B testing and counterfactual learning are seeing rapid adoption in search and recommender systems, since they allow efficient reuse of existing log data. However, there are fundamental limits to using existing log data alone, since the counterfactual estimators that are commonly used in these methods can have large bias and large variance when the logging policy is very different from the target policy being evaluated. To overcome this limitation, we explore the question of how to design data-gathering policies that most effectively augment an existing dataset of bandit feedback with additional observations for both learning and evaluation. To this effect, this paper introduces Minimum Variance Augmentation Logging (MVAL), a method for constructing logging policies that minimize the variance of the downstream evaluation or learning problem. We explore multiple approaches to computing MVAL policies efficiently, and find that they can be substantially more effective in decreasing the variance of an estimator than naïve approaches.