LGAIJan 28, 2025

Inducing, Detecting and Characterising Neural Modules: A Pipeline for Functional Interpretability in Reinforcement Learning

arXiv:2501.17077v22 citationsh-index: 3ICML
Originality Incremental advance
AI Analysis

This work addresses interpretability for RL systems, which is crucial for alignment with human values, but it is incremental as it builds on existing modularity concepts with new detection techniques.

The authors tackled the challenge of interpretability in reinforcement learning by proposing a method to induce and detect functional modules in policy networks, demonstrating that encouraging sparsity and locality leads to distinct navigational modules in 2D and 3D MiniGrid environments.

Interpretability is crucial for ensuring RL systems align with human values. However, it remains challenging to achieve in complex decision making domains. Existing methods frequently attempt interpretability at the level of fundamental model units, such as neurons or decision nodes: an approach which scales poorly to large models. Here, we instead propose an approach to interpretability at the level of functional modularity. We show how encouraging sparsity and locality in network weights leads to the emergence of functional modules in RL policy networks. To detect these modules, we develop an extended Louvain algorithm which uses a novel `correlation alignment' metric to overcome the limitations of standard network analysis techniques when applied to neural network architectures. Applying these methods to 2D and 3D MiniGrid environments reveals the consistent emergence of distinct navigational modules for different axes, and we further demonstrate how these functions can be validated through direct interventions on network weights prior to inference.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes