LGSCJan 11, 2024

Interpretable Concept Bottlenecks to Align Reinforcement Learning Agents

arXiv:2401.05821v428 citationsh-index: 11Has CodeNIPS
Originality Incremental advance
AI Analysis

This work addresses interpretability and alignment challenges in reinforcement learning for domain experts, though it is incremental as it builds on existing concept bottleneck methods.

The paper tackles the problem of aligning reinforcement learning agents by introducing Successive Concept Bottleneck Agents (SCoBots), which use concept bottleneck layers to represent object relations, enabling competitive performance and allowing domain experts to identify and resolve misalignment issues, such as in the game Pong.

Goal misalignment, reward sparsity and difficult credit assignment are only a few of the many issues that make it difficult for deep reinforcement learning (RL) agents to learn optimal policies. Unfortunately, the black-box nature of deep neural networks impedes the inclusion of domain experts for inspecting the model and revising suboptimal policies. To this end, we introduce *Successive Concept Bottleneck Agents* (SCoBots), that integrate consecutive concept bottleneck (CB) layers. In contrast to current CB models, SCoBots do not just represent concepts as properties of individual objects, but also as relations between objects which is crucial for many RL tasks. Our experimental results provide evidence of SCoBots' competitive performances, but also of their potential for domain experts to understand and regularize their behavior. Among other things, SCoBots enabled us to identify a previously unknown misalignment problem in the iconic video game, Pong, and resolve it. Overall, SCoBots thus result in more human-aligned RL agents. Our code is available at https://github.com/k4ntz/SCoBots .

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes