LG AIOct 3, 2022

Reward Learning with Trees: Methods and Evaluation

Tom Bewley, Jonathan Lawry, Arthur Richards, Rachel Craddock, Ian Henderson

arXiv:2210.01007v13.31 citationsh-index: 27

Originality Incremental advance

AI Analysis

This work addresses the need for transparency and verifiability in AI alignment for researchers and practitioners, though it is incremental as it builds on existing methods for interpretable models.

The paper tackled the problem of learning reward functions from human feedback by using interpretable tree models instead of opaque neural networks, and showed that reward trees are broadly competitive with neural networks on high-dimensional tasks while offering better robustness to limited or corrupted data.

Recent efforts to learn reward functions from human feedback have tended to use deep neural networks, whose lack of transparency hampers our ability to explain agent behaviour or verify alignment. We explore the merits of learning intrinsically interpretable tree models instead. We develop a recently proposed method for learning reward trees from preference labels, and show it to be broadly competitive with neural networks on challenging high-dimensional tasks, with good robustness to limited or corrupted data. Having found that reward tree learning can be done effectively in complex settings, we then consider why it should be used, demonstrating that the interpretable reward structure gives significant scope for traceability, verification and explanation.

View on arXiv PDF

Similar