Modeling Human Beliefs about AI Behavior for Scalable Oversight
This work addresses the problem of ensuring reliable human supervision for advanced AI systems, which is incremental as it builds on existing scalable oversight research by focusing on belief modeling.
The paper tackles the challenge of scalable oversight for AI systems that exceed human capabilities by addressing how incorrect human beliefs about AI behavior can lead to unreliable feedback. It proposes modeling human beliefs to improve value learning, introducing concepts like belief model covering and using foundation model representations to mimic evaluators' beliefs.
As AI systems advance beyond human capabilities, scalable oversight becomes critical: how can we supervise AI that exceeds our abilities? A key challenge is that human evaluators may form incorrect beliefs about AI behavior in complex tasks, leading to unreliable feedback and poor value inference. To address this, we propose modeling evaluators' beliefs to interpret their feedback more reliably. We formalize human belief models, analyze their theoretical role in value learning, and characterize when ambiguity remains. To reduce reliance on precise belief models, we introduce "belief model covering" as a relaxation. This motivates our preliminary proposal to use the internal representations of adapted foundation models to mimic human evaluators' beliefs. These representations could be used to learn correct values from human feedback even when evaluators misunderstand the AI's behavior. Our work suggests that modeling human beliefs can improve value learning and outlines practical research directions for implementing this approach to scalable oversight.