LGDec 2, 2020

Value Alignment Verification

arXiv:2012.01557v241 citations
AI Analysis

This work addresses the critical problem of efficiently evaluating the performance and correctness of autonomous agents for humans interacting with them in complex and potentially risky tasks.

This paper formalizes and analyzes the problem of efficiently verifying if an autonomous agent's behavior aligns with human values. The authors propose and analyze heuristic and approximate value alignment verification tests in gridworlds and an autonomous driving domain, proving that under sufficient conditions, exact and approximate alignment can be verified across infinite test environments with constant query complexity.

As humans interact with autonomous agents to perform increasingly complicated, potentially risky tasks, it is important to be able to efficiently evaluate an agent's performance and correctness. In this paper we formalize and theoretically analyze the problem of efficient value alignment verification: how to efficiently test whether the behavior of another agent is aligned with a human's values. The goal is to construct a kind of "driver's test" that a human can give to any agent which will verify value alignment via a minimal number of queries. We study alignment verification problems with both idealized humans that have an explicit reward function as well as problems where they have implicit values. We analyze verification of exact value alignment for rational agents and propose and analyze heuristic and approximate value alignment verification tests in a wide range of gridworlds and a continuous autonomous driving domain. Finally, we prove that there exist sufficient conditions such that we can verify exact and approximate alignment across an infinite set of test environments via a constant-query-complexity alignment test.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes