Experiments with Detecting and Mitigating AI Deception
This addresses the open problem of ensuring safe and trustworthy AI, but the work is incremental as it focuses on simple games and specific algorithms.
The paper tackled the problem of detecting and mitigating deceptive AI systems by analyzing two algorithms: one based on path-specific objectives to remove deception-incentivizing paths, and another using shielding to monitor and replace unsafe policies with safe ones. In experiments on simple games, both methods prevented deception, with shielding achieving higher reward.
How to detect and mitigate deceptive AI systems is an open problem for the field of safe and trustworthy AI. We analyse two algorithms for mitigating deception: The first is based on the path-specific objectives framework where paths in the game that incentivise deception are removed. The second is based on shielding, i.e., monitoring for unsafe policies and replacing them with a safe reference policy. We construct two simple games and evaluate our algorithms empirically. We find that both methods ensure that our agent is not deceptive, however, shielding tends to achieve higher reward.