Simulating the Evolution of Alignment and Values in Machine Intelligence
This addresses the risk of deceptive AI models evolving over time, which is a critical safety concern for AI developers and society, though it is incremental as it builds on existing alignment and evolutionary theory.
The study tackled the problem of model alignment by simulating how populations of models evolve over time, revealing that even with high correlation between test performance and true value, deceptive beliefs can become fixed, and only through improved evaluators, adaptive tests, and mutational dynamics were significant reductions in deception achieved while maintaining alignment fitness (permutation test, p_adj < 0.001).
Model alignment is currently applied in a vacuum, evaluated primarily through standardised benchmark performance. The purpose of this study is to examine the effects of alignment on populations of models through time. We focus on the treatment of beliefs which contain both an alignment signal (how well it does on the test) and a true value (what the impact actually will be). By applying evolutionary theory we can model how different populations of beliefs and selection methodologies can fix deceptive beliefs through iterative alignment testing. The correlation between testing accuracy and true value remains a strong feature, but even at high correlations ($Ï= 0.8$) there is variability in the resulting deceptive beliefs that become fixed. Mutations allow for more complex developments, highlighting the increasing need to update the quality of tests to avoid fixation of maliciously deceptive models. Only by combining improving evaluator capabilities, adaptive test design, and mutational dynamics do we see significant reductions in deception while maintaining alignment fitness (permutation test, $p_{\text{adj}} < 0.001$).