AIAug 27, 2024

Evaluating Stability of Unreflective Alignment

James Lucassen, Mark Henry, Philippa Wright, Owen Yeung

arXiv:2408.15116v1h-index: 1

Originality Incremental advance

AI Analysis

This addresses the risk of AI alignment failures for safe delegation of cognitive labor, but it is incremental as it builds on existing theoretical concerns with preliminary evaluations.

The paper tackles the problem of reflective stability in AI alignment by proposing Counterfactual Priority Change (CPC) destabilization as a mechanism that could cause issues in future LLMs, finding that increased scale and capability in current LLMs are associated with higher CPC-based stepping back and preference instability.

Many theoretical obstacles to AI alignment are consequences of reflective stability - the problem of designing alignment mechanisms that the AI would not disable if given the option. However, problems stemming from reflective stability are not obviously present in current LLMs, leading to disagreement over whether they will need to be solved to enable safe delegation of cognitive labor. In this paper, we propose Counterfactual Priority Change (CPC) destabilization as a mechanism by which reflective stability problems may arise in future LLMs. We describe two risk factors for CPC-destabilization: 1) CPC-based stepping back and 2) preference instability. We develop preliminary evaluations for each of these risk factors, and apply them to frontier LLMs. Our findings indicate that in current LLMs, increased scale and capability are associated with increases in both CPC-based stepping back and preference instability, suggesting that CPC-destabilization may cause reflective stability problems in future LLMs.

View on arXiv PDF

Similar