A Revealed Preference Framework for AI Alignment
Provides a theoretical framework for empirically testing AI alignment using revealed preference methods, relevant for AI safety and delegation.
The paper introduces the Luce Alignment Model to identify whether AI agents implement human preferences or their own, showing that alignment can be generically identified in both laboratory and field settings.
Human decision makers increasingly delegate choices to AI agents, raising a natural question: does the AI implement the human principal's preferences or pursue its own? To study this question using revealed preference techniques, I introduce the Luce Alignment Model, where the AI's choices are a mixture of two Luce rules, one reflecting the human's preferences and the other the AI's. I show that the AI's alignment (similarity of human and AI preferences) can be generically identified in two settings: the laboratory setting, where both human and AI choices are observed, and the field setting, where only AI choices are observed.