THAIGTMar 29

A Revealed Preference Framework for AI Alignment

arXiv:2603.278685.3h-index: 7
Predicted impact top 62% in TH · last 90 daysOriginality Incremental advance
AI Analysis

Provides a theoretical framework for empirically testing AI alignment using revealed preference methods, relevant for AI safety and delegation.

The paper introduces the Luce Alignment Model to identify whether AI agents implement human preferences or their own, showing that alignment can be generically identified in both laboratory and field settings.

Human decision makers increasingly delegate choices to AI agents, raising a natural question: does the AI implement the human principal's preferences or pursue its own? To study this question using revealed preference techniques, I introduce the Luce Alignment Model, where the AI's choices are a mixture of two Luce rules, one reflecting the human's preferences and the other the AI's. I show that the AI's alignment (similarity of human and AI preferences) can be generically identified in two settings: the laboratory setting, where both human and AI choices are observed, and the field setting, where only AI choices are observed.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes