Intrinsic Barriers and Practical Pathways for Human-AI Alignment: An Agreement-Based Complexity Analysis
This work addresses fundamental complexity barriers in aligning AI systems with human values, which is critical for AI safety and ethics, though it is incremental in formalizing existing alignment challenges.
The paper formalizes AI alignment as a multi-objective optimization problem and proves an information-theoretic lower bound showing that intrinsic alignment overheads are unavoidable when the number of objectives or agents is large, establishing limits to alignment itself. It also constructs algorithms to achieve alignment under various conditions, demonstrating that reward hacking is inevitable in large task spaces, and offers principles for scalable human-AI collaboration.
We formalize AI alignment as a multi-objective optimization problem called $\langle M,N,\varepsilon,δ\rangle$-agreement, in which a set of $N$ agents (including humans) must reach approximate ($\varepsilon$) agreement across $M$ candidate objectives, with probability at least $1-δ$. Analyzing communication complexity, we prove an information-theoretic lower bound showing that once either $M$ or $N$ is large enough, no amount of computational power or rationality can avoid intrinsic alignment overheads. This establishes rigorous limits to alignment *itself*, not merely to particular methods, clarifying a "No-Free-Lunch" principle: encoding "all human values" is inherently intractable and must be managed through consensus-driven reduction or prioritization of objectives. Complementing this impossibility result, we construct explicit algorithms as achievability certificates for alignment under both unbounded and bounded rationality with noisy communication. Even in these best-case regimes, our bounded-agent and sampling analysis shows that with large task spaces ($D$) and finite samples, *reward hacking is globally inevitable*: rare high-loss states are systematically under-covered, implying scalable oversight must target safety-critical slices rather than uniform coverage. Together, these results identify fundamental complexity barriers -- tasks ($M$), agents ($N$), and state-space size ($D$) -- and offer principles for more scalable human-AI collaboration.