AI LGJan 27, 2025

What is Harm? Baby Don't Hurt Me! On the Impossibility of Complete Harm Specification in AI Alignment

arXiv:2501.16448v1h-index: 1IJCNLP-AACL

Originality Highly original

AI Analysis

This work addresses a foundational challenge in AI alignment for researchers and developers, showing that the pursuit of complete harm specifications is inherently flawed and requires a paradigm shift.

The paper tackles the problem of specifying harm in AI alignment, demonstrating through information theory that complete harm specification is fundamentally impossible, with the entropy of harm always exceeding the mutual information between ground truth harm and system specifications.

"First, do no harm" faces a fundamental challenge in artificial intelligence: how can we specify what constitutes harm? While prior work treats harm specification as a technical hurdle to be overcome through better algorithms or more data, we argue this assumption is unsound. Drawing on information theory, we demonstrate that complete harm specification is fundamentally impossible for any system where harm is defined external to its specifications. This impossibility arises from an inescapable information-theoretic gap: the entropy of harm H(O) always exceeds the mutual information I(O;I) between ground truth harm O and a system's specifications I. We introduce two novel metrics: semantic entropy H(S) and the safety-capability ratio I(O;I)/H(O), to quantify these limitations. Through a progression of increasingly sophisticated specification attempts, we show why each approach must fail and why the resulting gaps are not mere engineering challenges but fundamental constraints akin to the halting problem. These results suggest a paradigm shift: rather than pursuing complete specifications, AI alignment research should focus on developing systems that can operate safely despite irreducible specification uncertainty.

View on arXiv PDF

Similar