Are language models rational? The case of coherence norms and belief revision
This addresses the problem of AI safety and alignment by linking rationality to predicting model behavior, though it is incremental in applying philosophical concepts to machine learning.
The paper investigates whether coherence norms of rationality apply to language models by introducing a new account of credence based on next token probabilities, finding that these norms apply to some models but not others.
Do norms of rationality apply to machine learning models, in particular language models? In this paper we investigate this question by focusing on a special subset of rational norms: coherence norms. We consider both logical coherence norms as well as coherence norms tied to the strength of belief. To make sense of the latter, we introduce the Minimal Assent Connection (MAC) and propose a new account of credence, which captures the strength of belief in language models. This proposal uniformly assigns strength of belief simply on the basis of model internal next token probabilities. We argue that rational norms tied to coherence do apply to some language models, but not to others. This issue is significant since rationality is closely tied to predicting and explaining behavior, and thus it is connected to considerations about AI safety and alignment, as well as understanding model behavior more generally.