A Causal World Model Underlying Next Token Prediction: Exploring GPT in a Controlled Environment
This work addresses the problem of understanding the internal representations of large language models for researchers in AI interpretability, though it is incremental in exploring causal interpretations.
The authors investigated whether GPT models, trained on next token prediction, implicitly learn a causal world model, and found that in controlled environments like Othello and Chess, the model generates legal moves for out-of-distribution sequences when a causal structure is encoded with high confidence.
Are generative pre-trained transformer (GPT) models, trained only to predict the next token, implicitly learning a world model from which sequences are generated one token at a time? We address this question by deriving a causal interpretation of the attention mechanism in GPT and presenting a causal world model that arises from this interpretation. Furthermore, we propose that GPT models, at inference time, can be utilized for zero-shot causal structure learning for input sequences, and introduce a corresponding confidence score. Empirical tests were conducted in controlled environments using the setups of the Othello and Chess strategy games. A GPT, pre-trained on real-world games played with the intention of winning, was tested on out-of-distribution synthetic data consisting of sequences of random legal moves. We find that the GPT model is likely to generate legal next moves for out-of-distribution sequences for which a causal structure is encoded in the attention mechanism with high confidence. In cases where it generates illegal moves, it also fails to capture a causal structure.