Why are language models less surprised than humans? Testing the Parse Multiplicity Mismatch Hypothesis
For psycholinguistics and NLP, it tests a specific computational constraint hypothesis to explain the mismatch between LM surprisal and human processing difficulty.
The paper tests whether language models' ability to consider more simultaneous sentence interpretations than humans explains their underprediction of garden path effects. Reducing the number of active parses in RNNGs increased predicted garden path effects but not enough to match human data, suggesting this hypothesis is insufficient.
Surprisal theory posits that the processing difficulty of a word is determined by its predictability in context, offering a potential link between human sentence processing and next-word predictions from language models. While language model (LM) surprisals successfully predict reading times in naturalistic text, they systematically underpredict the magnitude of difficulty observed in controlled studies of syntactic ambiguity, particularly in garden path sentences. This mismatch might arise from differences in the computational constraints between humans and LMs. Here we test one such hypothesis, specifically, that LMs may be able to simultaneously consider a greater number of distinct sentence interpretations at once, compared to humans. Using Recurrent Neural Network Grammars (RNNGs) with word-synchronous beam search, we systematically vary the number of simultaneous parses used to compute word surprisal, and then use these surprisals to predict human reading times. Reducing the number of simultaneous active parses indeed increases the magnitude of predicted garden path effects, but not nearly enough to capture the full magnitude of the effects in humans. This suggests that differences in the number of simultaneous parses available to LMs and humans cannot reconcile LM-based surprisal with human sentence processing.