A Psycholinguistic Evaluation of Language Models' Sensitivity to Argument Roles
This work addresses the problem of understanding how language models process language compared to humans, for researchers in NLP and cognitive science, and is incremental in revealing specific limitations.
The study evaluated large language models' sensitivity to argument roles by replicating psycholinguistic experiments, finding they can distinguish plausible from implausible verb contexts but do not match human real-time processing patterns.
We present a systematic evaluation of large language models' sensitivity to argument roles, i.e., who did what to whom, by replicating psycholinguistic studies on human argument role processing. In three experiments, we find that language models are able to distinguish verbs that appear in plausible and implausible contexts, where plausibility is determined through the relation between the verb and its preceding arguments. However, none of the models capture the same selective patterns that human comprehenders exhibit during real-time verb prediction. This indicates that language models' capacity to detect verb plausibility does not arise from the same mechanism that underlies human real-time sentence processing.