Grammaticality Judgments in Humans and Language Models: Revisiting Generative Grammar with LLMs
This addresses the problem of understanding syntactic capabilities in AI models for linguistics and AI researchers, but it is incremental as it applies existing methods to new data.
The paper tested whether large language models (LLMs) like GPT-4 and LLaMA-3, trained only on surface text, reproduce grammaticality contrasts from generative grammar, such as subject-auxiliary inversion and parasitic gap licensing, and found that LLMs reliably distinguish between grammatical and ungrammatical variants, implying sensitivity to syntactic structure.
What counts as evidence for syntactic structure? In traditional generative grammar, systematic contrasts in grammaticality such as subject-auxiliary inversion and the licensing of parasitic gaps are taken as evidence for an internal, hierarchical grammar. In this paper, we test whether large language models (LLMs), trained only on surface forms, reproduce these contrasts in ways that imply an underlying structural representation. We focus on two classic constructions: subject-auxiliary inversion (testing recognition of the subject boundary) and parasitic gap licensing (testing abstract dependency structure). We evaluate models including GPT-4 and LLaMA-3 using prompts eliciting acceptability ratings. Results show that LLMs reliably distinguish between grammatical and ungrammatical variants in both constructions, and as such support that they are sensitive to structure and not just linear order. Structural generalizations, distinct from cognitive knowledge, emerge from predictive training on surface forms, suggesting functional sensitivity to syntax without explicit encoding.