Measuring Form and Function in Language Models
For researchers in cognitive science and NLP, this work provides a targeted evaluation method to compare language models against human developmental benchmarks.
The paper introduces quantitative metrics from child language acquisition to evaluate language models on formal syntactic and functional discourse properties of English determiners. They find that no model trained on comparable data to children meets both benchmarks, but some very large models do.
We introduce quantitative metrics for child language acquisition to evaluate language models. Our focus is on the formal syntactic and functional discourse properties of determiners in English, which young children acquire early and accurately. We propose Contextual Alternative Choice (CAC), a new prompting method which provides targeted tests for both syntactic and discourse knowledge of language. The method enables direct comparison of language models against children, and more importantly, against statistical benchmarks independently established in empirical research. No current model trained on a comparable amount of data simultaneously meet both formal and functional benchmarks like human children, but some very large models do. We present our results as methodological and technical contributions, with specific emphasis on cognitive status of language models.