Child-directed speech facilitates production, not comprehension, in BabyLMs
For researchers in language acquisition and computational linguistics, this work challenges the prevailing view that CDS is unhelpful for BabyLMs by revealing a dissociation between comprehension and production, highlighting the need for production-based evaluations.
The study shows that child-directed speech (CDS) facilitates grammatical production in language models, as measured by a novel frame-completion task, but not comprehension, where web-crawl data excels. CDS-trained models produce grammatical completions earlier and with better slot-filler probability mass.
Recent studies suggest that child-directed speech is not conducive to language learning in BabyLMs. However, current evaluations focus predominantly on comprehension and not production, which is central to usage-based theories of language acquisition which argue how CDS facilitates early language use through constructional ''frames'' (frequent lexical patterns with open slots). We introduce a novel generation-based evaluation inspired by such theories in form of a frame-completion task, and compare Llama models trained with CDS, the BabyLM corpus, and web-crawl data (FineWeb-edu) on comprehension benchmarks and our novel framework. Our results reveal a clear dissociation between models' comprehension and production capabilities: while FineWeb-trained models excel at minimal pairs, CDS-trained models produce grammatical completions substantially earlier in training and concentrate probability mass on appropriate slot-fillers. These findings show that comprehension benchmarks underestimate what CDS affords to BabyLMs.