Mila Marcheva

h-index30
2papers

2 Papers

CLOct 11, 2025
BabyBabelLM: A Multilingual Benchmark of Developmentally Plausible Training Data

Jaap Jumelet, Abdellah Fourtassi, Akari Haga et al. · mila

We present BabyBabelLM, a multilingual collection of datasets modeling the language a person observes from birth until they acquire a native language. We curate developmentally plausible pretraining data aiming to cover the equivalent of 100M English words of content in each of 45 languages. We compile evaluation suites and train baseline models in each language. BabyBabelLM aims to facilitate multilingual pretraining and cognitive modeling.

11.2CLMay 8
A Computational Operationalisation of Competing Maturational Theories of Syntactic Development via Statistical Grammar Induction

Mila Marcheva, Suchir Salhan, Weiwei Sun · mila

This paper is concerned with what intermediate syntactic categories children acquire during first language development, and in what order. Maturational theories make different predictions. Bottom-up accounts (GROWING) propose that lexical and inflectional structure emerges first, while inward accounts (INWARD) predict early access to discourse-related categories. We computationally operationalise these hypotheses of staged syntactic emergence using statistical grammar induction, asking what each proposed ordering makes learnable when input and learning algorithm are held constant. Our framework makes category acquisition explicit and allows us to explore how different maturational orderings shape the structure that can be learned under identical conditions. Based on this operationalisation, the GROWING account significantly outperforms the INWARD account across three evaluation metrics.