CurLL: A Developmental Framework to Evaluate Continual Learning in Language Models
This work provides a systematic evaluation framework for continual learning in language models, addressing a domain-specific need for better benchmarks in AI education and development.
The authors tackled the problem of evaluating continual learning in language models by introducing CurLL, a dataset and benchmark based on human developmental stages from ages 5-10, which showed trade-offs in skill retention and transfer efficiency when training a 135M-parameter transformer under different setups.
We introduce a comprehensive continual learning dataset and benchmark (CurlL) grounded in human developmental trajectories from ages 5-10, enabling systematic and fine-grained assessment of models' ability to progressively acquire new skills. CurlL spans five developmental stages (0-4) covering ages 5-10, supported by a skill graph that breaks down broad skills into smaller abilities, concrete goals, and measurable indicators, while also capturing which abilities build on others. We generate a 23.4B-token synthetic dataset with controlled skill progression, vocabulary complexity, and format diversity, comprising paragraphs, comprehension-based QA (CQA), skill-testing QA (CSQA), and instruction-response (IR) pairs. Stage-wise token counts range from 2.12B to 6.78B tokens, supporting precise analysis of forgetting, forward transfer, and backward transfer. Using a 135M-parameter transformer trained under independent, joint, and sequential (continual) setups, we show trade-offs in skill retention and transfer efficiency. By mirroring human learning patterns and providing fine-grained control over skill dependencies, this work advances continual learning evaluations for language models.