CLAIApr 12, 2025

Parameterized Synthetic Text Generation with SimpleStories

arXiv:2504.09184v37 citationsh-index: 6Has Code
Originality Synthesis-oriented
AI Analysis

This work provides a domain-specific resource for studying language model training processes, with incremental improvements in dataset generation.

The authors tackled the problem of generating synthetic text datasets with controlled characteristics by creating SimpleStories, a large dataset of 2 million samples in English and Japanese, achieving improved sample efficiency and model interpretability compared to TinyStories.

We present SimpleStories, a large synthetic story dataset in simple language, consisting of 2 million samples each in English and Japanese. Through parameterizing prompts at multiple levels of abstraction, we achieve control over story characteristics at scale, inducing syntactic and semantic diversity. Ablations on a newly trained model suite show improved sample efficiency and model interpretability compared to the TinyStories dataset. We open-source all constituent parts of model creation, hoping to enable novel ways to study the end-to-end training process. As a byproduct, we move the frontier regarding the fewest-parameter language model that outputs grammatical natural language.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes