CLMar 15

Multilingual TinyStories: A Synthetic Combinatorial Corpus of Indic Children's Stories for Training Small Language Models

arXiv:2603.1456326.4h-index: 1

AI Analysis

This provides a foundational resource for multilingual language modeling and transfer learning in the Indic linguistic sphere, addressing a bottleneck for low-resource language development.

The paper tackles the scarcity of high-quality training data for low-resource Indic languages by introducing the Multilingual TinyStories dataset, a synthetically generated collection of 132,942 children's stories across 17 Indian languages, totaling over 93.9 million tokens, designed for training small language models.

The development of robust language models for low-resource languages is frequently bottlenecked by the scarcity of high-quality, coherent, and domain-appropriate training corpora. In this paper, we introduce the Multilingual TinyStories dataset, a large-scale, synthetically generated collection of children's stories encompassing 17 Indian languages. Designed specifically for the training and evaluation of Small Language Models (SLMs), the corpus provides simple, narrative-driven text strictly localized to native scripts. We detail our hybrid curation pipeline, which leverages the Sarvam-M language model and a novel combinatorial prompt engineering framework for native generation, coupled with the Google Translate API for large-scale cross-lingual expansion. Through strict programmatic filtering, we compiled 132,942 stories and over 93.9 million tokens in our release, serving as a foundational resource for multilingual language modeling and transfer learning in the Indic linguistic sphere.

View on arXiv PDF

Similar