CLAISep 19, 2025

Pico: A Modular Framework for Hypothesis-Driven Small Language Model Research

arXiv:2509.16413v14 citationsh-index: 4Has CodeEMNLP
Originality Synthesis-oriented
AI Analysis

This addresses the problem of uncertain design choices in small language model development for researchers, though it is incremental as it provides a tool rather than a new model or paradigm.

The authors tackled the lack of systematic methods for developing small language models by introducing Pico, a modular framework that enables hypothesis-driven research, resulting in a lightweight sandbox for testing design changes and a suite of baseline models for reproducible experimentation.

Building language models (LMs), especially small and medium ones, remains more art than science. While large LMs often improve by sheer scale, it is still unclear why many design choices work. For small LMs, this uncertainty is more limiting: tight parameter budgets make each decision critical, yet researchers still lack systematic, scientific ways to test and refine new ideas. We introduce Pico, a lightweight, modular framework that enables systematic, hypothesis-driven research for small and medium-scale language model development. Pico consists of two libraries that together provide a practical sandbox where researchers can make targeted changes to a model's architecture or training procedures and directly observe their effects on the model's behavior. To support reproducible experimentation, we also release a suite of baseline models, pico-decoder, trained under standardized conditions and open-sourced for the community. Case studies highlight how Pico can support iterative small LM design and analysis.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes