LGAICLSep 10, 2025

Open-sci-ref-0.01: open and reproducible reference baselines for language model and dataset comparison

arXiv:2509.09009v24 citationsh-index: 17
Originality Synthesis-oriented
AI Analysis

This provides standardized baselines for researchers to assess and compare language model training methods, though it is incremental as it focuses on reference points rather than new methods.

The paper introduces open-sci-ref, a family of dense transformer models trained as reproducible baselines across multiple scales and datasets, establishing reference points for comparing training approaches and revealing that NemoTron-CC HQ consistently outperforms other datasets.

We introduce open-sci-ref, a family of dense transformer models trained as research baselines across multiple model (0.13B to 1.7B parameters) and token scales (up to 1T) on 8 recent open reference datasets. Evaluating the models on various standardized benchmarks, our training runs set establishes reference points that enable researchers to assess the sanity and quality of alternative training approaches across scales and datasets. Intermediate checkpoints allow comparison and studying of the training dynamics. The established reference baselines allow training procedures to be compared through their scaling trends, aligning them on a common compute axis. Comparison of open reference datasets reveals that training on NemoTron-CC HQ consistently outperforms other reference datasets, followed by DCLM-baseline and FineWeb-Edu. In addition to intermediate training checkpoints, the release includes logs, code, and downstream evaluations to simplify reproduction, standardize comparison, and facilitate future research.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes