AIMay 13

What properties of reasoning supervision are associated with improved downstream model quality?

Mikołaj Langner, Dzmitry Pihulski, Jan Eliasz, Michał Rajkowski, Przemysław Kazienko, Maciej Piasecki, Jan Kocoń, Teddy Ferdinan

arXiv:2605.1329061.9

AI Analysis

For practitioners training reasoning models, this provides a scale-aware framework to validate training data without expensive trial-and-error fine-tuning.

This work investigates whether intrinsic data metrics can predict the utility of reasoning datasets before training, finding strong correlations with downstream model performance. The predictors are scale-dependent: smaller models rely on alignment-focused metrics, while larger models benefit from high redundancy.

Validating training data for reasoning models typically requires expensive trial-and-error fine-tuning cycles. In this work, we investigate whether the utility of a reasoning dataset can be reliably predicted prior to training using intrinsic data metrics. We propose a suite of quantitative measures and evaluate their predictive power by fine-tuning 8B and 11B models on semantically distinct variants of a Polish reasoning dataset. Our analysis reveals that these intrinsic metrics demonstrate strong and significant correlations with downstream model performance. Crucially, we find that the predictors of utility are scale-dependent: smaller models rely on alignment-focused metrics to ensure precision, whereas larger models benefit from high redundancy, utilizing verbose traces to solve complex tasks. These findings establish a scale-aware framework for validating reasoning data, enabling practitioners to select effective training sets without the need for exhaustive empirical testing.

View on arXiv PDF

Similar