Towards Label-Free Biological Reasoning Synthetic Dataset Creation via Uncertainty Filtering
This addresses the bottleneck of costly supervision in biological reasoning for large reasoning models, though it is incremental as it builds on existing uncertainty metrics.
The paper tackles the problem of expensive ground-truth labels for synthetic chain-of-thought datasets in domains like biology by proposing uncertainty-based filtering, which uses model confidence metrics to filter traces, resulting in higher accuracy and improved supervised fine-tuning that outperforms unfiltered data and narrows the gap to ground-truth training.
Synthetic chain-of-thought (CoT) traces are widely used to train large reasoning models (LRMs), improving generalization by providing step-level supervision. Yet most approaches require ground-truth labels to seed or filter these traces - an expensive bottleneck in domains like biology where wet-lab data are scarce. We propose a label-free alternative: uncertainty-based filtering, which uses a model's own confidence - quantified through established uncertainty metrics like self-consistency and predictive perplexity - as a substitute for external labels. We sample multiple reasoning traces and retain only low-uncertainty subsets. Applied to biological perturbation prediction, a domain where wet-lab labels are especially costly, we show that the filtered subset has higher accuracy, and that supervised fine-tuning (SFT) on uncertainty-filtered data outperforms unfiltered synthetic data, narrows the gap to ground-truth training, and surpasses strong LRM baselines. Ablations show that per-class filtering corrects for class-specific uncertainty scales and that hybrid uncertainty metrics yield higher-quality datasets. Our results suggest that model-internal confidence is a powerful signal for efficient reasoning dataset creation, enabling LRMs in domains where supervision is expensive.