CLApr 1, 2025

Z1: Efficient Test-time Scaling with Code

arXiv:2504.00810v136 citationsh-index: 28EMNLP
Originality Incremental advance
AI Analysis

This work addresses efficiency issues in LLM reasoning for developers and researchers, though it is incremental as it builds on existing test-time scaling methods.

The paper tackles the problem of high token costs in test-time scaling for LLMs by proposing a method that trains models on code reasoning trajectories to reduce excess thinking tokens while maintaining performance, achieving competitive results with about 30% fewer tokens and demonstrating generalization to broader tasks like GPQA Diamond at 47.5%.

Large Language Models (LLMs) can achieve enhanced complex problem-solving through test-time computing scaling, yet this often entails longer contexts and numerous reasoning token costs. In this paper, we propose an efficient test-time scaling method that trains LLMs on code-related reasoning trajectories, facilitating their reduction of excess thinking tokens while maintaining performance. First, we create Z1-Code-Reasoning-107K, a curated dataset of simple and complex coding problems paired with their short and long solution trajectories. Second, we present a novel Shifted Thinking Window to mitigate overthinking overhead by removing context-delimiting tags (e.g., <think>. . . </think>) and capping reasoning tokens. Trained with long and short trajectory data and equipped with Shifted Thinking Window, our model, Z1-7B, demonstrates the ability to adjust its reasoning level as the complexity of problems and exhibits efficient test-time scaling across different reasoning tasks that matches R1-Distill-Qwen-7B performance with about 30% of its average thinking tokens. Notably, fine-tuned with only code trajectories, Z1-7B demonstrates generalization to broader reasoning tasks (47.5% on GPQA Diamond). Our analysis of efficient reasoning elicitation also provides valuable insights for future research.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes