ConPress: Learning Efficient Reasoning from Multi-Question Contextual Pressure
This addresses efficiency issues for users of large reasoning models by enabling faster and cheaper inference through compressed reasoning, though it is incremental as it builds on existing chain-of-thought methods.
The paper tackles the problem of high inference overhead in large reasoning models by identifying a self-compression phenomenon where models produce shorter reasoning traces when multiple questions are presented together, and proposes ConPress, a self-supervised fine-tuning method that reduces reasoning token usage by 59% on MATH500 and 33% on AIME25 while maintaining competitive accuracy.
Large reasoning models (LRMs) typically solve reasoning-intensive tasks by generating long chain-of-thought (CoT) traces, leading to substantial inference overhead. We identify a reproducible inference-time phenomenon, termed Self-Compression: when multiple independent and answerable questions are presented within a single prompt, the model spontaneously produces shorter reasoning traces for each question. This phenomenon arises from multi-question contextual pressure during generation and consistently manifests across models and benchmarks. Building on this observation, we propose ConPress (Learning from Contextual Pressure), a lightweight self-supervised fine-tuning approach. ConPress constructs multi-question prompts to induce self-compression, samples the resulting model outputs, and parses and filters per-question traces to obtain concise yet correct reasoning trajectories. These trajectories are directly used for supervised fine-tuning, internalizing compressed reasoning behavior in single-question settings without external teachers, manual pruning, or reinforcement learning. With only 8k fine-tuning examples, ConPress reduces reasoning token usage by 59% on MATH500 and 33% on AIME25, while maintaining competitive accuracy.