Breaking the Overscaling Curse: Thinking Parallelism Before Parallel Thinking
This work addresses inefficiencies in parallel thinking systems for LLM reasoning, offering a domain-specific improvement for more cost-effective AI applications.
The paper tackles the overscaling curse in parallel thinking for LLMs, where a fixed high parallelism level leads to budget redundancy due to sample heterogeneity, and proposes T2, a lightweight method that estimates optimal parallelism per sample to reduce cost while maintaining performance, achieving significant efficiency gains in experiments.
Parallel thinking enhances LLM reasoning by multi-path sampling and aggregation. In system-level evaluations, a global parallelism level N is allocated to all samples, typically set large to maximize overall dataset accuracy. However, due to sample heterogeneity, some samples can achieve comparable performance with a smaller N'< N, causing budget redundancy. This incompatibility between system-level efficacy and sample-level efficiency constitutes the overscaling curse. In this paper, we formalize and quantify the overscaling curse, showing its universality and severity in practice, and analyze its trigger mechanism. We then propose a lightweight method, T2, to break the overscaling curse, which utilizes latent representations to estimate the optimal parallelism level for each sample before decoding. Experiments show that T2 significantly reduces cost while maintaining comparable performance, enabling more efficient parallel thinking.