CL AIOct 2, 2025

The Disparate Impacts of Speculative Decoding

Jameson Sandler, Ahmet Üstün, Marco Romanelli, Sara Hooker, Ferdinando Fioretto

arXiv:2510.02128v14.92 citationsh-index: 18

Originality Incremental advance

AI Analysis

This addresses fairness issues in inference optimization for large language models, particularly for underrepresented tasks, but is incremental as it builds on existing speculative decoding techniques.

The paper analyzes speculative decoding and finds that its speed-up benefits are not uniform across tasks, consistently diminishing for underrepresented tasks, and proposes a mitigation strategy that improves fairness by 12% on average.

The practice of speculative decoding, whereby inference is probabilistically supported by a smaller, cheaper, ``drafter'' model, has become a standard technique for systematically reducing the decoding time of large language models. This paper conducts an analysis of speculative decoding through the lens of its potential disparate speed-up rates across tasks. Crucially, the paper shows that speed-up gained from speculative decoding is not uniformly distributed across tasks, consistently diminishing for under-fit, and often underrepresented tasks. To better understand this phenomenon, we derive an analysis to quantify this observed ``unfairness'' and draw attention to the factors that motivate such disparate speed-ups to emerge. Further, guided by these insights, the paper proposes a mitigation strategy designed to reduce speed-up disparities and validates the approach across several model pairs, revealing on average a 12% improvement in our fairness metric.

View on arXiv PDF

Similar