HARP: Hesitation-Aware Reframing in Transformer Inference Pass
This work addresses efficiency and performance issues in Transformer inference for language modeling, though it is incremental as it builds on existing methods with a novel adaptation.
The paper tackles the problem of variable computational demands in large language model inference by introducing HARP, a model-agnostic and training-free method that selectively applies additional computation during token generation based on uncertainty, achieving performance improvements up to +5.16% while maintaining inference times twice faster than beam search.
This paper aims to improve the performance of large language models by addressing the variable computational demands in inference steps, where some tokens require more computational resources than others. We present HARP, a simple modification to "off-the-shelf" Transformer forward pass. Drawing from hesitation and the framing effect in decision-making, HARP selectively applies additional computation when the model encounters uncertainty during token generation. Our method mimics human cognitive processes by pausing at difficult decision points and reframing inputs for a different perspective. Unlike other approaches, HARP is model-agnostic, training-free, and easy to implement. We evaluate our method across various downstream tasks and model sizes, demonstrating performance improvements up to +5.16%. Notably, HARP achieves these gains while maintaining inference times twice faster than beam search. Simple and yet with significant gains, HARP provides insights into the potential of adaptive computation for enhancing the performance of Transformer-based language models.