Temporal Guidance for Large Language Models
This work addresses efficiency issues in LLM decoding for researchers and practitioners, offering an incremental improvement over existing contrastive methods.
The paper tackles the computational overhead of contrastive decoding in large language models by proposing Temporal Guidance (TeGu), a method that uses multi-token prediction for self-contrast, achieving significant performance improvements with low additional resource costs across various models and benchmarks.
Contrastive Decoding (CD) enhances the generation quality of large language models (LLMs) but incurs significant additional computational overhead due to the need for an auxiliary model. Existing internal self-contrastive decoding methods, such as Decoding by Contrasting Layers (DoLa), focus on discrepancies across different layers, which are notably unstable on small-scale models. In this work, based on the observation that LLMs exhibit local preferences, we propose a novel contrastive guidance strategy along the temporal dimension, namely Temporal Guidance (TeGu). Our method ingeniously leverages Multi-Token Prediction (MTP) to construct weaker amateur predictions for model self-contrast. To standardize the implementation of this mechanism, we further introduce a lightweight Conditional MTP Projector (cMTPP), which avoids maintaining multiple independent networks as required by other MTP modules. Across various model series and benchmarks, TeGu achieves significant performance improvements while maintaining low additional memory consumption and computational overhead.