CREST: Effectively Compacting a Datastore For Retrieval-Based Speculative Decoding
This addresses storage efficiency for speculative decoding in large language models, offering an incremental improvement over existing methods.
The paper tackles the problem of reducing storage space in retrieval-based speculative decoding by compacting the datastore, achieving comparable performance with 10.6-13.5x less storage and a 16.5-17.1% higher acceptance length on benchmarks.
We present CREST (Compact Retrieval-Based Speculative Decoding), a redesign of REST that allows it to be effectively "compacted". REST is a drafting technique for speculative decoding based on retrieving exact n-gram matches of the most recent n tokens generated by the target LLM from a datastore. The key idea of CREST is to only store a subset of the smallest and most common n-grams in the datastore with the hope of achieving comparable performance with less storage space. We found that storing a subset of n-grams both reduces storage space and improves performance. CREST matches REST's accepted token length with 10.6-13.5x less storage space and achieves a 16.5-17.1% higher acceptance length than REST using the same storage space on the HumanEval and MT Bench benchmarks.