LG CLJun 16, 2024

Optimized Speculative Sampling for GPU Hardware Accelerators

Dominik Wagner, Seanie Lee, Ilja Baumann, Philipp Seeberger, Korbinian Riedhammer, Tobias Bocklet

arXiv:2406.11016v223.126 citationsHas Code

Originality Incremental advance

AI Analysis

This work addresses the bottleneck of sampling speed for large language models on GPUs, offering incremental optimizations for faster inference in applications like speech recognition and summarization.

The paper tackled the problem of slow speculative sampling on GPU hardware accelerators by optimizing matrix computations and approximating probability distributions, resulting in profiling time improvements of 6-13% with no accuracy loss and 37-94% with minor accuracy decline.

In this work, we optimize speculative sampling for parallel hardware accelerators to improve sampling speed. We notice that substantial portions of the intermediate matrices necessary for speculative sampling can be computed concurrently. This allows us to distribute the workload across multiple GPU threads, enabling simultaneous operations on matrix segments within thread blocks. This results in profiling time improvements ranging from 6% to 13% relative to the baseline implementation, without compromising accuracy. To further accelerate speculative sampling, probability distributions parameterized by softmax are approximated by sigmoid. This approximation approach results in significantly greater relative improvements in profiling time, ranging from 37% to 94%, with a minor decline in accuracy. We conduct extensive experiments on both automatic speech recognition and summarization tasks to validate the effectiveness of our optimization methods.

View on arXiv PDF Code

Similar