Model-free Speculative Decoding for Transformer-based ASR with Token Map Drafting
This addresses the problem of deploying ASR on CPU-based and resource-constrained devices by enabling efficient speculative decoding without hardware accelerators, though it is incremental as it builds on existing speculative decoding methods.
The paper tackles the computational expense of autoregressive decoding in transformer-based ASR systems like Whisper by proposing Token Map Drafting, a model-free speculative decoding technique that uses a precomputed n-gram token map to accelerate inference without accuracy loss, achieving speed-ups of 1.27x and 1.37x on datasets.
End-to-end automatic speech recognition (ASR) systems based on transformer architectures, such as Whisper, offer high transcription accuracy and robustness. However, their autoregressive decoding is computationally expensive, hence limiting deployment on CPU-based and resource-constrained devices. Speculative decoding (SD) mitigates this issue by using a smaller draft model to propose candidate tokens, which are then verified by the main model. However, this approach is impractical for devices lacking hardware accelerators like GPUs. To address this, we propose \emph{Token Map Drafting}, a model-free SD technique that eliminates the need for a separate draft model. Instead, we leverage a precomputed n-gram token map derived from domain-specific training data, enabling efficient speculative decoding with minimal overhead. Our method significantly accelerates ASR inference in structured, low-perplexity domains without sacrificing transcription accuracy. Experimental results demonstrate decoding speed-ups of $1.27\times$ on the CI-AVSR dataset and $1.37\times$ on our internal dataset without degrading recognition accuracy. Additionally, our approach achieves a $10\%$ absolute improvement in decoding speed over the Distill-spec baseline running on CPU, highlighting its effectiveness for on-device ASR applications.