CLAIDCLGPFOct 5, 2020

A Streaming Approach For Efficient Batched Beam Search

arXiv:2010.02164v3993 citations
Originality Incremental advance
AI Analysis

This work addresses computational inefficiency in beam search for machine translation and other NLP tasks, offering practical speed improvements for practitioners.

The paper tackles the problem of inefficient GPU utilization during variable-length decoding by proposing a streaming batching strategy that refills batches when candidates terminate or are pruned. The method reduces runtime by up to 71% compared to fixed-width beam search and 17% compared to variable-width baselines while maintaining BLEU scores.

We propose an efficient batching strategy for variable-length decoding on GPU architectures. During decoding, when candidates terminate or are pruned according to heuristics, our streaming approach periodically "refills" the batch before proceeding with a selected subset of candidates. We apply our method to variable-width beam search on a state-of-the-art machine translation model. Our method decreases runtime by up to 71% compared to a fixed-width beam search baseline and 17% compared to a variable-width baseline, while matching baselines' BLEU. Finally, experiments show that our method can speed up decoding in other domains, such as semantic and syntactic parsing.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes