E-BATCH: Energy-Efficient and High-Throughput RNN Batching
This addresses energy and throughput bottlenecks for RNN accelerators in applications like speech recognition or NLP, representing a strong incremental improvement over existing batching techniques.
The paper tackles the problem of low hardware utilization in RNN inference due to data dependencies and padding inefficiencies in batching, proposing E-BATCH, a scheme that improves throughput by 1.8-2.1x and energy efficiency by 1.6-3.6x over state-of-the-art methods on E-PUR and TPU platforms.
Recurrent Neural Network (RNN) inference exhibits low hardware utilization due to the strict data dependencies across time-steps. Batching multiple requests can increase throughput. However, RNN batching requires a large amount of padding since the batched input sequences may largely differ in length. Schemes that dynamically update the batch every few time-steps avoid padding. However, they require executing different RNN layers in a short timespan, decreasing energy efficiency. Hence, we propose E-BATCH, a low-latency and energy-efficient batching scheme tailored to RNN accelerators. It consists of a runtime system and effective hardware support. The runtime concatenates multiple sequences to create large batches, resulting in substantial energy savings. Furthermore, the accelerator notifies it when the evaluation of a sequence is done, so that a new sequence can be immediately added to a batch, thus largely reducing the amount of padding. E-BATCH dynamically controls the number of time-steps evaluated per batch to achieve the best trade-off between latency and energy efficiency for the given hardware platform. We evaluate E-BATCH on top of E-PUR and TPU. In E-PUR, E-BATCH improves throughput by 1.8x and energy-efficiency by 3.6x, whereas in TPU, it improves throughput by 2.1x and energy-efficiency by 1.6x, over the state-of-the-art.