AS CL LGNov 22, 2023

Turbocharge Speech Understanding with Pilot Inference

arXiv:2311.17065v34 citationsh-index: 3

Originality Incremental advance

AI Analysis

This work addresses the problem of efficient speech understanding for edge computing, though it is incremental as it builds on known hybrid approaches.

The paper tackles accelerating speech understanding on resource-constrained edge devices by introducing novel techniques like pilot inference, achieving a 2x reduction in end-to-end latency and a 2x reduction in offloading needs while maintaining state-of-the-art accuracy.

Modern speech understanding (SU) runs a sophisticated pipeline: ingesting streaming voice input, the pipeline executes encoder-decoder based deep neural networks repeatedly; by doing so, the pipeline generates tentative outputs (called hypotheses), and periodically scores the hypotheses. This paper sets to accelerate SU on resource-constrained edge devices. It takes a hybrid approach: to speed up on-device execution; to offload inputs that are beyond the device's capacity. While the approach is well-known, we address SU's unique challenges with novel techniques: (1) late contextualization, which executes a model's attentive encoder in parallel to the input ingestion; (2) pilot inference, which mitigates the SU pipeline's temporal load imbalance; (3) autoregression offramps, which evaluate offloading decisions based on pilot inferences and hypotheses. Our techniques are compatible with existing speech models, pipelines, and frameworks; they can be applied independently or in combination. Our prototype, called PASU, is tested on Arm platforms with 6 - 8 cores: it delivers SOTA accuracy; it reduces the end-to-end latency by 2x and reduces the offloading needs by 2x.

View on arXiv PDF

Similar