Adaptive Edge-Cloud Inference for Speech-to-Action Systems Using ASR and Large Language Models
This addresses latency, connectivity, and privacy challenges for voice-controlled IoT systems, though it appears incremental as it combines existing ASR and LLM components with adaptive routing.
The paper tackles the trade-off between cloud-based and edge-based speech-to-action systems by presenting ASTA, an adaptive solution that dynamically routes voice commands between edge and cloud inference based on real-time system metrics. Experimental results on 80 spoken commands show ASTA successfully routes all inputs, achieving 62.5% ASR accuracy and generating executable commands without repair for 47.5% of inputs.
Voice-based interaction has emerged as a natural and intuitive modality for controlling IoT devices. However, speech-driven edge devices face a fundamental trade-off between cloud-based solutions, which offer stronger language understanding capabilities at the cost of latency, connectivity dependence, and privacy concerns, and edge-based solutions, which provide low latency and improved privacy but are limited by computational constraints. This paper presents ASTA, an adaptive speech-to-action solution that dynamically routes voice commands between edge and cloud inference to balance performance and system resource utilization. ASTA integrates on-device automatic speech recognition and lightweight offline language-model inference with cloud-based LLM processing, guided by real-time system metrics such as CPU workload, device temperature, and network latency. A metric-aware routing mechanism selects the inference path at runtime, while a rule-based command validation and repair component ensures successful end-to-end command execution. We implemented our solution on an NVIDIA Jetson-based edge platform and evaluated it using a diverse dataset of 80 spoken commands. Experimental results show that ASTA successfully routes all input commands for execution, achieving a balanced distribution between online and offline inference. The system attains an ASR accuracy of 62.5% and generates executable commands without repair for only 47.5% of inputs, highlighting the importance of the repair mechanism in improving robustness. These results suggest that adaptive edge-cloud orchestration is a viable approach for resilient and resource-aware voice-controlled IoT systems.