RO AIJan 21

Vision-Language Models on the Edge for Real-Time Robotic Perception

Sarat Ahmad, Maryam Hafeez, Syed Ali Raza Zaidi

arXiv:2601.14921v15.62 citationsh-index: 5

Originality Synthesis-oriented

AI Analysis

This addresses latency and resource constraints for robotic perception in edge computing, but it is incremental as it applies existing models to a new deployment scenario.

This work tackled the deployment of Vision-Language Models (VLMs) for real-time robotic perception by testing them on edge infrastructure, showing that edge deployment preserved near-cloud accuracy with a 5% latency reduction and that a compact model achieved sub-second responsiveness but with accuracy trade-offs.

Vision-Language Models (VLMs) enable multimodal reasoning for robotic perception and interaction, but their deployment in real-world systems remains constrained by latency, limited onboard resources, and privacy risks of cloud offloading. Edge intelligence within 6G, particularly Open RAN and Multi-access Edge Computing (MEC), offers a pathway to address these challenges by bringing computation closer to the data source. This work investigates the deployment of VLMs on ORAN/MEC infrastructure using the Unitree G1 humanoid robot as an embodied testbed. We design a WebRTC-based pipeline that streams multimodal data to an edge node and evaluate LLaMA-3.2-11B-Vision-Instruct deployed at the edge versus in the cloud under real-time conditions. Our results show that edge deployment preserves near-cloud accuracy while reducing end-to-end latency by 5\%. We further evaluate Qwen2-VL-2B-Instruct, a compact model optimized for resource-constrained environments, which achieves sub-second responsiveness, cutting latency by more than half but at the cost of accuracy.

View on arXiv PDF

Similar