ROAIJan 21

Vision-Language Models on the Edge for Real-Time Robotic Perception

arXiv:2601.14921v12 citationsh-index: 5
Originality Synthesis-oriented
AI Analysis

This addresses latency and resource constraints for robotic perception in edge computing, but it is incremental as it applies existing models to a new deployment scenario.

This work tackled the deployment of Vision-Language Models (VLMs) for real-time robotic perception by testing them on edge infrastructure, showing that edge deployment preserved near-cloud accuracy with a 5% latency reduction and that a compact model achieved sub-second responsiveness but with accuracy trade-offs.

Vision-Language Models (VLMs) enable multimodal reasoning for robotic perception and interaction, but their deployment in real-world systems remains constrained by latency, limited onboard resources, and privacy risks of cloud offloading. Edge intelligence within 6G, particularly Open RAN and Multi-access Edge Computing (MEC), offers a pathway to address these challenges by bringing computation closer to the data source. This work investigates the deployment of VLMs on ORAN/MEC infrastructure using the Unitree G1 humanoid robot as an embodied testbed. We design a WebRTC-based pipeline that streams multimodal data to an edge node and evaluate LLaMA-3.2-11B-Vision-Instruct deployed at the edge versus in the cloud under real-time conditions. Our results show that edge deployment preserves near-cloud accuracy while reducing end-to-end latency by 5\%. We further evaluate Qwen2-VL-2B-Instruct, a compact model optimized for resource-constrained environments, which achieves sub-second responsiveness, cutting latency by more than half but at the cost of accuracy.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes