ROI-Driven Foveated Attention for Unified Egocentric Representations in Vision-Language-Action Systems
This addresses data collection and cross-embodiment issues in vision-language-action systems for robotics, though it is incremental as it builds on existing methods with engineering improvements.
The paper tackles the high cost and poor transfer of data in embodied AI systems by proposing an ROI-driven workflow that projects end-effector poses into a camera to create hand-centric regions, enabling data reuse across robots and improving alignment.
The development of embodied AI systems is increasingly constrained by the availability and structure of physical interaction data. Despite recent advances in vision-language-action (VLA) models, current pipelines suffer from high data collection cost, limited cross-embodiment alignment, and poor transfer from internet-scale visual data to robot control. We propose a region-of-interest (ROI) driven engineering workflow that introduces an egocentric, geometry-grounded data representation. By projecting end-effector poses via forward kinematics (FK) into a single external camera, we derive movement-aligned hand-centric ROIs without requiring wrist-mounted cameras or multi-view systems. Unlike directly downsampling the full frame, ROI is cropped from the original image before resizing, preserving high local information density for contact-critical regions while retaining global context. We present a reproducible pipeline covering calibration, synchronization, ROI generation, deterministic boundary handling, and metadata governance. The resulting representation is embodiment-aligned and viewpoint-normalized, enabling data reuse across heterogeneous robots. We argue that egocentric ROI serves as a practical data abstraction for scalable collection and cross-embodiment learning, bridging internet-scale perception and robot-specific control.