CV LG ROSep 24, 2025

Large Pre-Trained Models for Bimanual Manipulation in 3D

Hanna Yurchyk, Wei-Di Chang, Gregory Dudek, David Meger

arXiv:2509.20579v16.21 citationsh-index: 28Humanoids

Originality Incremental advance

AI Analysis

This work addresses robotic manipulation tasks, specifically for bimanual operations, with incremental improvements over existing methods.

The paper tackles the problem of bimanual robotic manipulation by integrating attention maps from a pre-trained Vision Transformer into voxel representations, resulting in an average absolute improvement of 8.2% and a relative gain of 21.9% across tasks in the RLBench benchmark.

We investigate the integration of attention maps from a pre-trained Vision Transformer into voxel representations to enhance bimanual robotic manipulation. Specifically, we extract attention maps from DINOv2, a self-supervised ViT model, and interpret them as pixel-level saliency scores over RGB images. These maps are lifted into a 3D voxel grid, resulting in voxel-level semantic cues that are incorporated into a behavior cloning policy. When integrated into a state-of-the-art voxel-based policy, our attention-guided featurization yields an average absolute improvement of 8.2% and a relative gain of 21.9% across all tasks in the RLBench bimanual benchmark.

View on arXiv PDF

Similar