CVLGROSep 24, 2025

Large Pre-Trained Models for Bimanual Manipulation in 3D

arXiv:2509.20579v11 citationsh-index: 28Humanoids
Originality Incremental advance
AI Analysis

This work addresses robotic manipulation tasks, specifically for bimanual operations, with incremental improvements over existing methods.

The paper tackles the problem of bimanual robotic manipulation by integrating attention maps from a pre-trained Vision Transformer into voxel representations, resulting in an average absolute improvement of 8.2% and a relative gain of 21.9% across tasks in the RLBench benchmark.

We investigate the integration of attention maps from a pre-trained Vision Transformer into voxel representations to enhance bimanual robotic manipulation. Specifically, we extract attention maps from DINOv2, a self-supervised ViT model, and interpret them as pixel-level saliency scores over RGB images. These maps are lifted into a 3D voxel grid, resulting in voxel-level semantic cues that are incorporated into a behavior cloning policy. When integrated into a state-of-the-art voxel-based policy, our attention-guided featurization yields an average absolute improvement of 8.2% and a relative gain of 21.9% across all tasks in the RLBench bimanual benchmark.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes