ROCVDec 10, 2025

ViTA-Seg: Vision Transformer for Amodal Segmentation in Robotics

arXiv:2512.09510v11 citationsh-index: 36
Originality Incremental advance
AI Analysis

This addresses occlusion challenges for robotic manipulation, but it is incremental as it builds on existing Vision Transformer methods with domain-specific adaptations.

The paper tackles the problem of occlusions in robotic bin picking by proposing ViTA-Seg, a Vision Transformer framework for real-time amodal segmentation, which achieves strong accuracy and computational efficiency on benchmarks like COOCA and KINS.

Occlusions in robotic bin picking compromise accurate and reliable grasp planning. We present ViTA-Seg, a class-agnostic Vision Transformer framework for real-time amodal segmentation that leverages global attention to recover complete object masks, including hidden regions. We proposte two architectures: a) Single-Head for amodal mask prediction; b) Dual-Head for amodal and occluded mask prediction. We also introduce ViTA-SimData, a photo-realistic synthetic dataset tailored to industrial bin-picking scenario. Extensive experiments on two amodal benchmarks, COOCA and KINS, demonstrate that ViTA-Seg Dual Head achieves strong amodal and occlusion segmentation accuracy with computational efficiency, enabling robust, real-time robotic manipulation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes