SenseShift6D: Multimodal RGB-D Benchmarking for Robust 6D Pose Estimation across Environment and Sensor Variations
This addresses robustness issues in computer vision for robotics and autonomous systems, though it is incremental by extending evaluation paradigms rather than proposing a new method.
The paper tackles the problem of 6D object-pose estimation under real-world variations in illumination and sensor settings by introducing SenseShift6D, a multimodal RGB-D dataset with 166.4k RGB and 16.7k depth images across 1,380 sensor-lighting permutations, and shows that test-time sensor control improves performance by 19.5 percentage points on pretrained models.
Recent advances on 6D object-pose estimation have achieved high performance on representative benchmarks such as LM-O, YCB-V, and T-Less. However, these datasets were captured under fixed illumination and camera settings, leaving the impact of real-world variations in illumination, exposure, gain or depth-sensor mode-and the potential of test-time sensor control to mitigate such variations-largely unexplored. To bridge this gap, we introduce SenseShift6D, the first RGB-D dataset that physically sweeps 13 RGB exposures, 9 RGB gains, auto-exposure, 4 depth-capture modes, and 5 illumination levels. For five common household objects (spray, pringles, tincase, sandwich, and mouse), we acquire 166.4k RGB and 16.7k depth images, which can provide 1,380 unique sensor-lighting permutations per object pose. Experiments with state-of-the-art models on our dataset demonstrate that applying multimodal sensor control at test time yields substantial performance gains, achieving a 19.5 pp improvement on pretrained generalizable models. It also enhances robustness precisely where those models tend to fail. Moreover, even instance-level pose estimators, where train and test set share identical object and background, performance still varies under environmental and sensor change, demonstrating that test-time sensor control remains effective compared to costly expansions in the quantity and diversity of real-world training data, without any additional training. SenseShift6D extends the object pose evaluation paradigm from data-centered to sensor-aware robustness, laying a foundation for adaptive, self-tuning perception systems capable of operating robustly in uncertain real-world environments.