RO CVMay 23

PoseRefer: Pathway-Local Parameters for Semantically Grounded Reference Resolution

arXiv:2605.2462233.4

AI Analysis

For researchers in human-robot interaction and 3D grounding, this work provides a diagnostic framework to evaluate whether fusion methods truly integrate multimodal cues or rely on category artifacts.

The paper introduces MM-Conv, a dataset of natural co-speech gesture from VR interaction, and evaluates a decoupled late-fusion architecture for 3D reference resolution. The fusion model achieves 31.9% top-1 accuracy, outperforming pose-only and text-only baselines, and reveals that fusion accuracy can be confounded by category representation artifacts unless pathways are architecturally decoupled.

A robot resolving ``put the cup on that one'' must fuse gesture, language, and scene geometry, yet 3D grounding benchmarks only partially capture this regime: descriptions are written post-hoc, gestures are templated, or pointing is staged for the camera. MM-Conv captures natural co-speech gesture from dyadic VR interaction alongside full-body motion capture and 3D scene graphs. We use it to evaluate pose-language fusion with a decoupled late-fusion architecture in which pose and text pathways share no learned parameters. The two choices together make category, pose, and text contributions easier to isolate through controlled ablations. Fusion with frozen MiniLM category embeddings exceeds pose alone and the best text-only pathway on every reference type, reaching 31.9% top-1. The learned scalar gate flips between opposing policies depending on whether the text pathway has category access. This is a reliability diagnostic: fusion-accuracy claims for semantic grounding systems are indistinguishable from category-representation artifacts unless pathways are architecturally decoupled.

View on arXiv PDF

Similar