Context-based Motion Retrieval using Open Vocabulary Methods for Autonomous Driving
This addresses the challenge of identifying edge cases involving vulnerable road users for robust evaluation of autonomous driving systems, representing a domain-specific incremental improvement.
The paper tackles the problem of retrieving rare human behavior scenarios in autonomous driving datasets by proposing a context-aware motion retrieval framework that combines SMPL-based motion sequences with video frames in a multimodal embedding space aligned with natural language. Their method achieves up to 27.5% higher accuracy than state-of-the-art models on their new WayMoCo dataset.
Autonomous driving systems must operate reliably in safety-critical scenarios, particularly those involving unusual or complex behavior by Vulnerable Road Users (VRUs). Identifying these edge cases in driving datasets is essential for robust evaluation and generalization, but retrieving such rare human behavior scenarios within the long tail of large-scale datasets is challenging. To support targeted evaluation of autonomous driving systems in diverse, human-centered scenarios, we propose a novel context-aware motion retrieval framework. Our method combines Skinned Multi-Person Linear (SMPL)-based motion sequences and corresponding video frames before encoding them into a shared multimodal embedding space aligned with natural language. Our approach enables the scalable retrieval of human behavior and their context through text queries. This work also introduces our dataset WayMoCo, an extension of the Waymo Open Dataset. It contains automatically labeled motion and scene context descriptions derived from generated pseudo-ground-truth SMPL sequences and corresponding image data. Our approach outperforms state-of-the-art models by up to 27.5% accuracy in motion-context retrieval, when evaluated on the WayMoCo dataset.