CVLGSDASOct 25, 2019

Self-supervised Moving Vehicle Tracking with Stereo Sound

arXiv:1910.11760v1157 citations
Originality Incremental advance
AI Analysis

This addresses the challenge of object localization in low-visibility conditions for applications like autonomous driving or surveillance, though it is incremental as it builds on existing visual detection models.

The paper tackles the problem of localizing moving vehicles using only stereo sound at inference time, by leveraging unlabeled audio-visual data for self-supervised learning, and shows that their approach outperforms baselines on a new dataset and assists visual localization in poor lighting.

Humans are able to localize objects in the environment using both visual and auditory cues, integrating information from multiple modalities into a common reference frame. We introduce a system that can leverage unlabeled audio-visual data to learn to localize objects (moving vehicles) in a visual reference frame, purely using stereo sound at inference time. Since it is labor-intensive to manually annotate the correspondences between audio and object bounding boxes, we achieve this goal by using the co-occurrence of visual and audio streams in unlabeled videos as a form of self-supervision, without resorting to the collection of ground-truth annotations. In particular, we propose a framework that consists of a vision "teacher" network and a stereo-sound "student" network. During training, knowledge embodied in a well-established visual vehicle detection model is transferred to the audio domain using unlabeled videos as a bridge. At test time, the stereo-sound student network can work independently to perform object localization us-ing just stereo audio and camera meta-data, without any visual input. Experimental results on a newly collected Au-ditory Vehicle Tracking dataset verify that our proposed approach outperforms several baseline approaches. We also demonstrate that our cross-modal auditory localization approach can assist in the visual localization of moving vehicles under poor lighting conditions.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes