CV SD ASApr 26, 2022

Sound Localization by Self-Supervised Time Delay Estimation

Ziyang Chen, David F. Fouhey, Andrew Owens

arXiv:2204.12489v314.524 citationsh-index: 30Has Code

Originality Incremental advance

AI Analysis

This addresses sound localization for audio processing applications, offering a self-supervised approach that reduces reliance on labeled data, though it is incremental as it adapts existing visual tracking techniques.

The paper tackles the problem of sound localization by estimating interaural time delays from stereo recordings, proposing a self-supervised method that matches supervised performance on real-world data, with a multimodal extension enabling visually-guided localization in multi-speaker scenarios.

Sounds reach one microphone in a stereo pair sooner than the other, resulting in an interaural time delay that conveys their directions. Estimating a sound's time delay requires finding correspondences between the signals recorded by each microphone. We propose to learn these correspondences through self-supervision, drawing on recent techniques from visual tracking. We adapt the contrastive random walk of Jabri et al. to learn a cycle-consistent representation from unlabeled stereo sounds, resulting in a model that performs on par with supervised methods on "in the wild" internet recordings. We also propose a multimodal contrastive learning model that solves a visually-guided localization task: estimating the time delay for a particular person in a multi-speaker mixture, given a visual representation of their face. Project site: https://ificl.github.io/stereocrw/

View on arXiv PDF Code

Similar