Sound Localization by Self-Supervised Time Delay Estimation
This addresses sound localization for audio processing applications, offering a self-supervised approach that reduces reliance on labeled data, though it is incremental as it adapts existing visual tracking techniques.
The paper tackles the problem of sound localization by estimating interaural time delays from stereo recordings, proposing a self-supervised method that matches supervised performance on real-world data, with a multimodal extension enabling visually-guided localization in multi-speaker scenarios.
Sounds reach one microphone in a stereo pair sooner than the other, resulting in an interaural time delay that conveys their directions. Estimating a sound's time delay requires finding correspondences between the signals recorded by each microphone. We propose to learn these correspondences through self-supervision, drawing on recent techniques from visual tracking. We adapt the contrastive random walk of Jabri et al. to learn a cycle-consistent representation from unlabeled stereo sounds, resulting in a model that performs on par with supervised methods on "in the wild" internet recordings. We also propose a multimodal contrastive learning model that solves a visually-guided localization task: estimating the time delay for a particular person in a multi-speaker mixture, given a visual representation of their face. Project site: https://ificl.github.io/stereocrw/