Space-Time Memory Network for Sounding Object Localization in Videos
This addresses the challenge of robust audio-visual object localization for applications in video analysis, though it appears incremental as it builds on existing cross-modal learning approaches.
The paper tackled the problem of localizing sounding objects in videos by leveraging temporal synchronization and association between audio and visual modalities, and the result was a space-time memory network that outperformed recent state-of-the-art methods in various complex audio-visual scenes.
Leveraging temporal synchronization and association within sight and sound is an essential step towards robust localization of sounding objects. To this end, we propose a space-time memory network for sounding object localization in videos. It can simultaneously learn spatio-temporal attention over both uni-modal and cross-modal representations from audio and visual modalities. We show and analyze both quantitatively and qualitatively the effectiveness of incorporating spatio-temporal learning in localizing audio-visual objects. We demonstrate that our approach generalizes over various complex audio-visual scenes and outperforms recent state-of-the-art methods.