CV LG MM SD ASAug 31, 2024

Multi-scale Multi-instance Visual Sound Localization and Segmentation

arXiv:2409.00486v16.53 citationsh-index: 20

Originality Incremental advance

AI Analysis

This addresses the challenge of accurately localizing sounding objects in videos for applications in multimedia analysis, though it appears incremental by building on prior audio-visual association methods.

The paper tackles the problem of visual sound localization by predicting object locations from audio in videos, proposing a multi-scale multi-instance framework (M2VSL) that achieves state-of-the-art performance on benchmarks like VGGSound-Instruments and AVSBench.

Visual sound localization is a typical and challenging problem that predicts the location of objects corresponding to the sound source in a video. Previous methods mainly used the audio-visual association between global audio and one-scale visual features to localize sounding objects in each image. Despite their promising performance, they omitted multi-scale visual features of the corresponding image, and they cannot learn discriminative regions compared to ground truths. To address this issue, we propose a novel multi-scale multi-instance visual sound localization framework, namely M2VSL, that can directly learn multi-scale semantic features associated with sound sources from the input image to localize sounding objects. Specifically, our M2VSL leverages learnable multi-scale visual features to align audio-visual representations at multi-level locations of the corresponding image. We also introduce a novel multi-scale multi-instance transformer to dynamically aggregate multi-scale cross-modal representations for visual sound localization. We conduct extensive experiments on VGGSound-Instruments, VGG-Sound Sources, and AVSBench benchmarks. The results demonstrate that the proposed M2VSL can achieve state-of-the-art performance on sounding object localization and segmentation.

View on arXiv PDF

Similar