SDLGASSep 29, 2024

Solution for Temporal Sound Localisation Task of ECCV Second Perception Test Challenge 2024

arXiv:2409.19595v1h-index: 2
Originality Synthesis-oriented
AI Analysis

This is an incremental improvement for participants in sound event localization competitions.

The authors tackled the Temporal Sound Localisation task by improving audio feature extraction, demonstrating that sound features are superior for localizing sound events, and achieved first place in the ECCV 2024 challenge with a score of 0.4925.

This report proposes an improved method for the Temporal Sound Localisation (TSL) task, which localizes and classifies the sound events occurring in the video according to a predefined set of sound classes. The champion solution from last year's first competition has explored the TSL by fusing audio and video modalities with the same weight. Considering the TSL task aims to localize sound events, we conduct relevant experiments that demonstrated the superiority of sound features (Section 3). Based on our findings, to enhance audio modality features, we employ various models to extract audio features, such as InterVideo, CaVMAE, and VideoMAE models. Our approach ranks first in the final test with a score of 0.4925.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes