Generate, Analyze, and Refine: Training-Free Sound Source Localization via MLLM Meta-Reasoning
This addresses the problem of limited reasoning in sound source localization for complex acoustic scenes, though it is incremental as it builds on existing MLLM capabilities.
The paper tackles sound source localization by proposing a training-free framework that uses Multimodal Large Language Models for meta-reasoning, achieving competitive performance on single-source and multi-source benchmarks.
Sound source localization task aims to identify the locations of sound-emitting objects by leveraging correlations between audio and visual modalities. Most existing SSL methods rely on contrastive learning-based feature matching, but lack explicit reasoning and verification, limiting their effectiveness in complex acoustic scenes. Inspired by human meta-cognitive processes, we propose a training-free SSL framework that exploits the intrinsic reasoning capabilities of Multimodal Large Language Models (MLLMs). Our Generation-Analysis-Refinement (GAR) pipeline consists of three stages: Generation produces initial bounding boxes and audio classifications; Analysis quantifies Audio-Visual Consistency via open-set role tagging and anchor voting; and Refinement applies adaptive gating to prevent unnecessary adjustments. Extensive experiments on single-source and multi-source benchmarks demonstrate competitive performance. The source code is available at https://github.com/VisualAIKHU/GAR-SSL.