CVMar 2, 2024

DiffSal: Joint Audio and Video Learning for Diffusion Saliency Prediction

arXiv:2403.01226v120 citationsh-index: 16CVPR
Originality Highly original
AI Analysis

This work addresses the problem of enhancing saliency prediction accuracy for audio-visual applications, representing an incremental advance with a novel method.

The paper tackled audio-visual saliency prediction by proposing DiffSal, a diffusion-based architecture that formulates it as a conditional generative task, achieving an average relative improvement of 6.3% over previous state-of-the-art results across six benchmarks.

Audio-visual saliency prediction can draw support from diverse modality complements, but further performance enhancement is still challenged by customized architectures as well as task-specific loss functions. In recent studies, denoising diffusion models have shown more promising in unifying task frameworks owing to their inherent ability of generalization. Following this motivation, a novel Diffusion architecture for generalized audio-visual Saliency prediction (DiffSal) is proposed in this work, which formulates the prediction problem as a conditional generative task of the saliency map by utilizing input audio and video as the conditions. Based on the spatio-temporal audio-visual features, an extra network Saliency-UNet is designed to perform multi-modal attention modulation for progressive refinement of the ground-truth saliency map from the noisy map. Extensive experiments demonstrate that the proposed DiffSal can achieve excellent performance across six challenging audio-visual benchmarks, with an average relative improvement of 6.3\% over the previous state-of-the-art results by six metrics.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes