CVLGSDASMar 7, 2024

Dynamic Cross Attention for Audio-Visual Person Verification

arXiv:2403.04661v36 citationsh-index: 11FG
Originality Incremental advance
AI Analysis

This work addresses person verification for security or identification applications by improving audio-visual fusion, though it appears incremental as it builds on existing cross-attention methods.

The paper tackles the problem of audio-visual person verification by addressing weak complementary relationships between modalities, proposing a Dynamic Cross-Attention model that dynamically selects features based on relationship strength, resulting in consistent performance improvements and outperforming state-of-the-art methods on the Voxceleb1 dataset.

Although person or identity verification has been predominantly explored using individual modalities such as face and voice, audio-visual fusion has recently shown immense potential to outperform unimodal approaches. Audio and visual modalities are often expected to pose strong complementary relationships, which plays a crucial role in effective audio-visual fusion. However, they may not always strongly complement each other, they may also exhibit weak complementary relationships, resulting in poor audio-visual feature representations. In this paper, we propose a Dynamic Cross-Attention (DCA) model that can dynamically select the cross-attended or unattended features on the fly based on the strong or weak complementary relationships, respectively, across audio and visual modalities. In particular, a conditional gating layer is designed to evaluate the contribution of the cross-attention mechanism and choose cross-attended features only when they exhibit strong complementary relationships, otherwise unattended features. Extensive experiments are conducted on the Voxceleb1 dataset to demonstrate the robustness of the proposed model. Results indicate that the proposed model consistently improves the performance on multiple variants of cross-attention while outperforming the state-of-the-art methods.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes