CVAIMay 22, 2025

PAEFF: Precise Alignment and Enhanced Gated Feature Fusion for Face-Voice Association

arXiv:2505.17002v210 citationsh-index: 10INTERSPEECH
Originality Incremental advance
AI Analysis

This work addresses a specific multimodal learning task for the AI community, with incremental improvements in face-voice association.

The paper tackles the problem of learning associations between faces and voices by addressing issues with negative mining and margin parameters, proposing a method that aligns embedding spaces and uses enhanced gated fusion, achieving improved performance on the VoxCeleb dataset.

We study the task of learning association between faces and voices, which is gaining interest in the multimodal community lately. These methods suffer from the deliberate crafting of negative mining procedures as well as the reliance on the distant margin parameter. These issues are addressed by learning a joint embedding space in which orthogonality constraints are applied to the fused embeddings of faces and voices. However, embedding spaces of faces and voices possess different characteristics and require spaces to be aligned before fusing them. To this end, we propose a method that accurately aligns the embedding spaces and fuses them with an enhanced gated fusion thereby improving the performance of face-voice association. Extensive experiments on the VoxCeleb dataset reveals the merits of the proposed approach.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes