SDCVDec 4, 2025

Shared Multi-modal Embedding Space for Face-Voice Association

arXiv:2512.04814v11 citationsh-index: 19
Originality Synthesis-oriented
AI Analysis

This addresses the problem of cross-modal biometric association for multilingual applications, but appears incremental as it builds on existing embedding and loss techniques.

The paper tackled the FAME 2026 challenge tasks of training face-voice associations in a multilingual setting, including testing on unseen languages, by using separate uni-modal pipelines with general and demographic feature extraction, projecting features into a shared embedding space with Adaptive Angular Margin loss. The approach achieved first place with an average Equal-Error Rate of 23.99%.

The FAME 2026 challenge comprises two demanding tasks: training face-voice associations combined with a multilingual setting that includes testing on languages on which the model was not trained. Our approach consists of separate uni-modal processing pipelines with general face and voice feature extraction, complemented by additional age-gender feature extraction to support prediction. The resulting single-modal features are projected into a shared embedding space and trained with an Adaptive Angular Margin (AAM) loss. Our approach achieved first place in the FAME 2026 challenge, with an average Equal-Error Rate (EER) of 23.99%.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes