LGCVSDASNov 9, 2023

On the Behavior of Audio-Visual Fusion Architectures in Identity Verification Tasks

arXiv:2311.05071v1h-index: 4
Originality Synthesis-oriented
AI Analysis

This work addresses identity verification challenges in multimodal systems, particularly when modalities are missing, but is incremental as it focuses on architectural tweaks.

The paper investigates modifications to audio-visual fusion architectures for identity verification, finding that averaging output embeddings improves error rates on the Voxceleb1-E test set in both full-modality and missing-modality scenarios.

We train an identity verification architecture and evaluate modifications to the part of the model that combines audio and visual representations, including in scenarios where one input is missing in either of two examples to be compared. We report results on the Voxceleb1-E test set that suggest averaging the output embeddings improves error rate in the full-modality setting and when a single modality is missing, and makes more complete use of the embedding space than systems which use shared layers and discuss possible reasons for this behavior.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes