LG CV SD ASNov 9, 2023

On the Behavior of Audio-Visual Fusion Architectures in Identity Verification Tasks

Daniel Claborne, Eric Slyman, Karl Pazdernik

arXiv:2311.05071v12.0h-index: 4

Originality Synthesis-oriented

AI Analysis

This work addresses identity verification challenges in multimodal systems, particularly when modalities are missing, but is incremental as it focuses on architectural tweaks.

The paper investigates modifications to audio-visual fusion architectures for identity verification, finding that averaging output embeddings improves error rates on the Voxceleb1-E test set in both full-modality and missing-modality scenarios.

We train an identity verification architecture and evaluate modifications to the part of the model that combines audio and visual representations, including in scenarios where one input is missing in either of two examples to be compared. We report results on the Voxceleb1-E test set that suggest averaging the output embeddings improves error rate in the full-modality setting and when a single modality is missing, and makes more complete use of the embedding space than systems which use shared layers and discuss possible reasons for this behavior.

View on arXiv PDF

Similar