AS CV MM SDAug 31, 2024

Comparative Analysis of Modality Fusion Approaches for Audio-Visual Person Identification and Verification

Aref Farhadipour, Masoumeh Chapariniya, Teodora Vukovic, Volker Dellwo

arXiv:2409.00562v215.223 citationsh-index: 21

Originality Synthesis-oriented

AI Analysis

This work addresses person identification and verification for security or biometric applications, but it is incremental as it compares existing fusion methods on a standard dataset.

The paper compared three modality fusion strategies for audio-visual person identification and verification, finding that feature fusion of gammatonegram and facial features achieved the highest accuracy of 98.37% in identification, while concatenating facial features with x-vector resulted in an EER of 0.62% in verification.

Multimodal learning involves integrating information from various modalities to enhance learning and comprehension. We compare three modality fusion strategies in person identification and verification by processing two modalities: voice and face. In this paper, a one-dimensional convolutional neural network is employed for x-vector extraction from voice, while the pre-trained VGGFace2 network and transfer learning are utilized for face modality. In addition, gammatonegram is used as speech representation in engagement with the Darknet19 pre-trained network. The proposed systems are evaluated using the K-fold cross-validation technique on the 118 speakers of the test set of the VoxCeleb2 dataset. The comparative evaluations are done for single-modality and three proposed multimodal strategies in equal situations. Results demonstrate that the feature fusion strategy of gammatonegram and facial features achieves the highest performance, with an accuracy of 98.37% in the person identification task. However, concatenating facial features with the x-vector reaches 0.62% for EER in verification tasks.

View on arXiv PDF

Similar