CV SD ASJun 21, 2025

SSAVSV: Towards Unified Model for Self-Supervised Audio-Visual Speaker Verification

arXiv:2506.17694v1h-index: 2

Originality Incremental advance

AI Analysis

This work addresses scalability and efficiency issues in speaker verification for applications requiring multimodal processing, though it appears incremental as it builds on existing self-supervised and transformer methods.

The paper tackles the problem of computationally expensive and data-hungry audio-visual speaker verification by proposing a self-supervised learning framework with a unified vision transformer backbone, achieving competitive performance without labeled data and reducing computational costs.

Conventional audio-visual methods for speaker verification rely on large amounts of labeled data and separate modality-specific architectures, which is computationally expensive, limiting their scalability. To address these problems, we propose a self-supervised learning framework based on contrastive learning with asymmetric masking and masked data modeling to obtain robust audiovisual feature representations. In particular, we employ a unified framework for self-supervised audiovisual speaker verification using a single shared backbone for audio and visual inputs, leveraging the versatility of vision transformers. The proposed unified framework can handle audio, visual, or audiovisual inputs using a single shared vision transformer backbone during training and testing while being computationally efficient and robust to missing modalities. Extensive experiments demonstrate that our method achieves competitive performance without labeled data while reducing computational costs compared to traditional approaches.

View on arXiv PDF

Similar