SDAICVIRMMASJan 16, 2025

Metric Learning with Progressive Self-Distillation for Audio-Visual Embedding Learning

arXiv:2501.09608v13 citationsh-index: 8ICASSP
Originality Incremental advance
AI Analysis

This work addresses audio-visual embedding learning for applications like multimedia retrieval, but it appears incremental as it builds on existing metric learning methods.

The paper tackles the problem of suboptimal performance in audio-visual embedding learning by proposing a novel architecture that integrates cross-modal triplet loss with progressive self-distillation, resulting in enhanced representation learning through dynamic refinement of soft audio-visual alignments.

Metric learning projects samples into an embedded space, where similarities and dissimilarities are quantified based on their learned representations. However, existing methods often rely on label-guided representation learning, where representations of different modalities, such as audio and visual data, are aligned based on annotated labels. This approach tends to underutilize latent complex features and potential relationships inherent in the distributions of audio and visual data that are not directly tied to the labels, resulting in suboptimal performance in audio-visual embedding learning. To address this issue, we propose a novel architecture that integrates cross-modal triplet loss with progressive self-distillation. Our method enhances representation learning by leveraging inherent distributions and dynamically refining soft audio-visual alignments -- probabilistic alignments between audio and visual data that capture the inherent relationships beyond explicit labels. Specifically, the model distills audio-visual distribution-based knowledge from annotated labels in a subset of each batch. This self-distilled knowledge is used t

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes