SDCVLGDec 22, 2025

Pushing the Frontier of Audiovisual Perception with Large-Scale Multimodal Correspondence Learning

arXiv:2512.19687v19 citationsh-index: 36
Originality Incremental advance
AI Analysis

This work addresses the challenge of multimodal understanding for applications like speech retrieval and sound event detection, though it is incremental as it builds on existing contrastive learning methods.

The authors tackled the problem of audiovisual perception by introducing PE-AV, a family of encoders trained with scaled contrastive learning, which set a new state of the art across standard audio and video benchmarks.

We introduce Perception Encoder Audiovisual, PE-AV, a new family of encoders for audio and video understanding trained with scaled contrastive learning. Built on PE, PE-AV makes several key contributions to extend representations to audio, and natively support joint embeddings across audio-video, audio-text, and video-text modalities. PE-AV's unified cross-modal embeddings enable novel tasks such as speech retrieval, and set a new state of the art across standard audio and video benchmarks. We unlock this by building a strong audiovisual data engine that synthesizes high-quality captions for O(100M) audio-video pairs, enabling large-scale supervision consistent across modalities. Our audio data includes speech, music, and general sound effects-avoiding single-domain limitations common in prior work. We exploit ten pairwise contrastive objectives, showing that scaling cross-modality and caption-type pairs strengthens alignment and improves zero-shot performance. We further develop PE-A-Frame by fine-tuning PE-AV with frame-level contrastive objectives, enabling fine-grained audio-frame-to-text alignment for tasks such as sound event detection.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes